Pandas Correlation Analysis
Correlation analysis is an important part of data analysis, used to discover relationships and dependencies between variables. Pandas provides powerful tools for calculating and visualizing correlations in data. This chapter will detail how to perform various correlation analyses using Pandas.
1. Correlation Analysis Basics
1.1 Concept of Correlation
Correlation measures the strength and direction of linear relationships between two or more variables:
- Positive correlation: When one variable increases, the other tends to increase
- Negative correlation: When one variable increases, the other tends to decrease
- No correlation: No obvious linear relationship between variables
1.2 Types of Correlation Coefficients
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# Set font for plots
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
# Create sample dataset
np.random.seed(42)
n_samples = 1000
# Generate correlated data
data = {
'temperature': np.random.normal(25, 5, n_samples), # Temperature
'humidity': np.random.normal(60, 15, n_samples), # Humidity
'pressure': np.random.normal(1013, 20, n_samples), # Pressure
'wind_speed': np.random.exponential(10, n_samples), # Wind speed
'rainfall': np.random.gamma(2, 2, n_samples), # Rainfall
}
# Add some correlations
data['ice_cream_sales'] = (data['temperature'] * 2 +
np.random.normal(0, 10, n_samples) + 50)
data['umbrella_sales'] = (data['rainfall'] * 3 +
np.random.normal(0, 15, n_samples) + 20)
data['air_conditioner_usage'] = (data['temperature'] * 1.5 +
data['humidity'] * 0.3 +
np.random.normal(0, 8, n_samples))
# Create DataFrame
df = pd.DataFrame(data)
print("Weather and Sales Dataset:")
print(df.head())
print(f"\nData Shape: {df.shape}")
print(f"Data Types:\n{df.dtypes}")2. Pearson Correlation Coefficient
2.1 Calculating Pearson Correlation Coefficient
Pearson correlation coefficient measures the linear correlation between variables, with values ranging from [-1, 1].
# 1. Calculate correlation matrix using corr() method
print("Pearson Correlation Matrix:")
corr_matrix = df.corr(method='pearson')
print(corr_matrix.round(3))
# 2. Calculate correlation between specific columns
print("\nCorrelation between Temperature and Ice Cream Sales:")
temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
print(f"Correlation Coefficient: {temp_ice_corr:.4f}")
# 3. Calculate correlation of multiple variables with target variable
print("\nCorrelation of All Variables with Ice Cream Sales:")
target_corr = df.corr()['ice_cream_sales'].sort_values(ascending=False)
print(target_corr.round(4))
# 4. Using corrwith() method
print("\nUsing corrwith() to Calculate Correlations:")
weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
sales_corr = df[weather_vars].corrwith(df['ice_cream_sales'])
print(sales_corr.round(4))2.2 Statistical Significance Testing of Correlation
from scipy.stats import pearsonr
def correlation_with_pvalue(df, col1, col2):
"""
Calculate correlation coefficient and p-value
"""
corr_coef, p_value = pearsonr(df[col1], df[col2])
return corr_coef, p_value
# Calculate correlation and significance
print("Correlation and Significance Tests:")
variable_pairs = [
('temperature', 'ice_cream_sales'),
('rainfall', 'umbrella_sales'),
('temperature', 'air_conditioner_usage'),
('humidity', 'air_conditioner_usage')
]
for var1, var2 in variable_pairs:
corr, p_val = correlation_with_pvalue(df, var1, var2)
significance = "Significant" if p_val < 0.05 else "Not Significant"
print(f"{var1} vs {var2}: r={corr:.4f}, p={p_val:.4f} ({significance})")3. Spearman Rank Correlation Coefficient
3.1 Spearman Correlation Analysis
Spearman correlation coefficient is based on variable ranks, suitable for non-linear monotonic relationships.
# 1. Calculate Spearman correlation matrix
print("Spearman Correlation Matrix:")
spearman_corr = df.corr(method='spearman')
print(spearman_corr.round(3))
# 2. Compare Pearson and Spearman correlations
print("\nPearson vs Spearman Correlation Comparison:")
comparison_df = pd.DataFrame({
'Pearson': df.corr(method='pearson')['ice_cream_sales'],
'Spearman': df.corr(method='spearman')['ice_cream_sales']
})
comparison_df['Difference'] = abs(comparison_df['Pearson'] - comparison_df['Spearman'])
print(comparison_df.round(4))
# 3. Create non-linear relationship data for comparison
np.random.seed(42)
x = np.random.uniform(0, 10, 200)
y_linear = 2 * x + np.random.normal(0, 1, 200)
y_nonlinear = x**2 + np.random.normal(0, 5, 200)
nonlinear_df = pd.DataFrame({
'x': x,
'y_linear': y_linear,
'y_nonlinear': y_nonlinear
})
print("\nLinear vs Non-linear Relationship Correlations:")
print("Linear Relationship:")
print(f" Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='pearson'):.4f}")
print(f" Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='spearman'):.4f}")
print("Non-linear Relationship:")
print(f" Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='pearson'):.4f}")
print(f" Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='spearman'):.4f}")4. Kendall Tau Correlation Coefficient
4.1 Kendall Correlation Analysis
Kendall Tau correlation coefficient is based on the ratio of concordant and discordant pairs, more robust to outliers.
# 1. Calculate Kendall correlation matrix
print("Kendall Correlation Matrix:")
kendall_corr = df.corr(method='kendall')
print(kendall_corr.round(3))
# 2. Comparison of three correlation coefficients
print("\nComparison of Three Correlation Coefficients:")
corr_comparison = pd.DataFrame({
'Pearson': df.corr(method='pearson')['ice_cream_sales'],
'Spearman': df.corr(method='spearman')['ice_cream_sales'],
'Kendall': df.corr(method='kendall')['ice_cream_sales']
})
print(corr_comparison.round(4))
# 3. Visualize differences between three correlation coefficients
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, method in enumerate(['pearson', 'spearman', 'kendall']):
corr_matrix = df.corr(method=method)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
ax=axes[i], cbar_kws={'shrink': 0.8})
axes[i].set_title(f'{method.capitalize()} Correlation Matrix')
axes[i].tick_params(axis='x', rotation=45)
axes[i].tick_params(axis='y', rotation=0)
plt.tight_layout()
plt.show()5. Partial Correlation Analysis
5.1 Calculating Partial Correlation Coefficient
Partial correlation analysis controls for the influence of other variables to calculate the pure correlation between two variables.
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
def partial_correlation(df, x_col, y_col, control_cols):
"""
Calculate partial correlation coefficient
"""
# Prepare data
X_control = df[control_cols].values
x_values = df[x_col].values
y_values = df[y_col].values
# Regress x and y separately, removing influence of control variables
reg_x = LinearRegression().fit(X_control, x_values)
reg_y = LinearRegression().fit(X_control, y_values)
# Calculate residuals
x_residuals = x_values - reg_x.predict(X_control)
y_residuals = y_values - reg_y.predict(X_control)
# Calculate correlation between residuals
partial_corr, p_value = pearsonr(x_residuals, y_residuals)
return partial_corr, p_value
# Calculate partial correlation coefficients
print("Partial Correlation Analysis:")
# Control for humidity and pressure, calculate partial correlation between temperature and AC usage
partial_corr, p_val = partial_correlation(
df, 'temperature', 'air_conditioner_usage', ['humidity', 'pressure']
)
print(f"Partial Correlation of Temperature and AC Usage (controlling humidity, pressure): {partial_corr:.4f}, p={p_val:.4f}")
# Simple correlation vs partial correlation comparison
simple_corr = df['temperature'].corr(df['air_conditioner_usage'])
print(f"Simple Correlation: {simple_corr:.4f}")
print(f"Partial Correlation: {partial_corr:.4f}")
print(f"Difference: {abs(simple_corr - partial_corr):.4f}")
# Multiple partial correlation analyses
print("\nPartial Correlation Analysis for Multiple Variables:")
partial_results = []
variables = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
for var in variables:
if var != 'temperature':
control_vars = [v for v in variables if v not in [var, 'temperature']]
partial_corr, p_val = partial_correlation(
df, 'temperature', var, control_vars
)
simple_corr = df['temperature'].corr(df[var])
partial_results.append({
'Variable': var,
'Simple_Correlation': simple_corr,
'Partial_Correlation': partial_corr,
'P_Value': p_val,
'Difference': abs(simple_corr - partial_corr)
})
partial_df = pd.DataFrame(partial_results)
print(partial_df.round(4))6. Visualization of Correlation Matrix
6.1 Heatmap Visualization
# 1. Basic heatmap
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Basic heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0,
ax=axes[0,0], cbar_kws={'shrink': 0.8})
axes[0,0].set_title('Basic Correlation Heatmap')
# Lower triangle only
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask, annot=True, cmap='coolwarm', center=0,
ax=axes[0,1], cbar_kws={'shrink': 0.8})
axes[0,1].set_title('Lower Triangle Correlation Heatmap')
# Custom colors and format
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f',
cmap='RdYlBu_r', center=0, square=True,
ax=axes[1,0], cbar_kws={'shrink': 0.8})
axes[1,0].set_title('Custom Format Heatmap')
# Highlight strong correlations
strong_corr = corr_matrix.copy()
strong_corr[abs(strong_corr) < 0.3] = 0
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0,
ax=axes[1,1], cbar_kws={'shrink': 0.8})
axes[1,1].set_title('Strong Correlations Heatmap (|r| ≥ 0.3)')
plt.tight_layout()
plt.show()6.2 Scatter Plot Matrix
# Create scatter plot matrix
from pandas.plotting import scatter_matrix
# Select main variables
main_vars = ['temperature', 'humidity', 'ice_cream_sales',
'umbrella_sales', 'air_conditioner_usage']
fig, axes = plt.subplots(figsize=(12, 10))
scatter_matrix(df[main_vars], alpha=0.6, figsize=(12, 10), diagonal='hist')
plt.suptitle('Variable Scatter Plot Matrix', y=0.95)
plt.tight_layout()
plt.show()
# Using seaborn pairplot
sns.pairplot(df[main_vars], diag_kind='hist', plot_kws={'alpha': 0.6})
plt.suptitle('Seaborn Scatter Plot Matrix', y=1.02)
plt.show()7. Practical Applications of Correlation Analysis
7.1 Feature Selection
def feature_correlation_analysis(df, target_col, threshold=0.1):
"""
Feature selection analysis based on correlation
"""
# Calculate correlation with target variable
correlations = df.corr()[target_col].abs().sort_values(ascending=False)
# Remove target variable itself
correlations = correlations.drop(target_col)
# Filter features with strong correlations
strong_features = correlations[correlations >= threshold]
weak_features = correlations[correlations < threshold]
print(f"Correlation Analysis with {target_col} (threshold: {threshold}):")
print(f"\nStrongly Correlated Features ({len(strong_features)} total):")
for feature, corr in strong_features.items():
print(f" {feature}: {corr:.4f}")
print(f"\nWeakly Correlated Features ({len(weak_features)} total):")
for feature, corr in weak_features.items():
print(f" {feature}: {corr:.4f}")
return strong_features.index.tolist(), weak_features.index.tolist()
# Feature selection for ice cream sales
strong_features, weak_features = feature_correlation_analysis(
df, 'ice_cream_sales', threshold=0.3
)
# Multicollinearity check between features
def multicollinearity_check(df, features, threshold=0.8):
"""
Check multicollinearity between features
"""
corr_matrix = df[features].corr()
high_corr_pairs = []
for i in range(len(features)):
for j in range(i+1, len(features)):
corr_val = abs(corr_matrix.iloc[i, j])
if corr_val >= threshold:
high_corr_pairs.append({
'Feature1': features[i],
'Feature2': features[j],
'Correlation': corr_val
})
return pd.DataFrame(high_corr_pairs)
print("\nMulticollinearity Check:")
all_features = [col for col in df.columns if col != 'ice_cream_sales']
multicollinearity_df = multicollinearity_check(df, all_features, threshold=0.7)
if len(multicollinearity_df) > 0:
print(multicollinearity_df)
else:
print("No highly correlated feature pairs found")7.2 Business Insight Analysis
def business_correlation_insights(df):
"""
Extract business insights from correlation analysis
"""
print("=== Business Correlation Insights Analysis ===")
# 1. Sales correlation analysis
print("\n1. Sales Impact Factor Analysis:")
# Ice cream sales impact factors
ice_cream_factors = df.corr()['ice_cream_sales'].abs().sort_values(ascending=False)
ice_cream_factors = ice_cream_factors.drop('ice_cream_sales')
print(" Ice Cream Sales Key Impact Factors:")
for factor, corr in ice_cream_factors.head(3).items():
direction = "Positive" if df[factor].corr(df['ice_cream_sales']) > 0 else "Negative"
print(f" {factor}: {corr:.3f} ({direction} correlation)")
# Umbrella sales impact factors
umbrella_factors = df.corr()['umbrella_sales'].abs().sort_values(ascending=False)
umbrella_factors = umbrella_factors.drop('umbrella_sales')
print("\n Umbrella Sales Key Impact Factors:")
for factor, corr in umbrella_factors.head(3).items():
direction = "Positive" if df[factor].corr(df['umbrella_sales']) > 0 else "Negative"
print(f" {factor}: {corr:.3f} ({direction} correlation)")
# 2. Weather factor correlations
print("\n2. Weather Factor Relationships:")
weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
weather_corr = df[weather_vars].corr()
# Find strongest weather factor correlations
weather_pairs = []
for i in range(len(weather_vars)):
for j in range(i+1, len(weather_vars)):
corr_val = weather_corr.iloc[i, j]
weather_pairs.append({
'Pair': f"{weather_vars[i]} - {weather_vars[j]}",
'Correlation': corr_val
})
weather_pairs_df = pd.DataFrame(weather_pairs)
weather_pairs_df = weather_pairs_df.reindex(
weather_pairs_df['Correlation'].abs().sort_values(ascending=False).index
)
print(" Weather Factor Correlations Ranked:")
for _, row in weather_pairs_df.head(3).iterrows():
direction = "Positive" if row['Correlation'] > 0 else "Negative"
print(f" {row['Pair']}: {row['Correlation']:.3f} ({direction} correlation)")
# 3. Business recommendations
print("\n3. Business Recommendations:")
temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
rain_umbrella_corr = df['rainfall'].corr(df['umbrella_sales'])
if temp_ice_corr > 0.5:
print(f" • Ice cream sales strongly positively correlated with temperature ({temp_ice_corr:.3f}), recommend adjusting inventory based on weather forecasts")
if rain_umbrella_corr > 0.5:
print(f" • Umbrella sales strongly positively correlated with rainfall ({rain_umbrella_corr:.3f}), recommend monitoring rainfall forecasts")
# AC usage analysis
ac_temp_corr = df['temperature'].corr(df['air_conditioner_usage'])
ac_humidity_corr = df['humidity'].corr(df['air_conditioner_usage'])
if ac_temp_corr > 0.3 or ac_humidity_corr > 0.3:
print(f" • AC usage correlated with temperature ({ac_temp_corr:.3f}) and humidity ({ac_humidity_corr:.3f})")
print(" Recommend power companies predict electricity demand based on weather forecasts")
# Execute business insights analysis
business_correlation_insights(df)8. Time Series Correlation Analysis
8.1 Lag Correlation Analysis
# Create time series data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
np.random.seed(42)
# Generate time series with seasonality and trend
time_trend = np.arange(365) * 0.01
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365.25)
noise = np.random.normal(0, 2, 365)
ts_data = pd.DataFrame({
'date': dates,
'temperature': 20 + seasonal + noise,
'sales': 100 + time_trend * 50 + seasonal * 5 + noise * 3
})
# Add lagged sales (previous day's sales affects today's inventory demand)
ts_data['inventory_demand'] = ts_data['sales'].shift(1) * 0.8 + np.random.normal(0, 5, 365)
ts_data.set_index('date', inplace=True)
print("Time Series Data:")
print(ts_data.head(10))
def lag_correlation_analysis(df, col1, col2, max_lag=30):
"""
Calculate lag correlations
"""
correlations = []
for lag in range(max_lag + 1):
if lag == 0:
corr = df[col1].corr(df[col2])
else:
corr = df[col1].corr(df[col2].shift(lag))
correlations.append({
'Lag': lag,
'Correlation': corr
})
return pd.DataFrame(correlations)
# Analyze lag correlation between temperature and sales
lag_corr_df = lag_correlation_analysis(ts_data, 'temperature', 'sales', max_lag=14)
print("\nLag Correlation between Temperature and Sales:")
print(lag_corr_df.head(10))
# Find lag with strongest correlation
max_corr_lag = lag_corr_df.loc[lag_corr_df['Correlation'].abs().idxmax()]
print(f"\nStrongest Correlation: Lag {max_corr_lag['Lag']} days, Correlation Coefficient {max_corr_lag['Correlation']:.4f}")
# Visualize lag correlations
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(lag_corr_df['Lag'], lag_corr_df['Correlation'], 'o-')
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Lag Days')
plt.ylabel('Correlation Coefficient')
plt.title('Lag Correlation between Temperature and Sales')
plt.grid(True, alpha=0.3)
# Plot original time series
plt.subplot(1, 2, 2)
plt.plot(ts_data.index, ts_data['temperature'], label='Temperature', alpha=0.7)
plt.plot(ts_data.index, ts_data['sales']/10, label='Sales/10', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Temperature and Sales Time Series')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()8.2 Rolling Correlation Analysis
def rolling_correlation_analysis(df, col1, col2, window=30):
"""
Calculate rolling correlation
"""
rolling_corr = df[col1].rolling(window=window).corr(df[col2])
return rolling_corr
# Calculate 30-day rolling correlation
rolling_corr = rolling_correlation_analysis(ts_data, 'temperature', 'sales', window=30)
print("Rolling Correlation Statistics:")
print(f"Mean Correlation: {rolling_corr.mean():.4f}")
print(f"Correlation Std Dev: {rolling_corr.std():.4f}")
print(f"Max Correlation: {rolling_corr.max():.4f}")
print(f"Min Correlation: {rolling_corr.min():.4f}")
# Visualize rolling correlation
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# Original data
axes[0].plot(ts_data.index, ts_data['temperature'], label='Temperature', alpha=0.7)
axes[0].plot(ts_data.index, ts_data['sales']/10, label='Sales/10', alpha=0.7)
axes[0].set_ylabel('Value')
axes[0].set_title('Original Time Series')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Rolling correlation
axes[1].plot(ts_data.index, rolling_corr, color='red', linewidth=2)
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1].axhline(y=rolling_corr.mean(), color='blue', linestyle='--', alpha=0.5,
label=f'Mean: {rolling_corr.mean():.3f}')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Correlation Coefficient')
axes[1].set_title('30-Day Rolling Correlation')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()9. Considerations in Correlation Analysis
9.1 Correlation vs Causation
def correlation_vs_causation_demo():
"""
Demonstrate the difference between correlation and causation
"""
print("=== Correlation vs Causation Demo ===")
# Create spurious correlation example
np.random.seed(42)
n = 1000
# Spurious correlation due to common cause
economic_growth = np.random.normal(0, 1, n)
ice_cream_consumption = economic_growth * 0.8 + np.random.normal(0, 0.5, n)
crime_rate = economic_growth * 0.6 + np.random.normal(0, 0.7, n)
spurious_df = pd.DataFrame({
'economic_growth': economic_growth,
'ice_cream_consumption': ice_cream_consumption,
'crime_rate': crime_rate
})
# Calculate correlation
ice_crime_corr = spurious_df['ice_cream_consumption'].corr(spurious_df['crime_rate'])
print(f"Correlation between ice cream consumption and crime rate: {ice_crime_corr:.4f}")
print("Note: This is spurious correlation! The real cause is economic growth affecting both")
# Partial correlation after controlling for economic growth
partial_corr, p_val = partial_correlation(
spurious_df, 'ice_cream_consumption', 'crime_rate', ['economic_growth']
)
print(f"Partial correlation after controlling for economic growth: {partial_corr:.4f}")
print("Partial correlation is near 0, indicating no relationship after controlling for economic growth")
return spurious_df
spurious_df = correlation_vs_causation_demo()9.2 Impact of Outliers on Correlation
def outlier_impact_on_correlation():
"""
Demonstrate the impact of outliers on correlation
"""
print("\n=== Impact of Outliers on Correlation ===")
# Create normal data
np.random.seed(42)
n = 100
x_normal = np.random.normal(0, 1, n)
y_normal = 0.5 * x_normal + np.random.normal(0, 0.5, n)
# Add outliers
x_with_outlier = np.append(x_normal, [5, 6])
y_with_outlier = np.append(y_normal, [8, 10])
# Calculate correlations
corr_normal = np.corrcoef(x_normal, y_normal)[0, 1]
corr_with_outlier = np.corrcoef(x_with_outlier, y_with_outlier)[0, 1]
print(f"Normal Data Correlation: {corr_normal:.4f}")
print(f"Correlation with Outliers: {corr_with_outlier:.4f}")
print(f"Impact of Outliers: {abs(corr_with_outlier - corr_normal):.4f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(x_normal, y_normal, alpha=0.6)
axes[0].set_title(f'Normal Data (r={corr_normal:.3f})')
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].grid(True, alpha=0.3)
axes[1].scatter(x_normal, y_normal, alpha=0.6, label='Normal Data')
axes[1].scatter([5, 6], [8, 10], color='red', s=100, label='Outliers')
axes[1].set_title(f'With Outliers (r={corr_with_outlier:.3f})')
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return x_normal, y_normal, x_with_outlier, y_with_outlier
outlier_data = outlier_impact_on_correlation()10. Correlation Analysis Best Practices
10.1 Complete Correlation Analysis Workflow
def comprehensive_correlation_analysis(df, target_col=None):
"""
Complete correlation analysis workflow
"""
print("=== Comprehensive Correlation Analysis Report ===")
# 1. Basic data information
print(f"\n1. Basic Data Information:")
print(f" Data Shape: {df.shape}")
print(f" Numeric Columns: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f" Total Missing Values: {df.isnull().sum().sum()}")
# 2. Correlation matrix calculation
numeric_df = df.select_dtypes(include=[np.number])
print(f"\n2. Correlation Matrix Statistics:")
corr_matrix = numeric_df.corr()
# Extract upper triangle (avoid duplicates and self-correlation)
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
correlations = upper_triangle.stack().reset_index()
correlations.columns = ['Variable1', 'Variable2', 'Correlation']
correlations['Abs_Correlation'] = correlations['Correlation'].abs()
print(f" Total Variable Pairs: {len(correlations)}")
print(f" Strong Correlation Pairs (|r| > 0.7): {len(correlations[correlations['Abs_Correlation'] > 0.7])}")
print(f" Moderate Correlation Pairs (0.3 < |r| ≤ 0.7): {len(correlations[(correlations['Abs_Correlation'] > 0.3) & (correlations['Abs_Correlation'] <= 0.7)])}")
print(f" Weak Correlation Pairs (|r| ≤ 0.3): {len(correlations[correlations['Abs_Correlation'] <= 0.3])}")
# 3. Strongest correlations
print(f"\n3. Strongest Correlations (Top 5):")
top_correlations = correlations.nlargest(5, 'Abs_Correlation')
for _, row in top_correlations.iterrows():
direction = "Positive" if row['Correlation'] > 0 else "Negative"
print(f" {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f} ({direction})")
# 4. If target variable specified
if target_col and target_col in numeric_df.columns:
print(f"\n4. Correlation with Target Variable '{target_col}':")
target_corr = numeric_df.corr()[target_col].abs().sort_values(ascending=False)
target_corr = target_corr.drop(target_col) # Remove self-correlation
print(" Strongly Correlated Features (|r| > 0.5):")
strong_features = target_corr[target_corr > 0.5]
if len(strong_features) > 0:
for feature, corr in strong_features.items():
direction = "Positive" if numeric_df[feature].corr(numeric_df[target_col]) > 0 else "Negative"
print(f" {feature}: {corr:.4f} ({direction})")
else:
print(" No strongly correlated features")
# 5. Multicollinearity check
print(f"\n5. Multicollinearity Check:")
high_corr = correlations[correlations['Abs_Correlation'] > 0.8]
if len(high_corr) > 0:
print(" Highly correlated variable pairs found:")
for _, row in high_corr.iterrows():
print(f" {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f}")
print(" Recommendation: Consider removing one variable or using PCA")
else:
print(" No serious multicollinearity issues found")
# 6. Correlation distribution
print(f"\n6. Correlation Distribution Statistics:")
corr_stats = correlations['Correlation'].describe()
print(f" Mean Correlation: {corr_stats['mean']:.4f}")
print(f" Correlation Std Dev: {corr_stats['std']:.4f}")
print(f" Correlation Range: [{corr_stats['min']:.4f}, {corr_stats['max']:.4f}]")
return correlations, corr_matrix
# Execute comprehensive analysis
correlations_result, corr_matrix_result = comprehensive_correlation_analysis(
df, target_col='ice_cream_sales'
)10.2 Correlation Analysis Checklist
def correlation_analysis_checklist(df):
"""
Correlation analysis checklist
"""
print("=== Correlation Analysis Checklist ===")
checklist = {
"Data Preprocessing": [
"✓ Check for missing values",
"✓ Identify outliers",
"✓ Confirm data types",
"✓ Handle categorical variables"
],
"Correlation Calculation": [
"✓ Choose appropriate correlation coefficient type",
"✓ Check linear relationship assumption",
"✓ Consider non-linear relationships",
"✓ Calculate statistical significance"
],
"Result Interpretation": [
"✓ Distinguish correlation from causation",
"✓ Consider third variable effects",
"✓ Check multicollinearity",
"✓ Verify business logic reasonableness"
],
"Visualization": [
"✓ Plot correlation heatmap",
"✓ Create scatter plot matrix",
"✓ Check data distributions",
"✓ Annotate important findings"
]
}
for category, items in checklist.items():
print(f"\n{category}:")
for item in items:
print(f" {item}")
# Actual checks
print("\n=== Current Dataset Check Results ===")
# Missing value check
missing_count = df.isnull().sum().sum()
print(f"Missing Value Check: {'✓ Passed' if missing_count == 0 else f'⚠ Found {missing_count} missing values'}")
# Data type check
numeric_cols = len(df.select_dtypes(include=[np.number]).columns)
total_cols = len(df.columns)
print(f"Data Type Check: {numeric_cols}/{total_cols} columns are numeric")
# Outlier check (using IQR method)
numeric_df = df.select_dtypes(include=[np.number])
outlier_count = 0
for col in numeric_df.columns:
Q1 = numeric_df[col].quantile(0.25)
Q3 = numeric_df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = ((numeric_df[col] < (Q1 - 1.5 * IQR)) |
(numeric_df[col] > (Q3 + 1.5 * IQR))).sum()
outlier_count += outliers
print(f"Outlier Check: {'✓ Passed' if outlier_count == 0 else f'⚠ Found {outlier_count} outliers'}")
# Multicollinearity check
if len(numeric_df.columns) > 1:
corr_matrix = numeric_df.corr()
high_corr_count = 0
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.8:
high_corr_count += 1
print(f"Multicollinearity Check: {'✓ Passed' if high_corr_count == 0 else f'⚠ Found {high_corr_count} highly correlated variable pairs'}")
# Execute checklist
correlation_analysis_checklist(df)Chapter Summary
This chapter comprehensively covered methods and techniques for correlation analysis using Pandas:
- Correlation Coefficient Types: Pearson, Spearman, Kendall - characteristics and use cases for each
- Partial Correlation Analysis: Controlling for other variables to calculate pure correlation
- Visualization Methods: Heatmaps, scatter plot matrices, and other visualization techniques
- Practical Applications: Feature selection, business insights, multicollinearity checking
- Time Series Correlation: Lag correlation and rolling correlation analysis
- Considerations: Difference between correlation and causation, outlier effects
- Best Practices: Complete analysis workflow and checklist
Correlation analysis is a fundamental skill in data science. Correctly understanding and applying correlation analysis can help us:
- Discover relationship patterns between variables
- Perform feature selection and dimensionality reduction
- Identify multicollinearity issues
- Generate business insights and hypotheses
- Prepare for further modeling work
Exercises
- Perform a complete correlation analysis using a real dataset
- Compare the performance of different correlation coefficients on non-linear relationships
- Implement an automated correlation analysis report generator
- Analyze lag correlations in time series data
- Design a multicollinearity detection and handling scheme
In the next chapter, we will learn about Pandas data sorting and aggregation, exploring more advanced data manipulation techniques.