Skip to content

Pandas Correlation Analysis

Correlation analysis is an important part of data analysis, used to discover relationships and dependencies between variables. Pandas provides powerful tools for calculating and visualizing correlations in data. This chapter will detail how to perform various correlation analyses using Pandas.

1. Correlation Analysis Basics

1.1 Concept of Correlation

Correlation measures the strength and direction of linear relationships between two or more variables:

  • Positive correlation: When one variable increases, the other tends to increase
  • Negative correlation: When one variable increases, the other tends to decrease
  • No correlation: No obvious linear relationship between variables

1.2 Types of Correlation Coefficients

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set font for plots
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Create sample dataset
np.random.seed(42)
n_samples = 1000

# Generate correlated data
data = {
    'temperature': np.random.normal(25, 5, n_samples),  # Temperature
    'humidity': np.random.normal(60, 15, n_samples),    # Humidity
    'pressure': np.random.normal(1013, 20, n_samples),  # Pressure
    'wind_speed': np.random.exponential(10, n_samples), # Wind speed
    'rainfall': np.random.gamma(2, 2, n_samples),       # Rainfall
}

# Add some correlations
data['ice_cream_sales'] = (data['temperature'] * 2 + 
                          np.random.normal(0, 10, n_samples) + 50)
data['umbrella_sales'] = (data['rainfall'] * 3 + 
                         np.random.normal(0, 15, n_samples) + 20)
data['air_conditioner_usage'] = (data['temperature'] * 1.5 + 
                                data['humidity'] * 0.3 + 
                                np.random.normal(0, 8, n_samples))

# Create DataFrame
df = pd.DataFrame(data)

print("Weather and Sales Dataset:")
print(df.head())
print(f"\nData Shape: {df.shape}")
print(f"Data Types:\n{df.dtypes}")

2. Pearson Correlation Coefficient

2.1 Calculating Pearson Correlation Coefficient

Pearson correlation coefficient measures the linear correlation between variables, with values ranging from [-1, 1].

python
# 1. Calculate correlation matrix using corr() method
print("Pearson Correlation Matrix:")
corr_matrix = df.corr(method='pearson')
print(corr_matrix.round(3))

# 2. Calculate correlation between specific columns
print("\nCorrelation between Temperature and Ice Cream Sales:")
temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
print(f"Correlation Coefficient: {temp_ice_corr:.4f}")

# 3. Calculate correlation of multiple variables with target variable
print("\nCorrelation of All Variables with Ice Cream Sales:")
target_corr = df.corr()['ice_cream_sales'].sort_values(ascending=False)
print(target_corr.round(4))

# 4. Using corrwith() method
print("\nUsing corrwith() to Calculate Correlations:")
weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
sales_corr = df[weather_vars].corrwith(df['ice_cream_sales'])
print(sales_corr.round(4))

2.2 Statistical Significance Testing of Correlation

python
from scipy.stats import pearsonr

def correlation_with_pvalue(df, col1, col2):
    """
    Calculate correlation coefficient and p-value
    """
    corr_coef, p_value = pearsonr(df[col1], df[col2])
    return corr_coef, p_value

# Calculate correlation and significance
print("Correlation and Significance Tests:")
variable_pairs = [
    ('temperature', 'ice_cream_sales'),
    ('rainfall', 'umbrella_sales'),
    ('temperature', 'air_conditioner_usage'),
    ('humidity', 'air_conditioner_usage')
]

for var1, var2 in variable_pairs:
    corr, p_val = correlation_with_pvalue(df, var1, var2)
    significance = "Significant" if p_val < 0.05 else "Not Significant"
    print(f"{var1} vs {var2}: r={corr:.4f}, p={p_val:.4f} ({significance})")

3. Spearman Rank Correlation Coefficient

3.1 Spearman Correlation Analysis

Spearman correlation coefficient is based on variable ranks, suitable for non-linear monotonic relationships.

python
# 1. Calculate Spearman correlation matrix
print("Spearman Correlation Matrix:")
spearman_corr = df.corr(method='spearman')
print(spearman_corr.round(3))

# 2. Compare Pearson and Spearman correlations
print("\nPearson vs Spearman Correlation Comparison:")
comparison_df = pd.DataFrame({
    'Pearson': df.corr(method='pearson')['ice_cream_sales'],
    'Spearman': df.corr(method='spearman')['ice_cream_sales']
})
comparison_df['Difference'] = abs(comparison_df['Pearson'] - comparison_df['Spearman'])
print(comparison_df.round(4))

# 3. Create non-linear relationship data for comparison
np.random.seed(42)
x = np.random.uniform(0, 10, 200)
y_linear = 2 * x + np.random.normal(0, 1, 200)
y_nonlinear = x**2 + np.random.normal(0, 5, 200)

nonlinear_df = pd.DataFrame({
    'x': x,
    'y_linear': y_linear,
    'y_nonlinear': y_nonlinear
})

print("\nLinear vs Non-linear Relationship Correlations:")
print("Linear Relationship:")
print(f"  Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='pearson'):.4f}")
print(f"  Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='spearman'):.4f}")

print("Non-linear Relationship:")
print(f"  Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='pearson'):.4f}")
print(f"  Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='spearman'):.4f}")

4. Kendall Tau Correlation Coefficient

4.1 Kendall Correlation Analysis

Kendall Tau correlation coefficient is based on the ratio of concordant and discordant pairs, more robust to outliers.

python
# 1. Calculate Kendall correlation matrix
print("Kendall Correlation Matrix:")
kendall_corr = df.corr(method='kendall')
print(kendall_corr.round(3))

# 2. Comparison of three correlation coefficients
print("\nComparison of Three Correlation Coefficients:")
corr_comparison = pd.DataFrame({
    'Pearson': df.corr(method='pearson')['ice_cream_sales'],
    'Spearman': df.corr(method='spearman')['ice_cream_sales'],
    'Kendall': df.corr(method='kendall')['ice_cream_sales']
})
print(corr_comparison.round(4))

# 3. Visualize differences between three correlation coefficients
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, method in enumerate(['pearson', 'spearman', 'kendall']):
    corr_matrix = df.corr(method=method)
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
                ax=axes[i], cbar_kws={'shrink': 0.8})
    axes[i].set_title(f'{method.capitalize()} Correlation Matrix')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].tick_params(axis='y', rotation=0)

plt.tight_layout()
plt.show()

5. Partial Correlation Analysis

5.1 Calculating Partial Correlation Coefficient

Partial correlation analysis controls for the influence of other variables to calculate the pure correlation between two variables.

python
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression

def partial_correlation(df, x_col, y_col, control_cols):
    """
    Calculate partial correlation coefficient
    """
    # Prepare data
    X_control = df[control_cols].values
    x_values = df[x_col].values
    y_values = df[y_col].values
    
    # Regress x and y separately, removing influence of control variables
    reg_x = LinearRegression().fit(X_control, x_values)
    reg_y = LinearRegression().fit(X_control, y_values)
    
    # Calculate residuals
    x_residuals = x_values - reg_x.predict(X_control)
    y_residuals = y_values - reg_y.predict(X_control)
    
    # Calculate correlation between residuals
    partial_corr, p_value = pearsonr(x_residuals, y_residuals)
    
    return partial_corr, p_value

# Calculate partial correlation coefficients
print("Partial Correlation Analysis:")

# Control for humidity and pressure, calculate partial correlation between temperature and AC usage
partial_corr, p_val = partial_correlation(
    df, 'temperature', 'air_conditioner_usage', ['humidity', 'pressure']
)
print(f"Partial Correlation of Temperature and AC Usage (controlling humidity, pressure): {partial_corr:.4f}, p={p_val:.4f}")

# Simple correlation vs partial correlation comparison
simple_corr = df['temperature'].corr(df['air_conditioner_usage'])
print(f"Simple Correlation: {simple_corr:.4f}")
print(f"Partial Correlation: {partial_corr:.4f}")
print(f"Difference: {abs(simple_corr - partial_corr):.4f}")

# Multiple partial correlation analyses
print("\nPartial Correlation Analysis for Multiple Variables:")
partial_results = []

variables = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
for var in variables:
    if var != 'temperature':
        control_vars = [v for v in variables if v not in [var, 'temperature']]
        partial_corr, p_val = partial_correlation(
            df, 'temperature', var, control_vars
        )
        simple_corr = df['temperature'].corr(df[var])
        
        partial_results.append({
            'Variable': var,
            'Simple_Correlation': simple_corr,
            'Partial_Correlation': partial_corr,
            'P_Value': p_val,
            'Difference': abs(simple_corr - partial_corr)
        })

partial_df = pd.DataFrame(partial_results)
print(partial_df.round(4))

6. Visualization of Correlation Matrix

6.1 Heatmap Visualization

python
# 1. Basic heatmap
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Basic heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0,
            ax=axes[0,0], cbar_kws={'shrink': 0.8})
axes[0,0].set_title('Basic Correlation Heatmap')

# Lower triangle only
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask, annot=True, cmap='coolwarm', center=0,
            ax=axes[0,1], cbar_kws={'shrink': 0.8})
axes[0,1].set_title('Lower Triangle Correlation Heatmap')

# Custom colors and format
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', 
            cmap='RdYlBu_r', center=0, square=True,
            ax=axes[1,0], cbar_kws={'shrink': 0.8})
axes[1,0].set_title('Custom Format Heatmap')

# Highlight strong correlations
strong_corr = corr_matrix.copy()
strong_corr[abs(strong_corr) < 0.3] = 0
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0,
            ax=axes[1,1], cbar_kws={'shrink': 0.8})
axes[1,1].set_title('Strong Correlations Heatmap (|r| ≥ 0.3)')

plt.tight_layout()
plt.show()

6.2 Scatter Plot Matrix

python
# Create scatter plot matrix
from pandas.plotting import scatter_matrix

# Select main variables
main_vars = ['temperature', 'humidity', 'ice_cream_sales', 
             'umbrella_sales', 'air_conditioner_usage']

fig, axes = plt.subplots(figsize=(12, 10))
scatter_matrix(df[main_vars], alpha=0.6, figsize=(12, 10), diagonal='hist')
plt.suptitle('Variable Scatter Plot Matrix', y=0.95)
plt.tight_layout()
plt.show()

# Using seaborn pairplot
sns.pairplot(df[main_vars], diag_kind='hist', plot_kws={'alpha': 0.6})
plt.suptitle('Seaborn Scatter Plot Matrix', y=1.02)
plt.show()

7. Practical Applications of Correlation Analysis

7.1 Feature Selection

python
def feature_correlation_analysis(df, target_col, threshold=0.1):
    """
    Feature selection analysis based on correlation
    """
    # Calculate correlation with target variable
    correlations = df.corr()[target_col].abs().sort_values(ascending=False)
    
    # Remove target variable itself
    correlations = correlations.drop(target_col)
    
    # Filter features with strong correlations
    strong_features = correlations[correlations >= threshold]
    weak_features = correlations[correlations < threshold]
    
    print(f"Correlation Analysis with {target_col} (threshold: {threshold}):")
    print(f"\nStrongly Correlated Features ({len(strong_features)} total):")
    for feature, corr in strong_features.items():
        print(f"  {feature}: {corr:.4f}")
    
    print(f"\nWeakly Correlated Features ({len(weak_features)} total):")
    for feature, corr in weak_features.items():
        print(f"  {feature}: {corr:.4f}")
    
    return strong_features.index.tolist(), weak_features.index.tolist()

# Feature selection for ice cream sales
strong_features, weak_features = feature_correlation_analysis(
    df, 'ice_cream_sales', threshold=0.3
)

# Multicollinearity check between features
def multicollinearity_check(df, features, threshold=0.8):
    """
    Check multicollinearity between features
    """
    corr_matrix = df[features].corr()
    high_corr_pairs = []
    
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            corr_val = abs(corr_matrix.iloc[i, j])
            if corr_val >= threshold:
                high_corr_pairs.append({
                    'Feature1': features[i],
                    'Feature2': features[j],
                    'Correlation': corr_val
                })
    
    return pd.DataFrame(high_corr_pairs)

print("\nMulticollinearity Check:")
all_features = [col for col in df.columns if col != 'ice_cream_sales']
multicollinearity_df = multicollinearity_check(df, all_features, threshold=0.7)
if len(multicollinearity_df) > 0:
    print(multicollinearity_df)
else:
    print("No highly correlated feature pairs found")

7.2 Business Insight Analysis

python
def business_correlation_insights(df):
    """
    Extract business insights from correlation analysis
    """
    print("=== Business Correlation Insights Analysis ===")
    
    # 1. Sales correlation analysis
    print("\n1. Sales Impact Factor Analysis:")
    
    # Ice cream sales impact factors
    ice_cream_factors = df.corr()['ice_cream_sales'].abs().sort_values(ascending=False)
    ice_cream_factors = ice_cream_factors.drop('ice_cream_sales')
    
    print("   Ice Cream Sales Key Impact Factors:")
    for factor, corr in ice_cream_factors.head(3).items():
        direction = "Positive" if df[factor].corr(df['ice_cream_sales']) > 0 else "Negative"
        print(f"     {factor}: {corr:.3f} ({direction} correlation)")
    
    # Umbrella sales impact factors
    umbrella_factors = df.corr()['umbrella_sales'].abs().sort_values(ascending=False)
    umbrella_factors = umbrella_factors.drop('umbrella_sales')
    
    print("\n   Umbrella Sales Key Impact Factors:")
    for factor, corr in umbrella_factors.head(3).items():
        direction = "Positive" if df[factor].corr(df['umbrella_sales']) > 0 else "Negative"
        print(f"     {factor}: {corr:.3f} ({direction} correlation)")
    
    # 2. Weather factor correlations
    print("\n2. Weather Factor Relationships:")
    weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
    weather_corr = df[weather_vars].corr()
    
    # Find strongest weather factor correlations
    weather_pairs = []
    for i in range(len(weather_vars)):
        for j in range(i+1, len(weather_vars)):
            corr_val = weather_corr.iloc[i, j]
            weather_pairs.append({
                'Pair': f"{weather_vars[i]} - {weather_vars[j]}",
                'Correlation': corr_val
            })
    
    weather_pairs_df = pd.DataFrame(weather_pairs)
    weather_pairs_df = weather_pairs_df.reindex(
        weather_pairs_df['Correlation'].abs().sort_values(ascending=False).index
    )
    
    print("   Weather Factor Correlations Ranked:")
    for _, row in weather_pairs_df.head(3).iterrows():
        direction = "Positive" if row['Correlation'] > 0 else "Negative"
        print(f"     {row['Pair']}: {row['Correlation']:.3f} ({direction} correlation)")
    
    # 3. Business recommendations
    print("\n3. Business Recommendations:")
    
    temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
    rain_umbrella_corr = df['rainfall'].corr(df['umbrella_sales'])
    
    if temp_ice_corr > 0.5:
        print(f"   • Ice cream sales strongly positively correlated with temperature ({temp_ice_corr:.3f}), recommend adjusting inventory based on weather forecasts")
    
    if rain_umbrella_corr > 0.5:
        print(f"   • Umbrella sales strongly positively correlated with rainfall ({rain_umbrella_corr:.3f}), recommend monitoring rainfall forecasts")
    
    # AC usage analysis
    ac_temp_corr = df['temperature'].corr(df['air_conditioner_usage'])
    ac_humidity_corr = df['humidity'].corr(df['air_conditioner_usage'])
    
    if ac_temp_corr > 0.3 or ac_humidity_corr > 0.3:
        print(f"   • AC usage correlated with temperature ({ac_temp_corr:.3f}) and humidity ({ac_humidity_corr:.3f})")
        print("     Recommend power companies predict electricity demand based on weather forecasts")

# Execute business insights analysis
business_correlation_insights(df)

8. Time Series Correlation Analysis

8.1 Lag Correlation Analysis

python
# Create time series data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
np.random.seed(42)

# Generate time series with seasonality and trend
time_trend = np.arange(365) * 0.01
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365.25)
noise = np.random.normal(0, 2, 365)

ts_data = pd.DataFrame({
    'date': dates,
    'temperature': 20 + seasonal + noise,
    'sales': 100 + time_trend * 50 + seasonal * 5 + noise * 3
})

# Add lagged sales (previous day's sales affects today's inventory demand)
ts_data['inventory_demand'] = ts_data['sales'].shift(1) * 0.8 + np.random.normal(0, 5, 365)

ts_data.set_index('date', inplace=True)

print("Time Series Data:")
print(ts_data.head(10))

def lag_correlation_analysis(df, col1, col2, max_lag=30):
    """
    Calculate lag correlations
    """
    correlations = []
    
    for lag in range(max_lag + 1):
        if lag == 0:
            corr = df[col1].corr(df[col2])
        else:
            corr = df[col1].corr(df[col2].shift(lag))
        
        correlations.append({
            'Lag': lag,
            'Correlation': corr
        })
    
    return pd.DataFrame(correlations)

# Analyze lag correlation between temperature and sales
lag_corr_df = lag_correlation_analysis(ts_data, 'temperature', 'sales', max_lag=14)

print("\nLag Correlation between Temperature and Sales:")
print(lag_corr_df.head(10))

# Find lag with strongest correlation
max_corr_lag = lag_corr_df.loc[lag_corr_df['Correlation'].abs().idxmax()]
print(f"\nStrongest Correlation: Lag {max_corr_lag['Lag']} days, Correlation Coefficient {max_corr_lag['Correlation']:.4f}")

# Visualize lag correlations
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(lag_corr_df['Lag'], lag_corr_df['Correlation'], 'o-')
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Lag Days')
plt.ylabel('Correlation Coefficient')
plt.title('Lag Correlation between Temperature and Sales')
plt.grid(True, alpha=0.3)

# Plot original time series
plt.subplot(1, 2, 2)
plt.plot(ts_data.index, ts_data['temperature'], label='Temperature', alpha=0.7)
plt.plot(ts_data.index, ts_data['sales']/10, label='Sales/10', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Temperature and Sales Time Series')
plt.legend()
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

8.2 Rolling Correlation Analysis

python
def rolling_correlation_analysis(df, col1, col2, window=30):
    """
    Calculate rolling correlation
    """
    rolling_corr = df[col1].rolling(window=window).corr(df[col2])
    return rolling_corr

# Calculate 30-day rolling correlation
rolling_corr = rolling_correlation_analysis(ts_data, 'temperature', 'sales', window=30)

print("Rolling Correlation Statistics:")
print(f"Mean Correlation: {rolling_corr.mean():.4f}")
print(f"Correlation Std Dev: {rolling_corr.std():.4f}")
print(f"Max Correlation: {rolling_corr.max():.4f}")
print(f"Min Correlation: {rolling_corr.min():.4f}")

# Visualize rolling correlation
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Original data
axes[0].plot(ts_data.index, ts_data['temperature'], label='Temperature', alpha=0.7)
axes[0].plot(ts_data.index, ts_data['sales']/10, label='Sales/10', alpha=0.7)
axes[0].set_ylabel('Value')
axes[0].set_title('Original Time Series')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Rolling correlation
axes[1].plot(ts_data.index, rolling_corr, color='red', linewidth=2)
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1].axhline(y=rolling_corr.mean(), color='blue', linestyle='--', alpha=0.5, 
                label=f'Mean: {rolling_corr.mean():.3f}')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Correlation Coefficient')
axes[1].set_title('30-Day Rolling Correlation')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

9. Considerations in Correlation Analysis

9.1 Correlation vs Causation

python
def correlation_vs_causation_demo():
    """
    Demonstrate the difference between correlation and causation
    """
    print("=== Correlation vs Causation Demo ===")
    
    # Create spurious correlation example
    np.random.seed(42)
    n = 1000
    
    # Spurious correlation due to common cause
    economic_growth = np.random.normal(0, 1, n)
    ice_cream_consumption = economic_growth * 0.8 + np.random.normal(0, 0.5, n)
    crime_rate = economic_growth * 0.6 + np.random.normal(0, 0.7, n)
    
    spurious_df = pd.DataFrame({
        'economic_growth': economic_growth,
        'ice_cream_consumption': ice_cream_consumption,
        'crime_rate': crime_rate
    })
    
    # Calculate correlation
    ice_crime_corr = spurious_df['ice_cream_consumption'].corr(spurious_df['crime_rate'])
    
    print(f"Correlation between ice cream consumption and crime rate: {ice_crime_corr:.4f}")
    print("Note: This is spurious correlation! The real cause is economic growth affecting both")
    
    # Partial correlation after controlling for economic growth
    partial_corr, p_val = partial_correlation(
        spurious_df, 'ice_cream_consumption', 'crime_rate', ['economic_growth']
    )
    
    print(f"Partial correlation after controlling for economic growth: {partial_corr:.4f}")
    print("Partial correlation is near 0, indicating no relationship after controlling for economic growth")
    
    return spurious_df

spurious_df = correlation_vs_causation_demo()

9.2 Impact of Outliers on Correlation

python
def outlier_impact_on_correlation():
    """
    Demonstrate the impact of outliers on correlation
    """
    print("\n=== Impact of Outliers on Correlation ===")
    
    # Create normal data
    np.random.seed(42)
    n = 100
    x_normal = np.random.normal(0, 1, n)
    y_normal = 0.5 * x_normal + np.random.normal(0, 0.5, n)
    
    # Add outliers
    x_with_outlier = np.append(x_normal, [5, 6])
    y_with_outlier = np.append(y_normal, [8, 10])
    
    # Calculate correlations
    corr_normal = np.corrcoef(x_normal, y_normal)[0, 1]
    corr_with_outlier = np.corrcoef(x_with_outlier, y_with_outlier)[0, 1]
    
    print(f"Normal Data Correlation: {corr_normal:.4f}")
    print(f"Correlation with Outliers: {corr_with_outlier:.4f}")
    print(f"Impact of Outliers: {abs(corr_with_outlier - corr_normal):.4f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    axes[0].scatter(x_normal, y_normal, alpha=0.6)
    axes[0].set_title(f'Normal Data (r={corr_normal:.3f})')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].scatter(x_normal, y_normal, alpha=0.6, label='Normal Data')
    axes[1].scatter([5, 6], [8, 10], color='red', s=100, label='Outliers')
    axes[1].set_title(f'With Outliers (r={corr_with_outlier:.3f})')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('Y')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return x_normal, y_normal, x_with_outlier, y_with_outlier

outlier_data = outlier_impact_on_correlation()

10. Correlation Analysis Best Practices

10.1 Complete Correlation Analysis Workflow

python
def comprehensive_correlation_analysis(df, target_col=None):
    """
    Complete correlation analysis workflow
    """
    print("=== Comprehensive Correlation Analysis Report ===")
    
    # 1. Basic data information
    print(f"\n1. Basic Data Information:")
    print(f"   Data Shape: {df.shape}")
    print(f"   Numeric Columns: {len(df.select_dtypes(include=[np.number]).columns)}")
    print(f"   Total Missing Values: {df.isnull().sum().sum()}")
    
    # 2. Correlation matrix calculation
    numeric_df = df.select_dtypes(include=[np.number])
    
    print(f"\n2. Correlation Matrix Statistics:")
    corr_matrix = numeric_df.corr()
    
    # Extract upper triangle (avoid duplicates and self-correlation)
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    correlations = upper_triangle.stack().reset_index()
    correlations.columns = ['Variable1', 'Variable2', 'Correlation']
    correlations['Abs_Correlation'] = correlations['Correlation'].abs()
    
    print(f"   Total Variable Pairs: {len(correlations)}")
    print(f"   Strong Correlation Pairs (|r| > 0.7): {len(correlations[correlations['Abs_Correlation'] > 0.7])}")
    print(f"   Moderate Correlation Pairs (0.3 < |r| ≤ 0.7): {len(correlations[(correlations['Abs_Correlation'] > 0.3) & (correlations['Abs_Correlation'] <= 0.7)])}")
    print(f"   Weak Correlation Pairs (|r| ≤ 0.3): {len(correlations[correlations['Abs_Correlation'] <= 0.3])}")
    
    # 3. Strongest correlations
    print(f"\n3. Strongest Correlations (Top 5):")
    top_correlations = correlations.nlargest(5, 'Abs_Correlation')
    for _, row in top_correlations.iterrows():
        direction = "Positive" if row['Correlation'] > 0 else "Negative"
        print(f"   {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f} ({direction})")
    
    # 4. If target variable specified
    if target_col and target_col in numeric_df.columns:
        print(f"\n4. Correlation with Target Variable '{target_col}':")
        target_corr = numeric_df.corr()[target_col].abs().sort_values(ascending=False)
        target_corr = target_corr.drop(target_col)  # Remove self-correlation
        
        print("   Strongly Correlated Features (|r| > 0.5):")
        strong_features = target_corr[target_corr > 0.5]
        if len(strong_features) > 0:
            for feature, corr in strong_features.items():
                direction = "Positive" if numeric_df[feature].corr(numeric_df[target_col]) > 0 else "Negative"
                print(f"     {feature}: {corr:.4f} ({direction})")
        else:
            print("     No strongly correlated features")
    
    # 5. Multicollinearity check
    print(f"\n5. Multicollinearity Check:")
    high_corr = correlations[correlations['Abs_Correlation'] > 0.8]
    if len(high_corr) > 0:
        print("   Highly correlated variable pairs found:")
        for _, row in high_corr.iterrows():
            print(f"     {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f}")
        print("   Recommendation: Consider removing one variable or using PCA")
    else:
        print("   No serious multicollinearity issues found")
    
    # 6. Correlation distribution
    print(f"\n6. Correlation Distribution Statistics:")
    corr_stats = correlations['Correlation'].describe()
    print(f"   Mean Correlation: {corr_stats['mean']:.4f}")
    print(f"   Correlation Std Dev: {corr_stats['std']:.4f}")
    print(f"   Correlation Range: [{corr_stats['min']:.4f}, {corr_stats['max']:.4f}]")
    
    return correlations, corr_matrix

# Execute comprehensive analysis
correlations_result, corr_matrix_result = comprehensive_correlation_analysis(
    df, target_col='ice_cream_sales'
)

10.2 Correlation Analysis Checklist

python
def correlation_analysis_checklist(df):
    """
    Correlation analysis checklist
    """
    print("=== Correlation Analysis Checklist ===")
    
    checklist = {
        "Data Preprocessing": [
            "✓ Check for missing values",
            "✓ Identify outliers",
            "✓ Confirm data types",
            "✓ Handle categorical variables"
        ],
        "Correlation Calculation": [
            "✓ Choose appropriate correlation coefficient type",
            "✓ Check linear relationship assumption",
            "✓ Consider non-linear relationships",
            "✓ Calculate statistical significance"
        ],
        "Result Interpretation": [
            "✓ Distinguish correlation from causation",
            "✓ Consider third variable effects",
            "✓ Check multicollinearity",
            "✓ Verify business logic reasonableness"
        ],
        "Visualization": [
            "✓ Plot correlation heatmap",
            "✓ Create scatter plot matrix",
            "✓ Check data distributions",
            "✓ Annotate important findings"
        ]
    }
    
    for category, items in checklist.items():
        print(f"\n{category}:")
        for item in items:
            print(f"  {item}")
    
    # Actual checks
    print("\n=== Current Dataset Check Results ===")
    
    # Missing value check
    missing_count = df.isnull().sum().sum()
    print(f"Missing Value Check: {'✓ Passed' if missing_count == 0 else f'⚠ Found {missing_count} missing values'}")
    
    # Data type check
    numeric_cols = len(df.select_dtypes(include=[np.number]).columns)
    total_cols = len(df.columns)
    print(f"Data Type Check: {numeric_cols}/{total_cols} columns are numeric")
    
    # Outlier check (using IQR method)
    numeric_df = df.select_dtypes(include=[np.number])
    outlier_count = 0
    for col in numeric_df.columns:
        Q1 = numeric_df[col].quantile(0.25)
        Q3 = numeric_df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = ((numeric_df[col] < (Q1 - 1.5 * IQR)) | 
                   (numeric_df[col] > (Q3 + 1.5 * IQR))).sum()
        outlier_count += outliers
    
    print(f"Outlier Check: {'✓ Passed' if outlier_count == 0 else f'⚠ Found {outlier_count} outliers'}")
    
    # Multicollinearity check
    if len(numeric_df.columns) > 1:
        corr_matrix = numeric_df.corr()
        high_corr_count = 0
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                if abs(corr_matrix.iloc[i, j]) > 0.8:
                    high_corr_count += 1
        
        print(f"Multicollinearity Check: {'✓ Passed' if high_corr_count == 0 else f'⚠ Found {high_corr_count} highly correlated variable pairs'}")

# Execute checklist
correlation_analysis_checklist(df)

Chapter Summary

This chapter comprehensively covered methods and techniques for correlation analysis using Pandas:

  1. Correlation Coefficient Types: Pearson, Spearman, Kendall - characteristics and use cases for each
  2. Partial Correlation Analysis: Controlling for other variables to calculate pure correlation
  3. Visualization Methods: Heatmaps, scatter plot matrices, and other visualization techniques
  4. Practical Applications: Feature selection, business insights, multicollinearity checking
  5. Time Series Correlation: Lag correlation and rolling correlation analysis
  6. Considerations: Difference between correlation and causation, outlier effects
  7. Best Practices: Complete analysis workflow and checklist

Correlation analysis is a fundamental skill in data science. Correctly understanding and applying correlation analysis can help us:

  • Discover relationship patterns between variables
  • Perform feature selection and dimensionality reduction
  • Identify multicollinearity issues
  • Generate business insights and hypotheses
  • Prepare for further modeling work

Exercises

  1. Perform a complete correlation analysis using a real dataset
  2. Compare the performance of different correlation coefficients on non-linear relationships
  3. Implement an automated correlation analysis report generator
  4. Analyze lag correlations in time series data
  5. Design a multicollinearity detection and handling scheme

In the next chapter, we will learn about Pandas data sorting and aggregation, exploring more advanced data manipulation techniques.

Content is for learning and research only.