Pandas 相关性分析

相关性分析是数据分析中的重要环节，用于发现变量之间的关系和依赖性。Pandas 提供了强大的工具来计算和可视化数据之间的相关性。本章将详细介绍如何使用 Pandas 进行各种相关性分析。

1. 相关性分析基础

1.1 相关性的概念

相关性衡量两个或多个变量之间线性关系的强度和方向：

正相关：一个变量增加时，另一个变量也倾向于增加
负相关：一个变量增加时，另一个变量倾向于减少
无相关：变量之间没有明显的线性关系

1.2 相关系数的类型

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 创建示例数据集
np.random.seed(42)
n_samples = 1000

# 生成相关的数据
data = {
    'temperature': np.random.normal(25, 5, n_samples),  # 温度
    'humidity': np.random.normal(60, 15, n_samples),    # 湿度
    'pressure': np.random.normal(1013, 20, n_samples),  # 气压
    'wind_speed': np.random.exponential(10, n_samples), # 风速
    'rainfall': np.random.gamma(2, 2, n_samples),       # 降雨量
}

# 添加一些相关性
data['ice_cream_sales'] = (data['temperature'] * 2 + 
                          np.random.normal(0, 10, n_samples) + 50)
data['umbrella_sales'] = (data['rainfall'] * 3 + 
                         np.random.normal(0, 15, n_samples) + 20)
data['air_conditioner_usage'] = (data['temperature'] * 1.5 + 
                                data['humidity'] * 0.3 + 
                                np.random.normal(0, 8, n_samples))

# 创建DataFrame
df = pd.DataFrame(data)

print("天气与销售数据集:")
print(df.head())
print(f"\n数据形状: {df.shape}")
print(f"数据类型:\n{df.dtypes}")

2. Pearson 相关系数

2.1 计算 Pearson 相关系数

Pearson 相关系数衡量变量之间的线性相关性，取值范围为 [-1, 1]。

python

# 1. 使用 corr() 方法计算相关矩阵
print("Pearson 相关矩阵:")
corr_matrix = df.corr(method='pearson')
print(corr_matrix.round(3))

# 2. 计算特定列之间的相关性
print("\n温度与冰淇淋销量的相关性:")
temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
print(f"相关系数: {temp_ice_corr:.4f}")

# 3. 计算多个变量与目标变量的相关性
print("\n各变量与冰淇淋销量的相关性:")
target_corr = df.corr()['ice_cream_sales'].sort_values(ascending=False)
print(target_corr.round(4))

# 4. 使用 corrwith() 方法
print("\n使用 corrwith() 计算相关性:")
weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
sales_corr = df[weather_vars].corrwith(df['ice_cream_sales'])
print(sales_corr.round(4))

2.2 相关性的统计显著性检验

python

from scipy.stats import pearsonr

def correlation_with_pvalue(df, col1, col2):
    """
    计算相关系数及其p值
    """
    corr_coef, p_value = pearsonr(df[col1], df[col2])
    return corr_coef, p_value

# 计算相关性及显著性
print("相关性及显著性检验:")
variable_pairs = [
    ('temperature', 'ice_cream_sales'),
    ('rainfall', 'umbrella_sales'),
    ('temperature', 'air_conditioner_usage'),
    ('humidity', 'air_conditioner_usage')
]

for var1, var2 in variable_pairs:
    corr, p_val = correlation_with_pvalue(df, var1, var2)
    significance = "显著" if p_val < 0.05 else "不显著"
    print(f"{var1} vs {var2}: r={corr:.4f}, p={p_val:.4f} ({significance})")

3. Spearman 等级相关系数

3.1 Spearman 相关性分析

Spearman 相关系数基于变量的等级，适用于非线性单调关系。

python

# 1. 计算 Spearman 相关矩阵
print("Spearman 相关矩阵:")
spearman_corr = df.corr(method='spearman')
print(spearman_corr.round(3))

# 2. 比较 Pearson 和 Spearman 相关性
print("\nPearson vs Spearman 相关性比较:")
comparison_df = pd.DataFrame({
    'Pearson': df.corr(method='pearson')['ice_cream_sales'],
    'Spearman': df.corr(method='spearman')['ice_cream_sales']
})
comparison_df['Difference'] = abs(comparison_df['Pearson'] - comparison_df['Spearman'])
print(comparison_df.round(4))

# 3. 创建非线性关系数据进行对比
np.random.seed(42)
x = np.random.uniform(0, 10, 200)
y_linear = 2 * x + np.random.normal(0, 1, 200)
y_nonlinear = x**2 + np.random.normal(0, 5, 200)

nonlinear_df = pd.DataFrame({
    'x': x,
    'y_linear': y_linear,
    'y_nonlinear': y_nonlinear
})

print("\n线性 vs 非线性关系的相关性:")
print("线性关系:")
print(f"  Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='pearson'):.4f}")
print(f"  Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_linear'], method='spearman'):.4f}")

print("非线性关系:")
print(f"  Pearson: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='pearson'):.4f}")
print(f"  Spearman: {nonlinear_df['x'].corr(nonlinear_df['y_nonlinear'], method='spearman'):.4f}")

4. Kendall Tau 相关系数

4.1 Kendall 相关性分析

Kendall Tau 相关系数基于一致对和不一致对的比例，对异常值更加稳健。

python

# 1. 计算 Kendall 相关矩阵
print("Kendall 相关矩阵:")
kendall_corr = df.corr(method='kendall')
print(kendall_corr.round(3))

# 2. 三种相关系数的比较
print("\n三种相关系数比较:")
corr_comparison = pd.DataFrame({
    'Pearson': df.corr(method='pearson')['ice_cream_sales'],
    'Spearman': df.corr(method='spearman')['ice_cream_sales'],
    'Kendall': df.corr(method='kendall')['ice_cream_sales']
})
print(corr_comparison.round(4))

# 3. 可视化三种相关系数的差异
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, method in enumerate(['pearson', 'spearman', 'kendall']):
    corr_matrix = df.corr(method=method)
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
                ax=axes[i], cbar_kws={'shrink': 0.8})
    axes[i].set_title(f'{method.capitalize()} 相关矩阵')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].tick_params(axis='y', rotation=0)

plt.tight_layout()
plt.show()

5. 偏相关分析

5.1 偏相关系数计算

偏相关分析控制其他变量的影响，计算两个变量之间的纯相关性。

python

from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression

def partial_correlation(df, x_col, y_col, control_cols):
    """
    计算偏相关系数
    """
    # 准备数据
    X_control = df[control_cols].values
    x_values = df[x_col].values
    y_values = df[y_col].values
    
    # 对 x 和 y 分别进行回归，去除控制变量的影响
    reg_x = LinearRegression().fit(X_control, x_values)
    reg_y = LinearRegression().fit(X_control, y_values)
    
    # 计算残差
    x_residuals = x_values - reg_x.predict(X_control)
    y_residuals = y_values - reg_y.predict(X_control)
    
    # 计算残差之间的相关性
    partial_corr, p_value = pearsonr(x_residuals, y_residuals)
    
    return partial_corr, p_value

# 计算偏相关系数
print("偏相关分析:")

# 控制湿度和气压，计算温度与空调使用的偏相关
partial_corr, p_val = partial_correlation(
    df, 'temperature', 'air_conditioner_usage', ['humidity', 'pressure']
)
print(f"温度与空调使用的偏相关 (控制湿度、气压): {partial_corr:.4f}, p={p_val:.4f}")

# 简单相关 vs 偏相关比较
simple_corr = df['temperature'].corr(df['air_conditioner_usage'])
print(f"简单相关: {simple_corr:.4f}")
print(f"偏相关: {partial_corr:.4f}")
print(f"差异: {abs(simple_corr - partial_corr):.4f}")

# 多个偏相关分析
print("\n多个变量的偏相关分析:")
partial_results = []

variables = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
for var in variables:
    if var != 'temperature':
        control_vars = [v for v in variables if v not in [var, 'temperature']]
        partial_corr, p_val = partial_correlation(
            df, 'temperature', var, control_vars
        )
        simple_corr = df['temperature'].corr(df[var])
        
        partial_results.append({
            'Variable': var,
            'Simple_Correlation': simple_corr,
            'Partial_Correlation': partial_corr,
            'P_Value': p_val,
            'Difference': abs(simple_corr - partial_corr)
        })

partial_df = pd.DataFrame(partial_results)
print(partial_df.round(4))

6. 相关性矩阵的可视化

6.1 热力图可视化

python

# 1. 基本热力图
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 基本热力图
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0,
            ax=axes[0,0], cbar_kws={'shrink': 0.8})
axes[0,0].set_title('基本相关性热力图')

# 只显示下三角
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask=mask, annot=True, cmap='coolwarm', center=0,
            ax=axes[0,1], cbar_kws={'shrink': 0.8})
axes[0,1].set_title('下三角相关性热力图')

# 自定义颜色和格式
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', 
            cmap='RdYlBu_r', center=0, square=True,
            ax=axes[1,0], cbar_kws={'shrink': 0.8})
axes[1,0].set_title('自定义格式热力图')

# 突出显示强相关性
strong_corr = corr_matrix.copy()
strong_corr[abs(strong_corr) < 0.3] = 0
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0,
            ax=axes[1,1], cbar_kws={'shrink': 0.8})
axes[1,1].set_title('强相关性热力图 (|r| ≥ 0.3)')

plt.tight_layout()
plt.show()

6.2 散点图矩阵

python

# 创建散点图矩阵
from pandas.plotting import scatter_matrix

# 选择主要变量
main_vars = ['temperature', 'humidity', 'ice_cream_sales', 
             'umbrella_sales', 'air_conditioner_usage']

fig, axes = plt.subplots(figsize=(12, 10))
scatter_matrix(df[main_vars], alpha=0.6, figsize=(12, 10), diagonal='hist')
plt.suptitle('变量散点图矩阵', y=0.95)
plt.tight_layout()
plt.show()

# 使用 seaborn 的 pairplot
sns.pairplot(df[main_vars], diag_kind='hist', plot_kws={'alpha': 0.6})
plt.suptitle('Seaborn 散点图矩阵', y=1.02)
plt.show()

7. 相关性分析的实际应用

7.1 特征选择

python

def feature_correlation_analysis(df, target_col, threshold=0.1):
    """
    基于相关性的特征选择分析
    """
    # 计算与目标变量的相关性
    correlations = df.corr()[target_col].abs().sort_values(ascending=False)
    
    # 移除目标变量自身
    correlations = correlations.drop(target_col)
    
    # 筛选相关性较强的特征
    strong_features = correlations[correlations >= threshold]
    weak_features = correlations[correlations < threshold]
    
    print(f"与 {target_col} 相关性分析 (阈值: {threshold}):")
    print(f"\n强相关特征 ({len(strong_features)} 个):")
    for feature, corr in strong_features.items():
        print(f"  {feature}: {corr:.4f}")
    
    print(f"\n弱相关特征 ({len(weak_features)} 个):")
    for feature, corr in weak_features.items():
        print(f"  {feature}: {corr:.4f}")
    
    return strong_features.index.tolist(), weak_features.index.tolist()

# 对冰淇淋销量进行特征选择
strong_features, weak_features = feature_correlation_analysis(
    df, 'ice_cream_sales', threshold=0.3
)

# 特征之间的多重共线性检查
def multicollinearity_check(df, features, threshold=0.8):
    """
    检查特征之间的多重共线性
    """
    corr_matrix = df[features].corr()
    high_corr_pairs = []
    
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            corr_val = abs(corr_matrix.iloc[i, j])
            if corr_val >= threshold:
                high_corr_pairs.append({
                    'Feature1': features[i],
                    'Feature2': features[j],
                    'Correlation': corr_val
                })
    
    return pd.DataFrame(high_corr_pairs)

print("\n多重共线性检查:")
all_features = [col for col in df.columns if col != 'ice_cream_sales']
multicollinearity_df = multicollinearity_check(df, all_features, threshold=0.7)
if len(multicollinearity_df) > 0:
    print(multicollinearity_df)
else:
    print("未发现高度相关的特征对")

7.2 业务洞察分析

python

def business_correlation_insights(df):
    """
    从相关性分析中提取业务洞察
    """
    print("=== 业务相关性洞察分析 ===")
    
    # 1. 销售相关性分析
    print("\n1. 销售影响因素分析:")
    
    # 冰淇淋销量影响因素
    ice_cream_factors = df.corr()['ice_cream_sales'].abs().sort_values(ascending=False)
    ice_cream_factors = ice_cream_factors.drop('ice_cream_sales')
    
    print("   冰淇淋销量主要影响因素:")
    for factor, corr in ice_cream_factors.head(3).items():
        direction = "正相关" if df[factor].corr(df['ice_cream_sales']) > 0 else "负相关"
        print(f"     {factor}: {corr:.3f} ({direction})")
    
    # 雨伞销量影响因素
    umbrella_factors = df.corr()['umbrella_sales'].abs().sort_values(ascending=False)
    umbrella_factors = umbrella_factors.drop('umbrella_sales')
    
    print("\n   雨伞销量主要影响因素:")
    for factor, corr in umbrella_factors.head(3).items():
        direction = "正相关" if df[factor].corr(df['umbrella_sales']) > 0 else "负相关"
        print(f"     {factor}: {corr:.3f} ({direction})")
    
    # 2. 天气因素相关性
    print("\n2. 天气因素相互关系:")
    weather_vars = ['temperature', 'humidity', 'pressure', 'wind_speed', 'rainfall']
    weather_corr = df[weather_vars].corr()
    
    # 找出最强的天气因素相关性
    weather_pairs = []
    for i in range(len(weather_vars)):
        for j in range(i+1, len(weather_vars)):
            corr_val = weather_corr.iloc[i, j]
            weather_pairs.append({
                'Pair': f"{weather_vars[i]} - {weather_vars[j]}",
                'Correlation': corr_val
            })
    
    weather_pairs_df = pd.DataFrame(weather_pairs)
    weather_pairs_df = weather_pairs_df.reindex(
        weather_pairs_df['Correlation'].abs().sort_values(ascending=False).index
    )
    
    print("   天气因素相关性排序:")
    for _, row in weather_pairs_df.head(3).iterrows():
        direction = "正相关" if row['Correlation'] > 0 else "负相关"
        print(f"     {row['Pair']}: {row['Correlation']:.3f} ({direction})")
    
    # 3. 季节性分析建议
    print("\n3. 业务建议:")
    
    temp_ice_corr = df['temperature'].corr(df['ice_cream_sales'])
    rain_umbrella_corr = df['rainfall'].corr(df['umbrella_sales'])
    
    if temp_ice_corr > 0.5:
        print(f"   • 冰淇淋销量与温度强正相关 ({temp_ice_corr:.3f})，建议根据天气预报调整库存")
    
    if rain_umbrella_corr > 0.5:
        print(f"   • 雨伞销量与降雨量强正相关 ({rain_umbrella_corr:.3f})，建议关注降雨预报")
    
    # 空调使用分析
    ac_temp_corr = df['temperature'].corr(df['air_conditioner_usage'])
    ac_humidity_corr = df['humidity'].corr(df['air_conditioner_usage'])
    
    if ac_temp_corr > 0.3 or ac_humidity_corr > 0.3:
        print(f"   • 空调使用与温度 ({ac_temp_corr:.3f}) 和湿度 ({ac_humidity_corr:.3f}) 相关")
        print("     建议电力公司根据天气预报预测用电需求")

# 执行业务洞察分析
business_correlation_insights(df)

8. 时间序列相关性分析

8.1 滞后相关性分析

python

# 创建时间序列数据
dates = pd.date_range('2023-01-01', periods=365, freq='D')
np.random.seed(42)

# 生成带有季节性和趋势的时间序列
time_trend = np.arange(365) * 0.01
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365.25)
noise = np.random.normal(0, 2, 365)

ts_data = pd.DataFrame({
    'date': dates,
    'temperature': 20 + seasonal + noise,
    'sales': 100 + time_trend * 50 + seasonal * 5 + noise * 3
})

# 添加滞后销量（前一天的销量影响今天的库存需求）
ts_data['inventory_demand'] = ts_data['sales'].shift(1) * 0.8 + np.random.normal(0, 5, 365)

ts_data.set_index('date', inplace=True)

print("时间序列数据:")
print(ts_data.head(10))

def lag_correlation_analysis(df, col1, col2, max_lag=30):
    """
    计算滞后相关性
    """
    correlations = []
    
    for lag in range(max_lag + 1):
        if lag == 0:
            corr = df[col1].corr(df[col2])
        else:
            corr = df[col1].corr(df[col2].shift(lag))
        
        correlations.append({
            'Lag': lag,
            'Correlation': corr
        })
    
    return pd.DataFrame(correlations)

# 分析温度与销量的滞后相关性
lag_corr_df = lag_correlation_analysis(ts_data, 'temperature', 'sales', max_lag=14)

print("\n温度与销量的滞后相关性:")
print(lag_corr_df.head(10))

# 找出最强相关性的滞后期
max_corr_lag = lag_corr_df.loc[lag_corr_df['Correlation'].abs().idxmax()]
print(f"\n最强相关性: 滞后 {max_corr_lag['Lag']} 天, 相关系数 {max_corr_lag['Correlation']:.4f}")

# 可视化滞后相关性
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(lag_corr_df['Lag'], lag_corr_df['Correlation'], 'o-')
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('滞后天数')
plt.ylabel('相关系数')
plt.title('温度与销量的滞后相关性')
plt.grid(True, alpha=0.3)

# 绘制原始时间序列
plt.subplot(1, 2, 2)
plt.plot(ts_data.index, ts_data['temperature'], label='温度', alpha=0.7)
plt.plot(ts_data.index, ts_data['sales']/10, label='销量/10', alpha=0.7)
plt.xlabel('日期')
plt.ylabel('数值')
plt.title('温度与销量时间序列')
plt.legend()
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

8.2 滚动相关性分析

python

def rolling_correlation_analysis(df, col1, col2, window=30):
    """
    计算滚动相关性
    """
    rolling_corr = df[col1].rolling(window=window).corr(df[col2])
    return rolling_corr

# 计算30天滚动相关性
rolling_corr = rolling_correlation_analysis(ts_data, 'temperature', 'sales', window=30)

print("滚动相关性统计:")
print(f"平均相关性: {rolling_corr.mean():.4f}")
print(f"相关性标准差: {rolling_corr.std():.4f}")
print(f"最大相关性: {rolling_corr.max():.4f}")
print(f"最小相关性: {rolling_corr.min():.4f}")

# 可视化滚动相关性
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# 原始数据
axes[0].plot(ts_data.index, ts_data['temperature'], label='温度', alpha=0.7)
axes[0].plot(ts_data.index, ts_data['sales']/10, label='销量/10', alpha=0.7)
axes[0].set_ylabel('数值')
axes[0].set_title('原始时间序列')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 滚动相关性
axes[1].plot(ts_data.index, rolling_corr, color='red', linewidth=2)
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1].axhline(y=rolling_corr.mean(), color='blue', linestyle='--', alpha=0.5, 
                label=f'平均值: {rolling_corr.mean():.3f}')
axes[1].set_xlabel('日期')
axes[1].set_ylabel('相关系数')
axes[1].set_title('30天滚动相关性')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

9. 相关性分析的注意事项

9.1 相关性 vs 因果性

python

def correlation_vs_causation_demo():
    """
    演示相关性与因果性的区别
    """
    print("=== 相关性 vs 因果性 演示 ===")
    
    # 创建虚假相关的例子
    np.random.seed(42)
    n = 1000
    
    # 共同原因导致的虚假相关
    economic_growth = np.random.normal(0, 1, n)
    ice_cream_consumption = economic_growth * 0.8 + np.random.normal(0, 0.5, n)
    crime_rate = economic_growth * 0.6 + np.random.normal(0, 0.7, n)
    
    spurious_df = pd.DataFrame({
        'economic_growth': economic_growth,
        'ice_cream_consumption': ice_cream_consumption,
        'crime_rate': crime_rate
    })
    
    # 计算相关性
    ice_crime_corr = spurious_df['ice_cream_consumption'].corr(spurious_df['crime_rate'])
    
    print(f"冰淇淋消费与犯罪率的相关性: {ice_crime_corr:.4f}")
    print("注意: 这是虚假相关！真正的原因是经济增长同时影响了两者")
    
    # 控制经济增长后的偏相关
    partial_corr, p_val = partial_correlation(
        spurious_df, 'ice_cream_consumption', 'crime_rate', ['economic_growth']
    )
    
    print(f"控制经济增长后的偏相关: {partial_corr:.4f}")
    print("偏相关接近0，说明在控制经济增长后，冰淇淋消费与犯罪率无关")
    
    return spurious_df

spurious_df = correlation_vs_causation_demo()

9.2 异常值对相关性的影响

python

def outlier_impact_on_correlation():
    """
    演示异常值对相关性的影响
    """
    print("\n=== 异常值对相关性的影响 ===")
    
    # 创建正常数据
    np.random.seed(42)
    n = 100
    x_normal = np.random.normal(0, 1, n)
    y_normal = 0.5 * x_normal + np.random.normal(0, 0.5, n)
    
    # 添加异常值
    x_with_outlier = np.append(x_normal, [5, 6])
    y_with_outlier = np.append(y_normal, [8, 10])
    
    # 计算相关性
    corr_normal = np.corrcoef(x_normal, y_normal)[0, 1]
    corr_with_outlier = np.corrcoef(x_with_outlier, y_with_outlier)[0, 1]
    
    print(f"正常数据相关性: {corr_normal:.4f}")
    print(f"包含异常值的相关性: {corr_with_outlier:.4f}")
    print(f"异常值影响: {abs(corr_with_outlier - corr_normal):.4f}")
    
    # 可视化
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    axes[0].scatter(x_normal, y_normal, alpha=0.6)
    axes[0].set_title(f'正常数据 (r={corr_normal:.3f})')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].scatter(x_normal, y_normal, alpha=0.6, label='正常数据')
    axes[1].scatter([5, 6], [8, 10], color='red', s=100, label='异常值')
    axes[1].set_title(f'包含异常值 (r={corr_with_outlier:.3f})')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('Y')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return x_normal, y_normal, x_with_outlier, y_with_outlier

outlier_data = outlier_impact_on_correlation()

10. 相关性分析最佳实践

10.1 完整的相关性分析流程

python

def comprehensive_correlation_analysis(df, target_col=None):
    """
    完整的相关性分析流程
    """
    print("=== 综合相关性分析报告 ===")
    
    # 1. 数据基本信息
    print(f"\n1. 数据基本信息:")
    print(f"   数据形状: {df.shape}")
    print(f"   数值列数量: {len(df.select_dtypes(include=[np.number]).columns)}")
    print(f"   缺失值总数: {df.isnull().sum().sum()}")
    
    # 2. 相关性矩阵计算
    numeric_df = df.select_dtypes(include=[np.number])
    
    print(f"\n2. 相关性矩阵统计:")
    corr_matrix = numeric_df.corr()
    
    # 提取上三角矩阵（避免重复和自相关）
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    correlations = upper_triangle.stack().reset_index()
    correlations.columns = ['Variable1', 'Variable2', 'Correlation']
    correlations['Abs_Correlation'] = correlations['Correlation'].abs()
    
    print(f"   变量对总数: {len(correlations)}")
    print(f"   强相关对数 (|r| > 0.7): {len(correlations[correlations['Abs_Correlation'] > 0.7])}")
    print(f"   中等相关对数 (0.3 < |r| ≤ 0.7): {len(correlations[(correlations['Abs_Correlation'] > 0.3) & (correlations['Abs_Correlation'] <= 0.7)])}")
    print(f"   弱相关对数 (|r| ≤ 0.3): {len(correlations[correlations['Abs_Correlation'] <= 0.3])}")
    
    # 3. 最强相关性
    print(f"\n3. 最强相关性 (Top 5):")
    top_correlations = correlations.nlargest(5, 'Abs_Correlation')
    for _, row in top_correlations.iterrows():
        direction = "正" if row['Correlation'] > 0 else "负"
        print(f"   {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f} ({direction}相关)")
    
    # 4. 如果指定了目标变量
    if target_col and target_col in numeric_df.columns:
        print(f"\n4. 与目标变量 '{target_col}' 的相关性:")
        target_corr = numeric_df.corr()[target_col].abs().sort_values(ascending=False)
        target_corr = target_corr.drop(target_col)  # 移除自相关
        
        print("   强相关特征 (|r| > 0.5):")
        strong_features = target_corr[target_corr > 0.5]
        if len(strong_features) > 0:
            for feature, corr in strong_features.items():
                direction = "正" if numeric_df[feature].corr(numeric_df[target_col]) > 0 else "负"
                print(f"     {feature}: {corr:.4f} ({direction}相关)")
        else:
            print("     无强相关特征")
    
    # 5. 多重共线性检查
    print(f"\n5. 多重共线性检查:")
    high_corr = correlations[correlations['Abs_Correlation'] > 0.8]
    if len(high_corr) > 0:
        print("   发现高度相关的变量对:")
        for _, row in high_corr.iterrows():
            print(f"     {row['Variable1']} - {row['Variable2']}: {row['Correlation']:.4f}")
        print("   建议: 考虑移除其中一个变量或使用主成分分析")
    else:
        print("   未发现严重的多重共线性问题")
    
    # 6. 相关性分布
    print(f"\n6. 相关性分布统计:")
    corr_stats = correlations['Correlation'].describe()
    print(f"   平均相关性: {corr_stats['mean']:.4f}")
    print(f"   相关性标准差: {corr_stats['std']:.4f}")
    print(f"   相关性范围: [{corr_stats['min']:.4f}, {corr_stats['max']:.4f}]")
    
    return correlations, corr_matrix

# 执行综合分析
correlations_result, corr_matrix_result = comprehensive_correlation_analysis(
    df, target_col='ice_cream_sales'
)

10.2 相关性分析检查清单

python

def correlation_analysis_checklist(df):
    """
    相关性分析检查清单
    """
    print("=== 相关性分析检查清单 ===")
    
    checklist = {
        "数据预处理": [
            "✓ 检查缺失值",
            "✓ 识别异常值",
            "✓ 确认数据类型",
            "✓ 处理分类变量"
        ],
        "相关性计算": [
            "✓ 选择合适的相关系数类型",
            "✓ 检查线性关系假设",
            "✓ 考虑非线性关系",
            "✓ 计算统计显著性"
        ],
        "结果解释": [
            "✓ 区分相关性与因果性",
            "✓ 考虑第三变量影响",
            "✓ 检查多重共线性",
            "✓ 验证业务逻辑合理性"
        ],
        "可视化": [
            "✓ 绘制相关性热力图",
            "✓ 创建散点图矩阵",
            "✓ 检查数据分布",
            "✓ 标注重要发现"
        ]
    }
    
    for category, items in checklist.items():
        print(f"\n{category}:")
        for item in items:
            print(f"  {item}")
    
    # 实际检查
    print("\n=== 当前数据集检查结果 ===")
    
    # 缺失值检查
    missing_count = df.isnull().sum().sum()
    print(f"缺失值检查: {'✓ 通过' if missing_count == 0 else f'⚠ 发现 {missing_count} 个缺失值'}")
    
    # 数据类型检查
    numeric_cols = len(df.select_dtypes(include=[np.number]).columns)
    total_cols = len(df.columns)
    print(f"数据类型检查: {numeric_cols}/{total_cols} 列为数值型")
    
    # 异常值检查（使用IQR方法）
    numeric_df = df.select_dtypes(include=[np.number])
    outlier_count = 0
    for col in numeric_df.columns:
        Q1 = numeric_df[col].quantile(0.25)
        Q3 = numeric_df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = ((numeric_df[col] < (Q1 - 1.5 * IQR)) | 
                   (numeric_df[col] > (Q3 + 1.5 * IQR))).sum()
        outlier_count += outliers
    
    print(f"异常值检查: {'✓ 通过' if outlier_count == 0 else f'⚠ 发现 {outlier_count} 个异常值'}")
    
    # 多重共线性检查
    if len(numeric_df.columns) > 1:
        corr_matrix = numeric_df.corr()
        high_corr_count = 0
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                if abs(corr_matrix.iloc[i, j]) > 0.8:
                    high_corr_count += 1
        
        print(f"多重共线性检查: {'✓ 通过' if high_corr_count == 0 else f'⚠ 发现 {high_corr_count} 对高度相关变量'}")

# 执行检查清单
correlation_analysis_checklist(df)

本章小结

本章全面介绍了使用 Pandas 进行相关性分析的方法和技巧：

相关系数类型：Pearson、Spearman、Kendall 三种相关系数的特点和适用场景
偏相关分析：控制其他变量影响，计算纯相关性
可视化方法：热力图、散点图矩阵等可视化技术
实际应用：特征选择、业务洞察、多重共线性检查
时间序列相关性：滞后相关性和滚动相关性分析
注意事项：相关性与因果性的区别、异常值影响
最佳实践：完整的分析流程和检查清单

相关性分析是数据科学中的基础技能，正确理解和应用相关性分析能够帮助我们：

发现变量之间的关系模式
进行特征选择和降维
识别多重共线性问题
生成业务洞察和假设
为进一步的建模工作做准备

练习题

使用真实数据集进行完整的相关性分析
比较不同相关系数在非线性关系中的表现
实现一个自动化的相关性分析报告生成器
分析时间序列数据中的滞后相关性
设计一个多重共线性检测和处理方案

下一章我们将学习 Pandas 的数据排序与聚合功能，探索更高级的数据操作技术。

Pandas 相关性分析 ​

1. 相关性分析基础 ​

1.1 相关性的概念 ​

1.2 相关系数的类型 ​

2. Pearson 相关系数 ​

2.1 计算 Pearson 相关系数 ​

2.2 相关性的统计显著性检验 ​

3. Spearman 等级相关系数 ​

3.1 Spearman 相关性分析 ​

4. Kendall Tau 相关系数 ​

4.1 Kendall 相关性分析 ​

5. 偏相关分析 ​

5.1 偏相关系数计算 ​

6. 相关性矩阵的可视化 ​

6.1 热力图可视化 ​

6.2 散点图矩阵 ​

7. 相关性分析的实际应用 ​

7.1 特征选择 ​

7.2 业务洞察分析 ​

8. 时间序列相关性分析 ​

8.1 滞后相关性分析 ​

8.2 滚动相关性分析 ​

9. 相关性分析的注意事项 ​

9.1 相关性 vs 因果性 ​

9.2 异常值对相关性的影响 ​

10. 相关性分析最佳实践 ​

10.1 完整的相关性分析流程 ​

10.2 相关性分析检查清单 ​

本章小结 ​

练习题 ​

Pandas 相关性分析

1. 相关性分析基础

1.1 相关性的概念

1.2 相关系数的类型

2. Pearson 相关系数

2.1 计算 Pearson 相关系数

2.2 相关性的统计显著性检验

3. Spearman 等级相关系数

3.1 Spearman 相关性分析

4. Kendall Tau 相关系数

4.1 Kendall 相关性分析

5. 偏相关分析

5.1 偏相关系数计算

6. 相关性矩阵的可视化

6.1 热力图可视化

6.2 散点图矩阵

7. 相关性分析的实际应用

7.1 特征选择

7.2 业务洞察分析

8. 时间序列相关性分析

8.1 滞后相关性分析

8.2 滚动相关性分析

9. 相关性分析的注意事项

9.1 相关性 vs 因果性

9.2 异常值对相关性的影响

10. 相关性分析最佳实践

10.1 完整的相关性分析流程

10.2 相关性分析检查清单

本章小结

练习题