Pandas Data Sorting and Aggregation
Data sorting and aggregation are core operations in data analysis. Sorting helps us understand data distribution and patterns, while aggregation allows us to extract meaningful statistical information from large datasets. This chapter will detail various sorting and aggregation techniques in Pandas.
1. Data Sorting Basics
1.1 Single Column Sorting
python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
# Create sample dataset
np.random.seed(42)
n_employees = 1000
# Generate employee data
employee_data = {
'employee_id': range(1, n_employees + 1),
'name': [f'Employee{i:04d}' for i in range(1, n_employees + 1)],
'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing', 'Operations'], n_employees),
'position': np.random.choice(['Junior', 'Mid-level', 'Senior', 'Expert', 'Manager'], n_employees),
'salary': np.random.normal(8000, 2000, n_employees).astype(int),
'age': np.random.randint(22, 60, n_employees),
'experience_years': np.random.randint(0, 20, n_employees),
'performance_score': np.random.uniform(60, 100, n_employees),
'join_date': pd.date_range('2015-01-01', periods=n_employees, freq='D')[:n_employees]
}
# Ensure salary is positive
employee_data['salary'] = np.maximum(employee_data['salary'], 3000)
df = pd.DataFrame(employee_data)
print("Employee Dataset:")
print(df.head())
print(f"\nData Shape: {df.shape}")
# 1. Basic sorting - sort_values()
print("\n=== Single Column Sorting ===")
# Sort by salary ascending
print("Sort by Salary Ascending (Top 5):")
salary_asc = df.sort_values('salary')
print(salary_asc[['name', 'department', 'salary']].head())
# Sort by salary descending
print("\nSort by Salary Descending (Top 5):")
salary_desc = df.sort_values('salary', ascending=False)
print(salary_desc[['name', 'department', 'salary']].head())
# Sort by performance score
print("\nSort by Performance Score Descending (Top 5):")
performance_desc = df.sort_values('performance_score', ascending=False)
print(performance_desc[['name', 'department', 'performance_score']].head())1.2 Multi-Column Sorting
python
print("\n=== Multi-Column Sorting ===")
# Sort by department and salary
print("Sort by Department (Ascending) and Salary (Descending):")
multi_sort = df.sort_values(['department', 'salary'], ascending=[True, False])
print(multi_sort[['name', 'department', 'salary']].head(10))
# Complex multi-column sorting
print("\nSort by Department, Position, and Performance Score:")
complex_sort = df.sort_values(
['department', 'position', 'performance_score'],
ascending=[True, True, False]
)
print(complex_sort[['name', 'department', 'position', 'performance_score']].head(10))
# Sorting with missing values
df_with_na = df.copy()
df_with_na.loc[np.random.choice(df.index, 50), 'salary'] = np.nan
print("\nSorting with Missing Values:")
# Missing values at the end
na_last = df_with_na.sort_values('salary', na_position='last')
print("Missing Values at End:")
print(na_last[['name', 'salary']].tail())
# Missing values at the beginning
na_first = df_with_na.sort_values('salary', na_position='first')
print("\nMissing Values at Beginning:")
print(na_first[['name', 'salary']].head())1.3 Index Sorting
python
print("\n=== Index Sorting ===")
# Set multi-level index
df_indexed = df.set_index(['department', 'position'])
# Sort by index
print("Sort by Index:")
index_sorted = df_indexed.sort_index()
print(index_sorted.head(10))
# Sort by specific level
print("\nSort by Second Level Index (Position):")
level_sorted = df_indexed.sort_index(level=1)
print(level_sorted.head(10))
# Descending index sort
print("\nSort by Index Descending:")
index_desc = df_indexed.sort_index(ascending=False)
print(index_desc.head(10))2. Advanced Sorting Techniques
2.1 Custom Sorting
python
print("\n=== Custom Sorting ===")
# Define position hierarchy
position_order = ['Junior', 'Mid-level', 'Senior', 'Expert', 'Manager']
# Use Categorical for custom sorting
df['position_cat'] = pd.Categorical(df['position'], categories=position_order, ordered=True)
print("Sort by Position Hierarchy:")
position_sorted = df.sort_values('position_cat')
print(position_sorted[['name', 'position', 'salary']].head(10))
# Using key parameter for custom sorting (Pandas 1.1+)
def custom_sort_key(series):
"""Custom sort key function"""
position_map = {'Junior': 1, 'Mid-level': 2, 'Senior': 3, 'Expert': 4, 'Manager': 5}
return series.map(position_map)
print("\nUsing Custom Key Function:")
custom_sorted = df.sort_values('position', key=custom_sort_key)
print(custom_sorted[['name', 'position', 'salary']].head(10))
# Complex custom sorting
def salary_performance_score(df_subset):
"""Composite salary and performance score"""
normalized_salary = (df_subset['salary'] - df_subset['salary'].min()) / (df_subset['salary'].max() - df_subset['salary'].min())
normalized_performance = (df_subset['performance_score'] - df_subset['performance_score'].min()) / (df_subset['performance_score'].max() - df_subset['performance_score'].min())
return normalized_salary * 0.6 + normalized_performance * 0.4
df['composite_score'] = salary_performance_score(df)
print("\nSort by Composite Score (Top 10):")
composite_sorted = df.sort_values('composite_score', ascending=False)
print(composite_sorted[['name', 'salary', 'performance_score', 'composite_score']].head(10))2.2 Within-Group Sorting
python
print("\n=== Within-Group Sorting ===")
# Sort by salary within each department
print("Salary Ranking within Departments:")
df['dept_salary_rank'] = df.groupby('department')['salary'].rank(ascending=False)
dept_ranking = df.sort_values(['department', 'dept_salary_rank'])
print(dept_ranking[['name', 'department', 'salary', 'dept_salary_rank']].head(15))
# Use nlargest to get top earners in each department
print("\nTop 3 Highest Salaries per Department:")
top_earners = df.groupby('department').apply(
lambda x: x.nlargest(3, 'salary')[['name', 'salary']]
).reset_index(level=1, drop=True)
print(top_earners)
# Top performers per department
print("\nTop Performer per Department:")
top_performers = df.groupby('department').apply(
lambda x: x.nlargest(1, 'performance_score')[['name', 'performance_score']]
).reset_index(level=1, drop=True)
print(top_performers)3. Data Aggregation Basics
3.1 Basic Aggregation Functions
python
print("\n=== Basic Aggregation Operations ===")
# Basic statistical aggregation
print("Overall Statistics:")
basic_stats = df[['salary', 'age', 'experience_years', 'performance_score']].agg([
'count', 'mean', 'median', 'std', 'min', 'max'
])
print(basic_stats.round(2))
# Single column multiple aggregations
print("\nSalary Statistics:")
salary_stats = df['salary'].agg({
'Count': 'count',
'Mean': 'mean',
'Median': 'median',
'Std Dev': 'std',
'Min': 'min',
'Max': 'max',
'Sum': 'sum'
})
print(salary_stats.round(2))
# Multiple columns different aggregations
print("\nMulti-column Custom Aggregation:")
custom_agg = df.agg({
'salary': ['mean', 'median', 'std'],
'age': ['mean', 'min', 'max'],
'performance_score': ['mean', 'count'],
'experience_years': 'mean'
})
print(custom_agg.round(2))3.2 GroupBy Aggregation
python
print("\n=== GroupBy Aggregation ===")
# Aggregate by department
print("Aggregate by Department:")
dept_agg = df.groupby('department').agg({
'salary': ['mean', 'median', 'count'],
'age': 'mean',
'performance_score': ['mean', 'std'],
'experience_years': 'mean'
}).round(2)
print(dept_agg)
# Aggregate by position
print("\nAggregate by Position:")
position_agg = df.groupby('position').agg({
'salary': ['mean', 'count'],
'performance_score': 'mean',
'age': 'mean'
}).round(2)
print(position_agg)
# Multi-level groupby aggregation
print("\nAggregate by Department and Position:")
multi_group_agg = df.groupby(['department', 'position']).agg({
'salary': 'mean',
'performance_score': 'mean',
'employee_id': 'count' # Count
}).round(2)
multi_group_agg.columns = ['Avg Salary', 'Avg Performance', 'Count']
print(multi_group_agg.head(10))3.3 Time Series Aggregation
python
print("\n=== Time Series Aggregation ===")
# Aggregate by join year
df['join_year'] = df['join_date'].dt.year
print("Aggregate by Join Year:")
yearly_agg = df.groupby('join_year').agg({
'employee_id': 'count',
'salary': 'mean',
'age': 'mean'
}).round(2)
yearly_agg.columns = ['New Hires', 'Avg Salary', 'Avg Age']
print(yearly_agg)
# Aggregate by join month
df['join_month'] = df['join_date'].dt.month
print("\nAggregate by Join Month:")
monthly_agg = df.groupby('join_month').agg({
'employee_id': 'count',
'salary': 'mean'
}).round(2)
monthly_agg.columns = ['New Hires', 'Avg Salary']
print(monthly_agg)
# Using resample for time resampling
df_time_indexed = df.set_index('join_date')
print("\nQuarterly Resample Aggregation:")
quarterly_agg = df_time_indexed.resample('Q').agg({
'employee_id': 'count',
'salary': 'mean',
'performance_score': 'mean'
}).round(2)
print(quarterly_agg.head())4. Advanced Aggregation Techniques
4.1 Custom Aggregation Functions
python
print("\n=== Custom Aggregation Functions ===")
# Define custom aggregation functions
def salary_range(series):
"""Calculate salary range"""
return series.max() - series.min()
def top_quartile_mean(series):
"""Calculate top 25% mean"""
return series.quantile(0.75)
def coefficient_of_variation(series):
"""Calculate coefficient of variation"""
return series.std() / series.mean() if series.mean() != 0 else 0
# Apply custom aggregation functions
print("Using Custom Aggregation Functions:")
custom_agg_result = df.groupby('department')['salary'].agg([
('Avg Salary', 'mean'),
('Salary Range', salary_range),
('Top 25% Mean', top_quartile_mean),
('CV', coefficient_of_variation)
]).round(2)
print(custom_agg_result)
# Complex custom aggregation
def department_summary(group):
"""Department comprehensive summary"""
return pd.Series({
'Employee Count': len(group),
'Avg Salary': group['salary'].mean(),
'Median Salary': group['salary'].median(),
'Avg Performance': group['performance_score'].mean(),
'High Performer Ratio': (group['performance_score'] > 85).mean(),
'Avg Experience': group['experience_years'].mean(),
'Salary Std Dev': group['salary'].std()
})
print("\nDepartment Comprehensive Summary:")
dept_summary = df.groupby('department').apply(department_summary).round(2)
print(dept_summary)4.2 Conditional Aggregation
python
print("\n=== Conditional Aggregation ===")
# Use where for conditional aggregation
print("Salary Statistics for High Performers (performance > 85):")
high_performance_salary = df.groupby('department')['salary'].agg([
('High Performer Count', lambda x: (df.loc[x.index, 'performance_score'] > 85).sum()),
('High Performer Avg Salary', lambda x: x[df.loc[x.index, 'performance_score'] > 85].mean()),
('High Performer Median Salary', lambda x: x[df.loc[x.index, 'performance_score'] > 85].median())
]).round(2)
print(high_performance_salary)
# Use query with aggregation
print("\nDepartment Statistics for Employees with 5+ Years Experience:")
experienced_stats = df.query('experience_years >= 5').groupby('department').agg({
'salary': ['mean', 'count'],
'performance_score': 'mean',
'age': 'mean'
}).round(2)
print(experienced_stats)
# Quantile aggregation
print("\nSalary Quantiles by Department:")
salary_quantiles = df.groupby('department')['salary'].quantile([0.25, 0.5, 0.75]).unstack()
salary_quantiles.columns = ['Q1', 'Median', 'Q3']
print(salary_quantiles.round(0))4.3 Rolling Window Aggregation
python
print("\n=== Rolling Window Aggregation ===")
# Create time series data
daily_data = df.groupby('join_date').agg({
'employee_id': 'count',
'salary': 'mean'
}).reset_index()
daily_data.columns = ['date', 'new_hires', 'avg_salary']
daily_data = daily_data.set_index('date').sort_index()
print("Daily Hire Data (First 10 Days):")
print(daily_data.head(10))
# 7-day rolling average
print("\n7-Day Rolling Average:")
daily_data['new_hires_7d_avg'] = daily_data['new_hires'].rolling(window=7).mean()
daily_data['avg_salary_7d_avg'] = daily_data['avg_salary'].rolling(window=7).mean()
print(daily_data[['new_hires', 'new_hires_7d_avg', 'avg_salary', 'avg_salary_7d_avg']].head(10).round(2))
# 30-day rolling statistics
print("\n30-Day Rolling Statistics:")
rolling_stats = daily_data['new_hires'].rolling(window=30).agg([
'mean', 'std', 'min', 'max', 'sum'
]).round(2)
print(rolling_stats.head(35))
# Expanding window aggregation
print("\nCumulative Statistics:")
daily_data['cumulative_hires'] = daily_data['new_hires'].expanding().sum()
daily_data['cumulative_avg_salary'] = daily_data['avg_salary'].expanding().mean()
print(daily_data[['new_hires', 'cumulative_hires', 'avg_salary', 'cumulative_avg_salary']].head(10).round(2))5. Pivot Tables and Cross-tabulations
5.1 Pivot Table
python
print("\n=== Pivot Table ===")
# Basic pivot table
print("Department-Position Salary Pivot Table:")
pivot_salary = pd.pivot_table(
df,
values='salary',
index='department',
columns='position',
aggfunc='mean'
).round(0)
print(pivot_salary)
# Multi-value pivot table
print("\nMulti-metric Pivot Table:")
multi_pivot = pd.pivot_table(
df,
values=['salary', 'performance_score'],
index='department',
columns='position',
aggfunc={'salary': 'mean', 'performance_score': 'mean'}
).round(2)
print(multi_pivot)
# Pivot table with margins
print("\nPivot Table with Totals:")
pivot_with_margins = pd.pivot_table(
df,
values='salary',
index='department',
columns='position',
aggfunc='mean',
margins=True,
margins_name='Total'
).round(0)
print(pivot_with_margins)
# Multi-level index pivot table
df['age_group'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60], labels=['20-30', '31-40', '41-50', '51-60'])
print("\nMulti-level Index Pivot Table:")
multi_index_pivot = pd.pivot_table(
df,
values='salary',
index=['department', 'age_group'],
columns='position',
aggfunc='mean'
).round(0)
print(multi_index_pivot.head(10))5.2 Cross-tabulation
python
print("\n=== Cross-tabulation ===")
# Basic cross-tabulation
print("Department-Position Count Cross-tab:")
crosstab_basic = pd.crosstab(df['department'], df['position'])
print(crosstab_basic)
# Cross-tabulation with proportions
print("\nDepartment-Position Proportion Cross-tab:")
crosstab_prop = pd.crosstab(df['department'], df['position'], normalize='index').round(3)
print(crosstab_prop)
# Cross-tabulation with margins
print("\nCross-tab with Totals:")
crosstab_margins = pd.crosstab(
df['department'],
df['position'],
margins=True,
margins_name='Total'
)
print(crosstab_margins)
# Multi-dimensional cross-tabulation
print("\nMulti-dimensional Cross-tab (Department-Position-Age Group):")
multi_crosstab = pd.crosstab(
[df['department'], df['age_group']],
df['position']
)
print(multi_crosstab.head(10))
# Cross-tabulation with values
print("\nCross-tab with Average Salary:")
crosstab_values = pd.crosstab(
df['department'],
df['position'],
values=df['salary'],
aggfunc='mean'
).round(0)
print(crosstab_values)6. Group Transform and Filter
6.1 Group Transform
python
print("\n=== Group Transform ===")
# Calculate relative salary within department
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']
df['salary_pct_of_dept_avg'] = (df['salary'] / df['dept_avg_salary'] * 100).round(1)
print("Employee Salary Relative to Department Average:")
print(df[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg', 'salary_pct_of_dept_avg']].head(10))
# Calculate ranking within department
df['dept_salary_rank'] = df.groupby('department')['salary'].transform(
lambda x: x.rank(ascending=False)
)
df['dept_performance_rank'] = df.groupby('department')['performance_score'].transform(
lambda x: x.rank(ascending=False)
)
print("\nRanking within Departments:")
print(df[['name', 'department', 'salary', 'dept_salary_rank', 'performance_score', 'dept_performance_rank']].head(10))
# Standardized scores
df['salary_zscore'] = df.groupby('department')['salary'].transform(
lambda x: (x - x.mean()) / x.std()
)
df['performance_zscore'] = df.groupby('department')['performance_score'].transform(
lambda x: (x - x.mean()) / x.std()
)
print("\nStandardized Scores within Departments:")
print(df[['name', 'department', 'salary_zscore', 'performance_zscore']].head(10).round(3))6.2 Group Filter
python
print("\n=== Group Filter ===")
# Filter large departments (employee count > 150)
large_departments = df.groupby('department').filter(lambda x: len(x) > 150)
print(f"Large Department Employee Count: {len(large_departments)}")
print("Large Department Distribution:")
print(large_departments['department'].value_counts())
# Filter high-salary departments (avg salary > 8000)
high_salary_depts = df.groupby('department').filter(
lambda x: x['salary'].mean() > 8000
)
print(f"\nHigh Salary Department Employee Count: {len(high_salary_depts)}")
print("High Salary Department Distribution:")
print(high_salary_depts['department'].value_counts())
# Filter high-performance departments (avg performance > 80)
high_performance_depts = df.groupby('department').filter(
lambda x: x['performance_score'].mean() > 80
)
print(f"\nHigh Performance Department Employee Count: {len(high_performance_depts)}")
print("High Performance Department Distribution:")
print(high_performance_depts['department'].value_counts())
# Compound condition filter
complex_filter = df.groupby('department').filter(
lambda x: (len(x) > 100) and (x['salary'].mean() > 7500) and (x['performance_score'].mean() > 75)
)
print(f"\nEmployees Meeting Compound Conditions: {len(complex_filter)}")
if len(complex_filter) > 0:
print("Qualifying Departments:")
print(complex_filter['department'].value_counts())7. Aggregation Visualization
7.1 Aggregated Data Visualization
python
print("\n=== Aggregation Visualization ===")
# Department salary statistics visualization
dept_salary_stats = df.groupby('department')['salary'].agg(['mean', 'median', 'std']).round(0)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Average salary bar chart
dept_salary_stats['mean'].plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Average Salary by Department')
axes[0,0].set_ylabel('Salary')
axes[0,0].tick_params(axis='x', rotation=45)
# Salary distribution box plot
df.boxplot(column='salary', by='department', ax=axes[0,1])
axes[0,1].set_title('Salary Distribution by Department')
axes[0,1].set_xlabel('Department')
axes[0,1].set_ylabel('Salary')
# Position distribution stacked bar chart
position_dist = pd.crosstab(df['department'], df['position'])
position_dist.plot(kind='bar', stacked=True, ax=axes[1,0])
axes[1,0].set_title('Position Distribution by Department')
axes[1,0].set_ylabel('Count')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].legend(title='Position', bbox_to_anchor=(1.05, 1), loc='upper left')
# Performance vs Salary scatter plot
for dept in df['department'].unique():
dept_data = df[df['department'] == dept]
axes[1,1].scatter(dept_data['performance_score'], dept_data['salary'],
label=dept, alpha=0.6)
axes[1,1].set_xlabel('Performance Score')
axes[1,1].set_ylabel('Salary')
axes[1,1].set_title('Performance vs Salary')
axes[1,1].legend()
plt.tight_layout()
plt.show()7.2 Time Series Aggregation Visualization
python
# Hire trend visualization
monthly_hires = df.groupby(df['join_date'].dt.to_period('M')).size()
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# Monthly hire count
monthly_hires.plot(ax=axes[0], marker='o')
axes[0].set_title('Monthly New Hires Trend')
axes[0].set_ylabel('New Hires')
axes[0].grid(True, alpha=0.3)
# Cumulative hire count
monthly_hires.cumsum().plot(ax=axes[1], marker='s')
axes[1].set_title('Cumulative New Hires')
axes[1].set_ylabel('Cumulative Count')
axes[1].set_xlabel('Time')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()8. Performance Optimization Tips
8.1 Efficient Aggregation Methods
python
print("\n=== Performance Optimization Tips ===")
import time
# Create large dataset for performance testing
large_df = pd.concat([df] * 10, ignore_index=True)
print(f"Large Dataset Size: {large_df.shape}")
# Method 1: Using agg
start_time = time.time()
result1 = large_df.groupby('department').agg({
'salary': ['mean', 'std'],
'performance_score': 'mean'
})
time1 = time.time() - start_time
# Method 2: Separate calculations
start_time = time.time()
result2_mean = large_df.groupby('department')['salary'].mean()
result2_std = large_df.groupby('department')['salary'].std()
result2_perf = large_df.groupby('department')['performance_score'].mean()
time2 = time.time() - start_time
print(f"Using agg method time: {time1:.4f} seconds")
print(f"Separate calculations time: {time2:.4f} seconds")
# Memory optimization
print("\nMemory Optimization:")
print(f"Original data memory: {large_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Optimize data types
optimized_df = large_df.copy()
optimized_df['department'] = optimized_df['department'].astype('category')
optimized_df['position'] = optimized_df['position'].astype('category')
optimized_df['salary'] = optimized_df['salary'].astype('int32')
optimized_df['age'] = optimized_df['age'].astype('int8')
print(f"Optimized data memory: {optimized_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Memory savings: {(1 - optimized_df.memory_usage(deep=True).sum() / large_df.memory_usage(deep=True).sum()) * 100:.1f}%")8.2 Parallel Processing
python
# Using numba to accelerate custom aggregation functions (requires numba installation)
try:
from numba import jit
@jit
def fast_coefficient_of_variation(arr):
"""Fast coefficient of variation calculation"""
mean_val = arr.mean()
if mean_val == 0:
return 0
return arr.std() / mean_val
print("\nUsing Numba Accelerated Aggregation:")
start_time = time.time()
fast_cv = large_df.groupby('department')['salary'].apply(
lambda x: fast_coefficient_of_variation(x.values)
)
fast_time = time.time() - start_time
print(f"Numba accelerated time: {fast_time:.4f} seconds")
except ImportError:
print("Numba not installed, skipping acceleration demo")
# Chunked processing for large data
def chunked_aggregation(df, chunk_size=1000):
"""Chunked aggregation processing"""
results = []
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
chunk_result = chunk.groupby('department')['salary'].mean()
results.append(chunk_result)
# Combine results
combined = pd.concat(results, axis=1)
return combined.mean(axis=1)
print("\nChunked Processing Result:")
chunked_result = chunked_aggregation(large_df)
print(chunked_result.head())9. Practical Application Cases
9.1 Sales Data Analysis
python
print("\n=== Sales Data Analysis Case ===")
# Create sales data
np.random.seed(42)
sales_data = {
'date': pd.date_range('2023-01-01', periods=365, freq='D'),
'product': np.random.choice(['Product A', 'Product B', 'Product C', 'Product D'], 365),
'region': np.random.choice(['North', 'South', 'East', 'West'], 365),
'salesperson': np.random.choice([f'Sales{i}' for i in range(1, 21)], 365),
'quantity': np.random.randint(1, 100, 365),
'unit_price': np.random.uniform(50, 500, 365),
}
sales_df = pd.DataFrame(sales_data)
sales_df['revenue'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['month'] = sales_df['date'].dt.month
sales_df['quarter'] = sales_df['date'].dt.quarter
print("Sales Data Sample:")
print(sales_df.head())
# 1. Monthly sales analysis
print("\nMonthly Sales Analysis:")
monthly_sales = sales_df.groupby('month').agg({
'revenue': ['sum', 'mean', 'count'],
'quantity': 'sum'
}).round(2)
monthly_sales.columns = ['Total Revenue', 'Avg Revenue', 'Transactions', 'Total Quantity']
print(monthly_sales)
# 2. Product-region analysis
print("\nProduct-Region Sales Matrix:")
product_region_matrix = pd.pivot_table(
sales_df,
values='revenue',
index='product',
columns='region',
aggfunc='sum'
).round(0)
print(product_region_matrix)
# 3. Salesperson performance ranking
print("\nSalesperson Performance Ranking (Top 10):")
salesperson_performance = sales_df.groupby('salesperson').agg({
'revenue': 'sum',
'quantity': 'sum',
'date': 'count'
}).round(2)
salesperson_performance.columns = ['Total Revenue', 'Total Quantity', 'Transactions']
salesperson_performance['Avg Transaction'] = (salesperson_performance['Total Revenue'] / salesperson_performance['Transactions']).round(2)
top_performers = salesperson_performance.sort_values('Total Revenue', ascending=False)
print(top_performers.head(10))9.2 Customer Behavior Analysis
python
print("\n=== Customer Behavior Analysis Case ===")
# Create customer data
customer_data = {
'customer_id': np.repeat(range(1, 201), np.random.randint(1, 10, 200)),
'purchase_date': [],
'amount': [],
'category': []
}
# Generate purchase records
for cid in range(1, 201):
n_purchases = np.random.randint(1, 10)
start_date = pd.Timestamp('2023-01-01')
dates = pd.date_range(start_date, periods=n_purchases, freq=f'{np.random.randint(1, 30)}D')
customer_data['purchase_date'].extend(dates)
customer_data['amount'].extend(np.random.uniform(20, 500, n_purchases))
customer_data['category'].extend(np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], n_purchases))
customer_df = pd.DataFrame({
'customer_id': customer_data['customer_id'],
'purchase_date': customer_data['purchase_date'],
'amount': customer_data['amount'],
'category': customer_data['category']
})
print(f"Customer Data Records: {len(customer_df)}")
print("Customer Data Sample:")
print(customer_df.head())
# 1. Customer value analysis
print("\nCustomer Value Analysis:")
customer_value = customer_df.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'count'],
'purchase_date': ['min', 'max']
})
customer_value.columns = ['Total Spent', 'Avg Spent', 'Purchases', 'First Purchase', 'Last Purchase']
customer_value['Purchase Span'] = (customer_value['Last Purchase'] - customer_value['First Purchase']).dt.days
# Customer segmentation
customer_value['Customer Tier'] = pd.cut(
customer_value['Total Spent'],
bins=[0, 500, 1500, 3000, float('inf')],
labels=['Bronze', 'Silver', 'Gold', 'Diamond']
)
print("Customer Tier Distribution:")
print(customer_value['Customer Tier'].value_counts())
print("\nStatistics by Customer Tier:")
tier_stats = customer_value.groupby('Customer Tier').agg({
'Total Spent': ['mean', 'count'],
'Purchases': 'mean',
'Purchase Span': 'mean'
}).round(2)
print(tier_stats)
# 2. Category preference analysis
print("\nCustomer Category Preference Analysis:")
category_preference = pd.crosstab(
customer_df['customer_id'],
customer_df['category'],
values=customer_df['amount'],
aggfunc='sum'
).fillna(0)
# Find each customer's main category
customer_main_category = category_preference.idxmax(axis=1)
print("Customer Main Category Distribution:")
print(customer_main_category.value_counts())10. Best Practices and Considerations
10.1 Sorting and Aggregation Best Practices
python
print("\n=== Best Practices Summary ===")
def sorting_aggregation_best_practices():
"""
Sorting and aggregation best practices guide
"""
practices = {
"Sorting Best Practices": [
"1. Use sort_values() instead of sort()",
"2. For large datasets, consider using nlargest() or nsmallest()",
"3. Use na_position parameter appropriately for missing values",
"4. Use Categorical type for custom sorting",
"5. Avoid repeated sorting in loops"
],
"Aggregation Best Practices": [
"1. Use agg() for multiple aggregation operations",
"2. Prefer built-in aggregation functions over custom ones",
"3. Use transform() appropriately to maintain data structure",
"4. Use filter() for conditional filtering",
"5. Consider memory usage, optimize data types"
],
"Performance Optimization Tips": [
"1. Filter data before aggregation",
"2. Use categorical data types to reduce memory",
"3. Avoid complex lambda functions in groupby",
"4. Consider using numba to accelerate custom functions",
"5. For very large datasets, consider chunked processing"
],
"Common Pitfalls": [
"1. Watch for index changes after aggregation",
"2. Handle missing values carefully in aggregation results",
"3. Avoid accidentally including non-numeric columns in aggregation",
"4. Watch for timezone issues in time series aggregation",
"5. Verify business logic reasonableness of aggregation results"
]
}
for category, items in practices.items():
print(f"\n{category}:")
for item in items:
print(f" {item}")
sorting_aggregation_best_practices()
# Performance comparison example
print("\n=== Performance Comparison Example ===")
# Inefficient method
def inefficient_aggregation(df):
results = []
for dept in df['department'].unique():
dept_data = df[df['department'] == dept]
avg_salary = dept_data['salary'].mean()
results.append({'department': dept, 'avg_salary': avg_salary})
return pd.DataFrame(results)
# Efficient method
def efficient_aggregation(df):
return df.groupby('department')['salary'].mean().reset_index()
# Performance test
start_time = time.time()
inefficient_result = inefficient_aggregation(df)
inefficient_time = time.time() - start_time
start_time = time.time()
efficient_result = efficient_aggregation(df)
efficient_time = time.time() - start_time
print(f"Inefficient method time: {inefficient_time:.4f} seconds")
print(f"Efficient method time: {efficient_time:.4f} seconds")
print(f"Performance improvement: {inefficient_time / efficient_time:.1f}x")Chapter Summary
This chapter comprehensively covered data sorting and aggregation techniques in Pandas:
Sorting Techniques
- Basic Sorting: Using
sort_values(),sort_index() - Advanced Sorting: Custom sorting, within-group sorting, multi-level sorting
- Sorting Optimization: Handling missing values, using categorical data types
Aggregation Techniques
- Basic Aggregation: Various uses of
agg(),groupby() - Advanced Aggregation: Custom aggregation functions, conditional aggregation, time series aggregation
- Group Operations: Differences and applications of
transform(),filter(),apply()
Data Pivot
- Pivot Tables: Multi-dimensional analysis with
pivot_table() - Cross-tabulations: Statistical analysis with
crosstab() - Data Reshaping: Wide-to-long transformations and multi-level index handling
Performance Optimization
- Memory Optimization: Data type optimization, chunked processing
- Computation Optimization: Vectorized operations, parallel processing
- Best Practices: Avoiding common pitfalls, improving code efficiency
Mastering these sorting and aggregation techniques will help you:
- Quickly understand data distribution and patterns
- Perform multi-dimensional data analysis
- Generate various statistical reports and business insights
- Prepare for data visualization and modeling
Exercises
- Create a complex multi-level groupby aggregation analysis
- Implement a custom rolling window aggregation function
- Design a customer segmentation analysis system
- Optimize aggregation performance for a large dataset
- Build a dynamic pivot table analysis tool
In the next chapter, we will learn about Pandas data visualization, exploring how to transform data analysis results into intuitive charts.