Pandas Data Sorting and Aggregation

Data sorting and aggregation are core operations in data analysis. Sorting helps us understand data distribution and patterns, while aggregation allows us to extract meaningful statistical information from large datasets. This chapter will detail various sorting and aggregation techniques in Pandas.

1. Data Sorting Basics

1.1 Single Column Sorting

python

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Create sample dataset
np.random.seed(42)
n_employees = 1000

# Generate employee data
employee_data = {
    'employee_id': range(1, n_employees + 1),
    'name': [f'Employee{i:04d}' for i in range(1, n_employees + 1)],
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing', 'Operations'], n_employees),
    'position': np.random.choice(['Junior', 'Mid-level', 'Senior', 'Expert', 'Manager'], n_employees),
    'salary': np.random.normal(8000, 2000, n_employees).astype(int),
    'age': np.random.randint(22, 60, n_employees),
    'experience_years': np.random.randint(0, 20, n_employees),
    'performance_score': np.random.uniform(60, 100, n_employees),
    'join_date': pd.date_range('2015-01-01', periods=n_employees, freq='D')[:n_employees]
}

# Ensure salary is positive
employee_data['salary'] = np.maximum(employee_data['salary'], 3000)

df = pd.DataFrame(employee_data)

print("Employee Dataset:")
print(df.head())
print(f"\nData Shape: {df.shape}")

# 1. Basic sorting - sort_values()
print("\n=== Single Column Sorting ===")

# Sort by salary ascending
print("Sort by Salary Ascending (Top 5):")
salary_asc = df.sort_values('salary')
print(salary_asc[['name', 'department', 'salary']].head())

# Sort by salary descending
print("\nSort by Salary Descending (Top 5):")
salary_desc = df.sort_values('salary', ascending=False)
print(salary_desc[['name', 'department', 'salary']].head())

# Sort by performance score
print("\nSort by Performance Score Descending (Top 5):")
performance_desc = df.sort_values('performance_score', ascending=False)
print(performance_desc[['name', 'department', 'performance_score']].head())

1.2 Multi-Column Sorting

python

print("\n=== Multi-Column Sorting ===")

# Sort by department and salary
print("Sort by Department (Ascending) and Salary (Descending):")
multi_sort = df.sort_values(['department', 'salary'], ascending=[True, False])
print(multi_sort[['name', 'department', 'salary']].head(10))

# Complex multi-column sorting
print("\nSort by Department, Position, and Performance Score:")
complex_sort = df.sort_values(
    ['department', 'position', 'performance_score'], 
    ascending=[True, True, False]
)
print(complex_sort[['name', 'department', 'position', 'performance_score']].head(10))

# Sorting with missing values
df_with_na = df.copy()
df_with_na.loc[np.random.choice(df.index, 50), 'salary'] = np.nan

print("\nSorting with Missing Values:")
# Missing values at the end
na_last = df_with_na.sort_values('salary', na_position='last')
print("Missing Values at End:")
print(na_last[['name', 'salary']].tail())

# Missing values at the beginning
na_first = df_with_na.sort_values('salary', na_position='first')
print("\nMissing Values at Beginning:")
print(na_first[['name', 'salary']].head())

1.3 Index Sorting

python

print("\n=== Index Sorting ===")

# Set multi-level index
df_indexed = df.set_index(['department', 'position'])

# Sort by index
print("Sort by Index:")
index_sorted = df_indexed.sort_index()
print(index_sorted.head(10))

# Sort by specific level
print("\nSort by Second Level Index (Position):")
level_sorted = df_indexed.sort_index(level=1)
print(level_sorted.head(10))

# Descending index sort
print("\nSort by Index Descending:")
index_desc = df_indexed.sort_index(ascending=False)
print(index_desc.head(10))

2. Advanced Sorting Techniques

2.1 Custom Sorting

python

print("\n=== Custom Sorting ===")

# Define position hierarchy
position_order = ['Junior', 'Mid-level', 'Senior', 'Expert', 'Manager']

# Use Categorical for custom sorting
df['position_cat'] = pd.Categorical(df['position'], categories=position_order, ordered=True)

print("Sort by Position Hierarchy:")
position_sorted = df.sort_values('position_cat')
print(position_sorted[['name', 'position', 'salary']].head(10))

# Using key parameter for custom sorting (Pandas 1.1+)
def custom_sort_key(series):
    """Custom sort key function"""
    position_map = {'Junior': 1, 'Mid-level': 2, 'Senior': 3, 'Expert': 4, 'Manager': 5}
    return series.map(position_map)

print("\nUsing Custom Key Function:")
custom_sorted = df.sort_values('position', key=custom_sort_key)
print(custom_sorted[['name', 'position', 'salary']].head(10))

# Complex custom sorting
def salary_performance_score(df_subset):
    """Composite salary and performance score"""
    normalized_salary = (df_subset['salary'] - df_subset['salary'].min()) / (df_subset['salary'].max() - df_subset['salary'].min())
    normalized_performance = (df_subset['performance_score'] - df_subset['performance_score'].min()) / (df_subset['performance_score'].max() - df_subset['performance_score'].min())
    return normalized_salary * 0.6 + normalized_performance * 0.4

df['composite_score'] = salary_performance_score(df)
print("\nSort by Composite Score (Top 10):")
composite_sorted = df.sort_values('composite_score', ascending=False)
print(composite_sorted[['name', 'salary', 'performance_score', 'composite_score']].head(10))

2.2 Within-Group Sorting

python

print("\n=== Within-Group Sorting ===")

# Sort by salary within each department
print("Salary Ranking within Departments:")
df['dept_salary_rank'] = df.groupby('department')['salary'].rank(ascending=False)
dept_ranking = df.sort_values(['department', 'dept_salary_rank'])
print(dept_ranking[['name', 'department', 'salary', 'dept_salary_rank']].head(15))

# Use nlargest to get top earners in each department
print("\nTop 3 Highest Salaries per Department:")
top_earners = df.groupby('department').apply(
    lambda x: x.nlargest(3, 'salary')[['name', 'salary']]
).reset_index(level=1, drop=True)
print(top_earners)

# Top performers per department
print("\nTop Performer per Department:")
top_performers = df.groupby('department').apply(
    lambda x: x.nlargest(1, 'performance_score')[['name', 'performance_score']]
).reset_index(level=1, drop=True)
print(top_performers)

3. Data Aggregation Basics

3.1 Basic Aggregation Functions

python

print("\n=== Basic Aggregation Operations ===")

# Basic statistical aggregation
print("Overall Statistics:")
basic_stats = df[['salary', 'age', 'experience_years', 'performance_score']].agg([
    'count', 'mean', 'median', 'std', 'min', 'max'
])
print(basic_stats.round(2))

# Single column multiple aggregations
print("\nSalary Statistics:")
salary_stats = df['salary'].agg({
    'Count': 'count',
    'Mean': 'mean',
    'Median': 'median',
    'Std Dev': 'std',
    'Min': 'min',
    'Max': 'max',
    'Sum': 'sum'
})
print(salary_stats.round(2))

# Multiple columns different aggregations
print("\nMulti-column Custom Aggregation:")
custom_agg = df.agg({
    'salary': ['mean', 'median', 'std'],
    'age': ['mean', 'min', 'max'],
    'performance_score': ['mean', 'count'],
    'experience_years': 'mean'
})
print(custom_agg.round(2))

3.2 GroupBy Aggregation

python

print("\n=== GroupBy Aggregation ===")

# Aggregate by department
print("Aggregate by Department:")
dept_agg = df.groupby('department').agg({
    'salary': ['mean', 'median', 'count'],
    'age': 'mean',
    'performance_score': ['mean', 'std'],
    'experience_years': 'mean'
}).round(2)
print(dept_agg)

# Aggregate by position
print("\nAggregate by Position:")
position_agg = df.groupby('position').agg({
    'salary': ['mean', 'count'],
    'performance_score': 'mean',
    'age': 'mean'
}).round(2)
print(position_agg)

# Multi-level groupby aggregation
print("\nAggregate by Department and Position:")
multi_group_agg = df.groupby(['department', 'position']).agg({
    'salary': 'mean',
    'performance_score': 'mean',
    'employee_id': 'count'  # Count
}).round(2)
multi_group_agg.columns = ['Avg Salary', 'Avg Performance', 'Count']
print(multi_group_agg.head(10))

3.3 Time Series Aggregation

python

print("\n=== Time Series Aggregation ===")

# Aggregate by join year
df['join_year'] = df['join_date'].dt.year
print("Aggregate by Join Year:")
yearly_agg = df.groupby('join_year').agg({
    'employee_id': 'count',
    'salary': 'mean',
    'age': 'mean'
}).round(2)
yearly_agg.columns = ['New Hires', 'Avg Salary', 'Avg Age']
print(yearly_agg)

# Aggregate by join month
df['join_month'] = df['join_date'].dt.month
print("\nAggregate by Join Month:")
monthly_agg = df.groupby('join_month').agg({
    'employee_id': 'count',
    'salary': 'mean'
}).round(2)
monthly_agg.columns = ['New Hires', 'Avg Salary']
print(monthly_agg)

# Using resample for time resampling
df_time_indexed = df.set_index('join_date')
print("\nQuarterly Resample Aggregation:")
quarterly_agg = df_time_indexed.resample('Q').agg({
    'employee_id': 'count',
    'salary': 'mean',
    'performance_score': 'mean'
}).round(2)
print(quarterly_agg.head())

4. Advanced Aggregation Techniques

4.1 Custom Aggregation Functions

python

print("\n=== Custom Aggregation Functions ===")

# Define custom aggregation functions
def salary_range(series):
    """Calculate salary range"""
    return series.max() - series.min()

def top_quartile_mean(series):
    """Calculate top 25% mean"""
    return series.quantile(0.75)

def coefficient_of_variation(series):
    """Calculate coefficient of variation"""
    return series.std() / series.mean() if series.mean() != 0 else 0

# Apply custom aggregation functions
print("Using Custom Aggregation Functions:")
custom_agg_result = df.groupby('department')['salary'].agg([
    ('Avg Salary', 'mean'),
    ('Salary Range', salary_range),
    ('Top 25% Mean', top_quartile_mean),
    ('CV', coefficient_of_variation)
]).round(2)
print(custom_agg_result)

# Complex custom aggregation
def department_summary(group):
    """Department comprehensive summary"""
    return pd.Series({
        'Employee Count': len(group),
        'Avg Salary': group['salary'].mean(),
        'Median Salary': group['salary'].median(),
        'Avg Performance': group['performance_score'].mean(),
        'High Performer Ratio': (group['performance_score'] > 85).mean(),
        'Avg Experience': group['experience_years'].mean(),
        'Salary Std Dev': group['salary'].std()
    })

print("\nDepartment Comprehensive Summary:")
dept_summary = df.groupby('department').apply(department_summary).round(2)
print(dept_summary)

4.2 Conditional Aggregation

python

print("\n=== Conditional Aggregation ===")

# Use where for conditional aggregation
print("Salary Statistics for High Performers (performance > 85):")
high_performance_salary = df.groupby('department')['salary'].agg([
    ('High Performer Count', lambda x: (df.loc[x.index, 'performance_score'] > 85).sum()),
    ('High Performer Avg Salary', lambda x: x[df.loc[x.index, 'performance_score'] > 85].mean()),
    ('High Performer Median Salary', lambda x: x[df.loc[x.index, 'performance_score'] > 85].median())
]).round(2)
print(high_performance_salary)

# Use query with aggregation
print("\nDepartment Statistics for Employees with 5+ Years Experience:")
experienced_stats = df.query('experience_years >= 5').groupby('department').agg({
    'salary': ['mean', 'count'],
    'performance_score': 'mean',
    'age': 'mean'
}).round(2)
print(experienced_stats)

# Quantile aggregation
print("\nSalary Quantiles by Department:")
salary_quantiles = df.groupby('department')['salary'].quantile([0.25, 0.5, 0.75]).unstack()
salary_quantiles.columns = ['Q1', 'Median', 'Q3']
print(salary_quantiles.round(0))

4.3 Rolling Window Aggregation

python

print("\n=== Rolling Window Aggregation ===")

# Create time series data
daily_data = df.groupby('join_date').agg({
    'employee_id': 'count',
    'salary': 'mean'
}).reset_index()
daily_data.columns = ['date', 'new_hires', 'avg_salary']
daily_data = daily_data.set_index('date').sort_index()

print("Daily Hire Data (First 10 Days):")
print(daily_data.head(10))

# 7-day rolling average
print("\n7-Day Rolling Average:")
daily_data['new_hires_7d_avg'] = daily_data['new_hires'].rolling(window=7).mean()
daily_data['avg_salary_7d_avg'] = daily_data['avg_salary'].rolling(window=7).mean()
print(daily_data[['new_hires', 'new_hires_7d_avg', 'avg_salary', 'avg_salary_7d_avg']].head(10).round(2))

# 30-day rolling statistics
print("\n30-Day Rolling Statistics:")
rolling_stats = daily_data['new_hires'].rolling(window=30).agg([
    'mean', 'std', 'min', 'max', 'sum'
]).round(2)
print(rolling_stats.head(35))

# Expanding window aggregation
print("\nCumulative Statistics:")
daily_data['cumulative_hires'] = daily_data['new_hires'].expanding().sum()
daily_data['cumulative_avg_salary'] = daily_data['avg_salary'].expanding().mean()
print(daily_data[['new_hires', 'cumulative_hires', 'avg_salary', 'cumulative_avg_salary']].head(10).round(2))

5. Pivot Tables and Cross-tabulations

5.1 Pivot Table

python

print("\n=== Pivot Table ===")

# Basic pivot table
print("Department-Position Salary Pivot Table:")
pivot_salary = pd.pivot_table(
    df, 
    values='salary', 
    index='department', 
    columns='position', 
    aggfunc='mean'
).round(0)
print(pivot_salary)

# Multi-value pivot table
print("\nMulti-metric Pivot Table:")
multi_pivot = pd.pivot_table(
    df,
    values=['salary', 'performance_score'],
    index='department',
    columns='position',
    aggfunc={'salary': 'mean', 'performance_score': 'mean'}
).round(2)
print(multi_pivot)

# Pivot table with margins
print("\nPivot Table with Totals:")
pivot_with_margins = pd.pivot_table(
    df,
    values='salary',
    index='department',
    columns='position',
    aggfunc='mean',
    margins=True,
    margins_name='Total'
).round(0)
print(pivot_with_margins)

# Multi-level index pivot table
df['age_group'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60], labels=['20-30', '31-40', '41-50', '51-60'])
print("\nMulti-level Index Pivot Table:")
multi_index_pivot = pd.pivot_table(
    df,
    values='salary',
    index=['department', 'age_group'],
    columns='position',
    aggfunc='mean'
).round(0)
print(multi_index_pivot.head(10))

5.2 Cross-tabulation

python

print("\n=== Cross-tabulation ===")

# Basic cross-tabulation
print("Department-Position Count Cross-tab:")
crosstab_basic = pd.crosstab(df['department'], df['position'])
print(crosstab_basic)

# Cross-tabulation with proportions
print("\nDepartment-Position Proportion Cross-tab:")
crosstab_prop = pd.crosstab(df['department'], df['position'], normalize='index').round(3)
print(crosstab_prop)

# Cross-tabulation with margins
print("\nCross-tab with Totals:")
crosstab_margins = pd.crosstab(
    df['department'], 
    df['position'], 
    margins=True, 
    margins_name='Total'
)
print(crosstab_margins)

# Multi-dimensional cross-tabulation
print("\nMulti-dimensional Cross-tab (Department-Position-Age Group):")
multi_crosstab = pd.crosstab(
    [df['department'], df['age_group']], 
    df['position']
)
print(multi_crosstab.head(10))

# Cross-tabulation with values
print("\nCross-tab with Average Salary:")
crosstab_values = pd.crosstab(
    df['department'], 
    df['position'], 
    values=df['salary'], 
    aggfunc='mean'
).round(0)
print(crosstab_values)

6. Group Transform and Filter

6.1 Group Transform

python

print("\n=== Group Transform ===")

# Calculate relative salary within department
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']
df['salary_pct_of_dept_avg'] = (df['salary'] / df['dept_avg_salary'] * 100).round(1)

print("Employee Salary Relative to Department Average:")
print(df[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg', 'salary_pct_of_dept_avg']].head(10))

# Calculate ranking within department
df['dept_salary_rank'] = df.groupby('department')['salary'].transform(
    lambda x: x.rank(ascending=False)
)
df['dept_performance_rank'] = df.groupby('department')['performance_score'].transform(
    lambda x: x.rank(ascending=False)
)

print("\nRanking within Departments:")
print(df[['name', 'department', 'salary', 'dept_salary_rank', 'performance_score', 'dept_performance_rank']].head(10))

# Standardized scores
df['salary_zscore'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)
df['performance_zscore'] = df.groupby('department')['performance_score'].transform(
    lambda x: (x - x.mean()) / x.std()
)

print("\nStandardized Scores within Departments:")
print(df[['name', 'department', 'salary_zscore', 'performance_zscore']].head(10).round(3))

6.2 Group Filter

python

print("\n=== Group Filter ===")

# Filter large departments (employee count > 150)
large_departments = df.groupby('department').filter(lambda x: len(x) > 150)
print(f"Large Department Employee Count: {len(large_departments)}")
print("Large Department Distribution:")
print(large_departments['department'].value_counts())

# Filter high-salary departments (avg salary > 8000)
high_salary_depts = df.groupby('department').filter(
    lambda x: x['salary'].mean() > 8000
)
print(f"\nHigh Salary Department Employee Count: {len(high_salary_depts)}")
print("High Salary Department Distribution:")
print(high_salary_depts['department'].value_counts())

# Filter high-performance departments (avg performance > 80)
high_performance_depts = df.groupby('department').filter(
    lambda x: x['performance_score'].mean() > 80
)
print(f"\nHigh Performance Department Employee Count: {len(high_performance_depts)}")
print("High Performance Department Distribution:")
print(high_performance_depts['department'].value_counts())

# Compound condition filter
complex_filter = df.groupby('department').filter(
    lambda x: (len(x) > 100) and (x['salary'].mean() > 7500) and (x['performance_score'].mean() > 75)
)
print(f"\nEmployees Meeting Compound Conditions: {len(complex_filter)}")
if len(complex_filter) > 0:
    print("Qualifying Departments:")
    print(complex_filter['department'].value_counts())

7. Aggregation Visualization

7.1 Aggregated Data Visualization

python

print("\n=== Aggregation Visualization ===")

# Department salary statistics visualization
dept_salary_stats = df.groupby('department')['salary'].agg(['mean', 'median', 'std']).round(0)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Average salary bar chart
dept_salary_stats['mean'].plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Average Salary by Department')
axes[0,0].set_ylabel('Salary')
axes[0,0].tick_params(axis='x', rotation=45)

# Salary distribution box plot
df.boxplot(column='salary', by='department', ax=axes[0,1])
axes[0,1].set_title('Salary Distribution by Department')
axes[0,1].set_xlabel('Department')
axes[0,1].set_ylabel('Salary')

# Position distribution stacked bar chart
position_dist = pd.crosstab(df['department'], df['position'])
position_dist.plot(kind='bar', stacked=True, ax=axes[1,0])
axes[1,0].set_title('Position Distribution by Department')
axes[1,0].set_ylabel('Count')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].legend(title='Position', bbox_to_anchor=(1.05, 1), loc='upper left')

# Performance vs Salary scatter plot
for dept in df['department'].unique():
    dept_data = df[df['department'] == dept]
    axes[1,1].scatter(dept_data['performance_score'], dept_data['salary'], 
                     label=dept, alpha=0.6)
axes[1,1].set_xlabel('Performance Score')
axes[1,1].set_ylabel('Salary')
axes[1,1].set_title('Performance vs Salary')
axes[1,1].legend()

plt.tight_layout()
plt.show()

7.2 Time Series Aggregation Visualization

python

# Hire trend visualization
monthly_hires = df.groupby(df['join_date'].dt.to_period('M')).size()

fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Monthly hire count
monthly_hires.plot(ax=axes[0], marker='o')
axes[0].set_title('Monthly New Hires Trend')
axes[0].set_ylabel('New Hires')
axes[0].grid(True, alpha=0.3)

# Cumulative hire count
monthly_hires.cumsum().plot(ax=axes[1], marker='s')
axes[1].set_title('Cumulative New Hires')
axes[1].set_ylabel('Cumulative Count')
axes[1].set_xlabel('Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

8. Performance Optimization Tips

8.1 Efficient Aggregation Methods

python

print("\n=== Performance Optimization Tips ===")

import time

# Create large dataset for performance testing
large_df = pd.concat([df] * 10, ignore_index=True)
print(f"Large Dataset Size: {large_df.shape}")

# Method 1: Using agg
start_time = time.time()
result1 = large_df.groupby('department').agg({
    'salary': ['mean', 'std'],
    'performance_score': 'mean'
})
time1 = time.time() - start_time

# Method 2: Separate calculations
start_time = time.time()
result2_mean = large_df.groupby('department')['salary'].mean()
result2_std = large_df.groupby('department')['salary'].std()
result2_perf = large_df.groupby('department')['performance_score'].mean()
time2 = time.time() - start_time

print(f"Using agg method time: {time1:.4f} seconds")
print(f"Separate calculations time: {time2:.4f} seconds")

# Memory optimization
print("\nMemory Optimization:")
print(f"Original data memory: {large_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Optimize data types
optimized_df = large_df.copy()
optimized_df['department'] = optimized_df['department'].astype('category')
optimized_df['position'] = optimized_df['position'].astype('category')
optimized_df['salary'] = optimized_df['salary'].astype('int32')
optimized_df['age'] = optimized_df['age'].astype('int8')

print(f"Optimized data memory: {optimized_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Memory savings: {(1 - optimized_df.memory_usage(deep=True).sum() / large_df.memory_usage(deep=True).sum()) * 100:.1f}%")

8.2 Parallel Processing

python

# Using numba to accelerate custom aggregation functions (requires numba installation)
try:
    from numba import jit
    
    @jit
    def fast_coefficient_of_variation(arr):
        """Fast coefficient of variation calculation"""
        mean_val = arr.mean()
        if mean_val == 0:
            return 0
        return arr.std() / mean_val
    
    print("\nUsing Numba Accelerated Aggregation:")
    start_time = time.time()
    fast_cv = large_df.groupby('department')['salary'].apply(
        lambda x: fast_coefficient_of_variation(x.values)
    )
    fast_time = time.time() - start_time
    print(f"Numba accelerated time: {fast_time:.4f} seconds")
    
except ImportError:
    print("Numba not installed, skipping acceleration demo")

# Chunked processing for large data
def chunked_aggregation(df, chunk_size=1000):
    """Chunked aggregation processing"""
    results = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        chunk_result = chunk.groupby('department')['salary'].mean()
        results.append(chunk_result)
    
    # Combine results
    combined = pd.concat(results, axis=1)
    return combined.mean(axis=1)

print("\nChunked Processing Result:")
chunked_result = chunked_aggregation(large_df)
print(chunked_result.head())

9. Practical Application Cases

9.1 Sales Data Analysis

python

print("\n=== Sales Data Analysis Case ===")

# Create sales data
np.random.seed(42)
sales_data = {
    'date': pd.date_range('2023-01-01', periods=365, freq='D'),
    'product': np.random.choice(['Product A', 'Product B', 'Product C', 'Product D'], 365),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 365),
    'salesperson': np.random.choice([f'Sales{i}' for i in range(1, 21)], 365),
    'quantity': np.random.randint(1, 100, 365),
    'unit_price': np.random.uniform(50, 500, 365),
}

sales_df = pd.DataFrame(sales_data)
sales_df['revenue'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['month'] = sales_df['date'].dt.month
sales_df['quarter'] = sales_df['date'].dt.quarter

print("Sales Data Sample:")
print(sales_df.head())

# 1. Monthly sales analysis
print("\nMonthly Sales Analysis:")
monthly_sales = sales_df.groupby('month').agg({
    'revenue': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)
monthly_sales.columns = ['Total Revenue', 'Avg Revenue', 'Transactions', 'Total Quantity']
print(monthly_sales)

# 2. Product-region analysis
print("\nProduct-Region Sales Matrix:")
product_region_matrix = pd.pivot_table(
    sales_df,
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum'
).round(0)
print(product_region_matrix)

# 3. Salesperson performance ranking
print("\nSalesperson Performance Ranking (Top 10):")
salesperson_performance = sales_df.groupby('salesperson').agg({
    'revenue': 'sum',
    'quantity': 'sum',
    'date': 'count'
}).round(2)
salesperson_performance.columns = ['Total Revenue', 'Total Quantity', 'Transactions']
salesperson_performance['Avg Transaction'] = (salesperson_performance['Total Revenue'] / salesperson_performance['Transactions']).round(2)
top_performers = salesperson_performance.sort_values('Total Revenue', ascending=False)
print(top_performers.head(10))

9.2 Customer Behavior Analysis

python

print("\n=== Customer Behavior Analysis Case ===")

# Create customer data
customer_data = {
    'customer_id': np.repeat(range(1, 201), np.random.randint(1, 10, 200)),
    'purchase_date': [],
    'amount': [],
    'category': []
}

# Generate purchase records
for cid in range(1, 201):
    n_purchases = np.random.randint(1, 10)
    start_date = pd.Timestamp('2023-01-01')
    dates = pd.date_range(start_date, periods=n_purchases, freq=f'{np.random.randint(1, 30)}D')
    customer_data['purchase_date'].extend(dates)
    customer_data['amount'].extend(np.random.uniform(20, 500, n_purchases))
    customer_data['category'].extend(np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], n_purchases))

customer_df = pd.DataFrame({
    'customer_id': customer_data['customer_id'],
    'purchase_date': customer_data['purchase_date'],
    'amount': customer_data['amount'],
    'category': customer_data['category']
})

print(f"Customer Data Records: {len(customer_df)}")
print("Customer Data Sample:")
print(customer_df.head())

# 1. Customer value analysis
print("\nCustomer Value Analysis:")
customer_value = customer_df.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'count'],
    'purchase_date': ['min', 'max']
})
customer_value.columns = ['Total Spent', 'Avg Spent', 'Purchases', 'First Purchase', 'Last Purchase']
customer_value['Purchase Span'] = (customer_value['Last Purchase'] - customer_value['First Purchase']).dt.days

# Customer segmentation
customer_value['Customer Tier'] = pd.cut(
    customer_value['Total Spent'],
    bins=[0, 500, 1500, 3000, float('inf')],
    labels=['Bronze', 'Silver', 'Gold', 'Diamond']
)

print("Customer Tier Distribution:")
print(customer_value['Customer Tier'].value_counts())

print("\nStatistics by Customer Tier:")
tier_stats = customer_value.groupby('Customer Tier').agg({
    'Total Spent': ['mean', 'count'],
    'Purchases': 'mean',
    'Purchase Span': 'mean'
}).round(2)
print(tier_stats)

# 2. Category preference analysis
print("\nCustomer Category Preference Analysis:")
category_preference = pd.crosstab(
    customer_df['customer_id'], 
    customer_df['category'], 
    values=customer_df['amount'], 
    aggfunc='sum'
).fillna(0)

# Find each customer's main category
customer_main_category = category_preference.idxmax(axis=1)
print("Customer Main Category Distribution:")
print(customer_main_category.value_counts())

10. Best Practices and Considerations

10.1 Sorting and Aggregation Best Practices

python

print("\n=== Best Practices Summary ===")

def sorting_aggregation_best_practices():
    """
    Sorting and aggregation best practices guide
    """
    practices = {
        "Sorting Best Practices": [
            "1. Use sort_values() instead of sort()",
            "2. For large datasets, consider using nlargest() or nsmallest()",
            "3. Use na_position parameter appropriately for missing values",
            "4. Use Categorical type for custom sorting",
            "5. Avoid repeated sorting in loops"
        ],
        "Aggregation Best Practices": [
            "1. Use agg() for multiple aggregation operations",
            "2. Prefer built-in aggregation functions over custom ones",
            "3. Use transform() appropriately to maintain data structure",
            "4. Use filter() for conditional filtering",
            "5. Consider memory usage, optimize data types"
        ],
        "Performance Optimization Tips": [
            "1. Filter data before aggregation",
            "2. Use categorical data types to reduce memory",
            "3. Avoid complex lambda functions in groupby",
            "4. Consider using numba to accelerate custom functions",
            "5. For very large datasets, consider chunked processing"
        ],
        "Common Pitfalls": [
            "1. Watch for index changes after aggregation",
            "2. Handle missing values carefully in aggregation results",
            "3. Avoid accidentally including non-numeric columns in aggregation",
            "4. Watch for timezone issues in time series aggregation",
            "5. Verify business logic reasonableness of aggregation results"
        ]
    }
    
    for category, items in practices.items():
        print(f"\n{category}:")
        for item in items:
            print(f"  {item}")

sorting_aggregation_best_practices()

# Performance comparison example
print("\n=== Performance Comparison Example ===")

# Inefficient method
def inefficient_aggregation(df):
    results = []
    for dept in df['department'].unique():
        dept_data = df[df['department'] == dept]
        avg_salary = dept_data['salary'].mean()
        results.append({'department': dept, 'avg_salary': avg_salary})
    return pd.DataFrame(results)

# Efficient method
def efficient_aggregation(df):
    return df.groupby('department')['salary'].mean().reset_index()

# Performance test
start_time = time.time()
inefficient_result = inefficient_aggregation(df)
inefficient_time = time.time() - start_time

start_time = time.time()
efficient_result = efficient_aggregation(df)
efficient_time = time.time() - start_time

print(f"Inefficient method time: {inefficient_time:.4f} seconds")
print(f"Efficient method time: {efficient_time:.4f} seconds")
print(f"Performance improvement: {inefficient_time / efficient_time:.1f}x")

Chapter Summary

This chapter comprehensively covered data sorting and aggregation techniques in Pandas:

Sorting Techniques

Basic Sorting: Using sort_values(), sort_index()
Advanced Sorting: Custom sorting, within-group sorting, multi-level sorting
Sorting Optimization: Handling missing values, using categorical data types

Aggregation Techniques

Basic Aggregation: Various uses of agg(), groupby()
Advanced Aggregation: Custom aggregation functions, conditional aggregation, time series aggregation
Group Operations: Differences and applications of transform(), filter(), apply()

Data Pivot

Pivot Tables: Multi-dimensional analysis with pivot_table()
Cross-tabulations: Statistical analysis with crosstab()
Data Reshaping: Wide-to-long transformations and multi-level index handling

Performance Optimization

Memory Optimization: Data type optimization, chunked processing
Computation Optimization: Vectorized operations, parallel processing
Best Practices: Avoiding common pitfalls, improving code efficiency

Mastering these sorting and aggregation techniques will help you:

Quickly understand data distribution and patterns
Perform multi-dimensional data analysis
Generate various statistical reports and business insights
Prepare for data visualization and modeling

Exercises

Create a complex multi-level groupby aggregation analysis
Implement a custom rolling window aggregation function
Design a customer segmentation analysis system
Optimize aggregation performance for a large dataset
Build a dynamic pivot table analysis tool

In the next chapter, we will learn about Pandas data visualization, exploring how to transform data analysis results into intuitive charts.

Pandas Data Sorting and Aggregation ​

1. Data Sorting Basics ​

1.1 Single Column Sorting ​

1.2 Multi-Column Sorting ​

1.3 Index Sorting ​

2. Advanced Sorting Techniques ​

2.1 Custom Sorting ​

2.2 Within-Group Sorting ​

3. Data Aggregation Basics ​

3.1 Basic Aggregation Functions ​

3.2 GroupBy Aggregation ​

3.3 Time Series Aggregation ​

4. Advanced Aggregation Techniques ​

4.1 Custom Aggregation Functions ​

4.2 Conditional Aggregation ​

4.3 Rolling Window Aggregation ​

5. Pivot Tables and Cross-tabulations ​

5.1 Pivot Table ​

5.2 Cross-tabulation ​

6. Group Transform and Filter ​

6.1 Group Transform ​

6.2 Group Filter ​

7. Aggregation Visualization ​

7.1 Aggregated Data Visualization ​

7.2 Time Series Aggregation Visualization ​

8. Performance Optimization Tips ​

8.1 Efficient Aggregation Methods ​

8.2 Parallel Processing ​

9. Practical Application Cases ​

9.1 Sales Data Analysis ​

9.2 Customer Behavior Analysis ​

10. Best Practices and Considerations ​

10.1 Sorting and Aggregation Best Practices ​

Chapter Summary ​

Sorting Techniques ​

Aggregation Techniques ​

Data Pivot ​

Performance Optimization ​

Exercises ​

Pandas Data Sorting and Aggregation

1. Data Sorting Basics

1.1 Single Column Sorting

1.2 Multi-Column Sorting

1.3 Index Sorting

2. Advanced Sorting Techniques

2.1 Custom Sorting

2.2 Within-Group Sorting

3. Data Aggregation Basics

3.1 Basic Aggregation Functions

3.2 GroupBy Aggregation

3.3 Time Series Aggregation

4. Advanced Aggregation Techniques

4.1 Custom Aggregation Functions

4.2 Conditional Aggregation

4.3 Rolling Window Aggregation

5. Pivot Tables and Cross-tabulations

5.1 Pivot Table

5.2 Cross-tabulation

6. Group Transform and Filter

6.1 Group Transform

6.2 Group Filter

7. Aggregation Visualization

7.1 Aggregated Data Visualization

7.2 Time Series Aggregation Visualization

8. Performance Optimization Tips

8.1 Efficient Aggregation Methods

8.2 Parallel Processing

9. Practical Application Cases

9.1 Sales Data Analysis

9.2 Customer Behavior Analysis

10. Best Practices and Considerations

10.1 Sorting and Aggregation Best Practices

Chapter Summary

Sorting Techniques

Aggregation Techniques

Data Pivot

Performance Optimization

Exercises