Skip to content

Pandas Common Functions

Pandas provides a rich set of functions for data processing and analysis. Mastering these common functions is key to using Pandas efficiently. This chapter will detail the most important and commonly used Pandas functions to help you navigate daily data analysis tasks with ease.

1. Data Viewing and Information Functions

1.1 Basic Information Viewing

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample data
np.random.seed(42)
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'age': [25, 30, 35, 28, 32, 27, 29, 31],
    'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing', 'HR', 'IT', 'Finance'],
    'salary': [8000, 7000, 9500, 8500, 7500, 6800, 9000, 9200],
    'join_date': pd.date_range('2020-01-01', periods=8, freq='3M'),
    'performance': np.random.choice(['Excellent', 'Good', 'Average'], 8),
    'bonus': np.random.randint(1000, 5000, 8)
}

df = pd.DataFrame(data)
print("Sample Data:")
print(df)

# 1. head() - View first few rows
print("\nFirst 3 rows:")
print(df.head(3))

# 2. tail() - View last few rows
print("\nLast 3 rows:")
print(df.tail(3))

# 3. info() - Basic data information
print("\nData Info:")
print(df.info())

# 4. describe() - Statistical description
print("\nStatistical Description of Numeric Columns:")
print(df.describe())

# 5. shape - Data shape
print(f"\nData Shape: {df.shape}")

# 6. columns - Column names
print(f"\nColumn Names: {df.columns.tolist()}")

# 7. dtypes - Data types
print("\nData Types:")
print(df.dtypes)

# 8. index - Index information
print(f"\nIndex: {df.index}")

1.2 Data Overview Functions

python
# 1. sample() - Random sampling
print("Random Sample of 3 Rows:")
print(df.sample(3))

# 2. nunique() - Number of unique values
print("\nNumber of Unique Values per Column:")
print(df.nunique())

# 3. value_counts() - Value counts
print("\nDepartment Distribution:")
print(df['department'].value_counts())

# 4. unique() - Unique values
print("\nUnique Departments:")
print(df['department'].unique())

# 5. memory_usage() - Memory usage
print("\nMemory Usage:")
print(df.memory_usage(deep=True))

2. Data Selection and Filtering Functions

2.1 Conditional Selection

python
# 1. query() - Query expression
print("Employees older than 30:")
result1 = df.query('age > 30')
print(result1[['name', 'age', 'department']])

print("\nIT department employees with salary > 8000:")
result2 = df.query('department == "IT" and salary > 8000')
print(result2[['name', 'department', 'salary']])

# 2. where() - Conditional replacement
print("\nSet salary below 8000 to NaN:")
salary_filtered = df['salary'].where(df['salary'] >= 8000)
print(salary_filtered)

# 3. mask() - Conditional mask (opposite of where)
print("\nMask salary above 8000:")
salary_masked = df['salary'].mask(df['salary'] > 8000)
print(salary_masked)

# 4. isin() - Membership check
print("\nIT and HR department employees:")
it_hr_employees = df[df['department'].isin(['IT', 'HR'])]
print(it_hr_employees[['name', 'department']])

# 5. between() - Range selection
print("\nEmployees aged between 28-32:")
age_range = df[df['age'].between(28, 32)]
print(age_range[['name', 'age']])

2.2 Position and Label Selection

python
# 1. loc[] - Label-based selection
print("Using loc to select specific rows and columns:")
print(df.loc[0:2, ['name', 'age', 'salary']])

# 2. iloc[] - Position-based selection
print("\nUsing iloc to select by position:")
print(df.iloc[0:3, 1:4])

# 3. at[] - Single value selection (label)
print(f"\nFirst employee's name: {df.at[0, 'name']}")

# 4. iat[] - Single value selection (position)
print(f"Value at row 1, column 2: {df.iat[0, 1]}")

# 5. filter() - Filter by name
print("\nColumns containing 'a':")
filtered_cols = df.filter(regex='a')
print(filtered_cols.columns.tolist())

# Filter by column names
print("\nSelect specific columns:")
selected_cols = df.filter(['name', 'salary', 'department'])
print(selected_cols.head(3))

3. Data Statistics and Aggregation Functions

3.1 Basic Statistical Functions

python
# 1. sum() - Sum
print("Total Salary:")
print(f"Total Salary: {df['salary'].sum():,}")

# 2. mean() - Mean
print(f"\nAverage Salary: {df['salary'].mean():.2f}")

# 3. median() - Median
print(f"Median Salary: {df['salary'].median()}")

# 4. std() - Standard deviation
print(f"Salary Standard Deviation: {df['salary'].std():.2f}")

# 5. var() - Variance
print(f"Salary Variance: {df['salary'].var():.2f}")

# 6. min() / max() - Minimum/Maximum
print(f"\nMinimum Salary: {df['salary'].min()}")
print(f"Maximum Salary: {df['salary'].max()}")

# 7. quantile() - Quantiles
print("\nSalary Quantiles:")
print(df['salary'].quantile([0.25, 0.5, 0.75]))

# 8. count() - Non-null count
print(f"\nNon-null Salary Count: {df['salary'].count()}")

# 9. mode() - Mode
print("\nDepartment Mode:")
print(df['department'].mode())

3.2 Aggregation Functions

python
# 1. agg() / aggregate() - Custom aggregation
print("Salary Statistics:")
salary_stats = df['salary'].agg(['mean', 'std', 'min', 'max'])
print(salary_stats)

# Multi-column aggregation
print("\nMulti-column Aggregation:")
multi_agg = df.agg({
    'salary': ['mean', 'sum'],
    'age': ['mean', 'min', 'max'],
    'bonus': 'sum'
})
print(multi_agg)

# 2. apply() - Apply function
print("\nApply Custom Function:")
# Calculate salary grade
def salary_grade(salary):
    if salary >= 9000:
        return 'High'
    elif salary >= 7500:
        return 'Medium'
    else:
        return 'Low'

df['salary_grade'] = df['salary'].apply(salary_grade)
print(df[['name', 'salary', 'salary_grade']])

# 3. map() - Map values
print("\nDepartment Mapping:")
dept_mapping = {'IT': 'Information Technology', 'HR': 'Human Resources', 'Finance': 'Finance', 'Marketing': 'Marketing'}
df['dept_full'] = df['department'].map(dept_mapping)
print(df[['name', 'department', 'dept_full']].head())

4. Data Sorting Functions

4.1 Sorting Operations

python
# 1. sort_values() - Sort by values
print("Sort by Salary (Descending):")
sorted_by_salary = df.sort_values('salary', ascending=False)
print(sorted_by_salary[['name', 'salary']].head())

# Multi-column sorting
print("\nSort by Department and Salary:")
multi_sorted = df.sort_values(['department', 'salary'], ascending=[True, False])
print(multi_sorted[['name', 'department', 'salary']])

# 2. sort_index() - Sort by index
print("\nSort by Index (Descending):")
index_sorted = df.sort_index(ascending=False)
print(index_sorted.head(3))

# 3. nlargest() / nsmallest() - Top/Bottom n values
print("\nTop 3 Highest Salaries:")
top3_salary = df.nlargest(3, 'salary')
print(top3_salary[['name', 'salary']])

print("\nTop 3 Youngest Employees:")
youngest3 = df.nsmallest(3, 'age')
print(youngest3[['name', 'age']])

4.2 Ranking Functions

python
# 1. rank() - Ranking
print("Salary Ranking:")
df['salary_rank'] = df['salary'].rank(ascending=False)
print(df[['name', 'salary', 'salary_rank']].sort_values('salary_rank'))

# Different ranking methods
print("\nDifferent Ranking Methods:")
ranking_methods = ['average', 'min', 'max', 'first', 'dense']
for method in ranking_methods:
    df[f'rank_{method}'] = df['salary'].rank(method=method, ascending=False)

print(df[['name', 'salary'] + [f'rank_{m}' for m in ranking_methods]])

5. Data Grouping Functions

5.1 GroupBy Operations

python
# 1. groupby() - Grouping
print("Group by Department Statistics:")
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'sum', 'count'],
    'age': 'mean',
    'bonus': 'sum'
})
print(dept_stats)

# 2. Apply function to groups
print("\nSalary Ranking within Departments:")
def rank_in_group(group):
    group['dept_salary_rank'] = group['salary'].rank(ascending=False)
    return group

df_with_dept_rank = df.groupby('department').apply(rank_in_group)
print(df_with_dept_rank[['name', 'department', 'salary', 'dept_salary_rank']])

# 3. transform() - Transform within groups
print("\nDepartment Average Salary:")
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']
print(df[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg']])

5.2 Group Statistics Functions

python
# 1. size() - Group size
print("Number of Employees per Department:")
dept_size = df.groupby('department').size()
print(dept_size)

# 2. count() - Non-null count
print("\nNon-null Count per Department:")
dept_count = df.groupby('department').count()
print(dept_count)

# 3. nunique() - Number of unique values in groups
print("\nNumber of Performance Levels per Department:")
perf_variety = df.groupby('department')['performance'].nunique()
print(perf_variety)

# 4. first() / last() - First/Last value
print("\nFirst Employee per Department:")
first_employee = df.groupby('department').first()
print(first_employee[['name', 'salary']])

6. Data Merging and Joining Functions

6.1 Merge Operations

python
# Create additional data for demonstration
dept_info = pd.DataFrame({
    'department': ['IT', 'HR', 'Finance', 'Marketing'],
    'manager': ['Manager Zhang', 'Manager Li', 'Manager Wang', 'Manager Zhao'],
    'budget': [1000000, 500000, 800000, 600000]
})

project_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'project': ['Project A', 'Project B', 'Project C', 'Project D'],
    'hours': [40, 35, 45, 30]
})

print("Department Info:")
print(dept_info)
print("\nProject Data:")
print(project_data)

# 1. merge() - Merge
print("\nMerge Employee and Department Info:")
merged_df = pd.merge(df, dept_info, on='department', how='left')
print(merged_df[['name', 'department', 'manager', 'budget']].head())

# 2. join() - Join
print("\nUsing join to connect:")
df_indexed = df.set_index('name')
project_indexed = project_data.set_index('name')
joined_df = df_indexed.join(project_indexed, how='left')
print(joined_df[['department', 'salary', 'project', 'hours']].head())

# 3. concat() - Concatenate
print("\nVertical Concatenation:")
df1 = df.head(3)
df2 = df.tail(3)
concatenated = pd.concat([df1, df2], ignore_index=True)
print(concatenated[['name', 'department']])

# Horizontal concatenation
print("\nHorizontal Concatenation:")
horizontal_concat = pd.concat([df[['name', 'age']], df[['salary', 'department']]], axis=1)
print(horizontal_concat.head(3))

6.2 Data Reshaping Functions

python
# 1. pivot() - Pivot table
print("Department-Performance Pivot Table:")
pivot_table = df.pivot_table(values='salary', index='department', 
                            columns='performance', aggfunc='mean')
print(pivot_table)

# 2. melt() - Wide to long format
print("\nWide to Long Format:")
wide_df = df[['name', 'salary', 'bonus']]
melted_df = pd.melt(wide_df, id_vars=['name'], 
                   value_vars=['salary', 'bonus'],
                   var_name='income_type', value_name='amount')
print(melted_df.head())

# 3. stack() / unstack() - Stack/Unstack
print("\nStack Operation:")
stacked = df.set_index(['name', 'department'])[['salary', 'bonus']].stack()
print(stacked.head())

print("\nUnstack Operation:")
unstacked = stacked.unstack()
print(unstacked.head())

7. String Processing Functions

7.1 String Operations

python
# Create data with strings
string_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'JOHN', 'jane'],
    'email': ['alice@company.com', 'bob@COMPANY.COM', 'charlie@company.cn', 
              'david@company.com', 'john@company.com', 'jane@company.com'],
    'phone': ['138-0013-8000', '139 0013 9000', '13700137000', 
              '136-0013-6000', '135-0013-5000', '134-0013-4000']
})

print("String Data:")
print(string_data)

# 1. str.len() - String length
print("\nName Length:")
string_data['name_length'] = string_data['name'].str.len()
print(string_data[['name', 'name_length']])

# 2. str.upper() / str.lower() - Case conversion
print("\nEmail Lowercase:")
string_data['email_lower'] = string_data['email'].str.lower()
print(string_data[['email', 'email_lower']])

# 3. str.strip() - Remove whitespace
print("\nStrip Name:")
string_data['name_stripped'] = string_data['name'].str.strip()
print(string_data[['name', 'name_stripped']])

# 4. str.replace() - Replace
print("\nStandardize Phone Format:")
string_data['phone_clean'] = string_data['phone'].str.replace(r'[\s-]', '', regex=True)
print(string_data[['phone', 'phone_clean']])

# 5. str.contains() - Contains check
print("\nEmails with 'company' domain:")
company_emails = string_data[string_data['email'].str.contains('company')]
print(company_emails['email'])

# 6. str.startswith() / str.endswith() - Start/End check
print("\nNames starting with 'A':")
a_names = string_data[string_data['name'].str.startswith('A')]
print(a_names['name'])

# 7. str.split() - Split
print("\nSplit Email:")
email_parts = string_data['email'].str.split('@', expand=True)
email_parts.columns = ['username', 'domain']
print(email_parts)

7.2 Regular Expression Functions

python
# 1. str.extract() - Extract pattern
print("Extract Email Username:")
username_pattern = r'([^@]+)@'
usernames = string_data['email'].str.extract(username_pattern)
print(usernames)

# 2. str.findall() - Find all matches
print("\nFind All Digits:")
digits = string_data['phone'].str.findall(r'\d+')
print(digits)

# 3. str.match() - Match pattern
print("\nMatch Alphabetic Names:")
alpha_pattern = r'^[A-Za-z]+$'
alpha_names = string_data[string_data['name'].str.match(alpha_pattern)]
print(alpha_names['name'])

8. Date and Time Functions

8.1 Date and Time Operations

python
# Create time series data
date_data = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'sales': np.random.randint(100, 1000, 10),
    'temperature': np.random.normal(20, 5, 10)
})

print("Time Series Data:")
print(date_data)

# 1. dt.year, dt.month, dt.day - Extract date components
print("\nExtract Date Components:")
date_data['year'] = date_data['date'].dt.year
date_data['month'] = date_data['date'].dt.month
date_data['day'] = date_data['date'].dt.day
date_data['weekday'] = date_data['date'].dt.day_name()
print(date_data[['date', 'year', 'month', 'day', 'weekday']].head())

# 2. dt.strftime() - Format date
print("\nFormat Date:")
date_data['date_formatted'] = date_data['date'].dt.strftime('%Y-%m-%d')
print(date_data[['date', 'date_formatted']].head())

# 3. Date arithmetic
print("\nDate Arithmetic:")
date_data['days_from_start'] = (date_data['date'] - date_data['date'].min()).dt.days
date_data['future_date'] = date_data['date'] + pd.Timedelta(days=30)
print(date_data[['date', 'days_from_start', 'future_date']].head())

8.2 Time Series Resampling

python
# Create longer time series
long_ts = pd.DataFrame({
    'datetime': pd.date_range('2023-01-01', periods=100, freq='H'),
    'value': np.random.randn(100).cumsum()
})
long_ts.set_index('datetime', inplace=True)

print("Original Hourly Data (First 5 Rows):")
print(long_ts.head())

# 1. resample() - Resampling
print("\nResample by Day (Mean):")
daily_avg = long_ts.resample('D').mean()
print(daily_avg.head())

print("\nResample by Day (Multiple Statistics):")
daily_stats = long_ts.resample('D').agg(['mean', 'std', 'min', 'max'])
print(daily_stats.head())

# 2. rolling() - Rolling window
print("\n24-Hour Rolling Mean:")
long_ts['rolling_24h'] = long_ts['value'].rolling(window=24).mean()
print(long_ts[['value', 'rolling_24h']].head(25))

9. Missing Value Functions

9.1 Missing Value Detection

python
# Create data with missing values
data_with_na = df.copy()
data_with_na.loc[1, 'salary'] = np.nan
data_with_na.loc[3, 'age'] = np.nan
data_with_na.loc[5, 'bonus'] = np.nan

print("Data with Missing Values:")
print(data_with_na[['name', 'age', 'salary', 'bonus']])

# 1. isnull() / isna() - Detect missing values
print("\nMissing Value Detection:")
print(data_with_na.isnull().sum())

# 2. notnull() / notna() - Detect non-missing values
print("\nNon-Missing Value Statistics:")
print(data_with_na.notnull().sum())

# 3. dropna() - Drop missing values
print("\nDrop Rows with Missing Values:")
no_na_rows = data_with_na.dropna()
print(f"Original Rows: {len(data_with_na)}, After Dropping: {len(no_na_rows)}")

# Drop specific column missing values
print("\nDrop Rows with Missing Salary:")
no_salary_na = data_with_na.dropna(subset=['salary'])
print(f"After Dropping Salary NaN: {len(no_salary_na)}")

9.2 Missing Value Filling

python
# 1. fillna() - Fill missing values
print("Fill Missing Values with 0:")
filled_zero = data_with_na.fillna(0)
print(filled_zero[['name', 'age', 'salary', 'bonus']])

# Fill different columns with different values
print("\nFill with Different Strategies:")
fill_values = {
    'age': data_with_na['age'].mean(),
    'salary': data_with_na['salary'].median(),
    'bonus': 0
}
filled_custom = data_with_na.fillna(fill_values)
print(filled_custom[['name', 'age', 'salary', 'bonus']])

# 2. ffill() / bfill() - Forward/Backward fill
print("\nForward Fill:")
ffilled = data_with_na.fillna(method='ffill')
print(ffilled[['name', 'age', 'salary', 'bonus']])

# 3. interpolate() - Interpolation fill
print("\nLinear Interpolation Fill:")
interpolated = data_with_na[['age', 'salary', 'bonus']].interpolate()
print(interpolated)

10. Data Type Conversion Functions

10.1 Type Conversion

python
# 1. astype() - Type conversion
print("Data Type Conversion:")
type_converted = df.copy()
type_converted['age'] = type_converted['age'].astype('float64')
type_converted['department'] = type_converted['department'].astype('category')
print(type_converted.dtypes)

# 2. to_numeric() - Convert to numeric
string_numbers = pd.Series(['1', '2', '3.5', 'abc', '5'])
print("\nString to Numeric:")
numeric_converted = pd.to_numeric(string_numbers, errors='coerce')
print(numeric_converted)

# 3. to_datetime() - Convert to datetime
date_strings = pd.Series(['2023-01-01', '2023/02/15', '2023-03-20', 'invalid'])
print("\nString to Datetime:")
date_converted = pd.to_datetime(date_strings, errors='coerce')
print(date_converted)

# 4. to_categorical() - Convert to categorical
print("\nConvert to Categorical:")
categorical_dept = pd.Categorical(df['department'])
print(f"Categories: {categorical_dept.categories}")
print(f"Codes: {categorical_dept.codes}")

11. Utility Functions

11.1 Data Validation and Cleaning

python
# 1. duplicated() - Detect duplicates
print("Detect Duplicates:")
duplicates = df.duplicated()
print(f"Duplicate Rows: {duplicates.sum()}")

# 2. drop_duplicates() - Remove duplicates
print("\nRemove Duplicates:")
no_duplicates = df.drop_duplicates()
print(f"Original Rows: {len(df)}, After Deduplication: {len(no_duplicates)}")

# 3. reset_index() - Reset index
print("\nReset Index:")
reset_df = df.reset_index(drop=True)
print(reset_df.index)

# 4. set_index() - Set index
print("\nSet Name as Index:")
name_indexed = df.set_index('name')
print(name_indexed.head(3))

11.2 Data Import and Export

python
# 1. to_csv() - Export to CSV
df.to_csv('employees.csv', index=False, encoding='utf-8')
print("Data exported to employees.csv")

# 2. to_excel() - Export to Excel
df.to_excel('employees.xlsx', index=False, sheet_name='Employee Data')
print("Data exported to employees.xlsx")

# 3. to_json() - Export to JSON
df.to_json('employees.json', orient='records', force_ascii=False, indent=2)
print("Data exported to employees.json")

# 4. to_dict() - Convert to dictionary
print("\nConvert to Dictionary:")
dict_data = df.head(2).to_dict('records')
print(dict_data)

12. Performance Optimization Functions

12.1 Memory and Performance Optimization

python
# 1. memory_usage() - Memory usage
print("Memory Usage Details:")
memory_info = df.memory_usage(deep=True)
print(memory_info)
print(f"Total Memory Usage: {memory_info.sum()} bytes")

# 2. Data type optimization
def optimize_dtypes(df):
    """
    Optimize DataFrame data types to save memory
    """
    optimized_df = df.copy()
    
    # Optimize integer types
    for col in optimized_df.select_dtypes(include=['int64']).columns:
        col_min = optimized_df[col].min()
        col_max = optimized_df[col].max()
        
        if col_min >= 0:
            if col_max < 255:
                optimized_df[col] = optimized_df[col].astype('uint8')
            elif col_max < 65535:
                optimized_df[col] = optimized_df[col].astype('uint16')
        else:
            if col_min > -128 and col_max < 127:
                optimized_df[col] = optimized_df[col].astype('int8')
            elif col_min > -32768 and col_max < 32767:
                optimized_df[col] = optimized_df[col].astype('int16')
    
    # Convert to categorical type
    for col in optimized_df.select_dtypes(include=['object']).columns:
        if optimized_df[col].nunique() / len(optimized_df) < 0.5:
            optimized_df[col] = optimized_df[col].astype('category')
    
    return optimized_df

optimized_df = optimize_dtypes(df)
print("\nMemory Comparison Before and After Optimization:")
print(f"Before: {df.memory_usage(deep=True).sum()} bytes")
print(f"After: {optimized_df.memory_usage(deep=True).sum()} bytes")

13. Method Chaining

13.1 Method Chaining Examples

python
# Method chaining example
result = (df
    .query('age > 25')  # Filter employees older than 25
    .groupby('department')['salary']  # Group by department
    .agg(['mean', 'count'])  # Calculate mean and count
    .round(2)  # Round to 2 decimal places
    .sort_values('mean', ascending=False)  # Sort by mean salary descending
)

print("Method Chaining Result:")
print(result)

# Complex method chaining
complex_result = (df
    .assign(  # Add new columns
        salary_level=lambda x: pd.cut(x['salary'], 
                                    bins=[0, 7000, 8500, float('inf')], 
                                    labels=['Low', 'Medium', 'High']),
        age_group=lambda x: pd.cut(x['age'], 
                                 bins=[0, 28, 32, float('inf')], 
                                 labels=['Young', 'Middle', 'Senior'])
    )
    .groupby(['department', 'salary_level'])  # Multi-level grouping
    .size()  # Calculate group size
    .unstack(fill_value=0)  # Pivot table format
)

print("\nComplex Method Chaining Result:")
print(complex_result)

14. Common Function Combinations

14.1 Data Analysis Workflow

python
def analyze_employee_data(df):
    """
    Employee Data Analysis Workflow
    """
    print("=== Employee Data Analysis Report ===")
    
    # 1. Basic information
    print(f"\n1. Basic Information")
    print(f"   Total Employees: {len(df)}")
    print(f"   Number of Departments: {df['department'].nunique()}")
    print(f"   Data Completeness: {(1 - df.isnull().sum().sum() / df.size) * 100:.1f}%")
    
    # 2. Salary analysis
    print(f"\n2. Salary Analysis")
    salary_stats = df['salary'].describe()
    print(f"   Average Salary: {salary_stats['mean']:.0f}")
    print(f"   Median Salary: {salary_stats['50%']:.0f}")
    print(f"   Salary Range: {salary_stats['min']:.0f} - {salary_stats['max']:.0f}")
    
    # 3. Department analysis
    print(f"\n3. Department Analysis")
    dept_analysis = df.groupby('department').agg({
        'salary': ['mean', 'count'],
        'age': 'mean'
    }).round(1)
    print(dept_analysis)
    
    # 4. Age analysis
    print(f"\n4. Age Distribution")
    age_groups = pd.cut(df['age'], bins=[0, 28, 32, float('inf')], 
                       labels=['≤28', '29-32', '>32'])
    age_dist = age_groups.value_counts().sort_index()
    print(age_dist)
    
    # 5. Performance analysis
    print(f"\n5. Performance Distribution")
    perf_dist = df['performance'].value_counts()
    print(perf_dist)
    
    return {
        'total_employees': len(df),
        'avg_salary': df['salary'].mean(),
        'dept_count': df['department'].nunique(),
        'age_distribution': age_dist.to_dict(),
        'performance_distribution': perf_dist.to_dict()
    }

# Execute analysis
analysis_result = analyze_employee_data(df)

Chapter Summary

This chapter detailed the most important and commonly used functions in Pandas:

  1. Data Viewing Functions: head(), tail(), info(), describe(), shape, dtypes, etc.
  2. Data Selection Functions: query(), where(), isin(), between(), loc[], iloc[], etc.
  3. Statistical Aggregation Functions: sum(), mean(), std(), agg(), apply(), map(), etc.
  4. Sorting Functions: sort_values(), sort_index(), rank(), nlargest(), nsmallest(), etc.
  5. Grouping Functions: groupby(), transform(), size(), count(), etc.
  6. Merging and Joining Functions: merge(), join(), concat(), pivot(), melt(), etc.
  7. String Functions: str.len(), str.upper(), str.replace(), str.contains(), etc.
  8. Time Functions: dt.year, dt.month, resample(), rolling(), etc.
  9. Missing Value Functions: isnull(), dropna(), fillna(), interpolate(), etc.
  10. Type Conversion Functions: astype(), to_numeric(), to_datetime(), etc.
  11. Utility Functions: duplicated(), reset_index(), to_csv(), memory_usage(), etc.

Mastering these common functions is key to using Pandas efficiently. In practice, these functions often need to be combined, and method chaining can build concise yet powerful data processing workflows.

Exercises

  1. Complete a full data analysis workflow using method chaining
  2. Write a function to automatically generate a data quality report
  3. Implement a data cleaning pipeline using multiple common functions
  4. Create an employee performance analysis system using various aggregation functions
  5. Design a time series data analysis tool

In the next chapter, we will learn about Pandas correlation analysis to explore relationships and patterns between variables.

Content is for learning and research only.