Pandas DataFrame Data Structure
DataFrame is the most important two-dimensional data structure in Pandas, which can be understood as a labeled two-dimensional array or table. This chapter will comprehensively introduce the creation, operations, and applications of DataFrame.
📚 DataFrame Overview
What is DataFrame
DataFrame is a two-dimensional labeled data structure with the following characteristics:
- Both rows and columns have labels: Each row and column has a corresponding index
- Columns can have different data types: Integers, floats, strings, etc.
- Variable size: Columns can be inserted and deleted
- Like a spreadsheet: Python version of Excel or SQL tables
Components of DataFrame
- Data (values): The actual stored data
- Row index (index): Labels for rows
- Column index (columns): Labels for columns
- Data types (dtypes): Data type of each column
🔨 Creating DataFrame
Creating from Dictionary
import pandas as pd
import numpy as np
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'],
'Salary': [8000, 12000, 15000, 9500]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age City Salary
# 0 Alice 25 Beijing 8000
# 1 Bob 30 Shanghai 12000
# 2 Charlie 35 Guangzhou 15000
# 3 David 28 Shenzhen 9500Creating from List
# Create from 2D list
data_list = [
['Alice', 25, 'Beijing', 8000],
['Bob', 30, 'Shanghai', 12000],
['Charlie', 35, 'Guangzhou', 15000],
['David', 28, 'Shenzhen', 9500]
]
df_list = pd.DataFrame(data_list,
columns=['Name', 'Age', 'City', 'Salary'])
print(df_list)Creating from Series
# Create from multiple Series
names = pd.Series(['Alice', 'Bob', 'Charlie', 'David'])
ages = pd.Series([25, 30, 35, 28])
cities = pd.Series(['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'])
df_series = pd.DataFrame({
'Name': names,
'Age': ages,
'City': cities
})
print(df_series)Creating from NumPy Array
# Create from NumPy array
array_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_array = pd.DataFrame(array_data,
columns=['A', 'B', 'C'],
index=['Row1', 'Row2', 'Row3'])
print(df_array)Creating Empty DataFrame
# Create empty DataFrame
empty_df = pd.DataFrame()
print(f"Empty DataFrame: {empty_df}")
# Create empty DataFrame with specified structure
structured_empty = pd.DataFrame(columns=['Name', 'Age', 'City'])
print(structured_empty)Creating from Files (Preview)
# Create from CSV file (example)
# df_csv = pd.read_csv('data.csv')
# Create from Excel file (example)
# df_excel = pd.read_excel('data.xlsx')
# Create from JSON (example)
# df_json = pd.read_json('data.json')🔍 DataFrame Attributes
Basic Attributes
# Create example DataFrame
students = pd.DataFrame({
'ID': ['S001', 'S002', 'S003', 'S004', 'S005'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Math': [85, 92, 78, 96, 88],
'English': [90, 85, 92, 89, 94],
'Physics': [88, 90, 85, 92, 87]
})
# View basic attributes
print(f"Shape: {students.shape}") # (5, 5)
print(f"Size: {students.size}") # 25
print(f"Dimensions: {students.ndim}") # 2
print(f"Row count: {len(students)}") # 5
print(f"Column count: {len(students.columns)}") # 5
print(f"Column names: {list(students.columns)}") # ['ID', 'Name', 'Math', 'English', 'Physics']
print(f"Index: {list(students.index)}") # [0, 1, 2, 3, 4]
print(f"Data types:\n{students.dtypes}") # Data types of each columnViewing Data Overview
# View first few rows
print("First 3 rows:")
print(students.head(3))
# View last few rows
print("\nLast 2 rows:")
print(students.tail(2))
# View basic info
print("\nBasic info:")
print(students.info())
# View statistical summary
print("\nStatistical summary:")
print(students.describe())Memory Usage
# Check memory usage
print("Memory usage:")
print(students.memory_usage(deep=True))
print(f"\nTotal memory usage: {students.memory_usage(deep=True).sum()} bytes")🎯 Indexing and Selection
Column Selection
# Select single column (returns Series)
math_scores = students['Math']
print("Math scores:")
print(math_scores)
print(f"Type: {type(math_scores)}")
# Select multiple columns (returns DataFrame)
scores = students[['Name', 'Math', 'English']]
print("\nNames and scores:")
print(scores)
print(f"Type: {type(scores)}")
# Access column using dot notation (column name cannot have spaces or special characters)
# names = students.Name # Works, but not recommendedRow Selection
# Use iloc for position-based row selection
print("First row:")
print(students.iloc[0])
print("\nFirst three rows:")
print(students.iloc[0:3])
print("\nLast row:")
print(students.iloc[-1])
# Use loc for label-based row selection
print("\nRow with index 2:")
print(students.loc[2])
# Select multiple rows
print("\nRows with index 1 and 3:")
print(students.loc[[1, 3]])Conditional Selection
# Boolean indexing
high_math = students[students['Math'] > 90]
print("Students with Math score > 90:")
print(high_math)
# Multiple conditions selection
excellent = students[(students['Math'] > 85) & (students['English'] > 90)]
print("\nStudents with Math > 85 and English > 90:")
print(excellent)
# Using isin method
selected_students = students[students['Name'].isin(['Alice', 'Charlie'])]
print("\nSelected students:")
print(selected_students)
# Using query method
query_result = students.query('Math > 85 and English > 88')
print("\nUsing query:")
print(query_result)Advanced Indexing
# loc: Label-based indexing
print("Using loc to select specific rows and columns:")
print(students.loc[1:3, ['Name', 'Math']])
# iloc: Position-based indexing
print("\nUsing iloc to select specific positions:")
print(students.iloc[1:4, 2:4])
# at and iat: Fast access to single value
print(f"\nMath score of row 2: {students.at[1, 'Math']}")
print(f"Value at row 2, column 3: {students.iat[1, 2]}")🔧 DataFrame Operations
Adding Columns
# Add new columns
students_copy = students.copy()
# Calculate total score
students_copy['Total'] = students_copy['Math'] + students_copy['English'] + students_copy['Physics']
# Calculate average score
students_copy['Average'] = students_copy['Total'] / 3
# Add grade column
def get_grade(avg_score):
if avg_score >= 90:
return 'A'
elif avg_score >= 80:
return 'B'
elif avg_score >= 70:
return 'C'
else:
return 'D'
students_copy['Grade'] = students_copy['Average'].apply(get_grade)
print("After adding columns:")
print(students_copy)Deleting Columns
# Drop columns (doesn't modify original DataFrame)
df_dropped = students_copy.drop(['Total', 'Average'], axis=1)
print("After dropping columns:")
print(df_dropped)
# Drop columns (modifies original DataFrame)
students_copy.drop(['Grade'], axis=1, inplace=True)
print("\nAfter in-place drop:")
print(students_copy)
# Delete column using del
del students_copy['Total']
print("\nAfter using del:")
print(students_copy)Adding Rows
# Add row using loc
students_new = students.copy()
students_new.loc[5] = ['S006', 'Frank', 89, 91, 86]
print("After adding row:")
print(students_new)
# Add row using concat
new_student = pd.DataFrame({
'ID': ['S007'],
'Name': ['Grace'],
'Math': [93],
'English': [88],
'Physics': [90]
})
students_concat = pd.concat([students, new_student], ignore_index=True)
print("\nUsing concat to add row:")
print(students_concat)Deleting Rows
# Delete rows by index
students_drop_row = students.drop([0, 2])
print("After dropping rows:")
print(students_drop_row)
# Delete rows based on condition
students_filtered = students[students['Math'] >= 85]
print("\nKeeping rows where Math >= 85:")
print(students_filtered)📊 Data Operations and Calculations
Mathematical Operations
# Create numeric DataFrame
scores_df = students[['Math', 'English', 'Physics']].copy()
# Scalar operations
print("All scores plus 5:")
print(scores_df + 5)
print("\nAll scores times 1.1:")
print(scores_df * 1.1)
# Column-wise operations
print("\nDifference between Math and English:")
print(scores_df['Math'] - scores_df['English'])
# Row-wise operations
print("\nTotal score for each student:")
print(scores_df.sum(axis=1))
print("\nAverage score per subject:")
print(scores_df.mean(axis=0))Statistical Functions
# Basic statistics
print("Basic statistical info:")
print(scores_df.describe())
print(f"\nMaximum per subject:")
print(scores_df.max())
print(f"\nMinimum per subject:")
print(scores_df.min())
print(f"\nAverage per subject:")
print(scores_df.mean())
print(f"\nStandard deviation per subject:")
print(scores_df.std())
# Correlation analysis
print("\nCorrelation between subjects:")
print(scores_df.corr())Sorting
# Sort by single column
sorted_by_math = students.sort_values('Math', ascending=False)
print("Sorted by Math score descending:")
print(sorted_by_math)
# Sort by multiple columns
sorted_multi = students.sort_values(['Math', 'English'], ascending=[False, True])
print("\nSorted by Math descending, English ascending:")
print(sorted_multi)
# Sort by index
sorted_by_index = students.sort_index(ascending=False)
print("\nSorted by index descending:")
print(sorted_by_index)🔄 Data Processing
Handling Missing Values
# Create DataFrame with missing values
data_with_nan = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, 3, np.nan, 5],
'D': [1, np.nan, np.nan, 4, 5]
})
print("Data with missing values:")
print(data_with_nan)
# Detect missing values
print("\nMissing value detection:")
print(data_with_nan.isnull())
print("\nCount of missing values per column:")
print(data_with_nan.isnull().sum())
# Drop missing values
print("\nDrop rows with missing values:")
print(data_with_nan.dropna())
print("\nDrop columns with missing values:")
print(data_with_nan.dropna(axis=1))
# Fill missing values
print("\nFill missing values with 0:")
print(data_with_nan.fillna(0))
print("\nFill missing values with mean:")
print(data_with_nan.fillna(data_with_nan.mean()))
print("\nForward fill:")
print(data_with_nan.fillna(method='ffill'))Handling Duplicate Values
# Create DataFrame with duplicate values
data_with_duplicates = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'Age': [25, 30, 25, 35, 30],
'City': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou', 'Shanghai']
})
print("Data with duplicates:")
print(data_with_duplicates)
# Detect duplicates
print("\nDuplicate detection:")
print(data_with_duplicates.duplicated())
# Drop duplicates
print("\nAfter dropping duplicates:")
print(data_with_duplicates.drop_duplicates())
# Drop duplicates based on specific columns
print("\nDrop duplicates based on Name:")
print(data_with_duplicates.drop_duplicates(subset=['Name']))Data Type Conversion
# Create mixed type data
mixed_data = pd.DataFrame({
'NumericString': ['1', '2', '3', '4', '5'],
'Float': [1.1, 2.2, 3.3, 4.4, 5.5],
'Integer': [1, 2, 3, 4, 5]
})
print("Original data types:")
print(mixed_data.dtypes)
# Convert data types
mixed_data['NumericString'] = pd.to_numeric(mixed_data['NumericString'])
mixed_data['Float'] = mixed_data['Float'].astype(int)
print("\nAfter type conversion:")
print(mixed_data.dtypes)
print(mixed_data)🔗 DataFrame Merging and Joining
Concatenating DataFrames
# Create example data
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df2 = pd.DataFrame({
'A': [7, 8, 9],
'B': [10, 11, 12]
})
# Vertical concatenation
vertical_concat = pd.concat([df1, df2])
print("Vertical concatenation:")
print(vertical_concat)
# Reset index
vertical_reset = pd.concat([df1, df2], ignore_index=True)
print("\nAfter resetting index:")
print(vertical_reset)
# Horizontal concatenation
df3 = pd.DataFrame({
'C': [13, 14, 15],
'D': [16, 17, 18]
})
horizontal_concat = pd.concat([df1, df3], axis=1)
print("\nHorizontal concatenation:")
print(horizontal_concat)Merging DataFrames
# Create data for merging
employees = pd.DataFrame({
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'DeptID': [101, 102, 101, 103]
})
departments = pd.DataFrame({
'DeptID': [101, 102, 103],
'DeptName': ['Engineering', 'Sales', 'HR']
})
print("Employees table:")
print(employees)
print("\nDepartments table:")
print(departments)
# Inner join
inner_merge = pd.merge(employees, departments, on='DeptID')
print("\nInner join result:")
print(inner_merge)
# Left join
left_merge = pd.merge(employees, departments, on='DeptID', how='left')
print("\nLeft join result:")
print(left_merge)🎨 Practical Application Examples
Example 1: Sales Data Analysis
# Create sales data
sales_data = pd.DataFrame({
'Date': pd.date_range('2024-01-01', periods=10, freq='D'),
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'Quantity': [100, 150, 120, 80, 200, 110, 90, 180, 130, 95],
'UnitPrice': [10.5, 15.2, 10.5, 8.8, 15.2, 10.5, 8.8, 15.2, 10.5, 8.8],
'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Charlie']
})
# Calculate revenue
sales_data['Revenue'] = sales_data['Quantity'] * sales_data['UnitPrice']
print("Sales data:")
print(sales_data)
# Group by product
product_summary = sales_data.groupby('Product').agg({
'Quantity': 'sum',
'Revenue': 'sum',
'UnitPrice': 'mean'
}).round(2)
print("\nSummary by product:")
print(product_summary)
# Group by salesperson
salesperson_summary = sales_data.groupby('Salesperson').agg({
'Quantity': 'sum',
'Revenue': 'sum'
}).sort_values('Revenue', ascending=False)
print("\nSummary by salesperson:")
print(salesperson_summary)
# Find best sales day
best_day = sales_data.loc[sales_data['Revenue'].idxmax()]
print(f"\nBest sales day: {best_day['Date'].strftime('%Y-%m-%d')}, Revenue: {best_day['Revenue']}")Example 2: Student Grade Management System
# Create student grade data
student_scores = pd.DataFrame({
'ID': ['S001', 'S002', 'S003', 'S004', 'S005', 'S006'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Class': ['1A', '1A', '1B', '1B', '1A', '1B'],
'Chinese': [85, 92, 78, 96, 88, 91],
'Math': [90, 85, 92, 89, 94, 87],
'English': [88, 90, 85, 92, 87, 89],
'Physics': [82, 88, 90, 85, 91, 86],
'Chemistry': [89, 87, 88, 90, 85, 92]
})
print("Student grade data:")
print(student_scores)
# Calculate total and average
subjects = ['Chinese', 'Math', 'English', 'Physics', 'Chemistry']
student_scores['Total'] = student_scores[subjects].sum(axis=1)
student_scores['Average'] = student_scores[subjects].mean(axis=1).round(2)
# Calculate ranking
student_scores['Rank'] = student_scores['Total'].rank(ascending=False, method='min').astype(int)
print("\nAfter adding statistics:")
print(student_scores[['Name', 'Class', 'Total', 'Average', 'Rank']])
# Class statistics
class_summary = student_scores.groupby('Class').agg({
'Total': ['mean', 'max', 'min'],
'Average': 'mean'
}).round(2)
print("\nClass statistics:")
print(class_summary)
# Subject statistics
subject_stats = student_scores[subjects].describe().round(2)
print("\nSubject statistics:")
print(subject_stats)
# Find top student for each subject
print("\nTop student for each subject:")
for subject in subjects:
max_idx = student_scores[subject].idxmax()
max_student = student_scores.loc[max_idx]
print(f"{subject}: {max_student['Name']} ({max_student[subject]} points)")Example 3: Financial Data Analysis
# Create financial data
financial_data = pd.DataFrame({
'Month': pd.date_range('2024-01-01', periods=12, freq='M'),
'Revenue': [120000, 135000, 142000, 158000, 163000, 171000,
185000, 192000, 178000, 165000, 155000, 148000],
'Expenses': [80000, 85000, 90000, 95000, 98000, 102000,
108000, 115000, 110000, 105000, 98000, 92000],
'Employees': [50, 52, 55, 58, 60, 62, 65, 68, 66, 64, 62, 60]
})
# Calculate profit and profit margin
financial_data['Profit'] = financial_data['Revenue'] - financial_data['Expenses']
financial_data['ProfitMargin'] = (financial_data['Profit'] / financial_data['Revenue'] * 100).round(2)
# Calculate revenue per employee
financial_data['RevenuePerEmployee'] = (financial_data['Revenue'] / financial_data['Employees']).round(0)
print("Financial data:")
print(financial_data)
# Quarterly statistics
financial_data['Quarter'] = financial_data['Month'].dt.quarter
quarterly_summary = financial_data.groupby('Quarter').agg({
'Revenue': 'sum',
'Expenses': 'sum',
'Profit': 'sum',
'Employees': 'mean'
}).round(0)
quarterly_summary['QuarterlyProfitMargin'] = (quarterly_summary['Profit'] / quarterly_summary['Revenue'] * 100).round(2)
print("\nQuarterly statistics:")
print(quarterly_summary)
# Annual summary
annual_summary = {
'Annual Total Revenue': financial_data['Revenue'].sum(),
'Annual Total Expenses': financial_data['Expenses'].sum(),
'Annual Total Profit': financial_data['Profit'].sum(),
'Average Monthly Profit Margin': financial_data['ProfitMargin'].mean(),
'Highest Monthly Profit': financial_data['Profit'].max(),
'Lowest Monthly Profit': financial_data['Profit'].min()
}
print("\nAnnual Summary:")
for key, value in annual_summary.items():
if 'Margin' in key:
print(f"{key}: {value:.2f}%")
else:
print(f"{key}: {value:,.0f}")📈 Performance Optimization Tips
Vectorized Operations
import time
# Create large DataFrame
large_df = pd.DataFrame({
'A': np.random.randn(100000),
'B': np.random.randn(100000),
'C': np.random.randn(100000)
})
# Compare performance of loops vs vectorized operations
# Method 1: Loop (slow)
start_time = time.time()
result1 = []
for i in range(len(large_df)):
if large_df.iloc[i]['A'] > 0:
result1.append(large_df.iloc[i]['B'] * large_df.iloc[i]['C'])
else:
result1.append(0)
loop_time = time.time() - start_time
# Method 2: Vectorized (fast)
start_time = time.time()
result2 = np.where(large_df['A'] > 0, large_df['B'] * large_df['C'], 0)
vectorized_time = time.time() - start_time
print(f"Loop method time: {loop_time:.4f} seconds")
print(f"Vectorized method time: {vectorized_time:.4f} seconds")
print(f"Performance improvement: {loop_time/vectorized_time:.1f}x")Memory Optimization
# Data type optimization
original_df = pd.DataFrame({
'IntColumn': [1, 2, 3, 4, 5] * 1000,
'FloatColumn': [1.1, 2.2, 3.3, 4.4, 5.5] * 1000,
'CategoryColumn': ['A', 'B', 'C'] * 1667
})
print(f"Original memory usage: {original_df.memory_usage(deep=True).sum()} bytes")
# Optimize data types
optimized_df = original_df.copy()
optimized_df['IntColumn'] = optimized_df['IntColumn'].astype('int8')
optimized_df['FloatColumn'] = optimized_df['FloatColumn'].astype('float32')
optimized_df['CategoryColumn'] = optimized_df['CategoryColumn'].astype('category')
print(f"Optimized memory usage: {optimized_df.memory_usage(deep=True).sum()} bytes")
print(f"Memory savings: {(1 - optimized_df.memory_usage(deep=True).sum()/original_df.memory_usage(deep=True).sum())*100:.1f}%")📝 Chapter Summary
Through this chapter, you should have mastered:
✅ DataFrame Basic Concepts: Understanding the structure and characteristics of DataFrame
✅ Creating DataFrame: Mastering various methods for creating DataFrame
✅ Indexing and Selection: Proficiently using various indexing and selection methods
✅ Data Operations: Performing CRUD operations
✅ Data Processing: Handling missing values, duplicates, and other data quality issues
✅ Data Merging: Joining and merging multiple DataFrames
✅ Practical Applications: Solving real data analysis problems
✅ Performance Optimization: Improving code execution efficiency and memory usage
Key Points
- DataFrame is the Core of Pandas: Mastering DataFrame is fundamental to data analysis
- Flexibility of Indexing: loc, iloc, boolean indexing provide powerful data selection capabilities
- Importance of Vectorized Operations: Avoid loops, use Pandas built-in methods
- Data Quality Management: Handle missing values and duplicates promptly
- Memory Optimization: Choosing appropriate data types can significantly save memory
Next Steps
Now that you've mastered Pandas core data structures, next we'll learn how to handle CSV and Excel files.
Next Chapter: Pandas CSV and Excel Handling