Skip to content

Pandas DataFrame Data Structure

DataFrame is the most important two-dimensional data structure in Pandas, which can be understood as a labeled two-dimensional array or table. This chapter will comprehensively introduce the creation, operations, and applications of DataFrame.

📚 DataFrame Overview

What is DataFrame

DataFrame is a two-dimensional labeled data structure with the following characteristics:

  • Both rows and columns have labels: Each row and column has a corresponding index
  • Columns can have different data types: Integers, floats, strings, etc.
  • Variable size: Columns can be inserted and deleted
  • Like a spreadsheet: Python version of Excel or SQL tables

Components of DataFrame

  • Data (values): The actual stored data
  • Row index (index): Labels for rows
  • Column index (columns): Labels for columns
  • Data types (dtypes): Data type of each column

🔨 Creating DataFrame

Creating from Dictionary

python
import pandas as pd
import numpy as np

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'],
    'Salary': [8000, 12000, 15000, 9500]
}

df = pd.DataFrame(data)
print(df)
# Output:
#       Name  Age       City  Salary
# 0    Alice   25    Beijing    8000
# 1      Bob   30   Shanghai   12000
# 2  Charlie   35  Guangzhou   15000
# 3    David   28   Shenzhen    9500

Creating from List

python
# Create from 2D list
data_list = [
    ['Alice', 25, 'Beijing', 8000],
    ['Bob', 30, 'Shanghai', 12000],
    ['Charlie', 35, 'Guangzhou', 15000],
    ['David', 28, 'Shenzhen', 9500]
]

df_list = pd.DataFrame(data_list, 
                      columns=['Name', 'Age', 'City', 'Salary'])
print(df_list)

Creating from Series

python
# Create from multiple Series
names = pd.Series(['Alice', 'Bob', 'Charlie', 'David'])
ages = pd.Series([25, 30, 35, 28])
cities = pd.Series(['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'])

df_series = pd.DataFrame({
    'Name': names,
    'Age': ages,
    'City': cities
})
print(df_series)

Creating from NumPy Array

python
# Create from NumPy array
array_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_array = pd.DataFrame(array_data, 
                       columns=['A', 'B', 'C'],
                       index=['Row1', 'Row2', 'Row3'])
print(df_array)

Creating Empty DataFrame

python
# Create empty DataFrame
empty_df = pd.DataFrame()
print(f"Empty DataFrame: {empty_df}")

# Create empty DataFrame with specified structure
structured_empty = pd.DataFrame(columns=['Name', 'Age', 'City'])
print(structured_empty)

Creating from Files (Preview)

python
# Create from CSV file (example)
# df_csv = pd.read_csv('data.csv')

# Create from Excel file (example)
# df_excel = pd.read_excel('data.xlsx')

# Create from JSON (example)
# df_json = pd.read_json('data.json')

🔍 DataFrame Attributes

Basic Attributes

python
# Create example DataFrame
students = pd.DataFrame({
    'ID': ['S001', 'S002', 'S003', 'S004', 'S005'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Math': [85, 92, 78, 96, 88],
    'English': [90, 85, 92, 89, 94],
    'Physics': [88, 90, 85, 92, 87]
})

# View basic attributes
print(f"Shape: {students.shape}")                    # (5, 5)
print(f"Size: {students.size}")                      # 25
print(f"Dimensions: {students.ndim}")                # 2
print(f"Row count: {len(students)}")                 # 5
print(f"Column count: {len(students.columns)}")      # 5
print(f"Column names: {list(students.columns)}")     # ['ID', 'Name', 'Math', 'English', 'Physics']
print(f"Index: {list(students.index)}")              # [0, 1, 2, 3, 4]
print(f"Data types:\n{students.dtypes}")             # Data types of each column

Viewing Data Overview

python
# View first few rows
print("First 3 rows:")
print(students.head(3))

# View last few rows
print("\nLast 2 rows:")
print(students.tail(2))

# View basic info
print("\nBasic info:")
print(students.info())

# View statistical summary
print("\nStatistical summary:")
print(students.describe())

Memory Usage

python
# Check memory usage
print("Memory usage:")
print(students.memory_usage(deep=True))

print(f"\nTotal memory usage: {students.memory_usage(deep=True).sum()} bytes")

🎯 Indexing and Selection

Column Selection

python
# Select single column (returns Series)
math_scores = students['Math']
print("Math scores:")
print(math_scores)
print(f"Type: {type(math_scores)}")

# Select multiple columns (returns DataFrame)
scores = students[['Name', 'Math', 'English']]
print("\nNames and scores:")
print(scores)
print(f"Type: {type(scores)}")

# Access column using dot notation (column name cannot have spaces or special characters)
# names = students.Name  # Works, but not recommended

Row Selection

python
# Use iloc for position-based row selection
print("First row:")
print(students.iloc[0])

print("\nFirst three rows:")
print(students.iloc[0:3])

print("\nLast row:")
print(students.iloc[-1])

# Use loc for label-based row selection
print("\nRow with index 2:")
print(students.loc[2])

# Select multiple rows
print("\nRows with index 1 and 3:")
print(students.loc[[1, 3]])

Conditional Selection

python
# Boolean indexing
high_math = students[students['Math'] > 90]
print("Students with Math score > 90:")
print(high_math)

# Multiple conditions selection
excellent = students[(students['Math'] > 85) & (students['English'] > 90)]
print("\nStudents with Math > 85 and English > 90:")
print(excellent)

# Using isin method
selected_students = students[students['Name'].isin(['Alice', 'Charlie'])]
print("\nSelected students:")
print(selected_students)

# Using query method
query_result = students.query('Math > 85 and English > 88')
print("\nUsing query:")
print(query_result)

Advanced Indexing

python
# loc: Label-based indexing
print("Using loc to select specific rows and columns:")
print(students.loc[1:3, ['Name', 'Math']])

# iloc: Position-based indexing
print("\nUsing iloc to select specific positions:")
print(students.iloc[1:4, 2:4])

# at and iat: Fast access to single value
print(f"\nMath score of row 2: {students.at[1, 'Math']}")
print(f"Value at row 2, column 3: {students.iat[1, 2]}")

🔧 DataFrame Operations

Adding Columns

python
# Add new columns
students_copy = students.copy()

# Calculate total score
students_copy['Total'] = students_copy['Math'] + students_copy['English'] + students_copy['Physics']

# Calculate average score
students_copy['Average'] = students_copy['Total'] / 3

# Add grade column
def get_grade(avg_score):
    if avg_score >= 90:
        return 'A'
    elif avg_score >= 80:
        return 'B'
    elif avg_score >= 70:
        return 'C'
    else:
        return 'D'

students_copy['Grade'] = students_copy['Average'].apply(get_grade)

print("After adding columns:")
print(students_copy)

Deleting Columns

python
# Drop columns (doesn't modify original DataFrame)
df_dropped = students_copy.drop(['Total', 'Average'], axis=1)
print("After dropping columns:")
print(df_dropped)

# Drop columns (modifies original DataFrame)
students_copy.drop(['Grade'], axis=1, inplace=True)
print("\nAfter in-place drop:")
print(students_copy)

# Delete column using del
del students_copy['Total']
print("\nAfter using del:")
print(students_copy)

Adding Rows

python
# Add row using loc
students_new = students.copy()
students_new.loc[5] = ['S006', 'Frank', 89, 91, 86]
print("After adding row:")
print(students_new)

# Add row using concat
new_student = pd.DataFrame({
    'ID': ['S007'],
    'Name': ['Grace'],
    'Math': [93],
    'English': [88],
    'Physics': [90]
})

students_concat = pd.concat([students, new_student], ignore_index=True)
print("\nUsing concat to add row:")
print(students_concat)

Deleting Rows

python
# Delete rows by index
students_drop_row = students.drop([0, 2])
print("After dropping rows:")
print(students_drop_row)

# Delete rows based on condition
students_filtered = students[students['Math'] >= 85]
print("\nKeeping rows where Math >= 85:")
print(students_filtered)

📊 Data Operations and Calculations

Mathematical Operations

python
# Create numeric DataFrame
scores_df = students[['Math', 'English', 'Physics']].copy()

# Scalar operations
print("All scores plus 5:")
print(scores_df + 5)

print("\nAll scores times 1.1:")
print(scores_df * 1.1)

# Column-wise operations
print("\nDifference between Math and English:")
print(scores_df['Math'] - scores_df['English'])

# Row-wise operations
print("\nTotal score for each student:")
print(scores_df.sum(axis=1))

print("\nAverage score per subject:")
print(scores_df.mean(axis=0))

Statistical Functions

python
# Basic statistics
print("Basic statistical info:")
print(scores_df.describe())

print(f"\nMaximum per subject:")
print(scores_df.max())

print(f"\nMinimum per subject:")
print(scores_df.min())

print(f"\nAverage per subject:")
print(scores_df.mean())

print(f"\nStandard deviation per subject:")
print(scores_df.std())

# Correlation analysis
print("\nCorrelation between subjects:")
print(scores_df.corr())

Sorting

python
# Sort by single column
sorted_by_math = students.sort_values('Math', ascending=False)
print("Sorted by Math score descending:")
print(sorted_by_math)

# Sort by multiple columns
sorted_multi = students.sort_values(['Math', 'English'], ascending=[False, True])
print("\nSorted by Math descending, English ascending:")
print(sorted_multi)

# Sort by index
sorted_by_index = students.sort_index(ascending=False)
print("\nSorted by index descending:")
print(sorted_by_index)

🔄 Data Processing

Handling Missing Values

python
# Create DataFrame with missing values
data_with_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, np.nan],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, np.nan, np.nan, 4, 5]
})

print("Data with missing values:")
print(data_with_nan)

# Detect missing values
print("\nMissing value detection:")
print(data_with_nan.isnull())

print("\nCount of missing values per column:")
print(data_with_nan.isnull().sum())

# Drop missing values
print("\nDrop rows with missing values:")
print(data_with_nan.dropna())

print("\nDrop columns with missing values:")
print(data_with_nan.dropna(axis=1))

# Fill missing values
print("\nFill missing values with 0:")
print(data_with_nan.fillna(0))

print("\nFill missing values with mean:")
print(data_with_nan.fillna(data_with_nan.mean()))

print("\nForward fill:")
print(data_with_nan.fillna(method='ffill'))

Handling Duplicate Values

python
# Create DataFrame with duplicate values
data_with_duplicates = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou', 'Shanghai']
})

print("Data with duplicates:")
print(data_with_duplicates)

# Detect duplicates
print("\nDuplicate detection:")
print(data_with_duplicates.duplicated())

# Drop duplicates
print("\nAfter dropping duplicates:")
print(data_with_duplicates.drop_duplicates())

# Drop duplicates based on specific columns
print("\nDrop duplicates based on Name:")
print(data_with_duplicates.drop_duplicates(subset=['Name']))

Data Type Conversion

python
# Create mixed type data
mixed_data = pd.DataFrame({
    'NumericString': ['1', '2', '3', '4', '5'],
    'Float': [1.1, 2.2, 3.3, 4.4, 5.5],
    'Integer': [1, 2, 3, 4, 5]
})

print("Original data types:")
print(mixed_data.dtypes)

# Convert data types
mixed_data['NumericString'] = pd.to_numeric(mixed_data['NumericString'])
mixed_data['Float'] = mixed_data['Float'].astype(int)

print("\nAfter type conversion:")
print(mixed_data.dtypes)
print(mixed_data)

🔗 DataFrame Merging and Joining

Concatenating DataFrames

python
# Create example data
df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df2 = pd.DataFrame({
    'A': [7, 8, 9],
    'B': [10, 11, 12]
})

# Vertical concatenation
vertical_concat = pd.concat([df1, df2])
print("Vertical concatenation:")
print(vertical_concat)

# Reset index
vertical_reset = pd.concat([df1, df2], ignore_index=True)
print("\nAfter resetting index:")
print(vertical_reset)

# Horizontal concatenation
df3 = pd.DataFrame({
    'C': [13, 14, 15],
    'D': [16, 17, 18]
})

horizontal_concat = pd.concat([df1, df3], axis=1)
print("\nHorizontal concatenation:")
print(horizontal_concat)

Merging DataFrames

python
# Create data for merging
employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'DeptID': [101, 102, 101, 103]
})

departments = pd.DataFrame({
    'DeptID': [101, 102, 103],
    'DeptName': ['Engineering', 'Sales', 'HR']
})

print("Employees table:")
print(employees)
print("\nDepartments table:")
print(departments)

# Inner join
inner_merge = pd.merge(employees, departments, on='DeptID')
print("\nInner join result:")
print(inner_merge)

# Left join
left_merge = pd.merge(employees, departments, on='DeptID', how='left')
print("\nLeft join result:")
print(left_merge)

🎨 Practical Application Examples

Example 1: Sales Data Analysis

python
# Create sales data
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Quantity': [100, 150, 120, 80, 200, 110, 90, 180, 130, 95],
    'UnitPrice': [10.5, 15.2, 10.5, 8.8, 15.2, 10.5, 8.8, 15.2, 10.5, 8.8],
    'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Charlie']
})

# Calculate revenue
sales_data['Revenue'] = sales_data['Quantity'] * sales_data['UnitPrice']

print("Sales data:")
print(sales_data)

# Group by product
product_summary = sales_data.groupby('Product').agg({
    'Quantity': 'sum',
    'Revenue': 'sum',
    'UnitPrice': 'mean'
}).round(2)

print("\nSummary by product:")
print(product_summary)

# Group by salesperson
salesperson_summary = sales_data.groupby('Salesperson').agg({
    'Quantity': 'sum',
    'Revenue': 'sum'
}).sort_values('Revenue', ascending=False)

print("\nSummary by salesperson:")
print(salesperson_summary)

# Find best sales day
best_day = sales_data.loc[sales_data['Revenue'].idxmax()]
print(f"\nBest sales day: {best_day['Date'].strftime('%Y-%m-%d')}, Revenue: {best_day['Revenue']}")

Example 2: Student Grade Management System

python
# Create student grade data
student_scores = pd.DataFrame({
    'ID': ['S001', 'S002', 'S003', 'S004', 'S005', 'S006'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Class': ['1A', '1A', '1B', '1B', '1A', '1B'],
    'Chinese': [85, 92, 78, 96, 88, 91],
    'Math': [90, 85, 92, 89, 94, 87],
    'English': [88, 90, 85, 92, 87, 89],
    'Physics': [82, 88, 90, 85, 91, 86],
    'Chemistry': [89, 87, 88, 90, 85, 92]
})

print("Student grade data:")
print(student_scores)

# Calculate total and average
subjects = ['Chinese', 'Math', 'English', 'Physics', 'Chemistry']
student_scores['Total'] = student_scores[subjects].sum(axis=1)
student_scores['Average'] = student_scores[subjects].mean(axis=1).round(2)

# Calculate ranking
student_scores['Rank'] = student_scores['Total'].rank(ascending=False, method='min').astype(int)

print("\nAfter adding statistics:")
print(student_scores[['Name', 'Class', 'Total', 'Average', 'Rank']])

# Class statistics
class_summary = student_scores.groupby('Class').agg({
    'Total': ['mean', 'max', 'min'],
    'Average': 'mean'
}).round(2)

print("\nClass statistics:")
print(class_summary)

# Subject statistics
subject_stats = student_scores[subjects].describe().round(2)
print("\nSubject statistics:")
print(subject_stats)

# Find top student for each subject
print("\nTop student for each subject:")
for subject in subjects:
    max_idx = student_scores[subject].idxmax()
    max_student = student_scores.loc[max_idx]
    print(f"{subject}: {max_student['Name']} ({max_student[subject]} points)")

Example 3: Financial Data Analysis

python
# Create financial data
financial_data = pd.DataFrame({
    'Month': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Revenue': [120000, 135000, 142000, 158000, 163000, 171000,
                185000, 192000, 178000, 165000, 155000, 148000],
    'Expenses': [80000, 85000, 90000, 95000, 98000, 102000,
                 108000, 115000, 110000, 105000, 98000, 92000],
    'Employees': [50, 52, 55, 58, 60, 62, 65, 68, 66, 64, 62, 60]
})

# Calculate profit and profit margin
financial_data['Profit'] = financial_data['Revenue'] - financial_data['Expenses']
financial_data['ProfitMargin'] = (financial_data['Profit'] / financial_data['Revenue'] * 100).round(2)

# Calculate revenue per employee
financial_data['RevenuePerEmployee'] = (financial_data['Revenue'] / financial_data['Employees']).round(0)

print("Financial data:")
print(financial_data)

# Quarterly statistics
financial_data['Quarter'] = financial_data['Month'].dt.quarter
quarterly_summary = financial_data.groupby('Quarter').agg({
    'Revenue': 'sum',
    'Expenses': 'sum',
    'Profit': 'sum',
    'Employees': 'mean'
}).round(0)

quarterly_summary['QuarterlyProfitMargin'] = (quarterly_summary['Profit'] / quarterly_summary['Revenue'] * 100).round(2)

print("\nQuarterly statistics:")
print(quarterly_summary)

# Annual summary
annual_summary = {
    'Annual Total Revenue': financial_data['Revenue'].sum(),
    'Annual Total Expenses': financial_data['Expenses'].sum(),
    'Annual Total Profit': financial_data['Profit'].sum(),
    'Average Monthly Profit Margin': financial_data['ProfitMargin'].mean(),
    'Highest Monthly Profit': financial_data['Profit'].max(),
    'Lowest Monthly Profit': financial_data['Profit'].min()
}

print("\nAnnual Summary:")
for key, value in annual_summary.items():
    if 'Margin' in key:
        print(f"{key}: {value:.2f}%")
    else:
        print(f"{key}: {value:,.0f}")

📈 Performance Optimization Tips

Vectorized Operations

python
import time

# Create large DataFrame
large_df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.randn(100000),
    'C': np.random.randn(100000)
})

# Compare performance of loops vs vectorized operations
# Method 1: Loop (slow)
start_time = time.time()
result1 = []
for i in range(len(large_df)):
    if large_df.iloc[i]['A'] > 0:
        result1.append(large_df.iloc[i]['B'] * large_df.iloc[i]['C'])
    else:
        result1.append(0)
loop_time = time.time() - start_time

# Method 2: Vectorized (fast)
start_time = time.time()
result2 = np.where(large_df['A'] > 0, large_df['B'] * large_df['C'], 0)
vectorized_time = time.time() - start_time

print(f"Loop method time: {loop_time:.4f} seconds")
print(f"Vectorized method time: {vectorized_time:.4f} seconds")
print(f"Performance improvement: {loop_time/vectorized_time:.1f}x")

Memory Optimization

python
# Data type optimization
original_df = pd.DataFrame({
    'IntColumn': [1, 2, 3, 4, 5] * 1000,
    'FloatColumn': [1.1, 2.2, 3.3, 4.4, 5.5] * 1000,
    'CategoryColumn': ['A', 'B', 'C'] * 1667
})

print(f"Original memory usage: {original_df.memory_usage(deep=True).sum()} bytes")

# Optimize data types
optimized_df = original_df.copy()
optimized_df['IntColumn'] = optimized_df['IntColumn'].astype('int8')
optimized_df['FloatColumn'] = optimized_df['FloatColumn'].astype('float32')
optimized_df['CategoryColumn'] = optimized_df['CategoryColumn'].astype('category')

print(f"Optimized memory usage: {optimized_df.memory_usage(deep=True).sum()} bytes")
print(f"Memory savings: {(1 - optimized_df.memory_usage(deep=True).sum()/original_df.memory_usage(deep=True).sum())*100:.1f}%")

📝 Chapter Summary

Through this chapter, you should have mastered:

DataFrame Basic Concepts: Understanding the structure and characteristics of DataFrame
Creating DataFrame: Mastering various methods for creating DataFrame
Indexing and Selection: Proficiently using various indexing and selection methods
Data Operations: Performing CRUD operations
Data Processing: Handling missing values, duplicates, and other data quality issues
Data Merging: Joining and merging multiple DataFrames
Practical Applications: Solving real data analysis problems
Performance Optimization: Improving code execution efficiency and memory usage

Key Points

  1. DataFrame is the Core of Pandas: Mastering DataFrame is fundamental to data analysis
  2. Flexibility of Indexing: loc, iloc, boolean indexing provide powerful data selection capabilities
  3. Importance of Vectorized Operations: Avoid loops, use Pandas built-in methods
  4. Data Quality Management: Handle missing values and duplicates promptly
  5. Memory Optimization: Choosing appropriate data types can significantly save memory

Next Steps

Now that you've mastered Pandas core data structures, next we'll learn how to handle CSV and Excel files.


Next Chapter: Pandas CSV and Excel Handling

Content is for learning and research only.