📘 Pandas Data Analysis: GroupBy and Pivot

Welcome, data explorer! 🎉 Today we’re diving into two of the most powerful tools in pandas for data analysis: GroupBy and Pivot Tables. These techniques will transform how you analyze and understand your data – think of them as your data Swiss Army knife! 🔪

🎯 Introduction

Ever wondered how to:

Calculate average sales by region? 🌍
Find the highest-scoring player in each game? 🎮
Transform messy data into beautiful summary tables? 📊

That’s exactly what GroupBy and Pivot are for! By the end of this tutorial, you’ll be slicing and dicing data like a pro chef! 👨‍🍳

📚 Understanding GroupBy and Pivot

What is GroupBy? 🤔

Think of GroupBy like sorting your laundry:

Group similar items together (whites, colors, delicates) 👕
Apply an operation to each group (wash, dry, fold) 🧺
Combine the results into a clean pile! ✨

In pandas, GroupBy follows the same pattern:

Split data into groups based on criteria
Apply functions to each group
Combine results into a new structure

What is Pivot? 🔄

Pivot tables are like reorganizing your bookshelf:

Take books scattered everywhere 📚
Arrange them by genre (rows) and author (columns) 📖
See patterns you couldn’t before! 🎯

🔧 Basic Syntax and Usage

Let’s start with some fun data about a gaming tournament! 🎮

import pandas as pd
import numpy as np

# 🎮 Create sample gaming data
gaming_data = pd.DataFrame({
    'player': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'] * 2,
    'game': ['Fortnite', 'Fortnite', 'Fortnite', 'Minecraft', 'Minecraft', 'Minecraft'] * 2,
    'region': ['NA', 'EU', 'ASIA', 'NA', 'EU', 'ASIA'] * 2,
    'score': [850, 920, 780, 1200, 1150, 1300, 900, 880, 810, 1250, 1100, 1350],
    'playtime_hours': [2.5, 3.0, 2.0, 4.0, 3.5, 4.5, 2.8, 2.9, 2.2, 4.2, 3.7, 4.8]
})

print("🎯 Our Gaming Tournament Data:")
print(gaming_data.head())

GroupBy Basics 🏃‍♂️

# 👥 Group by player and calculate average score
player_stats = gaming_data.groupby('player')['score'].mean()
print("\n🏆 Average Score by Player:")
print(player_stats)

# 🎮 Group by game and get multiple statistics
game_summary = gaming_data.groupby('game').agg({
    'score': ['mean', 'max', 'min'],
    'playtime_hours': 'sum'
})
print("\n📊 Game Statistics:")
print(game_summary)

Pivot Table Basics 🔄

# 📊 Create a pivot table: players vs games
pivot_scores = gaming_data.pivot_table(
    values='score',
    index='player',
    columns='game',
    aggfunc='mean'
)
print("\n🎯 Player Performance Across Games:")
print(pivot_scores)

💡 Practical Examples

Example 1: E-commerce Sales Analysis 🛒

Let’s analyze an online store’s sales data!

# 🛍️ Create e-commerce data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100, freq='D')
products = ['Laptop', 'Phone', 'Headphones', 'Tablet', 'Watch']
regions = ['North', 'South', 'East', 'West']

sales_data = pd.DataFrame({
    'date': np.random.choice(dates, 500),
    'product': np.random.choice(products, 500),
    'region': np.random.choice(regions, 500),
    'quantity': np.random.randint(1, 10, 500),
    'price': np.random.uniform(50, 2000, 500).round(2)
})
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']

# 💰 GroupBy: Total revenue by product
product_revenue = sales_data.groupby('product')['revenue'].sum().sort_values(ascending=False)
print("💵 Revenue by Product:")
print(product_revenue)

# 📈 GroupBy with multiple columns
regional_product_sales = sales_data.groupby(['region', 'product']).agg({
    'quantity': 'sum',
    'revenue': ['sum', 'mean']
}).round(2)
print("\n🌍 Regional Product Performance:")
print(regional_product_sales.head(10))

Example 2: Restaurant Reviews Analysis 🍕

# 🍕 Create restaurant review data
restaurants = ['Pizza Palace', 'Burger Barn', 'Sushi Supreme', 'Taco Town', 'Pasta Paradise']
cuisines = ['Italian', 'American', 'Japanese', 'Mexican', 'Italian']
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

reviews_data = pd.DataFrame({
    'restaurant': np.random.choice(restaurants, 200),
    'day': np.random.choice(days, 200),
    'rating': np.random.uniform(3.0, 5.0, 200).round(1),
    'num_reviews': np.random.randint(5, 50, 200),
    'avg_price': np.random.uniform(10, 50, 200).round(2)
})

# 🌟 Pivot table: Restaurant ratings by day
rating_pivot = reviews_data.pivot_table(
    values='rating',
    index='restaurant',
    columns='day',
    aggfunc='mean'
).round(2)
print("⭐ Restaurant Ratings by Day:")
print(rating_pivot)

# 📊 Advanced pivot with multiple values
comprehensive_pivot = reviews_data.pivot_table(
    values=['rating', 'num_reviews'],
    index='restaurant',
    columns='day',
    aggfunc={'rating': 'mean', 'num_reviews': 'sum'},
    fill_value=0
)
print("\n📈 Comprehensive Restaurant Analysis:")
print(comprehensive_pivot)

Example 3: Student Performance Tracker 🎓

# 📚 Create student performance data
subjects = ['Math', 'Science', 'English', 'History']
students = ['Emma', 'Liam', 'Olivia', 'Noah', 'Ava']
terms = ['Fall 2023', 'Spring 2024']

grades_data = pd.DataFrame({
    'student': np.repeat(students, len(subjects) * len(terms)),
    'subject': np.tile(np.repeat(subjects, len(terms)), len(students)),
    'term': np.tile(terms, len(students) * len(subjects)),
    'grade': np.random.randint(70, 100, len(students) * len(subjects) * len(terms)),
    'attendance': np.random.uniform(0.8, 1.0, len(students) * len(subjects) * len(terms)).round(2)
})

# 🎯 GroupBy: Student performance
student_performance = grades_data.groupby(['student', 'term']).agg({
    'grade': ['mean', 'min', 'max'],
    'attendance': 'mean'
}).round(2)
print("📊 Student Performance Summary:")
print(student_performance)

# 📈 Pivot: Subject grades by student
subject_pivot = grades_data.pivot_table(
    values='grade',
    index='student',
    columns='subject',
    aggfunc='mean'
).round(1)
print("\n📚 Subject Performance by Student:")
print(subject_pivot)

🚀 Advanced Concepts

Custom Aggregation Functions 🛠️

# 🎯 Define custom aggregation functions
def grade_category(grades):
    avg = grades.mean()
    if avg >= 90: return 'A'
    elif avg >= 80: return 'B'
    elif avg >= 70: return 'C'
    else: return 'D'

# 📊 Apply custom function
grade_categories = grades_data.groupby('student')['grade'].apply(grade_category)
print("🏆 Student Grade Categories:")
print(grade_categories)

# 💡 Multiple custom aggregations
def performance_summary(group):
    return pd.Series({
        'avg_grade': group['grade'].mean(),
        'consistency': group['grade'].std(),
        'best_subject': group.loc[group['grade'].idxmax(), 'subject'],
        'attendance_rate': group['attendance'].mean()
    })

student_summary = grades_data.groupby('student').apply(performance_summary)
print("\n📈 Detailed Student Analysis:")
print(student_summary.round(2))

Multi-level GroupBy and Pivot 🏗️

# 🌐 Complex grouping with multiple levels
multi_group = sales_data.groupby(['region', 'product', pd.Grouper(key='date', freq='M')]).agg({
    'quantity': 'sum',
    'revenue': ['sum', 'mean', 'count']
})
print("🔍 Multi-level Sales Analysis:")
print(multi_group.head(10))

# 🎨 Advanced pivot with margins
advanced_pivot = sales_data.pivot_table(
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,  # Add row/column totals
    margins_name='Total'  # Name for totals
).round(2)
print("\n💰 Revenue by Region and Product (with totals):")
print(advanced_pivot)

Transform and Filter Operations 🔄

# 🔄 Transform: Add group statistics to original data
gaming_data['player_avg_score'] = gaming_data.groupby('player')['score'].transform('mean')
gaming_data['score_vs_avg'] = gaming_data['score'] - gaming_data['player_avg_score']
print("🎮 Scores vs Player Average:")
print(gaming_data[['player', 'game', 'score', 'player_avg_score', 'score_vs_avg']].head())

# 🔍 Filter: Keep only high-performing groups
high_performers = gaming_data.groupby('player').filter(lambda x: x['score'].mean() > 1000)
print("\n🏆 High Performing Players' Records:")
print(high_performers)

⚠️ Common Pitfalls and Solutions

❌ Wrong: Forgetting to handle missing values

# This will exclude NaN values silently!
df.groupby('category')['value'].mean()

✅ Right: Handle missing values explicitly

# Fill NaN values before grouping
df['value'].fillna(0).groupby(df['category']).mean()

# Or use dropna parameter
df.groupby('category', dropna=False)['value'].mean()

❌ Wrong: Using pivot with duplicate entries

# This will raise an error if there are duplicates!
df.pivot(index='date', columns='product', values='sales')

✅ Right: Use pivot_table for aggregation

# pivot_table handles duplicates with aggregation
df.pivot_table(index='date', columns='product', values='sales', aggfunc='sum')

❌ Wrong: Chaining too many operations

# Hard to read and debug
result = df.groupby('A').mean().reset_index().merge(other_df).sort_values('B')

✅ Right: Break into clear steps

# Clear and maintainable
grouped = df.groupby('A').mean().reset_index()
merged = grouped.merge(other_df, on='A')
result = merged.sort_values('B')

🛠️ Best Practices

1. Choose the Right Tool 🎯

# Use GroupBy when:
# - You need to apply multiple aggregations
# - Working with time series data
# - Need custom aggregation functions

# Use Pivot when:
# - Creating cross-tabulation reports
# - Reshaping data for visualization
# - Need a quick summary table

2. Performance Tips ⚡

# 🚀 Use categorical data for better performance
df['category'] = df['category'].astype('category')
grouped = df.groupby('category').sum()  # Much faster!

# 💡 Use ngroups to check group count
n_groups = df.groupby('category').ngroups
print(f"Number of groups: {n_groups}")

# 📊 Use describe() for quick statistics
summary = df.groupby('category')['value'].describe()

3. Handle Large Datasets 📦

# 🔄 Process in chunks for memory efficiency
def process_large_groupby(df, group_cols, chunk_size=10000):
    results = []
    for start in range(0, len(df), chunk_size):
        chunk = df.iloc[start:start + chunk_size]
        result = chunk.groupby(group_cols).agg({'value': 'sum'})
        results.append(result)
    return pd.concat(results).groupby(level=0).sum()

🧪 Hands-On Exercise

Time to practice! Create a fitness tracking analysis system! 💪

Your Challenge:

# 🏃‍♂️ Fitness tracking data
fitness_data = pd.DataFrame({
    'user': ['Alex', 'Sam', 'Jordan', 'Alex', 'Sam', 'Jordan'] * 10,
    'activity': ['Running', 'Cycling', 'Swimming'] * 20,
    'date': pd.date_range('2024-01-01', periods=60, freq='D'),
    'duration_mins': np.random.randint(20, 90, 60),
    'calories': np.random.randint(150, 600, 60),
    'heart_rate': np.random.randint(100, 180, 60)
})

# TODO: Your tasks
# 1. Calculate average calories burned per activity
# 2. Find each user's most frequent activity
# 3. Create a pivot table showing total duration by user and activity
# 4. Identify the user with highest average heart rate
# 5. Group by week and show weekly fitness trends

💡 Click to see the solution

# 1. Average calories per activity
avg_calories = fitness_data.groupby('activity')['calories'].mean().round(1)
print("🔥 Average Calories by Activity:")
print(avg_calories)

# 2. Most frequent activity per user
most_frequent = fitness_data.groupby('user')['activity'].agg(lambda x: x.value_counts().index[0])
print("\n⭐ Favorite Activity by User:")
print(most_frequent)

# 3. Duration pivot table
duration_pivot = fitness_data.pivot_table(
    values='duration_mins',
    index='user',
    columns='activity',
    aggfunc='sum',
    fill_value=0
)
print("\n⏱️ Total Duration by User and Activity:")
print(duration_pivot)

# 4. Highest average heart rate
avg_heart_rate = fitness_data.groupby('user')['heart_rate'].mean()
top_performer = avg_heart_rate.idxmax()
print(f"\n💗 Highest Average Heart Rate: {top_performer} ({avg_heart_rate[top_performer]:.1f} bpm)")

# 5. Weekly trends
fitness_data['week'] = fitness_data['date'].dt.isocalendar().week
weekly_trends = fitness_data.groupby('week').agg({
    'duration_mins': 'sum',
    'calories': 'sum',
    'heart_rate': 'mean'
}).round(1)
print("\n📈 Weekly Fitness Trends:")
print(weekly_trends.head())

🎓 Key Takeaways

You’ve mastered GroupBy and Pivot! Here’s what you learned:

🎯 GroupBy splits data into groups and applies operations
🔄 Pivot reshapes data for better analysis
📊 Both tools are essential for data analysis
🛠️ Custom functions make analysis powerful
⚡ Performance matters with large datasets
🧩 Combine both for comprehensive analysis

Remember:

GroupBy for calculations 🧮
Pivot for reshaping 🔄
Together for insights 💡

🤝 Next Steps

Congratulations, data analyst! 🎉 You’re now equipped to tackle real-world data analysis challenges!

Here’s what to explore next:

📊 Data visualization with matplotlib/seaborn
🔍 Advanced filtering and querying
📈 Time series analysis with pandas
🤖 Machine learning data preparation
🌐 Working with real-world datasets

Keep practicing with your own data – maybe analyze your music listening habits, gaming statistics, or fitness data! The possibilities are endless! 🚀

Happy data wrangling! 🐼✨

📘 Pandas Data Analysis: GroupBy and Pivot

Prerequisites

What you'll learn

📘 Pandas Data Analysis: GroupBy and Pivot

🎯 Introduction

📚 Understanding GroupBy and Pivot

What is GroupBy? 🤔

What is Pivot? 🔄

🔧 Basic Syntax and Usage

GroupBy Basics 🏃‍♂️

Pivot Table Basics 🔄

💡 Practical Examples

Example 1: E-commerce Sales Analysis 🛒

Example 2: Restaurant Reviews Analysis 🍕

Example 3: Student Performance Tracker 🎓

🚀 Advanced Concepts

Custom Aggregation Functions 🛠️

Multi-level GroupBy and Pivot 🏗️

Transform and Filter Operations 🔄

⚠️ Common Pitfalls and Solutions

❌ Wrong: Forgetting to handle missing values

✅ Right: Handle missing values explicitly

❌ Wrong: Using pivot with duplicate entries

✅ Right: Use pivot_table for aggregation

❌ Wrong: Chaining too many operations

✅ Right: Break into clear steps

🛠️ Best Practices

1. Choose the Right Tool 🎯

2. Performance Tips ⚡

3. Handle Large Datasets 📦

🧪 Hands-On Exercise

Your Challenge:

🎓 Key Takeaways

🤝 Next Steps

More python Tutorials

📘 Pandas Data Cleaning: Missing Values

📘 Pandas Data Analysis: GroupBy and Pivot

📘 Pandas Time Series: DateTime Operations

Tutorial Info