📘 Pandas Basics: DataFrames and Series

Welcome to the wonderful world of pandas! 🐼 If you’ve ever struggled with Excel spreadsheets or wished you could analyze data like a pro, you’re in the right place. Today, we’re diving into pandas - Python’s superpower for data manipulation. Get ready to transform raw data into insights! 🚀

🎯 Introduction

Have you ever tried to analyze thousands of rows of data in Excel and felt like pulling your hair out? 😅 That’s where pandas comes to the rescue! Think of pandas as Excel on steroids - it can handle millions of rows, perform complex calculations in seconds, and make data analysis actually fun!

In this tutorial, we’ll explore:

What pandas is and why it’s a game-changer 🎮
The two main data structures: Series and DataFrames 📊
How to create, manipulate, and analyze data like a pro 💪
Real-world examples that’ll make you go “Aha!” 💡

📚 Understanding Pandas

Pandas is like having a super-smart assistant who can organize, analyze, and transform your data in the blink of an eye. At its core, pandas gives us two powerful data structures:

Series: Your Data’s Best Friend 🤝

Think of a Series as a super-charged list with superpowers. It’s like a single column in a spreadsheet, but much more powerful!

DataFrame: The Data Superhero 🦸‍♂️

A DataFrame is like an entire Excel spreadsheet in Python. It has rows, columns, and can do magical things with your data!

Let’s see them in action:

import pandas as pd  # 👋 Hello pandas!

# 🎨 Creating a Series - like a single column
temperatures = pd.Series([72, 68, 75, 71, 69])
print("Today's temperatures: 🌡️")
print(temperatures)

# 🎨 Creating a DataFrame - like a whole spreadsheet
weather_data = pd.DataFrame({
    'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'Temperature': [72, 68, 75, 71, 69],
    'Humidity': [65, 70, 60, 68, 72],
    'Weather': ['Sunny ☀️', 'Cloudy ☁️', 'Sunny ☀️', 'Rainy 🌧️', 'Cloudy ☁️']
})
print("\nWeekly weather report: 📊")
print(weather_data)

🔧 Basic Syntax and Usage

Let’s start with the basics and build up our pandas skills! 💪

Creating DataFrames - Your Data Container 📦

# Method 1: From a dictionary (most common!)
student_grades = pd.DataFrame({
    'Name': ['Alice 👧', 'Bob 👦', 'Charlie 🧒', 'Diana 👩'],
    'Math': [95, 87, 92, 88],
    'Science': [92, 89, 94, 91],
    'English': [88, 94, 87, 93]
})
print("Class grades: 📚")
print(student_grades)

# Method 2: From a list of lists
data = [
    ['Apple 🍎', 1.50, 100],
    ['Banana 🍌', 0.75, 150],
    ['Orange 🍊', 2.00, 80]
]
inventory = pd.DataFrame(data, columns=['Product', 'Price', 'Quantity'])
print("\nStore inventory: 🛒")
print(inventory)

Accessing Data - Finding What You Need 🔍

# 🎯 Accessing columns (like selecting a column in Excel)
print("\nJust the names:")
print(student_grades['Name'])

# 🎯 Accessing multiple columns
print("\nMath and Science scores:")
print(student_grades[['Math', 'Science']])

# 🎯 Accessing rows by index
print("\nFirst student's data:")
print(student_grades.iloc[0])  # iloc = index location

# 🎯 Accessing specific cells
print("\nAlice's Math score:")
print(student_grades.loc[0, 'Math'])  # loc = label location

💡 Practical Examples

Let’s dive into some real-world examples that’ll make you love pandas! 🎉

Example 1: Analyzing Sales Data 💰

# 🛍️ Creating a sales dataset
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=7),
    'Product': ['Laptop 💻', 'Phone 📱', 'Tablet 📲', 'Laptop 💻', 
                'Headphones 🎧', 'Phone 📱', 'Laptop 💻'],
    'Price': [999, 699, 399, 999, 199, 699, 999],
    'Quantity': [2, 3, 1, 1, 5, 2, 3],
    'Customer': ['John', 'Sarah', 'Mike', 'Emma', 'Lisa', 'Tom', 'Anna']
})

# 💡 Calculate total sales for each transaction
sales_data['Total'] = sales_data['Price'] * sales_data['Quantity']

print("Sales Report 📊")
print(sales_data)

# 🎯 Find total revenue
total_revenue = sales_data['Total'].sum()
print(f"\nTotal Revenue: ${total_revenue:,.2f} 💰")

# 🎯 Best selling product
product_sales = sales_data.groupby('Product')['Quantity'].sum()
print("\nProduct Sales Summary:")
print(product_sales.sort_values(ascending=False))

Example 2: Student Performance Tracker 🎓

# 📚 Creating a student performance dataset
students = pd.DataFrame({
    'Name': ['Emma 👩‍🎓', 'Liam 👨‍🎓', 'Olivia 👩‍🎓', 'Noah 👨‍🎓', 'Ava 👩‍🎓'],
    'Math': [85, 92, 78, 95, 88],
    'Science': [90, 88, 85, 91, 92],
    'English': [88, 85, 90, 87, 94],
    'PE': [95, 98, 92, 96, 90]
})

# 💡 Calculate average grade for each student
students['Average'] = students[['Math', 'Science', 'English', 'PE']].mean(axis=1)

# 🌟 Add letter grades
def get_letter_grade(score):
    if score >= 90: return 'A 🌟'
    elif score >= 80: return 'B 👍'
    elif score >= 70: return 'C ✅'
    else: return 'D 📚'

students['Grade'] = students['Average'].apply(get_letter_grade)

print("Student Report Card 📋")
print(students)

# 🎯 Find top performer
top_student = students.loc[students['Average'].idxmax()]
print(f"\nTop Student: {top_student['Name']} with {top_student['Average']:.1f}% 🏆")

Example 3: Weather Data Analysis 🌦️

# 🌡️ Creating weather data
weather = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10),
    'Temperature': [32, 35, 28, 30, 33, 38, 40, 35, 32, 29],
    'Humidity': [65, 70, 80, 75, 68, 60, 55, 65, 72, 85],
    'Conditions': ['Snow ❄️', 'Cloudy ☁️', 'Snow ❄️', 'Cloudy ☁️', 
                   'Sunny ☀️', 'Sunny ☀️', 'Sunny ☀️', 'Cloudy ☁️', 
                   'Rain 🌧️', 'Snow ❄️']
})

# 💡 Temperature statistics
print("Weather Summary 🌡️")
print(f"Average Temperature: {weather['Temperature'].mean():.1f}°F")
print(f"Highest Temperature: {weather['Temperature'].max()}°F 🔥")
print(f"Lowest Temperature: {weather['Temperature'].min()}°F 🧊")

# 🎯 Count weather conditions
weather_counts = weather['Conditions'].value_counts()
print("\nWeather Conditions:")
print(weather_counts)

🚀 Advanced Concepts

Ready to level up? Let’s explore some powerful pandas features! 💪

Data Filtering - Finding Exactly What You Need 🔍

# 🎯 Creating a product database
products = pd.DataFrame({
    'Name': ['Gaming PC 🖥️', 'Laptop 💻', 'Tablet 📲', 'Monitor 🖥️', 
             'Keyboard ⌨️', 'Mouse 🖱️', 'Headset 🎧'],
    'Price': [1500, 999, 499, 299, 79, 49, 99],
    'Category': ['Computer', 'Computer', 'Mobile', 'Accessory', 
                 'Accessory', 'Accessory', 'Accessory'],
    'InStock': [True, True, False, True, True, False, True]
})

# 💡 Filter products under $100
budget_items = products[products['Price'] < 100]
print("Budget-friendly items 💰")
print(budget_items)

# 💡 Multiple conditions
available_computers = products[
    (products['Category'] == 'Computer') & 
    (products['InStock'] == True)
]
print("\nAvailable computers:")
print(available_computers)

Data Aggregation - Powerful Analytics 📊

# 🏪 Creating sales by region
regional_sales = pd.DataFrame({
    'Region': ['North', 'South', 'East', 'West'] * 3,
    'Month': ['Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 
              'Feb', 'Feb', 'Mar', 'Mar', 'Mar', 'Mar'],
    'Sales': [15000, 18000, 16000, 14000, 17000, 19000, 
              18000, 15000, 16000, 20000, 19000, 17000]
})

# 💡 Group by region and calculate totals
regional_totals = regional_sales.groupby('Region')['Sales'].agg(['sum', 'mean', 'max'])
print("Regional Performance 📈")
print(regional_totals)

# 💡 Pivot table magic!
monthly_pivot = regional_sales.pivot_table(
    values='Sales', 
    index='Region', 
    columns='Month', 
    aggfunc='sum'
)
print("\nMonthly Sales by Region:")
print(monthly_pivot)

⚠️ Common Pitfalls and Solutions

Let’s learn from common mistakes so you can avoid them! 🛡️

Pitfall 1: Forgetting to Handle Missing Data 😱

# ❌ Wrong way - ignoring missing values
messy_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', None, 'Diana'],
    'Score': [95, None, 87, 92]
})

# This will fail!
# average = messy_data['Score'].mean()  # 💥 Includes None!

# ✅ Correct way - handle missing values
average = messy_data['Score'].dropna().mean()
print(f"Average score: {average:.1f}")

# Or fill missing values
messy_data['Score'].fillna(0, inplace=True)
print("\nCleaned data:")
print(messy_data)

Pitfall 2: Modifying DataFrames Incorrectly 🚫

# ❌ Wrong way - forgetting inplace or assignment
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.drop('B', axis=1)  # This doesn't actually drop the column!
print("DataFrame still has column B:")
print(df)

# ✅ Correct way - use inplace or assignment
df = df.drop('B', axis=1)  # Or use df.drop('B', axis=1, inplace=True)
print("\nNow column B is gone:")
print(df)

Pitfall 3: Confusing loc and iloc 🤔

# 🎯 Understanding the difference
data = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Cherry'],
    'Price': [1.5, 0.75, 2.0]
}, index=['A', 'B', 'C'])

# ❌ Wrong - mixing up loc and iloc
# price = data.iloc['A', 'Price']  # 💥 iloc uses integers!

# ✅ Correct ways
price_loc = data.loc['A', 'Price']  # Using labels
price_iloc = data.iloc[0, 1]  # Using positions
print(f"Apple price: ${price_loc}")

🛠️ Best Practices

Follow these tips to write clean, efficient pandas code! ✨

1. Always Check Your Data First 👀

# 🎯 Essential data exploration commands
df = pd.DataFrame({
    'Name': ['Product A', 'Product B', 'Product C'],
    'Sales': [1000, 1500, 800],
    'Profit': [200, 450, 150]
})

# Always start with these!
print(df.head())      # First few rows
print(df.info())      # Data types and missing values
print(df.describe())  # Statistical summary

2. Use Meaningful Column Names 📝

# ❌ Bad naming
df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': [7, 8, 9]
})

# ✅ Good naming
sales_data = pd.DataFrame({
    'product_id': [1, 2, 3],
    'quantity_sold': [4, 5, 6],
    'revenue_usd': [7, 8, 9]
})

3. Chain Operations for Cleaner Code 🔗

# 🎯 Elegant method chaining
result = (sales_data
    .groupby('product_id')
    .agg({'quantity_sold': 'sum', 'revenue_usd': 'sum'})
    .sort_values('revenue_usd', ascending=False)
    .head(5)
)

🧪 Hands-On Exercise

Time to put your skills to the test! 💪 Create a movie ratings analyzer:

Challenge: Create a DataFrame with movie data and answer these questions:

What’s the average rating for each genre?
Which movie has the highest rating?
How many movies are in each genre?

Try it yourself first! 🎯

🔑 Click here for the solution

# 🎬 Creating movie database
movies = pd.DataFrame({
    'Title': ['The Matrix 🤖', 'Inception 🌀', 'Toy Story 🧸', 
              'The Godfather 🎭', 'Shrek 🐸', 'Interstellar 🚀'],
    'Genre': ['Sci-Fi', 'Thriller', 'Animation', 'Drama', 'Animation', 'Sci-Fi'],
    'Rating': [8.7, 8.8, 8.3, 9.2, 7.9, 8.6],
    'Year': [1999, 2010, 1995, 1972, 2001, 2014],
    'Duration': [136, 148, 81, 175, 90, 169]
})

print("Movie Database 🎬")
print(movies)

# 1. Average rating by genre
genre_ratings = movies.groupby('Genre')['Rating'].mean()
print("\n📊 Average Ratings by Genre:")
print(genre_ratings.round(2))

# 2. Highest rated movie
best_movie = movies.loc[movies['Rating'].idxmax()]
print(f"\n🏆 Highest Rated Movie: {best_movie['Title']} ({best_movie['Rating']})")

# 3. Movies per genre
genre_counts = movies['Genre'].value_counts()
print("\n📈 Movies per Genre:")
print(genre_counts)

# Bonus: Movies longer than 2 hours
long_movies = movies[movies['Duration'] > 120]
print("\n⏰ Movies over 2 hours:")
print(long_movies[['Title', 'Duration']])

Great job! 🎉 You’ve just analyzed movie data like a pro!

🎓 Key Takeaways

Congratulations! 🎊 You’ve just mastered the basics of pandas! Here’s what you’ve learned:

Series are like super-powered lists perfect for single columns of data 📋
DataFrames are your go-to for working with tabular data (like spreadsheets) 📊
You can create, filter, and analyze data with just a few lines of code 💪
Pandas makes data analysis fun and efficient! 🚀

Remember:

Always explore your data first with .head(), .info(), and .describe() 🔍
Handle missing values appropriately 🛡️
Use meaningful variable names 📝
Chain operations for cleaner code 🔗

🤝 Next Steps

You’re on fire! 🔥 Here’s what to explore next:

Data Cleaning: Learn about handling messy real-world data 🧹
Advanced Indexing: Master MultiIndex and advanced selection techniques 🎯
Time Series: Work with dates and time-based data 📅
Data Visualization: Combine pandas with matplotlib for stunning charts 📈

Keep practicing with your own datasets - maybe analyze your favorite sports team’s statistics or your music listening history! The more you practice, the more natural it becomes.

You’ve got this! 💪 Happy data wrangling! 🐼✨

Prerequisites

What you'll learn