📘 Data Pipelines: ETL with Python

🎯 Introduction

Welcome to the exciting world of data pipelines and ETL (Extract, Transform, Load)! 🎉 In today’s data-driven world, knowing how to build robust data pipelines is like having a superpower.

You’ll discover how ETL processes can transform messy data into golden insights 💎. Whether you’re processing customer data 👥, analyzing sales trends 📊, or building machine learning pipelines 🤖, understanding ETL is essential for every data professional.

By the end of this tutorial, you’ll be building data pipelines like a pro! Let’s dive in! 🏊‍♂️

📚 Understanding ETL and Data Pipelines

🤔 What is ETL?

ETL is like being a data chef 👨‍🍳. Think of it as a three-step recipe:

Extract 📥: Gathering ingredients (data) from various sources
Transform 🔄: Preparing and cooking (cleaning, formatting)
Load 📤: Serving the final dish (storing processed data)

In Python terms, ETL means:

✨ Extract: Reading data from files, APIs, databases
🚀 Transform: Cleaning, validating, aggregating data
🛡️ Load: Saving to databases, data warehouses, or files

💡 Why Use Data Pipelines?

Here’s why developers love data pipelines:

Automation 🤖: Process data without manual intervention
Scalability 📈: Handle growing data volumes effortlessly
Reliability 🛡️: Consistent, repeatable processes
Time Savings ⏰: Focus on insights, not data wrangling

Real-world example: Imagine an e-commerce store 🛒. Every day, you need to process orders, update inventory, and generate reports. A data pipeline automates all of this!

🔧 Basic ETL Syntax and Usage

📝 Simple ETL Example

Let’s start with a friendly example using pandas:

# 👋 Hello, ETL!
import pandas as pd
import numpy as np
from datetime import datetime

# 📥 EXTRACT: Load sales data
def extract_data():
    # 🎨 Simulating data extraction from CSV
    data = {
        'order_id': [1001, 1002, 1003, 1004, 1005],
        'product': ['Laptop 💻', 'Mouse 🖱️', 'Keyboard ⌨️', 'Monitor 🖥️', 'Laptop 💻'],
        'price': [999.99, 29.99, 79.99, 299.99, 1199.99],
        'quantity': [1, 2, 1, 1, 1],
        'date': ['2024-01-15', '2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17']
    }
    df = pd.DataFrame(data)
    print("📥 Extracted data successfully! 🎉")
    return df

# 🔄 TRANSFORM: Clean and enrich data
def transform_data(df):
    # 🧹 Clean data
    df['date'] = pd.to_datetime(df['date'])
    
    # 💰 Calculate total revenue
    df['total_revenue'] = df['price'] * df['quantity']
    
    # 📊 Add day of week
    df['day_of_week'] = df['date'].dt.day_name()
    
    # 🏷️ Add price category
    df['price_category'] = pd.cut(df['price'], 
                                   bins=[0, 50, 200, 1000, 5000],
                                   labels=['Budget 💵', 'Mid-range 💰', 'Premium 💎', 'Luxury 👑'])
    
    print("🔄 Transformed data successfully! ✨")
    return df

# 📤 LOAD: Save processed data
def load_data(df, filename='processed_sales.csv'):
    # 💾 Save to CSV
    df.to_csv(filename, index=False)
    print(f"📤 Loaded data to {filename} successfully! 🚀")

# 🎮 Run the ETL pipeline
def run_etl_pipeline():
    print("🚀 Starting ETL Pipeline...")
    
    # Execute ETL steps
    raw_data = extract_data()
    transformed_data = transform_data(raw_data)
    load_data(transformed_data)
    
    print("✅ ETL Pipeline completed successfully! 🎉")
    return transformed_data

# Run it!
result = run_etl_pipeline()
print("\n📊 Sample of processed data:")
print(result.head())

💡 Explanation: Notice how we break down the process into clear steps! Each function has a single responsibility, making our pipeline modular and maintainable.

🎯 Common ETL Patterns

Here are patterns you’ll use daily:

# 🏗️ Pattern 1: Data Validation
def validate_data(df):
    # ✅ Check for missing values
    if df.isnull().sum().sum() > 0:
        print("⚠️ Found missing values!")
        df = df.fillna(method='ffill')  # Forward fill
    
    # 🛡️ Remove duplicates
    initial_rows = len(df)
    df = df.drop_duplicates()
    if len(df) < initial_rows:
        print(f"🧹 Removed {initial_rows - len(df)} duplicate rows")
    
    return df

# 🎨 Pattern 2: Data Type Conversion
def convert_data_types(df):
    # 📅 Convert date strings to datetime
    date_columns = ['order_date', 'ship_date', 'delivery_date']
    for col in date_columns:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # 💯 Convert numeric strings to numbers
    numeric_columns = ['price', 'quantity', 'discount']
    for col in numeric_columns:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    return df

# 🔄 Pattern 3: Data Aggregation
def aggregate_sales_data(df):
    # 📊 Group by product category
    summary = df.groupby('product_category').agg({
        'total_revenue': 'sum',
        'quantity': 'sum',
        'order_id': 'count'
    }).rename(columns={'order_id': 'order_count'})
    
    # 📈 Add average order value
    summary['avg_order_value'] = summary['total_revenue'] / summary['order_count']
    
    return summary

💡 Practical Examples

🛒 Example 1: E-commerce Data Pipeline

Let’s build a real e-commerce ETL pipeline:

# 🛍️ E-commerce ETL Pipeline
import pandas as pd
import requests
from sqlalchemy import create_engine

class EcommercePipeline:
    def __init__(self):
        self.data = None
        self.transformed_data = None
        
    # 📥 Extract from multiple sources
    def extract(self):
        print("📥 Extracting data from multiple sources...")
        
        # 🌐 Extract from API
        # api_data = requests.get('https://api.store.com/orders').json()
        
        # 📊 Extract from CSV
        orders_data = {
            'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD004'],
            'customer_id': ['CUST101', 'CUST102', 'CUST101', 'CUST103'],
            'product_id': ['PROD201', 'PROD202', 'PROD203', 'PROD201'],
            'quantity': [2, 1, 3, 1],
            'order_date': ['2024-01-15', '2024-01-15', '2024-01-16', '2024-01-17'],
            'status': ['completed', 'pending', 'completed', 'completed']
        }
        
        products_data = {
            'product_id': ['PROD201', 'PROD202', 'PROD203'],
            'product_name': ['Wireless Mouse 🖱️', 'Mechanical Keyboard ⌨️', 'USB Hub 🔌'],
            'price': [29.99, 89.99, 19.99],
            'category': ['Accessories', 'Accessories', 'Accessories']
        }
        
        self.orders_df = pd.DataFrame(orders_data)
        self.products_df = pd.DataFrame(products_data)
        
        print("✅ Data extraction complete! 📊")
        
    # 🔄 Transform the data
    def transform(self):
        print("🔄 Transforming data...")
        
        # 🔗 Join orders with products
        self.transformed_data = self.orders_df.merge(
            self.products_df, 
            on='product_id', 
            how='left'
        )
        
        # 💰 Calculate revenue
        self.transformed_data['revenue'] = (
            self.transformed_data['quantity'] * 
            self.transformed_data['price']
        )
        
        # 📅 Convert dates
        self.transformed_data['order_date'] = pd.to_datetime(
            self.transformed_data['order_date']
        )
        
        # 📊 Add time-based features
        self.transformed_data['order_month'] = (
            self.transformed_data['order_date'].dt.month_name()
        )
        self.transformed_data['order_day'] = (
            self.transformed_data['order_date'].dt.day_name()
        )
        
        # 🏷️ Categorize order value
        self.transformed_data['order_size'] = pd.cut(
            self.transformed_data['revenue'],
            bins=[0, 50, 100, 500, 1000],
            labels=['Small 🐭', 'Medium 🐕', 'Large 🐘', 'Huge 🦕']
        )
        
        # 🧹 Clean up
        self.transformed_data = self.transformed_data[
            self.transformed_data['status'] == 'completed'
        ]
        
        print("✨ Transformation complete!")
        
    # 📤 Load to destination
    def load(self):
        print("📤 Loading data to destinations...")
        
        # 💾 Save to CSV
        self.transformed_data.to_csv('sales_report.csv', index=False)
        
        # 📊 Create summary report
        summary = self.transformed_data.groupby('product_name').agg({
            'revenue': 'sum',
            'quantity': 'sum',
            'order_id': 'count'
        }).rename(columns={'order_id': 'total_orders'})
        
        summary.to_csv('product_summary.csv')
        
        # 🗄️ Load to database (example)
        # engine = create_engine('sqlite:///sales.db')
        # self.transformed_data.to_sql('sales_fact', engine, if_exists='append')
        
        print("🎉 Data loaded successfully!")
        
    # 🚀 Run the complete pipeline
    def run(self):
        print("🚀 Starting E-commerce ETL Pipeline...\n")
        
        try:
            self.extract()
            self.transform()
            self.load()
            
            print("\n✅ Pipeline completed successfully! 🎊")
            print(f"\n📊 Processed {len(self.transformed_data)} orders")
            print(f"💰 Total revenue: ${self.transformed_data['revenue'].sum():.2f}")
            
        except Exception as e:
            print(f"❌ Pipeline failed: {str(e)}")
            raise

# 🎮 Run the pipeline
pipeline = EcommercePipeline()
pipeline.run()

# 📊 View results
print("\n🔍 Sample transformed data:")
print(pipeline.transformed_data.head())

🎮 Example 2: Real-time Data Stream ETL

Let’s create a streaming ETL pipeline:

# 🌊 Streaming ETL Pipeline
import time
import random
from datetime import datetime
from collections import deque

class StreamingETL:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.data_buffer = deque(maxlen=window_size)
        self.processed_count = 0
        
    # 📥 Extract streaming data
    def extract_stream(self):
        # 🎲 Simulate streaming sensor data
        sensor_types = ['Temperature 🌡️', 'Humidity 💧', 'Pressure 🌪️']
        
        data_point = {
            'timestamp': datetime.now(),
            'sensor_type': random.choice(sensor_types),
            'value': random.uniform(20, 100),
            'location': random.choice(['Room A 🏠', 'Room B 🏢', 'Room C 🏭']),
            'status': 'active' if random.random() > 0.1 else 'error'
        }
        
        return data_point
    
    # 🔄 Transform streaming data
    def transform_stream(self, data):
        # 🌡️ Add temperature conversion
        if data['sensor_type'] == 'Temperature 🌡️':
            data['value_fahrenheit'] = (data['value'] * 9/5) + 32
            data['value_kelvin'] = data['value'] + 273.15
        
        # 🚨 Add alert level
        if data['value'] > 80:
            data['alert_level'] = 'High ⚠️'
        elif data['value'] > 60:
            data['alert_level'] = 'Medium 🟡'
        else:
            data['alert_level'] = 'Low 🟢'
        
        # 📊 Add rolling statistics
        recent_values = [d['value'] for d in self.data_buffer if d['sensor_type'] == data['sensor_type']]
        if recent_values:
            data['rolling_avg'] = sum(recent_values) / len(recent_values)
            data['rolling_max'] = max(recent_values)
            data['rolling_min'] = min(recent_values)
        
        return data
    
    # 📤 Load streaming data
    def load_stream(self, data):
        # 💾 Add to buffer
        self.data_buffer.append(data)
        self.processed_count += 1
        
        # 🚨 Trigger alerts
        if data['alert_level'] == 'High ⚠️':
            print(f"🚨 ALERT: High {data['sensor_type']} reading in {data['location']}!")
        
        # 📊 Periodic summary
        if self.processed_count % 10 == 0:
            self.print_summary()
    
    # 📊 Print summary statistics
    def print_summary(self):
        if not self.data_buffer:
            return
            
        print(f"\n📊 Stream Summary (Last {len(self.data_buffer)} readings):")
        
        # Group by sensor type
        sensor_stats = {}
        for data in self.data_buffer:
            sensor = data['sensor_type']
            if sensor not in sensor_stats:
                sensor_stats[sensor] = []
            sensor_stats[sensor].append(data['value'])
        
        for sensor, values in sensor_stats.items():
            avg_value = sum(values) / len(values)
            print(f"  {sensor}: Avg={avg_value:.2f}, Count={len(values)}")
    
    # 🚀 Run streaming pipeline
    def run_stream(self, duration=30):
        print(f"🌊 Starting streaming ETL pipeline for {duration} seconds...\n")
        
        start_time = time.time()
        
        while time.time() - start_time < duration:
            # ETL cycle
            raw_data = self.extract_stream()
            transformed_data = self.transform_stream(raw_data)
            self.load_stream(transformed_data)
            
            # Display current reading
            print(f"📡 {transformed_data['timestamp'].strftime('%H:%M:%S')} | "
                  f"{transformed_data['sensor_type']} | "
                  f"Value: {transformed_data['value']:.2f} | "
                  f"Alert: {transformed_data['alert_level']}")
            
            # Simulate streaming delay
            time.sleep(1)
        
        print(f"\n✅ Streaming pipeline completed! Processed {self.processed_count} readings 🎉")

# 🎮 Run the streaming pipeline
streaming_etl = StreamingETL(window_size=50)
streaming_etl.run_stream(duration=10)  # Run for 10 seconds

🚀 Advanced ETL Concepts

🧙‍♂️ Advanced Topic 1: Parallel Processing

When you’re ready to level up, try parallel ETL processing:

# 🎯 Parallel ETL Processing
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import pandas as pd

class ParallelETL:
    def __init__(self, num_workers=4):
        self.num_workers = num_workers
    
    # 🚀 Parallel extraction
    def parallel_extract(self, file_list):
        print(f"📥 Extracting {len(file_list)} files in parallel...")
        
        with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
            # 🎯 Map extract function to each file
            results = list(executor.map(self._extract_file, file_list))
        
        # 🔗 Combine all dataframes
        combined_df = pd.concat(results, ignore_index=True)
        print(f"✅ Extracted {len(combined_df)} total records!")
        
        return combined_df
    
    def _extract_file(self, filename):
        # 📄 Simulate file extraction
        print(f"  📄 Extracting {filename}...")
        
        # Simulate data
        data = {
            'file': [filename] * 100,
            'value': [random.random() * 100 for _ in range(100)],
            'timestamp': [datetime.now() for _ in range(100)]
        }
        
        return pd.DataFrame(data)
    
    # 🔄 Parallel transformation
    def parallel_transform(self, df):
        print("🔄 Transforming data in parallel...")
        
        # 🔪 Split dataframe into chunks
        chunk_size = len(df) // self.num_workers
        chunks = [df[i:i+chunk_size] for i in range(0, len(df), chunk_size)]
        
        with ProcessPoolExecutor(max_workers=self.num_workers) as executor:
            # 🎯 Transform each chunk
            transformed_chunks = list(executor.map(self._transform_chunk, chunks))
        
        # 🔗 Combine results
        result = pd.concat(transformed_chunks, ignore_index=True)
        print(f"✨ Transformed {len(result)} records!")
        
        return result
    
    @staticmethod
    def _transform_chunk(chunk):
        # 🔧 Apply transformations
        chunk['value_squared'] = chunk['value'] ** 2
        chunk['value_category'] = pd.cut(
            chunk['value'], 
            bins=[0, 25, 50, 75, 100],
            labels=['Low 📉', 'Medium 📊', 'High 📈', 'Very High 🚀']
        )
        
        return chunk

# 🎮 Use parallel ETL
files = [f'data_file_{i}.csv' for i in range(1, 5)]
parallel_etl = ParallelETL(num_workers=4)

# Run parallel extraction
# data = parallel_etl.parallel_extract(files)
# transformed = parallel_etl.parallel_transform(data)

🏗️ Advanced Topic 2: Error Handling and Recovery

For production-ready pipelines:

# 🚀 Robust ETL with Error Handling
class RobustETL:
    def __init__(self):
        self.error_log = []
        self.checkpoint_data = None
        
    # 🛡️ Decorator for error handling
    def handle_errors(func):
        def wrapper(self, *args, **kwargs):
            try:
                return func(self, *args, **kwargs)
            except Exception as e:
                error_msg = f"❌ Error in {func.__name__}: {str(e)}"
                print(error_msg)
                self.error_log.append({
                    'function': func.__name__,
                    'error': str(e),
                    'timestamp': datetime.now()
                })
                
                # 🔄 Try recovery
                if hasattr(self, f'recover_{func.__name__}'):
                    print(f"🔄 Attempting recovery for {func.__name__}...")
                    recovery_func = getattr(self, f'recover_{func.__name__}')
                    return recovery_func(*args, **kwargs)
                
                raise
        
        return wrapper
    
    # 💾 Checkpoint management
    def save_checkpoint(self, data, stage):
        self.checkpoint_data = {
            'data': data.copy(),
            'stage': stage,
            'timestamp': datetime.now()
        }
        print(f"💾 Checkpoint saved at stage: {stage}")
    
    def load_checkpoint(self):
        if self.checkpoint_data:
            print(f"📥 Loading checkpoint from stage: {self.checkpoint_data['stage']}")
            return self.checkpoint_data['data']
        return None
    
    @handle_errors
    def extract_with_retry(self, source, max_retries=3):
        for attempt in range(max_retries):
            try:
                print(f"📥 Extraction attempt {attempt + 1}/{max_retries}...")
                
                # Simulate extraction that might fail
                if random.random() < 0.3:  # 30% chance of failure
                    raise Exception("Network timeout")
                
                # Success!
                data = pd.DataFrame({
                    'id': range(1000),
                    'value': [random.random() * 100 for _ in range(1000)]
                })
                
                self.save_checkpoint(data, 'extraction')
                return data
                
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                print(f"⚠️ Attempt {attempt + 1} failed: {str(e)}. Retrying...")
                time.sleep(2 ** attempt)  # Exponential backoff
    
    # 🔄 Data quality checks
    def validate_data_quality(self, df):
        quality_report = {
            'total_rows': len(df),
            'null_values': df.isnull().sum().sum(),
            'duplicate_rows': df.duplicated().sum(),
            'data_types': df.dtypes.to_dict()
        }
        
        # 🚨 Quality alerts
        if quality_report['null_values'] > len(df) * 0.1:
            print("⚠️ Warning: More than 10% null values detected!")
        
        if quality_report['duplicate_rows'] > 0:
            print(f"🧹 Removing {quality_report['duplicate_rows']} duplicates...")
            df = df.drop_duplicates()
        
        return df, quality_report

# 🎮 Example usage
robust_etl = RobustETL()
# data = robust_etl.extract_with_retry('data_source')

⚠️ Common Pitfalls and Solutions

😱 Pitfall 1: Memory Overflow with Large Datasets

# ❌ Wrong way - loading entire dataset into memory
def bad_etl():
    huge_df = pd.read_csv('10GB_file.csv')  # 💥 Memory error!
    processed = huge_df.apply(complex_transform)
    
# ✅ Correct way - process in chunks
def good_etl():
    chunk_size = 10000
    
    # 📦 Process file in chunks
    for chunk in pd.read_csv('10GB_file.csv', chunksize=chunk_size):
        processed_chunk = chunk.apply(simple_transform)
        
        # 💾 Append to output file
        processed_chunk.to_csv('output.csv', mode='a', header=False, index=False)
        
    print("✅ Large file processed successfully! 🎉")

🤯 Pitfall 2: Not Handling Schema Changes

# ❌ Dangerous - assumes fixed schema
def rigid_etl(df):
    return df[['col1', 'col2', 'col3']]  # 💥 KeyError if columns change!

# ✅ Safe - handle schema flexibility
def flexible_etl(df):
    # 🛡️ Define expected columns
    expected_cols = ['col1', 'col2', 'col3']
    
    # 🔍 Check what columns exist
    existing_cols = [col for col in expected_cols if col in df.columns]
    missing_cols = [col for col in expected_cols if col not in df.columns]
    
    if missing_cols:
        print(f"⚠️ Missing columns: {missing_cols}")
        # 🔧 Add missing columns with default values
        for col in missing_cols:
            df[col] = None
    
    # ✅ Now safe to select columns
    return df[expected_cols]

🛠️ Best Practices

🎯 Idempotency: Make your ETL processes repeatable without side effects
📊 Monitoring: Track metrics like processing time, record counts, and errors
🛡️ Error Handling: Implement comprehensive error handling and recovery
💾 Checkpointing: Save progress for long-running pipelines
📝 Documentation: Document data sources, transformations, and business rules
🧪 Testing: Test with edge cases and bad data
🚀 Performance: Optimize for your specific use case (batch vs. streaming)

🧪 Hands-On Exercise

🎯 Challenge: Build a Weather Data ETL Pipeline

Create a comprehensive weather data ETL pipeline:

📋 Requirements:

✅ Extract data from multiple weather stations
🌡️ Convert between temperature units (Celsius/Fahrenheit/Kelvin)
📊 Calculate daily aggregates (min, max, average)
🌦️ Categorize weather conditions
📈 Generate weather trend analysis
🚨 Alert system for extreme weather

🚀 Bonus Points:

Add data validation and quality checks
Implement parallel processing for multiple stations
Create visualizations of weather patterns
Add predictive analytics

💡 Solution

🔍 Click to see solution

# 🎯 Weather Data ETL Pipeline Solution
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

class WeatherETL:
    def __init__(self):
        self.stations = ['Station_A 🏙️', 'Station_B 🌊', 'Station_C 🏔️']
        self.weather_data = None
        self.processed_data = None
        self.alerts = []
        
    # 📥 Extract weather data
    def extract(self):
        print("📥 Extracting weather data from stations...")
        
        # 🎲 Generate sample weather data
        data_list = []
        
        for station in self.stations:
            for day in range(30):  # Last 30 days
                date = datetime.now() - timedelta(days=day)
                
                # 🌡️ Generate temperature data
                base_temp = 20 + (10 * np.sin(day/7))  # Seasonal variation
                
                for hour in range(24):
                    temp_c = base_temp + np.random.normal(0, 5) + (5 * np.sin(hour/6))
                    
                    data_list.append({
                        'station': station,
                        'timestamp': date.replace(hour=hour),
                        'temperature_c': round(temp_c, 1),
                        'humidity': round(50 + np.random.normal(0, 15), 1),
                        'pressure': round(1013 + np.random.normal(0, 10), 1),
                        'wind_speed': round(abs(np.random.normal(10, 5)), 1)
                    })
        
        self.weather_data = pd.DataFrame(data_list)
        print(f"✅ Extracted {len(self.weather_data)} weather records!")
        
    # 🔄 Transform weather data
    def transform(self):
        print("🔄 Transforming weather data...")
        
        df = self.weather_data.copy()
        
        # 🌡️ Temperature conversions
        df['temperature_f'] = (df['temperature_c'] * 9/5) + 32
        df['temperature_k'] = df['temperature_c'] + 273.15
        
        # 📅 Add time features
        df['date'] = df['timestamp'].dt.date
        df['hour'] = df['timestamp'].dt.hour
        df['day_of_week'] = df['timestamp'].dt.day_name()
        
        # 🌦️ Categorize weather conditions
        def categorize_weather(row):
            temp = row['temperature_c']
            humidity = row['humidity']
            wind = row['wind_speed']
            
            if temp > 30:
                return 'Hot 🔥'
            elif temp < 0:
                return 'Freezing ❄️'
            elif humidity > 80 and temp > 20:
                return 'Humid 💦'
            elif wind > 25:
                return 'Windy 🌪️'
            elif 15 <= temp <= 25 and humidity < 70:
                return 'Pleasant ☀️'
            else:
                return 'Mild 🌤️'
        
        df['weather_condition'] = df.apply(categorize_weather, axis=1)
        
        # 📊 Calculate daily aggregates
        daily_stats = df.groupby(['station', 'date']).agg({
            'temperature_c': ['min', 'max', 'mean'],
            'humidity': 'mean',
            'pressure': 'mean',
            'wind_speed': 'max'
        }).round(1)
        
        daily_stats.columns = ['temp_min', 'temp_max', 'temp_avg', 
                              'humidity_avg', 'pressure_avg', 'wind_max']
        daily_stats = daily_stats.reset_index()
        
        # 🚨 Generate alerts
        extreme_temps = df[
            (df['temperature_c'] > 35) | (df['temperature_c'] < -5)
        ]
        
        for _, row in extreme_temps.iterrows():
            alert = {
                'station': row['station'],
                'timestamp': row['timestamp'],
                'alert_type': 'Extreme Temperature 🚨',
                'value': f"{row['temperature_c']}°C",
                'severity': 'High'
            }
            self.alerts.append(alert)
        
        self.processed_data = df
        self.daily_stats = daily_stats
        
        print(f"✨ Transformation complete! Generated {len(self.alerts)} weather alerts.")
        
    # 📤 Load processed data
    def load(self):
        print("📤 Loading processed data...")
        
        # 💾 Save detailed data
        self.processed_data.to_csv('weather_detailed.csv', index=False)
        
        # 📊 Save daily summaries
        self.daily_stats.to_csv('weather_daily_summary.csv', index=False)
        
        # 🚨 Save alerts
        if self.alerts:
            alerts_df = pd.DataFrame(self.alerts)
            alerts_df.to_csv('weather_alerts.csv', index=False)
            
            print("\n🚨 Weather Alerts:")
            for alert in self.alerts[:5]:  # Show first 5 alerts
                print(f"  {alert['station']} - {alert['alert_type']}: {alert['value']}")
        
        # 📈 Generate visualizations
        self.create_visualizations()
        
        print("🎉 Data loaded successfully!")
        
    # 📈 Create weather visualizations
    def create_visualizations(self):
        print("📊 Creating weather visualizations...")
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        fig.suptitle('Weather Analysis Dashboard 🌡️', fontsize=16)
        
        # Temperature trends
        for station in self.stations:
            station_data = self.daily_stats[self.daily_stats['station'] == station]
            axes[0, 0].plot(station_data['date'], station_data['temp_avg'], 
                          label=station, marker='o')
        
        axes[0, 0].set_title('Average Temperature Trends')
        axes[0, 0].set_ylabel('Temperature (°C)')
        axes[0, 0].legend()
        axes[0, 0].tick_params(axis='x', rotation=45)
        
        # Weather condition distribution
        condition_counts = self.processed_data['weather_condition'].value_counts()
        axes[0, 1].pie(condition_counts.values, labels=condition_counts.index, 
                      autopct='%1.1f%%')
        axes[0, 1].set_title('Weather Condition Distribution')
        
        # Station comparison
        station_avg = self.daily_stats.groupby('station')['temp_avg'].mean()
        axes[1, 0].bar(station_avg.index, station_avg.values, 
                      color=['#ff9999', '#66b3ff', '#99ff99'])
        axes[1, 0].set_title('Average Temperature by Station')
        axes[1, 0].set_ylabel('Temperature (°C)')
        
        # Humidity vs Temperature
        sample_data = self.processed_data.sample(min(1000, len(self.processed_data)))
        scatter = axes[1, 1].scatter(sample_data['temperature_c'], 
                                   sample_data['humidity'],
                                   c=sample_data['temperature_c'], 
                                   cmap='coolwarm', alpha=0.6)
        axes[1, 1].set_title('Temperature vs Humidity')
        axes[1, 1].set_xlabel('Temperature (°C)')
        axes[1, 1].set_ylabel('Humidity (%)')
        plt.colorbar(scatter, ax=axes[1, 1])
        
        plt.tight_layout()
        plt.savefig('weather_dashboard.png', dpi=300, bbox_inches='tight')
        print("📊 Dashboard saved as weather_dashboard.png")
        
    # 🚀 Run complete pipeline
    def run(self):
        print("🚀 Starting Weather ETL Pipeline...\n")
        
        try:
            self.extract()
            self.transform()
            self.load()
            
            print(f"\n✅ Pipeline completed successfully! 🎊")
            print(f"📊 Processed {len(self.weather_data)} records")
            print(f"🌡️ Temperature range: {self.processed_data['temperature_c'].min():.1f}°C to {self.processed_data['temperature_c'].max():.1f}°C")
            print(f"🚨 Generated {len(self.alerts)} weather alerts")
            
        except Exception as e:
            print(f"❌ Pipeline failed: {str(e)}")
            raise

# 🎮 Run the weather ETL pipeline
weather_etl = WeatherETL()
weather_etl.run()

# 📊 Display sample results
print("\n🔍 Sample Daily Statistics:")
print(weather_etl.daily_stats.head())

🎓 Key Takeaways

You’ve learned so much about ETL and data pipelines! Here’s what you can now do:

✅ Build ETL pipelines from scratch with confidence 💪
✅ Extract data from multiple sources (files, APIs, databases) 📥
✅ Transform data with cleaning, validation, and enrichment 🔄
✅ Load data to various destinations reliably 📤
✅ Handle errors gracefully with retry logic and checkpointing 🛡️
✅ Process large datasets efficiently with chunking and parallel processing 🚀
✅ Monitor and alert on data quality issues 🚨

Remember: ETL is the backbone of data engineering! Every data scientist and engineer needs these skills. 🤝

🤝 Next Steps

Congratulations! 🎉 You’ve mastered ETL and data pipelines with Python!

Here’s what to do next:

💻 Practice with the weather ETL exercise above
🏗️ Build an ETL pipeline for your own data project
📚 Learn about Apache Airflow for production ETL orchestration
🚀 Explore streaming ETL with Apache Kafka or Apache Spark
🌟 Share your ETL projects with the data community!

Remember: Every data engineering expert started with their first ETL pipeline. Keep building, keep learning, and most importantly, have fun transforming data! 🚀

Happy data pipelining! 🎉🚀✨

Prerequisites

What you'll learn