📘 Supervised Learning: Regression

🎯 Introduction

Welcome to the wonderful world of regression analysis! 🎉 Have you ever wondered how Netflix predicts your movie ratings 🎬, how real estate apps estimate house prices 🏠, or how weather forecasts predict tomorrow’s temperature 🌡️? That’s regression in action!

Unlike classification (which puts things in boxes), regression predicts continuous values - actual numbers! Today, you’ll learn how to build powerful prediction models that can forecast everything from stock prices to energy consumption. Let’s unlock the magic of regression together! 🚀

📚 Understanding Regression

Regression is like drawing the “best-fit line” through a cloud of data points. Remember plotting graphs in math class? 📊 It’s that, but supercharged with machine learning!

The Magic of Regression 🪄

# 🎯 Regression in action!
# Input: Features (X)
# Output: Continuous value (y)

# Example: Predicting House Prices 🏠
house_features = {
    "size_sqft": 1500,        # 📏 Feature 1
    "bedrooms": 3,            # 🛏️ Feature 2
    "location_score": 8.5,    # 📍 Feature 3
    "age_years": 10           # 📅 Feature 4
}
# Output: $325,000 (actual number, not category!)

Think of regression as teaching a computer to understand relationships: “When X goes up, Y tends to go up (or down) by this much.” It’s pattern recognition for numbers! 🔢

🔧 Basic Syntax and Usage

Let’s start with the simplest form of regression - Linear Regression. It’s like finding the perfect straight line through your data!

Your First Regression Model 🌟

# 🚀 Let's build our first regression model!
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# 📊 Create some fun data - Ice cream sales vs temperature
np.random.seed(42)  # For reproducible randomness 🎲
temperature = np.random.uniform(20, 35, 100)  # 🌡️ Temperature in Celsius
ice_cream_sales = 2 * temperature + np.random.normal(0, 5, 100) + 50  # 🍦 Sales

# 🎨 Reshape for sklearn (it likes 2D arrays)
X = temperature.reshape(-1, 1)
y = ice_cream_sales

# 📊 Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 🏗️ Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# 🎯 Make predictions
predictions = model.predict(X_test)

print(f"🎉 Model trained! Slope: {model.coef_[0]:.2f}, Intercept: {model.intercept_:.2f}")

Visualizing Your Model 📈

# 🎨 Let's see our model in action!
plt.figure(figsize=(10, 6))

# 📊 Plot the data points
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data 📘')
plt.scatter(X_test, y_test, color='green', alpha=0.7, label='Test data 📗')

# 📈 Plot the regression line
X_line = np.linspace(20, 35, 100).reshape(-1, 1)
y_line = model.predict(X_line)
plt.plot(X_line, y_line, color='red', linewidth=2, label='Model prediction 📊')

plt.xlabel('Temperature (°C) 🌡️')
plt.ylabel('Ice Cream Sales 🍦')
plt.title('Ice Cream Sales vs Temperature - Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

💡 Practical Examples

🏠 Example 1: Real Estate Price Predictor

Let’s build something practical - a house price predictor!

# 🏠 House Price Prediction System
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# 📊 Create realistic house data
house_data = pd.DataFrame({
    'size': [750, 1200, 1800, 2400, 3000, 1500, 2000, 2800],
    'bedrooms': [1, 2, 3, 4, 4, 3, 3, 5],
    'bathrooms': [1, 1, 2, 2.5, 3, 2, 2.5, 3],
    'age': [20, 15, 10, 5, 2, 8, 12, 1],
    'garage': [0, 1, 1, 2, 3, 1, 2, 3],
    'price': [150000, 250000, 350000, 450000, 600000, 320000, 380000, 750000]
})

# 🎯 Prepare features and target
X = house_data[['size', 'bedrooms', 'bathrooms', 'age', 'garage']]
y = house_data['price']

# 📏 Scale the features (important for regression!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 🚂 Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.25, random_state=42
)

# 🏗️ Multiple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

# 🎯 Make predictions
predictions = model.predict(X_test)

# 📊 Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"🏆 Model Performance:")
print(f"   📏 Mean Squared Error: ${mse:,.0f}")
print(f"   🎯 R² Score: {r2:.3f} (closer to 1 is better!)")

# 💡 Feature importance
feature_importance = pd.DataFrame({
    'feature': ['size', 'bedrooms', 'bathrooms', 'age', 'garage'],
    'coefficient': model.coef_,
    'impact': ['📈' if c > 0 else '📉' for c in model.coef_]
})
print("\n🔍 Feature Impact on Price:")
print(feature_importance)

# 🏠 Predict a new house!
new_house = [[2200, 3, 2, 7, 2]]  # Raw features
new_house_scaled = scaler.transform(new_house)
predicted_price = model.predict(new_house_scaled)[0]
print(f"\n🎉 Predicted price for new house: ${predicted_price:,.0f}")

⚡ Example 2: Energy Consumption Predictor

Let’s predict energy usage based on weather and time!

# ⚡ Energy Consumption Prediction
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures

# 📊 Generate energy consumption data
hours = np.arange(24)  # 🕐 Hours of the day
base_consumption = 100  # Base load

# 🌡️ Temperature effect (more AC/heating at extremes)
temperature = 20 + 10 * np.sin((hours - 6) * np.pi / 12)  # Daily temp cycle
temp_effect = 0.5 * (temperature - 22) ** 2  # Quadratic effect

# 👥 Human activity pattern
activity = np.where((hours >= 9) & (hours <= 17), 1.5, 1.0)  # Work hours
activity[20:23] = 1.3  # Evening peak

# ⚡ Total consumption
consumption = base_consumption + temp_effect * 10 + activity * 50 + np.random.normal(0, 10, 24)

# 🎨 Create feature matrix with polynomial features
X_energy = np.column_stack([hours, temperature])
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_energy)

# 🏗️ Compare different regression models
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),  # 🛡️ With regularization
    'Lasso': Lasso(alpha=0.1)   # 🎯 Feature selection
}

plt.figure(figsize=(15, 5))

for i, (name, model) in enumerate(models.items(), 1):
    # 🚂 Train the model
    model.fit(X_poly, consumption)
    predictions = model.predict(X_poly)
    
    # 📊 Plot results
    plt.subplot(1, 3, i)
    plt.scatter(hours, consumption, alpha=0.6, label='Actual ⚡')
    plt.plot(hours, predictions, 'r-', linewidth=2, label=f'{name} 📈')
    plt.xlabel('Hour of Day 🕐')
    plt.ylabel('Energy (kWh) ⚡')
    plt.title(f'{name} Regression')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 📏 Calculate R² score
    r2 = r2_score(consumption, predictions)
    plt.text(0.02, 0.98, f'R² = {r2:.3f}', transform=plt.gca().transAxes,
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
             verticalalignment='top')

plt.tight_layout()
plt.show()

🚀 Advanced Concepts

🎯 Polynomial Regression - When Lines Aren’t Enough!

Sometimes relationships aren’t straight lines. That’s where polynomial regression shines!

# 🎢 Polynomial Regression - Capturing Curves!
from sklearn.pipeline import Pipeline

# 📊 Create non-linear data (like a rollercoaster! 🎢)
X_curve = np.linspace(0, 10, 100).reshape(-1, 1)
y_curve = 3 * X_curve.ravel() ** 2 - 20 * X_curve.ravel() + 50 + np.random.normal(0, 10, 100)

# 🏗️ Create polynomial pipeline
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=3)),  # 📐 Create polynomial features
    ('model', LinearRegression())            # 📈 Fit linear model to poly features
])

# 🚂 Train the model
poly_pipeline.fit(X_curve, y_curve)
y_poly_pred = poly_pipeline.predict(X_curve)

# 📊 Visualize the magic!
plt.figure(figsize=(10, 6))
plt.scatter(X_curve, y_curve, alpha=0.5, label='Data points 📊')
plt.plot(X_curve, y_poly_pred, 'r-', linewidth=3, label='Polynomial fit 🎢')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Polynomial Regression - Capturing Complex Patterns!')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

🛡️ Regularization - Preventing Overfitting

When models get too complex, they need guardrails. Enter regularization!

# 🛡️ Ridge vs Lasso - The Regularization Showdown!
from sklearn.linear_model import ElasticNet

# 📊 Create dataset with many features (some useless!)
n_samples, n_features = 100, 20
X_complex = np.random.randn(n_samples, n_features)

# Only first 5 features actually matter! 🎯
true_coefficients = np.zeros(n_features)
true_coefficients[:5] = [3, -2, 1.5, 0, -1]
y_complex = X_complex @ true_coefficients + np.random.normal(0, 0.5, n_samples)

# 🏗️ Compare regularization methods
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
methods = {
    'Ridge': Ridge,
    'Lasso': Lasso,
    'ElasticNet': ElasticNet
}

plt.figure(figsize=(15, 5))

for i, (name, Model) in enumerate(methods.items(), 1):
    plt.subplot(1, 3, i)
    
    for alpha in alphas:
        if name == 'ElasticNet':
            model = Model(alpha=alpha, l1_ratio=0.5)
        else:
            model = Model(alpha=alpha)
        
        model.fit(X_complex, y_complex)
        plt.plot(range(n_features), model.coef_, 
                marker='o', label=f'α={alpha}', alpha=0.7)
    
    plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
    plt.xlabel('Feature Index')
    plt.ylabel('Coefficient Value')
    plt.title(f'{name} Regularization')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

⚠️ Common Pitfalls and Solutions

😱 Pitfall 1: Forgetting to Scale Features

# ❌ Wrong way - features on different scales
X_unscaled = np.array([[1, 1000], [2, 2000], [3, 3000]])  # 😰 Big difference!
model_bad = LinearRegression()
model_bad.fit(X_unscaled, [10, 20, 30])

# ✅ Correct way - scale your features!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_unscaled)  # 🎯 All features normalized
model_good = LinearRegression()
model_good.fit(X_scaled, [10, 20, 30])

print("❌ Unscaled coefficients:", model_bad.coef_)
print("✅ Scaled coefficients:", model_good.coef_)

😱 Pitfall 2: Ignoring Multicollinearity

# ❌ Dangerous - highly correlated features
bedrooms = np.array([1, 2, 3, 4, 5])
rooms = bedrooms * 2.5  # 😰 Almost perfectly correlated!

# ✅ Solution - check correlation matrix!
import seaborn as sns

data = pd.DataFrame({'bedrooms': bedrooms, 'rooms': rooms})
correlation = data.corr()

plt.figure(figsize=(6, 4))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix 🔍')
plt.show()

# 💡 If correlation > 0.8, consider dropping one feature!

😱 Pitfall 3: Overfitting with Small Datasets

# ❌ Wrong - complex model with little data
X_small = np.random.randn(10, 1)
y_small = 2 * X_small.squeeze() + np.random.normal(0, 0.1, 10)

# Too complex! 😰
poly_overfit = Pipeline([
    ('poly', PolynomialFeatures(degree=9)),
    ('model', LinearRegression())
])
poly_overfit.fit(X_small, y_small)

# ✅ Better - simpler model or regularization
simple_model = LinearRegression()
simple_model.fit(X_small, y_small)

# Or use Ridge for regularization 🛡️
ridge_model = Ridge(alpha=1.0)
poly_ridge = Pipeline([
    ('poly', PolynomialFeatures(degree=9)),
    ('model', ridge_model)
])
poly_ridge.fit(X_small, y_small)

🛠️ Best Practices

1. 🎯 Always Split Your Data

# 🏆 Golden rule: Never test on training data!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

2. 📏 Scale Your Features

# 🌟 Especially important for regularized models
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

3. 🔍 Check Your Assumptions

# 📊 Residual analysis is your friend!
residuals = y_test - predictions
plt.scatter(predictions, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot - Should Look Random! 🎲')

4. 🎨 Try Multiple Models

# 🏆 No single model rules them all!
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"{name}: R² = {score:.3f}")

5. 📊 Cross-Validation is Key

# 🔄 Get robust performance estimates
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV R² Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

🧪 Hands-On Exercise

🎯 Challenge: Build a Stock Price Predictor

Create a regression model that predicts tomorrow’s stock price based on historical data!

📋 Requirements:

✅ Load historical stock data (volume, open, close, high, low)
📊 Create features (moving averages, price changes, volume ratios)
🎯 Split data properly (time series awareness!)
🏗️ Try multiple regression models
📈 Visualize predictions vs actual prices
🏆 Calculate performance metrics

🚀 Bonus Points:

Add technical indicators (RSI, MACD)
Implement walk-forward validation
Create a simple trading strategy based on predictions

💡 Solution

🔍 Click to see solution

# 📈 Stock Price Prediction System
import pandas as pd
from sklearn.metrics import mean_absolute_error

# 📊 Create synthetic stock data (replace with real data!)
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365, freq='D')
price = 100  # Starting price

# Generate realistic stock movements
prices = [price]
for _ in range(364):
    change = np.random.normal(0, 2)  # Daily change
    price *= (1 + change/100)
    prices.append(price)

stock_data = pd.DataFrame({
    'date': dates,
    'close': prices,
    'volume': np.random.uniform(1e6, 5e6, 365),
    'high': [p * np.random.uniform(1.0, 1.02) for p in prices],
    'low': [p * np.random.uniform(0.98, 1.0) for p in prices]
})

# 🎨 Feature Engineering
def create_features(df):
    df = df.copy()
    
    # 📊 Price-based features
    df['returns'] = df['close'].pct_change()
    df['ma_5'] = df['close'].rolling(5).mean()
    df['ma_20'] = df['close'].rolling(20).mean()
    df['volatility'] = df['returns'].rolling(20).std()
    
    # 📈 Technical indicators
    df['rsi'] = calculate_rsi(df['close'])
    df['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    
    # 🎯 Target: Next day's price
    df['target'] = df['close'].shift(-1)
    
    return df.dropna()

def calculate_rsi(prices, period=14):
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

# 📊 Prepare the data
stock_features = create_features(stock_data)

# 🎯 Define features and target
feature_cols = ['returns', 'ma_5', 'ma_20', 'volatility', 'rsi', 'volume_ratio']
X = stock_features[feature_cols]
y = stock_features['target']

# 🚂 Time series split (no random shuffle!)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# 📏 Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 🏗️ Train multiple models
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = {}
for name, model in models.items():
    # 🚂 Train
    model.fit(X_train_scaled, y_train)
    
    # 🎯 Predict
    predictions = model.predict(X_test_scaled)
    
    # 📊 Evaluate
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    results[name] = {
        'predictions': predictions,
        'mae': mae,
        'r2': r2
    }
    
    print(f"{name} - MAE: ${mae:.2f}, R²: {r2:.3f}")

# 📈 Visualize best model's predictions
best_model = max(results.items(), key=lambda x: x[1]['r2'])
model_name, model_results = best_model

plt.figure(figsize=(15, 6))

# Plot 1: Predictions vs Actual
plt.subplot(1, 2, 1)
test_dates = stock_features.index[split_idx:]
plt.plot(test_dates, y_test.values, label='Actual Price 📊', linewidth=2)
plt.plot(test_dates, model_results['predictions'], 
         label=f'{model_name} Predictions 🎯', linewidth=2, alpha=0.8)
plt.xlabel('Date')
plt.ylabel('Stock Price ($)')
plt.title(f'Stock Price Predictions - {model_name} Model')
plt.legend()
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

# Plot 2: Prediction Error Distribution
plt.subplot(1, 2, 2)
errors = y_test.values - model_results['predictions']
plt.hist(errors, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(x=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Prediction Error ($)')
plt.ylabel('Frequency')
plt.title('Prediction Error Distribution')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 🏆 Trading Strategy (Simple)
print("\n🚀 Simple Trading Strategy Results:")
predictions = model_results['predictions']
returns = []

for i in range(len(predictions)-1):
    if predictions[i+1] > y_test.iloc[i]:  # Predict price increase
        daily_return = (y_test.iloc[i+1] - y_test.iloc[i]) / y_test.iloc[i]
        returns.append(daily_return)
    else:  # Stay out of market
        returns.append(0)

strategy_return = np.prod([1 + r for r in returns]) - 1
buy_hold_return = (y_test.iloc[-1] - y_test.iloc[0]) / y_test.iloc[0]

print(f"📈 Strategy Return: {strategy_return*100:.2f}%")
print(f"📊 Buy & Hold Return: {buy_hold_return*100:.2f}%")
print(f"🎉 Strategy {'beats' if strategy_return > buy_hold_return else 'loses to'} buy & hold!")

🎓 Key Takeaways

You’ve mastered regression analysis! Here’s what you can now do:

✅ Build regression models for real-world predictions 📊
✅ Handle different types of regression (Linear, Polynomial, Regularized) 🎯
✅ Evaluate model performance using R², MSE, and MAE 📈
✅ Avoid common pitfalls like overfitting and scaling issues 🛡️
✅ Apply advanced techniques like regularization and feature engineering 🚀

Remember: Regression is about finding patterns in numbers - you’re now equipped to uncover insights hidden in data! 🔍

🤝 Next Steps

Congratulations on mastering regression! 🎉 You’re becoming a true data scientist!

Here’s what to explore next:

🧪 Practice with real datasets from Kaggle or UCI
📊 Try time series regression with ARIMA models
🚀 Learn about advanced regression techniques (Gaussian Process, Support Vector Regression)
🏗️ Build a complete ML pipeline with feature engineering
📚 Move on to our next tutorial: Unsupervised Learning: Clustering

Keep predicting, keep learning, and remember - every expert was once a beginner. You’re doing amazing! 🌟

Happy predicting! 🎯🚀✨

Prerequisites

What you'll learn