Prerequisites
- Basic understanding of programming concepts π
- Python installation (3.8+) π
- VS Code or preferred IDE π»
What you'll learn
- Understand the concept fundamentals π―
- Apply the concept in real projects ποΈ
- Debug common issues π
- Write clean, Pythonic code β¨
π― Introduction
Welcome to the wonderful world of regression analysis! π Have you ever wondered how Netflix predicts your movie ratings π¬, how real estate apps estimate house prices π , or how weather forecasts predict tomorrowβs temperature π‘οΈ? Thatβs regression in action!
Unlike classification (which puts things in boxes), regression predicts continuous values - actual numbers! Today, youβll learn how to build powerful prediction models that can forecast everything from stock prices to energy consumption. Letβs unlock the magic of regression together! π
π Understanding Regression
Regression is like drawing the βbest-fit lineβ through a cloud of data points. Remember plotting graphs in math class? π Itβs that, but supercharged with machine learning!
The Magic of Regression πͺ
# π― Regression in action!
# Input: Features (X)
# Output: Continuous value (y)
# Example: Predicting House Prices π
house_features = {
"size_sqft": 1500, # π Feature 1
"bedrooms": 3, # ποΈ Feature 2
"location_score": 8.5, # π Feature 3
"age_years": 10 # π
Feature 4
}
# Output: $325,000 (actual number, not category!)
Think of regression as teaching a computer to understand relationships: βWhen X goes up, Y tends to go up (or down) by this much.β Itβs pattern recognition for numbers! π’
π§ Basic Syntax and Usage
Letβs start with the simplest form of regression - Linear Regression. Itβs like finding the perfect straight line through your data!
Your First Regression Model π
# π Let's build our first regression model!
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# π Create some fun data - Ice cream sales vs temperature
np.random.seed(42) # For reproducible randomness π²
temperature = np.random.uniform(20, 35, 100) # π‘οΈ Temperature in Celsius
ice_cream_sales = 2 * temperature + np.random.normal(0, 5, 100) + 50 # π¦ Sales
# π¨ Reshape for sklearn (it likes 2D arrays)
X = temperature.reshape(-1, 1)
y = ice_cream_sales
# π Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ποΈ Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# π― Make predictions
predictions = model.predict(X_test)
print(f"π Model trained! Slope: {model.coef_[0]:.2f}, Intercept: {model.intercept_:.2f}")
Visualizing Your Model π
# π¨ Let's see our model in action!
plt.figure(figsize=(10, 6))
# π Plot the data points
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data π')
plt.scatter(X_test, y_test, color='green', alpha=0.7, label='Test data π')
# π Plot the regression line
X_line = np.linspace(20, 35, 100).reshape(-1, 1)
y_line = model.predict(X_line)
plt.plot(X_line, y_line, color='red', linewidth=2, label='Model prediction π')
plt.xlabel('Temperature (Β°C) π‘οΈ')
plt.ylabel('Ice Cream Sales π¦')
plt.title('Ice Cream Sales vs Temperature - Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
π‘ Practical Examples
π Example 1: Real Estate Price Predictor
Letβs build something practical - a house price predictor!
# π House Price Prediction System
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# π Create realistic house data
house_data = pd.DataFrame({
'size': [750, 1200, 1800, 2400, 3000, 1500, 2000, 2800],
'bedrooms': [1, 2, 3, 4, 4, 3, 3, 5],
'bathrooms': [1, 1, 2, 2.5, 3, 2, 2.5, 3],
'age': [20, 15, 10, 5, 2, 8, 12, 1],
'garage': [0, 1, 1, 2, 3, 1, 2, 3],
'price': [150000, 250000, 350000, 450000, 600000, 320000, 380000, 750000]
})
# π― Prepare features and target
X = house_data[['size', 'bedrooms', 'bathrooms', 'age', 'garage']]
y = house_data['price']
# π Scale the features (important for regression!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# π Split and train
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.25, random_state=42
)
# ποΈ Multiple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
# π― Make predictions
predictions = model.predict(X_test)
# π Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"π Model Performance:")
print(f" π Mean Squared Error: ${mse:,.0f}")
print(f" π― RΒ² Score: {r2:.3f} (closer to 1 is better!)")
# π‘ Feature importance
feature_importance = pd.DataFrame({
'feature': ['size', 'bedrooms', 'bathrooms', 'age', 'garage'],
'coefficient': model.coef_,
'impact': ['π' if c > 0 else 'π' for c in model.coef_]
})
print("\nπ Feature Impact on Price:")
print(feature_importance)
# π Predict a new house!
new_house = [[2200, 3, 2, 7, 2]] # Raw features
new_house_scaled = scaler.transform(new_house)
predicted_price = model.predict(new_house_scaled)[0]
print(f"\nπ Predicted price for new house: ${predicted_price:,.0f}")
β‘ Example 2: Energy Consumption Predictor
Letβs predict energy usage based on weather and time!
# β‘ Energy Consumption Prediction
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
# π Generate energy consumption data
hours = np.arange(24) # π Hours of the day
base_consumption = 100 # Base load
# π‘οΈ Temperature effect (more AC/heating at extremes)
temperature = 20 + 10 * np.sin((hours - 6) * np.pi / 12) # Daily temp cycle
temp_effect = 0.5 * (temperature - 22) ** 2 # Quadratic effect
# π₯ Human activity pattern
activity = np.where((hours >= 9) & (hours <= 17), 1.5, 1.0) # Work hours
activity[20:23] = 1.3 # Evening peak
# β‘ Total consumption
consumption = base_consumption + temp_effect * 10 + activity * 50 + np.random.normal(0, 10, 24)
# π¨ Create feature matrix with polynomial features
X_energy = np.column_stack([hours, temperature])
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_energy)
# ποΈ Compare different regression models
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(alpha=1.0), # π‘οΈ With regularization
'Lasso': Lasso(alpha=0.1) # π― Feature selection
}
plt.figure(figsize=(15, 5))
for i, (name, model) in enumerate(models.items(), 1):
# π Train the model
model.fit(X_poly, consumption)
predictions = model.predict(X_poly)
# π Plot results
plt.subplot(1, 3, i)
plt.scatter(hours, consumption, alpha=0.6, label='Actual β‘')
plt.plot(hours, predictions, 'r-', linewidth=2, label=f'{name} π')
plt.xlabel('Hour of Day π')
plt.ylabel('Energy (kWh) β‘')
plt.title(f'{name} Regression')
plt.legend()
plt.grid(True, alpha=0.3)
# π Calculate RΒ² score
r2 = r2_score(consumption, predictions)
plt.text(0.02, 0.98, f'RΒ² = {r2:.3f}', transform=plt.gca().transAxes,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
verticalalignment='top')
plt.tight_layout()
plt.show()
π Advanced Concepts
π― Polynomial Regression - When Lines Arenβt Enough!
Sometimes relationships arenβt straight lines. Thatβs where polynomial regression shines!
# π’ Polynomial Regression - Capturing Curves!
from sklearn.pipeline import Pipeline
# π Create non-linear data (like a rollercoaster! π’)
X_curve = np.linspace(0, 10, 100).reshape(-1, 1)
y_curve = 3 * X_curve.ravel() ** 2 - 20 * X_curve.ravel() + 50 + np.random.normal(0, 10, 100)
# ποΈ Create polynomial pipeline
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=3)), # π Create polynomial features
('model', LinearRegression()) # π Fit linear model to poly features
])
# π Train the model
poly_pipeline.fit(X_curve, y_curve)
y_poly_pred = poly_pipeline.predict(X_curve)
# π Visualize the magic!
plt.figure(figsize=(10, 6))
plt.scatter(X_curve, y_curve, alpha=0.5, label='Data points π')
plt.plot(X_curve, y_poly_pred, 'r-', linewidth=3, label='Polynomial fit π’')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Polynomial Regression - Capturing Complex Patterns!')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
π‘οΈ Regularization - Preventing Overfitting
When models get too complex, they need guardrails. Enter regularization!
# π‘οΈ Ridge vs Lasso - The Regularization Showdown!
from sklearn.linear_model import ElasticNet
# π Create dataset with many features (some useless!)
n_samples, n_features = 100, 20
X_complex = np.random.randn(n_samples, n_features)
# Only first 5 features actually matter! π―
true_coefficients = np.zeros(n_features)
true_coefficients[:5] = [3, -2, 1.5, 0, -1]
y_complex = X_complex @ true_coefficients + np.random.normal(0, 0.5, n_samples)
# ποΈ Compare regularization methods
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
methods = {
'Ridge': Ridge,
'Lasso': Lasso,
'ElasticNet': ElasticNet
}
plt.figure(figsize=(15, 5))
for i, (name, Model) in enumerate(methods.items(), 1):
plt.subplot(1, 3, i)
for alpha in alphas:
if name == 'ElasticNet':
model = Model(alpha=alpha, l1_ratio=0.5)
else:
model = Model(alpha=alpha)
model.fit(X_complex, y_complex)
plt.plot(range(n_features), model.coef_,
marker='o', label=f'Ξ±={alpha}', alpha=0.7)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.title(f'{name} Regularization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
β οΈ Common Pitfalls and Solutions
π± Pitfall 1: Forgetting to Scale Features
# β Wrong way - features on different scales
X_unscaled = np.array([[1, 1000], [2, 2000], [3, 3000]]) # π° Big difference!
model_bad = LinearRegression()
model_bad.fit(X_unscaled, [10, 20, 30])
# β
Correct way - scale your features!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_unscaled) # π― All features normalized
model_good = LinearRegression()
model_good.fit(X_scaled, [10, 20, 30])
print("β Unscaled coefficients:", model_bad.coef_)
print("β
Scaled coefficients:", model_good.coef_)
π± Pitfall 2: Ignoring Multicollinearity
# β Dangerous - highly correlated features
bedrooms = np.array([1, 2, 3, 4, 5])
rooms = bedrooms * 2.5 # π° Almost perfectly correlated!
# β
Solution - check correlation matrix!
import seaborn as sns
data = pd.DataFrame({'bedrooms': bedrooms, 'rooms': rooms})
correlation = data.corr()
plt.figure(figsize=(6, 4))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix π')
plt.show()
# π‘ If correlation > 0.8, consider dropping one feature!
π± Pitfall 3: Overfitting with Small Datasets
# β Wrong - complex model with little data
X_small = np.random.randn(10, 1)
y_small = 2 * X_small.squeeze() + np.random.normal(0, 0.1, 10)
# Too complex! π°
poly_overfit = Pipeline([
('poly', PolynomialFeatures(degree=9)),
('model', LinearRegression())
])
poly_overfit.fit(X_small, y_small)
# β
Better - simpler model or regularization
simple_model = LinearRegression()
simple_model.fit(X_small, y_small)
# Or use Ridge for regularization π‘οΈ
ridge_model = Ridge(alpha=1.0)
poly_ridge = Pipeline([
('poly', PolynomialFeatures(degree=9)),
('model', ridge_model)
])
poly_ridge.fit(X_small, y_small)
π οΈ Best Practices
1. π― Always Split Your Data
# π Golden rule: Never test on training data!
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
2. π Scale Your Features
# π Especially important for regularized models
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
3. π Check Your Assumptions
# π Residual analysis is your friend!
residuals = y_test - predictions
plt.scatter(predictions, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot - Should Look Random! π²')
4. π¨ Try Multiple Models
# π No single model rules them all!
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(),
'Lasso': Lasso(),
'ElasticNet': ElasticNet()
}
for name, model in models.items():
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"{name}: RΒ² = {score:.3f}")
5. π Cross-Validation is Key
# π Get robust performance estimates
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV RΒ² Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
π§ͺ Hands-On Exercise
π― Challenge: Build a Stock Price Predictor
Create a regression model that predicts tomorrowβs stock price based on historical data!
π Requirements:
- β Load historical stock data (volume, open, close, high, low)
- π Create features (moving averages, price changes, volume ratios)
- π― Split data properly (time series awareness!)
- ποΈ Try multiple regression models
- π Visualize predictions vs actual prices
- π Calculate performance metrics
π Bonus Points:
- Add technical indicators (RSI, MACD)
- Implement walk-forward validation
- Create a simple trading strategy based on predictions
π‘ Solution
π Click to see solution
# π Stock Price Prediction System
import pandas as pd
from sklearn.metrics import mean_absolute_error
# π Create synthetic stock data (replace with real data!)
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365, freq='D')
price = 100 # Starting price
# Generate realistic stock movements
prices = [price]
for _ in range(364):
change = np.random.normal(0, 2) # Daily change
price *= (1 + change/100)
prices.append(price)
stock_data = pd.DataFrame({
'date': dates,
'close': prices,
'volume': np.random.uniform(1e6, 5e6, 365),
'high': [p * np.random.uniform(1.0, 1.02) for p in prices],
'low': [p * np.random.uniform(0.98, 1.0) for p in prices]
})
# π¨ Feature Engineering
def create_features(df):
df = df.copy()
# π Price-based features
df['returns'] = df['close'].pct_change()
df['ma_5'] = df['close'].rolling(5).mean()
df['ma_20'] = df['close'].rolling(20).mean()
df['volatility'] = df['returns'].rolling(20).std()
# π Technical indicators
df['rsi'] = calculate_rsi(df['close'])
df['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
# π― Target: Next day's price
df['target'] = df['close'].shift(-1)
return df.dropna()
def calculate_rsi(prices, period=14):
delta = prices.diff()
gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
rs = gain / loss
return 100 - (100 / (1 + rs))
# π Prepare the data
stock_features = create_features(stock_data)
# π― Define features and target
feature_cols = ['returns', 'ma_5', 'ma_20', 'volatility', 'rsi', 'volume_ratio']
X = stock_features[feature_cols]
y = stock_features['target']
# π Time series split (no random shuffle!)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# π Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ποΈ Train multiple models
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
results = {}
for name, model in models.items():
# π Train
model.fit(X_train_scaled, y_train)
# π― Predict
predictions = model.predict(X_test_scaled)
# π Evaluate
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
results[name] = {
'predictions': predictions,
'mae': mae,
'r2': r2
}
print(f"{name} - MAE: ${mae:.2f}, RΒ²: {r2:.3f}")
# π Visualize best model's predictions
best_model = max(results.items(), key=lambda x: x[1]['r2'])
model_name, model_results = best_model
plt.figure(figsize=(15, 6))
# Plot 1: Predictions vs Actual
plt.subplot(1, 2, 1)
test_dates = stock_features.index[split_idx:]
plt.plot(test_dates, y_test.values, label='Actual Price π', linewidth=2)
plt.plot(test_dates, model_results['predictions'],
label=f'{model_name} Predictions π―', linewidth=2, alpha=0.8)
plt.xlabel('Date')
plt.ylabel('Stock Price ($)')
plt.title(f'Stock Price Predictions - {model_name} Model')
plt.legend()
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
# Plot 2: Prediction Error Distribution
plt.subplot(1, 2, 2)
errors = y_test.values - model_results['predictions']
plt.hist(errors, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(x=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Prediction Error ($)')
plt.ylabel('Frequency')
plt.title('Prediction Error Distribution')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# π Trading Strategy (Simple)
print("\nπ Simple Trading Strategy Results:")
predictions = model_results['predictions']
returns = []
for i in range(len(predictions)-1):
if predictions[i+1] > y_test.iloc[i]: # Predict price increase
daily_return = (y_test.iloc[i+1] - y_test.iloc[i]) / y_test.iloc[i]
returns.append(daily_return)
else: # Stay out of market
returns.append(0)
strategy_return = np.prod([1 + r for r in returns]) - 1
buy_hold_return = (y_test.iloc[-1] - y_test.iloc[0]) / y_test.iloc[0]
print(f"π Strategy Return: {strategy_return*100:.2f}%")
print(f"π Buy & Hold Return: {buy_hold_return*100:.2f}%")
print(f"π Strategy {'beats' if strategy_return > buy_hold_return else 'loses to'} buy & hold!")
π Key Takeaways
Youβve mastered regression analysis! Hereβs what you can now do:
- β Build regression models for real-world predictions π
- β Handle different types of regression (Linear, Polynomial, Regularized) π―
- β Evaluate model performance using RΒ², MSE, and MAE π
- β Avoid common pitfalls like overfitting and scaling issues π‘οΈ
- β Apply advanced techniques like regularization and feature engineering π
Remember: Regression is about finding patterns in numbers - youβre now equipped to uncover insights hidden in data! π
π€ Next Steps
Congratulations on mastering regression! π Youβre becoming a true data scientist!
Hereβs what to explore next:
- π§ͺ Practice with real datasets from Kaggle or UCI
- π Try time series regression with ARIMA models
- π Learn about advanced regression techniques (Gaussian Process, Support Vector Regression)
- ποΈ Build a complete ML pipeline with feature engineering
- π Move on to our next tutorial: Unsupervised Learning: Clustering
Keep predicting, keep learning, and remember - every expert was once a beginner. Youβre doing amazing! π
Happy predicting! π―πβ¨