📘 Statsmodels: Statistical Modeling

🎯 Introduction

Welcome to this exciting tutorial on Statsmodels! 🎉 In this guide, we’ll explore how to perform powerful statistical modeling and analysis in Python using the statsmodels library.

You’ll discover how statsmodels can transform your data analysis experience. Whether you’re building predictive models 📊, conducting hypothesis tests 🔬, or exploring relationships in your data 📈, understanding statsmodels is essential for data scientists and analysts.

By the end of this tutorial, you’ll feel confident using statsmodels in your own projects! Let’s dive in! 🏊‍♂️

📚 Understanding Statsmodels

🤔 What is Statsmodels?

Statsmodels is like having a complete statistics laboratory in Python! 🧪 Think of it as your personal statistical advisor that helps you understand relationships in data, test hypotheses, and build predictive models.

In Python terms, statsmodels provides:

✨ Comprehensive statistical tests and models
🚀 Regression analysis (linear, logistic, and more)
🛡️ Time series analysis tools
📊 Statistical summaries and diagnostics

💡 Why Use Statsmodels?

Here’s why data scientists love statsmodels:

Rich Statistical Output 📋: Detailed summaries like R
Academic Rigor 🎓: Based on established statistical methods
Diagnostic Tools 🔍: Check model assumptions easily
Formula API 📝: R-like formula syntax for models

Real-world example: Imagine analyzing sales data 💰. With statsmodels, you can build models to understand what factors drive sales and make predictions!

🔧 Basic Syntax and Usage

📝 Simple Linear Regression

Let’s start with a friendly example:

# 👋 Hello, Statsmodels!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# 🎨 Create some sample data
np.random.seed(42)
data = pd.DataFrame({
    'hours_studied': np.random.uniform(1, 10, 100),  # 📚 Study hours
    'coffee_cups': np.random.randint(0, 5, 100),     # ☕ Coffee consumed
})
# 🎯 Generate test scores with some relationship
data['test_score'] = (
    50 + 5 * data['hours_studied'] + 
    2 * data['coffee_cups'] + 
    np.random.normal(0, 5, 100)  # 🎲 Add some randomness
)

# 🚀 Build our first model!
model = smf.ols('test_score ~ hours_studied + coffee_cups', data=data)
results = model.fit()

# 📊 View the magical summary!
print(results.summary())

💡 Explanation: We’re predicting test scores based on study hours and coffee consumption! The formula syntax makes it super readable.

🎯 Common Statistical Tests

Here are tests you’ll use daily:

# 🧪 T-test example
from statsmodels.stats.weightstats import ttest_ind

# 🎮 Compare two groups
group_a = np.random.normal(100, 15, 50)  # 🅰️ Group A scores
group_b = np.random.normal(105, 15, 50)  # 🅱️ Group B scores

# 🔬 Perform t-test
t_stat, p_value, df = ttest_ind(group_a, group_b)
print(f"📊 T-statistic: {t_stat:.3f}")
print(f"🎯 P-value: {p_value:.3f}")
print(f"💡 Significant? {'Yes! 🎉' if p_value < 0.05 else 'No 😐'}")

# 📈 ANOVA example
import statsmodels.stats.anova as anova

# 🏃 Fitness data across three training programs
data = pd.DataFrame({
    'fitness_score': np.concatenate([
        np.random.normal(70, 10, 30),   # 🏃 Program A
        np.random.normal(75, 10, 30),   # 🏊 Program B
        np.random.normal(80, 10, 30)    # 🚴 Program C
    ]),
    'program': ['A']*30 + ['B']*30 + ['C']*30
})

# 🎪 One-way ANOVA
model = smf.ols('fitness_score ~ program', data=data).fit()
anova_table = anova.anova_lm(model)
print("\n🎯 ANOVA Results:")
print(anova_table)

💡 Practical Examples

🛒 Example 1: E-commerce Sales Analysis

Let’s analyze what drives online sales:

# 🛍️ Create e-commerce dataset
np.random.seed(123)
n_days = 365

ecommerce_data = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=n_days),
    'ad_spend': np.random.uniform(100, 1000, n_days),      # 💵 Daily ad spend
    'email_sent': np.random.randint(1000, 5000, n_days),   # 📧 Emails sent
    'website_visits': np.random.randint(5000, 20000, n_days), # 🌐 Daily visits
    'is_weekend': np.tile([0,0,0,0,0,1,1], n_days//7 + 1)[:n_days], # 📅 Weekend flag
})

# 💰 Generate sales with realistic relationships
ecommerce_data['sales'] = (
    2000 +  # Base sales
    0.5 * ecommerce_data['ad_spend'] +          # 📈 Ad impact
    0.02 * ecommerce_data['email_sent'] +       # 📧 Email impact
    0.1 * ecommerce_data['website_visits'] +    # 🌐 Traffic impact
    -500 * ecommerce_data['is_weekend'] +       # 📅 Weekend effect
    np.random.normal(0, 500, n_days)            # 🎲 Random variation
)

# 🔮 Build the model
sales_model = smf.ols(
    'sales ~ ad_spend + email_sent + website_visits + is_weekend', 
    data=ecommerce_data
)
sales_results = sales_model.fit()

# 📊 Interpret results
print("🎯 Sales Analysis Results:")
print(sales_results.summary())

# 🎨 Visualize actual vs predicted
ecommerce_data['predicted_sales'] = sales_results.predict()

plt.figure(figsize=(10, 6))
plt.scatter(ecommerce_data['sales'], ecommerce_data['predicted_sales'], 
            alpha=0.5, color='blue', label='Daily Sales')
plt.plot([ecommerce_data['sales'].min(), ecommerce_data['sales'].max()], 
         [ecommerce_data['sales'].min(), ecommerce_data['sales'].max()], 
         'r--', label='Perfect Prediction')
plt.xlabel('Actual Sales 💰')
plt.ylabel('Predicted Sales 📊')
plt.title('🎯 Model Performance: Actual vs Predicted Sales')
plt.legend()
plt.show()

# 💡 Feature importance
print("\n🏆 Feature Impact on Sales:")
for feature, coef in sales_results.params[1:].items():
    impact = "📈 Positive" if coef > 0 else "📉 Negative"
    print(f"{feature}: {coef:.2f} ({impact})")

🎯 Try it yourself: Add seasonal effects or product categories to the model!

🎮 Example 2: A/B Testing for Game Features

Let’s test if a new game feature improves player engagement:

# 🎮 Game A/B test data
np.random.seed(456)

# 🅰️ Control group (old feature)
control_group = pd.DataFrame({
    'group': 'control',
    'player_id': range(1000),
    'daily_playtime': np.random.gamma(2, 30, 1000),  # ⏱️ Minutes played
    'sessions': np.random.poisson(3, 1000),          # 🎯 Daily sessions
    'purchases': np.random.binomial(1, 0.1, 1000),   # 💎 Made purchase?
})

# 🅱️ Treatment group (new feature)
treatment_group = pd.DataFrame({
    'group': 'treatment',
    'player_id': range(1000, 2000),
    'daily_playtime': np.random.gamma(2.3, 32, 1000),  # 📈 Slightly higher!
    'sessions': np.random.poisson(3.5, 1000),          # 📈 More sessions!
    'purchases': np.random.binomial(1, 0.15, 1000),    # 💰 More purchases!
})

# 🔄 Combine data
ab_test_data = pd.concat([control_group, treatment_group])

# 📊 Statistical tests
print("🎯 A/B Test Results:\n")

# 🕐 Test playtime difference
from statsmodels.stats.weightstats import ttest_ind
t_stat, p_val, _ = ttest_ind(
    treatment_group['daily_playtime'], 
    control_group['daily_playtime']
)
print(f"⏱️ Playtime Test:")
print(f"   Control: {control_group['daily_playtime'].mean():.1f} min")
print(f"   Treatment: {treatment_group['daily_playtime'].mean():.1f} min")
print(f"   P-value: {p_val:.4f} {'✅ Significant!' if p_val < 0.05 else '❌ Not significant'}")

# 💎 Test purchase rate difference
from statsmodels.stats.proportion import proportions_ztest
control_purchases = control_group['purchases'].sum()
treatment_purchases = treatment_group['purchases'].sum()
counts = [treatment_purchases, control_purchases]
nobs = [len(treatment_group), len(control_group)]

z_stat, p_val = proportions_ztest(counts, nobs)
print(f"\n💎 Purchase Rate Test:")
print(f"   Control: {control_purchases/len(control_group)*100:.1f}%")
print(f"   Treatment: {treatment_purchases/len(treatment_group)*100:.1f}%")
print(f"   P-value: {p_val:.4f} {'✅ Significant!' if p_val < 0.05 else '❌ Not significant'}")

# 🏆 Effect size calculation
from statsmodels.stats.power import ttest_power
effect_size = (treatment_group['daily_playtime'].mean() - 
               control_group['daily_playtime'].mean()) / control_group['daily_playtime'].std()
print(f"\n📏 Effect Size: {effect_size:.3f}")
print(f"{'🚀 Large effect!' if abs(effect_size) > 0.8 else '✨ Medium effect' if abs(effect_size) > 0.5 else '💫 Small effect'}")

🚀 Advanced Concepts

🧙‍♂️ Time Series Analysis

When you’re ready to level up, try time series modeling:

# 🎯 Advanced time series with ARIMA
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# 📈 Create time series data
dates = pd.date_range('2020-01-01', periods=365*3, freq='D')
trend = np.linspace(100, 200, len(dates))
seasonal = 10 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25)
noise = np.random.normal(0, 5, len(dates))
sales = trend + seasonal + noise

ts_data = pd.Series(sales, index=dates)

# 🔮 Fit ARIMA model
model = ARIMA(ts_data, order=(1, 1, 1))
results = model.fit()

# 🎨 Make predictions
forecast = results.forecast(steps=30)
print(f"🔮 Next 30 days forecast: {forecast.mean():.2f}")

# 📊 Plot results
plt.figure(figsize=(12, 6))
plt.plot(ts_data.index, ts_data.values, label='Historical Sales 📈')
plt.plot(forecast.index, forecast.values, 'r--', label='Forecast 🔮')
plt.fill_between(forecast.index, 
                 forecast.values - 1.96*forecast.values.std(), 
                 forecast.values + 1.96*forecast.values.std(), 
                 alpha=0.3, color='red', label='95% Confidence 🎯')
plt.legend()
plt.title('🚀 Time Series Forecast')
plt.show()

🏗️ Logistic Regression for Classification

For the brave data scientists:

# 🚀 Customer churn prediction
churn_data = pd.DataFrame({
    'tenure': np.random.uniform(0, 72, 1000),           # 📅 Months as customer
    'monthly_charges': np.random.uniform(20, 100, 1000), # 💵 Monthly bill
    'total_charges': np.random.uniform(100, 7000, 1000), # 💰 Total spent
    'num_services': np.random.randint(1, 8, 1000),      # 📦 Services subscribed
})

# 🎲 Generate churn based on features
churn_prob = 1 / (1 + np.exp(
    -(-3 + 0.05 * churn_data['tenure'] - 
      0.02 * churn_data['monthly_charges'] + 
      0.3 * churn_data['num_services'])
))
churn_data['churned'] = np.random.binomial(1, churn_prob)

# 🔮 Build logistic regression
logit_model = smf.logit(
    'churned ~ tenure + monthly_charges + total_charges + num_services', 
    data=churn_data
)
logit_results = logit_model.fit()

print("🎯 Churn Prediction Model:")
print(logit_results.summary())

# 💡 Interpret odds ratios
print("\n🎲 Odds Ratios (Impact on Churn):")
odds_ratios = np.exp(logit_results.params)
for feature, odds in odds_ratios[1:].items():
    if odds > 1:
        print(f"{feature}: {odds:.3f} (📈 Increases churn risk)")
    else:
        print(f"{feature}: {odds:.3f} (📉 Decreases churn risk)")

⚠️ Common Pitfalls and Solutions

😱 Pitfall 1: Forgetting to Check Assumptions

# ❌ Wrong way - jumping straight to conclusions!
model = smf.ols('y ~ x', data=df).fit()
print("X causes Y!") # 💥 Whoa there!

# ✅ Correct way - check assumptions first!
model = smf.ols('y ~ x', data=df).fit()

# 🔍 Check residuals
residuals = model.resid
plt.figure(figsize=(10, 4))

# 📊 Residual plot
plt.subplot(1, 2, 1)
plt.scatter(model.fittedvalues, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('🎯 Residuals vs Fitted')

# 📈 Q-Q plot
from statsmodels.graphics.gofplots import qqplot
plt.subplot(1, 2, 2)
qqplot(residuals, line='s')
plt.title('📊 Q-Q Plot')
plt.tight_layout()
plt.show()

# 🧪 Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan
_, pval, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"🔬 Heteroscedasticity test p-value: {pval:.4f}")

🤯 Pitfall 2: P-hacking and Multiple Testing

# ❌ Dangerous - testing everything!
for col in df.columns:
    model = smf.ols(f'outcome ~ {col}', data=df).fit()
    if model.pvalues[1] < 0.05:
        print(f"Found significance with {col}!") # 💥 False discoveries!

# ✅ Safe - adjust for multiple comparisons!
from statsmodels.stats.multitest import multipletests

p_values = []
features = []

for col in df.columns:
    if col != 'outcome':
        model = smf.ols(f'outcome ~ {col}', data=df).fit()
        p_values.append(model.pvalues[1])
        features.append(col)

# 🛡️ Bonferroni correction
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')

print("📊 Adjusted Results:")
for feature, p_orig, p_adj, significant in zip(features, p_values, p_adjusted, rejected):
    print(f"{feature}: p={p_orig:.4f} → adjusted p={p_adj:.4f} {'✅' if significant else '❌'}")

🛠️ Best Practices

🎯 Start Simple: Build basic models before complex ones
📊 Visualize First: Plot your data before modeling
🔍 Check Assumptions: Validate model assumptions
📈 Cross-validate: Use holdout data for validation
✨ Document Results: Keep track of model versions

🧪 Hands-On Exercise

🎯 Challenge: Build a Health Analytics Dashboard

Create a comprehensive health study analysis:

📋 Requirements:

✅ Load health metrics data (weight, exercise, sleep)
🏷️ Perform hypothesis tests between groups
👤 Build regression models for health outcomes
📅 Include demographic factors
🎨 Create visualizations of findings!

🚀 Bonus Points:

Add interaction effects between variables
Implement model diagnostics
Create a summary report function

💡 Solution

🔍 Click to see solution

# 🎯 Health Analytics Solution!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# 🏥 Generate health study data
np.random.seed(789)
n_participants = 500

health_data = pd.DataFrame({
    'participant_id': range(n_participants),
    'age': np.random.normal(45, 15, n_participants),
    'gender': np.random.choice(['M', 'F'], n_participants),
    'exercise_hours': np.random.gamma(2, 2, n_participants),     # 🏃 Weekly exercise
    'sleep_hours': np.random.normal(7, 1.5, n_participants),     # 😴 Daily sleep
    'diet_quality': np.random.randint(1, 11, n_participants),    # 🥗 Diet score 1-10
    'stress_level': np.random.randint(1, 11, n_participants),    # 😰 Stress 1-10
})

# 💪 Generate BMI with realistic relationships
health_data['bmi'] = (
    25 + 
    0.1 * health_data['age'] +
    -0.5 * health_data['exercise_hours'] +
    -0.3 * health_data['sleep_hours'] +
    -0.4 * health_data['diet_quality'] +
    0.3 * health_data['stress_level'] +
    2 * (health_data['gender'] == 'M') +
    np.random.normal(0, 2, n_participants)
)

class HealthAnalyzer:
    def __init__(self, data):
        self.data = data
        self.results = {}
    
    def demographic_analysis(self):
        """📊 Analyze differences by demographics"""
        print("📊 Demographic Analysis:")
        print("="*50)
        
        # 🎯 Gender comparison
        male_bmi = self.data[self.data['gender'] == 'M']['bmi']
        female_bmi = self.data[self.data['gender'] == 'F']['bmi']
        
        t_stat, p_val, _ = ttest_ind(male_bmi, female_bmi)
        print(f"\n👥 BMI by Gender:")
        print(f"   Male: {male_bmi.mean():.2f} ± {male_bmi.std():.2f}")
        print(f"   Female: {female_bmi.mean():.2f} ± {female_bmi.std():.2f}")
        print(f"   P-value: {p_val:.4f} {'✅ Significant' if p_val < 0.05 else '❌ Not significant'}")
        
        # 🎂 Age correlation
        age_corr = self.data[['age', 'bmi', 'exercise_hours', 'sleep_hours']].corr()
        print(f"\n🎂 Age Correlations:")
        for var in ['bmi', 'exercise_hours', 'sleep_hours']:
            corr = age_corr.loc['age', var]
            print(f"   Age vs {var}: {corr:.3f}")
    
    def build_health_model(self):
        """🔮 Build comprehensive health model"""
        print("\n🔮 Health Prediction Model:")
        print("="*50)
        
        # 🏗️ Full model with interactions
        model = smf.ols('''bmi ~ age + gender + exercise_hours + sleep_hours + 
                        diet_quality + stress_level + 
                        exercise_hours:diet_quality''', data=self.data)
        self.results['model'] = model.fit()
        
        print(self.results['model'].summary())
        
        # 💡 Feature importance
        print("\n🏆 Feature Importance:")
        params = self.results['model'].params[1:]  # Skip intercept
        importance = abs(params).sort_values(ascending=False)
        for feature, value in importance.items():
            print(f"   {feature}: {value:.3f}")
        
        return self.results['model']
    
    def diagnostic_plots(self):
        """📊 Create diagnostic visualizations"""
        model = self.results['model']
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        
        # 🎯 Residuals vs Fitted
        axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
        axes[0, 0].axhline(y=0, color='r', linestyle='--')
        axes[0, 0].set_xlabel('Fitted Values')
        axes[0, 0].set_ylabel('Residuals')
        axes[0, 0].set_title('🎯 Residuals vs Fitted')
        
        # 📊 Q-Q Plot
        from statsmodels.graphics.gofplots import qqplot
        qqplot(model.resid, line='s', ax=axes[0, 1])
        axes[0, 1].set_title('📊 Normal Q-Q Plot')
        
        # 📈 Scale-Location
        axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(model.resid)), alpha=0.5)
        axes[1, 0].set_xlabel('Fitted Values')
        axes[1, 0].set_ylabel('√|Residuals|')
        axes[1, 0].set_title('📈 Scale-Location Plot')
        
        # 🎨 Actual vs Predicted
        axes[1, 1].scatter(self.data['bmi'], model.fittedvalues, alpha=0.5)
        axes[1, 1].plot([self.data['bmi'].min(), self.data['bmi'].max()],
                       [self.data['bmi'].min(), self.data['bmi'].max()], 'r--')
        axes[1, 1].set_xlabel('Actual BMI')
        axes[1, 1].set_ylabel('Predicted BMI')
        axes[1, 1].set_title('🎨 Actual vs Predicted')
        
        plt.tight_layout()
        plt.show()
    
    def generate_report(self):
        """📋 Generate summary report"""
        print("\n📋 Health Study Summary Report")
        print("="*50)
        print(f"📊 Participants: {len(self.data)}")
        print(f"👥 Gender Split: {self.data['gender'].value_counts().to_dict()}")
        print(f"🎂 Age Range: {self.data['age'].min():.0f} - {self.data['age'].max():.0f}")
        print(f"\n💪 Key Findings:")
        
        # Extract key coefficients
        model = self.results['model']
        for var in ['exercise_hours', 'sleep_hours', 'diet_quality']:
            coef = model.params[var]
            pval = model.pvalues[var]
            impact = "📈 increases" if coef > 0 else "📉 decreases"
            sig = "✅" if pval < 0.05 else "❌"
            print(f"   • Each hour of {var} {impact} BMI by {abs(coef):.2f} {sig}")

# 🎮 Run the analysis!
analyzer = HealthAnalyzer(health_data)
analyzer.demographic_analysis()
model = analyzer.build_health_model()
analyzer.diagnostic_plots()
analyzer.generate_report()

# 🎊 Success message
print("\n🎉 Congratulations! You've completed a full statistical analysis!")

🎓 Key Takeaways

You’ve learned so much! Here’s what you can now do:

✅ Perform statistical tests with confidence 💪
✅ Build regression models for prediction 🔮
✅ Analyze time series data like a pro 📈
✅ Check model assumptions properly 🔍
✅ Interpret statistical output correctly 📊

Remember: Statistics is about understanding relationships in data, not just running tests! 🤝

🤝 Next Steps

Congratulations! 🎉 You’ve mastered statsmodels!

Here’s what to do next:

💻 Practice with the exercises above
🏗️ Apply statsmodels to your own datasets
📚 Explore advanced models (GLM, GAM, State Space)
🌟 Combine with scikit-learn for machine learning

Remember: Every data scientist started as a beginner. Keep analyzing, keep learning, and most importantly, have fun with data! 🚀

Happy statistical modeling! 🎉🚀✨

Prerequisites

What you'll learn