+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Part 383 of 541

๐Ÿ“˜ Statsmodels: Statistical Modeling

Master statsmodels: statistical modeling in Python with practical examples, best practices, and real-world applications ๐Ÿš€

๐Ÿš€Intermediate
25 min read

Prerequisites

  • Basic understanding of programming concepts ๐Ÿ“
  • Python installation (3.8+) ๐Ÿ
  • VS Code or preferred IDE ๐Ÿ’ป

What you'll learn

  • Understand the concept fundamentals ๐ŸŽฏ
  • Apply the concept in real projects ๐Ÿ—๏ธ
  • Debug common issues ๐Ÿ›
  • Write clean, Pythonic code โœจ

๐ŸŽฏ Introduction

Welcome to this exciting tutorial on Statsmodels! ๐ŸŽ‰ In this guide, weโ€™ll explore how to perform powerful statistical modeling and analysis in Python using the statsmodels library.

Youโ€™ll discover how statsmodels can transform your data analysis experience. Whether youโ€™re building predictive models ๐Ÿ“Š, conducting hypothesis tests ๐Ÿ”ฌ, or exploring relationships in your data ๐Ÿ“ˆ, understanding statsmodels is essential for data scientists and analysts.

By the end of this tutorial, youโ€™ll feel confident using statsmodels in your own projects! Letโ€™s dive in! ๐ŸŠโ€โ™‚๏ธ

๐Ÿ“š Understanding Statsmodels

๐Ÿค” What is Statsmodels?

Statsmodels is like having a complete statistics laboratory in Python! ๐Ÿงช Think of it as your personal statistical advisor that helps you understand relationships in data, test hypotheses, and build predictive models.

In Python terms, statsmodels provides:

  • โœจ Comprehensive statistical tests and models
  • ๐Ÿš€ Regression analysis (linear, logistic, and more)
  • ๐Ÿ›ก๏ธ Time series analysis tools
  • ๐Ÿ“Š Statistical summaries and diagnostics

๐Ÿ’ก Why Use Statsmodels?

Hereโ€™s why data scientists love statsmodels:

  1. Rich Statistical Output ๐Ÿ“‹: Detailed summaries like R
  2. Academic Rigor ๐ŸŽ“: Based on established statistical methods
  3. Diagnostic Tools ๐Ÿ”: Check model assumptions easily
  4. Formula API ๐Ÿ“: R-like formula syntax for models

Real-world example: Imagine analyzing sales data ๐Ÿ’ฐ. With statsmodels, you can build models to understand what factors drive sales and make predictions!

๐Ÿ”ง Basic Syntax and Usage

๐Ÿ“ Simple Linear Regression

Letโ€™s start with a friendly example:

# ๐Ÿ‘‹ Hello, Statsmodels!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# ๐ŸŽจ Create some sample data
np.random.seed(42)
data = pd.DataFrame({
    'hours_studied': np.random.uniform(1, 10, 100),  # ๐Ÿ“š Study hours
    'coffee_cups': np.random.randint(0, 5, 100),     # โ˜• Coffee consumed
})
# ๐ŸŽฏ Generate test scores with some relationship
data['test_score'] = (
    50 + 5 * data['hours_studied'] + 
    2 * data['coffee_cups'] + 
    np.random.normal(0, 5, 100)  # ๐ŸŽฒ Add some randomness
)

# ๐Ÿš€ Build our first model!
model = smf.ols('test_score ~ hours_studied + coffee_cups', data=data)
results = model.fit()

# ๐Ÿ“Š View the magical summary!
print(results.summary())

๐Ÿ’ก Explanation: Weโ€™re predicting test scores based on study hours and coffee consumption! The formula syntax makes it super readable.

๐ŸŽฏ Common Statistical Tests

Here are tests youโ€™ll use daily:

# ๐Ÿงช T-test example
from statsmodels.stats.weightstats import ttest_ind

# ๐ŸŽฎ Compare two groups
group_a = np.random.normal(100, 15, 50)  # ๐Ÿ…ฐ๏ธ Group A scores
group_b = np.random.normal(105, 15, 50)  # ๐Ÿ…ฑ๏ธ Group B scores

# ๐Ÿ”ฌ Perform t-test
t_stat, p_value, df = ttest_ind(group_a, group_b)
print(f"๐Ÿ“Š T-statistic: {t_stat:.3f}")
print(f"๐ŸŽฏ P-value: {p_value:.3f}")
print(f"๐Ÿ’ก Significant? {'Yes! ๐ŸŽ‰' if p_value < 0.05 else 'No ๐Ÿ˜'}")

# ๐Ÿ“ˆ ANOVA example
import statsmodels.stats.anova as anova

# ๐Ÿƒ Fitness data across three training programs
data = pd.DataFrame({
    'fitness_score': np.concatenate([
        np.random.normal(70, 10, 30),   # ๐Ÿƒ Program A
        np.random.normal(75, 10, 30),   # ๐ŸŠ Program B
        np.random.normal(80, 10, 30)    # ๐Ÿšด Program C
    ]),
    'program': ['A']*30 + ['B']*30 + ['C']*30
})

# ๐ŸŽช One-way ANOVA
model = smf.ols('fitness_score ~ program', data=data).fit()
anova_table = anova.anova_lm(model)
print("\n๐ŸŽฏ ANOVA Results:")
print(anova_table)

๐Ÿ’ก Practical Examples

๐Ÿ›’ Example 1: E-commerce Sales Analysis

Letโ€™s analyze what drives online sales:

# ๐Ÿ›๏ธ Create e-commerce dataset
np.random.seed(123)
n_days = 365

ecommerce_data = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=n_days),
    'ad_spend': np.random.uniform(100, 1000, n_days),      # ๐Ÿ’ต Daily ad spend
    'email_sent': np.random.randint(1000, 5000, n_days),   # ๐Ÿ“ง Emails sent
    'website_visits': np.random.randint(5000, 20000, n_days), # ๐ŸŒ Daily visits
    'is_weekend': np.tile([0,0,0,0,0,1,1], n_days//7 + 1)[:n_days], # ๐Ÿ“… Weekend flag
})

# ๐Ÿ’ฐ Generate sales with realistic relationships
ecommerce_data['sales'] = (
    2000 +  # Base sales
    0.5 * ecommerce_data['ad_spend'] +          # ๐Ÿ“ˆ Ad impact
    0.02 * ecommerce_data['email_sent'] +       # ๐Ÿ“ง Email impact
    0.1 * ecommerce_data['website_visits'] +    # ๐ŸŒ Traffic impact
    -500 * ecommerce_data['is_weekend'] +       # ๐Ÿ“… Weekend effect
    np.random.normal(0, 500, n_days)            # ๐ŸŽฒ Random variation
)

# ๐Ÿ”ฎ Build the model
sales_model = smf.ols(
    'sales ~ ad_spend + email_sent + website_visits + is_weekend', 
    data=ecommerce_data
)
sales_results = sales_model.fit()

# ๐Ÿ“Š Interpret results
print("๐ŸŽฏ Sales Analysis Results:")
print(sales_results.summary())

# ๐ŸŽจ Visualize actual vs predicted
ecommerce_data['predicted_sales'] = sales_results.predict()

plt.figure(figsize=(10, 6))
plt.scatter(ecommerce_data['sales'], ecommerce_data['predicted_sales'], 
            alpha=0.5, color='blue', label='Daily Sales')
plt.plot([ecommerce_data['sales'].min(), ecommerce_data['sales'].max()], 
         [ecommerce_data['sales'].min(), ecommerce_data['sales'].max()], 
         'r--', label='Perfect Prediction')
plt.xlabel('Actual Sales ๐Ÿ’ฐ')
plt.ylabel('Predicted Sales ๐Ÿ“Š')
plt.title('๐ŸŽฏ Model Performance: Actual vs Predicted Sales')
plt.legend()
plt.show()

# ๐Ÿ’ก Feature importance
print("\n๐Ÿ† Feature Impact on Sales:")
for feature, coef in sales_results.params[1:].items():
    impact = "๐Ÿ“ˆ Positive" if coef > 0 else "๐Ÿ“‰ Negative"
    print(f"{feature}: {coef:.2f} ({impact})")

๐ŸŽฏ Try it yourself: Add seasonal effects or product categories to the model!

๐ŸŽฎ Example 2: A/B Testing for Game Features

Letโ€™s test if a new game feature improves player engagement:

# ๐ŸŽฎ Game A/B test data
np.random.seed(456)

# ๐Ÿ…ฐ๏ธ Control group (old feature)
control_group = pd.DataFrame({
    'group': 'control',
    'player_id': range(1000),
    'daily_playtime': np.random.gamma(2, 30, 1000),  # โฑ๏ธ Minutes played
    'sessions': np.random.poisson(3, 1000),          # ๐ŸŽฏ Daily sessions
    'purchases': np.random.binomial(1, 0.1, 1000),   # ๐Ÿ’Ž Made purchase?
})

# ๐Ÿ…ฑ๏ธ Treatment group (new feature)
treatment_group = pd.DataFrame({
    'group': 'treatment',
    'player_id': range(1000, 2000),
    'daily_playtime': np.random.gamma(2.3, 32, 1000),  # ๐Ÿ“ˆ Slightly higher!
    'sessions': np.random.poisson(3.5, 1000),          # ๐Ÿ“ˆ More sessions!
    'purchases': np.random.binomial(1, 0.15, 1000),    # ๐Ÿ’ฐ More purchases!
})

# ๐Ÿ”„ Combine data
ab_test_data = pd.concat([control_group, treatment_group])

# ๐Ÿ“Š Statistical tests
print("๐ŸŽฏ A/B Test Results:\n")

# ๐Ÿ• Test playtime difference
from statsmodels.stats.weightstats import ttest_ind
t_stat, p_val, _ = ttest_ind(
    treatment_group['daily_playtime'], 
    control_group['daily_playtime']
)
print(f"โฑ๏ธ Playtime Test:")
print(f"   Control: {control_group['daily_playtime'].mean():.1f} min")
print(f"   Treatment: {treatment_group['daily_playtime'].mean():.1f} min")
print(f"   P-value: {p_val:.4f} {'โœ… Significant!' if p_val < 0.05 else 'โŒ Not significant'}")

# ๐Ÿ’Ž Test purchase rate difference
from statsmodels.stats.proportion import proportions_ztest
control_purchases = control_group['purchases'].sum()
treatment_purchases = treatment_group['purchases'].sum()
counts = [treatment_purchases, control_purchases]
nobs = [len(treatment_group), len(control_group)]

z_stat, p_val = proportions_ztest(counts, nobs)
print(f"\n๐Ÿ’Ž Purchase Rate Test:")
print(f"   Control: {control_purchases/len(control_group)*100:.1f}%")
print(f"   Treatment: {treatment_purchases/len(treatment_group)*100:.1f}%")
print(f"   P-value: {p_val:.4f} {'โœ… Significant!' if p_val < 0.05 else 'โŒ Not significant'}")

# ๐Ÿ† Effect size calculation
from statsmodels.stats.power import ttest_power
effect_size = (treatment_group['daily_playtime'].mean() - 
               control_group['daily_playtime'].mean()) / control_group['daily_playtime'].std()
print(f"\n๐Ÿ“ Effect Size: {effect_size:.3f}")
print(f"{'๐Ÿš€ Large effect!' if abs(effect_size) > 0.8 else 'โœจ Medium effect' if abs(effect_size) > 0.5 else '๐Ÿ’ซ Small effect'}")

๐Ÿš€ Advanced Concepts

๐Ÿง™โ€โ™‚๏ธ Time Series Analysis

When youโ€™re ready to level up, try time series modeling:

# ๐ŸŽฏ Advanced time series with ARIMA
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# ๐Ÿ“ˆ Create time series data
dates = pd.date_range('2020-01-01', periods=365*3, freq='D')
trend = np.linspace(100, 200, len(dates))
seasonal = 10 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25)
noise = np.random.normal(0, 5, len(dates))
sales = trend + seasonal + noise

ts_data = pd.Series(sales, index=dates)

# ๐Ÿ”ฎ Fit ARIMA model
model = ARIMA(ts_data, order=(1, 1, 1))
results = model.fit()

# ๐ŸŽจ Make predictions
forecast = results.forecast(steps=30)
print(f"๐Ÿ”ฎ Next 30 days forecast: {forecast.mean():.2f}")

# ๐Ÿ“Š Plot results
plt.figure(figsize=(12, 6))
plt.plot(ts_data.index, ts_data.values, label='Historical Sales ๐Ÿ“ˆ')
plt.plot(forecast.index, forecast.values, 'r--', label='Forecast ๐Ÿ”ฎ')
plt.fill_between(forecast.index, 
                 forecast.values - 1.96*forecast.values.std(), 
                 forecast.values + 1.96*forecast.values.std(), 
                 alpha=0.3, color='red', label='95% Confidence ๐ŸŽฏ')
plt.legend()
plt.title('๐Ÿš€ Time Series Forecast')
plt.show()

๐Ÿ—๏ธ Logistic Regression for Classification

For the brave data scientists:

# ๐Ÿš€ Customer churn prediction
churn_data = pd.DataFrame({
    'tenure': np.random.uniform(0, 72, 1000),           # ๐Ÿ“… Months as customer
    'monthly_charges': np.random.uniform(20, 100, 1000), # ๐Ÿ’ต Monthly bill
    'total_charges': np.random.uniform(100, 7000, 1000), # ๐Ÿ’ฐ Total spent
    'num_services': np.random.randint(1, 8, 1000),      # ๐Ÿ“ฆ Services subscribed
})

# ๐ŸŽฒ Generate churn based on features
churn_prob = 1 / (1 + np.exp(
    -(-3 + 0.05 * churn_data['tenure'] - 
      0.02 * churn_data['monthly_charges'] + 
      0.3 * churn_data['num_services'])
))
churn_data['churned'] = np.random.binomial(1, churn_prob)

# ๐Ÿ”ฎ Build logistic regression
logit_model = smf.logit(
    'churned ~ tenure + monthly_charges + total_charges + num_services', 
    data=churn_data
)
logit_results = logit_model.fit()

print("๐ŸŽฏ Churn Prediction Model:")
print(logit_results.summary())

# ๐Ÿ’ก Interpret odds ratios
print("\n๐ŸŽฒ Odds Ratios (Impact on Churn):")
odds_ratios = np.exp(logit_results.params)
for feature, odds in odds_ratios[1:].items():
    if odds > 1:
        print(f"{feature}: {odds:.3f} (๐Ÿ“ˆ Increases churn risk)")
    else:
        print(f"{feature}: {odds:.3f} (๐Ÿ“‰ Decreases churn risk)")

โš ๏ธ Common Pitfalls and Solutions

๐Ÿ˜ฑ Pitfall 1: Forgetting to Check Assumptions

# โŒ Wrong way - jumping straight to conclusions!
model = smf.ols('y ~ x', data=df).fit()
print("X causes Y!") # ๐Ÿ’ฅ Whoa there!

# โœ… Correct way - check assumptions first!
model = smf.ols('y ~ x', data=df).fit()

# ๐Ÿ” Check residuals
residuals = model.resid
plt.figure(figsize=(10, 4))

# ๐Ÿ“Š Residual plot
plt.subplot(1, 2, 1)
plt.scatter(model.fittedvalues, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('๐ŸŽฏ Residuals vs Fitted')

# ๐Ÿ“ˆ Q-Q plot
from statsmodels.graphics.gofplots import qqplot
plt.subplot(1, 2, 2)
qqplot(residuals, line='s')
plt.title('๐Ÿ“Š Q-Q Plot')
plt.tight_layout()
plt.show()

# ๐Ÿงช Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan
_, pval, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"๐Ÿ”ฌ Heteroscedasticity test p-value: {pval:.4f}")

๐Ÿคฏ Pitfall 2: P-hacking and Multiple Testing

# โŒ Dangerous - testing everything!
for col in df.columns:
    model = smf.ols(f'outcome ~ {col}', data=df).fit()
    if model.pvalues[1] < 0.05:
        print(f"Found significance with {col}!") # ๐Ÿ’ฅ False discoveries!

# โœ… Safe - adjust for multiple comparisons!
from statsmodels.stats.multitest import multipletests

p_values = []
features = []

for col in df.columns:
    if col != 'outcome':
        model = smf.ols(f'outcome ~ {col}', data=df).fit()
        p_values.append(model.pvalues[1])
        features.append(col)

# ๐Ÿ›ก๏ธ Bonferroni correction
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')

print("๐Ÿ“Š Adjusted Results:")
for feature, p_orig, p_adj, significant in zip(features, p_values, p_adjusted, rejected):
    print(f"{feature}: p={p_orig:.4f} โ†’ adjusted p={p_adj:.4f} {'โœ…' if significant else 'โŒ'}")

๐Ÿ› ๏ธ Best Practices

  1. ๐ŸŽฏ Start Simple: Build basic models before complex ones
  2. ๐Ÿ“Š Visualize First: Plot your data before modeling
  3. ๐Ÿ” Check Assumptions: Validate model assumptions
  4. ๐Ÿ“ˆ Cross-validate: Use holdout data for validation
  5. โœจ Document Results: Keep track of model versions

๐Ÿงช Hands-On Exercise

๐ŸŽฏ Challenge: Build a Health Analytics Dashboard

Create a comprehensive health study analysis:

๐Ÿ“‹ Requirements:

  • โœ… Load health metrics data (weight, exercise, sleep)
  • ๐Ÿท๏ธ Perform hypothesis tests between groups
  • ๐Ÿ‘ค Build regression models for health outcomes
  • ๐Ÿ“… Include demographic factors
  • ๐ŸŽจ Create visualizations of findings!

๐Ÿš€ Bonus Points:

  • Add interaction effects between variables
  • Implement model diagnostics
  • Create a summary report function

๐Ÿ’ก Solution

๐Ÿ” Click to see solution
# ๐ŸŽฏ Health Analytics Solution!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# ๐Ÿฅ Generate health study data
np.random.seed(789)
n_participants = 500

health_data = pd.DataFrame({
    'participant_id': range(n_participants),
    'age': np.random.normal(45, 15, n_participants),
    'gender': np.random.choice(['M', 'F'], n_participants),
    'exercise_hours': np.random.gamma(2, 2, n_participants),     # ๐Ÿƒ Weekly exercise
    'sleep_hours': np.random.normal(7, 1.5, n_participants),     # ๐Ÿ˜ด Daily sleep
    'diet_quality': np.random.randint(1, 11, n_participants),    # ๐Ÿฅ— Diet score 1-10
    'stress_level': np.random.randint(1, 11, n_participants),    # ๐Ÿ˜ฐ Stress 1-10
})

# ๐Ÿ’ช Generate BMI with realistic relationships
health_data['bmi'] = (
    25 + 
    0.1 * health_data['age'] +
    -0.5 * health_data['exercise_hours'] +
    -0.3 * health_data['sleep_hours'] +
    -0.4 * health_data['diet_quality'] +
    0.3 * health_data['stress_level'] +
    2 * (health_data['gender'] == 'M') +
    np.random.normal(0, 2, n_participants)
)

class HealthAnalyzer:
    def __init__(self, data):
        self.data = data
        self.results = {}
    
    def demographic_analysis(self):
        """๐Ÿ“Š Analyze differences by demographics"""
        print("๐Ÿ“Š Demographic Analysis:")
        print("="*50)
        
        # ๐ŸŽฏ Gender comparison
        male_bmi = self.data[self.data['gender'] == 'M']['bmi']
        female_bmi = self.data[self.data['gender'] == 'F']['bmi']
        
        t_stat, p_val, _ = ttest_ind(male_bmi, female_bmi)
        print(f"\n๐Ÿ‘ฅ BMI by Gender:")
        print(f"   Male: {male_bmi.mean():.2f} ยฑ {male_bmi.std():.2f}")
        print(f"   Female: {female_bmi.mean():.2f} ยฑ {female_bmi.std():.2f}")
        print(f"   P-value: {p_val:.4f} {'โœ… Significant' if p_val < 0.05 else 'โŒ Not significant'}")
        
        # ๐ŸŽ‚ Age correlation
        age_corr = self.data[['age', 'bmi', 'exercise_hours', 'sleep_hours']].corr()
        print(f"\n๐ŸŽ‚ Age Correlations:")
        for var in ['bmi', 'exercise_hours', 'sleep_hours']:
            corr = age_corr.loc['age', var]
            print(f"   Age vs {var}: {corr:.3f}")
    
    def build_health_model(self):
        """๐Ÿ”ฎ Build comprehensive health model"""
        print("\n๐Ÿ”ฎ Health Prediction Model:")
        print("="*50)
        
        # ๐Ÿ—๏ธ Full model with interactions
        model = smf.ols('''bmi ~ age + gender + exercise_hours + sleep_hours + 
                        diet_quality + stress_level + 
                        exercise_hours:diet_quality''', data=self.data)
        self.results['model'] = model.fit()
        
        print(self.results['model'].summary())
        
        # ๐Ÿ’ก Feature importance
        print("\n๐Ÿ† Feature Importance:")
        params = self.results['model'].params[1:]  # Skip intercept
        importance = abs(params).sort_values(ascending=False)
        for feature, value in importance.items():
            print(f"   {feature}: {value:.3f}")
        
        return self.results['model']
    
    def diagnostic_plots(self):
        """๐Ÿ“Š Create diagnostic visualizations"""
        model = self.results['model']
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        
        # ๐ŸŽฏ Residuals vs Fitted
        axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
        axes[0, 0].axhline(y=0, color='r', linestyle='--')
        axes[0, 0].set_xlabel('Fitted Values')
        axes[0, 0].set_ylabel('Residuals')
        axes[0, 0].set_title('๐ŸŽฏ Residuals vs Fitted')
        
        # ๐Ÿ“Š Q-Q Plot
        from statsmodels.graphics.gofplots import qqplot
        qqplot(model.resid, line='s', ax=axes[0, 1])
        axes[0, 1].set_title('๐Ÿ“Š Normal Q-Q Plot')
        
        # ๐Ÿ“ˆ Scale-Location
        axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(model.resid)), alpha=0.5)
        axes[1, 0].set_xlabel('Fitted Values')
        axes[1, 0].set_ylabel('โˆš|Residuals|')
        axes[1, 0].set_title('๐Ÿ“ˆ Scale-Location Plot')
        
        # ๐ŸŽจ Actual vs Predicted
        axes[1, 1].scatter(self.data['bmi'], model.fittedvalues, alpha=0.5)
        axes[1, 1].plot([self.data['bmi'].min(), self.data['bmi'].max()],
                       [self.data['bmi'].min(), self.data['bmi'].max()], 'r--')
        axes[1, 1].set_xlabel('Actual BMI')
        axes[1, 1].set_ylabel('Predicted BMI')
        axes[1, 1].set_title('๐ŸŽจ Actual vs Predicted')
        
        plt.tight_layout()
        plt.show()
    
    def generate_report(self):
        """๐Ÿ“‹ Generate summary report"""
        print("\n๐Ÿ“‹ Health Study Summary Report")
        print("="*50)
        print(f"๐Ÿ“Š Participants: {len(self.data)}")
        print(f"๐Ÿ‘ฅ Gender Split: {self.data['gender'].value_counts().to_dict()}")
        print(f"๐ŸŽ‚ Age Range: {self.data['age'].min():.0f} - {self.data['age'].max():.0f}")
        print(f"\n๐Ÿ’ช Key Findings:")
        
        # Extract key coefficients
        model = self.results['model']
        for var in ['exercise_hours', 'sleep_hours', 'diet_quality']:
            coef = model.params[var]
            pval = model.pvalues[var]
            impact = "๐Ÿ“ˆ increases" if coef > 0 else "๐Ÿ“‰ decreases"
            sig = "โœ…" if pval < 0.05 else "โŒ"
            print(f"   โ€ข Each hour of {var} {impact} BMI by {abs(coef):.2f} {sig}")

# ๐ŸŽฎ Run the analysis!
analyzer = HealthAnalyzer(health_data)
analyzer.demographic_analysis()
model = analyzer.build_health_model()
analyzer.diagnostic_plots()
analyzer.generate_report()

# ๐ŸŽŠ Success message
print("\n๐ŸŽ‰ Congratulations! You've completed a full statistical analysis!")

๐ŸŽ“ Key Takeaways

Youโ€™ve learned so much! Hereโ€™s what you can now do:

  • โœ… Perform statistical tests with confidence ๐Ÿ’ช
  • โœ… Build regression models for prediction ๐Ÿ”ฎ
  • โœ… Analyze time series data like a pro ๐Ÿ“ˆ
  • โœ… Check model assumptions properly ๐Ÿ”
  • โœ… Interpret statistical output correctly ๐Ÿ“Š

Remember: Statistics is about understanding relationships in data, not just running tests! ๐Ÿค

๐Ÿค Next Steps

Congratulations! ๐ŸŽ‰ Youโ€™ve mastered statsmodels!

Hereโ€™s what to do next:

  1. ๐Ÿ’ป Practice with the exercises above
  2. ๐Ÿ—๏ธ Apply statsmodels to your own datasets
  3. ๐Ÿ“š Explore advanced models (GLM, GAM, State Space)
  4. ๐ŸŒŸ Combine with scikit-learn for machine learning

Remember: Every data scientist started as a beginner. Keep analyzing, keep learning, and most importantly, have fun with data! ๐Ÿš€


Happy statistical modeling! ๐ŸŽ‰๐Ÿš€โœจ