Prerequisites
- Basic understanding of programming concepts ๐
- Python installation (3.8+) ๐
- VS Code or preferred IDE ๐ป
What you'll learn
- Understand the concept fundamentals ๐ฏ
- Apply the concept in real projects ๐๏ธ
- Debug common issues ๐
- Write clean, Pythonic code โจ
๐ฏ Introduction
Welcome to this exciting tutorial on Statsmodels! ๐ In this guide, weโll explore how to perform powerful statistical modeling and analysis in Python using the statsmodels library.
Youโll discover how statsmodels can transform your data analysis experience. Whether youโre building predictive models ๐, conducting hypothesis tests ๐ฌ, or exploring relationships in your data ๐, understanding statsmodels is essential for data scientists and analysts.
By the end of this tutorial, youโll feel confident using statsmodels in your own projects! Letโs dive in! ๐โโ๏ธ
๐ Understanding Statsmodels
๐ค What is Statsmodels?
Statsmodels is like having a complete statistics laboratory in Python! ๐งช Think of it as your personal statistical advisor that helps you understand relationships in data, test hypotheses, and build predictive models.
In Python terms, statsmodels provides:
- โจ Comprehensive statistical tests and models
- ๐ Regression analysis (linear, logistic, and more)
- ๐ก๏ธ Time series analysis tools
- ๐ Statistical summaries and diagnostics
๐ก Why Use Statsmodels?
Hereโs why data scientists love statsmodels:
- Rich Statistical Output ๐: Detailed summaries like R
- Academic Rigor ๐: Based on established statistical methods
- Diagnostic Tools ๐: Check model assumptions easily
- Formula API ๐: R-like formula syntax for models
Real-world example: Imagine analyzing sales data ๐ฐ. With statsmodels, you can build models to understand what factors drive sales and make predictions!
๐ง Basic Syntax and Usage
๐ Simple Linear Regression
Letโs start with a friendly example:
# ๐ Hello, Statsmodels!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
# ๐จ Create some sample data
np.random.seed(42)
data = pd.DataFrame({
'hours_studied': np.random.uniform(1, 10, 100), # ๐ Study hours
'coffee_cups': np.random.randint(0, 5, 100), # โ Coffee consumed
})
# ๐ฏ Generate test scores with some relationship
data['test_score'] = (
50 + 5 * data['hours_studied'] +
2 * data['coffee_cups'] +
np.random.normal(0, 5, 100) # ๐ฒ Add some randomness
)
# ๐ Build our first model!
model = smf.ols('test_score ~ hours_studied + coffee_cups', data=data)
results = model.fit()
# ๐ View the magical summary!
print(results.summary())
๐ก Explanation: Weโre predicting test scores based on study hours and coffee consumption! The formula syntax makes it super readable.
๐ฏ Common Statistical Tests
Here are tests youโll use daily:
# ๐งช T-test example
from statsmodels.stats.weightstats import ttest_ind
# ๐ฎ Compare two groups
group_a = np.random.normal(100, 15, 50) # ๐
ฐ๏ธ Group A scores
group_b = np.random.normal(105, 15, 50) # ๐
ฑ๏ธ Group B scores
# ๐ฌ Perform t-test
t_stat, p_value, df = ttest_ind(group_a, group_b)
print(f"๐ T-statistic: {t_stat:.3f}")
print(f"๐ฏ P-value: {p_value:.3f}")
print(f"๐ก Significant? {'Yes! ๐' if p_value < 0.05 else 'No ๐'}")
# ๐ ANOVA example
import statsmodels.stats.anova as anova
# ๐ Fitness data across three training programs
data = pd.DataFrame({
'fitness_score': np.concatenate([
np.random.normal(70, 10, 30), # ๐ Program A
np.random.normal(75, 10, 30), # ๐ Program B
np.random.normal(80, 10, 30) # ๐ด Program C
]),
'program': ['A']*30 + ['B']*30 + ['C']*30
})
# ๐ช One-way ANOVA
model = smf.ols('fitness_score ~ program', data=data).fit()
anova_table = anova.anova_lm(model)
print("\n๐ฏ ANOVA Results:")
print(anova_table)
๐ก Practical Examples
๐ Example 1: E-commerce Sales Analysis
Letโs analyze what drives online sales:
# ๐๏ธ Create e-commerce dataset
np.random.seed(123)
n_days = 365
ecommerce_data = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=n_days),
'ad_spend': np.random.uniform(100, 1000, n_days), # ๐ต Daily ad spend
'email_sent': np.random.randint(1000, 5000, n_days), # ๐ง Emails sent
'website_visits': np.random.randint(5000, 20000, n_days), # ๐ Daily visits
'is_weekend': np.tile([0,0,0,0,0,1,1], n_days//7 + 1)[:n_days], # ๐
Weekend flag
})
# ๐ฐ Generate sales with realistic relationships
ecommerce_data['sales'] = (
2000 + # Base sales
0.5 * ecommerce_data['ad_spend'] + # ๐ Ad impact
0.02 * ecommerce_data['email_sent'] + # ๐ง Email impact
0.1 * ecommerce_data['website_visits'] + # ๐ Traffic impact
-500 * ecommerce_data['is_weekend'] + # ๐
Weekend effect
np.random.normal(0, 500, n_days) # ๐ฒ Random variation
)
# ๐ฎ Build the model
sales_model = smf.ols(
'sales ~ ad_spend + email_sent + website_visits + is_weekend',
data=ecommerce_data
)
sales_results = sales_model.fit()
# ๐ Interpret results
print("๐ฏ Sales Analysis Results:")
print(sales_results.summary())
# ๐จ Visualize actual vs predicted
ecommerce_data['predicted_sales'] = sales_results.predict()
plt.figure(figsize=(10, 6))
plt.scatter(ecommerce_data['sales'], ecommerce_data['predicted_sales'],
alpha=0.5, color='blue', label='Daily Sales')
plt.plot([ecommerce_data['sales'].min(), ecommerce_data['sales'].max()],
[ecommerce_data['sales'].min(), ecommerce_data['sales'].max()],
'r--', label='Perfect Prediction')
plt.xlabel('Actual Sales ๐ฐ')
plt.ylabel('Predicted Sales ๐')
plt.title('๐ฏ Model Performance: Actual vs Predicted Sales')
plt.legend()
plt.show()
# ๐ก Feature importance
print("\n๐ Feature Impact on Sales:")
for feature, coef in sales_results.params[1:].items():
impact = "๐ Positive" if coef > 0 else "๐ Negative"
print(f"{feature}: {coef:.2f} ({impact})")
๐ฏ Try it yourself: Add seasonal effects or product categories to the model!
๐ฎ Example 2: A/B Testing for Game Features
Letโs test if a new game feature improves player engagement:
# ๐ฎ Game A/B test data
np.random.seed(456)
# ๐
ฐ๏ธ Control group (old feature)
control_group = pd.DataFrame({
'group': 'control',
'player_id': range(1000),
'daily_playtime': np.random.gamma(2, 30, 1000), # โฑ๏ธ Minutes played
'sessions': np.random.poisson(3, 1000), # ๐ฏ Daily sessions
'purchases': np.random.binomial(1, 0.1, 1000), # ๐ Made purchase?
})
# ๐
ฑ๏ธ Treatment group (new feature)
treatment_group = pd.DataFrame({
'group': 'treatment',
'player_id': range(1000, 2000),
'daily_playtime': np.random.gamma(2.3, 32, 1000), # ๐ Slightly higher!
'sessions': np.random.poisson(3.5, 1000), # ๐ More sessions!
'purchases': np.random.binomial(1, 0.15, 1000), # ๐ฐ More purchases!
})
# ๐ Combine data
ab_test_data = pd.concat([control_group, treatment_group])
# ๐ Statistical tests
print("๐ฏ A/B Test Results:\n")
# ๐ Test playtime difference
from statsmodels.stats.weightstats import ttest_ind
t_stat, p_val, _ = ttest_ind(
treatment_group['daily_playtime'],
control_group['daily_playtime']
)
print(f"โฑ๏ธ Playtime Test:")
print(f" Control: {control_group['daily_playtime'].mean():.1f} min")
print(f" Treatment: {treatment_group['daily_playtime'].mean():.1f} min")
print(f" P-value: {p_val:.4f} {'โ
Significant!' if p_val < 0.05 else 'โ Not significant'}")
# ๐ Test purchase rate difference
from statsmodels.stats.proportion import proportions_ztest
control_purchases = control_group['purchases'].sum()
treatment_purchases = treatment_group['purchases'].sum()
counts = [treatment_purchases, control_purchases]
nobs = [len(treatment_group), len(control_group)]
z_stat, p_val = proportions_ztest(counts, nobs)
print(f"\n๐ Purchase Rate Test:")
print(f" Control: {control_purchases/len(control_group)*100:.1f}%")
print(f" Treatment: {treatment_purchases/len(treatment_group)*100:.1f}%")
print(f" P-value: {p_val:.4f} {'โ
Significant!' if p_val < 0.05 else 'โ Not significant'}")
# ๐ Effect size calculation
from statsmodels.stats.power import ttest_power
effect_size = (treatment_group['daily_playtime'].mean() -
control_group['daily_playtime'].mean()) / control_group['daily_playtime'].std()
print(f"\n๐ Effect Size: {effect_size:.3f}")
print(f"{'๐ Large effect!' if abs(effect_size) > 0.8 else 'โจ Medium effect' if abs(effect_size) > 0.5 else '๐ซ Small effect'}")
๐ Advanced Concepts
๐งโโ๏ธ Time Series Analysis
When youโre ready to level up, try time series modeling:
# ๐ฏ Advanced time series with ARIMA
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# ๐ Create time series data
dates = pd.date_range('2020-01-01', periods=365*3, freq='D')
trend = np.linspace(100, 200, len(dates))
seasonal = 10 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25)
noise = np.random.normal(0, 5, len(dates))
sales = trend + seasonal + noise
ts_data = pd.Series(sales, index=dates)
# ๐ฎ Fit ARIMA model
model = ARIMA(ts_data, order=(1, 1, 1))
results = model.fit()
# ๐จ Make predictions
forecast = results.forecast(steps=30)
print(f"๐ฎ Next 30 days forecast: {forecast.mean():.2f}")
# ๐ Plot results
plt.figure(figsize=(12, 6))
plt.plot(ts_data.index, ts_data.values, label='Historical Sales ๐')
plt.plot(forecast.index, forecast.values, 'r--', label='Forecast ๐ฎ')
plt.fill_between(forecast.index,
forecast.values - 1.96*forecast.values.std(),
forecast.values + 1.96*forecast.values.std(),
alpha=0.3, color='red', label='95% Confidence ๐ฏ')
plt.legend()
plt.title('๐ Time Series Forecast')
plt.show()
๐๏ธ Logistic Regression for Classification
For the brave data scientists:
# ๐ Customer churn prediction
churn_data = pd.DataFrame({
'tenure': np.random.uniform(0, 72, 1000), # ๐
Months as customer
'monthly_charges': np.random.uniform(20, 100, 1000), # ๐ต Monthly bill
'total_charges': np.random.uniform(100, 7000, 1000), # ๐ฐ Total spent
'num_services': np.random.randint(1, 8, 1000), # ๐ฆ Services subscribed
})
# ๐ฒ Generate churn based on features
churn_prob = 1 / (1 + np.exp(
-(-3 + 0.05 * churn_data['tenure'] -
0.02 * churn_data['monthly_charges'] +
0.3 * churn_data['num_services'])
))
churn_data['churned'] = np.random.binomial(1, churn_prob)
# ๐ฎ Build logistic regression
logit_model = smf.logit(
'churned ~ tenure + monthly_charges + total_charges + num_services',
data=churn_data
)
logit_results = logit_model.fit()
print("๐ฏ Churn Prediction Model:")
print(logit_results.summary())
# ๐ก Interpret odds ratios
print("\n๐ฒ Odds Ratios (Impact on Churn):")
odds_ratios = np.exp(logit_results.params)
for feature, odds in odds_ratios[1:].items():
if odds > 1:
print(f"{feature}: {odds:.3f} (๐ Increases churn risk)")
else:
print(f"{feature}: {odds:.3f} (๐ Decreases churn risk)")
โ ๏ธ Common Pitfalls and Solutions
๐ฑ Pitfall 1: Forgetting to Check Assumptions
# โ Wrong way - jumping straight to conclusions!
model = smf.ols('y ~ x', data=df).fit()
print("X causes Y!") # ๐ฅ Whoa there!
# โ
Correct way - check assumptions first!
model = smf.ols('y ~ x', data=df).fit()
# ๐ Check residuals
residuals = model.resid
plt.figure(figsize=(10, 4))
# ๐ Residual plot
plt.subplot(1, 2, 1)
plt.scatter(model.fittedvalues, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('๐ฏ Residuals vs Fitted')
# ๐ Q-Q plot
from statsmodels.graphics.gofplots import qqplot
plt.subplot(1, 2, 2)
qqplot(residuals, line='s')
plt.title('๐ Q-Q Plot')
plt.tight_layout()
plt.show()
# ๐งช Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan
_, pval, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"๐ฌ Heteroscedasticity test p-value: {pval:.4f}")
๐คฏ Pitfall 2: P-hacking and Multiple Testing
# โ Dangerous - testing everything!
for col in df.columns:
model = smf.ols(f'outcome ~ {col}', data=df).fit()
if model.pvalues[1] < 0.05:
print(f"Found significance with {col}!") # ๐ฅ False discoveries!
# โ
Safe - adjust for multiple comparisons!
from statsmodels.stats.multitest import multipletests
p_values = []
features = []
for col in df.columns:
if col != 'outcome':
model = smf.ols(f'outcome ~ {col}', data=df).fit()
p_values.append(model.pvalues[1])
features.append(col)
# ๐ก๏ธ Bonferroni correction
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')
print("๐ Adjusted Results:")
for feature, p_orig, p_adj, significant in zip(features, p_values, p_adjusted, rejected):
print(f"{feature}: p={p_orig:.4f} โ adjusted p={p_adj:.4f} {'โ
' if significant else 'โ'}")
๐ ๏ธ Best Practices
- ๐ฏ Start Simple: Build basic models before complex ones
- ๐ Visualize First: Plot your data before modeling
- ๐ Check Assumptions: Validate model assumptions
- ๐ Cross-validate: Use holdout data for validation
- โจ Document Results: Keep track of model versions
๐งช Hands-On Exercise
๐ฏ Challenge: Build a Health Analytics Dashboard
Create a comprehensive health study analysis:
๐ Requirements:
- โ Load health metrics data (weight, exercise, sleep)
- ๐ท๏ธ Perform hypothesis tests between groups
- ๐ค Build regression models for health outcomes
- ๐ Include demographic factors
- ๐จ Create visualizations of findings!
๐ Bonus Points:
- Add interaction effects between variables
- Implement model diagnostics
- Create a summary report function
๐ก Solution
๐ Click to see solution
# ๐ฏ Health Analytics Solution!
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
# ๐ฅ Generate health study data
np.random.seed(789)
n_participants = 500
health_data = pd.DataFrame({
'participant_id': range(n_participants),
'age': np.random.normal(45, 15, n_participants),
'gender': np.random.choice(['M', 'F'], n_participants),
'exercise_hours': np.random.gamma(2, 2, n_participants), # ๐ Weekly exercise
'sleep_hours': np.random.normal(7, 1.5, n_participants), # ๐ด Daily sleep
'diet_quality': np.random.randint(1, 11, n_participants), # ๐ฅ Diet score 1-10
'stress_level': np.random.randint(1, 11, n_participants), # ๐ฐ Stress 1-10
})
# ๐ช Generate BMI with realistic relationships
health_data['bmi'] = (
25 +
0.1 * health_data['age'] +
-0.5 * health_data['exercise_hours'] +
-0.3 * health_data['sleep_hours'] +
-0.4 * health_data['diet_quality'] +
0.3 * health_data['stress_level'] +
2 * (health_data['gender'] == 'M') +
np.random.normal(0, 2, n_participants)
)
class HealthAnalyzer:
def __init__(self, data):
self.data = data
self.results = {}
def demographic_analysis(self):
"""๐ Analyze differences by demographics"""
print("๐ Demographic Analysis:")
print("="*50)
# ๐ฏ Gender comparison
male_bmi = self.data[self.data['gender'] == 'M']['bmi']
female_bmi = self.data[self.data['gender'] == 'F']['bmi']
t_stat, p_val, _ = ttest_ind(male_bmi, female_bmi)
print(f"\n๐ฅ BMI by Gender:")
print(f" Male: {male_bmi.mean():.2f} ยฑ {male_bmi.std():.2f}")
print(f" Female: {female_bmi.mean():.2f} ยฑ {female_bmi.std():.2f}")
print(f" P-value: {p_val:.4f} {'โ
Significant' if p_val < 0.05 else 'โ Not significant'}")
# ๐ Age correlation
age_corr = self.data[['age', 'bmi', 'exercise_hours', 'sleep_hours']].corr()
print(f"\n๐ Age Correlations:")
for var in ['bmi', 'exercise_hours', 'sleep_hours']:
corr = age_corr.loc['age', var]
print(f" Age vs {var}: {corr:.3f}")
def build_health_model(self):
"""๐ฎ Build comprehensive health model"""
print("\n๐ฎ Health Prediction Model:")
print("="*50)
# ๐๏ธ Full model with interactions
model = smf.ols('''bmi ~ age + gender + exercise_hours + sleep_hours +
diet_quality + stress_level +
exercise_hours:diet_quality''', data=self.data)
self.results['model'] = model.fit()
print(self.results['model'].summary())
# ๐ก Feature importance
print("\n๐ Feature Importance:")
params = self.results['model'].params[1:] # Skip intercept
importance = abs(params).sort_values(ascending=False)
for feature, value in importance.items():
print(f" {feature}: {value:.3f}")
return self.results['model']
def diagnostic_plots(self):
"""๐ Create diagnostic visualizations"""
model = self.results['model']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# ๐ฏ Residuals vs Fitted
axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('๐ฏ Residuals vs Fitted')
# ๐ Q-Q Plot
from statsmodels.graphics.gofplots import qqplot
qqplot(model.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('๐ Normal Q-Q Plot')
# ๐ Scale-Location
axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(model.resid)), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('โ|Residuals|')
axes[1, 0].set_title('๐ Scale-Location Plot')
# ๐จ Actual vs Predicted
axes[1, 1].scatter(self.data['bmi'], model.fittedvalues, alpha=0.5)
axes[1, 1].plot([self.data['bmi'].min(), self.data['bmi'].max()],
[self.data['bmi'].min(), self.data['bmi'].max()], 'r--')
axes[1, 1].set_xlabel('Actual BMI')
axes[1, 1].set_ylabel('Predicted BMI')
axes[1, 1].set_title('๐จ Actual vs Predicted')
plt.tight_layout()
plt.show()
def generate_report(self):
"""๐ Generate summary report"""
print("\n๐ Health Study Summary Report")
print("="*50)
print(f"๐ Participants: {len(self.data)}")
print(f"๐ฅ Gender Split: {self.data['gender'].value_counts().to_dict()}")
print(f"๐ Age Range: {self.data['age'].min():.0f} - {self.data['age'].max():.0f}")
print(f"\n๐ช Key Findings:")
# Extract key coefficients
model = self.results['model']
for var in ['exercise_hours', 'sleep_hours', 'diet_quality']:
coef = model.params[var]
pval = model.pvalues[var]
impact = "๐ increases" if coef > 0 else "๐ decreases"
sig = "โ
" if pval < 0.05 else "โ"
print(f" โข Each hour of {var} {impact} BMI by {abs(coef):.2f} {sig}")
# ๐ฎ Run the analysis!
analyzer = HealthAnalyzer(health_data)
analyzer.demographic_analysis()
model = analyzer.build_health_model()
analyzer.diagnostic_plots()
analyzer.generate_report()
# ๐ Success message
print("\n๐ Congratulations! You've completed a full statistical analysis!")
๐ Key Takeaways
Youโve learned so much! Hereโs what you can now do:
- โ Perform statistical tests with confidence ๐ช
- โ Build regression models for prediction ๐ฎ
- โ Analyze time series data like a pro ๐
- โ Check model assumptions properly ๐
- โ Interpret statistical output correctly ๐
Remember: Statistics is about understanding relationships in data, not just running tests! ๐ค
๐ค Next Steps
Congratulations! ๐ Youโve mastered statsmodels!
Hereโs what to do next:
- ๐ป Practice with the exercises above
- ๐๏ธ Apply statsmodels to your own datasets
- ๐ Explore advanced models (GLM, GAM, State Space)
- ๐ Combine with scikit-learn for machine learning
Remember: Every data scientist started as a beginner. Keep analyzing, keep learning, and most importantly, have fun with data! ๐
Happy statistical modeling! ๐๐โจ