Influential Points
What Influential Points Analysis Checks
Influential points analysis identifies observations that disproportionately affect the regression results. These points can dramatically change coefficient estimates, model fit, and conclusions if included or excluded from the analysis.
Purpose: Identifies observations that disproportionately affect model results, helping detect data quality issues and outliers.
Why Identifying Influential Points Matters
Understanding influential points helps you:
Ensure Model Robustness: Verify results aren't driven by a few unusual observations
Detect Data Quality Issues: Find data entry errors, anomalies, or special events
Improve Model Reliability: Decide whether to keep, investigate, or exclude unusual observations
Communicate Limitations: Understand which time periods drive your results
A good model should have stable results that don't change dramatically when removing individual observations.
Types of Problematic Points
MixModeler identifies three categories of concerning observations:
Outliers:
- Observations with large residuals (prediction errors) 
- Far from the regression line 
- Threshold: |Standardized residual| > 3 
High Leverage Points:
- Observations with unusual predictor values 
- Far from the average of X variables 
- Threshold: Hat value > 2(p+1)/n, where p = number of predictors, n = sample size 
Influential Points:
- Observations that significantly affect the regression 
- Combine high leverage with large residuals 
- Threshold: Cook's Distance > 4/n 
Diagnostic Metrics
Standardized Residuals
Definition: Residuals divided by their standard deviation
Purpose: Measures how far each observation is from the fitted regression line
Interpretation:
- |Std. Residual| < 2: Normal observation 
- 2 < |Std. Residual| < 3: Potential outlier 
- |Std. Residual| > 3: Definite outlier 
Hat Values (Leverage)
Definition: Diagonal elements of the hat matrix measuring distance in X-space
Purpose: Identifies observations with unusual predictor combinations
Interpretation:
- Hat value > 2(p+1)/n: High leverage point 
- High leverage alone isn't problematic unless combined with large residuals 
Cook's Distance
Definition: Measures overall influence combining leverage and residuals
Purpose: Identifies observations that change the regression when removed
Formula: Combines standardized residual and leverage: D = (r²ᵢ / p) × (hᵢᵢ / (1-hᵢᵢ))
Interpretation:
- Cook's D < 0.5: Little influence 
- 0.5 < Cook's D < 1: Moderate influence 
- Cook's D > 1: High influence (investigate) 
- Cook's D > 4/n: Rule of thumb threshold 
Visual Diagnostics
MixModeler provides several plots to identify influential points:
Cook's Distance Plot:
- Bar chart showing Cook's D for each observation 
- Horizontal line at threshold (4/n) 
- Good: All bars below threshold 
- Problem: Bars extending above threshold 
Leverage vs Residuals Plot:
- Scatter plot with leverage on x-axis, standardized residuals on y-axis 
- Shows Cook's Distance contours 
- Good: Points clustered in center 
- Problem: Points in upper right or lower right (high leverage + large residual) 
Influential Observations Table:
- Lists top influential points with all metrics 
- Sortable by Cook's D, leverage, or residual 
- Shows observation index, date, and classification type 
Interpreting Test Results
Passed Tests (✓)
What it means: Few or no influential points detected
Implications:
- Model results are robust 
- No single observation drives the results 
- Data quality appears good 
- Model is stable across observations 
Action: No action needed - model is robust to individual observations
Failed Tests (⚠)
What it means: Multiple influential points, outliers, or high leverage observations detected
Implications:
- Results may change if influential points are removed 
- Potential data quality issues 
- Model may not generalize well 
- Coefficient estimates may be unstable 
Common Causes:
- Data entry errors or measurement mistakes 
- Special events (promotions, crises, holidays) 
- Structural breaks in relationships 
- Natural extreme values in business data 
- Missing variables that explain extreme values 
What to Do When Points Are Identified
When influential points are detected, follow this decision process:
1. Investigate the Observation
Check data quality:
- Verify data entry accuracy 
- Look for measurement errors 
- Confirm unusual values are real 
Understand business context:
- Was there a special event that period? 
- Promotion, crisis, launch, or external shock? 
- Does it represent normal business variation? 
2. Determine Appropriate Action
For data errors:
- ✓ Correct if error can be fixed 
- ✓ Remove if beyond repair 
- Document the correction 
For special events:
- Consider adding a dummy variable for the event 
- This preserves the observation while accounting for its uniqueness 
- Example: Super_Bowl_Week, COVID_Period, BlackFriday 
For legitimate extreme values:
- Keep in the dataset (reflects real business variation) 
- Report sensitivity analysis showing results with/without 
- Acknowledge in interpretation 
For structural breaks:
- Consider splitting analysis into before/after periods 
- Add trend break variables 
- Model the regime change explicitly 
3. Test Model Stability
Run sensitivity analysis:
- Fit model with and without influential points 
- Compare coefficient estimates 
- Check if business conclusions change 
If results are stable:
- Keep all observations 
- Note influential points in documentation 
If results change dramatically:
- Investigate why (missing variables, wrong specification) 
- Consider robust regression methods 
- Report both sets of results 
Practical Guidelines
When to Remove Observations:
- Clear data entry errors 
- Impossible values 
- Duplicate records 
- Extreme outliers from known data quality issues 
When to Keep Observations:
- Real business variation (even if extreme) 
- Important events that may recur 
- Legitimately unusual periods 
- When removal would bias sample 
When to Add Control Variables:
- Special events that can be modeled 
- Seasonal anomalies 
- Temporary shocks 
- Regime changes 
Number of Influential Points:
- 1-3 points in 100+ observations: Generally acceptable 
- 5-10% of observations: Investigate carefully 
- 10% of observations: Serious data or model issues 
Example Interpretation
Scenario 1 - Passed:
- 2 observations with Cook's D > threshold 
- Both are holiday weeks with promotions 
- Removing them doesn't change coefficients meaningfully 
Interpretation: Minor influential points related to known promotional events. Model is robust. Consider adding holiday dummy variables but not strictly necessary.
Scenario 2 - Moderate Issues:
- 5 high leverage points (unusual spend combinations) 
- 3 outliers (large residuals) 
- 2 highly influential points that change TV coefficient by 30% 
Interpretation: Some influential observations warrant investigation. Check the 2 highly influential points for data quality. If valid, add dummy variables for those periods or report sensitivity analysis. Model results should be interpreted cautiously.
Scenario 3 - Severe Issues:
- 15% of observations flagged as influential 
- Removing top 3 changes multiple coefficients by >50% 
- Several impossible values detected 
Interpretation: Serious data quality and model specification issues. Clean data thoroughly, check for systematic errors, and reconsider model specification before using for business decisions.
Marketing Mix Modeling Context
In MMM, influential points often represent:
Promotional Events: Black Friday, Cyber Monday, Holiday seasons with extreme lift
Product Launches: Periods with unusual marketing mix and amplified response
Competitive Actions: Periods when competitors did something unusual
External Shocks: COVID, economic crises, supply chain disruptions
Media Tests: Periods with experimental spend levels
Best Practices for MMM:
- Document all influential periods with business explanations 
- Add dummy variables for recurring special events 
- Use hold-out validation to test model stability 
- Report attribution with and without extreme periods 
- Be transparent about model limitations 
Related Diagnostics
After identifying influential points:
- Check Residual Normality as outliers can cause non-normality 
- Review Heteroscedasticity as outliers can create variance patterns 
- Examine Actual vs Predicted to visualize which periods fit poorly 
- Check raw data quality in Data Upload section 
Last updated