Influential Points
What Influential Points Analysis Checks
Influential points analysis identifies observations that disproportionately affect the regression results. These points can dramatically change coefficient estimates, model fit, and conclusions if included or excluded from the analysis.
Purpose: Identifies observations that disproportionately affect model results, helping detect data quality issues and outliers.
Why Identifying Influential Points Matters
Understanding influential points helps you:
Ensure Model Robustness: Verify results aren't driven by a few unusual observations
Detect Data Quality Issues: Find data entry errors, anomalies, or special events
Improve Model Reliability: Decide whether to keep, investigate, or exclude unusual observations
Communicate Limitations: Understand which time periods drive your results
A good model should have stable results that don't change dramatically when removing individual observations.
Types of Problematic Points
MixModeler identifies three categories of concerning observations:
Outliers:
Observations with large residuals (prediction errors)
Far from the regression line
Threshold: |Standardized residual| > 3
High Leverage Points:
Observations with unusual predictor values
Far from the average of X variables
Threshold: Hat value > 2(p+1)/n, where p = number of predictors, n = sample size
Influential Points:
Observations that significantly affect the regression
Combine high leverage with large residuals
Threshold: Cook's Distance > 4/n
Diagnostic Metrics
Standardized Residuals
Definition: Residuals divided by their standard deviation
Purpose: Measures how far each observation is from the fitted regression line
Interpretation:
|Std. Residual| < 2: Normal observation
2 < |Std. Residual| < 3: Potential outlier
|Std. Residual| > 3: Definite outlier
Hat Values (Leverage)
Definition: Diagonal elements of the hat matrix measuring distance in X-space
Purpose: Identifies observations with unusual predictor combinations
Interpretation:
Hat value > 2(p+1)/n: High leverage point
High leverage alone isn't problematic unless combined with large residuals
Cook's Distance
Definition: Measures overall influence combining leverage and residuals
Purpose: Identifies observations that change the regression when removed
Formula: Combines standardized residual and leverage: D = (r²ᵢ / p) × (hᵢᵢ / (1-hᵢᵢ))
Interpretation:
Cook's D < 0.5: Little influence
0.5 < Cook's D < 1: Moderate influence
Cook's D > 1: High influence (investigate)
Cook's D > 4/n: Rule of thumb threshold
Visual Diagnostics
MixModeler provides several plots to identify influential points:
Cook's Distance Plot:
Bar chart showing Cook's D for each observation
Horizontal line at threshold (4/n)
Good: All bars below threshold
Problem: Bars extending above threshold
Leverage vs Residuals Plot:
Scatter plot with leverage on x-axis, standardized residuals on y-axis
Shows Cook's Distance contours
Good: Points clustered in center
Problem: Points in upper right or lower right (high leverage + large residual)
Influential Observations Table:
Lists top influential points with all metrics
Sortable by Cook's D, leverage, or residual
Shows observation index, date, and classification type
Interpreting Test Results
Passed Tests (✓)
What it means: Few or no influential points detected
Implications:
Model results are robust
No single observation drives the results
Data quality appears good
Model is stable across observations
Action: No action needed - model is robust to individual observations
Failed Tests (⚠)
What it means: Multiple influential points, outliers, or high leverage observations detected
Implications:
Results may change if influential points are removed
Potential data quality issues
Model may not generalize well
Coefficient estimates may be unstable
Common Causes:
Data entry errors or measurement mistakes
Special events (promotions, crises, holidays)
Structural breaks in relationships
Natural extreme values in business data
Missing variables that explain extreme values
What to Do When Points Are Identified
When influential points are detected, follow this decision process:
1. Investigate the Observation
Check data quality:
Verify data entry accuracy
Look for measurement errors
Confirm unusual values are real
Understand business context:
Was there a special event that period?
Promotion, crisis, launch, or external shock?
Does it represent normal business variation?
2. Determine Appropriate Action
For data errors:
✓ Correct if error can be fixed
✓ Remove if beyond repair
Document the correction
For special events:
Consider adding a dummy variable for the event
This preserves the observation while accounting for its uniqueness
Example: Super_Bowl_Week, COVID_Period, BlackFriday
For legitimate extreme values:
Keep in the dataset (reflects real business variation)
Report sensitivity analysis showing results with/without
Acknowledge in interpretation
For structural breaks:
Consider splitting analysis into before/after periods
Add trend break variables
Model the regime change explicitly
3. Test Model Stability
Run sensitivity analysis:
Fit model with and without influential points
Compare coefficient estimates
Check if business conclusions change
If results are stable:
Keep all observations
Note influential points in documentation
If results change dramatically:
Investigate why (missing variables, wrong specification)
Consider robust regression methods
Report both sets of results
Practical Guidelines
When to Remove Observations:
Clear data entry errors
Impossible values
Duplicate records
Extreme outliers from known data quality issues
When to Keep Observations:
Real business variation (even if extreme)
Important events that may recur
Legitimately unusual periods
When removal would bias sample
When to Add Control Variables:
Special events that can be modeled
Seasonal anomalies
Temporary shocks
Regime changes
Number of Influential Points:
1-3 points in 100+ observations: Generally acceptable
5-10% of observations: Investigate carefully
10% of observations: Serious data or model issues
Example Interpretation
Scenario 1 - Passed:
2 observations with Cook's D > threshold
Both are holiday weeks with promotions
Removing them doesn't change coefficients meaningfully
Interpretation: Minor influential points related to known promotional events. Model is robust. Consider adding holiday dummy variables but not strictly necessary.
Scenario 2 - Moderate Issues:
5 high leverage points (unusual spend combinations)
3 outliers (large residuals)
2 highly influential points that change TV coefficient by 30%
Interpretation: Some influential observations warrant investigation. Check the 2 highly influential points for data quality. If valid, add dummy variables for those periods or report sensitivity analysis. Model results should be interpreted cautiously.
Scenario 3 - Severe Issues:
15% of observations flagged as influential
Removing top 3 changes multiple coefficients by >50%
Several impossible values detected
Interpretation: Serious data quality and model specification issues. Clean data thoroughly, check for systematic errors, and reconsider model specification before using for business decisions.
Marketing Mix Modeling Context
In MMM, influential points often represent:
Promotional Events: Black Friday, Cyber Monday, Holiday seasons with extreme lift
Product Launches: Periods with unusual marketing mix and amplified response
Competitive Actions: Periods when competitors did something unusual
External Shocks: COVID, economic crises, supply chain disruptions
Media Tests: Periods with experimental spend levels
Best Practices for MMM:
Document all influential periods with business explanations
Add dummy variables for recurring special events
Use hold-out validation to test model stability
Report attribution with and without extreme periods
Be transparent about model limitations
Related Diagnostics
After identifying influential points:
Check Residual Normality as outliers can cause non-normality
Review Heteroscedasticity as outliers can create variance patterns
Examine Actual vs Predicted to visualize which periods fit poorly
Check raw data quality in Data Upload section
Last updated