Influential Points

What Influential Points Analysis Checks

Influential points analysis identifies observations that disproportionately affect the regression results. These points can dramatically change coefficient estimates, model fit, and conclusions if included or excluded from the analysis.

Purpose: Identifies observations that disproportionately affect model results, helping detect data quality issues and outliers.

Why Identifying Influential Points Matters

Understanding influential points helps you:

Ensure Model Robustness: Verify results aren't driven by a few unusual observations

Detect Data Quality Issues: Find data entry errors, anomalies, or special events

Improve Model Reliability: Decide whether to keep, investigate, or exclude unusual observations

Communicate Limitations: Understand which time periods drive your results

A good model should have stable results that don't change dramatically when removing individual observations.

Types of Problematic Points

MixModeler identifies three categories of concerning observations:

Outliers:

Observations with large residuals (prediction errors)
Far from the regression line
Threshold: |Standardized residual| > 3

High Leverage Points:

Observations with unusual predictor values
Far from the average of X variables
Threshold: Hat value > 2(p+1)/n, where p = number of predictors, n = sample size

Influential Points:

Observations that significantly affect the regression
Combine high leverage with large residuals
Threshold: Cook's Distance > 4/n

Diagnostic Metrics

Standardized Residuals

Definition: Residuals divided by their standard deviation

Purpose: Measures how far each observation is from the fitted regression line

Interpretation:

|Std. Residual| < 2: Normal observation
2 < |Std. Residual| < 3: Potential outlier
|Std. Residual| > 3: Definite outlier

Hat Values (Leverage)

Definition: Diagonal elements of the hat matrix measuring distance in X-space

Purpose: Identifies observations with unusual predictor combinations

Interpretation:

Hat value > 2(p+1)/n: High leverage point
High leverage alone isn't problematic unless combined with large residuals

Cook's Distance

Definition: Measures overall influence combining leverage and residuals

Purpose: Identifies observations that change the regression when removed

Formula: Combines standardized residual and leverage: D = (r²ᵢ / p) × (hᵢᵢ / (1-hᵢᵢ))

Interpretation:

Cook's D < 0.5: Little influence
0.5 < Cook's D < 1: Moderate influence
Cook's D > 1: High influence (investigate)
Cook's D > 4/n: Rule of thumb threshold

Visual Diagnostics

MixModeler provides several plots to identify influential points:

Cook's Distance Plot:

Bar chart showing Cook's D for each observation
Horizontal line at threshold (4/n)
Good: All bars below threshold
Problem: Bars extending above threshold

Leverage vs Residuals Plot:

Scatter plot with leverage on x-axis, standardized residuals on y-axis
Shows Cook's Distance contours
Good: Points clustered in center
Problem: Points in upper right or lower right (high leverage + large residual)

Influential Observations Table:

Lists top influential points with all metrics
Sortable by Cook's D, leverage, or residual
Shows observation index, date, and classification type

Interpreting Test Results

Passed Tests (✓)

What it means: Few or no influential points detected

Implications:

Model results are robust
No single observation drives the results
Data quality appears good
Model is stable across observations

Action: No action needed - model is robust to individual observations

Failed Tests (⚠)

What it means: Multiple influential points, outliers, or high leverage observations detected

Implications:

Results may change if influential points are removed
Potential data quality issues
Model may not generalize well
Coefficient estimates may be unstable

Common Causes:

Data entry errors or measurement mistakes
Special events (promotions, crises, holidays)
Structural breaks in relationships
Natural extreme values in business data
Missing variables that explain extreme values

What to Do When Points Are Identified

When influential points are detected, follow this decision process:

1. Investigate the Observation

Check data quality:

Verify data entry accuracy
Look for measurement errors
Confirm unusual values are real

Understand business context:

Was there a special event that period?
Promotion, crisis, launch, or external shock?
Does it represent normal business variation?

2. Determine Appropriate Action

For data errors:

✓ Correct if error can be fixed
✓ Remove if beyond repair
Document the correction

For special events:

Consider adding a dummy variable for the event
This preserves the observation while accounting for its uniqueness
Example: Super_Bowl_Week, COVID_Period, BlackFriday

For legitimate extreme values:

Keep in the dataset (reflects real business variation)
Report sensitivity analysis showing results with/without
Acknowledge in interpretation

For structural breaks:

Consider splitting analysis into before/after periods
Add trend break variables
Model the regime change explicitly

3. Test Model Stability

Run sensitivity analysis:

Fit model with and without influential points
Compare coefficient estimates
Check if business conclusions change

If results are stable:

Keep all observations
Note influential points in documentation

If results change dramatically:

Investigate why (missing variables, wrong specification)
Consider robust regression methods
Report both sets of results

Practical Guidelines

When to Remove Observations:

Clear data entry errors
Impossible values
Duplicate records
Extreme outliers from known data quality issues

When to Keep Observations:

Real business variation (even if extreme)
Important events that may recur
Legitimately unusual periods
When removal would bias sample

When to Add Control Variables:

Special events that can be modeled
Seasonal anomalies
Temporary shocks
Regime changes

Number of Influential Points:

1-3 points in 100+ observations: Generally acceptable
5-10% of observations: Investigate carefully
10% of observations: Serious data or model issues

Example Interpretation

Scenario 1 - Passed:

2 observations with Cook's D > threshold
Both are holiday weeks with promotions
Removing them doesn't change coefficients meaningfully

Interpretation: Minor influential points related to known promotional events. Model is robust. Consider adding holiday dummy variables but not strictly necessary.

Scenario 2 - Moderate Issues:

5 high leverage points (unusual spend combinations)
3 outliers (large residuals)
2 highly influential points that change TV coefficient by 30%

Interpretation: Some influential observations warrant investigation. Check the 2 highly influential points for data quality. If valid, add dummy variables for those periods or report sensitivity analysis. Model results should be interpreted cautiously.

Scenario 3 - Severe Issues:

15% of observations flagged as influential
Removing top 3 changes multiple coefficients by >50%
Several impossible values detected

Interpretation: Serious data quality and model specification issues. Clean data thoroughly, check for systematic errors, and reconsider model specification before using for business decisions.

Marketing Mix Modeling Context

In MMM, influential points often represent:

Promotional Events: Black Friday, Cyber Monday, Holiday seasons with extreme lift

Product Launches: Periods with unusual marketing mix and amplified response

Competitive Actions: Periods when competitors did something unusual

External Shocks: COVID, economic crises, supply chain disruptions

Media Tests: Periods with experimental spend levels

Best Practices for MMM:

Document all influential periods with business explanations
Add dummy variables for recurring special events
Use hold-out validation to test model stability
Report attribution with and without extreme periods
Be transparent about model limitations

After identifying influential points:

Check Residual Normality as outliers can cause non-normality
Review Heteroscedasticity as outliers can create variance patterns
Examine Actual vs Predicted to visualize which periods fit poorly
Check raw data quality in Data Upload section

PreviousMulticollinearity (VIF)NextActual vs Predicted

Last updated 27 days ago

What Influential Points Analysis Checks

Why Identifying Influential Points Matters

Types of Problematic Points

Diagnostic Metrics

Standardized Residuals

Hat Values (Leverage)

Cook's Distance

Visual Diagnostics

Interpreting Test Results

Passed Tests (✓)

Failed Tests (⚠)

What to Do When Points Are Identified

Practical Guidelines

Example Interpretation

Marketing Mix Modeling Context

Related Diagnostics