Influential Points

What Influential Points Analysis Checks

Influential points analysis identifies observations that disproportionately affect the regression results. These points can dramatically change coefficient estimates, model fit, and conclusions if included or excluded from the analysis.

Purpose: Identifies observations that disproportionately affect model results, helping detect data quality issues and outliers.

Why Identifying Influential Points Matters

Understanding influential points helps you:

Ensure Model Robustness: Verify results aren't driven by a few unusual observations

Detect Data Quality Issues: Find data entry errors, anomalies, or special events

Improve Model Reliability: Decide whether to keep, investigate, or exclude unusual observations

Communicate Limitations: Understand which time periods drive your results

A good model should have stable results that don't change dramatically when removing individual observations.

Types of Problematic Points

MixModeler identifies three categories of concerning observations:

Outliers:

  • Observations with large residuals (prediction errors)

  • Far from the regression line

  • Threshold: |Standardized residual| > 3

High Leverage Points:

  • Observations with unusual predictor values

  • Far from the average of X variables

  • Threshold: Hat value > 2(p+1)/n, where p = number of predictors, n = sample size

Influential Points:

  • Observations that significantly affect the regression

  • Combine high leverage with large residuals

  • Threshold: Cook's Distance > 4/n

Diagnostic Metrics

Standardized Residuals

Definition: Residuals divided by their standard deviation

Purpose: Measures how far each observation is from the fitted regression line

Interpretation:

  • |Std. Residual| < 2: Normal observation

  • 2 < |Std. Residual| < 3: Potential outlier

  • |Std. Residual| > 3: Definite outlier

Hat Values (Leverage)

Definition: Diagonal elements of the hat matrix measuring distance in X-space

Purpose: Identifies observations with unusual predictor combinations

Interpretation:

  • Hat value > 2(p+1)/n: High leverage point

  • High leverage alone isn't problematic unless combined with large residuals

Cook's Distance

Definition: Measures overall influence combining leverage and residuals

Purpose: Identifies observations that change the regression when removed

Formula: Combines standardized residual and leverage: D = (r²ᵢ / p) × (hᵢᵢ / (1-hᵢᵢ))

Interpretation:

  • Cook's D < 0.5: Little influence

  • 0.5 < Cook's D < 1: Moderate influence

  • Cook's D > 1: High influence (investigate)

  • Cook's D > 4/n: Rule of thumb threshold

Visual Diagnostics

MixModeler provides several plots to identify influential points:

Cook's Distance Plot:

  • Bar chart showing Cook's D for each observation

  • Horizontal line at threshold (4/n)

  • Good: All bars below threshold

  • Problem: Bars extending above threshold

Leverage vs Residuals Plot:

  • Scatter plot with leverage on x-axis, standardized residuals on y-axis

  • Shows Cook's Distance contours

  • Good: Points clustered in center

  • Problem: Points in upper right or lower right (high leverage + large residual)

Influential Observations Table:

  • Lists top influential points with all metrics

  • Sortable by Cook's D, leverage, or residual

  • Shows observation index, date, and classification type

Interpreting Test Results

Passed Tests (✓)

What it means: Few or no influential points detected

Implications:

  • Model results are robust

  • No single observation drives the results

  • Data quality appears good

  • Model is stable across observations

Action: No action needed - model is robust to individual observations

Failed Tests (⚠)

What it means: Multiple influential points, outliers, or high leverage observations detected

Implications:

  • Results may change if influential points are removed

  • Potential data quality issues

  • Model may not generalize well

  • Coefficient estimates may be unstable

Common Causes:

  • Data entry errors or measurement mistakes

  • Special events (promotions, crises, holidays)

  • Structural breaks in relationships

  • Natural extreme values in business data

  • Missing variables that explain extreme values

What to Do When Points Are Identified

When influential points are detected, follow this decision process:

1. Investigate the Observation

Check data quality:

  • Verify data entry accuracy

  • Look for measurement errors

  • Confirm unusual values are real

Understand business context:

  • Was there a special event that period?

  • Promotion, crisis, launch, or external shock?

  • Does it represent normal business variation?

2. Determine Appropriate Action

For data errors:

  • ✓ Correct if error can be fixed

  • ✓ Remove if beyond repair

  • Document the correction

For special events:

  • Consider adding a dummy variable for the event

  • This preserves the observation while accounting for its uniqueness

  • Example: Super_Bowl_Week, COVID_Period, BlackFriday

For legitimate extreme values:

  • Keep in the dataset (reflects real business variation)

  • Report sensitivity analysis showing results with/without

  • Acknowledge in interpretation

For structural breaks:

  • Consider splitting analysis into before/after periods

  • Add trend break variables

  • Model the regime change explicitly

3. Test Model Stability

Run sensitivity analysis:

  • Fit model with and without influential points

  • Compare coefficient estimates

  • Check if business conclusions change

If results are stable:

  • Keep all observations

  • Note influential points in documentation

If results change dramatically:

  • Investigate why (missing variables, wrong specification)

  • Consider robust regression methods

  • Report both sets of results

Practical Guidelines

When to Remove Observations:

  • Clear data entry errors

  • Impossible values

  • Duplicate records

  • Extreme outliers from known data quality issues

When to Keep Observations:

  • Real business variation (even if extreme)

  • Important events that may recur

  • Legitimately unusual periods

  • When removal would bias sample

When to Add Control Variables:

  • Special events that can be modeled

  • Seasonal anomalies

  • Temporary shocks

  • Regime changes

Number of Influential Points:

  • 1-3 points in 100+ observations: Generally acceptable

  • 5-10% of observations: Investigate carefully

  • 10% of observations: Serious data or model issues

Example Interpretation

Scenario 1 - Passed:

  • 2 observations with Cook's D > threshold

  • Both are holiday weeks with promotions

  • Removing them doesn't change coefficients meaningfully

Interpretation: Minor influential points related to known promotional events. Model is robust. Consider adding holiday dummy variables but not strictly necessary.

Scenario 2 - Moderate Issues:

  • 5 high leverage points (unusual spend combinations)

  • 3 outliers (large residuals)

  • 2 highly influential points that change TV coefficient by 30%

Interpretation: Some influential observations warrant investigation. Check the 2 highly influential points for data quality. If valid, add dummy variables for those periods or report sensitivity analysis. Model results should be interpreted cautiously.

Scenario 3 - Severe Issues:

  • 15% of observations flagged as influential

  • Removing top 3 changes multiple coefficients by >50%

  • Several impossible values detected

Interpretation: Serious data quality and model specification issues. Clean data thoroughly, check for systematic errors, and reconsider model specification before using for business decisions.

Marketing Mix Modeling Context

In MMM, influential points often represent:

Promotional Events: Black Friday, Cyber Monday, Holiday seasons with extreme lift

Product Launches: Periods with unusual marketing mix and amplified response

Competitive Actions: Periods when competitors did something unusual

External Shocks: COVID, economic crises, supply chain disruptions

Media Tests: Periods with experimental spend levels

Best Practices for MMM:

  • Document all influential periods with business explanations

  • Add dummy variables for recurring special events

  • Use hold-out validation to test model stability

  • Report attribution with and without extreme periods

  • Be transparent about model limitations

After identifying influential points:

  • Check Residual Normality as outliers can cause non-normality

  • Review Heteroscedasticity as outliers can create variance patterns

  • Examine Actual vs Predicted to visualize which periods fit poorly

  • Check raw data quality in Data Upload section

Last updated