Data Quality Best Practices

Ensuring Reliable and Accurate MMM Results

The quality of your Marketing Mix Model is directly tied to the quality of your input data. This page provides comprehensive best practices for data preparation, validation, and maintenance to ensure your MMM delivers accurate, actionable insights.

The Data Quality Principle

"Garbage In, Garbage Out"

Core Truth: No amount of sophisticated statistical modeling can overcome poor data quality. A model built on flawed data will produce unreliable results, leading to poor business decisions.

Investment in Data Quality: Spending 70% of time on data preparation and 30% on modeling is better than the reverse.

Data Quality Dimensions

Accuracy

Definition: Data correctly represents the real-world phenomena being measured

Questions to Ask:

Are marketing spend numbers from authoritative sources (invoices, billing statements)?
Is KPI data from validated systems (e.g., verified sales reports)?
Have data entry errors been checked and corrected?

Red Flags:

Round numbers that seem "too perfect" (e.g., exactly $10,000 every week)
Sudden unexplained spikes or drops
Values that contradict known business events

Completeness

Definition: All necessary data points are present

Questions to Ask:

Are there gaps in the time series (missing weeks/months)?
Do we have data for all major marketing channels?
Are seasonality and external factors included?

Red Flags:

Missing values (blank cells)
Entire channels with zero spend for extended periods (when you know campaigns ran)
Sporadic data availability

Consistency

Definition: Data follows the same format, units, and definitions throughout

Questions to Ask:

Are all spend values in the same currency?
Is the time frequency consistent (all weekly OR all monthly)?
Do channel definitions remain constant over time?

Red Flags:

Switching from weekly to monthly mid-dataset
Channel names changing (TV_Spend becomes Television_Advertising)
Unit changes (thousands to actual values mid-series)

Validity

Definition: Data values fall within acceptable ranges

Questions to Ask:

Are negative values appropriate (e.g., negative spend doesn't make sense)?
Do percentages stay between 0-100%?
Are outliers explainable by real business events?

Red Flags:

Impossible values (110% discount, negative revenue)
Extreme outliers with no explanation
Values outside expected business ranges

Timeliness

Definition: Data represents the correct time period

Questions to Ask:

Are dates aligned correctly?
Do marketing investments align with campaign timing?
Is there proper lag between spend and sales (if expected)?

Red Flags:

TV spend recorded in wrong week
Mismatch between campaign calendar and data
Timing inconsistencies

Pre-Modeling Data Checks

Check 1: Inspect Summary Statistics

For Each Variable:

Calculate and review:

Mean: Average value
Median: Middle value
Min/Max: Range of values
Standard Deviation: Variability

Red Flag Examples:

Check 2: Visualize Time Series

Create Line Charts for:

KPI over time
Each marketing channel over time
Control variables over time

Look For:

Consistent patterns
Explainable seasonality
Sudden structural breaks
Data entry errors (obvious spikes)

Example Issues to Spot:

Sales Chart:
Normal: ~~~∿~~~∿~~~∿~~~
Issue:  ~~~∿~~~↑~~~∿~~~  (unexplained spike - data error?)

Check 3: Correlation Analysis

Calculate Correlations Between:

KPI and each marketing channel
Marketing channels with each other

Expected Patterns:

KPI should positively correlate with most marketing channels
Some channels may correlate with each other (multicollinearity - manageable)

Red Flags:

Negative correlation between marketing and KPI (when positive expected)
Perfect correlation (r = 1.0) between two channels (likely same data duplicated)
Zero correlation when strong relationship expected

Check 4: Missing Data Assessment

Identify:

How many missing values per variable?
Are missings random or systematic?
Which time periods have missing data?

Decision Rules:

Check 5: Outlier Detection

Methods:

Z-Score Method: Values more than 3 standard deviations from mean are outliers

IQR Method: Values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR

Visual Method: Create box plots for each variable

Response to Outliers:

Verify: Is this a data error or real event?
Real Event: Keep data, possibly add dummy variable for that period
Data Error: Correct the value
Uncertain: Consider excluding that observation or winsorizing (capping)

Data Preparation Best Practices

Marketing Spend Data

Source from Authoritative Systems

Best Sources: ✅ Billing invoices from media vendors ✅ Finance/accounting systems ✅ Media agency reports (reconciled) ✅ Platform spend data (Google Ads, Facebook Ads API)

Avoid: ❌ Estimates or approximations ❌ Budget allocations (use actual spend) ❌ Un-reconciled preliminary numbers

Handle Zero Spend Correctly

When Channel Has Zero Spend:

Enter 0 explicitly (don't leave blank)
Ensure this reflects reality (campaign off, not missing data)

Why It Matters: Model needs to know when marketing was inactive to isolate its effect

Aggregate to Consistent Level

Ensure Consistency:

All channels at same geographic level (national vs. regional)
All channels at same time period (weekly totals, not daily within weeks)
Units consistent (all dollars, or all thousands of dollars)

Example Issue:

❌ TV spend in thousands, Digital spend in actual dollars
✅ All spend in actual dollars

KPI Data

Use Clean, Validated Data

Best Practices:

Source from validated sales/transaction systems
Remove returns and cancellations if not part of model scope
Ensure consistent definition across time periods
Align timing (sales attributed to correct week)

Consider KPI Transformations

Log Transformation: Use when KPI has exponential growth or high variance

Log_Sales = ln(Sales)

Benefits:

Stabilizes variance
Interprets coefficients as elasticities
Better fit for multiplicative models

When to Use:

KPI grows exponentially over time
Wide range of values (min = 1000, max = 1000000)

Address Seasonality

Options:

Option 1: Include Seasonal Variables Add month dummies or quarter indicators

Option 2: Deseasonalize KPI Remove seasonal component before modeling (advanced)

Recommendation: Option 1 (include seasonality variables) is preferred - transparent and flexible

Seasonality Variables

Month Dummies (Recommended)

Create 11 binary variables (omit one month as reference):

Why 11 not 12? Avoid multicollinearity - one month serves as baseline

Holiday Indicators

Create binary flags for major holidays:

Benefits:

Captures demand spikes
Prevents misattribution to marketing
Improves model fit

External Variables

Economic Indicators

Consider Including:

Consumer confidence index
Unemployment rate
GDP growth
Industry-specific indices

When to Include:

B2C products sensitive to economic conditions
Luxury goods
Discretionary spending categories

Competitor Activity

Data Sources:

Industry reports
Competitive intelligence firms
Media monitoring services
Web scraping (within legal limits)

Variables:

Competitor advertising spend index
Competitor price index
New competitor entries

Weather Data

When Relevant:

Retail (foot traffic affected by weather)
Food & beverage
Seasonal products
Outdoor activities

Variables:

Temperature
Precipitation
Severe weather events

Handling Common Data Issues

Issue 1: Sparse Data (Many Zeros)

Scenario: Channel only active periodically (e.g., TV campaigns are 4-week flights)

Implications:

Harder to estimate effects
Lower statistical power
Potential confounding with seasonality

Solutions:

Aggregate channels: Combine similar sparse channels
Extend data collection: Gather more historical data
Use Bayesian priors: Inform model with external knowledge
Accept limitations: Lower confidence in sparse channel estimates

Issue 2: Highly Correlated Channels

Scenario: Two channels always move together (e.g., TV and Radio run simultaneously)

Symptoms:

High VIF (>5)
Unstable coefficients
Wide confidence intervals

Solutions:

Combine into one variable: Create TV_Plus_Radio composite
Use one as control: Model only TV, include Radio as control
Regularization: Use Bayesian priors to constrain estimates
Gather more data: Periods where channels vary independently

Issue 3: Structural Breaks

Scenario: Business change affects baseline (e.g., new product launch, market expansion)

Symptoms:

Sudden shift in KPI level
Model diagnostics fail
Residuals show patterns

Solutions:

Add dummy variable: Post_Launch = 1 after structural change
Interaction terms: Allow marketing effects to differ pre/post
Split data: Model pre and post periods separately
Control explicitly: Include variable for the structural factor

Issue 4: Data Frequency Mismatch

Scenario: Different variables available at different frequencies

Example:

Sales data: Weekly
TV spend: Weekly
Economic indicators: Monthly

Solutions:

Interpolation: Distribute monthly value across weeks
Forward fill: Repeat monthly value for each week in that month
Aggregate: Convert all data to monthly (if feasible)

Recommendation: Forward fill for economic/external variables (they change slowly)

Issue 5: Insufficient Variation

Scenario: Variable barely changes over time (e.g., price constant at $49.99 all year)

Implications:

Model cannot estimate effect
Variable adds noise without signal
Degrees of freedom wasted

Solutions:

Exclude variable: If truly constant, it's absorbed in intercept
Combine with other variables: Create interaction or ratio
Extend time window: Gather data across periods with more variation

Data Transformation Guidelines

When to Transform Variables

Log Transformation:

Variable has wide range (spanning multiple orders of magnitude)
Relationship with KPI is multiplicative, not additive
Elasticity interpretation desired

Square Root:

Reduces impact of outliers
Moderate skewness

Standardization:

Want coefficients on comparable scale
Comparing effect sizes across very different units

When NOT to Transform

Avoid Transformations If:

Data already well-behaved (normal-ish distribution)
Interpretation becomes too complex
Business stakeholders need direct interpretation

Remember: MixModeler handles saturation curves and adstock - these are specialized transformations. Standard transformations (log, sqrt) are optional.

Data Update and Maintenance

Regular Data Refresh Cycle

Recommended Schedule:

Monthly Refresh:

Add latest week/month of data
Re-run models with updated data
Check for any anomalies

Quarterly Deep Dive:

Validate data sources
Review model performance
Update priors if needed
Assess new variables

Annual Review:

Complete data audit
Evaluate need for structural changes
Update documentation

Version Control

Best Practices:

File Naming:

MMM_Data_2024Q1_v1.xlsx
MMM_Data_2024Q1_v2.xlsx
MMM_Data_2024Q2_v1.xlsx

Change Log: Maintain a separate sheet or document tracking:

Date of change
What changed
Why (data correction, new source, etc.)
Who made the change

Data Documentation

Create Data Dictionary:

Why It Matters:

Reproducibility
Knowledge transfer
Troubleshooting
Stakeholder transparency

Data Quality Checklist

Before finalizing data for modeling:

Accuracy

All data sourced from authoritative systems
Spend figures reconciled with invoices
KPI validated against official reports
Outliers investigated and explained

Completeness

No missing time periods
All major marketing channels included
Seasonality variables present
External factors captured (if relevant)
Sufficient historical data (52+ weeks)

Consistency

All spend in same currency/units
Time frequency consistent throughout
Channel definitions unchanged over time
Data aligned to consistent time periods

Validity

No impossible values (negative spend, >100% rates)
Values within expected business ranges
Outliers are explainable
Data types correct (numeric for all data)

Timeliness

Dates accurate and chronological
Marketing spend aligns with campaign timing
Latest data included (current period)

Advanced Data Quality Techniques

Statistical Process Control

Create Control Charts: Monitor variables over time to detect:

Out-of-control points
Trending patterns
Systematic shifts

UCL/LCL Limits:

Upper Control Limit = Mean + 3×StdDev
Lower Control Limit = Mean - 3×StdDev

Action: Investigate points outside control limits

Data Reconciliation

Cross-Check Sources:

Marketing spend: Media agency vs. finance system
Sales: CRM vs. accounting vs. point-of-sale
Digital metrics: Platform reporting vs. Google Analytics

Acceptable Variance:

Within 5%: Likely timing differences, acceptable
5-10%: Investigate discrepancies
10%: Resolve before modeling

Holdout Sample Validation

Approach:

Set aside last 8-12 weeks of data (don't use in model)
Build model on remaining data
Predict holdout period
Compare predictions to actuals

Benefits:

Validates data quality
Tests model performance
Identifies overfitting
Builds stakeholder confidence

Common Data Quality Pitfalls

Pitfall 1: Using Budget Instead of Actual Spend

Problem: Budgets don't reflect actual spending patterns

Impact:

Overestimates effects when underspent
Underestimates effects when overspent
Wrong attribution timing

Fix: Always use actual spend from invoices or billing systems

Pitfall 2: Inconsistent Channel Definitions

Problem: TV spend includes digital video some months, excludes it others

Impact:

Artificially high variance
Misleading coefficient estimates
Incomparable periods

Fix: Maintain rigorous, consistent channel taxonomy across entire time series

Pitfall 3: Ignoring Data Lags

Problem: TV spend recorded when invoice paid (4 weeks after airing)

Impact:

Wrong timing attribution
Reduced statistical significance
Misleading adstock estimates

Fix: Align data to when marketing actually occurred (flight dates, not payment dates)

Pitfall 4: Aggregating Too Much

Problem: Combining all digital channels into "Digital" when they behave differently

Impact:

Loss of actionable insights
Averaging out different ROIs
Can't optimize within digital

Fix: Disaggregate to actionable decision level (Display, Search, Social separately)

Pitfall 5: Not Adjusting for Inflation

Problem: Using nominal dollars across multi-year period

Impact:

Artificial growth trends
Misleading year-over-year comparisons
Biased coefficient estimates

Fix: Convert to real (inflation-adjusted) dollars using CPI or category-specific price indices

Data Quality Metrics

Quantitative Metrics to Track

Completeness Rate:

Completeness = (Non-Missing Values / Total Values) × 100%

Target: >95%

Accuracy Rate (When Ground Truth Available):

Accuracy = (Correct Values / Total Values) × 100%

Target: >98%

Timeliness:

Data Lag = Days Between Event Date and Data Available Date

Target: <7 days for weekly data

Consistency Score:

Consistency = (Matched Values Across Sources / Total Values) × 100%

Target: >95%

Tools and Techniques for Data Quality

Excel Data Validation

Set Rules:

Select data column
Data → Data Validation
Set criteria (e.g., "Whole number", "Between 0 and 1000000")
Excel will flag violations

Benefits:

Prevents data entry errors
Enforces valid ranges
Visual feedback

Conditional Formatting for Outliers

Highlight Unusual Values:

Select data range
Home → Conditional Formatting → New Rule
Format cells where: =ABS(B2-AVERAGE($B$2:$B$100))/STDEV($B$2:$B$100)>3
Set format (red fill)

Result: Outliers (>3 standard deviations) highlighted automatically

Pivot Tables for Quick Checks

Use Cases:

Sum spend by channel (verify totals)
Count observations by month (detect gaps)
Average KPI by quarter (spot trends)

Quick QA Method: Create pivot of channel spend → look for unexpectedly high/low totals

Charts for Visual Inspection

Create:

Line chart of KPI over time (spot anomalies)
Scatter plots of marketing vs. KPI (check relationships)
Box plots for each variable (identify outliers)

Visual inspection often reveals issues statistical tests miss

Data Quality Communication

Reporting Data Quality to Stakeholders

Include in Documentation:

Data sources: Where each variable comes from
Known limitations: Missing periods, data quality concerns
Assumptions made: How missing data handled, outliers treated
Validation performed: Cross-checks completed
Confidence level: High/Medium/Low data quality rating per variable

Data Quality Score

Create Simple Score:

Scale:

A (90-100%): High confidence
B (75-89%): Good, minor concerns
C (60-74%): Moderate concerns
D (<60%): Significant quality issues

When to Reject or Fix Data

Red Lines: Don't Model If...

❌ Less than 26 observations (insufficient statistical power)

❌ KPI has unexplained structural break (model will fail)

❌ >30% missing data in key variables (too many gaps)

❌ Data sources unverifiable (can't validate accuracy)

❌ Timing misalignment >4 weeks (marketing/sales out of sync)

Yellow Flags: Fix Before Modeling

⚠️ 5-20% missing data → Interpolate or fill with zeros

⚠️ Minor outliers → Investigate and cap if data errors

⚠️ Some inconsistency → Standardize definitions

⚠️ Moderate multicollinearity → Consider combining variables

⚠️ Limited variance → Extend time window or exclude variable

Continuous Improvement

Quarterly Data Quality Review

Agenda:

Review data collection processes
Assess new data sources
Validate existing sources still accurate
Update data dictionary
Address known issues
Plan improvements

Feedback Loop

Process:

Model results reveal unexpected findings
Investigate potential data issues
Validate data against source systems
Correct if errors found
Document learnings
Improve processes to prevent recurrence

Example: Model shows TV coefficient negative → Investigate → Discover timing lag in data → Fix → Re-model → Positive coefficient (expected)

Summary

Key Takeaways:

📊 Quality over quantity - 52 weeks of clean data beats 104 weeks of messy data

🔍 Always validate - inspect, visualize, cross-check before modeling

📋 Document everything - data sources, assumptions, decisions

🎯 Source from authority - use official systems, not estimates

✅ Run checklist - systematic validation prevents errors

🔄 Maintain continuously - data quality is ongoing, not one-time

⚠️ Know your limitations - be transparent about data quality concerns

🛠️ Fix issues early - addressing data problems saves time later

Remember: The best statistical model cannot overcome poor data. Invest in data quality first, modeling sophistication second. Your MMM is only as good as the data that feeds it!

PreviousExcel Template Guide NextPrivacy & Security (On-Device Processing)

Last updated 1 month ago