Data Quality Best Practices
Data Quality Best Practices
Ensuring Reliable and Accurate MMM Results
The quality of your Marketing Mix Model is directly tied to the quality of your input data. This page provides comprehensive best practices for data preparation, validation, and maintenance to ensure your MMM delivers accurate, actionable insights.
The Data Quality Principle
"Garbage In, Garbage Out"
Core Truth: No amount of sophisticated statistical modeling can overcome poor data quality. A model built on flawed data will produce unreliable results, leading to poor business decisions.
Investment in Data Quality: Spending 70% of time on data preparation and 30% on modeling is better than the reverse.
Data Quality Dimensions
Accuracy
Definition: Data correctly represents the real-world phenomena being measured
Questions to Ask:
Are marketing spend numbers from authoritative sources (invoices, billing statements)?
Is KPI data from validated systems (e.g., verified sales reports)?
Have data entry errors been checked and corrected?
Red Flags:
Round numbers that seem "too perfect" (e.g., exactly $10,000 every week)
Sudden unexplained spikes or drops
Values that contradict known business events
Completeness
Definition: All necessary data points are present
Questions to Ask:
Are there gaps in the time series (missing weeks/months)?
Do we have data for all major marketing channels?
Are seasonality and external factors included?
Red Flags:
Missing values (blank cells)
Entire channels with zero spend for extended periods (when you know campaigns ran)
Sporadic data availability
Consistency
Definition: Data follows the same format, units, and definitions throughout
Questions to Ask:
Are all spend values in the same currency?
Is the time frequency consistent (all weekly OR all monthly)?
Do channel definitions remain constant over time?
Red Flags:
Switching from weekly to monthly mid-dataset
Channel names changing (TV_Spend becomes Television_Advertising)
Unit changes (thousands to actual values mid-series)
Validity
Definition: Data values fall within acceptable ranges
Questions to Ask:
Are negative values appropriate (e.g., negative spend doesn't make sense)?
Do percentages stay between 0-100%?
Are outliers explainable by real business events?
Red Flags:
Impossible values (110% discount, negative revenue)
Extreme outliers with no explanation
Values outside expected business ranges
Timeliness
Definition: Data represents the correct time period
Questions to Ask:
Are dates aligned correctly?
Do marketing investments align with campaign timing?
Is there proper lag between spend and sales (if expected)?
Red Flags:
TV spend recorded in wrong week
Mismatch between campaign calendar and data
Timing inconsistencies
Pre-Modeling Data Checks
Check 1: Inspect Summary Statistics
For Each Variable:
Calculate and review:
Mean: Average value
Median: Middle value
Min/Max: Range of values
Standard Deviation: Variability
Red Flag Examples:
Check 2: Visualize Time Series
Create Line Charts for:
KPI over time
Each marketing channel over time
Control variables over time
Look For:
Consistent patterns
Explainable seasonality
Sudden structural breaks
Data entry errors (obvious spikes)
Example Issues to Spot:
Check 3: Correlation Analysis
Calculate Correlations Between:
KPI and each marketing channel
Marketing channels with each other
Expected Patterns:
KPI should positively correlate with most marketing channels
Some channels may correlate with each other (multicollinearity - manageable)
Red Flags:
Negative correlation between marketing and KPI (when positive expected)
Perfect correlation (r = 1.0) between two channels (likely same data duplicated)
Zero correlation when strong relationship expected
Check 4: Missing Data Assessment
Identify:
How many missing values per variable?
Are missings random or systematic?
Which time periods have missing data?
Decision Rules:
Check 5: Outlier Detection
Methods:
Z-Score Method: Values more than 3 standard deviations from mean are outliers
IQR Method: Values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
Visual Method: Create box plots for each variable
Response to Outliers:
Verify: Is this a data error or real event?
Real Event: Keep data, possibly add dummy variable for that period
Data Error: Correct the value
Uncertain: Consider excluding that observation or winsorizing (capping)
Data Preparation Best Practices
Marketing Spend Data
Source from Authoritative Systems
Best Sources: ✅ Billing invoices from media vendors ✅ Finance/accounting systems ✅ Media agency reports (reconciled) ✅ Platform spend data (Google Ads, Facebook Ads API)
Avoid: ❌ Estimates or approximations ❌ Budget allocations (use actual spend) ❌ Un-reconciled preliminary numbers
Handle Zero Spend Correctly
When Channel Has Zero Spend:
Enter
0explicitly (don't leave blank)Ensure this reflects reality (campaign off, not missing data)
Why It Matters: Model needs to know when marketing was inactive to isolate its effect
Aggregate to Consistent Level
Ensure Consistency:
All channels at same geographic level (national vs. regional)
All channels at same time period (weekly totals, not daily within weeks)
Units consistent (all dollars, or all thousands of dollars)
Example Issue:
KPI Data
Use Clean, Validated Data
Best Practices:
Source from validated sales/transaction systems
Remove returns and cancellations if not part of model scope
Ensure consistent definition across time periods
Align timing (sales attributed to correct week)
Consider KPI Transformations
Log Transformation: Use when KPI has exponential growth or high variance
Benefits:
Stabilizes variance
Interprets coefficients as elasticities
Better fit for multiplicative models
When to Use:
KPI grows exponentially over time
Wide range of values (min = 1000, max = 1000000)
Address Seasonality
Options:
Option 1: Include Seasonal Variables Add month dummies or quarter indicators
Option 2: Deseasonalize KPI Remove seasonal component before modeling (advanced)
Recommendation: Option 1 (include seasonality variables) is preferred - transparent and flexible
Seasonality Variables
Month Dummies (Recommended)
Create 11 binary variables (omit one month as reference):
Why 11 not 12? Avoid multicollinearity - one month serves as baseline
Holiday Indicators
Create binary flags for major holidays:
Benefits:
Captures demand spikes
Prevents misattribution to marketing
Improves model fit
External Variables
Economic Indicators
Consider Including:
Consumer confidence index
Unemployment rate
GDP growth
Industry-specific indices
When to Include:
B2C products sensitive to economic conditions
Luxury goods
Discretionary spending categories
Competitor Activity
Data Sources:
Industry reports
Competitive intelligence firms
Media monitoring services
Web scraping (within legal limits)
Variables:
Competitor advertising spend index
Competitor price index
New competitor entries
Weather Data
When Relevant:
Retail (foot traffic affected by weather)
Food & beverage
Seasonal products
Outdoor activities
Variables:
Temperature
Precipitation
Severe weather events
Handling Common Data Issues
Issue 1: Sparse Data (Many Zeros)
Scenario: Channel only active periodically (e.g., TV campaigns are 4-week flights)
Implications:
Harder to estimate effects
Lower statistical power
Potential confounding with seasonality
Solutions:
Aggregate channels: Combine similar sparse channels
Extend data collection: Gather more historical data
Use Bayesian priors: Inform model with external knowledge
Accept limitations: Lower confidence in sparse channel estimates
Issue 2: Highly Correlated Channels
Scenario: Two channels always move together (e.g., TV and Radio run simultaneously)
Symptoms:
High VIF (>5)
Unstable coefficients
Wide confidence intervals
Solutions:
Combine into one variable: Create
TV_Plus_RadiocompositeUse one as control: Model only TV, include Radio as control
Regularization: Use Bayesian priors to constrain estimates
Gather more data: Periods where channels vary independently
Issue 3: Structural Breaks
Scenario: Business change affects baseline (e.g., new product launch, market expansion)
Symptoms:
Sudden shift in KPI level
Model diagnostics fail
Residuals show patterns
Solutions:
Add dummy variable:
Post_Launch = 1after structural changeInteraction terms: Allow marketing effects to differ pre/post
Split data: Model pre and post periods separately
Control explicitly: Include variable for the structural factor
Issue 4: Data Frequency Mismatch
Scenario: Different variables available at different frequencies
Example:
Sales data: Weekly
TV spend: Weekly
Economic indicators: Monthly
Solutions:
Interpolation: Distribute monthly value across weeks
Forward fill: Repeat monthly value for each week in that month
Aggregate: Convert all data to monthly (if feasible)
Recommendation: Forward fill for economic/external variables (they change slowly)
Issue 5: Insufficient Variation
Scenario: Variable barely changes over time (e.g., price constant at $49.99 all year)
Implications:
Model cannot estimate effect
Variable adds noise without signal
Degrees of freedom wasted
Solutions:
Exclude variable: If truly constant, it's absorbed in intercept
Combine with other variables: Create interaction or ratio
Extend time window: Gather data across periods with more variation
Data Transformation Guidelines
When to Transform Variables
Log Transformation:
Variable has wide range (spanning multiple orders of magnitude)
Relationship with KPI is multiplicative, not additive
Elasticity interpretation desired
Square Root:
Reduces impact of outliers
Moderate skewness
Standardization:
Want coefficients on comparable scale
Comparing effect sizes across very different units
When NOT to Transform
Avoid Transformations If:
Data already well-behaved (normal-ish distribution)
Interpretation becomes too complex
Business stakeholders need direct interpretation
Remember: MixModeler handles saturation curves and adstock - these are specialized transformations. Standard transformations (log, sqrt) are optional.
Data Update and Maintenance
Regular Data Refresh Cycle
Recommended Schedule:
Monthly Refresh:
Add latest week/month of data
Re-run models with updated data
Check for any anomalies
Quarterly Deep Dive:
Validate data sources
Review model performance
Update priors if needed
Assess new variables
Annual Review:
Complete data audit
Evaluate need for structural changes
Update documentation
Version Control
Best Practices:
File Naming:
Change Log: Maintain a separate sheet or document tracking:
Date of change
What changed
Why (data correction, new source, etc.)
Who made the change
Data Documentation
Create Data Dictionary:
Why It Matters:
Reproducibility
Knowledge transfer
Troubleshooting
Stakeholder transparency
Data Quality Checklist
Before finalizing data for modeling:
Accuracy
Completeness
Consistency
Validity
Timeliness
Advanced Data Quality Techniques
Statistical Process Control
Create Control Charts: Monitor variables over time to detect:
Out-of-control points
Trending patterns
Systematic shifts
UCL/LCL Limits:
Action: Investigate points outside control limits
Data Reconciliation
Cross-Check Sources:
Marketing spend: Media agency vs. finance system
Sales: CRM vs. accounting vs. point-of-sale
Digital metrics: Platform reporting vs. Google Analytics
Acceptable Variance:
Within 5%: Likely timing differences, acceptable
5-10%: Investigate discrepancies
10%: Resolve before modeling
Holdout Sample Validation
Approach:
Set aside last 8-12 weeks of data (don't use in model)
Build model on remaining data
Predict holdout period
Compare predictions to actuals
Benefits:
Validates data quality
Tests model performance
Identifies overfitting
Builds stakeholder confidence
Common Data Quality Pitfalls
Pitfall 1: Using Budget Instead of Actual Spend
Problem: Budgets don't reflect actual spending patterns
Impact:
Overestimates effects when underspent
Underestimates effects when overspent
Wrong attribution timing
Fix: Always use actual spend from invoices or billing systems
Pitfall 2: Inconsistent Channel Definitions
Problem: TV spend includes digital video some months, excludes it others
Impact:
Artificially high variance
Misleading coefficient estimates
Incomparable periods
Fix: Maintain rigorous, consistent channel taxonomy across entire time series
Pitfall 3: Ignoring Data Lags
Problem: TV spend recorded when invoice paid (4 weeks after airing)
Impact:
Wrong timing attribution
Reduced statistical significance
Misleading adstock estimates
Fix: Align data to when marketing actually occurred (flight dates, not payment dates)
Pitfall 4: Aggregating Too Much
Problem: Combining all digital channels into "Digital" when they behave differently
Impact:
Loss of actionable insights
Averaging out different ROIs
Can't optimize within digital
Fix: Disaggregate to actionable decision level (Display, Search, Social separately)
Pitfall 5: Not Adjusting for Inflation
Problem: Using nominal dollars across multi-year period
Impact:
Artificial growth trends
Misleading year-over-year comparisons
Biased coefficient estimates
Fix: Convert to real (inflation-adjusted) dollars using CPI or category-specific price indices
Data Quality Metrics
Quantitative Metrics to Track
Completeness Rate:
Target: >95%
Accuracy Rate (When Ground Truth Available):
Target: >98%
Timeliness:
Target: <7 days for weekly data
Consistency Score:
Target: >95%
Tools and Techniques for Data Quality
Excel Data Validation
Set Rules:
Select data column
Data → Data Validation
Set criteria (e.g., "Whole number", "Between 0 and 1000000")
Excel will flag violations
Benefits:
Prevents data entry errors
Enforces valid ranges
Visual feedback
Conditional Formatting for Outliers
Highlight Unusual Values:
Select data range
Home → Conditional Formatting → New Rule
Format cells where:
=ABS(B2-AVERAGE($B$2:$B$100))/STDEV($B$2:$B$100)>3Set format (red fill)
Result: Outliers (>3 standard deviations) highlighted automatically
Pivot Tables for Quick Checks
Use Cases:
Sum spend by channel (verify totals)
Count observations by month (detect gaps)
Average KPI by quarter (spot trends)
Quick QA Method: Create pivot of channel spend → look for unexpectedly high/low totals
Charts for Visual Inspection
Create:
Line chart of KPI over time (spot anomalies)
Scatter plots of marketing vs. KPI (check relationships)
Box plots for each variable (identify outliers)
Visual inspection often reveals issues statistical tests miss
Data Quality Communication
Reporting Data Quality to Stakeholders
Include in Documentation:
Data sources: Where each variable comes from
Known limitations: Missing periods, data quality concerns
Assumptions made: How missing data handled, outliers treated
Validation performed: Cross-checks completed
Confidence level: High/Medium/Low data quality rating per variable
Data Quality Score
Create Simple Score:
Scale:
A (90-100%): High confidence
B (75-89%): Good, minor concerns
C (60-74%): Moderate concerns
D (<60%): Significant quality issues
When to Reject or Fix Data
Red Lines: Don't Model If...
❌ Less than 26 observations (insufficient statistical power)
❌ KPI has unexplained structural break (model will fail)
❌ >30% missing data in key variables (too many gaps)
❌ Data sources unverifiable (can't validate accuracy)
❌ Timing misalignment >4 weeks (marketing/sales out of sync)
Yellow Flags: Fix Before Modeling
⚠️ 5-20% missing data → Interpolate or fill with zeros
⚠️ Minor outliers → Investigate and cap if data errors
⚠️ Some inconsistency → Standardize definitions
⚠️ Moderate multicollinearity → Consider combining variables
⚠️ Limited variance → Extend time window or exclude variable
Continuous Improvement
Quarterly Data Quality Review
Agenda:
Review data collection processes
Assess new data sources
Validate existing sources still accurate
Update data dictionary
Address known issues
Plan improvements
Feedback Loop
Process:
Model results reveal unexpected findings
Investigate potential data issues
Validate data against source systems
Correct if errors found
Document learnings
Improve processes to prevent recurrence
Example: Model shows TV coefficient negative → Investigate → Discover timing lag in data → Fix → Re-model → Positive coefficient (expected)
Summary
Key Takeaways:
📊 Quality over quantity - 52 weeks of clean data beats 104 weeks of messy data
🔍 Always validate - inspect, visualize, cross-check before modeling
📋 Document everything - data sources, assumptions, decisions
🎯 Source from authority - use official systems, not estimates
✅ Run checklist - systematic validation prevents errors
🔄 Maintain continuously - data quality is ongoing, not one-time
⚠️ Know your limitations - be transparent about data quality concerns
🛠️ Fix issues early - addressing data problems saves time later
Remember: The best statistical model cannot overcome poor data. Invest in data quality first, modeling sophistication second. Your MMM is only as good as the data that feeds it!
Last updated