Data Quality Best Practices
Data Quality Best Practices
Ensuring Reliable and Accurate MMM Results
The quality of your Marketing Mix Model is directly tied to the quality of your input data. This page provides comprehensive best practices for data preparation, validation, and maintenance to ensure your MMM delivers accurate, actionable insights.
The Data Quality Principle
"Garbage In, Garbage Out"
Core Truth: No amount of sophisticated statistical modeling can overcome poor data quality. A model built on flawed data will produce unreliable results, leading to poor business decisions.
Investment in Data Quality: Spending 70% of time on data preparation and 30% on modeling is better than the reverse.
Data Quality Dimensions
Accuracy
Definition: Data correctly represents the real-world phenomena being measured
Questions to Ask:
- Are marketing spend numbers from authoritative sources (invoices, billing statements)? 
- Is KPI data from validated systems (e.g., verified sales reports)? 
- Have data entry errors been checked and corrected? 
Red Flags:
- Round numbers that seem "too perfect" (e.g., exactly $10,000 every week) 
- Sudden unexplained spikes or drops 
- Values that contradict known business events 
Completeness
Definition: All necessary data points are present
Questions to Ask:
- Are there gaps in the time series (missing weeks/months)? 
- Do we have data for all major marketing channels? 
- Are seasonality and external factors included? 
Red Flags:
- Missing values (blank cells) 
- Entire channels with zero spend for extended periods (when you know campaigns ran) 
- Sporadic data availability 
Consistency
Definition: Data follows the same format, units, and definitions throughout
Questions to Ask:
- Are all spend values in the same currency? 
- Is the time frequency consistent (all weekly OR all monthly)? 
- Do channel definitions remain constant over time? 
Red Flags:
- Switching from weekly to monthly mid-dataset 
- Channel names changing (TV_Spend becomes Television_Advertising) 
- Unit changes (thousands to actual values mid-series) 
Validity
Definition: Data values fall within acceptable ranges
Questions to Ask:
- Are negative values appropriate (e.g., negative spend doesn't make sense)? 
- Do percentages stay between 0-100%? 
- Are outliers explainable by real business events? 
Red Flags:
- Impossible values (110% discount, negative revenue) 
- Extreme outliers with no explanation 
- Values outside expected business ranges 
Timeliness
Definition: Data represents the correct time period
Questions to Ask:
- Are dates aligned correctly? 
- Do marketing investments align with campaign timing? 
- Is there proper lag between spend and sales (if expected)? 
Red Flags:
- TV spend recorded in wrong week 
- Mismatch between campaign calendar and data 
- Timing inconsistencies 
Pre-Modeling Data Checks
Check 1: Inspect Summary Statistics
For Each Variable:
Calculate and review:
- Mean: Average value 
- Median: Middle value 
- Min/Max: Range of values 
- Standard Deviation: Variability 
Red Flag Examples:
Check 2: Visualize Time Series
Create Line Charts for:
- KPI over time 
- Each marketing channel over time 
- Control variables over time 
Look For:
- Consistent patterns 
- Explainable seasonality 
- Sudden structural breaks 
- Data entry errors (obvious spikes) 
Example Issues to Spot:
Sales Chart:
Normal: ~~~∿~~~∿~~~∿~~~
Issue:  ~~~∿~~~↑~~~∿~~~  (unexplained spike - data error?)Check 3: Correlation Analysis
Calculate Correlations Between:
- KPI and each marketing channel 
- Marketing channels with each other 
Expected Patterns:
- KPI should positively correlate with most marketing channels 
- Some channels may correlate with each other (multicollinearity - manageable) 
Red Flags:
- Negative correlation between marketing and KPI (when positive expected) 
- Perfect correlation (r = 1.0) between two channels (likely same data duplicated) 
- Zero correlation when strong relationship expected 
Check 4: Missing Data Assessment
Identify:
- How many missing values per variable? 
- Are missings random or systematic? 
- Which time periods have missing data? 
Decision Rules:
Check 5: Outlier Detection
Methods:
Z-Score Method: Values more than 3 standard deviations from mean are outliers
IQR Method: Values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
Visual Method: Create box plots for each variable
Response to Outliers:
- Verify: Is this a data error or real event? 
- Real Event: Keep data, possibly add dummy variable for that period 
- Data Error: Correct the value 
- Uncertain: Consider excluding that observation or winsorizing (capping) 
Data Preparation Best Practices
Marketing Spend Data
Source from Authoritative Systems
Best Sources: ✅ Billing invoices from media vendors ✅ Finance/accounting systems ✅ Media agency reports (reconciled) ✅ Platform spend data (Google Ads, Facebook Ads API)
Avoid: ❌ Estimates or approximations ❌ Budget allocations (use actual spend) ❌ Un-reconciled preliminary numbers
Handle Zero Spend Correctly
When Channel Has Zero Spend:
- Enter - 0explicitly (don't leave blank)
- Ensure this reflects reality (campaign off, not missing data) 
Why It Matters: Model needs to know when marketing was inactive to isolate its effect
Aggregate to Consistent Level
Ensure Consistency:
- All channels at same geographic level (national vs. regional) 
- All channels at same time period (weekly totals, not daily within weeks) 
- Units consistent (all dollars, or all thousands of dollars) 
Example Issue:
❌ TV spend in thousands, Digital spend in actual dollars
✅ All spend in actual dollarsKPI Data
Use Clean, Validated Data
Best Practices:
- Source from validated sales/transaction systems 
- Remove returns and cancellations if not part of model scope 
- Ensure consistent definition across time periods 
- Align timing (sales attributed to correct week) 
Consider KPI Transformations
Log Transformation: Use when KPI has exponential growth or high variance
Log_Sales = ln(Sales)Benefits:
- Stabilizes variance 
- Interprets coefficients as elasticities 
- Better fit for multiplicative models 
When to Use:
- KPI grows exponentially over time 
- Wide range of values (min = 1000, max = 1000000) 
Address Seasonality
Options:
Option 1: Include Seasonal Variables Add month dummies or quarter indicators
Option 2: Deseasonalize KPI Remove seasonal component before modeling (advanced)
Recommendation: Option 1 (include seasonality variables) is preferred - transparent and flexible
Seasonality Variables
Month Dummies (Recommended)
Create 11 binary variables (omit one month as reference):
Why 11 not 12? Avoid multicollinearity - one month serves as baseline
Holiday Indicators
Create binary flags for major holidays:
Benefits:
- Captures demand spikes 
- Prevents misattribution to marketing 
- Improves model fit 
External Variables
Economic Indicators
Consider Including:
- Consumer confidence index 
- Unemployment rate 
- GDP growth 
- Industry-specific indices 
When to Include:
- B2C products sensitive to economic conditions 
- Luxury goods 
- Discretionary spending categories 
Competitor Activity
Data Sources:
- Industry reports 
- Competitive intelligence firms 
- Media monitoring services 
- Web scraping (within legal limits) 
Variables:
- Competitor advertising spend index 
- Competitor price index 
- New competitor entries 
Weather Data
When Relevant:
- Retail (foot traffic affected by weather) 
- Food & beverage 
- Seasonal products 
- Outdoor activities 
Variables:
- Temperature 
- Precipitation 
- Severe weather events 
Handling Common Data Issues
Issue 1: Sparse Data (Many Zeros)
Scenario: Channel only active periodically (e.g., TV campaigns are 4-week flights)
Implications:
- Harder to estimate effects 
- Lower statistical power 
- Potential confounding with seasonality 
Solutions:
- Aggregate channels: Combine similar sparse channels 
- Extend data collection: Gather more historical data 
- Use Bayesian priors: Inform model with external knowledge 
- Accept limitations: Lower confidence in sparse channel estimates 
Issue 2: Highly Correlated Channels
Scenario: Two channels always move together (e.g., TV and Radio run simultaneously)
Symptoms:
- High VIF (>5) 
- Unstable coefficients 
- Wide confidence intervals 
Solutions:
- Combine into one variable: Create - TV_Plus_Radiocomposite
- Use one as control: Model only TV, include Radio as control 
- Regularization: Use Bayesian priors to constrain estimates 
- Gather more data: Periods where channels vary independently 
Issue 3: Structural Breaks
Scenario: Business change affects baseline (e.g., new product launch, market expansion)
Symptoms:
- Sudden shift in KPI level 
- Model diagnostics fail 
- Residuals show patterns 
Solutions:
- Add dummy variable: - Post_Launch = 1after structural change
- Interaction terms: Allow marketing effects to differ pre/post 
- Split data: Model pre and post periods separately 
- Control explicitly: Include variable for the structural factor 
Issue 4: Data Frequency Mismatch
Scenario: Different variables available at different frequencies
Example:
- Sales data: Weekly 
- TV spend: Weekly 
- Economic indicators: Monthly 
Solutions:
- Interpolation: Distribute monthly value across weeks 
- Forward fill: Repeat monthly value for each week in that month 
- Aggregate: Convert all data to monthly (if feasible) 
Recommendation: Forward fill for economic/external variables (they change slowly)
Issue 5: Insufficient Variation
Scenario: Variable barely changes over time (e.g., price constant at $49.99 all year)
Implications:
- Model cannot estimate effect 
- Variable adds noise without signal 
- Degrees of freedom wasted 
Solutions:
- Exclude variable: If truly constant, it's absorbed in intercept 
- Combine with other variables: Create interaction or ratio 
- Extend time window: Gather data across periods with more variation 
Data Transformation Guidelines
When to Transform Variables
Log Transformation:
- Variable has wide range (spanning multiple orders of magnitude) 
- Relationship with KPI is multiplicative, not additive 
- Elasticity interpretation desired 
Square Root:
- Reduces impact of outliers 
- Moderate skewness 
Standardization:
- Want coefficients on comparable scale 
- Comparing effect sizes across very different units 
When NOT to Transform
Avoid Transformations If:
- Data already well-behaved (normal-ish distribution) 
- Interpretation becomes too complex 
- Business stakeholders need direct interpretation 
Remember: MixModeler handles saturation curves and adstock - these are specialized transformations. Standard transformations (log, sqrt) are optional.
Data Update and Maintenance
Regular Data Refresh Cycle
Recommended Schedule:
Monthly Refresh:
- Add latest week/month of data 
- Re-run models with updated data 
- Check for any anomalies 
Quarterly Deep Dive:
- Validate data sources 
- Review model performance 
- Update priors if needed 
- Assess new variables 
Annual Review:
- Complete data audit 
- Evaluate need for structural changes 
- Update documentation 
Version Control
Best Practices:
File Naming:
MMM_Data_2024Q1_v1.xlsx
MMM_Data_2024Q1_v2.xlsx
MMM_Data_2024Q2_v1.xlsxChange Log: Maintain a separate sheet or document tracking:
- Date of change 
- What changed 
- Why (data correction, new source, etc.) 
- Who made the change 
Data Documentation
Create Data Dictionary:
Why It Matters:
- Reproducibility 
- Knowledge transfer 
- Troubleshooting 
- Stakeholder transparency 
Data Quality Checklist
Before finalizing data for modeling:
Accuracy
Completeness
Consistency
Validity
Timeliness
Advanced Data Quality Techniques
Statistical Process Control
Create Control Charts: Monitor variables over time to detect:
- Out-of-control points 
- Trending patterns 
- Systematic shifts 
UCL/LCL Limits:
Upper Control Limit = Mean + 3×StdDev
Lower Control Limit = Mean - 3×StdDevAction: Investigate points outside control limits
Data Reconciliation
Cross-Check Sources:
- Marketing spend: Media agency vs. finance system 
- Sales: CRM vs. accounting vs. point-of-sale 
- Digital metrics: Platform reporting vs. Google Analytics 
Acceptable Variance:
- Within 5%: Likely timing differences, acceptable 
- 5-10%: Investigate discrepancies 
- 10%: Resolve before modeling 
Holdout Sample Validation
Approach:
- Set aside last 8-12 weeks of data (don't use in model) 
- Build model on remaining data 
- Predict holdout period 
- Compare predictions to actuals 
Benefits:
- Validates data quality 
- Tests model performance 
- Identifies overfitting 
- Builds stakeholder confidence 
Common Data Quality Pitfalls
Pitfall 1: Using Budget Instead of Actual Spend
Problem: Budgets don't reflect actual spending patterns
Impact:
- Overestimates effects when underspent 
- Underestimates effects when overspent 
- Wrong attribution timing 
Fix: Always use actual spend from invoices or billing systems
Pitfall 2: Inconsistent Channel Definitions
Problem: TV spend includes digital video some months, excludes it others
Impact:
- Artificially high variance 
- Misleading coefficient estimates 
- Incomparable periods 
Fix: Maintain rigorous, consistent channel taxonomy across entire time series
Pitfall 3: Ignoring Data Lags
Problem: TV spend recorded when invoice paid (4 weeks after airing)
Impact:
- Wrong timing attribution 
- Reduced statistical significance 
- Misleading adstock estimates 
Fix: Align data to when marketing actually occurred (flight dates, not payment dates)
Pitfall 4: Aggregating Too Much
Problem: Combining all digital channels into "Digital" when they behave differently
Impact:
- Loss of actionable insights 
- Averaging out different ROIs 
- Can't optimize within digital 
Fix: Disaggregate to actionable decision level (Display, Search, Social separately)
Pitfall 5: Not Adjusting for Inflation
Problem: Using nominal dollars across multi-year period
Impact:
- Artificial growth trends 
- Misleading year-over-year comparisons 
- Biased coefficient estimates 
Fix: Convert to real (inflation-adjusted) dollars using CPI or category-specific price indices
Data Quality Metrics
Quantitative Metrics to Track
Completeness Rate:
Completeness = (Non-Missing Values / Total Values) × 100%Target: >95%
Accuracy Rate (When Ground Truth Available):
Accuracy = (Correct Values / Total Values) × 100%Target: >98%
Timeliness:
Data Lag = Days Between Event Date and Data Available DateTarget: <7 days for weekly data
Consistency Score:
Consistency = (Matched Values Across Sources / Total Values) × 100%Target: >95%
Tools and Techniques for Data Quality
Excel Data Validation
Set Rules:
- Select data column 
- Data → Data Validation 
- Set criteria (e.g., "Whole number", "Between 0 and 1000000") 
- Excel will flag violations 
Benefits:
- Prevents data entry errors 
- Enforces valid ranges 
- Visual feedback 
Conditional Formatting for Outliers
Highlight Unusual Values:
- Select data range 
- Home → Conditional Formatting → New Rule 
- Format cells where: - =ABS(B2-AVERAGE($B$2:$B$100))/STDEV($B$2:$B$100)>3
- Set format (red fill) 
Result: Outliers (>3 standard deviations) highlighted automatically
Pivot Tables for Quick Checks
Use Cases:
- Sum spend by channel (verify totals) 
- Count observations by month (detect gaps) 
- Average KPI by quarter (spot trends) 
Quick QA Method: Create pivot of channel spend → look for unexpectedly high/low totals
Charts for Visual Inspection
Create:
- Line chart of KPI over time (spot anomalies) 
- Scatter plots of marketing vs. KPI (check relationships) 
- Box plots for each variable (identify outliers) 
Visual inspection often reveals issues statistical tests miss
Data Quality Communication
Reporting Data Quality to Stakeholders
Include in Documentation:
- Data sources: Where each variable comes from 
- Known limitations: Missing periods, data quality concerns 
- Assumptions made: How missing data handled, outliers treated 
- Validation performed: Cross-checks completed 
- Confidence level: High/Medium/Low data quality rating per variable 
Data Quality Score
Create Simple Score:
Scale:
- A (90-100%): High confidence 
- B (75-89%): Good, minor concerns 
- C (60-74%): Moderate concerns 
- D (<60%): Significant quality issues 
When to Reject or Fix Data
Red Lines: Don't Model If...
❌ Less than 26 observations (insufficient statistical power)
❌ KPI has unexplained structural break (model will fail)
❌ >30% missing data in key variables (too many gaps)
❌ Data sources unverifiable (can't validate accuracy)
❌ Timing misalignment >4 weeks (marketing/sales out of sync)
Yellow Flags: Fix Before Modeling
⚠️ 5-20% missing data → Interpolate or fill with zeros
⚠️ Minor outliers → Investigate and cap if data errors
⚠️ Some inconsistency → Standardize definitions
⚠️ Moderate multicollinearity → Consider combining variables
⚠️ Limited variance → Extend time window or exclude variable
Continuous Improvement
Quarterly Data Quality Review
Agenda:
- Review data collection processes 
- Assess new data sources 
- Validate existing sources still accurate 
- Update data dictionary 
- Address known issues 
- Plan improvements 
Feedback Loop
Process:
- Model results reveal unexpected findings 
- Investigate potential data issues 
- Validate data against source systems 
- Correct if errors found 
- Document learnings 
- Improve processes to prevent recurrence 
Example: Model shows TV coefficient negative → Investigate → Discover timing lag in data → Fix → Re-model → Positive coefficient (expected)
Summary
Key Takeaways:
📊 Quality over quantity - 52 weeks of clean data beats 104 weeks of messy data
🔍 Always validate - inspect, visualize, cross-check before modeling
📋 Document everything - data sources, assumptions, decisions
🎯 Source from authority - use official systems, not estimates
✅ Run checklist - systematic validation prevents errors
🔄 Maintain continuously - data quality is ongoing, not one-time
⚠️ Know your limitations - be transparent about data quality concerns
🛠️ Fix issues early - addressing data problems saves time later
Remember: The best statistical model cannot overcome poor data. Invest in data quality first, modeling sophistication second. Your MMM is only as good as the data that feeds it!
Last updated