Data Preparation Tips
Overview
Data quality is the foundation of reliable marketing mix modeling. Well-prepared data leads to faster model building, more accurate results, and greater confidence in recommendations. This guide provides practical tips for preparing your data for MixModeler.
Data Structure Essentials
Excel File Format
Required Structure:
Date | Revenue | TV_Spend | Digital_Spend | Print_Spend
2024-01-07 | 125000 | 15000 | 8000 | 3000
2024-01-14 | 132000 | 18000 | 9500 | 3200Column Order:
First column: Date (required)
Second column: KPI variable (recommended but not required)
Remaining columns: Marketing and control variables
Key Rules:
One sheet only (MixModeler reads first sheet)
No blank rows or columns
No merged cells
Headers in first row only
Data starts in row 2
Date Formatting
Supported Formats:
YYYY-MM-DD(recommended): 2024-01-07MM/DD/YYYY: 01/07/2024DD/MM/YYYY: 07/01/2024Excel date serial numbers
Best Practice: Use YYYY-MM-DD format consistently
Common Mistakes:
Mixed date formats in same column
Text dates ("January 7, 2024")
Dates as text instead of dates
Inconsistent granularity (mix of weeks and months)
Fix in Excel:
Select date column
Format → Cells → Date → YYYY-MM-DD
Verify all dates converted correctly
Column Naming
Good Names:
TV_SpendDigital_ImpressionsEmail_ClicksPromo_Flag
Avoid:
Spaces: "TV Spend" → Use underscores: "TV_Spend"
Special characters: "TV (Broadcast)" → Use: "TV_Broadcast"
Numbers only: "2024_Campaign" → OK if descriptive
Very long names: "Television_Advertising_Total_Weekly_Spend" → Use: "TV_Spend"
Convention: Use Channel_Metric format (e.g., Facebook_Spend, Google_Clicks)
Data Quality Checks
Completeness
Check for Missing Values:
In Excel, use filter to find blanks
For each column, count non-blank cells
Decide how to handle missing data:
Fill with zero: For spending (no campaign that week)
Interpolate: For continuous variables (average of neighbors)
Remove row: If critical data missing (KPI)
Flag: Create indicator variable for missing periods
Target: <5% missing values per variable, <2% missing in KPI
Consistency
Verify Data Ranges:
Red Flags:
Negative values (unless intentional, like discounts)
Extreme outliers (100x typical value)
Sudden jumps or drops
Values outside known constraints
Excel Formula to Check:
Accuracy
Reconcile Totals:
Sum of weekly marketing spend should match monthly budgets
Total revenue should match financial reports
Campaign dates align with marketing calendar
Cross-Validate:
Compare data from multiple sources
Verify with stakeholders (marketing, finance)
Spot-check random samples
Document Discrepancies:
Note any data quality issues
Record assumptions made
Flag periods with known problems
Common Data Issues and Solutions
Issue 1: Outliers
Identify:
Values >3 standard deviations from mean
Sudden spikes or drops in time series
Visual inspection in charts
Investigate:
Data entry error? (Extra zero, decimal place)
Real business event? (Super Bowl, Black Friday)
System glitch? (Reporting bug)
Solutions:
If Error: Correct to true value
If Real Event: Keep, possibly create indicator variable
If Can't Determine: Consider capping at 2-3 standard deviations
Example:
Issue 2: Zero Inflation
Problem: Many zero values in marketing variables
Common in: Campaigns that run intermittently
Example:
Solutions:
Keep zeros: If truly no spend
Create binary flag: Campaign_Active (0/1)
Split variable: Print_Spend_Active (non-zero periods only)
Use only active periods: Filter model to weeks with spend >0
Issue 3: Highly Correlated Variables
Identify: Correlation >0.8 between predictors
Check in MixModeler:
Go to Variable Charts
Generate correlation heatmap
Look for dark red cells (high correlation)
Solutions:
Remove one variable: Keep more important one
Combine variables: Create sum or average
Principal components: Outside MixModeler, then import
Accept correlation: If both theoretically important
Example:
Issue 4: Changing Metrics
Problem: Variable definition changes mid-period
Example:
Solutions:
Split variable: Google_Clicks_Period1, Google_Clicks_Period2
Adjust historical data: Retroactively calculate consistent metric
Use overlap period: Derive conversion factor, adjust
Start fresh: Model only consistent period (weeks 27-52)
Issue 5: Seasonality Artifacts
Problem: Metrics affected by calendar (holidays, month-end)
Adjust for:
Number of days in week/month
Holiday weeks (Thanksgiving, Christmas)
Weekday composition (5 vs 6 day weeks)
Create Control Variables:
Holiday_Flag: 1 for holiday weeks, 0 otherwiseDays_in_Period: Actual number of daysMonthorQuarter: Seasonal dummy variables
Variable Preparation Strategies
Marketing Spend Variables
Recommendations:
Include even if spend is zero some periods
Keep spending in original currency units
Don't pre-normalize or standardize (MixModeler handles this)
Consider log transformation if range is very wide (>100x)
Adstock Consideration:
Most media should have adstock applied
Test multiple decay rates (30%, 50%, 70%)
Longer-lasting effects: Print, TV (50-70%)
Shorter effects: Digital, Radio (30-50%)
Impression/Reach Variables
Best Practices:
Use in addition to (not instead of) spend
Helps model account for efficiency changes
Consider Rate metrics (CPM, CPC) as separate variables
Example Variables:
Benefit: Model can separate volume effects from efficiency
Campaign/Event Variables
Binary Flags (0/1):
Product launches
Major promotions
Rebranding campaigns
Competitive events
Continuous Variables:
Promotion depth (% discount)
Number of SKUs on promotion
Days of campaign activity
Timing:
Align dates precisely
Consider lead time (awareness before purchase)
May need lag/lead transformations
Control Variables
Always Consider:
Seasonality: Month, quarter, holiday indicators
Trend: Linear time trend, year indicator
External Factors: Weather, economic indicators
Competitive: Competitor spending (if available)
Price: Your pricing changes
Rule of Thumb: Include if theoretically important and data available
Data Granularity
Weekly vs Monthly
Weekly Data (Recommended):
More observations (52-156 data points)
Better for capturing marketing dynamics
Preferred for most MMM applications
Aligns with media planning cycles
Monthly Data:
Fewer observations (24-60 data points)
Suitable for strategic, long-term analysis
Better when media changes are slow
Easier data collection in some organizations
Trade-offs:
Sample Size
Larger (better)
Smaller
Data Collection
More work
Easier
Noise
More
Less
Granularity
High (better for optimization)
Coarse
Recommendation: Use weekly if possible, monthly if weekly data unavailable
Alignment Across Sources
Ensure Consistent Periods:
All variables must use same time aggregation
Week definitions consistent (Sunday-Saturday vs Monday-Sunday)
Month-end dates align
Fiscal vs calendar year clarity
Example Problem:
Advanced Preparation Techniques
Creating Interaction Variables
When Useful: Suspected synergy between channels
Prepare in Excel:
Save as: TV_Digital_Interaction
Test: Include in Variable Testing to verify significance
Lagged Variables
For Delayed Effects:
Create in Excel:
Example:
TV_Spend_Lag1(last week's TV spend)Promo_Lag2(promotion 2 weeks ago)
Use When: Effect occurs with delay (awareness → purchase)
Rate/Ratio Variables
Efficiency Metrics:
Benefits:
Captures efficiency changes
Controls for price variations
Helps model separate volume from value
Data Validation Checklist
Before uploading to MixModeler:
[ ] Single Excel sheet with all data
[ ] First column is Date in consistent format
[ ] KPI variable included and complete
[ ] All columns have clear, descriptive names
[ ] No blank rows or columns
[ ] No merged cells
[ ] Missing values addressed (<5% per variable)
[ ] Outliers investigated and handled
[ ] Data ranges verified (min/max make sense)
[ ] Totals reconciled with source systems
[ ] Time periods align across all variables
[ ] 26+ weeks of data (52+ recommended)
[ ] Documentation of data sources and assumptions
Best Practices Summary
Start Clean: Invest time in data quality upfront - saves hours later
Document Everything: Track data sources, transformations, and assumptions
Validate Continuously: Check data at every step, not just at upload
Keep Original: Save raw data separately, create working copy for modifications
Iterate: First upload won't be perfect - refine as you build models
Ask Questions: Verify ambiguous data with stakeholders early
Version Control: Save dated versions of data files as you make changes
Next Steps: Review Model Building Strategy for using your prepared data effectively, or see Common Pitfalls to Avoid for frequent data mistakes.
Last updated