Data Preparation Tips
Overview
Data quality is the foundation of reliable marketing mix modeling. Well-prepared data leads to faster model building, more accurate results, and greater confidence in recommendations. This guide provides practical tips for preparing your data for MixModeler.
Data Structure Essentials
Excel File Format
Required Structure:
Date        | Revenue | TV_Spend | Digital_Spend | Print_Spend
2024-01-07  | 125000  | 15000    | 8000          | 3000
2024-01-14  | 132000  | 18000    | 9500          | 3200Column Order:
- First column: Date (required) 
- Second column: KPI variable (recommended but not required) 
- Remaining columns: Marketing and control variables 
Key Rules:
- One sheet only (MixModeler reads first sheet) 
- No blank rows or columns 
- No merged cells 
- Headers in first row only 
- Data starts in row 2 
Date Formatting
Supported Formats:
- YYYY-MM-DD(recommended): 2024-01-07
- MM/DD/YYYY: 01/07/2024
- DD/MM/YYYY: 07/01/2024
- Excel date serial numbers 
Best Practice: Use YYYY-MM-DD format consistently
Common Mistakes:
- Mixed date formats in same column 
- Text dates ("January 7, 2024") 
- Dates as text instead of dates 
- Inconsistent granularity (mix of weeks and months) 
Fix in Excel:
- Select date column 
- Format → Cells → Date → YYYY-MM-DD 
- Verify all dates converted correctly 
Column Naming
Good Names:
- TV_Spend
- Digital_Impressions
- Email_Clicks
- Promo_Flag
Avoid:
- Spaces: "TV Spend" → Use underscores: "TV_Spend" 
- Special characters: "TV (Broadcast)" → Use: "TV_Broadcast" 
- Numbers only: "2024_Campaign" → OK if descriptive 
- Very long names: "Television_Advertising_Total_Weekly_Spend" → Use: "TV_Spend" 
Convention: Use Channel_Metric format (e.g., Facebook_Spend, Google_Clicks)
Data Quality Checks
Completeness
Check for Missing Values:
- In Excel, use filter to find blanks 
- For each column, count non-blank cells 
- Decide how to handle missing data: - Fill with zero: For spending (no campaign that week) 
- Interpolate: For continuous variables (average of neighbors) 
- Remove row: If critical data missing (KPI) 
- Flag: Create indicator variable for missing periods 
 
Target: <5% missing values per variable, <2% missing in KPI
Consistency
Verify Data Ranges:
Check: Min and Max values make sense
TV_Spend: Min = $0, Max = $150,000 ✓
Revenue: Min = $50,000, Max = $800,000 ✓
Digital_CPC: Min = $0.10, Max = $2,500 ✗ (investigate!)Red Flags:
- Negative values (unless intentional, like discounts) 
- Extreme outliers (100x typical value) 
- Sudden jumps or drops 
- Values outside known constraints 
Excel Formula to Check:
=MIN(B2:B100)  // Minimum value
=MAX(B2:B100)  // Maximum value
=AVERAGE(B2:B100)  // Average
=STDEV(B2:B100)  // Standard deviationAccuracy
Reconcile Totals:
- Sum of weekly marketing spend should match monthly budgets 
- Total revenue should match financial reports 
- Campaign dates align with marketing calendar 
Cross-Validate:
- Compare data from multiple sources 
- Verify with stakeholders (marketing, finance) 
- Spot-check random samples 
Document Discrepancies:
- Note any data quality issues 
- Record assumptions made 
- Flag periods with known problems 
Common Data Issues and Solutions
Issue 1: Outliers
Identify:
- Values >3 standard deviations from mean 
- Sudden spikes or drops in time series 
- Visual inspection in charts 
Investigate:
- Data entry error? (Extra zero, decimal place) 
- Real business event? (Super Bowl, Black Friday) 
- System glitch? (Reporting bug) 
Solutions:
- If Error: Correct to true value 
- If Real Event: Keep, possibly create indicator variable 
- If Can't Determine: Consider capping at 2-3 standard deviations 
Example:
Week of 2024-06-15: TV_Spend = $1,500,000 (10x normal)
→ Investigate: Was this a major campaign launch?
→ If yes: Keep value, add "Major_Campaign_Flag" variable
→ If no: Likely error, check source dataIssue 2: Zero Inflation
Problem: Many zero values in marketing variables
Common in: Campaigns that run intermittently
Example:
Print_Spend: 
Weeks 1-10: $0
Weeks 11-15: $5,000-$8,000
Weeks 16-52: $0Solutions:
- Keep zeros: If truly no spend 
- Create binary flag: Campaign_Active (0/1) 
- Split variable: Print_Spend_Active (non-zero periods only) 
- Use only active periods: Filter model to weeks with spend >0 
Issue 3: Highly Correlated Variables
Identify: Correlation >0.8 between predictors
Check in MixModeler:
- Go to Variable Charts 
- Generate correlation heatmap 
- Look for dark red cells (high correlation) 
Solutions:
- Remove one variable: Keep more important one 
- Combine variables: Create sum or average 
- Principal components: Outside MixModeler, then import 
- Accept correlation: If both theoretically important 
Example:
Facebook_Spend and Instagram_Spend: r = 0.95
→ Solution: Create "Meta_Social_Spend" = Facebook + InstagramIssue 4: Changing Metrics
Problem: Variable definition changes mid-period
Example:
Weeks 1-26: Google_Clicks (organic + paid)
Weeks 27-52: Google_Clicks (paid only)Solutions:
- Split variable: Google_Clicks_Period1, Google_Clicks_Period2 
- Adjust historical data: Retroactively calculate consistent metric 
- Use overlap period: Derive conversion factor, adjust 
- Start fresh: Model only consistent period (weeks 27-52) 
Issue 5: Seasonality Artifacts
Problem: Metrics affected by calendar (holidays, month-end)
Adjust for:
- Number of days in week/month 
- Holiday weeks (Thanksgiving, Christmas) 
- Weekday composition (5 vs 6 day weeks) 
Create Control Variables:
- Holiday_Flag: 1 for holiday weeks, 0 otherwise
- Days_in_Period: Actual number of days
- Monthor- Quarter: Seasonal dummy variables
Variable Preparation Strategies
Marketing Spend Variables
Recommendations:
- Include even if spend is zero some periods 
- Keep spending in original currency units 
- Don't pre-normalize or standardize (MixModeler handles this) 
- Consider log transformation if range is very wide (>100x) 
Adstock Consideration:
- Most media should have adstock applied 
- Test multiple decay rates (30%, 50%, 70%) 
- Longer-lasting effects: Print, TV (50-70%) 
- Shorter effects: Digital, Radio (30-50%) 
Impression/Reach Variables
Best Practices:
- Use in addition to (not instead of) spend 
- Helps model account for efficiency changes 
- Consider Rate metrics (CPM, CPC) as separate variables 
Example Variables:
TV_Spend
TV_GRPs
TV_CPP (Cost Per Point)Benefit: Model can separate volume effects from efficiency
Campaign/Event Variables
Binary Flags (0/1):
- Product launches 
- Major promotions 
- Rebranding campaigns 
- Competitive events 
Continuous Variables:
- Promotion depth (% discount) 
- Number of SKUs on promotion 
- Days of campaign activity 
Timing:
- Align dates precisely 
- Consider lead time (awareness before purchase) 
- May need lag/lead transformations 
Control Variables
Always Consider:
- Seasonality: Month, quarter, holiday indicators 
- Trend: Linear time trend, year indicator 
- External Factors: Weather, economic indicators 
- Competitive: Competitor spending (if available) 
- Price: Your pricing changes 
Rule of Thumb: Include if theoretically important and data available
Data Granularity
Weekly vs Monthly
Weekly Data (Recommended):
- More observations (52-156 data points) 
- Better for capturing marketing dynamics 
- Preferred for most MMM applications 
- Aligns with media planning cycles 
Monthly Data:
- Fewer observations (24-60 data points) 
- Suitable for strategic, long-term analysis 
- Better when media changes are slow 
- Easier data collection in some organizations 
Trade-offs:
Sample Size
Larger (better)
Smaller
Data Collection
More work
Easier
Noise
More
Less
Granularity
High (better for optimization)
Coarse
Recommendation: Use weekly if possible, monthly if weekly data unavailable
Alignment Across Sources
Ensure Consistent Periods:
- All variables must use same time aggregation 
- Week definitions consistent (Sunday-Saturday vs Monday-Sunday) 
- Month-end dates align 
- Fiscal vs calendar year clarity 
Example Problem:
Revenue: Calendar weeks (Sunday-Saturday)
TV Spend: Broadcast weeks (Monday-Sunday)
→ Solution: Realign all to same week definitionAdvanced Preparation Techniques
Creating Interaction Variables
When Useful: Suspected synergy between channels
Prepare in Excel:
=TV_Spend * Digital_SpendSave as: TV_Digital_Interaction
Test: Include in Variable Testing to verify significance
Lagged Variables
For Delayed Effects:
Create in Excel:
// Lag by 1 period (previous week's value)
=OFFSET(B2,-1,0)  // In row 3 onwardsExample:
- TV_Spend_Lag1(last week's TV spend)
- Promo_Lag2(promotion 2 weeks ago)
Use When: Effect occurs with delay (awareness → purchase)
Rate/Ratio Variables
Efficiency Metrics:
=Digital_Spend / Digital_Clicks  // CPC
=TV_Spend / TV_GRPs  // CPP
=Revenue / Units_Sold  // Average PriceBenefits:
- Captures efficiency changes 
- Controls for price variations 
- Helps model separate volume from value 
Data Validation Checklist
Before uploading to MixModeler:
- [ ] Single Excel sheet with all data 
- [ ] First column is Date in consistent format 
- [ ] KPI variable included and complete 
- [ ] All columns have clear, descriptive names 
- [ ] No blank rows or columns 
- [ ] No merged cells 
- [ ] Missing values addressed (<5% per variable) 
- [ ] Outliers investigated and handled 
- [ ] Data ranges verified (min/max make sense) 
- [ ] Totals reconciled with source systems 
- [ ] Time periods align across all variables 
- [ ] 26+ weeks of data (52+ recommended) 
- [ ] Documentation of data sources and assumptions 
Best Practices Summary
Start Clean: Invest time in data quality upfront - saves hours later
Document Everything: Track data sources, transformations, and assumptions
Validate Continuously: Check data at every step, not just at upload
Keep Original: Save raw data separately, create working copy for modifications
Iterate: First upload won't be perfect - refine as you build models
Ask Questions: Verify ambiguous data with stakeholders early
Version Control: Save dated versions of data files as you make changes
Next Steps: Review Model Building Strategy for using your prepared data effectively, or see Common Pitfalls to Avoid for frequent data mistakes.
Last updated