Data Preparation Tips

Overview

Data quality is the foundation of reliable marketing mix modeling. Well-prepared data leads to faster model building, more accurate results, and greater confidence in recommendations. This guide provides practical tips for preparing your data for MixModeler.

Data Structure Essentials

Excel File Format

Required Structure:

Date        | Revenue | TV_Spend | Digital_Spend | Print_Spend
2024-01-07  | 125000  | 15000    | 8000          | 3000
2024-01-14  | 132000  | 18000    | 9500          | 3200

Column Order:

  1. First column: Date (required)

  2. Second column: KPI variable (recommended but not required)

  3. Remaining columns: Marketing and control variables

Key Rules:

  • One sheet only (MixModeler reads first sheet)

  • No blank rows or columns

  • No merged cells

  • Headers in first row only

  • Data starts in row 2

Date Formatting

Supported Formats:

  • YYYY-MM-DD (recommended): 2024-01-07

  • MM/DD/YYYY: 01/07/2024

  • DD/MM/YYYY: 07/01/2024

  • Excel date serial numbers

Best Practice: Use YYYY-MM-DD format consistently

Common Mistakes:

  • Mixed date formats in same column

  • Text dates ("January 7, 2024")

  • Dates as text instead of dates

  • Inconsistent granularity (mix of weeks and months)

Fix in Excel:

  1. Select date column

  2. Format → Cells → Date → YYYY-MM-DD

  3. Verify all dates converted correctly

Column Naming

Good Names:

  • TV_Spend

  • Digital_Impressions

  • Email_Clicks

  • Promo_Flag

Avoid:

  • Spaces: "TV Spend" → Use underscores: "TV_Spend"

  • Special characters: "TV (Broadcast)" → Use: "TV_Broadcast"

  • Numbers only: "2024_Campaign" → OK if descriptive

  • Very long names: "Television_Advertising_Total_Weekly_Spend" → Use: "TV_Spend"

Convention: Use Channel_Metric format (e.g., Facebook_Spend, Google_Clicks)

Data Quality Checks

Completeness

Check for Missing Values:

  1. In Excel, use filter to find blanks

  2. For each column, count non-blank cells

  3. Decide how to handle missing data:

    • Fill with zero: For spending (no campaign that week)

    • Interpolate: For continuous variables (average of neighbors)

    • Remove row: If critical data missing (KPI)

    • Flag: Create indicator variable for missing periods

Target: <5% missing values per variable, <2% missing in KPI

Consistency

Verify Data Ranges:

Check: Min and Max values make sense
TV_Spend: Min = $0, Max = $150,000 ✓
Revenue: Min = $50,000, Max = $800,000 ✓
Digital_CPC: Min = $0.10, Max = $2,500 ✗ (investigate!)

Red Flags:

  • Negative values (unless intentional, like discounts)

  • Extreme outliers (100x typical value)

  • Sudden jumps or drops

  • Values outside known constraints

Excel Formula to Check:

=MIN(B2:B100)  // Minimum value
=MAX(B2:B100)  // Maximum value
=AVERAGE(B2:B100)  // Average
=STDEV(B2:B100)  // Standard deviation

Accuracy

Reconcile Totals:

  • Sum of weekly marketing spend should match monthly budgets

  • Total revenue should match financial reports

  • Campaign dates align with marketing calendar

Cross-Validate:

  • Compare data from multiple sources

  • Verify with stakeholders (marketing, finance)

  • Spot-check random samples

Document Discrepancies:

  • Note any data quality issues

  • Record assumptions made

  • Flag periods with known problems

Common Data Issues and Solutions

Issue 1: Outliers

Identify:

  • Values >3 standard deviations from mean

  • Sudden spikes or drops in time series

  • Visual inspection in charts

Investigate:

  • Data entry error? (Extra zero, decimal place)

  • Real business event? (Super Bowl, Black Friday)

  • System glitch? (Reporting bug)

Solutions:

  • If Error: Correct to true value

  • If Real Event: Keep, possibly create indicator variable

  • If Can't Determine: Consider capping at 2-3 standard deviations

Example:

Week of 2024-06-15: TV_Spend = $1,500,000 (10x normal)
→ Investigate: Was this a major campaign launch?
→ If yes: Keep value, add "Major_Campaign_Flag" variable
→ If no: Likely error, check source data

Issue 2: Zero Inflation

Problem: Many zero values in marketing variables

Common in: Campaigns that run intermittently

Example:

Print_Spend: 
Weeks 1-10: $0
Weeks 11-15: $5,000-$8,000
Weeks 16-52: $0

Solutions:

  • Keep zeros: If truly no spend

  • Create binary flag: Campaign_Active (0/1)

  • Split variable: Print_Spend_Active (non-zero periods only)

  • Use only active periods: Filter model to weeks with spend >0

Issue 3: Highly Correlated Variables

Identify: Correlation >0.8 between predictors

Check in MixModeler:

  1. Go to Variable Charts

  2. Generate correlation heatmap

  3. Look for dark red cells (high correlation)

Solutions:

  • Remove one variable: Keep more important one

  • Combine variables: Create sum or average

  • Principal components: Outside MixModeler, then import

  • Accept correlation: If both theoretically important

Example:

Facebook_Spend and Instagram_Spend: r = 0.95
→ Solution: Create "Meta_Social_Spend" = Facebook + Instagram

Issue 4: Changing Metrics

Problem: Variable definition changes mid-period

Example:

Weeks 1-26: Google_Clicks (organic + paid)
Weeks 27-52: Google_Clicks (paid only)

Solutions:

  • Split variable: Google_Clicks_Period1, Google_Clicks_Period2

  • Adjust historical data: Retroactively calculate consistent metric

  • Use overlap period: Derive conversion factor, adjust

  • Start fresh: Model only consistent period (weeks 27-52)

Issue 5: Seasonality Artifacts

Problem: Metrics affected by calendar (holidays, month-end)

Adjust for:

  • Number of days in week/month

  • Holiday weeks (Thanksgiving, Christmas)

  • Weekday composition (5 vs 6 day weeks)

Create Control Variables:

  • Holiday_Flag: 1 for holiday weeks, 0 otherwise

  • Days_in_Period: Actual number of days

  • Month or Quarter: Seasonal dummy variables

Variable Preparation Strategies

Marketing Spend Variables

Recommendations:

  • Include even if spend is zero some periods

  • Keep spending in original currency units

  • Don't pre-normalize or standardize (MixModeler handles this)

  • Consider log transformation if range is very wide (>100x)

Adstock Consideration:

  • Most media should have adstock applied

  • Test multiple decay rates (30%, 50%, 70%)

  • Longer-lasting effects: Print, TV (50-70%)

  • Shorter effects: Digital, Radio (30-50%)

Impression/Reach Variables

Best Practices:

  • Use in addition to (not instead of) spend

  • Helps model account for efficiency changes

  • Consider Rate metrics (CPM, CPC) as separate variables

Example Variables:

TV_Spend
TV_GRPs
TV_CPP (Cost Per Point)

Benefit: Model can separate volume effects from efficiency

Campaign/Event Variables

Binary Flags (0/1):

  • Product launches

  • Major promotions

  • Rebranding campaigns

  • Competitive events

Continuous Variables:

  • Promotion depth (% discount)

  • Number of SKUs on promotion

  • Days of campaign activity

Timing:

  • Align dates precisely

  • Consider lead time (awareness before purchase)

  • May need lag/lead transformations

Control Variables

Always Consider:

  • Seasonality: Month, quarter, holiday indicators

  • Trend: Linear time trend, year indicator

  • External Factors: Weather, economic indicators

  • Competitive: Competitor spending (if available)

  • Price: Your pricing changes

Rule of Thumb: Include if theoretically important and data available

Data Granularity

Weekly vs Monthly

Weekly Data (Recommended):

  • More observations (52-156 data points)

  • Better for capturing marketing dynamics

  • Preferred for most MMM applications

  • Aligns with media planning cycles

Monthly Data:

  • Fewer observations (24-60 data points)

  • Suitable for strategic, long-term analysis

  • Better when media changes are slow

  • Easier data collection in some organizations

Trade-offs:

Aspect
Weekly
Monthly

Sample Size

Larger (better)

Smaller

Data Collection

More work

Easier

Noise

More

Less

Granularity

High (better for optimization)

Coarse

Recommendation: Use weekly if possible, monthly if weekly data unavailable

Alignment Across Sources

Ensure Consistent Periods:

  • All variables must use same time aggregation

  • Week definitions consistent (Sunday-Saturday vs Monday-Sunday)

  • Month-end dates align

  • Fiscal vs calendar year clarity

Example Problem:

Revenue: Calendar weeks (Sunday-Saturday)
TV Spend: Broadcast weeks (Monday-Sunday)
→ Solution: Realign all to same week definition

Advanced Preparation Techniques

Creating Interaction Variables

When Useful: Suspected synergy between channels

Prepare in Excel:

=TV_Spend * Digital_Spend

Save as: TV_Digital_Interaction

Test: Include in Variable Testing to verify significance

Lagged Variables

For Delayed Effects:

Create in Excel:

// Lag by 1 period (previous week's value)
=OFFSET(B2,-1,0)  // In row 3 onwards

Example:

  • TV_Spend_Lag1 (last week's TV spend)

  • Promo_Lag2 (promotion 2 weeks ago)

Use When: Effect occurs with delay (awareness → purchase)

Rate/Ratio Variables

Efficiency Metrics:

=Digital_Spend / Digital_Clicks  // CPC
=TV_Spend / TV_GRPs  // CPP
=Revenue / Units_Sold  // Average Price

Benefits:

  • Captures efficiency changes

  • Controls for price variations

  • Helps model separate volume from value

Data Validation Checklist

Before uploading to MixModeler:

  • [ ] Single Excel sheet with all data

  • [ ] First column is Date in consistent format

  • [ ] KPI variable included and complete

  • [ ] All columns have clear, descriptive names

  • [ ] No blank rows or columns

  • [ ] No merged cells

  • [ ] Missing values addressed (<5% per variable)

  • [ ] Outliers investigated and handled

  • [ ] Data ranges verified (min/max make sense)

  • [ ] Totals reconciled with source systems

  • [ ] Time periods align across all variables

  • [ ] 26+ weeks of data (52+ recommended)

  • [ ] Documentation of data sources and assumptions

Best Practices Summary

Start Clean: Invest time in data quality upfront - saves hours later

Document Everything: Track data sources, transformations, and assumptions

Validate Continuously: Check data at every step, not just at upload

Keep Original: Save raw data separately, create working copy for modifications

Iterate: First upload won't be perfect - refine as you build models

Ask Questions: Verify ambiguous data with stakeholders early

Version Control: Save dated versions of data files as you make changes


Next Steps: Review Model Building Strategy for using your prepared data effectively, or see Common Pitfalls to Avoid for frequent data mistakes.

Last updated