Best Practices for Variable Creation

Guidelines for Effective Variable Engineering

Creating the right variables is crucial for building accurate and actionable MMM models. This page provides proven strategies and best practices for variable engineering in MixModeler.

Core Principles

1. Start Simple, Add Complexity Gradually

Initial Model:

Raw marketing variables (no transformations)
Basic seasonality (month dummies)
KPI as-is

Build Incrementally:

Add adstock to media channels
Apply saturation curves
Test interaction terms
Create composite variables

Why: Easier to understand what each transformation contributes, simpler debugging

2. Every Transformation Should Have a Purpose

Bad Practice: "Let me try every transformation and see what sticks"

Good Practice: "TV ads persist for weeks, so I'll apply adstock with 60% decay rate based on industry benchmarks"

Rule: Each transformation should address a specific business hypothesis or known marketing behavior

3. Test Before Committing

Before creating a variable:

Preview the transformation
Check the distribution (min, max, mean)
Visualize the effect (charts)
Understand what it represents

After creating:

Test in model (add to Model Builder)
Check t-statistic (is it significant?)
Verify coefficient sign (does it make sense?)
Compare R² with/without variable

If no improvement: Don't use the transformation

Naming Conventions

Use Clear, Descriptive Names

Good Names:

TV_Spend_ads60              (TV with 60% adstock)
Digital_Display_ATAN_a12    (Display with saturation curve)
Radio_Q4_Only               (Radio spend, Q4 periods only)
Social_Media_Mix_WGTD       (Weighted combination)
OOH_AVO_85                  (OOH above 85% threshold)

Bad Names:

TV_transformed
Variable1
temp_test
X_final_v2

Benefits:

Easy to understand at a glance
Clear what transformation was applied
Easier to document and share

Include Transformation Details

Format:

{Base Variable}_{Transformation}_{Parameters}

Examples:
TV_Spend_ads60_ATAN_a15_p12
Digital_lag2
Price_Q4_Only

What to Include:

Base variable name
Transformation type (ads, ATAN, AVO, WGTD, etc.)
Key parameters (adstock rate, threshold, etc.)

Transformation Best Practices

Adstock Transformations

✅ Do:

Apply to all media channels (TV, Radio, Print, Display)
Test multiple rates (40%, 50%, 60%, 70%)
Use Variable Testing to find optimal rate
Document why specific rate chosen

❌ Don't:

Apply same rate to all channels (they decay differently)
Use adstock on non-media variables (price, weather)
Apply adstock AND lag (choose one)
Use rates > 90% (unrealistic persistence)

Typical Rates:

TV: 50-70%
Radio: 40-60%
Print: 60-80%
Digital Display: 30-50%
Search: 10-30%

Saturation Curves

✅ Do:

Apply to media channels with large spend variance
Use Curve Testing to find optimal parameters
Test both S-shape and concave curves
Apply AFTER adstock (adstock first, then saturation)

❌ Don't:

Apply to variables with limited range (little saturation to model)
Use overly aggressive parameters (creates flat line)
Combine with too many other transformations (over-complicating)

When to Use:

Media channels with wide spend range (3× difference min to max)
Channels where diminishing returns expected
When linear model shows unrealistic ROI at high spend

Lead/Lag Transformations

✅ Do:

Use for non-media variables (price, promotions, external factors)
Test multiple lag periods (1, 2, 3 weeks)
Choose lag with highest t-statistic
Document the delay hypothesis

❌ Don't:

Use for media channels (use adstock instead)
Create excessive lags (>4 weeks rarely needed)
Use both lag and lead for same variable

Common Applications:

Price_lag1 (price changes take time to affect behavior)
DirectMail_lag2 (2-week delivery + response time)
Competitor_Activity_lag1 (delayed competitive response)

Split by Date

✅ Do:

Align splits with real business events (campaigns, rebrand, market entry)
Create complementary splits (Period A + Period B = Total)
Ensure sufficient data in each split (15+ observations)
Document the reason for split

❌ Don't:

Split arbitrarily without business rationale
Create too many splits (> 3-4 per variable)
Split into very short periods (< 10 observations)

Good Use Cases:

Before/After major change (product launch, rebrand)
Campaign vs. baseline periods
Seasonal effectiveness (Q4 vs. non-Q4)

Weighted Variables (WGTD)

✅ Do:

Combine highly correlated channels (reduces VIF)
Start with OLS coefficients as weights
Adjust weights based on business knowledge
Document weight rationale

❌ Don't:

Combine unrelated channels
Use arbitrary weights without justification
Over-combine (lose actionable insights)

Best Applications:

Multiple digital channels (PPC, Meta, Display, LinkedIn)
Multiple TV campaigns running simultaneously
Regional media that should be consolidated

AVO (Above Value Operator)

✅ Do:

Test multiple thresholds (70, 80, 90)
Check distribution (% of 1s vs. 0s)
Use for campaign flight detection
Combine with continuous spend variable

❌ Don't:

Use extreme thresholds (too few or too many 1s)
Confuse with percentile (AVO 90 ≠ 90th percentile)
Use as only variable for that channel

Typical Thresholds:

AVO 80-90: Identify heavy campaign weeks
AVO 60-70: Moderate campaign activity
AVO 40-50: General activity indicator

Variable Testing Strategy

Systematic Testing Process

Step 1: Hypothesis Define what you're testing and why

"TV ads persist 4-6 weeks based on past studies"

Step 2: Create Candidates Build multiple versions

TV_ads40, TV_ads50, TV_ads60, TV_ads70

Step 3: Test in Model Use Variable Testing page

Compare t-statistics
Check coefficients make sense
Review model R²

Step 4: Select Winner Choose best-performing version

Highest t-stat (most significant)
Makes business sense
Improves model fit

Step 5: Document Record decision rationale

Why this transformation?
What did we test?
What did we find?

Common Pitfalls to Avoid

Pitfall 1: Transformation Overload

Problem: Applying too many transformations to one variable

Example:

TV → adstock → saturation → standardization → lag → AVO

Result: Impossible to interpret, overfitted

Fix: Maximum 2-3 transformations per variable (typically adstock + saturation)

Pitfall 2: Ignoring Business Logic

Problem: Purely statistical approach without business validation

Example: Model shows TV with negative coefficient because of confounding

Fix: Always validate results with business stakeholders

Pitfall 3: Not Testing Alternatives

Problem: Applying one transformation without testing alternatives

Example: Using 50% adstock without testing 40%, 60%, 70%

Fix: Always test multiple parameter values

Pitfall 4: Inconsistent Application

Problem: Applying transformations inconsistently

Example: TV with adstock, Radio without (when both are brand media)

Fix: Apply same logic to similar channel types

Pitfall 5: Creating Too Many Variables

Problem: Explosion of variables from transformations

Example: Starting with 20 variables, ending with 80 after transformations

Fix: Be selective, only create variables that improve model

Variable Management

Organization Strategy

Group by Type:

Raw Variables: Original uploaded data
Time Transformations: Lags, leads, splits
Marketing Transformations: Adstock, saturation
Composite Variables: Weighted, multiplied
Indicators: AVO, dummies

Naming Prefix: Consider consistent prefixes for easy filtering

raw_TV_Spend
trans_TV_ads60
comp_Digital_Mix_WGTD
ind_TV_AVO_90

Version Control

Track Changes:

Keep notes on why variables were created
Date of creation
Parameters used
Performance in models

Excel Export: Export model with transformations documented for reproducibility

Decision Framework

Should I Create This Variable?

Ask:

1. Does it address a real business hypothesis? ✅ Yes → Proceed ❌ No → Reconsider

2. Will it improve model interpretability or fit? ✅ Yes → Proceed ❌ No → Skip

3. Can I clearly explain what it represents? ✅ Yes → Proceed ❌ No → Simplify first

4. Have I tested it properly? ✅ Yes → Proceed ❌ No → Test first

5. Does it make business sense? ✅ Yes → Use it ❌ No → Don't use it

Model Complexity vs. Interpretability

Finding the Balance

Simple Model:

10-15 variables
Minimal transformations
Easy to explain
May miss some effects

Complex Model:

30+ variables
Many transformations
Hard to explain
May overfit

Optimal Model:

15-25 variables
Purposeful transformations
Interpretable
Captures key effects

Guideline: If you can't easily explain a variable to a stakeholder, it's probably too complex

Documentation Best Practices

What to Document

For Each Created Variable:

Base variable(s) used
Transformation type and parameters
Business rationale
Date created
Performance (t-stat, significance)
Decision to keep or exclude

Example Log:

Variable: TV_Spend_ads60_ATAN_a15_p12
Created: 2024-01-15
Base: TV_Spend
Transformations: 
  - Adstock 60% (tested 40%, 50%, 60%, 70% - 60% had highest t-stat)
  - ATAN saturation (alpha=15, power=1.2)
Rationale: TV shows strong persistence and diminishing returns
Performance: t-stat = 4.2, R² improvement = 0.03
Status: ACTIVE in Model_v2

Quality Checklist

Before finalizing variables, verify:

Statistical Quality:

Variable has sufficient variance (not all same values)
No missing values (or handled appropriately)
Distribution looks reasonable (no extreme outliers unless real)
Correlation with KPI makes sense

Business Quality:

Variable represents something real and measurable
Transformation has clear business logic
Results interpretable to stakeholders
Actionable (can inform decisions)

Technical Quality:

Clear, descriptive naming
Documented in notes
Tested in model
Reproducible (can recreate from base data)

Summary

Key Takeaways:

🎯 Start simple, add complexity gradually - don't over-engineer initially

📝 Document everything - rationale, parameters, decisions

🧪 Test before committing - verify transformations improve model

✅ Every variable needs a purpose - no "just because" transformations

📊 Name clearly - descriptive names with transformation details

🔍 Validate with business logic - statistics + domain knowledge

⚖️ Balance complexity vs. interpretability - aim for 15-25 final variables

🎓 Less is often more - 20 well-chosen variables beat 50 random ones

Bottom Line: Great variable engineering is both an art and science. Use statistical methods to test, business logic to guide, and common sense to validate. When in doubt, keep it simple!

PreviousAVO (Average Value Operator)NextCurve Theory & When to Use

Last updated 1 month ago