Large Dataset Handling

Overview

MixModeler is optimized to handle large marketing mix models with hundreds of variables and years of data. Through intelligent chunking, memory management, and acceleration technologies, the platform processes datasets that would overwhelm traditional browser-based applications.

This guide explains how MixModeler handles large datasets, what limits exist, and best practices for working efficiently with extensive marketing data.

Dataset Size Categories

Small Datasets

Characteristics:

  • Variables: 10-30

  • Observations: 26-52 (6 months - 1 year weekly)

  • Total data points: 260-1,560

Performance: Instant processing (<1 second for most operations)

Memory Usage: <100 MB

Ideal For: Single-channel testing, small business MMM, pilot projects

Medium Datasets

Characteristics:

  • Variables: 30-100

  • Observations: 52-156 (1-3 years weekly)

  • Total data points: 1,560-15,600

Performance: Fast processing (1-3 seconds with acceleration)

Memory Usage: 100-500 MB

Ideal For: Multi-channel MMM, standard business applications, most use cases

Large Datasets

Characteristics:

  • Variables: 100-300

  • Observations: 104-260 (2-5 years weekly)

  • Total data points: 10,400-78,000

Performance: Good performance with GPU (3-8 seconds), acceptable with WASM only (8-20 seconds)

Memory Usage: 500-2000 MB

Ideal For: Enterprise MMM, comprehensive multi-market models, advanced analytics

Very Large Datasets

Characteristics:

  • Variables: 300-500 (subscription limit)

  • Observations: 260+ (5+ years weekly)

  • Total data points: 78,000-130,000+

Performance: Requires GPU acceleration (10-30 seconds), slow without GPU (>60 seconds)

Memory Usage: 2-4 GB

Ideal For: Global enterprise MMM, exhaustive market analysis, research applications

Note: Professional/Business plan required for 300+ variables

Subscription Limits

Variable Limits by Plan

Free Plan:

  • Maximum 20 variables per dataset

  • Suitable for testing and small models

  • All features available

Professional Plan:

  • Maximum 500 variables per dataset

  • Suitable for most business needs

  • Priority support

Business Plan:

  • Unlimited variables

  • Enterprise-scale modeling

  • Dedicated support

Observation Limits

Practical Limits:

  • Minimum: 26 observations (6 months weekly data)

  • Recommended minimum: 52 observations (1 year)

  • Maximum: No hard limit, 500+ observations supported

  • Optimal: 52-260 observations (1-5 years)

Best Practice: More observations generally improve model reliability, but returns diminish beyond 3-5 years of data

Memory Management

Browser Memory Architecture

How MixModeler Uses Memory:

Application Code: ~50-100 MB (fixed)

Loaded Data: Variable count × Observations × 8 bytes

  • Example: 100 vars × 200 obs = 160 KB

Working Memory: 5-10x loaded data during operations

  • Example: 160 KB → 800 KB - 1.6 MB during processing

WASM Memory: Separate heap, 500 MB - 2 GB depending on operation

GPU Memory: Separate VRAM when GPU acceleration active

Browser Overhead: ~200-500 MB for Chrome/Edge

Total Typical Usage: 1-4 GB for large models

Memory-Efficient Processing

Chunked Operations: Large datasets processed in smaller chunks

How Chunking Works:

  1. Dataset split into manageable pieces (chunk size: 1,000-10,000 rows)

  2. Each chunk processed independently

  3. Results aggregated at the end

  4. Memory released after each chunk

Operations Using Chunking:

  • Data upload and validation

  • Statistical summary calculations

  • Correlation matrix generation (for very large variable sets)

  • Diagnostic test suites

User Impact: Transparent - operations complete normally, just take slightly longer

Automatic Memory Management

Garbage Collection: Browser automatically reclaims unused memory

Memory Monitoring: MixModeler tracks usage and warns if approaching limits

Automatic Cleanup: Memory released immediately after operations complete

No User Action Required: Memory management is fully automatic

Performance Optimization for Large Datasets

Acceleration Technology Stack

For 100-200 Variables:

  • WASM: 5-10x speedup (always available)

  • GPU: Additional 3-8x speedup (when available)

  • Combined: 15-80x faster than pure JavaScript

For 200-500 Variables:

  • WASM: 6-12x speedup (essential)

  • GPU: Additional 5-15x speedup (highly recommended)

  • Combined: 30-180x faster than pure JavaScript

Chunk Size Optimization

MixModeler automatically selects optimal chunk sizes:

Small Datasets (<10,000 data points):

  • No chunking needed

  • Process entire dataset at once

Medium Datasets (10,000-50,000 data points):

  • Chunk size: 10,000 rows

  • 2-5 chunks typical

  • Minimal overhead

Large Datasets (50,000-100,000 data points):

  • Chunk size: 5,000 rows

  • 10-20 chunks typical

  • Managed overhead

Very Large Datasets (>100,000 data points):

  • Chunk size: 2,000-5,000 rows

  • 20-50 chunks typical

  • More processing time but prevents memory issues

Parallel Processing

Multi-Core CPU Utilization:

  • Chunk processing distributed across CPU cores

  • 4-core CPU: Up to 4 chunks simultaneously

  • 8-core CPU: Up to 8 chunks simultaneously

GPU Parallel Processing:

  • Thousands of operations simultaneously

  • Particularly effective for matrix operations

  • Essential for 200+ variable models

Bayesian MCMC:

  • Multiple chains run in parallel

  • 4 chains utilize 4 CPU cores optimally

  • Each chain processes independently

Practical Guidelines

Hardware Recommendations by Dataset Size

For 100-200 Variables:

Component
Minimum
Recommended
Optimal

RAM

8 GB

16 GB

32 GB

CPU

Quad-core 2.5 GHz

Quad-core 3.0 GHz

6-8 core 3.5 GHz

GPU

Integrated

GTX 1660 / RX 5600

RTX 3060 / RX 6700

Storage

HDD

SSD

NVMe SSD

Expected Performance: 2-5 seconds per model iteration with recommended setup

For 200-400 Variables:

Component
Minimum
Recommended
Optimal

RAM

16 GB

32 GB

64 GB

CPU

6-core 3.0 GHz

8-core 3.5 GHz

12+ core 4.0 GHz

GPU

GTX 1660

RTX 3060 / RX 6700

RTX 3070+ / RX 6800+

Storage

SSD

NVMe SSD

NVMe SSD (fast)

Expected Performance: 5-15 seconds per model iteration with recommended setup

For 400-500 Variables:

Component
Minimum
Recommended
Optimal

RAM

32 GB

64 GB

128 GB

CPU

8-core 3.5 GHz

12-core 4.0 GHz

16+ core 4.5 GHz

GPU

RTX 3060

RTX 3070 / RX 6800

RTX 4080+ / RX 7900+

Storage

NVMe SSD

NVMe SSD (fast)

NVMe SSD (fastest)

Expected Performance: 15-30 seconds per model iteration with recommended setup

Software Optimization

Browser Choice:

  • Best: Chrome or Edge (latest version)

  • Good: Brave, Chromium-based browsers

  • Acceptable: Firefox (latest)

  • Not Recommended: Safari (limited WebGPU), older browsers

Browser Settings:

  • Enable hardware acceleration

  • Allow sufficient memory per tab (Chrome: 2-4 GB)

  • Disable unnecessary extensions

  • Keep browser updated

Operating System:

  • Use 64-bit OS (required for large memory access)

  • Keep OS updated for latest performance optimizations

  • Close unnecessary background applications

Workflow Optimization

Start Small, Scale Up:

  1. Begin with subset of variables (50-100)

  2. Develop and test model structure

  3. Gradually add variables

  4. Final run with full variable set

Benefits: Faster iteration during development, full dataset only when needed

Reduce Diagnostic Frequency:

  • Run full diagnostics on final model only

  • Use quick validation during development

  • Enable all tests only when necessary

Use OLS Before Bayesian:

  • OLS is 50-100x faster than Bayesian

  • Validate model structure with OLS first

  • Run Bayesian only on vetted model specifications

Leverage Fast Inference:

  • Use Fast Inference (SVI) for Bayesian exploration

  • Switch to full MCMC only for final production model

  • Can iterate 10-20x faster during development

Handling Extremely Large Datasets

When You Hit Limits

Symptoms:

  • Browser becomes unresponsive

  • "Out of memory" errors

  • Very long operation times (>5 minutes for non-Bayesian)

  • Browser tab crashes

Immediate Solutions:

1. Reduce Variables (most effective):

  • Remove highly correlated variables (VIF > 10)

  • Eliminate non-significant variables from previous runs

  • Focus on key marketing channels

  • Group similar variables (e.g., combine social platforms)

2. Reduce Observations:

  • Use most recent data (e.g., last 2 years instead of 5)

  • Consider monthly instead of weekly data (if appropriate)

  • Focus on relevant time period for current business question

3. Process in Batches:

  • Split variables into logical groups

  • Run separate models for each group

  • Combine insights from multiple models

4. Upgrade Hardware:

  • Add more RAM (biggest impact)

  • Get dedicated GPU (significant speedup)

  • Use faster CPU (moderate improvement)

Data Reduction Techniques

Variable Selection:

Business Prioritization: Keep only strategically important channels

Statistical Filtering: Remove variables with:

  • Very low variance (contribute little information)

  • Very high correlation with other variables (redundant)

  • Missing data >20% of observations

Dimensionality Reduction:

  • Create composite variables (e.g., "Total_Digital" instead of 10 digital channels)

  • Use principal components (outside MixModeler, then import reduced set)

  • Aggregate similar channels

Temporal Aggregation:

Weekly → Bi-Weekly: Reduces observations by 50%, minimal information loss

Weekly → Monthly: Reduces observations by 75%, some information loss

Considerations: Ensure aggregation makes business sense and doesn't hide important patterns

Splitting Complex Models

Approach 1: Geographic Split

  • Model each region/market separately

  • Combine insights at reporting stage

  • Allows more variables per model

Approach 2: Channel Category Split

  • Model digital channels separately from traditional

  • Model brand vs performance marketing separately

  • Run comprehensive model with top performers from each

Approach 3: Time Period Split

  • Model recent period (high priority)

  • Model historical period separately

  • Compare for structural changes

Monitoring Performance

Key Metrics to Track

Load Time: Time to upload and validate data

  • Target: <5 seconds for large datasets

  • Concern: >15 seconds

Model Fitting Time: Time to estimate coefficients

  • Target: <10 seconds for OLS, <5 minutes for Bayesian

  • Concern: >30 seconds for OLS, >15 minutes for Bayesian

Memory Usage: Peak RAM consumption

  • Target: <2 GB

  • Concern: >3 GB (browser may become unstable)

Operation Success Rate: Percentage of operations completing without error

  • Target: 100%

  • Concern: <95%

When to Optimize

Optimization Triggers:

  • Operations taking >2x expected time

  • Memory usage approaching 3-4 GB

  • Browser responsiveness degrading

  • Frequent need to restart browser

Optimization Actions (in order of impact):

  1. Reduce number of variables (biggest impact)

  2. Enable/upgrade GPU acceleration

  3. Close other applications/tabs

  4. Upgrade RAM

  5. Consider data aggregation

Best Practices Summary

Data Preparation:

  • Remove unnecessary variables before upload

  • Ensure data quality (no excessive missing values)

  • Use appropriate temporal granularity

  • Test with subset before full dataset

Hardware Utilization:

  • Ensure GPU acceleration active for large models

  • Close unnecessary applications

  • Use single browser tab for intensive operations

  • Monitor memory usage

Workflow Efficiency:

  • Start small, scale gradually

  • Use OLS for structure testing

  • Run full diagnostics only when needed

  • Leverage Fast Inference for Bayesian exploration

Resource Management:

  • Restart browser periodically during long sessions

  • Clear browser cache if performance degrades

  • Save models frequently

  • Export results before major operations

Troubleshooting Large Dataset Issues

Issue 1: "Out of Memory" Error

Cause: Dataset exceeds available RAM

Solutions:

  1. Close all other browser tabs and applications

  2. Reduce number of variables (remove low-importance ones)

  3. Reduce observations (use more recent data)

  4. Restart browser to clear memory leaks

  5. Upgrade system RAM

Issue 2: Browser Freezing/Unresponsive

Cause: Operation overwhelming browser

Solutions:

  1. Wait 2-3 minutes (may still be processing)

  2. If no progress, close tab and restart

  3. Reduce model complexity

  4. Enable GPU acceleration

  5. Use faster hardware

Issue 3: Very Slow Operations (>5 minutes for OLS)

Cause: Insufficient acceleration or hardware

Diagnosis:

  • Check if GPU badge present (should have GPU for large models)

  • Check console for "using CPU" messages (bad sign)

  • Monitor CPU/GPU usage in task manager

Solutions:

  1. Verify GPU acceleration active

  2. Update graphics drivers

  3. Close background applications

  4. Reduce variable count

  5. Consider hardware upgrade

Issue 4: Inconsistent Performance

Cause: Browser or system resource contention

Solutions:

  1. Restart browser fresh

  2. Close all unnecessary tabs

  3. Check for system updates or background processes

  4. Allow system to cool down (thermal throttling)

  5. Use performance mode in power settings

Issue 5: Upload Failing for Large Files

Cause: File size or browser limitations

Solutions:

  1. Save Excel as .csv (often smaller)

  2. Remove unnecessary columns in Excel before upload

  3. Split into multiple files if needed

  4. Ensure stable internet connection

  5. Try different browser

Future Scalability

Planned Enhancements

Advanced Chunking: More intelligent adaptive chunk sizing

Distributed Processing: Leverage multiple browser tabs/windows

Server-Side Options: Optional server processing for very large models (enterprise plans)

Improved Caching: Faster reloading of previously analyzed datasets

Memory Optimization: Reduced memory footprint for same dataset sizes

Current Limitations

Browser Constraints: Inherent browser memory limits (2-4 GB per tab)

Single-Tab Processing: Cannot currently distribute across multiple tabs

No Disk Caching: All data held in memory during session

Sequential Operations: Most operations cannot run simultaneously

Practical Impact: Very large models (500+ variables) will always require substantial hardware


Next Steps: Review the full Advanced Features section, or proceed to Exporting & Reporting to learn how to share your large model results with stakeholders.

Last updated