Large Dataset Handling

Overview

MixModeler is optimized to handle large marketing mix models with hundreds of variables and years of data. Through intelligent chunking, memory management, and acceleration technologies, the platform processes datasets that would overwhelm traditional browser-based applications.

This guide explains how MixModeler handles large datasets, what limits exist, and best practices for working efficiently with extensive marketing data.

Dataset Size Categories

Small Datasets

Characteristics:

Variables: 10-30
Observations: 26-52 (6 months - 1 year weekly)
Total data points: 260-1,560

Performance: Instant processing (<1 second for most operations)

Memory Usage: <100 MB

Ideal For: Single-channel testing, small business MMM, pilot projects

Medium Datasets

Characteristics:

Variables: 30-100
Observations: 52-156 (1-3 years weekly)
Total data points: 1,560-15,600

Performance: Fast processing (1-3 seconds with acceleration)

Memory Usage: 100-500 MB

Ideal For: Multi-channel MMM, standard business applications, most use cases

Large Datasets

Characteristics:

Variables: 100-300
Observations: 104-260 (2-5 years weekly)
Total data points: 10,400-78,000

Performance: Good performance with GPU (3-8 seconds), acceptable with WASM only (8-20 seconds)

Memory Usage: 500-2000 MB

Ideal For: Enterprise MMM, comprehensive multi-market models, advanced analytics

Very Large Datasets

Characteristics:

Variables: 300-500 (subscription limit)
Observations: 260+ (5+ years weekly)
Total data points: 78,000-130,000+

Performance: Requires GPU acceleration (10-30 seconds), slow without GPU (>60 seconds)

Memory Usage: 2-4 GB

Ideal For: Global enterprise MMM, exhaustive market analysis, research applications

Note: Professional/Business plan required for 300+ variables

Subscription Limits

Variable Limits by Plan

Free Plan:

Maximum 20 variables per dataset
Suitable for testing and small models
All features available

Professional Plan:

Maximum 500 variables per dataset
Suitable for most business needs
Priority support

Business Plan:

Unlimited variables
Enterprise-scale modeling
Dedicated support

Observation Limits

Practical Limits:

Minimum: 26 observations (6 months weekly data)
Recommended minimum: 52 observations (1 year)
Maximum: No hard limit, 500+ observations supported
Optimal: 52-260 observations (1-5 years)

Best Practice: More observations generally improve model reliability, but returns diminish beyond 3-5 years of data

Memory Management

Browser Memory Architecture

How MixModeler Uses Memory:

Application Code: ~50-100 MB (fixed)

Loaded Data: Variable count × Observations × 8 bytes

Example: 100 vars × 200 obs = 160 KB

Working Memory: 5-10x loaded data during operations

Example: 160 KB → 800 KB - 1.6 MB during processing

WASM Memory: Separate heap, 500 MB - 2 GB depending on operation

GPU Memory: Separate VRAM when GPU acceleration active

Browser Overhead: ~200-500 MB for Chrome/Edge

Total Typical Usage: 1-4 GB for large models

Memory-Efficient Processing

Chunked Operations: Large datasets processed in smaller chunks

How Chunking Works:

Dataset split into manageable pieces (chunk size: 1,000-10,000 rows)
Each chunk processed independently
Results aggregated at the end
Memory released after each chunk

Operations Using Chunking:

Data upload and validation
Statistical summary calculations
Correlation matrix generation (for very large variable sets)
Diagnostic test suites

User Impact: Transparent - operations complete normally, just take slightly longer

Automatic Memory Management

Garbage Collection: Browser automatically reclaims unused memory

Memory Monitoring: MixModeler tracks usage and warns if approaching limits

Automatic Cleanup: Memory released immediately after operations complete

No User Action Required: Memory management is fully automatic

Performance Optimization for Large Datasets

Acceleration Technology Stack

For 100-200 Variables:

WASM: 5-10x speedup (always available)
GPU: Additional 3-8x speedup (when available)
Combined: 15-80x faster than pure JavaScript

For 200-500 Variables:

WASM: 6-12x speedup (essential)
GPU: Additional 5-15x speedup (highly recommended)
Combined: 30-180x faster than pure JavaScript

Chunk Size Optimization

MixModeler automatically selects optimal chunk sizes:

Small Datasets (<10,000 data points):

No chunking needed
Process entire dataset at once

Medium Datasets (10,000-50,000 data points):

Chunk size: 10,000 rows
2-5 chunks typical
Minimal overhead

Large Datasets (50,000-100,000 data points):

Chunk size: 5,000 rows
10-20 chunks typical
Managed overhead

Very Large Datasets (>100,000 data points):

Chunk size: 2,000-5,000 rows
20-50 chunks typical
More processing time but prevents memory issues

Parallel Processing

Multi-Core CPU Utilization:

Chunk processing distributed across CPU cores
4-core CPU: Up to 4 chunks simultaneously
8-core CPU: Up to 8 chunks simultaneously

GPU Parallel Processing:

Thousands of operations simultaneously
Particularly effective for matrix operations
Essential for 200+ variable models

Bayesian MCMC:

Multiple chains run in parallel
4 chains utilize 4 CPU cores optimally
Each chain processes independently

Practical Guidelines

Hardware Recommendations by Dataset Size

For 100-200 Variables:

Component

Minimum

Recommended

Optimal

RAM

8 GB

16 GB

32 GB

CPU

Quad-core 2.5 GHz

Quad-core 3.0 GHz

6-8 core 3.5 GHz

GPU

Integrated

GTX 1660 / RX 5600

RTX 3060 / RX 6700

Storage

HDD

SSD

NVMe SSD

Expected Performance: 2-5 seconds per model iteration with recommended setup

For 200-400 Variables:

Component

Minimum

Recommended

Optimal

RAM

16 GB

32 GB

64 GB

CPU

6-core 3.0 GHz

8-core 3.5 GHz

12+ core 4.0 GHz

GPU

GTX 1660

RTX 3060 / RX 6700

RTX 3070+ / RX 6800+

Storage

SSD

NVMe SSD

NVMe SSD (fast)

Expected Performance: 5-15 seconds per model iteration with recommended setup

For 400-500 Variables:

Component

Minimum

Recommended

Optimal

RAM

32 GB

64 GB

128 GB

CPU

8-core 3.5 GHz

12-core 4.0 GHz

16+ core 4.5 GHz

GPU

RTX 3060

RTX 3070 / RX 6800

RTX 4080+ / RX 7900+

Storage

NVMe SSD

NVMe SSD (fast)

NVMe SSD (fastest)

Expected Performance: 15-30 seconds per model iteration with recommended setup

Software Optimization

Browser Choice:

Best: Chrome or Edge (latest version)
Good: Brave, Chromium-based browsers
Acceptable: Firefox (latest)
Not Recommended: Safari (limited WebGPU), older browsers

Browser Settings:

Enable hardware acceleration
Allow sufficient memory per tab (Chrome: 2-4 GB)
Disable unnecessary extensions
Keep browser updated

Operating System:

Use 64-bit OS (required for large memory access)
Keep OS updated for latest performance optimizations
Close unnecessary background applications

Workflow Optimization

Start Small, Scale Up:

Begin with subset of variables (50-100)
Develop and test model structure
Gradually add variables
Final run with full variable set

Benefits: Faster iteration during development, full dataset only when needed

Reduce Diagnostic Frequency:

Run full diagnostics on final model only
Use quick validation during development
Enable all tests only when necessary

Use OLS Before Bayesian:

OLS is 50-100x faster than Bayesian
Validate model structure with OLS first
Run Bayesian only on vetted model specifications

Leverage Fast Inference:

Use Fast Inference (SVI) for Bayesian exploration
Switch to full MCMC only for final production model
Can iterate 10-20x faster during development

Handling Extremely Large Datasets

When You Hit Limits

Symptoms:

Browser becomes unresponsive
"Out of memory" errors
Very long operation times (>5 minutes for non-Bayesian)
Browser tab crashes

Immediate Solutions:

1. Reduce Variables (most effective):

Remove highly correlated variables (VIF > 10)
Eliminate non-significant variables from previous runs
Focus on key marketing channels
Group similar variables (e.g., combine social platforms)

2. Reduce Observations:

Use most recent data (e.g., last 2 years instead of 5)
Consider monthly instead of weekly data (if appropriate)
Focus on relevant time period for current business question

3. Process in Batches:

Split variables into logical groups
Run separate models for each group
Combine insights from multiple models

4. Upgrade Hardware:

Add more RAM (biggest impact)
Get dedicated GPU (significant speedup)
Use faster CPU (moderate improvement)

Data Reduction Techniques

Variable Selection:

Business Prioritization: Keep only strategically important channels

Statistical Filtering: Remove variables with:

Very low variance (contribute little information)
Very high correlation with other variables (redundant)
Missing data >20% of observations

Dimensionality Reduction:

Create composite variables (e.g., "Total_Digital" instead of 10 digital channels)
Use principal components (outside MixModeler, then import reduced set)
Aggregate similar channels

Temporal Aggregation:

Weekly → Bi-Weekly: Reduces observations by 50%, minimal information loss

Weekly → Monthly: Reduces observations by 75%, some information loss

Considerations: Ensure aggregation makes business sense and doesn't hide important patterns

Splitting Complex Models

Approach 1: Geographic Split

Model each region/market separately
Combine insights at reporting stage
Allows more variables per model

Approach 2: Channel Category Split

Model digital channels separately from traditional
Model brand vs performance marketing separately
Run comprehensive model with top performers from each

Approach 3: Time Period Split

Model recent period (high priority)
Model historical period separately
Compare for structural changes

Monitoring Performance

Key Metrics to Track

Load Time: Time to upload and validate data

Target: <5 seconds for large datasets
Concern: >15 seconds

Model Fitting Time: Time to estimate coefficients

Target: <10 seconds for OLS, <5 minutes for Bayesian
Concern: >30 seconds for OLS, >15 minutes for Bayesian

Memory Usage: Peak RAM consumption

Target: <2 GB
Concern: >3 GB (browser may become unstable)

Operation Success Rate: Percentage of operations completing without error

Target: 100%
Concern: <95%

When to Optimize

Optimization Triggers:

Operations taking >2x expected time
Memory usage approaching 3-4 GB
Browser responsiveness degrading
Frequent need to restart browser

Optimization Actions (in order of impact):

Reduce number of variables (biggest impact)
Enable/upgrade GPU acceleration
Close other applications/tabs
Upgrade RAM
Consider data aggregation

Best Practices Summary

Data Preparation:

Remove unnecessary variables before upload
Ensure data quality (no excessive missing values)
Use appropriate temporal granularity
Test with subset before full dataset

Hardware Utilization:

Ensure GPU acceleration active for large models
Close unnecessary applications
Use single browser tab for intensive operations
Monitor memory usage

Workflow Efficiency:

Start small, scale gradually
Use OLS for structure testing
Run full diagnostics only when needed
Leverage Fast Inference for Bayesian exploration

Resource Management:

Restart browser periodically during long sessions
Clear browser cache if performance degrades
Save models frequently
Export results before major operations

Troubleshooting Large Dataset Issues

Issue 1: "Out of Memory" Error

Cause: Dataset exceeds available RAM

Solutions:

Close all other browser tabs and applications
Reduce number of variables (remove low-importance ones)
Reduce observations (use more recent data)
Restart browser to clear memory leaks
Upgrade system RAM

Issue 2: Browser Freezing/Unresponsive

Cause: Operation overwhelming browser

Solutions:

Wait 2-3 minutes (may still be processing)
If no progress, close tab and restart
Reduce model complexity
Enable GPU acceleration
Use faster hardware

Issue 3: Very Slow Operations (>5 minutes for OLS)

Cause: Insufficient acceleration or hardware

Diagnosis:

Check if GPU badge present (should have GPU for large models)
Check console for "using CPU" messages (bad sign)
Monitor CPU/GPU usage in task manager

Solutions:

Verify GPU acceleration active
Update graphics drivers
Close background applications
Reduce variable count
Consider hardware upgrade

Issue 4: Inconsistent Performance

Cause: Browser or system resource contention

Solutions:

Restart browser fresh
Close all unnecessary tabs
Check for system updates or background processes
Allow system to cool down (thermal throttling)
Use performance mode in power settings

Issue 5: Upload Failing for Large Files

Cause: File size or browser limitations

Solutions:

Save Excel as .csv (often smaller)
Remove unnecessary columns in Excel before upload
Split into multiple files if needed
Ensure stable internet connection
Try different browser

Future Scalability

Planned Enhancements

Advanced Chunking: More intelligent adaptive chunk sizing

Distributed Processing: Leverage multiple browser tabs/windows

Server-Side Options: Optional server processing for very large models (enterprise plans)

Improved Caching: Faster reloading of previously analyzed datasets

Memory Optimization: Reduced memory footprint for same dataset sizes

Current Limitations

Browser Constraints: Inherent browser memory limits (2-4 GB per tab)

Single-Tab Processing: Cannot currently distribute across multiple tabs

No Disk Caching: All data held in memory during session

Sequential Operations: Most operations cannot run simultaneously

Practical Impact: Very large models (500+ variables) will always require substantial hardware

Next Steps: Review the full Advanced Features section, or proceed to Exporting & Reporting to learn how to share your large model results with stakeholders.

PreviousPerformance Monitoring NextExcel Export Features

Last updated 25 days ago