Large Dataset Handling
Overview
MixModeler is optimized to handle large marketing mix models with hundreds of variables and years of data. Through intelligent chunking, memory management, and acceleration technologies, the platform processes datasets that would overwhelm traditional browser-based applications.
This guide explains how MixModeler handles large datasets, what limits exist, and best practices for working efficiently with extensive marketing data.
Dataset Size Categories
Small Datasets
Characteristics:
Variables: 10-30
Observations: 26-52 (6 months - 1 year weekly)
Total data points: 260-1,560
Performance: Instant processing (<1 second for most operations)
Memory Usage: <100 MB
Ideal For: Single-channel testing, small business MMM, pilot projects
Medium Datasets
Characteristics:
Variables: 30-100
Observations: 52-156 (1-3 years weekly)
Total data points: 1,560-15,600
Performance: Fast processing (1-3 seconds with acceleration)
Memory Usage: 100-500 MB
Ideal For: Multi-channel MMM, standard business applications, most use cases
Large Datasets
Characteristics:
Variables: 100-300
Observations: 104-260 (2-5 years weekly)
Total data points: 10,400-78,000
Performance: Good performance with GPU (3-8 seconds), acceptable with WASM only (8-20 seconds)
Memory Usage: 500-2000 MB
Ideal For: Enterprise MMM, comprehensive multi-market models, advanced analytics
Very Large Datasets
Characteristics:
Variables: 300-500 (subscription limit)
Observations: 260+ (5+ years weekly)
Total data points: 78,000-130,000+
Performance: Requires GPU acceleration (10-30 seconds), slow without GPU (>60 seconds)
Memory Usage: 2-4 GB
Ideal For: Global enterprise MMM, exhaustive market analysis, research applications
Note: Professional/Business plan required for 300+ variables
Subscription Limits
Variable Limits by Plan
Free Plan:
Maximum 20 variables per dataset
Suitable for testing and small models
All features available
Professional Plan:
Maximum 500 variables per dataset
Suitable for most business needs
Priority support
Business Plan:
Unlimited variables
Enterprise-scale modeling
Dedicated support
Observation Limits
Practical Limits:
Minimum: 26 observations (6 months weekly data)
Recommended minimum: 52 observations (1 year)
Maximum: No hard limit, 500+ observations supported
Optimal: 52-260 observations (1-5 years)
Best Practice: More observations generally improve model reliability, but returns diminish beyond 3-5 years of data
Memory Management
Browser Memory Architecture
How MixModeler Uses Memory:
Application Code: ~50-100 MB (fixed)
Loaded Data: Variable count × Observations × 8 bytes
Example: 100 vars × 200 obs = 160 KB
Working Memory: 5-10x loaded data during operations
Example: 160 KB → 800 KB - 1.6 MB during processing
WASM Memory: Separate heap, 500 MB - 2 GB depending on operation
GPU Memory: Separate VRAM when GPU acceleration active
Browser Overhead: ~200-500 MB for Chrome/Edge
Total Typical Usage: 1-4 GB for large models
Memory-Efficient Processing
Chunked Operations: Large datasets processed in smaller chunks
How Chunking Works:
Dataset split into manageable pieces (chunk size: 1,000-10,000 rows)
Each chunk processed independently
Results aggregated at the end
Memory released after each chunk
Operations Using Chunking:
Data upload and validation
Statistical summary calculations
Correlation matrix generation (for very large variable sets)
Diagnostic test suites
User Impact: Transparent - operations complete normally, just take slightly longer
Automatic Memory Management
Garbage Collection: Browser automatically reclaims unused memory
Memory Monitoring: MixModeler tracks usage and warns if approaching limits
Automatic Cleanup: Memory released immediately after operations complete
No User Action Required: Memory management is fully automatic
Performance Optimization for Large Datasets
Acceleration Technology Stack
For 100-200 Variables:
WASM: 5-10x speedup (always available)
GPU: Additional 3-8x speedup (when available)
Combined: 15-80x faster than pure JavaScript
For 200-500 Variables:
WASM: 6-12x speedup (essential)
GPU: Additional 5-15x speedup (highly recommended)
Combined: 30-180x faster than pure JavaScript
Chunk Size Optimization
MixModeler automatically selects optimal chunk sizes:
Small Datasets (<10,000 data points):
No chunking needed
Process entire dataset at once
Medium Datasets (10,000-50,000 data points):
Chunk size: 10,000 rows
2-5 chunks typical
Minimal overhead
Large Datasets (50,000-100,000 data points):
Chunk size: 5,000 rows
10-20 chunks typical
Managed overhead
Very Large Datasets (>100,000 data points):
Chunk size: 2,000-5,000 rows
20-50 chunks typical
More processing time but prevents memory issues
Parallel Processing
Multi-Core CPU Utilization:
Chunk processing distributed across CPU cores
4-core CPU: Up to 4 chunks simultaneously
8-core CPU: Up to 8 chunks simultaneously
GPU Parallel Processing:
Thousands of operations simultaneously
Particularly effective for matrix operations
Essential for 200+ variable models
Bayesian MCMC:
Multiple chains run in parallel
4 chains utilize 4 CPU cores optimally
Each chain processes independently
Practical Guidelines
Hardware Recommendations by Dataset Size
For 100-200 Variables:
RAM
8 GB
16 GB
32 GB
CPU
Quad-core 2.5 GHz
Quad-core 3.0 GHz
6-8 core 3.5 GHz
GPU
Integrated
GTX 1660 / RX 5600
RTX 3060 / RX 6700
Storage
HDD
SSD
NVMe SSD
Expected Performance: 2-5 seconds per model iteration with recommended setup
For 200-400 Variables:
RAM
16 GB
32 GB
64 GB
CPU
6-core 3.0 GHz
8-core 3.5 GHz
12+ core 4.0 GHz
GPU
GTX 1660
RTX 3060 / RX 6700
RTX 3070+ / RX 6800+
Storage
SSD
NVMe SSD
NVMe SSD (fast)
Expected Performance: 5-15 seconds per model iteration with recommended setup
For 400-500 Variables:
RAM
32 GB
64 GB
128 GB
CPU
8-core 3.5 GHz
12-core 4.0 GHz
16+ core 4.5 GHz
GPU
RTX 3060
RTX 3070 / RX 6800
RTX 4080+ / RX 7900+
Storage
NVMe SSD
NVMe SSD (fast)
NVMe SSD (fastest)
Expected Performance: 15-30 seconds per model iteration with recommended setup
Software Optimization
Browser Choice:
Best: Chrome or Edge (latest version)
Good: Brave, Chromium-based browsers
Acceptable: Firefox (latest)
Not Recommended: Safari (limited WebGPU), older browsers
Browser Settings:
Enable hardware acceleration
Allow sufficient memory per tab (Chrome: 2-4 GB)
Disable unnecessary extensions
Keep browser updated
Operating System:
Use 64-bit OS (required for large memory access)
Keep OS updated for latest performance optimizations
Close unnecessary background applications
Workflow Optimization
Start Small, Scale Up:
Begin with subset of variables (50-100)
Develop and test model structure
Gradually add variables
Final run with full variable set
Benefits: Faster iteration during development, full dataset only when needed
Reduce Diagnostic Frequency:
Run full diagnostics on final model only
Use quick validation during development
Enable all tests only when necessary
Use OLS Before Bayesian:
OLS is 50-100x faster than Bayesian
Validate model structure with OLS first
Run Bayesian only on vetted model specifications
Leverage Fast Inference:
Use Fast Inference (SVI) for Bayesian exploration
Switch to full MCMC only for final production model
Can iterate 10-20x faster during development
Handling Extremely Large Datasets
When You Hit Limits
Symptoms:
Browser becomes unresponsive
"Out of memory" errors
Very long operation times (>5 minutes for non-Bayesian)
Browser tab crashes
Immediate Solutions:
1. Reduce Variables (most effective):
Remove highly correlated variables (VIF > 10)
Eliminate non-significant variables from previous runs
Focus on key marketing channels
Group similar variables (e.g., combine social platforms)
2. Reduce Observations:
Use most recent data (e.g., last 2 years instead of 5)
Consider monthly instead of weekly data (if appropriate)
Focus on relevant time period for current business question
3. Process in Batches:
Split variables into logical groups
Run separate models for each group
Combine insights from multiple models
4. Upgrade Hardware:
Add more RAM (biggest impact)
Get dedicated GPU (significant speedup)
Use faster CPU (moderate improvement)
Data Reduction Techniques
Variable Selection:
Business Prioritization: Keep only strategically important channels
Statistical Filtering: Remove variables with:
Very low variance (contribute little information)
Very high correlation with other variables (redundant)
Missing data >20% of observations
Dimensionality Reduction:
Create composite variables (e.g., "Total_Digital" instead of 10 digital channels)
Use principal components (outside MixModeler, then import reduced set)
Aggregate similar channels
Temporal Aggregation:
Weekly → Bi-Weekly: Reduces observations by 50%, minimal information loss
Weekly → Monthly: Reduces observations by 75%, some information loss
Considerations: Ensure aggregation makes business sense and doesn't hide important patterns
Splitting Complex Models
Approach 1: Geographic Split
Model each region/market separately
Combine insights at reporting stage
Allows more variables per model
Approach 2: Channel Category Split
Model digital channels separately from traditional
Model brand vs performance marketing separately
Run comprehensive model with top performers from each
Approach 3: Time Period Split
Model recent period (high priority)
Model historical period separately
Compare for structural changes
Monitoring Performance
Key Metrics to Track
Load Time: Time to upload and validate data
Target: <5 seconds for large datasets
Concern: >15 seconds
Model Fitting Time: Time to estimate coefficients
Target: <10 seconds for OLS, <5 minutes for Bayesian
Concern: >30 seconds for OLS, >15 minutes for Bayesian
Memory Usage: Peak RAM consumption
Target: <2 GB
Concern: >3 GB (browser may become unstable)
Operation Success Rate: Percentage of operations completing without error
Target: 100%
Concern: <95%
When to Optimize
Optimization Triggers:
Operations taking >2x expected time
Memory usage approaching 3-4 GB
Browser responsiveness degrading
Frequent need to restart browser
Optimization Actions (in order of impact):
Reduce number of variables (biggest impact)
Enable/upgrade GPU acceleration
Close other applications/tabs
Upgrade RAM
Consider data aggregation
Best Practices Summary
Data Preparation:
Remove unnecessary variables before upload
Ensure data quality (no excessive missing values)
Use appropriate temporal granularity
Test with subset before full dataset
Hardware Utilization:
Ensure GPU acceleration active for large models
Close unnecessary applications
Use single browser tab for intensive operations
Monitor memory usage
Workflow Efficiency:
Start small, scale gradually
Use OLS for structure testing
Run full diagnostics only when needed
Leverage Fast Inference for Bayesian exploration
Resource Management:
Restart browser periodically during long sessions
Clear browser cache if performance degrades
Save models frequently
Export results before major operations
Troubleshooting Large Dataset Issues
Issue 1: "Out of Memory" Error
Cause: Dataset exceeds available RAM
Solutions:
Close all other browser tabs and applications
Reduce number of variables (remove low-importance ones)
Reduce observations (use more recent data)
Restart browser to clear memory leaks
Upgrade system RAM
Issue 2: Browser Freezing/Unresponsive
Cause: Operation overwhelming browser
Solutions:
Wait 2-3 minutes (may still be processing)
If no progress, close tab and restart
Reduce model complexity
Enable GPU acceleration
Use faster hardware
Issue 3: Very Slow Operations (>5 minutes for OLS)
Cause: Insufficient acceleration or hardware
Diagnosis:
Check if GPU badge present (should have GPU for large models)
Check console for "using CPU" messages (bad sign)
Monitor CPU/GPU usage in task manager
Solutions:
Verify GPU acceleration active
Update graphics drivers
Close background applications
Reduce variable count
Consider hardware upgrade
Issue 4: Inconsistent Performance
Cause: Browser or system resource contention
Solutions:
Restart browser fresh
Close all unnecessary tabs
Check for system updates or background processes
Allow system to cool down (thermal throttling)
Use performance mode in power settings
Issue 5: Upload Failing for Large Files
Cause: File size or browser limitations
Solutions:
Save Excel as .csv (often smaller)
Remove unnecessary columns in Excel before upload
Split into multiple files if needed
Ensure stable internet connection
Try different browser
Future Scalability
Planned Enhancements
Advanced Chunking: More intelligent adaptive chunk sizing
Distributed Processing: Leverage multiple browser tabs/windows
Server-Side Options: Optional server processing for very large models (enterprise plans)
Improved Caching: Faster reloading of previously analyzed datasets
Memory Optimization: Reduced memory footprint for same dataset sizes
Current Limitations
Browser Constraints: Inherent browser memory limits (2-4 GB per tab)
Single-Tab Processing: Cannot currently distribute across multiple tabs
No Disk Caching: All data held in memory during session
Sequential Operations: Most operations cannot run simultaneously
Practical Impact: Very large models (500+ variables) will always require substantial hardware
Next Steps: Review the full Advanced Features section, or proceed to Exporting & Reporting to learn how to share your large model results with stakeholders.
Last updated