Large Dataset Handling
Overview
MixModeler is optimized to handle large marketing mix models with hundreds of variables and years of data. Through intelligent chunking, memory management, and acceleration technologies, the platform processes datasets that would overwhelm traditional browser-based applications.
This guide explains how MixModeler handles large datasets, what limits exist, and best practices for working efficiently with extensive marketing data.
Dataset Size Categories
Small Datasets
Characteristics:
- Variables: 10-30 
- Observations: 26-52 (6 months - 1 year weekly) 
- Total data points: 260-1,560 
Performance: Instant processing (<1 second for most operations)
Memory Usage: <100 MB
Ideal For: Single-channel testing, small business MMM, pilot projects
Medium Datasets
Characteristics:
- Variables: 30-100 
- Observations: 52-156 (1-3 years weekly) 
- Total data points: 1,560-15,600 
Performance: Fast processing (1-3 seconds with acceleration)
Memory Usage: 100-500 MB
Ideal For: Multi-channel MMM, standard business applications, most use cases
Large Datasets
Characteristics:
- Variables: 100-300 
- Observations: 104-260 (2-5 years weekly) 
- Total data points: 10,400-78,000 
Performance: Good performance with GPU (3-8 seconds), acceptable with WASM only (8-20 seconds)
Memory Usage: 500-2000 MB
Ideal For: Enterprise MMM, comprehensive multi-market models, advanced analytics
Very Large Datasets
Characteristics:
- Variables: 300-500 (subscription limit) 
- Observations: 260+ (5+ years weekly) 
- Total data points: 78,000-130,000+ 
Performance: Requires GPU acceleration (10-30 seconds), slow without GPU (>60 seconds)
Memory Usage: 2-4 GB
Ideal For: Global enterprise MMM, exhaustive market analysis, research applications
Note: Professional/Business plan required for 300+ variables
Subscription Limits
Variable Limits by Plan
Free Plan:
- Maximum 20 variables per dataset 
- Suitable for testing and small models 
- All features available 
Professional Plan:
- Maximum 500 variables per dataset 
- Suitable for most business needs 
- Priority support 
Business Plan:
- Unlimited variables 
- Enterprise-scale modeling 
- Dedicated support 
Observation Limits
Practical Limits:
- Minimum: 26 observations (6 months weekly data) 
- Recommended minimum: 52 observations (1 year) 
- Maximum: No hard limit, 500+ observations supported 
- Optimal: 52-260 observations (1-5 years) 
Best Practice: More observations generally improve model reliability, but returns diminish beyond 3-5 years of data
Memory Management
Browser Memory Architecture
How MixModeler Uses Memory:
Application Code: ~50-100 MB (fixed)
Loaded Data: Variable count × Observations × 8 bytes
- Example: 100 vars × 200 obs = 160 KB 
Working Memory: 5-10x loaded data during operations
- Example: 160 KB → 800 KB - 1.6 MB during processing 
WASM Memory: Separate heap, 500 MB - 2 GB depending on operation
GPU Memory: Separate VRAM when GPU acceleration active
Browser Overhead: ~200-500 MB for Chrome/Edge
Total Typical Usage: 1-4 GB for large models
Memory-Efficient Processing
Chunked Operations: Large datasets processed in smaller chunks
How Chunking Works:
- Dataset split into manageable pieces (chunk size: 1,000-10,000 rows) 
- Each chunk processed independently 
- Results aggregated at the end 
- Memory released after each chunk 
Operations Using Chunking:
- Data upload and validation 
- Statistical summary calculations 
- Correlation matrix generation (for very large variable sets) 
- Diagnostic test suites 
User Impact: Transparent - operations complete normally, just take slightly longer
Automatic Memory Management
Garbage Collection: Browser automatically reclaims unused memory
Memory Monitoring: MixModeler tracks usage and warns if approaching limits
Automatic Cleanup: Memory released immediately after operations complete
No User Action Required: Memory management is fully automatic
Performance Optimization for Large Datasets
Acceleration Technology Stack
For 100-200 Variables:
- WASM: 5-10x speedup (always available) 
- GPU: Additional 3-8x speedup (when available) 
- Combined: 15-80x faster than pure JavaScript 
For 200-500 Variables:
- WASM: 6-12x speedup (essential) 
- GPU: Additional 5-15x speedup (highly recommended) 
- Combined: 30-180x faster than pure JavaScript 
Chunk Size Optimization
MixModeler automatically selects optimal chunk sizes:
Small Datasets (<10,000 data points):
- No chunking needed 
- Process entire dataset at once 
Medium Datasets (10,000-50,000 data points):
- Chunk size: 10,000 rows 
- 2-5 chunks typical 
- Minimal overhead 
Large Datasets (50,000-100,000 data points):
- Chunk size: 5,000 rows 
- 10-20 chunks typical 
- Managed overhead 
Very Large Datasets (>100,000 data points):
- Chunk size: 2,000-5,000 rows 
- 20-50 chunks typical 
- More processing time but prevents memory issues 
Parallel Processing
Multi-Core CPU Utilization:
- Chunk processing distributed across CPU cores 
- 4-core CPU: Up to 4 chunks simultaneously 
- 8-core CPU: Up to 8 chunks simultaneously 
GPU Parallel Processing:
- Thousands of operations simultaneously 
- Particularly effective for matrix operations 
- Essential for 200+ variable models 
Bayesian MCMC:
- Multiple chains run in parallel 
- 4 chains utilize 4 CPU cores optimally 
- Each chain processes independently 
Practical Guidelines
Hardware Recommendations by Dataset Size
For 100-200 Variables:
RAM
8 GB
16 GB
32 GB
CPU
Quad-core 2.5 GHz
Quad-core 3.0 GHz
6-8 core 3.5 GHz
GPU
Integrated
GTX 1660 / RX 5600
RTX 3060 / RX 6700
Storage
HDD
SSD
NVMe SSD
Expected Performance: 2-5 seconds per model iteration with recommended setup
For 200-400 Variables:
RAM
16 GB
32 GB
64 GB
CPU
6-core 3.0 GHz
8-core 3.5 GHz
12+ core 4.0 GHz
GPU
GTX 1660
RTX 3060 / RX 6700
RTX 3070+ / RX 6800+
Storage
SSD
NVMe SSD
NVMe SSD (fast)
Expected Performance: 5-15 seconds per model iteration with recommended setup
For 400-500 Variables:
RAM
32 GB
64 GB
128 GB
CPU
8-core 3.5 GHz
12-core 4.0 GHz
16+ core 4.5 GHz
GPU
RTX 3060
RTX 3070 / RX 6800
RTX 4080+ / RX 7900+
Storage
NVMe SSD
NVMe SSD (fast)
NVMe SSD (fastest)
Expected Performance: 15-30 seconds per model iteration with recommended setup
Software Optimization
Browser Choice:
- Best: Chrome or Edge (latest version) 
- Good: Brave, Chromium-based browsers 
- Acceptable: Firefox (latest) 
- Not Recommended: Safari (limited WebGPU), older browsers 
Browser Settings:
- Enable hardware acceleration 
- Allow sufficient memory per tab (Chrome: 2-4 GB) 
- Disable unnecessary extensions 
- Keep browser updated 
Operating System:
- Use 64-bit OS (required for large memory access) 
- Keep OS updated for latest performance optimizations 
- Close unnecessary background applications 
Workflow Optimization
Start Small, Scale Up:
- Begin with subset of variables (50-100) 
- Develop and test model structure 
- Gradually add variables 
- Final run with full variable set 
Benefits: Faster iteration during development, full dataset only when needed
Reduce Diagnostic Frequency:
- Run full diagnostics on final model only 
- Use quick validation during development 
- Enable all tests only when necessary 
Use OLS Before Bayesian:
- OLS is 50-100x faster than Bayesian 
- Validate model structure with OLS first 
- Run Bayesian only on vetted model specifications 
Leverage Fast Inference:
- Use Fast Inference (SVI) for Bayesian exploration 
- Switch to full MCMC only for final production model 
- Can iterate 10-20x faster during development 
Handling Extremely Large Datasets
When You Hit Limits
Symptoms:
- Browser becomes unresponsive 
- "Out of memory" errors 
- Very long operation times (>5 minutes for non-Bayesian) 
- Browser tab crashes 
Immediate Solutions:
1. Reduce Variables (most effective):
- Remove highly correlated variables (VIF > 10) 
- Eliminate non-significant variables from previous runs 
- Focus on key marketing channels 
- Group similar variables (e.g., combine social platforms) 
2. Reduce Observations:
- Use most recent data (e.g., last 2 years instead of 5) 
- Consider monthly instead of weekly data (if appropriate) 
- Focus on relevant time period for current business question 
3. Process in Batches:
- Split variables into logical groups 
- Run separate models for each group 
- Combine insights from multiple models 
4. Upgrade Hardware:
- Add more RAM (biggest impact) 
- Get dedicated GPU (significant speedup) 
- Use faster CPU (moderate improvement) 
Data Reduction Techniques
Variable Selection:
Business Prioritization: Keep only strategically important channels
Statistical Filtering: Remove variables with:
- Very low variance (contribute little information) 
- Very high correlation with other variables (redundant) 
- Missing data >20% of observations 
Dimensionality Reduction:
- Create composite variables (e.g., "Total_Digital" instead of 10 digital channels) 
- Use principal components (outside MixModeler, then import reduced set) 
- Aggregate similar channels 
Temporal Aggregation:
Weekly → Bi-Weekly: Reduces observations by 50%, minimal information loss
Weekly → Monthly: Reduces observations by 75%, some information loss
Considerations: Ensure aggregation makes business sense and doesn't hide important patterns
Splitting Complex Models
Approach 1: Geographic Split
- Model each region/market separately 
- Combine insights at reporting stage 
- Allows more variables per model 
Approach 2: Channel Category Split
- Model digital channels separately from traditional 
- Model brand vs performance marketing separately 
- Run comprehensive model with top performers from each 
Approach 3: Time Period Split
- Model recent period (high priority) 
- Model historical period separately 
- Compare for structural changes 
Monitoring Performance
Key Metrics to Track
Load Time: Time to upload and validate data
- Target: <5 seconds for large datasets 
- Concern: >15 seconds 
Model Fitting Time: Time to estimate coefficients
- Target: <10 seconds for OLS, <5 minutes for Bayesian 
- Concern: >30 seconds for OLS, >15 minutes for Bayesian 
Memory Usage: Peak RAM consumption
- Target: <2 GB 
- Concern: >3 GB (browser may become unstable) 
Operation Success Rate: Percentage of operations completing without error
- Target: 100% 
- Concern: <95% 
When to Optimize
Optimization Triggers:
- Operations taking >2x expected time 
- Memory usage approaching 3-4 GB 
- Browser responsiveness degrading 
- Frequent need to restart browser 
Optimization Actions (in order of impact):
- Reduce number of variables (biggest impact) 
- Enable/upgrade GPU acceleration 
- Close other applications/tabs 
- Upgrade RAM 
- Consider data aggregation 
Best Practices Summary
Data Preparation:
- Remove unnecessary variables before upload 
- Ensure data quality (no excessive missing values) 
- Use appropriate temporal granularity 
- Test with subset before full dataset 
Hardware Utilization:
- Ensure GPU acceleration active for large models 
- Close unnecessary applications 
- Use single browser tab for intensive operations 
- Monitor memory usage 
Workflow Efficiency:
- Start small, scale gradually 
- Use OLS for structure testing 
- Run full diagnostics only when needed 
- Leverage Fast Inference for Bayesian exploration 
Resource Management:
- Restart browser periodically during long sessions 
- Clear browser cache if performance degrades 
- Save models frequently 
- Export results before major operations 
Troubleshooting Large Dataset Issues
Issue 1: "Out of Memory" Error
Cause: Dataset exceeds available RAM
Solutions:
- Close all other browser tabs and applications 
- Reduce number of variables (remove low-importance ones) 
- Reduce observations (use more recent data) 
- Restart browser to clear memory leaks 
- Upgrade system RAM 
Issue 2: Browser Freezing/Unresponsive
Cause: Operation overwhelming browser
Solutions:
- Wait 2-3 minutes (may still be processing) 
- If no progress, close tab and restart 
- Reduce model complexity 
- Enable GPU acceleration 
- Use faster hardware 
Issue 3: Very Slow Operations (>5 minutes for OLS)
Cause: Insufficient acceleration or hardware
Diagnosis:
- Check if GPU badge present (should have GPU for large models) 
- Check console for "using CPU" messages (bad sign) 
- Monitor CPU/GPU usage in task manager 
Solutions:
- Verify GPU acceleration active 
- Update graphics drivers 
- Close background applications 
- Reduce variable count 
- Consider hardware upgrade 
Issue 4: Inconsistent Performance
Cause: Browser or system resource contention
Solutions:
- Restart browser fresh 
- Close all unnecessary tabs 
- Check for system updates or background processes 
- Allow system to cool down (thermal throttling) 
- Use performance mode in power settings 
Issue 5: Upload Failing for Large Files
Cause: File size or browser limitations
Solutions:
- Save Excel as .csv (often smaller) 
- Remove unnecessary columns in Excel before upload 
- Split into multiple files if needed 
- Ensure stable internet connection 
- Try different browser 
Future Scalability
Planned Enhancements
Advanced Chunking: More intelligent adaptive chunk sizing
Distributed Processing: Leverage multiple browser tabs/windows
Server-Side Options: Optional server processing for very large models (enterprise plans)
Improved Caching: Faster reloading of previously analyzed datasets
Memory Optimization: Reduced memory footprint for same dataset sizes
Current Limitations
Browser Constraints: Inherent browser memory limits (2-4 GB per tab)
Single-Tab Processing: Cannot currently distribute across multiple tabs
No Disk Caching: All data held in memory during session
Sequential Operations: Most operations cannot run simultaneously
Practical Impact: Very large models (500+ variables) will always require substantial hardware
Next Steps: Review the full Advanced Features section, or proceed to Exporting & Reporting to learn how to share your large model results with stakeholders.
Last updated