Memory estimation is unreliable across different hardware configurations. Instead of guessing, we measure what actually works on your specific system.
IMPORTANT: Previous benchmarks (before v0.2.5) used synthetic data and gave inaccurate results. The new benchmark uses real training conditions.
Run the accurate memory testing script to find your hardware's optimal settings:
cd /Users/arosboro/your_ai
source venv/bin/activate
# Accurate benchmark with real training data (RECOMMENDED)
python scripts/find_optimal_profile.py --model NousResearch/Hermes-2-Pro-Mistral-7B- Uses Real Training Data: Loads actual JSONL files and tokenizes them
- Computes Distrust Loss: Includes the full distrust loss overhead
- Allocates Optimizer State: AdamW momentum + variance tensors
- Runs 15 Steps: Captures late-allocating buffers and peak memory
- Adds Safety Margin: Reports memory × 1.15 for real training headroom
- Detects Memory Growth: Warns if memory increases >10% between steps 10-15
ACCURATE MEMORY BENCHMARK - Real Training Conditions
══════════════════════════════════════════════════════════════════════════
Finding optimal batch size for rank=64, layers=16
🔬 Testing: batch=1, rank=64, layers=16 (15 steps)
Loading model and data...
Loaded 500 training samples
Running 15 training steps with REAL data...
Step 1: loss=12.34, mem=14500MB, time=2.1s
Step 5: loss=11.89, mem=14520MB, time=2.0s
Step 10: loss=11.45, mem=14530MB, time=2.0s
Step 15: loss=11.02, mem=14535MB, time=2.0s
✅ SUCCESS!
Peak memory: 14535MB (adjusted: 16715MB)
Avg step time: 2.0s
🔬 Testing: batch=32, rank=64, layers=16 (15 steps)
❌ OOM - Configuration exceeds available memory
... (binary search continues)
══════════════════════════════════════════════════════════════════════════
BENCHMARK COMPLETE
══════════════════════════════════════════════════════════════════════════
OPTIMAL CONFIGURATION:
Batch size: 17
LoRA rank: 128
LoRA layers: 16
Peak memory: 18500MB (with 15% safety margin)
Step time: 3.2s
══════════════════════════════════════════════════════════════════════════
The script will ask if you want to save the results as your hardware profile:
Save as hardware profile for future use? [Y/n] y
✅ Hardware profile saved!
Future training will use these validated settings
The benchmark outputs a JSON file with validated configurations. Use the reported settings explicitly:
# Use the optimal configuration from benchmark output
python src/train_qlora.py \
--model NousResearch/Hermes-2-Pro-Mistral-7B \
--batch-size 17 \
--lora-rank 128 \
--lora-layers 16 \
--max-steps 5000- Accurate Results: Tested with real training data and distrust loss
- No OOM Crashes: Settings include 15% safety margin
- Peak Memory Detection: 15-step tests capture late-allocating buffers
- Memory Growth Detection: Warns if memory increases during test
- One-time Setup: Test once, train confidently
If you don't want to run the benchmark, use these proven safe settings:
For M3 Ultra 96GB with Hermes-7B:
- batch=17, rank=128, layers=16 (tested, stable)
Generic conservative defaults:
- Small models (7-8B): batch=2, rank=64, layers=16
- Medium models (14B): batch=2, rank=64, layers=16
- Large models (32B+): batch=2, rank=96, layers=20
These are guaranteed to work but may not maximize your hardware's capacity.
You can always override settings manually:
python src/train_qlora.py \
--model NousResearch/Hermes-2-Pro-Mistral-7B \
--batch-size 8 \
--lora-rank 128 \
--lora-layers 24Test keeps failing at batch=1?
- Your system may have other processes using GPU memory
- Close other applications and try again
- Check available memory:
Activity Monitor→Memory
Test is taking too long?
- Press Ctrl+C to cancel
- The test saves progress as it goes
- Each configuration test runs for just a few training steps
Want to retest with different model?
- Run the script again with a different
--modelargument - Results are model-specific (larger models need different settings)