Problem
run_cluster.sh launches all datasets as background processes within a single cluster job (single GPU). Each process independently loads the full model into VRAM:
python3 run_eval.py ... --datasets afrimedqa_mcq &
python3 run_eval.py ... --datasets medqa_usmle &
# ... 4 more datasets
wait
With 6 processes competing for one GPU, VRAM is exhausted and processes fall back to CPU. This is why Meditron3 8B runs at ~9s/q despite having a GPU allocated — it was running on CPU the whole time.
Proposed Fix
Submit one cluster job per dataset via submit_job.sh, each with its own GPU. The model is loaded once per job but gets full VRAM:
mamai-eval-meditron3-afrimedqa → GPU 0 (loads model, runs 660q)
mamai-eval-meditron3-medqa → GPU 1 (loads model, runs 1025q)
mamai-eval-meditron3-medmcqa → GPU 2 (loads model, runs 500q)
...
Wall time becomes max(time_per_dataset) instead of sum(all_datasets) / degree_of_contention.
Expected Impact
| Model |
Current |
After fix |
| MedGemma 4B |
~10 min |
~5 min |
| Meditron3 8B |
~2.5 hrs (mostly CPU) |
~15 min (true GPU) |
Implementation
Add a submit_job_per_dataset.sh wrapper that loops over datasets and calls submit_job.sh once per dataset with DATASETS=<single_dataset>. No changes to run_cluster.sh or run_eval.py needed.
Trade-off
Each job loads the model independently (~4GB per job). For 6 datasets that's 6 GPU allocations instead of 1. Acceptable on a shared cluster where GPUs are the bottleneck anyway — better to use 6 GPUs for 15 minutes than 1 GPU for 2.5 hours.
Problem
run_cluster.shlaunches all datasets as background processes within a single cluster job (single GPU). Each process independently loads the full model into VRAM:With 6 processes competing for one GPU, VRAM is exhausted and processes fall back to CPU. This is why Meditron3 8B runs at ~9s/q despite having a GPU allocated — it was running on CPU the whole time.
Proposed Fix
Submit one cluster job per dataset via
submit_job.sh, each with its own GPU. The model is loaded once per job but gets full VRAM:Wall time becomes
max(time_per_dataset)instead ofsum(all_datasets) / degree_of_contention.Expected Impact
Implementation
Add a
submit_job_per_dataset.shwrapper that loops over datasets and callssubmit_job.shonce per dataset withDATASETS=<single_dataset>. No changes torun_cluster.shorrun_eval.pyneeded.Trade-off
Each job loads the model independently (~4GB per job). For 6 datasets that's 6 GPU allocations instead of 1. Acceptable on a shared cluster where GPUs are the bottleneck anyway — better to use 6 GPUs for 15 minutes than 1 GPU for 2.5 hours.