Skip to content

eval: run each dataset as a separate cluster job for true GPU parallelism #47

@nmrenyi

Description

@nmrenyi

Problem

run_cluster.sh launches all datasets as background processes within a single cluster job (single GPU). Each process independently loads the full model into VRAM:

python3 run_eval.py ... --datasets afrimedqa_mcq &
python3 run_eval.py ... --datasets medqa_usmle &
# ... 4 more datasets
wait

With 6 processes competing for one GPU, VRAM is exhausted and processes fall back to CPU. This is why Meditron3 8B runs at ~9s/q despite having a GPU allocated — it was running on CPU the whole time.

Proposed Fix

Submit one cluster job per dataset via submit_job.sh, each with its own GPU. The model is loaded once per job but gets full VRAM:

mamai-eval-meditron3-afrimedqa   → GPU 0  (loads model, runs 660q)
mamai-eval-meditron3-medqa       → GPU 1  (loads model, runs 1025q)
mamai-eval-meditron3-medmcqa     → GPU 2  (loads model, runs 500q)
...

Wall time becomes max(time_per_dataset) instead of sum(all_datasets) / degree_of_contention.

Expected Impact

Model Current After fix
MedGemma 4B ~10 min ~5 min
Meditron3 8B ~2.5 hrs (mostly CPU) ~15 min (true GPU)

Implementation

Add a submit_job_per_dataset.sh wrapper that loops over datasets and calls submit_job.sh once per dataset with DATASETS=<single_dataset>. No changes to run_cluster.sh or run_eval.py needed.

Trade-off

Each job loads the model independently (~4GB per job). For 6 datasets that's 6 GPU allocations instead of 1. Acceptable on a shared cluster where GPUs are the bottleneck anyway — better to use 6 GPUs for 15 minutes than 1 GPU for 2.5 hours.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions