eval: run each dataset as a separate cluster job for true GPU parallelism

## Problem

`run_cluster.sh` launches all datasets as background processes within a single cluster job (single GPU). Each process independently loads the full model into VRAM:

```bash
python3 run_eval.py ... --datasets afrimedqa_mcq &
python3 run_eval.py ... --datasets medqa_usmle &
# ... 4 more datasets
wait
```

With 6 processes competing for one GPU, VRAM is exhausted and processes fall back to CPU. This is why Meditron3 8B runs at ~9s/q despite having a GPU allocated — it was running on CPU the whole time.

## Proposed Fix

Submit one cluster job **per dataset** via `submit_job.sh`, each with its own GPU. The model is loaded once per job but gets full VRAM:

```
mamai-eval-meditron3-afrimedqa   → GPU 0  (loads model, runs 660q)
mamai-eval-meditron3-medqa       → GPU 1  (loads model, runs 1025q)
mamai-eval-meditron3-medmcqa     → GPU 2  (loads model, runs 500q)
...
```

Wall time becomes `max(time_per_dataset)` instead of `sum(all_datasets) / degree_of_contention`.

## Expected Impact

| Model | Current | After fix |
|-------|---------|-----------|
| MedGemma 4B | ~10 min | ~5 min |
| Meditron3 8B | ~2.5 hrs (mostly CPU) | ~15 min (true GPU) |

## Implementation

Add a `submit_job_per_dataset.sh` wrapper that loops over datasets and calls `submit_job.sh` once per dataset with `DATASETS=<single_dataset>`. No changes to `run_cluster.sh` or `run_eval.py` needed.

## Trade-off

Each job loads the model independently (~4GB per job). For 6 datasets that's 6 GPU allocations instead of 1. Acceptable on a shared cluster where GPUs are the bottleneck anyway — better to use 6 GPUs for 15 minutes than 1 GPU for 2.5 hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: run each dataset as a separate cluster job for true GPU parallelism #47

Problem

Proposed Fix

Expected Impact

Implementation

Trade-off

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Current	After fix
MedGemma 4B	~10 min	~5 min
Meditron3 8B	~2.5 hrs (mostly CPU)	~15 min (true GPU)

eval: run each dataset as a separate cluster job for true GPU parallelism #47

Description

Problem

Proposed Fix

Expected Impact

Implementation

Trade-off

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions