Measures speaker diarization accuracy of the MimicScribe pipeline: Parakeet TDT 0.6B (on-device ASR) + Pyannote (on-device diarization) + Gemini 3 Flash (LLM speaker attribution).
| Corpus | Scenario | Sessions | Hours | Speakers |
|---|---|---|---|---|
| AMI IHM-mix | Headset mix (simulates multi-mic remote call) | 16 eval | ~8h | 4 per session |
| Earnings-21 | Real corporate conference calls | 11 eval | ~10h | 2-15 |
See results/RESULTS.md after running the benchmark.
- macOS 15.0+ with Xcode and Swift 6.2+
- MimicScribe built and run at least once (to download CoreML ASR + diarization models)
- Gemini API key configured in the project root
.envfile (the pipeline calls Gemini for speaker attribution) - Python 3.10+ for the benchmark harness
- HuggingFace token (optional) — required if Earnings-21 becomes gated. Set
HF_TOKENorHUGGINGFACE_PAT_READin.env.
cd benchmark
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Download eval data (~10 GB for both corpora)
bench-download
# Run MimicScribe pipeline on all audio files
bench-run
# Score and generate results
bench-scoreOr run everything at once:
bench-allbench-download --corpus ami
bench-run --corpus ami
bench-score --corpus ami- DER (Diarization Error Rate): missed speech + false alarm + speaker confusion. Standard 0.25s collar.
- Missed: speech in reference but not in hypothesis
- False Alarm: speech in hypothesis but not in reference
- Confusion: speech attributed to the wrong speaker
- Download: Fetches audio + ground-truth RTTM files from AMI and Earnings-21
- Run: Runs each audio file through
swift run mimicscribe --process-file(ASR + diarization + Gemini speaker attribution), then exports segments from SQLite as hypothesis RTTM files - Score: Compares hypothesis RTTM against reference RTTM using pyannote.metrics, renders results to
results/RESULTS.md
- AMI Meeting Corpus: Carletta, J. et al. (2005). The AMI Meeting Corpus: A Pre-announcement. CC BY 4.0
- AMI RTTM ground truth: BUTSpeechFIT/AMI-diarization-setup
- Earnings-21: Del Rio, M. et al. (2021). Earnings-21: A Practical Benchmark for ASR in the Wild. Interspeech 2021. speech-datasets