wavlab-speech · ftshijt · May 6, 2026 · Jun 16, 2025 · Jun 19, 2025 · Jun 19, 2025
diff --git a/.gitignore b/.gitignore
@@ -169,4 +169,14 @@ fadtk/
 scoreq/
 fairseq/
 UTMOSv2/
+
+# Versa optional metric installer output and model caches
+versa_cache/
+tools/NISQA/
+tools/Noresqa/
+tools/SRMRpy/
+tools/audiobox-aesthetics/
+tools/emotion2vec/
+ssl-singer-identity/
+pretrained_models/
 wvmos/
diff --git a/README.md b/README.md
@@ -55,10 +55,10 @@ For metrics marked without "x" in the "Auto-Install" column of our metrics table
 
 ```bash
 # Test core functionality
-python versa/test/test_pipeline/test_general.py
+python -m pytest test/test_general.py
 
 # Test specific metrics that require additional installation
-python versa/test/test_pipeline/test_{metric}.py
+python -m pytest test/test_metrics/test_{metric}.py
 ```
 
 
@@ -69,31 +69,31 @@ python versa/test/test_pipeline/test_{metric}.py
 ```bash
 # Direct usage with file paths
 python versa/bin/scorer.py \
-    --score_config egs/speech.yaml \
+    --score_config egs/speech_cpu.yaml \
     --gt test/test_samples/test1 \
     --pred test/test_samples/test2 \
     --output_file test_result \
     --io dir
 
 # With SCP-style input
 python versa/bin/scorer.py \
-    --score_config egs/speech.yaml \
+    --score_config egs/speech_cpu.yaml \
     --gt test/test_samples/test1.scp \
     --pred test/test_samples/test2.scp \
     --output_file test_result \
     --io soundfile
 
 # With Kaldi-ARK style input (compatible with ESPnet)
 python versa/bin/scorer.py \
-    --score_config egs/speech.yaml \
+    --score_config egs/speech_cpu.yaml \
     --gt test/test_samples/test1.scp \
     --pred test/test_samples/test2.scp \
     --output_file test_result \
     --io kaldi
 
 # Including text transcription information
 python versa/bin/scorer.py \
-    --score_config egs/separate_metrics/wer.yaml \
+    --score_config egs/separate_metrics/wer_tiny.yaml \
     --gt test/test_samples/test1.scp \
     --pred test/test_samples/test2.scp \
     --output_file test_result \

diff --git a/docs/metric_migration.md b/docs/metric_migration.md
@@ -0,0 +1,151 @@
+# Metric Migration Guide
+
+This guide summarizes the preferred process for migrating existing Versa metrics
+to the new object-oriented metric interface.
+
+## Migration Goal
+
+Use `versa.definition.BaseMetric` as the source of truth for metric
+implementations. Preserve user-facing behavior, but do not preserve legacy
+internal helper APIs unless they are still needed by public callers.
+
+Preserve:
+
+- YAML metric names
+- CLI/scorer behavior
+- output score keys
+- documented config defaults
+- optional dependency behavior
+
+Clean up:
+
+- old function-style metric internals
+- duplicated setup code
+- eager optional dependency imports
+- tests that only exercise legacy helper functions
+
+## Required Metric Shape
+
+Each migrated metric should provide:
+
+- a `BaseMetric` subclass
+- `_setup(self)` for config defaults, dependency checks, and model setup
+- `compute(self, predictions, references=None, metadata=None)` for scoring
+- `get_metadata(self)` returning `MetricMetadata`
+- `register_<metric>_metric(registry)` as the registry integration point
+
+`compute` should:
+
+- validate required inputs
+- read sample rate from `metadata.get("sample_rate", 16000)` when needed
+- return the same output keys users already receive
+- avoid changing user-visible numeric conventions unless the migration requires it
+
+## Metadata Checklist
+
+Every metric registration should define:
+
+- canonical metric name
+- `MetricCategory`: `INDEPENDENT`, `DEPENDENT`, `NON_MATCH`, or `DISTRIBUTIONAL`
+- `MetricType`: usually `FLOAT` for one score or `DICT` for grouped scores
+- `requires_reference`
+- `requires_text`
+- `gpu_compatible`
+- `auto_install`
+- dependency import names
+- short description
+- paper reference and implementation source when known
+- useful aliases for existing YAML or common names
+
+## Optional Dependencies
+
+Optional dependencies must not break `import versa`.
+
+Use guarded imports inside metric modules, and raise a clear `ImportError` from
+`_setup` when a required optional package is missing. Register optional metrics
+from `versa/__init__.py` through `_optional_metric_import(...)`.
+
+## Tests
+
+Prefer tests for the new public path:
+
+- metric class behavior
+- registry registration and aliases
+- `VersaScorer` pipeline behavior with existing sample audio when lightweight
+- missing optional dependency behavior
+- unchanged user-facing output keys
+
+Do not add tests solely to preserve old internal helper APIs unless those APIs
+remain part of the public interface.
+
+Base-install focused tests currently live in:
+
+- `test/test_metrics/test_base_metrics.py`
+- `test/test_pipeline/test_base_metrics_pipeline.py`
+
+## Migration Candidates
+
+The following modules still appear to use the old interface because they do not
+define or import `BaseMetric`. This list is based on a repository scan and should
+be updated as each metric is migrated.
+
+### Corpus and Distributional Metrics
+
+- `versa/corpus_metrics/fad.py`
+- `versa/corpus_metrics/individual_fad.py`
+- `versa/corpus_metrics/kid.py`
+- `versa/corpus_metrics/clap_score.py`
+
+### Already Migrated Examples
+
+Use these as local references when migrating the remaining metrics:
+
+- `versa/sequence_metrics/mcd_f0.py`
+- `versa/sequence_metrics/signal_metric.py`
+- `versa/sequence_metrics/warpq.py`
+- `versa/corpus_metrics/espnet_wer.py`
+- `versa/corpus_metrics/owsm_wer.py`
+- `versa/corpus_metrics/whisper_wer.py`
+- `versa/utterance_metrics/log_wmse.py`
+- `versa/utterance_metrics/pseudo_mos.py`
+- `versa/utterance_metrics/qwen2_audio.py`
+- `versa/utterance_metrics/qwen_omni.py`
+- `versa/utterance_metrics/speaking_rate.py`
+- `versa/utterance_metrics/scoreq.py`
+- `versa/utterance_metrics/se_snr.py`
+- `versa/utterance_metrics/sheet_ssqa.py`
+- `versa/utterance_metrics/singer.py`
+- `versa/utterance_metrics/speaker.py`
+- `versa/utterance_metrics/stoi.py`
+- `versa/utterance_metrics/pesq_score.py`
+- `versa/utterance_metrics/squim.py`
+- `versa/utterance_metrics/universa.py`
+- `versa/utterance_metrics/vad.py`
+- `versa/utterance_metrics/visqol_score.py`
+- `versa/utterance_metrics/vqscore.py`
+
+## Verification
+
+Run focused checks before broader validation:
+
+```bash
+/opt/homebrew/bin/mamba run -n versa-dev python -m pytest <focused tests> -q
+/opt/homebrew/bin/mamba run -n versa-dev python -m black --check <touched files>
+/opt/homebrew/bin/mamba run -n versa-dev python -m flake8 <touched files>
+```
+
+The base migration tests use mocks for heavy model-backed metrics. They validate
+registry integration, pipeline wiring, input handling, and output keys, but do
+not prove checkpoint download or real inference.
+
+Run optional real model checks locally after installing the metric dependencies:
+
+```bash
+tools/install_scoreq.sh
+VERSA_RUN_REAL_MODEL_TESTS=1 \
+  /opt/homebrew/bin/mamba run -n versa-dev python -m pytest \
+  test/test_pipeline/test_scoreq.py -q -s
+```
+
+These tests are marked `real_model` and are skipped unless
+`VERSA_RUN_REAL_MODEL_TESTS=1` is set.
diff --git a/docs/supported_metrics.md b/docs/supported_metrics.md
@@ -13,7 +13,7 @@ We include x mark if the metric is auto-installed in versa.
 | 6 | x | PESQ in TorchAudio-Squim  | squim_no_ref | torch_squim_pesq | [torch_squim](https://pytorch.org/audio/main/tutorials/squim_tutorial.html) | [paper](https://arxiv.org/abs/2304.01448) |
 | 7 | x | STOI in TorchAudio-Squim  | squim_no_ref | torch_squim_stoi | [torch_squim](https://pytorch.org/audio/main/tutorials/squim_tutorial.html) | [paper](https://arxiv.org/abs/2304.01448) |
 | 8 | x | SI-SDR in TorchAudio-Squim  | squim_no_ref | torch_squim_si_sdr | [torch_squim](https://pytorch.org/audio/main/tutorials/squim_tutorial.html) | [paper](https://arxiv.org/abs/2304.01448) |
-| 9 | x | Singing voice MOS  | pseudo_mos | singmos_v1 |[singmos](https://github.com/South-Twilight/SingMOS) | [paper](https://arxiv.org/abs/2406.10911) |
+| 9 | x | Singing voice MOS  | singmos | singmos |[singmos](https://github.com/South-Twilight/SingMOS/tree/main) | [paper](https://arxiv.org/abs/2406.10911) |
 | 10 | x | Sheet SSQA MOS Models | sheet_ssqa | sheet_ssqa |[Sheet](https://github.com/unilight/sheet/tree/main) | [paper](https://arxiv.org/abs/2411.03715) |
 | 11 |   | UTMOSv2: UTokyo-SaruLab MOS Prediction System | utmosv2 | utmosv2 |[UTMOSv2](https://github.com/sarulab-speech/UTMOSv2) | [paper](https://arxiv.org/abs/2409.09305) |
 | 12 |   | Speech Contrastive Regression for Quality Assessment without reference (ScoreQ) | scoreq_nr | scoreq_nr |[ScoreQ](https://github.com/ftshijt/scoreq/tree/main) | [paper](https://arxiv.org/pdf/2410.06675) |
@@ -50,9 +50,9 @@ We include x mark if the metric is auto-installed in versa.
 | 43 | x | Qwen2 Recording Environment - Background | qwen2_speech_background_environment_metric | qwen2_speech_background_environment_metric | [Qwen2 Audio](https://github.com/QwenLM/Qwen2-Audio) | [paper](https://arxiv.org/abs/2407.10759) |
 | 44 | x | Qwen2 Recording Environment - Quality | qwen2_recording_quality_metric | qwen2_recording_quality_metric | [Qwen2 Audio](https://github.com/QwenLM/Qwen2-Audio) | [paper](https://arxiv.org/abs/2407.10759) |
 | 45 | x | Qwen2 Recording Environment - Channel Type | qwen2_channel_type_metric | qwen2_channel_type_metric | [Qwen2 Audio](https://github.com/QwenLM/Qwen2-Audio) | [paper](https://arxiv.org/abs/2407.10759) |
-| 46 | x | Dimensional Emotion | w2v2_dimensional_emotion | w2v2_dimensional_emotion | [w2v2-how-to](https://github.com/audeering/w2v2-how-to) | [paper](https://arxiv.org/pdf/2203.07378) |
-| 47 |   | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) - No Reference | universa_noref | universa_score | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
-| 48 |   | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) - No Reference | arecho_noref | arecho_score | [ARECHO](https://huggingface.co/espnet/arecho_base_v0) | [paper](https://arxiv.org/abs/2505.20741) |
+| 46 | x | Dimensional Emotion | emo_vad | arousal_emo_vad, valence_emo_vad, dominance_emo_vad | [w2v2-how-to](https://github.com/audeering/w2v2-how-to) | [paper](https://arxiv.org/pdf/2203.07378) |
+| 47 | x | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) | universa, universa_noref, universa_audioref, universa_textref, universa_fullref | universa_{sub_metrics} | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
+| 48 |   | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) - No Reference | arecho, arecho_noref | arecho_{sub_metrics} | [ARECHO](https://huggingface.co/espnet/arecho_base_v0) | [paper](https://arxiv.org/abs/2505.20741) |
 | 49 | x | DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech  | pseudo_mos | dnsmos_pro_bvcc | [DNSMOSPro](https://github.com/fcumlin/DNSMOSPro/tree/main) | [paper](https://www.isca-archive.org/interspeech_2024/cumlin24_interspeech.html) |
 | 50 | x | DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech  | pseudo_mos | dnsmos_pro_nisqa | [DNSMOSPro](https://github.com/fcumlin/DNSMOSPro/tree/main) | [paper](https://www.isca-archive.org/interspeech_2024/cumlin24_interspeech.html) |
 | 51 | x | DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech  | pseudo_mos | dnsmos_pro_vcc2018 | [DNSMOSPro](https://github.com/fcumlin/DNSMOSPro/tree/main) | [paper](https://www.isca-archive.org/interspeech_2024/cumlin24_interspeech.html) |
@@ -61,6 +61,7 @@ We include x mark if the metric is auto-installed in versa.
 | 54 | x | VQScore (Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech)  | vqscore | vqscore | [VQScore](https://github.com/JasonSWFu/VQscore) | [paper](https://arxiv.org/abs/2402.16321) |
 | 55 | x | Singing voice MOS  | pseudo_mos | singmos_pro |[singmos](https://github.com/South-Twilight/SingMOS) | [paper](https://arxiv.org/abs/2510.01812) |
 
+
 ### Dependent Metrics
 |Number| Auto-Install | Metric Name  (Auto-Install)  | Key in config | Key in report |  Code Source                                                                                                     | References                                                                                       |
 |---|---|------------------|---------------|---------------|-----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
@@ -70,7 +71,7 @@ We include x mark if the metric is auto-installed in versa.
 | 4 | x | Signal-to-interference  Ratio (SIR)  | signal_metric | sir | [espnet](https://github.com/espnet/espnet) | - |
 | 5 | x | Signal-to-artifact Ratio (SAR)  | signal_metric | sar | [espnet](https://github.com/espnet/espnet) | - |
 | 6 | x | Signal-to-distortion Ratio (SDR)  | signal_metric | sdr | [espnet](https://github.com/espnet/espnet) | - |
-| 7 | x | Convolutional scale-invariant signal-to-distortion ratio (CI-SDR)  | signal_metric | ci-sdr | [ci_sdr](https://github.com/fgnt/ci_sdr) | [paper](https://arxiv.(org/abs/2011.15003) |
+| 7 | x | Convolutional scale-invariant signal-to-distortion ratio (CI-SDR)  | signal_metric | ci-sdr | [ci_sdr](https://github.com/fgnt/ci_sdr) | [paper](https://arxiv.org/abs/2011.15003) |
 | 8 | x | Scale-invariant signal-to-noise ratio (SI-SNR)  | signal_metric | si-snr | [espnet](https://github.com/espnet/espnet) | [paper](https://arxiv.org/abs/1711.00541) |
 | 9 | x | Perceptual Evaluation of Speech Quality (PESQ)  | pesq | pesq | [pesq](https://pypi.org/project/pesq/) | [paper](https://ieeexplore.ieee.org/document/941023) |
 | 10 | x | Short-Time Objective Intelligibility (STOI)  | stoi | stoi | [pystoi](https://github.com/mpariente/pystoi) | [paper](https://ieeexplore.ieee.org/document/5495701) |
@@ -89,11 +90,10 @@ We include x mark if the metric is auto-installed in versa.
 | 23 |  | Composite Objective Speech Quality (composite) | pysepm | pysepm_Csig, pysepm_Cbak, pysepm_Covl | [pysepm](https://github.com/shimhz/pysepm.git) | [Paper](https://ecs.utdallas.edu/loizou/speech/obj_paper_jan08.pdf)|
 | 24 |  | Coherence and speech intelligibility index (CSII) | pysepm | pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low | [pysepm](https://github.com/shimhz/pysepm.git) | [Paper](https://www.researchgate.net/profile/James-Kates-2/publication/7842209_Coherence_and_the_speech_intelligibility_index/links/546f5dab0cf2d67fc0310f88/Coherence-and-the-speech-intelligibility-index.pdf)|
 | 25 |  | Normalized-covariance measure (NCM) | pysepm | pysepm_ncm | [pysepm](https://github.com/shimhz/pysepm.git) | [Paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC3037773/pdf/JASMAN-000128-003715_1.pdf)|
-| 26 |   | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Audio Reference | universa_audioref | universa_score | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
-| 27 |   | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Audio Reference | arecho_audioref | arecho_score | [ARECHO](https://huggingface.co/espnet/arecho_base_v0) | [paper](https://arxiv.org/abs/2505.20741) |
-| 28 | x | Chroma-related Alignment | chroma_alignment | chroma_{stft,cqt,cens}_{cosine, euclidean}_dtw{"", _log, _raw} | - | - |
-| 29 | x | Deep Perceptual Audio Metric (DPAM) | dpam | dpam_distance | [PerceptualAudio_Pytorch](https://github.com/adrienchaton/PerceptualAudio_pytorch)  | [paper](https://arxiv.org/abs/2001.04460) |
-| 30 | x | Contrastive learning-based Deep Perceptual Audio Metric (CDPAM) | cdpam | cdpam_distance | [PerceptualAudio](https://github.com/pranaymanocha/PerceptualAudio/cdpam) | [paper](https://arxiv.org/abs/2102.05109) |
+| 26 | x | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Paired Reference | universa | universa_{sub_metrics} | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
+| 27 | x | Chroma-related Alignment | chroma_alignment | chroma_{stft,cqt,cens}_{cosine, euclidean}_dtw{"", _log, _raw} | - | - |
+| 28 | x | Deep Perceptual Audio Metric (DPAM) | dpam | dpam_distance | [PerceptualAudio_Pytorch](https://github.com/adrienchaton/PerceptualAudio_pytorch)  | [paper](https://arxiv.org/abs/2001.04460) |
+| 29 | x | Contrastive learning-based Deep Perceptual Audio Metric (CDPAM) | cdpam | cdpam_distance | [PerceptualAudio](https://github.com/pranaymanocha/PerceptualAudio/cdpam) | [paper](https://arxiv.org/abs/2102.05109) |
 
 
 ### Non-match Metrics
@@ -111,11 +111,8 @@ We include x mark if the metric is auto-installed in versa.
 | 9 |   | Contrastive Language-Audio Pretraining Score (CLAP Score) | clap_score | clap_score | [fadtk](https://github.com/gudgud96/frechet-audio-distance) | [paper](https://arxiv.org/abs/2301.12661) |
 | 10 |   | Accompaniment Prompt Adherence (APA) | apa | apa | [Sony-audio-metrics](https://github.com/SonyCSLParis/audio-metrics) | [paper](https://arxiv.org/abs/2404.00775) |
 | 11 |  | Log Likelihood Ratio (LLR) | pysepm | pysepm_llr | [pysepm](https://github.com/shimhz/pysepm.git) | [Paper](https://ecs.utdallas.edu/loizou/speech/obj_paper_jan08.pdf)|
-| 12 |   | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Text Reference | universa_textref | universa_score | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
-| 13 |   | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Full Reference | universa_fullref | universa_score | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
-| 14 |   | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Text Reference | arecho_textref | arecho_score | [ARECHO](https://huggingface.co/espnet/arecho_base_v0) | [paper](https://arxiv.org/abs/2505.20741) |
-| 15 |   | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Full Reference | arecho_fullref | arecho_score | [ARECHO](https://huggingface.co/espnet/arecho_base_v0) | [paper](https://arxiv.org/abs/2505.20741) |
-| 16 |  | Singer Embedding Similarity  | singer | singer_similarity | [SSL-Singer-Identity](https://github.com/SonyCSLParis/ssl-singer-identity) | [paper](https://hal.science/hal-04186048v1) |
+| 12 | x | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Paired Text | universa | universa_{sub_metrics} | [Uni-VERSA](https://huggingface.co/collections/espnet/universa-6834e7c0a28225bffb6e2526) | [paper](https://arxiv.org/abs/2505.20741) |
+| 13 |  | Singer Embedding Similarity  | singer | singer_similarity | [SSL-Singer-Identity](https://github.com/SonyCSLParis/ssl-singer-identity) | [paper](https://hal.science/hal-04186048v1) |
 
 ### Distributional Metrics (in verifying)
 

diff --git a/egs/demo/se.yaml b/egs/demo/se.yaml
@@ -90,4 +90,4 @@
 #  --nisqa_loud_pred: NISQA loudness prediction
 # NOTE(jiatong): pretrain model can be downloaded with `./tools/setup_nisqa.sh`
 - name: nisqa
-  nisqa_model_path: ./tools/NISQA/weights/nisqa.tar
+  nisqa_model_path: versa_cache/nisqa/nisqa.tar
diff --git a/egs/separate_metrics/cdpam_distance.yaml b/egs/separate_metrics/cdpam_distance.yaml
@@ -0,0 +1,5 @@
+# CDPAM distance metrics
+# CDPAM distance between audio samples
+# More info in https://github.com/facebookresearch/audiocraft
+# -- cdpam_distance: the CDPAM distance between audio samples
+- name: cdpam_distance