This repository is in support of the ICASSP 2026 submission titled "Unseen but not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models."
The software builds off of the published repository for AlignNets.
Clone the repository.
In order for the clone to include the trained models you will need git lfs.
Installation instructions are here.
conda env create -f environment.yml
conda activate dsc
pip install .
In order to run and create an environment with package versions exactly matching those used to run tests for the paper run
conda env create -f environment-paper.yml
conda activate dsc-paper
pip install .
In order to quickly define the Individual, Concealed, and Global dataset groups and train models with them, use the prep-dataset-groups.py CLI.
Here we present an example for running Dataset Concealment results with three example datasets.
It is easily extensible to more datasets.
The file example_dataset_input.yaml has an example of what a dataset group dictionary looks like.
It is formatted as
"Dataset1": "path/to/dataset1"
"Dataset2": "path/to/dataset2"
"Dataset3": "path/to/dataset3"
It is expected that each dataset path is structured as follows:
path/to/dataset1
├── test.csv
├── train.csv
└── valid.csvRunning
python prep-dataset-groups.py example_dataset_input.yaml --output-dir dsc/config/data/data_dirs/example
will generate all the required dataset group files in the dsc/config/data/data_dirs/example folder.
The data_dirs/example folder has the following structure.
dsc/config/data/data_dirs/example
├── Concealed
│ ├── Conceal-Dataset1.yaml
│ ├── Conceal-Dataset2.yaml
│ └── Conceal-Dataset3.yaml
├── Global.yaml
├── Individual
│ ├── Dataset1.yaml
│ ├── Dataset2.yaml
│ └── Dataset3.yamlThe following enables Global training on the first optimization device availble. This may need to be updated depending on your configuration.
Wav2Vec
python train.py \
data/data_dirs=Global \
'optimization.devices=[0]' \
--config-dir dsc/config/models/ \
--config-name alignnet-wav2vec
NISQA
python train.py \
data/data_dirs=Global \
'optimization.devices=[0]' \
--config-dir dsc/config/models/ \
--config-name alignnet-nisqa
MOSNet
python train.py \
data/data_dirs=Global \
'optimization.devices=[0]' \
--config-dir dsc/config/models/ \
--config-name alignnet-MOSNet
The following example demonstrates quickly training all the concealed dataset groups. The logic for other models or individual dataset groups is similar.
python train.py -m \
data/data_dirs=Concealed/Conceal-Dataset1,Concealed/Conceal-Dataset2,Concealed/Conceal-Dataset3 \
'optimization.devices=[0]' \
--config-dir dsc/config/models/ \
--config-name alignnet-nisqa
The config files in the dsc/config directory detail all the configuration details used to train models to generate results for the paper "Unseen but not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models."
The trained_model folder contains three trained, example model checkpoints used in the paper.
The results from the paper come from 10 independently trained versions of each model, so the included checkpoints cannot fully replicate results on their own.
They are included for convenience.
Trained models can easily be used at inference via the CLI built into inference.py.
Some basic help can be seen via
python inference.py --help
In general, three overrides must be set:
model.path- path to a trained model.data.data_dirs- list containing absolute paths to csv files that list audio files to perform inference on.output.file- path to file where inference output will be stored.
After running inference, a csv will be created at output.file with the following columns:
file- filenames where audio was loaded from.estimate- estimate generated by the model.dataset- index listing which file fromdata.data_filesthis file belongs to.AlignNet dataset index- index listing which dataset within the model the scores come from. This will be the same for every file in the csv. The default dataset will always be the reference dataset, but this can be overriden viamodel.dataset_index.
For example, to run inference using the included NISQA model trained on the smaller datasets, one would run
python inference.py \
data.data_dirs=[/path/to/datafile.csv] \
model.path=trained_models/NISQA-AlignNet-Global \
output.file=estimations.csv \
--config-name nisqa.yaml
The example Wav2Vec model can be used like:
python inference.py \
data.data_dirs=[/path/to/datafile.csv] \
model.path=trained_models/Wav2Vec-AlignNet-Global \
output.file=estimations.csv \
--config-name wav2vec.yaml
And the example MOSNet model can be used like:
python inference.py \
data.data_dirs=[/path/to/datafile.csv] \
model.path=trained_models/MOSNet-AlignNet-Global \
output.file=estimations.csv \
--config-name mosnet.yaml
Most of the datasets used in the paper can be found by finding the links and references here.
The remaining dataset can be found at:
- TMHINT-QI
- Chen, Y.-W., Tsao, Y. (2022) InQSS: a speech intelligibility and quality assessment model using a multi-task learning network. Proc. Interspeech 2022, 3088-3092, doi: 10.21437/Interspeech.2022-10153.
For datasets where specifically curated training/testing data splits are provided they are used. Specifically the following datasets provide this information:
- NISQA SIM
- VMC22
- TMHINT-QI
For both NISQA SIM and TMHINT-QI we randomly split the data provided for training into training and validation sets so that 90% of the data was in the training set and 10% in the validation set. VMC22 provides specifically curated training, validation, and test sets which were used for all tests. To attempt to understand the stability of training individually with VMC22, randomness was achieved by using different random seeds. For all other tests the seed was kept fixed and randomness comes from the different data splits only.
For most of the other datasets we randomly split the data to achieve 70% in the training set, 15% in the validation set, and 15% in the test set. This applies to the following datasets:
- Tencent
- VCC18
- IU
- PSTN
The FFTNet and NOIZEUS datasets are significantly smaller than the other datasets used during training. They have 1200 and 1664 audio files respectively, an order of magnitude smaller than most of the other datasets. Due to this we opted for larger training data splits for these two datasets to ensure that the model had a sufficient number of examples from each to appropriately learn them when training with multiple datasets at once. For these two datasets we randomly split the data to achieve 80% in the training set, 10% in the validation set, and 10% in the test set.