This is the code repository for the neural speech codec presented in the EMNLP 2024 paper ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper]
- Our neural speech codec ESC, within only 30MB, efficiently compresses 16kHz speech at bitrates of 1.5, 3, 4.5, 6, 7.5, and 9kbps, while maintaining comparative reconstruction quality to Descript's audio codec.
- We provide pretrained model checkpoints [download] for different ESC variants and DAC models, as well as a demo webpage [link] including multilingual speech samples.
conda create -n esc python=3.8
conda activate esc
pip install -r requirements.txtpython -m scripts.compress --input /path/to/input.wav --save_path /path/to/output --model_path /path/to/model --num_streams 6 --device cpu This will create .pth(code) and .wav(reconstructed audio) files under the specified save_path. Our codec supports num_streams from 1 to 6, corresponding to bitrates 1.5 ~ 9.0 kbps. For programmatic usage, you can compress audio tensors using torchaudio as follows:
import torchaudio, torch
from esc import ESC
model = ESC(**config)
model.load_state_dict(torch.load("model.pth", map_location="cpu"),)
x, _ = torchaudio.load("input.wav")
# Enc. (@ num_streams*1.5 kbps)
codes, f_shape = model.encode(x, num_streams=6)
# Dec.
recon_x = model.decode(codes, f_shape)For more details, see the example.ipynb notebook.
We provide developmental training and evaluation datasets available on Hugging Face. For custom training, set the train_data_path in exp.yaml to the parent directory containing .wav audio segments. Run the following to start training:
WANDB_API_KEY=your_API_key
accelerate launch main.py --exp_name esc9kbps --config_path ./configs/9kbps_esc_base.yaml --wandb_project efficient-speech-codec --lr 1.0e-4 --num_epochs 80 --num_pretraining_epochs 15 --num_devices 4 --dropout_rate 0.75 --save_path /path/to/output --seed 53We use accelerate library to handle distributed training and wandb library for monitoring. To enable adversarial training with the same discriminator in DAC, include the --adv_training flag.
Training a base ESC model on 4 RTX4090 GPUs takes ~16 hours for 250k steps on 3-second speech clips with a batch size of 36. Detailed experiment configurations can be found in the configs/ folder. For complete experiments presented in the paper, refer to scripts_all.sh.
CUDA_VISIBLE_DEVICES=0
python -m scripts.test --eval_folder_path path/to/data --batch_size 12 --model_path /path/to/model --device cudaThis will run codec evaluation across all available bandwidth on the specified test set folder. We provide four metrics for reporting: PESQ, Mel-Distance, SI-SDR and Bitrate-Utilization-Rate. Evaluation statistics will be saved under model_path by default.
You can download the pre-trained model checkpoints below:
| Codec | Checkpoint | #Param. |
|---|---|---|
| ESC-Base | Download | 8.39M |
| ESC-Base(adv) | Download | 8.39M |
| ESC-Large | Download | 15.58M |
| DAC-Tiny(adv) | Download | 8.17M |
| DAC-Tiny | Download | 8.17M |
| DAC-Base(adv) | Download | 74.31M |
We provide a comprehensive performance comparison of ESC with Descript's audio codec (DAC) at different scales of model sizes (w/ and w/o adversarial trainings).
If you find our work useful or relevant to your research, please kindly cite our paper:
@article{gu2024esc,
title={ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers},
author={Gu, Yuzhe and Diao, Enmao},
journal={arXiv preprint arXiv:2404.19441},
year={2024}
}