Skip to content

Code for our NAACL 2025 paper: "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning"

License

Notifications You must be signed in to change notification settings

pyf98/NeMo_VoiceTextBlender

Repository files navigation

VoiceTextBlender

This repo contains the code for our paper:

Yifan Peng*, Krishna C. Puvvada*, Zhehuai Chen*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg, "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning," in Proc. NAACL, 2025. (accepted) [arXiv]

Overview

We propose a novel single-stage joint speech-text SFT approach for training SpeechLMs using LoRA adapters. Our model achieves excellent performance across various speech benchmarks while retaining performance on text-only benchmarks. Our 3B model even outperforms previous 7B or 13B SpeechLMs on most evaluated benchmarks. Furthermore, our model exhibits emergent capabilities in handling previously unseen instructions and multi-turn mixed-modal conversations.

Specifically, we combine multi-turn text-only SFT data with single-turn speech-related SFT data during training. To extend beyond speech-based QA tasks, we propose a novel data generation method that can create mixed-modal interleaving speech-text inputs.

Installation

The code is based on an old version of NVIDIA NeMo. We use a docker container to train and decode the model.

Generating speech SFT data

Running inference using a pre-trained model

Training a new model

Citation

@inproceedings{vtblender,
    title={{VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning}}, 
    author={Yifan Peng and Krishna C. Puvvada and Zhehuai Chen and Piotr Zelasko and He Huang and Kunal Dhawan and Ke Hu and Shinji Watanabe and Jagadeesh Balam and Boris Ginsburg},
    year={2025},
    booktitle={Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)},
}

About

Code for our NAACL 2025 paper: "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published