VoiceTextBlender

This repo contains the code for our paper:

Yifan Peng*, Krishna C. Puvvada*, Zhehuai Chen*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg, "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning," in Proc. NAACL, 2025. (accepted) [arXiv]

Overview

We propose a novel single-stage joint speech-text SFT approach for training SpeechLMs using LoRA adapters. Our model achieves excellent performance across various speech benchmarks while retaining performance on text-only benchmarks. Our 3B model even outperforms previous 7B or 13B SpeechLMs on most evaluated benchmarks. Furthermore, our model exhibits emergent capabilities in handling previously unseen instructions and multi-turn mixed-modal conversations.

Specifically, we combine multi-turn text-only SFT data with single-turn speech-related SFT data during training. To extend beyond speech-based QA tasks, we propose a novel data generation method that can create mixed-modal interleaving speech-text inputs.

Installation

The code is based on an old version of NVIDIA NeMo. We use a docker container to train and decode the model.

Generating speech SFT data

Running inference using a pre-trained model

Training a new model

Citation

@inproceedings{vtblender,
    title={{VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning}}, 
    author={Yifan Peng and Krishna C. Puvvada and Zhehuai Chen and Piotr Zelasko and He Huang and Kunal Dhawan and Ke Hu and Shinji Watanabe and Jagadeesh Balam and Boris Ginsburg},
    year={2025},
    booktitle={Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6,658 Commits
.github		.github
docs		docs
examples		examples
external		external
imgs		imgs
nemo		nemo
requirements		requirements
scripts		scripts
tests		tests
tools		tools
tutorials		tutorials
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.ci		Dockerfile.ci
LICENSE		LICENSE
README.md		README.md
README_original.rst		README_original.rst
pyproject.toml		pyproject.toml
reinstall.sh		reinstall.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceTextBlender

Overview

Installation

Generating speech SFT data

Running inference using a pre-trained model

Training a new model

Citation

About

Releases

Packages

Contributors 335

Languages

License

pyf98/NeMo_VoiceTextBlender

Folders and files

Latest commit

History

Repository files navigation

VoiceTextBlender

Overview

Installation

Generating speech SFT data

Running inference using a pre-trained model

Training a new model

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 335

Languages

Packages