![python logo](https://raw.githubusercontent.com/github/explore/80688e429a7d4ef2fca1e82350fe8e3517d3494d/topics/python/python.png)
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
The open source code for SimpleSpeech series
[ICASSP 2025] FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion
Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- H…
[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Repository for the paper "Combining audio control and style transfer using latent diffusion", accepted at ISMIR 2024
Companion code for ISMIR 2017 paper "Deep Salience Representations for $F_0$ Estimation in Polyphonic Music"
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
🗣️🇧🇷 Bases de áudio transcrito em Português Brasileiro
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Awesome music generation model——MG²
Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/
A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, …
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
A HuggingFace compatible Small Language Model trainer.
Speech To Speech: an effort for an open-sourced and modular GPT4-o
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models