diff --git a/mint.json b/mint.json index 01d1efe..adddb42 100644 --- a/mint.json +++ b/mint.json @@ -162,8 +162,8 @@ "server/services/stt/gladia", "server/services/stt/google", "server/services/stt/groq", + "server/services/stt/riva", "server/services/stt/openai", - "server/services/stt/parakeet", "server/services/stt/ultravox", "server/services/stt/whisper" ] @@ -198,12 +198,12 @@ "server/services/tts/cartesia", "server/services/tts/deepgram", "server/services/tts/elevenlabs", - "server/services/tts/fastpitch", "server/services/tts/fish", "server/services/tts/google", "server/services/tts/groq", "server/services/tts/lmnt", "server/services/tts/neuphonic", + "server/services/tts/riva", "server/services/tts/openai", "server/services/tts/piper", "server/services/tts/playht", diff --git a/server/services/stt/parakeet.mdx b/server/services/stt/parakeet.mdx deleted file mode 100644 index dc5096f..0000000 --- a/server/services/stt/parakeet.mdx +++ /dev/null @@ -1,188 +0,0 @@ ---- -title: "NVIDIA Parakeet" -description: "Speech-to-text service implementation using NVIDIA’s Parakeet speech recognition model" ---- - -## Overview - -`ParakeetSTTService` provides real-time speech-to-text capabilities using NVIDIA's Riva Parakeet model. It supports interim results and configurable recognition parameters for enhanced accuracy. - -## Installation - -To use `ParakeetSTTService`, install the required dependencies: - -```bash -pip install "pipecat-ai[riva]" -``` - -You'll also need to set up your NVIDIA API key as an environment variable: `NVIDIA_API_KEY`. - - - You can obtain an NVIDIA API key by signing up through [NVIDIA's developer - portal](https://developer.nvidia.com). - - -## Configuration - -### Constructor Parameters - - - Your NVIDIA API key - - - - NVIDIA Riva server address - - - - NVIDIA function identifier for the STT service - - - - Audio sample rate in Hz - - - - Additional configuration parameters - - -### InputParams - - - The language for speech recognition - - -## Input - -The service processes audio frames containing: - -- Raw PCM audio data -- 16-bit depth -- Single channel (mono) - -## Output Frames - -### TranscriptionFrame - -Generated for final transcriptions, containing: - - - Transcribed text - - - - User identifier - - - - ISO 8601 formatted timestamp - - - - Language used for transcription - - -### InterimTranscriptionFrame - -Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results. - -## Methods - -See the [STT base class methods](/server/base-classes/speech#methods) for additional functionality. - -## Usage Example - -```python -from pipecat.services.riva.stt import ParakeetSTTService -from pipecat.transcriptions.language import Language - -# Configure service -stt = ParakeetSTTService( - api_key="your-nvidia-api-key", - params=ParakeetSTTService.InputParams( - language=Language.EN_US - ) -) - -# Use in pipeline -pipeline = Pipeline([ - transport.input(), - stt, - llm, - ... -]) -``` - -## Language Support - -Parakeet STT primarily supports English with various regional accents: - -| Language Code | Description | Service Codes | -| ---------------- | ------------ | ------------- | -| `Language.EN_US` | English (US) | `en-US` | - -## Frame Flow - -```mermaid -graph TD - A[InputAudioRawFrame] --> B[ParakeetSTTService] - B --> C[InterimTranscriptionFrame] - B --> D[TranscriptionFrame] - B --> E[ErrorFrame] - C --> F[Real-time Processing] - D --> G[Final Processing] -``` - -## Advanced Configuration - -The service supports several advanced configuration options that can be adjusted: - - - Filter profanity from transcription - - - - Automatically add punctuation - - - - Whether to disable verbatim transcripts - - - - List of words to boost in the language model - - - - Score applied to boosted words - - -## Example with Advanced Configuration - -```python -# Configure service with advanced parameters -stt = ParakeetSTTService( - api_key="your-nvidia-api-key", - params=ParakeetSTTService.InputParams( - language=Language.EN_US - ) -) - -# Configure advanced options -stt._profanity_filter = True -stt._automatic_punctuation = True -stt._boosted_lm_words = ["PipeCat", "AI", "speech"] -``` - -## Notes - -- Uses NVIDIA's Riva AI Services platform -- Handles streaming audio input -- Provides real-time transcription results -- Manages connection lifecycle -- Uses asyncio for asynchronous processing -- Automatically cleans up resources on stop/cancel diff --git a/server/services/stt/riva.mdx b/server/services/stt/riva.mdx new file mode 100644 index 0000000..03eddad --- /dev/null +++ b/server/services/stt/riva.mdx @@ -0,0 +1,292 @@ +--- +title: "NVIDIA Riva" +description: "Speech-to-text service implementation using NVIDIA Riva" +--- + +## Overview + +`RivaSTTService` provides real-time speech-to-text capabilities using NVIDIA's Riva Parakeet model. It supports interim results and configurable recognition parameters for enhanced accuracy. `RivaSegmentedSTTService` provides speech-to-text capabilities via NVIDIA's Riva Canary model. + +## Installation + +To use `RivaSTTService` or `RivaSegmentedSTTService`, install the required dependencies: + +```bash +pip install "pipecat-ai[riva]" +``` + +You'll also need to set up your NVIDIA API key as an environment variable: `NVIDIA_API_KEY`. + + + You can obtain an NVIDIA API key by signing up through [NVIDIA's developer + portal](https://developer.nvidia.com). + + +## RivaSTTService + +### Configuration + + + Your NVIDIA API key + + + + NVIDIA Riva server address + + + + A mapping of the NVIDIA function identifier for the STT service with the model name. + + + + Audio sample rate in Hz + + + + Additional configuration parameters + + +#### InputParams + + + The language for speech recognition + + +### Input + +The service processes audio frames containing: + +- Raw PCM audio data +- 16-bit depth +- Single channel (mono) + +### Output Frames + +#### TranscriptionFrame + +Generated for final transcriptions, containing: + + + Transcribed text + + + + User identifier + + + + ISO 8601 formatted timestamp + + + + Language used for transcription + + +#### InterimTranscriptionFrame + +Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results. + +## RivaSegmentedSTTService + +### Configuration + + + Your NVIDIA API key + + + + NVIDIA Riva server address + + + + A mapping of the NVIDIA function identifier for the STT service with the model name. + + + + Audio sample rate in Hz + + + + Additional configuration parameters + + +#### InputParams + + + The language for speech recognition + + +### Input + +The service processes audio frames containing: + +- Raw audio bytes in WAV format + +### Output Frames + +#### TranscriptionFrame + +Generated for final transcriptions, containing: + + + Transcribed text + + + + User identifier + + + + ISO 8601 formatted timestamp + + + + Language used for transcription + + +#### InterimTranscriptionFrame + +Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results. + +## Methods + +See the [STT base class methods](/server/base-classes/speech#methods) for additional functionality. + +## Models + +| Model | Pipecat Class | Model Card Link | +| ------------------------- | ----------------------- | ------------------------------------------------------------------------------------ | +| `parakeet-ctc-1.1b-asr` | RivaSTTService | [NVIDIA Model Card](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr/modelcard) | +| `canary-1b-asr` | RivaSegmentedSTTService | [NVIDIA Model Card](https://build.nvidia.com/nvidia/canary-1b-asr/modelcard) | + +## Usage Examples + +### RivaSTTService + +```python +from pipecat.services.riva.stt import RivaSTTService +from pipecat.transcriptions.language import Language + +# Configure service +stt = RivaSTTService( + api_key="your-nvidia-api-key", + params=RivaSTTService.InputParams( + language=Language.EN_US + ) +) + +# Use in pipeline +pipeline = Pipeline([ + transport.input(), + stt, + llm, + ... +]) +``` + +### RivaSegmentedSTTService + +```python +from pipecat.services.riva.stt import RivaSegmentedSTTService +from pipecat.transcriptions.language import Language + +# Configure service +stt = RivaSegmentedSTTService( + api_key="your-nvidia-api-key", + params=RivaSegmentedSTTService.InputParams( + language=Language.EN_US + ) +) + +# Use in pipeline +pipeline = Pipeline([ + transport.input(), + stt, + llm, + ... +]) +``` + +## Language Support + +Riva model `parakeet-ctc-1.1b-asr` (default) primarily supports English with various regional accents: + +| Language Code | Description | Service Codes | +| ---------------- | ------------ | ------------- | +| `Language.EN_US` | English (US) | `en-US` | + +## Frame Flow + +```mermaid +graph TD + A[InputAudioRawFrame] --> B[RivaSTTService] + B --> C[InterimTranscriptionFrame] + B --> D[TranscriptionFrame] + B --> E[ErrorFrame] + C --> F[Real-time Processing] + D --> G[Final Processing] +``` + +## Advanced Configuration + +The service supports several advanced configuration options that can be adjusted: + + + Filter profanity from transcription + + + + Automatically add punctuation + + + + Whether to disable verbatim transcripts + + + + List of words to boost in the language model + + + + Score applied to boosted words + + +## Example with Advanced Configuration + +```python +# Configure service with advanced parameters +stt = RivaSTTService( + api_key="your-nvidia-api-key", + params=RivaSTTService.InputParams( + language=Language.EN_US + ) +) + +# Configure advanced options +stt._profanity_filter = True +stt._automatic_punctuation = True +stt._boosted_lm_words = ["PipeCat", "AI", "speech"] +``` + +## Notes + +- Uses NVIDIA's Riva AI Services platform +- Handles streaming audio input +- Provides real-time transcription results +- Manages connection lifecycle +- Uses asyncio for asynchronous processing +- Automatically cleans up resources on stop/cancel diff --git a/server/services/supported-services.mdx b/server/services/supported-services.mdx index 6d45937..175597f 100644 --- a/server/services/supported-services.mdx +++ b/server/services/supported-services.mdx @@ -14,19 +14,19 @@ description: "AI services integrated with Pipecat and their setup requirements" ## Speech-to-Text -| Service | Setup | -| ------------------------------------------------ | -------------------------------------- | -| [AssemblyAI](/server/services/stt/assemblyai) | `pip install "pipecat-ai[assemblyai]"` | -| [Azure](/server/services/stt/azure) | `pip install "pipecat-ai[azure]"` | -| [Deepgram](/server/services/stt/deepgram) | `pip install "pipecat-ai[deepgram]"` | -| [Fal Wizper](/server/services/stt/fal) | `pip install "pipecat-ai[fal]"` | -| [Gladia](/server/services/stt/gladia) | `pip install "pipecat-ai[gladia]"` | -| [Google](/server/services/stt/google) | `pip install "pipecat-ai[google]"` | -| [Groq (Whisper)](/server/services/stt/groq) | `pip install "pipecat-ai[groq]"` | -| [NVIDIA Parakeet](/server/services/stt/parakeet) | `pip install "pipecat-ai[riva]"` | -| [OpenAI (Whisper)](/server/services/stt/openai) | `pip install "pipecat-ai[openai]"` | -| [Ultravox](/server/services/stt/ultravox) | `pip install "pipecat-ai[ultravox]"` | -| [Whisper](/server/services/stt/whisper) | `pip install "pipecat-ai[whisper]"` | +| Service | Setup | +| ----------------------------------------------- | -------------------------------------- | +| [AssemblyAI](/server/services/stt/assemblyai) | `pip install "pipecat-ai[assemblyai]"` | +| [Azure](/server/services/stt/azure) | `pip install "pipecat-ai[azure]"` | +| [Deepgram](/server/services/stt/deepgram) | `pip install "pipecat-ai[deepgram]"` | +| [Fal Wizper](/server/services/stt/fal) | `pip install "pipecat-ai[fal]"` | +| [Gladia](/server/services/stt/gladia) | `pip install "pipecat-ai[gladia]"` | +| [Google](/server/services/stt/google) | `pip install "pipecat-ai[google]"` | +| [Groq (Whisper)](/server/services/stt/groq) | `pip install "pipecat-ai[groq]"` | +| [NVIDIA Riva](/server/services/stt/riva) | `pip install "pipecat-ai[riva]"` | +| [OpenAI (Whisper)](/server/services/stt/openai) | `pip install "pipecat-ai[openai]"` | +| [Ultravox](/server/services/stt/ultravox) | `pip install "pipecat-ai[ultravox]"` | +| [Whisper](/server/services/stt/whisper) | `pip install "pipecat-ai[whisper]"` | ## Large Language Models @@ -52,24 +52,24 @@ description: "AI services integrated with Pipecat and their setup requirements" ## Text-to-Speech -| Service | Setup | -| -------------------------------------------------- | -------------------------------------- | -| [Amazon Polly](/server/services/tts/aws) | `pip install "pipecat-ai[aws]"` | -| [Azure](/server/services/tts/azure) | `pip install "pipecat-ai[azure]"` | -| [Cartesia](/server/services/tts/cartesia) | `pip install "pipecat-ai[cartesia]"` | -| [Deepgram](/server/services/tts/deepgram) | `pip install "pipecat-ai[deepgram]"` | -| [ElevenLabs](/server/services/tts/elevenlabs) | `pip install "pipecat-ai[elevenlabs]"` | -| [Fish](/server/services/tts/fish) | `pip install "pipecat-ai[fish]"` | -| [Google](/server/services/tts/google) | `pip install "pipecat-ai[google]"` | -| [Groq](/server/services/tts/groq) | `pip install "pipecat-ai[groq]"` | -| [LMNT](/server/services/tts/lmnt) | `pip install "pipecat-ai[lmnt]"` | -| [Neuphonic](/server/services/tts/neuphonic) | `pip install "pipecat-ai[neuphonic]"` | -| [NVIDIA FastPitch](/server/services/tts/fastpitch) | `pip install "pipecat-ai[riva]"` | -| [OpenAI](/server/services/tts/openai) | `pip install "pipecat-ai[openai]"` | -| [Piper](/server/services/tts/piper) | No dependencies required | -| [PlayHT](/server/services/tts/playht) | `pip install "pipecat-ai[playht]"` | -| [Rime](/server/services/tts/rime) | `pip install "pipecat-ai[rime]"` | -| [XTTS](/server/services/tts/xtts) | `pip install "pipecat-ai[xtts]"` | +| Service | Setup | +| --------------------------------------------- | -------------------------------------- | +| [Amazon Polly](/server/services/tts/aws) | `pip install "pipecat-ai[aws]"` | +| [Azure](/server/services/tts/azure) | `pip install "pipecat-ai[azure]"` | +| [Cartesia](/server/services/tts/cartesia) | `pip install "pipecat-ai[cartesia]"` | +| [Deepgram](/server/services/tts/deepgram) | `pip install "pipecat-ai[deepgram]"` | +| [ElevenLabs](/server/services/tts/elevenlabs) | `pip install "pipecat-ai[elevenlabs]"` | +| [Fish](/server/services/tts/fish) | `pip install "pipecat-ai[fish]"` | +| [Google](/server/services/tts/google) | `pip install "pipecat-ai[google]"` | +| [Groq](/server/services/tts/groq) | `pip install "pipecat-ai[groq]"` | +| [LMNT](/server/services/tts/lmnt) | `pip install "pipecat-ai[lmnt]"` | +| [Neuphonic](/server/services/tts/neuphonic) | `pip install "pipecat-ai[neuphonic]"` | +| [NVIDIA Riva](/server/services/tts/riva) | `pip install "pipecat-ai[riva]"` | +| [OpenAI](/server/services/tts/openai) | `pip install "pipecat-ai[openai]"` | +| [Piper](/server/services/tts/piper) | No dependencies required | +| [PlayHT](/server/services/tts/playht) | `pip install "pipecat-ai[playht]"` | +| [Rime](/server/services/tts/rime) | `pip install "pipecat-ai[rime]"` | +| [XTTS](/server/services/tts/xtts) | `pip install "pipecat-ai[xtts]"` | ## Speech-to-Speech diff --git a/server/services/tts/fastpitch.mdx b/server/services/tts/riva.mdx similarity index 54% rename from server/services/tts/fastpitch.mdx rename to server/services/tts/riva.mdx index 5a409ec..4373c07 100644 --- a/server/services/tts/fastpitch.mdx +++ b/server/services/tts/riva.mdx @@ -1,15 +1,15 @@ --- -title: "NVIDIA FastPitch" -description: "Text-to-speech service implementation using NVIDIA’s FastPitch model" +title: "NVIDIA Riva" +description: "Text-to-speech service implementation using NVIDIA Riva" --- ## Overview -`FastPitchTTSService` converts text to speech using NVIDIA's Riva FastPitch TTS model. It provides high-quality text-to-speech synthesis with configurable voice options. +`RivaTTSService` converts text to speech using NVIDIA's Riva. It provides high-quality text-to-speech synthesis with configurable voice options, including multilingual voices. ## Installation -To use `FastPitchTTSService`, install the required dependencies: +To use `RivaTTSService`, install the required dependencies: ```bash pip install "pipecat-ai[riva]" @@ -29,7 +29,7 @@ You'll also need to set up your NVIDIA API key as an environment variable: `NVID NVIDIA Riva server address - + Voice identifier to use for synthesis @@ -38,11 +38,14 @@ You'll also need to set up your NVIDIA API key as an environment variable: `NVID - NVIDIA function identifier for the TTS service + A mapping of the NVIDIA function identifier for the TTS service with the model name. @@ -93,26 +96,37 @@ Signals the completion of audio generation. See the [TTS base class methods](/server/base-classes/speech#ttsservice) for additional functionality. +## Models + +| Model | Model Card Link | +| ------------------------- | -------------------------------------------------------------------------------------- | +| `magpie-tts-multilingual` | [NVIDIA Model Card](https://build.nvidia.com/nvidia/magpie-tts-multilingual/modelcard) | +| `fastpitch-hifigan-tts` | [NVIDIA Model Card](https://build.nvidia.com/nvidia/fastpitch-hifigan-tts/modelcard) | + ## Language Support -FastPitch TTS primarily supports English with various regional accents: +Riva model `magpie-tts-multilingual` (default) supports English, Spanish, and French: + +| Language Code | Description | Service Codes | +| ---------------- | --------------- | ------------- | +| `Language.EN_US` | English (US) | `en-US` | +| `Language.ES-US` | Spanish (US) | `es-US` | +| `Language.FR-FR` | French (France) | `fr-FR` | -| Language Code | Description | Service Codes | -| ---------------- | ------------ | ------------- | -| `Language.EN_US` | English (US) | `en-US` | +## Usage Examples -## Usage Example +### TTS Language and Voice Configuration ```python -from pipecat.services.riva.tts import FastPitchTTSService +from pipecat.services.riva.tts import RivaTTSService from pipecat.transcriptions.language import Language # Configure service -tts = FastPitchTTSService( +tts = RivaTTSService( api_key="your-nvidia-api-key", - voice_id="English-US.Female-1", - params=FastPitchTTSService.InputParams( - language=Language.EN_US, + voice_id="Magpie-Multilingual.FR-FR.Louise", + params=RivaTTSService.InputParams( + language=Language.FR_FR, quality=20 ) ) @@ -126,11 +140,33 @@ pipeline = Pipeline([ ]) ``` +### Model, function ID, and Voice configuration + +```python +# Configure TTS with specific language +tts = RivaTTSService( + api_key="your-nvidia-api-key", + voice_id="English-US.Female-1", + model_function_map={ + "function_id": "0149dedb-2be8-4195-b9a0-e57e0e14f972", + "model_name": "fastpitch-hifigan-tts", + } +) + +# Use in pipeline +pipeline = Pipeline([ + ..., + llm, + tts, + transport.output(), +]) +``` + ## Frame Flow ```mermaid graph TD - A[TextFrame] --> B[FastPitchTTSService] + A[TextFrame] --> B[RivaTTSService] B --> C[TTSStartedFrame] B --> D[TTSAudioRawFrame] B --> E[TTSStoppedFrame]