Skip to content
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
9ac86ea
LLM Chain: Add foundation for chain execution with database schema
vprashrex Feb 20, 2026
6451bb0
LLM Chain: Add documentation and update endpoint description for chai…
vprashrex Feb 21, 2026
c9f94e2
LLM Chain: Move guardrails into execute_llm_call for per-block suppor…
vprashrex Feb 21, 2026
fb18356
Merge branch 'main' into feature/llm-chain-setup
vprashrex Feb 27, 2026
baaac95
prettify format
vprashrex Mar 1, 2026
5177bfb
refactor: update STTLLMParams to allow optional instructions and impr…
vprashrex Mar 1, 2026
2fb81b1
feat: add metadata to BlockResult and update job execution to use res…
vprashrex Mar 1, 2026
113488a
feat: add tests for LLM chain execution and job handling
vprashrex Mar 2, 2026
a62c433
Merge branch 'main' into feature/llm-chain-setup
vprashrex Mar 2, 2026
6421465
fix: correct variable name from job_id to job_uuid in execute_job fun…
vprashrex Mar 2, 2026
50acc8c
Merge branch 'main' into feature/llm-chain-setup
vprashrex Mar 5, 2026
19d6f58
refactor: streamline LLM chain execution and enhance callback handling
vprashrex Mar 5, 2026
e04c374
Merge branch 'main' into feature/llm-chain-setup
vprashrex Mar 5, 2026
9cc5cf8
docs: enhance llm_chain.md with detailed input specifications and gua…
vprashrex Mar 6, 2026
f7797d1
refactor: remove unused timestamps from LlmChain model and update rel…
vprashrex Mar 6, 2026
4624f55
Merge branch 'main' into feature/llm-chain-setup
vprashrex Mar 6, 2026
5b9a4e9
feat: basic speech-to-speech impl on top of llm_chain
Prajna1999 Mar 5, 2026
c1807df
feat: add s2s blocks
Prajna1999 Mar 5, 2026
e7de797
Merge branch 'main' into feature/speech-to-speech
Prajna1999 Mar 6, 2026
9eeb999
Merge branch 'main' into feature/speech-to-speech
Prajna1999 Mar 9, 2026
56920b1
feat: detected lang in the webhook rsponse, context passing across li…
Prajna1999 Mar 9, 2026
96ea78e
chore: docs
Prajna1999 Mar 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions backend/app/api/docs/llm/speech_to_speech.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# Speech-to-Speech (STS) with RAG

Execute a complete speech-to-speech workflow with knowledge base retrieval.

## Endpoint

```
POST /llm/sts
```

## Flow

```
Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
```
Comment on lines +7 to +15
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

The fenced code blocks at lines 7-9 and 13-15 are missing language specifiers per MD040. Use http for the endpoint block and text for the flow diagram.

📝 Proposed fix
 ## Endpoint
 
-```
+```http
 POST /llm/sts

Flow

- +text
Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
POST /llm/sts
```
## Flow
```
Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
```
## Endpoint
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/docs/llm/speech_to_speech.md` around lines 7 - 15, The
Markdown fenced code blocks in speech_to_speech.md (the POST /llm/sts endpoint
block and the Flow diagram block) lack language identifiers; update the first
fenced block to use the "http" language identifier and the Flow fenced block to
use the "text" identifier so they pass MD040 linting (refer to the fenced blocks
containing "POST /llm/sts" and "Voice Input → STT (auto language) → RAG
(Knowledge Base) → TTS → Voice Output").


## Input

- **Voice note**: WhatsApp-compatible audio format (required)
- **Knowledge base IDs**: One or more knowledge bases for RAG (required)
- **Languages**: Input and output languages (optional, defaults to Hindi)
- **Models**: STT, LLM, and TTS model selection (optional, defaults to Sarvam)

## Output

You will receive **3 callbacks** to your webhook URL:

1. **STT Callback** (Intermediate): Transcribed text from audio
2. **LLM Callback** (Intermediate): RAG-enhanced response text
3. **TTS Callback** (Final): Audio output + response text

Each callback includes:
- Output from that step
- Token usage
- Latency information (check timestamps)

## Supported Languages

### Primary Indian Languages
- English, Hindi, Hinglish (code-switching)
- Bengali, Kannada, Malayalam, Marathi
- Odia, Punjabi, Tamil, Telugu, Gujarati

### Additional Languages (Sarvam Saaras V3)
- Assamese, Urdu, Nepali
- Konkani, Kashmiri, Sindhi
- Sanskrit, Santali, Manipuri
- Bodo, Maithili, Dogri

**Total: 25 languages** with automatic language detection

## Available Models

### STT (Speech-to-Text)
- `saaras:v3` - Sarvam Saaras V3 (**default**, fast, auto language detection, optimized for Indian languages)
- `gemini-2.5-pro` - Google Gemini 2.5 Pro

**Note:** Sarvam STT uses automatic language detection. No need to specify input language.

### LLM (RAG)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we support more than just these two models for RAG

- `gpt-4o` - OpenAI GPT-4o (**default**, best quality)
- `gpt-4o-mini` - OpenAI GPT-4o Mini (faster, lower cost)

### TTS (Text-to-Speech)
- `bulbul:v3` - Sarvam Bulbul V3 (**default**, natural Indian voices, MP3 output)
- `gemini-2.5-pro-preview-tts` - Google Gemini 2.5 Pro (OGG OPUS output)

## Edge Cases & Error Handling

### Empty STT Output
If speech-to-text returns empty/blank:
- Chain fails immediately
- Error message: "STT returned no transcription"
- No subsequent blocks are executed

### Audio Size Limit
WhatsApp limit: 16MB
- TTS providers may fail if output exceeds limit
- Error is caught and reported in callback
- Consider using shorter responses or compression

### Invalid Audio Format
If input audio format is unsupported:
- STT provider fails with format error
- Error reported in callback
- Supported: MP3, WAV, OGG, OPUS, M4A

### Provider Failures
Each block has independent error handling:
- STT fails → Chain stops, STT error reported
- LLM fails → Chain stops, RAG error reported
- TTS fails → Chain stops, TTS error reported

## Example Request

```bash
curl -X POST https://api.kaapi.ai/llm/sts \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"query": {
"type": "audio",
"content": {
"format": "base64",
"value": "base64_encoded_audio_data",
"mime_type": "audio/ogg"
}
},
"knowledge_base_ids": ["kb_abc123"],
"input_language": "hindi",
"output_language": "english",
"callback_url": "https://your-app.com/webhook"
}
EOF
```

**Note:** `stt_model`, `llm_model`, and `tts_model` are optional and will use defaults if not specified.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "specifying" before "stt_model",etc , etc


## Example Callbacks

### Callback 1: STT Output (Intermediate)
```json
{
"success": true,
"data": {
"block_index": 1,
"total_blocks": 3,
"response": {
"provider_response_id": "stt_xyz789",
"provider": "sarvamai-native",
"model": "saarika:v1",
"output": {
"type": "text",
"content": {
"value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
}
}
},
"usage": {
"input_tokens": 0,
"output_tokens": 12,
"total_tokens": 12
}
},
"metadata": {
"speech_to_speech": true,
"input_language": "hi-IN"
}
}
```
Comment on lines +122 to +151
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

STT callback example shows incorrect model name.

Line 132 shows "model": "saarika:v1" but the documented default STT model is "saaras:v3" (line 55). Consider updating the example to use the correct model name for consistency.

📝 Proposed fix
     "response": {
       "provider_response_id": "stt_xyz789",
       "provider": "sarvamai-native",
-      "model": "saarika:v1",
+      "model": "saaras:v3",
       "output": {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Callback 1: STT Output (Intermediate)
```json
{
"success": true,
"data": {
"block_index": 1,
"total_blocks": 3,
"response": {
"provider_response_id": "stt_xyz789",
"provider": "sarvamai-native",
"model": "saarika:v1",
"output": {
"type": "text",
"content": {
"value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
}
}
},
"usage": {
"input_tokens": 0,
"output_tokens": 12,
"total_tokens": 12
}
},
"metadata": {
"speech_to_speech": true,
"input_language": "hi-IN"
}
}
```
### Callback 1: STT Output (Intermediate)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/docs/llm/speech_to_speech.md` around lines 122 - 151, The STT
callback example under "Callback 1: STT Output (Intermediate)" uses an incorrect
model name; update the JSON "model" field in that example (the key "model"
inside the "response" object) from "saarika:v1" to the documented default
"saaras:v3" so it matches the default STT model elsewhere in the docs.


### Callback 2: LLM Output (Intermediate)
```json
{
"success": true,
"data": {
"block_index": 2,
"total_blocks": 3,
"response": {
"provider_response_id": "chatcmpl_abc123",
"provider": "openai",
"model": "gpt-4o",
"output": {
"type": "text",
"content": {
"value": "आपके अकाउंट में कुल बैलेंस ₹5,000 है। पिछले महीने में 3 ट्रांजैक्शन हुए हैं।"
}
}
},
"usage": {
"input_tokens": 150,
"output_tokens": 45,
"total_tokens": 195
}
},
"metadata": {
"speech_to_speech": true
}
}
```

### Callback 3: TTS Output (Final)
```json
{
"success": true,
"data": {
"response": {
"provider_response_id": "tts_def456",
"provider": "sarvamai-native",
"model": "bulbul:v1",
"output": {
"type": "audio",
"content": {
"format": "base64",
"value": "base64_encoded_audio_output",
"mime_type": "audio/ogg"
}
}
},
"usage": {
"input_tokens": 15,
"output_tokens": 0,
"total_tokens": 15
}
},
"metadata": {
"speech_to_speech": true,
"output_language": "hi-IN"
}
}
```
Comment on lines +183 to +212
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

TTS callback example shows incorrect model version.

Line 191 shows "model": "bulbul:v1" but the documented default TTS model is "bulbul:v3" (line 65). Update the example for consistency.

📝 Proposed fix
     "response": {
       "provider_response_id": "tts_def456",
       "provider": "sarvamai-native",
-      "model": "bulbul:v1",
+      "model": "bulbul:v3",
       "output": {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Callback 3: TTS Output (Final)
```json
{
"success": true,
"data": {
"response": {
"provider_response_id": "tts_def456",
"provider": "sarvamai-native",
"model": "bulbul:v1",
"output": {
"type": "audio",
"content": {
"format": "base64",
"value": "base64_encoded_audio_output",
"mime_type": "audio/ogg"
}
}
},
"usage": {
"input_tokens": 15,
"output_tokens": 0,
"total_tokens": 15
}
},
"metadata": {
"speech_to_speech": true,
"output_language": "hi-IN"
}
}
```
### Callback 3: TTS Output (Final)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/docs/llm/speech_to_speech.md` around lines 183 - 212, The TTS
callback example under "Callback 3: TTS Output (Final)" contains the wrong model
version; update the JSON field "model": "bulbul:v1" to "model": "bulbul:v3" so
it matches the documented default TTS model used elsewhere (e.g., the default
referenced at "bulbul:v3"); ensure the change is made only in the example JSON
"response" object where "provider": "sarvamai-native" and
"provider_response_id": "tts_def456".


## Latency Tracking

Calculate latency from callback timestamps:
- **STT latency**: Time from request to first callback
- **LLM latency**: Time between first and second callback
- **TTS latency**: Time between second and third callback
- **Total latency**: Time from request to final callback

## Best Practices

1. **Language Consistency**: If not translating, keep input_language = output_language
2. **Model Selection**: Use Sarvam models for Indian languages (faster, better quality)
3. **Knowledge Base**: Ensure KB is properly indexed and relevant to expected queries
4. **Error Handling**: Implement retry logic for transient provider failures
5. **Webhook Security**: Validate webhook signatures and use HTTPS
2 changes: 2 additions & 0 deletions backend/app/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
languages,
llm,
llm_chain,
llm_speech,
organization,
openai_conversation,
project,
Expand Down Expand Up @@ -43,6 +44,7 @@
api_router.include_router(languages.router)
api_router.include_router(llm.router)
api_router.include_router(llm_chain.router)
api_router.include_router(llm_speech.router)
api_router.include_router(login.router)
api_router.include_router(onboarding.router)
api_router.include_router(openai_conversation.router)
Expand Down
141 changes: 141 additions & 0 deletions backend/app/api/routes/llm_speech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
"""Speech-to-Speech (STS) API endpoint with RAG."""

import logging

from fastapi import APIRouter, Depends

from app.api.deps import AuthContextDep, SessionDep
from app.api.permissions import Permission, require_permission
from app.models import Message
from app.models.llm.request import (
LLMChainRequest,
QueryParams,
SpeechToSpeechRequest,
)
from app.services.llm.chain.utils import (
LANGUAGE_CODES,
build_rag_block,
build_stt_block,
build_tts_block,
get_language_code,
)
from app.services.llm.jobs import start_chain_job
from app.utils import APIResponse, load_description, validate_callback_url

logger = logging.getLogger(__name__)

router = APIRouter(tags=["LLM"])


@router.post(
"/llm/sts",
description=load_description("llm/speech_to_speech.md"),
response_model=APIResponse[Message],
dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
)
def speech_to_speech(
_current_user: AuthContextDep,
_session: SessionDep,
request: SpeechToSpeechRequest,
):
"""
Speech-to-speech (STS) endpoint with RAG.

Executes a 3-block chain:
1. STT (Speech-to-Text) - Transcribes audio to text (auto-detects language for Sarvam)
2. RAG (Retrieval-Augmented Generation) - Processes text with knowledge base
3. TTS (Text-to-Speech) - Converts response back to audio

Input: Voice note (WhatsApp compatible)
Output: Voice note + text (via callback)

Edge cases:
- Empty STT output: Chain fails with clear error
- Audio > 16MB: TTS provider will fail (caught and reported)
- Invalid audio format: STT provider will fail (caught and reported)
"""
project_id = _current_user.project_.id
organization_id = _current_user.organization_.id

# Validate callback URL
if request.callback_url:
validate_callback_url(str(request.callback_url))

# Validate and determine languages
if request.input_language and request.input_language != "auto":
if request.input_language not in LANGUAGE_CODES:
from fastapi import HTTPException

raise HTTPException(
status_code=400,
detail=f"Unsupported input language: {request.input_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
)

if request.output_language and request.output_language not in LANGUAGE_CODES:
from fastapi import HTTPException

raise HTTPException(
status_code=400,
detail=f"Unsupported output language: {request.output_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move HTTPException import to module level.

HTTPException is imported twice inside the function body (lines 67 and 75). Move this import to the top of the file with other FastAPI imports for better performance and maintainability.

📝 Proposed fix
-from fastapi import APIRouter, Depends
+from fastapi import APIRouter, Depends, HTTPException

Then remove the inline imports at lines 67-68 and 75-76:

     if request.input_language and request.input_language != "auto":
         if request.input_language not in LANGUAGE_CODES:
-            from fastapi import HTTPException
-
             raise HTTPException(
                 status_code=400,
                 detail=f"Unsupported input language: {request.input_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
             )

     if request.output_language and request.output_language not in LANGUAGE_CODES:
-        from fastapi import HTTPException
-
         raise HTTPException(
             status_code=400,
             detail=f"Unsupported output language: {request.output_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/llm_speech.py` around lines 65 - 80, The two inline
imports of HTTPException inside the validation block should be moved to the
module level with the other FastAPI imports: add "from fastapi import
HTTPException" at the top of the file and remove the inline imports inside the
conditional checks that validate request.input_language and
request.output_language against LANGUAGE_CODES; ensure the behavior of the
blocks using request.input_language, request.output_language, and LANGUAGE_CODES
remains unchanged.


input_lang_code = get_language_code(request.input_language)
output_lang_code = get_language_code(
request.output_language, default=request.input_language or "auto"
)

logger.info(
f"[speech_to_speech] Starting STS chain | "
f"project_id={project_id}, "
f"input_lang={input_lang_code}, "
f"output_lang={output_lang_code}, "
f"stt_model={request.stt_model.value}, "
f"llm_model={request.llm_model.value}, "
f"tts_model={request.tts_model.value}"
)

# Build 3-block chain: STT → RAG → TTS
blocks = [
build_stt_block(request.stt_model, input_lang_code),
build_rag_block(request.llm_model, request.knowledge_base_ids),
build_tts_block(request.tts_model, output_lang_code),
]

# Add metadata to track STS-specific info
metadata = request.request_metadata or {}
metadata.update(
{
"speech_to_speech": True,
"input_language": input_lang_code,
"output_language": output_lang_code,
"stt_model": request.stt_model.value,
"llm_model": request.llm_model.value,
"tts_model": request.tts_model.value,
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid mutating the request's metadata dictionary in-place.

Line 105 assigns request.request_metadata directly to metadata, then line 106 mutates it with .update(). If request.request_metadata is not None, this modifies the original dictionary in-place, which could cause unintended side effects.

📝 Proposed fix
     # Add metadata to track STS-specific info
-    metadata = request.request_metadata or {}
+    metadata = dict(request.request_metadata) if request.request_metadata else {}
     metadata.update(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/llm_speech.py` around lines 104 - 115, The code
mutates request.request_metadata in-place by assigning metadata =
request.request_metadata and then calling metadata.update(...); instead create a
new dict (e.g., metadata = dict(request.request_metadata or {} ) or use
request.request_metadata.copy()) so you can safely update it with the STS keys
("speech_to_speech", "input_language", "output_language", "stt_model",
"llm_model", "tts_model") without altering the original
request.request_metadata; update the metadata variable and leave
request.request_metadata immutable.


# Create chain request
chain_request = LLMChainRequest(
query=QueryParams(input=request.query),
blocks=blocks,
callback_url=request.callback_url,
request_metadata=metadata,
)

# Start async chain job
start_chain_job(
db=_session,
request=chain_request,
project_id=project_id,
organization_id=organization_id,
)

return APIResponse.success_response(
data=Message(
message=(
"Speech-to-speech processing initiated. "
"You will receive intermediate callbacks for STT and LLM outputs, "
"followed by the final callback with audio and text."
)
)
)
Loading