Agentic Voice RAG β ask questions about any web page using your voice. Powered by Microsoft VibeVoice-ASR for speech recognition, Qdrant as the vector database, and quantized HuggingFace LLMs for answer generation β no OpenAI API key required.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERFACE β
β Streamlit (app.py) β
βββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
Voice Input π€ URL Ingestion π₯
β β
βΌ βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Microsoft β β Web Scraper β
β VibeVoice-ASR β β (requests + BeautifulSoupβ
β (ASR / STT) β β + lxml) β
ββββββββββ¬βββββββββββββ ββββββββββββββ¬ββββββββββββββ
β transcript β raw text
β βΌ
β ββββββββββββββββββββββββββββ
β β Text Chunker β
β β RecursiveCharacterText β
β β Splitter (512 tok, β
β β 64 tok overlap) β
β ββββββββββββββ¬ββββββββββββββ
β β chunks
β βΌ
β ββββββββββββββββββββββββββββ
β β Embeddings β
β β all-MiniLM-L6-v2 β
β β (sentence-transformers) β
β ββββββββββββββ¬ββββββββββββββ
β β 384-dim vectors
β βΌ
β ββββββββββββββββββββββββββββ
β β Qdrant Vector DB β
ββββ query βββββββΊβ (in-memory or server) β
β embedding β cosine similarity search β
β ββββββββββββββ¬ββββββββββββββ
β β top-K chunks
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HuggingFace Quantized LLM β
β Default: Qwen/Qwen2.5-1.5B-Instruct β
β Quantization: none | 4bit (NF4) | 8bit (LLM.int8()) β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β answer text
βΌ
βββββββββββββββββββββββββ
β gTTS β
β Text-to-Speech π β
βββββββββββββββββββββββββ
| Component | Technology |
|---|---|
| Speech-to-Text | Microsoft VibeVoice-ASR (7 B, via π€ Transformers β₯ 5.3.0) |
| Vector Database | Qdrant (in-memory or server mode) |
| Embeddings | all-MiniLM-L6-v2 via sentence-transformers (384-dim) |
| Text Chunking | LangChain RecursiveCharacterTextSplitter (512 tokens, 64 overlap) |
| LLM | HuggingFace causal LM β default Qwen/Qwen2.5-1.5B-Instruct (no API key required) |
| Quantization | BitsAndBytes NF4 (4-bit) or LLM.int8() (8-bit) β CUDA only |
| Text-to-Speech | gTTS |
| UI | Streamlit |
- π₯ URL ingestion β scrape any public web page, chunk the text, embed it, and store it in Qdrant
- π€ Voice input β record questions directly in the browser; VibeVoice-ASR handles up to 60 minutes of audio in a single pass
- π Semantic search β top-K cosine similarity retrieval from Qdrant
- π€ Local RAG answers β quantized HuggingFace LLM generates grounded answers from retrieved context β no external API needed
- π Voice output β answers are read back using gTTS
- β¨οΈ Text fallback β type questions when a microphone is unavailable
- π Conversation history β full Q&A history displayed in the chat tab
git clone https://github.com/inamdarmihir/agentvoicerag.git
cd agentvoicerag
pip install -r requirements.txtNote: VibeVoice-ASR is a 7 B parameter model. A CUDA-capable GPU is strongly recommended. On CPU it will still work but inference will be slow. Set
VIBEVOICE_ASR_MODEL=openai/whisper-basein.envfor a lightweight CPU-friendly alternative.
cp .env.example .env
# Edit .env β all settings have sensible defaults; no API key is requireddocker compose up -dThe UI also ships an in-memory Qdrant mode that requires no extra setup.
streamlit run app.pyOpen http://localhost:8501 in your browser.
- Click the π₯ Ingest URLs tab.
- Paste one or more URLs (one per line) into the text area.
- Click π Ingest.
The app scrapes each page, splits the text into 512-token chunks with 64-token
overlap, generates embeddings with all-MiniLM-L6-v2, and upserts them into Qdrant.
- Click the ποΈ Voice Chat tab.
- Click the microphone icon and record your question.
- Click π Transcribe & Answer.
The pipeline will:
- Transcribe your audio with VibeVoice-ASR.
- Embed the transcript and retrieve the top-K most relevant chunks from Qdrant.
- Feed the context + question to the HuggingFace LLM and return an answer.
- Read the answer aloud using gTTS.
Use the β¨οΈ Type a question section at the bottom of the Voice Chat tab as a text-only fallback.
All settings can also be changed at runtime from the βοΈ Settings sidebar.
| Environment Variable | Default | Description |
|---|---|---|
HF_LLM_MODEL |
Qwen/Qwen2.5-1.5B-Instruct |
HuggingFace model repo ID for answer generation |
HF_LLM_QUANTIZATION |
none |
Quantization level: none, 4bit, or 8bit (CUDA only) |
HF_TOKEN |
(empty) | HuggingFace access token β required only for gated models (Llama, Gemma, β¦) |
QDRANT_URL |
http://localhost:6333 |
Qdrant server URL (Server mode only) |
QDRANT_API_KEY |
(empty) | Qdrant API key for Qdrant Cloud |
VIBEVOICE_ASR_MODEL |
microsoft/VibeVoice-ASR |
HuggingFace model ID for speech recognition |
agentvoicerag/
βββ app.py # Streamlit UI & pipeline orchestration
βββ src/
β βββ __init__.py
β βββ ingestion.py # URL scraping, chunking, embedding, Qdrant upsert
β βββ retrieval.py # Qdrant cosine-similarity search & context builder
β βββ voice.py # VibeVoice-ASR wrapper (STT) + gTTS (TTS)
β βββ llm.py # HFQuantizedLLM β BitsAndBytes 4-bit/8-bit inference
βββ docker-compose.yml # Qdrant server
βββ requirements.txt
βββ .env.example
βββ README.md
The LLM layer (src/llm.py) loads any HuggingFace causal language model via the
transformers pipeline. No OpenAI API key is required.
| Model | Size | Notes |
|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct |
0.5 B | Runs on any CPU |
Qwen/Qwen2.5-1.5B-Instruct |
1.5 B | Default β fast on CPU |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
1.1 B | Very small footprint |
microsoft/Phi-3-mini-4k-instruct |
3.8 B | Strong quality |
google/gemma-2-2b-it |
2.0 B | Google Gemma 2 |
mistralai/Mistral-7B-Instruct-v0.3 |
7.0 B | GPU recommended |
| Model | Size |
|---|---|
meta-llama/Llama-3.2-3B-Instruct |
3.0 B |
meta-llama/Meta-Llama-3-8B-Instruct |
8.0 B |
| Mode | Requirement | Description |
|---|---|---|
none |
CPU / MPS / CUDA | Full precision (fp32 on CPU/MPS, fp16 on CUDA) |
4bit |
CUDA + bitsandbytes | NF4 double-quantization β ~4Γ memory reduction |
8bit |
CUDA + bitsandbytes | LLM.int8() β ~2Γ memory reduction |
VibeVoice-ASR is a 7 B parameter unified speech-to-text model from Microsoft Research that:
- Processes up to 60 minutes of audio in a single pass
- Performs speaker diarization (Who), timestamping (When), and transcription (What) jointly
- Supports 50+ languages and code-switching
- Supports custom hotwords for domain-specific accuracy
It is available natively in π€ Transformers β₯ 5.3.0:
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="microsoft/VibeVoice-ASR")
result = pipe("path/to/audio.wav")
print(result["text"])Lightweight alternative: set
VIBEVOICE_ASR_MODEL=openai/whisper-basefor CPU-friendly transcription at the cost of some accuracy.
Qdrant is a high-performance vector similarity search engine. This project supports two modes:
| Mode | Setup | Use case |
|---|---|---|
| In-Memory | No setup required | Quick demos, development |
| Server | docker compose up -d |
Persistent storage, production |
MIT License β see LICENSE for details.
VibeVoice-ASR is Β© Microsoft and licensed under the MIT License. Please review Microsoft's responsible AI guidelines before deploying in production.