Skip to content

inamdarmihir/agentvoicerag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ AgentVoiceRAG

Agentic Voice RAG – ask questions about any web page using your voice. Powered by Microsoft VibeVoice-ASR for speech recognition, Qdrant as the vector database, and quantized HuggingFace LLMs for answer generation β€” no OpenAI API key required.


Architecture

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚                        USER INTERFACE                        β”‚
 β”‚                     Streamlit (app.py)                       β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚                        β”‚
         Voice Input 🎀           URL Ingestion πŸ“₯
                 β”‚                        β”‚
                 β–Ό                        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Microsoft           β”‚    β”‚  Web Scraper             β”‚
   β”‚  VibeVoice-ASR       β”‚    β”‚  (requests + BeautifulSoupβ”‚
   β”‚  (ASR / STT)         β”‚    β”‚   + lxml)                β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ transcript                   β”‚ raw text
            β”‚                             β–Ό
            β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                 β”‚  Text Chunker             β”‚
            β”‚                 β”‚  RecursiveCharacterText   β”‚
            β”‚                 β”‚  Splitter (512 tok,       β”‚
            β”‚                 β”‚  64 tok overlap)          β”‚
            β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                             β”‚ chunks
            β”‚                             β–Ό
            β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                 β”‚  Embeddings               β”‚
            β”‚                 β”‚  all-MiniLM-L6-v2         β”‚
            β”‚                 β”‚  (sentence-transformers)  β”‚
            β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                             β”‚ 384-dim vectors
            β”‚                             β–Ό
            β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                 β”‚  Qdrant Vector DB         β”‚
            │◄── query ──────►│  (in-memory or server)   β”‚
            β”‚   embedding     β”‚  cosine similarity search β”‚
            β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                             β”‚ top-K chunks
            β–Ό                             β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚          HuggingFace Quantized LLM                    β”‚
   β”‚  Default: Qwen/Qwen2.5-1.5B-Instruct                 β”‚
   β”‚  Quantization: none | 4bit (NF4) | 8bit (LLM.int8()) β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ answer text
                               β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚  gTTS                 β”‚
                   β”‚  Text-to-Speech πŸ”Š    β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Summary

Component Technology
Speech-to-Text Microsoft VibeVoice-ASR (7 B, via πŸ€— Transformers β‰₯ 5.3.0)
Vector Database Qdrant (in-memory or server mode)
Embeddings all-MiniLM-L6-v2 via sentence-transformers (384-dim)
Text Chunking LangChain RecursiveCharacterTextSplitter (512 tokens, 64 overlap)
LLM HuggingFace causal LM β€” default Qwen/Qwen2.5-1.5B-Instruct (no API key required)
Quantization BitsAndBytes NF4 (4-bit) or LLM.int8() (8-bit) β€” CUDA only
Text-to-Speech gTTS
UI Streamlit

Features

  • πŸ“₯ URL ingestion – scrape any public web page, chunk the text, embed it, and store it in Qdrant
  • 🎀 Voice input – record questions directly in the browser; VibeVoice-ASR handles up to 60 minutes of audio in a single pass
  • πŸ” Semantic search – top-K cosine similarity retrieval from Qdrant
  • πŸ€– Local RAG answers – quantized HuggingFace LLM generates grounded answers from retrieved context β€” no external API needed
  • πŸ”Š Voice output – answers are read back using gTTS
  • ⌨️ Text fallback – type questions when a microphone is unavailable
  • πŸ“œ Conversation history – full Q&A history displayed in the chat tab

Quick Start

1. Clone & install

git clone https://github.com/inamdarmihir/agentvoicerag.git
cd agentvoicerag
pip install -r requirements.txt

Note: VibeVoice-ASR is a 7 B parameter model. A CUDA-capable GPU is strongly recommended. On CPU it will still work but inference will be slow. Set VIBEVOICE_ASR_MODEL=openai/whisper-base in .env for a lightweight CPU-friendly alternative.

2. Configure environment variables

cp .env.example .env
# Edit .env β€” all settings have sensible defaults; no API key is required

3. (Optional) Start a Qdrant server with Docker

docker compose up -d

The UI also ships an in-memory Qdrant mode that requires no extra setup.

4. Run the app

streamlit run app.py

Open http://localhost:8501 in your browser.


Usage

Ingest URLs

  1. Click the πŸ“₯ Ingest URLs tab.
  2. Paste one or more URLs (one per line) into the text area.
  3. Click πŸš€ Ingest.

The app scrapes each page, splits the text into 512-token chunks with 64-token overlap, generates embeddings with all-MiniLM-L6-v2, and upserts them into Qdrant.

Ask questions with your voice

  1. Click the πŸŽ™οΈ Voice Chat tab.
  2. Click the microphone icon and record your question.
  3. Click πŸ” Transcribe & Answer.

The pipeline will:

  1. Transcribe your audio with VibeVoice-ASR.
  2. Embed the transcript and retrieve the top-K most relevant chunks from Qdrant.
  3. Feed the context + question to the HuggingFace LLM and return an answer.
  4. Read the answer aloud using gTTS.

Ask questions by typing

Use the ⌨️ Type a question section at the bottom of the Voice Chat tab as a text-only fallback.


Configuration

All settings can also be changed at runtime from the βš™οΈ Settings sidebar.

Environment Variable Default Description
HF_LLM_MODEL Qwen/Qwen2.5-1.5B-Instruct HuggingFace model repo ID for answer generation
HF_LLM_QUANTIZATION none Quantization level: none, 4bit, or 8bit (CUDA only)
HF_TOKEN (empty) HuggingFace access token β€” required only for gated models (Llama, Gemma, …)
QDRANT_URL http://localhost:6333 Qdrant server URL (Server mode only)
QDRANT_API_KEY (empty) Qdrant API key for Qdrant Cloud
VIBEVOICE_ASR_MODEL microsoft/VibeVoice-ASR HuggingFace model ID for speech recognition

Project Structure

agentvoicerag/
β”œβ”€β”€ app.py                  # Streamlit UI & pipeline orchestration
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ ingestion.py        # URL scraping, chunking, embedding, Qdrant upsert
β”‚   β”œβ”€β”€ retrieval.py        # Qdrant cosine-similarity search & context builder
β”‚   β”œβ”€β”€ voice.py            # VibeVoice-ASR wrapper (STT) + gTTS (TTS)
β”‚   └── llm.py              # HFQuantizedLLM β€” BitsAndBytes 4-bit/8-bit inference
β”œβ”€β”€ docker-compose.yml      # Qdrant server
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── README.md

HuggingFace LLM

The LLM layer (src/llm.py) loads any HuggingFace causal language model via the transformers pipeline. No OpenAI API key is required.

Supported free models (no token needed)

Model Size Notes
Qwen/Qwen2.5-0.5B-Instruct 0.5 B Runs on any CPU
Qwen/Qwen2.5-1.5B-Instruct 1.5 B Default β€” fast on CPU
TinyLlama/TinyLlama-1.1B-Chat-v1.0 1.1 B Very small footprint
microsoft/Phi-3-mini-4k-instruct 3.8 B Strong quality
google/gemma-2-2b-it 2.0 B Google Gemma 2
mistralai/Mistral-7B-Instruct-v0.3 7.0 B GPU recommended

Gated models (HF token required)

Model Size
meta-llama/Llama-3.2-3B-Instruct 3.0 B
meta-llama/Meta-Llama-3-8B-Instruct 8.0 B

Quantization

Mode Requirement Description
none CPU / MPS / CUDA Full precision (fp32 on CPU/MPS, fp16 on CUDA)
4bit CUDA + bitsandbytes NF4 double-quantization β€” ~4Γ— memory reduction
8bit CUDA + bitsandbytes LLM.int8() β€” ~2Γ— memory reduction

VibeVoice-ASR

VibeVoice-ASR is a 7 B parameter unified speech-to-text model from Microsoft Research that:

  • Processes up to 60 minutes of audio in a single pass
  • Performs speaker diarization (Who), timestamping (When), and transcription (What) jointly
  • Supports 50+ languages and code-switching
  • Supports custom hotwords for domain-specific accuracy

It is available natively in πŸ€— Transformers β‰₯ 5.3.0:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="microsoft/VibeVoice-ASR")
result = pipe("path/to/audio.wav")
print(result["text"])

Lightweight alternative: set VIBEVOICE_ASR_MODEL=openai/whisper-base for CPU-friendly transcription at the cost of some accuracy.


Qdrant

Qdrant is a high-performance vector similarity search engine. This project supports two modes:

Mode Setup Use case
In-Memory No setup required Quick demos, development
Server docker compose up -d Persistent storage, production

License

MIT License – see LICENSE for details.

VibeVoice-ASR is Β© Microsoft and licensed under the MIT License. Please review Microsoft's responsible AI guidelines before deploying in production.

About

Agentic Voice RAG built with VibeVoice and Qdrant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages