A production-ready Retrieval-Augmented Generation (RAG) system with multi-modal storage backends, semantic search, and a modern realtime web interface (websocket).
- Framework: FastAPI, LangChain, LangGraph
- LLM: Ollama (local models)
- Embeddings: LangChain embeddings with ChromaDB
- Storage:
- Vector Store: ChromaDB
- Knowledge Graph: NetworkX
- Keyword Index: SQLite FTS5
- Web: WebSocket support via FastAPI
- MCP: Model Context Protocol server for IDE integration
- Framework: React 18 with Vite
- Styling: Tailwind CSS
- Build: Vite with PostCSS/Autoprefixer
- Scraping: BeautifulSoup4 for web content parsing
- Chunking: Configurable text chunking with overlap
- Formatting: Wikipedia content cleaner
- Tokenization: tiktoken for token counting
Build an intelligent RAG agent that ingests external content (e.g., Wikipedia articles), indexes it across multiple storage backends (vector, graph, keyword), and enables semantic search and retrieval through both REST API and web interface. The system supports local LLM inference via Ollama for complete on-device processing.
The system is organized into 5 main stages:
- Cleaner: Formats raw web content (Wikipedia articles)
- Chunker: Splits text into overlapping chunks for processing
- Ingestor: Orchestrates the ingestion pipeline
- Indexing Strategies:
vector.py: Generates embeddings and stores in ChromaDBkeyword.py: Creates FTS5 full-text search indexgraph.py: Extracts and stores knowledge graph triplets
Persistent data layer with three backends:
- VectorStore: ChromaDB for semantic similarity search
- KeywordStore: SQLite FTS5 for exact text matching
- GraphStore: NetworkX for knowledge graph relationships
Multi-strategy query interface combining all retrieval methods:
- Vector Search (
vector.py): Semantic similarity via embeddings (k=2-4) - Keyword Search (
keyword.py): Full-text search via FTS5 (k=2-4) - Graph Queries (
graph.py): Entity relationship triplets - Hybrid Mode: Runs all three in parallel, deduplicates by content prefix, returns top 8 results
The hybrid approach ensures comprehensive retrieval without redundancy—semantic for meaning, keywords for exact matches, and graphs for entity relationships.
- ChatService (
chat.py): LLM interface with streaming and token counting
LangGraph Workflow with two-stage processing:
- Retrieve Node: Executes all retrieval strategies, deduplicates results, formats context (1200 chars text + 400 chars triplets)
- Generate Node: Passes context to LLM with strict factual system prompt, maintains 6-message history window
- Memory: MemorySaver checkpointer for conversation persistence
- Design: Prevents hallucination by enforcing "answer from context only" principle
React-based frontend with:
- Real-time chat interface
- WebSocket connection to backend
- Responsive Tailwind CSS UI
- Vite-powered development server
The system supports three independent ways to interact with RAG functionality:
- Direct interactive terminal interface
- Uses LangGraph workflow with memory persistence
- Runs all three retrieval strategies sequentially
- Ideal for testing and local development
- FastAPI REST endpoints for chat operations
- WebSocket support for real-time streaming responses
- SQLite database for persistent chat history
- CORS-enabled for web client access
- Runs on
http://localhost:8000with interactive docs at/docs
- Model Context Protocol interface via stdio
- Retrieval-only — exposes 4 tools for external LLM clients to call:
semantic_search: Vector similarity searchkeyword_search: Exact keyword matchinggraph_query: Entity relationship querieshybrid_search: Combined search with deduplication
- Integrates with Continue IDE, Cline, Claude, and other AI tools (they handle LLM generation)
- Tuned for higher k values (k=4) for IDE context
- External client orchestrates: receives raw retrieval results → passes to their LLM → returns answer
All three entry points use the same underlying retrieval engines and storage backends.
The system enforces strict adherence to provided context:
- System prompt requires answers to be based only on retrieved documents
- Prevents internal knowledge or training data from polluting responses
- Responds with "Not found in the provided text" when context is insufficient
- Deduplicates overlapping results to reduce hallucination risk
- Async threading: Synchronous retrievers run in thread pool to prevent event loop blocking
- Parallel retrieval: All three search strategies execute simultaneously
- Context limiting: Fixed-size windows (1200 chars text, 400 chars triplets, 8 results max) keep LLM focused
- History management: 6-message conversation window balances context and token efficiency
- Vector Store: Semantic understanding via embeddings
- Keyword Index: Precision for exact term matching
- Knowledge Graph: Structured entity relationships
- Each backend tuned independently for its retrieval strategy
- Python 3.9+
- Node.js 18+
- Ollama (for local LLM inference)
-
Clone and navigate to project:
cd /path/to/rag-agent -
Create Python virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Python dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp .env.example .env # If available, or create manuallyRequired variables in
.env:VECTOR_PATH=./.chroma GRAPH_PATH=./.networkx/graph.gml KEYWORD_PATH=./.sqlite/keyword.sql COLLECTION_NAME=greek_myth EMBED_MODEL=nomic-embed-text CHAT_MODEL=llama3.2:3b KG_MODEL=qwen2.5-coder:3b SOURCE_URL=https://wikipedia.org/wiki/Greek_mythology USER_AGENT=Mozilla/5.0 (compatible; RAGNode/1.0)
-
Start Ollama (if not running):
ollama serve
Then pull required models:
ollama pull llama3.2:3b # chat model ollama pull qwen2.5-coder:3b # KG model ollama pull nomic-embed-text # embed model
-
Navigate to client directory:
cd rag-client -
Install dependencies:
npm install
-
Start development server:
npm run dev
Frontend runs on
http://localhost:5173
Ingest data:
python ingest.pyPopulates all three storage backends (vector, keyword, graph) with indexed content.
-
Start backend (Terminal 1):
python rag-server/server.py
Server runs on
http://localhost:8000with API docs at/docs -
Start frontend (Terminal 2):
cd rag-client npm run devClient runs on
http://localhost:5173 -
Open browser: Navigate to
http://localhost:5173
python prompt.pyInteractive terminal chat with direct LangGraph pipeline access. Useful for debugging and local development.
For integration with Continue IDE, Cline, or other AI tools:
python mcp-server/server.pyExposes 4 tools (semantic_search, keyword_search, graph_query, hybrid_search) via Model Context Protocol on stdio.
Backend with auto-reload (requires watchdog):
pip install watchdog
watchmedo auto-restart -d . -p '*.py' -- python rag-server/server.pyFrontend already has hot reload enabled with npm run dev.
rag-agent/
├── config.py # Configuration management
├── ingest.py # Data ingestion entry point
├── prompt.py # RAG prompt and response generation
├── requirements.txt # Python dependencies
│
├── formatting/ # Content cleaning & formatting
│ ├── base.py
│ └── wikipedia.py
│
├── ingestion/ # Data ingestion pipeline
│ ├── chunk.py # Text chunking logic
│ ├── ingestor.py # Pipeline orchestrator
│ └── indexing/ # Indexing strategies
│ ├── vector.py
│ ├── keyword.py
│ └── graph.py
│
├── storage/ # Storage backends
│ ├── vector.py # ChromaDB wrapper
│ ├── keyword.py # SQLite FTS5 wrapper
│ └── graph.py # NetworkX wrapper
│
├── retrieval/ # Retrieval methods
│ ├── vector.py
│ ├── keyword.py
│ └── graph.py
│
├── services/ # High-level services
│ └── chat.py
│
├── mcp-server/ # MCP Protocol server
│ ├── rag.py
│ └── server.py
│
├── rag-server/ # FastAPI REST server
│ └── server.py
│
└── rag-client/ # React frontend
├── package.json
├── vite.config.js
├── tailwind.config.js
└── src/
├── App.jsx
├── main.jsx
├── api.js
└── index.css
The RAG Server provides:
GET /- Health checkPOST /chat- Send chat message and get RAG responseWebSocket /ws- Real-time chat via WebSocket
For detailed API docs, visit http://localhost:8000/docs when the server is running.
All configuration is centralized in config.py and .env:
- Storage paths: Configure where to store vector DBs, graphs, and keyword indices
- Models: Select embedding and LLM models available in Ollama
- Collection: Configure collection name for ChromaDB
- Source: Set the data source URL for ingestion
ChromaDB connection issues:
# Clear local ChromaDB cache
rm -rf .chroma/
python ingest.py # Re-ingest dataOllama models not found:
# List available models
ollama list
# Pull required models
ollama pull llama3.2:3b # chat model
ollama pull qwen2.5-coder:3b # KG model
ollama pull nomic-embed-text # embed modelFrontend WebSocket connection fails:
- Ensure backend is running on
http://localhost:8000 - Check CORS settings in
rag-server/main.py
- Create a feature branch
- Make changes following the architecture patterns
- Test with both ingestion and retrieval workflows
- Submit a pull request
MIT