Skip to content

sajibsrs/ragnode

Repository files navigation

RAG Node

A production-ready Retrieval-Augmented Generation (RAG) system with multi-modal storage backends, semantic search, and a modern realtime web interface (websocket).

Tech Stack

Backend

  • Framework: FastAPI, LangChain, LangGraph
  • LLM: Ollama (local models)
  • Embeddings: LangChain embeddings with ChromaDB
  • Storage:
    • Vector Store: ChromaDB
    • Knowledge Graph: NetworkX
    • Keyword Index: SQLite FTS5
  • Web: WebSocket support via FastAPI
  • MCP: Model Context Protocol server for IDE integration

Frontend

  • Framework: React 18 with Vite
  • Styling: Tailwind CSS
  • Build: Vite with PostCSS/Autoprefixer

Data Processing

  • Scraping: BeautifulSoup4 for web content parsing
  • Chunking: Configurable text chunking with overlap
  • Formatting: Wikipedia content cleaner
  • Tokenization: tiktoken for token counting

Goal

Build an intelligent RAG agent that ingests external content (e.g., Wikipedia articles), indexes it across multiple storage backends (vector, graph, keyword), and enables semantic search and retrieval through both REST API and web interface. The system supports local LLM inference via Ollama for complete on-device processing.

Architecture

The system is organized into 5 main stages:

1. Ingestion (ingestion/)

  • Cleaner: Formats raw web content (Wikipedia articles)
  • Chunker: Splits text into overlapping chunks for processing
  • Ingestor: Orchestrates the ingestion pipeline
  • Indexing Strategies:
    • vector.py: Generates embeddings and stores in ChromaDB
    • keyword.py: Creates FTS5 full-text search index
    • graph.py: Extracts and stores knowledge graph triplets

2. Storage (storage/)

Persistent data layer with three backends:

  • VectorStore: ChromaDB for semantic similarity search
  • KeywordStore: SQLite FTS5 for exact text matching
  • GraphStore: NetworkX for knowledge graph relationships

3. Retrieval (retrieval/)

Multi-strategy query interface combining all retrieval methods:

  • Vector Search (vector.py): Semantic similarity via embeddings (k=2-4)
  • Keyword Search (keyword.py): Full-text search via FTS5 (k=2-4)
  • Graph Queries (graph.py): Entity relationship triplets
  • Hybrid Mode: Runs all three in parallel, deduplicates by content prefix, returns top 8 results

The hybrid approach ensures comprehensive retrieval without redundancy—semantic for meaning, keywords for exact matches, and graphs for entity relationships.

4. Services (services/)

  • ChatService (chat.py): LLM interface with streaming and token counting

5. RAG Pipeline (prompt.py)

LangGraph Workflow with two-stage processing:

  • Retrieve Node: Executes all retrieval strategies, deduplicates results, formats context (1200 chars text + 400 chars triplets)
  • Generate Node: Passes context to LLM with strict factual system prompt, maintains 6-message history window
  • Memory: MemorySaver checkpointer for conversation persistence
  • Design: Prevents hallucination by enforcing "answer from context only" principle

6. Client (rag-client/)

React-based frontend with:

  • Real-time chat interface
  • WebSocket connection to backend
  • Responsive Tailwind CSS UI
  • Vite-powered development server

Entry Points

The system supports three independent ways to interact with RAG functionality:

1. CLI Chat (python prompt.py)

  • Direct interactive terminal interface
  • Uses LangGraph workflow with memory persistence
  • Runs all three retrieval strategies sequentially
  • Ideal for testing and local development

2. Web API (python rag-server/server.py)

  • FastAPI REST endpoints for chat operations
  • WebSocket support for real-time streaming responses
  • SQLite database for persistent chat history
  • CORS-enabled for web client access
  • Runs on http://localhost:8000 with interactive docs at /docs

3. MCP Server (python mcp-server/server.py)

  • Model Context Protocol interface via stdio
  • Retrieval-only — exposes 4 tools for external LLM clients to call:
    • semantic_search: Vector similarity search
    • keyword_search: Exact keyword matching
    • graph_query: Entity relationship queries
    • hybrid_search: Combined search with deduplication
  • Integrates with Continue IDE, Cline, Claude, and other AI tools (they handle LLM generation)
  • Tuned for higher k values (k=4) for IDE context
  • External client orchestrates: receives raw retrieval results → passes to their LLM → returns answer

All three entry points use the same underlying retrieval engines and storage backends.

Design Principles

Factual Grounding

The system enforces strict adherence to provided context:

  • System prompt requires answers to be based only on retrieved documents
  • Prevents internal knowledge or training data from polluting responses
  • Responds with "Not found in the provided text" when context is insufficient
  • Deduplicates overlapping results to reduce hallucination risk

Performance Optimization

  • Async threading: Synchronous retrievers run in thread pool to prevent event loop blocking
  • Parallel retrieval: All three search strategies execute simultaneously
  • Context limiting: Fixed-size windows (1200 chars text, 400 chars triplets, 8 results max) keep LLM focused
  • History management: 6-message conversation window balances context and token efficiency

Multi-Modal Storage

  • Vector Store: Semantic understanding via embeddings
  • Keyword Index: Precision for exact term matching
  • Knowledge Graph: Structured entity relationships
  • Each backend tuned independently for its retrieval strategy

Installation

Prerequisites

  • Python 3.9+
  • Node.js 18+
  • Ollama (for local LLM inference)

Backend Setup

  1. Clone and navigate to project:

    cd /path/to/rag-agent
  2. Create Python virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install Python dependencies:

    pip install -r requirements.txt
  4. Configure environment variables:

    cp .env.example .env  # If available, or create manually

    Required variables in .env:

     VECTOR_PATH=./.chroma
     GRAPH_PATH=./.networkx/graph.gml
     KEYWORD_PATH=./.sqlite/keyword.sql
    
     COLLECTION_NAME=greek_myth
    
     EMBED_MODEL=nomic-embed-text
     CHAT_MODEL=llama3.2:3b
     KG_MODEL=qwen2.5-coder:3b
    
     SOURCE_URL=https://wikipedia.org/wiki/Greek_mythology
     USER_AGENT=Mozilla/5.0 (compatible; RAGNode/1.0)
  5. Start Ollama (if not running):

    ollama serve

    Then pull required models:

     ollama pull llama3.2:3b         # chat model
     ollama pull qwen2.5-coder:3b    # KG model
     ollama pull nomic-embed-text    # embed model

Frontend Setup

  1. Navigate to client directory:

    cd rag-client
  2. Install dependencies:

    npm install
  3. Start development server:

    npm run dev

    Frontend runs on http://localhost:5173

Data Setup (One-Time)

Ingest data:

python ingest.py

Populates all three storage backends (vector, keyword, graph) with indexed content.

Option 1: Web Interface (Recommended for UI)

  1. Start backend (Terminal 1):

    python rag-server/server.py

    Server runs on http://localhost:8000 with API docs at /docs

  2. Start frontend (Terminal 2):

    cd rag-client
    npm run dev

    Client runs on http://localhost:5173

  3. Open browser: Navigate to http://localhost:5173

Option 2: CLI Chat (Quick Testing)

python prompt.py

Interactive terminal chat with direct LangGraph pipeline access. Useful for debugging and local development.

Option 3: MCP Server (IDE Integration)

For integration with Continue IDE, Cline, or other AI tools:

python mcp-server/server.py

Exposes 4 tools (semantic_search, keyword_search, graph_query, hybrid_search) via Model Context Protocol on stdio.

Development Mode

Backend with auto-reload (requires watchdog):

pip install watchdog
watchmedo auto-restart -d . -p '*.py' -- python rag-server/server.py

Frontend already has hot reload enabled with npm run dev.

Project Structure

rag-agent/
├── config.py               # Configuration management
├── ingest.py               # Data ingestion entry point
├── prompt.py               # RAG prompt and response generation
├── requirements.txt        # Python dependencies
│
├── formatting/             # Content cleaning & formatting
│   ├── base.py
│   └── wikipedia.py
│
├── ingestion/              # Data ingestion pipeline
│   ├── chunk.py            # Text chunking logic
│   ├── ingestor.py         # Pipeline orchestrator
│   └── indexing/           # Indexing strategies
│       ├── vector.py
│       ├── keyword.py
│       └── graph.py
│
├── storage/                # Storage backends
│   ├── vector.py           # ChromaDB wrapper
│   ├── keyword.py          # SQLite FTS5 wrapper
│   └── graph.py            # NetworkX wrapper
│
├── retrieval/              # Retrieval methods
│   ├── vector.py
│   ├── keyword.py
│   └── graph.py
│
├── services/               # High-level services
│   └── chat.py
│
├── mcp-server/             # MCP Protocol server
│   ├── rag.py
│   └── server.py
│
├── rag-server/             # FastAPI REST server
│   └── server.py
│
└── rag-client/             # React frontend
    ├── package.json
    ├── vite.config.js
    ├── tailwind.config.js
    └── src/
        ├── App.jsx
        ├── main.jsx
        ├── api.js
        └── index.css

API Endpoints

The RAG Server provides:

  • GET / - Health check
  • POST /chat - Send chat message and get RAG response
  • WebSocket /ws - Real-time chat via WebSocket

For detailed API docs, visit http://localhost:8000/docs when the server is running.

Configuration

All configuration is centralized in config.py and .env:

  • Storage paths: Configure where to store vector DBs, graphs, and keyword indices
  • Models: Select embedding and LLM models available in Ollama
  • Collection: Configure collection name for ChromaDB
  • Source: Set the data source URL for ingestion

Troubleshooting

ChromaDB connection issues:

# Clear local ChromaDB cache
rm -rf .chroma/
python ingest.py  # Re-ingest data

Ollama models not found:

# List available models
ollama list

# Pull required models
ollama pull llama3.2:3b         # chat model
ollama pull qwen2.5-coder:3b    # KG model
ollama pull nomic-embed-text    # embed model

Frontend WebSocket connection fails:

  • Ensure backend is running on http://localhost:8000
  • Check CORS settings in rag-server/main.py

Contributing

  1. Create a feature branch
  2. Make changes following the architecture patterns
  3. Test with both ingestion and retrieval workflows
  4. Submit a pull request

License

MIT

About

A production-ready RAG system with hybrid retrieval (semantic + lexical + graph), local LLM inference via Ollama, real-time chat interface, and WebSocket streaming. Built with LangChain, LangGraph, FastAPI, and React.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors