RAG Node

A production-ready Retrieval-Augmented Generation (RAG) system with multi-modal storage backends, semantic search, and a modern realtime web interface (websocket).

Tech Stack

Backend

Framework: FastAPI, LangChain, LangGraph
LLM: Ollama (local models)
Embeddings: LangChain embeddings with ChromaDB
Storage:
- Vector Store: ChromaDB
- Knowledge Graph: NetworkX
- Keyword Index: SQLite FTS5
Web: WebSocket support via FastAPI
MCP: Model Context Protocol server for IDE integration

Frontend

Framework: React 18 with Vite
Styling: Tailwind CSS
Build: Vite with PostCSS/Autoprefixer

Data Processing

Scraping: BeautifulSoup4 for web content parsing
Chunking: Configurable text chunking with overlap
Formatting: Wikipedia content cleaner
Tokenization: tiktoken for token counting

Goal

Build an intelligent RAG agent that ingests external content (e.g., Wikipedia articles), indexes it across multiple storage backends (vector, graph, keyword), and enables semantic search and retrieval through both REST API and web interface. The system supports local LLM inference via Ollama for complete on-device processing.

Architecture

The system is organized into 5 main stages:

1. Ingestion (`ingestion/`)

Cleaner: Formats raw web content (Wikipedia articles)
Chunker: Splits text into overlapping chunks for processing
Ingestor: Orchestrates the ingestion pipeline
Indexing Strategies:
- vector.py: Generates embeddings and stores in ChromaDB
- keyword.py: Creates FTS5 full-text search index
- graph.py: Extracts and stores knowledge graph triplets

2. Storage (`storage/`)

Persistent data layer with three backends:

VectorStore: ChromaDB for semantic similarity search
KeywordStore: SQLite FTS5 for exact text matching
GraphStore: NetworkX for knowledge graph relationships

3. Retrieval (`retrieval/`)

Multi-strategy query interface combining all retrieval methods:

Vector Search (vector.py): Semantic similarity via embeddings (k=2-4)
Keyword Search (keyword.py): Full-text search via FTS5 (k=2-4)
Graph Queries (graph.py): Entity relationship triplets
Hybrid Mode: Runs all three in parallel, deduplicates by content prefix, returns top 8 results

The hybrid approach ensures comprehensive retrieval without redundancy—semantic for meaning, keywords for exact matches, and graphs for entity relationships.

4. Services (`services/`)

ChatService (chat.py): LLM interface with streaming and token counting

5. RAG Pipeline (`prompt.py`)

LangGraph Workflow with two-stage processing:

Retrieve Node: Executes all retrieval strategies, deduplicates results, formats context (1200 chars text + 400 chars triplets)
Generate Node: Passes context to LLM with strict factual system prompt, maintains 6-message history window
Memory: MemorySaver checkpointer for conversation persistence
Design: Prevents hallucination by enforcing "answer from context only" principle

6. Client (`rag-client/`)

React-based frontend with:

Real-time chat interface
WebSocket connection to backend
Responsive Tailwind CSS UI
Vite-powered development server

Entry Points

The system supports three independent ways to interact with RAG functionality:

1. CLI Chat (`python prompt.py`)

Direct interactive terminal interface
Uses LangGraph workflow with memory persistence
Runs all three retrieval strategies sequentially
Ideal for testing and local development

2. Web API (`python rag-server/server.py`)

FastAPI REST endpoints for chat operations
WebSocket support for real-time streaming responses
SQLite database for persistent chat history
CORS-enabled for web client access
Runs on http://localhost:8000 with interactive docs at /docs

3. MCP Server (`python mcp-server/server.py`)

Model Context Protocol interface via stdio
Retrieval-only — exposes 4 tools for external LLM clients to call:
- semantic_search: Vector similarity search
- keyword_search: Exact keyword matching
- graph_query: Entity relationship queries
- hybrid_search: Combined search with deduplication
Integrates with Continue IDE, Cline, Claude, and other AI tools (they handle LLM generation)
Tuned for higher k values (k=4) for IDE context
External client orchestrates: receives raw retrieval results → passes to their LLM → returns answer

All three entry points use the same underlying retrieval engines and storage backends.

Design Principles

Factual Grounding

The system enforces strict adherence to provided context:

System prompt requires answers to be based only on retrieved documents
Prevents internal knowledge or training data from polluting responses
Responds with "Not found in the provided text" when context is insufficient
Deduplicates overlapping results to reduce hallucination risk

Performance Optimization

Async threading: Synchronous retrievers run in thread pool to prevent event loop blocking
Parallel retrieval: All three search strategies execute simultaneously
Context limiting: Fixed-size windows (1200 chars text, 400 chars triplets, 8 results max) keep LLM focused
History management: 6-message conversation window balances context and token efficiency

Multi-Modal Storage

Vector Store: Semantic understanding via embeddings
Keyword Index: Precision for exact term matching
Knowledge Graph: Structured entity relationships
Each backend tuned independently for its retrieval strategy

Installation

Prerequisites

Python 3.9+
Node.js 18+
Ollama (for local LLM inference)

Backend Setup

Clone and navigate to project:
```
cd /path/to/rag-agent
```

Create Python virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Python dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

cp .env.example .env  # If available, or create manually

Required variables in .env:

 VECTOR_PATH=./.chroma
 GRAPH_PATH=./.networkx/graph.gml
 KEYWORD_PATH=./.sqlite/keyword.sql

 COLLECTION_NAME=greek_myth

 EMBED_MODEL=nomic-embed-text
 CHAT_MODEL=llama3.2:3b
 KG_MODEL=qwen2.5-coder:3b

 SOURCE_URL=https://wikipedia.org/wiki/Greek_mythology
 USER_AGENT=Mozilla/5.0 (compatible; RAGNode/1.0)

Start Ollama (if not running):

ollama serve

Then pull required models:

 ollama pull llama3.2:3b         # chat model
 ollama pull qwen2.5-coder:3b    # KG model
 ollama pull nomic-embed-text    # embed model

Frontend Setup

Navigate to client directory:
```
cd rag-client
```
Install dependencies:
```
npm install
```
Start development server:
```
npm run dev
```
Frontend runs on http://localhost:5173

Data Setup (One-Time)

Ingest data:

python ingest.py

Populates all three storage backends (vector, keyword, graph) with indexed content.

Option 1: Web Interface (Recommended for UI)

Start backend (Terminal 1):
```
python rag-server/server.py
```
Server runs on http://localhost:8000 with API docs at /docs
Start frontend (Terminal 2):
```
cd rag-client
npm run dev
```
Client runs on http://localhost:5173
Open browser: Navigate to http://localhost:5173

Option 2: CLI Chat (Quick Testing)

python prompt.py

Interactive terminal chat with direct LangGraph pipeline access. Useful for debugging and local development.

Option 3: MCP Server (IDE Integration)

For integration with Continue IDE, Cline, or other AI tools:

python mcp-server/server.py

Exposes 4 tools (semantic_search, keyword_search, graph_query, hybrid_search) via Model Context Protocol on stdio.

Development Mode

Backend with auto-reload (requires watchdog):

pip install watchdog
watchmedo auto-restart -d . -p '*.py' -- python rag-server/server.py

Frontend already has hot reload enabled with npm run dev.

Project Structure

rag-agent/
├── config.py               # Configuration management
├── ingest.py               # Data ingestion entry point
├── prompt.py               # RAG prompt and response generation
├── requirements.txt        # Python dependencies
│
├── formatting/             # Content cleaning & formatting
│   ├── base.py
│   └── wikipedia.py
│
├── ingestion/              # Data ingestion pipeline
│   ├── chunk.py            # Text chunking logic
│   ├── ingestor.py         # Pipeline orchestrator
│   └── indexing/           # Indexing strategies
│       ├── vector.py
│       ├── keyword.py
│       └── graph.py
│
├── storage/                # Storage backends
│   ├── vector.py           # ChromaDB wrapper
│   ├── keyword.py          # SQLite FTS5 wrapper
│   └── graph.py            # NetworkX wrapper
│
├── retrieval/              # Retrieval methods
│   ├── vector.py
│   ├── keyword.py
│   └── graph.py
│
├── services/               # High-level services
│   └── chat.py
│
├── mcp-server/             # MCP Protocol server
│   ├── rag.py
│   └── server.py
│
├── rag-server/             # FastAPI REST server
│   └── server.py
│
└── rag-client/             # React frontend
    ├── package.json
    ├── vite.config.js
    ├── tailwind.config.js
    └── src/
        ├── App.jsx
        ├── main.jsx
        ├── api.js
        └── index.css

API Endpoints

The RAG Server provides:

GET / - Health check
POST /chat - Send chat message and get RAG response
WebSocket /ws - Real-time chat via WebSocket

For detailed API docs, visit http://localhost:8000/docs when the server is running.

Configuration

All configuration is centralized in config.py and .env:

Storage paths: Configure where to store vector DBs, graphs, and keyword indices
Models: Select embedding and LLM models available in Ollama
Collection: Configure collection name for ChromaDB
Source: Set the data source URL for ingestion

Troubleshooting

ChromaDB connection issues:

# Clear local ChromaDB cache
rm -rf .chroma/
python ingest.py  # Re-ingest data

Ollama models not found:

# List available models
ollama list

# Pull required models
ollama pull llama3.2:3b         # chat model
ollama pull qwen2.5-coder:3b    # KG model
ollama pull nomic-embed-text    # embed model

Frontend WebSocket connection fails:

Ensure backend is running on http://localhost:8000
Check CORS settings in rag-server/main.py

Contributing

Create a feature branch
Make changes following the architecture patterns
Test with both ingestion and retrieval workflows
Submit a pull request

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
formatting		formatting
ingestion		ingestion
mcp-server		mcp-server
rag-client		rag-client
rag-server		rag-server
retrieval		retrieval
services		services
storage		storage
.gitignore		.gitignore
README.md		README.md
config.py		config.py
ingest.py		ingest.py
prompt.py		prompt.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Node

Tech Stack

Backend

Frontend

Data Processing

Goal

Architecture

1. Ingestion (ingestion/)

2. Storage (storage/)

3. Retrieval (retrieval/)

4. Services (services/)

5. RAG Pipeline (prompt.py)

6. Client (rag-client/)

Entry Points

1. CLI Chat (python prompt.py)

2. Web API (python rag-server/server.py)

3. MCP Server (python mcp-server/server.py)

Design Principles

Factual Grounding

Performance Optimization

Multi-Modal Storage

Installation

Prerequisites

Backend Setup

Frontend Setup

Data Setup (One-Time)

Option 1: Web Interface (Recommended for UI)

Option 2: CLI Chat (Quick Testing)

Option 3: MCP Server (IDE Integration)

Development Mode

Project Structure

API Endpoints

Configuration

Troubleshooting

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Ingestion (`ingestion/`)

2. Storage (`storage/`)

3. Retrieval (`retrieval/`)

4. Services (`services/`)

5. RAG Pipeline (`prompt.py`)

6. Client (`rag-client/`)

1. CLI Chat (`python prompt.py`)

2. Web API (`python rag-server/server.py`)

3. MCP Server (`python mcp-server/server.py`)

Packages