Production-ready RAG (Retrieval-Augmented Generation) pipeline for law firms, consulting companies, and research teams working with 100+ page document corpora. Upload documents, ask questions, get cited answers with hallucination protection.
- Hybrid Search: Vector (Chroma) + BM25 keyword search fused with Reciprocal Rank Fusion for superior retrieval accuracy
- 3-Layer Hallucination Guard: Retrieval confidence check → generation constraint → citation verification
- Inline Citations: Every factual claim tagged with
[S1]-[S5]source references, each verified via embedding similarity - Multi-Tenant Isolation: JWT-based auth, separate Chroma collections and BM25 indexes per tenant — zero data leakage
- Semantic Chunking: spaCy sentence boundaries, 800-token target with 150-token overlap, context injection per chunk
- Supported Formats: PDF (PyMuPDF + OCR fallback), DOCX (heading-aware), URL (Firecrawl)
- RAGAS Evaluation: Built-in faithfulness, answer relevancy, and context precision scoring
graph TB
subgraph Ingestion Pipeline
A[PDF / DOCX / URL] --> B[Parsers<br/>PyMuPDF · python-docx · Firecrawl]
B --> C[Semantic Chunker<br/>spaCy · 800 tokens · 150 overlap]
C --> D[Batch Embedder<br/>OpenAI text-embedding-3-small]
D --> E[(Chroma Vector DB)]
D --> F[(BM25 Index)]
end
subgraph Query Pipeline
G[User Question] --> H[Query Rewriter<br/>Claude Haiku → 3 variants]
H --> I[Hybrid Retriever]
I --> J[Dense: Chroma Top-20]
I --> K[Sparse: BM25 Top-20]
J --> L[RRF Fusion k=60]
K --> L
L -->|Top 20| M[Cohere Reranker]
M -->|Top 5| N{Relevance > 0.4?}
N -->|No| O[No relevant content found]
N -->|Yes| P[Answer Generator<br/>Claude Sonnet]
P --> Q[Citation Validator<br/>Cosine similarity > 0.75]
Q --> R[Response with<br/>inline citations]
end
subgraph Multi-Tenant Isolation
S[Tenant A Collection] -.-> E
T[Tenant B Collection] -.-> E
S2[Tenant A Index] -.-> F
T2[Tenant B Index] -.-> F
end
- Parse: Extract text page-by-page preserving structure. Claude Haiku extracts document title and section headings.
- Chunk: Split into ~800-token segments using spaCy sentence boundaries (never mid-sentence). Each chunk gets a context prefix:
[Document: {title} | Section: {heading}] - Embed: Batch embed via OpenAI
text-embedding-3-smallwith exponential backoff retry. - Store: Vectors → tenant-namespaced Chroma collection. Text → tenant-specific BM25 pickle index.
- Rewrite: Claude Haiku generates 3 query variants (literal, semantic expansion, entity-focused)
- Retrieve: Each variant runs against both Chroma (dense) and BM25 (sparse) → 6 ranked lists
- Fuse: Reciprocal Rank Fusion (k=60) merges all lists → top 20 unique chunks
- Rerank: Cohere
rerank-v3.5scores chunks against original question → top 5 - Guard Layer 1: If best chunk relevance < 0.4 → return "no relevant content" (skip LLM)
- Generate: Claude Sonnet produces answer with mandatory
[Sx]citations per claim - Guard Layer 2: System prompt enforces
INSUFFICIENT_CONTEXTwhen sources don't cover the question - Guard Layer 3: Each citation verified via cosine similarity (claim embedding vs chunk embedding). Below 0.75 → flagged. Over 30% unverified → hallucination warning + confidence penalty
| Component | Technology |
|---|---|
| API Framework | FastAPI + Uvicorn |
| Vector Database | ChromaDB (persistent mode) |
| Sparse Index | rank-bm25 (pickle persistence) |
| Embeddings | OpenAI text-embedding-3-small |
| Answer Generation | Claude Sonnet (claude-sonnet-4-6) |
| Query Rewriting | Claude Haiku (claude-haiku-4-5-20251001) |
| Reranking | Cohere rerank-v3.5 |
| PDF Parsing | PyMuPDF + Unstructured (OCR fallback) |
| DOCX Parsing | python-docx |
| URL Parsing | Firecrawl |
| NLP | spaCy (sentence segmentation) |
| Tokenization | tiktoken (cl100k_base) |
| Auth | PyJWT (HS256) |
| Cache | Redis |
| Containerization | Docker Compose |
- Docker & Docker Compose
- API keys: OpenAI, Anthropic, Cohere (optional — falls back gracefully)
cp .env.example .env
# Fill in your API keysRequired environment variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key for embeddings |
ANTHROPIC_API_KEY |
Anthropic API key for Claude Haiku/Sonnet |
COHERE_API_KEY |
Cohere API key for reranking (optional — falls back to RRF order) |
JWT_SECRET |
Secret key for JWT token signing (change from default) |
docker-compose up -d| Service | Port | Description |
|---|---|---|
| FastAPI | 8080 | Main API server |
| Chroma | 8000 | Vector database (persistent) |
| Redis | 6379 | BM25 index cache |
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Start Chroma and Redis separately, then:
uvicorn api.main:app --host 0.0.0.0 --port 8080 --reloadcurl -X POST http://localhost:8080/tenants \
-H "Content-Type: application/json" \
-d '{"name": "Acme Law Firm"}'{
"tenant_id": "a1b2c3d4-...",
"api_key": "eyJhbGciOiJIUzI1NiIs..."
}File upload:
curl -X POST http://localhost:8080/documents \
-H "Authorization: Bearer <api_key>" \
-F "[email protected]"URL ingestion:
curl -X POST http://localhost:8080/documents \
-H "Authorization: Bearer <api_key>" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/report.html"}'{
"doc_id": "d5e6f7g8-...",
"chunk_count": 0,
"status": "processing"
}Processing happens in the background. Poll GET /documents to check when status becomes "ready".
curl -X POST http://localhost:8080/query \
-H "Authorization: Bearer <api_key>" \
-H "Content-Type: application/json" \
-d '{"question": "What are the payment terms in the contract?", "top_k": 5}'{
"question": "What are the payment terms in the contract?",
"rewritten_queries": [
"payment terms contract",
"financial obligations billing schedule fees",
"Acme Corp payment net-30 $150,000"
],
"answer": "The payment terms require annual fees of $150,000, payable net-30 from invoice date [S1]. Late payments incur a 1.5% monthly interest charge [S3].",
"citations": [
{
"citation_id": 1,
"chunk_id": "...",
"claim_text": "The payment terms require annual fees of $150,000, payable net-30 from invoice date",
"source_similarity": 0.92,
"verified": true
}
],
"confidence": 0.95,
"hallucination_warning": false
}curl http://localhost:8080/documents \
-H "Authorization: Bearer <api_key>"curl -X DELETE http://localhost:8080/documents/<doc_id> \
-H "Authorization: Bearer <api_key>"| Layer | Mechanism | Trigger | Action |
|---|---|---|---|
| 1 — Retrieval confidence | Top reranked chunk relevance score | Score < 0.4 | Skip generation, return null answer |
| 2 — Generation constraint | System prompt forces [Sx] citations |
Sources insufficient | Return INSUFFICIENT_CONTEXT |
| 3 — Citation validation | Cosine similarity: claim vs source chunk | Similarity < 0.75 | Flag as unverified; >30% unverified → warning + confidence penalty |
Every operation is scoped to tenant_id extracted from the verified JWT — never from client input:
- Chroma: Separate collection per tenant (
tenant_{tenant_id}_docs) - BM25: Separate index file per tenant (
./indexes/{tenant_id}.pkl) - API: All endpoints derive
tenant_idfrom Bearer token, not request body/params - Deletion: Removing a tenant destroys its Chroma collection and BM25 index
# Generate a realistic 15-page service agreement PDF
python evaluation/fixtures/generate_contract.pyThe generated contract includes: parties (Acme Corp & TechFlow Solutions), payment terms ($150K/year, net-30), termination clauses (90-day notice), liability cap ($500K), confidentiality (5-year survival), SLA (99.5% uptime), and governing law (Delaware).
python -m evaluation.ragas_eval --tenant test --questions evaluation/fixtures/eval_questions.json10 evaluation questions: 8 answerable from the contract + 2 intentionally unanswerable (tests hallucination guard).
Results written to evaluation/eval_results.md:
| Metric | Target | Description |
|---|---|---|
| Faithfulness | > 0.90 | Claims in answer grounded in retrieved context |
| Answer Relevancy | > 0.85 | Answer addresses the question asked |
| Context Precision | > 0.88 | Retrieved chunks actually relevant to question |
pytest tests/ -v87+ tests across 3 test files:
test_ingestion.py— PDF/DOCX parsing, chunking bounds, overlap, context injection, sentence boundariestest_retrieval.py— BM25 persistence, RRF fusion correctness, hybrid vs vector-only comparison, reranker fallbacktest_generation.py— Citation parsing, cosine similarity, hallucination warning triggers, INSUFFICIENT_CONTEXT handling
All external APIs (OpenAI, Anthropic, Cohere, Chroma) are mocked — tests run without API keys.
documind/
├── ingestion/
│ ├── parsers.py # PDF/DOCX/URL → raw text + page metadata
│ ├── chunker.py # Semantic chunking (800 tok, 150 overlap, spaCy)
│ └── embedder.py # Batch embedding with retry + progress bar
├── retrieval/
│ ├── vector_store.py # Chroma wrapper: tenant-namespaced collections
│ ├── bm25_index.py # BM25 index: build, persist, query, LRU cache
│ ├── hybrid_retriever.py # RRF fusion of vector + BM25 results
│ └── reranker.py # Cohere Rerank with graceful fallback
├── generation/
│ ├── query_rewriter.py # Question → 3 search variants (Claude Haiku)
│ ├── answer_generator.py # Grounded answer + confidence (Claude Sonnet)
│ └── citation_validator.py # Claim-to-chunk similarity verification
├── api/
│ └── main.py # FastAPI: upload, query, tenant management
├── evaluation/
│ ├── ragas_eval.py # RAGAS faithfulness + relevancy + precision
│ └── fixtures/
│ ├── generate_contract.py # Generates 15-page test PDF
│ └── eval_questions.json # 10 Q&A pairs with ground truth
├── tests/
│ ├── test_ingestion.py # Parser + chunker unit tests
│ ├── test_retrieval.py # Hybrid retrieval accuracy tests
│ └── test_generation.py # Citation validation tests
├── models.py # Pydantic v2 data models
├── config.py # Settings from environment variables
├── docker-compose.yml # FastAPI + Chroma + Redis
├── Dockerfile
├── requirements.txt # Pinned dependency versions
└── .env.example
| Model | Purpose | Calls | Notes |
|---|---|---|---|
| Claude Haiku | Query rewriting | 1 | 3 search variants |
| Claude Haiku | Confidence scoring | 1 | Structured JSON output |
| Claude Sonnet | Answer generation | 1 | With citation instructions |
| OpenAI text-embedding-3-small | Query + citation embeddings | 1-2 | Batched |
| Cohere rerank-v3.5 | Chunk reranking | 1 | Falls back to RRF if unavailable |
Token usage is logged per query with estimated cost in USD.
MIT