DocuMind — Enterprise Document Intelligence System

Production-ready RAG (Retrieval-Augmented Generation) pipeline for law firms, consulting companies, and research teams working with 100+ page document corpora. Upload documents, ask questions, get cited answers with hallucination protection.

Key Features

Hybrid Search: Vector (Chroma) + BM25 keyword search fused with Reciprocal Rank Fusion for superior retrieval accuracy
3-Layer Hallucination Guard: Retrieval confidence check → generation constraint → citation verification
Inline Citations: Every factual claim tagged with [S1]-[S5] source references, each verified via embedding similarity
Multi-Tenant Isolation: JWT-based auth, separate Chroma collections and BM25 indexes per tenant — zero data leakage
Semantic Chunking: spaCy sentence boundaries, 800-token target with 150-token overlap, context injection per chunk
Supported Formats: PDF (PyMuPDF + OCR fallback), DOCX (heading-aware), URL (Firecrawl)
RAGAS Evaluation: Built-in faithfulness, answer relevancy, and context precision scoring

Architecture

graph TB
    subgraph Ingestion Pipeline
        A[PDF / DOCX / URL] --> B[Parsers<br/>PyMuPDF · python-docx · Firecrawl]
        B --> C[Semantic Chunker<br/>spaCy · 800 tokens · 150 overlap]
        C --> D[Batch Embedder<br/>OpenAI text-embedding-3-small]
        D --> E[(Chroma Vector DB)]
        D --> F[(BM25 Index)]
    end

    subgraph Query Pipeline
        G[User Question] --> H[Query Rewriter<br/>Claude Haiku → 3 variants]
        H --> I[Hybrid Retriever]
        I --> J[Dense: Chroma Top-20]
        I --> K[Sparse: BM25 Top-20]
        J --> L[RRF Fusion k=60]
        K --> L
        L -->|Top 20| M[Cohere Reranker]
        M -->|Top 5| N{Relevance > 0.4?}
        N -->|No| O[No relevant content found]
        N -->|Yes| P[Answer Generator<br/>Claude Sonnet]
        P --> Q[Citation Validator<br/>Cosine similarity > 0.75]
        Q --> R[Response with<br/>inline citations]
    end

    subgraph Multi-Tenant Isolation
        S[Tenant A Collection] -.-> E
        T[Tenant B Collection] -.-> E
        S2[Tenant A Index] -.-> F
        T2[Tenant B Index] -.-> F
    end

How It Works

Document Ingestion

Parse: Extract text page-by-page preserving structure. Claude Haiku extracts document title and section headings.
Chunk: Split into ~800-token segments using spaCy sentence boundaries (never mid-sentence). Each chunk gets a context prefix: [Document: {title} | Section: {heading}]
Embed: Batch embed via OpenAI text-embedding-3-small with exponential backoff retry.
Store: Vectors → tenant-namespaced Chroma collection. Text → tenant-specific BM25 pickle index.

Query Processing

Rewrite: Claude Haiku generates 3 query variants (literal, semantic expansion, entity-focused)
Retrieve: Each variant runs against both Chroma (dense) and BM25 (sparse) → 6 ranked lists
Fuse: Reciprocal Rank Fusion (k=60) merges all lists → top 20 unique chunks
Rerank: Cohere rerank-v3.5 scores chunks against original question → top 5
Guard Layer 1: If best chunk relevance < 0.4 → return "no relevant content" (skip LLM)
Generate: Claude Sonnet produces answer with mandatory [Sx] citations per claim
Guard Layer 2: System prompt enforces INSUFFICIENT_CONTEXT when sources don't cover the question
Guard Layer 3: Each citation verified via cosine similarity (claim embedding vs chunk embedding). Below 0.75 → flagged. Over 30% unverified → hallucination warning + confidence penalty

Tech Stack

Component	Technology
API Framework	FastAPI + Uvicorn
Vector Database	ChromaDB (persistent mode)
Sparse Index	rank-bm25 (pickle persistence)
Embeddings	OpenAI text-embedding-3-small
Answer Generation	Claude Sonnet (claude-sonnet-4-6)
Query Rewriting	Claude Haiku (claude-haiku-4-5-20251001)
Reranking	Cohere rerank-v3.5
PDF Parsing	PyMuPDF + Unstructured (OCR fallback)
DOCX Parsing	python-docx
URL Parsing	Firecrawl
NLP	spaCy (sentence segmentation)
Tokenization	tiktoken (cl100k_base)
Auth	PyJWT (HS256)
Cache	Redis
Containerization	Docker Compose

Quick Start

Prerequisites

Docker & Docker Compose
API keys: OpenAI, Anthropic, Cohere (optional — falls back gracefully)

1. Environment Setup

cp .env.example .env
# Fill in your API keys

Required environment variables:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key for embeddings
`ANTHROPIC_API_KEY`	Anthropic API key for Claude Haiku/Sonnet
`COHERE_API_KEY`	Cohere API key for reranking (optional — falls back to RRF order)
`JWT_SECRET`	Secret key for JWT token signing (change from default)

2. Run with Docker Compose

docker-compose up -d

Service	Port	Description
FastAPI	8080	Main API server
Chroma	8000	Vector database (persistent)
Redis	6379	BM25 index cache

3. Local Development

pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Start Chroma and Redis separately, then:
uvicorn api.main:app --host 0.0.0.0 --port 8080 --reload

API Reference

POST /tenants — Create a Tenant

curl -X POST http://localhost:8080/tenants \
  -H "Content-Type: application/json" \
  -d '{"name": "Acme Law Firm"}'

{
  "tenant_id": "a1b2c3d4-...",
  "api_key": "eyJhbGciOiJIUzI1NiIs..."
}

POST /documents — Upload a Document

File upload:

curl -X POST http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>" \
  -F "[email protected]"

URL ingestion:

curl -X POST http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/report.html"}'

{
  "doc_id": "d5e6f7g8-...",
  "chunk_count": 0,
  "status": "processing"
}

Processing happens in the background. Poll GET /documents to check when status becomes "ready".

POST /query — Ask a Question

curl -X POST http://localhost:8080/query \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the payment terms in the contract?", "top_k": 5}'

{
  "question": "What are the payment terms in the contract?",
  "rewritten_queries": [
    "payment terms contract",
    "financial obligations billing schedule fees",
    "Acme Corp payment net-30 $150,000"
  ],
  "answer": "The payment terms require annual fees of $150,000, payable net-30 from invoice date [S1]. Late payments incur a 1.5% monthly interest charge [S3].",
  "citations": [
    {
      "citation_id": 1,
      "chunk_id": "...",
      "claim_text": "The payment terms require annual fees of $150,000, payable net-30 from invoice date",
      "source_similarity": 0.92,
      "verified": true
    }
  ],
  "confidence": 0.95,
  "hallucination_warning": false
}

GET /documents — List Documents

curl http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>"

DELETE /documents/{doc_id} — Delete a Document

curl -X DELETE http://localhost:8080/documents/<doc_id> \
  -H "Authorization: Bearer <api_key>"

Hallucination Guard — 3 Layers

Layer	Mechanism	Trigger	Action
1 — Retrieval confidence	Top reranked chunk relevance score	Score < 0.4	Skip generation, return null answer
2 — Generation constraint	System prompt forces `[Sx]` citations	Sources insufficient	Return `INSUFFICIENT_CONTEXT`
3 — Citation validation	Cosine similarity: claim vs source chunk	Similarity < 0.75	Flag as unverified; >30% unverified → warning + confidence penalty

Multi-Tenant Isolation

Every operation is scoped to tenant_id extracted from the verified JWT — never from client input:

Chroma: Separate collection per tenant (tenant_{tenant_id}_docs)
BM25: Separate index file per tenant (./indexes/{tenant_id}.pkl)
API: All endpoints derive tenant_id from Bearer token, not request body/params
Deletion: Removing a tenant destroys its Chroma collection and BM25 index

Evaluation

Generate Test Fixtures

# Generate a realistic 15-page service agreement PDF
python evaluation/fixtures/generate_contract.py

The generated contract includes: parties (Acme Corp & TechFlow Solutions), payment terms ($150K/year, net-30), termination clauses (90-day notice), liability cap ($500K), confidentiality (5-year survival), SLA (99.5% uptime), and governing law (Delaware).

Run RAGAS Evaluation

python -m evaluation.ragas_eval --tenant test --questions evaluation/fixtures/eval_questions.json

10 evaluation questions: 8 answerable from the contract + 2 intentionally unanswerable (tests hallucination guard).

Results written to evaluation/eval_results.md:

Metric	Target	Description
Faithfulness	> 0.90	Claims in answer grounded in retrieved context
Answer Relevancy	> 0.85	Answer addresses the question asked
Context Precision	> 0.88	Retrieved chunks actually relevant to question

Testing

pytest tests/ -v

87+ tests across 3 test files:

test_ingestion.py — PDF/DOCX parsing, chunking bounds, overlap, context injection, sentence boundaries
test_retrieval.py — BM25 persistence, RRF fusion correctness, hybrid vs vector-only comparison, reranker fallback
test_generation.py — Citation parsing, cosine similarity, hallucination warning triggers, INSUFFICIENT_CONTEXT handling

All external APIs (OpenAI, Anthropic, Cohere, Chroma) are mocked — tests run without API keys.

Project Structure

documind/
├── ingestion/
│   ├── parsers.py              # PDF/DOCX/URL → raw text + page metadata
│   ├── chunker.py              # Semantic chunking (800 tok, 150 overlap, spaCy)
│   └── embedder.py             # Batch embedding with retry + progress bar
├── retrieval/
│   ├── vector_store.py         # Chroma wrapper: tenant-namespaced collections
│   ├── bm25_index.py           # BM25 index: build, persist, query, LRU cache
│   ├── hybrid_retriever.py     # RRF fusion of vector + BM25 results
│   └── reranker.py             # Cohere Rerank with graceful fallback
├── generation/
│   ├── query_rewriter.py       # Question → 3 search variants (Claude Haiku)
│   ├── answer_generator.py     # Grounded answer + confidence (Claude Sonnet)
│   └── citation_validator.py   # Claim-to-chunk similarity verification
├── api/
│   └── main.py                 # FastAPI: upload, query, tenant management
├── evaluation/
│   ├── ragas_eval.py           # RAGAS faithfulness + relevancy + precision
│   └── fixtures/
│       ├── generate_contract.py    # Generates 15-page test PDF
│       └── eval_questions.json     # 10 Q&A pairs with ground truth
├── tests/
│   ├── test_ingestion.py       # Parser + chunker unit tests
│   ├── test_retrieval.py       # Hybrid retrieval accuracy tests
│   └── test_generation.py      # Citation validation tests
├── models.py                   # Pydantic v2 data models
├── config.py                   # Settings from environment variables
├── docker-compose.yml          # FastAPI + Chroma + Redis
├── Dockerfile
├── requirements.txt            # Pinned dependency versions
└── .env.example

LLM Usage Per Query

Model	Purpose	Calls	Notes
Claude Haiku	Query rewriting	1	3 search variants
Claude Haiku	Confidence scoring	1	Structured JSON output
Claude Sonnet	Answer generation	1	With citation instructions
OpenAI text-embedding-3-small	Query + citation embeddings	1-2	Batched
Cohere rerank-v3.5	Chunk reranking	1	Falls back to RRF if unavailable

Token usage is logged per query with estimated cost in USD.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocuMind — Enterprise Document Intelligence System

Key Features

Architecture

How It Works

Document Ingestion

Query Processing

Tech Stack

Quick Start

Prerequisites

1. Environment Setup

2. Run with Docker Compose

3. Local Development

API Reference

POST /tenants — Create a Tenant

POST /documents — Upload a Document

POST /query — Ask a Question

GET /documents — List Documents

DELETE /documents/{doc_id} — Delete a Document

Hallucination Guard — 3 Layers

Multi-Tenant Isolation

Evaluation

Generate Test Fixtures

Run RAGAS Evaluation

Testing

Project Structure

LLM Usage Per Query

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DocuMind — Enterprise Document Intelligence System

Key Features

Architecture

How It Works

Document Ingestion

Query Processing

Tech Stack

Quick Start

Prerequisites

1. Environment Setup

2. Run with Docker Compose

3. Local Development

API Reference

POST /tenants — Create a Tenant

POST /documents — Upload a Document

POST /query — Ask a Question

GET /documents — List Documents

DELETE /documents/{doc_id} — Delete a Document

Hallucination Guard — 3 Layers

Multi-Tenant Isolation

Evaluation

Generate Test Fixtures

Run RAGAS Evaluation

Testing

Project Structure

LLM Usage Per Query

License