Skip to content

Latest commit

 

History

History
329 lines (264 loc) · 12.1 KB

File metadata and controls

329 lines (264 loc) · 12.1 KB

DocuMind — Enterprise Document Intelligence System

Production-ready RAG (Retrieval-Augmented Generation) pipeline for law firms, consulting companies, and research teams working with 100+ page document corpora. Upload documents, ask questions, get cited answers with hallucination protection.

Key Features

  • Hybrid Search: Vector (Chroma) + BM25 keyword search fused with Reciprocal Rank Fusion for superior retrieval accuracy
  • 3-Layer Hallucination Guard: Retrieval confidence check → generation constraint → citation verification
  • Inline Citations: Every factual claim tagged with [S1]-[S5] source references, each verified via embedding similarity
  • Multi-Tenant Isolation: JWT-based auth, separate Chroma collections and BM25 indexes per tenant — zero data leakage
  • Semantic Chunking: spaCy sentence boundaries, 800-token target with 150-token overlap, context injection per chunk
  • Supported Formats: PDF (PyMuPDF + OCR fallback), DOCX (heading-aware), URL (Firecrawl)
  • RAGAS Evaluation: Built-in faithfulness, answer relevancy, and context precision scoring

Architecture

graph TB
    subgraph Ingestion Pipeline
        A[PDF / DOCX / URL] --> B[Parsers<br/>PyMuPDF · python-docx · Firecrawl]
        B --> C[Semantic Chunker<br/>spaCy · 800 tokens · 150 overlap]
        C --> D[Batch Embedder<br/>OpenAI text-embedding-3-small]
        D --> E[(Chroma Vector DB)]
        D --> F[(BM25 Index)]
    end

    subgraph Query Pipeline
        G[User Question] --> H[Query Rewriter<br/>Claude Haiku → 3 variants]
        H --> I[Hybrid Retriever]
        I --> J[Dense: Chroma Top-20]
        I --> K[Sparse: BM25 Top-20]
        J --> L[RRF Fusion k=60]
        K --> L
        L -->|Top 20| M[Cohere Reranker]
        M -->|Top 5| N{Relevance > 0.4?}
        N -->|No| O[No relevant content found]
        N -->|Yes| P[Answer Generator<br/>Claude Sonnet]
        P --> Q[Citation Validator<br/>Cosine similarity > 0.75]
        Q --> R[Response with<br/>inline citations]
    end

    subgraph Multi-Tenant Isolation
        S[Tenant A Collection] -.-> E
        T[Tenant B Collection] -.-> E
        S2[Tenant A Index] -.-> F
        T2[Tenant B Index] -.-> F
    end
Loading

How It Works

Document Ingestion

  1. Parse: Extract text page-by-page preserving structure. Claude Haiku extracts document title and section headings.
  2. Chunk: Split into ~800-token segments using spaCy sentence boundaries (never mid-sentence). Each chunk gets a context prefix: [Document: {title} | Section: {heading}]
  3. Embed: Batch embed via OpenAI text-embedding-3-small with exponential backoff retry.
  4. Store: Vectors → tenant-namespaced Chroma collection. Text → tenant-specific BM25 pickle index.

Query Processing

  1. Rewrite: Claude Haiku generates 3 query variants (literal, semantic expansion, entity-focused)
  2. Retrieve: Each variant runs against both Chroma (dense) and BM25 (sparse) → 6 ranked lists
  3. Fuse: Reciprocal Rank Fusion (k=60) merges all lists → top 20 unique chunks
  4. Rerank: Cohere rerank-v3.5 scores chunks against original question → top 5
  5. Guard Layer 1: If best chunk relevance < 0.4 → return "no relevant content" (skip LLM)
  6. Generate: Claude Sonnet produces answer with mandatory [Sx] citations per claim
  7. Guard Layer 2: System prompt enforces INSUFFICIENT_CONTEXT when sources don't cover the question
  8. Guard Layer 3: Each citation verified via cosine similarity (claim embedding vs chunk embedding). Below 0.75 → flagged. Over 30% unverified → hallucination warning + confidence penalty

Tech Stack

Component Technology
API Framework FastAPI + Uvicorn
Vector Database ChromaDB (persistent mode)
Sparse Index rank-bm25 (pickle persistence)
Embeddings OpenAI text-embedding-3-small
Answer Generation Claude Sonnet (claude-sonnet-4-6)
Query Rewriting Claude Haiku (claude-haiku-4-5-20251001)
Reranking Cohere rerank-v3.5
PDF Parsing PyMuPDF + Unstructured (OCR fallback)
DOCX Parsing python-docx
URL Parsing Firecrawl
NLP spaCy (sentence segmentation)
Tokenization tiktoken (cl100k_base)
Auth PyJWT (HS256)
Cache Redis
Containerization Docker Compose

Quick Start

Prerequisites

  • Docker & Docker Compose
  • API keys: OpenAI, Anthropic, Cohere (optional — falls back gracefully)

1. Environment Setup

cp .env.example .env
# Fill in your API keys

Required environment variables:

Variable Description
OPENAI_API_KEY OpenAI API key for embeddings
ANTHROPIC_API_KEY Anthropic API key for Claude Haiku/Sonnet
COHERE_API_KEY Cohere API key for reranking (optional — falls back to RRF order)
JWT_SECRET Secret key for JWT token signing (change from default)

2. Run with Docker Compose

docker-compose up -d
Service Port Description
FastAPI 8080 Main API server
Chroma 8000 Vector database (persistent)
Redis 6379 BM25 index cache

3. Local Development

pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Start Chroma and Redis separately, then:
uvicorn api.main:app --host 0.0.0.0 --port 8080 --reload

API Reference

POST /tenants — Create a Tenant

curl -X POST http://localhost:8080/tenants \
  -H "Content-Type: application/json" \
  -d '{"name": "Acme Law Firm"}'
{
  "tenant_id": "a1b2c3d4-...",
  "api_key": "eyJhbGciOiJIUzI1NiIs..."
}

POST /documents — Upload a Document

File upload:

curl -X POST http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>" \
  -F "[email protected]"

URL ingestion:

curl -X POST http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/report.html"}'
{
  "doc_id": "d5e6f7g8-...",
  "chunk_count": 0,
  "status": "processing"
}

Processing happens in the background. Poll GET /documents to check when status becomes "ready".

POST /query — Ask a Question

curl -X POST http://localhost:8080/query \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the payment terms in the contract?", "top_k": 5}'
{
  "question": "What are the payment terms in the contract?",
  "rewritten_queries": [
    "payment terms contract",
    "financial obligations billing schedule fees",
    "Acme Corp payment net-30 $150,000"
  ],
  "answer": "The payment terms require annual fees of $150,000, payable net-30 from invoice date [S1]. Late payments incur a 1.5% monthly interest charge [S3].",
  "citations": [
    {
      "citation_id": 1,
      "chunk_id": "...",
      "claim_text": "The payment terms require annual fees of $150,000, payable net-30 from invoice date",
      "source_similarity": 0.92,
      "verified": true
    }
  ],
  "confidence": 0.95,
  "hallucination_warning": false
}

GET /documents — List Documents

curl http://localhost:8080/documents \
  -H "Authorization: Bearer <api_key>"

DELETE /documents/{doc_id} — Delete a Document

curl -X DELETE http://localhost:8080/documents/<doc_id> \
  -H "Authorization: Bearer <api_key>"

Hallucination Guard — 3 Layers

Layer Mechanism Trigger Action
1 — Retrieval confidence Top reranked chunk relevance score Score < 0.4 Skip generation, return null answer
2 — Generation constraint System prompt forces [Sx] citations Sources insufficient Return INSUFFICIENT_CONTEXT
3 — Citation validation Cosine similarity: claim vs source chunk Similarity < 0.75 Flag as unverified; >30% unverified → warning + confidence penalty

Multi-Tenant Isolation

Every operation is scoped to tenant_id extracted from the verified JWT — never from client input:

  • Chroma: Separate collection per tenant (tenant_{tenant_id}_docs)
  • BM25: Separate index file per tenant (./indexes/{tenant_id}.pkl)
  • API: All endpoints derive tenant_id from Bearer token, not request body/params
  • Deletion: Removing a tenant destroys its Chroma collection and BM25 index

Evaluation

Generate Test Fixtures

# Generate a realistic 15-page service agreement PDF
python evaluation/fixtures/generate_contract.py

The generated contract includes: parties (Acme Corp & TechFlow Solutions), payment terms ($150K/year, net-30), termination clauses (90-day notice), liability cap ($500K), confidentiality (5-year survival), SLA (99.5% uptime), and governing law (Delaware).

Run RAGAS Evaluation

python -m evaluation.ragas_eval --tenant test --questions evaluation/fixtures/eval_questions.json

10 evaluation questions: 8 answerable from the contract + 2 intentionally unanswerable (tests hallucination guard).

Results written to evaluation/eval_results.md:

Metric Target Description
Faithfulness > 0.90 Claims in answer grounded in retrieved context
Answer Relevancy > 0.85 Answer addresses the question asked
Context Precision > 0.88 Retrieved chunks actually relevant to question

Testing

pytest tests/ -v

87+ tests across 3 test files:

  • test_ingestion.py — PDF/DOCX parsing, chunking bounds, overlap, context injection, sentence boundaries
  • test_retrieval.py — BM25 persistence, RRF fusion correctness, hybrid vs vector-only comparison, reranker fallback
  • test_generation.py — Citation parsing, cosine similarity, hallucination warning triggers, INSUFFICIENT_CONTEXT handling

All external APIs (OpenAI, Anthropic, Cohere, Chroma) are mocked — tests run without API keys.

Project Structure

documind/
├── ingestion/
│   ├── parsers.py              # PDF/DOCX/URL → raw text + page metadata
│   ├── chunker.py              # Semantic chunking (800 tok, 150 overlap, spaCy)
│   └── embedder.py             # Batch embedding with retry + progress bar
├── retrieval/
│   ├── vector_store.py         # Chroma wrapper: tenant-namespaced collections
│   ├── bm25_index.py           # BM25 index: build, persist, query, LRU cache
│   ├── hybrid_retriever.py     # RRF fusion of vector + BM25 results
│   └── reranker.py             # Cohere Rerank with graceful fallback
├── generation/
│   ├── query_rewriter.py       # Question → 3 search variants (Claude Haiku)
│   ├── answer_generator.py     # Grounded answer + confidence (Claude Sonnet)
│   └── citation_validator.py   # Claim-to-chunk similarity verification
├── api/
│   └── main.py                 # FastAPI: upload, query, tenant management
├── evaluation/
│   ├── ragas_eval.py           # RAGAS faithfulness + relevancy + precision
│   └── fixtures/
│       ├── generate_contract.py    # Generates 15-page test PDF
│       └── eval_questions.json     # 10 Q&A pairs with ground truth
├── tests/
│   ├── test_ingestion.py       # Parser + chunker unit tests
│   ├── test_retrieval.py       # Hybrid retrieval accuracy tests
│   └── test_generation.py      # Citation validation tests
├── models.py                   # Pydantic v2 data models
├── config.py                   # Settings from environment variables
├── docker-compose.yml          # FastAPI + Chroma + Redis
├── Dockerfile
├── requirements.txt            # Pinned dependency versions
└── .env.example

LLM Usage Per Query

Model Purpose Calls Notes
Claude Haiku Query rewriting 1 3 search variants
Claude Haiku Confidence scoring 1 Structured JSON output
Claude Sonnet Answer generation 1 With citation instructions
OpenAI text-embedding-3-small Query + citation embeddings 1-2 Batched
Cohere rerank-v3.5 Chunk reranking 1 Falls back to RRF if unavailable

Token usage is logged per query with estimated cost in USD.

License

MIT