alfred-rag

A hybrid RAG stack: dense embeddings (LanceDB) plus BM25 (Tantivy) with Reciprocal Rank Fusion, then cross-encoder reranking with a Qwen3-Reranker-8B. Plus a sentence-transformers fine-tuning recipe for adapting Qwen3-Embedding-8B to your own corpus.

Built originally for a personal knowledge-base RAG running on a DGX Spark, serving a multi-domain corpus of technical documentation, personal notes, and reference material.

Components

File	Purpose
`rag_server.py`	FastAPI server: `/embed`, `/rerank`, `/search` (hybrid + RRF + rerank), `/health`, `/metrics`
`rag_eval.py`	Eval harness: runs a list of queries, scores keyword hits, writes per-query JSON + summary CSV
`finetune_embedding.py`	Fine-tune Qwen3-Embedding-8B (or any sentence-transformers base) on query/positive pairs with MultipleNegativesRankingLoss
`eval_queries.example.json`	Schema example for eval queries

Architecture

query
  |
  |----> embed (dense) ------+
  |                           |
  +----> BM25 (tantivy) ------+
                              |
                              v
                       RRF fusion (k=60)
                              |
                              v
                        top N candidates
                              |
                              v
                 Qwen3-Reranker-8B cross-encoder
                              |
                              v
                         final top_k

Dense: LanceDB + Qwen3-Embedding-8B (4096-d), optionally fine-tuned on your corpus
Sparse: Tantivy BM25 with English stemming
Fusion: RRF with configurable k (default 60)
Reranking: Qwen3-Reranker-8B scoring yes/no logit on (query, doc) pairs
Optional: per-domain filtering, max-per-source diversification, domain boosts

Quickstart

1. Install

pip install -r requirements.txt

2. Index your corpus

You'll need to build indexes/lancedb/ (vector) and indexes/tantivy/ (BM25) from your documents. This repo doesn't ship an ingest script because corpus-prep is inherently bespoke. At minimum each document chunk should have: id, text, domain, filename, url, source, format, chunk_index, vector.

3. Serve

# With env-configured paths
export EMBED_MODEL_PATH=Qwen/Qwen3-Embedding-8B
export RERANKER_MODEL_PATH=Qwen/Qwen3-Reranker-8B
export LANCEDB_PATH=./indexes/lancedb
export TANTIVY_PATH=./indexes/tantivy
export LANCE_TABLE=corpus
export PORT=9000
export DEVICE=cuda  # or "mps" / "cpu"

python rag_server.py
# or: uvicorn rag_server:app --host 0.0.0.0 --port 9000

4. Query

curl -s http://localhost:9000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "How does reciprocal rank fusion work?", "top_k": 5}' | jq .

5. Run eval

Copy eval_queries.example.json to eval_queries.json and author your own queries + expected keywords:

python rag_eval.py --server http://localhost:9000 --queries eval_queries.json

Writes per-query JSON to eval_results/ plus a summary CSV.

Fine-tuning

If the base embedding model misses domain-specific vocabulary, fine-tune it on query/positive pairs:

# training_data/embedding_pairs.jsonl format:
# {"query": "...", "positive": "..."}
# {"query": "...", "positive": "..."}

python finetune_embedding.py \
  --base-model Qwen/Qwen3-Embedding-8B \
  --data-path ./training_data/embedding_pairs.jsonl \
  --output-dir ./models/qwen3-embedding-8b-ft \
  --epochs 3 --batch-size 4 --lr 3e-5

Uses MultipleNegativesRankingLoss (in-batch negatives) so you only need positive pairs. Evaluator is InformationRetrievalEvaluator over a held-out split. A 3600-pair corpus takes roughly 9 hours on a DGX Spark (H100-class) at batch size 4 for 3 epochs.

Configuration

All paths and models are env-configurable. Server defaults:

Variable	Default
`EMBED_MODEL_PATH`	`Qwen/Qwen3-Embedding-8B`
`RERANKER_MODEL_PATH`	`Qwen/Qwen3-Reranker-8B`
`LANCEDB_PATH`	`./indexes/lancedb`
`TANTIVY_PATH`	`./indexes/tantivy`
`LANCE_TABLE`	`corpus`
`DEVICE`	`cuda` if available else `cpu`
`PORT`	`9000`

Observability

The server exposes Prometheus metrics at /metrics:

llm_llm_tokens_input_total{endpoint, model, job} — embedded + reranked input tokens
llm_llm_tokens_output_total{endpoint, model, job} — reranker yes/no logit count
llm_llm_requests_total{endpoint, model, job} — request count per endpoint
llm_llm_tool_calls_total — search invocations

Pair with Grafana for a basic retrieval dashboard.

Related projects

Part of a self-hosted LLM operations toolkit:

blockops-proxy — tool-call-translating proxy that fronts this RAG server for OpenAI-compatible clients
llm-otel-proxy — OTel metrics proxy that tracks tokens/cost/latency on this server's traffic
alfred-infra — monitoring + backup infrastructure for multi-machine local-LLM clusters
context-bench — benchmark the embedding/reranker throughput across context sizes

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_queries.example.json		eval_queries.example.json
finetune_embedding.py		finetune_embedding.py
rag_eval.py		rag_eval.py
rag_server.py		rag_server.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alfred-rag

Components

Architecture

Quickstart

1. Install

2. Index your corpus

3. Serve

4. Query

5. Run eval

Fine-tuning

Configuration

Observability

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

alfred-rag

Components

Architecture

Quickstart

1. Install

2. Index your corpus

3. Serve

4. Query

5. Run eval

Fine-tuning

Configuration

Observability

Related projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages