RAG (Retrieval-Augmented Generation) project with three interfaces over the same backend logic:
- CLI (
uv run rag ...) - FastAPI server (
uv run rag-api) - React + TypeScript + Tailwind UI (
frontend/)
flowchart TD
subgraph uiLayer [Frontend]
reactApp[ReactApp]
end
subgraph apiLayer [BackendAPI]
fastApi[FastAPI]
ingestManager[IngestJobManager]
sseEndpoint[SSEEndpoint]
end
subgraph serviceLayer [SharedServices]
ragService[RagService]
end
subgraph dataLayer [DataAndModels]
localLoader[DocumentLoader]
notionLoader[NotionLoader]
vectorStore[VectorStore]
ragChain[RagChain]
chromaDb[ChromaDB]
llmModel[LLMModel]
embedModel[EmbeddingModel]
end
subgraph cliLayer [CLI]
cliCmd[CLICmd]
end
reactApp -->|"POST /chat"| fastApi
reactApp -->|"POST /ingest"| fastApi
reactApp -->|"GET /ingest/{jobId}/events"| sseEndpoint
reactApp -->|"GET /status"| fastApi
reactApp -->|"POST /clear"| fastApi
fastApi --> ragService
fastApi --> ingestManager
sseEndpoint --> ingestManager
ingestManager --> ragService
ragService --> localLoader
ragService --> notionLoader
ragService --> vectorStore
ragService --> ragChain
vectorStore --> chromaDb
vectorStore --> embedModel
ragChain --> llmModel
ragChain --> vectorStore
cliCmd --> ragService
- Python 3.10+
- uv package manager
- llama-cpp-python via
uv sync(default wheel is CPU-only). On Windows with an NVIDIA GPU, run.\scripts\install_llamacpp_cuda_windows.ps1after sync for CUDA offload (needs CUDA toolkit + MSVC — see LLM GPU acceleration). - Node.js 18+ (for frontend)
Install backend dependencies:
uv syncInstall frontend dependencies:
cd frontend
npm installuv run rag ingest
uv run rag ingest --source local
uv run rag ingest --source notionuv run rag query "What powers does Congress have?"
uv run rag query "What is the role of the President?" --show-sourcesuv run rag status
uv run rag clearRun API server:
uv run rag-apiEndpoints:
GET /healthGET /statusPOST /clearPOST /chatPOST /ingest(starts async ingestion job)GET /ingest/{job_id}(job snapshot)GET /ingest/{job_id}/events(SSE progress stream)
POST /ingestwith{"source":"all"|"local"|"notion"}.- Receive
job_id. - Subscribe to
GET /ingest/{job_id}/events. - Update UI progress bar from SSE event payload (
status,progress,stage,message).
cd frontend
npm run devSet API URL (optional):
# frontend/.env
VITE_API_BASE_URL=http://127.0.0.1:8001To ingest from Notion, create .env in the project root:
NOTION_TOKEN=secret_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
NOTION_DATABASE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxThen share your target Notion database with your integration.
Edit src/config.py:
EMBEDDING_MODELEMBEDDING_DEVICE(examples:cuda:0,cuda:1,cpu)LLM_MODELLLAMACPP_N_GPU_LAYERSLLAMACPP_N_CTXLLAMACPP_N_THREADSLLAMACPP_TEMPERATURELLAMACPP_VERBOSECHUNK_SIZECHUNK_OVERLAPTOP_K_RESULTSSUPPORTED_EXTENSIONS(now includes image files like.png,.jpg,.jpeg,.webp,.tiff)VISION_ENABLEDVISION_CAPTION_PROVIDERVISION_CAPTION_MODELVISION_MMPROJ_MODELVISION_MAX_IMAGES_PER_DOCOCR_ENABLED
PyTorch CUDA (torch==...+cu124) only affects embeddings. The chat LLM uses a separate native library: llama-cpp-python. The default wheel you get from a plain uv sync / pip install is often CPU-only (llama_supports_gpu_offload() is false), so LLAMACPP_N_GPU_LAYERS is ignored and you will see no GPU usage during /chat.
- Check:
uv run python -c "import llama_cpp; print(llama_cpp.llama_supports_gpu_offload())"— expectTruefor GPU offload. - API / CLI:
GET /statusandrag statusincludellamacpp_gpu_offloadandllamacpp_n_gpu_layers.
Windows (recommended): from the repo root, with CUDA toolkit + MSVC build tools installed:
.\scripts\install_llamacpp_cuda_windows.ps1That sets FORCE_CMAKE and CMAKE_ARGS=-DGGML_CUDA=ON, then reinstalls llama-cpp-python from source against your CUDA toolkit.
Manual one-liner (same effect):
$env:FORCE_CMAKE = "1"
$env:CMAKE_ARGS = "-DGGML_CUDA=ON"
uv pip install --force-reinstall --no-cache-dir llama-cpp-pythonIf CMake still skips GPU, try -DLLAMA_CUDA=ON instead (older CMake layouts). Prefer running from Developer PowerShell for VS so MSVC and SDK are on PATH. If nvcc hits unsupported Microsoft Visual Studio version (common when CUDA <= 12.1 sees VS 2025/2026), install Build Tools for Visual Studio 2022 (MSVC v143, C++ workload) so scripts/install_llamacpp_cuda_windows.ps1 can pick the 17.x toolchain, or upgrade CUDA to a release that supports your MSVC. The script also appends -allow-unsupported-compiler, which helps some setups but not CMake’s CUDA compiler detection with a too-new host compiler.
Important: another plain uv sync can reinstall the CPU wheel from PyPI and reset llama_supports_gpu_offload() to False. Run the script again after that.
Linux: use the same FORCE_CMAKE / CMAKE_ARGS env vars with uv pip install --force-reinstall --no-cache-dir llama-cpp-python, or install a CUDA-enabled wheel from a published index if your platform provides one — see llama-cpp-python.
Local ingestion now supports text + visual processing for:
- PDF text and embedded images
- PPTX text and slide images
- Standalone image files (
.png,.jpg,.jpeg,.webp,.tiff)
When vision is enabled, extracted images are captioned and the captions are embedded as additional retrievable chunks.
Example .env settings:
VISION_ENABLED=true
VISION_CAPTION_PROVIDER=llamacpp
VISION_CAPTION_MODEL=mys/ggml_llava-v1.5-7b/ggml-model-q5_k.gguf
VISION_MMPROJ_MODEL=mys/ggml_llava-v1.5-7b/mmproj-model-f16.gguf
VISION_MAX_IMAGES_PER_DOC=16
# Optional OCR extraction to append visible text from images
OCR_ENABLED=falsellama.cpp model configuration examples:
# local GGUF file under models/
LLM_MODEL=models/gpt-oss-20b-Q4_K_M.gguf
# or auto-download from HuggingFace to models/
LLM_MODEL=unsloth/gpt-oss-20b-GGUF/gpt-oss-20b-Q4_K_M.gguf
LLAMACPP_N_GPU_LAYERS=-1
LLAMACPP_N_CTX=4096
LLAMACPP_N_THREADS=16
LLAMACPP_TEMPERATURE=0.1
LLAMACPP_VERBOSE=falseCaption-derived chunks include metadata such as:
modality=image_captionpage_or_slideimage_mimeparent_sourcecaption_model
For query-time CUDA OOM issues, set these in your project .env:
# Keep embedding on GPU
EMBEDDING_DEVICE=cuda:0
# Start balanced; lower if memory pressure continues
EMBEDDING_BATCH_SIZE=24
EMBEDDING_OOM_RETRY_BATCH_SIZE=8
# Keep vector quality and speed defaults
EMBEDDING_NORMALIZE=true
# Keep disabled for speed-first profile (enable only if needed)
EMBEDDING_OOM_CPU_FALLBACK=false
# Recommended by PyTorch to reduce fragmentation
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueIf OOM persists:
- Lower
EMBEDDING_BATCH_SIZEto16or8. - Keep
EMBEDDING_OOM_RETRY_BATCH_SIZEat4or8. - Enable
EMBEDDING_OOM_CPU_FALLBACK=trueonly if stability is more important than speed.
Validation checklist:
- Run repeated
/chatrequests and confirm no progressive VRAM growth. - Run
/ingestwhile sending/chatrequests and confirm no CUDA OOM. - Confirm latency remains acceptable after batch-size tuning.