Study Assistant: RAG-Powered Document Q&A

A production-grade Retrieval-Augmented Generation system for querying personal study notes and documents. Built to support exam preparation by enabling natural language search across multiple document sources.

Motivation

Most students accumulate notes across PDFs, text files, and web resources. Finding specific information across all of them during exam prep is slow and frustrating. This system lets you ask questions in plain English and get grounded, cited answers from your own material.

Architecture

Documents (PDF / TXT / URL)
        ↓
  Load & Parse
        ↓
  Recursive Chunking (chunk_size=500, overlap=50)
        ↓
  Embedding (all-MiniLM-L6-v2, 384 dimensions)
        ↓
  ChromaDB Vector Store (persistent)
        ↓
User Query -> Embed -> MMR Similarity Search (k=4)
        ↓
  Retrieved Chunks + Prompt V2
        ↓
  Llama 3.1 70B via Groq
        ↓
  Grounded Answer with Source Citations

Stack

Component	Technology
Embedding Model	`all-MiniLM-L6-v2` (sentence-transformers, local)
Vector Database	ChromaDB (persistent, local)
LLM	Llama 3.1 70B via Groq API
RAG Framework	LangChain
Backend	FastAPI
Frontend	Streamlit
Containerization	Docker

Evaluation

Systematic hyperparameter search across 12 configurations (k in {2, 4, 8}, chunk_size in {200, 300, 400, 500}) using a custom LLM-as-judge evaluation framework with four metrics:

Faithfulness: is every claim in the answer grounded in retrieved context?
Answer Relevancy: does the answer address the question asked?
Context Recall: did retrieval find the chunks needed to answer?
Completeness: did the system actually answer or deflect?

Optimal Configuration: k=4, chunk_size=500

Metric	Score
Faithfulness	0.9222
Answer Relevancy	0.9778
Context Recall	0.7818
Completeness	0.7273

Retrieval Tuning Results (8B model, fixed config)

Technique	Completeness	Verdict
Baseline	0.6818	Starting point
+ Query Rewriting	0.5545	Dropped
+ MMR Retrieval	0.7273	Kept
+ Prompt V2	0.7727	Best overall
+ Hybrid Search (0.5/0.5)	0.7273	Dropped
+ Hybrid Search (0.7/0.3)	0.7273	No improvement

Key findings:

Larger chunks (500) consistently outperform smaller chunks across all k values, preserving semantic context better for these document types. k=4 balances retrieval coverage without adding noise. Query rewriting hurt completeness on factual Q&A since raw questions are already precise search queries for this domain. MMR improved completeness by reducing redundant chunk retrieval. Hybrid search added noise for this dataset and was rejected. LLM-as-judge evaluation requires a completeness metric to avoid inflated scores from "I don't know" responses being rated as faithful.

Project Structure

rag-project/
├── src/
│   ├── document_loader.py      # PDF, TXT, URL loading + chunking
│   ├── embeddings.py           # HuggingFace embedding model
│   ├── vector_store.py         # ChromaDB operations (add, search, clear)
│   └── rag_chain.py            # LangChain RAG chains + prompt variants
├── api/
│   └── app.py                  # FastAPI backend
├── app/
│   └── streamlit_app.py        # Streamlit frontend
├── evaluation/
│   ├── evaluate.py             # LLM-as-judge evaluation harness
│   ├── test_set.json           # 11 manually curated test questions
│   └── results_*.json          # All experiment results
├── data/                       # User documents (gitignored)
├── .chroma/                    # ChromaDB persistent store (gitignored)
├── .env                        # API keys (gitignored)
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Setup

Prerequisites: Python 3.10+, Groq API key (free at console.groq.com)

git clone https://github.com/yourusername/rag-project
cd rag-project

python -m venv venv
venv\Scripts\activate  # Windows
source venv/bin/activate  # Mac/Linux

pip install -r requirements.txt

echo "GROQ_API_KEY=your_key_here" > .env

streamlit run app/streamlit_app.py

Docker:

docker-compose up --build

Usage

Upload PDF or TXT files, or paste a URL in the sidebar
Wait for documents to be processed and indexed
Ask questions in plain English
Answers include source citations showing which document was used
Use "Clear Session" to start fresh with new documents

Supported Document Types

Type	Notes
PDF	Text-based PDFs only; scanned PDFs require OCR preprocessing
TXT	Plain text files, UTF-8 encoded
URL	Web pages parsed and cleaned automatically

Known Limitations

Scanned and handwritten PDFs are not supported without OCR preprocessing
Context recall drops for very specific factual queries due to embedding model limitations
Groq free tier has rate limits; sustained heavy usage may hit them
Clearing session removes all indexed documents from ChromaDB

What I Learned

Proper RAG evaluation requires more than eyeballing answers. LLM-as-judge needs a completeness metric to avoid rewarding "I don't know" responses
Chunk size affects answer quality more than retrieval depth for this document type
Query rewriting helps for conversational queries but hurts precise factual Q&A
PDF parsing quality directly impacts embedding quality and downstream retrieval
MMR retrieval consistently improves completeness by reducing redundant chunk selection

Roadmap

Agentic extension: ReAct agent that decides between local retrieval and web search
Web search integration via Tavily API
OCR support for scanned PDFs and handwritten notes
Conversation memory for follow-up questions
Multi-user support with separate ChromaDB collections

Related Projects

[Energy Consumption Forecasting](github link): LSTM time series forecasting with FastAPI and Docker

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
evaluation		evaluation
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test_embeddings.py		test_embeddings.py
test_loader.py		test_loader.py
test_multisource.py		test_multisource.py
test_rag.py		test_rag.py
test_vectorstore.py		test_vectorstore.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Study Assistant: RAG-Powered Document Q&A

Motivation

Architecture

Stack

Evaluation

Optimal Configuration: k=4, chunk_size=500

Retrieval Tuning Results (8B model, fixed config)

Project Structure

Setup

Usage

Supported Document Types

Known Limitations

What I Learned

Roadmap

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Study Assistant: RAG-Powered Document Q&A

Motivation

Architecture

Stack

Evaluation

Optimal Configuration: k=4, chunk_size=500

Retrieval Tuning Results (8B model, fixed config)

Project Structure

Setup

Usage

Supported Document Types

Known Limitations

What I Learned

Roadmap

Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages