A Retrieval-Augmented Generation (RAG) system that answers family-focused questions across three domains: Health & First Aid, Nutrition & Recipes, and Children's Education. Built as a final project for a RAG/LLM certification.
- Ingests 3 public PDF documents (first aid, nutrition, education) into a Pinecone vector store
- Classifies user queries and applies role-based access control (Parent / Child / Guest)
- Answers questions using 4 different retrieval strategies, compared via RAGAS evaluation
- Provides an interactive Gradio chat interface
| Model | Use |
|---|---|
text-embedding-3-small (1536d) |
Document & query embeddings |
gpt-4o-mini (temp=0.3) |
Answer generation |
gpt-3.5-turbo (temp=0) |
Topic classification (deterministic) |
- Pinecone serverless index (
family-assistant, AWS us-east-1, cosine similarity) - Two namespaces:
main_chunks(604 vectors) andchild_chunks(1,166 vectors)
| Type | Size | Overlap |
|---|---|---|
| Parent | 2000 chars | 400 |
| Child | 500 chars | 75 |
| Regular | 1000 chars | 200 |
| Role | Accessible Topics |
|---|---|
| Parent | all, parents, kids |
| Child | all, kids |
| Guest | all |
Health documents are parents-only; Nutrition and Education are open to all.
- Baseline — Simple k=3 similarity search on main chunks
- Parent-Child — Search child chunks, return full parent chunks for richer context
- Multi-Query + Reranking — Expand query into 3 variations, retrieve k=5, rerank top 4 with FlashRank cross-encoder
- All Combined — Parent-Child + Multi-Query + Reranking (most comprehensive)
All strategies share: topic classification filtering, role-based access control, and 5-turn conversation memory.
PDF → Clean text → Auto-detect topic (LLM) → Enrich metadata → Store (main + child namespaces)
Query → Classify topic → Filter by role → Retrieve → [Rerank] → Prompt → gpt-4o-mini → Answer
Uses RAGAS with 4 metrics across 6 ground-truth Q&A pairs (2 per topic):
| Metric | Measures |
|---|---|
| Faithfulness | Answer grounded in context (no hallucinations) |
| Answer Relevancy | Answer addresses the question |
| Context Precision | Relevant docs ranked at the top |
| Context Recall | All needed information was retrieved |
Designed to run in Google Colab.
langchain-openai langchain-pinecone pypdf langchain ragas
datasets gradio flashrank python-dotenv
Set the following secrets in Colab (or a .env file for local use):
OPENAI_API_KEYPINECONE_API_KEY
Upload the 3 PDFs to the Colab file browser. The notebook auto-detects and ingests any .pdf files present.
- St John Ambulance — First Aid Reference Guide (Health)
- Georgetown University — Family Food and Fitness Guide (Nutrition)
- U.S. Department of Education — Parent's Guide to Student Success (Education)
Bruno Mirrado — family_assistant_v1