This project is a complete Retrieval-Augmented Generation (RAG) pipeline that turns a domain description into a searchable research assistant powered by arXiv papers.
You describe a domain (for example: "An expert on machine learning especially on memory") and the system will:
- Generate ranked search keywords using OpenAI
- Search arXiv using those keywords
- Download and parse research PDFs
- Embed the papers locally using
all-MiniLM-L6-v2 - Store them in Chroma vector database
- Let you ask questions over the papers using ChatGPT
The result is a domain-specific research assistant that answers questions using real academic sources.
Domain description β Keyword generation β arXiv search β PDF download β Parsing β Embeddings β Chroma β RAG QA
- Domain β ranked search keywords
- Automatic arXiv ingestion
- PDF parsing
- Local embeddings (no cloud vector DB)
- Chroma vector store
- RAG-based question answering
- Python 3.9+
- OpenAI API key
- Internet connection
pip install openai chromadb sentence-transformers pypdf requestsSet your API key:
export OPENAI_API_KEY="your-api-key"python librerag.pyThis project is for researchers, engineers, and builders who want their own domain-specific AI assistant grounded in real academic literature β not hallucinations.
MIT