A production-ready Serverless Retrieval-Augmented Generation (RAG) system built on AWS Lambda, API Gateway, Pinecone, and OpenAI. The system allows users to upload documents (PDF) and ask natural language questions via a ChatGPT-like web UI.
- 🔍 Semantic Search (RAG) using vector embeddings
- 📚 PDF ingestion & chunking
- 🧠 LLM-based answer generation (OpenAI)
- ⚡ Serverless architecture (AWS Lambda + API Gateway)
- 🐳 Container-based Lambda (ECR)
- 🌐 ChatGPT-like Web UI (pure HTML/CSS/JS)
- 🔐 Secrets managed via GitHub Secrets
- 📦 Infrastructure as Code with Terraform
- 🔄 CI/CD with GitHub Actions
User (Browser)
↓
index.html (Chat UI)
↓ POST /ask
AWS API Gateway
↓
AWS Lambda (Query Handler)
├─ Retrieval → Pinecone Vector DB
└─ Generation → OpenAI Chat Model
↓
Answer returned to UI
A separate Ingestion Lambda is used to process and embed documents.
SERVERLESS_RAG_PROJECT
├── .github/
│ └── workflows/
│ └── deploy.yml # CI/CD pipeline
├── infra/ # Terraform IaC
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── src/
│ ├── common/
│ │ └── logger.py
│ ├── ingestion/
│ │ ├── handler.py # Document ingestion Lambda
│ │ └── service.py
│ └── retrieval/
│ ├── handler.py # Query Lambda
│ ├── search.py
│ └── generator.py
├── index.html # ChatGPT-like frontend
├── Dockerfile
├── requirements.txt
└── README.md
- Upload PDF documents
- Extract text per page
- Split text into chunks
- Generate embeddings using OpenAI Embeddings
- Store vectors in Pinecone
- User submits a question
- Convert question to embedding
- Perform similarity search in Pinecone
- Retrieve top-K relevant chunks
- Inject context into prompt
- Generate final answer via OpenAI Chat Model
| Layer | Technology |
|---|---|
| Frontend | HTML, CSS, Vanilla JS |
| API | AWS API Gateway |
| Compute | AWS Lambda (Container Image) |
| Container | Docker + Amazon ECR |
| Vector DB | Pinecone |
| Embeddings | OpenAI text-embedding-3-small |
| LLM | OpenAI Chat Models |
| IaC | Terraform |
| CI/CD | GitHub Actions |
Configured via GitHub Secrets and injected by Terraform:
OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_INDEX_NAME=
PINECONE_NAMESPACE=default
⚠️ Changing env vars does NOT require rebuilding Docker images — only a redeploy.
Triggered on push to master:
- Checkout code
- Build Docker image
- Push image to Amazon ECR
- Run Terraform
- Update Lambda functions
Image tagging strategy:
latest${GITHUB_SHA}(immutable)
- Base image:
public.ecr.aws/lambda/python:3.12 - Optimized for fast cold start
- No heavy ML libraries (no
torch, nosentence-transformers)
Lambda handlers:
src.ingestion.handler.handler
src.retrieval.handler.handlerOpen index.html in a browser and ask questions:
POST /ask
{
"query": "What is the main content of ML.pdf?"
}Features:
- ChatGPT-style UI
- User / Bot message alignment
- Loading indicator
- CORS enabled
- 💵 Pay-as-you-go (OpenAI)
- Pinecone: free / starter tier suitable for small docs
- AWS Lambda: extremely low cost for light usage
Example:
- 1 PDF (~6 pages): < $0.01 embedding cost
- Typical query: fractions of a cent
- No secrets in Docker images
- Immutable image tags
- Environment-based configuration
- Stateless Lambdas
- Warm-start optimization
- 🔐 Authentication (Cognito / JWT)
- 📎 Source citation in answers
- 🔄 Streaming responses
- 📊 Observability & metrics
- 🌍 Multi-language support
Built by Thanh – Backend Engineer
Focused on AWS, Serverless, and AI-powered systems.
This project demonstrates a clean, scalable, and production-ready RAG architecture using modern cloud-native and AI technologies.
If you are learning Serverless + GenAI, this is a solid real-world reference.