Skip to content

A RAG app built with FastAPI, integrating with pre-trained GPT model for context-aware text generation and Graylog for logging and performance monitoring

Notifications You must be signed in to change notification settings

IvanLauLinTiong/rag-fastapi-graylog-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-Pipeline-With-FastAPI-and-Graylog

A FastAPI-based service implementing a Retrieval-Augmented Generation (RAG) pipeline for text generation, integrated with Graylog for API monitoring. The service allows users to upload files, store them in a Chroma vector store, and generate context-aware text using a pre-trained model endpoint (microsoft/Phi-3-mini-4k-instruct) from HuggingFace.

Architecture

The pipeline consists of two services where need to bring up using docker compose:

  • chastservice: service to serve API endpoints
  • graylog: service to monitor API performance and server metrics

API Endpoints

POST /generate

Request:
- query: the question or prompt for text generation.

Response:
Generated text based on the query (without context awareness).

POST /generate-text-with-context

Request:
- query: the question or prompt for text generation.

Response:
A context-aware answer based on the uploaded documents stored in the vector database.

POST /upload

Request:
- files: list of PDF files to be uploaded, processed and stored in vector store.

Response:
Number of files uploaded and stored successfully.

GET /system-metrics

Response:
Returns CPU, memory, and disk usage metrics and logs to graylog service.

GET /health

Response:
Returns the health status of the service.

ragpipeline

The pipeline follows these steps to provide responses to your questions and send logs and system health details to Graylog service:

  1. User upload PDF files via /upload to populate the ChromaDB store.

  2. Using BAAI/bge-small-en-v1.5 as embedding model, those files content are extracted and divided into smaller text embeddings which are later saved to ChromaDB.

  3. User ask question by submiting query request to API /generate-text-with-context.

  4. The pipeline communicates with Microsoft model microsoft/Phi-3-mini-4k-instruct to process user queries and generated the suitable responses based on the provided relevant context from the ChromaDB.

  5. Any API request would be logged and sent to Graylog service and this is handled by FastAPI middleware, namely async def log_request().

  6. System metrics of the server are sent to Graylog every 1 minute and is done at docker HEALTHCHECK command at chatservice/Dockerfile.

Setup

  1. Clone the repo:
git clone https://github.com/IvanLauLinTiong/rag-fastapi-graylog-monitoring
cd rag-fastapi-graylog-monitoring

  1. Rename the .env.example to .env and set your HUGGINGFACEHUB_API_TOKEN (see tutorial here to get your token).

  2. Install docker on your system and run command below:

docker compose up -d

This would take a moment (5 - 10 min) to spin up the containers as the docker images need to be built on first run. It would be faster to bring up in next new runs.

  1. After all containers are running up, you may acccess the UIs below:
  1. Inside graylog UI, you could:
  • input any keyword at the search box, for example response_time:>2 to view any logs has API response time that greater than 2 seconds: api_response_time_search

  • configure the alert notification for Slow API response event. Currently the event condition is searching within the last 5 minutes of log messages and run every 5 minutes and the notification is sent via email:

graylogevent


graylognotification

  1. Run command below to stop all the running services:
docker compose down --remove-orphans

Caveats

  1. The pipeline would be slow to bring up on first run since all dependecies and images are need to be downloaded and built before running.

  2. The email alert notification on slow API response time is not sent successfully. email_transport_failed

  3. The pipeline is not scalable to support more than one user access.

MLOps principles Reflection

Currently only two princples of MLOps are applied throughout the entire development:

  1. Version Control for Code and Models

In MLOps, it’s important to maintain version control of both the code and the machine learning models. I am using pre-trained model (microsoft/Phi-3-mini-4k-instruct) from HuggingFace transformers, which ensures consistency by leveraging a specific version of the model. I could extend this setup by using tool like DVC (Data version control) to version control the data (retrieved documents) or Mlflow for the model if I want to fine tune or swap the model.

  1. Monitoring and Logging

Graylog provides system-level monitoring, which aligns with the MLOps principle of continuously monitoring the ML infra. I monitor API performance (response times, error rates, etc.) and system resources (CPU, memory, etc.) Email alert notification is also configured when the API response time exceeds a certain threshold (here 2 seconds). This ensures that the system remains responsive and reliable, helping to meet SLAs (Service Level Agreements), which is crucial in production-grade ML systems.

Future Improvements

  1. Address those caveats that higligted in Caveats section:
  • docker optimization like using multi-stage build
  • troubleshoot firewall on Graylog server whether open for email SMTP protocol
  • add message queue to pipeline and configure ChromaDB to support multi-user access
  1. Add CI/CD pipeline which performs unit testing for APIs , evaluting pre-trained model output and auto deploy the FastAPI docker image.

  2. Add horizontal scalability and resilience to the container services like deploy them on kubernetes platform.

About

A RAG app built with FastAPI, integrating with pre-trained GPT model for context-aware text generation and Graylog for logging and performance monitoring

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published