Warning
This project is incomplete and under active development. The infrastructure and documentation are subject to significant changes.
Infrastructure-as-code for deploying and managing GPU servers for machine learning research support.
Table of Contents
This repository contains the configuration, deployment scripts, and documentation for running:
- LLM inference services (vLLM + LiteLLM proxy)
- Monitoring stack (Prometheus + Grafana + DCGM exporter)
- Workshop environments (JupyterHub for training sessions)
Our research group found ourselves with a server and a dream: serve large language model endpoints to our community for free, so they could experiment with LLMs. But the path from zero to scalable, robust language model service did not seem to us to be an easy one. We were faced with questions like: What inference engine do we use? How do we manage access? How do we monitor usage? What kind of models can we supply, and how many users can we feasibly serve? How do we assess the quality of our service?
We quickly noticed that this information is scattered about over blog posts, subreddits, tutorial, technical documentation, and tribal knowledge. And in trying to answer these questions, we realised that surely other people must have run into the same problems? No doubt there are pockets of researchers and small business (or even homelab enthusiasts) with their own hardware who were also grappling with the same questions.
In some ways, this repo serves as a call to all those who are doing something similar: here is what we tried? How about you? To others who are in the first stages of this process, we are hoping that this will serve as a useful starting point. Within this repo, we aim to not only provide the software infrastructure to serve LLMs, but also sets of documentation acting as tutorials. Furthermore, we also offer our Architectural Decision Records (ADRs), so that people can understand why we made the decisions that we did.
We offer this with the only caveat that many areas may be... suboptimal. If that is the case, then we are open to any well-intentioned feedback or advice in our issues.
- Clone this repository on your GPU server
- Run the setup script to install base dependencies, Docker, and NVIDIA drivers:
sudo ./scripts/setup.sh- Deploy the monitoring stack:
./scripts/monitoring.sh- Deploy the LLM service:
./scripts/monitoring.shansible/ # Ansible playbooks for server configuration
docs/ # Documentation and Architecture Decision Records
scripts/ # Operational scripts (mode switching, maintenance)
stacks/ # Docker Compose definitions for each service
- Getting Started - Detailed setup instructions
- System Architecture - How the components fit together
- ADRs - Architecture Decision Records explaining key choices
- Ubuntu 22.04 LTS (server)
- NVIDIA GPU with recent drivers
- Docker and Docker Compose
GNU GPLv3 - See LICENSE for details.