Here, we detail the initial setup stages. We make some basic assumptions:
- You have a bare metal server
- You have no other software installed
- You have Ubuntu 22.04
We provide two methods of setup: script and ansible, with the prefered option being ansible.
The process is as follows:
- Connect to your server (however you do this)
- Clone this repo (this may require the installation of Git)
- Generate security keys and deposit the public key on your server in
~/.ssh/authorized_keys - Exit your server
- Set your inventory.ini target
- Run the
setup.ymlplaybook - Run the
monitoring.ymlplaybook
In the sections below, we will walk through this process in detail, assuming you have completed steps 1 and 2.
Ansible connects to your server via SSH using key-based authentication. If you don't already have an SSH key pair, generate one on your local machine:
ssh-keygen -t ed25519 -C "[email protected]"Accept the default location (~/.ssh/id_ed25519) or specify a custom path. You can optionally add a passphrase for additional security.
Next, copy your public key to the server. If you're still connected to the server, you can do this manually:
# On the server
mkdir -p ~/.ssh
chmod 700 ~/.ssh
echo "your-public-key-content" >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keysAlternatively, from your local machine:
ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<server-ip>Verify the key works by disconnecting and reconnecting without a password:
ssh -i ~/.ssh/id_ed25519 ubuntu@<server-ip>The inventory file tells Ansible which servers to manage. Copy the example file and edit it:
cd ansible
cp inventory.ini.example inventory.iniEdit inventory.ini to add your server details:
[gpu_servers]
gpu-server ansible_host=<YOUR_SERVER_IP> ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/id_ed25519Replace <YOUR_SERVER_IP> with your server's IP address. If you used a different SSH key path, update ansible_ssh_private_key_file accordingly.
Test the connection with:
ansible gpu_servers -m ping -i ./ansible/inventory.iniThe -i flag is required to specify the inventory file target location. If you navigate to the ansible folder, you won't need to specify this. You should see a successful pong response.
The setup playbook installs Docker, the NVIDIA Container Toolkit, and other base dependencies. Before running it, ensure you have the required Ansible collections installed on your local machine:
ansible-galaxy collection install community.dockerThen run the playbook:
ansible-playbook playbooks/setup.ymlThis playbook will update system packages, install Docker CE, configure the NVIDIA Container Toolkit, and verify GPU access. The process typically takes a few minutes. At the end, you should see the output of nvidia-smi confirming your GPU is detected.
With the base setup complete, deploy the monitoring stack:
ansible-playbook playbooks/monitoring.ymlThis deploys Prometheus, Grafana, node-exporter (for CPU/RAM metrics), and dcgm-exporter (for GPU metrics). The playbook performs health checks on all services before completing.
Once finished, you can access the monitoring interfaces:
| Service | URL | Notes |
|---|---|---|
| Grafana | http://<server-ip>:3000 |
Default login: admin / admin |
| Prometheus | http://<server-ip>:9090 |
Query interface |
| Node Exporter | http://<server-ip>:9100/metrics |
Raw system metrics |
| DCGM Exporter | http://<server-ip>:9400/metrics |
Raw GPU metrics |
On first login to Grafana, you'll be prompted to change the admin password. After that, you need to add Prometheus as a data source:
- Navigate to Connections → Data sources
- Click Add data source
- Select Prometheus
- Set the URL to
http://prometheus:9090(using the Docker network hostname) - Click Save & test
To visualise your metrics, import pre-built dashboards:
- Navigate to Dashboards → Import
- Enter the dashboard ID and click Load
- Select your Prometheus data source and click Import
Recommended dashboards:
| Dashboard | ID | Description |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive system metrics |
| NVIDIA DCGM Exporter | 12239 | GPU utilisation, memory, temperature, power |
Ansible can't connect to the server
Check that your SSH key is correctly configured and that the inventory file has the correct IP, username, and key path. Test manually with ssh -i <key-path> <user>@<ip>.
Docker commands fail with permission denied
The setup playbook adds your user to the docker group, but this requires a new login session to take effect. Either reconnect to the server or run newgrp docker.
DCGM exporter container fails to start
Ensure the NVIDIA drivers are installed and working (nvidia-smi should show your GPU). The NVIDIA Container Toolkit must also be configured correctly. Re-run the setup playbook if needed.
Grafana can't reach Prometheus
When adding the data source, use http://prometheus:9090 (the Docker network hostname), not localhost. The containers communicate over the Docker bridge network.
The security process remains the same, but now we just remain inside the server and run:
./scripts/setup.shand then
./scripts/monitoringFollow the same instructions in the ansible section for Grafana.