This collection of software lets you use your own equipment to observe the health and operation of the nodes in your datacenter that are part of the Internet Computer. It is a collection of docker containers that, when deployed and configured, will collect metrics about the performance of selected nodes and send alerts if something is off.
The stack selects IC nodes and collects metrics from each node, saving them in a local Prometheus database every few seconds. This database is queryable through a Grafana deployed side-to-side with Prometheus.
Nodes of the Internet Computer all make available to the public a series of metrics (in Prometheus text format) that can be collected and analyzed by software such as this.
The stack has to be deployed within the same network subnet as the nodes so that the scraping of the nodes can take place. This practically means that the nodes and the machine that is running this stack have to be within the same data center and connected to the same router.
- IPv6 connectivity.
- Root-equivalent access on your workstation, to deploy the software this stack needs to be set up.
- Hardware.
- It is recommended to run this stack on a machine with at least 16GB ram and 80-100GB storage.
- SSH access to the machine.
- This is needed to later remotely connect to the services and inspect issues.
- Setting up ssh server guide.
To start preparing your workstation for scraping and observing your nodes you have to
ssh into the machine (or be physically next to it in order to run commands on the
machine).
Install Docker for your machine. There shouldn't be visible differences across different operating systems. This setup has been tested on Mac OS and Manjaro OS (Linux flavour) but hasn't been tested on Windows machines.
Once you can run docker commands and have docker compose you can proceed to the next step.
Now that you have docker you can proceed to configure the stack.
To properly configure the user and group which will be used for all docker workloads run the following setup script:
# NOTE: you have to be within the same directory as this README!
./setup.shFirst thing that you have to configure is your scraping targets. To do that, find your principal id that correclates to your node provider id. You can find that from the public dashboard. After that you should find the id of the data center to which you are deploying this stack. You can find that information also on the public dashboard.
With that you can run the following command:
# NOTE: you have to be within the same directory as this README!
docker compose -f ./docker-compose.tools.yaml run --rm prom-config-builder tools/prom-config-builder/prom_config_builder.py --node-provider-id <node-provider-id> --dc-id <dc-id>Example command for node provider Dfinity Stiftung for data center se1:
docker compose -f ./docker-compose.tools.yaml run --rm prom-config-builder tools/prom-config-builder/prom_config_builder.py --node-provider-id bvcsg-3od6r-jnydw-eysln-aql7w-td5zn-ay5m6-sibd2-jzojt-anwag-mqe --dc-id se1Once that executes, you should be able to see a new file at ./config/prometheus/config.yaml.
This file contains the definitions for the scraping targets. It will be slightly different
for each node provider and each data center. It is not versione controlled and you can
always recreate it with running the above command if you lose it, or delete it.
This stack uses Grafana to present the dashboards and to send alerts. Sending alerts is done using the contact points which need to be configured. Grafana supports various contact points, some of them can be used for free, some of them are paid. This stack was tested with the following contact points:
- Discord [free]
- Slack
- Google chat
To setup your prefered contact point copy over the template with the following command:
cp ./config/grafana/templates/template_contact_points.yaml ./config/grafana/provisioning/alerting/contact_points.yamlAfter that, edit the new file on path ./config/grafana/provisioning/alerting/contact_points.yaml
and uncomment your prefered contact point and configure it (to uncomment you should
remove the initial # from the contact point definition.
To deploy the stack run the following command:
# This will spawn the containers
docker compose -f ./docker-compose.yaml up -d It will take some time for the services to start and to sync between one another.
You can monitor if containers are failing by running docker ps and see if
there are some restarts happening in the containers. To see more about Troubleshooting
read Troubleshooting
Once started, you will see the following applications:
- Prometheus - http://localhost:9090
- Grafana - http://localhost:3000 - default creds can be see in
./config/grafana/grafana.ini - Service discovery - http://localhost:8000
After 5-10 minutes you should see targets discovered in victoria on the targets page. Initially, they might apear in red and if you keep monitoring they should slowly start getting green, which means that the targets are successfuly scraped.
You should also see some data incoming in the grafana sample dashboard for the node exporter. Here, you can see various information about the performance of nodes and their health.
Alerting tab will show obs alert evaluations
which will contain a list of preconfigured alerts. Here you can see if any of the
alerts are in in a problematic state. Most of the time they should be in Normal
state, this means that everything is fine. Some of them may occasionally go into a
Pending state, which means that they crossed the threshold for an alert but still
isn't happening for long enough to consider this a problem. When a Pending alert
is happening for long enough it will go into Error mode and you should receive an
alert on your preconfigured contact point.
NOTE: Not all alerts are configured to send notifications if they fire, usually
because they are just warnings and cannot be acted upon. To see which are and
which aren't you can see ./config/grafana/provisioning/alerting/alerts.yaml and
see which have the lavel severity: critical attached to them because only they
will send alerts to your contact point. If you wish to send all of them, just replace
the other ones that contain severity: warning with severity: critical.
To access the stack remotely you can do the following:
ssh -L 3000:localhost:3000 -L 9090:localhost:9090 -L 8000:localhost:8000 <machine-with-obs-stack>Example command with all parameters:
ssh -L 3000:localhost:3000 -L 9090:localhost:9090 -i ~/.ssh/priv_key.pem [email protected]Extending this stack usually means adding new alerts or dashboards. Those might be your own modifications or the ones that come from someone else.
Building grafana dashboards is usually done via grafana ui. You can follow this tutorial from grafana to make your custom dashboards.
After creating the dashboard it is suggested to export it and save it
./config/grafana/provisioning/dashboards/. This will make sure that
if you later need to restore the dashboard or do a fresh deployment
of grafana it will be persisted.
NOTE: Storing a dashboard and restarting can make lead to errors
due to overlapping dashboard uids. If you export the dashboard and
save it you should, be sure to do a clean deployment of grafana.
Adding grafana alerts can also be done via grafana ui.
Similarly to dashboards, it is suggested to export alerts and save them
in ./config/grafana/provisioning/alerting/ which will make sure that
they persist after full grafana redeployments.
It is possible that usually due to networking issues, service discovery component may fail. To debug why you can run the following command to inspect the logs:
# From the same folder of this README
docker compose logs multiservice-discoveryYou may see something along the lines of:
multiservice-discovery-1 | Nov 10 13:25:07.044 WARN Failed to sync registry for mercury @ interval Instant { tv_sec: 9050, tv_nsec: 707832535 }: SyncWithNnsFailed { failures: [("targets", RegistryTransportError { source: UnknownError("Failed to query get_certified_changes_since on canister rwlgt-iiaaa-aaaaa-aaaaa-cai: Request failed for http://[2606:fb40:201:1001:6801:2fff:fef5:b129]:8080/api/v2/canister/rwlgt-iiaaa-aaaaa-aaaaa-cai/query: hyper_util::client::legacy::Error(Connect, ConnectError(\"tcp connect error\", [2606:fb40:201:1001:6801:2fff:fef5:b129]:8080, Os { code: 101, kind: NetworkUnreachable, message: \"Network is unreachable\" }))") })] }This is usually a transient error and may happen from time to time. What this means is that the discovery cannot sync with the registry canister and may be serving stale targets. Usually this is acceptable but if that happens when deploying and the initial sync fails you may be unable to see any of your nodes. As long as you can see your nodes on the following link you can safely ignore the transient failures.
curl http://localhost:8000/prom/targets?node_provider_id=<node-provider-id>&dc_id=<dc-id>NOTE: The initial sync of service discovery may take up to 15 minutes! Syncing will be clearly logged in the multiservice discovery.
If you don't see anything in the prometheus targets view, that means that prometheus failed to receive targets from the service discovery.
To check the logs run:
# From the same folder of this README
docker compose logs prometheusCheck if you can see your nodes by running the following command:
curl http://localhost:8000/prom/targets?node_provider_id=<node-provider-id>&dc_id=<dc-id>You should now see 4 jobs:
host_node_exporternode_exporterorchestratorreplica
If any of them are shown in read it means that some of the targets (or all of them) are failing to be scraped. You can see that from the logs as well:
# From the same folder of this README
docker compose logs prometheusThis means that the prometheus scraper cannot reach the nodes it is trying to scrape. It can be because the workstation for this observability stack isn't in the same network subnet as the nodes, or due to other network issues.
To make a full clean restart (or partial) you can do the following:
- Ensure that everything you created through grafana ui is exported
- Dashboards (save them into
./config/grafana/provisioning/dashboards/) - Alerts (save them into
./config/grafana/provisioning/alerting/) - Contact points (save them into
./config/grafana/provisioning/alerting/) - Message templates (save them into
./config/grafana/provisioning/alerting/) - Notification policies (save them into
./config/grafana/provisioning/alerting/)
- Dashboards (save them into
- If you haven't created the resources, or don't mind losing them you can proceed.
- Stop the stack:
docker compose -f ./docker-compose.yaml down - Clean the volumes. You don't have to clean everything, pick just
ones that you wish to restart fully:
- prometheus:
rm -rf ./volumes/prometheus/ - grafana:
rm -rf ./volumes/grafana/ - multiservice discovery:
rm -rf ./volumes/msd/
- prometheus:
- Reset the folder structure:
git checkout -- ./volumes/ - Run the stack again:
docker compose -f ./docker-compose.yaml up -d