Skip to content

Add cluster activation status metric and emit to v1/jmx #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

amybubu
Copy link

@amybubu amybubu commented Apr 15, 2025

Description

As discussed in Issue #649 we would like to add a built-in activation status metric to Trino Gateway, to improve telemetry. This metric has a value of 1 if activated and a value of 0 if deactivated. The metric is populated with values polled directly from the database. Each cluster has it's own BackendClusterMetricsStats that can keep track of multiple metrics as we add in the future. When a backend cluster is added/deleted, an associated metric is registered/unregistered.

To keep this information up to date, the backends table in the DB is polled every 30 seconds to check if any changes have been made. We can't rely on triggering a metrics registry refresh within the add/delete backend methods because only one instance gets this API call. So if there are multiple instances, they won't all know to update their respective list of metrics. NOTE: Because these are periodically updated every 30 seconds, there may be a slight delay in new metrics showing up and old metrics disappearing. For deleted backends, their activation status metric value will be -1 until it is unregistered next refresh period.

Testing

  • built and ran locally
  • saw new metrics in /v1/jmx (NOTE: added testMetric to show proof of concept of adding other types of metrics, but removed from code before publishing PR)
Screenshot 2025-04-07 at 5 38 33 PM
  • deactivated and activated cluster via UI and curl calls, and saw changes reflected in metrics after refreshing v1/jmx
  • added, updated, and deleted backends, and saw associated metrics added, updated and deleted respectively (with expected slight delay due to refresh period)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( x ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 15, 2025
Copy link
Member

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great framework for us to build on top of for collecting various backend-specific metrics. Thanks @amybubu !

@andythsu
Copy link
Member

LGTM!

@andythsu
Copy link
Member

pinging @vishalya

private final MonitorConfiguration monitorConfiguration;
// MBeanExporter uses weak references, so statsMap is needed to maintain strong references to metric objects to prevent garbage collection
private final Map<String, ClusterMetricsStats> statsMap = new HashMap<>();
private final ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - Single thread should be good.

@vishalya
Copy link
Member

vishalya commented Apr 30, 2025

LGTM me as well.
This exporter has it's own executor/thread which exports the metrics, would like to get an opinion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

@xkrogen
Copy link
Member

xkrogen commented Apr 30, 2025

LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an onion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

One thing to point out here -- the executor service isn't actually used to export the metrics. It's used to refresh the list of backends which are registered as having metrics (MBeanExporter#exportWithGeneratedName()/MBeanExporter#unexportWithGeneratedName()). Then the normal MBeanExporter thread handles the actual exporting of metrics.

For a little background on why this is needed, we started with just updating the map on any changes to the backends (add/delete/etc), but of course with a multi-Gateway setup this falls out of sync when changes are made on other instances. It's essentially the same problem we've discussed in other contexts; I think I can remember specifically that we have this split-brain with the health check logic.

Long-term, probably the right place to be in is that we have one central "state sync" which propagates any DB state changes back into the Gateway's local in-memory state, and any kind of triggers involved with state changes (including metrics registration) could happen there.

Copy link
Member

@vishalya vishalya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are aware of the limitation due to the lack of the central state.
Approved.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants