-
Notifications
You must be signed in to change notification settings - Fork 95
Add cluster activation status metric and emit to v1/jmx #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add cluster activation status metric and emit to v1/jmx #673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great framework for us to build on top of for collecting various backend-specific metrics. Thanks @amybubu !
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/BackendClusterMetricStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/BackendsMetricStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/config/MonitorConfiguration.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/BackendClusterMetricStats.java
Outdated
Show resolved
Hide resolved
LGTM! |
pinging @vishalya |
private final MonitorConfiguration monitorConfiguration; | ||
// MBeanExporter uses weak references, so statsMap is needed to maintain strong references to metric objects to prevent garbage collection | ||
private final Map<String, ClusterMetricsStats> statsMap = new HashMap<>(); | ||
private final ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - Single thread should be good.
LGTM me as well. |
One thing to point out here -- the executor service isn't actually used to export the metrics. It's used to refresh the list of backends which are registered as having metrics ( For a little background on why this is needed, we started with just updating the map on any changes to the backends (add/delete/etc), but of course with a multi-Gateway setup this falls out of sync when changes are made on other instances. It's essentially the same problem we've discussed in other contexts; I think I can remember specifically that we have this split-brain with the health check logic. Long-term, probably the right place to be in is that we have one central "state sync" which propagates any DB state changes back into the Gateway's local in-memory state, and any kind of triggers involved with state changes (including metrics registration) could happen there. |
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterMetricsStats.java
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClustersMetricsStats.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are aware of the limitation due to the lack of the central state.
Approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test?
Description
As discussed in Issue #649 we would like to add a built-in activation status metric to Trino Gateway, to improve telemetry. This metric has a value of 1 if activated and a value of 0 if deactivated. The metric is populated with values polled directly from the database. Each cluster has it's own BackendClusterMetricsStats that can keep track of multiple metrics as we add in the future. When a backend cluster is added/deleted, an associated metric is registered/unregistered.
To keep this information up to date, the backends table in the DB is polled every 30 seconds to check if any changes have been made. We can't rely on triggering a metrics registry refresh within the add/delete backend methods because only one instance gets this API call. So if there are multiple instances, they won't all know to update their respective list of metrics. NOTE: Because these are periodically updated every 30 seconds, there may be a slight delay in new metrics showing up and old metrics disappearing. For deleted backends, their activation status metric value will be -1 until it is unregistered next refresh period.
Testing
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( x ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: