Publish queue latency metrics from tracked thread pools #120488

nicktindall · 2025-01-21T02:35:37Z

We only publish queue latency for thread pools for which EsExecutors.TaskTrackingConfig#trackExecutionTime is true.

We only use existing timestamps (the queue time is measured as the time between TimedRunnable#creationTimeNanos and TimedRunnable#startTimeNanos).

We maintain an in-memory ExponentialBucketHistogram (formerly HandlingTimeTracker) for each monitored thread-poll to track the queue latencies for the tasks executed. Each time we poll for metrics we publish a hard-coded set of percentiles (I put 50th and 90th to begin with) as gauge values. This makes the querying possible with ES/QL and will allow ordering/filtering on those values.

After we've published the values we clear the histogram to start collecting observations for the next interval.

Happy to break this up (e.g. rename histogram/imlement percentiles, then add the metric) if we think this PR is doing too much.

Closes: ES-10531

…ool_queue_latency_metric

nicktindall · 2025-03-11T04:45:56Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+                handlingTimeTracker.clear();
+                return metricValues;
+            }
+        );


I wonder if rather than publishing a time-series-per-percentile (using percentile attribute) we should publish a metric-per-percentile.
The metric makes no sense if you don't filter by a percentile label.

Is it easier to plot different percentiles on the same graph with labels (and group by) compared to two different time series?

I don't think that makes a difference, but I'm not sure.

Yes, just having a look at Kibana just now, it would be much easier to plot as a single metric grouped-by the percentiles. As separate metrics we'd need to add them as distinct time-series.

nicktindall · 2025-03-11T04:48:43Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

    private final Function<Runnable, WrappedRunnable> runnableWrapper;
    private final ExponentiallyWeightedMovingAverage executionEWMA;
    private final LongAdder totalExecutionTime = new LongAdder();
    private final boolean trackOngoingTasks;
    // The set of currently running tasks and the timestamp of when they started execution in the Executor.
    private final Map<Runnable, Long> ongoingTasks = new ConcurrentHashMap<>();
+    private final HandlingTimeTracker handlingTimeTracker = new HandlingTimeTracker();


By using a HandlingTimeTracker we can publish specific percentile streams. This is as opposed to using an APM histogram metric, which comes with a lot of limitations (doesn't work in ESQL, can't sort by, can't filter on) and is costly to aggregate. The histogram is cleared each time we publish the percentiles, so the percentiles are for samples received since the last publish.

If we agree with this approach, I think it might be worth moving the HandlingTimeTracker somewhere common and giving it a more generic name (e.g. ExponentialBucketHistogram).

HandlingTimeTracker is the simplest-possible solution to publishing percentiles. We could get more accurate metrics over a window that was decoupled from the metric polling interval if we used something like exponential histograms or decaying histograms though they would likely incur a larger synchronisation overhead.

nicktindall · 2025-03-11T05:06:58Z

server/src/main/java/org/elasticsearch/common/network/HandlingTimeTracker.java

+        }
+        assert false : "We shouldn't ever get here";
+        return Long.MAX_VALUE;
+    }


I opted to implement "exclusive" percentile because it was easier with the way the counts are stored in HandlingTimeTracker (the arrival of a value increments the counts of the first bucket with lower bound <= the value). If we wanted to get fancy we could look at interpolation, but I don't think it's necessary.

…ool_queue_latency_metric

nicktindall · 2025-03-26T00:12:44Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+        public String toCompositeString() {
+            return nodeName == null ? threadPoolName : nodeName + "/" + threadPoolName;
+        }
+    }


I added this so we could separate the node name and thread pool name for the purpose of generating valid metric names.

i.e. es.thread_pool.instance-000003/write.threads.queue.latency.histogram is invalid, we want es.thread_pool.write.threads.queue.latency.histogram instead, so better to pass in something structured and unambiguous than making assumptions about the format of the string.

I wonder whether an alternative could be add a new method each to ExecutorHolder and TaskExecutionTimeTrackingEsThreadPoolExecutor, something like:

class ExecutorHolder { void registerMetrics(MeterRegistry meterRegistry) { if (executor instanceof TaskExecutionTimeTrackingEsThreadPoolExecutor te) { te.registerMetrics(info.name, meterRegistry) } } } class TaskExecutionTimeTrackingEsThreadPoolExecutor { void registerMetrics(String name, MeterRegistry meterRegistry) { meterRegistry.registerLongsGauge(...); } }

and call ExecutorHolder#registerMetrics after building it in ThreadPool. I think it avoids the need of this class and associated cascading changes?

Yes I think you're on to something there. I will review that tomorrow.

OK I've tidied that up now, thanks for the suggestion @ywangd

elasticsearchmachine · 2025-03-26T00:30:31Z

Hi @nicktindall, I've created a changelog YAML for you.

elasticsearchmachine · 2025-03-26T00:30:32Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2025-03-26T00:30:32Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java

…ialBucketHistogram.java Co-authored-by: Yang Wang <[email protected]>

…ool_queue_latency_metric

Publish queue latency metrics from tracked thread pools

9918299

Closes: ES-10531

elasticsearchmachine added the v9.0.0 label Jan 21, 2025

nicktindall added 3 commits January 21, 2025 13:13

Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…

b1d8df4

…ool_queue_latency_metric

Fix metric name

dfef676

Temporary hack to fix metric name

2ceb965

nicktindall requested review from DiannaHohensee and henningandersen January 22, 2025 03:45

nicktindall added 4 commits January 22, 2025 21:38

Propose solution to composite thread-pool names

dbed27f

Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…

c04f2ea

…ool_queue_latency_metric

Fix fixed thread pool names

c95f625

Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…

f850bc9

…ool_queue_latency_metric

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

nicktindall added 3 commits March 7, 2025 16:13

Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric

1b450f0

POC using HandlingTimeTracker to track queue latency

e7f5bb6

Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric

d269195

nicktindall commented Mar 11, 2025

View reviewed changes

nicktindall added 6 commits March 13, 2025 10:13

Tidy

80b8b3f

Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric

9811299

Fix metric name

598d0a8

Generalise HandlingTimeTracker

4153d27

Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…

d4b0818

…ool_queue_latency_metric

Tidy and fix document/test

3e3d9fc

nicktindall commented Mar 26, 2025

View reviewed changes

Restore field name

be4b96c

nicktindall marked this pull request as ready for review March 26, 2025 00:29

nicktindall requested a review from a team as a code owner March 26, 2025 00:29

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 26, 2025

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination Meta label for Distributed Coordination team labels Mar 26, 2025

Update docs/changelog/120488.yaml

8a313ab

github-actions bot deployed to docs-preview March 26, 2025 00:31 View deployment

Fix changelog area value

8fa7dc6

github-actions bot deployed to docs-preview March 26, 2025 00:44 View deployment

ywangd reviewed Mar 27, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java Outdated Show resolved Hide resolved

Update server/src/main/java/org/elasticsearch/common/metrics/Exponent…

2c36b97

…ialBucketHistogram.java Co-authored-by: Yang Wang <[email protected]>

github-actions bot deployed to docs-preview March 27, 2025 06:13 View deployment

nicktindall requested a review from a team as a code owner March 27, 2025 23:25

github-actions bot deployed to docs-preview March 27, 2025 23:26 View deployment

github-actions bot deployed to docs-preview March 28, 2025 00:17 View deployment

Setup metrics separately to constructor

420739e

nicktindall force-pushed the ES-10531_add_thread_pool_queue_latency_metric branch from 5b227a0 to 420739e Compare March 28, 2025 00:26

github-actions bot deployed to docs-preview March 28, 2025 00:26 View deployment

Remove unnecessary change

3208423

github-actions bot deployed to docs-preview March 28, 2025 00:37 View deployment

Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…

9dd0457

…ool_queue_latency_metric

github-actions bot deployed to docs-preview March 28, 2025 00:40 View deployment

Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric

516a4a9

github-actions bot deployed to docs-preview April 2, 2025 02:47 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish queue latency metrics from tracked thread pools #120488

Publish queue latency metrics from tracked thread pools #120488

nicktindall commented Jan 21, 2025 •

edited

Loading

nicktindall Mar 11, 2025

ywangd Mar 28, 2025

nicktindall Mar 28, 2025

nicktindall Apr 2, 2025

nicktindall Mar 11, 2025 •

edited

Loading

nicktindall Mar 11, 2025 •

edited

Loading

nicktindall Mar 26, 2025

ywangd Mar 27, 2025

nicktindall Mar 27, 2025

nicktindall Mar 28, 2025

elasticsearchmachine commented Mar 26, 2025

elasticsearchmachine commented Mar 26, 2025

elasticsearchmachine commented Mar 26, 2025

Publish queue latency metrics from tracked thread pools #120488

Are you sure you want to change the base?

Publish queue latency metrics from tracked thread pools #120488

Conversation

nicktindall commented Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

nicktindall Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 26, 2025

elasticsearchmachine commented Mar 26, 2025

elasticsearchmachine commented Mar 26, 2025

nicktindall commented Jan 21, 2025 •

edited

Loading

nicktindall Mar 11, 2025 •

edited

Loading

nicktindall Mar 11, 2025 •

edited

Loading