feat: Adjust publisher options #1558

ka3de · 2025-10-13T10:43:33Z

Adjusts the publisher options in order to account for new "slower running" checks (periodicity up to 1h), and reduce data discarded on a long (~15m) outage.

Increase time to wait for new check execution's data before removing the tenant handler
Increase retries and backoff to account for a possible ~20m downtime
Increase buffer size limit
Avoid discarding data based on number of entries

See particular commits for more details.

Updates grafana/synthetic-monitoring/issues/378.

Specifies the time to wait for new check execution's data for the tenant before removing the handler, which would cause any non delivered data to be lost. Adjust it to 65 minutes. In the worst case scenario, where the publisher is unable to push for 1h, setting it to 65 minutes will allow the longer running checks (current max=1h) to still buffer data for new executions.

When retries are exhausted, the data being pushed is not added again to the buffer queue, and instead it's discarded. With the previous configuration (20 retries, 30ms min backoff, 2s max backoff), if we consider that each push request takes ~200ms, the total time before exhausting retries and discarding data was ~35s. With the new configuration (50 retries, 50ms min backoff, 30s max backoff), if we consider that each push request takes ~200ms, the total time before exhausting retries and discarding data is ~20min. Notice that data might still be discarded earlier due to buffer size limits.

Specifies the max number of bytes to hold in the buffer per tenant in case of unsuccessfull retries before discarding data. In lack of better data for calculations.. running rough estimates we can consider each check execution to report ~15 metrics with 10 labels and one sample each. That will give us a rough estimate of ~5Kb per check execution. With the previous value of 128Kb, that would hold data for 25 check executions. In the best scenario where a tenant runs a single check in that agent, considering a check frequency of 1m, that would allow for 25 minutes before discarding data. In most cases this number will be much lower, as a tenant will be running more than one check in the same agent, resulting in a very short time window before data is discarded. Therefore, increase the buffer limit size to 1mb. Considering that that will only have an effect when data is unable to be pushed to particular Mimir/Loki cells, accept the memory increase in order to try to reduce discarded data. Autoscalers will scale agents if necessary based on the memory threshold.

Number of items limit seems difficul to reason about and setting a reasonable limit. Therefore remove this limit and apply only the size limit which fulfills the same purpose.

Adjust backoff for testing purposes.

ka3de added 5 commits October 13, 2025 10:59

feat: Avoid discarding data based on num items

9b4e300

Number of items limit seems difficul to reason about and setting a reasonable limit. Therefore remove this limit and apply only the size limit which fulfills the same purpose.

fix: Fix queue test

a75a29e

Adjust backoff for testing purposes.

ka3de force-pushed the adjust-publisher-options branch from 58d9d75 to a75a29e Compare October 13, 2025 10:52

ka3de marked this pull request as ready for review October 13, 2025 10:56

ka3de requested a review from a team as a code owner October 13, 2025 10:56

ka3de requested review from Pokom and mem October 13, 2025 10:56

nadiamoe approved these changes Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Adjust publisher options #1558

feat: Adjust publisher options #1558

Uh oh!

ka3de commented Oct 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Adjust publisher options #1558

Are you sure you want to change the base?

feat: Adjust publisher options #1558

Uh oh!

Conversation

ka3de commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ka3de commented Oct 13, 2025 •

edited

Loading