Skip to content

Conversation

ka3de
Copy link
Contributor

@ka3de ka3de commented Oct 13, 2025

Adjusts the publisher options in order to account for new "slower running" checks (periodicity up to 1h), and reduce data discarded on a long (~15m) outage.

  • Increase time to wait for new check execution's data before removing the tenant handler
  • Increase retries and backoff to account for a possible ~20m downtime
  • Increase buffer size limit
  • Avoid discarding data based on number of entries

See particular commits for more details.

Updates grafana/synthetic-monitoring/issues/378.

ka3de added 5 commits October 13, 2025 10:59
Specifies the time to wait for new check execution's data for the tenant
before removing the handler, which would cause any non delivered data to
be lost.

Adjust it to 65 minutes. In the worst case scenario, where the publisher
is unable to push for 1h, setting it to 65 minutes will allow the longer
running checks (current max=1h) to still buffer data for new executions.
When retries are exhausted, the data being pushed is not added again to
the buffer queue, and instead it's discarded.

With the previous configuration (20 retries, 30ms min backoff, 2s max
backoff), if we consider that each push request takes ~200ms, the total
time before exhausting retries and discarding data was ~35s.

With the new configuration (50 retries, 50ms min backoff, 30s max
backoff), if we consider that each push request takes ~200ms, the total
time before exhausting retries and discarding data is ~20min.

Notice that data might still be discarded earlier due to buffer size
limits.
Specifies the max number of bytes to hold in the buffer per tenant in
case of unsuccessfull retries before discarding data.

In lack of better data for calculations.. running rough estimates we can
consider each check execution to report ~15 metrics with 10 labels
and one sample each. That will give us a rough estimate of ~5Kb per
check execution.

With the previous value of 128Kb, that would hold data for 25 check
executions. In the best scenario where a tenant runs a single check in
that agent, considering a check frequency of 1m, that would allow
for 25 minutes before discarding data. In most cases this number will be
much lower, as a tenant will be running more than one check in the same
agent, resulting in a very short time window before data is discarded.

Therefore, increase the buffer limit size to 1mb. Considering that that
will only have an effect when data is unable to be pushed to particular
Mimir/Loki cells, accept the memory increase in order to try to reduce
discarded data. Autoscalers will scale agents if necessary based on the
memory threshold.
Number of items limit seems difficul to reason about and setting a
reasonable limit. Therefore remove this limit and apply only the size
limit which fulfills the same purpose.
Adjust backoff for testing purposes.
@ka3de ka3de force-pushed the adjust-publisher-options branch from 58d9d75 to a75a29e Compare October 13, 2025 10:52
@ka3de ka3de marked this pull request as ready for review October 13, 2025 10:56
@ka3de ka3de requested a review from a team as a code owner October 13, 2025 10:56
@ka3de ka3de requested review from Pokom and mem October 13, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants