Skip to content

Conversation

@towseef41
Copy link

@towseef41 towseef41 commented Nov 30, 2025

Description

Add configurable retries with exponential backoff/jitter to the Prometheus Remote Write exporter so transient 429/408/5xx and connection/timeouts don’t drop metrics silently. Updated README with the new retry knobs.

Fixes #3985

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • tox -e py311-test-exporter-prometheus-remote-write
  • tox -e lint-exporter-prometheus-remote-write

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@towseef41 towseef41 requested a review from a team as a code owner November 30, 2025 11:23
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 30, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: towseef41 / name: Towseef Altaf (1121cb6)

@herin049
Copy link
Contributor

herin049 commented Dec 1, 2025

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)

I find this approach preferable to rolling our own backoff-retry loop.

@towseef41
Copy link
Author

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)

I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.

For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

@herin049
Copy link
Contributor

herin049 commented Dec 1, 2025

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)
I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.

For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

Sounds good, we can get the opinion of other members as I might be in the minority here.

On a related note, I'm not sure if sub-classing urllib3.Retry is necessary if you just want to bound the backoff delay, you can simply set backoff_jitter and backoff_max to something reasonable.

@towseef41
Copy link
Author

Can we not just use the builtin urllib3 retry functionality (see https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry and https://requests.readthedocs.io/en/latest/user/advanced/#example-automatic-retries)
I find this approach preferable to rolling our own backoff-retry loop.

Thanks for the suggestion. I’m good to move to a requests.Session + HTTPAdapter using urllib3.Retry (POST allowed), mapping our existing knobs, and I’ll add a tiny Retry subclass only to keep jitter/backoff cap. I’ll drop the manual loop and update tests.
For context, I initially considered a custom loop to keep full control over jitter/backoff cap and explicit logging and avoid relying on adapter/session setup, but I agree urllib3.Retry is battle-tested and clearer.

Sounds good, we can get the opinion of other members as I might be in the minority here.

On a related note, I'm not sure if sub-classing urllib3.Retry is necessary if you just want to bound the backoff delay, you can simply set backoff_jitter and backoff_max to something reasonable.

Makes sense. I’ll switch to urllib3.Retry and avoid subclassing if possible: use a requests.Session + HTTPAdapter with Retry(total=..., backoff_factor=..., backoff_max=..., status_forcelist=..., allowed_methods={"POST"}). The requests-bundled urllib3 we have doesn’t expose backoff_jitter, so if we want jitter I’ll add the smallest possible override; otherwise I’ll stick to base Retry with a sensible backoff_max.

@xrmx
Copy link
Contributor

xrmx commented Dec 1, 2025

The OTLP exports implement retries manually but AFAICS don't expose any tunable (e.g. _export in exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py from opentelemetry-python repo). If you don't expose this mechanism I think it's fine to reuse http libraries code for that.

@towseef41
Copy link
Author

The OTLP exports implement retries manually but AFAICS don't expose any tunable (e.g. _export in exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py from opentelemetry-python repo). If you don't expose this mechanism I think it's fine to reuse http libraries code for that.

@xrmx
Thanks for the pointers. I’m planning to reuse the HTTP stack’s retries instead of a custom loop: requests.Session + HTTPAdapter with urllib3.Retry (POST allowed, small status_forcelist). I kept a few knobs exposed (max retries, backoff factor/cap, status list) so users can tune if needed, but otherwise it follows the built-in behavior. If you’d rather keep it non-tunable (closer to OTLP) and just lean on Retry defaults, I can pare that back.

…entelemetry/exporter/prometheus_remote_write/__init__.py

Co-authored-by: Lukas Hering <[email protected]>
@towseef41 towseef41 requested a review from herin049 December 2, 2025 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add configurable retries with backoff to Prometheus Remote Write exporter

3 participants