Skip to content

fix(sos): circuit-breaker on Defra SOS endpoint#8

Merged
ruaraidhd merged 1 commit into
mainfrom
fix/sos-circuit-breaker
May 30, 2026
Merged

fix(sos): circuit-breaker on Defra SOS endpoint#8
ruaraidhd merged 1 commit into
mainfrom
fix/sos-circuit-breaker

Conversation

@ruaraidhd
Copy link
Copy Markdown
Contributor

Summary

When uk-air.defra.gov.uk/sos-ukair/* goes down — which it does unannounced for hours at a time — every getData call previously sat in tenacity retry-backoff for ~7 seconds before raising, and the per-timeseries RequestException was caught and logged inside make_sos_data_fetcher so the overall call returned an empty DataFrame rather than raising. One get_current("AURN", [site]) call fans out to ~8 pollutants per site; during the 2026-05-13 outage I measured:

Source get_current wall-clock Rows Exception?
AURN [MY1] (SOS) 410.7s 0 No
SAQN [GLA4] (SOS) 154.1s 0 No
AQE [HG1] (SOS) 0.0s (no series) 0 No
LAQN [BL0] (separate infra) 0.3s 0 No

Worse than the latency itself: because aeolus catches per-timeseries failures and returns an empty frame, downstream consumers cannot distinguish "no data right now" from "upstream is broken". Exception-based service-level circuit-breakers (e.g. Hermes's fetch.py) never see the signal and so never trip — every incoming request keeps re-saturating the asyncio thread executor with retry traffic, wedging unrelated tools (KB lookups, other sources) until the queue drains.

What this changes

A small process-level circuit-breaker on the SOS host inside sources/sos.py. After AEOLUS_SOS_BREAKER_FAILURES (default 5) consecutive RequestExceptions from any SOS endpoint, subsequent _fetch_sos_json calls fail fast with RequestException("SOS circuit-breaker open …") for AEOLUS_SOS_BREAKER_COOLDOWN_S (default 60) seconds. The first call after the cooldown probes upstream again; a success closes the breaker.

Both thresholds env-overridable for ops tuning without a redeploy.

New public hook aeolus.sources.sos.reset_sos_circuit() for tests and ops use.

Behaviour matrix

Upstream Before After
Healthy unchanged unchanged
Single transient failure 1 call, ~7s retry, raise same
Sustained outage, request N+1 ~7s retry, raise (or N×7s retry + 0 rows from make_sos_data_fetcher) <1ms, raise with clear breaker message
Recovery after outage unchanged first call after cooldown probes; success closes breaker

No behaviour change when SOS is healthy. During outages, calls now fail in microseconds rather than tens of seconds, and consumers who watch RequestException to short-circuit will see the signal.

Test plan

  • 5 new tests in TestCircuitBreaker cover: below-threshold no-op, threshold opens breaker, success resets counter, cooldown recovery, make_sos_data_fetcher short-circuits when open.
  • tests/test_sos.py: 37/37 passing.
  • Full suite: 1318 passed, 4 unrelated AirNow live-integration failures pre-date this branch.

Migration

None for healthy SOS. During outages, callers will see RequestException("SOS circuit-breaker open …") immediately instead of RequestException after 7s of retries. Existing except RequestException blocks (e.g. make_sos_data_fetcher line 380) catch it unchanged.

Version bumped 0.4.5.1 → 0.4.5.2.

🤖 Generated with Claude Code

When uk-air.defra.gov.uk/sos-ukair/* goes down — which it does
unannounced for hours at a time — every getData call previously sat in
tenacity retry-backoff for ~7s before raising, and the per-timeseries
RequestException was caught and logged inside make_sos_data_fetcher so
the overall call returned an empty DataFrame rather than raising. One
get_current("AURN", [site]) fans out to ~8 pollutants per site; observed
wall-clock cost during the 2026-05-13 outage was 410s for one AURN site
and 154s for one SAQN site, with the consumer seeing zero rows and no
exception — making exception-based downstream circuit-breakers blind to
the failure.

Adds a process-level circuit-breaker on _fetch_sos_json: after
AEOLUS_SOS_BREAKER_FAILURES (default 5) consecutive failures, subsequent
calls fail fast with RequestException for AEOLUS_SOS_BREAKER_COOLDOWN_S
(default 60) seconds. The first call after the cooldown probes upstream
again; a success closes the breaker. Both thresholds env-overridable.

New public hook reset_sos_circuit() for tests and ops.

5 new tests in TestCircuitBreaker covering: below-threshold no-op,
threshold opens breaker, success resets counter, cooldown recovery,
make_sos_data_fetcher short-circuits when open. 37/37 test_sos.py
green; full suite 1318 passed (4 unrelated AirNow live-integration
failures pre-date this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ruaraidhd ruaraidhd merged commit 8a0fd3b into main May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant