fix(sos): circuit-breaker on Defra SOS endpoint#8
Merged
Conversation
When uk-air.defra.gov.uk/sos-ukair/* goes down — which it does
unannounced for hours at a time — every getData call previously sat in
tenacity retry-backoff for ~7s before raising, and the per-timeseries
RequestException was caught and logged inside make_sos_data_fetcher so
the overall call returned an empty DataFrame rather than raising. One
get_current("AURN", [site]) fans out to ~8 pollutants per site; observed
wall-clock cost during the 2026-05-13 outage was 410s for one AURN site
and 154s for one SAQN site, with the consumer seeing zero rows and no
exception — making exception-based downstream circuit-breakers blind to
the failure.
Adds a process-level circuit-breaker on _fetch_sos_json: after
AEOLUS_SOS_BREAKER_FAILURES (default 5) consecutive failures, subsequent
calls fail fast with RequestException for AEOLUS_SOS_BREAKER_COOLDOWN_S
(default 60) seconds. The first call after the cooldown probes upstream
again; a success closes the breaker. Both thresholds env-overridable.
New public hook reset_sos_circuit() for tests and ops.
5 new tests in TestCircuitBreaker covering: below-threshold no-op,
threshold opens breaker, success resets counter, cooldown recovery,
make_sos_data_fetcher short-circuits when open. 37/37 test_sos.py
green; full suite 1318 passed (4 unrelated AirNow live-integration
failures pre-date this branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
uk-air.defra.gov.uk/sos-ukair/*goes down — which it does unannounced for hours at a time — everygetDatacall previously sat intenacityretry-backoff for ~7 seconds before raising, and the per-timeseriesRequestExceptionwas caught and logged insidemake_sos_data_fetcherso the overall call returned an empty DataFrame rather than raising. Oneget_current("AURN", [site])call fans out to ~8 pollutants per site; during the 2026-05-13 outage I measured:get_currentwall-clock[MY1](SOS)[GLA4](SOS)[HG1](SOS)[BL0](separate infra)Worse than the latency itself: because aeolus catches per-timeseries failures and returns an empty frame, downstream consumers cannot distinguish "no data right now" from "upstream is broken". Exception-based service-level circuit-breakers (e.g. Hermes's
fetch.py) never see the signal and so never trip — every incoming request keeps re-saturating the asyncio thread executor with retry traffic, wedging unrelated tools (KB lookups, other sources) until the queue drains.What this changes
A small process-level circuit-breaker on the SOS host inside
sources/sos.py. AfterAEOLUS_SOS_BREAKER_FAILURES(default 5) consecutiveRequestExceptions from any SOS endpoint, subsequent_fetch_sos_jsoncalls fail fast withRequestException("SOS circuit-breaker open …")forAEOLUS_SOS_BREAKER_COOLDOWN_S(default 60) seconds. The first call after the cooldown probes upstream again; a success closes the breaker.Both thresholds env-overridable for ops tuning without a redeploy.
New public hook
aeolus.sources.sos.reset_sos_circuit()for tests and ops use.Behaviour matrix
make_sos_data_fetcher)No behaviour change when SOS is healthy. During outages, calls now fail in microseconds rather than tens of seconds, and consumers who watch
RequestExceptionto short-circuit will see the signal.Test plan
TestCircuitBreakercover: below-threshold no-op, threshold opens breaker, success resets counter, cooldown recovery,make_sos_data_fetchershort-circuits when open.tests/test_sos.py: 37/37 passing.Migration
None for healthy SOS. During outages, callers will see
RequestException("SOS circuit-breaker open …")immediately instead ofRequestExceptionafter 7s of retries. Existingexcept RequestExceptionblocks (e.g.make_sos_data_fetcherline 380) catch it unchanged.Version bumped 0.4.5.1 → 0.4.5.2.
🤖 Generated with Claude Code