fix: clean every checker each window and release periodic lock when idle#55
fix: clean every checker each window and release periodic lock when idle#55AlinsRan wants to merge 11 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe periodic healthcheck timer in ChangesHealthcheck Timer Fixes and Test Hardening
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Two issues affecting workloads with multiple health-checked upstreams: 1. Stale-target cleanup gating. `last_cleanup_check` is a module-level timestamp shared by every checker, but it was updated *inside* the `for _, checker_obj in pairs(hcs)` loop. The first checker that runs cleanup advanced the timestamp, so the cleanup condition turned false for every remaining checker in the same iteration. As a result only the first upstream's purge-marked targets were ever removed; the other upstreams kept their deleted nodes in the shm forever (still returned by the control API and still actively probed). This is why it could not be reproduced with a single upstream. The window is now decided once before the loop and the timestamp advanced once after it, so every checker in `hcs` is cleaned in the same window. 2. Periodic lock not released when idle. When every checker on the lock-holding worker has been stopped (active == false), the timer kept renewing the global PERIODIC_LOCK, so other workers that still own active checkers could never acquire it and active checks stopped entirely after a disable -> re-enable cycle. The lock holder now releases the lock when no active checker remains. This extends the work started in #54. Adds a regression test (t/with_worker-events/19-multi-checker-cleanup.t) that creates two health-checked upstreams, marks both for delayed removal, and asserts both are cleaned. It fails before the gating fix (second checker keeps its targets) and passes after.
04d4136 to
ada5ee8
Compare
…ents - Add a regression test asserting the periodic lock is released once no active checker remains on the worker (so another worker can take over). - Mirror the multi-checker cleanup test and the new lock test into the resty-events suite so both event backends exercise the fixes.
14-tls_active_probes.t actively probes the external host badssl.com via a hard-coded IP (104.154.89.105). It is flaky in CI whenever that host is slow/unreachable and is currently red on every run, even for the non-disable TEST 1. Disable it via the repo's .t_disabled convention (cf. 10-garbagecollect, 13-integration) until it is rewritten against a local TLS server. This is unrelated to the cleanup/lock changes in this PR.
Check whether the worker still owns an active checker before acquiring the periodic lock instead of after. A worker whose checkers were all disabled no longer re-acquires and immediately releases the lock on every back-off tick; it releases the lock once (if it still holds it) and then stops contending.
14-tls_active_probes: replace the external badssl.com probe (hard-coded IP 104.154.89.105, flaky/unreachable in CI) with a local self-signed TLS server embedded in the test nginx config. The cert has CN/SAN example.test, so probing with that SNI and verification enabled succeeds, a mismatching SNI with verification enabled fails, and verification off accepts the mismatch - reproducing the original healthy/unhealthy assertions against a local server. Wait for the probe by polling the target status instead of a fixed sleep, and use a distinct port per events backend to avoid cross-suite collisions. Re-enabled (renamed from .t_disabled). 99-status_ver: the per-target status-change event is delivered only to workers connected to the events broker at publish time (no replay), so a worker whose subscription was not yet ready intermittently missed the change (~50% flake). Add a cross-worker barrier in init_worker that confirms broadcasts reach every worker before the target is added, and create the checker (registering its subscription) before adding the target so no status change can fire before every worker is subscribed. Both workers now deterministically observe the version change; assert it via grep_error_log occurrence count.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
t/with_worker-events/99-status_ver.t (1)
54-67: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider bounding the barrier loop to avoid a runaway background loop.
while truehas no timeout or max-retry cap. If cross-worker delivery never completes (e.g., a worker fails to subscribe), this loop spins in the timer thread indefinitely. The test would still fail cleanly via the error-log assertion, but a bounded retry would surface a clearer failure and avoid the lingering loop.♻️ Optional: cap the handshake retries
local my_id = ngx.worker.id() - while true do + local attempts = 0 + while attempts < 200 do we.post("barrier", "ping", my_id) we.poll() local seen = 0 for i = 0, worker_count - 1 do if shm:get("barrier:" .. tostring(i)) then seen = seen + 1 end end if seen >= worker_count then break end ngx.sleep(0.05) + attempts = attempts + 1 end🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@t/with_worker-events/99-status_ver.t` around lines 54 - 67, The `while true` loop in the barrier synchronization section has no timeout or maximum retry limit, which can cause indefinite spinning if worker delivery fails. Add a counter variable to track retry attempts and set a reasonable maximum number of iterations (e.g., 100 retries with 0.05 second sleeps) before breaking out of the loop. Once the maximum retries is reached, break from the loop so the test can proceed to failure assertions rather than hanging indefinitely. This ensures the test fails with a clear timeout rather than lingering in the background loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@t/with_worker-events/99-status_ver.t`:
- Around line 54-67: The `while true` loop in the barrier synchronization
section has no timeout or maximum retry limit, which can cause indefinite
spinning if worker delivery fails. Add a counter variable to track retry
attempts and set a reasonable maximum number of iterations (e.g., 100 retries
with 0.05 second sleeps) before breaking out of the loop. Once the maximum
retries is reached, break from the loop so the test can proceed to failure
assertions rather than hanging indefinitely. This ensures the test fails with a
clear timeout rather than lingering in the background loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c0536338-7599-4ce9-beb5-2fa57c21b77b
⛔ Files ignored due to path filters (4)
t/with_resty-events/util/tls_probe_cert.pemis excluded by!**/*.pemt/with_resty-events/util/tls_probe_key.pemis excluded by!**/*.pemt/with_worker-events/util/tls_probe_cert.pemis excluded by!**/*.pemt/with_worker-events/util/tls_probe_key.pemis excluded by!**/*.pem
📒 Files selected for processing (9)
lib/resty/healthcheck.luat/with_resty-events/14-tls_active_probes.tt/with_resty-events/19-multi-checker-cleanup.tt/with_resty-events/20-periodic-lock-release.tt/with_resty-events/99-status_ver.tt/with_worker-events/14-tls_active_probes.tt/with_worker-events/19-multi-checker-cleanup.tt/with_worker-events/20-periodic-lock-release.tt/with_worker-events/99-status_ver.t
The previous revision gated the whole periodic tick on has_active_checker and returned before the cleanup loop when no active checker existed. In a passive-only deployment (checks.passive but no active interval) no worker ever has an active checker, so delayed_clear()-marked targets were never purged and leaked in the shm (apache/apisix#13385) -- a regression only on the pure-passive path, which the existing tests (all using active probes) did not cover. Run stale-target cleanup independently of active probing and of the periodic lock: locking_target_list serializes the purge across workers and re-reads the shm list each time, so running it on every worker is safe and idempotent. The periodic lock is still released on has_active_checker so a disabled worker yields to an active one (apache/apisix#13235); that gate stays on has_active_checker rather than on hcs being non-empty because a disabled checker lingers in the weak hcs table with active=false until GC. Add a passive-only + delayed_clear regression test for both event backends.
For a passive-only checker checks.active.*.active is nil (never set to true by start(), never set to false by stop()), so ngx.say rendered it as "nil" rather than "false". Normalize with `not not` so the assertion is stable. The cleanup assertion (remaining: 0) already passed, confirming the fix purges delayed_clear targets with no active checker.
Two follow-ups from the PR review of the cleanup/lock-gating rework: 1. Stale-target cleanup ran on every worker each window, so N workers x M checkers all contended on each checker's TARGET_LIST_LOCK. Elect a single worker per window via an atomic shm:add on a dedicated cleanup key (TTL = CLEANUP_INTERVAL). Any worker can win -- including a passive-only one -- so the apache/apisix#13385 leak fix still holds, while the rest skip the pass instead of fighting for the lock. Removes the per-worker last_cleanup_check bookkeeping. 2. After a disable -> re-enable cycle, a backed-off timer (running at CHECK_INTERVAL * 10 while idle or after losing the lock race) took up to one slow tick to resume probing. checker:start() now wakes the timer back to CHECK_INTERVAL when it re-enables active checks.
The barrier handshake spun in an unbounded `while true` loop, so a worker that never synced would hang until the global test timeout with no useful diagnostic. Cap the wait with a deadline and fail loudly with the synced count when it is exceeded.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1425-1433: The timer wake-up optimization in the conditional block
(around line 1428-1431) that sets active_check_timer.interval to CHECK_INTERVAL
cannot interrupt an in-flight backed-off sleep because the lua-resty-timer
library only applies interval changes after the current handler completes, not
during an active sleep. To fix this and ensure the re-enabled checker resumes
fast probing promptly, replace the interval mutation approach with either a
direct call to ngx.timer.at(CHECK_INTERVAL, ...) to immediately reschedule the
timer callback, or alternatively defer the fast probing resumption to the next
natural timer wake by removing this optimization if an immediate interrupt is
not feasible with the current timer architecture.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a446900a-f113-41c0-9e50-5be1416582cb
📒 Files selected for processing (3)
lib/resty/healthcheck.luat/with_resty-events/99-status_ver.tt/with_worker-events/99-status_ver.t
🚧 Files skipped from review as they are similar to previous changes (2)
- t/with_resty-events/99-status_ver.t
- t/with_worker-events/99-status_ver.t
The interval-wake added to checker:start() could not bring a re-enabled checker back to fast probing: lua-resty-timer (~> 1) schedules each recurring sleep from self.sub_interval, which is frozen to the initial interval at construction and never re-synced from self.interval; self.interval only feeds scheduling for node-wide (key_name) timers, which the active-check timer is not. Mutating active_check_timer.interval from start() (or anywhere) therefore has no effect on the timer cadence, so the optimization was a no-op. Remove it.
lua-resty-timer schedules each recurring sleep from sub_interval (frozen at construction), never re-reading self.interval, so mutating active_check_timer.interval has no effect on the timer cadence -- the same reason the start() interval-wake was removed in 24ea30c. Remove the three remaining no-op assignments in the periodic tick so the code no longer implies an idle back-off that never happens. Also document that the per-window cleanup election purges only the elected worker's own hcs and is eventually consistent under asymmetric checker distribution.
Problem
Two issues that only manifest when multiple upstreams have health checks enabled — single-upstream setups don't reproduce them, which is why earlier repro attempts failed. They are the library-side root causes behind apache/apisix#13385, apache/apisix#13141 and apache/apisix#13235.
Root cause 1 — stale-target cleanup is gated incorrectly
last_cleanup_checkis a module-level timestamp shared by every checker, yet the active-check timer both read and wrote it inside thefor _, checker_obj in pairs(hcs)loop:The first checker to enter the cleanup branch advances
last_cleanup_check, so the condition turns false for every other checker in the same iteration. With several health-checked upstreams, only the first upstream's purge-marked nodes are removed from the shm; the rest keep their deleted nodes forever — still returned by the control API (#13385) and still actively probed (#13141).A related defect in the same path: cleanup ran only inside the
get_periodic_lockbranch. A passive-only deployment (checks.passive, no active interval) does no active probing, so no worker ever holds the lock and cleanup never runs at all — itsdelayed_clear()-marked targets leak forever (the passive-only side of #13385).Root cause 2 — periodic lock not released when the worker goes idle
When every checker on the lock-holding worker is stopped (
active == false), the timer kept renewing the globalPERIODIC_LOCK. Workers that still own active checkers could never acquire it, so active health checks stopped entirely after a disable → re-enable cycle (#13235).Fix
last_cleanup_checkand elect one cleanup worker per window via an atomicshm:add(cleanup_key, pid, CLEANUP_INTERVAL). The elected worker purges every checker inhcsin a single pass; the others skip it, so they don't all contend on each checker's target-list lock. The election is independent of the periodic lock, so passive-only deployments are cleaned too.PERIODIC_LOCKonce it has no active checker left, letting another worker take over. (Extends fix: release periodic lock when no active checkers exist #54, which fixed the lock direction but not the cleanup gating.)Upstream relation (Kong/lua-resty-healthcheck)
This repo is a fork of
Kong/lua-resty-healthcheck. How the fixes map to upstream:is_checked), merged 2023-12-22get_periodic_lockOnly RC1 has an upstream fix, and it never reached this fork (Kong is on 3.1.x, this fork on 3.2.x). That fix keeps cleanup gated behind the periodic lock and merely moves the timestamp advance out of the loop, so it does not cover the passive-only leak. This PR's election approach is a superset of it; RC2 is net-new on both sides. Both are worth contributing back upstream to reduce fork drift.
Kong Kong#146 is intentionally not cherry-picked — it would fix only RC1 while leaving cleanup lock-gated, so it is strictly weaker than what this PR does.
Test
19-multi-checker-cleanup.tcreates two health-checked upstreams, marks all targets of both for delayed removal, and asserts both are cleaned after the window — it fails before the fix (the second checker keeps its targets) and passes after.20-periodic-lock-release.tand21-passive-only-cleanup.tcover RC2 and the passive-only path (mirrored underwith_resty-events/). A test-only_M._set_cleanup_intervalhook (guarded by__TESTING_HEALTHCHECKER) shortens the window for determinism.Verified locally with
prove; the existing00-new,01-start-stop,10-garbagecollect,11-clearsuites still pass.Cross-PR dependency
The companion apache/apisix PR (incremental health-check target updates + no-nil-window rebuild) needs a release of this library that contains this fix. Once this is merged and tagged, bump the apisix rockspec dependency
lua-resty-healthcheck-api7to that version.Relates to apache/apisix#13385, apache/apisix#13141, apache/apisix#13235; extends #54.
Summary by CodeRabbit
Bug Fixes
Tests