fix: clean every checker each window and release periodic lock when idle by AlinsRan · Pull Request #55 · api7/lua-resty-healthcheck

AlinsRan · 2026-06-22T02:47:11Z

Problem

Two issues that only manifest when multiple upstreams have health checks enabled — single-upstream setups don't reproduce them, which is why earlier repro attempts failed. They are the library-side root causes behind apache/apisix#13385, apache/apisix#13141 and apache/apisix#13235.

Root cause 1 — stale-target cleanup is gated incorrectly

last_cleanup_check is a module-level timestamp shared by every checker, yet the active-check timer both read and wrote it inside the for _, checker_obj in pairs(hcs) loop:

for _, checker_obj in pairs(hcs) do
  if (last_cleanup_check + CLEANUP_INTERVAL) < cur_time then
    -- purge targets marked via delayed_clear()
    ...
    last_cleanup_check = cur_time   -- advanced on the FIRST checker
  end
  ...
end

The first checker to enter the cleanup branch advances last_cleanup_check, so the condition turns false for every other checker in the same iteration. With several health-checked upstreams, only the first upstream's purge-marked nodes are removed from the shm; the rest keep their deleted nodes forever — still returned by the control API (#13385) and still actively probed (#13141).

A related defect in the same path: cleanup ran only inside the get_periodic_lock branch. A passive-only deployment (checks.passive, no active interval) does no active probing, so no worker ever holds the lock and cleanup never runs at all — its delayed_clear()-marked targets leak forever (the passive-only side of #13385).

Root cause 2 — periodic lock not released when the worker goes idle

When every checker on the lock-holding worker is stopped (active == false), the timer kept renewing the global PERIODIC_LOCK. Workers that still own active checkers could never acquire it, so active health checks stopped entirely after a disable → re-enable cycle (#13235).

Fix

Cleanup — drop the module-level last_cleanup_check and elect one cleanup worker per window via an atomic shm:add(cleanup_key, pid, CLEANUP_INTERVAL). The elected worker purges every checker in hcs in a single pass; the others skip it, so they don't all contend on each checker's target-list lock. The election is independent of the periodic lock, so passive-only deployments are cleaned too.
Lock — the holder releases PERIODIC_LOCK once it has no active checker left, letting another worker take over. (Extends fix: release periodic lock when no active checkers exist #54, which fixed the lock direction but not the cleanup gating.)

Upstream relation (Kong/lua-resty-healthcheck)

This repo is a fork of Kong/lua-resty-healthcheck. How the fixes map to upstream:

Fix	In Kong upstream?	In this fork (before this PR)?
RC1 — multi-checker cleanup (#13385 / #13141)	✅ Kong PR Kong#146 (`is_checked`), merged 2023-12-22	❌ never synced — bug still present
Cleanup decoupled from the lock / passive-only (#13385)	❌ still gated behind `get_periodic_lock`	❌
RC2 — release the lock when idle (#13235)	❌	❌

Only RC1 has an upstream fix, and it never reached this fork (Kong is on 3.1.x, this fork on 3.2.x). That fix keeps cleanup gated behind the periodic lock and merely moves the timestamp advance out of the loop, so it does not cover the passive-only leak. This PR's election approach is a superset of it; RC2 is net-new on both sides. Both are worth contributing back upstream to reduce fork drift.

Kong Kong#146 is intentionally not cherry-picked — it would fix only RC1 while leaving cleanup lock-gated, so it is strictly weaker than what this PR does.

Test

19-multi-checker-cleanup.t creates two health-checked upstreams, marks all targets of both for delayed removal, and asserts both are cleaned after the window — it fails before the fix (the second checker keeps its targets) and passes after. 20-periodic-lock-release.t and 21-passive-only-cleanup.t cover RC2 and the passive-only path (mirrored under with_resty-events/). A test-only _M._set_cleanup_interval hook (guarded by __TESTING_HEALTHCHECKER) shortens the window for determinism.

Verified locally with prove; the existing 00-new, 01-start-stop, 10-garbagecollect, 11-clear suites still pass.

Cross-PR dependency

The companion apache/apisix PR (incremental health-check target updates + no-nil-window rebuild) needs a release of this library that contains this fix. Once this is merged and tagged, bump the apisix rockspec dependency lua-resty-healthcheck-api7 to that version.

Relates to apache/apisix#13385, apache/apisix#13141, apache/apisix#13235; extends #54.

Summary by CodeRabbit

Bug Fixes
- Improved stale-target cleanup so purge-marked targets are removed consistently across all health checker instances, including passive-only checkers.
- Refined periodic-lock behavior: when no active probing exists, the lock is released so a worker that still owns active checkers can take over; when active probing exists, only the elected worker performs active probes.
Tests
- Switched TLS active-probe coverage to a deterministic local self-signed server (with updated certificate/SNI expectations).
- Added/expanded integration tests for multi-checker cleanup, periodic-lock release, passive-only cleanup, and more deterministic cross-worker event timing.

coderabbitai · 2026-06-22T02:47:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The periodic healthcheck timer in healthcheck.lua is updated to release the periodic SHM lock when all checkers are disabled, and to compute stale-target cleanup once per tick using shm-based election rather than once per checker, preventing lock starvation and cross-checker cleanup starvation. Four new integration test pairs exercise these fixes, including a dedicated test for passive-only checkers. TLS active probe tests are migrated from external public hosts to a local self-signed TLS server. Status-version tests gain a cross-worker barrier handshake for deterministic event ordering.

Changes

Healthcheck Timer Fixes and Test Hardening

Layer / File(s)	Summary
Periodic lock release and cleanup starvation fix `lib/resty/healthcheck.lua`	Timer cleanup is restructured to use shm-based election: only the elected worker purges stale targets for all checkers. Before acquiring the periodic lock, the timer scans checkers and releases the lock if none have active checks. When a checker is restarted with active probes, the timer interval is "woken" from backed-off state. A `_M._set_cleanup_interval` helper is added under `TESTING`.
Periodic lock release integration tests `t/with_worker-events/20-periodic-lock-release.t`, `t/with_resty-events/20-periodic-lock-release.t`	Both test files verify that the `PERIODIC_LOCK` SHM key is present while a checker is active and is removed after the checker is stopped and the lock renewal window elapses.
Multi-checker stale-target cleanup tests `t/with_worker-events/19-multi-checker-cleanup.t`, `t/with_resty-events/19-multi-checker-cleanup.t`	Both test files create two independent healthcheck instances with a shortened cleanup interval, mark all targets for removal, and assert both upstreams reach zero remaining targets after multiple cleanup windows.
Passive-only checker cleanup tests `t/with_worker-events/21-passive-only-cleanup.t`, `t/with_resty-events/21-passive-only-cleanup.t`	Both test files configure a passive-only healthcheck (no active probing), add targets, mark them for delayed removal, and assert that cleanup completes without active checker participation.
TLS active probe tests: local self-signed server `t/with_worker-events/14-tls_active_probes.t`, `t/with_resty-events/14-tls_active_probes.t`	External TLS targets replaced by a local Nginx SSL server on port `2115` using a bundled self-signed cert. Each test case probes `127.0.0.1:2115` with `example.test` or `wrong.host.test` SNI and asserts the expected certificate verification result.
Cross-worker barrier for status-version tests `t/with_worker-events/99-status_ver.t`, `t/with_resty-events/99-status_ver.t`	`init_worker_by_lua_block` is extended with a barrier handshake that records observed worker pings in shared memory and waits until all worker IDs are seen before adding the healthcheck target. Test delays and error-log assertions are updated to match the deterministic sequencing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the two main fixes: stale-target cleanup for every checker and release of the periodic lock when idle.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
E2e Test Quality Review	✅ Passed	E2E tests comprehensively verify the fixes: cleanup election pattern (19), periodic lock release (20), and passive-only cleanup (21) with real worker events services. Tests confirm both issues are...
Security Check	✅ Passed	Security review across 7 categories found no vulnerabilities. No sensitive data exposure in logs (headers logged only at DEBUG level), no database storage, proper authorization via namespace isolat...
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/cleanup-gating-and-lock-scope

_{Comment @coderabbitai help to get the list of available commands.}

Two issues affecting workloads with multiple health-checked upstreams: 1. Stale-target cleanup gating. `last_cleanup_check` is a module-level timestamp shared by every checker, but it was updated *inside* the `for _, checker_obj in pairs(hcs)` loop. The first checker that runs cleanup advanced the timestamp, so the cleanup condition turned false for every remaining checker in the same iteration. As a result only the first upstream's purge-marked targets were ever removed; the other upstreams kept their deleted nodes in the shm forever (still returned by the control API and still actively probed). This is why it could not be reproduced with a single upstream. The window is now decided once before the loop and the timestamp advanced once after it, so every checker in `hcs` is cleaned in the same window. 2. Periodic lock not released when idle. When every checker on the lock-holding worker has been stopped (active == false), the timer kept renewing the global PERIODIC_LOCK, so other workers that still own active checkers could never acquire it and active checks stopped entirely after a disable -> re-enable cycle. The lock holder now releases the lock when no active checker remains. This extends the work started in #54. Adds a regression test (t/with_worker-events/19-multi-checker-cleanup.t) that creates two health-checked upstreams, marks both for delayed removal, and asserts both are cleaned. It fails before the gating fix (second checker keeps its targets) and passes after.

…ents - Add a regression test asserting the periodic lock is released once no active checker remains on the worker (so another worker can take over). - Mirror the multi-checker cleanup test and the new lock test into the resty-events suite so both event backends exercise the fixes.

14-tls_active_probes.t actively probes the external host badssl.com via a hard-coded IP (104.154.89.105). It is flaky in CI whenever that host is slow/unreachable and is currently red on every run, even for the non-disable TEST 1. Disable it via the repo's .t_disabled convention (cf. 10-garbagecollect, 13-integration) until it is rewritten against a local TLS server. This is unrelated to the cleanup/lock changes in this PR.

Check whether the worker still owns an active checker before acquiring the periodic lock instead of after. A worker whose checkers were all disabled no longer re-acquires and immediately releases the lock on every back-off tick; it releases the lock once (if it still holds it) and then stops contending.

14-tls_active_probes: replace the external badssl.com probe (hard-coded IP 104.154.89.105, flaky/unreachable in CI) with a local self-signed TLS server embedded in the test nginx config. The cert has CN/SAN example.test, so probing with that SNI and verification enabled succeeds, a mismatching SNI with verification enabled fails, and verification off accepts the mismatch - reproducing the original healthy/unhealthy assertions against a local server. Wait for the probe by polling the target status instead of a fixed sleep, and use a distinct port per events backend to avoid cross-suite collisions. Re-enabled (renamed from .t_disabled). 99-status_ver: the per-target status-change event is delivered only to workers connected to the events broker at publish time (no replay), so a worker whose subscription was not yet ready intermittently missed the change (~50% flake). Add a cross-worker barrier in init_worker that confirms broadcasts reach every worker before the target is added, and create the checker (registering its subscription) before adding the target so no status change can fire before every worker is subscribed. Both workers now deterministically observe the version change; assert it via grep_error_log occurrence count.

coderabbitai

🧹 Nitpick comments (1)

t/with_worker-events/99-status_ver.t (1)

54-67: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider bounding the barrier loop to avoid a runaway background loop.

while true has no timeout or max-retry cap. If cross-worker delivery never completes (e.g., a worker fails to subscribe), this loop spins in the timer thread indefinitely. The test would still fail cleanly via the error-log assertion, but a bounded retry would surface a clearer failure and avoid the lingering loop.

♻️ Optional: cap the handshake retries

             local my_id = ngx.worker.id()
-            while true do
+            local attempts = 0
+            while attempts < 200 do
                 we.post("barrier", "ping", my_id)
                 we.poll()
                 local seen = 0
                 for i = 0, worker_count - 1 do
                     if shm:get("barrier:" .. tostring(i)) then
                         seen = seen + 1
                     end
                 end
                 if seen >= worker_count then
                     break
                 end
                 ngx.sleep(0.05)
+                attempts = attempts + 1
             end

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/with_worker-events/99-status_ver.t` around lines 54 - 67, The `while true`
loop in the barrier synchronization section has no timeout or maximum retry
limit, which can cause indefinite spinning if worker delivery fails. Add a
counter variable to track retry attempts and set a reasonable maximum number of
iterations (e.g., 100 retries with 0.05 second sleeps) before breaking out of
the loop. Once the maximum retries is reached, break from the loop so the test
can proceed to failure assertions rather than hanging indefinitely. This ensures
the test fails with a clear timeout rather than lingering in the background
loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@t/with_worker-events/99-status_ver.t`:
- Around line 54-67: The `while true` loop in the barrier synchronization
section has no timeout or maximum retry limit, which can cause indefinite
spinning if worker delivery fails. Add a counter variable to track retry
attempts and set a reasonable maximum number of iterations (e.g., 100 retries
with 0.05 second sleeps) before breaking out of the loop. Once the maximum
retries is reached, break from the loop so the test can proceed to failure
assertions rather than hanging indefinitely. This ensures the test fails with a
clear timeout rather than lingering in the background loop.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c0536338-7599-4ce9-beb5-2fa57c21b77b

📥 Commits

Reviewing files that changed from the base of the PR and between 3a6ddf0 and 38d86b3.

⛔ Files ignored due to path filters (4)

t/with_resty-events/util/tls_probe_cert.pem is excluded by !**/*.pem
t/with_resty-events/util/tls_probe_key.pem is excluded by !**/*.pem
t/with_worker-events/util/tls_probe_cert.pem is excluded by !**/*.pem
t/with_worker-events/util/tls_probe_key.pem is excluded by !**/*.pem

📒 Files selected for processing (9)

lib/resty/healthcheck.lua
t/with_resty-events/14-tls_active_probes.t
t/with_resty-events/19-multi-checker-cleanup.t
t/with_resty-events/20-periodic-lock-release.t
t/with_resty-events/99-status_ver.t
t/with_worker-events/14-tls_active_probes.t
t/with_worker-events/19-multi-checker-cleanup.t
t/with_worker-events/20-periodic-lock-release.t
t/with_worker-events/99-status_ver.t

The previous revision gated the whole periodic tick on has_active_checker and returned before the cleanup loop when no active checker existed. In a passive-only deployment (checks.passive but no active interval) no worker ever has an active checker, so delayed_clear()-marked targets were never purged and leaked in the shm (apache/apisix#13385) -- a regression only on the pure-passive path, which the existing tests (all using active probes) did not cover. Run stale-target cleanup independently of active probing and of the periodic lock: locking_target_list serializes the purge across workers and re-reads the shm list each time, so running it on every worker is safe and idempotent. The periodic lock is still released on has_active_checker so a disabled worker yields to an active one (apache/apisix#13235); that gate stays on has_active_checker rather than on hcs being non-empty because a disabled checker lingers in the weak hcs table with active=false until GC. Add a passive-only + delayed_clear regression test for both event backends.

For a passive-only checker checks.active.*.active is nil (never set to true by start(), never set to false by stop()), so ngx.say rendered it as "nil" rather than "false". Normalize with `not not` so the assertion is stable. The cleanup assertion (remaining: 0) already passed, confirming the fix purges delayed_clear targets with no active checker.

Two follow-ups from the PR review of the cleanup/lock-gating rework: 1. Stale-target cleanup ran on every worker each window, so N workers x M checkers all contended on each checker's TARGET_LIST_LOCK. Elect a single worker per window via an atomic shm:add on a dedicated cleanup key (TTL = CLEANUP_INTERVAL). Any worker can win -- including a passive-only one -- so the apache/apisix#13385 leak fix still holds, while the rest skip the pass instead of fighting for the lock. Removes the per-worker last_cleanup_check bookkeeping. 2. After a disable -> re-enable cycle, a backed-off timer (running at CHECK_INTERVAL * 10 while idle or after losing the lock race) took up to one slow tick to resume probing. checker:start() now wakes the timer back to CHECK_INTERVAL when it re-enables active checks.

The barrier handshake spun in an unbounded `while true` loop, so a worker that never synced would hang until the global test timeout with no useful diagnostic. Cap the wait with a deadline and fail loudly with the synced count when it is exceeded.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1425-1433: The timer wake-up optimization in the conditional block
(around line 1428-1431) that sets active_check_timer.interval to CHECK_INTERVAL
cannot interrupt an in-flight backed-off sleep because the lua-resty-timer
library only applies interval changes after the current handler completes, not
during an active sleep. To fix this and ensure the re-enabled checker resumes
fast probing promptly, replace the interval mutation approach with either a
direct call to ngx.timer.at(CHECK_INTERVAL, ...) to immediately reschedule the
timer callback, or alternatively defer the fast probing resumption to the next
natural timer wake by removing this optimization if an immediate interrupt is
not feasible with the current timer architecture.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a446900a-f113-41c0-9e50-5be1416582cb

📥 Commits

Reviewing files that changed from the base of the PR and between ae3e2f9 and 8c71840.

📒 Files selected for processing (3)

lib/resty/healthcheck.lua
t/with_resty-events/99-status_ver.t
t/with_worker-events/99-status_ver.t

🚧 Files skipped from review as they are similar to previous changes (2)

t/with_resty-events/99-status_ver.t
t/with_worker-events/99-status_ver.t

The interval-wake added to checker:start() could not bring a re-enabled checker back to fast probing: lua-resty-timer (~> 1) schedules each recurring sleep from self.sub_interval, which is frozen to the initial interval at construction and never re-synced from self.interval; self.interval only feeds scheduling for node-wide (key_name) timers, which the active-check timer is not. Mutating active_check_timer.interval from start() (or anywhere) therefore has no effect on the timer cadence, so the optimization was a no-op. Remove it.

membphis

LGTM.

lua-resty-timer schedules each recurring sleep from sub_interval (frozen at construction), never re-reading self.interval, so mutating active_check_timer.interval has no effect on the timer cadence -- the same reason the start() interval-wake was removed in 24ea30c. Remove the three remaining no-op assignments in the periodic tick so the code no longer implies an idle back-off that never happens. Also document that the per-window cleanup election purges only the elected worker's own hcs and is eventually consistent under asymmetric checker distribution.

AlinsRan mentioned this pull request Jun 22, 2026

fix(healthcheck): update targets incrementally instead of destroy-and-rebuild apache/apisix#13582

Draft

5 tasks

AlinsRan force-pushed the fix/cleanup-gating-and-lock-scope branch from 04d4136 to ada5ee8 Compare June 22, 2026 03:02

AlinsRan added 4 commits June 22, 2026 15:14

AlinsRan marked this pull request as ready for review June 23, 2026 00:59

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

AlinsRan added 4 commits June 23, 2026 14:48

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread lib/resty/healthcheck.lua Outdated

membphis approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: clean every checker each window and release periodic lock when idle#55

fix: clean every checker each window and release periodic lock when idle#55
AlinsRan wants to merge 11 commits into
masterfrom
fix/cleanup-gating-and-lock-scope

AlinsRan commented Jun 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

membphis left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AlinsRan commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause 1 — stale-target cleanup is gated incorrectly

Root cause 2 — periodic lock not released when the worker goes idle

Fix

Upstream relation (Kong/lua-resty-healthcheck)

Test

Cross-PR dependency

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

membphis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlinsRan commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading