Skip to content

fix: clean every checker each window and release periodic lock when idle#55

Open
AlinsRan wants to merge 11 commits into
masterfrom
fix/cleanup-gating-and-lock-scope
Open

fix: clean every checker each window and release periodic lock when idle#55
AlinsRan wants to merge 11 commits into
masterfrom
fix/cleanup-gating-and-lock-scope

Conversation

@AlinsRan

@AlinsRan AlinsRan commented Jun 22, 2026

Copy link
Copy Markdown

Problem

Two issues that only manifest when multiple upstreams have health checks enabled — single-upstream setups don't reproduce them, which is why earlier repro attempts failed. They are the library-side root causes behind apache/apisix#13385, apache/apisix#13141 and apache/apisix#13235.

Root cause 1 — stale-target cleanup is gated incorrectly

last_cleanup_check is a module-level timestamp shared by every checker, yet the active-check timer both read and wrote it inside the for _, checker_obj in pairs(hcs) loop:

for _, checker_obj in pairs(hcs) do
  if (last_cleanup_check + CLEANUP_INTERVAL) < cur_time then
    -- purge targets marked via delayed_clear()
    ...
    last_cleanup_check = cur_time   -- advanced on the FIRST checker
  end
  ...
end

The first checker to enter the cleanup branch advances last_cleanup_check, so the condition turns false for every other checker in the same iteration. With several health-checked upstreams, only the first upstream's purge-marked nodes are removed from the shm; the rest keep their deleted nodes forever — still returned by the control API (#13385) and still actively probed (#13141).

A related defect in the same path: cleanup ran only inside the get_periodic_lock branch. A passive-only deployment (checks.passive, no active interval) does no active probing, so no worker ever holds the lock and cleanup never runs at all — its delayed_clear()-marked targets leak forever (the passive-only side of #13385).

Root cause 2 — periodic lock not released when the worker goes idle

When every checker on the lock-holding worker is stopped (active == false), the timer kept renewing the global PERIODIC_LOCK. Workers that still own active checkers could never acquire it, so active health checks stopped entirely after a disable → re-enable cycle (#13235).

Fix

  • Cleanup — drop the module-level last_cleanup_check and elect one cleanup worker per window via an atomic shm:add(cleanup_key, pid, CLEANUP_INTERVAL). The elected worker purges every checker in hcs in a single pass; the others skip it, so they don't all contend on each checker's target-list lock. The election is independent of the periodic lock, so passive-only deployments are cleaned too.
  • Lock — the holder releases PERIODIC_LOCK once it has no active checker left, letting another worker take over. (Extends fix: release periodic lock when no active checkers exist #54, which fixed the lock direction but not the cleanup gating.)

Upstream relation (Kong/lua-resty-healthcheck)

This repo is a fork of Kong/lua-resty-healthcheck. How the fixes map to upstream:

Fix In Kong upstream? In this fork (before this PR)?
RC1 — multi-checker cleanup (#13385 / #13141) ✅ Kong PR Kong#146 (is_checked), merged 2023-12-22 ❌ never synced — bug still present
Cleanup decoupled from the lock / passive-only (#13385) ❌ still gated behind get_periodic_lock
RC2 — release the lock when idle (#13235)

Only RC1 has an upstream fix, and it never reached this fork (Kong is on 3.1.x, this fork on 3.2.x). That fix keeps cleanup gated behind the periodic lock and merely moves the timestamp advance out of the loop, so it does not cover the passive-only leak. This PR's election approach is a superset of it; RC2 is net-new on both sides. Both are worth contributing back upstream to reduce fork drift.

Kong Kong#146 is intentionally not cherry-picked — it would fix only RC1 while leaving cleanup lock-gated, so it is strictly weaker than what this PR does.

Test

19-multi-checker-cleanup.t creates two health-checked upstreams, marks all targets of both for delayed removal, and asserts both are cleaned after the window — it fails before the fix (the second checker keeps its targets) and passes after. 20-periodic-lock-release.t and 21-passive-only-cleanup.t cover RC2 and the passive-only path (mirrored under with_resty-events/). A test-only _M._set_cleanup_interval hook (guarded by __TESTING_HEALTHCHECKER) shortens the window for determinism.

Verified locally with prove; the existing 00-new, 01-start-stop, 10-garbagecollect, 11-clear suites still pass.

Cross-PR dependency

The companion apache/apisix PR (incremental health-check target updates + no-nil-window rebuild) needs a release of this library that contains this fix. Once this is merged and tagged, bump the apisix rockspec dependency lua-resty-healthcheck-api7 to that version.

Relates to apache/apisix#13385, apache/apisix#13141, apache/apisix#13235; extends #54.

Summary by CodeRabbit

  • Bug Fixes

    • Improved stale-target cleanup so purge-marked targets are removed consistently across all health checker instances, including passive-only checkers.
    • Refined periodic-lock behavior: when no active probing exists, the lock is released so a worker that still owns active checkers can take over; when active probing exists, only the elected worker performs active probes.
  • Tests

    • Switched TLS active-probe coverage to a deterministic local self-signed server (with updated certificate/SNI expectations).
    • Added/expanded integration tests for multi-checker cleanup, periodic-lock release, passive-only cleanup, and more deterministic cross-worker event timing.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The periodic healthcheck timer in healthcheck.lua is updated to release the periodic SHM lock when all checkers are disabled, and to compute stale-target cleanup once per tick using shm-based election rather than once per checker, preventing lock starvation and cross-checker cleanup starvation. Four new integration test pairs exercise these fixes, including a dedicated test for passive-only checkers. TLS active probe tests are migrated from external public hosts to a local self-signed TLS server. Status-version tests gain a cross-worker barrier handshake for deterministic event ordering.

Changes

Healthcheck Timer Fixes and Test Hardening

Layer / File(s) Summary
Periodic lock release and cleanup starvation fix
lib/resty/healthcheck.lua
Timer cleanup is restructured to use shm-based election: only the elected worker purges stale targets for all checkers. Before acquiring the periodic lock, the timer scans checkers and releases the lock if none have active checks. When a checker is restarted with active probes, the timer interval is "woken" from backed-off state. A _M._set_cleanup_interval helper is added under TESTING.
Periodic lock release integration tests
t/with_worker-events/20-periodic-lock-release.t, t/with_resty-events/20-periodic-lock-release.t
Both test files verify that the PERIODIC_LOCK SHM key is present while a checker is active and is removed after the checker is stopped and the lock renewal window elapses.
Multi-checker stale-target cleanup tests
t/with_worker-events/19-multi-checker-cleanup.t, t/with_resty-events/19-multi-checker-cleanup.t
Both test files create two independent healthcheck instances with a shortened cleanup interval, mark all targets for removal, and assert both upstreams reach zero remaining targets after multiple cleanup windows.
Passive-only checker cleanup tests
t/with_worker-events/21-passive-only-cleanup.t, t/with_resty-events/21-passive-only-cleanup.t
Both test files configure a passive-only healthcheck (no active probing), add targets, mark them for delayed removal, and assert that cleanup completes without active checker participation.
TLS active probe tests: local self-signed server
t/with_worker-events/14-tls_active_probes.t, t/with_resty-events/14-tls_active_probes.t
External TLS targets replaced by a local Nginx SSL server on port 2115 using a bundled self-signed cert. Each test case probes 127.0.0.1:2115 with example.test or wrong.host.test SNI and asserts the expected certificate verification result.
Cross-worker barrier for status-version tests
t/with_worker-events/99-status_ver.t, t/with_resty-events/99-status_ver.t
init_worker_by_lua_block is extended with a barrier handshake that records observed worker pings in shared memory and waits until all worker IDs are seen before adding the healthcheck target. Test delays and error-log assertions are updated to match the deterministic sequencing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the two main fixes: stale-target cleanup for every checker and release of the periodic lock when idle.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
E2e Test Quality Review ✅ Passed E2E tests comprehensively verify the fixes: cleanup election pattern (19), periodic lock release (20), and passive-only cleanup (21) with real worker events services. Tests confirm both issues are...
Security Check ✅ Passed Security review across 7 categories found no vulnerabilities. No sensitive data exposure in logs (headers logged only at DEBUG level), no database storage, proper authorization via namespace isolat...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cleanup-gating-and-lock-scope

Comment @coderabbitai help to get the list of available commands.

Two issues affecting workloads with multiple health-checked upstreams:

1. Stale-target cleanup gating. `last_cleanup_check` is a module-level
   timestamp shared by every checker, but it was updated *inside* the
   `for _, checker_obj in pairs(hcs)` loop. The first checker that runs
   cleanup advanced the timestamp, so the cleanup condition turned false
   for every remaining checker in the same iteration. As a result only
   the first upstream's purge-marked targets were ever removed; the other
   upstreams kept their deleted nodes in the shm forever (still returned
   by the control API and still actively probed). This is why it could
   not be reproduced with a single upstream. The window is now decided
   once before the loop and the timestamp advanced once after it, so
   every checker in `hcs` is cleaned in the same window.

2. Periodic lock not released when idle. When every checker on the
   lock-holding worker has been stopped (active == false), the timer kept
   renewing the global PERIODIC_LOCK, so other workers that still own
   active checkers could never acquire it and active checks stopped
   entirely after a disable -> re-enable cycle. The lock holder now
   releases the lock when no active checker remains. This extends the
   work started in #54.

Adds a regression test (t/with_worker-events/19-multi-checker-cleanup.t)
that creates two health-checked upstreams, marks both for delayed
removal, and asserts both are cleaned. It fails before the gating fix
(second checker keeps its targets) and passes after.
@AlinsRan AlinsRan force-pushed the fix/cleanup-gating-and-lock-scope branch from 04d4136 to ada5ee8 Compare June 22, 2026 03:02
AlinsRan added 4 commits June 22, 2026 15:14
…ents

- Add a regression test asserting the periodic lock is released once no
  active checker remains on the worker (so another worker can take over).
- Mirror the multi-checker cleanup test and the new lock test into the
  resty-events suite so both event backends exercise the fixes.
14-tls_active_probes.t actively probes the external host badssl.com via a
hard-coded IP (104.154.89.105). It is flaky in CI whenever that host is
slow/unreachable and is currently red on every run, even for the non-disable
TEST 1. Disable it via the repo's .t_disabled convention (cf.
10-garbagecollect, 13-integration) until it is rewritten against a local TLS
server. This is unrelated to the cleanup/lock changes in this PR.
Check whether the worker still owns an active checker before acquiring the
periodic lock instead of after. A worker whose checkers were all disabled no
longer re-acquires and immediately releases the lock on every back-off tick;
it releases the lock once (if it still holds it) and then stops contending.
14-tls_active_probes: replace the external badssl.com probe (hard-coded
IP 104.154.89.105, flaky/unreachable in CI) with a local self-signed TLS
server embedded in the test nginx config. The cert has CN/SAN
example.test, so probing with that SNI and verification enabled succeeds,
a mismatching SNI with verification enabled fails, and verification off
accepts the mismatch - reproducing the original healthy/unhealthy
assertions against a local server. Wait for the probe by polling the
target status instead of a fixed sleep, and use a distinct port per
events backend to avoid cross-suite collisions. Re-enabled (renamed from
.t_disabled).

99-status_ver: the per-target status-change event is delivered only to
workers connected to the events broker at publish time (no replay), so a
worker whose subscription was not yet ready intermittently missed the
change (~50% flake). Add a cross-worker barrier in init_worker that
confirms broadcasts reach every worker before the target is added, and
create the checker (registering its subscription) before adding the
target so no status change can fire before every worker is subscribed.
Both workers now deterministically observe the version change; assert it
via grep_error_log occurrence count.
@AlinsRan AlinsRan marked this pull request as ready for review June 23, 2026 00:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
t/with_worker-events/99-status_ver.t (1)

54-67: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider bounding the barrier loop to avoid a runaway background loop.

while true has no timeout or max-retry cap. If cross-worker delivery never completes (e.g., a worker fails to subscribe), this loop spins in the timer thread indefinitely. The test would still fail cleanly via the error-log assertion, but a bounded retry would surface a clearer failure and avoid the lingering loop.

♻️ Optional: cap the handshake retries
             local my_id = ngx.worker.id()
-            while true do
+            local attempts = 0
+            while attempts < 200 do
                 we.post("barrier", "ping", my_id)
                 we.poll()
                 local seen = 0
                 for i = 0, worker_count - 1 do
                     if shm:get("barrier:" .. tostring(i)) then
                         seen = seen + 1
                     end
                 end
                 if seen >= worker_count then
                     break
                 end
                 ngx.sleep(0.05)
+                attempts = attempts + 1
             end
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/with_worker-events/99-status_ver.t` around lines 54 - 67, The `while true`
loop in the barrier synchronization section has no timeout or maximum retry
limit, which can cause indefinite spinning if worker delivery fails. Add a
counter variable to track retry attempts and set a reasonable maximum number of
iterations (e.g., 100 retries with 0.05 second sleeps) before breaking out of
the loop. Once the maximum retries is reached, break from the loop so the test
can proceed to failure assertions rather than hanging indefinitely. This ensures
the test fails with a clear timeout rather than lingering in the background
loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@t/with_worker-events/99-status_ver.t`:
- Around line 54-67: The `while true` loop in the barrier synchronization
section has no timeout or maximum retry limit, which can cause indefinite
spinning if worker delivery fails. Add a counter variable to track retry
attempts and set a reasonable maximum number of iterations (e.g., 100 retries
with 0.05 second sleeps) before breaking out of the loop. Once the maximum
retries is reached, break from the loop so the test can proceed to failure
assertions rather than hanging indefinitely. This ensures the test fails with a
clear timeout rather than lingering in the background loop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c0536338-7599-4ce9-beb5-2fa57c21b77b

📥 Commits

Reviewing files that changed from the base of the PR and between 3a6ddf0 and 38d86b3.

⛔ Files ignored due to path filters (4)
  • t/with_resty-events/util/tls_probe_cert.pem is excluded by !**/*.pem
  • t/with_resty-events/util/tls_probe_key.pem is excluded by !**/*.pem
  • t/with_worker-events/util/tls_probe_cert.pem is excluded by !**/*.pem
  • t/with_worker-events/util/tls_probe_key.pem is excluded by !**/*.pem
📒 Files selected for processing (9)
  • lib/resty/healthcheck.lua
  • t/with_resty-events/14-tls_active_probes.t
  • t/with_resty-events/19-multi-checker-cleanup.t
  • t/with_resty-events/20-periodic-lock-release.t
  • t/with_resty-events/99-status_ver.t
  • t/with_worker-events/14-tls_active_probes.t
  • t/with_worker-events/19-multi-checker-cleanup.t
  • t/with_worker-events/20-periodic-lock-release.t
  • t/with_worker-events/99-status_ver.t

AlinsRan added 4 commits June 23, 2026 14:48
The previous revision gated the whole periodic tick on has_active_checker
and returned before the cleanup loop when no active checker existed. In a
passive-only deployment (checks.passive but no active interval) no worker
ever has an active checker, so delayed_clear()-marked targets were never
purged and leaked in the shm (apache/apisix#13385) -- a regression only on
the pure-passive path, which the existing tests (all using active probes)
did not cover.

Run stale-target cleanup independently of active probing and of the
periodic lock: locking_target_list serializes the purge across workers and
re-reads the shm list each time, so running it on every worker is safe and
idempotent. The periodic lock is still released on has_active_checker so a
disabled worker yields to an active one (apache/apisix#13235); that gate
stays on has_active_checker rather than on hcs being non-empty because a
disabled checker lingers in the weak hcs table with active=false until GC.

Add a passive-only + delayed_clear regression test for both event backends.
For a passive-only checker checks.active.*.active is nil (never set to
true by start(), never set to false by stop()), so ngx.say rendered it as
"nil" rather than "false". Normalize with `not not` so the assertion
is stable. The cleanup assertion (remaining: 0) already passed, confirming
the fix purges delayed_clear targets with no active checker.
Two follow-ups from the PR review of the cleanup/lock-gating rework:

1. Stale-target cleanup ran on every worker each window, so N workers x M
   checkers all contended on each checker's TARGET_LIST_LOCK. Elect a single
   worker per window via an atomic shm:add on a dedicated cleanup key
   (TTL = CLEANUP_INTERVAL). Any worker can win -- including a passive-only
   one -- so the apache/apisix#13385 leak fix still holds, while the rest skip
   the pass instead of fighting for the lock. Removes the per-worker
   last_cleanup_check bookkeeping.

2. After a disable -> re-enable cycle, a backed-off timer (running at
   CHECK_INTERVAL * 10 while idle or after losing the lock race) took up to
   one slow tick to resume probing. checker:start() now wakes the timer back
   to CHECK_INTERVAL when it re-enables active checks.
The barrier handshake spun in an unbounded `while true` loop, so a worker
that never synced would hang until the global test timeout with no useful
diagnostic. Cap the wait with a deadline and fail loudly with the synced
count when it is exceeded.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1425-1433: The timer wake-up optimization in the conditional block
(around line 1428-1431) that sets active_check_timer.interval to CHECK_INTERVAL
cannot interrupt an in-flight backed-off sleep because the lua-resty-timer
library only applies interval changes after the current handler completes, not
during an active sleep. To fix this and ensure the re-enabled checker resumes
fast probing promptly, replace the interval mutation approach with either a
direct call to ngx.timer.at(CHECK_INTERVAL, ...) to immediately reschedule the
timer callback, or alternatively defer the fast probing resumption to the next
natural timer wake by removing this optimization if an immediate interrupt is
not feasible with the current timer architecture.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a446900a-f113-41c0-9e50-5be1416582cb

📥 Commits

Reviewing files that changed from the base of the PR and between ae3e2f9 and 8c71840.

📒 Files selected for processing (3)
  • lib/resty/healthcheck.lua
  • t/with_resty-events/99-status_ver.t
  • t/with_worker-events/99-status_ver.t
🚧 Files skipped from review as they are similar to previous changes (2)
  • t/with_resty-events/99-status_ver.t
  • t/with_worker-events/99-status_ver.t

Comment thread lib/resty/healthcheck.lua Outdated
The interval-wake added to checker:start() could not bring a re-enabled
checker back to fast probing: lua-resty-timer (~> 1) schedules each recurring
sleep from self.sub_interval, which is frozen to the initial interval at
construction and never re-synced from self.interval; self.interval only feeds
scheduling for node-wide (key_name) timers, which the active-check timer is
not. Mutating active_check_timer.interval from start() (or anywhere) therefore
has no effect on the timer cadence, so the optimization was a no-op. Remove it.

@membphis membphis left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

lua-resty-timer schedules each recurring sleep from sub_interval (frozen at
construction), never re-reading self.interval, so mutating
active_check_timer.interval has no effect on the timer cadence -- the same
reason the start() interval-wake was removed in 24ea30c. Remove the three
remaining no-op assignments in the periodic tick so the code no longer implies
an idle back-off that never happens.

Also document that the per-window cleanup election purges only the elected
worker's own hcs and is eventually consistent under asymmetric checker
distribution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants