Skip to content

Commit 57ec06c

Browse files
committed
fix(keys): avoid duplicate metrics from KeyIndex desync
KeyIndex could emit the same key more than once, surfacing as duplicate metrics (apache/apisix#11934). Two issues combine to cause this: 1. KeyIndex:list() iterated the stored key values (self.keys), which can transiently hold the same key value at two different slots when an expired metric is re-added at a new slot before the old slot is reclaimed. The duplicated value was then emitted twice. list() now walks the slots but only emits a key at the slot the index points at (self.index[key] == idx), so each key is listed exactly once while preserving ordering. 2. When add() re-adds a key whose shared-dict entry had already expired, it cleared the local index/keys but did not bump delete_count or clear the dict slot. Other workers therefore never did a full sync and kept the stale slot in their local self.keys, desynchronizing the index. add() now mirrors remove(): it clears the dict slot, increments the local deleted counter and bumps delete_count. Adds a regression test (TestKeyIndex:testExpiredReAddNoDuplicate) that reproduces the duplicate emission before the fix and passes after.
1 parent f86aa3e commit 57ec06c

2 files changed

Lines changed: 57 additions & 3 deletions

File tree

prometheus_keys.lua

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,18 @@ function KeyIndex:list()
9191
self:sync()
9292
local copy = {}
9393
local i = 1
94-
for _, v in pairs(self.keys) do
95-
copy[i] = v
96-
i = i + 1
94+
-- Walk the key slots in order, but only emit a key when the slot is the one
95+
-- the index currently points at (self.index[key] == idx). self.keys can
96+
-- transiently hold the same key value in two different slots (e.g. when an
97+
-- expired metric is re-added at a new slot before the old slot is reclaimed);
98+
-- listing the raw self.keys values would emit duplicate metrics. Consulting
99+
-- the index guarantees each key is listed exactly once, at its canonical slot.
100+
for idx = 0, self.last do
101+
local key = self.keys[idx]
102+
if key and self.index[key] == idx then
103+
copy[i] = key
104+
i = i + 1
105+
end
97106
end
98107
return copy
99108
end
@@ -125,6 +134,13 @@ function KeyIndex:add(key_or_keys, err_msg_lru_eviction, exptime)
125134
self.index[key] = nil
126135
self.keys[idx] = nil
127136
self.expire_keys[idx] = nil
137+
self.dict:set(self.key_prefix .. idx, nil)
138+
self.deleted = self.deleted + 1
139+
-- Bump delete_count so other workers do a full sync and reclaim the
140+
-- stale slot. Without this the old slot lingers in their local
141+
-- self.keys while the metric is re-added at a new slot, which
142+
-- desynchronizes the index and causes duplicate metric emission.
143+
self.dict:incr(self.delete_count, 1, 0)
128144
expired = true
129145
end
130146
end

prometheus_test.lua

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -804,6 +804,44 @@ function TestKeyIndex:testList()
804804
luaunit.assertEquals(keys[3], "key3")
805805
end
806806

807+
-- Regression test for apache/apisix#11934 (duplicate metrics).
808+
-- A key with an exptime is added and synced (self.last now tracks N). The key
809+
-- then expires in the underlying shared dict without going through remove(), so
810+
-- neither key_count nor delete_count changes and the next sync() is a no-op,
811+
-- leaving self.index still pointing at the now-vanished slot. Re-adding the key
812+
-- therefore takes the "expired" path in add() and allocates a new slot while
813+
-- the stale slot lingers in self.keys. Before the fix, delete_count was not
814+
-- bumped (other workers never re-synced) and list() iterated self.keys, so the
815+
-- same key was emitted twice -> duplicate metrics.
816+
function TestKeyIndex:testExpiredReAddNoDuplicate()
817+
self.key_index:add("expkey", "eviction_err", 1)
818+
self.key_index:sync()
819+
luaunit.assertEquals(self.dict:get("_prefix_key_count"), 1)
820+
luaunit.assertEquals(self.dict:get("_prefix_key_1"), "expkey")
821+
luaunit.assertEquals(self.dict:get("_prefix_delete_count"), nil)
822+
luaunit.assertEquals(self.key_index.index["expkey"], 1)
823+
824+
-- Let the key expire in the underlying shared dict. key_count and
825+
-- delete_count are untouched, so the next sync() inside add() is a no-op and
826+
-- the local index keeps pointing at the (now gone) slot 1.
827+
sleep(2)
828+
luaunit.assertEquals(self.dict:get("_prefix_key_1"), nil)
829+
830+
-- Re-adding the now-expired key takes the expired branch and allocates slot 2.
831+
self.key_index:add("expkey", "eviction_err", 1)
832+
luaunit.assertEquals(ngx.logs, nil)
833+
luaunit.assertEquals(self.dict:get("_prefix_key_2"), "expkey")
834+
835+
-- delete_count must have been bumped on the expired re-add path so that
836+
-- other workers do a full sync and drop the stale slot.
837+
luaunit.assertEquals(self.dict:get("_prefix_delete_count"), 1)
838+
839+
-- list() must report the key exactly once, not twice.
840+
local keys = self.key_index:list()
841+
luaunit.assertEquals(#keys, 1)
842+
luaunit.assertEquals(keys[1], "expkey")
843+
end
844+
807845
function TestKeyIndex:testSync()
808846
self.key_index:sync()
809847
luaunit.assertEquals(ngx.logs, nil)

0 commit comments

Comments
 (0)