Skip to content

Race condition in retry tracking causes silent data loss #1744

@JoshSalway

Description

@JoshSalway

Description

RedisJobRepository::updateRetryInformationOnParent() and storeRetryReference() both use a non-atomic read-modify-write pattern on the retried_by hash field. If two retry jobs for the same parent complete concurrently, one update silently overwrites the other.

The code

updateRetryInformationOnParent() (line ~495):

if ($retries = $this->connection()->hget($payload->retryOf(), 'retried_by')) {
    $retries = $this->updateRetryStatus($payload, json_decode($retries, true), $failed);
    $this->connection()->hset($payload->retryOf(), 'retried_by', json_encode($retries));
}

storeRetryReference() (line ~700):

$retries = json_decode($this->connection()->hget($id, 'retried_by') ?: '[]');
$retries[] = ['id' => $retryId, 'status' => 'pending', ...];
$this->connection()->hmset($id, ['retried_by' => json_encode($retries)]);

The race

Both methods: HGET -> modify in PHP -> HSET. Two concurrent operations:

  1. Retry A reads [retry1: pending]
  2. Retry B reads [retry1: pending] (same state)
  3. Retry A updates retry1 to completed, HSETs [retry1: completed]
  4. Retry B appends retry2, HSETs [retry1: pending, retry2: pending]
  5. Retry A's status update is silently lost

How this was found

Found during a systematic audit of Horizon's Redis usage, searching for non-atomic read-modify-write patterns (HGET → modify in PHP → HSET) that are vulnerable to concurrent access. This is the same class of bug as the SETNX/EXPIRE race in Lock::get() (#1738) and the pipeline race in RedisHorizonCommandQueue::pending() (#1739).

The existing LuaScripts::updateMetrics() method already uses a Lua script for exactly this reason — atomic read-modify-write inside Redis — which confirmed these two methods should follow the same pattern.

Real-world evidence

Several existing issues describe symptoms consistent with this race condition under high concurrency:

The bug is silent — no errors are thrown, entries are simply dropped from the retried_by array or status updates are reverted. Users see missing or incorrect retry entries on the Horizon dashboard but would likely attribute it to other causes.

Suggested fix

Use Lua scripts (via the existing LuaScripts class) to perform the read-modify-write atomically inside Redis. This matches the existing pattern used by LuaScripts::updateMetrics() which does the same kind of atomic HMGET/modify/HMSET.

Impact

Retry status shown in the Horizon dashboard can be incorrect — a retry marked as "completed" may revert to "pending", or a retry reference may disappear entirely. This is a dashboard data integrity issue. Lower severity than #1738/#1739 but same root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions