-
-
Couldn't load subscription status.
- Fork 2.1k
Race condition with replication means /messages backfill lacks read-after-write consistency between workers #14211
Description
We have have a problem where because events are persisted in a queue in a client_reader worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from /messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).
CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492
WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint
Here is what happens:
serverBhasevent1stored as anoutlierfrom previous requests (specifically from MSC3030 jump to date pulling in a missingprev_eventafter backfilling)- Client on
serverBcalls/messages?dir=b serverB:client_reader1accepts the request and drives thingsserverB:client_reader1has some backward extremities in range and requests/backfillfromserverAserverB:client_reader1processes the events from backfill includingevent1and puts them in the_event_persist_queueserverB:masterpicks up the events from the_event_persist_queueand persists them to the database, de-outliersevent1and invalidates its own cache and sends them over replicationserverB:client_reader1starts assembling the/messagesresponse and getsevent1out of the stale cache still as anoutlierserverB:client_reader1responds to the/messagesrequest withoutevent1becauseoutliersare filtered outserverB:client_reader1finally gets the replication data and invalidates its own cache forevent1(too late, we already got the events from the stale cache and responded)
In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.
It's exactly this but it really sucks that calling /messages doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with
Z-Read-After-Write
/messages request so it's a little more insidious.
Having this be possible makes it even more of a reason that we should indicate gaps in /messages, MSC3871