- 
                Notifications
    You must be signed in to change notification settings 
- Fork 406
Description
This issue has been migrated from #14211.
We have have a problem where because events are persisted in a queue in a client_reader worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from /messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).
CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492
WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint
Here is what happens:
- serverBhas- event1stored as an- outlierfrom previous requests (specifically from MSC3030 jump to date pulling in a missing- prev_eventafter backfilling)
- Client on serverBcalls/messages?dir=b
- serverB:client_reader1accepts the request and drives things
- serverB:client_reader1has some backward extremities in range and requests- /backfillfrom- serverA
- serverB:client_reader1processes the events from backfill including- event1and puts them in the- _event_persist_queue
- serverB:masterpicks up the events from the- _event_persist_queueand persists them to the database, de-outliers- event1and invalidates its own cache and sends them over replication
- serverB:client_reader1starts assembling the- /messagesresponse and gets- event1out of the stale cache still as an- outlier
- serverB:client_reader1responds to the- /messagesrequest without- event1because- outliersare filtered out
- serverB:client_reader1finally gets the replication data and invalidates its own cache for- event1(too late, we already got the events from the stale cache and responded)
In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.
It's exactly this but it really sucks that calling /messages doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same /messages request so it's a little more insidious.
Having this be possible makes it even more of a reason that we should indicate gaps in /messages, MSC3871