Skip to content

fix(id-allocation): Schedule a flush when submitting ID Allocation op before replaying pending states #24683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 23, 2025

Conversation

markfields
Copy link
Member

@markfields markfields commented May 21, 2025

Description

Fixes AB#39398

In ContainerRuntime, we submit to the Outbox directly from replayPendingStates, and so we should also call scheduleFlush to ensure that op doesn't get stuck in the Outbox. Being "stuck in the Outbox" is a problem because it's assumed that all ops in a batch come during the same JS turn and have the same reference sequence number, and this will violate that invariant.

That one-line change is accompanied by some refactoring so we don't have to consider Immediate mode at that callsite (scheduleFlush handles it now).

In-Depth

In the typical case, calling PendingStateManager.replayPendingStates will trigger another op(s) to submit which will schedule the flush. But here's a counterexample:

Create a new DataStore and use a compressed ID in it. Reconnect before attaching it or making any other changes.

This will result in the replay flow, and we will submit an ID Allocation op, but there's nothing else to resubmit. This op will be "stuck" in the Outbox.

In the current code, this is kind of ok, because ID Allocation ops don't make the container dirty, and it's ok to drop that op. BUT, as mentioned above, it can easily violate the invariant that all ops in a batch have the same reference sequence number, because that "old" op will be included in the next batch, which could be some time later after new ops have been processed (which advances the refSeq). See the unit test added here.

@Copilot Copilot AI review requested due to automatic review settings May 21, 2025 23:11
@github-actions github-actions bot added area: runtime Runtime related issues base: main PRs targeted against main branch labels May 21, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures that after submitting an ID Allocation operation, a flush is scheduled to prevent ops from stalling in the Outbox and refactors the flush scheduling logic to use a unified flushPending flag and doPendingFlush helper.

  • Call scheduleFlush immediately after submitting an ID Allocation op before replaying pending states.
  • Replace flushTaskExists with flushPending and introduce a doPendingFlush method.
  • Update scheduleFlush to use a switch on flushMode, removing the old currentlyBatching helper and assertions.

Prompt:
> Write a test that generates a compressed ID, then reconnects (triggering replayPendingStates), then processes an op from another client, then generates another ID and submits a simple op using that ID. It should hit the "outboxSequenceNumberCoherencyCheck" error on flush.

I had to do surprisingly little modification (besides extra refactoring I wanted to)
@markfields markfields merged commit b4e1fd1 into microsoft:main May 23, 2025
37 checks passed
@markfields markfields deleted the idc/replay-fix branch May 23, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues base: main PRs targeted against main branch Feature_StagingMode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants