Defer slot migration failover until async modules are consistent #2609

murphyjacob4 · 2025-09-13T00:38:00Z

Summary

Problem: Modules that perform asynchronous processing (e.g., Valkey-search) can experience "read-your-own-write" consistency regressions during atomic slot migrations. The engine currently has no awareness of module-specific processing backlogs and can transfer slot ownership before an async module on the target node has fully processed all incoming writes, leading to data inconsistencies.

Solution: This PR introduces a synchronization mechanism that makes slot migration "module-aware." It defers the final REQUEST-FAILOVER step until all registered asynchronous modules on the target node certify they are fully consistent with the data state at the moment the source node pauses.

This is achieved through three core changes:

New Module APIs for modules to report their async processing status (offsets and time-based lag).
A pre-pause "Lag Probing" loop where the source queries the target's total end-to-end lag (network + module lag) to intelligently decide when to pause.
A post-pause "Module Wait" period on the target node to ensure all modules have processed data up to the pause-time checkpoint before committing the failover.

1. New Module APIs

Three new module APIs are introduced to allow the engine to query the state of a module's asynchronous processing pipeline:

VM_RegisterAsyncProcessingStatusFunc(ValkeyModuleCtx *ctx, ValkeyModuleAsyncProcessingStatusFunc callback)

Modules register this callback (at most one) for the engine to poll their status. Within this callback, modules must use the following two reporting functions.

VM_ReportAsyncProcessingOffset(ValkeyModuleAsyncProcessingStatusCtx *ctx, uint64_t accepted_offset, uint64_t applied_offset)

Reports the module's "point-in-time" data consistency. Modules define their own offset scheme. This allows the engine to know when applied (fully processed) data has caught up to the last accepted (queued) data at a specific moment (i.e., the pause).

We don't have a canonical offset that we can use, since the engine's replication offset is only tracked when we are doing replication. Hence why modules are left to devise their own offset scheme.

VM_ReportAsyncProcessingLag(ValkeyModuleAsyncProcessingStatusCtx *ctx, long long current_lag_millis)

Reports an estimate (in milliseconds) of the module's current processing backlog. This is used by the source node to calculate the total end-to-end lag before initiating a pause.

2. Migration Change: Pre-Pause Lag Probing (Source Node)

This change replaces the source node's reliance on its output buffer size as the trigger for pausing. The output buffer is blind to the module-processing backlog on the target, which could lead to pause timeouts.

We now use an active lag-probing system based on two new SYNCSLOTS commands:

SYNCSLOTS GET-LAG <mstime>: Sent from source to target to request a lag report.
SYNCSLOTS LAG <millis>: Sent from target to source with the total calculated lag.

Flow:

Before pausing, the source enters a probing loop, sending SYNCSLOTS GET-LAG.
The target receives this, calculates network lag (using the source <mstime>), and then polls its local async modules (via the new APIs) to get the maximum module_lag_millis.
The target responds with SYNCSLOTS LAG <total_ms>, where total_ms is the sum of network and module lag.
The source node will only proceed to the PAUSED state after this reported total_ms drops below a configurable threshold (default: 1 second). This ensures the target is reasonably "ready" for the final synchronization.

3. Migration Change: Post-Pause Module Wait (Target Node)

This is the core consistency guarantee. Simply having low lag before the pause is not enough; we must ensure modules on the target process every write that occurred before the pause.

However, the target node may still be processing writes for other, non-importing slots, so we cannot simply "wait for all activity to stop." Instead, we use the new offset APIs to create a precise synchronization barrier:

When the target node receives SYNCSLOTS PAUSED, it immediately polls all registered modules and records their current accepted_offset. This value creates a "checkpoint" representing the exact state of each module's queue at the moment of the pause.
The target node delays sending SYNCSLOTS REQUEST-FAILOVER.
It enters an internal polling loop (via serverCron) to continuously check module status.
Only when all modules report an applied_offset that is equal to (or greater than) their recorded accepted_offset checkpoint does the target proceed.
Once this condition is met, the target is guaranteed to be fully consistent with the source's state at the pause. It then confidently sends SYNCSLOTS REQUEST-FAILOVER to finalize the migration.

New Sequence Diagram

Source                                          Target  
|                                                |  
|------------ SYNCSLOTS ESTABLISH -------------->|  
|                                                |  
|<-------------------- +OK ----------------------|  
|                                                |  
|~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|  
|                                                |  
|----------- SYNCSLOTS SNAPSHOT-EOF ------------>|  
|                                                |  
|~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| (Begins and continues throughout)  
|                                                |  
|                                                |  
|========== BEGIN LAG PROBING LOOP ==============|  
|                                                |  
|------------ SYNCSLOTS GET-LAG ---------------->| (Source probes readiness)  
|                                                |  
|                                       (Target calculates network lag +  
|                                        polls local modules for module_lag_ms)  
|                                                |  
|<--------------- SYNCSLOTS LAG <total_ms> ------|  
|                                                |  
|...   (Repeat until total_ms < threshold)    ...|  
|                                                |  
|========== END LAG PROBING LOOP ================|  
|                                                |  
|                                                |  
|--------------- SYNCSLOTS PAUSED -------------->| (Sent after source sees low lag)  
|                                                |  
|                                       (Target receives PAUSED,  
|                                        records current module 'accepted_offsets')  
|                                                |  
|                                       (Target waits... polls modules internally until  
|                                        all 'applied_offsets' == recorded 'accepted_offsets')  
|                                                |  
|<---------- SYNCSLOTS REQUEST-FAILOVER ---------| (Sent ONLY after modules are fully caught up)  
|                                                |  
|---------- SYNCSLOTS FAILOVER-GRANTED --------->|  
|                                                |  
|                                            (performs takeover &  
|                                             propagates topology)  
|                                                |  
(finds out about topology                        |  
change & marks migration done)                   |  
|                                                |

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 · 2025-09-13T00:40:46Z

@madolson @zvi-code @yairgott @enjoy-binbin @zuiderkwast @PingXie

Hey folks, I want to align on the right strategy here. I have a draft PR for a solution that I think should work. I'd love to get some feedback on the high level approach before committing to testing, etc.

codecov · 2025-09-13T00:57:06Z

Codecov Report

❌ Patch coverage is 54.83871% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.17%. Comparing base (2b7d5f7) to head (738a335).

Files with missing lines	Patch %	Lines
src/module.c	21.81%	43 Missing ⚠️
src/cluster_migrateslots.c	68.70%	41 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2609      +/-   ##
============================================
- Coverage     72.20%   72.17%   -0.03%     
============================================
  Files           127      127              
  Lines         70820    70969     +149     
============================================
+ Hits          51138    51225      +87     
- Misses        19682    19744      +62

Files with missing lines	Coverage Δ
src/config.c	`78.66% <ø> (ø)`
src/module.h	`0.00% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/cluster_migrateslots.c	`89.46% <68.70%> (-3.18%)`	⬇️
src/module.c	`9.91% <21.81%> (+0.12%)`	⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add module API hooks for reporting on async keyspace processing.

738a335

Signed-off-by: Jacob Murphy <[email protected]>

github-actions bot assigned murphyjacob4 Sep 13, 2025

murphyjacob4 mentioned this pull request Sep 15, 2025

Follow up items from Atomic slot migraiton #2507

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Defer slot migration failover until async modules are consistent #2609

Defer slot migration failover until async modules are consistent #2609

Uh oh!

murphyjacob4 commented Sep 13, 2025

Uh oh!

murphyjacob4 commented Sep 13, 2025

Uh oh!

codecov bot commented Sep 13, 2025

Uh oh!

Uh oh!

Defer slot migration failover until async modules are consistent #2609

Are you sure you want to change the base?

Defer slot migration failover until async modules are consistent #2609

Uh oh!

Conversation

murphyjacob4 commented Sep 13, 2025

Summary

1. New Module APIs

2. Migration Change: Pre-Pause Lag Probing (Source Node)

3. Migration Change: Post-Pause Module Wait (Target Node)

New Sequence Diagram

Uh oh!

murphyjacob4 commented Sep 13, 2025

Uh oh!

codecov bot commented Sep 13, 2025

Codecov Report

Uh oh!

Uh oh!