Skip to content

Conversation

enjoy-binbin
Copy link
Member

When the cluster changes, we need to persist the cluster configuration,
if I/O is delayed or blocked, possibly by disk contention, this may result
in large latencies on the main thread.

We should avoid synchronous I/O from the main thread. So in this commit,
we will try to bio to save the config file. We add a bio job and send a
sds version of the config file, which does the synchronous save, so there
is some eventually consistent version consistently stored on disk.

This may break our previous assumption that nodes.conf is in sync and has
the strong consistency. For shutdown and cluster saveconfig, we will wait
for the bio job to get drained and trigger a new save in a sync way.

Closes #2424.

When the cluster changes, we need to persist the cluster configuration,
if I/O is delayed or blocked, possibly by disk contention, this may result
in large latencies on the main thread.

We should avoid synchronous I/O from the main thread. So in this commit,
we will try to bio to save the config file. We add a bio job and send a
sds version of the config file, which does the synchronous save, so there
is some eventually consistent version consistently stored on disk.

This may break our previous assumption that nodes.conf is in sync and has
the strong consistency. For shutdown and cluster saveconfig, we will wait
for the bio job to get drained and trigger a new save in a sync way.

Closes valkey-io#2424.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin marked this pull request as ready for review August 27, 2025 12:05
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 96.07843% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.21%. Comparing base (1ba622b) to head (0e37d99).
⚠️ Report is 7 commits behind head on unstable.

Files with missing lines Patch % Lines
src/bio.c 87.50% 1 Missing ⚠️
src/cluster_legacy.c 97.67% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2555      +/-   ##
============================================
+ Coverage     72.19%   72.21%   +0.02%     
============================================
  Files           126      126              
  Lines         70615    70653      +38     
============================================
+ Hits          50977    51020      +43     
+ Misses        19638    19633       -5     
Files with missing lines Coverage Δ
src/bio.c 85.43% <87.50%> (-0.09%) ⬇️
src/cluster_legacy.c 87.58% <97.67%> (+0.43%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

/* Save the config, possibly using fsync. */
if (flags & CLUSTER_TODO_SAVE_CONFIG) {
int fsync = flags & CLUSTER_TODO_FSYNC_CONFIG;
clusterSaveConfigOrDie(fsync);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may require some adjustments. Previously, we synchronized saves and exited if they failed. Now, if we switch to bio, do we still need to retain the exit logic?

Also see #1032

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment on #1032

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level thought:

We should make the clusterSaveConfig as async only otherwise we don't have any protection mechanism to disallow stale bio job to overwrite a freshly saved cluster config file from main thread.

}

/* Save the cluster file, if save fails, the process will exit. */
void clusterSaveConfigOrDie(int do_fsync) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized clusterSaveConfigOrDie is always fsync, maybe we can get rid of the method param altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove it once we make a decision at this one #2555 (comment)

Co-authored-by: Harkrishn Patro <[email protected]>
Signed-off-by: Binbin <[email protected]>
bioSubmitJob(BIO_RDB_SAVE, job);
}

void bioCreateClusterConfigSaveJob(sds content, int do_fsync) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who owns/free’s content? bioCreateClusterConfigSaveJob(sds content, int do_fsync) enqueues an sds into job->cluster_save_args.content, then the BIO worker calls clusterSaveConfigFromBio(...). is there a clear free path for content afterwards—Would you please document ownership and ensure a single, deterministic free site to avoid leaks or double-frees.

latencyStartMonitor(latency);
while (offset < content_size) {
written_bytes = write(fd, ci + offset, content_size - offset);
written_bytes = write(fd, content + offset, content_size - offset);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple saves are enqueued, can older content overwrite newer ones and amplify I/O. Can we Coalesce jobs so only the latest content is written (drop older items). Do we need some sort of versioning/epoch so the older jobs are dropped.

} else if (job_type == BIO_RDB_SAVE) {
replicaReceiveRDBFromPrimaryToDisk(job->save_to_disk_args.conn, job->save_to_disk_args.is_dual_channel);
} else if (job_type == BIO_CLUSTER_SAVE) {
if (clusterSaveConfigFromBio(job->cluster_save_args.content, job->cluster_save_args.do_fsync) == C_ERR) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to introduce exponential backoff and retry to mitigate transient errors. Can we include observability and metrics around the type of error we notice while trying to save the file/directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Synchronous I/O in clusterSaveConfig
3 participants