-
Notifications
You must be signed in to change notification settings - Fork 991
Save cluster file in bio to avoid the stuck io latency #2555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
When the cluster changes, we need to persist the cluster configuration, if I/O is delayed or blocked, possibly by disk contention, this may result in large latencies on the main thread. We should avoid synchronous I/O from the main thread. So in this commit, we will try to bio to save the config file. We add a bio job and send a sds version of the config file, which does the synchronous save, so there is some eventually consistent version consistently stored on disk. This may break our previous assumption that nodes.conf is in sync and has the strong consistency. For shutdown and cluster saveconfig, we will wait for the bio job to get drained and trigger a new save in a sync way. Closes valkey-io#2424. Signed-off-by: Binbin <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2555 +/- ##
============================================
+ Coverage 72.19% 72.21% +0.02%
============================================
Files 126 126
Lines 70615 70653 +38
============================================
+ Hits 50977 51020 +43
+ Misses 19638 19633 -5
🚀 New features to boost your workflow:
|
/* Save the config, possibly using fsync. */ | ||
if (flags & CLUSTER_TODO_SAVE_CONFIG) { | ||
int fsync = flags & CLUSTER_TODO_FSYNC_CONFIG; | ||
clusterSaveConfigOrDie(fsync); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may require some adjustments. Previously, we synchronized saves and exited if they failed. Now, if we switch to bio, do we still need to retain the exit logic?
Also see #1032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment on #1032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level thought:
We should make the clusterSaveConfig as async only otherwise we don't have any protection mechanism to disallow stale bio job to overwrite a freshly saved cluster config file from main thread.
} | ||
|
||
/* Save the cluster file, if save fails, the process will exit. */ | ||
void clusterSaveConfigOrDie(int do_fsync) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized clusterSaveConfigOrDie
is always fsync, maybe we can get rid of the method param altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove it once we make a decision at this one #2555 (comment)
Co-authored-by: Harkrishn Patro <[email protected]> Signed-off-by: Binbin <[email protected]>
bioSubmitJob(BIO_RDB_SAVE, job); | ||
} | ||
|
||
void bioCreateClusterConfigSaveJob(sds content, int do_fsync) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who owns/free’s content? bioCreateClusterConfigSaveJob(sds content, int do_fsync) enqueues an sds into job->cluster_save_args.content, then the BIO worker calls clusterSaveConfigFromBio(...). is there a clear free path for content afterwards—Would you please document ownership and ensure a single, deterministic free site to avoid leaks or double-frees.
latencyStartMonitor(latency); | ||
while (offset < content_size) { | ||
written_bytes = write(fd, ci + offset, content_size - offset); | ||
written_bytes = write(fd, content + offset, content_size - offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple saves are enqueued, can older content overwrite newer ones and amplify I/O. Can we Coalesce jobs so only the latest content is written (drop older items). Do we need some sort of versioning/epoch so the older jobs are dropped.
} else if (job_type == BIO_RDB_SAVE) { | ||
replicaReceiveRDBFromPrimaryToDisk(job->save_to_disk_args.conn, job->save_to_disk_args.is_dual_channel); | ||
} else if (job_type == BIO_CLUSTER_SAVE) { | ||
if (clusterSaveConfigFromBio(job->cluster_save_args.content, job->cluster_save_args.do_fsync) == C_ERR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to introduce exponential backoff and retry to mitigate transient errors. Can we include observability and metrics around the type of error we notice while trying to save the file/directory
When the cluster changes, we need to persist the cluster configuration,
if I/O is delayed or blocked, possibly by disk contention, this may result
in large latencies on the main thread.
We should avoid synchronous I/O from the main thread. So in this commit,
we will try to bio to save the config file. We add a bio job and send a
sds version of the config file, which does the synchronous save, so there
is some eventually consistent version consistently stored on disk.
This may break our previous assumption that nodes.conf is in sync and has
the strong consistency. For shutdown and cluster saveconfig, we will wait
for the bio job to get drained and trigger a new save in a sync way.
Closes #2424.