-
Notifications
You must be signed in to change notification settings - Fork 37
issue-1751: [Filestore] Restore and drain WriteBackCache at session recreation #4711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
91f6686 to
a10526b
Compare
|
Note This is an automated comment that will be appended during run. 🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 3410149.
🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 3410149.
|
3410149 to
8626f7b
Compare
8626f7b to
650f286
Compare
|
Note This is an automated comment that will be appended during run. 🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 650f286.
🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 650f286.
|
|
|
||
| // Cache is drained and disabled - new requests go directly | ||
| // to the session | ||
| UNIT_ASSERT_VALUES_EQUAL(2, writeDataCalled2.load()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check that cache file is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriteBackCache object is not accessible from this place.
And accessing the cache file directly is not reliable (this is a memory mapped file with no disk synchronization guarantee).
So the answer is: without writing tons of boilerplate code — no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is quite easy, you can look here for example
nbs/cloud/filestore/libs/vfs_fuse/fs_ut.cpp
Line 1071 in da489bf
| pathToCache = TFsPath(bootstrap.DirectoryHandlesStoragePath) / |
I agree that it is a little bit hacky, but allow you to verify file state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that accessing the file while it is being used by WriteBackCache is reliable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can suspend and check, but even without that, as I understand it, we can assume the test is deterministic and that nothing (including WBC) touches the file at this point. If that’s not the case, then we probably have a bigger problem.
|
For the future, this PR could be split into two parts: one for draining the cache on shutdown, and another for restoring it after suspension. |
|
| callContext = std::move(callContext), | ||
| request = std::move(request)](const auto& f) mutable | ||
| { | ||
| f.GetValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (can??) flush fails, we just proceed to write normally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FlushAllData is not expected to throw an exception, we can safely make this call.
I don't like this code and I'd prefer Y_UNUSED(f) but this pattern is commonly used in other places.
f55f953 to
6530b26
Compare
6530b26 to
01d279e
Compare
|
Note This is an automated comment that will be appended during run. 🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 20a9d0f.
🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 20a9d0f.
|
| const auto& response = future.GetValue(); | ||
| const auto& error = response.GetError(); | ||
| self->FSyncQueue->Dequeue(reqId, error, TNodeId {ino}, THandle {handle}); | ||
| Y_ABORT_UNLESS( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use STORAGE_VERIFY / STORAGE_VERIFY_C macros - they force the developer to provide the entity id (filesystem id / tablet id / client id / etc), otherwise we'll just see that some error happened without knowing anything about the entity that caused the problem (so the debugging process will take more time)
here we can use client id in the STORAGE_VERIFY message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done in a separate PR - there're many other ABORT_UNLESS usages in vfs_fuse which would be nice to replace with STORAGE_VERIFY(clientId)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, that ABORT_UNLESS is the only usage for checks in vfs_fuse and filestore vhost for now. Wasn't able to find a single instance of Verify in those components
#1751
Resolve data loss in the following scenario:
Actual behavior: nfs-vhost completely ignores the non-flushed requests remaining in the persistent queue.
Expected behavior: nfs-vhost should check for the presence of the persistent queue and initialize WriteBackCache if the persistent queue exists regardless of the value of ServerWriteBackCacheEnabled flag. If ServerWriteBackCacheEnabled is not set, new writes should not be processed until the queue is flushed.