Skip to content

Conversation

@e673
Copy link
Collaborator

@e673 e673 commented Nov 24, 2025

#1751

Resolve data loss in the following scenario:

  1. WriteBackCache was enabled and used.
  2. ServerWriteBackCacheEnabled flag was disabled in the configuration.
  3. nfs-vhost was restarted.

Actual behavior: nfs-vhost completely ignores the non-flushed requests remaining in the persistent queue.

Expected behavior: nfs-vhost should check for the presence of the persistent queue and initialize WriteBackCache if the persistent queue exists regardless of the value of ServerWriteBackCacheEnabled flag. If ServerWriteBackCacheEnabled is not set, new writes should not be processed until the queue is flushed.

@e673 e673 changed the title issue-1751: Restore and drain WriteBackCache at session recreation issue-1751: [Filestore] Restore and drain WriteBackCache at session recreation Nov 24, 2025
@e673 e673 requested a review from SvartMetal November 24, 2025 13:21
@e673 e673 added the filestore Add this label to run only cloud/filestore build and tests on PR label Nov 24, 2025
@github-actions
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 91f6686.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3181 3181 0 0 0 0 0

@e673 e673 force-pushed the users/nasonov/issue-4201-create-session branch from 91f6686 to a10526b Compare November 25, 2025 10:48
@github-actions
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit a10526b.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3181 3181 0 0 0 0 0

@neihar neihar self-requested a review November 26, 2025 08:49
@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 3410149.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3181 3180 0 1 0 0 0

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 3410149.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 2 0 0 0 0 0

@github-actions
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 8626f7b.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3194 3194 0 0 0 0 0

@e673 e673 force-pushed the users/nasonov/issue-4201-create-session branch from 8626f7b to 650f286 Compare December 2, 2025 13:01
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 650f286.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3196 3194 0 2 0 0 0

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 650f286.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
63 63 0 0 0 0 0

@e673 e673 requested a review from qkrorlqr December 3, 2025 19:14

// Cache is drained and disabled - new requests go directly
// to the session
UNIT_ASSERT_VALUES_EQUAL(2, writeDataCalled2.load());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that cache file is empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteBackCache object is not accessible from this place.
And accessing the cache file directly is not reliable (this is a memory mapped file with no disk synchronization guarantee).
So the answer is: without writing tons of boilerplate code — no.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is quite easy, you can look here for example

pathToCache = TFsPath(bootstrap.DirectoryHandlesStoragePath) /

I agree that it is a little bit hacky, but allow you to verify file state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that accessing the file while it is being used by WriteBackCache is reliable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can suspend and check, but even without that, as I understand it, we can assume the test is deterministic and that nothing (including WBC) touches the file at this point. If that’s not the case, then we probably have a bigger problem.

@neihar
Copy link
Collaborator

neihar commented Dec 3, 2025

For the future, this PR could be split into two parts: one for draining the cache on shutdown, and another for restoring it after suspension.

@neihar
Copy link
Collaborator

neihar commented Dec 3, 2025

For the future, this PR could be split into two parts: one for draining the cache on shutdown, and another for restoring it after suspension.
I see, it is already split, just in this pr we have both commits:)

callContext = std::move(callContext),
request = std::move(request)](const auto& f) mutable
{
f.GetValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (can??) flush fails, we just proceed to write normally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlushAllData is not expected to throw an exception, we can safely make this call.
I don't like this code and I'd prefer Y_UNUSED(f) but this pattern is commonly used in other places.

@e673 e673 force-pushed the users/nasonov/issue-4201-create-session branch 2 times, most recently from f55f953 to 6530b26 Compare December 4, 2025 12:51
@e673 e673 requested a review from neihar December 4, 2025 12:55
@e673 e673 force-pushed the users/nasonov/issue-4201-create-session branch from 6530b26 to 01d279e Compare December 4, 2025 13:07
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 20a9d0f.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3196 3195 0 1 0 0 0

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 20a9d0f.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 2 0 0 0 0 0

const auto& response = future.GetValue();
const auto& error = response.GetError();
self->FSyncQueue->Dequeue(reqId, error, TNodeId {ino}, THandle {handle});
Y_ABORT_UNLESS(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use STORAGE_VERIFY / STORAGE_VERIFY_C macros - they force the developer to provide the entity id (filesystem id / tablet id / client id / etc), otherwise we'll just see that some error happened without knowing anything about the entity that caused the problem (so the debugging process will take more time)

here we can use client id in the STORAGE_VERIFY message

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done in a separate PR - there're many other ABORT_UNLESS usages in vfs_fuse which would be nice to replace with STORAGE_VERIFY(clientId)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, that ABORT_UNLESS is the only usage for checks in vfs_fuse and filestore vhost for now. Wasn't able to find a single instance of Verify in those components

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

filestore Add this label to run only cloud/filestore build and tests on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants