Skip to content

Conversation

@dustinpho
Copy link
Contributor

Make Volume threadsafe by using sync-locks (a la parking lot) on metadata operations and fd (file descriptor) operations. We try to make these guards as short-lived as possible. As a result of this, a lot of the things we return from the thread-safe Volume are now cloned out before being returned. I didn't see a noticeable performance hit when trying to walk and read all directories within the berkeley_gnm_recon.

One exception to this rule (of holding the lock as little as possible) is during a Volume walk where we do want a consistent view. I added a WalkVolume iterator that holds a lock during the entire duration.

Another change I made was how readdir was implemented. Since we're copying everything over, I wanted to optimize it and make it copy as little over for each call. It does pagination with a last-seen-token used as the offset. I had to change things within the FUSE to support this as well, as a mkdir or rmdir operation weaved in-between readdirs may result in duplicate or skipped entries (though it is difficult to not skip some entries entirely).

@dustinpho dustinpho requested a review from blinsay November 24, 2025 17:04
Copy link
Member

@blinsay blinsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks rad, much better than the channel-d versione. Two big questions:

  • We don't have any conventions around lock ordering so we don't accidentally introduce a deadlock when we're taking the metadata and fd locks. You did (too good!) of a job hiding it, but I think new_fd is probably the biggest potential offender. rustc doesn't help us here. andI'm not really aware of a crate that helps either. I think we need something, but I'm not sure exactly what.

  • Commit seems a little bit wrong now. We're locking metadata to make a copy of the staged files in StagedVolume::upload and then releasing the lock to do the upload (good) but nothing seems like it stops someone from modifying one of the staged files between upload and modify (bad).

@dustinpho
Copy link
Contributor Author

Commit seems a little bit wrong now. We're locking metadata to make a copy of the staged files in StagedVolume::upload and then releasing the lock to do the upload (good) but nothing seems like it stops someone from modifying one of the staged files between upload and modify (bad).

doh ... for some reason i had it in my head that i would do this somewhere else down the line, but I should be doing that here. (the locking, taking a snapshot, then merging it after it's uploaded with whatever's here)

pond/src/lib.rs Outdated
}
}

impl<'a> From<DirEntryRef<'a>> for DirEntry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about calling this DirEntryRef::to_owned or something?

/// Opening a Fd with write permissions will always truncate the file.
pub async fn open_read_write(&mut self, ino: Ino) -> Result<Fd> {
// known false positive: https://github.com/rust-lang/rust-clippy/issues/6446
#[allow(clippy::await_holding_lock)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's shove the reason in the attribute. it's harder to spot the comment

Suggested change
#[allow(clippy::await_holding_lock)]
#[allow(clippy::await_holding_lock, reason = "https://github.com/rust-lang/rust-clippy/issues/6446")]

FileDescriptor::Staged { file: file.into() },
)
ino => {
let mut metadata_guard = self.meta.write();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's worth doing upgradeable_read here instead? we only write to metadata if the file is comitted.

// truncate the file (by assigning it a brand new staged file) if it's
// committed. the alternative would be to keep a copy of the committed
// file locally as a staged file, which can be expensive if it's a large file.
let (path, file) = self.store.tempfile()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crap, I think we missed that this should be async-ified as well, which means this branch also has to drop the lock and then grab it again. I think we need to revalidate before we do.

I could also buy an argument thatstore.tempfile() needs to stay sync and just return a new tempfile path instead of doing any IO. there's other potential races there (tempfile name collisions) but that might be easier to reason about.


pub async fn open_read(&mut self, ino: Ino) -> Result<Fd> {
// known false positive: https://github.com/rust-lang/rust-clippy/issues/6446
#[allow(clippy::await_holding_lock)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason=... attribute thing as above


/// Pack a local directory into a Pond volume.
pub async fn pack(&mut self, dir: impl AsRef<Path>, version: Version) -> crate::Result<()> {
pub async fn pack(&self, dir: impl AsRef<Path>, version: Version) -> crate::Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should keep pack &mut self. It seems weird to pack a directory while we're modifying it elsewhere.

@dustinpho dustinpho marked this pull request as draft November 26, 2025 20:23
@dustinpho
Copy link
Contributor Author

Ok -- so since last update there have been a few larger changes:

  1. the tempfile stuff was doing I/O while under a lock, so we changed that to instead do the file creation outside of the lock during file_open. 5f9fd29
  2. we've implemented commit, but made this a bit more aggressive than i previously thought we had to make it. we create a snapshot of things within the volume under a lock, then release it. if users want to still write to the volume concurrently, then we treat any changes they make to staged files (that we plan to commit as they're captured in our snapshot) as if they were already committed, i.e. truncating those files. we still upload the snapshot we took though. afterwards, we stitch together the metadata. c96a0a0 and cb38346 (the cleanup of metadata is messed up here, as i tried to LLM it. it ended up being pretty gross even when i tried to implement it, so i ended up changing how we intern locations).
  3. the commit made keeping the interned location vector super annoying to deal with. i ended up swapping this out with a btreeset, and we serialize it with an index since the order of a btreeset is stable (based off its PartialOrd). location is completely arc'd now (since the underlying enums are both arc'd now) so we pass it around as a pointer. b354f30
  4. got rid of parking lot rwlocks, since they aren't actually proven to be better than std::sync::RwLocks? this makes the callsites for those locks kinda ugly, since parking_lot actually hides the panic from us (but does panic if it's poisoned, just like we do). 56aa123

@dustinpho dustinpho requested a review from blinsay December 1, 2025 20:38
@dustinpho dustinpho marked this pull request as ready for review December 1, 2025 20:38
@dustinpho
Copy link
Contributor Author

the location stuff i can easily move out, but i'm including it here because you can see how simple it makes the metadata merging. the metadata merging with the interned location vec (where we had to maintain the invariants wrt staged, deduplicates, indexes) made it super nasty.

.find(|(i, l)| l.is_staged())
else {
/// Remove any locations that no longer have references.
pub(crate) fn prune_unreferenced_locations(&mut self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this without walking the volume if we lean on Arc::try_unwrap.

something like:

impl Location {
  fn try_collect(self: Self) -> Option<Self> {
     match Arc::try_unwrap(self.inner_thingy) {
       Ok(_value) => None, // discard it
       Err(shared) => Some(Self{ inner: shared })),
     }
  }


let locations = self.locations.into_iter().filter_map(|l| l.try_collect());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after writing this I realized we're passing around the Arcs outside of the Volume as well so this won't work unless we decide we don't want to do that. just from a quick read of the code, I'm not sure there's a reason to let the Arc leak out, but I also don't know how we'd control that without making Location not-Clone, which seems bad.

};
let EntryData::File {
location_idx,
location: mut_location,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: file_location?

/// Attempting to get the physical location of a directory or a symlink
/// returns an error.
pub(crate) fn location(&self, ino: Ino) -> Option<(&Location, &ByteRange)> {
pub(crate) fn location(&self, ino: Ino) -> Option<(Location, &ByteRange)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

letting the Arc leak out of metadata feels weird. do we actually end up needing to do this for anything?

.map(|c| c.as_os_str().to_string_lossy().to_string())
.collect();
self.metadata_mut().mkdir_all(Ino::Root, dirs)?;
self.meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading this again, we should pull out the version check and the commit because the dir walk is all sync/blocking. oops! next PR.

// line in the sand on what will be included in the commit. if someone opens a staged file
// and sees that its generation number is less than the current generation, and commit is
// inprogress, then we'll treat it similarly to opening a committed file (i.e. truncating
// it).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good docco. wdyt about moving it (and the rest of a description of how commit works) to the Commit struct so there's one place to read it all?

// and sees that its generation number is less than the current generation, and commit is
// inprogress, then we'll treat it similarly to opening a committed file (i.e. truncating
// it).
self.inner.generation.fetch_add(1, Ordering::SeqCst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might try to hide inc_generation and current_generation behind methods on Volume.

let mut metadata = self.inner.meta.write().expect("lock was poisoned");
metadata.set_version(snapshot.version().clone());

let snapshot_generation = self.inner.generation.load(Ordering::SeqCst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the live generation, not the snapshot generation, right?

continue;
}

// retain mtime and ctime changes (e.g. if the file was touched to update mtime or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have strong rules about marking a file staged if we modify metadata do we. this seems quite weird, but I can't actually point to any problems.

I know we use them for apply_location_ranges to change the snapshot, but the same thing seems like it should be true here.

// regular files that we committed up to S3. only update them if they weren't updated since
// we took the snapshot. if they were updated, then any concurrent updates to the metdata
// while we were committing will take precedence.
for snapshot_entry in snapshot.walk(Ino::Root)? {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the set of all inodes we uploaded and need to modify. it feels weird to do a walk here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants