-
Notifications
You must be signed in to change notification settings - Fork 1
core: make Volume threadsafe #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
blinsay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks rad, much better than the channel-d versione. Two big questions:
-
We don't have any conventions around lock ordering so we don't accidentally introduce a deadlock when we're taking the metadata and fd locks. You did (too good!) of a job hiding it, but I think
new_fdis probably the biggest potential offender.rustcdoesn't help us here. andI'm not really aware of a crate that helps either. I think we need something, but I'm not sure exactly what. -
Commit seems a little bit wrong now. We're locking metadata to make a copy of the staged files in
StagedVolume::uploadand then releasing the lock to do the upload (good) but nothing seems like it stops someone from modifying one of the staged files betweenuploadandmodify(bad).
doh ... for some reason i had it in my head that i would do this somewhere else down the line, but I should be doing that here. (the locking, taking a snapshot, then merging it after it's uploaded with whatever's here) |
pond/src/lib.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl<'a> From<DirEntryRef<'a>> for DirEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about calling this DirEntryRef::to_owned or something?
pond/src/volume.rs
Outdated
| /// Opening a Fd with write permissions will always truncate the file. | ||
| pub async fn open_read_write(&mut self, ino: Ino) -> Result<Fd> { | ||
| // known false positive: https://github.com/rust-lang/rust-clippy/issues/6446 | ||
| #[allow(clippy::await_holding_lock)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's shove the reason in the attribute. it's harder to spot the comment
| #[allow(clippy::await_holding_lock)] | |
| #[allow(clippy::await_holding_lock, reason = "https://github.com/rust-lang/rust-clippy/issues/6446")] |
pond/src/volume.rs
Outdated
| FileDescriptor::Staged { file: file.into() }, | ||
| ) | ||
| ino => { | ||
| let mut metadata_guard = self.meta.write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it's worth doing upgradeable_read here instead? we only write to metadata if the file is comitted.
pond/src/volume.rs
Outdated
| // truncate the file (by assigning it a brand new staged file) if it's | ||
| // committed. the alternative would be to keep a copy of the committed | ||
| // file locally as a staged file, which can be expensive if it's a large file. | ||
| let (path, file) = self.store.tempfile()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crap, I think we missed that this should be async-ified as well, which means this branch also has to drop the lock and then grab it again. I think we need to revalidate before we do.
I could also buy an argument thatstore.tempfile() needs to stay sync and just return a new tempfile path instead of doing any IO. there's other potential races there (tempfile name collisions) but that might be easier to reason about.
pond/src/volume.rs
Outdated
|
|
||
| pub async fn open_read(&mut self, ino: Ino) -> Result<Fd> { | ||
| // known false positive: https://github.com/rust-lang/rust-clippy/issues/6446 | ||
| #[allow(clippy::await_holding_lock)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason=... attribute thing as above
pond/src/volume.rs
Outdated
|
|
||
| /// Pack a local directory into a Pond volume. | ||
| pub async fn pack(&mut self, dir: impl AsRef<Path>, version: Version) -> crate::Result<()> { | ||
| pub async fn pack(&self, dir: impl AsRef<Path>, version: Version) -> crate::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we should keep pack &mut self. It seems weird to pack a directory while we're modifying it elsewhere.
c853b94 to
9dac10b
Compare
52a67de to
b354f30
Compare
|
Ok -- so since last update there have been a few larger changes:
|
|
the location stuff i can easily move out, but i'm including it here because you can see how simple it makes the metadata merging. the metadata merging with the interned location vec (where we had to maintain the invariants wrt staged, deduplicates, indexes) made it super nasty. |
| .find(|(i, l)| l.is_staged()) | ||
| else { | ||
| /// Remove any locations that no longer have references. | ||
| pub(crate) fn prune_unreferenced_locations(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this without walking the volume if we lean on Arc::try_unwrap.
something like:
impl Location {
fn try_collect(self: Self) -> Option<Self> {
match Arc::try_unwrap(self.inner_thingy) {
Ok(_value) => None, // discard it
Err(shared) => Some(Self{ inner: shared })),
}
}
let locations = self.locations.into_iter().filter_map(|l| l.try_collect());There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after writing this I realized we're passing around the Arcs outside of the Volume as well so this won't work unless we decide we don't want to do that. just from a quick read of the code, I'm not sure there's a reason to let the Arc leak out, but I also don't know how we'd control that without making Location not-Clone, which seems bad.
| }; | ||
| let EntryData::File { | ||
| location_idx, | ||
| location: mut_location, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: file_location?
| /// Attempting to get the physical location of a directory or a symlink | ||
| /// returns an error. | ||
| pub(crate) fn location(&self, ino: Ino) -> Option<(&Location, &ByteRange)> { | ||
| pub(crate) fn location(&self, ino: Ino) -> Option<(Location, &ByteRange)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
letting the Arc leak out of metadata feels weird. do we actually end up needing to do this for anything?
| .map(|c| c.as_os_str().to_string_lossy().to_string()) | ||
| .collect(); | ||
| self.metadata_mut().mkdir_all(Ino::Root, dirs)?; | ||
| self.meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading this again, we should pull out the version check and the commit because the dir walk is all sync/blocking. oops! next PR.
| // line in the sand on what will be included in the commit. if someone opens a staged file | ||
| // and sees that its generation number is less than the current generation, and commit is | ||
| // inprogress, then we'll treat it similarly to opening a committed file (i.e. truncating | ||
| // it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good docco. wdyt about moving it (and the rest of a description of how commit works) to the Commit struct so there's one place to read it all?
| // and sees that its generation number is less than the current generation, and commit is | ||
| // inprogress, then we'll treat it similarly to opening a committed file (i.e. truncating | ||
| // it). | ||
| self.inner.generation.fetch_add(1, Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might try to hide inc_generation and current_generation behind methods on Volume.
| let mut metadata = self.inner.meta.write().expect("lock was poisoned"); | ||
| metadata.set_version(snapshot.version().clone()); | ||
|
|
||
| let snapshot_generation = self.inner.generation.load(Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the live generation, not the snapshot generation, right?
| continue; | ||
| } | ||
|
|
||
| // retain mtime and ctime changes (e.g. if the file was touched to update mtime or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't have strong rules about marking a file staged if we modify metadata do we. this seems quite weird, but I can't actually point to any problems.
I know we use them for apply_location_ranges to change the snapshot, but the same thing seems like it should be true here.
| // regular files that we committed up to S3. only update them if they weren't updated since | ||
| // we took the snapshot. if they were updated, then any concurrent updates to the metdata | ||
| // while we were committing will take precedence. | ||
| for snapshot_entry in snapshot.walk(Ino::Root)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the set of all inodes we uploaded and need to modify. it feels weird to do a walk here.
Make Volume threadsafe by using sync-locks (a la parking lot) on metadata operations and fd (file descriptor) operations. We try to make these guards as short-lived as possible. As a result of this, a lot of the things we return from the thread-safe Volume are now cloned out before being returned. I didn't see a noticeable performance hit when trying to walk and read all directories within the berkeley_gnm_recon.
One exception to this rule (of holding the lock as little as possible) is during a Volume walk where we do want a consistent view. I added a WalkVolume iterator that holds a lock during the entire duration.
Another change I made was how readdir was implemented. Since we're copying everything over, I wanted to optimize it and make it copy as little over for each call. It does pagination with a last-seen-token used as the offset. I had to change things within the FUSE to support this as well, as a mkdir or rmdir operation weaved in-between readdirs may result in duplicate or skipped entries (though it is difficult to not skip some entries entirely).