Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine MIT Engaging dandi/001 and dandi/002 storage directories into a single virtual file system/namespace? #8

Open
kabilar opened this issue Jan 30, 2025 · 10 comments

Comments

@kabilar
Copy link
Member

kabilar commented Jan 30, 2025

Hi @satra, based on the email discussions with the ORCD team from September 2024, each storage server has 1.1 PiB and DANDI requested 1.5 PB so they had to split up the DANDI space between the two storage servers, as shown below:

  • 200T /orcd/data/linc/001
  • 620T /orcd/data/dandi/002
  • 460T /orcd/data/satra/002
  • 880T /orcd/data/dandi/001

Michel had proposed a couple of options to create a single virtual file system/namespace for the DANDI storage.

Should we pursue these options or just try to get s3invsync to work with multiple target directories (i.e. /orcd/data/dandi/001 and /orcd/data/dandi/002)? I am inclined to the latter option since we have more control over the timeline, but perhaps @jwodder and @yarikoptic have a preference here? Thanks.

@kabilar kabilar changed the title Combine MIT Engaging dandi/001 and dandi/002 storage durectories into a single virtual file system/namespace? Combine MIT Engaging dandi/001 and dandi/002 storage directories into a single virtual file system/namespace? Jan 30, 2025
@yarikoptic
Copy link
Member

single virtual filesystem would be much better -- I do not think s3invsync should get into business of "volume management".

@satra
Copy link
Member

satra commented Jan 31, 2025

@kabilar - check in with michel about the virtual layer. i'm not sure that's an easy solution.

i agree that s3invsync shouldn't be in the business of volume management. however, it should be able to take a set of paths and spill over if it detects out of space in any location and go to the next location to continue. users could specify a path to store the index, but the downloaded objects could be across filesystems.

@kabilar
Copy link
Member Author

kabilar commented Jan 31, 2025

@kabilar - check in with michel about the virtual layer. i'm not sure that's an easy solution.

Just sent an email.

@yarikoptic
Copy link
Member

however, it should be able to take a set of paths and spill over if it detects out of space in any location and go to the next location to continue

it isn't as easy as simply generating a single file, and going to the next "part" when first one is "full". there is no "objects" -- s3invsync is replicating original hierarchy of keys in the bucket and was created with the idea of retaining such hierarchy and adjusting the state "in place".

I guess it would be possible to design logic of inspecting/dealing with multiple leading paths, but this would likely have many negative impacts on performance etc.

@kabilar
Copy link
Member Author

kabilar commented Feb 4, 2025

I can certainly appreciate that this would be additional work and could effect performance. Given that its going to cost 60k + 6k recurring after the first year, perhaps it is most cost effective to implement this feature in s3invsync.

@yarikoptic @jwodder Can we map out what it would take to implement this feature? And then we can make a decision.

@jwodder
Copy link
Member

jwodder commented Feb 4, 2025

@kabilar Exactly how do you want this feature to behave? The most obvious option would be to download files to the first filesystem until it's full, then move on to the next file system and so forth, all the while retaining intermediate directory components in paths (e.g., two adjacent keys foo/bar/baz.txt and foo/bar/quux.txt could end up in different filesystems at /orcd/data/linc/001/foo/bar/baz.txt and /orcd/data/dandi/002/foo/bar/quux.txt).

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

@kabilar Exactly how do you want this feature to behave? The most obvious option would be to download files to the first filesystem until it's full, then move on to the next file system and so forth, all the while retaining intermediate directory components in paths (e.g., two adjacent keys foo/bar/baz.txt and foo/bar/quux.txt could end up in different filesystems at /orcd/data/linc/001/foo/bar/baz.txt and /orcd/data/dandi/002/foo/bar/quux.txt).

Hi @jwodder, this plan makes sense to me.

How and how often would you check for the space available on the filesystem? With parallel jobs this could get a bit tricky to make sure there is enough space remaining as the filesystem approaches capacity. Additionally, we can make it so that s3invsync is the only tool used to save data to /orcd/data/dandi/001/, but other DANDI users (e.g. Jeremy) may end up concurrently using /orcd/data/dandi/002/ for DANDI-related projects.

@jwodder
Copy link
Member

jwodder commented Feb 5, 2025

@kabilar

How and how often would you check for the space available on the filesystem?

I was thinking of just checking whether any write failures had an ErrorKind of StorageFull.

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

I was thinking of just checking whether any write failures had an ErrorKind of StorageFull.

Thanks. I think that would be fine for our use case.

@yarikoptic
Copy link
Member

some thoughts:

  • our .s3invsync.versions.json files might need to be removed in prior file systems and retained only in the "latest" to avoid conflicts/ambiguity
  • then move on to the next file system should mean that even if prior file system(s) got some space freed-up, we would need to operate on that "next" file system
  • I feel that we would need an "abstraction" layer for listing/manipulation of files/folders filesystem so it would abstract away having multiple leading folders and take care about
    • listing multiple locations (what if present in multiple somehow?),
    • deleting in a corresponding where it is present,
    • creating only in the latest (while potentially replacing/removing in priors)

altogether - sounds feasible, but I have fears of hidden obstacles coming up often, which could have been avoided by relying on proper volume management at filesystem level -- could there may be some LVM or another be put over those storage storage volumes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants