-
Notifications
You must be signed in to change notification settings - Fork 0
Description
- can we reduce file size
- without affecting training
- and requiring a ton of re-engineering of dataset prep / datapipe class
Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.
Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.
This doesn't need to be highest priority but it could help make it easier to upload the dataset.
edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:
While zarr, h5py, and xarray provide mechanisms for out-of-memory access, they tend to be slower in our experience or require fine tuning to reach the performance reached with memmapped npy files.
I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that