Skip to content

Save spectrograms with np.savez_compressed #1

@NickleDave

Description

@NickleDave
  • can we reduce file size
  • without affecting training
  • and requiring a ton of re-engineering of dataset prep / datapipe class

Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.

Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.

This doesn't need to be highest priority but it could help make it easier to upload the dataset.

edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:

While zarr, h5py, and xarray provide mechanisms for out-of-memory access, they tend to be slower in our experience or require fine tuning to reach the performance reached with memmapped npy files.

I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions