Would it be possible to use a video codec as chunk compressor? #1087
-
I have a video dataset of around 500k (half a million) videos that I want to ML on so I am looking for an efficient data format that I can use to quickly read the data. Each video is 10 sec long and subsampled to 256x256@10Hz, i.e., when decoded a video can be viewed as a The naive approach to storing this data would be to store it as individual videos. This is the format I have now, but it isn't ideal because each file is just about 1 MB on disk. This makes loading a pain since I can't keep that many open file handles. Instead, I am constantly opening and closing small files and I am wondering if this is really the best way to go. My other idea would be to see if I can store the dataset (or shards of it) as zarr arrays where each chunk is compressed using a video codec. This way I can keep the amazing compression rate of video codecs while also getting some of python's nice ndarray semantics. I realize that this might be a crazy idea, but part of me thinks that it sounds like the kind of crazy that deserves a try. Would something like this be achievable with Zarr? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
We have been investigating sharding recently. So yeah sharding may be an option at some point |
Beta Was this translation helpful? Give feedback.
-
Adding a custom compressor is pretty straightforward. You have to implement the numcodecs Codec API. You could either contribute the codec directly to numcodecs or implement it in a third-party package. In that case, you would want to register the codec. The delta codec is a nice simple example: The main disadvantage of a custom codec is that your data become less portable; if others can't access the codec, they can't decode the data. |
Beta Was this translation helpful? Give feedback.
Adding a custom compressor is pretty straightforward. You have to implement the numcodecs Codec API. You could either contribute the codec directly to numcodecs or implement it in a third-party package. In that case, you would want to register the codec.
The delta codec is a nice simple example:
https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/delta.py
The main disadvantage of a custom codec is that your data become less portable; if others can't access the codec, they can't decode the data.