Add zstd compression by robUx4 · Pull Request #866 · ietf-wg-cellar/matroska-specification

robUx4 · 2024-10-24T13:43:26Z

Based on #423 but for the v5 document.

Keeping as draft because the use of a dictionary may not be usable in the wild (it requires a state ? which makes seeking inmpossible).

Also, as said in #423 (comment), there is a header to the Zstandard stream that we should strip. In that case we should explain what needs to be stripped.

robUx4 · 2024-10-24T13:43:52Z

poke @rcombs

rcombs · 2024-10-25T00:56:57Z

I would definitely want to spec out dictionary support; it should be entirely doable, though the implementation would be a little tricky on the encode side. Basically:

Take all the packets you're going to mux, in matroska format (i.e. for subtitles, remove timestamp headers)
Pass them to ZDICT_optimizeTrainFromBuffer_cover or the like
Use the result as the dictionary, both in the matroska header and when compressing packets
Compress each packet independently, passing in the now-static dictionary

And then at demux-time, the packets are decompressed independently (same as with zlib today), with the dictionary from the header passed in as context.

In principle, this technique could be applied to zlib as well, and would probably yield similar gains; there just isn't a good standard library for producing zlib dictionaries (whereas the zstd library includes routines for it).

robUx4 · 2024-10-25T06:36:17Z

We should probably prototype this in real world files to see how this pans out. It can be done in libavformat, mkvtoolnix or mkclean. (I probably won't have time in the coming days)

My main concern is the parts that could/should be stripped from the zstd bitstream and what parts need to be injected when decoding. All this should be properly documented.

rcombs · 2024-10-25T06:48:19Z

Definitely agreed. I think it'd be tricky to do in lavf (which isn't really set up to do a 2-pass mux), so maybe mkvtoolnix would make more sense? I'm not familiar with that codebase, though.

Alternately, I could build something in lavf that just generates the dictionary, and then separately add support for reading a pre-existing dictionary from disk? The latter is probably useful regardless; it'd be pretty straightforward to take a large sample of subtitle files and train a dictionary from them, which would likely outperform the default zstd/zlib ones (though not to the same extent as doing it per-mux).

JeromeMartinez · 2024-10-25T08:35:52Z

Take all the packets you're going to mux,

There must be a fallback in the case you don't have any packet before you start to mux (real time streaming).

rcombs · 2024-10-25T09:02:10Z

For streaming, you could either not use a dictionary, or use a pre-generated one.

mbunkus · 2024-10-25T10:31:32Z

I think it'd be tricky to do in lavf (which isn't really set up to do a 2-pass mux), so maybe mkvtoolnix would make more sense

In general MKVToolNix (rather: mkvmerge) isn't set up for 2-pass-muxes either. Lots of global state exists. What could be done for testing/evaluation purposes is what was done back in the day & what you described above: modifying mkvmerge to do some kind of manual 2-pass process, with the first pass producing a file on disk that contains statistics/the dictionary produced during the first pass, and the second pass reading said file. I would not want to have this type of manual invocation & external temporary file as a production mechanism, though.

What I thought about & what might be slightly more doable would be to buffer the first 5 MB of content that would be written out to the disk, build the dictionary over the accumulated data & then use the dictionary for both the buffered data & all subsequent frames. That would be quite a bit easier than implementing a full 2-pass system within a single invocation of mkvmerge, for sure.

rcombs · 2024-10-25T17:11:30Z

Hmmm, the whole input subtitle file has to be loaded into memory in order to correctly implement eg ASS ReadOrder, doesn't it? Could a dictionary be trained off of that?

mbunkus · 2024-10-25T18:02:20Z

In case of mkvmerge that would only work if the subtitle file is a separate file, but not if it's a track multiplexed into any of the more complex containers (usually Matroska, MP4, MPEG-TS). In that case mkvmerge cannot simply read all data solely belonging to the subtitle track. Instead it would have to read the whole source file, making it a very costly operation.

robUx4 · 2024-10-26T08:06:44Z

I'll have a look. mkclean has a 2 pass --remux mode to handle as much header stripping as possible per track. It could be used to feed a dictionary as well. I can't promise when it will be done though. In the end it's equivalent to adding full support for zstd.

rcombs · 2024-10-26T08:16:45Z

I could imagine a world where lavf and mkvmerge add support for using a passed-in dictionary, and we just separately make a tool that takes one or more input files and creates an appropriate dictionary (that could be done as a muxer in lavf, for instance). Then the user could either run the dictionary-generation tool as a pre-mux step, or generate a generic dictionary from a corpus of representative input files (eg for streaming).

robUx4 · 2024-12-24T10:40:53Z

I implemented compressing/decompressing Zstd in mkclean. The juicy part is in this commit: Matroska-Org/foundation-source@2e5c64d in UnCompressFrameZstd() and CompressFrameZstd().

We should strip the magic number because in our case it's useless. It just needs a little bit of tweaking to use it with the zstd library. It has a ZSTD_f_zstd1_magicless mode, but it's off by default and doesn't work well.

I haven't investigated the dictionnary yet.

rcombs · 2025-10-27T15:35:27Z

Spec-wise, I think my only concern here is that we should be defining exactly what "a dictionary" means (like, what data should be in that header field, with what headers).

Based on ietf-wg-cellar#423. Co-authored-by: rcombs <[email protected]>

robUx4 · 2025-10-28T09:00:38Z

Spec-wise, I think my only concern here is that we should be defining exactly what "a dictionary" means (like, what data should be in that header field, with what headers).

I added a reference to the dictionary section of RFC 8878 and removed the magic number which is a fixed value (just like the magic number in Zstandard frames).

robUx4 · 2025-10-28T09:11:32Z

I could imagine a world where lavf and mkvmerge add support for using a passed-in dictionary

Yes, IMO that's how it would happen in most cases. The RFC also hints at that as well:

The likely outcome will be a registry of well-tested dictionaries optimized for different use cases

You would have dictionaries suited for a particular type of content. Since the publication of the RFC there doesn't seem to be emerging "well known" dictionaries.

rcombs · 2025-10-28T09:21:57Z

Oh, I'd still expect users to generate per-file dictionaries in a fair number of cases (eg anime these days tends to be distributed using muxtools and similar build scripts, which could pretty easily have a stage that trains a dictionary on the input before muxing), but yeah, I'd imagine some people would use generic ones instead.

robUx4 · 2025-11-11T08:53:23Z

Support in mkclean is now functional via Matroska-Org/foundation-source#150. It doesn't support dictionary handling but writing/reading by stripping the magic number works.

robUx4 · 2025-12-02T09:32:33Z

For the record I added support in VLC (without dictionary support) and it works with files created by mkclean.
https://code.videolan.org/videolan/vlc/-/merge_requests/8113

robUx4 added format addition matroska-v5 labels Oct 24, 2024

robUx4 force-pushed the v5-zstd branch from 0e5f8dc to 2531f71 Compare December 23, 2024 15:47

robUx4 force-pushed the v5-zstd branch from 2531f71 to 0cbae52 Compare March 9, 2025 09:40

robUx4 force-pushed the v5-zstd branch from 0cbae52 to dba29d7 Compare October 27, 2025 14:35

Add zstd compression

d738422

Based on ietf-wg-cellar#423. Co-authored-by: rcombs <[email protected]>

robUx4 force-pushed the v5-zstd branch from dba29d7 to d738422 Compare October 28, 2025 08:58

robUx4 marked this pull request as ready for review October 28, 2025 09:11

Conversation

robUx4 commented Oct 24, 2024

Uh oh!

robUx4 commented Oct 24, 2024

Uh oh!

rcombs commented Oct 25, 2024

Uh oh!

robUx4 commented Oct 25, 2024

Uh oh!

rcombs commented Oct 25, 2024

Uh oh!

JeromeMartinez commented Oct 25, 2024

Uh oh!

rcombs commented Oct 25, 2024

Uh oh!

mbunkus commented Oct 25, 2024

Uh oh!

rcombs commented Oct 25, 2024

Uh oh!

mbunkus commented Oct 25, 2024

Uh oh!

robUx4 commented Oct 26, 2024

Uh oh!

rcombs commented Oct 26, 2024

Uh oh!

robUx4 commented Dec 24, 2024

Uh oh!

rcombs commented Oct 27, 2025

Uh oh!

robUx4 commented Oct 28, 2025

Uh oh!

robUx4 commented Oct 28, 2025

Uh oh!

rcombs commented Oct 28, 2025

Uh oh!

robUx4 commented Nov 11, 2025

Uh oh!

robUx4 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants