-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Following a chain of sectors to merge them in a new stream #196
Comments
Should we unify this issue with other substream related issue, eg. the PNG one (data of multiple IDAT entries should be concatenated before zlib decompression). I presume it would be better if we could come up with a more universal solution which may be good enough to support file format we haven't met yet. Also it may worth thinking a little about serialization (but really just a little as it is really out-of-scope right now): so if we came up with multiple ideas, we can compare them from this point too. @GreyCat could we collect all the sub-stream related file formats to somewhere (this issue's description, separate wiki page, etc)? Current this is the list (I'll try to collect them and modify this comment):
|
Another complex example from Ogg specification. Each Ogg page has a list of physical "segments" defined like that: - id: len_segments
type: u1
repeat: expr
repeat-expr: num_segments
- id: segments
repeat: expr
repeat-expr: num_segments
size: len_segments[_index] Nowadays, with advent of
Packet of 255 bytes is encoded as 2 segments, 255-byte segment + 0-byte segment. To add the insult to injury, technically packets can be even split between different Ogg pages (i.e. higher level structures). This way one page might end up with a segment of 255 bytes and another one might start with a segment that continues it (and a "continuation" flag). |
seq:
.......
chains:
chain_typed:
doc: lol
type: frame # if `chain` is set, `type` is forbidden and inherited from that chain
chain_stream:
doc: united stream of bytes ready to be parsed
types:
........
aaaa:
seq:
- id: frame # a chain of objects of type `frame`
chain: _root.chain_typed
- id: size
type: u8
- id: byte_chunk # a chain of raw bytes, may be used for parsing after merging via `_io`
size: size
chain: chain_stream
What do you think? |
Would you care to elaborate a little on how's that supposed to work, so I won't be reinventing the whole thing from the very beginning, trying to guess what you've meant here? |
|
Before we implementing anything concrete I'd like to see how the suggested solution solves the issues mentioned above (Microsoft CFB, FAT filesystem, PNG, registry file?, TCP?, Ogg, referenced issue). Finding a fits-them-all solution is probably not easy. |
For ogg do you mean something like this: .....
types:
page:
chains:
data: {}
seq:
....
- id: segments
repeat: expr
repeat-expr: num_segments
size: len_segments[_index]
doc: Segment content bytes make up the rest of the Ogg page.
chain: data The proposed syntax should merge all the page's segments to data stream belonging to that page |
I'm not sure if this is directly related to this enhancement proposal but I've encountered a related issue when attempting to build a struct to describe MPEG-TS protocol captures. MPEG-TS consists of lots of small packets (188 bytes each) which have program identifiers and counters in their headers. Ideally, it would be possible to use Kaitai to not only split a capture in the 188 byte packets but also merge the payloads of packets belonging to a given program identifier according to their specified order (e.g. demultiplex) and then parse that re-assembled payload with its own Kaitai structure. Right now, I have to pop out of Kaitai into python to do this merging process, generate an intermediary binary containing the demultiplexed payloads, and then pop back into Kaitai with a different .ksy to parse this intermediary format. It'd be great to do this all inside Kaitai for cross-language portability and clarity. |
@pavja2 Could you demonstrate merging algorithm for our reference here? In indeed looks like it's another valid use case for this feature. |
@pavja2 I was also about to implement an mpeg-ts parser do u have your structure file shared somewhere? |
@kalidasya This person here actually claims that she has developed that, although I'm not sure if they'll be able to open source it. |
@kalidasya I do have a basic struct file. It's rough around the edges but works well enough for my needs. It'd be awesome if someone made it better! I'm traveling ATM but will post it soon as I can and let you know when I do. |
@kalidasya @pavja2 Guys, just a heads up: please consider creating a new issue in formats repo for that format and move the discussion there? Otherwise, it will be virtually impossible for others who might be interested to find these and join your cause :) |
To give some context for this ticket from mpeg-ts point of view:
@pavja2 what do you think? |
Moved from #555
Update: Each data_chuck has a field named "value" and "checksum". What I want to do is to get one string containing bytes from all the fields named "value" without data from the "checksum" fields. |
Hi hi! This would be handy for USB as well. For example with usb mass storage, with full speed usb packets are 64 bytes at a time but reads and writes of data are done in 512 byte chunks over multiple low level (IN) packets. This is also done for things like long descriptor strings. |
Squashfs (kaitai-io/kaitai_struct_formats#596) would also benefit from this. Metadata is stored in blocks, that need to be processed individually and then concatenated, before being able to parse it. I have it working using custom functions, but native support would be great. |
At least both FAT filesystem and Microsoft's CFB files follow the same pattern: to specify file contents, one provides an index of starting sector s0. A parser must follow the chain of sectors, as specified in a FAT-like table, i.e.:
until it meets certain terminator (like -1 or -2) in the FAT table. After that, if we would want to do further parsing on the file contents, we should reassemble all these individual sectors into one new stream (and probably trim it to the size specified in a separate field somewhere in the directory entry).
The following structure can model most of this behavior, but not all:
This effectively allows to access file sectors one by one by using:
However, there is no simple way to unite all these sectors and trim it to
file_full_size
, except in the app code, to continue parsing, i.e:Any ideas on what would be the best syntax to do it?
The text was updated successfully, but these errors were encountered: