Skip to content

Conversation

Kriskras99
Copy link
Contributor

@Kriskras99 Kriskras99 commented Jul 21, 2025

A start at #136

I've decided to do the RFC as a PR so it's easier to give inline commentary.

I strongly recommend reading the blog post mentioned in the references and watch the video about rc-zip (becomes relevant after ~10 minutes).

Feel free to edit/expand the RFC.

@martin-g
Copy link
Member

@Xuanwo @xxchan @Fokko Do you have experience with these matters in your libraries ?

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just watched the video about Sans IO and honestly I still have no idea how the implementation will look like ...
But looking at https://github.com/bearcove/rc-zip I see that we will need to add more crates for each runtime - tokio, async-std/smol, monoio, sync, etc. Tokio is pretty much the standard for async but the idea of having to add a crate for every runtime looks scary to me.

For now maybe_async looks most promising to me!

(This is not a final decision! We are discussing!)

@Kriskras99
Copy link
Contributor Author

The annoying thing about maybe_async is that it will break builds. If a crate is depending on two libraries who both depend on avro, but one enabes sync and the other async. The crate will not build.

For sans-io, we don't need to add a crate for every runtime. We can provide a sync and a tokio crate, and if someone wants to use a different runtime they can wrap the core crate themselves. The wrapper can be really simple, namely just a loop over the parser. The reason that rc-zip-* wrappers are more complex is because they provide a lot of extra trait implementations, and have implementations for the entries.

I plan on experimenting with writing a basic Sans I/O parser for Avro, just to see how it would look.

@martin-g
Copy link
Member

martin-g commented Jul 22, 2025

The annoying thing about maybe_async is that it will break builds. If a crate is depending on two libraries who both depend on avro, but one enabes sync and the other async. The crate will not build.

Didn't they solve this by introducing two features (is_async and is_sync) and the macro generates separate module/function for each ?

E.g.

#[maybe_async]
async fn foo() {...}

would generate fn foo_sync() when is_sync is enabled and async fn foo_async() when is_async is enabled.

@Kriskras99
Copy link
Contributor Author

Looking at the docs (specifically the last two codeblocks), it doesn't seem to rename the functions.

@martin-g
Copy link
Member

https://crates.io/crates/synca - one more contender. I haven't explored it too much.

@Kriskras99
Copy link
Contributor Author

That looks a lot better than maybe_async. On the state machine front, I also found this article about using async functions to generate state machines which are usable from sync code. It's code is a bit messy because they seem to want to use it for embedded and thus don't want to use the heap.

@Kriskras99
Copy link
Contributor Author

Kriskras99 commented Jul 25, 2025

I've pushed a (non-functional) prototype for parsing with state machines.

The most interesting thing about this prototype is the use of a "command" tape and a "output" tape.
I created the "command" tape as a replacement for a Schema in the state machine, as it's really annoying to keep track of where you are in the Schema (lots of if let Schema::).
The "output" tape replaces the Value while the state machine is still parsing the input. This has the same reason as not using the Schema. As without the schema, the field names are not known (or they could be known but would take a lot of space in the "command" tape).

The implementation is far from perfect and there is a lot that could be improved. For one, the ObjectStateMachine has a lot of duplication going on in the parse function. I'm pretty sure that duplication can be removed quite easily, making it a lot more readable. This implementation also cuts a lot of corners, like EOF handling or compression.

I do think it gives a good idea of what it would be like to have the core be state machines, wrapped in async and sync wrappers.

I currently only have a sync wrapper, but I'll start on an async wrapper just using the futures crate which should be compatible with almost all async runtimes (either through adapters or directly).

Note: I've not implemented converting the schema to a "command" tape. Implementing it is easy, implementing it so that the tape is compact is a bit more complicated. I've also not implemented decoding the "output" tape to a Value or T: Deserialize, this should actually be easy with almost no complexity. There are also more todos scattered around.

@Kriskras99
Copy link
Contributor Author

I've added the async implementation. ObjectContainerFileReader should really be a thin wrapper around an ObjectContainerFileStateMachine as a lot of logic is shared between the sync and async versions of next_object.
So the iterator/stream implementation are a more accurate view of how it would look to share the core.

@Kriskras99
Copy link
Contributor Author

Kriskras99 commented Jul 26, 2025

I did not check clippy before pushing, as it doesn't function anyway (too many todo!()).
It does pass cargo check.

@Kriskras99
Copy link
Contributor Author

I've done some small refactoring in the state machine POC. The logic of the Object Container file format is now moved to a state machine, making the sync and async readers really simple. I've also added proper EoF handling.

@Kriskras99
Copy link
Contributor Author

I also did an attempt at making a SyncA POC, but don't yet known what to do about the iterators

@Kriskras99
Copy link
Contributor Author

@martin-g if you like the statemachine approach, I can start work on making it a proper PR (including stuff like tests and handling edgecases). If you'd like to see a POC with SyncA first, that's possible too.

@martin-g
Copy link
Member

Thanks for all your work on this topic!
I start understanding how SansIO is supposed to work!

Yes, I would to try with SyncA too! Do you mind if I work on it in parallel to you?

At the end either your work or mine will be thrown away (or both) but I'm fine with that since I'll learn something new.

@Kriskras99
Copy link
Contributor Author

That sounds like a good plan, lets do it in different branches than this rfc one.

@Kriskras99 Kriskras99 force-pushed the rfc branch 2 times, most recently from 13b6c32 to 4d803ed Compare August 16, 2025 22:09
@Kriskras99
Copy link
Contributor Author

I've completed the reader part!

There are a few issues/notes:

  1. snap does not have a streaming decoder, it's easy to implement but getting the PR upstreamed will take some work as the maintainer isn't active on this project anymore
  2. I've disabled the reader::test_from_avro_datum_with_union_to_struct test. In my opinion the provided schema is not compatible with the bytes. The user should instead provide the right schema and use a reader schema for the schema evolution. If we keep this, we should inform users as at least one does depend on it
  3. I don't support having Schema::Null as the root schema in combination with the from_avro_datum* functions. They see an empty byteslice as an EoF. It's possible to add support for this, but I don't think anybody should depend on it.
  4. I've moved all the tests that were in decode.rs to state_machines/reading/mod.rs, so they're not gone!
  5. The current code organization is temporary and can be improved. I can imagine making decode a module and putting the state machines in there, making reader also a module with sync and async submodules. The same goes for the errors.
  6. I did not yet implement direct from tape deserialisation from Serde types. I plan on implementing this, as it's more efficient than going through Value first.
  7. There are a bunch of TODO scattered all over the place, also in places I did not modify. They are potential improvement points I ran into while implementing all of this.
  8. I've temporarily added the rustfmt.toml back, so I can sort imports more easily.

@Kriskras99
Copy link
Contributor Author

I think implementing the writer is will be really simple. encode.rs can be kept, as it just writes to the temporary buffer in the writer. So most of the logic is not file/stream related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants