Skip to content

I'm confused about EmptyOutboard #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dpc opened this issue Jan 13, 2025 · 6 comments
Open

I'm confused about EmptyOutboard #59

dpc opened this issue Jan 13, 2025 · 6 comments

Comments

@dpc
Copy link

dpc commented Jan 13, 2025

The documentation says an Outboard:

A binary merkle tree for blake3 hashes of a blob.

OK, so it's like a small database for the purpose of tracking things?

It confuses me because I thought "outboard" was the "extra data being written to verify chunks". So the relationship between "that data" and "the implementation of how these data is being tracked" is conflated(?) here.

But then what is EmptyOutboard?

An empty outboard, that just returns 0 hashes for all nodes.

How does it work then? How can anything get done when all hashes are 0? Does it mean extra btree information is not recorded at all during sending, and nothing gets verified during receiving?

Why does it even need hash then? Isn't it going to discard everything anyway? And why it can't just calculate it from the data? Is it because the data is given whole, but sending can be done only partially?

I guess maybe what I'm asking for is some nicer utility API.

I want to send bunch of data over iroh (whole), and the other side expecting it has a hash and would be great if it stopped receiving any data as soon as it realizes that it's being fed lies. That's the whole point of using verified incremental hashing.

I don't want to send chunks, and I don't want to worry about outboard. I would suspect this usage it going to be most common, so might deserve a helper API.

For reference, right now I have the following for sending and receiving data:


    pub async fn write_bao_content(
        send: &mut SendStream,
        bytes: &[u8],
        hash: ContentHash,
    ) -> RpcResult<()> {
        let bytes_len = u32::try_from(bytes.len())
            .ok()
            .context(MessageTooLargeSnafu {
                len: u32::MAX,
                limit: u32::MAX,
            })?;
        /// Use a block size of 16 KiB, a good default
        /// for most cases
        pub(crate) const BLOCK_SIZE: BlockSize = BlockSize::from_chunk_log(4);

        let mut ob = EmptyOutboard {
            tree: BaoTree::new(bytes_len.into(), BLOCK_SIZE),
            root: blake3::Hash::from_bytes(hash.into()),
        };

        // Encode the first 100000 bytes of the file
        let ranges = ByteRanges::from(0u64..bytes_len.into());
        let ranges = round_up_to_chunks(&ranges);
        encode_ranges_validated(
            &mut bytes.as_ref(),
            &mut ob,
            &ranges,
            TokioStreamWriter(send),
        )
        .await
        .context(EncodingBaoSnafu)?;

        Ok(())
    }

    pub async fn read_bao_content(
        read: &mut RecvStream,
        len: u32,
        hash: ContentHash,
    ) -> RpcResult<Vec<u8>> {
        const BLOCK_SIZE: BlockSize = BlockSize::from_chunk_log(4);
        let ranges = ByteRanges::from(0u64..len.into());
        let chunk_ranges = round_up_to_chunks(&ranges);
        let mut decoded = Vec::with_capacity(len.cast_into());
        let mut ob = EmptyOutboard {
            tree: bao_tree::BaoTree::new(len.into(), BLOCK_SIZE),
            root: blake3::Hash::from_bytes(hash.into()),
        };
        bao_tree::io::fsm::decode_ranges(
            TokioStreamReader(read),
            chunk_ranges,
            &mut decoded,
            &mut ob,
        )
        .await
        .context(DecodingBaoSnafu)?;

        Ok(decoded)
    }
}

and I came up with it using current documentation. Would be good to know if I shot myself in the foot somewhere, or I am doing the right thing. How are the verification information (is this the/determined by the outboard trait?) transmitted here BTW?

In the original bao crate, the hashes are stored in a file in pre order.

Is this the same with transmitting data over network stream? Always first?

@Arqu Arqu added this to iroh Jan 13, 2025
@matheus23
Copy link
Member

I want to send bunch of data over iroh (whole), and the other side expecting it has a hash and would be great if it stopped receiving any data as soon as it realizes that it's being fed lies. That's the whole point of using verified incremental hashing.

I don't want to send chunks, and I don't want to worry about outboard.

Can you explain a little bit why you're reaching for this very low-level crate instead of using iroh-blobs?
iroh-blobs uses outboards as a cache, so you don't have to re-compute outboards every time you want to send data.
It's also specifically made to integrate with iroh. If you want it to not store outboards on disk, you can use the in-memory store. If you want to further customize the outboard storage, you can provide your own store implementation.
Also, iroh-blobs does incremental verification, so it stopps receiving as soon as it realizes it's being fed lies.

If iroh-blobs does too much for your taste, and bao-tree provides too many features you don't need (such as streaming sub-ranges), have you tried looking into using the bao crate?

@matheus23
Copy link
Member

But then what is EmptyOutboard?

Outboards are for caching the implicit merkle tree from blake3-hashing a blob.
When you get some bao stream of data, the merkle tree is sent along with it so you can incrementally verify.
In functions like decode_ranges you can provide EmptyOutboard, if you don't care about caching this implicit merkle tree.

@dpc
Copy link
Author

dpc commented Jan 14, 2025

Thanks for response!

Can you explain a little bit why you're reaching for this very low-level crate instead of using iroh-blobs?

I was not considering using blobs, as it seems too high level for me. I have a one-off to embed some larger data in an rpc, and I'd like to have it incrementaly verificatied.

I started with bao, but it can't async, found bao-tree, looked like it can async, so was grateful and went with that, just had some confusion here and there that I'd like to report.

When you get some bao stream of data, the merkle tree is sent along with it so you can incrementally verify.
In functions like decode_ranges you can provide EmptyOutboard, if you don't care about caching this implicit merkle tree.

Half of my point is to give feedback about the API.

When I look at bao documentation it says:

The outboard encoding format is the same as the combined encoding format,

which sounds to me like "an outboard" is the verification data about the data being sent, and not some cache/database/provider. So the name and description of this trait confuses me. e.g. why do I need it all during writing the data out, etc. Isn't it the job of the encoding to calculate these? And how can "decoding" work with EmptyOutboard when it:

An empty outboard, that just returns 0 hashes for all nodes.

Now that I understand what it does, maybe it should say "not return any known parts of the outboard"?

I think I have some understanding why now, but the documentation did not make it clear.

Maybe trait Outboard should be named as trait OutboardStore or trait OutboardCache or trait OutboardProvider. Current description says:

This trait contains information about the geometry of the tree,

"Encoding and decoding will query the implementation of the trait for already known parts of the outboard to avoid recalculating it"?

@dpc
Copy link
Author

dpc commented Feb 10, 2025

Seems like on the sending side, the outboard needs to be initialized and can't be the empty one, otherwise the receiving side will reject the message, but only if it's larger than a block.

@rklaehn
Copy link
Collaborator

rklaehn commented Feb 11, 2025

Hi. Sorry, have been traveling and am only now catching up.

So EmptyOutboard is just a black hole implementation that you can pass in to e.g. decode_ranges if you don't want to store an outboard at all.

E.g. you are a pure receiver and have no interest in storing the outboard because you don't want to share the data, or you are going to download the entire data anyway, so you can recompute the outboard at any time.

The reason why EmptyOutboard has to have the hash and the tree is that decode_ranges needs to know the hash to be able to verify that the incoming data is correct, and the size to know what chunks to expect from the stream.

As you have discovered yourself, if you use an EmptyOutboard on the send side things will just fail. You need both outboard and data to encode a verified stream. There is one exception: anything that is < a chunk group in length, there is no need for an outboard, so it will work.

@dpc
Copy link
Author

dpc commented Mar 13, 2025

Ha. Quite relevant: https://www.iroh.computer/blog/blob-store-design-challenges

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants