feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346

rvagg · 2022-01-31T05:50:43Z

For pre-allocation if it's going to be more efficient to walk the graph twice than have sloppy allocation during encode.

Seeing a need here: ipfs/go-graphsync#332 (comment) - mainly because graphsync keeps a tight reign over its allocations so being able to predict helps with that process.

For this reason, it's kind of important this is absolutely correct! Do we have code for generating garbage nodes that I can run through this to get some more certainty about correctness? Something like https://github.com/rvagg/js-ipld-garbage?

(Also feel free to object if you think this is a terrible addition, that's a valid response too.)

mvdan · 2022-01-31T09:49:14Z

The graphsync code needs to encode the node anyway - would it not make more sense to encode it once, then reuse those bytes for both the length and the body itself? Even if you need the length first, I assume you could encode then and keep the value.

"one cached encode" should be less code on the side of ipld-prime (no need for this API) and it should also be faster versus "one length encode plus one normal encode", so I imagine it would be a better solution overall.

Not opposed to this API per se, just thinking outloud that it doesn't seem like the best option in graphsync's case :)

mvdan · 2022-01-31T09:53:04Z

and it should also be faster

I haven't measured (not sure if anyone has either), but if allocations are a worry for graphsync, keep in mind that just walking a node like what EncodedLength does will allocate quite a bit - each iterator allocates, as does obtaining each node field/key/etc. It will all likely allocate less than the full encode, but it will still allocate more than you probably think :)

rvagg · 2022-01-31T11:07:16Z

encode the node anyway

Yes, but it's a node in the middle of the entire message; the pre-allocation bit is about creating a memory backpressure per-peer, I don't think it's about allocating the actual memory, more a signal that "this peer wants to allocate this more bytes, block until it's allowed to". The size calculations don't seem to account for the entirety of the messages, just the larger chunks, like blocks, and also the extension data which could get quite large - hence this calculation - "how large is the extension data? do we need to block until this peer has released some memory?"

@hannahhoward am I getting that close to right?

In terms of API - if, as you say, walking the graph is going to result in a bunch of allocations anyway then it does decrease the general usefulness of this call, but maybe that's more of a documentation concern? That we just need to be clear with users that this isn't a freebie.

warpfork · 2022-01-31T12:17:22Z

I'd probably be fine merging this function, but +1 to @mvdan's comments that whether it is useful would be situational.

How much the act of walking will cause allocations will vary based on the node implementation (I think basicnode and things produced by gogen is minimal, just the iterators allocate; bindnode might have a lot more). But it's still going to be a truism that if you at any point need the encoded form in memory, it'll be cheaper to just to len() on the bytes.

I suppose this function could also advertise that it works on data too large for the serial form to be buffered into memory all at once as one slice. Or if you just wanted to be operating streamingly, for whatever other reason. (The size thing that seems a little academic, since we're starting with the tree form that's already in memory; the serial copy is just going to be roughly that size, a second time; not much more. But maybe there would be other reasons not to be stuck needing to put the whole encoded form in one contiguous byte slice; I dunno.)

hannahhoward · 2022-02-01T00:51:12Z

@rvagg is correct -- we are concerned with accounting and backpressure

When GraphSync responds to a request, as the selector traversal advances, things get loaded into memory, and they stay in memory until they're packed into protocol messages and those messages get sent over the wire. We basically cause the BlockReadOpener to block when the total "in memory" amount gets too big . We're really only concerned with the big numbers -- if we're off by a few bytes, the whole system will keep working. If we're off by a few megabytes, it won't.

We also need to make sure no one message gets too large to go over the wire, so we also use the "estimate" of message size to decide when to break up messages. The point we decide to break up the message is well below the actual limit of message size for a single message, so again, we really only care about the big numbers.

In terms of applying to the solution here:

yes, this is definitely a solution that could work
if we can find an elegant way to cache either in the graphsync code or in ipld-prime, that'd be fine too. This feels a big tough at the moment.
the vast majority of extensions are going to be really small, and it's possible we can just assume the size is 0. This seems risky. Maybe we just need this solution in combination with something that allows an extension producer to say "don't worry about size".

warpfork

No complaints. LGTM.

rvagg · 2022-02-02T04:59:46Z

I've pulled in #347 so I can add some proper tests for this, it helped me find a couple of minor problems with ints, so in this current state, a review and merge of that PR would need to come before this one.

I've also adjusted the comment on this API to make it clear that use of this function may not result in performance gains, so use it carefully.

…oding For pre-allocation if it's going to be more efficient to walk the graph twice than have sloppy allocation during encode.

rvagg · 2022-02-04T05:01:18Z

Rebased to master and updated to the latest garbage.Generate() API. I'm going to take the lack of additional comments as a lack of disapproval.

rvagg requested review from warpfork and mvdan January 31, 2022 05:50

rvagg mentioned this pull request Jan 31, 2022

determine strategy for determining extension length without double serialization filecoin-project/lightning-planning#15

Closed

warpfork approved these changes Feb 1, 2022

View reviewed changes

rvagg force-pushed the rvagg/dagcbor-nomarshal-length branch 2 times, most recently from 26272de to cd72db6 Compare February 2, 2022 04:57

rvagg force-pushed the rvagg/dagcbor-nomarshal-length branch from cd72db6 to d48bf5e Compare February 2, 2022 05:02

rvagg mentioned this pull request Feb 2, 2022

fix: avoid double-encode for extension size estimation ipfs/go-graphsync#340

Merged

rvagg force-pushed the rvagg/dagcbor-nomarshal-length branch from d48bf5e to e6b02d4 Compare February 3, 2022 10:26

rvagg added 2 commits February 4, 2022 15:57

feat: add dagcbor.EncodedLength(Node) to calculate length without enc…

5b6254e

…oding For pre-allocation if it's going to be more efficient to walk the graph twice than have sloppy allocation during encode.

fix: minor EncodedLength fixes, add tests to fully exercise

feb8a2a

rvagg force-pushed the rvagg/dagcbor-nomarshal-length branch from e6b02d4 to feb8a2a Compare February 4, 2022 04:59

rvagg merged commit 679d743 into master Feb 4, 2022

rvagg deleted the rvagg/dagcbor-nomarshal-length branch February 4, 2022 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346

feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346

rvagg commented Jan 31, 2022

mvdan commented Jan 31, 2022

mvdan commented Jan 31, 2022 •

edited

Loading

rvagg commented Jan 31, 2022

warpfork commented Jan 31, 2022

hannahhoward commented Feb 1, 2022

warpfork left a comment

rvagg commented Feb 2, 2022

rvagg commented Feb 4, 2022

feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346

feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346

Conversation

rvagg commented Jan 31, 2022

mvdan commented Jan 31, 2022

mvdan commented Jan 31, 2022 • edited Loading

rvagg commented Jan 31, 2022

warpfork commented Jan 31, 2022

hannahhoward commented Feb 1, 2022

warpfork left a comment

Choose a reason for hiding this comment

rvagg commented Feb 2, 2022

rvagg commented Feb 4, 2022

mvdan commented Jan 31, 2022 •

edited

Loading