-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add dagcbor.EncodedLength(Node) to calculate byte length without encoding #346
Conversation
The graphsync code needs to encode the node anyway - would it not make more sense to encode it once, then reuse those bytes for both the length and the body itself? Even if you need the length first, I assume you could encode then and keep the value. "one cached encode" should be less code on the side of ipld-prime (no need for this API) and it should also be faster versus "one length encode plus one normal encode", so I imagine it would be a better solution overall. Not opposed to this API per se, just thinking outloud that it doesn't seem like the best option in graphsync's case :) |
I haven't measured (not sure if anyone has either), but if allocations are a worry for graphsync, keep in mind that just walking a node like what EncodedLength does will allocate quite a bit - each iterator allocates, as does obtaining each node field/key/etc. It will all likely allocate less than the full encode, but it will still allocate more than you probably think :) |
Yes, but it's a node in the middle of the entire message; the pre-allocation bit is about creating a memory backpressure per-peer, I don't think it's about allocating the actual memory, more a signal that "this peer wants to allocate this more bytes, block until it's allowed to". The size calculations don't seem to account for the entirety of the messages, just the larger chunks, like blocks, and also the extension data which could get quite large - hence this calculation - "how large is the extension data? do we need to block until this peer has released some memory?" @hannahhoward am I getting that close to right? In terms of API - if, as you say, walking the graph is going to result in a bunch of allocations anyway then it does decrease the general usefulness of this call, but maybe that's more of a documentation concern? That we just need to be clear with users that this isn't a freebie. |
I'd probably be fine merging this function, but +1 to @mvdan's comments that whether it is useful would be situational. How much the act of walking will cause allocations will vary based on the node implementation (I think basicnode and things produced by gogen is minimal, just the iterators allocate; bindnode might have a lot more). But it's still going to be a truism that if you at any point need the encoded form in memory, it'll be cheaper to just to I suppose this function could also advertise that it works on data too large for the serial form to be buffered into memory all at once as one slice. Or if you just wanted to be operating streamingly, for whatever other reason. (The size thing that seems a little academic, since we're starting with the tree form that's already in memory; the serial copy is just going to be roughly that size, a second time; not much more. But maybe there would be other reasons not to be stuck needing to put the whole encoded form in one contiguous byte slice; I dunno.) |
@rvagg is correct -- we are concerned with accounting and backpressure When GraphSync responds to a request, as the selector traversal advances, things get loaded into memory, and they stay in memory until they're packed into protocol messages and those messages get sent over the wire. We basically cause the BlockReadOpener to block when the total "in memory" amount gets too big . We're really only concerned with the big numbers -- if we're off by a few bytes, the whole system will keep working. If we're off by a few megabytes, it won't. We also need to make sure no one message gets too large to go over the wire, so we also use the "estimate" of message size to decide when to break up messages. The point we decide to break up the message is well below the actual limit of message size for a single message, so again, we really only care about the big numbers. In terms of applying to the solution here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No complaints. LGTM.
26272de
to
cd72db6
Compare
I've pulled in #347 so I can add some proper tests for this, it helped me find a couple of minor problems with ints, so in this current state, a review and merge of that PR would need to come before this one. I've also adjusted the comment on this API to make it clear that use of this function may not result in performance gains, so use it carefully. |
cd72db6
to
d48bf5e
Compare
d48bf5e
to
e6b02d4
Compare
…oding For pre-allocation if it's going to be more efficient to walk the graph twice than have sloppy allocation during encode.
e6b02d4
to
feb8a2a
Compare
Rebased to master and updated to the latest |
For pre-allocation if it's going to be more efficient to walk the graph twice than have sloppy allocation during encode.
Seeing a need here: ipfs/go-graphsync#332 (comment) - mainly because graphsync keeps a tight reign over its allocations so being able to predict helps with that process.
For this reason, it's kind of important this is absolutely correct! Do we have code for generating garbage nodes that I can run through this to get some more certainty about correctness? Something like https://github.com/rvagg/js-ipld-garbage?
(Also feel free to object if you think this is a terrible addition, that's a valid response too.)