-
Notifications
You must be signed in to change notification settings - Fork 16
feat: Add traces for retrievals #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/better-batch-memory-management
Are you sure you want to change the base?
feat: Add traces for retrievals #291
Conversation
pkg/client/dagservice/exchange.go
Outdated
| attribute.String("get-block.space", se.space.DID().String()), | ||
| attribute.String("get-block.cid", c.String()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop the prefix on the attributes here and elsewhere, they will be associated with this span in the explorer. Plus, its nice to be able to search for all traces/spans with via space, cid, etc. rather than need to search by specific prefixes.
|
|
||
| ctx, span := tracer.Start(ctx, "get-blocks-batch", trace.WithAttributes( | ||
| attribute.String("get-blocks.batch.shard.digest", digestutil.Format(cloc.location.Commitment.Nb().Content.Hash())), | ||
| attribute.Int("get-blocks.batch.block-count", len(cloc.slices)), | ||
| attribute.Int64("get-blocks.batch.offset", int64(cloc.location.Position.Offset)), | ||
| attribute.Int64("get-blocks.batch.length", int64(cloc.location.Position.Length)), | ||
| )) | ||
| defer span.End() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would drop this as its already covered by the tracing in the Retrieve method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait: isn't this one of the most interesting ones? This tells us how batched our retrievals are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trace at the top of this function will already convey that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that tells us about groups that are GetBlocks()ed. This tells us about each batch we actually fetch. The ratio between those tells us how effectively we're managing to batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that traces might be heavier than needed for this insight, would a metric work here instead? Something like a histogram for batches-per-request or blocks-per-batch would give us the same aggregate visibility at lower cost.
OTOH I can see traces being useful for debugging individual slow retrievals. If that's the intent.
Open to either, your call.
frrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From where I sit, it feels a bit premature to add all this tracing, but maybe I am missing something on where known bottlenecks are.
If I were to try and scope tracing back, I'd only include spans for:
Retrieveas you have already done.LocateManysince it's always called when locating.
The metrics I'd be most interested in seeing are:
- time to retrieve and size of retrieval per node.
- time to locate and its cache hit/miss ratio
This will inform us of "How long does it take to find the thing" and "How long does it take to get the thing".
pkg/client/locator/indexlocator.go
Outdated
| attribute.String("locate-many.space", spaceDID.String()), | ||
| attribute.Int("locate-many.digest-count", len(digests)), | ||
| )) | ||
| defer span.End() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a named return to this method, and others, then do something like this to capture the status/errors if any of the trace
defer func() {
if err != nil {
span.SetStatus(//TODO)
span.RecordError(err)
}
span.End()
}or similar.
| xres, hres, err := rclient.Execute(ctx, inv, conn) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("executing `space/content/retrieve` invocation: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where I'd probably add a metric (histogram) that records the duration of the execution, since I think this is where the retrieval happens, I'd include the location (nodes URL) as an attribute on the metric, then we can plot retrieval times per node.
However, I suspect the retrieval isn't actually complete until hres.Body() is fully read and closed. Maybe we can wrap the Body returned from this in something that properly records the time to read the entire body. May also be nice to have a second histogram that records size in bytes per node for retrieval as well.
8c02ecd to
6ef5bda
Compare
ff8612f to
a6ead59
Compare
I hadn't noticed that only the DID itself appears on stdout! That's much cleaner.
Sort blocks by offset before fetching. Boxo doesn't appear to request them in any useful order.
Rather than read the entire response and make blocks of slices over that data, make a separate byte array for each block.
6ef5bda to
35c2c7f
Compare
a6ead59 to
356a654
Compare
#### PR Dependency Tree * **PR #289** 👈 * **PR #285** * **PR #290** * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
TODO: - [x] Initial fully working version - [x] Slop factor for CAR layout - [x] Manual test with `doupload`, to confirm it gets used in the current code path - → [Optimize reading](#290) - → [Otel metrics for monitoring batching success](#291) Closes #280 Closes #288 #### PR Dependency Tree * **PR #285** 👈 * **PR #290** * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
35c2c7f to
15d6c3c
Compare
#### PR Dependency Tree * **PR #290** 👈 * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
PR Dependency Tree
test/douploadworks again #289This tree was auto-generated by Charcoal