feat: Add traces for retrievals #291

Peeja · 2026-01-15T17:23:37Z

Add traces
Add metrics?

PR Dependency Tree

PR fix: test/doupload works again #289
- PR feat: Coalesce ranges when fetching blocks #285
  - PR feat: Tie up less memory with coalesced batches #290
    - PR feat: Add traces for retrievals #291 👈

This tree was auto-generated by Charcoal

frrist · 2026-01-15T17:30:37Z

pkg/client/dagservice/exchange.go

+		attribute.String("get-block.space", se.space.DID().String()),
+		attribute.String("get-block.cid", c.String()),


I'd drop the prefix on the attributes here and elsewhere, they will be associated with this span in the explorer. Plus, its nice to be able to search for all traces/spans with via space, cid, etc. rather than need to search by specific prefixes.

frrist · 2026-01-15T17:32:52Z

pkg/client/dagservice/exchange.go

+
+			ctx, span := tracer.Start(ctx, "get-blocks-batch", trace.WithAttributes(
+				attribute.String("get-blocks.batch.shard.digest", digestutil.Format(cloc.location.Commitment.Nb().Content.Hash())),
+				attribute.Int("get-blocks.batch.block-count", len(cloc.slices)),
+				attribute.Int64("get-blocks.batch.offset", int64(cloc.location.Position.Offset)),
+				attribute.Int64("get-blocks.batch.length", int64(cloc.location.Position.Length)),
+			))
+			defer span.End()
+


would drop this as its already covered by the tracing in the Retrieve method.

Wait: isn't this one of the most interesting ones? This tells us how batched our retrievals are.

I think the trace at the top of this function will already convey that, no?

No, that tells us about groups that are GetBlocks()ed. This tells us about each batch we actually fetch. The ratio between those tells us how effectively we're managing to batch.

My concern is that traces might be heavier than needed for this insight, would a metric work here instead? Something like a histogram for batches-per-request or blocks-per-batch would give us the same aggregate visibility at lower cost.

OTOH I can see traces being useful for debugging individual slow retrievals. If that's the intent.

Open to either, your call.

frrist

From where I sit, it feels a bit premature to add all this tracing, but maybe I am missing something on where known bottlenecks are.

If I were to try and scope tracing back, I'd only include spans for:

Retrieve as you have already done.
LocateMany since it's always called when locating.

The metrics I'd be most interested in seeing are:

time to retrieve and size of retrieval per node.
time to locate and its cache hit/miss ratio

This will inform us of "How long does it take to find the thing" and "How long does it take to get the thing".

pkg/client/locator/indexlocator.go

frrist · 2026-01-15T17:37:26Z

pkg/client/locator/indexlocator.go

+		attribute.String("locate-many.space", spaceDID.String()),
+		attribute.Int("locate-many.digest-count", len(digests)),
+	))
+	defer span.End()


Add a named return to this method, and others, then do something like this to capture the status/errors if any of the trace

defer func() { if err != nil { span.SetStatus(//TODO) span.RecordError(err) } span.End() }

or similar.

pkg/client/locator/indexlocator.go

frrist · 2026-01-15T17:45:52Z

pkg/client/retrieve.go

 	xres, hres, err := rclient.Execute(ctx, inv, conn)
 	if err != nil {
 		return nil, fmt.Errorf("executing `space/content/retrieve` invocation: %w", err)
 	}


Here is where I'd probably add a metric (histogram) that records the duration of the execution, since I think this is where the retrieval happens, I'd include the location (nodes URL) as an attribute on the metric, then we can plot retrieval times per node.

However, I suspect the retrieval isn't actually complete until hres.Body() is fully read and closed. Maybe we can wrap the Body returned from this in something that properly records the time to read the entire body. May also be nice to have a second histogram that records size in bytes per node for retrieval as well.

I hadn't noticed that only the DID itself appears on stdout! That's much cleaner.

Sort blocks by offset before fetching. Boxo doesn't appear to request them in any useful order.

Rather than read the entire response and make blocks of slices over that data, make a separate byte array for each block.

#### PR Dependency Tree * **PR #289** 👈 * **PR #285** * **PR #290** * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

TODO: - [x] Initial fully working version - [x] Slop factor for CAR layout - [x] Manual test with `doupload`, to confirm it gets used in the current code path - → [Optimize reading](#290) - → [Otel metrics for monitoring batching success](#291) Closes #280 Closes #288 #### PR Dependency Tree * **PR #285** 👈 * **PR #290** * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

#### PR Dependency Tree * **PR #290** 👈 * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Peeja requested review from alanshaw, hannahhoward and volmedo as code owners January 15, 2026 17:23

This was referenced Jan 15, 2026

fix: test/doupload works again #289

Merged

feat: Coalesce ranges when fetching blocks #285

Merged

feat: Tie up less memory with coalesced batches #290

Merged

frrist reviewed Jan 15, 2026

View reviewed changes

frrist approved these changes Jan 15, 2026

View reviewed changes

Peeja force-pushed the feat/better-batch-memory-management branch from 8c02ecd to 6ef5bda Compare January 16, 2026 20:25

Peeja force-pushed the feat/batch-request-otel branch from ff8612f to a6ead59 Compare January 16, 2026 20:25

Peeja added 16 commits January 20, 2026 12:59

fix: test/doupload works again

d118485

refactor: Get space DID more simply

87d1e85

I hadn't noticed that only the DID itself appears on stdout! That's much cleaner.

refactor: Rm duplicate error handling

b6c44f7

refactor: Picking a location is the Retriever's job

38ff9ae

refactor: Retrieve() doesn't need space arg

eb98b53

refactor: Validation is the exchange's job now

48e155a

refactor: Rm unnecessary type argument

55ef99d

feat: Exchange coalesces contiguous requests

2a63d9c

chore: Demonstrate that GetBlocks is used

b9f5adc

refactor: Test GetBlocks() directly

02a9860

feat: Can configure gap between slices

57afaa9

feat: Blocks more likely to coalesce as batches

5dbb70e

Sort blocks by offset before fetching. Boxo doesn't appear to request them in any useful order.

feat: Tie up less memory with coalesced batches

35c2c7f

Rather than read the entire response and make blocks of slices over that data, make a separate byte array for each block.

feat: Add traces for retrievals

432586c

refactor: Improve traces per PR feedback

f8db726

WIP

356a654

Peeja force-pushed the feat/better-batch-memory-management branch from 6ef5bda to 35c2c7f Compare January 20, 2026 18:00

Peeja force-pushed the feat/batch-request-otel branch from a6ead59 to 356a654 Compare January 20, 2026 18:00

Peeja added a commit that referenced this pull request Jan 20, 2026

fix: test/doupload works again (#289)

b39ee5a

#### PR Dependency Tree * **PR #289** 👈 * **PR #285** * **PR #290** * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Peeja force-pushed the feat/better-batch-memory-management branch from 35c2c7f to 15d6c3c Compare January 21, 2026 15:16

Peeja added a commit that referenced this pull request Jan 22, 2026

feat: Tie up less memory with coalesced batches (#290)

27d0aef

#### PR Dependency Tree * **PR #290** 👈 * **PR #291** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add traces for retrievals #291

feat: Add traces for retrievals #291

Uh oh!

Peeja commented Jan 15, 2026 •

edited

Loading

Uh oh!

frrist Jan 15, 2026 •

edited

Loading

Uh oh!

frrist Jan 15, 2026

Uh oh!

Peeja Jan 15, 2026

Uh oh!

frrist Jan 15, 2026

Uh oh!

Peeja Jan 16, 2026

Uh oh!

frrist Jan 16, 2026

Uh oh!

frrist left a comment

Uh oh!

Uh oh!

Uh oh!

frrist Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

frrist Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		attribute.String("get-block.space", se.space.DID().String()),
		attribute.String("get-block.cid", c.String()),

feat: Add traces for retrievals #291

Are you sure you want to change the base?

feat: Add traces for retrievals #291

Uh oh!

Conversation

Peeja commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Dependency Tree

Uh oh!

frrist Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frrist Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Peeja Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

frrist Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Peeja Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

frrist Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

frrist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

frrist Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

frrist Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Peeja commented Jan 15, 2026 •

edited

Loading

frrist Jan 15, 2026 •

edited

Loading