Skip to content

Conversation

@Peeja
Copy link
Member

@Peeja Peeja commented Jan 15, 2026

Comment on lines 72 to 73
attribute.String("get-block.space", se.space.DID().String()),
attribute.String("get-block.cid", c.String()),
Copy link
Member

@frrist frrist Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop the prefix on the attributes here and elsewhere, they will be associated with this span in the explorer. Plus, its nice to be able to search for all traces/spans with via space, cid, etc. rather than need to search by specific prefixes.

Comment on lines 282 to 311

ctx, span := tracer.Start(ctx, "get-blocks-batch", trace.WithAttributes(
attribute.String("get-blocks.batch.shard.digest", digestutil.Format(cloc.location.Commitment.Nb().Content.Hash())),
attribute.Int("get-blocks.batch.block-count", len(cloc.slices)),
attribute.Int64("get-blocks.batch.offset", int64(cloc.location.Position.Offset)),
attribute.Int64("get-blocks.batch.length", int64(cloc.location.Position.Length)),
))
defer span.End()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would drop this as its already covered by the tracing in the Retrieve method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait: isn't this one of the most interesting ones? This tells us how batched our retrievals are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trace at the top of this function will already convey that, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that tells us about groups that are GetBlocks()ed. This tells us about each batch we actually fetch. The ratio between those tells us how effectively we're managing to batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that traces might be heavier than needed for this insight, would a metric work here instead? Something like a histogram for batches-per-request or blocks-per-batch would give us the same aggregate visibility at lower cost.

OTOH I can see traces being useful for debugging individual slow retrievals. If that's the intent.

Open to either, your call.

Copy link
Member

@frrist frrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where I sit, it feels a bit premature to add all this tracing, but maybe I am missing something on where known bottlenecks are.

If I were to try and scope tracing back, I'd only include spans for:

  1. Retrieve as you have already done.
  2. LocateMany since it's always called when locating.

The metrics I'd be most interested in seeing are:

  1. time to retrieve and size of retrieval per node.
  2. time to locate and its cache hit/miss ratio

This will inform us of "How long does it take to find the thing" and "How long does it take to get the thing".

attribute.String("locate-many.space", spaceDID.String()),
attribute.Int("locate-many.digest-count", len(digests)),
))
defer span.End()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a named return to this method, and others, then do something like this to capture the status/errors if any of the trace

defer func() {
  if err != nil {
    span.SetStatus(//TODO)
    span.RecordError(err)
  }
  span.End()
}

or similar.

Comment on lines 87 to 114
xres, hres, err := rclient.Execute(ctx, inv, conn)
if err != nil {
return nil, fmt.Errorf("executing `space/content/retrieve` invocation: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where I'd probably add a metric (histogram) that records the duration of the execution, since I think this is where the retrieval happens, I'd include the location (nodes URL) as an attribute on the metric, then we can plot retrieval times per node.

However, I suspect the retrieval isn't actually complete until hres.Body() is fully read and closed. Maybe we can wrap the Body returned from this in something that properly records the time to read the entire body. May also be nice to have a second histogram that records size in bytes per node for retrieval as well.

@Peeja Peeja force-pushed the feat/better-batch-memory-management branch from 8c02ecd to 6ef5bda Compare January 16, 2026 20:25
@Peeja Peeja force-pushed the feat/batch-request-otel branch from ff8612f to a6ead59 Compare January 16, 2026 20:25
@Peeja Peeja force-pushed the feat/better-batch-memory-management branch from 6ef5bda to 35c2c7f Compare January 20, 2026 18:00
@Peeja Peeja force-pushed the feat/batch-request-otel branch from a6ead59 to 356a654 Compare January 20, 2026 18:00
Peeja added a commit that referenced this pull request Jan 20, 2026
#### PR Dependency Tree


* **PR #289** 👈
  * **PR #285**
    * **PR #290**
      * **PR #291**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added a commit that referenced this pull request Jan 21, 2026
TODO:

- [x] Initial fully working version
- [x] Slop factor for CAR layout
- [x] Manual test with `doupload`, to confirm it gets used in the
current code path
- → [Optimize reading](#290)
- → [Otel metrics for monitoring batching
success](#291)

Closes #280
Closes #288






















#### PR Dependency Tree


* **PR #285** 👈
  * **PR #290**
    * **PR #291**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
@Peeja Peeja force-pushed the feat/better-batch-memory-management branch from 35c2c7f to 15d6c3c Compare January 21, 2026 15:16
Peeja added a commit that referenced this pull request Jan 22, 2026
#### PR Dependency Tree


* **PR #290** 👈
  * **PR #291**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants