feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

mterhar · 2025-03-14T01:04:26Z

Which problem is this PR solving?

When sending data through Refinery, maybe it'd be nice to have a summary of each trace.

Short description of the changes

This adds a new span that goes to a separate "Span Summary" dataset where you can see a single span for each trace that hits refinery and whatever refinery thought about it when it was being evaluated.

EXPERIMENTAL

It has a lot of janky stuff right now that needs validation so it's a draft until we play with it with some realistic data and see what the summaries look like.

mterhar · 2025-03-17T14:31:55Z

Fields I've added and what they actually mean:

Number of slow spans
- meta.summarized.high_latency_span_count is a count of spans with a long duration.
- Problem: any async spans that are still running when the summary is created are not accounted for.
- meta.summarized.high_latency_threshold_ms is the number of milliseconds that qualifies a span as slow
Number of error spans
meta.summarized.error_count again it's just the number that arrived before the summary was created
Services that were traversed
- meta.summarized.services string concatenates all the unique service.name fields for the spans
Some stuff about the root span
- meta.root.name is the original name since I want this span to say "Summary ..." as the name.
- the timestamp is typically the root span's timestamp, though if an earlier one is found, it'll be overridden
The entire duration of the trace
- This one looks for the "last end timestamp" and sets the summary span's duration to "end time - start time"
- Problem When the root span arrives, it triggers decision time and summarization time in which case the root spans duration is the same as the summary span. Otherwise, trace timeout will set the duration of the summary span. It'll never be able to factor in async work

And then it also includes the other meta.things like meta.event_count, meta.span_count, etc.

The consequences of basing it on an existing span

By using either the root span or "the first span" as the basis for the summary, it ends up having some values like telemetry.sdk.name, span.kind, library.name, and others that simply incorrect. The benefit is that stuff like status_code, error, and http.url show up.

What to change?

From a Jobs-to-be-Done perspective, I want the summary span to be able to:

exist in lieu of the rest of the trace for sampled spans

requires copying just enough context from the trace into the summary
needs to avoid capturing the status of a way-downstream-service-call as the "trace response code"
should be able to express to the user whether it was a full trace or a partial trace

supplement a kept trace to be able to show what had happened by the time the decision was made

provides a full fidelity feed of "every trace is definitely in this dataset"
Mainly useful for metricizing span counts, error counts, etc.
Relational fields makes these use cases much less important

The summary will then be quite different if we received a root span or not.

Decision Trigger	Kept Trace	Dropped Trace
Root Arrived	Relational fields are just about as good	Can summarize trace effectively and provide hints about what was dropped
TraceTimeout or SpanLimit	Relational fields are actually more accurate	Shows a summary of the beginning of the trace, but can also be wildly inaccurate

I'm going to focus on making the "root arrived + dropped trace" use case the best available since it is what is easiest to start with. Suggest a follow-up effort to build a recurring summary follow-up that would require a separate buffer to hold onto some summary data that can be used to capture follow-on spans after the decision is made.

mterhar added 8 commits March 13, 2025 22:50

Add summarization setting and function

024d130

move tracedecision into types

29c94c1

Add the configuration and rules option

f632e18

update config and rules examples

91adf9b

Add summarizemode to the formatting test

960fbc5

Add summarize metric

867599c

Deep copy summary span rather than editing by accident

4facf9c

fix some tests

f8ae757

mterhar force-pushed the mterhar.add-summarize-capability branch from 85c74e3 to f8ae757 Compare March 15, 2025 00:17

mterhar changed the title ~~Experiment: Add summary spans depending on the rule~~ feat: Add summary spans depending on the rule -- EXPERIMENTAL Mar 15, 2025

mterhar added 9 commits March 20, 2025 16:03

add summary field list

2e1fdae

Allow users to decide to roll up service.name or not

0d24c25

fix test issue

e7ce8c1

update generated configs

c708eb4

allow summarizing dropped traces

2be4f56

Ensure trace ids exist

857dfca

fixed a few bugs, this commit is 2.9.3-sum04

008979d

includes root span in determining the end timestamp

e18b7d5

add other meta.refinery. fields to summary

c6d77a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

mterhar commented Mar 14, 2025

mterhar commented Mar 17, 2025 •

edited

Loading

feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

Are you sure you want to change the base?

feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

Conversation

mterhar commented Mar 14, 2025

Which problem is this PR solving?

Short description of the changes

EXPERIMENTAL

mterhar commented Mar 17, 2025 • edited Loading

The consequences of basing it on an existing span

What to change?

mterhar commented Mar 17, 2025 •

edited

Loading