Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add summary spans depending on the rule -- EXPERIMENTAL #1508

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

mterhar
Copy link
Contributor

@mterhar mterhar commented Mar 14, 2025

Which problem is this PR solving?

When sending data through Refinery, maybe it'd be nice to have a summary of each trace.

Short description of the changes

This adds a new span that goes to a separate "Span Summary" dataset where you can see a single span for each trace that hits refinery and whatever refinery thought about it when it was being evaluated.

EXPERIMENTAL

It has a lot of janky stuff right now that needs validation so it's a draft until we play with it with some realistic data and see what the summaries look like.

@mterhar mterhar force-pushed the mterhar.add-summarize-capability branch from 85c74e3 to f8ae757 Compare March 15, 2025 00:17
@mterhar mterhar changed the title Experiment: Add summary spans depending on the rule feat: Add summary spans depending on the rule -- EXPERIMENTAL Mar 15, 2025
@mterhar
Copy link
Contributor Author

mterhar commented Mar 17, 2025

Fields I've added and what they actually mean:

  • Number of slow spans
    • meta.summarized.high_latency_span_count is a count of spans with a long duration.
    • Problem: any async spans that are still running when the summary is created are not accounted for.
    • meta.summarized.high_latency_threshold_ms is the number of milliseconds that qualifies a span as slow
  • Number of error spans
  • meta.summarized.error_count again it's just the number that arrived before the summary was created
  • Services that were traversed
    • meta.summarized.services string concatenates all the unique service.name fields for the spans
  • Some stuff about the root span
    • meta.root.name is the original name since I want this span to say "Summary ..." as the name.
    • the timestamp is typically the root span's timestamp, though if an earlier one is found, it'll be overridden
  • The entire duration of the trace
    • This one looks for the "last end timestamp" and sets the summary span's duration to "end time - start time"
    • Problem When the root span arrives, it triggers decision time and summarization time in which case the root spans duration is the same as the summary span. Otherwise, trace timeout will set the duration of the summary span. It'll never be able to factor in async work

And then it also includes the other meta.things like meta.event_count, meta.span_count, etc.

The consequences of basing it on an existing span

By using either the root span or "the first span" as the basis for the summary, it ends up having some values like telemetry.sdk.name, span.kind, library.name, and others that simply incorrect. The benefit is that stuff like status_code, error, and http.url show up.

What to change?

From a Jobs-to-be-Done perspective, I want the summary span to be able to:

  1. exist in lieu of the rest of the trace for sampled spans
  • requires copying just enough context from the trace into the summary
  • needs to avoid capturing the status of a way-downstream-service-call as the "trace response code"
  • should be able to express to the user whether it was a full trace or a partial trace
  1. supplement a kept trace to be able to show what had happened by the time the decision was made
  • provides a full fidelity feed of "every trace is definitely in this dataset"
  • Mainly useful for metricizing span counts, error counts, etc.
  • Relational fields makes these use cases much less important

The summary will then be quite different if we received a root span or not.

Decision Trigger Kept Trace Dropped Trace
Root Arrived Relational fields are just about as good Can summarize trace effectively and provide hints about what was dropped
TraceTimeout or SpanLimit Relational fields are actually more accurate Shows a summary of the beginning of the trace, but can also be wildly inaccurate

I'm going to focus on making the "root arrived + dropped trace" use case the best available since it is what is easiest to start with. Suggest a follow-up effort to build a recurring summary follow-up that would require a separate buffer to hold onto some summary data that can be used to capture follow-on spans after the decision is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant