Skip to content

Add basic Open Telemetry instrumentation for all requests #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

atheriel
Copy link
Collaborator

@atheriel atheriel commented May 22, 2025

This commit wraps all requests in an Open Telemetry span that abides by the semantic conventions for HTTP clients (insofar as I understand them).

Right now this instrumentation is opt in: otel is in Suggests, and tracing must be enabled (e.g. via the OTEL_TRACES_EXPORTER environment variable). Otherwise this is costless at runtime.

For example:

library(otelsdk)

Sys.setenv(OTEL_TRACES_EXPORTER = "stderr")

request("https://google.com") |>
  req_perform()

I'm not sure that otel needs to move to Imports, because by design users actually need the otelsdk package to enable tracing anyway.

The major limitation right now is that we don't propagate the trace context to the server, because otel doesn't have an explicit mechanism for this yet.

@atheriel atheriel requested review from hadley and gaborcsardi May 22, 2025 13:37
@atheriel
Copy link
Collaborator Author

In Grafana:

Trace in Grafana

@atheriel
Copy link
Collaborator Author

And logfire:

in logfire

atheriel added a commit to tidyverse/ellmer that referenced this pull request May 22, 2025
This commit wraps all LLM model calls in an Open Telemetry span that
abides by the (still nascent) semantic conventions for Generative AI
clients [0].

It's very similar in approach to what was done for `httr2`, and in fact
the two of them complement one another nicely:
r-lib/httr2#729.

For example:

    library(otelsdk)

    Sys.setenv(OTEL_TRACES_EXPORTER = "stderr")

    chat <- ellmer::chat_databricks(model = "databricks-claude-3-7-sonnet")
    chat$chat("Tell me a joke in the form of an SQL query.")

[0]: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/

Signed-off-by: Aaron Jacobs <[email protected]>
atheriel added a commit to tidyverse/ellmer that referenced this pull request May 22, 2025
This commit wraps all LLM model calls in an Open Telemetry span that
abides by the (still nascent) semantic conventions for Generative AI
clients [0].

It's very similar in approach to what was done for `httr2`, and in fact
the two of them complement one another nicely:
r-lib/httr2#729.

For example:

    library(otelsdk)

    Sys.setenv(OTEL_TRACES_EXPORTER = "stderr")

    chat <- ellmer::chat_databricks(model = "databricks-claude-3-7-sonnet")
    chat$chat("Tell me a joke in the form of an SQL query.")

[0]: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/

Signed-off-by: Aaron Jacobs <[email protected]>
@atheriel
Copy link
Collaborator Author

This is a significant rework over the original implementation after I discovered issues with it in practice. It ended up being pretty hard to get the scoping/side effect issues right, but when I run this PR with httr2's test suite against a real OTel platform the traces do look correct.

However: right now this PR doesn't support httr2's request queue for iterative/parallel requests because the span scopes get nested. I think I might need to request some changes to otel to accomodate that and further research the OTel ecosystem to see what other implementations do for these "batch request" traces.

Finally, there aren't any unit tests. I'd like to add some (especially considering the likelihood of regressions), but right now the testing story for otel is a little unclear. Again, I'll try to work with Gabor to map out what that should look like.

@gaborcsardi
Copy link
Member

gaborcsardi commented May 26, 2025

@atheriel If you are using concurrency on a single thread, then you'll need to use sessions to get the relationships between the spans right. Essentially, you'll need to manage the sessions manually and pass the correct session to otel::start_span().

  1. Call tracer$start_session() for a new session, at the beginning of a composite request. Store the return value.
  2. Use this return value inotel::start_span() to start a span that belongs to this composite request.
  3. When the request is done, call tracer$finish_session() to close the session.

The API for this might get a little nicer soonish, I think the mechanics are pretty solid.

There is an example for this kind of manual session management in the shiny.R file and in otel::start_span(), to manage spans from concurrent Shiny sessions correctly.

@hadley
Copy link
Member

hadley commented May 27, 2025

@gaborcsardi have you thought at all about testing? It would be nice if otel offered some way to collect the spans into a data frame (or other data structure), so that we could then inspect and have some simple tests that the right information is getting passed through.

@gaborcsardi
Copy link
Member

gaborcsardi commented May 27, 2025

@gaborcsardi have you thought at all about testing?

Yes, you can write them to a file, and then read them back, e.g.

  tmp <- tempfile(fileext = "otel")
  trc_prv <- tracer_provider_stdstream_new(tmp)

  [...]

  spns <- parse_spans(tmp)

https://github.com/r-lib/otelsdk/blob/3162000f454b28a251a40cc325bbe36d56231cbe/tests/testthat/test-tracer.R#L12-L14

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I think folks are going to be super excited once this all comes together.

Is there already an otel way to suppress instrumentation just for httr2?

@hadley
Copy link
Member

hadley commented May 27, 2025

@gaborcsardi maybe it's worth wrapping that up into a little helper?

capture_spans <- function(code) {
  tmp <- tempfile(fileext = "otel")
  trc_prv <- tracer_provider_stdstream_new(tmp)
  
  code
  
  parse_spans(tmp)
}

@atheriel
Copy link
Collaborator Author

I suggest we move testing discussion to: r-lib/otelsdk#13

@atheriel
Copy link
Collaborator Author

Is there already an otel way to suppress instrumentation just for httr2?

Good point. We could have an httr2 option for this, but I think a more general mechanism belongs in otel: r-lib/otel#4.

@atheriel
Copy link
Collaborator Author

Just for fun: the main way that I tested that parent spans and attributes "looked right" was to trace httr2's entire test suite, which emits hundreds of spans:

Screenshot 2025-05-27 at 09-54-48 Live · posit1_ellmer-testing · Pydantic Logfire

@atheriel
Copy link
Collaborator Author

atheriel commented Jun 5, 2025

Unit tests are now included.

@gaborcsardi
Copy link
Member

Some improvements you can cherry-pick: gaborcsardi#1, including spans for req_perform_parallel: gaborcsardi/httr2@c62b119 (#1).

This commit wraps all requests in an Open Telemetry span that abides by
the semantic conventions for HTTP clients [0] (insofar as I understand
them). We also propagate the trace context [1] when there is one.

The main subtlety is that I had to tweak some of httr2's internals so
that request signing can take into account new headers. Luckily there is
fairly comprehensive test coverage so I'm fairly sure at this point that
I haven't broken anything.

Right now this instrumentation is opt in: `otel` is in `Suggests`, and
tracing must be enabled (e.g. via the `OTEL_TRACES_EXPORTER` environment
variable). Otherwise this is costless at runtime.

For example:

    library(otelsdk)

    Sys.setenv(OTEL_TRACES_EXPORTER = "stderr")

    request("https://google.com") |>
      req_perform()

I'm not sure that `otel` needs to move to `Imports`, because by design
users actually need the `otelsdk` package to enable tracing anyway.

Unit tests are included.

[0]: https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-span
[1]: https://www.w3.org/TR/trace-context/

Signed-off-by: Aaron Jacobs <[email protected]>
Co-authored-by: Gábor Csárdi <[email protected]>
@atheriel
Copy link
Collaborator Author

I've completely rewritten this in concert with Gabor as the otel and otelsdk packages have evolved significantly. Major updates include:

  • Parallel and promisified requests now work correctly and inherit the parent context you'd expect.
  • There are extensive unit tests for all of the sites that emit spans.

The main new subtlety is that I had to tweak some of httr2's internals so that request signing can take into account new headers while preserving the scope-centric needs of otel's API. Luckily there is fairly comprehensive test coverage so I'm fairly sure at this point that I haven't broken anything.

@atheriel atheriel requested a review from hadley June 18, 2025 19:40
@hadley
Copy link
Member

hadley commented Jun 19, 2025

I'm lining up for a httr2 patch release in the next week or two. I assume that this probably won't make it, since we'll need sometime for otel/otelsdk to get to CRAN?

@atheriel
Copy link
Collaborator Author

Probably not. Should we aim to merge it after the patch, then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants