Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instrument tracing #149

Merged
merged 22 commits into from
Dec 6, 2023
Merged

Conversation

zxiong
Copy link
Contributor

@zxiong zxiong commented Nov 27, 2023

Add common instrument tracing function so that it can be used by pubtools-* projects

@zxiong zxiong requested a review from lubomir November 27, 2023 06:48
Copy link

codecov bot commented Nov 27, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (faa0e80) 98.66% compared to head (ea0eeee) 99.36%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #149      +/-   ##
==========================================
+ Coverage   98.66%   99.36%   +0.69%     
==========================================
  Files           4        6       +2     
  Lines          75      157      +82     
==========================================
+ Hits           74      156      +82     
  Misses          1        1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -0,0 +1,123 @@
"""Instrument Pub task functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please revise the docs and the environment variable names to not be specific to pub? This code should work even when running outside of pub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

log = logging.getLogger(__name__)


class TracingWrapper:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no delineation here between what's intended as public API, and what's considered just an implementation detail.

As far as I can tell, only the instrument_func decorator should be public API. Everything else should be hidden.

Can you please follow the same setup as other modules in order to achieve this? Check how pubtools/pluggy.py works for example. Basically it means that:

  • this file should be moved to pubtools/_impl/tracing.py and it can contain anything you want
  • in pubtools/tracing.py you should import only what's intended as public API - instrument_func I guess - and put it into __all__ .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as your suggestion.

return cls.__instance


def instrument_func(span_name=None, args_to_attr=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please ensure that this appears in online docs at https://release-engineering.github.io/pubtools/ ?

I think you'll have to write a whole new page for this, similar to the hooks page which exists now. The page can contain the docs for this function and also general info about how to enable this (e.g. how to set the relevant env vars).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs for it.


Args:
span_name: str
Span name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please mention that it's OK to omit this in order to use the function's name as the span name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

if TracingWrapper.__instance is None:
log.info("Creating TracingWrapper instance")
cls.__instance = super().__new__(cls)
otlp_exporter = OTLPSpanExporter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the exporter part of this should be hardcoded to use only this type of exporter.

Correct me if I'm wrong, but I think the intended design is that you can set up all your tracing code, and then choose different exporters for different scenarios without having to change the tracing code?

e.g. in https://opentelemetry.io/docs/instrumentation/python/exporters/#console-exporter it refers to using a console exporter which seems to be built in to the main otel library. It says it's good for "development and debugging tasks" and I can imagine that indeed being quite useful for us.

For instance, if I'm working on pubtools-foo and I'm adding instrumentation code, it would be really nice if I can somehow make it use the console exporter when I run commands locally, but then use the HTTP exporter when deployed in Pub. I wouldn't want to be forced to deploy some kind of OTLP/HTTP receiver locally just to test my OTEL-related changes during local development.

I think you should consider using a hook to create the exporter. You can look at existing get_cert_key_paths hook to get an idea what's needed across pubtools & pub for that.

So, it could be something like this:

@hookspec(firstresult=True)
def otel_exporter():
    """Return an OTEL exporter, used by OTEL instrumentation.

    If OTEL tracing is enabled and this hook is not implemented, a default
    `ConsoleSpanExporter` will be used.
    """

...then in Pub, implement the otel_exporter hook to return an OTLPSpanExporter.

That should allow all the OTEL related code to work even in local development and tests while exporting to console, and work in Pub by using OTLP/HTTP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this hook.

attributes = {
"function_name": func.__qualname__,
}
if args_to_attr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather this is moved up to the top before the PUB_OTEL_TRACING check, even though the result is not used when it's false.

The reason is that there's no guarantee the arguments can be converted to string successfully. I think we can say it's a bug to use args_to_attr=True in that case, but the problem is that the bug will not be detected unless PUB_OTEL_TRACING is set. So, this sets up a trap where somebody adds @instrument_func on some new code and it seems to work fine when they test locally with PUB_OTEL_TRACING not set, but as soon as it's deployed into production where PUB_OTEL_TRACING is set, it crashes.

By moving up this attribute calculation to the beginning, we ensure bugs in the attribute calculation will be detected immediately even if tracing is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as your suggestion.

def test_instrument_func_multiple_threads():
trace_id = "cefb2b8db35d5f3c0dfdf79d5aab1451"
span_id = "1f2bb7927f140744"
os.environ["PUB_OTEL_TRACING"] = "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change os.environ directly, it's not reverted at the end of the test, so this test might influence later tests run by the same process.

Please use pytest monkeypatch fixture instead. If you monkeypatch.setenv, whatever you set is automatically rolled back at the end of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated those tests to use monkeypatch

zxiong and others added 7 commits November 28, 2023 10:06
- Revise the docs and the env-vars names to not be specific to pub.
- Hide instrument_func implementation.
- Use a hook to create the exporter instead of hardcode.
- Adjust function args to span attributes.
- Use pytest monkeypatch fixture instead of changing os.environ in unit
  test.
@zxiong zxiong requested a review from rohanpm December 4, 2023 10:31
docs/tracing.rst Outdated


Usage
.....................
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the number of characters in these header lines need to match exactly the line above in order for docs to render properly and without warnings.

So, for example, since "Usage" is 5 characters long this needs to be:

Suggested change
.....................
.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

Following environment variables are used in the module:

- ``OTEL_TRACING``: set ``true`` to enable instrument tracing, otherwise tracing is disabled.
- ``OTEL_SERVICE_NAME``: required, set the value of the service.name resource attribute. It's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I don't quite understand this, can you give me an example? (Just telling me here and not putting into the docs.) For example would we be setting this to "pub" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pub-hub" for Pub hub, and "pubd" (or pub-worker) for Pub worker. It can be as a filter field in field condition.

docs/tracing.rst Outdated


OTEL Exporter
~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not perfectly aligned with the header text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

log.info("Creating TracingWrapper instance")
cls.__instance = super().__new__(cls)
exporter = (
pm.hook.otel_exporter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called only once in case construction of the exporter is expensive or has some side-effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

docs/tracing.rst Outdated
Instrument tracing for functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before instrument tracing for functions, :class:`TracingWrapper` should be called once in the service, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please remove this requirement? It seems like an ugly implementation detail. The user doesn't have to actually do anything with the class except instantiate it, right?

Actually, I'm reading TracingWrapper right now and I can't understand why an instance of the class is even needed. There is never any reference to TracingWrapper.__instance anywhere. Is the point of that code just to do some one-time initialization and there is no technical reason for it to be a class (as opposed to e.g. just a plain function)?

Anyway, I would very much like us to find some way to avoid this and to do this automatically for the user.

Right now, when does the initialization need to happen? Is it necessary before @instrument_func or is it just necessary before one of the instrumented functions is invoked? I would suggest that instrument_func itself can arrange for this initialization to happen once, when needed, and don't require the caller to care about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see one usage of TracingWrapper in the change, which is that this happens in some tests:

tracing_wrapper.processor.force_flush()

Would you anticipate that other projects would ever need to do something like that? If so, that's a reason why it would have to remain in the API.

If it needs to remain in the API because projects would need to use API like the above, here is a different suggestion how to handle it:

There should be a public class named something like TraceContext which provides access to the processor (and whatever other otel classes that projects might need to access).

There should be a public function named something like get_trace_context() which returns the single global trace context, creating it if necessary; or returning an object which does nothing if tracing is disabled.

The instrument_func decorator, rather than being a standalone function, should be a member of TraceContext. The way of using it downstream would then be something like:

from pubtools.tracing import get_trace_context

ctx = get_trace_context()

class Whatever:
    @ctx.instrument_func(...)
    def some_method_i_want_to_instrument(self):
        ...

Note that it forces the caller to obtain a TraceContext before they can instrument anything, so it's not possible to forget to initialize the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see one usage of TracingWrapper in the change, which is that this happens in some tests:

tracing_wrapper.processor.force_flush()

Would you anticipate that other projects would ever need to do something like that? If so, that's a reason why it would have to remain in the API.

tracing_wrapper.processor.force_flush() will be used in pubd to make sure all trace data to be sent to exporters before a pub task exits.

Besides that racing_wrapper.provider will be used when enabling RequestsInstrumentor() in pubd.
RequestsInstrumentor().instrument(tracer_provider=tracing_wrapper.provider)

requirements.txt Outdated
@@ -1,2 +1,5 @@
pluggy
setuptools
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this dependency can be removed now, because we now expect that the implementer of the otel_exporter hook (e.g. pub) is the one which declares the dependency on this (or on whatever exporter it wants to use).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed opentelemetry-exporter-otlp.

exporter = (
pm.hook.otel_exporter()
if pm.hook.otel_exporter()
else ConsoleSpanExporter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I've never run this, I have no idea what the console exporter does. Have you tried it yet? Does it seem like a useful default when running locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried it in my local env. This exporter can print out all spans in the trace in the console, for example:

{ "name": "test_span", "context": { "trace_id": "0x57eabe45b97e63ba6d734f1b9b751daf", "span_id": "0xcb886a22023940bc", "trace_state": "[]" }, "kind": "SpanKind.INTERNAL", "parent_id": null, "start_time": "2023-12-05T01:40:50.679578Z", "end_time": "2023-12-05T01:40:50.679598Z", "status": { "status_code": "UNSET" }, "attributes": {}, "events": [], "links": [], "resource": { "attributes": { "service.name": "client-local-test" }, "schema_url": "" } }
The benefit of console exporter is easy to setup. But, If we'd like to see the relationships between spans clearly, it's not straightforward.

For now, I am more using OLTP exporter to send trace data a local jaeger which provides web UI to see the trace in my local test env.

cls.__instance = super().__new__(cls)
exporter = pm.hook.otel_exporter()
if not exporter:
exporter = ConsoleSpanExporter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to cheat and write it like this if you don't want to bother with explicitly covering the branch codecov is complaining about:

exporter = pm.hook.otel_exporter() or ConsoleSpanExporter()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


def get_trace_wrapper():
"""return a global trace wrapper instance"""
return TracingWrapper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no special case here when tracing is disabled. Isn't it going to lead to a crash? You've implemented __new__ so it'll return None in that case.

Then won't the following happen?

from pubtools.tracing import get_trace_wrapper

tw = get_trace_wrapper()

@tw.instrument_func(...)  # CRASH!  When tracing is disabled, tw is None so this can't work.
def some_func(...):
    ....

That's why I said it'd have to return an object which "does nothing" (i.e. instrument_func exists but has no effect) in the case of tracing being disabled.

This is not necessarily just about instrument_func either. You said that you expect downstream code to access tw.processor, that would also crash if tw is None. Will everywhere accessing tw.processor need to first check if tw is None, or maybe check if tw.processor is None? It would be nice to come up with some better way to handle it because that will lead to bugs where people write code which only works when tracing is enabled, or vice-versa.

One option would be not to expose processor at all, but instead on the TracingWrapper class you implement the specific methods that you expect downstream to need (e.g. have a TracingWrapper.flush() method and have it do nothing if tracing is disabled).

By the way, I'm not familiar with the otel python library at all, but it wouldn't surprise me if it already has some way to make it gracefully do nothing. If so it would be cleaner to use that than to have to introduce None vs non-None code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did crash. I will fix it.

if os.getenv("OTEL_TRACING", "").lower() == "true":
log.info("Creating TracingWrapper instance")
exporter = pm.hook.otel_exporter() or ConsoleSpanExporter()
cls.provider = TracerProvider(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason that you set these on the class object rather than the instance just created?

I suppose that it will work, but it's odd and it's effectively creating two levels of singleton. In the same way that TracingWrapper.instance is global, so is TracingWrapper.provider, TracingWrapper.processor, TracingWrapper.tracer. It's questionable then why a class is even needed if there is never any instance-specific data.

This should all work but I think it would be way cleaner if it were written like the below. Just implement TracingWrapper as a plain old class, no overriding __new__, and then implement get_trace_wrapper() so it returns an existing copy of it.

TRACE_WRAPPER = None

def get_trace_wrapper():
    """return a global trace wrapper instance"""
    if TRACE_WRAPPER is None:
        TRACE_WRAPPER = TracingWrapper()
    return TRACE_WRAPPER

# TracingWrapper is implemented "normally"
class TracingWrapper:
    def __init__(self):
        # These should be private. We want callers to only use our
        # public methods e.g. `force_flush` so we can handle the `tracing disabled` case
        # for them.
        self._processor = None
        self._provider = None
        self._tracer = None
        if os.getenv("OTEL_TRACING", "").lower() == "true":
            # OK, actually initialize self._processor etc to functioning objects now
            self._processor = ...

    def force_flush(self):
        """Flush trace data into OTEL collectors"""
        if self._processor:
            self._processor.force_flush()

Personally I think that's a lot easier to understand and deal with, and it means that tests in this repo can just write TracingWrapper() and have it work as expected - creating a clean new instance of the object.

Also, we haven't discussed it before, but the singleton initialization would need to use a lock if there is any chance that multiple threads might be doing the initialization at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason that you set these on the class object rather than the instance just created?

No special reason. Just thought that TracingWrapper is only needed to instanced once in the pubd/pub task job and referred it to other projects, so use singleton. Implementing TracingWrapper as a plain old class looks better, let me update it.

"{}={}".format(k, v) for k, v in kwargs.items()
)

if not os.getenv("OTEL_TRACING", "").lower() == "true":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: it's probably a good idea to do this env var check only once, in the TracingWrapper constructor, and store the result as self._enabled or similar.

Otherwise we have to worry about what happens if you have the env var unset, construct a TracingWrapper, instrument functions, then set the env var and call the instrumented functions. It'll be kind of half enabled, half disabled at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

@rohanpm
Copy link
Member

rohanpm commented Dec 6, 2023

I think the pre-commit failure you're running into can be fixed by adding this into tox.ini:

[isort]
profile = black

@zxiong zxiong merged commit 3ba55b8 into release-engineering:main Dec 6, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants