-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(inst-fetch,inst-xhr) Ignore network events with zero-timing #5332
Open
chancancode
wants to merge
1
commit into
open-telemetry:main
Choose a base branch
from
tildeio:fix-5314
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+207
−151
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
chancancode
commented
Jan 11, 2025
@@ -124,7 +124,7 @@ function createResource(resource = {}): PerformanceResourceTiming { | |||
function createMainResource(resource = {}): PerformanceResourceTiming { | |||
const mainResource: any = createResource(resource); | |||
Object.keys(mainResource).forEach((key: string) => { | |||
if (typeof mainResource[key] === 'number') { | |||
if (typeof mainResource[key] === 'number' && mainResource[key] !== 0) { | |||
mainResource[key] = mainResource[key] + 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chancancode
commented
Jan 11, 2025
@@ -166,7 +175,7 @@ function createResource(resource = {}): PerformanceResourceTiming { | |||
function createMainResource(resource = {}): PerformanceResourceTiming { | |||
const mainResource: any = createResource(resource); | |||
Object.keys(mainResource).forEach((key: string) => { | |||
if (typeof mainResource[key] === 'number') { | |||
if (typeof mainResource[key] === 'number' && mainResource[key] !== 0) { | |||
mainResource[key] = mainResource[key] + 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5332 +/- ##
==========================================
- Coverage 94.58% 94.57% -0.02%
==========================================
Files 318 318
Lines 8069 8069
Branches 1701 1701
==========================================
- Hits 7632 7631 -1
- Misses 437 438 +1 |
13 tasks
Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background:
For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources).
However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load.
For a resource timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (Allow zero/negative performance timings #1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries.
It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs.
When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers.
It would be nice if the W3C specs would have left the N/A values as
undefined
, but that's not the world we live in and so we have to work with what we've got.History:
secureConnectionStart
by way of tapping into auxiliary metadataPresumably, the patch in #4486 was writen that way (as opposed to just checking for 0) to avoid breaking #1769, but I suspect it ends up breaking some of those use cases anyway.
Options:
fetchStart
tostartTime
.startTime === 0
(which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (startTime > 0
), legitimate 0-values should not be possible so we drop them by default.Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely.
In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast at the same instant that the page loaded) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing.
Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases.
In #2457, @johnbley argued:
I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away.
However, given we are actively normalizes the 0-values into the absolute epoch timestamp of
performance.timeOrigin
, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long).It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful.
Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around.
Fixes #3199
Fixes #5314
See also #1769, #2457, #3848, #3879, #4478, #4486
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist:
addedupdated