-
Notifications
You must be signed in to change notification settings - Fork 47
Add PSNR (Y/U/V) for outbound-rtp #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is similar to qpSum but codec-independent. Since PSNR requires additional computation it is defined with an accompanying psnrMeasurements counter to allow the computation of an average PSNR. Defined as three components for the Y, U and V planes respeectively. See also https://datatracker.ietf.org/doc/html/rfc8761#section-5
@henbos ^ |
This needs to be presented at the next virtual interim. Youenn mentions he'd like to hear about use case of the metric. |
https://www.researchgate.net/publication/383545049_Low-Complexity_Video_PSNR_Measurement_in_Real-Time_Communication_Products has a whole paper about this. tl;dr is "qp is codec dependent", PSNR is not (but comes at a cost hence this can not be a simple sum) @youennf the folks who implemented https://developer.apple.com/documentation/videotoolbox/kvtcompressionpropertykey_calculatemeansquarederror?changes=l_4_8 might be able to tell you more too. cc @taste1981 |
I have not read the paper, is a preprint available? But based on my own interactions with PSNR, there is a source and decoded image. Is the measurement on the outbound-rtp related to source and the encoded image, i.e., the PSNR due to the encoder. |
This one is encoder PSNR, not scaling PSNR. Scaling PSNR would end up living on media-source in stats. |
The other thing that I am curious about is if the PSNR requires decoding the encoded video or is this calculated as part of the encoder operation. Mainly the impact on CPU if it requires some kind of decode step, I wonder if this is only calculated applied to I-frames or huge-frames. |
@sprangerik have you looked at this? |
@jesup thoughts? |
I am supportive of this. For context, see also https://webrtc-review.googlesource.com/c/src/+/368960 |
The idea is to have it as part of the encoder process. The encoder is by definition also a decoder, so it can directly use both the raw input and reconstructed state without penalty. The actual PSNR calculation will of course often incur an extra CPU hit, unless it is already a part of e.g. a rate-distortion aware rate controller - but that's not often the case for real-time encoders. That's why it's proposed to limit the frequency of PSNR calculations. This of course means the user cannot count on PSNR metrics being populated. Even for a given stream, the PSNR values might suddenly disappear if e.g. there is a software/hardware switching event and only one implementation supports PSNR output. |
since webrtc-stats anyway does aggregate values, we could do a sumPsnr and countFrames, i.e., each time a psnr is calculated it is added and corresponding frame count counter goes up. If it is done for all frames, we would not need a frame counter |
This issue was discussed in WebRTC February 2025 meeting – (#794 Add PSNR) |
This seems like a nice feature that could have a few uses. I do wonder if it could be a separate API instead of part of the outbound RTP stats. My initial concerns are calculating this data regardless if the application is even interested in the data and no specification or recommendations on frequency of measurements. Some pros and cons that come to mind if this were implemented as a separate API instead.
Cons:
If this should remain in the stats could we consider adding some sort of getStats object to enable logging for this kind of data? |
WebRTC users routinely log getStats data, so adding this would not be any big overhead. If the stats are collected on a timescale of seconds, the overhead is usually negligible. (polling stats for every frame is not a good idea). |
If I understand correctly, the concern of overhead is in the browser doing an expensive calculation most websites would never request (though per-frame is not an issue due to caching, per-second might be; is never an acceptable frequency?) How expensive is this computation? Our webstats model is like a boat we keep loading with new stuff. Eventually, it becomes problematic. At some point (maybe now?) might we wish we had something like this? await sender.getStats({verbosity: "high"}) // low | medium (default) | high |
I don't think WebRTC has to do these measurements very often for the PSNR measurements to be valuable and if they aren't done very often (say every second or every several seconds) then I don't think we need to make API changes. A similar example is that if you negotiate corruption-detection we do corruptionMeasurements, but since we only make these once per second they don't have any significant performance implications compared to the rest of the decoding pipeline. |
Btw this is unrelated to the polling frequency since the metrics only update when a measurement is made and a measurement happens in the background whether or not the app is polling getStats. (Polling getStats several times per second is bad because of the overhead of that call, not because of counters incrementing in the background) |
One concern is this stat seems to require making two getStats call over some interval. E.g. is the use case here to try one encoder setting, get stats, then wait 1 second and call getStats again expecting two different measurements? If so this might cause divide by zero error in one browser but not another. |
All metrics in the getStats API are used like so: "delta foo / delta bar", that is true whether it is a rate (delta bytesSent / delta timestamp), or a measurement thingy (delta totalCorruptionProbability / delta corruptionMeasurements) or something more exotic like (delta qpSum / delta framesEncoded) or even (jitterBufferDelay / jitterBufferEmittedCount). I could go on with more examples but "divide by zero" is something that the user of this API should be aware of |
qpSum / framesDecoded might be a better example of a foot gun since that could fail when network glitches but not in stable environment. In practice web developers will make helper functions that does lookup of deltas and rates taking care of foot guns. Also you have to be prepared for a metric not being present all of a sudden |
Yes I didn't mean to suggest the divide by zero hazard was limited to this API. The difference is something like I think the concern in this case is:
|
PSNR is similar to qp so having it in getStats makes sense. As the paper says we have done this at a frequency higher than one per second on devices where battery consumption is a concern and it works there. Hardware encoder support makes this "cheaper" even. Note that the calculation is done by the encoder so can not be triggered by calling getStats with some magic option. I considered whether it was possible to gate it on the corruption detection RTP header extension but that would have been quite awkward since it is not closely related (not without precedence, quite a few statistics depend on header extensions) When I say "A/B testing" consider a project like Jitsi moving to AV1, in particular the "Metrics Captured" which, unsurprisingly, relies on getStats. See here for how one uses PSNR to evaluate when it is available. Such experiments are designed not to compare 🍎 to 🍌 (different browsers, different operating systems) so letting a UA decide on sampling frequency is not a concern as long as it does so consistently. |
Polling is just asking "do you have any new measurements for me?" It doesn't matter if app polling interval and browser polling interval aligns or not and it's clear from the guidelines that there is no control of sampling period. So I would argue that the only thing that matters is if the measurements are arriving at a granular enough level to be useful. If the concern is that a browser implementer doesn't know what a useful measurement interval is, maybe we can provide some guidance there, but I fail to see the interop issue with different polling intervals that are all within a "useful" range. FTR I think 15 second is too large of an interval since a lot can happen in that period of time. |
I would not even poll getStats for A/B testing purposes. One would typically poll periodically and use the last result or call getStats explicitly before closing the peerconnection and then calculate the average PSNR as psnrSum_{y,u,v}/psnrMeasurements. Only calls with enough psnrMeasurements should be taken into account which one needs to irrespective of sampling frequency to exclude "short calls". (while we are rambling: it seems Firefox throws when calling getStats on a closed peerconnection which is not my understanding of #3 arguably with all the transceivers gone all the interesting stats disappear nowadays) |
That's fine for telemetry. I think our concern was more someone making runtime decisions off stats, e.g.: // probe and switch to best codec for media being sent right now:
let bestCodec, bestY = 0;
for (const codec of sender.getParameters().codecs) {
const params = sender.getParameters();
params.encodings[0].codec = codec;
await sender.setParameters(params);
await wait(1000);
const ortp1 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
await wait(1000);
const ortp2 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
const y = (ortp2.psnrSum.y - ortp1.psnrSum.y) / (ortp2.psnrMeasurements - ortp1.psnrMeasurements);
if (bestY < y) { bestY = y; bestCodec = codec;
}
const params = sender.getParameters();
params.encodings[0].codec = bestCodec; }
await sender.setParameters(params);
This might be good for an implementer to know. |
I can replace that with "is implementation-defined" linking to https://infra.spec.whatwg.org/#implementation-defined And copy @sprangerik's great "these metrics should primarily be used as a basis for statistical analysis rather than be used as an absolute truth on a per-frame basis" |
We could say that PSNR measurement frequency is implementation-specific but that "the user agent SHOULD make make PSNR measurements no less frequently than every X seconds, if PSNR measurements are supported for the current encoder implementation". I still think the app needs to handle the case that a PSNR measurement is not available for a given encoder implementation or browser though, but this would give the app an upper bound on how long to maximally wait for. |
From what I can see browser implementations would boil down to the same line of code in libWebRTC for software encoders, no? None of the parameters for video encoding are "specified", why do we need to be prescribe here? updated along the lines of #794 (comment) |
webrtc-stats.html
Outdated
<p> | ||
The PSNR is defined in [[ISO-29170-1:2017]]. | ||
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with the same text under psnrSum
(line 2235) which is already link to here.
<p> | |
The PSNR is defined in [[ISO-29170-1:2017]]. | |
</p> |
webrtc-stats.html
Outdated
<p class="note"> | ||
PSNR metrics should primarily be used as a basis for statistical analysis rather | ||
than be used as an absolute truth on a per-frame basis. | ||
The frequency of PSNR measurements is [=implementation-defined=]. | ||
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to avoid "should" and normative statements in non-normative notes. If we want to be normative we should lift it out.
Based on #794 (comment) are we comfortable picking a min frequency of every 5 seconds?
(Edit: added "or the encoding frame rate whichever is lower")
<p class="note"> | |
PSNR metrics should primarily be used as a basis for statistical analysis rather | |
than be used as an absolute truth on a per-frame basis. | |
The frequency of PSNR measurements is [=implementation-defined=]. | |
</p> | |
<p> | |
If the current encoder supports taking PSNR measurements, their | |
frequency SHOULD be no less than every 5 seconds or the | |
encoding frame rate, whichever is lower. | |
</p> | |
<p class="note"> | |
This allows for testing. PSNR measurements are intended for | |
statistical analysis, and aren't expected to be accurate down | |
to a frame. | |
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why 5 seconds? keyFramesEncoded
and keyFramesDecoded
are examples very-low-frequency events happening already where you can not make an assumption about the minimum interval between two getStats calls that will give you an increase. Same for packetsLost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value would you like? The frequency of keyFramesEncoded
and keyFramesDecoded
are determined by external factors unlikely to vary by user agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 is arbitrary. In this thread I heard 15 seconds was too high, and that 1 second was not too expensive, but also that we'd rather not overconstrain implementations too much. 10 seconds?
This value, whatever we pick would go into WPT and also give web developers that can't wait for some reason a minimum time to wait to be interoperable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we need to also qualify that the encoder is actually encoding something? E.g. if the track is muted or a canvas track, then the frame rate may be less than what value we pick. Maybe we add "... or the encoding frame rate whichever is lower"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: this is a parameter of the video encoder configuration. It becomes observable in stats just like others.
What the spec should say is "implementation defined" - the implied warning is "do not compare apples to oranges"
The frequency of keyFramesEncoded is a good example, it is determined by the encoder setting such as gop size. One wants this as large as possible for good performance but this is something where the encoder can be tuned without "interoperability" constraints since the decoder behavior is required to be flexible.
If one wants to get fancy about PSNR one may need to take into account things like "is this a screen sharing track" (higher resolution, lower frame rate and wholly different content) or frame rate (which may depend on BWE). Worth the effort for statistical analysis? Unlikely.
15 seconds was arbitrary too I think with the purpose of "i dont think this makes sense anymore". One second picked in the code is 3x the value from the paper (which has a parameter study) and hence 1/3rd as expensive in terms of power impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A SHOULD implies implementation-defined. Is there a value that would be satisfactory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the normative statement out of the note (and moved the note to the actual values)
From editors meeting: Main reason for a number seems to be feature detection - people want to know that if they have waited this long between calls to GetStats, and the frame counter has increased by X, and the stat doesn't show up, the feature is off. However, feature detection can be done by insisting that the stat is visible and with value zero if supported. |
If we have the psnr measurement frequency be implementation-defined, the spec should help the UA developer select a good frequency value. Spec should add some guidelines, for instance doing measurements every second or every 30 frames or so. |
Huh? This is a counter, it can not disappear, it can only stop increasing. |
I think the gist of it is that we want the frequency to be a high as possible as long as the performance impact can be kept negligible. Perhaps we can add some text to that effect. Trying to give guidelines in terms of the expected frequency doesn't seem helpful to UA developers - but on the other hand those developers are probably in the best position to determine at which frequency the performance hit becomes non-negligible for their particular encoder implementations. |
That's great, because I don't think anyone is proposing a ceiling on frequency. A floor would be nice though. I disagree giving guidance is bad, since what you wrote sounds like good guidance already. Let's write it down. |
As we say in German, "Guter Rat ist teuer". The guidance should "talk to your video encoding guys". That is what "implementation-defined" is for, no? |
This naturally leads to privacy questions if the selection of the rate depends on the encoder, the system load, the CPU... |
Who is proposing this?
How is this a new concern given video codecs in WebRTC already are adapting to CPU, thermal state, etc? Which is not described by any specification either. Note that PSNR should be possible to compute with JS and existing APIs already. And you are most welcome to invite privacy guys into your meetings obviously. |
webrtc-stats.html
Outdated
PSNR is defined in [[ISO-29170-1:2017]]. | ||
</p> | ||
<p class="note"> | ||
PSNR metrics should primarily be used as a basis for statistical analysis rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid "should" in non-normative notes. How about
PSNR metrics should primarily be used as a basis for statistical analysis rather | |
Authors are expected to use PSNR metrics primarily as a basis for statistical analysis rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://w3c.github.io/webrtc-stats/#dom-rtcinboundrtpstreamstats-totalcorruptionprobability -- literally copied from this note, can you explain why it is ok there but not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it in review of #788. It's not a hard rule, but most WG editors I've spoken with agree avoiding lowercase requirement-laden words avoids confusion and improves readability.
Both notes appear to be speaking to authors rather than implementers, which is fine. But the primary audience of specs are implementers, so it might help to clarify when speaking to someone else. Specs have no authority saying authors should or shouldn't do anything. So it seems more accurate to describe the usage the design anticipates, which is how I interpret these notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this note documents that implementers (well, the one) do not think authors should be abusing this for per-frame analysis. 810d67d avoids lower-case should.
</p> | ||
<p> | ||
The PSNR is defined in [[ISO-29170-1:2017]]. | ||
The frequency of PSNR measurements is [=implementation-defined=]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK with me if for any reasonable value of X (I'm not married to any particular value, as long as we have one that lets us WPT test)
The frequency of PSNR measurements is [=implementation-defined=]. | |
The frequency of PSNR measurements is [=implementation-defined=], | |
but SHOULD be no less than every X seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single value is sufficient for existence and can be used to write a WPT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is a WPT testing an interoperable floor on frequency, not just existence.
I only see benefits in the spec defining the implementor playground.
I believe @youennf also sought some implementer guidance here so we don't have to reverse engineer interoperable behavior.
If we can't resolve this in the PR let's aim to discuss this again next meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have not seen an argument why the minimum frequency needs to be interoperable.
Also I assume you mean "intraoperable"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding more unfounded numbers (unless you have run large scale experiments?) is not going to help.
1 second is quite likely to be an issue for high-resolution screen sharing.
Going once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: ^ s/every second/every few seconds/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If situations arise where implementations fall outside these bounds we can always revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a vague soft limit doesn't avoid the "numerator is zero" what problem does this solve?
What is the behavior when the track was removed via replaceTrack? Follow-up question in #619...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a vague soft limit doesn't avoid the "numerator is zero" what problem does this solve?
@youennf does it address your concern in #794 (comment)?
What is the behavior when the track was removed via replaceTrack?
The encoding frame rate drops to zero, which is lower than 1/15, and no PSNR measurements happen.
My understanding is that, if the encoder is providing PSNR, there is probably no perf issue in doing so for every frame. Based on this, I would tend to remove |
That's not quite the case. Some hardware encoders do expose PSNR at a minimum of performance penalty, so exposing it for every frame is fine there. However, there are a number of software encoders (e.g. libvpx) that have the ability to calculate PSNR but do not typically do so. It's commonly use in RD based control methods which use a lot of CPU are not suitable for realtime encoding. However, the PSNR feature can be enabled/disabled on a per-frame basis even in the fast encoding modes, which is what is being discussed here. Getting the values are still valuable, but we don't want to enable it for every frame as that will come at a noticeable CPU penalty. Then of course there are encoders that don't have the ability to output PSNR at all. Say we have a software encoders that does expose and hardware encoder of the same type the does not - and at runtime encoding is switched back and forth between hardware and software (e.g. due to resolution constraints or just random failures), what do we do then? I still think the easiest way is to allow the update frequency be zero for extended periods of time. The user just has to be aware that the metric may or may not be available at any given time, so if no PSNR measurements have been added between calls to getStats() you have to interpret that as "undefined" not "not available". |
I think psnrSum/psnrMeasurements gives us the information even in periods it's not updated. Ie if an app is calling getstats regularly, ie every second or every 2 seconds no updates in psnrSum and measurements already gives us info, that no updates were made. It would be great if psnr is calculated per frame but I'm okay with it being updated as often as the codec feels it can without a performance penalty. |
The fact that some encoders can expose or not PSNR would be new information exposed to the web and could be used for fingerprinting. Also, this PR gives no implementor's guideline. My assumption is that a single measurement frequency would be used for all encoders, this frequency value would be fixed for a given UA instance, and probably for a given UA across all devices it runs on (say a specific version of Chrome). It would be good to clarify this, otherwise I could see potential additional threats. Maybe the PING WG should weigh in there. |
This would map essentially 1:1 with the implementation used, and that can already pretty easily be inferred (e.g. via https://www.w3.org/TR/webrtc-stats/#dom-rtcoutboundrtpstreamstats-encoderimplementation or platform+https://www.w3.org/TR/webrtc-stats/#dom-rtcoutboundrtpstreamstats-powerefficientencoder, not to mention info from WebCodecs, WebGPU, parsing data from encoded transform, etc etc). So while it might be a new "bit", it doesn't actually provide any new information imo.
Can we let the implementor's guideline just be along what has been said above, e.g. "the frequency should be as high as possible as long as the performance impact can be kept negligible". I don't see a reason to change the frequency based on codec type, only by implementation performance overhead. For a given implementation though I don't see a reason to change the frequency - detailing that this should be fixed for a given UA seems fine to me. |
Both A possibility is to restrict psnr in the same manner.
I find it useful information. |
Those don't seem like great examples as they're blocked on exposing hardware is allowed, unless we're suggesting adding that requirement here? What are the other examples?
Agreed. Clarifying these assumptions in the guidance can only help.
Doesn't tying it too tightly to performance make it another performance metric? I like the part that it should not vary by codec. |
Even on hardware encoders it has an impact on power consumption and the return on doing it on every frame is not there, see the parts in the paper that talk about subsampling. I'm fine with gating on HW. |
If so, PSNR gathering could be opt-in, something like:
Is it overkill? |
Quoting https://w3c.github.io/webrtc-stats/#guidelines-for-design-of-stats-objects: |
Until now, there was no concern about stats being potentially computer intensive. |
This is similar to qpSum but codec-independent.
Since PSNR requires additional computation it is defined with an
accompanying psnrMeasurements counter to allow the computation of
an average PSNR.
Defined as a record with components for the Y, U and V planes respectively.
See also
https://datatracker.ietf.org/doc/html/rfc8761#section-5
Preview | Diff