Skip to content

AV1 over WHEP fails for AMD-VAAPI (mesa) publishers — bitstream lacks Temporal Delimiters + emits large Padding OBUs #5774

@oblivion8282-1337

Description

@oblivion8282-1337

Summary

Streams encoded by Mesa's av1_vaapi (AMD GPUs) decode reliably as files but freeze in Chromium-based WHEP receivers within a few seconds — first GOP looks fine, then the reference chain breaks and the picture stalls while RTP packets keep arriving. Streams encoded by av1_nvenc (NVIDIA) over the same pipeline work flawlessly.

The Chromium-side RTCInboundRtpStreamStats for the AMD case shows the symptom cleanly (12 s sample):

framesReceived:   730   ← RTP keeps flowing
framesDecoded:    145   ← 80 % failure
keyFramesDecoded:   7
framesDropped:     21
pliCount:          53
decoderImplementation: "dav1d (fallback from: ExternalDecoder (VaapiVideoDecoder))"

Tested on 1.17.1; AV1 RTMP→WHEP isn't usable on 1.18.x because of #5728, so 1.17.1 is the relevant version for AMD-VAAPI publishers.

Repro

Push an AV1 stream over Enhanced-RTMP from AMD hardware (Mesa ≥ 23 + gpu-screen-recorder's -k av1 -c flv defaults, or any other producer using av1_vaapi), consume via WHEP from Chromium / Electron. The freeze is encoder-shape-dependent, not content-dependent, and reproduces on a static UI just as well as a game capture.

Two small FLV samples (AV1 + Opus, ~10 s, captured from gpu-screen-recorder) illustrate the structural difference:

  • amd-vaapi-av1.flv (Mesa av1_vaapi) — freezes in WHEP
  • nvenc-av1.flv (NVIDIA av1_nvenc) — works in WHEP

Happy to share these via a non-public channel if useful — they're screen recordings so I'd rather not attach them to a public issue.

Root cause — two AV1 OBU-shape asymmetries

OBU type counts from ffmpeg -bsf:v trace_headers on each file:

OBU type AMD-VAAPI NVENC
1 — Sequence Header 6 4
2 — Temporal Delimiter 0 331
3 — Frame 504 331
4 — Redundant Frame Header 504 331
15 — Padding 29 (1232–8230 bytes each) 0
  1. No OBU_TEMPORAL_DELIMITER. Spec-permitted in low-overhead bitstream form (AV1 §5.6), but libwebrtc-side AV1 RTP receivers tend to lean on them for frame-boundary detection. NVENC emits one per frame; Mesa's av1_vaapi emits none. dav1d-on-disk is tolerant either way (which is why the captured file decodes cleanly), but routed through MediaMTX→WHEP→Chromium the missing TempDelim costs ~80 % of frames on its own.
  2. Large OBU_PADDING. Mesa pads the bitstream to meet CBR bitrate on low-motion content with multi-KB padding OBUs (1.2–8.2 KB each in our samples). Per spec these are no-ops, but in practice fragmenting an 8 KB padding OBU across ~6 RTP packets seems to break libwebrtc's reassembly. The receiver-stat pattern "first ~20 frames of each GOP decode, reference chain breaks until the next keyframe" matches a per-GOP reassembly breakdown rather than a per-frame decoder reject.

Either alone is enough to make AMD-AV1 unwatchable; together you get the ~96 %-failure decode rate.

Suggested fix

Normalising the temporal unit in internal/protocols/rtmp/to_stream.go's OnDataAV1 callback handles both at once, and the result is byte-for-byte identical to NVENC's input shape (modulo frame count). Patch:

r.OnDataAV1(track, func(pts time.Duration, tu [][]byte) {
    // Prepend OBU_TEMPORAL_DELIMITER if missing; strip OBU_PADDING.
    needTD := len(tu) == 0 || len(tu[0]) == 0 || (tu[0][0]>>3)&0xF != 2
    hasPadding := false
    for _, obu := range tu {
        if len(obu) > 0 && (obu[0]>>3)&0xF == 15 {
            hasPadding = true
            break
        }
    }
    if needTD || hasPadding {
        out := make([][]byte, 0, len(tu)+1)
        if needTD {
            // 0x10 = TempDelim, has_size=0 per mediacommon's stored-OBU convention.
            // av1.Bitstream.Marshal re-injects has_size + LEB128(0) on the wire.
            out = append(out, []byte{0x10})
        }
        for _, obu := range tu {
            if len(obu) == 0 || (obu[0]>>3)&0xF != 15 {
                out = append(out, obu)
            }
        }
        tu = out
    }
    (*subStream).WriteUnit(medi, forma, &unit.Unit{ /* ... */ })
})

Equivalent to running ffmpeg -bsf:v "av1_metadata=td=insert:delete_padding=1" on the bitstream before muxing. Verified end-to-end on a forked 1.17.1 image — AMD-VAAPI HQ streams that froze within 5 s now decode cleanly for arbitrary durations, with the receiver stats no longer showing the dav1d fallback or PLI flood.

Happy to send a PR if the approach looks right. Putting it inside OnDataAV1 keeps the fix scoped to the RTMP-ingest path; arguably nicer homes would be the AV1 packetizer in gortsplib/pkg/format/rtpav1 or even Pion's level, but that depends on where the project wants to draw the "publisher input that should be normalised" vs "packetizer/receiver responsibility to be tolerant" line.

Notes

  • The publisher (gpu-screen-recorder in our case) uses Enhanced-RTMP with AV_CODEC_FLAG_GLOBAL_HEADER, so the AV1 sequence header is in the FLV config message rather than inline at each IDR. Both encoders treat that the same way, so it's not the differentiator.
  • Receiver is Chromium 148 / Electron 42; same behaviour reproduces in stock Chromium and (anecdotally) Firefox.
  • Related: AV1 HLS Muxer Error not enough bytes #5728 is what keeps us on 1.17.x, so this fix would unblock AMD users on the last release where AV1-RTMP→WHEP works at all.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions