fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR by JujuDel · Pull Request #98 · stereolabs/zed-gstreamer

JujuDel · 2026-04-08T09:57:03Z

During multi-camera Argus provider recovery, grab() returns CAMERA_REBOOTING or CUDA_ERROR for 10–30s while the provider is destroyed and recreated. The previous code treated these as fatal, tearing down the entire GStreamer pipeline instead of waiting for recovery.

Changes

Core: grab() retry loop (`gstzedsrc.cpp`, `gstzedxonesrc.cpp`)

Both fill() and create() paths now retry grab() on CAMERA_REBOOTING/CUDA_ERROR with 1s sleep between attempts, up to a 60s timeout before escalating to GST_FLOW_ERROR
Logs recovery start, per-attempt wait, and recovery completion with elapsed time

Retrieve operations during recovery

retrieveImage() failures on CAMERA_REBOOTING/CUDA_ERROR return GST_FLOW_OK (empty frame) instead of killing the pipeline
Fixed a bug in gst_zedxonesrc_fill where the original check_ret lambda always returned false for both recovery and fatal errors, and the caller unconditionally mapped false → GST_FLOW_ERROR, negating the intended graceful handling

CUDA error state

Stale CUDA errors at the start of fill() are cleared by cudaGetLastError() and logged as warnings rather than treated as hard failures
cudaGetLastError() is called inside each retry iteration to drain accumulated CUDA error state before the next grab() attempt

Error condition fix

ret > sl::ERROR_CODE::SUCCESS replaced with ret != sl::ERROR_CODE::SUCCESS — CAMERA_REBOOTING is -1, so the old comparison silently ignored it

JujuDel · 2026-04-08T09:57:21Z

@copilot Review this MR

Agent-Logs-Url: https://github.com/stereolabs/zed-gstreamer/sessions/d743f6dc-6a7a-4459-977e-97a6db2e5e67 Co-authored-by: JujuDel <[email protected]>

…sage wording Agent-Logs-Url: https://github.com/stereolabs/zed-gstreamer/sessions/d743f6dc-6a7a-4459-977e-97a6db2e5e67 Co-authored-by: JujuDel <[email protected]>

Copilot · 2026-04-08T10:13:31Z

@copilot Review this MR

Review complete. Found and fixed the following issues (commits 26a4619, 4d81eeb):

Critical bug — gst_zedxonesrc_fill (gstzedxonesrc.cpp): The check_ret lambda returned false for both CAMERA_REBOOTING and fatal errors, but the caller unconditionally returned GST_FLOW_ERROR on false. The comment said "caller handles gracefully" but didn't — the pipeline was still killed during recovery. Replaced with inline error handling that returns GST_FLOW_OK for recovery errors.

Bug — gst_zedxonesrc_fill grab retry loop: Missing cudaGetLastError() call before retrying (inconsistent with all other retry loops in the codebase).

Incorrect code — gst_zedsrc_fill: Redundant second cudaGetLastError() call after the first one, which already clears the CUDA error state. Removed the redundant call and updated the log message to "cleared" to accurately reflect behavior.

Misleading messages — CHECK_RET_OR_GOTO log said "will retry grab" (no retry occurs, it exits with GST_FLOW_OK); retrieve error in gstzedxonesrc_fill said "Grabbing failed" for a retrieve operation. Both corrected.

adujardin

Review

Bug fix: `ret > SUCCESS` → `ret != SUCCESS` — Correct and important

The original check misses negative error codes like CAMERA_REBOOTING (-1). Changing to ret != sl::ERROR_CODE::SUCCESS is the right fix.

Issues

1. **Critical: `create()` returns `GST_FLOW_OK` without setting `*outbuf`**

In the create() path for both plugins, when retrieveImage fails during recovery:

return GST_FLOW_OK;  // don't kill pipeline

But *outbuf was never assigned. GstBaseSrc will attempt to push a NULL buffer downstream → segfault or assertion failure.

The fill() path is less dangerous (the buffer exists, just has stale/uninitialized content), but still problematic — downstream elements will decode garbage pixels.

Suggestion: Either allocate a zeroed/black buffer, or retry the retrieve, or drop the frame using a GAP event instead of returning GST_FLOW_OK with no buffer.

2. High: Blocking `g_usleep(1000000)` prevents clean pipeline shutdown

The 1-second sleep in GStreamer's streaming thread means gst_element_set_state(NULL) will hang until the sleep completes. Over a 60s recovery window, the pipeline is effectively unresponsive to state changes.

Suggestion: Check for flushing state in each loop iteration:

if (gst_base_src_is_flushing(GST_BASE_SRC(psrc))) {
    flow_ret = GST_FLOW_FLUSHING;
    goto out;
}

Or use a GCond/GMutex with timed wait that the unlock() vfunc can signal.

3. Medium: Copy-pasted retry loop (4 copies)

The identical ~25-line retry pattern is duplicated in gst_zedsrc_fill(), gst_zedsrc_create(), gst_zedxonesrc_fill(), and gst_zedxonesrc_create(). If the retry logic ever needs tuning (timeout, error codes, backoff), all 4 must be updated in lockstep. Consider extracting to a helper function or macro.

4. Medium: `cudaGetLastError()` in retry loop silently discards errors

Each retry iteration calls cudaGetLastError() to clear state, but the returned error is discarded. If a real CUDA error occurs (not just stale state), it will be silently swallowed. At least log it:

cudaError_t cu = cudaGetLastError();
if (cu != cudaSuccess)
    GST_DEBUG_OBJECT(src, "Cleared CUDA error during recovery: %d", cu);

5. Low: Hardcoded 60s timeout

Reasonable for the known Argus recovery window (10-30s), but exposing it as a GStreamer property (e.g., recovery-timeout) would give users control. Fine to defer to a follow-up.

What's good

The core bug (> SUCCESS missing negative codes) is real and the fix is correct
Comments explaining the Argus ProviderGuardian recovery behavior are useful context
Clearing stale CUDA errors before grab (with a warning instead of fatal) is sensible
The CHECK_RET_OR_GOTO macro update to soft-fail on recovery errors is a reasonable approach for the fill() path

Verdict

The ret > SUCCESS → ret != SUCCESS fix and the retry concept are solid. However, the create() path returning GST_FLOW_OK with no buffer is a crash bug that needs to be fixed before merging. The shutdown-blocking sleep is also worth addressing to avoid user-visible hangs. Requesting changes on those two items; the rest are suggestions.

…own hang - Extract grab retry loop into common/gst-zed-recovery.hpp template, replacing 4 duplicate ~25-line loops with single shared helper - create() paths: retry full grab+retrieve cycle instead of returning GST_FLOW_OK with unset *outbuf (which killed the pipeline anyway) - Replace blocking g_usleep(1s) with 100ms-chunk loop checking GST_PAD_IS_FLUSHING — pipeline shutdown responds in ~100ms - Log cudaGetLastError() return value instead of discarding silently - Expose recovery-timeout property (0-300s, default 60) on both plugins

JujuDel · 2026-04-09T10:53:57Z

Thanks for the review. All 5 findings addressed:

Finding 1 — `create()` returns `GST_FLOW_OK` without setting `*outbuf`

Partially accurate. Returning GST_FLOW_OK with *outbuf == NULL doesn't segfault — GstBaseSrc catches it and posts a GST_ELEMENT_ERROR internally. But it does kill the pipeline, which defeats the purpose of surviving recovery.

Fixed: create() now includes retrieveImage inside the retry loop. If either grab() or retrieveImage() returns CAMERA_REBOOTING/CUDA_ERROR, we retry the full grab+retrieve cycle. No more returning with an unset buffer.

Finding 2 — Blocking `g_usleep(1000000)` prevents clean pipeline shutdown

Valid.

Fixed: the 1s sleep is now split into 100ms chunks with a flushing check on each iteration (GST_PAD_IS_FLUSHING(GST_BASE_SRC_PAD(...))). Returns GST_FLOW_FLUSHING immediately when the pipeline is shutting down. Worst-case latency to respond to set_state(NULL) is ~100ms instead of up to 60s. This logic lives in a shared helper (common/gst-zed-recovery.hpp) so all paths benefit.

Finding 3 — Copy-pasted retry loop (4 copies)

Valid.

Fixed: extracted the common retry logic into zed_gst_grab_with_recovery(), a header-only template in common/gst-zed-recovery.hpp. All 4 call sites (zedsrc fill/create, zedxonesrc fill/create) now use it. The helper handles the grab retry, flushing check, timeout, and CUDA error clearing. The create() paths add an outer loop around the helper for the retrieve-after-grab retry. Future tuning only needs to touch one place.

Finding 4 — `cudaGetLastError()` silently discards errors

Valid.

Fixed: all retry-loop instances now log the return value via GST_DEBUG_OBJECT when non-zero. The pre-grab check in zedsrc fill() still logs at GST_WARNING level; the retry instances use GST_DEBUG since stale CUDA errors are expected during recovery.

Finding 5 — Hardcoded 60s timeout

Valid.

Fixed: both plugins now expose a recovery-timeout GStreamer property (int, range 0–300, default 60). Setting it to 0 disables retry entirely. Configurable at pipeline construction or runtime:

gst-launch-1.0 zedsrc recovery-timeout=30 ! ...

fix/graceful-camera-recovery-during-grab

c93aa1e

Copilot started work on behalf of JujuDel April 8, 2026 09:57 View session

Copilot AI and others added 2 commits April 8, 2026 10:08

fix: address review issues — correct recovery handling in fill() paths

26a4619

Agent-Logs-Url: https://github.com/stereolabs/zed-gstreamer/sessions/d743f6dc-6a7a-4459-977e-97a6db2e5e67 Co-authored-by: JujuDel <[email protected]>

fix: clarify CUDA error cleared by first call; fix retrieve error mes…

4d81eeb

…sage wording Agent-Logs-Url: https://github.com/stereolabs/zed-gstreamer/sessions/d743f6dc-6a7a-4459-977e-97a6db2e5e67 Co-authored-by: JujuDel <[email protected]>

Copilot AI changed the title ~~fix/graceful-camera-recovery-during-grab~~ fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR Apr 8, 2026

Copilot finished work on behalf of JujuDel April 8, 2026 10:15

JujuDel requested a review from adujardin April 8, 2026 10:22

adujardin requested changes Apr 9, 2026

View reviewed changes

JujuDel requested a review from adujardin April 9, 2026 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR#98

fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR#98
JujuDel wants to merge 4 commits intomasterfrom
fix/graceful-camera-recovery-during-grab

JujuDel commented Apr 8, 2026 •

edited by Copilot AI

Loading

Uh oh!

JujuDel commented Apr 8, 2026

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

adujardin left a comment

Uh oh!

JujuDel commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

JujuDel commented Apr 8, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Core: grab() retry loop (gstzedsrc.cpp, gstzedxonesrc.cpp)

Retrieve operations during recovery

CUDA error state

Error condition fix

Uh oh!

JujuDel commented Apr 8, 2026

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

adujardin left a comment

Choose a reason for hiding this comment

Review

Bug fix: ret > SUCCESS → ret != SUCCESS — Correct and important

Issues

1. Critical: create() returns GST_FLOW_OK without setting *outbuf

2. High: Blocking g_usleep(1000000) prevents clean pipeline shutdown

3. Medium: Copy-pasted retry loop (4 copies)

4. Medium: cudaGetLastError() in retry loop silently discards errors

5. Low: Hardcoded 60s timeout

What's good

Verdict

Uh oh!

JujuDel commented Apr 9, 2026

Finding 1 — create() returns GST_FLOW_OK without setting *outbuf

Finding 2 — Blocking g_usleep(1000000) prevents clean pipeline shutdown

Finding 3 — Copy-pasted retry loop (4 copies)

Finding 4 — cudaGetLastError() silently discards errors

Finding 5 — Hardcoded 60s timeout

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

JujuDel commented Apr 8, 2026 •

edited by Copilot AI

Loading

Core: grab() retry loop (`gstzedsrc.cpp`, `gstzedxonesrc.cpp`)

Bug fix: `ret > SUCCESS` → `ret != SUCCESS` — Correct and important

1. **Critical: `create()` returns `GST_FLOW_OK` without setting `*outbuf`**

2. High: Blocking `g_usleep(1000000)` prevents clean pipeline shutdown

4. Medium: `cudaGetLastError()` in retry loop silently discards errors

Finding 1 — `create()` returns `GST_FLOW_OK` without setting `*outbuf`

Finding 2 — Blocking `g_usleep(1000000)` prevents clean pipeline shutdown

Finding 4 — `cudaGetLastError()` silently discards errors