Skip to content

fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR#98

Open
JujuDel wants to merge 4 commits intomasterfrom
fix/graceful-camera-recovery-during-grab
Open

fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR#98
JujuDel wants to merge 4 commits intomasterfrom
fix/graceful-camera-recovery-during-grab

Conversation

@JujuDel
Copy link
Copy Markdown
Member

@JujuDel JujuDel commented Apr 8, 2026

During multi-camera Argus provider recovery, grab() returns CAMERA_REBOOTING or CUDA_ERROR for 10–30s while the provider is destroyed and recreated. The previous code treated these as fatal, tearing down the entire GStreamer pipeline instead of waiting for recovery.

Changes

Core: grab() retry loop (gstzedsrc.cpp, gstzedxonesrc.cpp)

  • Both fill() and create() paths now retry grab() on CAMERA_REBOOTING/CUDA_ERROR with 1s sleep between attempts, up to a 60s timeout before escalating to GST_FLOW_ERROR
  • Logs recovery start, per-attempt wait, and recovery completion with elapsed time

Retrieve operations during recovery

  • retrieveImage() failures on CAMERA_REBOOTING/CUDA_ERROR return GST_FLOW_OK (empty frame) instead of killing the pipeline
  • Fixed a bug in gst_zedxonesrc_fill where the original check_ret lambda always returned false for both recovery and fatal errors, and the caller unconditionally mapped falseGST_FLOW_ERROR, negating the intended graceful handling

CUDA error state

  • Stale CUDA errors at the start of fill() are cleared by cudaGetLastError() and logged as warnings rather than treated as hard failures
  • cudaGetLastError() is called inside each retry iteration to drain accumulated CUDA error state before the next grab() attempt

Error condition fix

  • ret > sl::ERROR_CODE::SUCCESS replaced with ret != sl::ERROR_CODE::SUCCESSCAMERA_REBOOTING is -1, so the old comparison silently ignored it

@JujuDel
Copy link
Copy Markdown
Member Author

JujuDel commented Apr 8, 2026

@copilot Review this MR

Copy link
Copy Markdown

Copilot AI commented Apr 8, 2026

@copilot Review this MR

Review complete. Found and fixed the following issues (commits 26a4619, 4d81eeb):

Critical buggst_zedxonesrc_fill (gstzedxonesrc.cpp): The check_ret lambda returned false for both CAMERA_REBOOTING and fatal errors, but the caller unconditionally returned GST_FLOW_ERROR on false. The comment said "caller handles gracefully" but didn't — the pipeline was still killed during recovery. Replaced with inline error handling that returns GST_FLOW_OK for recovery errors.

Buggst_zedxonesrc_fill grab retry loop: Missing cudaGetLastError() call before retrying (inconsistent with all other retry loops in the codebase).

Incorrect codegst_zedsrc_fill: Redundant second cudaGetLastError() call after the first one, which already clears the CUDA error state. Removed the redundant call and updated the log message to "cleared" to accurately reflect behavior.

Misleading messagesCHECK_RET_OR_GOTO log said "will retry grab" (no retry occurs, it exits with GST_FLOW_OK); retrieve error in gstzedxonesrc_fill said "Grabbing failed" for a retrieve operation. Both corrected.

Copilot AI changed the title fix/graceful-camera-recovery-during-grab fix: graceful camera recovery during grab — prevent pipeline teardown on transient CAMERA_REBOOTING/CUDA_ERROR Apr 8, 2026
@JujuDel JujuDel requested a review from adujardin April 8, 2026 10:22
Copy link
Copy Markdown
Member

@adujardin adujardin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Bug fix: ret > SUCCESSret != SUCCESS — Correct and important

The original check misses negative error codes like CAMERA_REBOOTING (-1). Changing to ret != sl::ERROR_CODE::SUCCESS is the right fix.


Issues

1. Critical: create() returns GST_FLOW_OK without setting *outbuf

In the create() path for both plugins, when retrieveImage fails during recovery:

return GST_FLOW_OK;  // don't kill pipeline

But *outbuf was never assigned. GstBaseSrc will attempt to push a NULL buffer downstream → segfault or assertion failure.

The fill() path is less dangerous (the buffer exists, just has stale/uninitialized content), but still problematic — downstream elements will decode garbage pixels.

Suggestion: Either allocate a zeroed/black buffer, or retry the retrieve, or drop the frame using a GAP event instead of returning GST_FLOW_OK with no buffer.

2. High: Blocking g_usleep(1000000) prevents clean pipeline shutdown

The 1-second sleep in GStreamer's streaming thread means gst_element_set_state(NULL) will hang until the sleep completes. Over a 60s recovery window, the pipeline is effectively unresponsive to state changes.

Suggestion: Check for flushing state in each loop iteration:

if (gst_base_src_is_flushing(GST_BASE_SRC(psrc))) {
    flow_ret = GST_FLOW_FLUSHING;
    goto out;
}

Or use a GCond/GMutex with timed wait that the unlock() vfunc can signal.

3. Medium: Copy-pasted retry loop (4 copies)

The identical ~25-line retry pattern is duplicated in gst_zedsrc_fill(), gst_zedsrc_create(), gst_zedxonesrc_fill(), and gst_zedxonesrc_create(). If the retry logic ever needs tuning (timeout, error codes, backoff), all 4 must be updated in lockstep. Consider extracting to a helper function or macro.

4. Medium: cudaGetLastError() in retry loop silently discards errors

Each retry iteration calls cudaGetLastError() to clear state, but the returned error is discarded. If a real CUDA error occurs (not just stale state), it will be silently swallowed. At least log it:

cudaError_t cu = cudaGetLastError();
if (cu != cudaSuccess)
    GST_DEBUG_OBJECT(src, "Cleared CUDA error during recovery: %d", cu);

5. Low: Hardcoded 60s timeout

Reasonable for the known Argus recovery window (10-30s), but exposing it as a GStreamer property (e.g., recovery-timeout) would give users control. Fine to defer to a follow-up.


What's good

  • The core bug (> SUCCESS missing negative codes) is real and the fix is correct
  • Comments explaining the Argus ProviderGuardian recovery behavior are useful context
  • Clearing stale CUDA errors before grab (with a warning instead of fatal) is sensible
  • The CHECK_RET_OR_GOTO macro update to soft-fail on recovery errors is a reasonable approach for the fill() path

Verdict

The ret > SUCCESSret != SUCCESS fix and the retry concept are solid. However, the create() path returning GST_FLOW_OK with no buffer is a crash bug that needs to be fixed before merging. The shutdown-blocking sleep is also worth addressing to avoid user-visible hangs. Requesting changes on those two items; the rest are suggestions.

…own hang

- Extract grab retry loop into common/gst-zed-recovery.hpp template,
  replacing 4 duplicate ~25-line loops with single shared helper
- create() paths: retry full grab+retrieve cycle instead of returning
  GST_FLOW_OK with unset *outbuf (which killed the pipeline anyway)
- Replace blocking g_usleep(1s) with 100ms-chunk loop checking
  GST_PAD_IS_FLUSHING — pipeline shutdown responds in ~100ms
- Log cudaGetLastError() return value instead of discarding silently
- Expose recovery-timeout property (0-300s, default 60) on both plugins
@JujuDel
Copy link
Copy Markdown
Member Author

JujuDel commented Apr 9, 2026

Thanks for the review. All 5 findings addressed:

Finding 1 — create() returns GST_FLOW_OK without setting *outbuf

Partially accurate. Returning GST_FLOW_OK with *outbuf == NULL doesn't segfault — GstBaseSrc catches it and posts a GST_ELEMENT_ERROR internally. But it does kill the pipeline, which defeats the purpose of surviving recovery.

Fixed: create() now includes retrieveImage inside the retry loop. If either grab() or retrieveImage() returns CAMERA_REBOOTING/CUDA_ERROR, we retry the full grab+retrieve cycle. No more returning with an unset buffer.

Finding 2 — Blocking g_usleep(1000000) prevents clean pipeline shutdown

Valid.

Fixed: the 1s sleep is now split into 100ms chunks with a flushing check on each iteration (GST_PAD_IS_FLUSHING(GST_BASE_SRC_PAD(...))). Returns GST_FLOW_FLUSHING immediately when the pipeline is shutting down. Worst-case latency to respond to set_state(NULL) is ~100ms instead of up to 60s. This logic lives in a shared helper (common/gst-zed-recovery.hpp) so all paths benefit.

Finding 3 — Copy-pasted retry loop (4 copies)

Valid.

Fixed: extracted the common retry logic into zed_gst_grab_with_recovery(), a header-only template in common/gst-zed-recovery.hpp. All 4 call sites (zedsrc fill/create, zedxonesrc fill/create) now use it. The helper handles the grab retry, flushing check, timeout, and CUDA error clearing. The create() paths add an outer loop around the helper for the retrieve-after-grab retry. Future tuning only needs to touch one place.

Finding 4 — cudaGetLastError() silently discards errors

Valid.

Fixed: all retry-loop instances now log the return value via GST_DEBUG_OBJECT when non-zero. The pre-grab check in zedsrc fill() still logs at GST_WARNING level; the retry instances use GST_DEBUG since stale CUDA errors are expected during recovery.

Finding 5 — Hardcoded 60s timeout

Valid.

Fixed: both plugins now expose a recovery-timeout GStreamer property (int, range 0–300, default 60). Setting it to 0 disables retry entirely. Configurable at pipeline construction or runtime:

gst-launch-1.0 zedsrc recovery-timeout=30 ! ...

@JujuDel JujuDel requested a review from adujardin April 9, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants