Conversation
|
@copilot Review this MR |
…sage wording Agent-Logs-Url: https://github.com/stereolabs/zed-gstreamer/sessions/d743f6dc-6a7a-4459-977e-97a6db2e5e67 Co-authored-by: JujuDel <[email protected]>
Review complete. Found and fixed the following issues (commits Critical bug — Bug — Incorrect code — Misleading messages — |
adujardin
left a comment
There was a problem hiding this comment.
Review
Bug fix: ret > SUCCESS → ret != SUCCESS — Correct and important
The original check misses negative error codes like CAMERA_REBOOTING (-1). Changing to ret != sl::ERROR_CODE::SUCCESS is the right fix.
Issues
1. Critical: create() returns GST_FLOW_OK without setting *outbuf
In the create() path for both plugins, when retrieveImage fails during recovery:
return GST_FLOW_OK; // don't kill pipelineBut *outbuf was never assigned. GstBaseSrc will attempt to push a NULL buffer downstream → segfault or assertion failure.
The fill() path is less dangerous (the buffer exists, just has stale/uninitialized content), but still problematic — downstream elements will decode garbage pixels.
Suggestion: Either allocate a zeroed/black buffer, or retry the retrieve, or drop the frame using a GAP event instead of returning GST_FLOW_OK with no buffer.
2. High: Blocking g_usleep(1000000) prevents clean pipeline shutdown
The 1-second sleep in GStreamer's streaming thread means gst_element_set_state(NULL) will hang until the sleep completes. Over a 60s recovery window, the pipeline is effectively unresponsive to state changes.
Suggestion: Check for flushing state in each loop iteration:
if (gst_base_src_is_flushing(GST_BASE_SRC(psrc))) {
flow_ret = GST_FLOW_FLUSHING;
goto out;
}Or use a GCond/GMutex with timed wait that the unlock() vfunc can signal.
3. Medium: Copy-pasted retry loop (4 copies)
The identical ~25-line retry pattern is duplicated in gst_zedsrc_fill(), gst_zedsrc_create(), gst_zedxonesrc_fill(), and gst_zedxonesrc_create(). If the retry logic ever needs tuning (timeout, error codes, backoff), all 4 must be updated in lockstep. Consider extracting to a helper function or macro.
4. Medium: cudaGetLastError() in retry loop silently discards errors
Each retry iteration calls cudaGetLastError() to clear state, but the returned error is discarded. If a real CUDA error occurs (not just stale state), it will be silently swallowed. At least log it:
cudaError_t cu = cudaGetLastError();
if (cu != cudaSuccess)
GST_DEBUG_OBJECT(src, "Cleared CUDA error during recovery: %d", cu);5. Low: Hardcoded 60s timeout
Reasonable for the known Argus recovery window (10-30s), but exposing it as a GStreamer property (e.g., recovery-timeout) would give users control. Fine to defer to a follow-up.
What's good
- The core bug (
> SUCCESSmissing negative codes) is real and the fix is correct - Comments explaining the Argus ProviderGuardian recovery behavior are useful context
- Clearing stale CUDA errors before grab (with a warning instead of fatal) is sensible
- The
CHECK_RET_OR_GOTOmacro update to soft-fail on recovery errors is a reasonable approach for thefill()path
Verdict
The ret > SUCCESS → ret != SUCCESS fix and the retry concept are solid. However, the create() path returning GST_FLOW_OK with no buffer is a crash bug that needs to be fixed before merging. The shutdown-blocking sleep is also worth addressing to avoid user-visible hangs. Requesting changes on those two items; the rest are suggestions.
…own hang - Extract grab retry loop into common/gst-zed-recovery.hpp template, replacing 4 duplicate ~25-line loops with single shared helper - create() paths: retry full grab+retrieve cycle instead of returning GST_FLOW_OK with unset *outbuf (which killed the pipeline anyway) - Replace blocking g_usleep(1s) with 100ms-chunk loop checking GST_PAD_IS_FLUSHING — pipeline shutdown responds in ~100ms - Log cudaGetLastError() return value instead of discarding silently - Expose recovery-timeout property (0-300s, default 60) on both plugins
|
Thanks for the review. All 5 findings addressed: Finding 1 —
|
During multi-camera Argus provider recovery,
grab()returnsCAMERA_REBOOTINGorCUDA_ERRORfor 10–30s while the provider is destroyed and recreated. The previous code treated these as fatal, tearing down the entire GStreamer pipeline instead of waiting for recovery.Changes
Core: grab() retry loop (
gstzedsrc.cpp,gstzedxonesrc.cpp)fill()andcreate()paths now retrygrab()onCAMERA_REBOOTING/CUDA_ERRORwith 1s sleep between attempts, up to a 60s timeout before escalating toGST_FLOW_ERRORRetrieve operations during recovery
retrieveImage()failures onCAMERA_REBOOTING/CUDA_ERRORreturnGST_FLOW_OK(empty frame) instead of killing the pipelinegst_zedxonesrc_fillwhere the originalcheck_retlambda always returnedfalsefor both recovery and fatal errors, and the caller unconditionally mappedfalse→GST_FLOW_ERROR, negating the intended graceful handlingCUDA error state
fill()are cleared bycudaGetLastError()and logged as warnings rather than treated as hard failurescudaGetLastError()is called inside each retry iteration to drain accumulated CUDA error state before the nextgrab()attemptError condition fix
ret > sl::ERROR_CODE::SUCCESSreplaced withret != sl::ERROR_CODE::SUCCESS—CAMERA_REBOOTINGis-1, so the old comparison silently ignored it