Skip to content

Fix flex counter SAI errors during warm reboot by deferring until APPLY_VIEW#1784

Open
securely1g wants to merge 1 commit intosonic-net:masterfrom
securely1g:fix/flexcounter-defer-until-apply-view
Open

Fix flex counter SAI errors during warm reboot by deferring until APPLY_VIEW#1784
securely1g wants to merge 1 commit intosonic-net:masterfrom
securely1g:fix/flexcounter-defer-until-apply-view

Conversation

@securely1g
Copy link

Description

During warm reboot reconciliation, flex counter events may arrive before the VIDTORID mapping is fully populated. Counters get registered with RID 0x0, causing SAI calls to fail with SAI_STATUS_INVALID_PARAMETER (-5) every poll interval.

Fix (2 files changed)

Defer all flex counter events during warm boot and process them after applyView() completes, when VIDTORID is fully populated.

  • syncd/Syncd.h: Add m_warmBootReconciling flag and deferred event queue
  • syncd/Syncd.cpp:
    • Initialize flag to true on warm start
    • Queue flex counter events during reconciliation instead of processing
    • Drain queue after applyView() succeeds and VIDTORID is updated

How it works

applyView() is the clear "reconciliation complete" signal — it runs the comparison logic, creates/removes SAI objects, and calls setVidAndRidMap() to repopulate the VIDTORID hash in Redis. After this point, all VID-to-RID translations will succeed.

Trade-offs

  • Pro: Clean architectural approach — events are processed in correct order
  • Pro: No counter will ever be registered with invalid RID
  • Con: No flex counter data during reconciliation (typically seconds to minutes)
  • Con: Memory usage for queued events (bounded by number of flex counter entries)

Alternative

See companion PR #1783 for a skip-based approach that guards against null RID at collection time.

Signed-off-by: securely1g securely1g@users.noreply.github.com

…LY_VIEW

During warm reboot reconciliation, flex counter events may arrive before
the VIDTORID mapping is fully populated. This results in counters being
registered with RID 0x0, causing SAI calls to fail with
SAI_STATUS_INVALID_PARAMETER every poll interval.

Fix: Defer all flex counter events during warm boot reconciliation and
process them after applyView() completes, when VIDTORID mapping is
fully populated.

- Add m_warmBootReconciling flag, initialized to true on warm start
- Queue flex counter events in m_deferredFlexCounterEvents during
  reconciliation instead of processing them immediately
- Drain the queue after applyView() succeeds and VIDTORID is updated
- Flag is cleared after draining, so subsequent events process normally

This approach ensures no flex counter is registered with an invalid RID,
at the cost of no counter data during reconciliation (typically seconds
to a few minutes).

Signed-off-by: securely1g <securely1g@users.noreply.github.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants