Skip to content

[pmon]: HLD: Enhance DPU Robustness in Smart Switch#2310

Merged
vvolam merged 14 commits into
sonic-net:masterfrom
vvolam:dpu_ras
Jun 1, 2026
Merged

[pmon]: HLD: Enhance DPU Robustness in Smart Switch#2310
vvolam merged 14 commits into
sonic-net:masterfrom
vvolam:dpu_ras

Conversation

@vvolam

@vvolam vvolam commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

What I did

Add a High Level Design document for DPU failure scenarios on SmartSwitch from the PMON (Platform Monitor) perspective.

Why I did it

SmartSwitch DPU lifecycle management requires clear specification of failure detection, DB state tracking, and recovery actions performed by chassisd and other PMON sub-daemons. This HLD documents all failure and planned operation scenarios to guide implementation.

How I did it

Added doc/smart-switch/pmon/enhance-dpu-robustness.md covering:

  • DPU software failures: critical process restart, persistent failure, pmon/databasedpu crashes on NPU
  • DPU hardware failures: complete DPU down, power failure, PCIe failure
  • NPU/switch-level failures: kernel crash, memory exhaustion
  • Planned operations: graceful shutdown, cold reboot, full SmartSwitch reboot
  • New DB fields: ready_status, recovery_status, reset_count, last_down_time, last_ready_time in CHASSIS_STATE_DB
  • New feature flag: FEATURE|dpu-auto-recovery in CONFIG_DB
  • DPU recovery state machine: Mermaid diagram with state table
  • Timers and thresholds: configurable via platform.json
  • Race condition handling: concurrent operations on the same DPU

How to verify it

Review the HLD document for completeness and correctness of failure scenarios, DB state transitions, and recovery actions.

Repo PR Title / Link Status
sonic-platform-daemons [chassisd]: Add DPU recovery state machine and new DB fields PR State
sonic-buildimage [smartswitch]: Add dpu-auto-recovery feature to SmartSwitch NPU default config PR State
sonic-utilties [cli]: Add DPU recovery CLI commands for SmartSwitch PR State

Add High Level Design document covering DPU failure scenarios
for Smart Switch, including software failures, hardware failures, and
NPU/switch level failures.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@vvolam vvolam changed the title [pmon]: Add Smart Switch DPU Reliability and Availability HLD [pmon]: Add Smart Switch Enhance DPU Robustness HLD Apr 27, 2026
DPU control plane, midplane, and data plane states are always
'down' during booting, never 'unknown'. Update terminology,
state machine table, and scenario summary accordingly.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@vvolam vvolam changed the title [pmon]: Add Smart Switch Enhance DPU Robustness HLD [pmon]: HLD: Enhance DPU Robustness in Smart Switch Apr 29, 2026
…covery gating

- Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name.

- Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly.

- Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario.

- Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing.

- Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents.

- Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop.

- Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection.

- Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly.

- State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention.

- Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@gpunathilell

Copy link
Copy Markdown
Contributor

Regarding the
databasedpu crash on NPU section, there are two DBs here, one is CHASSIS_STATE_DB to which the DPU is sending data from the chassisd running on the DPU, but this is not the same as databasedpuN dockers running on the switch, these dockers are the ones which orchagent on the DPU is writing/reading from, and chassisd is not involved here, databasedpuN issues in itself should not be possible to be detected, but assuming this is caused by failure in midplane, then either way CHASSIS_STATE_DB is also inaccessible to the DPU

@gpunathilell

Copy link
Copy Markdown
Contributor

We also need a sonic-mgmt section, about how the regular smartswitch tests are planned to be executed (with/without autorecovery or not)

@gpunathilell

Copy link
Copy Markdown
Contributor

@vvolam it is not clear from the document as to when exactly the autorecovery is triggered, the control plane goes down even when we execute shutdown/reboot, please mention in the document about the exact scenario when the autorecovery is triggered, and if we have a timeout configured for it

- Clarify databasedpu crash detection: chassisd detects indirectly via
  dpu_control_plane_state going down, not by monitoring databasedpuN
  Redis instances directly.
- Add auto-recovery trigger disambiguation: document that chassisd
  checks state_transition_in_progress before triggering recovery,
  skipping auto-recovery during planned shutdown/reboot operations.
- Add Testing section with sonic-mgmt test plan covering all failure
  mode scenarios (8 test classes) and test infrastructure details.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@vvolam

vvolam commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@gpunathilell I have updated the document addressing the comments. Please review

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated
- Clarify recovery timing: power-cycle triggered on same poll cycle
  that detects failure (no additional timeout beyond 10s poll interval).
- Add CLI section: show chassis modules status extended to display
  ready_status, recovery_status, reset_count, last_down_time,
  last_ready_time from CHASSIS_STATE_DB.
- Fix PCIe failure recovery: since DPU is already offline, chassisd
  updates the status and does not power-cycle.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md
Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated

@judsonwilson-nvidia judsonwilson-nvidia left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comments. Approving with goal of getting the code out there so we can see if it works. If there are serious problems, we can hopefully just disable the feature until it is fixed.

Reiterating my top concerns:

  1. It seems like PowerCycle state should set the module-state-transition flag so as not to have races with graceful shutdown? Handling these races with other power up/down/reboot events has been my strongest concern with these design, and the module-state-transition and the "graceful shutdown" procedure seemed like the way we handle this? You know better than I do.

  2. Is it OK for the DPU to self-recover if there is a control plane failure, or should we just force a reboot in all failure scenarios after some amount of time (allowing for any necessary watchdog periods, etc.)?

gpunathilell
gpunathilell previously approved these changes Jun 1, 2026

@gpunathilell gpunathilell left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam please specify what is done during power cycle (exact operations). other steps look good for now. need to verify code

@vvolam

vvolam commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@vvolam please specify what is done during power cycle (exact operations). other steps look good for now. need to verify code

Sure @gpunathilell, I will address this as a follow-up PR. Thank you

- Booting state: add dpu_data_plane_state to NOT condition
- Split 'After Power-Cycle Recovery' column into success vs. limit
  reached to clarify unrecoverable state
- Add WaitForSelfRecovery/PowerCycle → Offline transitions to Mermaid
  diagram for CLI module shutdown during recovery
- Reorder graceful shutdown steps: power_down → pci_detach → clear
  state_transition_in_progress → sensor config
- Fix Full SmartSwitch Reboot: chassisd detects reboot cause on startup
  (DPUs not guaranteed to reattach before NPU goes down)

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@vvolam vvolam requested a review from gpunathilell June 1, 2026 18:22
@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

1 similar comment
@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

Simplify the state machine by combining PlannedShutdown and Offline
into a single AdminDown state. The state machine now waits for admin
state up to transition back to Booting. The actual planned shutdown
actions (gNOI HALT, power_down, pci_detach) are tracked separately
via state_transition_in_progress flag.

Also update PlannedReboot to be a direct Ready → Booting transition.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@vvolam

vvolam commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@vvolam please specify what is done during power cycle (exact operations). other steps look good for now. need to verify code

Sure @gpunathilell, I will address this as a follow-up PR. Thank you

@gpunathilell I have addressed. Please review top 2 commits to the PR.

@vvolam vvolam merged commit 6adb3c6 into sonic-net:master Jun 1, 2026
2 checks passed
@vvolam vvolam deleted the dpu_ras branch June 1, 2026 21:23
WaitForSelfRecovery --> AdminDown : CLI module shutdown
WaitForSelfRecovery --> Booting : CLI reboot DPU (cancel timer, power-cycle)

PowerCycle --> Booting : Power cycle issued

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PowerCycle --> Booting : Power cycle issued

I don't think this exists in the code (and probably shouldn't?). I think it's a self loop back to PowerCycle when the physical power cycling is issued?

PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit
PowerCycle --> AdminDown : CLI module shutdown

ManualIntervention --> Booting : Operator power-cycle / module startup

@judsonwilson-nvidia judsonwilson-nvidia Jun 1, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there also a transition from ManualIntervention to AdminDown that is missing? I believe this is what the code does.

a114j0y pushed a commit to a114j0y/SONiC that referenced this pull request Jun 5, 2026
* Add Smart Switch DPU Reliability and Availability HLD

Add High Level Design document covering DPU failure scenarios
for Smart Switch, including software failures, hardware failures, and
NPU/switch level failures.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [pmon]: Fix DPU state values - replace unknown with down

DPU control plane, midplane, and data plane states are always
'down' during booting, never 'unknown'. Update terminology,
state machine table, and scenario summary accordingly.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Clarify DPU recovery semantics and auto-recovery gating

- Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name.

- Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly.

- Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario.

- Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing.

- Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents.

- Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Reset DPU immediately on control-plane-down

Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop.

- Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection.

- Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly.

- State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention.

- Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Address review comments on DPU robustness HLD

- Clarify databasedpu crash detection: chassisd detects indirectly via
  dpu_control_plane_state going down, not by monitoring databasedpuN
  Redis instances directly.
- Add auto-recovery trigger disambiguation: document that chassisd
  checks state_transition_in_progress before triggering recovery,
  skipping auto-recovery during planned shutdown/reboot operations.
- Add Testing section with sonic-mgmt test plan covering all failure
  mode scenarios (8 test classes) and test infrastructure details.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Address review round 2 on DPU robustness HLD

- Clarify recovery timing: power-cycle triggered on same poll cycle
  that detects failure (no additional timeout beyond 10s poll interval).
- Add CLI section: show chassis modules status extended to display
  ready_status, recovery_status, reset_count, last_down_time,
  last_ready_time from CHASSIS_STATE_DB.
- Fix PCIe failure recovery: since DPU is already offline, chassisd
  updates the status and does not power-cycle.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Address review round 3 on DPU robustness HLD

- Add dpu_boot_timeout timer (300s default) to handle stuck-boot scenarios
- Update state machine: Booting transitions to PowerCycle/ManualIntervention on timeout
- Split CLI: keep 'show chassis modules status' lean, add 'show chassis modules recovery'
- Include Ready-Status in both CLI outputs
- Rename reset_limit to dpu_reset_limit for consistency

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Fix contradictions and add missing corner cases

- Fix PCIe Failure: chassisd does power-cycle (midplane down implies recovery)
- Fix pmon crash: document reset_count reset to 0 on chassisd restart
- Fix DPU Power Failure: add recovery_status and dpu_reset_limit handling
- Clarify SWFailure: both planes down = HW failure path (skips SWFailure)
- Add boot timeout note for planned reboot scenarios
- Add state machine edges: Booting->Offline, Unrecoverable->Booting via operator
- Clarify recovery_status clear: chassisd restart or operator module startup
- Add syslog warning for stuck data plane (control up, data down)
- Add multi-DPU sequential recovery note with optional parallel flag
- Add race condition: shutdown during Booting cancels timer, goes to Offline
- Change dpu_reset_limit default from 5 to 2
- Clarify data plane warning applies only during Booting state
- Fix Ready->data-plane-down: ready_status set to false (no recovery action)
- Add recovery_status to databasedpu crash DB state table
- Fix scenario summary: midplane transitions down->up during initial boot

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* [doc][smart-switch][pmon]: Add DPU self-recovery (HW watchdog) section

Add HW watchdog-aware recovery for midplane-down events. When
dpu_midplane_link_state goes down, chassisd enters WaitForWatchdog
and waits dpu_boot_timeout (600s) for DPU to self-recover via HW
watchdog before issuing its own power-cycle. Validates reboot cause
(Kernel Panic, Memory Exhaustion, Watchdog) to accept self-recovery.

dpu_boot_timeout is reused for both Booting (wait after power-cycle)
and WaitForWatchdog (wait for HW watchdog self-recovery).

Also fixes DPU Power Failure and PCIe Failure sections to go through
WaitForWatchdog instead of immediate power-cycle, consistent with the
state machine.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* Simplify failure scenarios into unified DPU recovery model

- Add dpu_self_recovery_timeout (300s) for DPU self-recovery grace period
- Consolidate all DPU failure types into single 'DPU Failure' category
  with unified WaitForSelfRecovery state
- Consolidate NPU failures into 'NPU Ungraceful Reboot' category
- Update state machine: replace SWFailure/WaitForWatchdog with
  WaitForSelfRecovery state
- Make Key DB Indicators column explicit with exact DB field conditions
- Remove unused auto_restart/high_mem_alert from dpu-auto-recovery feature
- Simplify chassisd health poll interval and dpu_boot_timeout descriptions
- Fix gRPC abbreviation expansion

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* Address PR review: clarify Booting state DB indicators

Update the Booting state's Key DB Indicators to use timer-based
condition: 'dpu_boot_timeout timer running AND NOT (midplane up AND
control plane up)' instead of assuming specific intermediate link states.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* Fix table rendering: escape pipe in FEATURE|dpu-auto-recovery

Escape the pipe character in the ManualIntervention row's Key DB
Indicators column to prevent GitHub Markdown from breaking the table.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* Address PR review: fix state table, diagram, and step ordering

- Booting state: add dpu_data_plane_state to NOT condition
- Split 'After Power-Cycle Recovery' column into success vs. limit
  reached to clarify unrecoverable state
- Add WaitForSelfRecovery/PowerCycle → Offline transitions to Mermaid
  diagram for CLI module shutdown during recovery
- Reorder graceful shutdown steps: power_down → pci_detach → clear
  state_transition_in_progress → sensor config
- Fix Full SmartSwitch Reboot: chassisd detects reboot cause on startup
  (DPUs not guaranteed to reattach before NPU goes down)

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

* Merge PlannedShutdown and Offline into single AdminDown state

Simplify the state machine by combining PlannedShutdown and Offline
into a single AdminDown state. The state machine now waits for admin
state up to transition back to Booting. The actual planned shutdown
actions (gNOI HALT, power_down, pci_detach) are tracked separately
via state_transition_in_progress flag.

Also update PlannedReboot to be a direct Ready → Booting transition.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

---------

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants