Skip to content

[action] [PR:4572] [cli]: Add DPU recovery CLI commands for SmartSwitch#4597

Open
mssonicbld wants to merge 1 commit into
sonic-net:202605from
mssonicbld:cherry/202605/4572
Open

[action] [PR:4572] [cli]: Add DPU recovery CLI commands for SmartSwitch#4597
mssonicbld wants to merge 1 commit into
sonic-net:202605from
mssonicbld:cherry/202605/4572

Conversation

@mssonicbld

Copy link
Copy Markdown
Collaborator

What I did

Added CLI support for DPU recovery monitoring on SmartSwitch platforms as specified in the Enhance DPU Robustness HLD:

  1. Extended show chassis modules status with a Ready-Status column on SmartSwitch platforms.
  2. Added new show chassis modules recovery command to expose detailed DPU recovery state.

How I did it

  • Modified show/chassis_modules.py:

    • On SmartSwitch, the status command now connects to CHASSIS_STATE_DB and reads ready_status from DPU_STATE|DPU<N> for each module, appending it as a new column.
    • Added a new recovery subcommand that reads ready_status, recovery_status, reset_count, last_down_time, and last_ready_time from CHASSIS_STATE_DB: DPU_STATE|DPU<N>.
    • The recovery command is gated to SmartSwitch platforms only.
  • Added unit tests in tests/chassis_modules_test.py:

    • TestChassisModulesRecovery class with 7 test cases covering all DPUs, single DPU filter, non-SmartSwitch guard, no-data scenario, Ready-Status in status command, unrecoverable DPU display, and missing fields handling.

How to verify it

# On a SmartSwitch platform:
show chassis modules status
# Should now include a "Ready-Status" column

show chassis modules recovery
# Should display DPU recovery details

# Run unit tests:
cd src/sonic-utilities
python3 -m pytest tests/chassis_modules_test.py::TestChassisModulesRecovery -v

Previous command output (if the output of a command-line utility has changed)

admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial
------  -------------  ---------------  -------------  --------------  --------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>
  DPU1    <DPU Sku>              N/A         Online              up    <serial>

New command output (if the output of a command-line utility has changed)

admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial    Ready-Status
------  -------------  ---------------  -------------  --------------  --------  --------------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>            true
  DPU1    <DPU Sku>              N/A         Online              up    <serial>            true

admin@sonic:~$ show chassis modules recovery
  Name    Ready-Status    Recovery-Status    Reset-Count                   Last-Down-Time                  Last-Ready-Time
------  --------------  -----------------  -------------  -------------------------------  -------------------------------
  DPU0            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU1            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU2            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026

Signed-off-by: Sonic Build Admin sonicbld@microsoft.com

#### What I did

Added CLI support for DPU recovery monitoring on SmartSwitch platforms as specified in the [Enhance DPU Robustness HLD](sonic-net/SONiC#2310):

1. Extended `show chassis modules status` with a `Ready-Status` column on SmartSwitch platforms.
2. Added new `show chassis modules recovery` command to expose detailed DPU recovery state.

#### How I did it

- Modified `show/chassis_modules.py`:
  - On SmartSwitch, the `status` command now connects to `CHASSIS_STATE_DB` and reads `ready_status` from `DPU_STATE|DPU<N>` for each module, appending it as a new column.
  - Added a new `recovery` subcommand that reads `ready_status`, `recovery_status`, `reset_count`, `last_down_time`, and `last_ready_time` from `CHASSIS_STATE_DB: DPU_STATE|DPU<N>`.
  - The `recovery` command is gated to SmartSwitch platforms only.

- Added unit tests in `tests/chassis_modules_test.py`:
  - `TestChassisModulesRecovery` class with 7 test cases covering all DPUs, single DPU filter, non-SmartSwitch guard, no-data scenario, Ready-Status in status command, unrecoverable DPU display, and missing fields handling.

#### How to verify it

```bash
# On a SmartSwitch platform:
show chassis modules status
# Should now include a "Ready-Status" column

show chassis modules recovery
# Should display DPU recovery details

# Run unit tests:
cd src/sonic-utilities
python3 -m pytest tests/chassis_modules_test.py::TestChassisModulesRecovery -v
```

#### Previous command output (if the output of a command-line utility has changed)

```
admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial
------  -------------  ---------------  -------------  --------------  --------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>
  DPU1    <DPU Sku>              N/A         Online              up    <serial>
```

#### New command output (if the output of a command-line utility has changed)

```
admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial    Ready-Status
------  -------------  ---------------  -------------  --------------  --------  --------------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>            true
  DPU1    <DPU Sku>              N/A         Online              up    <serial>            true

admin@sonic:~$ show chassis modules recovery
  Name    Ready-Status    Recovery-Status    Reset-Count                   Last-Down-Time                  Last-Ready-Time
------  --------------  -----------------  -------------  -------------------------------  -------------------------------
  DPU0            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU1            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU2            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
```

Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator Author

Original PR: #4572

@mssonicbld

Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant