Skip to content

Conversation

@vvolam
Copy link
Contributor

@vvolam vvolam commented Oct 31, 2025

Description

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in sonic-platform-common#567

This change introduces enhancements to the ModuleBase class to support graceful shutdown and startup operations for DPU and other module types.

It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.

Key changes include:
Added transition management APIs:

set_module_state_transition()
get_module_state_transition()
clear_module_state_transition()

Introduced graceful lifecycle handlers:

  • _graceful_shutdown_handler() to wait for external transition completion using gnoi_halt_in_progress field with timeout handling
  • Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE)

Added helper functions for:

  • File-based operation locks to ensure concurrency safety during transitions
  • Included caching of transition timeout configuration from platform.json
  • Added robust error-handling and logging to prevent partial updates in Redis DB

Motivation and Context

This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently, ModuleBase lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.
By adding transition-aware handling, the system can:

Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.

How Has This Been Tested?

Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:

  • ✅ Unit tests added under tests/test_module_base.py covering:
    • Transition management (set/get/clear)
    • Timeout behavior and concurrency lock handling
    • PCIe detach/reattach and sensor config updates
    • Graceful shutdown/startup flows (set_admin_state_gracefully)
  • ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE
  • ✅ Simulated shutdown and startup sequences:
    • module_pre_shutdown() → safely detaches PCIe and updates state
    • module_post_startup() → rescans PCIe and restores sensor configuration
  • ✅ Regression-tested existing platform daemons to ensure backward compatibility

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and expands the ModuleBase class to support graceful state transitions for SmartSwitch modules, along with comprehensive test coverage improvements.

  • Adds graceful admin state management with timeout handling and state transition tracking
  • Introduces state database initialization and centralized transition flag management
  • Refactors test suite for better organization, readability, and parametrization

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
sonic_platform_base/module_base.py Adds graceful shutdown support, state DB initialization, transition lock mechanism, and state transition management APIs
tests/module_base_test.py Completely refactored tests with better organization, parametrization, and comprehensive coverage of new graceful shutdown and state transition features
Comments suppressed due to low confidence (1)

sonic_platform_base/module_base.py:116

  • The _file_operation_lock method attempts to open the lock file without ensuring the parent directory exists. If /var/lock/ doesn't exist, this will raise FileNotFoundError. The tests mock os.makedirs (line 93 in test file), suggesting directory creation was intended but never implemented in the production code. Add os.makedirs(os.path.dirname(lock_file_path), exist_ok=True) before opening the file.
    def _file_operation_lock(self, lock_file_path):
        """Common file-based lock for operations using flock"""
        with open(lock_file_path, 'w') as f:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam changed the title Enhance ModuleBase with graceful shutdown and startup transition handling [SmartSwitch] Enhance ModuleBase with graceful shutdown and startup transition handling Nov 1, 2025
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor Author

vvolam commented Nov 4, 2025

@rameshraghupathy @gpunathilell could you please review this latest PR

return ModuleBase._TRANSITION_TIMEOUTS_CACHE

timeouts = self._TRANSITION_TIMEOUT_DEFAULTS.copy()
platform_json_path = "/usr/share/sonic/platform/platform.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam if this function is being acccessed from host context (i.e for reboot command execution) then the path is different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell this function is used only in this context as the callers will only be calling set_module_state_transition() which is calling this function to get the timeouts.

module_name = self.get_name()
# Set the module state to administratively up.
if up:
if not self.set_module_state_transition(module_name, "startup"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam So how do we handle the case when another admin state/reboot is currently being handled? do we directly fail the request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell That's right, we come to this point only if the request is shutdown/reboot and we directly fail as it hits 'else' condition at 592.

@rameshraghupathy
Copy link
Contributor

@vvolam A mechanism to indicate the completion of "module_pre_shutdown" to "gnoi_shutdown_daemon" is needed in module_base.py so that the gnoi request can be sent to the DPU.

@rameshraghupathy
Copy link
Contributor

@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False)

@vvolam
Copy link
Contributor Author

vvolam commented Nov 5, 2025

@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False)

@rameshraghupathy I think this is not considered in the HLD as we are not protecting whole state transition using transition_in_progress. Let introduce a new field gnoi_halt_in_progress to start and complete gnoi HALT and update the PR. Thank you for pointing this out

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor Author

vvolam commented Nov 6, 2025

@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False)

@rameshraghupathy I think this is not considered in the HLD as we are now protecting whole state transition using transition_in_progress. Let introduce a new field gnoi_halt_in_progress to start and complete gnoi HALT and update the PR. Thank you for pointing this out

@rameshraghupathy Updated code to isolate gnoi progress to start and monitor using gnoi_halt_in_progress field. And timeout will be dpu_halt_services_timeout similar to reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants