-
Notifications
You must be signed in to change notification settings - Fork 200
[SmartSwitch] Enhance ModuleBase with graceful shutdown and startup transition handling #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors and expands the ModuleBase class to support graceful state transitions for SmartSwitch modules, along with comprehensive test coverage improvements.
- Adds graceful admin state management with timeout handling and state transition tracking
- Introduces state database initialization and centralized transition flag management
- Refactors test suite for better organization, readability, and parametrization
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| sonic_platform_base/module_base.py | Adds graceful shutdown support, state DB initialization, transition lock mechanism, and state transition management APIs |
| tests/module_base_test.py | Completely refactored tests with better organization, parametrization, and comprehensive coverage of new graceful shutdown and state transition features |
Comments suppressed due to low confidence (1)
sonic_platform_base/module_base.py:116
- The
_file_operation_lockmethod attempts to open the lock file without ensuring the parent directory exists. If/var/lock/doesn't exist, this will raiseFileNotFoundError. The tests mockos.makedirs(line 93 in test file), suggesting directory creation was intended but never implemented in the production code. Addos.makedirs(os.path.dirname(lock_file_path), exist_ok=True)before opening the file.
def _file_operation_lock(self, lock_file_path):
"""Common file-based lock for operations using flock"""
with open(lock_file_path, 'w') as f:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rameshraghupathy @gpunathilell could you please review this latest PR |
| return ModuleBase._TRANSITION_TIMEOUTS_CACHE | ||
|
|
||
| timeouts = self._TRANSITION_TIMEOUT_DEFAULTS.copy() | ||
| platform_json_path = "/usr/share/sonic/platform/platform.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam if this function is being acccessed from host context (i.e for reboot command execution) then the path is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell this function is used only in this context as the callers will only be calling set_module_state_transition() which is calling this function to get the timeouts.
| module_name = self.get_name() | ||
| # Set the module state to administratively up. | ||
| if up: | ||
| if not self.set_module_state_transition(module_name, "startup"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam So how do we handle the case when another admin state/reboot is currently being handled? do we directly fail the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell That's right, we come to this point only if the request is shutdown/reboot and we directly fail as it hits 'else' condition at 592.
|
@vvolam A mechanism to indicate the completion of "module_pre_shutdown" to "gnoi_shutdown_daemon" is needed in module_base.py so that the gnoi request can be sent to the DPU. |
|
@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False) |
@rameshraghupathy I think this is not considered in the HLD as we are not protecting whole state transition using transition_in_progress. Let introduce a new field gnoi_halt_in_progress to start and complete gnoi HALT and update the PR. Thank you for pointing this out |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
@rameshraghupathy Updated code to isolate gnoi progress to start and monitor using |
Description
HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in
sonic-platform-common#567This change introduces enhancements to the
ModuleBaseclass to support graceful shutdown and startup operations for DPU and other module types.It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.
Key changes include:
Added transition management APIs:
Introduced graceful lifecycle handlers:
_graceful_shutdown_handler()to wait for external transition completion usinggnoi_halt_in_progressfield with timeout handlingAdded helper functions for:
Motivation and Context
This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently,
ModuleBaselacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.By adding transition-aware handling, the system can:
Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.
How Has This Been Tested?
Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:
Additional Information (Optional)