-
Notifications
You must be signed in to change notification settings - Fork 744
Fixing state_db not having delete_field attribute causing a crash when DPUs in bad state #4064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…n DPUs in bad state
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
hi @prabhataravind and / or @prsunny - looks like this one needs a Reviewer assigned pls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a crash that occurs when DPUs are in a bad state by removing a problematic call to delete_field
method on the state database. The issue arises because the state_db
object doesn't have a delete_field
attribute, causing the application to crash during state transitions.
- Removes the redundant
delete_field
call that was causing crashes - Keeps the
transition_start_time
field in the CHASSIS_MODULE_TABLE instead of deleting it
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
@vvolam , please review and signoff. |
@prsunny I am waiting on Ramesh to address the comments. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you.
|
||
# Use UTC everywhere for consistent comparison | ||
current_time = datetime.now(timezone.utc) | ||
if start_time.tzinfo is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy do we need to handle a case when tzinfo is none? as it is always updated by chassisd and module_base. why is this if condition needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell It is a T1 switch and there is nothing wrong in doing the additional check to be resilient.
"""Delete a specific field from table|key.""" | ||
redis_key = f"{table}|{key}" | ||
client = self.db.get_redis_client("STATE_DB") | ||
return client.hdel(redis_key, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same redis sonic dbconnector abstraction to delete https://github.com/rameshraghupathy/sonic-utilities/blob/fc016432a4d79dbb9eebd31f7f008759dc737122/config/chassis_modules.py#L85
Since we are using the db connector abstractions for the other two functions
https://github.com/rameshraghupathy/sonic-utilities/blob/fc016432a4d79dbb9eebd31f7f008759dc737122/config/chassis_modules.py#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell Thank you for the suggestion about using the database connector abstraction consistently. While I agree this would improve consistency, this logic is currently undergoing broader refactoring in PRs #4031 and related PRs across sonic-platform-common, sonic-host-services, and sonic-platform-daemons. To avoid conflicts and duplicated effort, I believe the database access pattern consistency should be addressed as part of that broader refactoring work rather than making incremental changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since refactoring will be done, please do not resolve the review comments, I will approve as this is blocking bug fix. Thank you
What I did
Fixed the AttributeError caused by missing delete_field method in the StateDBHelper class when managing DPU state transitions. The code was attempting to call state_db.delete_field() to remove the 'transition_start_time' field from the database, but this method didn't exist, causing crashes when DPUs were in bad state.
How I did it
Added the missing delete_field method to the StateDBHelper class that properly removes fields from Redis using the hdel command
Maintained the existing logic in set_state_transition_in_progress that removes 'transition_start_time' from both local state and database when transitioning to 'False'
Ensured consistency between local dictionary state and database state by properly implementing field deletion
The fix addresses the root cause of the AttributeError while preserving the intended behavior of cleaning up transition timestamps when state transitions complete.
Code Changes:
Added delete_field(self, table, key, field) method to StateDBHelper class
Method uses client.hdel(redis_key, field) to properly delete fields from Redis database
Existing call to state_db.delete_field() on line 85 now works correctly
How to verify it
Make the DPU midplane unreachable for one of the DPUs
Toggle the DPU ON/OFF state a couple of times using config chassis modules startup/shutdown DPUx
Verify that the commands complete without AttributeError crashes
Confirm that 'transition_start_time' field is properly removed from STATE_DB when transitions complete
Check that both local state and database state remain synchronized
Why this approach vs. removing the line:
This fix maintains the original intended behavior of cleaning up transition timestamps, ensuring database consistency and preventing stale data accumulation, while properly implementing the missing functionality that was causing the crash.
Which release branch to backport (provide reason below if selected)
[x] 202505
[x] 202506