Skip to content

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented Sep 22, 2025

What I did

Fixed the AttributeError caused by missing delete_field method in the StateDBHelper class when managing DPU state transitions. The code was attempting to call state_db.delete_field() to remove the 'transition_start_time' field from the database, but this method didn't exist, causing crashes when DPUs were in bad state.

How I did it

Added the missing delete_field method to the StateDBHelper class that properly removes fields from Redis using the hdel command
Maintained the existing logic in set_state_transition_in_progress that removes 'transition_start_time' from both local state and database when transitioning to 'False'
Ensured consistency between local dictionary state and database state by properly implementing field deletion
The fix addresses the root cause of the AttributeError while preserving the intended behavior of cleaning up transition timestamps when state transitions complete.

Code Changes:

Added delete_field(self, table, key, field) method to StateDBHelper class
Method uses client.hdel(redis_key, field) to properly delete fields from Redis database
Existing call to state_db.delete_field() on line 85 now works correctly

How to verify it

Make the DPU midplane unreachable for one of the DPUs
Toggle the DPU ON/OFF state a couple of times using config chassis modules startup/shutdown DPUx
Verify that the commands complete without AttributeError crashes
Confirm that 'transition_start_time' field is properly removed from STATE_DB when transitions complete
Check that both local state and database state remain synchronized
Why this approach vs. removing the line:

This fix maintains the original intended behavior of cleaning up transition timestamps, ensuring database consistency and preventing stale data accumulation, while properly implementing the missing functionality that was causing the crash.

Which release branch to backport (provide reason below if selected)
[x] 202505
[x] 202506

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@KrisNey-MSFT
Copy link

hi @prabhataravind and / or @prsunny - looks like this one needs a Reviewer assigned pls.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a crash that occurs when DPUs are in a bad state by removing a problematic call to delete_field method on the state database. The issue arises because the state_db object doesn't have a delete_field attribute, causing the application to crash during state transitions.

  • Removes the redundant delete_field call that was causing crashes
  • Keeps the transition_start_time field in the CHASSIS_MODULE_TABLE instead of deleting it

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@prsunny
Copy link
Contributor

prsunny commented Oct 7, 2025

@vvolam , please review and signoff.
@KrisNey-MSFT for viz

@vvolam
Copy link
Contributor

vvolam commented Oct 7, 2025

@prsunny I am waiting on Ramesh to address the comments.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot October 8, 2025 03:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot October 8, 2025 16:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot October 9, 2025 17:37
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you.


# Use UTC everywhere for consistent comparison
current_time = datetime.now(timezone.utc)
if start_time.tzinfo is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy do we need to handle a case when tzinfo is none? as it is always updated by chassisd and module_base. why is this if condition needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell It is a T1 switch and there is nothing wrong in doing the additional check to be resilient.

"""Delete a specific field from table|key."""
redis_key = f"{table}|{key}"
client = self.db.get_redis_client("STATE_DB")
return client.hdel(redis_key, field)
Copy link
Contributor

@gpunathilell gpunathilell Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Thank you for the suggestion about using the database connector abstraction consistently. While I agree this would improve consistency, this logic is currently undergoing broader refactoring in PRs #4031 and related PRs across sonic-platform-common, sonic-host-services, and sonic-platform-daemons. To avoid conflicts and duplicated effort, I believe the database access pattern consistency should be addressed as part of that broader refactoring work rather than making incremental changes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since refactoring will be done, please do not resolve the review comments, I will approve as this is blocking bug fix. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants