-
Notifications
You must be signed in to change notification settings - Fork 118
vmbus_client: fix handling of revoked channels with active requests #1100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
One question about the last part of the description about removing the workaround: Would this still be an issue if upgrading from older versions of OpenHCL that do not have the fix? For example if we upgrade from current in-market directly to this version and skip 2411? |
I'm just wondering if we need to keep that workaround a bit longer. In reply to: 2758433642 |
while let Some((id, request)) = self.channels.revoked_channel_with_pending_request() { | ||
tracelimit::info_ratelimited!( | ||
channel_id = id.0, | ||
request, | ||
"waiting for responses for channel" | ||
); | ||
assert!(self.process_next_message().await); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there the potential for this to never finish given a malicious/buggy host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that was already the case with the pause/resume message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A malicious host can always DoS the guest (e.g., by not running it).
/// as well. | ||
struct ChannelRef<'a>(hash_map::OccupiedEntry<'a, ChannelId, Channel>); | ||
|
||
/// A tag value used to prove that [`ChannelRef::try_release`] has been called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this prove anything when it can be constructed without calling try_release? Maybe just update the wording of the comment (indicate instead of prove?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair. OK, I've had the github copilot agent rewrite the comment, hah.
If there are active GPADL or modify channel requests when the host revokes a channel, it will still send responses for those requests. This is different from channel open, for which the host will only send a response if the channel has not yet been revoked.
vmbus_client
was not handling this properly, so it would release the channel ID after receiving a revoke and before the GPADL responses arrived. This would cause panics when the request responses eventually arrived.Fix this by tracking the state more carefully and only releasing the channel once all active requests have been completed by the host. Wait for such requests during pause so that we do not have to save channels in a revoked-with-requests-outstanding state.
Also, fix a theoretical deadlock in the pause path. And now that #963 has been merged into the release branch, remove the workaround for older releases not restoring pending messages.