Skip to content

vmbus_client: fix handling of revoked channels with active requests #1100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jstarks
Copy link
Member

@jstarks jstarks commented Mar 26, 2025

If there are active GPADL or modify channel requests when the host revokes a channel, it will still send responses for those requests. This is different from channel open, for which the host will only send a response if the channel has not yet been revoked.

vmbus_client was not handling this properly, so it would release the channel ID after receiving a revoke and before the GPADL responses arrived. This would cause panics when the request responses eventually arrived.

Fix this by tracking the state more carefully and only releasing the channel once all active requests have been completed by the host. Wait for such requests during pause so that we do not have to save channels in a revoked-with-requests-outstanding state.

Also, fix a theoretical deadlock in the pause path. And now that #963 has been merged into the release branch, remove the workaround for older releases not restoring pending messages.

@jstarks jstarks requested review from a team as code owners March 26, 2025 18:30
@benhillis
Copy link
Member

One question about the last part of the description about removing the workaround: Would this still be an issue if upgrading from older versions of OpenHCL that do not have the fix? For example if we upgrade from current in-market directly to this version and skip 2411?

@benhillis
Copy link
Member

I'm just wondering if we need to keep that workaround a bit longer.


In reply to: 2758433642

Comment on lines +1409 to +1417
while let Some((id, request)) = self.channels.revoked_channel_with_pending_request() {
tracelimit::info_ratelimited!(
channel_id = id.0,
request,
"waiting for responses for channel"
);
assert!(self.process_next_message().await);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there the potential for this to never finish given a malicious/buggy host?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that was already the case with the pause/resume message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A malicious host can always DoS the guest (e.g., by not running it).

/// as well.
struct ChannelRef<'a>(hash_map::OccupiedEntry<'a, ChannelId, Channel>);

/// A tag value used to prove that [`ChannelRef::try_release`] has been called.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this prove anything when it can be constructed without calling try_release? Maybe just update the wording of the comment (indicate instead of prove?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair. OK, I've had the github copilot agent rewrite the comment, hah.

@jstarks jstarks enabled auto-merge (squash) April 10, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants