-
Notifications
You must be signed in to change notification settings - Fork 133
nvme_driver: don't flr nvme devices #1714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modifies the NVMe device initialization logic to disable Function Level Reset (FLR) during device attach/detach operations to improve system startup and shutdown performance. The change implements a fallback mechanism that attempts device initialization without FLR first, and only enables FLR if the initial attempt fails.
- Refactors NVMe device creation into separate functions with retry logic
- Adds reset method configuration to disable FLR by default
- Implements fallback to FLR if device initialization fails without reset
Comments suppressed due to low confidence (1)
openhcl/underhill_core/src/nvme_manager.rs:244
- [nitpick] The closure name 'update_reset' could be more descriptive. Consider renaming to 'set_device_reset_method' to better reflect its purpose.
let update_reset = |method: PciDeviceResetMethod| {
Bummer, this doesn't work yet:
|
This is expected with our NVMe emulator, which does not (currently) support any of Linux's reset methods. |
Got it. I see confirmation that this works when testing with a physical device. In addition, I have a draft PR to add FLR support (so we can test this in CI). |
Seems reasonable to me, but leaving for someone with more background here to approve. |
These are snp tests, which are unaffected by these code changes (there's no nvme to vtl2 for hyperv tests). Merging main in case the issue is simply that this branch is out of date. |
tracing::warn!( | ||
?method, | ||
err = &err as &dyn std::error::Error, | ||
"Failed to update reset_method" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lowercase "failed" to match others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take that as a follow-up. I've got more changes to come in this code base. Thanks Chris!
The default `vfio` device behavior is to issue a function level reset when attaching or detaching devices. It does so because the device is in an unknown or untrusted state. However, within the context of a trusted virtualization stack, OpenHCL can reasonably trust the state and behavior of the device. So, optimize performance by removing these function level resets for nvme devices. This follows the same model as already exists for MANA devices. The `nvme_driver` already shuts down the device (see `NvmeDriver::reset()`) and waits for the device to become disabled. A well behaved nvme device will not issue DMA after this point. That same device should tolerate a graceful start without an FLR. Pending work before this PR is ready to commit: - [x] Initial poc - [x] Parameterize via command line (provide a way to disable, and also easily run A/B tests) - [x] Fix microsoft#1714 (comment) - [x] Check no regressions on CI - [x] Test servicing locally with real NVMe devices
The default `vfio` device behavior is to issue a function level reset when attaching or detaching devices. It does so because the device is in an unknown or untrusted state. However, within the context of a trusted virtualization stack, OpenHCL can reasonably trust the state and behavior of the device. So, optimize performance by removing these function level resets for nvme devices. This follows the same model as already exists for MANA devices. The `nvme_driver` already shuts down the device (see `NvmeDriver::reset()`) and waits for the device to become disabled. A well behaved nvme device will not issue DMA after this point. That same device should tolerate a graceful start without an FLR. Pending work before this PR is ready to commit: - [x] Initial poc - [x] Parameterize via command line (provide a way to disable, and also easily run A/B tests) - [x] Fix microsoft#1714 (comment) - [x] Check no regressions on CI - [x] Test servicing locally with real NVMe devices
The default
vfio
device behavior is to issue a function level reset when attaching or detaching devices. It does so because the device is in an unknown or untrusted state. However, within the context of a trusted virtualization stack, OpenHCL can reasonably trust the state and behavior of the device. So, optimize performance by removing these function level resets for nvme devices. This follows the same model as already exists for MANA devices.The
nvme_driver
already shuts down the device (seeNvmeDriver::reset()
) and waits for the device to become disabled. A well behaved nvme device will not issue DMA after this point. That same device should tolerate a graceful start without an FLR.Pending work before this PR is ready to commit: