Potential bug fix for the guest VM boot failure #213

NSKernel · 2024-09-09T22:20:23Z

Description
As mentioned in #100, occasionally SEAMCALL failures can be seen during the boot of a guest VM. The SEAMCALL failure message starts with

SEAMCALL (0x000000000000000f) failed: 0xc0000b0800000001 RCX 0xxxxxxxxxx RDX 0x0000000000000400 R8 0xxxxxxxxxxx R9 0x0000000000000000 R10 0x0000000000000000 R11 0x0000000000000000

From the error code, it's a TDX_TLB_TRACKING_NOT_DONE. On my machine, stack dump shows it's caused by tdh_mem_page_demote called by __set_private_spte_present when kvm_x86_zap_private_spte fails. However, kvm_x86_zap_private_spte corresponds to tdx_sept_zap_private_spte which can return -EAGAIN if SEAMCALL tdh_mem_range_block returns TDX_ERROR_SEPT_BUSY which the error code is TDX_OPERAND_BUSY. In this case, as suggested in the ABI manual, one can simply retry. However, in __set_private_spte_present, the logic simply do the demoting (reversing) operation and return which caused the problem because the page hasn't been blocked & tracked.

Potential Fix

A fix that is working on my side is to change __set_private_spte_present's logic to check the return value and retry when it gets -EAGAIN. I've attached the patch here since I don't know how I can submit the patch to the Ubuntu's Git.

tdx-zap-fix.patch

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-09-09T22:20:33Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/PEK-1245.

This message was autogenerated

hector-cao · 2024-09-09T23:01:50Z

Thanks @NSKernel for this patch, it is really great ! we will take a look and get back to you ASAP.

NSKernel · 2024-09-10T00:41:00Z

Geez I'm still seeing some problem with this patch. I'll need to dig further.

NSKernel · 2024-09-10T00:59:30Z

Another thing I've been noticed is the CONFIG_HYPERV. This configuration, while looking totally innocent, is in fact important even you are running under KVM. When it's enabled, it will swap VT's flush_remote_tlbs_range with the TDX one which will invoke the tdx_track. Without it the same error occurs. Besides of that the system works on my side now. I'll try to provide another patch to remove CONFIG_HYPERV

NSKernel · 2024-09-10T01:50:31Z

Here's the second patch that fixes the no flush_remote_tlbs when not configuring CONFIG_HYPERV. With this patch the guest VM launches properly on my machine without CONFIG_HYPERV. The first patch seems to give some performance benefit but doesn't really seem to fix the bug though.

fix2.patch

NSKernel · 2024-09-10T03:03:31Z

BTW the content of the patch seems to have been discussed in the kernel's mailing list so take my patch as a hot fix before Intel's v20 patch is out

NSKernel closed this as completed Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug fix for the guest VM boot failure #213

Potential bug fix for the guest VM boot failure #213

NSKernel commented Sep 9, 2024 •

edited

Loading

syncronize-issues-to-jira bot commented Sep 9, 2024

hector-cao commented Sep 9, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

Potential bug fix for the guest VM boot failure #213

Potential bug fix for the guest VM boot failure #213

Comments

NSKernel commented Sep 9, 2024 • edited Loading

syncronize-issues-to-jira bot commented Sep 9, 2024

hector-cao commented Sep 9, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 10, 2024

NSKernel commented Sep 9, 2024 •

edited

Loading