Documentation/virt/kvm/x86/mmu.rst | 10 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 48 +++++--- arch/x86/include/asm/svm.h | 1 + arch/x86/include/asm/vmx.h | 14 ++- arch/x86/kvm/hyperv.c | 4 +- arch/x86/kvm/mmu.h | 30 +++-- arch/x86/kvm/mmu/mmu.c | 182 ++++++++++++++++++++--------- arch/x86/kvm/mmu/mmutrace.h | 19 +-- arch/x86/kvm/mmu/paging_tmpl.h | 73 ++++++++---- arch/x86/kvm/mmu/spte.c | 92 +++++++++------ arch/x86/kvm/mmu/spte.h | 70 ++++++----- arch/x86/kvm/mmu/tdp_mmu.c | 6 +- arch/x86/kvm/svm/nested.c | 38 +++++- arch/x86/kvm/svm/svm.c | 31 +++++ arch/x86/kvm/svm/svm.h | 1 + arch/x86/kvm/vmx/capabilities.h | 12 +- arch/x86/kvm/vmx/common.h | 26 +++-- arch/x86/kvm/vmx/hyperv_evmcs.h | 1 + arch/x86/kvm/vmx/main.c | 9 ++ arch/x86/kvm/vmx/nested.c | 46 +++++++- arch/x86/kvm/vmx/tdx.c | 2 +- arch/x86/kvm/vmx/vmx.c | 27 ++++- arch/x86/kvm/vmx/vmx.h | 1 + arch/x86/kvm/vmx/x86_ops.h | 1 + arch/x86/kvm/x86.c | 18 +-- 27 files changed, 536 insertions(+), 228 deletions(-)
This version can also be found in the "queue" branch of kvm.git.
Since it should be final I'm including again for reference the full
description.
Both MBEC and GMET allow more granular control over execute permissions,
with different levels of separation between supervisor and user mode.
MBEC provides support for separate supervisor and user-mode bits in the
PTEs; GMET instead lacks supervisor-mode only execution (with NX=0,
"both" is represented by U=0 and user-mode only by U=1). GMET was
clearly inspired by SMEP though with some differences and annoyances.
The implementation starts from two changes to core MMU code, both
of which help making the actual feature almost trivial to implement:
- first, I'm cleaning up the implementation of nVMX exec-only, by
properly adding read permissions to the ACC_* constant and to the
permission bitmask machinery. Jon also had to add a fourth ACC_*
bit, but used it only in the special case of nested MBEC; here
instead ACC_READ_MASK is the normality, which simplifies testing
a lot and removes gratuitous complexity.
- second, I'm enforcing that KVM runs with MBEC/GMET enabled even in
non-nested mode, if it wants to provide the feature to nested
hypervisors. This makes the creation of SPTEs looks exactly the
same for L1 and L2 guests, despite only the latter using MBEC/GMET
fully; the difference lies only in the input access permissions.
This strategy adds a limited amount of complexity to the core is limited,
while providing for an almost entirely seamless support of nested
hypervisors.
Later patches have to use slightly different meanings for ACC_* in Intel
and AMD. On the Intel side, some work is needed in order to split
shadow_x_mask and ACC_EXEC_MASK in two; now that there is an actual
ACC_READ_MASK to be used for exec-only pages, ACC_USER_MASK is unused
and can be reused as ACC_USER_EXEC_MASK. However, unlike the older
ACC_USER_MASK hack these differences are backed by concrete concepts
of the page table format, and there is always a 1:1 mapping from ACC_*
bits to PT_*_MASK or shadow_*_mask:
Intel AMD
-------------------- ------------------- -------------------
ACC_READ_MASK PT_PRESENT_MASK PT_PRESENT_MASK
ACC_WRITE_MASK PT_WRITABLE_MASK PT_WRITABLE_MASK
ACC_EXEC_MASK shadow_xs_mask shadow_nx_mask
ACC_USER_MASK --- shadow_user_mask
ACC_USER_EXEC_MASK shadow_xu_mask ---
On Intel, ACC_EXEC_MASK is used for kernel-mode execution and is tied to
shadow_xs_mask (when MBEC is disabled, ACC_USER_EXEC_MASK and the XU bit
are computed but ineffective). update_permission_bitmask() precomputes
all the necessary conditions. On the AMD side, the U bit maps to
ACC_USER_MASK but nNPT adjusts the permission bitmask to ignore it for
reads and writes when GMET is active. Despite the smaller scale of the
changes compared to MBEC, there are some changes to make to use GMET
for L1 guests, because the page tables have to be created with U=0.
This means that the root page has role.access != ACC_ALL and its
permissions have to be propagated down.
Note that with MBEC the user/supervisor distinction depends on the U
bit of the page tables rather than the CPL. Processors provide this
information to the hypervisor through the "advanced EPT violation
vmexit info" feature, which is a requirement for KVM to use MBEC,
and kvm-intel.ko passes it to the MMU in PFERR_USER_MASK (unlike
kvm-amd.ko which computes it from the CPL). This needs a small change
to pass the effective XWU permissions of the page tables down to
translate_nested_gpa().
The former "smep_andnot_wp" bit of cpu_role.base, now named "cr4_smep",
is repurposed for nested TDP to indicate that MBEC/GMET is on. The minor
pessimization for shadow page tables (toggling CR4.SMEP now always forces
building a separate version of the shadow page tables, even though that's
technically unnecessary if CR4.WP=1) is not really worth fretting about;
in practice, guests are not going to flip CR4.SMEP in a way that would
prevent efficient reuse of shadow page tables.
Paolo
v5->v6:
- rename make_spte_executable to change_spte_executable
- rename byte index in update_permission_bitmask to index
- use (u8) casts before "KVM: x86/mmu: introduce ACC_READ_MASK"
- make commit message for "KVM: x86/mmu: split XS/XU bits for EPT" more accurate
- add XU to shadow_acc_track_mask already in "KVM: x86/mmu: split XS/XU bits for EPT"
- fix compilation error
- use alternative code for __vmx_handle_ept_violation suggested by Sean
Jon Kohler (5):
KVM: TDX/VMX: rework EPT_VIOLATION_EXEC_FOR_RING3_LIN into PROT_MASK
KVM: x86/mmu: remove SPTE_PERM_MASK
KVM: x86/mmu: free up bit 10 of PTEs in preparation for MBEC
KVM: nVMX: advertise MBEC to nested guests
KVM: nVMX: allow MBEC with EVMCS
Paolo Bonzini (23):
KVM: x86/mmu: shuffle high bits of SPTEs in preparation for MBEC
KVM: x86/mmu: remove SPTE_EPT_*
KVM: x86/mmu: merge make_spte_{non,}executable
KVM: x86/mmu: rename and clarify BYTE_MASK
KVM: x86/mmu: separate more EPT/non-EPT permission_fault()
KVM: x86/mmu: introduce ACC_READ_MASK
KVM: x86/mmu: pass PFERR_GUEST_PAGE/FINAL_MASK to kvm_translate_gpa
KVM: x86/mmu: pass pte_access for final nGPA->GPA walk
KVM: x86: make translate_nested_gpa vendor-specific
KVM: x86/mmu: split XS/XU bits for EPT
KVM: x86/mmu: move cr4_smep to base role
KVM: VMX: enable use of MBEC
KVM: nVMX: pass advanced EPT violation vmexit info to guest
KVM: nVMX: pass PFERR_USER_MASK to MMU on EPT violations
KVM: x86/mmu: add support for MBEC to EPT page table walks
KVM: x86/mmu: propagate access mask from root pages down
KVM: x86/mmu: introduce cpu_role bit for availability of PFEC.I/D
KVM: SVM: add GMET bit definitions
KVM: x86/mmu: hard code more bits in kvm_init_shadow_npt_mmu
KVM: x86/mmu: add support for GMET to NPT page table walks
KVM: SVM: enable GMET and set it in MMU role
KVM: SVM: work around errata 1218
KVM: nSVM: enable GMET for guests
Documentation/virt/kvm/x86/mmu.rst | 10 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 48 +++++---
arch/x86/include/asm/svm.h | 1 +
arch/x86/include/asm/vmx.h | 14 ++-
arch/x86/kvm/hyperv.c | 4 +-
arch/x86/kvm/mmu.h | 30 +++--
arch/x86/kvm/mmu/mmu.c | 182 ++++++++++++++++++++---------
arch/x86/kvm/mmu/mmutrace.h | 19 +--
arch/x86/kvm/mmu/paging_tmpl.h | 73 ++++++++----
arch/x86/kvm/mmu/spte.c | 92 +++++++++------
arch/x86/kvm/mmu/spte.h | 70 ++++++-----
arch/x86/kvm/mmu/tdp_mmu.c | 6 +-
arch/x86/kvm/svm/nested.c | 38 +++++-
arch/x86/kvm/svm/svm.c | 31 +++++
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/vmx/capabilities.h | 12 +-
arch/x86/kvm/vmx/common.h | 26 +++--
arch/x86/kvm/vmx/hyperv_evmcs.h | 1 +
arch/x86/kvm/vmx/main.c | 9 ++
arch/x86/kvm/vmx/nested.c | 46 +++++++-
arch/x86/kvm/vmx/tdx.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 27 ++++-
arch/x86/kvm/vmx/vmx.h | 1 +
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 18 +--
27 files changed, 536 insertions(+), 228 deletions(-)
--
2.54.0
Thanks again for the updated version of this patch series.
I have been testing v6 on Intel and AMD platforms again and observed
a regression on Intel when CET and MBEC are both exposed to a Windows
guest.
Environment:
- Kernel: mainline 7.1.0-rc2 (with v6 patches applied)
- QEMU: downstream 11.0.0-1
- Guest: Windows Server 2026 (24H2, Build 26100.1742)
- virtio-win: 0.1.271
Hosts:
Intel: Intel(R) Xeon(R) Gold 6426Y
AMD: Epyc 7302P
Both hosts are running Proxmox VE (based on Debian Trixie).
Windows Guest Setup:
After the initial installation and verification [0] I enabled
Virtualization-Based Security (VBS) and Hypervisor-Protected Code
Integrity (HVCI).
I set the following in the Group Policy Editor (DeviceGuard):
* Select Platform Security Level: Secure Boot
* Virtualization Based Protection of Code Integrity: Enabled without
lock
* Require UEFI Memory Attributes Table: Checked
Issue: Host Lockups and Guest Hangs
On the Intel platform, the guest fails to boot when using:
QEMU options: -cpu host,level=30,+vmx-mbec
I observed two behaviors:
* The guest hangs indefinitely during the early boot phase. (Most
frequent)
* The guest fails to boot and ends up in Windows Recovery Mode.
When the guest hangs during early boot, the host experiences hard/soft
lockups:
watchdog: CPU11: Watchdog detected hard LOCKUP on cpu 11
watchdog: BUG: soft lockup - CPU#11 stuck for 28s [CPU 0/KVM:16105]
I also recorded a trace of the virtual guest getting stuck using:
`trace-cmd record -e kvm`
and found the following:
Frequency of top RIP:
* 987816 rip 0xfffff801b031bf36
* 985670 rip 0xfffff801b031bf35
* 184002 rip 0x7ffd3e35
Sequence of Events:
--
CPU 0/KVM-16105 [001] ..... 4327.371276: kvm_cr: cr_write 4 = 0xb50ef8
...
CPU 0/KVM-16105 [001] ..... 4327.373469: kvm_pio: pio_read at 0x608 size 4 count 1 val 0x5bcbb1
CPU 0/KVM-16105 [001] d..1. 4327.373469: kvm_entry: vcpu 0 rip 0xfffff801b031bf36
CPU 0/KVM-16105 [001] d..1. 4327.373470: kvm_exit: reason IO_INSTRUCTION rip 0xfffff801b031bf35 info 608000b 0
--
the last three lines seem to be repeating in a infinite loop.
On the AMD platform, the guest had no issue booting when using:
QEMU options: -cpu host
This seem to be the case because CET is not present on AMD.
I confirmed this using:
`cpuid -1 -l 7 -s 0`
which shows that:
- CET_SS: CET shadow stack = false
- CET_IBT: CET indirect branch tracking = false
On Intel `cpuid -1 -l 7 -s 0` shows:
- CET_SS: CET shadow stack = true
- CET_IBT: CET indirect branch tracking = true
If I explicitly disable them on Intel using:
QEMU options: -cpu host,level=30,+vmx-mbec,-cet-ss,-cet-ibt
the guest boots without issues.
This regression previously did not occur because I was using QEMU
version 10.2.0-1, where these options did not yet get exposed for this
particular Intel CPU [1].
Please let me know if you need further information or if there is
something else I could try/test.
[0] https://learn.microsoft.com/en-us/windows/security/hardware-security/enable-virtualization-based-protection-of-code-integrity?tabs=security
[1] https://gitlab.com/qemu-project/qemu/-/commit/5cb89cad7f30be3175dd5abbb79ae5e634476cfa
On 5/5/26 9:50 PM, Paolo Bonzini wrote:
> This version can also be found in the "queue" branch of kvm.git.
> Since it should be final I'm including again for reference the full
> description.
>
> Both MBEC and GMET allow more granular control over execute permissions,
> with different levels of separation between supervisor and user mode.
> MBEC provides support for separate supervisor and user-mode bits in the
> PTEs; GMET instead lacks supervisor-mode only execution (with NX=0,
> "both" is represented by U=0 and user-mode only by U=1). GMET was
> clearly inspired by SMEP though with some differences and annoyances.
>
> The implementation starts from two changes to core MMU code, both
> of which help making the actual feature almost trivial to implement:
>
> - first, I'm cleaning up the implementation of nVMX exec-only, by
> properly adding read permissions to the ACC_* constant and to the
> permission bitmask machinery. Jon also had to add a fourth ACC_*
> bit, but used it only in the special case of nested MBEC; here
> instead ACC_READ_MASK is the normality, which simplifies testing
> a lot and removes gratuitous complexity.
>
> - second, I'm enforcing that KVM runs with MBEC/GMET enabled even in
> non-nested mode, if it wants to provide the feature to nested
> hypervisors. This makes the creation of SPTEs looks exactly the
> same for L1 and L2 guests, despite only the latter using MBEC/GMET
> fully; the difference lies only in the input access permissions.
>
> This strategy adds a limited amount of complexity to the core is limited,
> while providing for an almost entirely seamless support of nested
> hypervisors.
>
> Later patches have to use slightly different meanings for ACC_* in Intel
> and AMD. On the Intel side, some work is needed in order to split
> shadow_x_mask and ACC_EXEC_MASK in two; now that there is an actual
> ACC_READ_MASK to be used for exec-only pages, ACC_USER_MASK is unused
> and can be reused as ACC_USER_EXEC_MASK. However, unlike the older
> ACC_USER_MASK hack these differences are backed by concrete concepts
> of the page table format, and there is always a 1:1 mapping from ACC_*
> bits to PT_*_MASK or shadow_*_mask:
>
> Intel AMD
> -------------------- ------------------- -------------------
> ACC_READ_MASK PT_PRESENT_MASK PT_PRESENT_MASK
> ACC_WRITE_MASK PT_WRITABLE_MASK PT_WRITABLE_MASK
> ACC_EXEC_MASK shadow_xs_mask shadow_nx_mask
> ACC_USER_MASK --- shadow_user_mask
> ACC_USER_EXEC_MASK shadow_xu_mask ---
>
> On Intel, ACC_EXEC_MASK is used for kernel-mode execution and is tied to
> shadow_xs_mask (when MBEC is disabled, ACC_USER_EXEC_MASK and the XU bit
> are computed but ineffective). update_permission_bitmask() precomputes
> all the necessary conditions. On the AMD side, the U bit maps to
> ACC_USER_MASK but nNPT adjusts the permission bitmask to ignore it for
> reads and writes when GMET is active. Despite the smaller scale of the
> changes compared to MBEC, there are some changes to make to use GMET
> for L1 guests, because the page tables have to be created with U=0.
> This means that the root page has role.access != ACC_ALL and its
> permissions have to be propagated down.
>
> Note that with MBEC the user/supervisor distinction depends on the U
> bit of the page tables rather than the CPL. Processors provide this
> information to the hypervisor through the "advanced EPT violation
> vmexit info" feature, which is a requirement for KVM to use MBEC,
> and kvm-intel.ko passes it to the MMU in PFERR_USER_MASK (unlike
> kvm-amd.ko which computes it from the CPL). This needs a small change
> to pass the effective XWU permissions of the page tables down to
> translate_nested_gpa().
>
> The former "smep_andnot_wp" bit of cpu_role.base, now named "cr4_smep",
> is repurposed for nested TDP to indicate that MBEC/GMET is on. The minor
> pessimization for shadow page tables (toggling CR4.SMEP now always forces
> building a separate version of the shadow page tables, even though that's
> technically unnecessary if CR4.WP=1) is not really worth fretting about;
> in practice, guests are not going to flip CR4.SMEP in a way that would
> prevent efficient reuse of shadow page tables.
>
> Paolo
>
> v5->v6:
> - rename make_spte_executable to change_spte_executable
> - rename byte index in update_permission_bitmask to index
> - use (u8) casts before "KVM: x86/mmu: introduce ACC_READ_MASK"
> - make commit message for "KVM: x86/mmu: split XS/XU bits for EPT" more accurate
> - add XU to shadow_acc_track_mask already in "KVM: x86/mmu: split XS/XU bits for EPT"
> - fix compilation error
> - use alternative code for __vmx_handle_ept_violation suggested by Sean
>
>
> Jon Kohler (5):
> KVM: TDX/VMX: rework EPT_VIOLATION_EXEC_FOR_RING3_LIN into PROT_MASK
> KVM: x86/mmu: remove SPTE_PERM_MASK
> KVM: x86/mmu: free up bit 10 of PTEs in preparation for MBEC
> KVM: nVMX: advertise MBEC to nested guests
> KVM: nVMX: allow MBEC with EVMCS
>
> Paolo Bonzini (23):
> KVM: x86/mmu: shuffle high bits of SPTEs in preparation for MBEC
> KVM: x86/mmu: remove SPTE_EPT_*
> KVM: x86/mmu: merge make_spte_{non,}executable
> KVM: x86/mmu: rename and clarify BYTE_MASK
> KVM: x86/mmu: separate more EPT/non-EPT permission_fault()
> KVM: x86/mmu: introduce ACC_READ_MASK
> KVM: x86/mmu: pass PFERR_GUEST_PAGE/FINAL_MASK to kvm_translate_gpa
> KVM: x86/mmu: pass pte_access for final nGPA->GPA walk
> KVM: x86: make translate_nested_gpa vendor-specific
> KVM: x86/mmu: split XS/XU bits for EPT
> KVM: x86/mmu: move cr4_smep to base role
> KVM: VMX: enable use of MBEC
> KVM: nVMX: pass advanced EPT violation vmexit info to guest
> KVM: nVMX: pass PFERR_USER_MASK to MMU on EPT violations
> KVM: x86/mmu: add support for MBEC to EPT page table walks
> KVM: x86/mmu: propagate access mask from root pages down
> KVM: x86/mmu: introduce cpu_role bit for availability of PFEC.I/D
> KVM: SVM: add GMET bit definitions
> KVM: x86/mmu: hard code more bits in kvm_init_shadow_npt_mmu
> KVM: x86/mmu: add support for GMET to NPT page table walks
> KVM: SVM: enable GMET and set it in MMU role
> KVM: SVM: work around errata 1218
> KVM: nSVM: enable GMET for guests
>
> Documentation/virt/kvm/x86/mmu.rst | 10 +-
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 48 +++++---
> arch/x86/include/asm/svm.h | 1 +
> arch/x86/include/asm/vmx.h | 14 ++-
> arch/x86/kvm/hyperv.c | 4 +-
> arch/x86/kvm/mmu.h | 30 +++--
> arch/x86/kvm/mmu/mmu.c | 182 ++++++++++++++++++++---------
> arch/x86/kvm/mmu/mmutrace.h | 19 +--
> arch/x86/kvm/mmu/paging_tmpl.h | 73 ++++++++----
> arch/x86/kvm/mmu/spte.c | 92 +++++++++------
> arch/x86/kvm/mmu/spte.h | 70 ++++++-----
> arch/x86/kvm/mmu/tdp_mmu.c | 6 +-
> arch/x86/kvm/svm/nested.c | 38 +++++-
> arch/x86/kvm/svm/svm.c | 31 +++++
> arch/x86/kvm/svm/svm.h | 1 +
> arch/x86/kvm/vmx/capabilities.h | 12 +-
> arch/x86/kvm/vmx/common.h | 26 +++--
> arch/x86/kvm/vmx/hyperv_evmcs.h | 1 +
> arch/x86/kvm/vmx/main.c | 9 ++
> arch/x86/kvm/vmx/nested.c | 46 +++++++-
> arch/x86/kvm/vmx/tdx.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 27 ++++-
> arch/x86/kvm/vmx/vmx.h | 1 +
> arch/x86/kvm/vmx/x86_ops.h | 1 +
> arch/x86/kvm/x86.c | 18 +--
> 27 files changed, 536 insertions(+), 228 deletions(-)
>
On Mon, May 11, 2026 at 12:54 PM David Riley <d.riley@proxmox.com> wrote:
> Environment:
> - Kernel: mainline 7.1.0-rc2 (with v6 patches applied)
Using 7.0.4 + v6 patches applied (except for the AMD ones) I could
reproduce the guest not booting, but not the host lockups. I also have
Windows Server 2025 build 26100, and in my case the host is a Meteor
Lake.
I'm running my guest with
$ qemu-kvm \
-M q35 -drive if=ide,file=win2k25.qcow2 \
-cpu host,+vmx-mbec,+cet-ss,+cet-ibt -vnc :0 \
-monitor stdio -m 8192 \
-bios /usr/share/edk2/ovmf/OVMF.stateless.secboot.fd \
-device nec-usb-xhci -usbdevice tablet -smp 8
However, the trace shows that CET is not used at all unless MBEC is
present. In particular (after "trace-cmd record -e kvm ...") I can do:
$ trace-cmd report |grep -e msr_write.*da0| sed 's/.*kvm_/kvm_/' | sort -u
and it shows as expected this with +vmx-mbec,+cet-ss,+cet-ibt:
kvm_msr: msr_write da0 = 0x800
but not with -vmx-mbec,+cet-ss,+cet-ibt. This initialization is
performed by Hyper-V even before VMXON, and the breakage happens even
if Memory Integrity is disabled inside Windows.
Knowing that Hyper-V was not running any nested guest at the time of
the hang, I changed __vmcs_writel() to have
if (field == SECONDARY_VM_EXEC_CONTROL) value &=
~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
which is admittedly a bit blunt :) but lets Hyper-V use CET, while
basically undoing the effects of this patch for non-nested operation.
This also hung for me.
If possible, could you please check:
1) whether 7.0 + patches (up to 22/28) also causes the host to hang?
2) whether 7.1 + patches (up to 22/28) also causes the host to hang?
to understand if this is a) something caused by our different setup b)
a regression in 7.1 c) something caused by the last 6 patches?
So, while the host hanging is worrisome, this seems to be caused more
likely by the CET enablement rather than by MBEC.
Thanks,
Paolo
Hi Paolo, Hi Chao, Hi Sean, I have been testing the v6 patchset (up to 22/28) this time on Arrow Lake hardware. My results suggest a kernel version dependent regression regarding host stability. Environment: * Host CPU: Intel(R) Core(TM) Ultra 7 265K (Arrow Lake) * Motherboard: Gigabyte Z890 EAGLE (BIOS F18) * Host OS: Proxmox VE based on Debian Trixie * Host Kernel: mainline with patches 1-22/28 applied. * Guest OS: Windows Server 2026 (24H2, Build 26100.1742) with VBS/Hyper-V enabled. * QEMU Command: -cpu host,level=30,+vmx-mbec,+cet-ss,+cet-ibt Results for Kernel 7.1.0-rc3 + v6 patches 1-22: I can reproduce the guest failing to boot. This setup causes host lockups on my Arrow Lake machine. In some cases, the guest manages to reach Windows Recovery, but most of the time it does not. @Chao, in the first line you can see the hard lockup. Also have a look at the hrtimer trap I tested below. dmesg output: [Fri May 15 13:07:37 2026] watchdog: CPU1: Watchdog detected hard LOCKUP on cpu 1 ... [Fri May 15 13:07:37 2026] CPU: 1 UID: 0 PID: 3327 Comm: CPU 0/KVM Tainted: G E 7.1.0-rc3-v7.1-rc3-v6-p22-00022-g9ba8d1bdd861-dirty #31 PREEMPT(lazy) [Fri May 15 13:07:37 2026] Tainted: [E]=UNSIGNED_MODULE [Fri May 15 13:07:37 2026] Hardware name: Gigabyte Technology Co., Ltd. Z890 EAGLE/Z890 EAGLE, BIOS F18 11/27/2025 [Fri May 15 13:07:37 2026] RIP: 0010:vmx_do_nmi_irqoff+0x13/0x20 [kvm_intel] [Fri May 15 13:07:37 2026] Code: 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 e8 5d cc ca f1 <c9> c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 [Fri May 15 13:07:37 2026] RSP: 0018:ffffcdf58fdf7c28 EFLAGS: 00000082 [Fri May 15 13:07:37 2026] RAX: 0000000080000200 RBX: ffff8baa8a6f4900 RCX: 0000000000000000 [Fri May 15 13:07:37 2026] RDX: 0000000080000202 RSI: 0000000000000000 RDI: ffff8baa8a6f4900 [Fri May 15 13:07:37 2026] RBP: ffffcdf58fdf7c28 R08: 0000000000000000 R09: 0000000000000000 [Fri May 15 13:07:37 2026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [Fri May 15 13:07:37 2026] R13: 0000000000000000 R14: 0000000000000004 R15: ffff8bab70170000 [Fri May 15 13:07:37 2026] FS: 0000756cc330a6c0(0000) GS:ffff8bba5a58a000(0000) knlGS:fffff8031fbbd000 [Fri May 15 13:07:37 2026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri May 15 13:07:37 2026] CR2: 00007fffffff0000 CR3: 00000001152c2001 CR4: 0000000008f72ef0 [Fri May 15 13:07:37 2026] PKRU: 55555554 [Fri May 15 13:07:37 2026] Call Trace: [Fri May 15 13:07:37 2026] <TASK> [Fri May 15 13:07:37 2026] vmx_handle_nmi+0x9a/0x140 [kvm_intel] [Fri May 15 13:07:37 2026] vmx_vcpu_enter_exit+0x18f/0x300 [kvm_intel] [Fri May 15 13:07:37 2026] vmx_vcpu_run+0x1d2/0x12c0 [kvm_intel] [Fri May 15 13:07:37 2026] vt_vcpu_run+0x1a/0x40 [kvm_intel] [Fri May 15 13:07:37 2026] kvm_arch_vcpu_ioctl_run+0x69e/0x18e0 [kvm] [Fri May 15 13:07:37 2026] ? fire_user_return_notifiers+0x37/0x70 [Fri May 15 13:07:37 2026] ? __x64_sys_ioctl+0xbf/0x100 [Fri May 15 13:07:37 2026] kvm_vcpu_ioctl+0x312/0xba0 [kvm] [Fri May 15 13:07:37 2026] ? __x64_sys_ioctl+0xbf/0x100 [Fri May 15 13:07:37 2026] ? kvm_on_user_return+0x4a/0x90 [kvm] [Fri May 15 13:07:37 2026] ? fire_user_return_notifiers+0x37/0x70 [Fri May 15 13:07:37 2026] ? do_syscall_64+0x396/0x14c0 [Fri May 15 13:07:37 2026] __x64_sys_ioctl+0xa5/0x100 [Fri May 15 13:07:37 2026] x64_sys_call+0x103b/0x2390 [Fri May 15 13:07:37 2026] do_syscall_64+0xe6/0x14c0 [Fri May 15 13:07:37 2026] ? fire_user_return_notifiers+0x37/0x70 [Fri May 15 13:07:37 2026] ? do_syscall_64+0x396/0x14c0 [Fri May 15 13:07:37 2026] ? fire_user_return_notifiers+0x37/0x70 [Fri May 15 13:07:37 2026] ? do_syscall_64+0x396/0x14c0 [Fri May 15 13:07:37 2026] ? do_syscall_64+0x9b/0x14c0 [Fri May 15 13:07:37 2026] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri May 15 13:07:37 2026] RIP: 0033:0x756cc783091b [Fri May 15 13:07:37 2026] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [Fri May 15 13:07:37 2026] RSP: 002b:0000756cc3305b30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [Fri May 15 13:07:37 2026] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 0000756cc783091b [Fri May 15 13:07:37 2026] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000020 [Fri May 15 13:07:37 2026] RBP: 00005a973691c030 R08: 0000000000000000 R09: 0000000000000000 [Fri May 15 13:07:37 2026] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [Fri May 15 13:07:37 2026] R13: 0000000000000001 R14: 0000000000000608 R15: 0000000000000000 [Fri May 15 13:07:37 2026] </TASK> Output from trace-cmd when the guest gets stuck: CPU 0/KVM-6610 [001] d..2. 709.333183: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-6610 [001] d..2. 709.333183: kvm_apicv_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-6610 [001] d..3. 709.333183: kvm_hv_timer_state: vcpu_id 0 hv_timer 1 CPU 0/KVM-6610 [001] d..1. 709.333183: kvm_entry: vcpu 0 rip 0xfffff800b49020ec CPU 0/KVM-6610 [001] d..1. 709.333183: kvm_wait_lapic_expire: vcpu 0: delta 460 (late) CPU 0/KVM-6610 [001] d..1. 709.348806: kvm_exit: reason PREEMPTION_TIMER rip 0xfffff800b49020ec info 0 0 CPU 0/KVM-6610 [001] d..2. 709.348806: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-6610 [001] d..2. 709.348806: kvm_apicv_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-6610 [001] d..3. 709.348806: kvm_hv_timer_state: vcpu_id 0 hv_timer 1 CPU 0/KVM-6610 [001] d..1. 709.348806: kvm_entry: vcpu 0 rip 0xfffff800b49020ec CPU 0/KVM-6610 [001] d..1. 709.348807: kvm_wait_lapic_expire: vcpu 0: delta -1624 (late) and trace-cmd report |grep -e msr_write.*da0| sed 's/.*kvm_/kvm_/' | sort -u kvm_msr: msr_write da0 = 0x800 If I run: sudo modprobe -r kvm_intel sudo modprobe kvm_intel preemption_timer=0 I am able to boot into windows sometimes. And other times it enters a endless loop: boots into windows recovery mode: CPU 0/KVM-18245 [001] ..... 2628.320804: kvm_fpu: unload CPU 0/KVM-18245 [001] ..... 2628.320804: kvm_userspace_exit: reason KVM_EXIT_IO (2) CPU 0/KVM-18245 [001] ..... 2628.320805: kvm_fpu: load CPU 0/KVM-18245 [001] ..... 2628.320805: kvm_pio: pio_read at 0x608 size 4 count 1 val 0xe1d664 CPU 0/KVM-18245 [001] d..1. 2628.320805: kvm_entry: vcpu 0 rip 0xfffff807c131bf36 CPU 0/KVM-18245 [001] d..1. 2628.320806: kvm_exit: reason IO_INSTRUCTION rip 0xfffff807c131bf35 info 608000b 0 CPU 0/KVM-18245 [001] ..... 2628.320806: kvm_fpu: unload CPU 0/KVM-18245 [001] ..... 2628.320807: kvm_userspace_exit: reason KVM_EXIT_IO (2) CPU 0/KVM-18245 [001] ..... 2628.320808: kvm_fpu: load CPU 0/KVM-18245 [001] ..... 2628.320808: kvm_pio: pio_read at 0x608 size 4 count 1 val 0xe1d66e CPU 0/KVM-18245 [001] d..1. 2628.320808: kvm_entry: vcpu 0 rip 0xfffff807c131bf36 CPU 0/KVM-18245 [001] d..1. 2628.320809: kvm_exit: reason IO_INSTRUCTION rip 0xfffff807c131bf35 info 608000b 0 but I also observed this (Windows is stuck in the boot stage): CPU 0/KVM-35230 [001] d..1. 4945.511627: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.511632: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.511632: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.511634: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.511635: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.511640: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.511640: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.730129: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020f7 info 0 0 CPU 0/KVM-35230 [001] ..... 4945.730145: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-35230 [001] ..... 4945.730145: kvm_apicv_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-35230 [001] d..1. 4945.730146: kvm_entry: vcpu 0 rip 0xfffff804b53020f7 kvm-pit/35115-35236 [012] ..... 4945.730212: kvm_set_irq: gsi 0 level 1 source 2 kvm-pit/35115-35236 [012] ...1. 4945.730215: kvm_pic_set_irq: chip 0 pin 0 (edge|masked) kvm-pit/35115-35236 [012] ...1. 4945.730217: kvm_ioapic_set_irq: pin 2 dst 0 vec 255 (Fixed|physical|edge|masked) kvm-pit/35115-35236 [012] ..... 4945.730217: kvm_set_irq: gsi 0 level 0 source 2 kvm-pit/35115-35236 [012] ...1. 4945.730217: kvm_pic_set_irq: chip 0 pin 0 (edge|masked) kvm-pit/35115-35236 [012] ...1. 4945.730217: kvm_ioapic_set_irq: pin 2 dst 0 vec 255 (Fixed|physical|edge|masked) CPU 0/KVM-35230 [001] d..1. 4945.730559: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.730561: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.731559: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.731560: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.732559: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.732560: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.732934: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020ec info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.732935: kvm_entry: vcpu 0 rip 0xfffff804b53020ec CPU 0/KVM-35230 [001] d..1. 4945.733559: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff804b53020f7 info 0 0 CPU 0/KVM-35230 [001] d..1. 4945.733560: kvm_entry: vcpu 0 rip 0xfffff804b53020f7 CPU 2/KVM-35232 [006] ..... 4945.893095: kvm_halt_poll_ns: vcpu 2: halt_poll_ns 0 (shrink 10000) CPU 2/KVM-35232 [006] ..... 4945.893097: kvm_vcpu_wakeup: wait time 8589893724 ns, polling valid CPU 1/KVM-35231 [008] ..... 4945.893300: kvm_halt_poll_ns: vcpu 1: halt_poll_ns 0 (shrink 10000) CPU 1/KVM-35231 [008] ..... 4945.893302: kvm_vcpu_wakeup: wait time 8590000199 ns, polling valid CPU 3/KVM-35233 [011] ..... 4945.893332: kvm_vcpu_wakeup: wait time 8 Results for Kernel 7.1.0-rc3 + v6 patches 1-22 + hrtimer trap: I used the mentioned trap from [0] Before booting the VM I setup tracing and verified that it was on using: cat /sys/kernel/tracing/tracing_on 1 and after booting the VM, which got stuck again, I checked again and it was off: cat /sys/kernel/tracing/tracing_on 0 I have the full compressed trace from this trigger event (captured with the trap). It is quite large, but I can provide it if needed. Results for Kernel 7.0.0 + v6 patches 1-22: I used the same: * Guest OS: Windows Server 2026 (24H2, Build 26100.1742) with VBS/Hyper-V enabled. * QEMU Command: -cpu host,level=30,+vmx-mbec,+cet-ss,+cet-ibt This also results in the Windows Guest getting stuck but there are no indications of CPU lockups. A trace-cmd shows that the guest is stuck entering and exiting: CPU 0/KVM-14938 [001] d..1. 2355.384828: kvm_entry: vcpu 0 rip 0xfffff805879020ec CPU 0/KVM-14938 [001] d..1. 2355.385826: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff805879020ec info 0 0 CPU 0/KVM-14938 [001] d..1. 2355.385827: kvm_entry: vcpu 0 rip 0xfffff805879020ec CPU 0/KVM-14938 [001] d..1. 2355.386826: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff805879020ec info 0 0 And in the trace report: trace-cmd report |grep -e msr_write.*da0| sed 's/.*kvm_/kvm_/' | sort -u kvm_msr: msr_write da0 = 0x800 Hope this helps, feel free to suggest other tests that I should run. Best regards, David [0] https://lore.kernel.org/kvm/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com On 5/12/26 4:30 PM, Paolo Bonzini wrote: > If possible, could you please check: > > 1) whether 7.0 + patches (up to 22/28) also causes the host to hang? > 2) whether 7.1 + patches (up to 22/28) also causes the host to hang? > > to understand if this is a) something caused by our different setup b) > a regression in 7.1 c) something caused by the last 6 patches? >
On Fri, May 15, 2026, David Riley wrote: > Hi Paolo, Hi Chao, Hi Sean, > > I have been testing the v6 patchset (up to 22/28) this time on Arrow > Lake hardware. My results suggest a kernel version dependent regression > regarding host stability. > > Environment: > * Host CPU: Intel(R) Core(TM) Ultra 7 265K (Arrow Lake) > * Motherboard: Gigabyte Z890 EAGLE (BIOS F18) > * Host OS: Proxmox VE based on Debian Trixie > * Host Kernel: mainline with patches 1-22/28 applied. > * Guest OS: Windows Server 2026 (24H2, Build 26100.1742) with VBS/Hyper-V > enabled. > * QEMU Command: -cpu host,level=30,+vmx-mbec,+cet-ss,+cet-ibt > > Results for Kernel 7.1.0-rc3 + v6 patches 1-22: > I can reproduce the guest failing to boot. This setup causes host lockups on > my Arrow Lake machine. In some cases, the guest manages to reach Windows > Recovery, but most of the time it does not. > > @Chao, in the first line you can see the hard lockup. Also have a look at the > hrtimer trap I tested below. > > dmesg output: > [Fri May 15 13:07:37 2026] watchdog: CPU1: Watchdog detected hard LOCKUP on cpu 1 ... > If I run: > sudo modprobe -r kvm_intel > sudo modprobe kvm_intel preemption_timer=0 > > I am able to boot into windows sometimes. Hmm, this probably confirms its the hrtimer issue? When using the VMX preemption timer, KVM (on Intel) doesn't use an hrtimer to emulate L1's APIC timer. I _think_ forcing KVM to use an hrtimer would cause result in hrtimers being reprogrammed in response to KVM's usage, and thus mask the deferred reprogramming bug? That sounds plausible-ish? > Results for Kernel 7.1.0-rc3 + v6 patches 1-22 + hrtimer trap: > > I used the mentioned trap from [0] Can you try Peter's fixes? AIUI, the reporter's hack-a-fix was very far from a complete fix. Note, there's a v3 of patch 1 (b4 should take care of that for you, if you're using b4). https://lore.kernel.org/all/20260423155611.216805954@infradead.org
Hi Sean, thanks for the input. On 5/15/26 8:31 PM, Sean Christopherson wrote: > [...] > Hmm, this probably confirms its the hrtimer issue? When using the VMX preemption > timer, KVM (on Intel) doesn't use an hrtimer to emulate L1's APIC timer. I _think_ > forcing KVM to use an hrtimer would cause result in hrtimers being reprogrammed > in response to KVM's usage, and thus mask the deferred reprogramming bug? That > sounds plausible-ish? > [...] > Can you try Peter's fixes? AIUI, the reporter's hack-a-fix was very far from a > complete fix. Note, there's a v3 of patch 1 (b4 should take care of that for you, > if you're using b4). > > https://lore.kernel.org/all/20260423155611.216805954@infradead.org I tested it again with the v3 hrtimer patches [0] applied on top of the v6 MBEC/GMET series. Setup: * Host CPU: Intel(R) Core(TM) Ultra 7 265K (Arrow Lake) * Host OS: Proxmox VE (based on Debian Trixie) * Host Kernel: mainline kernel 7.1.0-rc4 with v6 MBEC/GMET and v3 hrtimer [0] * QEMU: 11.0.0 (downstream build) * Guest OS: Windows Server 2026 (24H2, Build 26100.1742) with VBS/Hyper-V enabled. Using: * QEMU CPU Options: -cpu host,level=30,+vmx-mbec,-cet-ss,-cet-ibt The CPU lockups did not occur anymore and I was able to boot the Guest. Keep in mind that in this case I have the cet-ss and cet-ibt not passed along to the guest. If I launch the same Virtual Guest using * QEMU CPU Options: -cpu host,level=30,+vmx-mbec,+cet-ss,+cet-ibt The issue of the VM being stuck on boot persists even with the hrtimer patches applied, but now there are no hard/soft lockups of the CPU anymore. I get this trace using: trace-cmd record -e kvm CPU 0/KVM-11837 [001] d..2. 1363.314703: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-11837 [001] d..2. 1363.314703: kvm_apicv_accept_irq: apicid 0 vec 209 (Fixed|edge) CPU 0/KVM-11837 [001] d..3. 1363.314703: kvm_hv_timer_state: vcpu_id 0 hv_timer 1 CPU 0/KVM-11837 [001] d..1. 1363.314703: kvm_entry: vcpu 0 rip 0xfffff801a59020f7 CPU 0/KVM-11837 [001] d..1. 1363.314703: kvm_wait_lapic_expire: vcpu 0: delta -590 (late) CPU 0/KVM-11837 [001] d..1. 1363.314993: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.314994: kvm_entry: vcpu 0 rip 0xfffff801a59020ec CPU 0/KVM-11837 [001] d..1. 1363.315993: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.315993: kvm_entry: vcpu 0 rip 0xfffff801a59020ec CPU 0/KVM-11837 [001] d..1. 1363.316993: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.316994: kvm_entry: vcpu 0 rip 0xfffff801a59020ec CPU 0/KVM-11837 [001] d..1. 1363.317992: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.317993: kvm_entry: vcpu 0 rip 0xfffff801a59020ec CPU 0/KVM-11837 [001] d..1. 1363.318992: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.319269: kvm_entry: vcpu 0 rip 0xfffff801a59020ec CPU 0/KVM-11837 [001] d..1. 1363.319992: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff801a59020ec info 0 0 CPU 0/KVM-11837 [001] d..1. 1363.319994: kvm_entry: vcpu 0 rip 0xfffff801a59020ec I did not spot anything useful in the dmesg/journalctl output. I also did the same tests with mainline kernel 7.1.0-rc3 with v6 MBEC/GMET (patches 1-22 of 28) and v3 hrtimer [0] and got the same results. Best regards, David [0] https://lore.kernel.org/all/20260423155611.216805954@infradead.org
On 5/12/26 16:32, Paolo Bonzini wrote:
The trace shows that CET is not used at all unless MBEC is
> present. In particular (after "trace-cmd record -e kvm ...") I can do:
>
> $ trace-cmd report |grep -e msr_write.*da0| sed 's/.*kvm_/kvm_/' | sort -u
>
> and it shows as expected this with +vmx-mbec,+cet-ss,+cet-ibt:
>
> kvm_msr: msr_write da0 = 0x800
>
> but not with -vmx-mbec,+cet-ss,+cet-ibt. This initialization is
> performed by Hyper-V even before VMXON, and the breakage happens even
> if Memory Integrity is disabled inside Windows.
>
> Knowing that Hyper-V was not running any nested guest at the time of
> the hang, I changed __vmcs_writel() to have
>
> if (field == SECONDARY_VM_EXEC_CONTROL) value &=
> ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
I have now reproduced the guest hang with a one line change on top of
kvm/master:
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 937aeb474af7..43e0f20e4e26 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7231,6 +7231,7 @@ static void nested_vmx_setup_secondary_ctls(u32
ept_caps,
if (enable_ept) {
/* nested EPT: emulate EPT also to L1 */
msrs->secondary_ctls_high |=
+ SECONDARY_EXEC_MODE_BASED_EPT_EXEC | /* hem hem */
SECONDARY_EXEC_ENABLE_EPT;
msrs->ept_caps =
VMX_EPT_PAGE_WALK_4_BIT |
(which would break very badly if Hyper-V were to start a nested guest,
but the trace says it doesn't).
Can you check what behavior you get from this (actually silly) change?
It should allow you to exercise Hyper-V's CET paths without the burden
of the MMU changes.
Paolo
On 5/11/26 12:53, David Riley wrote: > > watchdog: CPU11: Watchdog detected hard LOCKUP on cpu 11 > watchdog: BUG: soft lockup - CPU#11 stuck for 28s [CPU 0/KVM:16105] What is the backtrace here? Thanks, Paolo
is that enough? dmesg | grep -A 50 "soft lockup - CPU#11" [ 5565.326572] watchdog: BUG: soft lockup - CPU#11 stuck for 28s! [CPU 0/KVM:16105] [ 5565.326576] Modules linked in: tcp_diag(E) inet_diag(E) veth(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) nfs(E) lockd(E) grace(E) netfs(E) ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) sunrpc(E) nf_tables(E) softdog(E) bonding(E) tls(E) binfmt_misc(E) nfnetlink_log(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) intel_ifs(E) i10nm_edac(E) skx_edac_common(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) cxl_pci(E) cxl_mem(E) kvm_intel(E) acpi_power_meter(E) ipmi_ssif(E) cxl_acpi(E) cxl_port(E) cxl_pmem(E) kvm(E) pmt_telemetry(E) dax_hmem(E) pmt_discovery(E) irqbypass(E) pmt_class(E) intel_sdsi(E) bnxt_re(E) cxl_core(E) aesni_intel(E) ib_uverbs(E) gf128mul(E) isst_if_mmio(E) isst_if_mbox_pci(E) fwctl(E) rapl(E) cmdlinepart(E) intel_cstate(E) einj(E) pcspkr(E) wmi_bmof(E) spi_nor(E) iaa_crypto(E) isst_if_common(E) ib_core(E) intel_vsec(E) mei_me(E) ast(E) mtd(E) spd5118(E) [ 5565.326599] i2c_algo_bit(E) mei(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) joydev(E) input_leds(E) mac_hid(E) sch_fq_codel(E) msr(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) libblake2b(E) raid6_pq(E) xor(E) hid_generic(E) usbmouse(E) usbkbd(E) usbhid(E) hid(E) cdc_ether(E) usbnet(E) mii(E) uas(E) usb_storage(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) nvme(E) nvme_core(E) xhci_pci(E) i40e(E) nvme_keyring(E) i2c_i801(E) idxd(E) tg3(E) ahci(E) i2c_mux(E) libie(E) idxd_bus(E) spi_intel_pci(E) bnxt_en(E) nvme_auth(E) libie_adminq(E) xhci_hcd(E) i2c_smbus(E) spi_intel(E) libahci(E) i2c_ismt(E) wmi(E) pinctrl_emmitsburg(E) [ 5565.326618] CPU: 11 UID: 0 PID: 16105 Comm: CPU 0/KVM Tainted: G EL 7.1.0-rc2-v6-mbec-gmet-00028-g1e3b074acc33 #24 PREEMPT(lazy) [ 5565.326620] Tainted: [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP [ 5565.326620] Hardware name: ****, BIOS 3001 07/03/2025 [ 5565.326621] RIP: 0010:kvm_arch_vcpu_ioctl_run+0x78d/0x18e0 [kvm] [ 5565.326680] Code: 07 00 48 83 bb 28 08 00 00 00 0f 85 69 0a 00 00 0f 1f 44 00 00 65 c6 05 08 d7 08 c7 01 c6 83 e2 0a 00 00 01 fb 0f 1f 44 00 00 <48> 83 83 48 19 00 00 01 fa 0f 1f 44 00 00 c6 83 e2 0a 00 00 00 0f [ 5565.326681] RSP: 0018:ff6afb0ed454b9e0 EFLAGS: 00000246 [ 5565.326682] RAX: 0000000000000000 RBX: ff2a147c1758a480 RCX: 0000000000000000 [ 5565.326683] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 5565.326683] RBP: ff6afb0ed454ba90 R08: 0000000000000000 R09: 0000000000000000 [ 5565.326684] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5565.326684] R13: ff2a147c12163000 R14: 0000000000000000 R15: ff2a147c20d2b140 [ 5565.326685] FS: 00007503727ef6c0(0000) GS:ff2a147bf748a000(0000) knlGS:fffff8013ee33000 [ 5565.326685] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5565.326686] CR2: 00007fffffff0000 CR3: 00000020948ed006 CR4: 0000000000f73ef0 [ 5565.326687] PKRU: 55555554 [ 5565.326687] Call Trace: [ 5565.326688] <TASK> [ 5565.326689] ? trace_event_buffer_reserve+0xa5/0xe0 [ 5565.326692] ? trace_event_raw_event_kvm_userspace_exit+0x6c/0xc0 [kvm] [ 5565.326727] kvm_vcpu_ioctl+0x312/0xba0 [kvm] [ 5565.326762] ? __rb_reserve_next.constprop.0+0x5c/0x420 [ 5565.326765] ? ring_buffer_lock_reserve+0x155/0x410 [ 5565.326767] __x64_sys_ioctl+0xa5/0x100 [ 5565.326769] x64_sys_call+0x103b/0x2390 [ 5565.326771] do_syscall_64+0xe6/0x14c0 [ 5565.326774] ? trace_event_buffer_reserve+0xa5/0xe0 [ 5565.326775] ? trace_event_raw_event_kvm_userspace_exit+0x6c/0xc0 [kvm] [ 5565.326807] ? kvm_vcpu_ioctl+0x2a7/0xba0 [kvm] [ 5565.326841] ? trace_event_buffer_reserve+0xa5/0xe0 [ 5565.326842] ? trace_event_raw_event_kvm_userspace_exit+0x6c/0xc0 [kvm] [ 5565.326874] ? __x64_sys_ioctl+0xbf/0x100 [ 5565.326875] ? kvm_on_user_return+0x4a/0x90 [kvm] [ 5565.326916] ? fire_user_return_notifiers+0x37/0x70 [ 5565.326918] ? do_syscall_64+0x396/0x14c0 [ 5565.326920] ? do_syscall_64+0x396/0x14c0 [ 5565.326922] ? __x64_sys_ioctl+0xbf/0x100 [ 5565.326923] ? kvm_on_user_return+0x4a/0x90 [kvm] [ 5565.326962] ? fire_user_return_notifiers+0x37/0x70 [ 5565.326963] ? do_syscall_64+0x396/0x14c0 [ 5565.326965] ? kvm_on_user_return+0x4a/0x90 [kvm] [ 5565.327002] ? fire_user_return_notifiers+0x37/0x70 [ 5565.327004] ? do_syscall_64+0x396/0x14c0 [ 5565.327005] ? do_syscall_64+0x396/0x14c0 [ 5565.327007] ? do_syscall_64+0x396/0x14c0 [ 5565.327008] ? do_syscall_64+0x396/0x14c0 [ 5565.327009] ? do_syscall_64+0x9b/0x14c0 [ 5565.327011] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 5565.327012] RIP: 0033:0x75037650f91b On 5/11/26 12:54 PM, Paolo Bonzini wrote: > On 5/11/26 12:53, David Riley wrote: >> >> watchdog: CPU11: Watchdog detected hard LOCKUP on cpu 11 >> watchdog: BUG: soft lockup - CPU#11 stuck for 28s [CPU 0/KVM:16105] > > What is the backtrace here? > > Thanks, > > Paolo > >
On Mon, May 11, 2026 at 01:07:33PM +0200, David Riley wrote: >is that enough? > >dmesg | grep -A 50 "soft lockup - CPU#11" Do you also have a hard lockup trace? I want to make sure the host lockup is not the issue discussed here: https://lore.kernel.org/kvm/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com/
On Thu, May 14, 2026, Chao Gao wrote: > On Mon, May 11, 2026 at 01:07:33PM +0200, David Riley wrote: > >is that enough? > > > >dmesg | grep -A 50 "soft lockup - CPU#11" > > Do you also have a hard lockup trace? > > I want to make sure the host lockup is not the issue discussed here: > > https://lore.kernel.org/kvm/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com Ugh, if it is the hrtimer issue, I apologize in advance. Despite being bitten by that bug over, and over, and over, I somehow keep forgetting to mention it to others when they run into problems. Glad someone is paying attention...
On Tue, May 05, 2026, Paolo Bonzini wrote: > This version can also be found in the "queue" branch of kvm.git. > Since it should be final I'm including again for reference the full > description. I still have two nits, but on my end the most important thing is to stabilize the hashes sooner than later, so I can use the resulting merge into kvm/next as the basis for 7.2 topic branches. I.e. feel free to ignore the nits for now if that makes life easier for you, I can always send patches to apply on top. > v5->v6: > - rename make_spte_executable to change_spte_executable > - rename byte index in update_permission_bitmask to index > - use (u8) casts before "KVM: x86/mmu: introduce ACC_READ_MASK" > - make commit message for "KVM: x86/mmu: split XS/XU bits for EPT" more accurate > - add XU to shadow_acc_track_mask already in "KVM: x86/mmu: split XS/XU bits for EPT" > - fix compilation error > - use alternative code for __vmx_handle_ept_violation suggested by Sean
On Thu, May 7, 2026 at 4:45 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, May 05, 2026, Paolo Bonzini wrote: > > This version can also be found in the "queue" branch of kvm.git. > > Since it should be final I'm including again for reference the full > > description. > > I still have two nits, but on my end the most important thing is to stabilize the > hashes sooner than later, so I can use the resulting merge into kvm/next as the > basis for 7.2 topic branches. I.e. feel free to ignore the nits for now if that > makes life easier for you, I can always send patches to apply on top. No no, it's fine. The ff one I just missed... the other I am not sure if it gains much but I'm not going to argue either way. I'll push with the stable hashes tomorrow. Paolo
© 2016 - 2026 Red Hat, Inc.