Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.
Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking. Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.
Suggested-by: John Allen <john.allen@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/sev.c | 3 +++
arch/x86/kvm/svm/svm.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0cd77a87dd84..0cd32df7b9b6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
if (kvm_ghcb_xcr0_is_valid(svm))
__kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb));
+ if (kvm_ghcb_xss_is_valid(svm))
+ __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb));
+
/* Copy the GHCB exit information into the VMCB fields */
exit_code = kvm_ghcb_get_sw_exit_code(ghcb);
control->exit_code = lower_32_bits(exit_code);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index a42e95883b45..10d764878bcc 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -942,5 +942,6 @@ DEFINE_KVM_GHCB_ACCESSORS(sw_exit_info_1)
DEFINE_KVM_GHCB_ACCESSORS(sw_exit_info_2)
DEFINE_KVM_GHCB_ACCESSORS(sw_scratch)
DEFINE_KVM_GHCB_ACCESSORS(xcr0)
+DEFINE_KVM_GHCB_ACCESSORS(xss)
#endif
--
2.51.0.384.g4c02a37b29-goog
On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > Synchronize XSS from the GHCB to KVM's internal tracking if the guest > marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy > of XSS in order to compute the required XSTATE size when emulating > CPUID.0xD.0x1 for the guest. > > Treat the incoming XSS change as an emulated write, i.e. validatate the > guest-provided value, to avoid letting the guest load garbage into KVM's > tracking. Simply ignore bad values, as either the guest managed to get an > unsupported value into hardware, or the guest is misbehaving and providing > pure garbage. In either case, KVM can't fix the broken guest. > > Note, emulating the change as an MSR write also takes care of side effects, > e.g. marking dynamic CPUID bits as dirty. > > Suggested-by: John Allen <john.allen@amd.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/svm/sev.c | 3 +++ > arch/x86/kvm/svm/svm.h | 1 + > 2 files changed, 4 insertions(+) > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 0cd77a87dd84..0cd32df7b9b6 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > if (kvm_ghcb_xcr0_is_valid(svm)) > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > + if (kvm_ghcb_xss_is_valid(svm)) > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > + It looks like this is the change that caused the selftest regression with sev-es. It's not yet clear to me what the problem is though. Thanks, John
On Tue, Sep 16, 2025, John Allen wrote: > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > Synchronize XSS from the GHCB to KVM's internal tracking if the guest > > marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy > > of XSS in order to compute the required XSTATE size when emulating > > CPUID.0xD.0x1 for the guest. > > > > Treat the incoming XSS change as an emulated write, i.e. validatate the > > guest-provided value, to avoid letting the guest load garbage into KVM's > > tracking. Simply ignore bad values, as either the guest managed to get an > > unsupported value into hardware, or the guest is misbehaving and providing > > pure garbage. In either case, KVM can't fix the broken guest. > > > > Note, emulating the change as an MSR write also takes care of side effects, > > e.g. marking dynamic CPUID bits as dirty. > > > > Suggested-by: John Allen <john.allen@amd.com> > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > arch/x86/kvm/svm/sev.c | 3 +++ > > arch/x86/kvm/svm/svm.h | 1 + > > 2 files changed, 4 insertions(+) > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > --- a/arch/x86/kvm/svm/sev.c > > +++ b/arch/x86/kvm/svm/sev.c > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > if (kvm_ghcb_xcr0_is_valid(svm)) > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > + > > It looks like this is the change that caused the selftest regression > with sev-es. It's not yet clear to me what the problem is though. Do you see any WARNs in the guest kernel log? The most obvious potential bug is that KVM is missing a CPUID update, e.g. due to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is only used by init_xstate_size(), and I would expect the guest kernel's sanity checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. Another possibility is that unconditionally setting cpuid_dynamic_bits_dirty was masking a pre-existing (or just different) bug, and that "fixing" that flaw by eliding cpuid_dynamic_bits_dirty when "vcpu->arch.ia32_xss == data" exposed the bug.
On Tue, Sep 16, 2025 at 12:53:58PM -0700, Sean Christopherson wrote: > On Tue, Sep 16, 2025, John Allen wrote: > > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > > Synchronize XSS from the GHCB to KVM's internal tracking if the guest > > > marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy > > > of XSS in order to compute the required XSTATE size when emulating > > > CPUID.0xD.0x1 for the guest. > > > > > > Treat the incoming XSS change as an emulated write, i.e. validatate the > > > guest-provided value, to avoid letting the guest load garbage into KVM's > > > tracking. Simply ignore bad values, as either the guest managed to get an > > > unsupported value into hardware, or the guest is misbehaving and providing > > > pure garbage. In either case, KVM can't fix the broken guest. > > > > > > Note, emulating the change as an MSR write also takes care of side effects, > > > e.g. marking dynamic CPUID bits as dirty. > > > > > > Suggested-by: John Allen <john.allen@amd.com> > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > > --- > > > arch/x86/kvm/svm/sev.c | 3 +++ > > > arch/x86/kvm/svm/svm.h | 1 + > > > 2 files changed, 4 insertions(+) > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > > --- a/arch/x86/kvm/svm/sev.c > > > +++ b/arch/x86/kvm/svm/sev.c > > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > > if (kvm_ghcb_xcr0_is_valid(svm)) > > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > > + > > > > It looks like this is the change that caused the selftest regression > > with sev-es. It's not yet clear to me what the problem is though. > > Do you see any WARNs in the guest kernel log? > > The most obvious potential bug is that KVM is missing a CPUID update, e.g. due > to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, > etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is > only used by init_xstate_size(), and I would expect the guest kernel's sanity > checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. Yes, actually that looks to be the case: [ 0.463504] ------------[ cut here ]------------ [ 0.464443] XSAVE consistency problem: size 880 != kernel_size 840 [ 0.465445] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:638 paranoid_xstate_size_valid+0x101/0x140 [ 0.466443] Modules linked in: [ 0.467445] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.17.0-rc3-shstk-v15+ #6 PREEMPT(voluntary) [ 0.468443] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 [ 0.469444] RIP: 0010:paranoid_xstate_size_valid+0x101/0x140 [ 0.470443] Code: 89 44 24 04 e8 00 fa ff ff 8b 44 24 04 eb c2 89 da 89 c6 48 c7 c7 80 f4 bc 9e 89 44 24 04 c6 05 9d a3 a4 ff 01 e8 3f fa fb fd <0f> 0b 8b 44 24 04 eb ce 80 3d 8a a3 a4 ff 00 74 09 e8 c9 f9 ff ff [ 0.471443] RSP: 0000:ffffffff9ee03e80 EFLAGS: 00010286 [ 0.472443] RAX: 0000000000000000 RBX: 0000000000000348 RCX: c0000000fffeffff [ 0.473443] RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ffffffff9fd83c00 [ 0.474443] RBP: 000000000000000c R08: 0000000000000000 R09: 0000000000000003 [ 0.475443] R10: ffffffff9ee03d20 R11: ffff8c04fff8ffe8 R12: 0000000000000001 [ 0.476443] R13: ffffffffffffffff R14: 0000000000000001 R15: 000000007c135000 [ 0.477443] FS: 0000000000000000(0000) GS:ffff8c051c118000(0000) knlGS:0000000000000000 [ 0.478443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.479443] CR2: ffff8c03f4c01000 CR3: 0008000f73822001 CR4: 0000000000f70ef0 [ 0.480445] PKRU: 55555554 [ 0.480967] Call Trace: [ 0.481446] <TASK> [ 0.481856] init_xstate_size+0xa8/0x160 [ 0.482444] fpu__init_system_xstate+0x1c4/0x500 [ 0.483444] fpu__init_system+0x93/0xc0 [ 0.484443] arch_cpu_finalize_init+0xd2/0x160 [ 0.485290] start_kernel+0x330/0x470 [ 0.485444] x86_64_start_reservations+0x14/0x30 [ 0.486443] x86_64_start_kernel+0xd0/0xe0 [ 0.487443] common_startup_64+0x13e/0x141 [ 0.488444] </TASK> [ 0.488879] ---[ end trace 0000000000000000 ]-- > > Another possibility is that unconditionally setting cpuid_dynamic_bits_dirty > was masking a pre-existing (or just different) bug, and that "fixing" that flaw > by eliding cpuid_dynamic_bits_dirty when "vcpu->arch.ia32_xss == data" exposed > the bug.
On Tue, Sep 16, 2025, John Allen wrote: > On Tue, Sep 16, 2025 at 12:53:58PM -0700, Sean Christopherson wrote: > > On Tue, Sep 16, 2025, John Allen wrote: > > > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > > > --- a/arch/x86/kvm/svm/sev.c > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > > > if (kvm_ghcb_xcr0_is_valid(svm)) > > > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > > > + > > > > > > It looks like this is the change that caused the selftest regression > > > with sev-es. It's not yet clear to me what the problem is though. > > > > Do you see any WARNs in the guest kernel log? > > > > The most obvious potential bug is that KVM is missing a CPUID update, e.g. due > > to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, > > etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is > > only used by init_xstate_size(), and I would expect the guest kernel's sanity > > checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. > > Yes, actually that looks to be the case: > > [ 0.463504] ------------[ cut here ]------------ > [ 0.464443] XSAVE consistency problem: size 880 != kernel_size 840 > [ 0.465445] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:638 paranoid_xstate_size_valid+0x101/0x140 Can you run with the below printk tracing in the host (and optionally tracing in the guest for its updates)? Compile tested only. There should be very few XSS updates, so this _shouldn't_ spam/crash your host :-) --- arch/x86/kvm/svm/sev.c | 6 ++++-- arch/x86/kvm/x86.c | 15 ++++++++++++--- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 0cd32df7b9b6..8ac87d623767 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3306,8 +3306,10 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) if (kvm_ghcb_xcr0_is_valid(svm)) __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); - if (kvm_ghcb_xss_is_valid(svm)) - __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); + if (kvm_ghcb_xss_is_valid(svm)) { + if (__kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); + pr_warn("Dropped XSS update, val = %llx\n", data); + } /* Copy the GHCB exit information into the VMCB fields */ exit_code = kvm_ghcb_get_sw_exit_code(ghcb); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c78acab2ff3f..a846ed69ce2c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4118,13 +4118,22 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } break; case MSR_IA32_XSS: - if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) { + pr_warn("Guest CPUID doesn't have XSAVES\n"); return KVM_MSR_RET_UNSUPPORTED; + } - if (data & ~vcpu->arch.guest_supported_xss) + if (data & ~vcpu->arch.guest_supported_xss) { + pr_warn("Invalid XSS: supported = %llx, val = %llx\n", + vcpu->arch.guest_supported_xss, data); return 1; - if (vcpu->arch.ia32_xss == data) + } + if (vcpu->arch.ia32_xss == data) { + pr_warn("XSS already set to val = %llx, eliding updates\n", data); break; + } + + pr_warn("XSS updated to val = %llx, marking CPUID dirty\n", data); vcpu->arch.ia32_xss = data; vcpu->arch.cpuid_dynamic_bits_dirty = true; break; base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c --
On Tue, Sep 16, 2025 at 02:38:52PM -0700, Sean Christopherson wrote: > On Tue, Sep 16, 2025, John Allen wrote: > > On Tue, Sep 16, 2025 at 12:53:58PM -0700, Sean Christopherson wrote: > > > On Tue, Sep 16, 2025, John Allen wrote: > > > > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > > > > if (kvm_ghcb_xcr0_is_valid(svm)) > > > > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > > > > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > > > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > > > > + > > > > > > > > It looks like this is the change that caused the selftest regression > > > > with sev-es. It's not yet clear to me what the problem is though. > > > > > > Do you see any WARNs in the guest kernel log? > > > > > > The most obvious potential bug is that KVM is missing a CPUID update, e.g. due > > > to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, > > > etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is > > > only used by init_xstate_size(), and I would expect the guest kernel's sanity > > > checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. > > > > Yes, actually that looks to be the case: > > > > [ 0.463504] ------------[ cut here ]------------ > > [ 0.464443] XSAVE consistency problem: size 880 != kernel_size 840 > > [ 0.465445] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:638 paranoid_xstate_size_valid+0x101/0x140 > > Can you run with the below printk tracing in the host (and optionally tracing in > the guest for its updates)? Compile tested only. Interesting, I see "Guest CPUID doesn't have XSAVES" times the number of cpus followed by "XSS already set to val = 0, eliding updates" times the number of cpus. This is with host tracing only. I can try with guest tracing too in the morning. Thanks, John
On Tue, Sep 16, 2025 at 05:55:33PM -0500, John Allen wrote: > On Tue, Sep 16, 2025 at 02:38:52PM -0700, Sean Christopherson wrote: > > On Tue, Sep 16, 2025, John Allen wrote: > > > On Tue, Sep 16, 2025 at 12:53:58PM -0700, Sean Christopherson wrote: > > > > On Tue, Sep 16, 2025, John Allen wrote: > > > > > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > > > > > if (kvm_ghcb_xcr0_is_valid(svm)) > > > > > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > > > > > > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > > > > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > > > > > + > > > > > > > > > > It looks like this is the change that caused the selftest regression > > > > > with sev-es. It's not yet clear to me what the problem is though. > > > > > > > > Do you see any WARNs in the guest kernel log? > > > > > > > > The most obvious potential bug is that KVM is missing a CPUID update, e.g. due > > > > to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, > > > > etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is > > > > only used by init_xstate_size(), and I would expect the guest kernel's sanity > > > > checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. > > > > > > Yes, actually that looks to be the case: > > > > > > [ 0.463504] ------------[ cut here ]------------ > > > [ 0.464443] XSAVE consistency problem: size 880 != kernel_size 840 > > > [ 0.465445] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:638 paranoid_xstate_size_valid+0x101/0x140 > > > > Can you run with the below printk tracing in the host (and optionally tracing in > > the guest for its updates)? Compile tested only. > > Interesting, I see "Guest CPUID doesn't have XSAVES" times the number of > cpus followed by "XSS already set to val = 0, eliding updates" times the > number of cpus. This is with host tracing only. I can try with guest > tracing too in the morning. Ok, I think I see the problem. The cases above where we were seeing the added print statements from kvm_set_msr_common were not situations where we were going through the __kvm_emulate_msr_write via sev_es_sync_from_ghcb. When we call __kvm_emulate_msr_write from this context, we never end up getting to kvm_set_msr_common because we hit the following statement at the top of svm_set_msr: if (sev_es_prevent_msr_access(vcpu, msr)) return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; So I'm not sure if this would force using the original method of directly setting arch.ia32_xss or if there's some additional handling here that we need in this scenario to allow the msr access. Thanks, John
On Thu, Sep 18, 2025, John Allen wrote: > On Tue, Sep 16, 2025 at 05:55:33PM -0500, John Allen wrote: > > On Tue, Sep 16, 2025 at 02:38:52PM -0700, Sean Christopherson wrote: > > > On Tue, Sep 16, 2025, John Allen wrote: > > > > On Tue, Sep 16, 2025 at 12:53:58PM -0700, Sean Christopherson wrote: > > > > > On Tue, Sep 16, 2025, John Allen wrote: > > > > > > On Fri, Sep 12, 2025 at 04:23:07PM -0700, Sean Christopherson wrote: > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > > > index 0cd77a87dd84..0cd32df7b9b6 100644 > > > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > > > @@ -3306,6 +3306,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > > > > > > if (kvm_ghcb_xcr0_is_valid(svm)) > > > > > > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(ghcb)); > > > > > > > > > > > > > > + if (kvm_ghcb_xss_is_valid(svm)) > > > > > > > + __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(ghcb)); > > > > > > > + > > > > > > > > > > > > It looks like this is the change that caused the selftest regression > > > > > > with sev-es. It's not yet clear to me what the problem is though. > > > > > > > > > > Do you see any WARNs in the guest kernel log? > > > > > > > > > > The most obvious potential bug is that KVM is missing a CPUID update, e.g. due > > > > > to dropping an XSS write, consuming stale data, not setting cpuid_dynamic_bits_dirty, > > > > > etc. But AFAICT, CPUID.0xD.1.EBX (only thing that consumes the current XSS) is > > > > > only used by init_xstate_size(), and I would expect the guest kernel's sanity > > > > > checks in paranoid_xstate_size_valid() to yell if KVM botches CPUID emulation. > > > > > > > > Yes, actually that looks to be the case: > > > > > > > > [ 0.463504] ------------[ cut here ]------------ > > > > [ 0.464443] XSAVE consistency problem: size 880 != kernel_size 840 > > > > [ 0.465445] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:638 paranoid_xstate_size_valid+0x101/0x140 > > > > > > Can you run with the below printk tracing in the host (and optionally tracing in > > > the guest for its updates)? Compile tested only. > > > > Interesting, I see "Guest CPUID doesn't have XSAVES" times the number of > > cpus followed by "XSS already set to val = 0, eliding updates" times the > > number of cpus. This is with host tracing only. I can try with guest > > tracing too in the morning. > > Ok, I think I see the problem. The cases above where we were seeing the > added print statements from kvm_set_msr_common were not situations where > we were going through the __kvm_emulate_msr_write via > sev_es_sync_from_ghcb. When we call __kvm_emulate_msr_write from this > context, we never end up getting to kvm_set_msr_common because we hit > the following statement at the top of svm_set_msr: > > if (sev_es_prevent_msr_access(vcpu, msr)) > return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; Gah, I was looking for something like that but couldn't find it, obviously. > So I'm not sure if this would force using the original method of > directly setting arch.ia32_xss or if there's some additional handling > here that we need in this scenario to allow the msr access. Does this fix things? If so, I'll slot in a patch to extract setting XSS to the helper, and then this patch can use that API. I like the symmetry between __kvm_set_xcr() and __kvm_set_xss(), and I especially like not doing a generic end-around on svm_set_msr() by calling kvm_set_msr_common() directly. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 945f7da60107..ace9f321d2c9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2213,6 +2213,7 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw); int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu); +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss); int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 94d9acc94c9a..462aebc54135 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3355,7 +3355,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(svm)); if (kvm_ghcb_xss_is_valid(svm)) - __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(svm)); + __kvm_set_xss(vcpu, kvm_ghcb_get_xss(svm)); /* Copy the GHCB exit information into the VMCB fields */ exit_code = kvm_ghcb_get_sw_exit_code(svm); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5bbc187ab428..9b81e92a8de5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1313,6 +1313,22 @@ int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_emulate_xsetbv); +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss) +{ + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) + return KVM_MSR_RET_UNSUPPORTED; + + if (xss & ~vcpu->arch.guest_supported_xss) + return 1; + if (vcpu->arch.ia32_xss == xss) + return 0; + + vcpu->arch.ia32_xss = xss; + vcpu->arch.cpuid_dynamic_bits_dirty = true; + return 0; +} +EXPORT_SYMBOL_GPL(__kvm_set_xss); + static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { return __kvm_is_valid_cr4(vcpu, cr4) && @@ -4119,16 +4135,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } break; case MSR_IA32_XSS: - if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) - return KVM_MSR_RET_UNSUPPORTED; - - if (data & ~vcpu->arch.guest_supported_xss) - return 1; - if (vcpu->arch.ia32_xss == data) - break; - vcpu->arch.ia32_xss = data; - vcpu->arch.cpuid_dynamic_bits_dirty = true; - break; + return __kvm_set_xss(vcpu, data); case MSR_SMI_COUNT: if (!msr_info->host_initiated) return 1;
On Thu, Sep 18, 2025, Sean Christopherson wrote: > On Thu, Sep 18, 2025, John Allen wrote: > > On Tue, Sep 16, 2025 at 05:55:33PM -0500, John Allen wrote: > > > Interesting, I see "Guest CPUID doesn't have XSAVES" times the number of > > > cpus followed by "XSS already set to val = 0, eliding updates" times the > > > number of cpus. This is with host tracing only. I can try with guest > > > tracing too in the morning. > > > > Ok, I think I see the problem. The cases above where we were seeing the > > added print statements from kvm_set_msr_common were not situations where > > we were going through the __kvm_emulate_msr_write via > > sev_es_sync_from_ghcb. When we call __kvm_emulate_msr_write from this > > context, we never end up getting to kvm_set_msr_common because we hit > > the following statement at the top of svm_set_msr: > > > > if (sev_es_prevent_msr_access(vcpu, msr)) > > return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; > > Gah, I was looking for something like that but couldn't find it, obviously. > > > So I'm not sure if this would force using the original method of > > directly setting arch.ia32_xss or if there's some additional handling > > here that we need in this scenario to allow the msr access. > > Does this fix things? If so, I'll slot in a patch to extract setting XSS to > the helper, and then this patch can use that API. I like the symmetry between > __kvm_set_xcr() and __kvm_set_xss(), and I especially like not doing a generic > end-around on svm_set_msr() by calling kvm_set_msr_common() directly. Scratch that, KVM supports intra-host (and inter-host?) migration of SEV-ES guests and so needs to allow the host to save/restore XSS, otherwise a guest that *knows* its XSS hasn't change could get stale/bad CPUID emulation if the guest doesn't provide XSS in the GHCB on every exit. So while seemingly hacky, I'm pretty sure the right solution is actually: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index cabe1950b160..d48bf20c865b 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2721,8 +2721,8 @@ static int svm_get_feature_msr(u32 msr, u64 *data) static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { - return sev_es_guest(vcpu->kvm) && - vcpu->arch.guest_state_protected && + return sev_es_guest(vcpu->kvm) && vcpu->arch.guest_state_protected && + msr_info->index != MSR_IA32_XSS && !msr_write_intercepted(vcpu, msr_info->index); } Side topic, checking msr_write_intercepted() is likely wrong. It's a bad heuristic for "managed in the VMSA". MSRs that _KVM_ loads into hardware and context switches should still be accessible. I haven't looked to see if this is a problem in practice. > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 945f7da60107..ace9f321d2c9 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2213,6 +2213,7 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); > void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw); > int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); > int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu); > +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss); > > int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); > int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 94d9acc94c9a..462aebc54135 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -3355,7 +3355,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(svm)); > > if (kvm_ghcb_xss_is_valid(svm)) > - __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(svm)); > + __kvm_set_xss(vcpu, kvm_ghcb_get_xss(svm)); > > /* Copy the GHCB exit information into the VMCB fields */ > exit_code = kvm_ghcb_get_sw_exit_code(svm); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5bbc187ab428..9b81e92a8de5 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1313,6 +1313,22 @@ int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_emulate_xsetbv); > > +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss) > +{ > + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) > + return KVM_MSR_RET_UNSUPPORTED; > + > + if (xss & ~vcpu->arch.guest_supported_xss) > + return 1; > + if (vcpu->arch.ia32_xss == xss) > + return 0; > + > + vcpu->arch.ia32_xss = xss; > + vcpu->arch.cpuid_dynamic_bits_dirty = true; > + return 0; > +} > +EXPORT_SYMBOL_GPL(__kvm_set_xss); > + > static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) > { > return __kvm_is_valid_cr4(vcpu, cr4) && > @@ -4119,16 +4135,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > } > break; > case MSR_IA32_XSS: > - if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) > - return KVM_MSR_RET_UNSUPPORTED; > - > - if (data & ~vcpu->arch.guest_supported_xss) > - return 1; > - if (vcpu->arch.ia32_xss == data) > - break; > - vcpu->arch.ia32_xss = data; > - vcpu->arch.cpuid_dynamic_bits_dirty = true; > - break; > + return __kvm_set_xss(vcpu, data); > case MSR_SMI_COUNT: > if (!msr_info->host_initiated) > return 1;
On Thu, Sep 18, 2025 at 01:44:13PM -0700, Sean Christopherson wrote: > On Thu, Sep 18, 2025, Sean Christopherson wrote: > > On Thu, Sep 18, 2025, John Allen wrote: > > > On Tue, Sep 16, 2025 at 05:55:33PM -0500, John Allen wrote: > > > > Interesting, I see "Guest CPUID doesn't have XSAVES" times the number of > > > > cpus followed by "XSS already set to val = 0, eliding updates" times the > > > > number of cpus. This is with host tracing only. I can try with guest > > > > tracing too in the morning. > > > > > > Ok, I think I see the problem. The cases above where we were seeing the > > > added print statements from kvm_set_msr_common were not situations where > > > we were going through the __kvm_emulate_msr_write via > > > sev_es_sync_from_ghcb. When we call __kvm_emulate_msr_write from this > > > context, we never end up getting to kvm_set_msr_common because we hit > > > the following statement at the top of svm_set_msr: > > > > > > if (sev_es_prevent_msr_access(vcpu, msr)) > > > return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; > > > > Gah, I was looking for something like that but couldn't find it, obviously. > > > > > So I'm not sure if this would force using the original method of > > > directly setting arch.ia32_xss or if there's some additional handling > > > here that we need in this scenario to allow the msr access. > > > > Does this fix things? If so, I'll slot in a patch to extract setting XSS to > > the helper, and then this patch can use that API. I like the symmetry between > > __kvm_set_xcr() and __kvm_set_xss(), and I especially like not doing a generic > > end-around on svm_set_msr() by calling kvm_set_msr_common() directly. > > Scratch that, KVM supports intra-host (and inter-host?) migration of SEV-ES > guests and so needs to allow the host to save/restore XSS, otherwise a guest > that *knows* its XSS hasn't change could get stale/bad CPUID emulation if the > guest doesn't provide XSS in the GHCB on every exit. > > So while seemingly hacky, I'm pretty sure the right solution is actually: > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index cabe1950b160..d48bf20c865b 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -2721,8 +2721,8 @@ static int svm_get_feature_msr(u32 msr, u64 *data) > static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > { > - return sev_es_guest(vcpu->kvm) && > - vcpu->arch.guest_state_protected && > + return sev_es_guest(vcpu->kvm) && vcpu->arch.guest_state_protected && > + msr_info->index != MSR_IA32_XSS && > !msr_write_intercepted(vcpu, msr_info->index); > } Yes, it looks like this fixes the regression. Thanks! The 32bit selftest still doesn't work properly with sev-es, but that was a problem with the previous version too. I suspect there's some incompatibility between sev-es and the test, but I haven't been able to get a good answer on why that might be. Thanks, John > > Side topic, checking msr_write_intercepted() is likely wrong. It's a bad > heuristic for "managed in the VMSA". MSRs that _KVM_ loads into hardware and > context switches should still be accessible. I haven't looked to see if this is > a problem in practice. > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 945f7da60107..ace9f321d2c9 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -2213,6 +2213,7 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); > > void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw); > > int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); > > int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu); > > +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss); > > > > int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); > > int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > index 94d9acc94c9a..462aebc54135 100644 > > --- a/arch/x86/kvm/svm/sev.c > > +++ b/arch/x86/kvm/svm/sev.c > > @@ -3355,7 +3355,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) > > __kvm_set_xcr(vcpu, 0, kvm_ghcb_get_xcr0(svm)); > > > > if (kvm_ghcb_xss_is_valid(svm)) > > - __kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(svm)); > > + __kvm_set_xss(vcpu, kvm_ghcb_get_xss(svm)); > > > > /* Copy the GHCB exit information into the VMCB fields */ > > exit_code = kvm_ghcb_get_sw_exit_code(svm); > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 5bbc187ab428..9b81e92a8de5 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -1313,6 +1313,22 @@ int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu) > > } > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_emulate_xsetbv); > > > > +int __kvm_set_xss(struct kvm_vcpu *vcpu, u64 xss) > > +{ > > + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) > > + return KVM_MSR_RET_UNSUPPORTED; > > + > > + if (xss & ~vcpu->arch.guest_supported_xss) > > + return 1; > > + if (vcpu->arch.ia32_xss == xss) > > + return 0; > > + > > + vcpu->arch.ia32_xss = xss; > > + vcpu->arch.cpuid_dynamic_bits_dirty = true; > > + return 0; > > +} > > +EXPORT_SYMBOL_GPL(__kvm_set_xss); > > + > > static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) > > { > > return __kvm_is_valid_cr4(vcpu, cr4) && > > @@ -4119,16 +4135,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > } > > break; > > case MSR_IA32_XSS: > > - if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) > > - return KVM_MSR_RET_UNSUPPORTED; > > - > > - if (data & ~vcpu->arch.guest_supported_xss) > > - return 1; > > - if (vcpu->arch.ia32_xss == data) > > - break; > > - vcpu->arch.ia32_xss = data; > > - vcpu->arch.cpuid_dynamic_bits_dirty = true; > > - break; > > + return __kvm_set_xss(vcpu, data); > > case MSR_SMI_COUNT: > > if (!msr_info->host_initiated) > > return 1;
On Thu, 2025-09-18 at 16:23 -0500, John Allen wrote: > The 32bit selftest still doesn't work properly with sev-es, but that was > a problem with the previous version too. I suspect there's some > incompatibility between sev-es and the test, but I haven't been able to > get a good answer on why that might be. You are talking about test_32bit() in test_shadow_stack.c? That test relies on a specific CET arch behavior. If you try to transition to a 32 bit compatibility mode segment with an SSP with high bits set (outside the 32 bit address space), a #GP will be triggered by the HW. The test verifies that this happens and the kernel handles it appropriately. Could it be platform/mode difference and not KVM issue?
On Thu, Sep 18, 2025 at 09:42:21PM +0000, Edgecombe, Rick P wrote: > On Thu, 2025-09-18 at 16:23 -0500, John Allen wrote: > > The 32bit selftest still doesn't work properly with sev-es, but that was > > a problem with the previous version too. I suspect there's some > > incompatibility between sev-es and the test, but I haven't been able to > > get a good answer on why that might be. > > You are talking about test_32bit() in test_shadow_stack.c? Yes, that's right. > > That test relies on a specific CET arch behavior. If you try to transition to a > 32 bit compatibility mode segment with an SSP with high bits set (outside the 32 > bit address space), a #GP will be triggered by the HW. The test verifies that > this happens and the kernel handles it appropriately. Could it be platform/mode > difference and not KVM issue? I'm fairly certain that this is an issue with any sev-es guest. The unexpected seg fault happens when we isolate the sigaction32 call used in the test regardless of shadow stack support. So I wonder if it's something similar to the case that the test is checking for. Maybe something to do with the C bit. Thanks, John
On 9/18/25 17:18, John Allen wrote: > On Thu, Sep 18, 2025 at 09:42:21PM +0000, Edgecombe, Rick P wrote: >> On Thu, 2025-09-18 at 16:23 -0500, John Allen wrote: >>> The 32bit selftest still doesn't work properly with sev-es, but that was >>> a problem with the previous version too. I suspect there's some >>> incompatibility between sev-es and the test, but I haven't been able to >>> get a good answer on why that might be. >> >> You are talking about test_32bit() in test_shadow_stack.c? > > Yes, that's right. > >> >> That test relies on a specific CET arch behavior. If you try to transition to a >> 32 bit compatibility mode segment with an SSP with high bits set (outside the 32 >> bit address space), a #GP will be triggered by the HW. The test verifies that >> this happens and the kernel handles it appropriately. Could it be platform/mode >> difference and not KVM issue? > > I'm fairly certain that this is an issue with any sev-es guest. The > unexpected seg fault happens when we isolate the sigaction32 call used > in the test regardless of shadow stack support. So I wonder if it's > something similar to the case that the test is checking for. Maybe > something to do with the C bit. Likely something to do with the encryption bit since, if set, will generate an invalid address in 32-bit, right? For SEV-ES, we transition to 64-bit very quickly because of the use of the encryption bit, which is why, for example, we don't support SEV-ES / SEV-SNP in the OvmfIa32X64.dsc package. Thanks, Tom > > Thanks, > John
On Fri, 2025-09-19 at 08:40 -0500, Tom Lendacky wrote: > Likely something to do with the encryption bit since, if set, will > generate an invalid address in 32-bit, right? But the SSP is a virtual address and c-bit is a physical thing. > > For SEV-ES, we transition to 64-bit very quickly because of the use of the > encryption bit, which is why, for example, we don't support SEV-ES / > SEV-SNP in the OvmfIa32X64.dsc package. This sounds like it's about the lack of ability to set the c-bit in the page table, rather than having the C-bit set in a virtual address. In compatibility mode you are not using 32 bit page tables, so the C-bit should be available like normal I think. Not an expert in 32 bit/compatibility mode though. More background on this test/behavior: During the tail end of the shadow stack enabling, there was a concern raised that we didn't un-support 32 bit shadow stack cleanly enough. We blocked it from being allowed in 32 bit apps, but nothing stopped an app from enabling it in 64 bit an then switching to 32 bit mode without the kernel getting a chance to block it. The simplest, get-it-done type solution was to just not allocate shadow stacks in the space where they could be usable in 32 bit mode and let the HW catch it. But the whole point is just to not allow 32 bit mode CET. Sounds like SEV-ES covers this another way - don't support 32 bit at all. I wonder if we should just patch the test to skip the 32 bit test on coco VMs? PS, we don't support CET on TDX currently even though it doesn't require everything in this series, but I just remembered (forehead slap) that on the way upstream the extra CET-TDX exclusion got pulled out. After this series, it would be allowed in TDX guests as well. So we need to do the same testing in TDX. Let me see how the test goes in TDX and get back to you.
+Kiryl, a CET selftest that does int80 fails on SEV-ES. On Fri, 2025-09-19 at 10:29 -0700, Rick Edgecombe wrote: > PS, we don't support CET on TDX currently even though it doesn't require > everything in this series, but I just remembered (forehead slap) that on the way > upstream the extra CET-TDX exclusion got pulled out. After this series, it would > be allowed in TDX guests as well. So we need to do the same testing in TDX. Let > me see how the test goes in TDX and get back to you. The test passes on a TDX guest: [INFO] new_ssp = 7f8c8d7ffff8, *new_ssp = 7f8c8d800001 [INFO] changing ssp from 7f8c8e1ffff0 to 7f8c8d7ffff8 [INFO] ssp is now 7f8c8d800000 [OK] Shadow stack pivot [OK] Shadow stack faults [INFO] Corrupting shadow stack [INFO] Generated shadow stack violation successfully [OK] Shadow stack violation test [INFO] Gup read -> shstk access success [INFO] Gup write -> shstk access success [INFO] Violation from normal write [INFO] Gup read -> write access success [INFO] Violation from normal write [INFO] Gup write -> write access success [INFO] Cow gup write -> write access success [OK] Shadow gup test [INFO] Violation from shstk access [OK] mprotect() test [OK] Userfaultfd test [OK] Guard gap test, other mapping's gaps [OK] Guard gap test, placement mapping's gaps [OK] Ptrace test [OK] 32 bit test [OK] Uretprobe test I guess int 80 was re-enabled for TDX, after being disabled for both coco families. See commits starting back from f4116bfc4462 ("x86/tdx: Allow 32-bit emulation by default"). Not sure why it was done that way. If there is some way to re-enable int80 for SEV-ES too, we can leave the test as is. But if you decide to disable the 32 bit test to resolve this, please leave it working for TDX.
On Fri, Sep 19, 2025 at 08:58:45PM +0000, Edgecombe, Rick P wrote: > +Kiryl, a CET selftest that does int80 fails on SEV-ES. > > On Fri, 2025-09-19 at 10:29 -0700, Rick Edgecombe wrote: > > PS, we don't support CET on TDX currently even though it doesn't require > > everything in this series, but I just remembered (forehead slap) that on the way > > upstream the extra CET-TDX exclusion got pulled out. After this series, it would > > be allowed in TDX guests as well. So we need to do the same testing in TDX. Let > > me see how the test goes in TDX and get back to you. > > The test passes on a TDX guest: > > [INFO] new_ssp = 7f8c8d7ffff8, *new_ssp = 7f8c8d800001 > [INFO] changing ssp from 7f8c8e1ffff0 to 7f8c8d7ffff8 > [INFO] ssp is now 7f8c8d800000 > [OK] Shadow stack pivot > [OK] Shadow stack faults > [INFO] Corrupting shadow stack > [INFO] Generated shadow stack violation successfully > [OK] Shadow stack violation test > [INFO] Gup read -> shstk access success > [INFO] Gup write -> shstk access success > [INFO] Violation from normal write > [INFO] Gup read -> write access success > [INFO] Violation from normal write > [INFO] Gup write -> write access success > [INFO] Cow gup write -> write access success > [OK] Shadow gup test > [INFO] Violation from shstk access > [OK] mprotect() test > [OK] Userfaultfd test > [OK] Guard gap test, other mapping's gaps > [OK] Guard gap test, placement mapping's gaps > [OK] Ptrace test > [OK] 32 bit test > [OK] Uretprobe test > > > I guess int 80 was re-enabled for TDX, after being disabled for both coco > families. See commits starting back from f4116bfc4462 ("x86/tdx: Allow 32-bit > emulation by default"). Not sure why it was done that way. If there is some way > to re-enable int80 for SEV-ES too, we can leave the test as is. But if you > decide to disable the 32 bit test to resolve this, please leave it working for > TDX. In TDX case, VAPIC state is protected VMM. It covers ISR, so guest can safely check ISR to detect if the exception is external or internal. IIUC, VAPIC state is controlled by VMM in SEV case and ISR is not reliable. I am not sure if Secure AVIC[1] changes the situation for AMD. Neeraj? [1] https://lore.kernel.org/all/20250811094444.203161-1-Neeraj.Upadhyay@amd.com/ -- Kiryl Shutsemau / Kirill A. Shutemov
> > In TDX case, VAPIC state is protected VMM. It covers ISR, so guest can > safely check ISR to detect if the exception is external or internal. > > IIUC, VAPIC state is controlled by VMM in SEV case and ISR is not > reliable. > > I am not sure if Secure AVIC[1] changes the situation for AMD. > > Neeraj? > For Secure AVIC enabled guests, guest's vAPIC ISR state is not visible to (and not controlled by) host or VMM. - Neeraj
On Mon, Sep 22, 2025 at 03:03:59PM +0530, Upadhyay, Neeraj wrote: > > > > > In TDX case, VAPIC state is protected VMM. It covers ISR, so guest can > > safely check ISR to detect if the exception is external or internal. > > > > IIUC, VAPIC state is controlled by VMM in SEV case and ISR is not > > reliable. > > > > I am not sure if Secure AVIC[1] changes the situation for AMD. > > > > Neeraj? > > > > For Secure AVIC enabled guests, guest's vAPIC ISR state is not visible to > (and not controlled by) host or VMM. In this case, I think you should make ia32_disable() in sme_early_init() conditional on !Secure AVIC. -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Sep 19, 2025 at 08:40:15AM -0500, Tom Lendacky wrote: > On 9/18/25 17:18, John Allen wrote: > > On Thu, Sep 18, 2025 at 09:42:21PM +0000, Edgecombe, Rick P wrote: > >> On Thu, 2025-09-18 at 16:23 -0500, John Allen wrote: > >>> The 32bit selftest still doesn't work properly with sev-es, but that was > >>> a problem with the previous version too. I suspect there's some > >>> incompatibility between sev-es and the test, but I haven't been able to > >>> get a good answer on why that might be. > >> > >> You are talking about test_32bit() in test_shadow_stack.c? > > > > Yes, that's right. > > > >> > >> That test relies on a specific CET arch behavior. If you try to transition to a > >> 32 bit compatibility mode segment with an SSP with high bits set (outside the 32 > >> bit address space), a #GP will be triggered by the HW. The test verifies that > >> this happens and the kernel handles it appropriately. Could it be platform/mode > >> difference and not KVM issue? > > > > I'm fairly certain that this is an issue with any sev-es guest. The > > unexpected seg fault happens when we isolate the sigaction32 call used > > in the test regardless of shadow stack support. So I wonder if it's > > something similar to the case that the test is checking for. Maybe > > something to do with the C bit. > > Likely something to do with the encryption bit since, if set, will > generate an invalid address in 32-bit, right? > > For SEV-ES, we transition to 64-bit very quickly because of the use of the > encryption bit, which is why, for example, we don't support SEV-ES / > SEV-SNP in the OvmfIa32X64.dsc package. Ok, I knew this sounded familiar. This came up in a discussion a while back. The reason this doesn't work is "int 0x80" is blocked in SEV/SEV-ES guests. See: b82a8dbd3d2f ("x86/coco: Disable 32-bit emulation by default on TDX and SEV") So I don't think this should be a blocker for this series, but it is something we'll want to address in the selftest. However, I'm not sure how we can check if we're running from an SEV or SEV-ES guest from userspace. Maybe we could attempt the int 0x80 and catch the seg fault in which case we assume that we're running under SEV or SEV-ES or some other situation where int 0x80 isn't supported? Seems hacky and like it could mask other failures. Thanks, John
© 2016 - 2025 Red Hat, Inc.