Immediately synchronize the user-return MSR values after a successful
VP.ENTER to minimize the window where KVM is tracking stale values in the
"curr" field, and so that the tracked value is synchronized before IRQs
are enabled.
This is *very* technically a bug fix, as a forced shutdown/reboot will
invoke kvm_shutdown() without waiting for tasks to be frozen, and so the
on_each_cpu() calls to kvm_disable_virtualization_cpu() will call
kvm_on_user_return() from IRQ context and thus could consume a stale
values->curr if the IRQ hits while KVM is active. That said, the real
motivation is to minimize the window where "curr" is stale, as the same
forced shutdown/reboot flaw has effectively existed for all of non-TDX
for years, as kvm_set_user_return_msr() runs with IRQs enabled. Not to
mention that a stale MSR is the least of the kernel's concerns if a reboot
is forced while KVM is active.
Fixes: e0b4f31a3c65 ("KVM: TDX: restore user ret MSRs")
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 20 +++++++++++++-------
arch/x86/kvm/vmx/tdx.h | 2 +-
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 326db9b9c567..2f3dfe9804b5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -780,6 +780,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
vt->guest_state_loaded = true;
+
+ /*
+ * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
+ * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an
+ * update to synchronize the "current" value in KVM's cache with the
+ * value in hardware (loaded by the TDX-Module).
+ */
+ to_tdx(vcpu)->need_user_return_msr_sync = true;
}
struct tdx_uret_msr {
@@ -807,7 +815,6 @@ static void tdx_user_return_msr_update_cache(void)
static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
{
struct vcpu_vt *vt = to_vt(vcpu);
- struct vcpu_tdx *tdx = to_tdx(vcpu);
if (!vt->guest_state_loaded)
return;
@@ -815,11 +822,6 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
++vcpu->stat.host_state_reload;
wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
- if (tdx->guest_entered) {
- tdx_user_return_msr_update_cache();
- tdx->guest_entered = false;
- }
-
vt->guest_state_loaded = false;
}
@@ -1059,7 +1061,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
update_debugctlmsr(vcpu->arch.host_debugctl);
tdx_load_host_xsave_state(vcpu);
- tdx->guest_entered = true;
+
+ if (tdx->need_user_return_msr_sync) {
+ tdx_user_return_msr_update_cache();
+ tdx->need_user_return_msr_sync = false;
+ }
vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..9434a6371d67 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -67,7 +67,7 @@ struct vcpu_tdx {
u64 vp_enter_ret;
enum vcpu_tdx_state state;
- bool guest_entered;
+ bool need_user_return_msr_sync;
u64 map_gpa_next;
u64 map_gpa_end;
--
2.51.0.858.gf9c4a03a3a-goog
+Adrian for TDX arch MSR clobbering details
On Thu, 2025-10-16 at 15:28 -0700, Sean Christopherson wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 326db9b9c567..2f3dfe9804b5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -780,6 +780,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
>
> vt->guest_state_loaded = true;
> +
> + /*
> + * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
> + * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an
> + * update to synchronize the "current" value in KVM's cache with the
> + * value in hardware (loaded by the TDX-Module).
> + */
I think we should be synchronizing only after a successful VP.ENTER with a real
TD exit, but today instead we synchronize after any attempt to VP.ENTER. Or more
accurately, we plan to synchronize when returning to userspace in that case.
It looks to me that if we get some VP.ENTER errors, the registers should not get
clobbered (although I'd love a second assessment on this from other TDX devs).
Then we actually desync the registers with tdx_user_return_msr_update_cache().
I mention because I think this change widens the issue. For the
TDX_OPERAND_BUSY, etc cases the issue is mostly accidentally avoided, by re-
entering the TD before returning to userspace and doing the sync.
> + to_tdx(vcpu)->need_user_return_msr_sync = true;
> }
>
> struct tdx_uret_msr {
> @@ -807,7 +815,6 @@ static void tdx_user_return_msr_update_cache(void)
> static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vt *vt = to_vt(vcpu);
> - struct vcpu_tdx *tdx = to_tdx(vcpu);
>
> if (!vt->guest_state_loaded)
> return;
> @@ -815,11 +822,6 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
> ++vcpu->stat.host_state_reload;
> wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
>
> - if (tdx->guest_entered) {
> - tdx_user_return_msr_update_cache();
> - tdx->guest_entered = false;
> - }
> -
> vt->guest_state_loaded = false;
> }
>
> @@ -1059,7 +1061,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> update_debugctlmsr(vcpu->arch.host_debugctl);
>
> tdx_load_host_xsave_state(vcpu);
> - tdx->guest_entered = true;
> +
> + if (tdx->need_user_return_msr_sync) {
Not sure what the purpose of need_user_return_msr_sync is now that this is moved
here. Before I guess guest_entered was trying to determine if VP.ENTER got
called, but now we know that is the case. So what condition is it avoiding?
But otherwise, as above, we might want to do it depending on the VP.ENTER error
code. Maybe:
if (!(vp_enter_ret & TDX_ERROR))?
> + tdx_user_return_msr_update_cache();
> + tdx->need_user_return_msr_sync = false;
> + }
>
> vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..9434a6371d67 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -67,7 +67,7 @@ struct vcpu_tdx {
> u64 vp_enter_ret;
>
> enum vcpu_tdx_state state;
> - bool guest_entered;
> + bool need_user_return_msr_sync;
>
> u64 map_gpa_next;
> u64 map_gpa_end;
On 21/10/2025 01:55, Edgecombe, Rick P wrote: >> + * Several of KVM's user-return MSRs are clobbered by the TDX-Module if >> + * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an >> + * update to synchronize the "current" value in KVM's cache with the >> + * value in hardware (loaded by the TDX-Module). >> + */ > > I think we should be synchronizing only after a successful VP.ENTER with a real > TD exit, but today instead we synchronize after any attempt to VP.ENTER. If the MSR's do not get clobbered, does it matter whether or not they get restored.
On Tue, Oct 21, 2025, Adrian Hunter wrote:
> On 21/10/2025 01:55, Edgecombe, Rick P wrote:
> >> + * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
> >> + * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an
> >> + * update to synchronize the "current" value in KVM's cache with the
> >> + * value in hardware (loaded by the TDX-Module).
> >> + */
> >
> > I think we should be synchronizing only after a successful VP.ENTER with a real
> > TD exit, but today instead we synchronize after any attempt to VP.ENTER.
Well this is all completely @#($*#. Looking at the TDX-Module source, if the
TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it
will signal a "normal" exit but not "restore" VMM state.
> If the MSR's do not get clobbered, does it matter whether or not they get
> restored.
It matters because KVM needs to know the actual value in hardware. If KVM thinks
an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct
value into hardware when returning to userspace and/or when running a different
vCPU.
Taking a step back, the entire approach of updating the "cache" after the fact is
ridiculous. TDX entry/exit is anything but fast; avoiding _at most_ 4x WRMSRs at
the start of the run loop is a very, very premature optimization. Preemptively
load hardware with the value that the TDX-Module _might_ set and call it good.
I'll replace patches 1 and 4 with this, tagged for stable@.
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/vmx/tdx.c | 52 +++++++++++++++------------------
arch/x86/kvm/vmx/tdx.h | 1 -
arch/x86/kvm/x86.c | 9 ------
4 files changed, 23 insertions(+), 40 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48598d017d6f..d158dfd1842e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2378,7 +2378,6 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
int kvm_add_user_return_msr(u32 msr);
int kvm_find_user_return_msr(u32 msr);
int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
-void kvm_user_return_msr_update_cache(unsigned int index, u64 val);
u64 kvm_get_user_return_msr(unsigned int slot);
static inline bool kvm_is_supported_user_return_msr(u32 msr)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 326db9b9c567..63abfa251243 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -763,25 +763,6 @@ static bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
}
-/*
- * Compared to vmx_prepare_switch_to_guest(), there is not much to do
- * as SEAMCALL/SEAMRET calls take care of most of save and restore.
- */
-void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vt *vt = to_vt(vcpu);
-
- if (vt->guest_state_loaded)
- return;
-
- if (likely(is_64bit_mm(current->mm)))
- vt->msr_host_kernel_gs_base = current->thread.gsbase;
- else
- vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
-
- vt->guest_state_loaded = true;
-}
-
struct tdx_uret_msr {
u32 msr;
unsigned int slot;
@@ -795,19 +776,38 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
{.msr = MSR_TSC_AUX,},
};
-static void tdx_user_return_msr_update_cache(void)
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
+ struct vcpu_vt *vt = to_vt(vcpu);
int i;
+ if (vt->guest_state_loaded)
+ return;
+
+ if (likely(is_64bit_mm(current->mm)))
+ vt->msr_host_kernel_gs_base = current->thread.gsbase;
+ else
+ vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+ vt->guest_state_loaded = true;
+
+ /*
+ * Explicitly set user-return MSRs that are clobbered by the TDX-Module
+ * if VP.ENTER succeeds, i.e. on TD-Exit, with the values that would be
+ * written by the TDX-Module. Don't rely on the TDX-Module to actually
+ * clobber the MSRs, as the contract is poorly defined and not upheld.
+ * E.g. the TDX-Module will synthesize an EPT Violation without doing
+ * VM-Enter if it suspects a zero-step attack, and never "restore" VMM
+ * state.
+ */
for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
- kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
- tdx_uret_msrs[i].defval);
+ kvm_set_user_return_msr(i, tdx_uret_msrs[i].slot,
+ tdx_uret_msrs[i].defval);
}
static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
{
struct vcpu_vt *vt = to_vt(vcpu);
- struct vcpu_tdx *tdx = to_tdx(vcpu);
if (!vt->guest_state_loaded)
return;
@@ -815,11 +815,6 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
++vcpu->stat.host_state_reload;
wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
- if (tdx->guest_entered) {
- tdx_user_return_msr_update_cache();
- tdx->guest_entered = false;
- }
-
vt->guest_state_loaded = false;
}
@@ -1059,7 +1054,6 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
update_debugctlmsr(vcpu->arch.host_debugctl);
tdx_load_host_xsave_state(vcpu);
- tdx->guest_entered = true;
vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..7f258870dc41 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -67,7 +67,6 @@ struct vcpu_tdx {
u64 vp_enter_ret;
enum vcpu_tdx_state state;
- bool guest_entered;
u64 map_gpa_next;
u64 map_gpa_end;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b4b5d2d09634..639589af7cbe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -681,15 +681,6 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_set_user_return_msr);
-void kvm_user_return_msr_update_cache(unsigned int slot, u64 value)
-{
- struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
-
- msrs->values[slot].curr = value;
- kvm_user_return_register_notifier(msrs);
-}
-EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_user_return_msr_update_cache);
-
u64 kvm_get_user_return_msr(unsigned int slot)
{
return this_cpu_ptr(user_return_msrs)->values[slot].curr;
base-commit: f222788458c8a7753d43befef2769cd282dc008e
--
On Tue, 2025-10-21 at 08:06 -0700, Sean Christopherson wrote:
> I think we should be synchronizing only after a successful VP.ENTER with a real
> > > TD exit, but today instead we synchronize after any attempt to VP.ENTER.
>
> Well this is all completely @#($*#. Looking at the TDX-Module source, if the
> TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it
> will signal a "normal" exit but not "restore" VMM state.
Oh yea, good point. So there is no way to tell from the return code if the
clobbering happened.
>
> > If the MSR's do not get clobbered, does it matter whether or not they get
> > restored.
>
> It matters because KVM needs to know the actual value in hardware. If KVM thinks
> an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct
> value into hardware when returning to userspace and/or when running a different
> vCPU.
>
> Taking a step back, the entire approach of updating the "cache" after the fact is
> ridiculous. TDX entry/exit is anything but fast; avoiding _at most_ 4x WRMSRs at
> the start of the run loop is a very, very premature optimization. Preemptively
> load hardware with the value that the TDX-Module _might_ set and call it good.
>
> I'll replace patches 1 and 4 with this, tagged for stable@.
Seems reasonable to me in concept, but there is a bug. It looks like some
important MSR isn't getting restored right and the host gets into a bad state.
The first signs start with triggering this:
asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs
*bad_regs)
{
struct pt_regs tmp, *new_stack;
/*
* This is called from entry_64.S early in handling a fault
* caused by a bad iret to user mode. To handle the fault
* correctly, we want to move our stack frame to where it would
* be had we entered directly on the entry stack (rather than
* just below the IRET frame) and we want to pretend that the
* exception came from the IRET target.
*/
new_stack = (struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) -
1;
/* Copy the IRET target to the temporary storage. */
__memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
/* Copy the remainder of the stack from the current stack. */
__memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
/* Update the entry stack */
__memcpy(new_stack, &tmp, sizeof(tmp));
BUG_ON(!user_mode(new_stack)); <---------------HERE
Need to debug.
On Tue, Oct 21, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-10-21 at 08:06 -0700, Sean Christopherson wrote:
> > I think we should be synchronizing only after a successful VP.ENTER with a real
> > > > TD exit, but today instead we synchronize after any attempt to VP.ENTER.
> >
> > Well this is all completely @#($*#. Looking at the TDX-Module source, if the
> > TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it
> > will signal a "normal" exit but not "restore" VMM state.
>
> Oh yea, good point. So there is no way to tell from the return code if the
> clobbering happened.
>
> >
> > > If the MSR's do not get clobbered, does it matter whether or not they get
> > > restored.
> >
> > It matters because KVM needs to know the actual value in hardware. If KVM thinks
> > an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct
> > value into hardware when returning to userspace and/or when running a different
> > vCPU.
> >
> > Taking a step back, the entire approach of updating the "cache" after the fact is
> > ridiculous. TDX entry/exit is anything but fast; avoiding _at most_ 4x WRMSRs at
> > the start of the run loop is a very, very premature optimization. Preemptively
> > load hardware with the value that the TDX-Module _might_ set and call it good.
> >
> > I'll replace patches 1 and 4 with this, tagged for stable@.
>
> Seems reasonable to me in concept, but there is a bug. It looks like some
> important MSR isn't getting restored right and the host gets into a bad state.
> The first signs start with triggering this:
>
> asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs
> *bad_regs)
> {
> struct pt_regs tmp, *new_stack;
>
> /*
> * This is called from entry_64.S early in handling a fault
> * caused by a bad iret to user mode. To handle the fault
> * correctly, we want to move our stack frame to where it would
> * be had we entered directly on the entry stack (rather than
> * just below the IRET frame) and we want to pretend that the
> * exception came from the IRET target.
> */
> new_stack = (struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) -
> 1;
>
> /* Copy the IRET target to the temporary storage. */
> __memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
>
> /* Copy the remainder of the stack from the current stack. */
> __memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
>
> /* Update the entry stack */
> __memcpy(new_stack, &tmp, sizeof(tmp));
>
> BUG_ON(!user_mode(new_stack)); <---------------HERE
>
> Need to debug.
/facepalm
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 63abfa251243..cde91a995076 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -801,8 +801,8 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
* state.
*/
for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
- kvm_set_user_return_msr(i, tdx_uret_msrs[i].slot,
- tdx_uret_msrs[i].defval);
+ kvm_set_user_return_msr(tdx_uret_msrs[i].slot,
+ tdx_uret_msrs[i].defval, -1ull);
}
static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
On 10/22/2025 3:33 AM, Sean Christopherson wrote:
> On Tue, Oct 21, 2025, Rick P Edgecombe wrote:
>> On Tue, 2025-10-21 at 08:06 -0700, Sean Christopherson wrote:
>>> I think we should be synchronizing only after a successful VP.ENTER with a real
>>>>> TD exit, but today instead we synchronize after any attempt to VP.ENTER.
>>>
>>> Well this is all completely @#($*#. Looking at the TDX-Module source, if the
>>> TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it
>>> will signal a "normal" exit but not "restore" VMM state.
>>
>> Oh yea, good point. So there is no way to tell from the return code if the
>> clobbering happened.
>>
>>>
>>>> If the MSR's do not get clobbered, does it matter whether or not they get
>>>> restored.
>>>
>>> It matters because KVM needs to know the actual value in hardware. If KVM thinks
>>> an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct
>>> value into hardware when returning to userspace and/or when running a different
>>> vCPU.
>>>
>>> Taking a step back, the entire approach of updating the "cache" after the fact is
>>> ridiculous. TDX entry/exit is anything but fast; avoiding _at most_ 4x WRMSRs at
>>> the start of the run loop is a very, very premature optimization. Preemptively
>>> load hardware with the value that the TDX-Module _might_ set and call it good.
>>>
>>> I'll replace patches 1 and 4 with this, tagged for stable@.
>>
>> Seems reasonable to me in concept, but there is a bug. It looks like some
>> important MSR isn't getting restored right and the host gets into a bad state.
>> The first signs start with triggering this:
>>
>> asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs
>> *bad_regs)
>> {
>> struct pt_regs tmp, *new_stack;
>>
>> /*
>> * This is called from entry_64.S early in handling a fault
>> * caused by a bad iret to user mode. To handle the fault
>> * correctly, we want to move our stack frame to where it would
>> * be had we entered directly on the entry stack (rather than
>> * just below the IRET frame) and we want to pretend that the
>> * exception came from the IRET target.
>> */
>> new_stack = (struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) -
>> 1;
>>
>> /* Copy the IRET target to the temporary storage. */
>> __memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
>>
>> /* Copy the remainder of the stack from the current stack. */
>> __memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
>>
>> /* Update the entry stack */
>> __memcpy(new_stack, &tmp, sizeof(tmp));
>>
>> BUG_ON(!user_mode(new_stack)); <---------------HERE
>>
>> Need to debug.
>
> /facepalm
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 63abfa251243..cde91a995076 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -801,8 +801,8 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> * state.
> */
> for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> - kvm_set_user_return_msr(i, tdx_uret_msrs[i].slot,
> - tdx_uret_msrs[i].defval);
> + kvm_set_user_return_msr(tdx_uret_msrs[i].slot,
> + tdx_uret_msrs[i].defval, -1ull);
> }
>
> static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
with the above fix, the whole diff/implementation works. It passes our
internal TDX CI.
On Tue, 2025-10-21 at 12:33 -0700, Sean Christopherson wrote: > /facepalm > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 63abfa251243..cde91a995076 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -801,8 +801,8 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) > * state. > */ > for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) > - kvm_set_user_return_msr(i, tdx_uret_msrs[i].slot, > - tdx_uret_msrs[i].defval); > + kvm_set_user_return_msr(tdx_uret_msrs[i].slot, > + tdx_uret_msrs[i].defval, -1ull); > } > > static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu) Ah ok, I'll give it another spin after I finish debugging step 0, which is figure out what has gone wrong with my TDX dev machine.
On 21/10/2025 18:06, Sean Christopherson wrote: > On Tue, Oct 21, 2025, Adrian Hunter wrote: >> On 21/10/2025 01:55, Edgecombe, Rick P wrote: >>>> + * Several of KVM's user-return MSRs are clobbered by the TDX-Module if >>>> + * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an >>>> + * update to synchronize the "current" value in KVM's cache with the >>>> + * value in hardware (loaded by the TDX-Module). >>>> + */ >>> >>> I think we should be synchronizing only after a successful VP.ENTER with a real >>> TD exit, but today instead we synchronize after any attempt to VP.ENTER. > > Well this is all completely @#($*#. Looking at the TDX-Module source, if the > TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it > will signal a "normal" exit but not "restore" VMM state. > >> If the MSR's do not get clobbered, does it matter whether or not they get >> restored. > > It matters because KVM needs to know the actual value in hardware. If KVM thinks > an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct > value into hardware when returning to userspace and/or when running a different > vCPU. I don't quite follow: if an MSR does not get clobbered, where does the incorrect value come from?
On Tue, Oct 21, 2025, Adrian Hunter wrote: > On 21/10/2025 18:06, Sean Christopherson wrote: > > On Tue, Oct 21, 2025, Adrian Hunter wrote: > >> On 21/10/2025 01:55, Edgecombe, Rick P wrote: > >>>> + * Several of KVM's user-return MSRs are clobbered by the TDX-Module if > >>>> + * VP.ENTER succeeds, i.e. on TD-Exit. Mark those MSRs as needing an > >>>> + * update to synchronize the "current" value in KVM's cache with the > >>>> + * value in hardware (loaded by the TDX-Module). > >>>> + */ > >>> > >>> I think we should be synchronizing only after a successful VP.ENTER with a real > >>> TD exit, but today instead we synchronize after any attempt to VP.ENTER. > > > > Well this is all completely @#($*#. Looking at the TDX-Module source, if the > > TDX-Module synthesizes an exit, e.g. because it suspects a zero-step attack, it > > will signal a "normal" exit but not "restore" VMM state. > > > >> If the MSR's do not get clobbered, does it matter whether or not they get > >> restored. > > > > It matters because KVM needs to know the actual value in hardware. If KVM thinks > > an MSR is 'X', but it's actually 'Y', then KVM could fail to write the correct > > value into hardware when returning to userspace and/or when running a different > > vCPU. > > I don't quite follow: if an MSR does not get clobbered, where does the > incorrect value come from? kvm_set_user_return_msr() elides the WRMSR if the current value in hardware matches the new, desired value. If KVM thinks the MSR is 'X', and KVM wants to set the MSR to 'X', then KVM will skip the WRMSR and continue on with the wrong value. Using MSR_TSC_AUX as an example, let's say the vCPU task is running on CPU1, and that there's a non-TDX vCPU (with guest-side CPU=0) also scheduled on CPU1. Before VP.ENTER, MSR_TSC_AUX=user_return_msrs[slot].curr=1 (the host's CPU1 value). After a *failed* VP.ENTER, MSR_TSC_AUX will still be '1', but it's "curr" value in user_return_msrs will be '0' due to kvm_user_return_msr_update_cache() incorrectly thinking the TDX-Module clobbered the MSR to '0' When KVM runs the non-TDX vCPU, which wants to run with MSR_TSC_AUX=0, then kvm_set_user_return_msr() will see msrs->values[slot].curr==value==0 and not do the WRMSR. KVM will then run the non-TDX vCPU with MSR_TSC_AUX=1 and corrupt the guest.
© 2016 - 2026 Red Hat, Inc.