Software needs to talk with P-SEAMLDR via P-SEAMLDR SEAMCALLs. So, add a
wrapper for P-SEAMLDR SEAMCALLs.
Save and restore the current VMCS using VMPTRST and VMPTRLD instructions
to avoid breaking KVM. Doing so is because P-SEAMLDR SEAMCALLs would
invalidate the current VMCS as documented in Intel® Trust Domain CPU
Architectural Extensions (May 2021 edition) Chapter 2.3 [1]:
SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
instruction.
Disable interrupts to prevent KVM code from interfering with P-SEAMLDR
SEAMCALLs. For example, if a vCPU is scheduled before the current VMCS is
restored, it may encounter an invalid current VMCS, causing its VMX
instruction to fail. Additionally, if KVM sends IPIs to invalidate a
current VMCS and the invalidation occurs right after the current VMCS is
saved, that VMCS will be reloaded after P-SEAMLDR SEAMCALLs, leading to
unexpected behavior.
NMIs are not a problem, as the only scenario where instructions relying on
the current-VMCS are used is during guest PMI handling in KVM. This occurs
immediately after VM exits with IRQ and NMI disabled, ensuring no
interference with P-SEAMLDR SEAMCALLs.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Link: https://cdrdv2.intel.com/v1/dl/getContent/733582 # [1]
---
v2:
- don't create a new, inferior framework to save/restore VMCS
- use human-friendly language, just "current VMCS" rather than
SDM term "current-VMCS pointer"
- don't mix guard() with goto
---
arch/x86/virt/vmx/tdx/Makefile | 1 +
arch/x86/virt/vmx/tdx/seamldr.c | 56 ++++++++++++++++++++++++++++++
drivers/virt/coco/tdx-host/Kconfig | 10 ++++++
3 files changed, 67 insertions(+)
create mode 100644 arch/x86/virt/vmx/tdx/seamldr.c
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 90da47eb85ee..26aea3531c36 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y += seamcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_MODULE_UPDATE) += seamldr.o
diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
new file mode 100644
index 000000000000..b99d73f7bb08
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamldr.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2025 Intel Corporation.
+ *
+ * Intel TDX module runtime update
+ */
+#define pr_fmt(fmt) "seamldr: " fmt
+
+#include <linux/irqflags.h>
+#include <linux/types.h>
+
+#include "seamcall.h"
+
+static __maybe_unused int seamldr_call(u64 fn, struct tdx_module_args *args)
+{
+ unsigned long flags;
+ u64 vmcs;
+ int ret;
+
+ if (!is_seamldr_call(fn))
+ return -EINVAL;
+
+ /*
+ * SEAMRET from P-SEAMLDR invalidates the current VMCS. Save/restore
+ * the VMCS across P-SEAMLDR SEAMCALLs to avoid clobbering KVM state.
+ * Disable interrupts as KVM is allowed to do VMREAD/VMWRITE in IRQ
+ * context (but not NMI context).
+ */
+ local_irq_save(flags);
+
+ asm goto("1: vmptrst %0\n\t"
+ _ASM_EXTABLE(1b, %l[error])
+ : "=m" (vmcs) : : "cc" : error);
+
+ ret = seamldr_prerr(fn, args);
+
+ /*
+ * Restore the current VMCS pointer. VMPTSTR "returns" all ones if the
+ * current VMCS is invalid.
+ */
+ if (vmcs != -1ULL) {
+ asm goto("1: vmptrld %0\n\t"
+ "jna %l[error]\n\t"
+ _ASM_EXTABLE(1b, %l[error])
+ : : "m" (vmcs) : "cc" : error);
+ }
+
+ local_irq_restore(flags);
+ return ret;
+
+error:
+ local_irq_restore(flags);
+
+ WARN_ONCE(1, "Failed to save/restore the current VMCS");
+ return -EIO;
+}
diff --git a/drivers/virt/coco/tdx-host/Kconfig b/drivers/virt/coco/tdx-host/Kconfig
index e58bad148a35..6a9199e6c2c6 100644
--- a/drivers/virt/coco/tdx-host/Kconfig
+++ b/drivers/virt/coco/tdx-host/Kconfig
@@ -8,3 +8,13 @@ config TDX_HOST_SERVICES
Say y or m if enabling support for confidential virtual machine
support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko
+
+config INTEL_TDX_MODULE_UPDATE
+ bool "Intel TDX module runtime update"
+ depends on TDX_HOST_SERVICES
+ help
+ This enables the kernel to support TDX module runtime update. This
+ allows the admin to update the TDX module to another compatible
+ version without the need to terminate running TDX guests.
+
+ If unsure, say N.
--
2.47.3
On 1/23/26 06:55, Chao Gao wrote:
...
> +static __maybe_unused int seamldr_call(u64 fn, struct tdx_module_args *args)
> +{
> + unsigned long flags;
> + u64 vmcs;
> + int ret;
> +
> + if (!is_seamldr_call(fn))
> + return -EINVAL;
Why is this here? We shouldn't be silently papering over kernel bugs.
This is a WARN_ON() at *best*, but it also begs the question of how a
non-SEAMLDR call even got here.
> + /*
> + * SEAMRET from P-SEAMLDR invalidates the current VMCS. Save/restore
> + * the VMCS across P-SEAMLDR SEAMCALLs to avoid clobbering KVM state.
> + * Disable interrupts as KVM is allowed to do VMREAD/VMWRITE in IRQ
> + * context (but not NMI context).
> + */
I think you mean:
WARN_ON(in_nmi());
> + local_irq_save(flags);
> +
> + asm goto("1: vmptrst %0\n\t"
> + _ASM_EXTABLE(1b, %l[error])
> + : "=m" (vmcs) : : "cc" : error);
I'd much rather this be wrapped up in a helper function. We shouldn't
have to look at the horrors of inline assembly like this.
But this *REALLY* wants the KVM folks to look at it. One argument is
that with the inline assembly this is nice and self-contained. The other
argument is that this completely ignores all existing KVM infrastructure
and is parallel VMCS management.
I'd be shocked if this is the one and only place in the whole kernel
that can unceremoniously zap VMX state.
I'd *bet* that you don't really need to do the vmptrld and that KVM can
figure it out because it can vmptrld on demand anyway. Something along
the lines of:
local_irq_disable();
list_for_each(handwaving...)
vmcs_clear();
ret = seamldr_prerr(fn, args);
local_irq_enable();
Basically, zap this CPU's vmcs state and then make KVM reload it at some
later time.
I'm sure Sean and Paolo will tell me if I'm crazy.
> diff --git a/drivers/virt/coco/tdx-host/Kconfig b/drivers/virt/coco/tdx-host/Kconfig
> index e58bad148a35..6a9199e6c2c6 100644
> --- a/drivers/virt/coco/tdx-host/Kconfig
> +++ b/drivers/virt/coco/tdx-host/Kconfig
> @@ -8,3 +8,13 @@ config TDX_HOST_SERVICES
>
> Say y or m if enabling support for confidential virtual machine
> support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko
> +
> +config INTEL_TDX_MODULE_UPDATE
> + bool "Intel TDX module runtime update"
> + depends on TDX_HOST_SERVICES
> + help
> + This enables the kernel to support TDX module runtime update. This
> + allows the admin to update the TDX module to another compatible
> + version without the need to terminate running TDX guests.
... as opposed to the method that the kernel has to update the module
without terminating guests? ;)
> + If unsure, say N.
Let's call this:
config
INTEL_TDX_ONLY_DISABLE_THIS_IF_YOU_HATE_SECURITY_AND_IF_YOU_DO_WHY_ARE_YOU_RUNNING_TDX?
Can we have question marks in config symbol names? ;)
But, seriously, what the heck? Who would disable security updates for
their confidential computing infrastructure? Is this some kind of
intelligence test for our users so that if someone disables it we can
just laugh at them?
On Wed, Jan 28, 2026 at 03:36:49PM -0800, Dave Hansen wrote:
>On 1/23/26 06:55, Chao Gao wrote:
>...
>> +static __maybe_unused int seamldr_call(u64 fn, struct tdx_module_args *args)
>> +{
>> + unsigned long flags;
>> + u64 vmcs;
>> + int ret;
>> +
>> + if (!is_seamldr_call(fn))
>> + return -EINVAL;
>
>Why is this here? We shouldn't be silently papering over kernel bugs.
>This is a WARN_ON() at *best*, but it also begs the question of how a
>non-SEAMLDR call even got here.
Only SEAMLDR calls can get here. I will make it a WARN_ON_ONCE().
>
>> + /*
>> + * SEAMRET from P-SEAMLDR invalidates the current VMCS. Save/restore
>> + * the VMCS across P-SEAMLDR SEAMCALLs to avoid clobbering KVM state.
>> + * Disable interrupts as KVM is allowed to do VMREAD/VMWRITE in IRQ
>> + * context (but not NMI context).
>> + */
>
>I think you mean:
>
> WARN_ON(in_nmi());
This function only disables interrupts, not NMIs. Kirill questioned whether any
KVM operations might execute from NMI context and do VMREAD/VMWRITE. If such
operations exist and an NMI interrupts seamldr_call(), they could encounter
an invalid current VMCS.
The problematic scenario is:
seamldr_call() KVM code in NMI handler
1. vmptrst // save current-vmcs
2. seamcall // clobber current-vmcs
3. // NMI handler start
call into some KVM code and do vmread/vmwrite
// consume __invalid__ current-vmcs
// NMI handler end
4. vmptrld // restore current-vmcs
The comment clarifies that KVM doesn't do VMREAD/VMWRITE during NMI handling.
>
>> + local_irq_save(flags);
>> +
>> + asm goto("1: vmptrst %0\n\t"
>> + _ASM_EXTABLE(1b, %l[error])
>> + : "=m" (vmcs) : : "cc" : error);
>
>I'd much rather this be wrapped up in a helper function. We shouldn't
>have to look at the horrors of inline assembly like this.
>
>But this *REALLY* wants the KVM folks to look at it. One argument is
>that with the inline assembly this is nice and self-contained. The other
>argument is that this completely ignores all existing KVM infrastructure
>and is parallel VMCS management.
Exactly. Sean suggested this approach [*]. He prefers inline assembly rather than
adding new, inferior wrappers
*: https://lore.kernel.org/linux-coco/aHEYtGgA3aIQ7A3y@google.com/
>
>I'd be shocked if this is the one and only place in the whole kernel
>that can unceremoniously zap VMX state.
>
>I'd *bet* that you don't really need to do the vmptrld and that KVM can
>figure it out because it can vmptrld on demand anyway. Something along
>the lines of:
>
> local_irq_disable();
> list_for_each(handwaving...)
> vmcs_clear();
> ret = seamldr_prerr(fn, args);
> local_irq_enable();
>
>Basically, zap this CPU's vmcs state and then make KVM reload it at some
>later time.
The idea is feasible. But just calling vmcs_clear() won't work. We need to
reset all the tracking state associated with each VMCS. We should call
vmclear_local_loaded_vmcss() instead, similar to what's done before VMXOFF.
>
>I'm sure Sean and Paolo will tell me if I'm crazy.
To me, this approach needs more work since we need to either move
vmclear_local_loaded_vmcss() to the kernel or allow KVM to register a callback.
I don't think it's as straightforward as just doing the save/restore.
>
>> diff --git a/drivers/virt/coco/tdx-host/Kconfig b/drivers/virt/coco/tdx-host/Kconfig
>> index e58bad148a35..6a9199e6c2c6 100644
>> --- a/drivers/virt/coco/tdx-host/Kconfig
>> +++ b/drivers/virt/coco/tdx-host/Kconfig
>> @@ -8,3 +8,13 @@ config TDX_HOST_SERVICES
>>
>> Say y or m if enabling support for confidential virtual machine
>> support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko
>> +
>> +config INTEL_TDX_MODULE_UPDATE
>> + bool "Intel TDX module runtime update"
>> + depends on TDX_HOST_SERVICES
>> + help
>> + This enables the kernel to support TDX module runtime update. This
>> + allows the admin to update the TDX module to another compatible
>> + version without the need to terminate running TDX guests.
>
>... as opposed to the method that the kernel has to update the module
>without terminating guests? ;)
I will reduce this to:
This enables the kernel to update the TDX Module to another compatible
version.
>
>> + If unsure, say N.
>
>Let's call this:
>
> config
>INTEL_TDX_ONLY_DISABLE_THIS_IF_YOU_HATE_SECURITY_AND_IF_YOU_DO_WHY_ARE_YOU_RUNNING_TDX?
>
>Can we have question marks in config symbol names? ;)
>
>But, seriously, what the heck? Who would disable security updates for
>their confidential computing infrastructure? Is this some kind of
>intelligence test for our users so that if someone disables it we can
>just laugh at them?
Looks like I failed that test! ;) I'll change it to default to 'y' and
recommend 'Y' if unsure.
On 1/30/26 05:21, Chao Gao wrote:
...
>>> + /*
>>> + * SEAMRET from P-SEAMLDR invalidates the current VMCS. Save/restore
>>> + * the VMCS across P-SEAMLDR SEAMCALLs to avoid clobbering KVM state.
>>> + * Disable interrupts as KVM is allowed to do VMREAD/VMWRITE in IRQ
>>> + * context (but not NMI context).
>>> + */
>>
>> I think you mean:
>>
>> WARN_ON(in_nmi());
>
> This function only disables interrupts, not NMIs. Kirill questioned whether any
> KVM operations might execute from NMI context and do VMREAD/VMWRITE. If such
> operations exist and an NMI interrupts seamldr_call(), they could encounter
> an invalid current VMCS.
>
> The problematic scenario is:
>
> seamldr_call() KVM code in NMI handler
>
> 1. vmptrst // save current-vmcs
> 2. seamcall // clobber current-vmcs
> 3. // NMI handler start
> call into some KVM code and do vmread/vmwrite
> // consume __invalid__ current-vmcs
> // NMI handler end
> 4. vmptrld // restore current-vmcs
>
> The comment clarifies that KVM doesn't do VMREAD/VMWRITE during NMI handling.
How about something like:
P-SEAMLDR calls invalidate the current VMCS. It must be saved
and restored around the call. Exclude KVM access to the VMCS
by disabling interrupts. This is not safe against VMCS use in
NMIs, but there are none of those today.
Ideally, you'd also pair that with _some_ checks in the KVM code that
use lockdep or warnings to reiterate that NMI access to the VMCS is not OK.
>>> + local_irq_save(flags);
>>> +
>>> + asm goto("1: vmptrst %0\n\t"
>>> + _ASM_EXTABLE(1b, %l[error])
>>> + : "=m" (vmcs) : : "cc" : error);
>>
>> I'd much rather this be wrapped up in a helper function. We shouldn't
>> have to look at the horrors of inline assembly like this.
>>
>> But this *REALLY* wants the KVM folks to look at it. One argument is
>> that with the inline assembly this is nice and self-contained. The other
>> argument is that this completely ignores all existing KVM infrastructure
>> and is parallel VMCS management.
>
> Exactly. Sean suggested this approach [*]. He prefers inline assembly rather than
> adding new, inferior wrappers
>
> *: https://lore.kernel.org/linux-coco/aHEYtGgA3aIQ7A3y@google.com/
Get his explicit reviews on the patch, please.
Also, I 100% object to inline assembly in the main flow. Please at least
make a wrapper for these and stick them in:
arch/x86/include/asm/special_insns.h
so the inline assembly spew is hidden from view.
>> I'd be shocked if this is the one and only place in the whole kernel
>> that can unceremoniously zap VMX state.
>>
>> I'd *bet* that you don't really need to do the vmptrld and that KVM can
>> figure it out because it can vmptrld on demand anyway. Something along
>> the lines of:
>>
>> local_irq_disable();
>> list_for_each(handwaving...)
>> vmcs_clear();
>> ret = seamldr_prerr(fn, args);
>> local_irq_enable();
>>
>> Basically, zap this CPU's vmcs state and then make KVM reload it at some
>> later time.
>
> The idea is feasible. But just calling vmcs_clear() won't work. We need to
> reset all the tracking state associated with each VMCS. We should call
> vmclear_local_loaded_vmcss() instead, similar to what's done before VMXOFF.
>
>>
>> I'm sure Sean and Paolo will tell me if I'm crazy.
>
> To me, this approach needs more work since we need to either move
> vmclear_local_loaded_vmcss() to the kernel or allow KVM to register a callback.
>
> I don't think it's as straightforward as just doing the save/restore.
Could you please just do me a favor and spend 20 minutes to see what
this looks like in practice and if the KVM folks hate it?
>>> diff --git a/drivers/virt/coco/tdx-host/Kconfig b/drivers/virt/coco/tdx-host/Kconfig
>>> index e58bad148a35..6a9199e6c2c6 100644
>>> --- a/drivers/virt/coco/tdx-host/Kconfig
>>> +++ b/drivers/virt/coco/tdx-host/Kconfig
>>> @@ -8,3 +8,13 @@ config TDX_HOST_SERVICES
>>>
>>> Say y or m if enabling support for confidential virtual machine
>>> support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko
>>> +
>>> +config INTEL_TDX_MODULE_UPDATE
>>> + bool "Intel TDX module runtime update"
>>> + depends on TDX_HOST_SERVICES
>>> + help
>>> + This enables the kernel to support TDX module runtime update. This
>>> + allows the admin to update the TDX module to another compatible
>>> + version without the need to terminate running TDX guests.
>>
>> ... as opposed to the method that the kernel has to update the module
>> without terminating guests? ;)
>
> I will reduce this to:
>
> This enables the kernel to update the TDX Module to another compatible
> version.
I guess I'll be explicit: Remove this Kconfig prompt.
I think you should remove INTEL_TDX_MODULE_UPDATE entirely. But I'll
settle for:
config INTEL_TDX_MODULE_UPDATE
bool
default TDX_HOST_SERVICES
so that users don't have to see it. Don't bother users with it. Period.
>>> I'd be shocked if this is the one and only place in the whole kernel
>>> that can unceremoniously zap VMX state.
>>>
>>> I'd *bet* that you don't really need to do the vmptrld and that KVM can
>>> figure it out because it can vmptrld on demand anyway. Something along
>>> the lines of:
>>>
>>> local_irq_disable();
>>> list_for_each(handwaving...)
>>> vmcs_clear();
>>> ret = seamldr_prerr(fn, args);
>>> local_irq_enable();
>>>
>>> Basically, zap this CPU's vmcs state and then make KVM reload it at some
>>> later time.
>>
>> The idea is feasible. But just calling vmcs_clear() won't work. We need to
>> reset all the tracking state associated with each VMCS. We should call
>> vmclear_local_loaded_vmcss() instead, similar to what's done before VMXOFF.
>>
>>>
>>> I'm sure Sean and Paolo will tell me if I'm crazy.
>>
>> To me, this approach needs more work since we need to either move
>> vmclear_local_loaded_vmcss() to the kernel or allow KVM to register a callback.
>>
>> I don't think it's as straightforward as just doing the save/restore.
>
>Could you please just do me a favor and spend 20 minutes to see what
>this looks like in practice and if the KVM folks hate it?
Sure. KVM tracks the current VMCS and only executes vmptrld for a new VMCS if
it differs from the current one. See arch/x86/kvm/vmx/vmx.c::vmx_vcpu_load_vmcs()
prev = per_cpu(current_vmcs, cpu);
if (prev != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
}
By resetting current_vmcs to NULL during P-SEAMLDR calls, KVM is forced to do a
vmptrld on the next VMCS load. So, we can implement seamldr_call() as:
static int seamldr_call(u64 fn, struct tdx_module_args *args)
{
int ret;
WARN_ON_ONCE(!is_seamldr_call(fn));
/*
* Serialize P-SEAMLDR calls since only a single CPU is allowed to
* interact with P-SEAMLDR at a time.
*
* P-SEAMLDR calls invalidate the current VMCS. Exclude KVM access to
* the VMCS by disabling interrupts. This is not safe against VMCS use
* in NMIs, but there are none of those today.
*
* Set the per-CPU current_vmcs cache to NULL to force KVM to reload
* the VMCS.
*/
guard(raw_spinlock_irqsave)(&seamldr_lock);
ret = seamcall_prerr(fn, args);
this_cpu_write(current_vmcs, NULL);
return ret;
}
This requires moving the per-CPU current_vmcs from KVM to the kernel, which
should be trivial with Sean's VMXON series.
And I tested this. Without this_cpu_write(), vmread/vmwrite errors occur after
TDX Module updates. But with it, no errors.
On Tue, Feb 03, 2026, Chao Gao wrote:
> >>> I'd be shocked if this is the one and only place in the whole kernel
> >>> that can unceremoniously zap VMX state.
> >>>
> >>> I'd *bet* that you don't really need to do the vmptrld and that KVM can
> >>> figure it out because it can vmptrld on demand anyway. Something along
> >>> the lines of:
> >>>
> >>> local_irq_disable();
> >>> list_for_each(handwaving...)
> >>> vmcs_clear();
> >>> ret = seamldr_prerr(fn, args);
> >>> local_irq_enable();
> >>>
> >>> Basically, zap this CPU's vmcs state and then make KVM reload it at some
> >>> later time.
> >>
> >> The idea is feasible. But just calling vmcs_clear() won't work. We need to
> >> reset all the tracking state associated with each VMCS. We should call
> >> vmclear_local_loaded_vmcss() instead, similar to what's done before VMXOFF.
> >>
> >>>
> >>> I'm sure Sean and Paolo will tell me if I'm crazy.
> >>
> >> To me, this approach needs more work since we need to either move
> >> vmclear_local_loaded_vmcss() to the kernel or allow KVM to register a callback.
> >>
> >> I don't think it's as straightforward as just doing the save/restore.
> >
> >Could you please just do me a favor and spend 20 minutes to see what
> >this looks like in practice and if the KVM folks hate it?
I hate it :-)
> Sure. KVM tracks the current VMCS and only executes vmptrld for a new VMCS if
> it differs from the current one. See arch/x86/kvm/vmx/vmx.c::vmx_vcpu_load_vmcs()
>
> prev = per_cpu(current_vmcs, cpu);
> if (prev != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> }
>
> By resetting current_vmcs to NULL during P-SEAMLDR calls, KVM is forced to do a
> vmptrld on the next VMCS load. So, we can implement seamldr_call() as:
>
> static int seamldr_call(u64 fn, struct tdx_module_args *args)
> {
> int ret;
>
> WARN_ON_ONCE(!is_seamldr_call(fn));
>
> /*
> * Serialize P-SEAMLDR calls since only a single CPU is allowed to
> * interact with P-SEAMLDR at a time.
> *
> * P-SEAMLDR calls invalidate the current VMCS. Exclude KVM access to
> * the VMCS by disabling interrupts. This is not safe against VMCS use
> * in NMIs, but there are none of those today.
> *
> * Set the per-CPU current_vmcs cache to NULL to force KVM to reload
> * the VMCS.
> */
> guard(raw_spinlock_irqsave)(&seamldr_lock);
> ret = seamcall_prerr(fn, args);
> this_cpu_write(current_vmcs, NULL);
>
> return ret;
> }
>
> This requires moving the per-CPU current_vmcs from KVM to the kernel, which
> should be trivial with Sean's VMXON series.
Trivial in code, but I am very strongly opposed to moving current_vmcs out of KVM.
As stated in the cover letter of the initial VMXON RFC[*]:
: Emphasis on "only", because leaving VMCS tracking and clearing in KVM is
: another key difference from Xin's series. The "light bulb" moment on that
: front is that TDX isn't a hypervisor, and isn't trying to be a hypervisor.
: Specifically, TDX should _never_ have it's own VMCSes (that are visible to the
: host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there
: is simply no reason to move that functionality out of KVM.
TDX's "use" of a VMCS should be completely transparent to KVM, because otherwise
we are stepping over that line that says the TDX subsystem isn't a hypervisor.
I also really, really don't want to add a super special case rule to KVM's VMCS
tracking logic.
After reading through the rest of this discussion, I'm doubling down on that
stance, because I agree that this is decidely odd behavior.
Pulling in two other threads from this discussion:
On Wed, Jan 28, 2026 at 3:05 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/23/26 06:55, Chao Gao wrote:
> > SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
> > to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR
> > using SEAMCALL must reload the current-VMCS, if required, using the
> > VMPTRLD instruction.
>
> That seems pretty mean.
>
> This is going to need a lot more justification for why this is an
> absolutely necessary requirement.
>
> KVM folks, are you OK with this?
As above, I'm definitely not ok with the current VMCS being zapped out from
underneath KVM. As to whether or not I'm ok with the P-SEAMLDR behavior, I would
say that's more of a question for you, as it will fall on the TDX subsytem to
workaround the bug/quirk.
On Fri, Jan 30, 2026 at 8:23 AM Dave Hansen <dave.hansen@intel.com> wrote:
> On 1/30/26 00:08, Chao Gao wrote:
> > AFAIK, this is a CPU implementation issue. The actual requirement is to
> > evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big
> > cores implement this by evicting the __entire__ VMCS cache. So, the
> > current VMCS is invalidated and cleared.
>
> But why is this a P-SEAMLDR thing and not a TDX module thing?
My guess is that it's because the P-SEAMLDR code loads and prepares the new TDX-
Module by constructing the VMCS used for SEAMCALL using direct writes to memory
(unless that TDX behavior has changed in the last few years). And so it needs
to ensure that in-memory representation is synchronized with the VMCS cache.
Hmm, but that doesn't make sense _if_ it really truly is SEAMRET that does the VMCS
cache invalidation, because flushing the VMCS cache would ovewrite the in-memory
state.
> It seems like a bug, or at least a P-SEAMLDR implementation issue the
> needs to get fixed.
Yeah, 'tis odd behavior. IMO, that's all the more reason the TDX subsystem should
hide the quirk from the rest of the kernel.
[*] https://lore.kernel.org/all/20251010220403.987927-1-seanjc@google.com
>On Fri, Jan 30, 2026 at 8:23 AM Dave Hansen <dave.hansen@intel.com> wrote: >> On 1/30/26 00:08, Chao Gao wrote: >> > AFAIK, this is a CPU implementation issue. The actual requirement is to >> > evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big >> > cores implement this by evicting the __entire__ VMCS cache. So, the >> > current VMCS is invalidated and cleared. >> >> But why is this a P-SEAMLDR thing and not a TDX module thing? > >My guess is that it's because the P-SEAMLDR code loads and prepares the new TDX- >Module by constructing the VMCS used for SEAMCALL using direct writes to memory >(unless that TDX behavior has changed in the last few years). And so it needs >to ensure that in-memory representation is synchronized with the VMCS cache. > >Hmm, but that doesn't make sense _if_ it really truly is SEAMRET that does the VMCS >cache invalidation, because flushing the VMCS cache would ovewrite the in-memory >state. My understanding is: 1. SEAMCALL/SEAMRET use VMCSs. 2. P-SEAMLDR is single-threaded (likely for simplicity). So, it uses a _single_ global VMCS and only one CPU can call P-SEAMLDR calls at a time. 3. After SEAMRET from P-SEAMLDR, _if_ the global VMCS isn't flushed, other CPUs cannot enter P-SEAMLDR because the global VMCS would be corrupted. (note the global VMCS is cached by the original CPU). 4. To make P-SEAMLDR callable on all CPUs, SEAMRET instruction flush VMCSs. The flush cannot be performed by the host VMM since the global VMCS is not visible to it. P-SEAMLDR cannot do it either because SEAMRET is its final instruction and requires a valid VMCS. The TDX Module has per-CPU VMCSs, so it doesn't has this problem. I'll check if SEAM ISA architects can join to explain this in more detail. > >> It seems like a bug, or at least a P-SEAMLDR implementation issue the >> needs to get fixed. > >Yeah, 'tis odd behavior. IMO, that's all the more reason the TDX subsystem should >hide the quirk from the rest of the kernel. > >[*] https://lore.kernel.org/all/20251010220403.987927-1-seanjc@google.com
On Wed, Feb 04, 2026, Chao Gao wrote:
> >On Fri, Jan 30, 2026 at 8:23 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 1/30/26 00:08, Chao Gao wrote:
> >> > AFAIK, this is a CPU implementation issue. The actual requirement is to
> >> > evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big
> >> > cores implement this by evicting the __entire__ VMCS cache. So, the
> >> > current VMCS is invalidated and cleared.
> >>
> >> But why is this a P-SEAMLDR thing and not a TDX module thing?
> >
> >My guess is that it's because the P-SEAMLDR code loads and prepares the new TDX-
> >Module by constructing the VMCS used for SEAMCALL using direct writes to memory
> >(unless that TDX behavior has changed in the last few years). And so it needs
> >to ensure that in-memory representation is synchronized with the VMCS cache.
> >
> >Hmm, but that doesn't make sense _if_ it really truly is SEAMRET that does the VMCS
> >cache invalidation, because flushing the VMCS cache would ovewrite the in-memory
> >state.
>
> My understanding is:
>
> 1. SEAMCALL/SEAMRET use VMCSs.
>
> 2. P-SEAMLDR is single-threaded (likely for simplicity). So, it uses a _single_
> global VMCS and only one CPU can call P-SEAMLDR calls at a time.
>
> 3. After SEAMRET from P-SEAMLDR, _if_ the global VMCS isn't flushed, other CPUs
> cannot enter P-SEAMLDR because the global VMCS would be corrupted. (note the
> global VMCS is cached by the original CPU).
>
> 4. To make P-SEAMLDR callable on all CPUs, SEAMRET instruction flush VMCSs.
> The flush cannot be performed by the host VMM since the global VMCS is not
> visible to it. P-SEAMLDR cannot do it either because SEAMRET is its final
> instruction and requires a valid VMCS.
No, this isn't the explanation. I found the explanation in the pseudocode for
SEAMRET. The "successful VM-Entry" path says this:
current-VMCS = current-VMCS.VMCS-link-pointer
IF inP_SEAMLDR == 1; THEN
If current-VMCS != FFFFFFFF_FFFFFFFFH; THEN
Ensure data for VMCS referenced by current-VMC is in memory
Initialize implementation-specific data in all VMCS referenced by current-VMCS
Set launch state of VMCS referenced by current-VMCS to “clear”
current-VMCS = FFFFFFFF_FFFFFFFFH
FI;
inP_SEAMLDR = 0
FI;
I.e. my guess about firmware (probably XuCode?) doing direct writes was correct,
I just guessed wrong on which VMCS. Or rather, I didn't guess "all".
> The TDX Module has per-CPU VMCSs, so it doesn't has this problem.
>
> I'll check if SEAM ISA architects can join to explain this in more detail.
On 2/5/26 08:29, Sean Christopherson wrote: > No, this isn't the explanation. I found the explanation in the pseudocode for > SEAMRET. The "successful VM-Entry" path says this: > > current-VMCS = current-VMCS.VMCS-link-pointer > IF inP_SEAMLDR == 1; THEN > If current-VMCS != FFFFFFFF_FFFFFFFFH; THEN > Ensure data for VMCS referenced by current-VMC is in memory > Initialize implementation-specific data in all VMCS referenced by current-VMCS > Set launch state of VMCS referenced by current-VMCS to “clear” > current-VMCS = FFFFFFFF_FFFFFFFFH > FI; > inP_SEAMLDR = 0 > FI; Yes, in version 002 of the spec. It wasn't there in 001. The basic problem is that the SEAM VMCSes need to get flushed when the TDX module is being loaded. The TDX module never loads itself, thus the "inP_SEAMLDR == 1" check. It sounds like there was already an existing thing in microcode to just flush VMCSes and invalidate "current-VMCS". It was much easier for microcode to just jump over to that existing thing than to surgically target the SEAM VMCSes, or somehow avoid zapping "current-VMCS". It makes total sense for the microcoders to have gone this route. I'm seeing if it can get changed back to the 001 version so we just don't even have to deal with this whole mess.
On 2/3/26 07:41, Sean Christopherson wrote: >> It seems like a bug, or at least a P-SEAMLDR implementation issue the >> needs to get fixed. > Yeah, 'tis odd behavior. IMO, that's all the more reason the TDX subsystem should > hide the quirk from the rest of the kernel. > > [*] https://lore.kernel.org/all/20251010220403.987927-1-seanjc@google.com For now, I say treat it as a bug. Don't deal with it in the series. If it truly is unfixable P-SEAMLDR behavior, then Intel can issue and erratum for it and we can add (ugly) code to follow Sean's suggestion.
On 1/23/26 06:55, Chao Gao wrote: > SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed > to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR > using SEAMCALL must reload the current-VMCS, if required, using the > VMPTRLD instruction. That seems pretty mean. This is going to need a lot more justification for why this is an absolutely necessary requirement. KVM folks, are you OK with this?
On Wed, Jan 28, 2026 at 03:04:55PM -0800, Dave Hansen wrote: >On 1/23/26 06:55, Chao Gao wrote: >> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed >> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR >> using SEAMCALL must reload the current-VMCS, if required, using the >> VMPTRLD instruction. > >That seems pretty mean. > >This is going to need a lot more justification for why this is an >absolutely necessary requirement. AFAIK, this is a CPU implementation issue. The actual requirement is to evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big cores implement this by evicting the __entire__ VMCS cache. So, the current VMCS is invalidated and cleared. > >KVM folks, are you OK with this?
On 1/30/26 00:08, Chao Gao wrote: > On Wed, Jan 28, 2026 at 03:04:55PM -0800, Dave Hansen wrote: >> On 1/23/26 06:55, Chao Gao wrote: >>> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed >>> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR >>> using SEAMCALL must reload the current-VMCS, if required, using the >>> VMPTRLD instruction. >> >> That seems pretty mean. >> >> This is going to need a lot more justification for why this is an >> absolutely necessary requirement. > > AFAIK, this is a CPU implementation issue. The actual requirement is to > evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big > cores implement this by evicting the __entire__ VMCS cache. So, the > current VMCS is invalidated and cleared. But why is this a P-SEAMLDR thing and not a TDX module thing? It seems like a bug, or at least a P-SEAMLDR implementation issue the needs to get fixed.
On 1/23/2026 10:55 PM, Chao Gao wrote:
> Software needs to talk with P-SEAMLDR via P-SEAMLDR SEAMCALLs. So, add a
> wrapper for P-SEAMLDR SEAMCALLs.
>
> Save and restore the current VMCS using VMPTRST and VMPTRLD instructions
> to avoid breaking KVM. Doing so is because P-SEAMLDR SEAMCALLs would
> invalidate the current VMCS as documented in Intel® Trust Domain CPU
> Architectural Extensions (May 2021 edition) Chapter 2.3 [1]:
>
> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
> SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
> instruction.
>
> Disable interrupts to prevent KVM code from interfering with P-SEAMLDR
> SEAMCALLs. For example, if a vCPU is scheduled before the current VMCS is
> restored, it may encounter an invalid current VMCS, causing its VMX
> instruction to fail. Additionally, if KVM sends IPIs to invalidate a
> current VMCS and the invalidation occurs right after the current VMCS is
> saved, that VMCS will be reloaded after P-SEAMLDR SEAMCALLs, leading to
> unexpected behavior.
>
> NMIs are not a problem, as the only scenario where instructions relying on
> the current-VMCS are used is during guest PMI handling in KVM. This occurs
> immediately after VM exits with IRQ and NMI disabled, ensuring no
> interference with P-SEAMLDR SEAMCALLs.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Link: https://cdrdv2.intel.com/v1/dl/getContent/733582 # [1]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Two nits below.
> ---
> v2:
> - don't create a new, inferior framework to save/restore VMCS
> - use human-friendly language, just "current VMCS" rather than
> SDM term "current-VMCS pointer"
> - don't mix guard() with goto
> ---
> arch/x86/virt/vmx/tdx/Makefile | 1 +
> arch/x86/virt/vmx/tdx/seamldr.c | 56 ++++++++++++++++++++++++++++++
> drivers/virt/coco/tdx-host/Kconfig | 10 ++++++
> 3 files changed, 67 insertions(+)
> create mode 100644 arch/x86/virt/vmx/tdx/seamldr.c
>
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 90da47eb85ee..26aea3531c36 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,3 @@
> # SPDX-License-Identifier: GPL-2.0-only
> obj-y += seamcall.o tdx.o
> +obj-$(CONFIG_INTEL_TDX_MODULE_UPDATE) += seamldr.o
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> new file mode 100644
> index 000000000000..b99d73f7bb08
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2025 Intel Corporation.
Update to 2026?
> + *
> + * Intel TDX module runtime update
> + */
> +#define pr_fmt(fmt) "seamldr: " fmt
> +
> +#include <linux/irqflags.h>
> +#include <linux/types.h>
> +
> +#include "seamcall.h"
> +
> +static __maybe_unused int seamldr_call(u64 fn, struct tdx_module_args *args)
> +{
> + unsigned long flags;
> + u64 vmcs;
> + int ret;
> +
> + if (!is_seamldr_call(fn))
> + return -EINVAL;
> +
> + /*
> + * SEAMRET from P-SEAMLDR invalidates the current VMCS. Save/restore
> + * the VMCS across P-SEAMLDR SEAMCALLs to avoid clobbering KVM state.
> + * Disable interrupts as KVM is allowed to do VMREAD/VMWRITE in IRQ
> + * context (but not NMI context).
> + */
> + local_irq_save(flags);
> +
> + asm goto("1: vmptrst %0\n\t"
> + _ASM_EXTABLE(1b, %l[error])
> + : "=m" (vmcs) : : "cc" : error);
> +
> + ret = seamldr_prerr(fn, args);
> +
> + /*
> + * Restore the current VMCS pointer. VMPTSTR "returns" all ones if the
s/VMPTSTR/VMPTRST
On Fri, Jan 23, 2026 at 06:55:15AM -0800, Chao Gao wrote: > --- a/drivers/virt/coco/tdx-host/Kconfig > +++ b/drivers/virt/coco/tdx-host/Kconfig > @@ -8,3 +8,13 @@ config TDX_HOST_SERVICES > > Say y or m if enabling support for confidential virtual machine > support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko > + > +config INTEL_TDX_MODULE_UPDATE > + bool "Intel TDX module runtime update" > + depends on TDX_HOST_SERVICES > + help > + This enables the kernel to support TDX module runtime update. This > + allows the admin to update the TDX module to another compatible > + version without the need to terminate running TDX guests. > + > + If unsure, say N. How about leave out the first "This" above: Enable the kernel to support TDX module runtime update. This allows the admin to update the TDX module to another compatible version without the need to terminate running TDX guests. Other than that: Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
© 2016 - 2026 Red Hat, Inc.