From: Rick Edgecombe <rick.p.edgecombe@intel.com>
KVM tries to take care of some required cache flushing earlier in the
kexec path in order to be kind to some long standing races that can occur
later in the operation. Until recently, VMXOFF was handled within KVM.
Since VMX being enabled is required to make a SEAMCALL, it had the best
per-cpu scoped operation to plug the flushing into. So it is kicked off
from there.
This early kexec cache flushing in KVM happens via a syscore shutdown
callback. Now that VMX enablement control has moved to arch/x86, which has
grown its own syscore shutdown callback, it no longer make sense for it to
live in KVM. It fits better with the TDX enablement managing code.
In addition, future changes will add a SEAMCALL that happens immediately
before VMXOFF, which means the cache flush in KVM will be too late to
flush the cache before the last SEAMCALL. So move it to the newly added TDX
arch/x86 syscore shutdown handler.
Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it
static and remove the export. Since it is also not part of an operation
spread across disparate components, remove the redundant comments and
verbose naming.
In the existing KVM based code, CPU offline also funnels through
tdx_cpu_flush_cache_for_kexec(). So the centralization to the arch/x86
syscore shutdown callback elides this CPU offline time behavior. However,
WBINVD is already generally done at CPU offline as matter of course. So
don't bother adding TDX specific logic for this, and rely on the normal
WBINVD to handle it.
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
arch/x86/include/asm/tdx.h | 6 ------
arch/x86/kvm/vmx/tdx.c | 10 ----------
arch/x86/virt/vmx/tdx/tdx.c | 39 ++++++++++++++++++++-------------------
3 files changed, 20 insertions(+), 35 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2917b3451491..7674fc530090 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -205,11 +205,5 @@ static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
#endif /* CONFIG_INTEL_TDX_HOST */
-#ifdef CONFIG_KEXEC_CORE
-void tdx_cpu_flush_cache_for_kexec(void);
-#else
-static inline void tdx_cpu_flush_cache_for_kexec(void) { }
-#endif
-
#endif /* !__ASSEMBLER__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7264b533feb..50a5cfdbd33e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -440,16 +440,6 @@ void tdx_disable_virtualization_cpu(void)
tdx_flush_vp(&arg);
}
local_irq_restore(flags);
-
- /*
- * Flush cache now if kexec is possible: this is necessary to avoid
- * having dirty private memory cachelines when the new kernel boots,
- * but WBINVD is a relatively expensive operation and doing it during
- * kexec can exacerbate races in native_stop_other_cpus(). Do it
- * now, since this is a safe moment and there is going to be no more
- * TDX activity on this CPU from this point on.
- */
- tdx_cpu_flush_cache_for_kexec();
}
#define TDX_SEAMCALL_RETRIES 10000
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cb9b3210ab71..0802d0fd18a4 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -224,8 +224,28 @@ static int tdx_offline_cpu(unsigned int cpu)
return 0;
}
+static void tdx_cpu_flush_cache(void)
+{
+ lockdep_assert_preemption_disabled();
+
+ if (!this_cpu_read(cache_state_incoherent))
+ return;
+
+ wbinvd();
+ this_cpu_write(cache_state_incoherent, false);
+}
+
static void tdx_shutdown_cpu(void *ign)
{
+ /*
+ * Flush cache now if kexec is possible: this is necessary to avoid
+ * having dirty private memory cachelines when the new kernel boots,
+ * but WBINVD is a relatively expensive operation and doing it during
+ * kexec can exacerbate races in native_stop_other_cpus(). Do it
+ * now, since this is a safe moment and there is going to be no more
+ * TDX activity on this CPU from this point on.
+ */
+ tdx_cpu_flush_cache();
x86_virt_put_ref(X86_FEATURE_VMX);
}
@@ -1920,22 +1940,3 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
}
EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
-
-#ifdef CONFIG_KEXEC_CORE
-void tdx_cpu_flush_cache_for_kexec(void)
-{
- lockdep_assert_preemption_disabled();
-
- if (!this_cpu_read(cache_state_incoherent))
- return;
-
- /*
- * Private memory cachelines need to be clean at the time of
- * kexec. Write them back now, as the caller promises that
- * there should be no more SEAMCALLs on this CPU.
- */
- wbinvd();
- this_cpu_write(cache_state_incoherent, false);
-}
-EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
-#endif
--
2.53.0
On Mon, Mar 23, 2026, Vishal Verma wrote:
> From: Rick Edgecombe <rick.p.edgecombe@intel.com>
>
> KVM tries to take care of some required cache flushing earlier in the
> kexec path in order to be kind to some long standing races that can occur
> later in the operation. Until recently, VMXOFF was handled within KVM.
> Since VMX being enabled is required to make a SEAMCALL, it had the best
> per-cpu scoped operation to plug the flushing into. So it is kicked off
> from there.
>
> This early kexec cache flushing in KVM happens via a syscore shutdown
> callback. Now that VMX enablement control has moved to arch/x86, which has
> grown its own syscore shutdown callback, it no longer make sense for it to
> live in KVM. It fits better with the TDX enablement managing code.
>
> In addition, future changes will add a SEAMCALL that happens immediately
> before VMXOFF, which means the cache flush in KVM will be too late to
> flush the cache before the last SEAMCALL. So move it to the newly added TDX
> arch/x86 syscore shutdown handler.
>
> Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it
> static and remove the export. Since it is also not part of an operation
> spread across disparate components, remove the redundant comments and
> verbose naming.
>
> In the existing KVM based code, CPU offline also funnels through
> tdx_cpu_flush_cache_for_kexec(). So the centralization to the arch/x86
> syscore shutdown callback elides this CPU offline time behavior. However,
> WBINVD is already generally done at CPU offline as matter of course. So
> don't bother adding TDX specific logic for this, and rely on the normal
> WBINVD to handle it.
>
> Acked-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Ingoring the potential fixup needed for the existing bug...
Acked-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/include/asm/tdx.h | 6 ------
> arch/x86/kvm/vmx/tdx.c | 10 ----------
> arch/x86/virt/vmx/tdx/tdx.c | 39 ++++++++++++++++++++-------------------
> 3 files changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 2917b3451491..7674fc530090 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -205,11 +205,5 @@ static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
> static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> -#ifdef CONFIG_KEXEC_CORE
> -void tdx_cpu_flush_cache_for_kexec(void);
> -#else
> -static inline void tdx_cpu_flush_cache_for_kexec(void) { }
> -#endif
> -
> #endif /* !__ASSEMBLER__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b7264b533feb..50a5cfdbd33e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -440,16 +440,6 @@ void tdx_disable_virtualization_cpu(void)
> tdx_flush_vp(&arg);
> }
> local_irq_restore(flags);
> -
> - /*
> - * Flush cache now if kexec is possible: this is necessary to avoid
> - * having dirty private memory cachelines when the new kernel boots,
> - * but WBINVD is a relatively expensive operation and doing it during
> - * kexec can exacerbate races in native_stop_other_cpus(). Do it
> - * now, since this is a safe moment and there is going to be no more
> - * TDX activity on this CPU from this point on.
> - */
> - tdx_cpu_flush_cache_for_kexec();
> }
>
> #define TDX_SEAMCALL_RETRIES 10000
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index cb9b3210ab71..0802d0fd18a4 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -224,8 +224,28 @@ static int tdx_offline_cpu(unsigned int cpu)
> return 0;
> }
>
> +static void tdx_cpu_flush_cache(void)
> +{
> + lockdep_assert_preemption_disabled();
> +
> + if (!this_cpu_read(cache_state_incoherent))
> + return;
> +
> + wbinvd();
> + this_cpu_write(cache_state_incoherent, false);
> +}
> +
> static void tdx_shutdown_cpu(void *ign)
> {
> + /*
> + * Flush cache now if kexec is possible: this is necessary to avoid
> + * having dirty private memory cachelines when the new kernel boots,
> + * but WBINVD is a relatively expensive operation and doing it during
> + * kexec can exacerbate races in native_stop_other_cpus(). Do it
> + * now, since this is a safe moment and there is going to be no more
> + * TDX activity on this CPU from this point on.
> + */
> + tdx_cpu_flush_cache();
> x86_virt_put_ref(X86_FEATURE_VMX);
> }
>
> @@ -1920,22 +1940,3 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> }
> EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
> -
> -#ifdef CONFIG_KEXEC_CORE
> -void tdx_cpu_flush_cache_for_kexec(void)
> -{
> - lockdep_assert_preemption_disabled();
Is there a pre-existing bug here that gets propagate to tdx_shutdown_cpu()? When
called from kvm_offline_cpu(), preemption won't be fully disabled, but per-CPU
access are fine because the task is pinned to the target CPU.
See https://lore.kernel.org/all/aUVx20ZRjOzKgKqy@google.com
> -
> - if (!this_cpu_read(cache_state_incoherent))
> - return;
> -
> - /*
> - * Private memory cachelines need to be clean at the time of
> - * kexec. Write them back now, as the caller promises that
> - * there should be no more SEAMCALLs on this CPU.
> - */
> - wbinvd();
> - this_cpu_write(cache_state_incoherent, false);
> -}
> -EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
> -#endif
>
> --
> 2.53.0
>
On Tue, 2026-03-31 at 12:22 -0700, Sean Christopherson wrote:
> > -
> > -#ifdef CONFIG_KEXEC_CORE
> > -void tdx_cpu_flush_cache_for_kexec(void)
> > -{
> > - lockdep_assert_preemption_disabled();
>
> Is there a pre-existing bug here that gets propagate to tdx_shutdown_cpu()?
> When called from kvm_offline_cpu(), preemption won't be fully disabled, but
> per-CPU access are fine because the task is pinned to the target CPU.
>
> See https://lore.kernel.org/all/aUVx20ZRjOzKgKqy@google.com
Yes. And actually it got hit during development of this series. This patch will
conflict with the fix:
https://lore.kernel.org/lkml/20260312100009.924136-1-kai.huang@intel.com/
Oh, you acked it actually. But I was under the impression that after this patch
here, the splat wouldn't be triggered. So it inadvertently fixes it. But that
other patch is much more backport friendly.
On Tue, Mar 31, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-03-31 at 12:22 -0700, Sean Christopherson wrote:
> > > -
> > > -#ifdef CONFIG_KEXEC_CORE
> > > -void tdx_cpu_flush_cache_for_kexec(void)
> > > -{
> > > - lockdep_assert_preemption_disabled();
> >
> > Is there a pre-existing bug here that gets propagate to tdx_shutdown_cpu()?
> > When called from kvm_offline_cpu(), preemption won't be fully disabled, but
> > per-CPU access are fine because the task is pinned to the target CPU.
> >
> > See https://lore.kernel.org/all/aUVx20ZRjOzKgKqy@google.com
>
> Yes. And actually it got hit during development of this series. This patch will
> conflict with the fix:
> https://lore.kernel.org/lkml/20260312100009.924136-1-kai.huang@intel.com/
>
> Oh, you acked it actually. But I was under the impression that after this patch
> here, the splat wouldn't be triggered. So it inadvertently fixes it.
Ah, that's why I was a bit confused. I was assuming tdx_shutdown_cpu() was a
cpuhp callback, but it's actually an IPI callback.
Hmm, isn't this patch wrong then? Ah, no, the changelog says:
However, WBINVD is already generally done at CPU offline as matter of course.
So don't bother adding TDX specific logic for this, and rely on the normal
WBINVD to handle it.
What's the "normal" WBINVD? At the very least, tdx_offline_cpu() should have a
comment that explicitly calls out where that WBVIND is. I assume you're referring
to the wbinvd() calls in things like hlt_play_dead()?
But unless the WBINVD is actually costly, why bother getting fancy?
> But that other patch is much more backport friendly.
On 3/31/26 16:04, Sean Christopherson wrote: > But unless the WBINVD is actually costly, why bother getting fancy? WBINVD might be the most expensive single instruction in the whole ISA. That said, I'd much rather have a potentially unnecessary WBINVD than miss one. The thing I'd be worried about would be something wonky like: 1. CPU offline does WBINVD 2. Some other TDX call gets made, dirties caches again 3. tdx_offline_cpu() skips WBINVD So, let's just do both for now: Do WBINVD in tdx_offline_cpu() and comment that it might be redundant with other things in the CPU offline procedure. This really needs to be solved with infrastructure and keeping data about the reasons for needing WBINVD, not relying on code ordering or fragile semantics.
On April 1, 2026 8:03:02 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >On 3/31/26 16:04, Sean Christopherson wrote: >> But unless the WBINVD is actually costly, why bother getting fancy? > >WBINVD might be the most expensive single instruction in the whole ISA. > >That said, I'd much rather have a potentially unnecessary WBINVD than >miss one. The thing I'd be worried about would be something wonky like: > > 1. CPU offline does WBINVD > 2. Some other TDX call gets made, dirties caches again > 3. tdx_offline_cpu() skips WBINVD > >So, let's just do both for now: Do WBINVD in tdx_offline_cpu() and >comment that it might be redundant with other things in the CPU offline >procedure. > >This really needs to be solved with infrastructure and keeping data >about the reasons for needing WBINVD, not relying on code ordering or >fragile semantics. It is, *by far*, the most expensive *uninterruptible* instruction in the ISA. REP string instructions can of course be arbitrarily long, but are interruptible and so don't really count. Some MSRs used during very early (pre-OS) initialization might be even slower on some implementations, but that's not visible to Linux and no workload of any kind is running.
On Wed, Apr 01, 2026, H. Peter Anvin wrote: > On April 1, 2026 8:03:02 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote: > >On 3/31/26 16:04, Sean Christopherson wrote: > >> But unless the WBINVD is actually costly, why bother getting fancy? > > > >WBINVD might be the most expensive single instruction in the whole ISA. > > > >That said, I'd much rather have a potentially unnecessary WBINVD than > >miss one. The thing I'd be worried about would be something wonky like: > > > > 1. CPU offline does WBINVD > > 2. Some other TDX call gets made, dirties caches again > > 3. tdx_offline_cpu() skips WBINVD > > > >So, let's just do both for now: Do WBINVD in tdx_offline_cpu() and > >comment that it might be redundant with other things in the CPU offline > >procedure. > > > >This really needs to be solved with infrastructure and keeping data > >about the reasons for needing WBINVD, not relying on code ordering or > >fragile semantics. > > It is, *by far*, the most expensive *uninterruptible* instruction in the ISA. > REP string instructions can of course be arbitrarily long, but are > interruptible and so don't really count. > > Some MSRs used during very early (pre-OS) initialization might be even slower > on some implementations, but that's not visible to Linux and no workload of > any kind is running. Sorry, "costly" wasn't the right word. I know WBINVD super expensive, but unless someone cares deeply about the latency of offlining a CPU after its down TDX stuff, the "cost" is effectively zero.
On 4/1/26 11:12, Sean Christopherson wrote: > Sorry, "costly" wasn't the right word. I know WBINVD super > expensive, but unless someone cares deeply about the latency of > offlining a CPU after its down TDX stuff, the "cost" is effectively > zero. I once increased the CPU online/offline latency once and got nastygrams from folks. IIRC, I added a synchronize_rcu() which incurs way more latency than WBINVD, but folks _do_ care about CPU online/offline latency surprisingly. In this case, though, I'm happy to add the WBINVD for simplicity and wait for a possible repeat of the torches an pitchforks.
On Tue, 2026-03-31 at 16:04 -0700, Sean Christopherson wrote: > > Oh, you acked it actually. But I was under the impression that after this > > patch here, the splat wouldn't be triggered. So it inadvertently fixes it. > > Ah, that's why I was a bit confused. I was assuming tdx_shutdown_cpu() was a > cpuhp callback, but it's actually an IPI callback. > > Hmm, isn't this patch wrong then? Ah, no, the changelog says: > > However, WBINVD is already generally done at CPU offline as matter of > course. So don't bother adding TDX specific logic for this, and rely on the > normal WBINVD to handle it. > > What's the "normal" WBINVD? At the very least, tdx_offline_cpu() should have > a comment that explicitly calls out where that WBVIND is. I guess we could add one in tdx_offline_cpu(). Seems reasonable. > I assume you're > referring to the wbinvd() calls in things like hlt_play_dead()? Yea. > > But unless the WBINVD is actually costly, why bother getting fancy? What is the suggestion to make it less fancy? Just put the wbinvd in tdx_offline_cpu()? Yea that works too. Probably will get a comment either way.
On Mon, Mar 23, 2026 at 02:59:05PM -0600, Vishal Verma wrote: > From: Rick Edgecombe <rick.p.edgecombe@intel.com> > > KVM tries to take care of some required cache flushing earlier in the > kexec path in order to be kind to some long standing races that can occur > later in the operation. Until recently, VMXOFF was handled within KVM. > Since VMX being enabled is required to make a SEAMCALL, it had the best > per-cpu scoped operation to plug the flushing into. So it is kicked off > from there. > > This early kexec cache flushing in KVM happens via a syscore shutdown > callback. Now that VMX enablement control has moved to arch/x86, which has > grown its own syscore shutdown callback, it no longer make sense for it to > live in KVM. It fits better with the TDX enablement managing code. > > In addition, future changes will add a SEAMCALL that happens immediately > before VMXOFF, which means the cache flush in KVM will be too late to > flush the cache before the last SEAMCALL. So move it to the newly added TDX > arch/x86 syscore shutdown handler. > > Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it > static and remove the export. Since it is also not part of an operation > spread across disparate components, remove the redundant comments and > verbose naming. > > In the existing KVM based code, CPU offline also funnels through > tdx_cpu_flush_cache_for_kexec(). So the centralization to the arch/x86 > syscore shutdown callback elides this CPU offline time behavior. However, > WBINVD is already generally done at CPU offline as matter of course. So > don't bother adding TDX specific logic for this, and rely on the normal > WBINVD to handle it. > > Acked-by: Kai Huang <kai.huang@intel.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org> -- Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Mar 23, 2026 at 02:59:05PM -0600, Vishal Verma wrote: >From: Rick Edgecombe <rick.p.edgecombe@intel.com> > >KVM tries to take care of some required cache flushing earlier in the >kexec path in order to be kind to some long standing races that can occur >later in the operation. Until recently, VMXOFF was handled within KVM. >Since VMX being enabled is required to make a SEAMCALL, it had the best >per-cpu scoped operation to plug the flushing into. So it is kicked off >from there. > >This early kexec cache flushing in KVM happens via a syscore shutdown >callback. Now that VMX enablement control has moved to arch/x86, which has >grown its own syscore shutdown callback, it no longer make sense for it to >live in KVM. It fits better with the TDX enablement managing code. > >In addition, future changes will add a SEAMCALL that happens immediately >before VMXOFF, which means the cache flush in KVM will be too late to >flush the cache before the last SEAMCALL. So move it to the newly added TDX >arch/x86 syscore shutdown handler. > >Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it >static and remove the export. Since it is also not part of an operation >spread across disparate components, remove the redundant comments and >verbose naming. > >In the existing KVM based code, CPU offline also funnels through >tdx_cpu_flush_cache_for_kexec(). So the centralization to the arch/x86 >syscore shutdown callback elides this CPU offline time behavior. However, >WBINVD is already generally done at CPU offline as matter of course. So >don't bother adding TDX specific logic for this, and rely on the normal >WBINVD to handle it. > >Acked-by: Kai Huang <kai.huang@intel.com> >Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> >Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com>
© 2016 - 2026 Red Hat, Inc.