[PATCH 2/4] x86/virt/tdx: Pull kexec cache flush logic into arch/x86

Rick Edgecombe posted 4 patches 1 month ago
There is a newer version of this series
[PATCH 2/4] x86/virt/tdx: Pull kexec cache flush logic into arch/x86
Posted by Rick Edgecombe 1 month ago
KVM tries to take care of some required cache flushing earlier in the
kexec path in order to be kind to some long standing races that can occur
later in the operation. Until recently, VMXOFF was handled within KVM.
Since VMX being enabled is required to make a SEAMCALL, it had the best
per-cpu scoped operation to plug the flushing into.

This early kexec cache flushing in KVM happens via a syscore shutdown 
callback. Now that VMX enablement control has moved to arch/x86, which has 
grown its own syscore shutdown callback, it no longer make sense for it to 
live in KVM. It fits better with the TDX enablement managing code.

In addition, future changes will add a SEAMCALL that happens immediately
before VMXOFF, which means the cache flush in KVM will be too late to be
helpful. So move it to the newly added TDX arch/x86 syscore shutdown
handler.

Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it 
static and remove the export. Since it is also not part of an operation 
spread across disparate components, remove the redundant comments and 
verbose naming.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/tdx.h  |  6 ------
 arch/x86/kvm/vmx/tdx.c      | 10 ----------
 arch/x86/virt/vmx/tdx/tdx.c | 39 +++++++++++++++++++------------------
 3 files changed, 20 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0c1ae4954f17..f0826b0a512a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -206,11 +206,5 @@ static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
-#ifdef CONFIG_KEXEC_CORE
-void tdx_cpu_flush_cache_for_kexec(void);
-#else
-static inline void tdx_cpu_flush_cache_for_kexec(void) { }
-#endif
-
 #endif /* !__ASSEMBLER__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7264b533feb..50a5cfdbd33e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -440,16 +440,6 @@ void tdx_disable_virtualization_cpu(void)
 		tdx_flush_vp(&arg);
 	}
 	local_irq_restore(flags);
-
-	/*
-	 * Flush cache now if kexec is possible: this is necessary to avoid
-	 * having dirty private memory cachelines when the new kernel boots,
-	 * but WBINVD is a relatively expensive operation and doing it during
-	 * kexec can exacerbate races in native_stop_other_cpus().  Do it
-	 * now, since this is a safe moment and there is going to be no more
-	 * TDX activity on this CPU from this point on.
-	 */
-	tdx_cpu_flush_cache_for_kexec();
 }
 
 #define TDX_SEAMCALL_RETRIES 10000
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cb9b3210ab71..0802d0fd18a4 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -224,8 +224,28 @@ static int tdx_offline_cpu(unsigned int cpu)
 	return 0;
 }
 
+static void tdx_cpu_flush_cache(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	if (!this_cpu_read(cache_state_incoherent))
+		return;
+
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+
 static void tdx_shutdown_cpu(void *ign)
 {
+	/*
+	 * Flush cache now if kexec is possible: this is necessary to avoid
+	 * having dirty private memory cachelines when the new kernel boots,
+	 * but WBINVD is a relatively expensive operation and doing it during
+	 * kexec can exacerbate races in native_stop_other_cpus().  Do it
+	 * now, since this is a safe moment and there is going to be no more
+	 * TDX activity on this CPU from this point on.
+	 */
+	tdx_cpu_flush_cache();
 	x86_virt_put_ref(X86_FEATURE_VMX);
 }
 
@@ -1920,22 +1940,3 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
-
-#ifdef CONFIG_KEXEC_CORE
-void tdx_cpu_flush_cache_for_kexec(void)
-{
-	lockdep_assert_preemption_disabled();
-
-	if (!this_cpu_read(cache_state_incoherent))
-		return;
-
-	/*
-	 * Private memory cachelines need to be clean at the time of
-	 * kexec.  Write them back now, as the caller promises that
-	 * there should be no more SEAMCALLs on this CPU.
-	 */
-	wbinvd();
-	this_cpu_write(cache_state_incoherent, false);
-}
-EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
-#endif
-- 
2.53.0
Re: [PATCH 2/4] x86/virt/tdx: Pull kexec cache flush logic into arch/x86
Posted by Huang, Kai 1 month ago
On Fri, 2026-03-06 at 17:03 -0800, Rick Edgecombe wrote:
> KVM tries to take care of some required cache flushing earlier in the
> kexec path in order to be kind to some long standing races that can occur
> later in the operation. Until recently, VMXOFF was handled within KVM.
> Since VMX being enabled is required to make a SEAMCALL, it had the best
> per-cpu scoped operation to plug the flushing into.
> 
> This early kexec cache flushing in KVM happens via a syscore shutdown 
> callback. Now that VMX enablement control has moved to arch/x86, which has 
> grown its own syscore shutdown callback, it no longer make sense for it to 
> live in KVM. It fits better with the TDX enablement managing code.

[...]

> 
> In addition, future changes will add a SEAMCALL that happens immediately
> before VMXOFF, which means the cache flush in KVM will be too late to be
> helpful. So move it to the newly added TDX arch/x86 syscore shutdown
> handler.

Nit: I am not sure how to interpret "too late to be helpful".  I think we
can just get rid of this paragraph.

> 
> Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it 
> static and remove the export. Since it is also not part of an operation 
> spread across disparate components, remove the redundant comments and 
> verbose naming.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Feel free to add:

Acked-by: Kai Huang <kai.huang@intel.com>

Btw, there's a functional change here, and perhaps we should call out in
changelog:

- Currently tdx_cpu_flush_cache_for_kexec() is done in
kvm_disable_virtualization_cpu(), which is also called by KVM's CPUHP
offline() callback.  So tdx_cpu_flush_cache_for_kexec() is explicitly done
in TDX code in CPU offline.

- With this change, tdx_cpu_flush_cache_for_kexec() is not explicitly done
in TDX code in CPU offline.

But AFAICT this is fine, since IIUC the WBINVD is always done when kernel
offlines one CPU (see [*]), i.e., the current
tdx_cpu_flush_cache_for_kexec() done in KVM's CPUHP is actually superfluous.

[*] See:

	native_play_dead() ->
		cpuidle_play_dead();                                                     
        	hlt_play_dead();

cpuidle_play_dead() can invoke different enter_dead() callbacks depending on
what idle driver is being used, but AFAICT eventually it ends up calling
either acpi_idle_play_dead() or mwait_play_dead(), both of which does WBINVD
before going to idle.

If cpuidle_play_dead() doesn't idle successfully, the hlt_play_dead() will
then WBINVD and hlt.

Actually, after looking at multiple commits around here, e.g.,

  ea53069231f93 ("x86, hotplug: Use mwait to offline a processor, fix the
legacy case")
  dfbba2518aac4 ("Revert "ACPI: processor: idle: Only flush cache on
entering C3")

... I believe it's a kernel policy to make sure cache is flushed when it
offlines a CPU (which makes sense anyway of course), I just couldn't find
the exact commit saying this (or I am not sure whether there's such commit).


Btw2, kinda related to this, could you help review:

https://lore.kernel.org/lkml/20260302102226.7459-1-kai.huang@intel.com/
Re: [PATCH 2/4] x86/virt/tdx: Pull kexec cache flush logic into arch/x86
Posted by Edgecombe, Rick P 1 month ago
On Mon, 2026-03-09 at 00:23 +0000, Huang, Kai wrote:
> Feel free to add:
> 
> Acked-by: Kai Huang <kai.huang@intel.com>
> 
> Btw, there's a functional change here, and perhaps we should call out
> in changelog:

Yea that makes sense.

> 
> - Currently tdx_cpu_flush_cache_for_kexec() is done in
> kvm_disable_virtualization_cpu(), which is also called by KVM's CPUHP
> offline() callback.  So tdx_cpu_flush_cache_for_kexec() is explicitly
> done in TDX code in CPU offline.
> 
> - With this change, tdx_cpu_flush_cache_for_kexec() is not explicitly
> done in TDX code in CPU offline.
> 
> But AFAICT this is fine, since IIUC the WBINVD is always done when
> kernel offlines one CPU (see [*]), i.e., the current
> tdx_cpu_flush_cache_for_kexec() done in KVM's CPUHP is actually
> superfluous.
> 
> [*] See:
> 
> 	native_play_dead() ->
> 		cpuidle_play_dead();                                
>                      
>         	hlt_play_dead();
> 
> cpuidle_play_dead() can invoke different enter_dead() callbacks
> depending on what idle driver is being used, but AFAICT eventually it
> ends up calling either acpi_idle_play_dead() or mwait_play_dead(),
> both of which does WBINVD before going to idle.
> 
> If cpuidle_play_dead() doesn't idle successfully, the hlt_play_dead()
> will then WBINVD and hlt.
> 
> Actually, after looking at multiple commits around here, e.g.,
> 
>   ea53069231f93 ("x86, hotplug: Use mwait to offline a processor, fix
> the legacy case")
>   dfbba2518aac4 ("Revert "ACPI: processor: idle: Only flush cache on
> entering C3")
> 
> ... I believe it's a kernel policy to make sure cache is flushed when
> it offlines a CPU (which makes sense anyway of course), I just
> couldn't find the exact commit saying this (or I am not sure whether
> there's such commit).
> 
> 
Thanks for the analysis.

> Btw2, kinda related to this, could you help review:
> 
> https://lore.kernel.org/lkml/20260302102226.7459-1-kai.huang@intel.com/

Well I think I wrote the log for it. But I yea I'll add a tag.