From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB9B5222590; Thu, 26 Jun 2025 10:49:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934972; cv=none; b=cbTUzChpCSezvG9JSWFhLSjvlGDk8eQU+80YZjfoXz071b2/sqkRXQ7DwWpWvBAf5mS4+7T1Ix4EHodPv1Os3TvDvLKO1GCUJr1TYbF0uYzEySCj3HpPOsBzgQSml7d40UASB8pZr5ngNKHnEvntZU4o556Sg3VzJX+ihDv8oYU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934972; c=relaxed/simple; bh=RIJQqUpfJAT9wdrD0NtGjCdOX2rHRdxMmijTeeidGuY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HL5xbrvLtgzgh8yHaAXTHasyn82rdBkny/WlPRnfBUk2c2UOrlg+6KqC/TCrdKRiQhbjpXkVso2YBuD5UgB/HUXvt5IJx73IgvNfgxKZaLcC954jDPZi4T6guIs/dK6BhUScZ2cVmff4cfLx4+uuKuZV+10xsH1kSO08C0DqbAM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=b1S1bRnb; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="b1S1bRnb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934971; x=1782470971; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RIJQqUpfJAT9wdrD0NtGjCdOX2rHRdxMmijTeeidGuY=; b=b1S1bRnb7phQELYxl8bKVG8VBYCo2FqV2Rd2Vrm0A7i1NvF6X4uvI9WZ C2kQkNBcGIGu2uk+V2gVWWwLUf4XH2mcpf/mEyVdJgTIJqx+0BHzSZHlr StZDuorfZ8x2ahoUGZRS7T/fne1fKcjJpe6pq+vgd3hSk3P+5N+BLOM87 da+eui03xN/pW7G2zUfzsT42wFsTwp6RiNMbCn4q7IRHqgcWtAZTxeNyL CqRPv02SQbRMYh1RfY+gMRtxY7LJ56wWhm8MYSDhGEAt2LMI5QcoB7wYm /YdHsQmEpRDuQpfjq2ZRaR/dWVJckcymmY1LXdYDmHnMjQZRb6OYT4gVz w==; X-CSE-ConnectionGUID: QGBRs96YR4+8ETtY3rXXzA== X-CSE-MsgGUID: fBfNJp+1Rrev8JVk8EgTeg== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655766" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655766" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:31 -0700 X-CSE-ConnectionGUID: DTaHf4MNTyK8qHbCv5hyzA== X-CSE-MsgGUID: 2Q+MCgSNSmWslno4tkQmWA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784309" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:25 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com Subject: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Date: Thu, 26 Jun 2025 22:48:47 +1200 Message-ID: X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On SME platforms, at hardware level dirty cachelines with and without encryption bit of the same memory can coexist, and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel to avoid silent memory corruption when a cacheline with a different encryption property is written back over whatever encryption properties the new kernel is using. TDX also needs cache flush during kexec for the same reason. It would be good to implement a generic way to flush cache instead of scattering checks for each feature all around. During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop them before it boots to the new kernel. For SME the kernel basically encrypts all memory including the kernel itself by manipulating page tables. A simple memory write from the kernel could dirty cachelines. Currently, the kernel uses WBINVD to flush cache for SME during kexec in two places: 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU stops them; 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the new kernel. Unlike SME, TDX can only dirty cachelines when it is used (i.e., when SEAMCALLs are performed). Since there are no more SEAMCALLs after the aforementioned WBINVDs, leverage this for TDX. In order to have a generic way to cover both SME and TDX, use a percpu boolean to indicate the cache may be in an incoherent state thus cache flush is needed during kexec, and turn on the boolean for SME. TDX can then leverage it by also turning the boolean on. A percpu boolean isn't strictly needed for SME since it is activated at very early boot time and on all CPUs. A global flag would be sufficient. But using a percpu variable has two benefits. Foremost, the state that is being tracked here (percpu cache coherency situation requiring a flush) is percpu, so a percpu state is a more direct and natural fit. Secondly, it will fit TDX's usage better since the percpu var can be set when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is valuable, because there is an existing race[*] where kexec could proceed while another CPU is active. WBINVD could make this race worse, so it's worth skipping it when possible. Today the first WBINVD in the stop_this_cpu() is performed when SME is *supported* by the platform, and the second WBINVD is done in relocate_kernel() when SME is *activated* by the kernel. Make things simple by changing to do the second WBINVD when the platform supports SME. This allows the kernel to simply turn on this percpu boolean when bringing up a CPU by checking whether the platform supports SME. No other functional change intended. Also, currently machine_check() has a comment to explain why no function call is allowed after load_segments(). After changing to use the new percpu boolean to control whether to perform WBINVD when calling the relocate_kernel(), that comment isn't needed anymore. But it is still a useful comment, so expand the comment around load_segments() to mention that due to depth tracking no function call can be made after load_segments(). [*] The "race" in native_stop_other_cpus() Commit 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust") introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on poweroff" issue which was caused by the WBINVD in stop_this_cpu(). Specifically, the new cpumask resolved the below problem mentioned in that commit: CPU0 CPU1 stop_other_cpus() send_IPIs(REBOOT); stop_this_cpu() while (num_online_cpus() > 1); set_online(false); proceed... -> hang wbinvd() While it fixed the reported issue, that commit explained the new cpumask "cannot plug all holes either". This is because it doesn't address the "race" mentioned in the #3 in the comment in native_stop_other_cpus(): /* * 1) Send an IPI on the reboot vector to all other CPUs. * * The other CPUs should react on it after leaving critical * sections and re-enabling interrupts. They might still hold * locks, but there is nothing which can be done about that. * * 2) Wait for all other CPUs to report that they reached the * HLT loop in stop_this_cpu() * * 3) If #2 timed out send an NMI to the CPUs which did not * yet report * * 4) Wait for all other CPUs to report that they reached the * HLT loop in stop_this_cpu() * * #3 can obviously race against a CPU reaching the HLT loop late. * That CPU will have reported already and the "have all CPUs * reached HLT" condition will be true despite the fact that the * other CPU is still handling the NMI. Again, there is no * protection against that as "disabled" APICs still respond to * NMIs. */ Consider below case: CPU 0 CPU 1 native_stop_other_cpus() stop_this_cpu() // sends REBOOT IPI to stop remote CPUs ... wbinvd(); // wait timesout, try NMI if (!cpumask_empty(&cpus_stop_mask)) { for_each_cpu(cpu, &cpus_stop_mask) { ... cpumask_clear_cpu(cpu, &cpus_stop_mask); hlt; // send NMI ---> wakeup from hlt and run stop_this_cpu(): // WAIT CPUs TO STOP while (!cpumask_empty( &cpus_stop_mask) && ...) } ... proceed ... wbinvd(); ... hlt; The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are stopped, but actually it quits immediately because the remote CPUs have been cleared in cpus_stop_mask when stop_this_cpu() is called from the REBOOT IPI. Doing WBINVD in stop_this_cpu() could potentially increase the chance to trigger the above "race" despite it's still rare to happen. Signed-off-by: Kai Huang Reviewed-by: Tom Lendacky Tested-by: Tom Lendacky --- arch/x86/include/asm/kexec.h | 2 +- arch/x86/include/asm/processor.h | 2 ++ arch/x86/kernel/cpu/amd.c | 16 ++++++++++++++++ arch/x86/kernel/machine_kexec_64.c | 15 ++++++++++----- arch/x86/kernel/process.c | 16 +++------------- arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++---- 6 files changed, 43 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index f2ad77929d6e..d7e93522b93d 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page, unsigned long pa_control_page, unsigned long start_address, unsigned int preserve_context, - unsigned int host_mem_enc_active); + unsigned int cache_incoherent); #endif extern relocate_kernel_fn relocate_kernel; #define ARCH_HAS_KIMAGE_ARCH diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/proces= sor.h index bde58f6510ac..a24c7805acdb 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy); void microcode_check(struct cpuinfo_x86 *prev_info); void store_cpu_caps(struct cpuinfo_x86 *info); =20 +DECLARE_PER_CPU(bool, cache_state_incoherent); + enum l1tf_mitigations { L1TF_MITIGATION_OFF, L1TF_MITIGATION_AUTO, diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index f18f540db58c..4c7fde344216 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x8= 6 *c) { u64 msr; =20 + /* + * Mark using wbinvd is needed during kexec on processors that + * support SME. This provides support for performing a successful + * kexec when going from SME inactive to SME active (or vice-versa). + * + * The cache must be cleared so that if there are entries with the + * same physical address, both with and without the encryption bit, + * they don't race each other when flushed and potentially end up + * with the wrong entry being committed to memory. + * + * Test the CPUID bit directly because the machine might've cleared + * X86_FEATURE_SME due to cmdline options. + */ + if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B= IT(0))) + __this_cpu_write(cache_state_incoherent, true); + /* * BIOS support is required for SME and SEV. * For SME: If BIOS has enabled SME then adjust x86_phys_bits by diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 697fb99406e6..4519c7b75c49 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -29,6 +29,7 @@ #include #include #include +#include =20 #ifdef CONFIG_ACPI /* @@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image) { unsigned long reloc_start =3D (unsigned long)__relocate_kernel_start; relocate_kernel_fn *relocate_kernel_ptr; - unsigned int host_mem_enc_active; + unsigned int cache_incoherent; int save_ftrace_enabled; void *control_page; =20 /* - * This must be done before load_segments() since if call depth tracking - * is used then GS must be valid to make any function calls. + * This must be done before load_segments(), since it resets + * GS to 0 and percpu data needs the correct GS to work. */ - host_mem_enc_active =3D cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT); + cache_incoherent =3D this_cpu_read(cache_state_incoherent); =20 #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) @@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image) * * Take advantage of this here by force loading the segments, * before the GDT is zapped with an invalid value. + * + * load_segments() resets GS to 0. Don't make any function call + * after here since call depth tracking uses percpu variables to + * operate (relocate_kernel() is explicitly ignored by call depth */ load_segments(); =20 @@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image) virt_to_phys(control_page), image->start, image->preserve_context, - host_mem_enc_active); + cache_incoherent); =20 #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 7b94851bb37e..6b5edfbefa9a 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw); DEFINE_PER_CPU(bool, __tss_limit_invalid); EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid); =20 +DEFINE_PER_CPU(bool, cache_state_incoherent); + /* * this gets called so that we can store lazy state into memory and copy t= he * current task into the new thread. @@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy) disable_local_APIC(); mcheck_cpu_clear(c); =20 - /* - * Use wbinvd on processors that support SME. This provides support - * for performing a successful kexec when going from SME inactive - * to SME active (or vice-versa). The cache must be cleared so that - * if there are entries with the same physical address, both with and - * without the encryption bit, they don't race each other when flushed - * and potentially end up with the wrong entry being committed to - * memory. - * - * Test the CPUID bit directly because the machine might've cleared - * X86_FEATURE_SME due to cmdline options. - */ - if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B= IT(0))) + if (this_cpu_read(cache_state_incoherent)) wbinvd(); =20 /* diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat= e_kernel_64.S index ea604f4d0b52..34b3a5e4fe49 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) * %rsi pa_control_page * %rdx start address * %rcx preserve_context - * %r8 host_mem_enc_active + * %r8 cache_incoherent */ =20 /* Save the CPU context, used for jumping back */ @@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) /* * %rdi indirection page * %rdx start address - * %r8 host_mem_enc_active + * %r8 cache_incoherent * %r9 page table page * %r11 preserve_context * %r13 original CR4 when relocate_kernel() was invoked @@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) movq %r9, %cr3 =20 /* + * If the memory cache is in incoherent state, e.g., due to + * memory encryption, do wbinvd to flush cache. + * * If SME is active, there could be old encrypted cache line * entries that will conflict with the now unencrypted memory * used by kexec. Flush the caches before copying the kernel. + * + * Note SME sets this flag to true when the platform supports + * SME, so the wbinvd is performed even SME is not activated + * by the kernel. But this has no harm. */ testq %r8, %r8 - jz .Lsme_off + jz .Lnowbinvd wbinvd -.Lsme_off: +.Lnowbinvd: =20 call swap_pages =20 --=20 2.49.0 From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9471264A9D; Thu, 26 Jun 2025 10:49:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934977; cv=none; b=iUUf+eY06osVqqnlFXnvgt5KLt1TaplKZkWOPBNeXM5XC6/1x1O/TW9g5v2kTIH+pg6bpiwefFjId2x4/bYXg8AVAIxooxh2qKT9RINzmI8f85kj+bLoYVP7/mceM8pXtxV0/QMFxyrKiyGzyn7AhwxW88CebGKCkxkzkDTpBnk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934977; c=relaxed/simple; bh=+HXLQAJX9DTjZknajjUDMLpMXRCl1eOZ8k2i7r/I67w=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hUSUKgymsJkWQL0pGMaj6a/OWR1VIm4Hs7ErdeKfVHhoj4HcF/8zcNV6/jZzkyeKFRSSxmNabHDQaxdPXDqrHFtwcdk7LOb4cYKMeC+om/iyHCSyUK5Qdvij3EH8LJbtdN9xUS9xeqt0ELSjLdQaL1RiEJUznRLM/r9EA3vvplk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=TiZSVILG; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="TiZSVILG" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934976; x=1782470976; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+HXLQAJX9DTjZknajjUDMLpMXRCl1eOZ8k2i7r/I67w=; b=TiZSVILGkTGQqv/yQ3Xi2q0idZRfyUFXjAJQa+YQnNkLuYePhGxnMhff QPBxCd3Lmo5lRvqu/bDYt3gjcFnpPfYiDOwx4zeburVNzvvwh0UJeMAMY 5EL8roYK/4Qn53/ph3l9VEb674HWy+EJnRh/OM8NigHVpICXHx6JXaAFG Hf2UeZAcF8LbD4b1gRpZ0rnZJWtgDoKjcWIf08nnjrGDiql/r3jEqFBCl SycM9DsWEYQAfZwpaoyV3TsBgjuSd/ljGRwlkdWA8/3ei1NIrk+zGy9sH SDfO+ZJXcw7WoxK/UeMTQx0WdbJOG0f82nfzlD4mOmQRzAdTwMMPdKFQz g==; X-CSE-ConnectionGUID: sSaYbx/dTMe1Re+Yb5Sdgg== X-CSE-MsgGUID: 2NPFicVqS8KvAzkFYxTntA== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655778" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655778" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:36 -0700 X-CSE-ConnectionGUID: 87d83ssURb6c9K/6a/AB/A== X-CSE-MsgGUID: sbOihw75RpONzoRTwFJD1A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784322" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:30 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com, Farrah Chen Subject: [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Date: Thu, 26 Jun 2025 22:48:48 +1200 Message-ID: <323dc9e1de6a2576ca21b9c446480e5b6c6a3116.1750934177.git.kai.huang@intel.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On TDX platforms, at hardware level dirty cachelines with and without TDX keyID can coexist, and CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel to avoid silent memory corruption when a cacheline with a different encryption property is written back over whatever encryption properties the new kernel is using. A percpu boolean is used to mark whether the cache of a given CPU may be in an incoherent state, and the kexec performs WBINVD on the CPUs with that boolean turned on. For TDX, only the TDX module or the TDX guests can generate dirty cachelines of TDX private memory, i.e., they are only generated when the kernel does SEAMCALL. Turn on that boolean when the kernel does SEAMCALL so that kexec can correctly flush cache. SEAMCALL can be made from both task context and IRQ disabled context. Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles) from kernel's point of view and preempt_{disable|enable}() is cheap compared to it, simply unconditionally disable preemption during setting the percpu boolean and making SEAMCALL. Signed-off-by: Kai Huang Tested-by: Farrah Chen --- v2 -> v3: - Change to use __always_inline for do_seamcall() to avoid indirect call instructions of making SEAMCALL. - Remove the senstence "not all SEAMCALLs generate dirty cachelines of TDX private memory but just treat all of them do." in changelog and the code comment. -- Dave --- arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 7ddef3a69866..d4c624c69d7f 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *ar= gs); u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args); void tdx_init(void); =20 +#include #include +#include =20 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args); =20 +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn, + struct tdx_module_args *args) +{ + u64 ret; + + preempt_disable(); + + /* + * SEAMCALLs are made to the TDX module and can generate dirty + * cachelines of TDX private memory. Mark cache state incoherent + * so that the cache can be flushed during kexec. + * + * This needs to be done before actually making the SEAMCALL, + * because kexec-ing CPU could send NMI to stop remote CPUs, + * in which case even disabling IRQ won't help here. + */ + this_cpu_write(cache_state_incoherent, true); + + ret =3D func(fn, args); + + preempt_enable(); + + return ret; +} + static __always_inline u64 sc_retry(sc_func_t func, u64 fn, struct tdx_module_args *args) { @@ -113,7 +140,7 @@ static __always_inline u64 sc_retry(sc_func_t func, u64= fn, u64 ret; =20 do { - ret =3D func(fn, args); + ret =3D do_seamcall(func, fn, args); } while (ret =3D=3D TDX_RND_NO_ENTROPY && --retry); =20 return ret; --=20 2.49.0 From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C116725A2BB; Thu, 26 Jun 2025 10:49:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934982; cv=none; b=HKhFA4U8rjXMLqigt7qoLS/W6slMM8y1h0dhRF3TPbnXvDq5xgYu3ZQb3fUzuUXBtuZZpBBlMH1Bknj62C3QmhpKGfMLGpfvdMO2mfKaTWhvOPMEv9CXGBWKPwIM/21n3e2cVtorO3GmjM3v/Dy4yM5N4gYkjRwlNOO5Stb9h7w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934982; c=relaxed/simple; bh=CsCoE9byU2IgCA1iIlKbPKQVXTwbOzVqmOroeX9C3f8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=D1PzVtTAwKARoOm3BaMCxB6T/mqwNIBcVJnd+iLnjnIbkxPSb5QsOET1feWNbPlDvGF3HAsmA19jVFxB5hsbAD9P8emPA1Lt31T+oBTDmAJfllp/pNLjBrxUdT1JHtqxAYDMhwg6CadHPh9NEcHs/1NNGaghcvcHrI+RPQF4q7A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=YO+p3nQt; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="YO+p3nQt" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934981; x=1782470981; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CsCoE9byU2IgCA1iIlKbPKQVXTwbOzVqmOroeX9C3f8=; b=YO+p3nQtzpjOTSR3McggP1RwK7pRI8eH51fqWjjeNGYc11TxmEgvRFaE DAuEjfELk9f28VAs+3L/+B5qAATR7m2qXwL09XCuRXeGFQzSDR7CawwBS UhDbcRFh9v7bb0Ot0h7L5EcPUiDEulTk2UUxgddy4P+vKYN/8ZhZS7zuA cnm1xEJAlGMFnZGemhWi4ZI54UNwYPzesDhPk7nxVpO3eqmG/rlefEIwv De2XZe/RdKJ2oukWI2vmGS7w73YFqINgUezwJ8dRnHvfeiTj7qwAnNVtH tkEMDbAYax3MK6DvQRd5kW1+gKy12IM7lescHcadsWAQqMCH+P29W0cAU Q==; X-CSE-ConnectionGUID: yrrNWl8NTB2+XdHBqVi93g== X-CSE-MsgGUID: qUiZMSFRRKOM8B8RgWS6Ig== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655793" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655793" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:41 -0700 X-CSE-ConnectionGUID: 0ZYWh4imS2ugWwG0dV36Jg== X-CSE-MsgGUID: cW8DHNfmQAWty0ijddpglA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784336" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:35 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com, Farrah Chen Subject: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Date: Thu, 26 Jun 2025 22:48:49 +1200 Message-ID: <412a62c52449182e392ab359dabd3328eae72990.1750934177.git.kai.huang@intel.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some early TDX-capable platforms have an erratum: A kernel partial write (a write transaction of less than cacheline lands at memory controller) to TDX private memory poisons that memory, and a subsequent read triggers a machine check. On those platforms, the old kernel must reset TDX private memory before jumping to the new kernel, otherwise the new kernel may see unexpected machine check. Currently the kernel doesn't track which page is a TDX private page. For simplicity just fail kexec/kdump for those platforms. Leverage the existing machine_kexec_prepare() to fail kexec/kdump by adding the check of the presence of the TDX erratum (which is only checked for if the kernel is built with TDX host support). This rejects kexec/kdump when the kernel is loading the kexec/kdump kernel image. The alternative is to reject kexec/kdump when the kernel is jumping to the new kernel. But for kexec this requires adding a new check (e.g., arch_kexec_allowed()) in the common code to fail kernel_kexec() at early stage. Kdump (crash_kexec()) needs similar check, but it's hard to justify because crash_kexec() is not supposed to abort. It's feasible to further relax this limitation, i.e., only fail kexec when TDX is actually enabled by the kernel. But this is still a half measure compared to resetting TDX private memory so just do the simplest thing for now. The impact to userspace is the users will get an error when loading the kexec/kdump kernel image: kexec_load failed: Operation not supported This might be confusing to the users, thus also print the reason in the dmesg: [..] kexec: not allowed on platform with tdx_pw_mce bug. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Binbin Wu Reviewed-by: Rick Edgecombe --- arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 4519c7b75c49..d5a85d786e61 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image) unsigned long reloc_end =3D (unsigned long)__relocate_kernel_end; int result; =20 + /* + * Some early TDX-capable platforms have an erratum. A kernel + * partial write (a write transaction of less than cacheline + * lands at memory controller) to TDX private memory poisons that + * memory, and a subsequent read triggers a machine check. + * + * On those platforms the old kernel must reset TDX private + * memory before jumping to the new kernel otherwise the new + * kernel may see unexpected machine check. For simplicity + * just fail kexec/kdump on those platforms. + */ + if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) { + pr_info_once("Not allowed on platform with tdx_pw_mce bug\n"); + return -EOPNOTSUPP; + } + /* Setup the identity mapped 64bit page table */ result =3D init_pgtable(image, __pa(control_page)); if (result) --=20 2.49.0 From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE7F825B1D8; Thu, 26 Jun 2025 10:49:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934987; cv=none; b=qJ5DUXtnDBibZUg3Ry9nrOwS6onip9Y9kTgqbm8JnmF1cRR1fumFCaJkDPI1fpiz0PA1juFd9QCJupc5rovzh3z7Qjlh0kIZNYrfsNVuRIKEYRYFabARyGuUUaW826tOrGsRCpUClS31ZuLikOX8up5dMJ8mXcSzLTDBuAUFjMs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934987; c=relaxed/simple; bh=UnEPQjbST0yIQL5kAKbK+dXltux/87JOeue/+D47lnw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ffqFOinaW9fupjMljv+Wn4TXyZyhAVTkvbk5KWYHySSXOoJ0T+GCVTbEDXSJtCnbXQRKu2tDjU6sCDgUIxF40KXiLZ+QIQtKOG9H1PZyKhjb05FebQbicMogAjS44R5kF3HMCYqlEn2+zbBHCwjBmJzGEBNQAq6xVvNTrKyN48I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FQScZq+x; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FQScZq+x" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934986; x=1782470986; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UnEPQjbST0yIQL5kAKbK+dXltux/87JOeue/+D47lnw=; b=FQScZq+xjQGi/LzB3IvO0/YC41ibu39Gf7+4HPS6wxHuYjEYLCKREu3f I3PujXCkjOwp/2FcZA4e5ay9wfWJfPxK+nccRLIdfVqboiO66jCmnvbnI 39Vn4Feqq7co5pw3FWLb9Hp9s0FHEpdLiEhHLdtEGhQCxXThlGotmAPYt didiK6wWZVIE22Qmr/SUtgZQtU4E0Vr/WN/LcbWXv6Yw0sci6xpnuga1Y vzMhu3fIC8pSpDcniJZjZUAG9Gdd+QICu3rTFhX/+KYBqXQ74ENXN28sC E/DxbOfYbPTKIqQ7o8+Fp7xSOluMNyiqZowEeIHy2VhhEhN996SB2hpAt A==; X-CSE-ConnectionGUID: XTaqbdbXTYusNGI9eYP7cw== X-CSE-MsgGUID: F07V3iv2RI6NgELoI03TTg== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655804" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655804" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:46 -0700 X-CSE-ConnectionGUID: q7IeTlpFRre97LBms66Sbg== X-CSE-MsgGUID: CVI+DdBOSdK+R73pK4IYCQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784343" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:41 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com, Farrah Chen Subject: [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency Date: Thu, 26 Jun 2025 22:48:50 +1200 Message-ID: X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During kexec it is now guaranteed that all dirty cachelines of TDX private memory are flushed before jumping to the new kernel. The TDX private memory from the old kernel will remain as TDX private memory in the new kernel, but it is OK because kernel read/write to TDX private memory will never cause machine check, except on the platforms with the TDX partial write erratum, which has already been handled. It is safe to allow kexec to work together with TDX now. Remove the !KEXEC_CORE dependency. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Rick Edgecombe --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 71019b3b54ea..ca1c9f9e59be 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1899,7 +1899,6 @@ config INTEL_TDX_HOST depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK depends on CONTIG_ALLOC - depends on !KEXEC_CORE depends on X86_MCE help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious --=20 2.49.0 From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1254025B1D8; Thu, 26 Jun 2025 10:49:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934992; cv=none; b=P7SlfkkXQ3eXfxchiGUHlnW4ecbvMmcODNZ8KX6wSslXOyrQ7ysCHJQadiY1uVath2tUS222vY/Kd6gLGYoArrO62YcKUR9N7Q+n2K8nbP5NMtI3KCOPz0cvJGjB2FZk0KfOsag/9vvJirasipaRDUgixunIbrhHS2cT4n9mysA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934992; c=relaxed/simple; bh=nPbD5l0rheNy1UTotpL0AJxsu5rm2GIdag3PA3QGeXQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UQTdkBfsSziLU3GnRXRr/Gvvu4dXjJpKmsZJAZoBUk8CPLv1bLkwvlSRr1Y6uGzMzAQeJ8Kgbem1XCIVMfgfvH+zy8DdZAgZtc6FGTevu21p9dATSz/KfUdqEaW3QJEHipd5sYStjCKuEnihqLxDj2sywYklfix02j8EjqW4bBk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=d5FVc7lI; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="d5FVc7lI" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934991; x=1782470991; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nPbD5l0rheNy1UTotpL0AJxsu5rm2GIdag3PA3QGeXQ=; b=d5FVc7lIxhAbGcOkfVR99JJY3vBV8yOstcm2bSyr5KsR0ACuIOUxVtXT DefKvTFHzCDiQyaSUsMdMNzAwk2UnfLd/8S0peEjYfdUPVLVjj71DtdUU e0zuliWWjVgrCXnV2m3wqrPFcGWZLqzmoM3hTdbJWuBhJSktzhJ8NEkoF s2iWOzbHxqnfUv7/dp+j9mfbTlYzAYUkAoEioYA5bLpoR+UDVviCTXQlc 9BU+5RXZKflgJptQHmja/qLPgnQoPhYSLhGatFdZcMTI/nPFXS48MqKyk cs9Mr/58NAsN6loH586+Oo4eqGH5DPMatZ2DVUF0jf571UNJ8aVWVXgHr Q==; X-CSE-ConnectionGUID: h1KvSoQ/RfmVPzYTQVwiuA== X-CSE-MsgGUID: RAyLy1jTTPilwIjrS6y0Kw== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655823" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655823" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:51 -0700 X-CSE-ConnectionGUID: /poxqECsTbun26tn5RyVJg== X-CSE-MsgGUID: Sf5k0uyIQaK7ejY2Lfy0YQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784355" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:46 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com, Farrah Chen Subject: [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation Date: Thu, 26 Jun 2025 22:48:51 +1200 Message-ID: X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TDX host kernel now supports kexec/kdump. Update the documentation to reflect that. Opportunistically, remove the parentheses in "Kexec()" and move this section under the "Erratum" section because the updated "Kexec" section now refers to that erratum. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Rick Edgecombe --- Documentation/arch/x86/tdx.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst index 719043cd8b46..61670e7df2f7 100644 --- a/Documentation/arch/x86/tdx.rst +++ b/Documentation/arch/x86/tdx.rst @@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly. Note TDX works with CPU logical online/offline, thus the kernel still allows to offline logical CPU and online it again. =20 -Kexec() -~~~~~~~ - -TDX host support currently lacks the ability to handle kexec. For -simplicity only one of them can be enabled in the Kconfig. This will be -fixed in the future. - Erratum ~~~~~~~ =20 @@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints ad= ditional message in machine check handler to tell user the machine check may be caused by kernel bug on TDX private memory. =20 +Kexec +~~~~~~~ + +Currently kexec doesn't work on the TDX platforms with the aforementioned +erratum. It fails when loading the kexec kernel image. Otherwise it +works normally. + Interaction vs S3 and deeper states ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =20 --=20 2.49.0 From nobody Wed Oct 8 15:53:34 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 285622727E3; Thu, 26 Jun 2025 10:49:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934997; cv=none; b=MLhXLIof4TIToyqeE8T8Zscmh4hPcDZAA9oaIVGk9LlbU4bLlA/BdIz2iUXCZizUncnU/vBtPoeBELP3QFcL8EyUfPYTjXIsO17Y7mtiVp/SjVshypSYpmz4NPp6n3F59uB5yUzijfswfBzIt9j4iDR9lJ5EMWW5yhrwBdLC2iU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750934997; c=relaxed/simple; bh=av0qF2y3/F3Q6t4QKOydohkA/mWlyY9WC1aCIBFt4/c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HZ9fKOsWRsLnmi1Wc6hvk3seaTSDJDhh3YuuqJslnosp4VnRgWsB4M/n7k3DSfa5Ek+BEevKowe92MYJ6MIwyARvD5ABYFCmXEML74ZnKMPAY+a2r6rWKV57rVfZCN4IEtrnl/jI2wEhZMHG4RTeGvCceg8i0oKXxBrEbBMlynk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HATVd+IQ; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HATVd+IQ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1750934997; x=1782470997; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=av0qF2y3/F3Q6t4QKOydohkA/mWlyY9WC1aCIBFt4/c=; b=HATVd+IQGzyL+0DM3vAoXwz+NMvkH1rLFkvV7ESw4l6LIC/MNmrxiOoe Q7V8QflgqvI0Aso8yelgh503zq5iAM2j0OUY+rDtS15nrsSmT2lL5rkvy C+VjWq3qiRAGkclIWIYPqRQM8BzitbRE1g0I9utgSVtAJuW3q/D2d5mOh wTg/LTP9RC2CGYNCGtCffJVh89QmwVPGDVEhEj88l6cimCVcNq6wBp5p1 fS/l+exH+15BgdAL+Jd5/H55S9suyc4qDZ1Gl79sHYfU45CiQaQai53he 9gu0FKORaMuACPT7cjx7FXhuhmnj/aRkpTzMv6WlLP2ieaR59MMNBryVR Q==; X-CSE-ConnectionGUID: tSAE6eo+RomdjjRnpOZVkQ== X-CSE-MsgGUID: TItKl8rXQCeALlgU2MJ/wQ== X-IronPort-AV: E=McAfee;i="6800,10657,11475"; a="70655837" X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="70655837" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:56 -0700 X-CSE-ConnectionGUID: 2zxX7Yd4RbOxJDLLG/wM/g== X-CSE-MsgGUID: 2/hAtZtYQrGCP1hjegGR8A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,267,1744095600"; d="scan'208";a="152784361" Received: from jairdeje-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.86]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2025 03:49:51 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kirill.shutemov@linux.intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, sagis@google.com, Farrah Chen Subject: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Date: Thu, 26 Jun 2025 22:48:52 +1200 Message-ID: <6cc612331718a8bdaae9ee7071b6a360d71f2ab8.1750934177.git.kai.huang@intel.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On TDX platforms, during kexec, the kernel needs to make sure there's no dirty cachelines of TDX private memory before booting to the new kernel to avoid silent memory corruption to the new kernel. During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus() to stop all remote CPUs before booting to the new kernel. The remote CPUs will then execute stop_this_cpu() to stop themselves. The kernel has a percpu boolean to indicate whether the cache of a CPU may be in incoherent state. In stop_this_cpu(), the kernel does WBINVD if that percpu boolean is true. TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL. This makes sure the caches will be flushed during kexec. However, the native_stop_other_cpus() and stop_this_cpu() have a "race" which is extremely rare to happen but could cause system to hang. Specifically, the native_stop_other_cpus() firstly sends normal reboot IPI to remote CPUs and wait one second for them to stop. If that times out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop them. The aforementioned race happens when NMIs are sent. Doing WBINVD in stop_this_cpu() makes each CPU take longer time to stop and increases the chance of the race to happen. Register reboot notifier in KVM to explicitly flush caches upon receiving reboot notifier (e.g., during kexec) for TDX. This moves the WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly lengthy operation at a time where it could cause this race. Signed-off-by: Kai Huang Acked-by: Paolo Bonzini Tested-by: Farrah Chen Reviewed-by: Binbin Wu --- v2 -> v3: - Update changelog to address Paolo's comments and Add Paolo's Ack: https://lore.kernel.org/lkml/3a7c0856-6e7b-4d3d-b966-6f17f1aca42e@redhat= .com/ --- arch/x86/include/asm/tdx.h | 3 +++ arch/x86/kvm/vmx/tdx.c | 45 +++++++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 9 ++++++++ 3 files changed, 57 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index d4c624c69d7f..e6b11982c6c6 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64= level, u64 *ext_err1, u6 u64 tdh_phymem_cache_wb(bool resume); u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td); u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page); + +void tdx_cpu_flush_cache(void); #else static inline void tdx_init(void) { } static inline int tdx_cpu_enable(void) { return -ENODEV; } @@ -228,6 +230,7 @@ static inline int tdx_enable(void) { return -ENODEV; } static inline u32 tdx_get_nr_guest_keyids(void) { return 0; } static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NU= LL; } +static inline void tdx_cpu_flush_cache(void) { } #endif /* CONFIG_INTEL_TDX_HOST */ =20 #endif /* !__ASSEMBLER__ */ diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 1ad20c273f3b..c567a64a6cb0 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -5,7 +5,9 @@ #include #include #include +#include #include +#include #include "capabilities.h" #include "mmu.h" #include "x86_ops.h" @@ -3347,6 +3349,33 @@ static int tdx_offline_cpu(unsigned int cpu) return -EBUSY; } =20 +static void smp_func_cpu_flush_cache(void *unused) +{ + tdx_cpu_flush_cache(); +} + +static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code, + void *unused) +{ + /* + * Flush cache for all CPUs upon the reboot notifier. This + * avoids having to do WBINVD in stop_this_cpu() during kexec. + * + * Kexec calls native_stop_other_cpus() to stop remote CPUs + * before booting to new kernel, but that code has a "race" + * when the normal REBOOT IPI timesout and NMIs are sent to + * remote CPUs to stop them. Doing WBINVD in stop_this_cpu() + * could potentially increase the posibility of the "race". + */ + if (code =3D=3D SYS_RESTART) + on_each_cpu(smp_func_cpu_flush_cache, NULL, 1); + return NOTIFY_DONE; +} + +static struct notifier_block tdx_reboot_nb =3D { + .notifier_call =3D tdx_reboot_notify, +}; + static void __do_tdx_cleanup(void) { /* @@ -3504,6 +3533,11 @@ void tdx_cleanup(void) { if (enable_tdx) { misc_cg_set_capacity(MISC_CG_RES_TDX, 0); + /* + * Ignore the return value. See the comment in + * tdx_bringup(). + */ + unregister_reboot_notifier(&tdx_reboot_nb); __tdx_cleanup(); kvm_disable_virtualization(); } @@ -3587,6 +3621,17 @@ int __init tdx_bringup(void) enable_tdx =3D 0; } =20 + if (enable_tdx) + /* + * Ignore the return value. @tdx_reboot_nb is used to flush + * cache for all CPUs upon rebooting to avoid having to do + * WBINVD in kexec while the kexec-ing CPU stops all remote + * CPUs. Failure to register isn't fatal, because if KVM + * doesn't flush cache explicitly upon rebooting the kexec + * will do it anyway. + */ + register_reboot_notifier(&tdx_reboot_nb); + return r; =20 success_disable_tdx: diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index c7a9a087ccaf..73425e9bee39 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct pag= e *page) return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); } EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid); + +void tdx_cpu_flush_cache(void) +{ + lockdep_assert_preemption_disabled(); + + wbinvd(); + this_cpu_write(cache_state_incoherent, false); +} +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache); --=20 2.49.0