From nobody Sat Feb  7 15:31:54 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFFA01A76B6
	for <linux-kernel@vger.kernel.org>; Tue, 24 Sep 2024 12:14:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727180058; cv=none;
 b=j6XHlat62DIebXSdPFGGFPqhpBIZNc0p1DyL1so3LikU3sEUVAP33ytOjELVB9S2093Q2eNgWUuGfs1BgYPNJmyJ/44zt2RRNlos0nlU1lwx15IemTCdF8y+sPdkEz8RSZG/ZkGVkJpfKCiOMCmziPwYHU1AOuCAXMu/5HHgZTY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727180058; c=relaxed/simple;
	bh=J8v7GgpB8rG4dlH7Vt2s0hHP0fI0x25cARwkAg/ekj4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=AjEr6LIgNyVF8ltmUopWr6aLywVkA7V5kZZT/1kcMtsvXTUP7ZI0fH6cfB192SFK6WNGPEhFcTuezZyoJ5rIh4IwFT0gn69CNXZFvk2wj4PuzF4b8JsnoSmK/g/72Y+AINpf62ZiD4MfI7LpV1DiWh96+KyNSoprb/CjiziSMB4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=eC/GlzIe; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="eC/GlzIe"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727180056; x=1758716056;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=J8v7GgpB8rG4dlH7Vt2s0hHP0fI0x25cARwkAg/ekj4=;
  b=eC/GlzIeHI7TAHLVEzEZbcy6smL2GhARxmqFGgW7Mqc45fGgozt0h8Ym
   riOcwGCOJLEfL85hcga6ziTd4E1fEdxJemKzdnRrKjecd0ul1s86ZH9Jq
   LPwEyA+riHkZdR9LPSZbw15E7JQw9X8o8TOuOPaXb/yRyI/r0IeDl5km+
   S1If2R6il3QHyS0tjgu+JxGCxLIGbbchsTF3zlfHc2bIGhdlL4FKCuQNT
   vh+FIpTJ5n6hybYq17Kvt1FigZG+++sVBzRu3sKkzka5sjdGhoxR0UKav
   AX85QfCo9Xfuyy413g0j/wa6Sm/EVvIr+qZBPa7qkYyxago5fRjzDxzvV
   g==;
X-CSE-ConnectionGUID: JdCmSTdORqerx+WX//+ckA==
X-CSE-MsgGUID: MhAGmj2wRAGJR9Pr4iQaXw==
X-IronPort-AV: E=McAfee;i="6700,10204,11204"; a="29881857"
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="29881857"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:16 -0700
X-CSE-ConnectionGUID: wP8ZaTNGTKKOpsAnGV5sPA==
X-CSE-MsgGUID: qWiZ4lveRTW50MLqzR8JqA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="71473241"
Received: from ccbilbre-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.124.221.10])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:12 -0700
From: Kai Huang <kai.huang@intel.com>
To: dave.hansen@intel.com,
	bp@alien8.de,
	tglx@linutronix.de,
	peterz@infradead.org,
	mingo@redhat.com,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com
Cc: x86@kernel.org,
	linux-kernel@vger.kernel.org,
	pbonzini@redhat.com,
	seanjc@google.com,
	dan.j.williams@intel.com,
	thomas.lendacky@amd.com,
	rick.p.edgecombe@intel.com,
	isaku.yamahata@intel.com,
	ashish.kalra@amd.com,
	bhe@redhat.com,
	nik.borisov@suse.com,
	sagis@google.com,
	Dave Young <dyoung@redhat.com>
Subject: [PATCH v7 1/5] x86/kexec: do unconditional WBINVD for bare-metal in
 stop_this_cpu()
Date: Wed, 25 Sep 2024 00:13:53 +1200
Message-ID: 
 <9fe9a391ba5aec1d0ef6246546f4f6cda3263ec8.1727179214.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.46.0
In-Reply-To: <cover.1727179214.git.kai.huang@intel.com>
References: <cover.1727179214.git.kai.huang@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal
to cover kexec support for both AMD SME and Intel TDX, despite there
_was_ some issue preventing from doing so but now it has been fixed.

Long version:

Both AMD SME and Intel TDX can leave caches in an incoherent state due
to memory encryption, which can lead to silent memory corruption during
kexec.  To address this issue, it is necessary to flush the caches
before jumping to the second kernel.

Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware.  To support TDX, instead of adding one more
vendor-specific check, it is proposed to perform unconditional WBINVD.
Kexec() is a slow path, and the additional WBINVD is acceptable for the
sake of simplicity and maintainability.

It is important to note that WBINVD should only be done for bare-metal
scenarios, as TDX guests and SEV-ES/SEV-SNP guests may not handle the
unexpected exception (#VE or #VC) caused by WBINVD.

Note:

Historically, there _was_ an issue preventing doing unconditional WBINVD
but that has been fixed.

When SME kexec() support was initially added in commit

  bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

WBINVD was done unconditionally.  However since then some issues were
reported that different Intel systems would hang or reset due to that
commit.

To try to fix, a later commit

  f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

then changed to only do WBINVD when hardware supports SME.

While this commit made the reported issues go away, it didn't pinpoint
the root cause.  Also, it forgot to handle a corner case[*], which
resulted in the reveal of the root cause and the final fix by commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD for bare-metal now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause).  The corner case was independently
fixed by commit

  9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd=
.com/ [1]
Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd=
.com/ [2]
Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_cr=
ate.local/ [3]
Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdg=
mkZqZvNp2dQ@mail.gmail.com/ [4]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---

v6 -> v7:
 - Use "Link: <permalink>".

v5 -> v6:
 - No change

v4 -> v5:
 - Add Tom's tag

v3 -> v4:
 - Update part of changelog based on Kirill's version (with minor tweak).
 - Use "exception (#VE or #VC)" for TDX and SEV-ES/SEV-SNP in changelog
   and comments.  (Kirill, Tom)
 - Point out "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests"
   in the comment.  (Tom)

v2 -> v3:
 - Change to only do WBINVD for bare metal


---
 arch/x86/kernel/process.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91..d1a20501e686 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,18 +813,17 @@ void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
=20
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
+	 * The kernel could leave caches in incoherent state on SME/TDX
+	 * capable platforms.  Flush cache to avoid silent memory
+	 * corruption for these platforms.
 	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * stop_this_cpu() isn't a fast path, just do WBINVD for bare-metal
+	 * to cover both SME and TDX.  It isn't necessary to perform WBINVD
+	 * in a guest and performing one could result in an exception (#VE
+	 * or #VC) for a TDX or SEV-ES/SEV-SNP guest that the guest may
+	 * not be able to handle (e.g., TDX guest panics if it sees #VE).
 	 */
-	if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B=
IT(0)))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		native_wbinvd();
=20
 	/*
--=20
2.46.0
From nobody Sat Feb  7 15:31:54 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 908DD1A76CF
	for <linux-kernel@vger.kernel.org>; Tue, 24 Sep 2024 12:14:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727180062; cv=none;
 b=if3lN4m4sOuxcZrALDdG8sH2W2TTIQQnSemKbTMJT28uw/9LnvUGfZDTul9SfRiNHHEG8La7DksTbqwJLRneor2Ofq+/T5UcxhYtccnLj1QR40B3dack/cAj2E4Pk597bhUqUgAYF1eU0o9bFpY/DfFeEQ3OLJwcpbgMGjzTlTc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727180062; c=relaxed/simple;
	bh=3xAr1fBNzxdVYzWS6YJPPW42FxFGJe2ZvmoVf930Cbg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ky9gGBcPodsmi7dBJuDyAjnKEQBX06hl70Ab91wDc1a87LOlvZQd+bUSlzRGCbqdKpdQpW2BkB7G/rOsyATheMW4qUTnY+3idKt7x0wEiA24xZUCk2gafzJ8k+DAokY7LmswzkQtH563RlZDwcJ5HtPtoH+Kw2p9YKcHqGGM/0o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=LVvYtaO4; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="LVvYtaO4"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727180060; x=1758716060;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=3xAr1fBNzxdVYzWS6YJPPW42FxFGJe2ZvmoVf930Cbg=;
  b=LVvYtaO4WV1XjjpBid8UCgQKw48wPft+6OoH4L8+jLYaWq2dfUse4Gol
   3XfEbUXtU2wwesvGvO0AgVSGpUg4XdbeUoPNY2weP/jcKSvuQgLSwlzBH
   AyDryeZtzi3EvI8G0yzU+806tT0uTha45clepgydgtAYnc5LvYDiHVzOr
   kUi+GM3VRCq/253Z6eRvNUKP0czJQREfMgDmzuV8rMKV0/i6wLnLFZwfI
   4YpYLwk5hRpCQo7iF85EzmRfGXZWEoWPBiWuiVpntQaP4PbPXoVXWITII
   8j5klZJVcfuKNBMGf4dBR/0zRk/9ar3qZcQBKbbhissb3xWvgcwfd3K1f
   w==;
X-CSE-ConnectionGUID: o1vPqqfvQhKSVilZJNTMrw==
X-CSE-MsgGUID: nAfk5UNZQHiCTI67kI4MWg==
X-IronPort-AV: E=McAfee;i="6700,10204,11204"; a="29881874"
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="29881874"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:20 -0700
X-CSE-ConnectionGUID: ezaBS902TW+zyH0TeuZXLg==
X-CSE-MsgGUID: Wg/TojsiTrmzi8hdAOhKdQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="71473247"
Received: from ccbilbre-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.124.221.10])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:16 -0700
From: Kai Huang <kai.huang@intel.com>
To: dave.hansen@intel.com,
	bp@alien8.de,
	tglx@linutronix.de,
	peterz@infradead.org,
	mingo@redhat.com,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com
Cc: x86@kernel.org,
	linux-kernel@vger.kernel.org,
	pbonzini@redhat.com,
	seanjc@google.com,
	dan.j.williams@intel.com,
	thomas.lendacky@amd.com,
	rick.p.edgecombe@intel.com,
	isaku.yamahata@intel.com,
	ashish.kalra@amd.com,
	bhe@redhat.com,
	nik.borisov@suse.com,
	sagis@google.com,
	Dave Young <dyoung@redhat.com>,
	David Kaplan <david.kaplan@amd.com>
Subject: [PATCH v7 2/5] x86/kexec: do unconditional WBINVD for bare-metal in
 relocate_kernel()
Date: Wed, 25 Sep 2024 00:13:54 +1200
Message-ID: 
 <afd9722e12df95cf3bad49ec48ae4516784395d6.1727179214.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.46.0
In-Reply-To: <cover.1727179214.git.kai.huang@intel.com>
References: <cover.1727179214.git.kai.huang@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Both SME and TDX can leave caches in incoherent state due to memory
encryption.  During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

During kexec, the WBINVD in stop_this_cpu() flushes caches for all
remote cpus when they are being stopped.  For SME, the WBINVD in
relocate_kernel() flushes the cache for the last running cpu (which is
executing the kexec).

Similarly, to support kexec for TDX host, after stopping all remote cpus
with cache flushed, the kernel needs to flush cache for the last running
cpu.

Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

However, instead of sprinkling around vendor-specific checks, just do
unconditional WBINVD to cover both SME and TDX.  Kexec is not a fast path
so having one additional WBINVD for platforms w/o SME/TDX is acceptable.

But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
which the kernel is unable to handle at this stage.

Note commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
of relocate_kernel() to an earlier place before load_segments() by
adding a variable 'host_mem_enc_active'.  The reason was the call to
cc_platform_has() after load_segments() caused a fault and system crash
when call depth tracking is active because load_segments() resets GS to
0 but call depth tracking uses per-CPU variable to operate.

Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
not a function call, thus it is safe to use it after load_segments()
when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
variable and use cpu_feature_enabled() directly as the argument when
calling relocate_kernel().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Kaplan <david.kaplan@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
---

v6 -> v7:
 - Add a comment to load_segments() to call out not to make function
   call after it - David Kaplan.
 - Add David's Tested-by.

v5 - >v6:
 - Use cpu_feature_enabled() instead of boot_cpu_has() - Boris
 - Resolve rebase conflict with commit 93c1800b3799 ("x86/kexec: Fix bug
   with call depth tracking")

v4 -> v5:
 - Add Tom's tag

v3 -> v4:
 - Use "exception (#VE or #VC)" for TDX and SEV-ES/SEV-SNP in changelog
   and comments.  (Kirill, Tom)
 - "Save the bare_metal" -> "Save the bare_metal flag" (Tom)
 - Point out "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests"
   in the comment.  (Tom)

v2 -> v3:
 - Change to only do WBINVD for bare metal


---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/kernel/machine_kexec_64.c   | 14 ++++++--------
 arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++++++++----
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index ae5482a2f0ca..b3429c70847d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -128,7 +128,7 @@ relocate_kernel(unsigned long indirection_page,
 		unsigned long page_list,
 		unsigned long start_address,
 		unsigned int preserve_context,
-		unsigned int host_mem_enc_active);
+		unsigned int bare_metal);
 #endif
=20
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k=
exec_64.c
index 9c9ac606893e..6c24b0e4051e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -322,16 +322,9 @@ void machine_kexec_cleanup(struct kimage *image)
 void machine_kexec(struct kimage *image)
 {
 	unsigned long page_list[PAGES_NR];
-	unsigned int host_mem_enc_active;
 	int save_ftrace_enabled;
 	void *control_page;
=20
-	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
-	 */
-	host_mem_enc_active =3D cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
@@ -378,6 +371,11 @@ void machine_kexec(struct kimage *image)
 	 *
 	 * I take advantage of this here by force loading the
 	 * segments, before I zap the gdt with an invalid value.
+	 *
+	 * Note this resets GS to 0.  Don't make any function call after
+	 * here since call depth tracking uses per-cpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call
+	 * depth tracking).
 	 */
 	load_segments();
 	/*
@@ -392,7 +390,7 @@ void machine_kexec(struct kimage *image)
 				       (unsigned long)page_list,
 				       image->start,
 				       image->preserve_context,
-				       host_mem_enc_active);
+				       !cpu_feature_enabled(X86_FEATURE_HYPERVISOR));
=20
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat=
e_kernel_64.S
index e9e88c342f75..19821c3fbc46 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -52,7 +52,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi page_list
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  bare_metal
 	 */
=20
 	/* Save the CPU context, used for jumping back */
@@ -80,7 +80,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	pushq $0
 	popfq
=20
-	/* Save SME active flag */
+	/* Save the bare_metal flag */
 	movq	%r8, %r12
=20
 	/*
@@ -161,9 +161,20 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
=20
 	/*
-	 * If SME is active, there could be old encrypted cache line
+	 * The kernel could leave caches in incoherent state on SME/TDX
+	 * capable platforms.  Just do unconditional WBINVD to avoid
+	 * silent memory corruption to the new kernel for these platforms.
+	 *
+	 * For SME, need to flush cache here before copying the kernel.
+	 * When it is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
-	 * used by kexec. Flush the caches before copying the kernel.
+	 * used by kexec.
+	 *
+	 * Do WBINVD for bare-metal only to cover both SME and TDX.  It
+	 * isn't necessary to perform a WBINVD in a guest and performing
+	 * one could result in an exception (#VE or #VC) for a TDX or
+	 * SEV-ES/SEV-SNP guest that can crash the guest since, at this
+	 * stage, the kernel has torn down the IDT.
 	 */
 	testq	%r12, %r12
 	jz .Lsme_off
--=20
2.46.0
From nobody Sat Feb  7 15:31:54 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C54F81A7ACD
	for <linux-kernel@vger.kernel.org>; Tue, 24 Sep 2024 12:14:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727180066; cv=none;
 b=tHdhzhQa9xTaZSgOVrVm36GxtnIlLy+H9ABQQoSoEzlobg/CFt0w+4lQGAbldF05Ug2iqCDFHvSwnjznlZgWBdHPfZ+1PvkWjlXddDVo49fhZneoB3Gf4E60EZVzqTYB4P/hS6Ad4f4Xni4MmMYGTjBZ9+jZCVdgYH9dbViOSuw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727180066; c=relaxed/simple;
	bh=i9I7CYjN9Hg7NAX4f/c7zsafdYgH7puGwntScyQFo64=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=lFzfDjxruJ8dV4oZvAU55wMfPvJj9M7H0aRISFxUfGGqiWq8JNp9ap3I8k9Y/Pkf0IgGFG1nh3AJFp7sS4arX0XTX5nWrca9KRwKEhT2OqpvvHsKiv17yvSPwzdG2j+dFAMaUq4zAbw8XEzw6SdXNxs9yaxDMZNiwHCP7ujFG2I=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=n4zvlQUA; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="n4zvlQUA"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727180065; x=1758716065;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=i9I7CYjN9Hg7NAX4f/c7zsafdYgH7puGwntScyQFo64=;
  b=n4zvlQUABHrQNf0/bOS4jUQdcvdwEO93SgYskkPFriiL6OIfckGCeE3n
   e6rUBd2r8yeR/oEvA1t7FstQQ54dOh7HHGoxPqbnozBhLCdbc3tiC4GQW
   q7aKb0Be+jWfRAkRtaR8tTGXT4bnsC9Ag/2HcTP3b43RyXS5iqM0C41xB
   kXITsphGVWR7kmfub6KSXg56mURf4oGGJ3wYJsny+NsTf/idFelwcK18B
   Q34giMrTzBN33+lOKOykJNrQLawNUbWYXwVLnJhCz2wIWjot8L/qxd6OR
   ioF4R5KACZj7UZ0Ns0m0eK5Z5kEdX+XIIKBUK7mxcbIJHqgHLMndL0p1j
   Q==;
X-CSE-ConnectionGUID: OLgNsQVGR0OeZOViYbHQFw==
X-CSE-MsgGUID: ZDnlo17VQJGcOd/qJYPLwQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11204"; a="29881884"
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="29881884"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:24 -0700
X-CSE-ConnectionGUID: J2w3TNJKQ4im6zjhmWEu6g==
X-CSE-MsgGUID: yNZgZSA+Qwy1j3C2GsqoNw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="71473251"
Received: from ccbilbre-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.124.221.10])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:21 -0700
From: Kai Huang <kai.huang@intel.com>
To: dave.hansen@intel.com,
	bp@alien8.de,
	tglx@linutronix.de,
	peterz@infradead.org,
	mingo@redhat.com,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com
Cc: x86@kernel.org,
	linux-kernel@vger.kernel.org,
	pbonzini@redhat.com,
	seanjc@google.com,
	dan.j.williams@intel.com,
	thomas.lendacky@amd.com,
	rick.p.edgecombe@intel.com,
	isaku.yamahata@intel.com,
	ashish.kalra@amd.com,
	bhe@redhat.com,
	nik.borisov@suse.com,
	sagis@google.com
Subject: [PATCH v7 3/5] x86/virt/tdx: Make module initializatiton state
 immutable in reboot notifier
Date: Wed, 25 Sep 2024 00:13:55 +1200
Message-ID: 
 <3372ac6b77a6fbc98be17fcbf6680ce5497c6251.1727179214.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.46.0
In-Reply-To: <cover.1727179214.git.kai.huang@intel.com>
References: <cover.1727179214.git.kai.huang@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

If the kernel has ever enabled TDX, part of system memory remains TDX
private memory when kexec happens.  E.g., the PAMT (Physical Address
Metadata Table) pages used by the TDX module to track each TDX memory
page's state are never freed once the TDX module is initialized.

In kexec, the kernel will need to convert all TDX private pages back to
normal when the platform has the TDX "partial write machine check"
erratum.  Such conversion will need to be done after stopping all remote
CPUs so that no more TDX activity can possibly happen.

Register a reboot notifier to make the TDX module initialization state
immutable during the preparation phase of kexec, so that the kernel can
later use module state to determine whether it is possible for the
system to have any TDX private page.  Otherwise, the remote CPU could be
stopped when it is in the middle of module initialization and the module
state wouldn't be able to reflect this.

Specifically, upon receiving the reboot notifier, stop further module
initialization if the kernel hasn't enabled TDX yet.  If there's any
other thread trying to initialize TDX module, wait until the ongoing
module initialization to finish.

The reboot notifier is triggered when the kernel goes to reboot, kexec,
halt or shutdown.  In any case, there's no need to allow the kernel to
continue to initialize the TDX module anyway (if not done yet).

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change

v5 -> v6:
 - No change

v4 -> v5:
 - New patch to split the 'tdx_rebooting' around reboot notifier (Dave).


---
 arch/x86/virt/vmx/tdx/tdx.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4e2b2e2ac9f9..c33417fe4086 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -27,6 +27,7 @@
 #include <linux/log2.h>
 #include <linux/acpi.h>
 #include <linux/suspend.h>
+#include <linux/reboot.h>
 #include <asm/page.h>
 #include <asm/special_insns.h>
 #include <asm/msr-index.h>
@@ -52,6 +53,8 @@ static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
=20
+static bool tdx_rebooting;
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *arg=
s);
=20
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *a=
rgs)
@@ -1185,6 +1188,9 @@ static int __tdx_enable(void)
 {
 	int ret;
=20
+	if (tdx_rebooting)
+		return -EINVAL;
+
 	ret =3D init_tdx_module();
 	if (ret) {
 		pr_err("module initialization failed (%d)\n", ret);
@@ -1418,6 +1424,21 @@ static struct notifier_block tdx_memory_nb =3D {
 	.notifier_call =3D tdx_memory_notifier,
 };
=20
+static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mo=
de,
+			       void *unused)
+{
+	/* Wait for ongoing TDX initialization to finish */
+	mutex_lock(&tdx_module_lock);
+	tdx_rebooting =3D true;
+	mutex_unlock(&tdx_module_lock);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block tdx_reboot_nb =3D {
+	.notifier_call =3D tdx_reboot_notifier,
+};
+
 static void __init check_tdx_erratum(void)
 {
 	/*
@@ -1472,6 +1493,14 @@ void __init tdx_init(void)
 		return;
 	}
=20
+	err =3D register_reboot_notifier(&tdx_reboot_nb);
+	if (err) {
+		pr_err("initialization failed: register_reboot_notifier() failed (%d)\n",
+				err);
+		unregister_memory_notifier(&tdx_memory_nb);
+		return;
+	}
+
 #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)
 	pr_info("Disable ACPI S3. Turn off TDX in the BIOS to use ACPI S3.\n");
 	acpi_suspend_lowlevel =3D NULL;
--=20
2.46.0
From nobody Sat Feb  7 15:31:54 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 036041A7271
	for <linux-kernel@vger.kernel.org>; Tue, 24 Sep 2024 12:14:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727180070; cv=none;
 b=T/2IHVh3B0m5B0+QEbGU7tE5icr6xOorlBJltSEvvup+r4mNtNFNcdGVSG+fr1inu/w1fL8dhlyAYvTmb2xSDoRRzgrS3D/O/oZHfdbHhPEOb/osuMcP6/eMmZ/XZZ4lMiN76h3aKPVP40QMrEcDySMgW0R6yuBJpMW7YN8NWrs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727180070; c=relaxed/simple;
	bh=tL6FecHXDg7D0l6AR0fJYB0wk47IIT3B4Ra0l1pLd74=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=E9SFUUH66OU9jy9j6opSxiwRHdlb3duWh3i4T3HMm82ZYBArdpcu5tyBH/975I3GBL6jHnp/QT6xs3LBnk8bhtNckWFspGEHSjl2yye6tgzdNd6/57O7uIEoW4MPS/a3eOBg/Qrtc6CWDlk1iry5aTpsCSObVXwNhggRtIPfbj0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=YzkrVey6; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="YzkrVey6"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727180069; x=1758716069;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=tL6FecHXDg7D0l6AR0fJYB0wk47IIT3B4Ra0l1pLd74=;
  b=YzkrVey6ObCNEweDnliRPZrLpO4tkgrRXZ1omPm4kMozNVZHG8ECWcu7
   gZCUH43TkzVNhHZCX0HhJ50616XjxYZS/+xzGarfvSyqR2f+lmcN5WbRM
   B9Pb8QMwESG5d4EEroysQvVJyQlXjevvNEf79IAjzbIUBsVR1qcP2VPMO
   gvGqHC0ywWBSuCgtfOcr66GcCregbBZTUrxyY3h/geaQWUHoaRFmG6owK
   fjtcFohzPU5raw2qq8c5jeSrc7a3HxkllM6GLiYVm536z4dPK5Wf4HJ8f
   Id1WOWFg8ux1fIpLlnsgkVPemhTvBzDZynaBSPrXMWL3xRu0+Pnx2MJsk
   A==;
X-CSE-ConnectionGUID: G4OecezuTWqZzARRttlJ2g==
X-CSE-MsgGUID: 93buNSo8Tqm/UaLy4b8Kwg==
X-IronPort-AV: E=McAfee;i="6700,10204,11204"; a="29881897"
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="29881897"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:29 -0700
X-CSE-ConnectionGUID: ZmIBk5cRSyuOHMcqCyreDQ==
X-CSE-MsgGUID: tViPq359REOKug3NsyqfTQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="71473258"
Received: from ccbilbre-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.124.221.10])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:25 -0700
From: Kai Huang <kai.huang@intel.com>
To: dave.hansen@intel.com,
	bp@alien8.de,
	tglx@linutronix.de,
	peterz@infradead.org,
	mingo@redhat.com,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com
Cc: x86@kernel.org,
	linux-kernel@vger.kernel.org,
	pbonzini@redhat.com,
	seanjc@google.com,
	dan.j.williams@intel.com,
	thomas.lendacky@amd.com,
	rick.p.edgecombe@intel.com,
	isaku.yamahata@intel.com,
	ashish.kalra@amd.com,
	bhe@redhat.com,
	nik.borisov@suse.com,
	sagis@google.com
Subject: [PATCH v7 4/5] x86/kexec: Reset TDX private memory on platforms with
 TDX erratum
Date: Wed, 25 Sep 2024 00:13:56 +1200
Message-ID: 
 <6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.46.0
In-Reply-To: <cover.1727179214.git.kai.huang@intel.com>
References: <cover.1727179214.git.kai.huang@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

TL;DR:

On the platforms with TDX "partial write machine check" erratum, during
kexec, convert TDX private memory back to normal before jumping to the
second kernel to avoid the second kernel potentially seeing unexpected
machine check.

Long version:

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

=3D=3D Background =3D=3D

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

=3D=3D Problem =3D=3D

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has left any
TDX private pages on the platform with this erratum, the new kernel
might get unexpected machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

Also note only the normal kexec needs to worry about this problem, but
not the crash kexec: 1) The kdump kernel only uses the special memory
reserved by the first kernel, and the reserved memory can never be used
by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
first (crashed) kernel's memory, is only for read.  The read will never
"poison" TDX memory thus cause unexpected machine check (only partial
write does).

=3D=3D Solution =3D=3D

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal (using MOVDIR64B to reset these pages)
to give the new kernel a clean slate after kexec().

The BIOS is also expected to disable fast warm reset as a workaround to
this erratum, thus this implementation doesn't try to reset TDX private
memory for the normal reboot case in the kernel but depends on the BIOS
to enable the workaround.

Reset TDX private pages in machine_kexec() so that: 1) all remote cpus
are stopped with cache flushed and there's no more TDX activity; 2) no
memory reset overhead for the normal reboot case since the BIOS is
expected to turn on the workaround.

There are different types of TDX private pages.  The TDX module itself
uses PAMT (Physical Address Metadata Table) to track each TDX memory
page's state.  TDX guests also have guest private memory and secure-EPT
pages.

It would be ideal to reset all types of TDX private memory once for all
in machine_kexec(), but there are practical problems to do so:

1) There's no existing infrastructure to track TDX private pages;
2) It's not feasible to query the TDX module about page type, because
   VMX, which making SEAMCALL requires, has already been disabled;
3) Even if it is feasible to query the TDX module, the result may not be
   accurate.  E.g., the remote CPU could be stopped right before the
   MOVDIR64B.

One temporary solution is to blindly convert all memory pages, but it's
problematic to do so too, because not all pages are mapped as writable
in the direct mapping.  It can be done by switching to the identical
mapping created for kexec(), or a new page table, but the complication
is overkill.

Therefore, rather than doing something dramatic, only reset PAMT pages
in machine_kexec().  All the in-kernel TDX users (e.g., KVM) need to
reset TDX private pages that they manage before the machine_kexec() by
registering either the reboot notifier or the syscore shutdown ops.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
v6 -> v7:
 - No change

v5 -> v6:
 - No change

v4 -> v5:
 - Remove the TDX-specific notifier, since there's no need to handle
   crash kexec specially.
 - Minor update to changelog and comments.

v3 -> v4:
 - No change

v2 -> v3:
 - No change

v1 -> v2:
 - Remove using reboot notifier to stop TDX module as it doesn't
   cover crash kexec.  Change to use a variable with barrier instead.
   (Rick)
 - Introduce kexec_save_processor_start() to make code better, and
   make the comment around calling site of tdx_reset_memory() more
   concise. (Dave)
 - Mention cache for all other cpus have been flushed around
   native_wbinvd() in tdx_reset_memory(). (Dave)
 - Remove the extended alternaties discussion from the comment, but leave
   it in the changelog. Point out what does current code do and point out
   risk. (Dave)


---
 arch/x86/include/asm/tdx.h         |  2 ++
 arch/x86/kernel/machine_kexec_64.c | 27 +++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.c        | 49 ++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..ed3ac9a8a079 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,11 +116,13 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
+void tdx_reset_memory(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline void tdx_reset_memory(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
=20
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k=
exec_64.c
index 6c24b0e4051e..83d0d7f3ec69 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
=20
 #ifdef CONFIG_ACPI
 /*
@@ -315,6 +316,14 @@ void machine_kexec_cleanup(struct kimage *image)
 	free_transition_pgtable(image);
 }
=20
+static void kexec_save_processor_start(struct kimage *image)
+{
+#ifdef CONFIG_KEXEC_JUMP
+	if (image->preserve_context)
+		save_processor_state();
+#endif
+}
+
 /*
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
@@ -325,10 +334,20 @@ void machine_kexec(struct kimage *image)
 	int save_ftrace_enabled;
 	void *control_page;
=20
-#ifdef CONFIG_KEXEC_JUMP
-	if (image->preserve_context)
-		save_processor_state();
-#endif
+	kexec_save_processor_start(image);
+
+	/*
+	 * Convert TDX private memory back to normal (when needed) to
+	 * avoid the second kernel potentially seeing unexpected machine
+	 * check.
+	 *
+	 * However skip this when preserve_context is on.  By reaching
+	 * here, TDX (if ever got enabled by the kernel) has survived
+	 * from the suspend when preserve_context is on, and it can
+	 * continue to work after jumping back from the second kernel.
+	 */
+	if (!image->preserve_context)
+		tdx_reset_memory();
=20
 	save_ftrace_enabled =3D __ftrace_enabled_save();
=20
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c33417fe4086..a69a65f57616 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1518,3 +1518,52 @@ void __init tdx_init(void)
=20
 	check_tdx_erratum();
 }
+
+void tdx_reset_memory(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
+		return;
+
+	/*
+	 * Kernel read/write to TDX private memory doesn't cause
+	 * machine check on hardware w/o this erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return;
+
+	/*
+	 * Converting TDX private pages back to normal must be done
+	 * after all remote cpus have been stopped so that no more
+	 * TDX activity can happen and caches have been flushed.
+	 */
+	WARN_ON_ONCE(num_online_cpus() !=3D 1);
+
+	/*
+	 * The system can only have TDX private memory after the TDX
+	 * module has been initialized.  tdx_reboot_notifier() has made
+	 * sure @tdx_module_status reflects the module initialization
+	 * status correctly and is immutable by now thus can be read
+	 * w/o holding lock.
+	 */
+	if (tdx_module_status !=3D TDX_MODULE_INITIALIZED)
+		return;
+
+	/*
+	 * All remote cpus have been stopped, and their caches have
+	 * been flushed in stop_this_cpu().  Now flush cache for the
+	 * last running cpu _before_ converting TDX private pages.
+	 */
+	native_wbinvd();
+
+	/*
+	 * It's ideal to cover all types of TDX private pages here, but
+	 * currently there's no unified way to tell whether a given page
+	 * is TDX private page or not.
+	 *
+	 * Only convert PAMT here.  All in-kernel TDX users (e.g., KVM)
+	 * are responsible for converting TDX private pages that are
+	 * managed by them by either registering reboot notifier or
+	 * shutdown syscore ops.
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
--=20
2.46.0
From nobody Sat Feb  7 15:31:54 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 353D41A7ACD
	for <linux-kernel@vger.kernel.org>; Tue, 24 Sep 2024 12:14:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1727180074; cv=none;
 b=r5ipdpmVL9786UMfCgpze1sqObQv/Ovwqq8skItifDkGCk32xifmhkGtp/9vUBz1odGcBT6W5Te6RPlsBH4++hn7yfRe/uFXb4BaJWBAyFsUAONiussvTJreUCqoumS5lQAzZhCMCcjOPS3twBiFZZo5tDco+HDm3GeaN1ei2DQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1727180074; c=relaxed/simple;
	bh=LIcTyHmfasKpbOmTSceVs68nnCqXeXKIJPkSRF3AgaE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=cgM5iY5lMLcO+jcoZPmrIBkgEmAHaExNuqY+UtCZ02qtF2UwdxXuMmbw0qLMtxGefTfSAhhZnqZAntijbJfcAb4k32OT4iVg9IbD+bEszyXUcv6PwtXRa6SR6559Xdaokro0jWfkSqp/0P1xJso9uWPh7VuQ3u+y7sIHJrAUeZQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=EMsYgFdM; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="EMsYgFdM"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727180073; x=1758716073;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=LIcTyHmfasKpbOmTSceVs68nnCqXeXKIJPkSRF3AgaE=;
  b=EMsYgFdMfzbEQxDXoPBV41Q2Zn208w0KwgecMQ5gjvyA6jlyfGvYrtzg
   ZBd36WNbIXt+Dy3X9+ZM2qmlumOTSUdjctOfbuMDEbfgBzdo2xJNxm0nW
   QJz/LqEweLuvUOOuN+Z2kFgTsaYEbXR42KvUlvZwRLLM9PaXch9TXmThk
   ULsaqJWUDRGugYlspfEdGq2nwRlOewbxrNBOT9ZJeJCW0POiiYlXxWUX7
   gg6RobADLtiFXJeitTHa53C0NFg9nVjPlIXTuLpk4Ifl8FavUuzVIJ7rk
   pXRMk57WoOEpCYShMYTzk/oA136qd220iu6s1CA6xzvZmu0MMcS+z1492
   g==;
X-CSE-ConnectionGUID: a2+A3A7WRlG1J5VFzZUspQ==
X-CSE-MsgGUID: gp6kqMluRvy+2c/yhWbj2w==
X-IronPort-AV: E=McAfee;i="6700,10204,11204"; a="29881909"
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="29881909"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:33 -0700
X-CSE-ConnectionGUID: AqwIIVxbT2WT3KZSgINwsQ==
X-CSE-MsgGUID: 8EX/RDWoQgG3b7ppw9DUGQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,254,1719903600";
   d="scan'208";a="71473263"
Received: from ccbilbre-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.124.221.10])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Sep 2024 05:14:29 -0700
From: Kai Huang <kai.huang@intel.com>
To: dave.hansen@intel.com,
	bp@alien8.de,
	tglx@linutronix.de,
	peterz@infradead.org,
	mingo@redhat.com,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com
Cc: x86@kernel.org,
	linux-kernel@vger.kernel.org,
	pbonzini@redhat.com,
	seanjc@google.com,
	dan.j.williams@intel.com,
	thomas.lendacky@amd.com,
	rick.p.edgecombe@intel.com,
	isaku.yamahata@intel.com,
	ashish.kalra@amd.com,
	bhe@redhat.com,
	nik.borisov@suse.com,
	sagis@google.com
Subject: [PATCH v7 5/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency
Date: Wed, 25 Sep 2024 00:13:57 +1200
Message-ID: 
 <714e950b49fa1ec4e9cad5170cf8dcf80c294798.1727179214.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.46.0
In-Reply-To: <cover.1727179214.git.kai.huang@intel.com>
References: <cover.1727179214.git.kai.huang@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Now TDX host can work with kexec().  Remove the !KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9ffdacaaa725..a9e231873f6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1975,7 +1975,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
--=20
2.46.0