From: Kai Huang <kai.huang@intel.com>
Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.
On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check. Currently the kernel doesn't track which page is a TDX
private page. For simplicity just fail kexec/kdump for those platforms.
Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support). This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.
The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel. But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage. Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.
It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel. But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.
The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:
kexec_load failed: Operation not supported
This might be confusing to the users, thus also print the reason in the
dmesg:
[..] kexec: Not allowed on platform with tdx_pw_mce bug.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 34c303a92eaf..201137b98fb8 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
int result;
+ /*
+ * Some early TDX-capable platforms have an erratum. A kernel
+ * partial write (a write transaction of less than cacheline
+ * lands at memory controller) to TDX private memory poisons that
+ * memory, and a subsequent read triggers a machine check.
+ *
+ * On those platforms the old kernel must reset TDX private
+ * memory before jumping to the new kernel otherwise the new
+ * kernel may see unexpected machine check. For simplicity
+ * just fail kexec/kdump on those platforms.
+ */
+ if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+ pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+ return -EOPNOTSUPP;
+ }
+
/* Setup the identity mapped 64bit page table */
result = init_pgtable(image, __pa(control_page));
if (result)
--
2.51.0
On Mon, Sep 1, 2025 at 9:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > From: Kai Huang <kai.huang@intel.com> > > Some early TDX-capable platforms have an erratum: A kernel partial > write (a write transaction of less than cacheline lands at memory > controller) to TDX private memory poisons that memory, and a subsequent > read triggers a machine check. > > On those platforms, the old kernel must reset TDX private memory before > jumping to the new kernel, otherwise the new kernel may see unexpected > machine check. Currently the kernel doesn't track which page is a TDX > private page. For simplicity just fail kexec/kdump for those platforms. > > Leverage the existing machine_kexec_prepare() to fail kexec/kdump by > adding the check of the presence of the TDX erratum (which is only > checked for if the kernel is built with TDX host support). This rejects > kexec/kdump when the kernel is loading the kexec/kdump kernel image. > > The alternative is to reject kexec/kdump when the kernel is jumping to > the new kernel. But for kexec this requires adding a new check (e.g., > arch_kexec_allowed()) in the common code to fail kernel_kexec() at early > stage. Kdump (crash_kexec()) needs similar check, but it's hard to > justify because crash_kexec() is not supposed to abort. > > It's feasible to further relax this limitation, i.e., only fail kexec > when TDX is actually enabled by the kernel. But this is still a half > measure compared to resetting TDX private memory so just do the simplest > thing for now. Hi Kai, IIUC, kernel doesn't donate any of it's available memory to TDX module if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel command line parameter is missing). Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not supplied to the kernel? > > The impact to userspace is the users will get an error when loading the > kexec/kdump kernel image: > > kexec_load failed: Operation not supported > > This might be confusing to the users, thus also print the reason in the > dmesg: >
On Sun, 2025-10-26 at 16:33 -0700, Vishal Annapurve wrote: > > It's feasible to further relax this limitation, i.e., only fail kexec > > when TDX is actually enabled by the kernel. But this is still a half > > measure compared to resetting TDX private memory so just do the simplest > > thing for now. > > Hi Kai, Hi Vishal, > > IIUC, kernel doesn't donate any of it's available memory to TDX module > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > command line parameter is missing). Right (for now KVM is the only in-kernel TDX user). > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > supplied to the kernel? It can be relaxed. Please see the above quoted text from the changelog: > It's feasible to further relax this limitation, i.e., only fail kexec > when TDX is actually enabled by the kernel. But this is still a half > measure compared to resetting TDX private memory so just do the simplest > thing for now.
On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote: > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > > command line parameter is missing). > > Right (for now KVM is the only in-kernel TDX user). > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > > supplied to the kernel? > > It can be relaxed. Please see the above quoted text from the changelog: > > > It's feasible to further relax this limitation, i.e., only fail kexec > > when TDX is actually enabled by the kernel. But this is still a half > > measure compared to resetting TDX private memory so just do the simplest > > thing for now. I think KVM could be re-inserted with different module params? As in, the two in-tree users could be two separate insertions of the KVM module. That seems like something that could easily come up in the real world, if a user re-inserts for the purpose of enabling TDX. I think the above quote was talking about another way of checking if it's enabled.
On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote: > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote: > > > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > > > command line parameter is missing). > > > > Right (for now KVM is the only in-kernel TDX user). > > > > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > > > supplied to the kernel? > > > > It can be relaxed. Please see the above quoted text from the changelog: > > > > > It's feasible to further relax this limitation, i.e., only fail kexec > > > when TDX is actually enabled by the kernel. But this is still a half > > > measure compared to resetting TDX private memory so just do the simplest > > > thing for now. > > I think KVM could be re-inserted with different module params? As in, the two > in-tree users could be two separate insertions of the KVM module. That seems > like something that could easily come up in the real world, if a user re-inserts > for the purpose of enabling TDX. I think the above quote was talking about > another way of checking if it's enabled. Yes exactly. We need to look at module status for that.
On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote: > > On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote: > > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote: > > > > > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module > > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > > > > command line parameter is missing). > > > > > > Right (for now KVM is the only in-kernel TDX user). > > > > > > > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > > > > supplied to the kernel? > > > > > > It can be relaxed. Please see the above quoted text from the changelog: > > > > > > > It's feasible to further relax this limitation, i.e., only fail kexec > > > > when TDX is actually enabled by the kernel. But this is still a half > > > > measure compared to resetting TDX private memory so just do the simplest > > > > thing for now. > > > > I think KVM could be re-inserted with different module params? As in, the two > > in-tree users could be two separate insertions of the KVM module. That seems > > like something that could easily come up in the real world, if a user re-inserts > > for the purpose of enabling TDX. I think the above quote was talking about > > another way of checking if it's enabled. > > Yes exactly. We need to look at module status for that. So, the right thing to do is to declare the host platform as affected by PW_MCE_BUG only if TDX module is initialized, does that sound correct?
On Mon, 2025-10-27 at 17:07 -0700, Vishal Annapurve wrote: > On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote: > > > > On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote: > > > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote: > > > > > > > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module > > > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > > > > > command line parameter is missing). > > > > > > > > Right (for now KVM is the only in-kernel TDX user). > > > > > > > > > > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > > > > > supplied to the kernel? > > > > > > > > It can be relaxed. Please see the above quoted text from the changelog: > > > > > > > > > It's feasible to further relax this limitation, i.e., only fail kexec > > > > > when TDX is actually enabled by the kernel. But this is still a half > > > > > measure compared to resetting TDX private memory so just do the simplest > > > > > thing for now. > > > > > > I think KVM could be re-inserted with different module params? As in, the two > > > in-tree users could be two separate insertions of the KVM module. That seems > > > like something that could easily come up in the real world, if a user re-inserts > > > for the purpose of enabling TDX. I think the above quote was talking about > > > another way of checking if it's enabled. > > > > Yes exactly. We need to look at module status for that. > > So, the right thing to do is to declare the host platform as affected > by PW_MCE_BUG only if TDX module is initialized, does that sound > correct? I was thinking something like this: https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/
On Tue, Oct 28, 2025 at 2:31 AM Huang, Kai <kai.huang@intel.com> wrote: > > On Mon, 2025-10-27 at 17:07 -0700, Vishal Annapurve wrote: > > On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote: > > > > > > On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote: > > > > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote: > > > > > > > > > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module > > > > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel > > > > > > command line parameter is missing). > > > > > > > > > > Right (for now KVM is the only in-kernel TDX user). > > > > > > > > > > > > > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not > > > > > > supplied to the kernel? > > > > > > > > > > It can be relaxed. Please see the above quoted text from the changelog: > > > > > > > > > > > It's feasible to further relax this limitation, i.e., only fail kexec > > > > > > when TDX is actually enabled by the kernel. But this is still a half > > > > > > measure compared to resetting TDX private memory so just do the simplest > > > > > > thing for now. > > > > > > > > I think KVM could be re-inserted with different module params? As in, the two > > > > in-tree users could be two separate insertions of the KVM module. That seems > > > > like something that could easily come up in the real world, if a user re-inserts > > > > for the purpose of enabling TDX. I think the above quote was talking about > > > > another way of checking if it's enabled. > > > > > > Yes exactly. We need to look at module status for that. > > > > So, the right thing to do is to declare the host platform as affected > > by PW_MCE_BUG only if TDX module is initialized, does that sound > > correct? > > I was thinking something like this: > > https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/ This seems to be an important thing to make progress on. IMO, disabling kexec/kdump even if the host doesn't plan to use TDX functionality but wants to keep the build config enabled is a regression. I think explicitly doing TDX module initialization[1] ideally needs something like the above series from Kai and possibly resetting the PAMT memory during kexec/kdump at least on SPR/EMR CPUs. Otherwise it's effectively impossible to enable CONFIG_INTEL_TDX_HOST and have kexec/kdump working on the host even if no confidential workloads are scheduled on such SPR/EMR hosts. [1] https://lore.kernel.org/kvm/20251010220403.987927-4-seanjc@google.com/
On Mon, Sep 1, 2025 at 9:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > From: Kai Huang <kai.huang@intel.com> > > Some early TDX-capable platforms have an erratum: A kernel partial > write (a write transaction of less than cacheline lands at memory > controller) to TDX private memory poisons that memory, and a subsequent > read triggers a machine check. > > On those platforms, the old kernel must reset TDX private memory before > jumping to the new kernel, otherwise the new kernel may see unexpected > machine check. Currently the kernel doesn't track which page is a TDX > private page. For simplicity just fail kexec/kdump for those platforms. Google has a usecase that needs host kdump support on SPR/EMR platforms. Disabling kdump disables the host's ability to dump very critical information on host crashes altogether. Is Intel working on enabling kdump support for platforms with the erratum?
On 9/29/25 18:38, Vishal Annapurve wrote: >> On those platforms, the old kernel must reset TDX private memory before >> jumping to the new kernel, otherwise the new kernel may see unexpected >> machine check. Currently the kernel doesn't track which page is a TDX >> private page. For simplicity just fail kexec/kdump for those platforms. > Google has a usecase that needs host kdump support on SPR/EMR > platforms. Disabling kdump disables the host's ability to dump very > critical information on host crashes altogether. Is Intel working on > enabling kdump support for platforms with the erratum? Nope. Any workarounds are going to be slow and probably imperfect. That's not a great match for kdump. I'm perfectly happy waiting for fixed hardware from what I've seen.
On Tue, Sep 30, 2025 at 2:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 9/29/25 18:38, Vishal Annapurve wrote: > >> On those platforms, the old kernel must reset TDX private memory before > >> jumping to the new kernel, otherwise the new kernel may see unexpected > >> machine check. Currently the kernel doesn't track which page is a TDX > >> private page. For simplicity just fail kexec/kdump for those platforms. > > Google has a usecase that needs host kdump support on SPR/EMR > > platforms. Disabling kdump disables the host's ability to dump very > > critical information on host crashes altogether. Is Intel working on > > enabling kdump support for platforms with the erratum? > > Nope. > > Any workarounds are going to be slow and probably imperfect. That's not Do we really need to deploy workarounds that are complex and slow to get kdump working for the majority of the scenarios? Is there any analysis done for the risk with imperfect and simpler workarounds vs benefits of kdump functionality? > a great match for kdump. I'm perfectly happy waiting for fixed hardware > from what I've seen. IIUC SPR/EMR - two CPU generations out there are impacted by this erratum and just disabling kdump functionality IMO is not the best solution here.
On 9/30/25 19:05, Vishal Annapurve wrote: ... >> Any workarounds are going to be slow and probably imperfect. That's not > > Do we really need to deploy workarounds that are complex and slow to > get kdump working for the majority of the scenarios? Is there any > analysis done for the risk with imperfect and simpler workarounds vs > benefits of kdump functionality? > >> a great match for kdump. I'm perfectly happy waiting for fixed hardware >> from what I've seen. > > IIUC SPR/EMR - two CPU generations out there are impacted by this > erratum and just disabling kdump functionality IMO is not the best > solution here. That's an eminently reasonable position. But we're speaking in broad generalities and I'm unsure what you don't like about the status quo or how you'd like to see things change. Care to send along a patch representing the "best solution"? That should clear things up.
On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 9/30/25 19:05, Vishal Annapurve wrote: > ... > >> Any workarounds are going to be slow and probably imperfect. That's not > > > > Do we really need to deploy workarounds that are complex and slow to > > get kdump working for the majority of the scenarios? Is there any > > analysis done for the risk with imperfect and simpler workarounds vs > > benefits of kdump functionality? > > > >> a great match for kdump. I'm perfectly happy waiting for fixed hardware > >> from what I've seen. > > > > IIUC SPR/EMR - two CPU generations out there are impacted by this > > erratum and just disabling kdump functionality IMO is not the best > > solution here. > > That's an eminently reasonable position. But we're speaking in broad > generalities and I'm unsure what you don't like about the status quo or > how you'd like to see things change. Looks like the decision to disable kdump was taken between [1] -> [2]. "The kernel currently doesn't track which page is TDX private memory. It's not trivial to reset TDX private memory. For simplicity, this series simply disables kexec/kdump for such platforms. This will be enhanced in the future." A patch [3] from the series[1], describes the issue as: "This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA." And also mentions: "Also note only the normal kexec needs to worry about this problem, but not the crash kexec: 1) The kdump kernel only uses the special memory reserved by the first kernel, and the reserved memory can never be used by TDX in the first kernel; 2) The /proc/vmcore, which reflects the first (crashed) kernel's memory, is only for read. The read will never "poison" TDX memory thus cause unexpected machine check (only partial write does)." What was the scenario that led to disabling kdump support altogether given the above description? [1] https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/ [2] https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/ [3] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/ > > Care to send along a patch representing the "best solution"? That should > clear things up. >
> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com> > wrote: > > > > On 9/30/25 19:05, Vishal Annapurve wrote: > > ... > > >> Any workarounds are going to be slow and probably imperfect. That's not > > > > > > Do we really need to deploy workarounds that are complex and slow to > > > get kdump working for the majority of the scenarios? Is there any > > > analysis done for the risk with imperfect and simpler workarounds vs > > > benefits of kdump functionality? > > > > > >> a great match for kdump. I'm perfectly happy waiting for fixed hardware > > >> from what I've seen. > > > > > > IIUC SPR/EMR - two CPU generations out there are impacted by this > > > erratum and just disabling kdump functionality IMO is not the best > > > solution here. > > > > That's an eminently reasonable position. But we're speaking in broad > > generalities and I'm unsure what you don't like about the status quo or > > how you'd like to see things change. > > Looks like the decision to disable kdump was taken between [1] -> [2]. > "The kernel currently doesn't track which page is TDX private memory. > It's not trivial to reset TDX private memory. For simplicity, this > series simply disables kexec/kdump for such platforms. This will be > enhanced in the future." > > A patch [3] from the series[1], describes the issue as: > "This problem is triggered by "partial" writes where a write transaction > of less than cacheline lands at the memory controller. The CPU does > these via non-temporal write instructions (like MOVNTI), or through > UC/WC memory mappings. The issue can also be triggered away from the > CPU by devices doing partial writes via DMA." > > And also mentions: > "Also note only the normal kexec needs to worry about this problem, but > not the crash kexec: 1) The kdump kernel only uses the special memory > reserved by the first kernel, and the reserved memory can never be used > by TDX in the first kernel; 2) The /proc/vmcore, which reflects the > first (crashed) kernel's memory, is only for read. The read will never > "poison" TDX memory thus cause unexpected machine check (only partial > write does)." While the statement that the read will never poison the memory is correct, the situation we can theoretically worry about is the following in my understanding: 1. During its execution on platform with partial write problem, host OS or other actor executing outside of SEAM mode triggers partial write into a cache line that originally belonged to TDX private memory. This is smth that host OS or other entities should not do, but it could happen due to host OS bugs, etc. 2. The above causes the specified cache line to be poisoned by mem controller. However, here we assume that no one accesses this cache line from TDX module, TD guests or Host OS for the time being and the problem remains hidden. 3. Host OS crashes due to some other issue, kdump crash kernel is triggered, and kdump starts to read all the memory from the previous host kernel to dump the diagnostics info. 4. At some point of time, kdump crash kernel reaches the memory with the poisoned cache line, consumes poison, and the #MC is issued for the kernel space. Isn't this the reason for also disabling kdump? Or do I miss smth? Best Regards, Elena.
On 02.10.25 08:59, Reshetova, Elena wrote: >> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com> >> wrote: >>> >>> On 9/30/25 19:05, Vishal Annapurve wrote: >>> ... >>>>> Any workarounds are going to be slow and probably imperfect. That's not >>>> >>>> Do we really need to deploy workarounds that are complex and slow to >>>> get kdump working for the majority of the scenarios? Is there any >>>> analysis done for the risk with imperfect and simpler workarounds vs >>>> benefits of kdump functionality? >>>> >>>>> a great match for kdump. I'm perfectly happy waiting for fixed hardware >>>>> from what I've seen. >>>> >>>> IIUC SPR/EMR - two CPU generations out there are impacted by this >>>> erratum and just disabling kdump functionality IMO is not the best >>>> solution here. >>> >>> That's an eminently reasonable position. But we're speaking in broad >>> generalities and I'm unsure what you don't like about the status quo or >>> how you'd like to see things change. >> >> Looks like the decision to disable kdump was taken between [1] -> [2]. >> "The kernel currently doesn't track which page is TDX private memory. >> It's not trivial to reset TDX private memory. For simplicity, this >> series simply disables kexec/kdump for such platforms. This will be >> enhanced in the future." >> >> A patch [3] from the series[1], describes the issue as: >> "This problem is triggered by "partial" writes where a write transaction >> of less than cacheline lands at the memory controller. The CPU does >> these via non-temporal write instructions (like MOVNTI), or through >> UC/WC memory mappings. The issue can also be triggered away from the >> CPU by devices doing partial writes via DMA." >> >> And also mentions: >> "Also note only the normal kexec needs to worry about this problem, but >> not the crash kexec: 1) The kdump kernel only uses the special memory >> reserved by the first kernel, and the reserved memory can never be used >> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the >> first (crashed) kernel's memory, is only for read. The read will never >> "poison" TDX memory thus cause unexpected machine check (only partial >> write does)." > > While the statement that the read will never poison the memory is correct, > the situation we can theoretically worry about is the following in my understanding: > > 1. During its execution on platform with partial write problem, host OS or other > actor executing outside of SEAM mode triggers partial write into a cache line that > originally belonged to TDX private memory. > This is smth that host OS or other entities should not do, but it could happen due > to host OS bugs, etc. > 2. The above causes the specified cache line to be poisoned by mem controller. > However, here we assume that no one accesses this cache line from TDX module, > TD guests or Host OS for the time being and the problem remains hidden. > 3. Host OS crashes due to some other issue, kdump crash kernel is triggered, > and kdump starts to read all the memory from the previous host kernel to dump > the diagnostics info. > 4. At some point of time, kdump crash kernel reaches the memory with the poisoned > cache line, consumes poison, and the #MC is issued for the kernel space. > > Isn't this the reason for also disabling kdump? Or do I miss smth? So lets compare the 2 cases with kdump enabled and disabled in your scenario (crash of the host OS): kdump enabled: No dump can be produced due to the #MC and system is rebooted. kdump disabled: No dump is produced and system is rebooted after crash. What is the main concern with kdump enabled? I don't see any disadvantage with enabling it, just the advantage that in many cases a dump will be written. Juergen
On 10/2/25 00:46, Juergen Gross wrote: > So lets compare the 2 cases with kdump enabled and disabled in your > scenario (crash of the host OS): > > kdump enabled: No dump can be produced due to the #MC and system is > rebooted. > > kdump disabled: No dump is produced and system is rebooted after crash. > > What is the main concern with kdump enabled? I don't see any > disadvantage with enabling it, just the advantage that in many cases > a dump will be written. The disadvantage is that a kernel bug from long ago results in a machine check. Machine checks are generally indicative of bad hardware. So the disadvantage is that someone mistakes the long ago kernel bug for bad hardware. There are two ways of looking at this: 1. A theoretically fragile kdump is better than no kdump at all. All of the stars would have to align for kdump to _fail_ and we don't think that's going to happen often enough to matter. 2. kdump happens after kernel bugs. The machine checks happen because of kernel bugs. It's not a big stretch to think that, at scale, kdump is going to run in to these #MCs on a regular basis. Does that capture the two perspectives fairly?
On 02.10.25 17:06, Dave Hansen wrote: > On 10/2/25 00:46, Juergen Gross wrote: >> So lets compare the 2 cases with kdump enabled and disabled in your >> scenario (crash of the host OS): >> >> kdump enabled: No dump can be produced due to the #MC and system is >> rebooted. >> >> kdump disabled: No dump is produced and system is rebooted after crash. >>> What is the main concern with kdump enabled? I don't see any >> disadvantage with enabling it, just the advantage that in many cases >> a dump will be written. > The disadvantage is that a kernel bug from long ago results in a machine > check. Machine checks are generally indicative of bad hardware. So the > disadvantage is that someone mistakes the long ago kernel bug for bad > hardware. > > There are two ways of looking at this: > > 1. A theoretically fragile kdump is better than no kdump at all. All of > the stars would have to align for kdump to _fail_ and we don't think > that's going to happen often enough to matter. > 2. kdump happens after kernel bugs. The machine checks happen because of > kernel bugs. It's not a big stretch to think that, at scale, kdump is > going to run in to these #MCs on a regular basis. > > Does that capture the two perspectives fairly? Basically yes. If we can't come to an agreement that kdump should be allowed in spite of a potential #MC, maybe we could disable kdump only if TDX guests have been active on the machine before? Disabling kdump on a distro kernel just because TDX was enabled but without anyone having used TDX would be quite hard. Juergen
On 10/7/25 06:31, Jürgen Groß wrote:> > If we can't come to an agreement that kdump should be allowed in > spite of a potential #MC, maybe we could disable kdump only if TDX > guests have been active on the machine before? How would we determine that? We can't just call the TDX module to see because it might have been running before but got shut down.
On 08.10.25 17:40, Dave Hansen wrote: > On 10/7/25 06:31, Jürgen Groß wrote:> >> If we can't come to an agreement that kdump should be allowed in >> spite of a potential #MC, maybe we could disable kdump only if TDX >> guests have been active on the machine before? > > How would we determine that? > > We can't just call the TDX module to see because it might have been > running before but got shut down. Ah, okay, I didn't think of that. Then we could add a kernel boot parameter to let the user opt-in for kexec being possible in spite of the potential #MC. I think this should cover it. Juergen
On Thu, Oct 2, 2025 at 8:06 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/2/25 00:46, Juergen Gross wrote:
> > So lets compare the 2 cases with kdump enabled and disabled in your
> > scenario (crash of the host OS):
> >
> > kdump enabled: No dump can be produced due to the #MC and system is
> > rebooted.
> >
> > kdump disabled: No dump is produced and system is rebooted after crash.
> > > What is the main concern with kdump enabled? I don't see any
> > disadvantage with enabling it, just the advantage that in many cases
> > a dump will be written.
> The disadvantage is that a kernel bug from long ago results in a machine
> check. Machine checks are generally indicative of bad hardware. So the
> disadvantage is that someone mistakes the long ago kernel bug for bad
> hardware.
>
> There are two ways of looking at this:
>
> 1. A theoretically fragile kdump is better than no kdump at all. All of
> the stars would have to align for kdump to _fail_ and we don't think
> that's going to happen often enough to matter.
> 2. kdump happens after kernel bugs. The machine checks happen because of
> kernel bugs. It's not a big stretch to think that, at scale, kdump is
> going to run in to these #MCs on a regular basis.
Looking at Elena's response, I would say it's still *a* big stretch
for kdump to run into these #MCs on a regular basis as following
sequence is needed for problematic scenario:
1) Host OS bug should corrupt TDX private memory with a *partial
write*, that is part of kernel memory.
-> i.e. PAMT tables, SEPT tables, TD VCPU/VM metadata etc.
-> IIUC corruption of guest memory is not a concern as that
belongs to userspace.
2) TDX Module/TD shouldn't consume that poisoned memory.
-> i.e. no walk of the metadata memory.
3) Host kernel needs to generate a bug that causes an orthogonal panic.
*partial writes* IIUC need special instructions.
>
> Does that capture the two perspectives fairly?
On Thu, Oct 2, 2025 at 9:09 AM Vishal Annapurve <vannapurve@google.com> wrote: > > On Thu, Oct 2, 2025 at 8:06 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 10/2/25 00:46, Juergen Gross wrote: > > > So lets compare the 2 cases with kdump enabled and disabled in your > > > scenario (crash of the host OS): > > > > > > kdump enabled: No dump can be produced due to the #MC and system is > > > rebooted. > > > > > > kdump disabled: No dump is produced and system is rebooted after crash. > > > > What is the main concern with kdump enabled? I don't see any > > > disadvantage with enabling it, just the advantage that in many cases > > > a dump will be written. > > The disadvantage is that a kernel bug from long ago results in a machine > > check. Machine checks are generally indicative of bad hardware. So the > > disadvantage is that someone mistakes the long ago kernel bug for bad > > hardware. > > > > There are two ways of looking at this: > > > > 1. A theoretically fragile kdump is better than no kdump at all. All of > > the stars would have to align for kdump to _fail_ and we don't think > > that's going to happen often enough to matter. > > 2. kdump happens after kernel bugs. The machine checks happen because of > > kernel bugs. It's not a big stretch to think that, at scale, kdump is > > going to run in to these #MCs on a regular basis. > > Looking at Elena's response, I would say it's still *a* big stretch > for kdump to run into these #MCs on a regular basis as following > sequence is needed for problematic scenario: > 1) Host OS bug should corrupt TDX private memory with a *partial > write*, that is part of kernel memory. > -> i.e. PAMT tables, SEPT tables, TD VCPU/VM metadata etc. > -> IIUC corruption of guest memory is not a concern as that > belongs to userspace. > 2) TDX Module/TD shouldn't consume that poisoned memory. > -> i.e. no walk of the metadata memory. > 3) Host kernel needs to generate a bug that causes an orthogonal panic. > > *partial writes* IIUC need special instructions. Circling bank on this topic, I would like to iterate a few points: 1) Google has been running workloads with the series [1] for ~2 years now, we haven't seen any issues with kdump functionality across kernel bugs, real hardware issues, private memory corruption etc. 2) IMO rather than disabling kdump because of host kernel bugs potentially corrupting private memory, it would be much more useful to employ mechanisms like direct map removal to ensure host bugs leading to private memory corruption are caught much early on. Disabling kdump doesn't help the problem here and just makes it worse for a vast majority of other scenarios. On the other hand, enabling kdump doesn't make the problem worse than it is. - Host IOMMU mappings should also be ideally restricted to the regions that don't overlap with private memory regions. 3) With DPAMT support [2], the possibility of the host corrupting private memory will reduce for the hosts not running confidential VMs at all. [1] https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/ [2] https://lore.kernel.org/kvm/20250918232224.2202592-1-rick.p.edgecombe@intel.com/
On 10/18/25 08:54, Vishal Annapurve wrote: > Circling bank on this topic, I would like to iterate a few points: > 1) Google has been running workloads with the series [1] for ~2 years > now, we haven't seen any issues with kdump functionality across kernel > bugs, real hardware issues, private memory corruption etc. Great points and great info! As a next step, I'd expect someone (at Google) to take this into consideration and put together a series to have the kernel comprehend those points.
On Tue, Oct 21, 2025 at 10:08 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/18/25 08:54, Vishal Annapurve wrote: > > Circling bank on this topic, I would like to iterate a few points: > > 1) Google has been running workloads with the series [1] for ~2 years > > now, we haven't seen any issues with kdump functionality across kernel > > bugs, real hardware issues, private memory corruption etc. > > Great points and great info! > > As a next step, I'd expect someone (at Google) to take this into > consideration and put together a series to have the kernel comprehend > those points. Then is it safe to say that Intel doesn't consider: * Adding the support to just reset PAMT memory [1] to this series and * Modifying the logic in this patch [2] to enable kdump and keep kexec support disabled in this series as a viable direction upstream for now until a better solution comes along? If not, can kdump be made optional as Juergen suggested? [1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/ [2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/
On Tue, 2025-10-21 at 19:50 -0700, Vishal Annapurve wrote:
> On Tue, Oct 21, 2025 at 10:08 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/18/25 08:54, Vishal Annapurve wrote:
> > > Circling bank on this topic, I would like to iterate a few points:
> > > 1) Google has been running workloads with the series [1] for ~2 years
> > > now, we haven't seen any issues with kdump functionality across kernel
> > > bugs, real hardware issues, private memory corruption etc.
> >
> > Great points and great info!
> >
> > As a next step, I'd expect someone (at Google) to take this into
> > consideration and put together a series to have the kernel comprehend
> > those points.
>
> Then is it safe to say that Intel doesn't consider:
> * Adding the support to just reset PAMT memory [1] to this series and
You need to reset all TDX private memory including TDX guest private
memory, S-EPT pages etc, and PAMT. Resetting PAMT alone won't be enough,
and is pointless.
When [1] was posted, KVM TDX hadn't landed yet, so the only type of TDX
private memory was PAMT, but there's also a big comment there to point out
the in-kernel users should be responsible for resetting any TDX private
memory that they manage:
+ /*
+ * It's ideal to cover all types of TDX private pages here, but
+ * currently there's no unified way to tell whether a given page
+ * is TDX private page or not.
+ *
+ * Only convert PAMT here. All in-kernel TDX users (e.g., KVM)
+ * are responsible for converting TDX private pages that are
+ * managed by them by either registering reboot notifier or
+ * shutdown syscore ops.
+ */
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
> * Modifying the logic in this patch [2] to enable kdump and keep kexec
> support disabled in this series
Resetting TDX private is a complete solution which allows to enable both
kdump and kexec. If we choose to reset TDX private memory, then we can
just revert [2].
>
> as a viable direction upstream for now until a better solution comes along?
The alternative could be to simply modify [2] to allow kdump (but leave
TDX private memory untouched to the new kernel) but not normal kexec. The
risk of doing so has already been covered in this thread AFAICT:
1) If the kdump kernel does partial write to vmcore, the kdump kernel may
see unexpected #MCE.
2) As Elena pointed out, if the old kernel has bug and somehow already
does partial write to TDX private memory (which leads to poison), the
consumption of such poison may be deferred to the kdump kernel.
>
> If not, can kdump be made optional as Juergen suggested?
IIUC Juergen suggested:
Then we could add a kernel boot parameter to let the user opt-in
for kexec being possible in spite of the potential #MC.
I don't have opinion on this, other than that I think the boot parameter
only makes sense if we do the "alternative" mentioned above, i.e., not
resetting TDX private memory.
>
> [1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/
> [2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/
On Wed, Oct 22, 2025 at 2:05 PM Huang, Kai <kai.huang@intel.com> wrote: > > > > * Modifying the logic in this patch [2] to enable kdump and keep kexec > > support disabled in this series > > Resetting TDX private is a complete solution which allows to enable both > kdump and kexec. If we choose to reset TDX private memory, then we can > just revert [2]. > > > > > as a viable direction upstream for now until a better solution comes along? > > The alternative could be to simply modify [2] to allow kdump (but leave > TDX private memory untouched to the new kernel) but not normal kexec. The > risk of doing so has already been covered in this thread AFAICT: > > 1) If the kdump kernel does partial write to vmcore, the kdump kernel may > see unexpected #MCE. Ideally a kdump kernel should not write to vmcore. > 2) As Elena pointed out, if the old kernel has bug and somehow already > does partial write to TDX private memory (which leads to poison), the > consumption of such poison may be deferred to the kdump kernel. Is this case very different from hardware memory failures leading to poisoned memory ranges? i.e. kdump solution has an existing scenario of possible poison consumption during generation of kdump. Is it okay to advertise kdump functionality to be the best effort and live with this caveat until a cleaner solution comes along? > > > > > If not, can kdump be made optional as Juergen suggested? > > IIUC Juergen suggested: > > Then we could add a kernel boot parameter to let the user opt-in > for kexec being possible in spite of the potential #MC. > > I don't have opinion on this, other than that I think the boot parameter > only makes sense if we do the "alternative" mentioned above, i.e., not > resetting TDX private memory. > > > > > [1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/ > > [2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/
> On 02.10.25 08:59, Reshetova, Elena wrote: > >> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com> > >> wrote: > >>> > >>> On 9/30/25 19:05, Vishal Annapurve wrote: > >>> ... > >>>>> Any workarounds are going to be slow and probably imperfect. That's > not > >>>> > >>>> Do we really need to deploy workarounds that are complex and slow to > >>>> get kdump working for the majority of the scenarios? Is there any > >>>> analysis done for the risk with imperfect and simpler workarounds vs > >>>> benefits of kdump functionality? > >>>> > >>>>> a great match for kdump. I'm perfectly happy waiting for fixed hardware > >>>>> from what I've seen. > >>>> > >>>> IIUC SPR/EMR - two CPU generations out there are impacted by this > >>>> erratum and just disabling kdump functionality IMO is not the best > >>>> solution here. > >>> > >>> That's an eminently reasonable position. But we're speaking in broad > >>> generalities and I'm unsure what you don't like about the status quo or > >>> how you'd like to see things change. > >> > >> Looks like the decision to disable kdump was taken between [1] -> [2]. > >> "The kernel currently doesn't track which page is TDX private memory. > >> It's not trivial to reset TDX private memory. For simplicity, this > >> series simply disables kexec/kdump for such platforms. This will be > >> enhanced in the future." > >> > >> A patch [3] from the series[1], describes the issue as: > >> "This problem is triggered by "partial" writes where a write transaction > >> of less than cacheline lands at the memory controller. The CPU does > >> these via non-temporal write instructions (like MOVNTI), or through > >> UC/WC memory mappings. The issue can also be triggered away from the > >> CPU by devices doing partial writes via DMA." > >> > >> And also mentions: > >> "Also note only the normal kexec needs to worry about this problem, but > >> not the crash kexec: 1) The kdump kernel only uses the special memory > >> reserved by the first kernel, and the reserved memory can never be used > >> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the > >> first (crashed) kernel's memory, is only for read. The read will never > >> "poison" TDX memory thus cause unexpected machine check (only partial > >> write does)." > > > > While the statement that the read will never poison the memory is correct, > > the situation we can theoretically worry about is the following in my > understanding: > > > > 1. During its execution on platform with partial write problem, host OS or > other > > actor executing outside of SEAM mode triggers partial write into a cache line > that > > originally belonged to TDX private memory. > > This is smth that host OS or other entities should not do, but it could happen > due > > to host OS bugs, etc. > > 2. The above causes the specified cache line to be poisoned by mem > controller. > > However, here we assume that no one accesses this cache line from TDX > module, > > TD guests or Host OS for the time being and the problem remains hidden. > > 3. Host OS crashes due to some other issue, kdump crash kernel is triggered, > > and kdump starts to read all the memory from the previous host kernel to > dump > > the diagnostics info. > > 4. At some point of time, kdump crash kernel reaches the memory with the > poisoned > > cache line, consumes poison, and the #MC is issued for the kernel space. > > > > Isn't this the reason for also disabling kdump? Or do I miss smth? > > So lets compare the 2 cases with kdump enabled and disabled in your scenario > (crash of the host OS): > > kdump enabled: No dump can be produced due to the #MC and system is > rebooted. > > kdump disabled: No dump is produced and system is rebooted after crash. > > What is the main concern with kdump enabled? I don't see any disadvantage > with > enabling it, just the advantage that in many cases a dump will be written. I am not in the position to judge about what should be done about kdump in Linux, neither I am arguing one way or another. I just wanted to fill the gap and explain the technical scenario above which I think was missing from this thread. Whatever decision is taken by community should rely on understanding the HW behaviour, so this is what I tried to explain above. Best Regards, Elena.
On 10/1/25 10:17, Vishal Annapurve wrote: > And also mentions: > "Also note only the normal kexec needs to worry about this problem, but > not the crash kexec: 1) The kdump kernel only uses the special memory > reserved by the first kernel, and the reserved memory can never be used > by TDX in the first kernel; 2) The /proc/vmcore, which reflects the > first (crashed) kernel's memory, is only for read. The read will never > "poison" TDX memory thus cause unexpected machine check (only partial > write does)." > > What was the scenario that led to disabling kdump support altogether > given the above description? I think it was purely out of convenience so that the disabling could be three lines of code. I don't know off the top of my head if there's a simple enough way to disable kexec but not kdump. When I applied the thing, I was probably just considering kexec/kdump a monolithic thing and not thinking that folks would want one but not the other. Kai, did you have any other motivations?
On Wed, 2025-10-01 at 11:00 -0700, Hansen, Dave wrote:
> On 10/1/25 10:17, Vishal Annapurve wrote:
> > And also mentions:
> > "Also note only the normal kexec needs to worry about this problem, but
> > not the crash kexec: 1) The kdump kernel only uses the special memory
> > reserved by the first kernel, and the reserved memory can never be used
> > by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
> > first (crashed) kernel's memory, is only for read. The read will never
> > "poison" TDX memory thus cause unexpected machine check (only partial
> > write does)."
> >
> > What was the scenario that led to disabling kdump support altogether
> > given the above description?
>
> I think it was purely out of convenience so that the disabling could be
> three lines of code.
>
> I don't know off the top of my head if there's a simple enough way to
> disable kexec but not kdump. When I applied the thing, I was probably
> just considering kexec/kdump a monolithic thing and not thinking that
> folks would want one but not the other.
>
> Kai, did you have any other motivations?
The "/proc/vmcore is only for read" is my understanding of how the kdump
kernel uses the /proc/vmcore. I used to only disable kexec but allow
kdump to work (something like the diff below [*]), but during the internal
review we decided to just disable all since we cannot be sure whether it
is 100% true for all the kdump users.
This was raised by Vishal publicly before and was discussed here (in v3):
https://lore.kernel.org/kvm/f8dcbe257b3931aec9e199132b678bd7681b7efa.camel@intel.com/
[*]:
diff --git a/arch/x86/kernel/machine_kexec_64.c
b/arch/x86/kernel/machine_kexec_64.c
index 15088d14904f..c7af4aa7dd6b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -356,10 +356,11 @@ int machine_kexec_prepare(struct kimage *image)
* On those platforms the old kernel must reset TDX private
* memory before jumping to the new kernel otherwise the new
* kernel may see unexpected machine check. For simplicity
- * just fail kexec/kdump on those platforms.
+ * just fail kexec on those platforms. Still allow kdump since
+ * the kdump kernel will only reads TDX memory but not write.
*/
- if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
- pr_info_once("Not allowed on platform with tdx_pw_mce
bug\n");
+ if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE) && image->type !=
KEXEC_TYPE_CRASH) {
+ pr_info_once("Kexec not allowed on platform with
tdx_pw_mce bug\n");
return -EOPNOTSUPP;
}
© 2016 - 2026 Red Hat, Inc.