Skip clearing a private page if it is marked as poisoned.
The machine check architecture may have the capability to recover from
memory corruption in SEAM non-root (i.e. TDX VM guest) mode. In that
case the page is marked as poisoned, and the TDX Module puts the TDX VM
into a FATAL error state where the only operation permitted is to tear it
down.
During tear down, reclaimed pages are cleared which, in some cases, helps
to avoid integrity violation or TD bit mismatch detection when later being
read using a shared HKID.
However a poisoned page will never be allocated again, so clearing is not
necessary, and in any case poisoned pages should not be touched.
Note that while it is possible that memory corruption arises from integrity
violation which could be cleared by MOVDIR64B, that is not necessarily the
cause of the machine check.
Suggested-by: Kai Huang <kai.huang@intel.com>
Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
arch/x86/kvm/vmx/tdx.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 457f91b95147..f4263f7a3924 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page)
void *dest = page_to_virt(page);
unsigned long i;
- /*
- * The page could have been poisoned. MOVDIR64B also clears
- * the poison bit so the kernel can safely use the page again.
- */
+ /* Machine check handler may have poisoned the page */
+ if (PageHWPoison(page))
+ return;
+
for (i = 0; i < PAGE_SIZE; i += 64)
movdir64b(dest + i, zero_page);
/*
--
2.48.1
On Wed, 2025-06-18 at 15:08 +0300, Hunter, Adrian wrote: > Skip clearing a private page if it is marked as poisoned. > > The machine check architecture may have the capability to recover from ^ "to recover" -> "to allow the kernel to recover"? > memory corruption in SEAM non-root (i.e. TDX VM guest) mode. In that > case the page is marked as poisoned, and the TDX Module puts the TDX VM "marked as poisoned" -> "marked as poisoned in the kernel"? Since next half of this sentence immediately talks about TDX module behaviour. > into a FATAL error state where the only operation permitted is to tear it > down. > > During tear down, reclaimed pages are cleared which, in some cases, helps ^ Double writespace in middle of sentence. > to avoid integrity violation or TD bit mismatch detection when later being > read using a shared HKID. > > However a poisoned page will never be allocated again, so clearing is not > necessary, and in any case poisoned pages should not be touched. > > Note that while it is possible that memory corruption arises from integrity > violation which could be cleared by MOVDIR64B, that is not necessarily the > cause of the machine check. > > Suggested-by: Kai Huang <kai.huang@intel.com> > Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure") > Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> With comments from Xiaoyao/Dave fixed, Reviewed-by: Kai Huang <kai.huang@intel.com>
On 6/18/25 05:08, Adrian Hunter wrote: > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page) > void *dest = page_to_virt(page); > unsigned long i; > > - /* > - * The page could have been poisoned. MOVDIR64B also clears > - * the poison bit so the kernel can safely use the page again. > - */ > + /* Machine check handler may have poisoned the page */ > + if (PageHWPoison(page)) > + return; I think the old comment needs to stay in some form. There are two kinds of poisons here: One from an integrity mismatch and the other because the hardware decided the memory is bad. MOVDIR64B clears the integrity one, but not the hardware one obviously. Could we make that clear in the comment, please?
On Wed, Jun 18, 2025 at 7:58 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/18/25 05:08, Adrian Hunter wrote: > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page) > > void *dest = page_to_virt(page); > > unsigned long i; > > > > - /* > > - * The page could have been poisoned. MOVDIR64B also clears > > - * the poison bit so the kernel can safely use the page again. > > - */ > > + /* Machine check handler may have poisoned the page */ > > + if (PageHWPoison(page)) > > + return; IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not going to cause any trouble. This check should be (unlikely(PageHWPoison(page)) and even better probably should be omitted altogether if there are no side effects of direct store to hwpoisoned pages. > > I think the old comment needs to stay in some form. > > There are two kinds of poisons here: One from an integrity mismatch and > the other because the hardware decided the memory is bad. MOVDIR64B > clears the integrity one, but not the hardware one obviously. To ensure I understand correctly, Am I correct in saying: movdir64b clearing the integrity poison is just hardware clearing the poison bit, software will still treat that page as poisoned? > > Could we make that clear in the comment, please? > >
On 25/06/2025 17:33, Vishal Annapurve wrote: > On Wed, Jun 18, 2025 at 7:58 AM Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 6/18/25 05:08, Adrian Hunter wrote: >>> --- a/arch/x86/kvm/vmx/tdx.c >>> +++ b/arch/x86/kvm/vmx/tdx.c >>> @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page) >>> void *dest = page_to_virt(page); >>> unsigned long i; >>> >>> - /* >>> - * The page could have been poisoned. MOVDIR64B also clears >>> - * the poison bit so the kernel can safely use the page again. >>> - */ >>> + /* Machine check handler may have poisoned the page */ >>> + if (PageHWPoison(page)) >>> + return; > > IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not > going to cause any trouble. No. PageHWPoison(page) means the page should not be touched. It must be freed back to the allocator where it will never be allocated again. > > This check should be (unlikely(PageHWPoison(page)) and even better 'unlikely' would be fine > probably should be omitted altogether if there are no side effects of > direct store to hwpoisoned pages. > >> >> I think the old comment needs to stay in some form. >> >> There are two kinds of poisons here: One from an integrity mismatch and >> the other because the hardware decided the memory is bad. MOVDIR64B >> clears the integrity one, but not the hardware one obviously. > > To ensure I understand correctly, Am I correct in saying: movdir64b > clearing the integrity poison is just hardware clearing the poison > bit, software will still treat that page as poisoned? Typically an integrity violation would have caused a machine check and the machine check handler would have marked the page SetPageHWPoison(page). So we really end up with only 2 cases: 1. page is fine and PageHWPoison(page) is false 2. page may have had an integrity violation or a hardware error (we can't tell which), and PageHWPoison(page) is true
> 2. page may have had an integrity violation or a hardware error > (we can't tell which), and PageHWPoison(page) is true Right. I think the point of avoiding MOVDIR64B to such page is we cannot tell whether it is a hardware error or not. If it is a hardware error, touching it using MOVDIR64B may cause additional #MC which will panic kernel since now the #MC happens in the kernel context.
On 6/25/25 15:32, Huang, Kai wrote: >> 2. page may have had an integrity violation or a hardware error >> (we can't tell which), and PageHWPoison(page) is true > Right. I think the point of avoiding MOVDIR64B to such page is we cannot > tell whether it is a hardware error or not. If it is a hardware error, > touching it using MOVDIR64B may cause additional #MC which will panic kernel > since now the #MC happens in the kernel context. First and foremost, does the code path in question in this case touch userspace pages? Or pages that are only "kernel context" in the first place. Second, if we can't tell the difference between integrity violation or a hardware error and this code needs to clear "integrity violation" poison then won't this change just fundamentally break the erratum workaround in the first place?!?!
On Wed, 2025-06-25 at 15:38 -0700, Dave Hansen wrote: > On 6/25/25 15:32, Huang, Kai wrote: > > > 2. page may have had an integrity violation or a hardware error > > > (we can't tell which), and PageHWPoison(page) is true > > Right. I think the point of avoiding MOVDIR64B to such page is we cannot > > tell whether it is a hardware error or not. If it is a hardware error, > > touching it using MOVDIR64B may cause additional #MC which will panic kernel > > since now the #MC happens in the kernel context. > > First and foremost, does the code path in question in this case touch > userspace pages? Or pages that are only "kernel context" in the first place. The #MC happens when in SEAM non-root mode, i.e., when TDX guest tries to read it using its TDX private keyID, so practically it is a userspace page. But for such #MC it doesn't matter because the machine check handler handles #MC from SEAM non-root separately: do_machine_check(...) { ... } else if (m->mcgstatus & MCG_STATUS_SEAM_NR) { /* * Saved RIP on stack makes it look like the machine check * was taken in the kernel on the instruction following * the entry to SEAM mode. But MCG_STATUS_SEAM_NR indicates * that the machine check was taken inside SEAM non-root * mode. CPU core has already marked that guest as dead. * It is OK for the kernel to resume execution at the * apparent point of the machine check as the fault did * not occur there. Mark the page as poisoned so it won't * be added to free list when the guest is terminated. */ if (mce_usable_address(m)) { struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); if (p) SetPageHWPoison(p); } ... } However if the kernel touch the page again using MOVDIR64B, the further #MC won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM non-root), therefore it will be treated as a normal kernel #MC which will result in kernel panic. > > Second, if we can't tell the difference between integrity violation or a > hardware error and this code needs to clear "integrity violation" poison > then won't this change just fundamentally break the erratum workaround > in the first place?!?! The existing upstream code clears "integrity violation" unconditionally, which IIUC doesn't work if the page is poisoned due to "hardware error". This patch avoids clearing integrity violation to avoid this, only if the page is marked as poisoned. The erratum workaround (using MOVDIR64B to clear) works if the page is not poisoned. I don't think it will break the erratum workaround. The whole assumption is if the page is marked as poisoned, the kernel will never used it again.
> However if the kernel touch the page again using MOVDIR64B, the further #MC > won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM > non-root), therefore it will be treated as a normal kernel #MC which will > result in kernel panic. Intel CPUs signal #MC when an instruction that is trying to consume poison data is about to retire. Stores to memory do not consume poison, so will not signal a #MC. In the MOVDIR64B case an entire cache line is stored in a single atomic write. This will clear the poison state of the cacheline (assuming that the poison was due to an integrity error, memory error injection, I/O error etc. If the DIMM is bad and has stuck bits, then the memory may still fail ECC check on the next read.) Using smaller stores to overwrite the cache line will not clear poison. The cacheline is read from memory to some cache level, the small store updates some bytes in the line, but the poison flag remains. Note that this doesn't trigger #MC because the poison data is not being consumed, it still isn't architecturally visible in some register, memory, or I/O device. You may still see a UCNA signature signaled with CMCI from the memory controller if either case resulted in a speculative prefetch of the poisoned cache line. -Tony
On Thu, 2025-06-26 at 15:31 +0000, Luck, Tony wrote: > > However if the kernel touch the page again using MOVDIR64B, the further #MC > > won't have MCG_STATUS_SEAM_NR bit set (because it doesn't happen in SEAM > > non-root), therefore it will be treated as a normal kernel #MC which will > > result in kernel panic. > > Intel CPUs signal #MC when an instruction that is trying to consume poison data > is about to retire. > > Stores to memory do not consume poison, so will not signal a #MC. > > In the MOVDIR64B case an entire cache line is stored in a single atomic > write. This will clear the poison state of the cacheline (assuming that the > poison was due to an integrity error, memory error injection, I/O error etc. > If the DIMM is bad and has stuck bits, then the memory may still fail ECC > check on the next read.) > > Using smaller stores to overwrite the cache line will not clear poison. The > cacheline is read from memory to some cache level, the small store updates > some bytes in the line, but the poison flag remains. Note that this doesn't > trigger #MC because the poison data is not being consumed, it still isn't > architecturally visible in some register, memory, or I/O device. > > You may still see a UCNA signature signaled with CMCI from the memory > controller if either case resulted in a speculative prefetch of the poisoned > cache line. > > -Tony Thanks for the info. :-) So it seems MOVDIR64B to a bad memory won't necessarily trigger #MC when the written is performed. But IMHO we may should just have a simple policy that when a page is marked as poisoned, it should never be touched again. It's only one page anyway (for one TD) so losing that doesn't seem bad to me. If we want to clear the poisoned page, then perhaps we should mark that page to be not-poisoned again.
On 6/26/25 15:20, Huang, Kai wrote: > But IMHO we may should just have a simple policy that when a page is marked > as poisoned, it should never be touched again. It's only one page anyway > (for one TD) so losing that doesn't seem bad to me. If we want to clear the > poisoned page, then perhaps we should mark that page to be not-poisoned > again. The simplest policy is to do nothing. The kernel only has 29 places that check PageHWPoison(). I'd guess that roughly half of those are the memory-failure.c infrastructure and bare-minimum code to handle poison, like not allowing pages to go back into the allocator. There are something like 5,000 lines of code in the kernel that deal with a literal 'struct page'. 29 checks for ~5,000 sites is pretty minuscule. We obviously don't have a policy that every place that uses 'struct page' needs to check for poison. We also don't even have a policy where writes to or reads from a page check for poison. Why is this TDX code so special that PageHWPoison() needs to be checked. For instance: $ grep -r PageHWPoison arch/x86/ arch/x86/kernel/cpu/mce/core.c: SetPageHWPoison(p); arch/x86/kernel/cpu/mce/core.c: SetPageHWPoison(p); In other words, this would be the *ONLY* arch/x86 site. Why?
On Thu, 2025-06-26 at 15:33 -0700, Hansen, Dave wrote: > On 6/26/25 15:20, Huang, Kai wrote: > > But IMHO we may should just have a simple policy that when a page is marked > > as poisoned, it should never be touched again. It's only one page anyway > > (for one TD) so losing that doesn't seem bad to me. If we want to clear the > > poisoned page, then perhaps we should mark that page to be not-poisoned > > again. > > The simplest policy is to do nothing. > > The kernel only has 29 places that check PageHWPoison(). I'd guess that > roughly half of those are the memory-failure.c infrastructure and > bare-minimum code to handle poison, like not allowing pages to go back > into the allocator. > > There are something like 5,000 lines of code in the kernel that deal > with a literal 'struct page'. 29 checks for ~5,000 sites is pretty > minuscule. We obviously don't have a policy that every place that uses > 'struct page' needs to check for poison. We also don't even have a > policy where writes to or reads from a page check for poison. It was my understanding that if page is marked as poisoned the kernel should not touch that again. I thought the kernel should have already implemented in this way, like not allowing pages to go back to the allocator, and the places that use 'struct page' you mentioned should already know the page is not poisoned. That being said it's just my guess, so my bad. > > Why is this TDX code so special that PageHWPoison() needs to be checked. > For instance: > > $ grep -r PageHWPoison arch/x86/ > arch/x86/kernel/cpu/mce/core.c: SetPageHWPoison(p); > arch/x86/kernel/cpu/mce/core.c: SetPageHWPoison(p); > > In other words, this would be the *ONLY* arch/x86 site. Why? This is the case that I know the kernel could touch poisoned page. And I didn't know writing to hardware error memory is fine, but I thought we should just skip it for safety. But per Tony it should be fine to write to it, so I am fine to not have the PageHWPoison() check here.
On Wed, 2025-06-25 at 19:25 +0300, Adrian Hunter wrote: > > To ensure I understand correctly, Am I correct in saying: movdir64b > > clearing the integrity poison is just hardware clearing the poison > > bit, software will still treat that page as poisoned? > > Typically an integrity violation would have caused a machine check > and the machine check handler would have marked the page > SetPageHWPoison(page). > > So we really end up with only 2 cases: > 1. page is fine and PageHWPoison(page) is false > 2. page may have had an integrity violation or a hardware error > (we can't tell which), and PageHWPoison(page) is true Could an access in userspace take an #MC, and have the kernel set the direct map for the page to NP and kill process? Then another process with access to the gmem fd tries to map it as private, and take a non-integrity related SEAMMODE #MC and ends up here? It does seem safer to just avoid touching it. But I bet there are many cases like this, and we can't check poison everywhere. If guestmemfd turns into more of a persistent allocator, instead of a per-VM thing, it probably needs to do some of the checks that the page allocator does, like poison checks.
On 6/25/25 09:25, Adrian Hunter wrote: >> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not >> going to cause any trouble. > No. PageHWPoison(page) means the page should not be touched. It must > be freed back to the allocator where it will never be allocated again. What's the end-user-visible effect if the page is touched in this specific function in this specific way (with movdir64b)? In other words, what does this patch do for end users?
On 25/06/2025 19:31, Dave Hansen wrote: > On 6/25/25 09:25, Adrian Hunter wrote: >>> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not >>> going to cause any trouble. >> No. PageHWPoison(page) means the page should not be touched. It must >> be freed back to the allocator where it will never be allocated again. > > What's the end-user-visible effect if the page is touched in this > specific function in this specific way (with movdir64b)? > > In other words, what does this patch do for end users? We have another patch that clarifies that. It turns out we need to clear pages (MOVDIR64B) only if the platform has the so-called partial-write errata X86_BUG_TDX_PW_MCE. It was decided to deal with that issue separately, but maybe it needs to bundled in the same patch set, now that the discussion has moved in that direction?
On 6/25/25 09:42, Adrian Hunter wrote: > On 25/06/2025 19:31, Dave Hansen wrote: >> On 6/25/25 09:25, Adrian Hunter wrote: >>>> IIUC, even if movdir64b stores contents on hwpoisoned pages, it's not >>>> going to cause any trouble. >>> No. PageHWPoison(page) means the page should not be touched. It must >>> be freed back to the allocator where it will never be allocated again. >> >> What's the end-user-visible effect if the page is touched in this >> specific function in this specific way (with movdir64b)? >> >> In other words, what does this patch do for end users? > > We have another patch that clarifies that. It turns out we need > to clear pages (MOVDIR64B) only if the platform has the so-called > partial-write errata X86_BUG_TDX_PW_MCE. > > It was decided to deal with that issue separately, but maybe it > needs to bundled in the same patch set, now that the discussion > has moved in that direction? It sure sounds like it to me.
On 6/18/2025 8:08 PM, Adrian Hunter wrote: > Skip clearing a private page if it is marked as poisoned. > > The machine check architecture may have the capability to recover from > memory corruption in SEAM non-root (i.e. TDX VM guest) mode. In that > case the page is marked as poisoned, and the TDX Module puts the TDX VM > into a FATAL error state where the only operation permitted is to tear it > down. > > During tear down, reclaimed pages are cleared which, in some cases, helps > to avoid integrity violation or TD bit mismatch detection when later being > read using a shared HKID. > > However a poisoned page will never be allocated again, so clearing is not > necessary, and in any case poisoned pages should not be touched. > > Note that while it is possible that memory corruption arises from integrity > violation which could be cleared by MOVDIR64B, that is not necessarily the > cause of the machine check. > > Suggested-by: Kai Huang <kai.huang@intel.com> > Fixes: 8d032b683c299 ("KVM: TDX: create/destroy VM structure") > Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> > --- > arch/x86/kvm/vmx/tdx.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 457f91b95147..f4263f7a3924 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -282,10 +282,10 @@ static void tdx_clear_page(struct page *page) > void *dest = page_to_virt(page); > unsigned long i; > > - /* > - * The page could have been poisoned. MOVDIR64B also clears > - * the poison bit so the kernel can safely use the page again. > - */ > + /* Machine check handler may have poisoned the page */ This doesn't read correct. The page is not poisoned by the machine check handler. Machine check handler just marks the page as poisoned. With it fixed, Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> > + if (PageHWPoison(page)) > + return; > + > for (i = 0; i < PAGE_SIZE; i += 64) > movdir64b(dest + i, zero_page); > /*
© 2016 - 2025 Red Hat, Inc.