Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
valid physical address bits within the machine check bank address register.
This is particularly needed in the case of errors in TDX/SEAM non-root mode
because the reported address contains the TDX KeyID. Refer to TDX and
TME-MK documentation for more information about KeyIDs.
Commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM
non-root mode") uses the address to mark the affected page as poisoned, but
omits to use the aforementioned mask.
Mask the address with MCI_ADDR_PHYSADDR so that the page can be found.
Fixes: 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
arch/x86/kernel/cpu/mce/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e9b3c5d4a52e..76c4634c6a5f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
* be added to free list when the guest is terminated.
*/
if (mce_usable_address(m)) {
- struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT);
+ unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT;
+ struct page *p = pfn_to_online_page(pfn);
if (p)
SetPageHWPoison(p);
--
2.48.1
On Wed, 2025-06-18 at 15:08 +0300, Hunter, Adrian wrote: > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of > valid physical address bits within the machine check bank address register. > > This is particularly needed in the case of errors in TDX/SEAM non-root mode > because the reported address contains the TDX KeyID. > Just wondering, do you know whether this is documented anywhere? If it is, I think it should be helpful if you can refer that in the changelog.
> > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine > > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of > > valid physical address bits within the machine check bank address register. > > > > This is particularly needed in the case of errors in TDX/SEAM non-root mode > > because the reported address contains the TDX KeyID. > > > > Just wondering, do you know whether this is documented anywhere? If it is, > I think it should be helpful if you can refer that in the changelog. It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR with the footnote in the diagram: "Useful bits in this field depend on the address methodology in use when the the register state is saved." Maybe there is something more explicit in documentation for memory encryption? -Tony
On Wed, 2025-06-18 at 23:39 +0000, Luck, Tony wrote: > > > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine > > > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of > > > valid physical address bits within the machine check bank address register. > > > > > > This is particularly needed in the case of errors in TDX/SEAM non-root mode > > > because the reported address contains the TDX KeyID. > > > > > > > Just wondering, do you know whether this is documented anywhere? If it is, > > I think it should be helpful if you can refer that in the changelog. > > It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR > with the footnote in the diagram: > > "Useful bits in this field depend on the address methodology in use when the > the register state is saved." Thanks for the info. > > Maybe there is something more explicit in documentation for memory encryption? I didn't find any. The TDX module base architecture spec (16.2.3. Memory Integrity Error Logging, Machine Checks and Unbreakable Shutdowns) says below: The poison memory address, at a granularity no finer than 32 bytes, is logged in IA32_MCi_ADDRESS MSRs. It doesn't explicitly say KeyID is appended.
> It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR > with the footnote in the diagram: > > "Useful bits in this field depend on the address methodology in use when the > the register state is saved." > > Maybe there is something more explicit in documentation for memory encryption? Section 5.1 in https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf shows how the upper bits of the physical address are used for the :KeyID" -Tony
On Wed, 2025-06-18 at 23:46 +0000, Luck, Tony wrote: > > It's sort of hinted at in the SDM Vol 3B Figure 17-7. IA32_MCi_ADDR MSR > > with the footnote in the diagram: > > > > "Useful bits in this field depend on the address methodology in use when the > > the register state is saved." > > > > Maybe there is something more explicit in documentation for memory encryption? > > > Section 5.1 in > https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf > > shows how the upper bits of the physical address are used for the :KeyID" > > -Tony Yeah. So I guess it's somehow implied the KeyID bits, which are "useful bits", are also recorded in IA32_MCi_ADDR.
On 6/18/25 05:08, Adrian Hunter wrote: > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs) > * be added to free list when the guest is terminated. > */ > if (mce_usable_address(m)) { > - struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); > + unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; > + struct page *p = pfn_to_online_page(pfn); If ->addr isn't really an address that software can do much with, shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()? Maybe we should break it up into address and KeyID _there_.
On 18/06/2025 17:55, Dave Hansen wrote: > On 6/18/25 05:08, Adrian Hunter wrote: >> --- a/arch/x86/kernel/cpu/mce/core.c >> +++ b/arch/x86/kernel/cpu/mce/core.c >> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs) >> * be added to free list when the guest is terminated. >> */ >> if (mce_usable_address(m)) { >> - struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); >> + unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; >> + struct page *p = pfn_to_online_page(pfn); > > If ->addr isn't really an address that software can do much with, > shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()? Would that mean no one would know if the mce addr had KeyID bits or not? > > Maybe we should break it up into address and KeyID _there_. Could we deal with any tidy-ups in separate patches?
On 19/06/2025 14:57, Adrian Hunter wrote: > On 18/06/2025 17:55, Dave Hansen wrote: >> On 6/18/25 05:08, Adrian Hunter wrote: >>> --- a/arch/x86/kernel/cpu/mce/core.c >>> +++ b/arch/x86/kernel/cpu/mce/core.c >>> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs) >>> * be added to free list when the guest is terminated. >>> */ >>> if (mce_usable_address(m)) { >>> - struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); >>> + unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; >>> + struct page *p = pfn_to_online_page(pfn); >> >> If ->addr isn't really an address that software can do much with, >> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()? > > Would that mean no one would know if the mce addr had KeyID bits or not? Current design, to keep the bits in mce addr, is from Tony's patch: x86/mce: Mask out non-address bits from machine check bank https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a01ec97dc066009dd89e43bfcf55644f2dd6d19 Assuming that is not altered, a tidy-up is still possible like: diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 6c77c03139f7..b469b7a7ecfa 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -386,4 +386,14 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { } unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len); +static inline unsigned long mce_addr_to_phys(u64 mce_addr) +{ + return mce_addr & MCI_ADDR_PHYSADDR; +} + +static inline unsigned long mce_addr_to_pfn(u64 mce_addr) +{ + return mce_addr_to_phys(mce_addr) >> PAGE_SHIFT; +} + #endif /* _ASM_X86_MCE_H */ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 76c4634c6a5f..e9e8c377790f 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -642,7 +642,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, mce->severity != MCE_DEFERRED_SEVERITY) return NOTIFY_DONE; - pfn = (mce->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; + pfn = mce_addr_to_pfn(mce->addr); if (!memory_failure(pfn, 0)) { set_mce_nospec(pfn); mce->kflags |= MCE_HANDLED_UC; @@ -1412,7 +1412,7 @@ static void kill_me_maybe(struct callback_head *cb) if (!p->mce_ripv) flags |= MF_MUST_KILL; - pfn = (p->mce_addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; + pfn = mce_addr_to_pfn(p->mce_addr); ret = memory_failure(pfn, flags); if (!ret) { set_mce_nospec(pfn); @@ -1441,7 +1441,7 @@ static void kill_me_never(struct callback_head *cb) p->mce_count = 0; pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); - pfn = (p->mce_addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; + pfn = mce_addr_to_pfn(p->mce_addr); if (!memory_failure(pfn, 0)) set_mce_nospec(pfn); } @@ -1665,7 +1665,7 @@ noinstr void do_machine_check(struct pt_regs *regs) * be added to free list when the guest is terminated. */ if (mce_usable_address(m)) { - unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; + unsigned long pfn = mce_addr_to_pfn(m->addr); struct page *p = pfn_to_online_page(pfn); if (p) diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c index ff8d078c6ca1..f3c4d6a5f159 100644 --- a/drivers/cxl/core/mce.c +++ b/drivers/cxl/core/mce.c @@ -24,7 +24,7 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val, if (!endpoint) return NOTIFY_DONE; - spa = mce->addr & MCI_ADDR_PHYSADDR; + spa = mce_addr_to_phys(mce->addr); pfn = spa >> PAGE_SHIFT; if (!pfn_valid(pfn)) diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index c9ade45c1a99..83fcec743ea7 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -732,7 +732,7 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, memset(&res, 0, sizeof(res)); res.mce = mce; - res.addr = mce->addr & MCI_ADDR_PHYSADDR; + res.addr = mce_addr_to_phys(mce->addr); if (!pfn_to_online_page(res.addr >> PAGE_SHIFT) && !arch_is_platform_page(res.addr)) { pr_err("Invalid address 0x%llx in IA32_MC%d_ADDR\n", mce->addr, mce->bank); return NOTIFY_DONE;
> >> If ->addr isn't really an address that software can do much with, > >> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()? > > > > Would that mean no one would know if the mce addr had KeyID bits or not? > > Current design, to keep the bits in mce addr, is from Tony's patch: > > x86/mce: Mask out non-address bits from machine check bank > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a01ec97dc066009dd89e43bfcf55644f2dd6d19 > > Assuming that is not altered, a tidy-up is still possible like: > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index 6c77c03139f7..b469b7a7ecfa 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -386,4 +386,14 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { } > > unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len); > > +static inline unsigned long mce_addr_to_phys(u64 mce_addr) > +{ > + return mce_addr & MCI_ADDR_PHYSADDR; > +} > + > +static inline unsigned long mce_addr_to_pfn(u64 mce_addr) > +{ > + return mce_addr_to_phys(mce_addr) >> PAGE_SHIFT; > +} > + > #endif /* _ASM_X86_MCE_H */ I like this. Can you write up a patch with a commit message please? -Tony
On 6/27/25 08:23, Adrian Hunter wrote: > On 19/06/2025 14:57, Adrian Hunter wrote: >> On 18/06/2025 17:55, Dave Hansen wrote: >>> On 6/18/25 05:08, Adrian Hunter wrote: >>>> --- a/arch/x86/kernel/cpu/mce/core.c >>>> +++ b/arch/x86/kernel/cpu/mce/core.c >>>> @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs) >>>> * be added to free list when the guest is terminated. >>>> */ >>>> if (mce_usable_address(m)) { >>>> - struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); >>>> + unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; >>>> + struct page *p = pfn_to_online_page(pfn); >>> If ->addr isn't really an address that software can do much with, >>> shouldn't we mask MCI_ADDR_PHYSADDR off up front, like in mce_read_aux()? >> Would that mean no one would know if the mce addr had KeyID bits or not? Uhhh, just store the KeyID separately. Have mce->addr and mce->keyid Problem solved.
>>> Would that mean no one would know if the mce addr had KeyID bits or not? > > Uhhh, just store the KeyID separately. Have > > mce->addr > and > mce->keyid > > Problem solved. As the old saying goes: "Solve one problem and create three new ones". "struct mce" is in user visible in arch/x86/include/uapi/asm/mce.h (via /dev/mcelog). So while we can add new fields (at the end of the structure) some thought about the implications are needed. We've been sending a combined key+address in the "mce->addr" to user space for a while. Has anyone built infrastructure on top of that? -Tony
On 6/27/25 09:24, Luck, Tony wrote: > We've been sending a combined key+address in the "mce->addr" to > user space for a while. Has anyone built infrastructure on top of that? I'm not sure they can do anything useful with an address that has the KeyID in the first place. The partitioning scheme is in an MSR, so they'd need to be doing silly gymnastics to even decode the address. Userspace can deal with the KeyID not being in the address. It's been the default for ages. So, if we take it back out, I'd expect it fixes more things than it breaks. So, yeah, we should carefully consider it. But it still 100% looks like the right thing to me to detangle the KeyID and physical address in the ABI.
On 27/06/2025 19:33, Dave Hansen wrote: > On 6/27/25 09:24, Luck, Tony wrote: >> We've been sending a combined key+address in the "mce->addr" to >> user space for a while. Has anyone built infrastructure on top of that? > > I'm not sure they can do anything useful with an address that has the > KeyID in the first place. The partitioning scheme is in an MSR, so > they'd need to be doing silly gymnastics to even decode the address. > > Userspace can deal with the KeyID not being in the address. It's been > the default for ages. So, if we take it back out, I'd expect it fixes > more things than it breaks. > > So, yeah, we should carefully consider it. But it still 100% looks like > the right thing to me to detangle the KeyID and physical address in the ABI. Coming back to this after a bit of a break. It feels unlikely to me that any users are expecting KeyID in mce->addr. Looking at user space programs like mcelog and rasdaemon, gives the impression that mce->addr contains only an address. The UAPI header file describes addr as "Bank's MCi_ADDR MSR", but what mce_read_aux() does tends to contradict that, especially for AMD SMCA. But there are also additional places where it seems like MCI_ADDR_PHYSADDR is missing: tdx_dump_mce_info() paddr_is_tdx_private() __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args) TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero skx_mce_output_error() edac_mc_handle_error() expects page_frame_number, so without KeyID The KeyID is probably only useful for potentially identifying the TD, but given that the TD incurs a FATAL error, that may be obvious anyway. So removing the KeyID from mce->addr looks like the right thing to do. Note AFAICT there are 3 kernel APIs that deal with the MCE address: Device /dev/mcelog which outputs struct mce Tracepoint mce:mce_record which outputs members from struct mce Tracepoint ras:mc_event where the kernel constructs the address from page_frame_number implying that KeyID should not be present I guess it would be sensible to ask what customers think. Vishal, do you know anyone at Google who deals with handling machine check information, and who might have an opinion on this?
On Wed, Jul 30, 2025 at 3:55 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 27/06/2025 19:33, Dave Hansen wrote: > > On 6/27/25 09:24, Luck, Tony wrote: > >> We've been sending a combined key+address in the "mce->addr" to > >> user space for a while. Has anyone built infrastructure on top of that? > > > > I'm not sure they can do anything useful with an address that has the > > KeyID in the first place. The partitioning scheme is in an MSR, so > > they'd need to be doing silly gymnastics to even decode the address. > > > > Userspace can deal with the KeyID not being in the address. It's been > > the default for ages. So, if we take it back out, I'd expect it fixes > > more things than it breaks. > > > > So, yeah, we should carefully consider it. But it still 100% looks like > > the right thing to me to detangle the KeyID and physical address in the ABI. > > Coming back to this after a bit of a break. > > It feels unlikely to me that any users are expecting KeyID in mce->addr. > > Looking at user space programs like mcelog and rasdaemon, gives the > impression that mce->addr contains only an address. > > The UAPI header file describes addr as "Bank's MCi_ADDR MSR", but what > mce_read_aux() does tends to contradict that, especially for AMD > SMCA. > > But there are also additional places where it seems like MCI_ADDR_PHYSADDR > is missing: > > tdx_dump_mce_info() > paddr_is_tdx_private() > __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args) > TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero > > skx_mce_output_error() > edac_mc_handle_error() > expects page_frame_number, so without KeyID > > The KeyID is probably only useful for potentially identifying the TD, but > given that the TD incurs a FATAL error, that may be obvious anyway. > > So removing the KeyID from mce->addr looks like the right thing to do. > > Note AFAICT there are 3 kernel APIs that deal with the MCE address: > > Device /dev/mcelog which outputs struct mce > Tracepoint mce:mce_record which outputs members from struct mce > Tracepoint ras:mc_event where the kernel constructs the address > from page_frame_number implying that KeyID should not be present > > I guess it would be sensible to ask what customers think. > > Vishal, do you know anyone at Google who deals with handling machine > check information, and who might have an opinion on this? > I think it's safe to assume Google hasn't built any infra in the userspace that needs KeyID bits in the mce address. That being said, Dave's suggestion to "detangle the KeyID and physical address in the ABI" makes sense to me.
On Wed, 2025-07-30 at 13:54 +0300, Adrian Hunter wrote: > But there are also additional places where it seems like MCI_ADDR_PHYSADDR > is missing: > > tdx_dump_mce_info() > paddr_is_tdx_private() > __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args) > TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero This is only called in mce_panic() path, which basically means the #MC is fatal, e.g., happens in kernel context. The intention of this is to catch any #MC due to kernel bug (i.e., software issue, but not hardware error) which does partial write to TDX private memory and read at a later time, and report a more precise information to the user to point out this could be due to "possible kernel bug". See changelog of 70060463cb2b ("x86/mce: Differentiate real hardware #MCs from TDX erratum ones"). In other words, for this case the address reported via MCI_ADDR_PHYSADDR should not contain any KeyID bits since the kernel always uses keyID 0 to read. I believe the KeyID bits will only be appended to the physical address reported in MCI_ADDR_PHYSADDR when the #MC was triggered from TDX guest, i.e., when the CPU was accessing memory using TDX KeyID. Such #MC is not fatal and won't call into mce_panic(). That being said, for tdx_dump_mce_info(), while explicitly masking out keyID bits in MCI_ADDR_PHYSADDR obviously doesn't hurt (or arguably better in some way), it is not necessary AFAICT.
On 6/18/2025 8:08 PM, Adrian Hunter wrote: > Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine > check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of > valid physical address bits within the machine check bank address register. > > This is particularly needed in the case of errors in TDX/SEAM non-root mode > because the reported address contains the TDX KeyID. Refer to TDX and > TME-MK documentation for more information about KeyIDs. > > Commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM > non-root mode") uses the address to mark the affected page as poisoned, but > omits to use the aforementioned mask. > > Mask the address with MCI_ADDR_PHYSADDR so that the page can be found. > > Fixes: 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") > Cc: stable@vger.kernel.org Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> > Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> > --- > arch/x86/kernel/cpu/mce/core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index e9b3c5d4a52e..76c4634c6a5f 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1665,7 +1665,8 @@ noinstr void do_machine_check(struct pt_regs *regs) > * be added to free list when the guest is terminated. > */ > if (mce_usable_address(m)) { > - struct page *p = pfn_to_online_page(m->addr >> PAGE_SHIFT); > + unsigned long pfn = (m->addr & MCI_ADDR_PHYSADDR) >> PAGE_SHIFT; > + struct page *p = pfn_to_online_page(pfn); > > if (p) > SetPageHWPoison(p);
© 2016 - 2025 Red Hat, Inc.