From nobody Sun Feb 8 16:12:15 2026 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92D562701D1; Wed, 31 Dec 2025 03:05:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767150353; cv=none; b=FJXRsw7z2OUGVjEEmv/9v2CzJhVH5jWaEakios7j2qCeTWumbLJMNslcG+ndTlDJTJEHSHYjfqkBj0uHDUUL/t0seeWYIy7sgYuMyGj7mfNTPw1JUJZZI9qqMwNWuw/BB1QkkVBh43ympJIM1hfKBCh6H80beSAYKAwYBZsQuCw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767150353; c=relaxed/simple; bh=fvm7Pr8ChlF9fMq8hHGbH/R7KbUFKKH+LxpSVw3aPNo=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=fVl+EuoaLzHu3L8JmXuNZDupotbptovt7sRLqFWo+NCD73nGde65+5TTO9pcAVGAn4c5Ye0slqBnJXgao9MeRodWz/mhAw+n1Sqbb02Ljas87eGiJ0cLo8kSGoAlzKEDAdpYsZ6Eur/0ApOGiIMHTP2I4moOe96Eac3+fpJ/jJM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=NV4p3loP; arc=none smtp.client-ip=115.124.30.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="NV4p3loP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1767150339; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=SN48A7LMVLnJ6v/o+xWlCwiKyMcSY/eXb1mrlaUXvtI=; b=NV4p3loPDqkZ7+D92eyf8c1RQXK9MqOjjV4x5+/hrJcvHa1gRVv3M/lMUNlgMjf/dbWuI+avCJt16T1X26doUw28Ffljx6zhBU7ukA05oUN8swUT3MvbCcWEEn1aN6BFc8wTQp3bra91Q9g1XYKoMdQX1Od/tVGVAk7vru9Oe3k= Received: from localhost(mailfrom:tianruidong@linux.alibaba.com fp:SMTPD_---0Ww.grOC_1767150333 cluster:ay36) by smtp.aliyun-inc.com; Wed, 31 Dec 2025 11:05:38 +0800 From: Ruidong Tian To: dan.j.williams@intel.com, vishal.l.verma@intel.com, dave.jiang@intel.com, tony.luck@intel.com, bp@alien8.de, linux-cxl@vger.kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, xueshuai@linux.alibaba.com Cc: Ruidong Tian Subject: [RFC PATCH v2] device/dax: Allow MCE recovery when accessing PFN metadata Date: Wed, 31 Dec 2025 11:05:26 +0800 Message-Id: <20251231030526.108309-1-tianruidong@linux.alibaba.com> X-Mailer: git-send-email 2.33.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Both fsdax and devdax modes require significant space to store Page Frame Number (PFN) metadata (struct page). For a 1TiB namespace, approximately 17.18GiB of metadata is needed[0]. As namespace sizes scale, hardware memory errors within this metadata region become increasingly frequent. Currently, the kernel treats any access to corrupted PFN metadata as an unrecoverable event, leading to an immediate system panic. However, in DAX scenarios (e.g., CXL-attached memory), the impact of metadata corruption is logically confined to the physical device backing that specific memory range. Instead of a global panic, the kernel can ideally localize the failure. By allowing the affected DAX memory range to be offlined or the specific device to be decommissioned, we can limit the blast radius of hardware errors. This enables other processes to migrate or exit gracefully rather than being terminated by a system-wide crash. Reproduce and testing: 1. Inject error to PFN metadata 2. mmap and read Before apply this patch, kernel will panic: CPU 120: Machine Check Exception: f Bank 1: bd80000000100134 RIP 10: {dax_set_mapping.isra.0+0xce/0x140} TSC ee24b9e2d5 ADDR b213398000 MISC 86 PPIN 6deeb6484732971d PROCESSOR 0:a06d1 TIME 1765336050 SOCKET 0 APIC b1 microcode 10003f3 Run the above through 'mcelog --ascii' Machine check: Data load in unrecoverable area of kernel Kernel panic - not syncing: Fatal local machine check After apply this patch: User application receive SIGBUS, system still alive. [0]: https://docs.pmem.io/ndctl-user-guide/managing-namespaces#fsdax-and-de= vdax-capacity-considerations Signed-off-by: Ruidong Tian --- drivers/dax/dax-private.h | 25 +++++++++++++++++++++++++ drivers/dax/device.c | 20 ++++++++++++++++---- 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 0867115aeef2..4fd3065ae3ba 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -129,4 +129,29 @@ static inline bool dax_align_valid(unsigned long align) return align =3D=3D PAGE_SIZE; } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +#ifndef copy_mc_to_kernel +static inline int dax_test_page_mc(const struct page *page) +{ + return 0; +} +static inline int dax_test_folio_mc(const struct folio *page) +{ + return 0; +} +#else +#include +static inline int dax_test_page_mc(const struct page *page) +{ + struct page _p; + + return copy_mc_to_kernel(&_p, page, sizeof(struct page)); +} +static inline int dax_test_folio_mc(const struct folio *folio) +{ + struct folio _f; + + return copy_mc_to_kernel(&_f, folio, sizeof(struct folio)); +} +#endif #endif diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 22999a402e02..cafe802aacb2 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -80,7 +80,7 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_= dax, pgoff_t pgoff, return -1; } =20 -static void dax_set_mapping(struct vm_fault *vmf, unsigned long pfn, +static int dax_set_mapping(struct vm_fault *vmf, unsigned long pfn, unsigned long fault_size) { unsigned long i, nr_pages =3D fault_size / PAGE_SIZE; @@ -95,6 +95,13 @@ static void dax_set_mapping(struct vm_fault *vmf, unsign= ed long pfn, pgoff =3D linear_page_index(vmf->vma, ALIGN_DOWN(vmf->address, fault_size)); =20 + for (i =3D 0; i < nr_pages; i++) { + struct page *p =3D pfn_to_page(pfn + i); + + if (dax_test_page_mc(p) || dax_test_folio_mc(page_folio(p))) + return -EFAULT; + } + for (i =3D 0; i < nr_pages; i++) { struct folio *folio =3D pfn_folio(pfn + i); =20 @@ -104,6 +111,8 @@ static void dax_set_mapping(struct vm_fault *vmf, unsig= ned long pfn, folio->mapping =3D filp->f_mapping; folio->index =3D pgoff + i; } + + return 0; } =20 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, @@ -134,7 +143,8 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *d= ev_dax, =20 pfn =3D PHYS_PFN(phys); =20 - dax_set_mapping(vmf, pfn, fault_size); + if (dax_set_mapping(vmf, pfn, fault_size)) + return VM_FAULT_SIGBUS; =20 return vmf_insert_page_mkwrite(vmf, pfn_to_page(pfn), vmf->flags & FAULT_FLAG_WRITE); @@ -178,7 +188,8 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *d= ev_dax, =20 pfn =3D PHYS_PFN(phys); =20 - dax_set_mapping(vmf, pfn, fault_size); + if (dax_set_mapping(vmf, pfn, fault_size)) + return VM_FAULT_SIGBUS; =20 return vmf_insert_folio_pmd(vmf, page_folio(pfn_to_page(pfn)), vmf->flags & FAULT_FLAG_WRITE); @@ -224,7 +235,8 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *d= ev_dax, =20 pfn =3D PHYS_PFN(phys); =20 - dax_set_mapping(vmf, pfn, fault_size); + if (dax_set_mapping(vmf, pfn, fault_size)) + return VM_FAULT_SIGBUS; =20 return vmf_insert_folio_pud(vmf, page_folio(pfn_to_page(pfn)), vmf->flags & FAULT_FLAG_WRITE); --=20 2.33.1