[v4] mm/memory-failure: add panic option for unrecoverable pages

[PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Breno Leitao 2 months ago

When the memory failure handler encounters an in-use kernel page that it
cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
currently logs the error as "Ignored" and continues operation.

This leaves corrupted data accessible to the kernel, which will inevitably
cause either silent data corruption or a delayed crash when the poisoned memory
is next accessed.

This is a common problem on large fleets. We frequently observe multi-bit ECC
errors hitting kernel slab pages, where memory_failure() fails to recover them
and the system crashes later at an unrelated code path, making root cause
analysis unnecessarily difficult.

Here is one specific example from production on an arm64 server: a multi-bit
ECC error hit a dentry cache slab page, memory_failure() failed to recover it
(slab pages are not supported by the hwpoison recovery mechanism), and 67
seconds later d_lookup() accessed the poisoned cache line causing
a synchronous external abort:

    [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
    [88690.498473] Memory failure: 0x40272d: unhandlable page.
    [88690.498619] Memory failure: 0x40272d: recovery action for
                   get hwpoison page: Ignored
    ...
    [88757.847126] Internal error: synchronous external abort:
                   0000000096000410 [#1] SMP
    [88758.061075] pc : d_lookup+0x5c/0x220

This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
(default 0) that, when enabled, panics immediately on unrecoverable
memory failures. This provides a clean crash dump at the time of the
error, which is far more useful for diagnosis than a random crash later
at an unrelated code path.

This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
unknown page types (MF_MSG_UNKNOWN).

Note that dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
kernel stacks, page tables) shares the MF_MSG_GET_HWPOISON return path
with transient refcount races, so it is intentionally excluded from the
panic conditions to avoid false positives.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

---
Breno Leitao (3):
      mm/memory-failure: report MF_MSG_KERNEL for reserved pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl

 Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++
 mm/memory-failure.c                     | 92 ++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 1 deletion(-)
---
base-commit: e6efabc0afca02efa263aba533f35d90117ab283
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Miaohe Lin 1 month, 3 weeks ago

On 2026/4/15 20:54, Breno Leitao wrote:
> When the memory failure handler encounters an in-use kernel page that it
> cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> currently logs the error as "Ignored" and continues operation.
> 
> This leaves corrupted data accessible to the kernel, which will inevitably
> cause either silent data corruption or a delayed crash when the poisoned memory
> is next accessed.
> 
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
> 
> Here is one specific example from production on an arm64 server: a multi-bit
> ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> (slab pages are not supported by the hwpoison recovery mechanism), and 67
> seconds later d_lookup() accessed the poisoned cache line causing
> a synchronous external abort:
> 
>     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
>     [88690.498473] Memory failure: 0x40272d: unhandlable page.
>     [88690.498619] Memory failure: 0x40272d: recovery action for
>                    get hwpoison page: Ignored
>     ...
>     [88757.847126] Internal error: synchronous external abort:
>                    0000000096000410 [#1] SMP
>     [88758.061075] pc : d_lookup+0x5c/0x220
> 
> This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> (default 0) that, when enabled, panics immediately on unrecoverable
> memory failures. This provides a clean crash dump at the time of the
> error, which is far more useful for diagnosis than a random crash later
> at an unrelated code path.
> 
> This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
> unknown page types (MF_MSG_UNKNOWN).
> 
> Note that dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
> kernel stacks, page tables) shares the MF_MSG_GET_HWPOISON return path
> with transient refcount races, so it is intentionally excluded from the
> panic conditions to avoid false positives.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

It might be helpful to add some information from [1].

Thanks for your work.

[1] https://lore.kernel.org/all/aeHy3-vQTQYJlGw5@gmail.com/

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Jiaqi Yan 2 months ago

Hi Breno,

On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
>
> When the memory failure handler encounters an in-use kernel page that it
> cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> currently logs the error as "Ignored" and continues operation.
>
> This leaves corrupted data accessible to the kernel, which will inevitably
> cause either silent data corruption or a delayed crash when the poisoned memory
> is next accessed.
>
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
>
> Here is one specific example from production on an arm64 server: a multi-bit
> ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> (slab pages are not supported by the hwpoison recovery mechanism), and 67
> seconds later d_lookup() accessed the poisoned cache line causing
> a synchronous external abort:
>
>     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
>     [88690.498473] Memory failure: 0x40272d: unhandlable page.
>     [88690.498619] Memory failure: 0x40272d: recovery action for
>                    get hwpoison page: Ignored
>     ...
>     [88757.847126] Internal error: synchronous external abort:
>                    0000000096000410 [#1] SMP
>     [88758.061075] pc : d_lookup+0x5c/0x220
>
> This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> (default 0) that, when enabled, panics immediately on unrecoverable
> memory failures. This provides a clean crash dump at the time of the

I get the fail-fast part, but wonder will kernel really be able to
provide clean crash dump useful for diagnosis?

In your example at 88757.847126, kernel was handling SEA and because
we are under kernel context, eventually has to die(). Apparently not
only your patch, but also memory-failure has no role to play there.
But at least SEA handling tried its best to show the kernel code that
consumed the memory error.

So your code should apply to the memory failure handling at
88690.498473, which is likely triggered from APEI GHES for poison
detection (I guess the example is from ARM64). Anything except SEA is
considered not synchronous (by APEI is_hest_sync_notify()). If kernel
panics there, I guess it will be in a random process context or a
kworker thread? How useful is it for diagnosis? Just the exact time an
error detected (which is already logged by kernel)?

On X86, for UCNA or SRAO type machine check exceptions, I think with
your patch the panic would also happen in random process context or
kworker thread,

Can you share some clean crash dumps from your testing that show they
are more useful than the crash at SEA? Thanks!

> error, which is far more useful for diagnosis than a random crash later
> at an unrelated code path.
>
> This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
> unknown page types (MF_MSG_UNKNOWN).
>
> Note that dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
> kernel stacks, page tables) shares the MF_MSG_GET_HWPOISON return path
> with transient refcount races, so it is intentionally excluded from the
> panic conditions to avoid false positives.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v4:
> - Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
> - Split the reserved page classification (MF_MSG_KERNEL) into its own
>   patch, separate from the panic mechanism.
> - Document why the buddy allocator TOCTOU race (between
>   get_hwpoison_page() and is_free_buddy_page()) cannot cause false
>   positives: PG_hwpoison is set beforehand and check_new_page() in the
>   page allocator rejects hwpoisoned pages.
> - Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
>   its mitigation via identify_page_state()'s two-pass design.
> - Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
>   panic conditions (shared path with transient races and non-reserved
>   kernel memory).
> - Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org
>
> Changes in v3:
> - Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
>   as suggested by maintainer.
> - Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
>   similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
> - Add documentation for the sysctl and CONFIG option.
> - Add code comments documenting the panic condition design rationale and
>   how the retry mechanism mitigates false positives from buddy allocator
>   races.
> - Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
>
> Changes in v2:
> - Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
>   instead of MF_MSG_GET_HWPOISON.
> - Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
>   instead of MF_MSG_GET_HWPOISON.
> - Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
>
> ---
> Breno Leitao (3):
>       mm/memory-failure: report MF_MSG_KERNEL for reserved pages
>       mm/memory-failure: add panic option for unrecoverable pages
>       Documentation: document panic_on_unrecoverable_memory_failure sysctl
>
>  Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++
>  mm/memory-failure.c                     | 92 ++++++++++++++++++++++++++++++++-
>  2 files changed, 128 insertions(+), 1 deletion(-)
> ---
> base-commit: e6efabc0afca02efa263aba533f35d90117ab283
> change-id: 20260323-ecc_panic-4e473b83087c
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Breno Leitao 2 months ago

Hi Jiaqi,

On Wed, Apr 15, 2026 at 01:56:35PM -0700, Jiaqi Yan wrote:
> On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > When the memory failure handler encounters an in-use kernel page that it
> > cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> > currently logs the error as "Ignored" and continues operation.
> >
> > This leaves corrupted data accessible to the kernel, which will inevitably
> > cause either silent data corruption or a delayed crash when the poisoned memory
> > is next accessed.
> >
> > This is a common problem on large fleets. We frequently observe multi-bit ECC
> > errors hitting kernel slab pages, where memory_failure() fails to recover them
> > and the system crashes later at an unrelated code path, making root cause
> > analysis unnecessarily difficult.
> >
> > Here is one specific example from production on an arm64 server: a multi-bit
> > ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> > (slab pages are not supported by the hwpoison recovery mechanism), and 67
> > seconds later d_lookup() accessed the poisoned cache line causing
> > a synchronous external abort:
> >
> >     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> >     [88690.498473] Memory failure: 0x40272d: unhandlable page.
> >     [88690.498619] Memory failure: 0x40272d: recovery action for
> >                    get hwpoison page: Ignored
> >     ...
> >     [88757.847126] Internal error: synchronous external abort:
> >                    0000000096000410 [#1] SMP
> >     [88758.061075] pc : d_lookup+0x5c/0x220
> >
> > This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> > (default 0) that, when enabled, panics immediately on unrecoverable
> > memory failures. This provides a clean crash dump at the time of the
>
> I get the fail-fast part, but wonder will kernel really be able to
> provide clean crash dump useful for diagnosis?

Yes, the kernel does provide a useful crash dump. With the sysctl enabled,
here's what I observe:

	Kernel panic - not syncing: Memory failure: 0x1: unrecoverable page
	CPU: 40 UID: 0 PID: 682 Comm: bash Tainted: G B  7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93
	Tainted: [B]=BAD_PAGE

	Call Trace:
	 <TASK>
	 vpanic+0x399/0x700
	 panic+0xb4/0xc0
	 action_result+0x278/0x340          ← your new panic call site
	 memory_failure+0x152b/0x1c80


Without the patch (or with the sysctl disabled), you only get:

	Memory failure: 0x1: unhandlable page.
	Memory failure: 0x1: recovery action for reserved kernel page: Ignored

Then the host continues running until it eventually accesses that poisoned
memory, triggering a generic error similar to the d_lookup() case mentioned
above.

> In your example at 88757.847126, kernel was handling SEA and because
> we are under kernel context, eventually has to die(). Apparently not
> only your patch, but also memory-failure has no role to play there.
> But at least SEA handling tried its best to show the kernel code that
> consumed the memory error.
>
> So your code should apply to the memory failure handling at
> 88690.498473, which is likely triggered from APEI GHES for poison
> detection (I guess the example is from ARM64). Anything except SEA is
> considered not synchronous (by APEI is_hest_sync_notify()). If kernel
> panics there, I guess it will be in a random process context or a
> kworker thread? How useful is it for diagnosis? Just the exact time an
> error detected (which is already logged by kernel)?

The kernel panics with a clear stack trace and explicit reason, making it
straightforward to correlate and analyze the failure.

My objective is to have a clean, immediate crash rather than allowing the
system to continue running and potentially crash later (if at all).

Working at a hyperscaler, I regularly see thousands of these "unhandlable
page" messages, followed by later kernel crashes when the corrupted memory
is eventually accessed.

> On X86, for UCNA or SRAO type machine check exceptions, I think with
> your patch the panic would also happen in random process context or
> kworker thread,
>
> Can you share some clean crash dumps from your testing that show they
> are more useful than the crash at SEA? Thanks!

Certainly, here is the complete crash dump from the example above. This
happened on a real production hardware:

	[88690.478913] [ T593001] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 784
	[88690.479097] [ T593001] {1}[Hardware Error]: event severity: recoverable
	[88690.479184] [ T593001] {1}[Hardware Error]:  imprecise tstamp: 2026-03-20 13:13:08
	[88690.479282] [ T593001] {1}[Hardware Error]:  Error 0, type: recoverable
	[88690.479359] [ T593001] {1}[Hardware Error]:   section_type: memory error
	[88690.479424] [ T593001] {1}[Hardware Error]:   physical_address: 0x00000040272d5080
	[88690.479503] [ T593001] {1}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
	[88690.479606] [ T593001] {1}[Hardware Error]:   node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 
	[88690.479680] [ T593001] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
	[88690.479754] [ T593001] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x000e 
	[88690.479882] [ T593001] EDAC MC0: 1 UE multi-bit ECC on unknown memory (node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e page:0x40272d offset:0x5080 grain:4096 - APEI location: node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e)
	[88690.498473] [ T593001] Memory failure: 0x40272d: unhandlable page.
	[88690.498619] [ T593001] Memory failure: 0x40272d: recovery action for get hwpoison page: Ignored
	[88757.847126] [ T640437] Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
	[88757.867131] [ T640437] Modules linked in: ghes_edac(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) mlx5_ib(E) sm3_ce(E) sha3_ce(E) sha512_ce(E) ipmi_ssif(E) ipmi_devintf(E) nvidia_cspmu(E) ib_uverbs(E) cppc_cpufreq(E) coresight_etm4x(E) coresight_stm(E) ipmi_msghandler(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) arm_spe_pmu(E) stm_core(E) coresight_tmc(E) coresight_funnel(E) coresight(E) bpf_preload(E) sch_fq_codel(E) ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) tls(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
	[88757.991191] [ T640437] CPU: 70 UID: 34133 PID: 640437 Comm: Collection-20 Kdump: loaded Tainted: G   M        E       6.16.1-0_fbk2_0_gf40efc324cc8 #1 NONE 
	[88758.017569] [ T640437] Tainted: [M]=MACHINE_CHECK, [E]=UNSIGNED_MODULE
	[88758.028860] [ T640437] Hardware name: ....
	[88758.046969] [ T640437] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	[88758.061075] [ T640437] pc : d_lookup+0x5c/0x220
	[88758.068392] [ T640437] lr : try_lookup_noperm+0x30/0x50
	[88758.077088] [ T640437] sp : ffff800138cafc30
	[88758.083827] [ T640437] x29: ffff800138cafc40 x28: ffff0001dcfe8bc0 x27: 00000000bc0a11f7
	[88758.098321] [ T640437] x26: 00000000000ee00c x25: ffffffffffffffff x24: 0000000000000001
	[88758.112807] [ T640437] x23: ffff003fa14d0000 x22: ffff8000828d3740 x21: ffff800138cafde8
	[88758.127281] [ T640437] x20: ffff0000d0316fc0 x19: ffff800138cafce0 x18: 0001000000000000
	[88758.141753] [ T640437] x17: 0000000000000001 x16: 0000000001ffffff x15: dfc038a300003936
	[88758.156226] [ T640437] x14: 00000000fffffffa x13: ffffffffffffffff x12: ffff0000d0316fc0
	[88758.170695] [ T640437] x11: 61c8864680b583eb x10: 0000000000000039 x9 : ffff800080fcfd68
	[88758.185170] [ T640437] x8 : ffff003fa72d5088 x7 : 0000000000000000 x6 : ffff800138cafd58
	[88758.199645] [ T640437] x5 : ffff0001dcfe8bc0 x4 : ffff80008104a330 x3 : 0000000000000002
	[88758.214111] [ T640437] x2 : ffff800138cafd4d x1 : ffff800138cafce0 x0 : ffff0000d0316fc0
	[88758.228579] [ T640437] Call trace:
	[88758.233565] [ T640437]  d_lookup+0x5c/0x220 (P)
	[88758.240864] [ T640437]  try_lookup_noperm+0x30/0x50
	[88758.248868] [ T640437]  proc_fill_cache+0x54/0x140
	[88758.256696] [ T640437]  proc_readfd_common+0x138/0x1e8
	[88758.265222] [ T640437]  proc_fd_iterate.llvm.7260857650841435759+0x1c/0x30
	[88758.277248] [ T640437]  iterate_dir+0x84/0x228
	[88758.284354] [ T640437]  __arm64_sys_getdents64+0x5c/0x110
	[88758.293383] [ T640437]  invoke_syscall+0x4c/0xd0
	[88758.300843] [ T640437]  do_el0_svc+0x80/0xb8
	[88758.307599] [ T640437]  el0_svc+0x30/0xf0
	[88758.313820] [ T640437]  el0t_64_sync_handler+0x70/0x100
	[88758.322497] [ T640437]  el0t_64_sync+0x17c/0x180
	...

And my clear crash would look like the following:

	[ 1096.480523] Memory failure: 0x2: recovery action for reserved kernel page: Ignored
	[ 1096.480751] Kernel panic - not syncing: Memory failure: 0x2: unrecoverable page
	[ 1096.480760] CPU: 5 UID: 0 PID: 683 Comm: bash Tainted: G    B               7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93 PREEMPTLAZY
	[ 1096.480768] Tainted: [B]=BAD_PAGE
	[ 1096.480774] Call Trace:
	[ 1096.480778]  <TASK>
	[ 1096.480782]  vpanic+0x399/0x700
	[ 1096.480821]  panic+0xb4/0xc0
	[ 1096.480849]  action_result+0x278/0x340
	[ 1096.480857]  memory_failure+0x152b/0x1c80
	[ 1096.480925]  hwpoison_inject+0x3a6/0x3f0 [hwpoison_inject]
	....


Isn't the clean approach way better than the random one?

For testing, I use this simple procedure, in case you want to play with
it:
	# modprobe hwpoison-inject
	# sysctl -w vm.panic_on_unrecoverable_memory_failure=0
	# echo 1 > /sys/kernel/debug/hwpoison/corrupt-pfn


Thanks for the review and good discussion,
--breno

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Jiaqi Yan 2 months ago

On Thu, Apr 16, 2026 at 8:32 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hi Jiaqi,
>
> On Wed, Apr 15, 2026 at 01:56:35PM -0700, Jiaqi Yan wrote:
> > On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > When the memory failure handler encounters an in-use kernel page that it
> > > cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> > > currently logs the error as "Ignored" and continues operation.
> > >
> > > This leaves corrupted data accessible to the kernel, which will inevitably
> > > cause either silent data corruption or a delayed crash when the poisoned memory
> > > is next accessed.
> > >
> > > This is a common problem on large fleets. We frequently observe multi-bit ECC
> > > errors hitting kernel slab pages, where memory_failure() fails to recover them
> > > and the system crashes later at an unrelated code path, making root cause
> > > analysis unnecessarily difficult.
> > >
> > > Here is one specific example from production on an arm64 server: a multi-bit
> > > ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> > > (slab pages are not supported by the hwpoison recovery mechanism), and 67
> > > seconds later d_lookup() accessed the poisoned cache line causing
> > > a synchronous external abort:
> > >
> > >     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> > >     [88690.498473] Memory failure: 0x40272d: unhandlable page.
> > >     [88690.498619] Memory failure: 0x40272d: recovery action for
> > >                    get hwpoison page: Ignored
> > >     ...
> > >     [88757.847126] Internal error: synchronous external abort:
> > >                    0000000096000410 [#1] SMP
> > >     [88758.061075] pc : d_lookup+0x5c/0x220
> > >
> > > This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> > > (default 0) that, when enabled, panics immediately on unrecoverable
> > > memory failures. This provides a clean crash dump at the time of the
> >
> > I get the fail-fast part, but wonder will kernel really be able to
> > provide clean crash dump useful for diagnosis?
>
> Yes, the kernel does provide a useful crash dump. With the sysctl enabled,
> here's what I observe:
>
>         Kernel panic - not syncing: Memory failure: 0x1: unrecoverable page
>         CPU: 40 UID: 0 PID: 682 Comm: bash Tainted: G B  7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93
>         Tainted: [B]=BAD_PAGE
>
>         Call Trace:
>          <TASK>
>          vpanic+0x399/0x700
>          panic+0xb4/0xc0
>          action_result+0x278/0x340          ← your new panic call site
>          memory_failure+0x152b/0x1c80
>
>
> Without the patch (or with the sysctl disabled), you only get:
>
>         Memory failure: 0x1: unhandlable page.
>         Memory failure: 0x1: recovery action for reserved kernel page: Ignored
>
> Then the host continues running until it eventually accesses that poisoned
> memory, triggering a generic error similar to the d_lookup() case mentioned
> above.
>
> > In your example at 88757.847126, kernel was handling SEA and because
> > we are under kernel context, eventually has to die(). Apparently not
> > only your patch, but also memory-failure has no role to play there.
> > But at least SEA handling tried its best to show the kernel code that
> > consumed the memory error.
> >
> > So your code should apply to the memory failure handling at
> > 88690.498473, which is likely triggered from APEI GHES for poison
> > detection (I guess the example is from ARM64). Anything except SEA is
> > considered not synchronous (by APEI is_hest_sync_notify()). If kernel
> > panics there, I guess it will be in a random process context or a
> > kworker thread? How useful is it for diagnosis? Just the exact time an
> > error detected (which is already logged by kernel)?
>
> The kernel panics with a clear stack trace and explicit reason, making it
> straightforward to correlate and analyze the failure.

So we will always get the same stack trace below, right?

          panic+0xb4/0xc0
          action_result+0x278/0x340
          memory_failure+0x152b/0x1c80

IIUC, this stack trace itself doesn't provide any useful information
about the memory error, right? What exactly can we use from the stack
trace? It is just a side-effect that we failed immediately.

You can still correlate failure with "Memory failure: 0x1: unhandlable
page" and keep running until the actual fatal poison consumption takes
down the system. Drawback is that these will be cascading events that
can be "noisy". What I see is the choice between failing fast versus
failing safe.

>
> My objective is to have a clean, immediate crash rather than allowing the
> system to continue running and potentially crash later (if at all).
>
> Working at a hyperscaler, I regularly see thousands of these "unhandlable
> page" messages, followed by later kernel crashes when the corrupted memory
> is eventually accessed.
>
> > On X86, for UCNA or SRAO type machine check exceptions, I think with
> > your patch the panic would also happen in random process context or
> > kworker thread,
> >
> > Can you share some clean crash dumps from your testing that show they
> > are more useful than the crash at SEA? Thanks!
>
> Certainly, here is the complete crash dump from the example above. This
> happened on a real production hardware:
>
>         [88690.478913] [ T593001] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 784
>         [88690.479097] [ T593001] {1}[Hardware Error]: event severity: recoverable
>         [88690.479184] [ T593001] {1}[Hardware Error]:  imprecise tstamp: 2026-03-20 13:13:08
>         [88690.479282] [ T593001] {1}[Hardware Error]:  Error 0, type: recoverable
>         [88690.479359] [ T593001] {1}[Hardware Error]:   section_type: memory error
>         [88690.479424] [ T593001] {1}[Hardware Error]:   physical_address: 0x00000040272d5080
>         [88690.479503] [ T593001] {1}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
>         [88690.479606] [ T593001] {1}[Hardware Error]:   node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027
>         [88690.479680] [ T593001] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
>         [88690.479754] [ T593001] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x000e
>         [88690.479882] [ T593001] EDAC MC0: 1 UE multi-bit ECC on unknown memory (node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e page:0x40272d offset:0x5080 grain:4096 - APEI location: node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e)
>         [88690.498473] [ T593001] Memory failure: 0x40272d: unhandlable page.
>         [88690.498619] [ T593001] Memory failure: 0x40272d: recovery action for get hwpoison page: Ignored
>         [88757.847126] [ T640437] Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
>         [88757.867131] [ T640437] Modules linked in: ghes_edac(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) mlx5_ib(E) sm3_ce(E) sha3_ce(E) sha512_ce(E) ipmi_ssif(E) ipmi_devintf(E) nvidia_cspmu(E) ib_uverbs(E) cppc_cpufreq(E) coresight_etm4x(E) coresight_stm(E) ipmi_msghandler(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) arm_spe_pmu(E) stm_core(E) coresight_tmc(E) coresight_funnel(E) coresight(E) bpf_preload(E) sch_fq_codel(E) ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) tls(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
>         [88757.991191] [ T640437] CPU: 70 UID: 34133 PID: 640437 Comm: Collection-20 Kdump: loaded Tainted: G   M        E       6.16.1-0_fbk2_0_gf40efc324cc8 #1 NONE
>         [88758.017569] [ T640437] Tainted: [M]=MACHINE_CHECK, [E]=UNSIGNED_MODULE
>         [88758.028860] [ T640437] Hardware name: ....
>         [88758.046969] [ T640437] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>         [88758.061075] [ T640437] pc : d_lookup+0x5c/0x220
>         [88758.068392] [ T640437] lr : try_lookup_noperm+0x30/0x50
>         [88758.077088] [ T640437] sp : ffff800138cafc30
>         [88758.083827] [ T640437] x29: ffff800138cafc40 x28: ffff0001dcfe8bc0 x27: 00000000bc0a11f7
>         [88758.098321] [ T640437] x26: 00000000000ee00c x25: ffffffffffffffff x24: 0000000000000001
>         [88758.112807] [ T640437] x23: ffff003fa14d0000 x22: ffff8000828d3740 x21: ffff800138cafde8
>         [88758.127281] [ T640437] x20: ffff0000d0316fc0 x19: ffff800138cafce0 x18: 0001000000000000
>         [88758.141753] [ T640437] x17: 0000000000000001 x16: 0000000001ffffff x15: dfc038a300003936
>         [88758.156226] [ T640437] x14: 00000000fffffffa x13: ffffffffffffffff x12: ffff0000d0316fc0
>         [88758.170695] [ T640437] x11: 61c8864680b583eb x10: 0000000000000039 x9 : ffff800080fcfd68
>         [88758.185170] [ T640437] x8 : ffff003fa72d5088 x7 : 0000000000000000 x6 : ffff800138cafd58
>         [88758.199645] [ T640437] x5 : ffff0001dcfe8bc0 x4 : ffff80008104a330 x3 : 0000000000000002
>         [88758.214111] [ T640437] x2 : ffff800138cafd4d x1 : ffff800138cafce0 x0 : ffff0000d0316fc0
>         [88758.228579] [ T640437] Call trace:
>         [88758.233565] [ T640437]  d_lookup+0x5c/0x220 (P)
>         [88758.240864] [ T640437]  try_lookup_noperm+0x30/0x50
>         [88758.248868] [ T640437]  proc_fill_cache+0x54/0x140
>         [88758.256696] [ T640437]  proc_readfd_common+0x138/0x1e8
>         [88758.265222] [ T640437]  proc_fd_iterate.llvm.7260857650841435759+0x1c/0x30
>         [88758.277248] [ T640437]  iterate_dir+0x84/0x228
>         [88758.284354] [ T640437]  __arm64_sys_getdents64+0x5c/0x110
>         [88758.293383] [ T640437]  invoke_syscall+0x4c/0xd0
>         [88758.300843] [ T640437]  do_el0_svc+0x80/0xb8
>         [88758.307599] [ T640437]  el0_svc+0x30/0xf0
>         [88758.313820] [ T640437]  el0t_64_sync_handler+0x70/0x100
>         [88758.322497] [ T640437]  el0t_64_sync+0x17c/0x180
>         ...
>
> And my clear crash would look like the following:
>
>         [ 1096.480523] Memory failure: 0x2: recovery action for reserved kernel page: Ignored
>         [ 1096.480751] Kernel panic - not syncing: Memory failure: 0x2: unrecoverable page
>         [ 1096.480760] CPU: 5 UID: 0 PID: 683 Comm: bash Tainted: G    B               7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93 PREEMPTLAZY
>         [ 1096.480768] Tainted: [B]=BAD_PAGE
>         [ 1096.480774] Call Trace:
>         [ 1096.480778]  <TASK>
>         [ 1096.480782]  vpanic+0x399/0x700
>         [ 1096.480821]  panic+0xb4/0xc0
>         [ 1096.480849]  action_result+0x278/0x340
>         [ 1096.480857]  memory_failure+0x152b/0x1c80
>         [ 1096.480925]  hwpoison_inject+0x3a6/0x3f0 [hwpoison_inject]
>         ....
>
>
> Isn't the clean approach way better than the random one?

I don't fully agree. In the past upstream has enhanced many kernel mm
services (e.g. khugepaged, page migration, dump_user_range()) to
recover from memory error in order to improve system availability,
given these service or tools can fail safe. Seeing many crashes
pointing to a certain in-kernel service at consumption time helped us
decide what services we should enhance, and which service we should
prioritize. Of course not all kernel code can be recovered from memory
error, but that doesn't mean knowing what kernel code often caused
crash isn't useful.

>
> For testing, I use this simple procedure, in case you want to play with
> it:
>         # modprobe hwpoison-inject
>         # sysctl -w vm.panic_on_unrecoverable_memory_failure=0
>         # echo 1 > /sys/kernel/debug/hwpoison/corrupt-pfn
>
>
> Thanks for the review and good discussion,

Anyway, I only have a second opinion on the usefulness of a static
stack trace. This fail-fast option is good to have. Thanks!

> --breno
>

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Breno Leitao 1 month, 4 weeks ago

On Thu, Apr 16, 2026 at 09:26:08AM -0700, Jiaqi Yan wrote:

> So we will always get the same stack trace below, right?
> 
>           panic+0xb4/0xc0
>           action_result+0x278/0x340
>           memory_failure+0x152b/0x1c80
> 
> IIUC, this stack trace itself doesn't provide any useful information
> about the memory error, right? What exactly can we use from the stack
> trace? It is just a side-effect that we failed immediately.

We can use it to correlate problems across a fleet of machines. Let me
share how crash dump analysis works in large datacenters.

There are thousands of crashes a day (to stay on the low ballpark), and
different services try to correlate and categorize them into a few
buckets, something like:

	1. New crash — needs investigation
	2. Known issue — fix is being rolled out
	3. Hardware problem — do not spend engineering time on it

When a machine crashes at a random code path like d_lookup() 67 seconds
after the memory error, the automated triage classifies it as a kernel
bug in VFS/dcache and assigns it to the filesystem team for
investigation. Engineers spend time chasing a bug that doesn't exist in
software — it's a hardware problem.

With the immediate panic at memory_failure(), the stack trace is always
recognizable and can be automatically classified as category 3 (hardware
problem). The static stack trace is the feature, not a limitation: it
gives triage automation a stable signature to match on.

The value isn't in what the stack trace and the panic() tells a human reading
one crash — it's in what it tells automated systems processing thousands of
them.

> You can still correlate failure with "Memory failure: 0x1: unhandlable
> page" and keep running until the actual fatal poison consumption takes
> down the system. Drawback is that these will be cascading events that
> can be "noisy". What I see is the choice between failing fast versus
> failing safe.

Correlating the "unhandlable page" log with a later crash is
theoretically possible but breaks down in practice at scale:

- The crash may happen seconds, minutes, or hours later — or never, if
the page isn't accessed again before a reboot.

- The crash happens on a different CPU, different task, different context

— there's no breadcrumb linking it back to the memory error.

- Automated triage systems work on stack traces and panic strings, not
by correlating dmesg lines across time with later crashes.

- The later crash looks completely different depending on the
architecture. On arm64, you get a "synchronous external abort". On
x86, it's a machine check exception. On some platforms, it might be a
generic page fault or a BUG_ON in a subsystem that found inconsistent
data. There is no single signature to match — every architecture and
every consumption path produces a different crash, making automated
correlation essentially impossible.

- Worse, the crash may never happen at all. If the corrupted memory is
read but the corruption doesn't trigger a fault — say, a flipped bit
in a permission field, a size, a pointer that still maps to valid
memory, or a data buffer — the result is silent data corruption with
no crash to correlate against. The system continues operating on wrong
data with no indication anything went wrong.

Also, I wouldn't call continuing with known-corrupted kernel memory
"failing safe" — it's the opposite. The kernel has no mechanism to
fence off a poisoned slab page or page table from future access.
Continuing is failing unsafely with a delayed, unpredictable
consequence.

> > Isn't the clean approach way better than the random one?
> 
> I don't fully agree. In the past upstream has enhanced many kernel mm
> services (e.g. khugepaged, page migration, dump_user_range()) to
> recover from memory error in order to improve system availability,
> given these service or tools can fail safe. Seeing many crashes
> pointing to a certain in-kernel service at consumption time helped us
> decide what services we should enhance, and which service we should
> prioritize. Of course not all kernel code can be recovered from memory
> error, but that doesn't mean knowing what kernel code often caused
> crash isn't useful.

That's a fair point — consumption-time crashes have historically been
useful for identifying which kernel services to harden. But I'd argue
this patch doesn't prevent that analysis, it complements it.

The sysctl defaults to off. Operators who want to observe where poison
is consumed — to prioritize which services to enhance — can leave it
disabled and get exactly the behavior they have today.

But for operators running large fleets where the priority is fast
diagnosis and machine replacement rather than kernel hardening research,
the immediate panic is what they need. They already know the memory is
bad, they don't need the kernel to keep running to find out which
subsystem hits it first.

Also, the services you mention — khugepaged, page migration,
dump_user_range() — were enhanced to handle errors in user pages,
where recovery is possible (kill the process, fail the migration). The
pages this patch panics on — reserved pages, unknown page types — are
kernel memory where _no_ recovery mechanism exists or is likely to exist.
There's no service to enhance for those; the only options are crash now
or crash later, given a crucial memory page got lost. 

> Anyway, I only have a second opinion on the usefulness of a static
> stack trace. This fail-fast option is good to have. Thanks!

Thanks for the review! Just to make sure I understand your position correctly —
are you saying you'd like changes to the patch, or is this more of a general
observation about the tradeoff?

--breno

Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

Posted by Jiaqi Yan 1 month, 4 weeks ago

On Fri, Apr 17, 2026 at 2:11 AM Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Apr 16, 2026 at 09:26:08AM -0700, Jiaqi Yan wrote:
>
> > So we will always get the same stack trace below, right?
> >
> >           panic+0xb4/0xc0
> >           action_result+0x278/0x340
> >           memory_failure+0x152b/0x1c80
> >
> > IIUC, this stack trace itself doesn't provide any useful information
> > about the memory error, right? What exactly can we use from the stack
> > trace? It is just a side-effect that we failed immediately.
>
> We can use it to correlate problems across a fleet of machines. Let me
> share how crash dump analysis works in large datacenters.
>
> There are thousands of crashes a day (to stay on the low ballpark), and
> different services try to correlate and categorize them into a few
> buckets, something like:
>
>         1. New crash — needs investigation
>         2. Known issue — fix is being rolled out
>         3. Hardware problem — do not spend engineering time on it
>
> When a machine crashes at a random code path like d_lookup() 67 seconds
> after the memory error, the automated triage classifies it as a kernel
> bug in VFS/dcache and assigns it to the filesystem team for
> investigation. Engineers spend time chasing a bug that doesn't exist in
> software — it's a hardware problem.
>
> With the immediate panic at memory_failure(), the stack trace is always
> recognizable and can be automatically classified as category 3 (hardware
> problem). The static stack trace is the feature, not a limitation: it
> gives triage automation a stable signature to match on.
>
> The value isn't in what the stack trace and the panic() tells a human reading
> one crash — it's in what it tells automated systems processing thousands of
> them.

Yeah, in this setting, a crash dump with a fixed signature totally makes sense.

>
> > You can still correlate failure with "Memory failure: 0x1: unhandlable
> > page" and keep running until the actual fatal poison consumption takes
> > down the system. Drawback is that these will be cascading events that
> > can be "noisy". What I see is the choice between failing fast versus
> > failing safe.
>
> Correlating the "unhandlable page" log with a later crash is
> theoretically possible but breaks down in practice at scale:
>
> - The crash may happen seconds, minutes, or hours later — or never, if
> the page isn't accessed again before a reboot.
>
> - The crash happens on a different CPU, different task, different context
>
> — there's no breadcrumb linking it back to the memory error.
>
> - Automated triage systems work on stack traces and panic strings, not
> by correlating dmesg lines across time with later crashes.
>
> - The later crash looks completely different depending on the
> architecture. On arm64, you get a "synchronous external abort". On
> x86, it's a machine check exception. On some platforms, it might be a
> generic page fault or a BUG_ON in a subsystem that found inconsistent
> data. There is no single signature to match — every architecture and
> every consumption path produces a different crash, making automated
> correlation essentially impossible.
>
> - Worse, the crash may never happen at all. If the corrupted memory is
> read but the corruption doesn't trigger a fault — say, a flipped bit
> in a permission field, a size, a pointer that still maps to valid
> memory, or a data buffer — the result is silent data corruption with
> no crash to correlate against. The system continues operating on wrong
> data with no indication anything went wrong.
>
> Also, I wouldn't call continuing with known-corrupted kernel memory
> "failing safe" — it's the opposite. The kernel has no mechanism to
> fence off a poisoned slab page or page table from future access.
> Continuing is failing unsafely with a delayed, unpredictable
> consequence.
>
>
> > > Isn't the clean approach way better than the random one?
> >
> > I don't fully agree. In the past upstream has enhanced many kernel mm
> > services (e.g. khugepaged, page migration, dump_user_range()) to
> > recover from memory error in order to improve system availability,
> > given these service or tools can fail safe. Seeing many crashes
> > pointing to a certain in-kernel service at consumption time helped us
> > decide what services we should enhance, and which service we should
> > prioritize. Of course not all kernel code can be recovered from memory
> > error, but that doesn't mean knowing what kernel code often caused
> > crash isn't useful.
>
>
> That's a fair point — consumption-time crashes have historically been
> useful for identifying which kernel services to harden. But I'd argue
> this patch doesn't prevent that analysis, it complements it.
>
> The sysctl defaults to off. Operators who want to observe where poison
> is consumed — to prioritize which services to enhance — can leave it
> disabled and get exactly the behavior they have today.
>
> But for operators running large fleets where the priority is fast
> diagnosis and machine replacement rather than kernel hardening research,
> the immediate panic is what they need. They already know the memory is
> bad, they don't need the kernel to keep running to find out which
> subsystem hits it first.
>
> Also, the services you mention — khugepaged, page migration,
> dump_user_range() — were enhanced to handle errors in user pages,
> where recovery is possible (kill the process, fail the migration). The
> pages this patch panics on — reserved pages, unknown page types — are
> kernel memory where _no_ recovery mechanism exists or is likely to exist.

Maybe, but I won't be surprised if one day someone comes up with some idea.

> There's no service to enhance for those; the only options are crash now
> or crash later, given a crucial memory page got lost.
>
> > Anyway, I only have a second opinion on the usefulness of a static
> > stack trace. This fail-fast option is good to have. Thanks!
>
> Thanks for the review! Just to make sure I understand your position correctly —
> are you saying you'd like changes to the patch, or is this more of a general
> observation about the tradeoff?

No change needed. I just hope to get more clarification from you on
the usefulness of the stack track, and I do get it. Thanks!

>
> --breno