mm/ksm: Add recovery mechanism for memory failures

[PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures

Posted by Longlong Xia 4 months ago

From: Longlong Xia <xialonglong@kylinos.cn>

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to migrate
mappings from the failing page to another healthy duplicate within
the same chain before resorting to killing processes.

The recovery process works as follows:
1. When a memory failure is detected on a KSM page, identify if the
failing node is part of a chain (has duplicates) (maybe add dup_haed
item to save head_node to struct stable_node?, saving searching
the whole stable tree, or other way to find head_node)
2. Search for another healthy duplicate page within the same chain
3. For each process mapping the failing page:
- Update the PTE to point to the healthy duplicate page ( maybe reuse
replace_page?, or split repalce_page into smaller function and use the
common part)
- Migrate the rmap_item to the new stable node
4. If all migrations succeed, remove the failing node from the chain
5. Only kill processes if recovery is impossible or fails

The original idea came from Naoya Horiguchi.
https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/

I test it with /sys/kernel/debug/hwpoison/corrupt-pfn in qemu-x86_64.
here is my test steps and result:

1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10e802000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000
 d. virtual addr = 0x7e4c68d20000  phy_addr =0x10eb02000

2. echo 0x10e802 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10eb02000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000
 d. virtual addr = 0x7e4c68d20000  phy_addr =0x10eb02000 (share with a)

3.echo 0x10eb02 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10ea02000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000 (share with a)
 d. virtual addr = 0x7e4c68c58000  phy_addr =0x10ea02000 (share with a)

4.echo 0x10ea02 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10e902000
 b. virtual addr = 0x7e4c68a32000  phy_addr =0x10e902000(share with a)
 c. virtual addr = 0x7e4c68a64000  phy_addr =0x10e902000(share with a)
 d. virtual addr = 0x7e4c68a96000  phy_addr =0x10e902000(share with a)

5.echo 0x10e902 > /sys/kernel/debug/hwpoison/corrupt-pfn
MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000

kernel-log:
Injecting memory failure at pfn 0x10e802
Memory failure: 0x10e802: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10eb02
Memory failure: 0x10eb02: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10ea02
Memory failure: 0x10ea02: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10e902
Memory failure: 0x10e902: recovery action for dirty LRU page: Recovered
MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000

Thanks for review and comments!

Longlong Xia (1):
  mm/ksm: Add recovery mechanism for memory failures

 mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 183 insertions(+)

-- 
2.43.0

Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures

Posted by David Hildenbrand 4 months ago

On 09.10.25 09:00, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing page to another healthy duplicate within
> the same chain before resorting to killing processes.

An alternative could be to allocate a new page and effectively migrate 
from the old (degraded) page to the new page by copying page content 
from one of the healty duplicates.

That would keep the #mappings per page in the chain balanced.

> 
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates) (maybe add dup_haed
> item to save head_node to struct stable_node?, saving searching
> the whole stable tree, or other way to find head_node)
> 2. Search for another healthy duplicate page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy duplicate page ( maybe reuse
> replace_page?, or split repalce_page into smaller function and use the
> common part)
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails

Does not sound too crazy.

But how realistic do we consider that in practice? We need quite a bunch 
of processes to dedup the same page to end up getting duplicates in the 
chain IIRC.

So isn't this rather an improvement only for less likely scenarios in 
practice?

-- 
Cheers

David / dhildenb