[RFC 0/7] mm: dual-bitmap page allocator consistency checker

Sasha Levin posted 7 patches 1 month, 3 weeks ago
Documentation/mm/index.rst            |   1 +
Documentation/mm/page_consistency.rst | 211 +++++++++++++++
MAINTAINERS                           |  10 +
include/linux/dual_bitmap.h           | 216 ++++++++++++++++
include/linux/page_consistency.h      |  84 ++++++
mm/Kconfig.debug                      |  59 +++++
mm/Makefile                           |   2 +
mm/mm_init.c                          |   9 +
mm/page_alloc.c                       |   4 +
mm/page_consistency.c                 | 360 ++++++++++++++++++++++++++
mm/page_consistency_test.c            | 274 ++++++++++++++++++++
11 files changed, 1230 insertions(+)
create mode 100644 Documentation/mm/page_consistency.rst
create mode 100644 include/linux/dual_bitmap.h
create mode 100644 include/linux/page_consistency.h
create mode 100644 mm/page_consistency.c
create mode 100644 mm/page_consistency_test.c
[RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Sasha Levin 1 month, 3 weeks ago
Existing memory debugging tools - KASAN, KFENCE, page_poisoning - detect
access violations and content corruption, but none of them can detect
silent corruption in the page allocator's own metadata. If a hardware
bit flip corrupts an allocation bitmap, the allocator hands out a page
that is already in use (or fails to hand out a free one), and nothing
in the kernel notices. This series adds a dual-bitmap consistency checker
that maintains the invariant primary == ~secondary across two independently
allocated bitmaps, so that any single-bit corruption in either bitmap is
immediately detectable. The approach is based on NVIDIA safety research.

Field studies consistently show that DRAM errors at scale are far more
common than textbook assumptions suggest, even with ECC. Schroeder et al.
(SIGMETRICS 2009) found 8% of DIMMs experienced errors per year in
Google's fleet; Sridharan and Liberty (SC 2012) reported similar rates
at LANL; Meta's 2021-2022 work documented silent data corruption at
scale, including memory-related faults. The critical property of
allocator metadata corruption is that it doesn't trigger an invalid
memory access - the corrupted data is structurally valid, just wrong.
KASAN instruments accesses, not metadata integrity, so it cannot see
this class of fault.

Functional safety is a different discipline from security that aims
to reduce the risk of hardware and software misbehaving to an
acceptable level. Security hardens against adversaries; safety hardens
against random hardware failures (cosmic rays, cell wear-out, thermal
noise) and systematic software failures (bugs). ISO 26262 (automotive
functional safety) defines four Automotive Safety Integrity Levels,
ASIL A through D. ASIL-D, the most stringent, is derived from the
severity of the hazard in case of failure. IEC 61508 defines similar
levels (SIL-1 through SIL-4) for industrial systems, and there are
equivalent standards for avionics and medical devices. ISO 26262
requires Freedom From Interference (FFI): a safety element must not
be corrupted by faults in other elements. For an OS kernel, this means
the memory allocator's metadata must either be immune to corruption or
corruption must be detected before it propagates. The dual-bitmap
implements a way to protect from corruption coming from hardware or
software - two complementary representations of page allocation state,
allocated independently via memblock, where any single-bit fault in
either bitmap is immediately detectable. Performance is secondary to
correctness in this context. A safety mechanism must be simple enough
to audit and certify, must fail deterministically (panic, not
log-and-hope), and its correctness matters more than its throughput.
The dual-bitmap adds two atomic bitops per alloc/free, but for
safety-critical deployments this cost is acceptable because the
alternative - undetected corruption propagating silently - violates
the system's safety case. The static key ensures zero cost for kernels
that don't need it.

The natural question is why not use page_ext. The key objection from a
safety perspective is that page_ext stores per-page metadata in memory
that is itself subject to the same hardware faults we're trying to
detect. The dual-bitmap approach works because the two bitmaps are
independent allocations - corruption in one is caught by comparison
with the other. Embedding both in page_ext means a single fault could
corrupt both the tracking data and its redundant copy in the same
allocation region. ISO 26262 recommends this approach for protecting
against hardware faults, but it also helps against software faults -
co-locating both bitmaps in page_ext violates this principle. Beyond
the safety argument, there are practical issues: page_ext adds
8-100+ bytes per page depending on enabled features while the
dual-bitmap uses 2 bits per page total, and page_ext initializes
after the buddy allocator while the checker must be active before
memblock_free_all() hands pages to buddy.

Sasha Levin (7):
  mm: add generic dual-bitmap consistency primitives
  mm: add page consistency checker header
  mm: add Kconfig options for page consistency checker
  mm: add page consistency checker implementation
  mm/page_alloc: integrate page consistency hooks
  Documentation/mm: add page consistency checker documentation
  mm/page_consistency: add KUnit tests for dual-bitmap primitives

 Documentation/mm/index.rst            |   1 +
 Documentation/mm/page_consistency.rst | 211 +++++++++++++++
 MAINTAINERS                           |  10 +
 include/linux/dual_bitmap.h           | 216 ++++++++++++++++
 include/linux/page_consistency.h      |  84 ++++++
 mm/Kconfig.debug                      |  59 +++++
 mm/Makefile                           |   2 +
 mm/mm_init.c                          |   9 +
 mm/page_alloc.c                       |   4 +
 mm/page_consistency.c                 | 360 ++++++++++++++++++++++++++
 mm/page_consistency_test.c            | 274 ++++++++++++++++++++
 11 files changed, 1230 insertions(+)
 create mode 100644 Documentation/mm/page_consistency.rst
 create mode 100644 include/linux/dual_bitmap.h
 create mode 100644 include/linux/page_consistency.h
 create mode 100644 mm/page_consistency.c
 create mode 100644 mm/page_consistency_test.c

-- 
2.53.0
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Vlastimil Babka (SUSE) 1 month, 3 weeks ago
On 4/24/26 16:00, Sasha Levin wrote:
> Existing memory debugging tools - KASAN, KFENCE, page_poisoning - detect
> access violations and content corruption, but none of them can detect
> silent corruption in the page allocator's own metadata. If a hardware
> bit flip corrupts an allocation bitmap, the allocator hands out a page

An allocation what? The page allocator is a buddy allocator, it has no
bitmap to track free/allocated state of pages?
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Sasha Levin 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 05:42:53PM +0200, Vlastimil Babka (SUSE) wrote:
>On 4/24/26 16:00, Sasha Levin wrote:
>> Existing memory debugging tools - KASAN, KFENCE, page_poisoning - detect
>> access violations and content corruption, but none of them can detect
>> silent corruption in the page allocator's own metadata. If a hardware
>> bit flip corrupts an allocation bitmap, the allocator hands out a page
>
>An allocation what? The page allocator is a buddy allocator, it has no
>bitmap to track free/allocated state of pages?

You're right, the cover letter is misleading there. Buddy doesn't use a bitmap:
PageBuddy lives in page_type, the free list is a list, and page->private holds
the order. The dual-bitmap is new metadata the feature adds, maintained from
the alloc/free hooks.

What it actually catches is the same PFN being handed out twice before it's
freed, or freed without having been allocated. Not every kind of buddy
corruption shows up that way, but the common bad ones do. Corruption of the
bitmap itself shows up through the complement invariant.

I'll fix the wording in v2.

-- 
Thanks,
Sasha
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago
On 4/24/26 18:25, Sasha Levin wrote:
> On Fri, Apr 24, 2026 at 05:42:53PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 4/24/26 16:00, Sasha Levin wrote:
>>> Existing memory debugging tools - KASAN, KFENCE, page_poisoning - detect
>>> access violations and content corruption, but none of them can detect
>>> silent corruption in the page allocator's own metadata. If a hardware
>>> bit flip corrupts an allocation bitmap, the allocator hands out a page
>>
>> An allocation what? The page allocator is a buddy allocator, it has no
>> bitmap to track free/allocated state of pages?
> 
> You're right, the cover letter is misleading there. Buddy doesn't use a bitmap:
> PageBuddy lives in page_type, the free list is a list, and page->private holds
> the order. The dual-bitmap is new metadata the feature adds, maintained from
> the alloc/free hooks.

Given that you have PageBuddy (first "bit"), could we use a second bit in page_ext?

-- 
Cheers,

David
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Sasha Levin 1 month, 3 weeks ago
On Sat, Apr 25, 2026 at 07:51:10AM +0200, David Hildenbrand (Arm) wrote:
>On 4/24/26 18:25, Sasha Levin wrote:
>> On Fri, Apr 24, 2026 at 05:42:53PM +0200, Vlastimil Babka (SUSE) wrote:
>>> On 4/24/26 16:00, Sasha Levin wrote:
>>>> Existing memory debugging tools - KASAN, KFENCE, page_poisoning - detect
>>>> access violations and content corruption, but none of them can detect
>>>> silent corruption in the page allocator's own metadata. If a hardware
>>>> bit flip corrupts an allocation bitmap, the allocator hands out a page
>>>
>>> An allocation what? The page allocator is a buddy allocator, it has no
>>> bitmap to track free/allocated state of pages?
>>
>> You're right, the cover letter is misleading there. Buddy doesn't use a bitmap:
>> PageBuddy lives in page_type, the free list is a list, and page->private holds
>> the order. The dual-bitmap is new metadata the feature adds, maintained from
>> the alloc/free hooks.
>
>Given that you have PageBuddy (first "bit"), could we use a second bit in page_ext?

Hmm... Thats an interesting idea.

I can see two concerns with something like this:

1. The checker has to be live before memblock_free_all() hands pages to buddy.
page_ext isn't fully up that early I think.

2. page_type encodes buddy, offline, slab tags, etc... and a page that isn't
PageBuddy isn't necessarily allocated through alloc_pages. The invariant gets
case-y.

But let me think about it a bit more.

-- 
Thanks,
Sasha
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Matthew Wilcox 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 10:00:49AM -0400, Sasha Levin wrote:
> corruption must be detected before it propagates. The dual-bitmap
> implements a way to protect from corruption coming from hardware or
> software - two complementary representations of page allocation state,
> allocated independently via memblock, where any single-bit fault in
> either bitmap is immediately detectable. Performance is secondary to
> correctness in this context. A safety mechanism must be simple enough
> to audit and certify, must fail deterministically (panic, not
> log-and-hope), and its correctness matters more than its throughput.
> The dual-bitmap adds two atomic bitops per alloc/free, but for
> safety-critical deployments this cost is acceptable because the
> alternative - undetected corruption propagating silently - violates
> the system's safety case. The static key ensures zero cost for kernels
> that don't need it.

But doubling the storage requirement in order to achieve merely detection
is significantly worse than state-of-the-art in 1950 (when Richard
Hamming invented Hamming codes).  If we used a (7,3) code, we'd have
SECDED at a lower cost.  Of course, there are far better codes available
than that today.
Re: [RFC 0/7] mm: dual-bitmap page allocator consistency checker
Posted by Sasha Levin 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 04:34:12PM +0100, Matthew Wilcox wrote:
>On Fri, Apr 24, 2026 at 10:00:49AM -0400, Sasha Levin wrote:
>> corruption must be detected before it propagates. The dual-bitmap
>> implements a way to protect from corruption coming from hardware or
>> software - two complementary representations of page allocation state,
>> allocated independently via memblock, where any single-bit fault in
>> either bitmap is immediately detectable. Performance is secondary to
>> correctness in this context. A safety mechanism must be simple enough
>> to audit and certify, must fail deterministically (panic, not
>> log-and-hope), and its correctness matters more than its throughput.
>> The dual-bitmap adds two atomic bitops per alloc/free, but for
>> safety-critical deployments this cost is acceptable because the
>> alternative - undetected corruption propagating silently - violates
>> the system's safety case. The static key ensures zero cost for kernels
>> that don't need it.
>
>But doubling the storage requirement in order to achieve merely detection
>is significantly worse than state-of-the-art in 1950 (when Richard
>Hamming invented Hamming codes).  If we used a (7,3) code, we'd have
>SECDED at a lower cost.  Of course, there are far better codes available
>than that today.

I agree with the density concern. I have two reasons for that:

1. Update cost. On the alloc/free hot path the dual-bitmap update is two
independent test_and_set_bit. A Hamming/SECDED codeword needs a
read-modify-write of the whole word with locking on every state change.

2. Correlated faults. The two copies need to sit in different physical memory
so a multi-bit fault (row, column, bank, row-hammer) can only hit one of them.
See this paper which has some numbers:
https://dl.acm.org/doi/epdf/10.1145/2786763.2694348 - About 21% of DRAM faults
span more than one bit, plain SECDED can leave up to 20 FIT per device of
undetected errors from those, and it only helps at all if data and parity bits
are spread across physically separate cells.

Two memblock_alloc'd bitmaps give that separation for free. You could
interleave a code across two independent regions instead, but then
the invariant check stops being a one-line complement check, which is
what I was trying to keep simple for the audit side.

-- 
Thanks,
Sasha