mm: reliable 1GB page allocation

[RFC PATCH 00/40] mm: reliable 1GB page allocation

Posted by Rik van Riel 4 days, 7 hours ago

Some workloads see real performance benefits from using 1GB pages,
but allocating 1GB pages has often been limited to hugetlb pages
that were set aside at boot time, or using CMA to keep a fixed
amount of system memory off limits to the kernel.

Neither of those are great solutions, given that modern servers
tend to be large, often run multiple workloads simultaneously,
and each workload wants something else.

To address that issue, this patch series divides memory not just
into 2MB page blocks, but into PUD sized superpageblocks, and
aggressively tries to steer unmovable, reclaimable, and highatomic
allocations into those superpageblocks that have already been
"tainted" by such allocations.

The goal is to leave as many 1GB superpageblocks as possible
used by only movable allocations, so they can be easily
defragmented for either regular PMD sized huge pages, or
for PUD sized huge pages.

Various strategies are used to accomplish this goal:
- unmovable and reclaimable allocations are preferentially
  done from 1GB blocks that have already been "tainted" by
  these allocations
- kernel allocations that can be done as one higher order
  allocation, or a number of smaller allocations (eg. kvmalloc)
  will fall back to small pages, rather than taint a new
  1GB block
- movable allocations are preferentially done from clean 1GB
  blocks, which have only free and movable memory inside,
  starting with the fullest of these 1GB blocks
- 2MB allocations follow the same strategy
- 1GB allocations start with the emptiest clean 1GB block
- if a 1GB block is mixed, with some movable pageblocks,
  some free pageblocks, and some unmovable/reclaimable pageblocks,
  the system has a free threshold below which only unmovable and
  reclaimable allocations can be done from that 1GB block
- below that threshold, no new movable allocations are allowed
  in that 1GB block, while new unmovable/reclaimable allocations
  are still allowed
- when a 1GB block is below that threshold, use the migration
  code to evacuate enough movable memory from the 1GB block
  to bring free memory in that 1GB block back to the threshold

These strategies together serve to concentrate unmovable and
reclaimable allocations in as few 1GB blocks as possible,
leaving as many 1GB blocks as possible available for movable
allocations.

That enables both more extensive use of 2MB THPs and mTHPs,
as well as reliable allocation of 1GB pages.

The above strategies also make the core page allocator
more complicated, and slower. In order to avoid that issue,
the series is built on top of Johannes's PCPBuddy series,
which has the goal of reducing how often CPUs need to get
pages from the zone free lists, instead relying on CPUs
giving back pages to each other, based on page block ownership.

TODO:
- compaction "always" succeeds, with a success rate of 99.96% seen
  in traces; this sounds great, but it also results in compaction
  never being throttled, and compaction blowing out everybody's
  PCP through lru_add_drain() calls. This needs some sort of solution.
- replace the superpageblock name with something Matthew and David
  both like
- find more corner cases, and fix them

Based on e1914add2799

Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation

Posted by Usama Arif 2 days, 11 hours ago

On Wed, 20 May 2026 10:59:06 -0400 Rik van Riel <riel@surriel.com> wrote:

> 
> Some workloads see real performance benefits from using 1GB pages,
> but allocating 1GB pages has often been limited to hugetlb pages
> that were set aside at boot time, or using CMA to keep a fixed
> amount of system memory off limits to the kernel.
> 
> Neither of those are great solutions, given that modern servers
> tend to be large, often run multiple workloads simultaneously,
> and each workload wants something else.
> 
> To address that issue, this patch series divides memory not just
> into 2MB page blocks, but into PUD sized superpageblocks, and
> aggressively tries to steer unmovable, reclaimable, and highatomic
> allocations into those superpageblocks that have already been
> "tainted" by such allocations.
> 
> The goal is to leave as many 1GB superpageblocks as possible
> used by only movable allocations, so they can be easily
> defragmented for either regular PMD sized huge pages, or
> for PUD sized huge pages.
> 
> Various strategies are used to accomplish this goal:
> - unmovable and reclaimable allocations are preferentially
>   done from 1GB blocks that have already been "tainted" by
>   these allocations
> - kernel allocations that can be done as one higher order
>   allocation, or a number of smaller allocations (eg. kvmalloc)
>   will fall back to small pages, rather than taint a new
>   1GB block

Hi Rik!

The comments are just based on coverletter.

Hopefully will get to review all the patches. The above one of
kernel allocations falling back to small pages is interesting.

- Will it result in a performance impact as kernel allocations
wont benefit from higher order allocation?
- Will this impact 2M THP allocation efficiency due to more
fragmentation of kernel memory?


> - movable allocations are preferentially done from clean 1GB
>   blocks, which have only free and movable memory inside,
>   starting with the fullest of these 1GB blocks
> - 2MB allocations follow the same strategy
> - 1GB allocations start with the emptiest clean 1GB block
> - if a 1GB block is mixed, with some movable pageblocks,
>   some free pageblocks, and some unmovable/reclaimable pageblocks,
>   the system has a free threshold below which only unmovable and
>   reclaimable allocations can be done from that 1GB block
> - below that threshold, no new movable allocations are allowed
>   in that 1GB block, while new unmovable/reclaimable allocations
>   are still allowed

by allowed, do you mean if movable allocations fail, it will
result in OOM?


> - when a 1GB block is below that threshold, use the migration
>   code to evacuate enough movable memory from the 1GB block
>   to bring free memory in that 1GB block back to the threshold
> 
> These strategies together serve to concentrate unmovable and
> reclaimable allocations in as few 1GB blocks as possible,
> leaving as many 1GB blocks as possible available for movable
> allocations.
> 
> That enables both more extensive use of 2MB THPs and mTHPs,
> as well as reliable allocation of 1GB pages.
> 
> The above strategies also make the core page allocator
> more complicated, and slower. In order to avoid that issue,
> the series is built on top of Johannes's PCPBuddy series,
> which has the goal of reducing how often CPUs need to get
> pages from the zone free lists, instead relying on CPUs
> giving back pages to each other, based on page block ownership.
> 
> TODO:
> - compaction "always" succeeds, with a success rate of 99.96% seen
>   in traces; this sounds great, but it also results in compaction
>   never being throttled, and compaction blowing out everybody's
>   PCP through lru_add_drain() calls. This needs some sort of solution.
> - replace the superpageblock name with something Matthew and David
>   both like
> - find more corner cases, and fix them
> 
> Based on e1914add2799
> 
> 
>

Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation

Posted by Rik van Riel 2 days, 8 hours ago

On Fri, 2026-05-22 at 04:02 -0700, Usama Arif wrote:
> On Wed, 20 May 2026 10:59:06 -0400 Rik van Riel <riel@surriel.com> 
> 
> Hopefully will get to review all the patches. The above one of
> kernel allocations falling back to small pages is interesting.
> 
> - Will it result in a performance impact as kernel allocations
> wont benefit from higher order allocation?

It might! We may well need a better solution
here, like spilling over earlier, but limiting
the number of 1GB blocks we can spill over into
simultaneously (partially used for kernel memory).

> - Will this impact 2M THP allocation efficiency due to more
> fragmentation of kernel memory?

THPs come from movable memory. With more 1GB
page blocks not having kernel allocations in
it, THP allocations should be easier.

> 
> 
> > - movable allocations are preferentially done from clean 1GB
> >   blocks, which have only free and movable memory inside,
> >   starting with the fullest of these 1GB blocks
> > - 2MB allocations follow the same strategy
> > - 1GB allocations start with the emptiest clean 1GB block
> > - if a 1GB block is mixed, with some movable pageblocks,
> >   some free pageblocks, and some unmovable/reclaimable pageblocks,
> >   the system has a free threshold below which only unmovable and
> >   reclaimable allocations can be done from that 1GB block
> > - below that threshold, no new movable allocations are allowed
> >   in that 1GB block, while new unmovable/reclaimable allocations
> >   are still allowed
> 
> by allowed, do you mean if movable allocations fail, it will
> result in OOM?

Yes, but by that time the zone free memory should
also be below the low watermark, because there 
should only be a few partially occupied tainted
1GB page blocks.

If the zone has 300MB low watermark, and there
are 50MB tied up in those reserved-for-unmovable
memory areas, it should not cause early OOMs.

I don't know if we can end up in a situation
where somehow the reserved-for-unmovable memory
would add up to more than the zone low watermark.

That would be bad, and if it happened we would
have to add some sort of protection against that.

-- 
All Rights Reversed.

[syzbot ci] Re: mm: reliable 1GB page allocation

Posted by syzbot ci 3 days, 15 hours ago

syzbot ci has tested the following series

[v1] mm: reliable 1GB page allocation
https://lore.kernel.org/all/20260520150018.2491267-1-riel@surriel.com
* [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data
* [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator
* [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist
* [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges
* [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism
* [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks
* [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags
* [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation
* [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug
* [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering
* [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks
* [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks
* [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block
* [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists
* [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker
* [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets
* [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation
* [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs
* [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems
* [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks
* [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks
* [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters
* [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs
* [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks
* [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB
* [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback
* [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs
* [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure
* [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists
* [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists
* [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache
* [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA
* [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages
* [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
* [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders
* [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof
* [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h
* [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source
* [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection
* [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE]

and found the following issue:
WARNING in preempt_count_sub

Full report is available here:
https://ci.syzbot.org/series/8eef71e1-708c-40d0-9f9d-3a1bd637fc80

***

WARNING in preempt_count_sub

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      4b83cbc4c15f09b000cc06f033f64b0824b6dc87
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/de024c89-d472-4f06-b5be-4f75f2acd1d9/config

------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(val > preempt_count())
WARNING: kernel/sched/core.c:5883 at preempt_count_sub+0x9e/0x170, CPU#0: kworker/0:0/9
Modules linked in:
CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: events drain_vmap_area_work
RIP: 0010:preempt_count_sub+0xa5/0x170
Code: 29 31 90 48 c1 e8 03 0f b6 04 18 84 c0 0f 85 88 00 00 00 83 3d ff 50 9d 0e 00 75 13 48 8d 3d 32 37 a0 0e 48 c7 c6 c0 f8 cb 8b <67> 48 0f b9 3a 90 eb b8 90 e8 ed d6 1d 03 85 c0 74 2f 48 c7 c0 c4
RSP: 0000:ffffc900000e7590 EFLAGS: 00010246

RAX: 0000000000000000 RBX: dffffc0000000000 RCX: ffff8881007d5880
RDX: 0000000000000000 RSI: ffffffff8bcbf8c0 RDI: ffffffff90341000
RBP: ffff888121042dc0 R08: ffffffff903129c3 R09: 1ffffffff2062538
R10: dffffc0000000000 R11: fffffbfff2062539 R12: 0000000000000020
R13: dffffc0000000000 R14: ffff888121042d80 R15: ffff88815fffbae8
FS:  0000000000000000(0000) GS:ffff88818dc7e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 free_frozen_page_commit+0x737/0x1290
 __free_frozen_pages+0x873/0x1070
 kasan_depopulate_vmalloc_pte+0x6d/0x90
 __apply_to_page_range+0xbdc/0x1420
 __kasan_release_vmalloc+0xa2/0xd0
 purge_vmap_node+0x220/0x960
 __purge_vmap_area_lazy+0x779/0xb40
 drain_vmap_area_work+0x27/0x40
 process_scheduled_works+0xb5d/0x1860
 worker_thread+0xa53/0xfc0
 kthread+0x388/0x470
 ret_from_fork+0x514/0xb70
 ret_from_fork_asm+0x1a/0x30
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.