On Wed, 20 May 2026 10:59:06 -0400 Rik van Riel <riel@surriel.com> wrote:
>
> Some workloads see real performance benefits from using 1GB pages,
> but allocating 1GB pages has often been limited to hugetlb pages
> that were set aside at boot time, or using CMA to keep a fixed
> amount of system memory off limits to the kernel.
>
> Neither of those are great solutions, given that modern servers
> tend to be large, often run multiple workloads simultaneously,
> and each workload wants something else.
>
> To address that issue, this patch series divides memory not just
> into 2MB page blocks, but into PUD sized superpageblocks, and
> aggressively tries to steer unmovable, reclaimable, and highatomic
> allocations into those superpageblocks that have already been
> "tainted" by such allocations.
>
> The goal is to leave as many 1GB superpageblocks as possible
> used by only movable allocations, so they can be easily
> defragmented for either regular PMD sized huge pages, or
> for PUD sized huge pages.
>
> Various strategies are used to accomplish this goal:
> - unmovable and reclaimable allocations are preferentially
> done from 1GB blocks that have already been "tainted" by
> these allocations
> - kernel allocations that can be done as one higher order
> allocation, or a number of smaller allocations (eg. kvmalloc)
> will fall back to small pages, rather than taint a new
> 1GB block
Hi Rik!
The comments are just based on coverletter.
Hopefully will get to review all the patches. The above one of
kernel allocations falling back to small pages is interesting.
- Will it result in a performance impact as kernel allocations
wont benefit from higher order allocation?
- Will this impact 2M THP allocation efficiency due to more
fragmentation of kernel memory?
> - movable allocations are preferentially done from clean 1GB
> blocks, which have only free and movable memory inside,
> starting with the fullest of these 1GB blocks
> - 2MB allocations follow the same strategy
> - 1GB allocations start with the emptiest clean 1GB block
> - if a 1GB block is mixed, with some movable pageblocks,
> some free pageblocks, and some unmovable/reclaimable pageblocks,
> the system has a free threshold below which only unmovable and
> reclaimable allocations can be done from that 1GB block
> - below that threshold, no new movable allocations are allowed
> in that 1GB block, while new unmovable/reclaimable allocations
> are still allowed
by allowed, do you mean if movable allocations fail, it will
result in OOM?
> - when a 1GB block is below that threshold, use the migration
> code to evacuate enough movable memory from the 1GB block
> to bring free memory in that 1GB block back to the threshold
>
> These strategies together serve to concentrate unmovable and
> reclaimable allocations in as few 1GB blocks as possible,
> leaving as many 1GB blocks as possible available for movable
> allocations.
>
> That enables both more extensive use of 2MB THPs and mTHPs,
> as well as reliable allocation of 1GB pages.
>
> The above strategies also make the core page allocator
> more complicated, and slower. In order to avoid that issue,
> the series is built on top of Johannes's PCPBuddy series,
> which has the goal of reducing how often CPUs need to get
> pages from the zone free lists, instead relying on CPUs
> giving back pages to each other, based on page block ownership.
>
> TODO:
> - compaction "always" succeeds, with a success rate of 99.96% seen
> in traces; this sounds great, but it also results in compaction
> never being throttled, and compaction blowing out everybody's
> PCP through lru_add_drain() calls. This needs some sort of solution.
> - replace the superpageblock name with something Matthew and David
> both like
> - find more corner cases, and fix them
>
> Based on e1914add2799
>
>
>