mm: page_alloc: pcp buddy allocator

[RFC 0/2] mm: page_alloc: pcp buddy allocator

Posted by Johannes Weiner 2 months, 2 weeks ago

Hi,

this is an RFC for making the page allocator scale better with higher
thread counts and larger memory quantities.

In Meta production, we're seeing increasing zone->lock contention that
was traced back to a few different paths. A prominent one is the
userspace allocator, jemalloc. Allocations happen from page faults on
all CPUs running the workload. Frees are cached for reuse, but the
caches are periodically purged back to the kernel from a handful of
purger threads. This breaks affinity between allocations and frees:
Both sides use their own PCPs - one side depletes them, the other one
overfills them. Both sides routinely hit the zone->locked slowpath.

My understanding is that tcmalloc has a similar architecture.

Another contributor to contention is process exits, where large
numbers of pages are freed at once. The current PCP can only reduce
lock time when pages are reused. Reuse is unlikely because it's an
avalanche of free pages on a CPU busy walking page tables. Every time
the PCP overflows, the drain acquires the zone->lock and frees pages
one by one, trying to merge buddies together.

The idea proposed here is this: instead of single pages, make the PCP
grab entire pageblocks, split them outside the zone->lock. That CPU
then takes ownership of the block, and all frees route back to that
PCP instead of the freeing CPU's local one.

This has several benefits:

1. It's right away coarser/fewer allocations transactions under the
   zone->lock.

1a. Even if no full free blocks are available (memory pressure or
    small zone), with splitting available at the PCP level means the
    PCP can still grab chunks larger than the requested order from the
    zone->lock freelists, and dole them out on its own time.

2. The pages free back to where the allocations happen, increasing the
   odds of reuse and reducing the chances of zone->lock slowpaths.

3. The page buddies come back into one place, allowing upfront merging
   under the local pcp->lock. This makes coarser/fewer freeing
   transactions under the zone->lock.

The big concern is fragmentation. Movable allocations tend to be a mix
of short-lived anon and long-lived file cache pages. By the time the
PCP needs to drain due to thresholds or pressure, the blocks might not
be fully re-assembled yet. To prevent gobbling up and fragmenting ever
more blocks, partial blocks are remembered on drain and their pages
queued last on the zone freelist. When a PCP refills, it first tries
to recover any such fragment blocks.

On small or pressured machines, the PCP degrades to its previous
behavior. If a whole block doesn't fit the pcp->high limit, or a whole
block isn't available, the refill grabs smaller chunks that aren't
marked for ownership. The free side will use the local PCP as before.

I still need to run broader benchmarks, but I've been consistently
seeing a 3-4% reduction in %sys time for simple kernel builds on my
32-way, 32G RAM test machine.

A synthetic test on the same machine that allocates on many CPUs and
frees on just a few sees a consistent 1% increase in throughput.

I would expect those numbers to increase with higher concurrency and
larger memory volumes, but verifying that is TBD.

Sending an RFC to get an early gauge on direction.

Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.

 include/linux/mmzone.h     |  38 ++-
 include/linux/page-flags.h |   9 +
 mm/debug.c                 |   1 +
 mm/internal.h              |  17 +
 mm/mm_init.c               |  25 +-
 mm/page_alloc.c            | 784 +++++++++++++++++++++++++++++++------------
 mm/sparse.c                |   3 +-
 7 files changed, 622 insertions(+), 255 deletions(-)

Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator

Posted by Zi Yan 2 months, 2 weeks ago

On 3 Apr 2026, at 15:40, Johannes Weiner wrote:

> Hi,
>
> this is an RFC for making the page allocator scale better with higher
> thread counts and larger memory quantities.
>
> In Meta production, we're seeing increasing zone->lock contention that
> was traced back to a few different paths. A prominent one is the
> userspace allocator, jemalloc. Allocations happen from page faults on
> all CPUs running the workload. Frees are cached for reuse, but the
> caches are periodically purged back to the kernel from a handful of
> purger threads. This breaks affinity between allocations and frees:
> Both sides use their own PCPs - one side depletes them, the other one
> overfills them. Both sides routinely hit the zone->locked slowpath.
>
> My understanding is that tcmalloc has a similar architecture.
>
> Another contributor to contention is process exits, where large
> numbers of pages are freed at once. The current PCP can only reduce
> lock time when pages are reused. Reuse is unlikely because it's an
> avalanche of free pages on a CPU busy walking page tables. Every time
> the PCP overflows, the drain acquires the zone->lock and frees pages
> one by one, trying to merge buddies together.

IIUC, zone->lock held time is mostly spent on free page merging.
Have you tried to let PCP do the free page merging before holding
zone->lock and returning free pages to buddy? That is a much smaller
change than what you proposed. This method might not work if
physically contiguous free pages are allocated by separate CPUs,
so that PCP merging cannot be done. But this might be rare?

>
> The idea proposed here is this: instead of single pages, make the PCP
> grab entire pageblocks, split them outside the zone->lock. That CPU
> then takes ownership of the block, and all frees route back to that
> PCP instead of the freeing CPU's local one.

This is basically distributed buddy allocators, right? Instead of
relying on a single zone->lock, PCP locks are used. The worst case
it can face is that physically contiguous free pages are allocated
across all CPUs, so that all CPUs are competing a single PCP lock.
It seems that you have not hit this. So I wonder if what I proposed
above might work as a simpler approach. Let me know if I miss anything.

I wonder how this distributed buddy allocators would work if anyone
wants to allocate >pageblock free pages, like alloc_contig_range().
Multiple PCP locks need to be taken one by one. Maybe it is better
than taking and dropping zone->lock repeatedly. Have you benchmarked
alloc_contig_range(), like hugetlb allocation?

>
> This has several benefits:
>
> 1. It's right away coarser/fewer allocations transactions under the
>    zone->lock.
>
> 1a. Even if no full free blocks are available (memory pressure or
>     small zone), with splitting available at the PCP level means the
>     PCP can still grab chunks larger than the requested order from the
>     zone->lock freelists, and dole them out on its own time.
>
> 2. The pages free back to where the allocations happen, increasing the
>    odds of reuse and reducing the chances of zone->lock slowpaths.
>
> 3. The page buddies come back into one place, allowing upfront merging
>    under the local pcp->lock. This makes coarser/fewer freeing
>    transactions under the zone->lock.

I wonder if we could go more radical by moving buddy allocator out of
zone->lock completely to PCP lock. If one PCP runs out of free pages,
it can steal another PCP's whole pageblock. I probably should do some
literature investigation on this. Some research must have been done
on this.

>
> The big concern is fragmentation. Movable allocations tend to be a mix
> of short-lived anon and long-lived file cache pages. By the time the
> PCP needs to drain due to thresholds or pressure, the blocks might not
> be fully re-assembled yet. To prevent gobbling up and fragmenting ever
> more blocks, partial blocks are remembered on drain and their pages
> queued last on the zone freelist. When a PCP refills, it first tries
> to recover any such fragment blocks.
>
> On small or pressured machines, the PCP degrades to its previous
> behavior. If a whole block doesn't fit the pcp->high limit, or a whole
> block isn't available, the refill grabs smaller chunks that aren't
> marked for ownership. The free side will use the local PCP as before.
>
> I still need to run broader benchmarks, but I've been consistently
> seeing a 3-4% reduction in %sys time for simple kernel builds on my
> 32-way, 32G RAM test machine.
>
> A synthetic test on the same machine that allocates on many CPUs and
> frees on just a few sees a consistent 1% increase in throughput.
>
> I would expect those numbers to increase with higher concurrency and
> larger memory volumes, but verifying that is TBD.
>
> Sending an RFC to get an early gauge on direction.

Thank you for sending this out. :)

>
> Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.
>
>  include/linux/mmzone.h     |  38 ++-
>  include/linux/page-flags.h |   9 +
>  mm/debug.c                 |   1 +
>  mm/internal.h              |  17 +
>  mm/mm_init.c               |  25 +-
>  mm/page_alloc.c            | 784 +++++++++++++++++++++++++++++++------------
>  mm/sparse.c                |   3 +-
>  7 files changed, 622 insertions(+), 255 deletions(-)


--
Best Regards,
Yan, Zi

Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator

Posted by Johannes Weiner 2 months, 2 weeks ago

On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote:
> On 3 Apr 2026, at 15:40, Johannes Weiner wrote:
> > this is an RFC for making the page allocator scale better with higher
> > thread counts and larger memory quantities.
> >
> > In Meta production, we're seeing increasing zone->lock contention that
> > was traced back to a few different paths. A prominent one is the
> > userspace allocator, jemalloc. Allocations happen from page faults on
> > all CPUs running the workload. Frees are cached for reuse, but the
> > caches are periodically purged back to the kernel from a handful of
> > purger threads. This breaks affinity between allocations and frees:
> > Both sides use their own PCPs - one side depletes them, the other one
> > overfills them. Both sides routinely hit the zone->locked slowpath.
> >
> > My understanding is that tcmalloc has a similar architecture.
> >
> > Another contributor to contention is process exits, where large
> > numbers of pages are freed at once. The current PCP can only reduce
> > lock time when pages are reused. Reuse is unlikely because it's an
> > avalanche of free pages on a CPU busy walking page tables. Every time
> > the PCP overflows, the drain acquires the zone->lock and frees pages
> > one by one, trying to merge buddies together.
> 
> IIUC, zone->lock held time is mostly spent on free page merging.
> Have you tried to let PCP do the free page merging before holding
> zone->lock and returning free pages to buddy? That is a much smaller
> change than what you proposed. This method might not work if
> physically contiguous free pages are allocated by separate CPUs,
> so that PCP merging cannot be done. But this might be rare?

On my 32G system, pcp->high_min for zone Normal is 988. That's one
block and a half. The rmqueue_smallest policy means the next CPU will
prefer the remainder of that partial block. So if there is
concurrency, every other block is shared. Not exactly uncommon. The
effect lessens the larger the machine is, of course.

But let's assume it's not an issue. How do you know you can safely
merge with a buddy pfn? You need to establish that it's on that same
PCP's list. Short of *scanning* the list, it seems something like
PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a
per-page cpu field is tough to come by.

So the block ownership is more natural, and then you might as well use
that for affinity routing to increase the odds of merges.

IOW, I'm having a hard time seeing what could be taken away and still
have it work.

> > The idea proposed here is this: instead of single pages, make the PCP
> > grab entire pageblocks, split them outside the zone->lock. That CPU
> > then takes ownership of the block, and all frees route back to that
> > PCP instead of the freeing CPU's local one.
> 
> This is basically distributed buddy allocators, right? Instead of
> relying on a single zone->lock, PCP locks are used. The worst case
> it can face is that physically contiguous free pages are allocated
> across all CPUs, so that all CPUs are competing a single PCP lock.

The worst case is one CPU allocating for everybody else in the system,
so that all freers route to that PCP.

I've played with microbenchmarks to provoke this, but it looks mostly
neutral over baseline, at least at the scale of this machine.

In this scenario, baseline will have the affinity mismatch problem:
the allocating CPU routinely hits zone->lock to refill, and the
freeing CPUs routinely hit zone->lock to drain and merge.

In the new scheme, they would hit the pcp->lock instead of the
zone->lock. So not necessarily an improvement in lock breaking. BUT
because freers refill the allocator's cache, merging is deferred;
that's a net reduction of work performed under the contended lock.

> It seems that you have not hit this. So I wonder if what I proposed
> above might work as a simpler approach. Let me know if I miss anything.
> 
> I wonder how this distributed buddy allocators would work if anyone
> wants to allocate >pageblock free pages, like alloc_contig_range().
> Multiple PCP locks need to be taken one by one. Maybe it is better
> than taking and dropping zone->lock repeatedly. Have you benchmarked
> alloc_contig_range(), like hugetlb allocation?

I didn't change that aspect.

The PCPs are still the same size, and PCP pages are still skipped by
the isolation code.

IOW it's not a purely distributed buddy allocator. It's still just a
per-cpu cache of limited size. The only thing I'm doing is provide a
mechanism for splitting and pre-merging at the cache level, and
setting up affinity/routing rules to increase the chances of
success. But the impact on alloc_contig should be the same.

> > This has several benefits:
> >
> > 1. It's right away coarser/fewer allocations transactions under the
> >    zone->lock.
> >
> > 1a. Even if no full free blocks are available (memory pressure or
> >     small zone), with splitting available at the PCP level means the
> >     PCP can still grab chunks larger than the requested order from the
> >     zone->lock freelists, and dole them out on its own time.
> >
> > 2. The pages free back to where the allocations happen, increasing the
> >    odds of reuse and reducing the chances of zone->lock slowpaths.
> >
> > 3. The page buddies come back into one place, allowing upfront merging
> >    under the local pcp->lock. This makes coarser/fewer freeing
> >    transactions under the zone->lock.
> 
> I wonder if we could go more radical by moving buddy allocator out of
> zone->lock completely to PCP lock. If one PCP runs out of free pages,
> it can steal another PCP's whole pageblock. I probably should do some
> literature investigation on this. Some research must have been done
> on this.

This is an interesting idea. Make the zone buddy a pure block economy
and remove all buddy code from it. Slowpath allocs and frees would
always be in whole blocks.

You'd have to come up with a natural stealing order. If one CPU needs
something it doesn't have, which CPUs, and which order, do you look at
for stealing.

I think you'd still have to route back frees to the nominal owner of
the block, or stealing could scatter pages all over the place and we'd
never be able to merge them back up.

I think you'd also need to pull accounting (NR_FREE_PAGES) to the
per-cpu level, and inform compaction/isolation to deal with these
pages, since the majority default is now distributed.

But the scenario where one CPU needs what another one has is an
interesting one. I didn't invent anything new for this for now, but
rather rely on how we have been handling this through the zone
freelists. But I do think it's a little silly: right now, if a CPU
needs something another CPU might have, we ask EVERY CPU in the system
to drain their cache into the shared pool - simultaneously - running
the full buddy merge algorithm on everything that comes in. The CPU
grabs a small handful of these pages, most likely having to split
again. All other CPUs are now cache cold on the next request.

Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator

Posted by Zi Yan 2 months, 2 weeks ago

On 6 Apr 2026, at 11:24, Johannes Weiner wrote:

> On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote:
>> On 3 Apr 2026, at 15:40, Johannes Weiner wrote:
>>> this is an RFC for making the page allocator scale better with higher
>>> thread counts and larger memory quantities.
>>>
>>> In Meta production, we're seeing increasing zone->lock contention that
>>> was traced back to a few different paths. A prominent one is the
>>> userspace allocator, jemalloc. Allocations happen from page faults on
>>> all CPUs running the workload. Frees are cached for reuse, but the
>>> caches are periodically purged back to the kernel from a handful of
>>> purger threads. This breaks affinity between allocations and frees:
>>> Both sides use their own PCPs - one side depletes them, the other one
>>> overfills them. Both sides routinely hit the zone->locked slowpath.
>>>
>>> My understanding is that tcmalloc has a similar architecture.
>>>
>>> Another contributor to contention is process exits, where large
>>> numbers of pages are freed at once. The current PCP can only reduce
>>> lock time when pages are reused. Reuse is unlikely because it's an
>>> avalanche of free pages on a CPU busy walking page tables. Every time
>>> the PCP overflows, the drain acquires the zone->lock and frees pages
>>> one by one, trying to merge buddies together.
>>
>> IIUC, zone->lock held time is mostly spent on free page merging.
>> Have you tried to let PCP do the free page merging before holding
>> zone->lock and returning free pages to buddy? That is a much smaller
>> change than what you proposed. This method might not work if
>> physically contiguous free pages are allocated by separate CPUs,
>> so that PCP merging cannot be done. But this might be rare?
>
> On my 32G system, pcp->high_min for zone Normal is 988. That's one
> block and a half. The rmqueue_smallest policy means the next CPU will
> prefer the remainder of that partial block. So if there is
> concurrency, every other block is shared. Not exactly uncommon. The
> effect lessens the larger the machine is, of course.
>
> But let's assume it's not an issue. How do you know you can safely
> merge with a buddy pfn? You need to establish that it's on that same
> PCP's list. Short of *scanning* the list, it seems something like
> PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a
> per-page cpu field is tough to come by.
>
> So the block ownership is more natural, and then you might as well use
> that for affinity routing to increase the odds of merges.
>
> IOW, I'm having a hard time seeing what could be taken away and still
> have it work.

You are right. I was assuming that pages that can be merged are freed
via the same CPU. That rarely happens.

>
>>> The idea proposed here is this: instead of single pages, make the PCP
>>> grab entire pageblocks, split them outside the zone->lock. That CPU
>>> then takes ownership of the block, and all frees route back to that
>>> PCP instead of the freeing CPU's local one.
>>
>> This is basically distributed buddy allocators, right? Instead of
>> relying on a single zone->lock, PCP locks are used. The worst case
>> it can face is that physically contiguous free pages are allocated
>> across all CPUs, so that all CPUs are competing a single PCP lock.
>
> The worst case is one CPU allocating for everybody else in the system,
> so that all freers route to that PCP.
>
> I've played with microbenchmarks to provoke this, but it looks mostly
> neutral over baseline, at least at the scale of this machine.
>
> In this scenario, baseline will have the affinity mismatch problem:
> the allocating CPU routinely hits zone->lock to refill, and the
> freeing CPUs routinely hit zone->lock to drain and merge.
>
> In the new scheme, they would hit the pcp->lock instead of the
> zone->lock. So not necessarily an improvement in lock breaking. BUT
> because freers refill the allocator's cache, merging is deferred;
> that's a net reduction of work performed under the contended lock.

This makes sense to me.

>
>> It seems that you have not hit this. So I wonder if what I proposed
>> above might work as a simpler approach. Let me know if I miss anything.
>>
>> I wonder how this distributed buddy allocators would work if anyone
>> wants to allocate >pageblock free pages, like alloc_contig_range().
>> Multiple PCP locks need to be taken one by one. Maybe it is better
>> than taking and dropping zone->lock repeatedly. Have you benchmarked
>> alloc_contig_range(), like hugetlb allocation?
>
> I didn't change that aspect.
>
> The PCPs are still the same size, and PCP pages are still skipped by
> the isolation code.
>
> IOW it's not a purely distributed buddy allocator. It's still just a
> per-cpu cache of limited size. The only thing I'm doing is provide a
> mechanism for splitting and pre-merging at the cache level, and
> setting up affinity/routing rules to increase the chances of
> success. But the impact on alloc_contig should be the same.

Got it. Thanks for the explanation.

>
>>> This has several benefits:
>>>
>>> 1. It's right away coarser/fewer allocations transactions under the
>>>    zone->lock.
>>>
>>> 1a. Even if no full free blocks are available (memory pressure or
>>>     small zone), with splitting available at the PCP level means the
>>>     PCP can still grab chunks larger than the requested order from the
>>>     zone->lock freelists, and dole them out on its own time.
>>>
>>> 2. The pages free back to where the allocations happen, increasing the
>>>    odds of reuse and reducing the chances of zone->lock slowpaths.
>>>
>>> 3. The page buddies come back into one place, allowing upfront merging
>>>    under the local pcp->lock. This makes coarser/fewer freeing
>>>    transactions under the zone->lock.
>>
>> I wonder if we could go more radical by moving buddy allocator out of
>> zone->lock completely to PCP lock. If one PCP runs out of free pages,
>> it can steal another PCP's whole pageblock. I probably should do some
>> literature investigation on this. Some research must have been done
>> on this.
>
> This is an interesting idea. Make the zone buddy a pure block economy
> and remove all buddy code from it. Slowpath allocs and frees would
> always be in whole blocks.
>
> You'd have to come up with a natural stealing order. If one CPU needs
> something it doesn't have, which CPUs, and which order, do you look at
> for stealing.

One naive idea is to make zone buddy keep track of PCP free lists
for stealing.

>
> I think you'd still have to route back frees to the nominal owner of
> the block, or stealing could scatter pages all over the place and we'd
> never be able to merge them back up.

Basically, we want to keep free pages to be merged as much as possible.
Something like free page compaction across all PCPs.

>
> I think you'd also need to pull accounting (NR_FREE_PAGES) to the
> per-cpu level, and inform compaction/isolation to deal with these
> pages, since the majority default is now distributed.
>
> But the scenario where one CPU needs what another one has is an
> interesting one. I didn't invent anything new for this for now, but
> rather rely on how we have been handling this through the zone
> freelists. But I do think it's a little silly: right now, if a CPU
> needs something another CPU might have, we ask EVERY CPU in the system
> to drain their cache into the shared pool - simultaneously - running
> the full buddy merge algorithm on everything that comes in. The CPU
> grabs a small handful of these pages, most likely having to split
> again. All other CPUs are now cache cold on the next request.

Yes, a better way might be that when a CPU wants something, it should be
able to ask the other CPUs to drain the minimal amount of free pages.
But I do not have a good idea on how to do that yet.

It sounds to me that your current approach is a good first step towards
distributed buddy allocator. I will check the code and think about it
more and ask questions later.

Thank you for the explanation.

Best Regards,
Yan, Zi