include/linux/mmzone.h | 38 ++- include/linux/page-flags.h | 9 + mm/debug.c | 1 + mm/internal.h | 17 + mm/mm_init.c | 25 +- mm/page_alloc.c | 784 +++++++++++++++++++++++++++++++------------ mm/sparse.c | 3 +- 7 files changed, 622 insertions(+), 255 deletions(-)
Hi,
this is an RFC for making the page allocator scale better with higher
thread counts and larger memory quantities.
In Meta production, we're seeing increasing zone->lock contention that
was traced back to a few different paths. A prominent one is the
userspace allocator, jemalloc. Allocations happen from page faults on
all CPUs running the workload. Frees are cached for reuse, but the
caches are periodically purged back to the kernel from a handful of
purger threads. This breaks affinity between allocations and frees:
Both sides use their own PCPs - one side depletes them, the other one
overfills them. Both sides routinely hit the zone->locked slowpath.
My understanding is that tcmalloc has a similar architecture.
Another contributor to contention is process exits, where large
numbers of pages are freed at once. The current PCP can only reduce
lock time when pages are reused. Reuse is unlikely because it's an
avalanche of free pages on a CPU busy walking page tables. Every time
the PCP overflows, the drain acquires the zone->lock and frees pages
one by one, trying to merge buddies together.
The idea proposed here is this: instead of single pages, make the PCP
grab entire pageblocks, split them outside the zone->lock. That CPU
then takes ownership of the block, and all frees route back to that
PCP instead of the freeing CPU's local one.
This has several benefits:
1. It's right away coarser/fewer allocations transactions under the
zone->lock.
1a. Even if no full free blocks are available (memory pressure or
small zone), with splitting available at the PCP level means the
PCP can still grab chunks larger than the requested order from the
zone->lock freelists, and dole them out on its own time.
2. The pages free back to where the allocations happen, increasing the
odds of reuse and reducing the chances of zone->lock slowpaths.
3. The page buddies come back into one place, allowing upfront merging
under the local pcp->lock. This makes coarser/fewer freeing
transactions under the zone->lock.
The big concern is fragmentation. Movable allocations tend to be a mix
of short-lived anon and long-lived file cache pages. By the time the
PCP needs to drain due to thresholds or pressure, the blocks might not
be fully re-assembled yet. To prevent gobbling up and fragmenting ever
more blocks, partial blocks are remembered on drain and their pages
queued last on the zone freelist. When a PCP refills, it first tries
to recover any such fragment blocks.
On small or pressured machines, the PCP degrades to its previous
behavior. If a whole block doesn't fit the pcp->high limit, or a whole
block isn't available, the refill grabs smaller chunks that aren't
marked for ownership. The free side will use the local PCP as before.
I still need to run broader benchmarks, but I've been consistently
seeing a 3-4% reduction in %sys time for simple kernel builds on my
32-way, 32G RAM test machine.
A synthetic test on the same machine that allocates on many CPUs and
frees on just a few sees a consistent 1% increase in throughput.
I would expect those numbers to increase with higher concurrency and
larger memory volumes, but verifying that is TBD.
Sending an RFC to get an early gauge on direction.
Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.
include/linux/mmzone.h | 38 ++-
include/linux/page-flags.h | 9 +
mm/debug.c | 1 +
mm/internal.h | 17 +
mm/mm_init.c | 25 +-
mm/page_alloc.c | 784 +++++++++++++++++++++++++++++++------------
mm/sparse.c | 3 +-
7 files changed, 622 insertions(+), 255 deletions(-)
On 3 Apr 2026, at 15:40, Johannes Weiner wrote: > Hi, > > this is an RFC for making the page allocator scale better with higher > thread counts and larger memory quantities. > > In Meta production, we're seeing increasing zone->lock contention that > was traced back to a few different paths. A prominent one is the > userspace allocator, jemalloc. Allocations happen from page faults on > all CPUs running the workload. Frees are cached for reuse, but the > caches are periodically purged back to the kernel from a handful of > purger threads. This breaks affinity between allocations and frees: > Both sides use their own PCPs - one side depletes them, the other one > overfills them. Both sides routinely hit the zone->locked slowpath. > > My understanding is that tcmalloc has a similar architecture. > > Another contributor to contention is process exits, where large > numbers of pages are freed at once. The current PCP can only reduce > lock time when pages are reused. Reuse is unlikely because it's an > avalanche of free pages on a CPU busy walking page tables. Every time > the PCP overflows, the drain acquires the zone->lock and frees pages > one by one, trying to merge buddies together. IIUC, zone->lock held time is mostly spent on free page merging. Have you tried to let PCP do the free page merging before holding zone->lock and returning free pages to buddy? That is a much smaller change than what you proposed. This method might not work if physically contiguous free pages are allocated by separate CPUs, so that PCP merging cannot be done. But this might be rare? > > The idea proposed here is this: instead of single pages, make the PCP > grab entire pageblocks, split them outside the zone->lock. That CPU > then takes ownership of the block, and all frees route back to that > PCP instead of the freeing CPU's local one. This is basically distributed buddy allocators, right? Instead of relying on a single zone->lock, PCP locks are used. The worst case it can face is that physically contiguous free pages are allocated across all CPUs, so that all CPUs are competing a single PCP lock. It seems that you have not hit this. So I wonder if what I proposed above might work as a simpler approach. Let me know if I miss anything. I wonder how this distributed buddy allocators would work if anyone wants to allocate >pageblock free pages, like alloc_contig_range(). Multiple PCP locks need to be taken one by one. Maybe it is better than taking and dropping zone->lock repeatedly. Have you benchmarked alloc_contig_range(), like hugetlb allocation? > > This has several benefits: > > 1. It's right away coarser/fewer allocations transactions under the > zone->lock. > > 1a. Even if no full free blocks are available (memory pressure or > small zone), with splitting available at the PCP level means the > PCP can still grab chunks larger than the requested order from the > zone->lock freelists, and dole them out on its own time. > > 2. The pages free back to where the allocations happen, increasing the > odds of reuse and reducing the chances of zone->lock slowpaths. > > 3. The page buddies come back into one place, allowing upfront merging > under the local pcp->lock. This makes coarser/fewer freeing > transactions under the zone->lock. I wonder if we could go more radical by moving buddy allocator out of zone->lock completely to PCP lock. If one PCP runs out of free pages, it can steal another PCP's whole pageblock. I probably should do some literature investigation on this. Some research must have been done on this. > > The big concern is fragmentation. Movable allocations tend to be a mix > of short-lived anon and long-lived file cache pages. By the time the > PCP needs to drain due to thresholds or pressure, the blocks might not > be fully re-assembled yet. To prevent gobbling up and fragmenting ever > more blocks, partial blocks are remembered on drain and their pages > queued last on the zone freelist. When a PCP refills, it first tries > to recover any such fragment blocks. > > On small or pressured machines, the PCP degrades to its previous > behavior. If a whole block doesn't fit the pcp->high limit, or a whole > block isn't available, the refill grabs smaller chunks that aren't > marked for ownership. The free side will use the local PCP as before. > > I still need to run broader benchmarks, but I've been consistently > seeing a 3-4% reduction in %sys time for simple kernel builds on my > 32-way, 32G RAM test machine. > > A synthetic test on the same machine that allocates on many CPUs and > frees on just a few sees a consistent 1% increase in throughput. > > I would expect those numbers to increase with higher concurrency and > larger memory volumes, but verifying that is TBD. > > Sending an RFC to get an early gauge on direction. Thank you for sending this out. :) > > Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a. > > include/linux/mmzone.h | 38 ++- > include/linux/page-flags.h | 9 + > mm/debug.c | 1 + > mm/internal.h | 17 + > mm/mm_init.c | 25 +- > mm/page_alloc.c | 784 +++++++++++++++++++++++++++++++------------ > mm/sparse.c | 3 +- > 7 files changed, 622 insertions(+), 255 deletions(-) -- Best Regards, Yan, Zi
On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote: > On 3 Apr 2026, at 15:40, Johannes Weiner wrote: > > this is an RFC for making the page allocator scale better with higher > > thread counts and larger memory quantities. > > > > In Meta production, we're seeing increasing zone->lock contention that > > was traced back to a few different paths. A prominent one is the > > userspace allocator, jemalloc. Allocations happen from page faults on > > all CPUs running the workload. Frees are cached for reuse, but the > > caches are periodically purged back to the kernel from a handful of > > purger threads. This breaks affinity between allocations and frees: > > Both sides use their own PCPs - one side depletes them, the other one > > overfills them. Both sides routinely hit the zone->locked slowpath. > > > > My understanding is that tcmalloc has a similar architecture. > > > > Another contributor to contention is process exits, where large > > numbers of pages are freed at once. The current PCP can only reduce > > lock time when pages are reused. Reuse is unlikely because it's an > > avalanche of free pages on a CPU busy walking page tables. Every time > > the PCP overflows, the drain acquires the zone->lock and frees pages > > one by one, trying to merge buddies together. > > IIUC, zone->lock held time is mostly spent on free page merging. > Have you tried to let PCP do the free page merging before holding > zone->lock and returning free pages to buddy? That is a much smaller > change than what you proposed. This method might not work if > physically contiguous free pages are allocated by separate CPUs, > so that PCP merging cannot be done. But this might be rare? On my 32G system, pcp->high_min for zone Normal is 988. That's one block and a half. The rmqueue_smallest policy means the next CPU will prefer the remainder of that partial block. So if there is concurrency, every other block is shared. Not exactly uncommon. The effect lessens the larger the machine is, of course. But let's assume it's not an issue. How do you know you can safely merge with a buddy pfn? You need to establish that it's on that same PCP's list. Short of *scanning* the list, it seems something like PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a per-page cpu field is tough to come by. So the block ownership is more natural, and then you might as well use that for affinity routing to increase the odds of merges. IOW, I'm having a hard time seeing what could be taken away and still have it work. > > The idea proposed here is this: instead of single pages, make the PCP > > grab entire pageblocks, split them outside the zone->lock. That CPU > > then takes ownership of the block, and all frees route back to that > > PCP instead of the freeing CPU's local one. > > This is basically distributed buddy allocators, right? Instead of > relying on a single zone->lock, PCP locks are used. The worst case > it can face is that physically contiguous free pages are allocated > across all CPUs, so that all CPUs are competing a single PCP lock. The worst case is one CPU allocating for everybody else in the system, so that all freers route to that PCP. I've played with microbenchmarks to provoke this, but it looks mostly neutral over baseline, at least at the scale of this machine. In this scenario, baseline will have the affinity mismatch problem: the allocating CPU routinely hits zone->lock to refill, and the freeing CPUs routinely hit zone->lock to drain and merge. In the new scheme, they would hit the pcp->lock instead of the zone->lock. So not necessarily an improvement in lock breaking. BUT because freers refill the allocator's cache, merging is deferred; that's a net reduction of work performed under the contended lock. > It seems that you have not hit this. So I wonder if what I proposed > above might work as a simpler approach. Let me know if I miss anything. > > I wonder how this distributed buddy allocators would work if anyone > wants to allocate >pageblock free pages, like alloc_contig_range(). > Multiple PCP locks need to be taken one by one. Maybe it is better > than taking and dropping zone->lock repeatedly. Have you benchmarked > alloc_contig_range(), like hugetlb allocation? I didn't change that aspect. The PCPs are still the same size, and PCP pages are still skipped by the isolation code. IOW it's not a purely distributed buddy allocator. It's still just a per-cpu cache of limited size. The only thing I'm doing is provide a mechanism for splitting and pre-merging at the cache level, and setting up affinity/routing rules to increase the chances of success. But the impact on alloc_contig should be the same. > > This has several benefits: > > > > 1. It's right away coarser/fewer allocations transactions under the > > zone->lock. > > > > 1a. Even if no full free blocks are available (memory pressure or > > small zone), with splitting available at the PCP level means the > > PCP can still grab chunks larger than the requested order from the > > zone->lock freelists, and dole them out on its own time. > > > > 2. The pages free back to where the allocations happen, increasing the > > odds of reuse and reducing the chances of zone->lock slowpaths. > > > > 3. The page buddies come back into one place, allowing upfront merging > > under the local pcp->lock. This makes coarser/fewer freeing > > transactions under the zone->lock. > > I wonder if we could go more radical by moving buddy allocator out of > zone->lock completely to PCP lock. If one PCP runs out of free pages, > it can steal another PCP's whole pageblock. I probably should do some > literature investigation on this. Some research must have been done > on this. This is an interesting idea. Make the zone buddy a pure block economy and remove all buddy code from it. Slowpath allocs and frees would always be in whole blocks. You'd have to come up with a natural stealing order. If one CPU needs something it doesn't have, which CPUs, and which order, do you look at for stealing. I think you'd still have to route back frees to the nominal owner of the block, or stealing could scatter pages all over the place and we'd never be able to merge them back up. I think you'd also need to pull accounting (NR_FREE_PAGES) to the per-cpu level, and inform compaction/isolation to deal with these pages, since the majority default is now distributed. But the scenario where one CPU needs what another one has is an interesting one. I didn't invent anything new for this for now, but rather rely on how we have been handling this through the zone freelists. But I do think it's a little silly: right now, if a CPU needs something another CPU might have, we ask EVERY CPU in the system to drain their cache into the shared pool - simultaneously - running the full buddy merge algorithm on everything that comes in. The CPU grabs a small handful of these pages, most likely having to split again. All other CPUs are now cache cold on the next request.
On 6 Apr 2026, at 11:24, Johannes Weiner wrote: > On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote: >> On 3 Apr 2026, at 15:40, Johannes Weiner wrote: >>> this is an RFC for making the page allocator scale better with higher >>> thread counts and larger memory quantities. >>> >>> In Meta production, we're seeing increasing zone->lock contention that >>> was traced back to a few different paths. A prominent one is the >>> userspace allocator, jemalloc. Allocations happen from page faults on >>> all CPUs running the workload. Frees are cached for reuse, but the >>> caches are periodically purged back to the kernel from a handful of >>> purger threads. This breaks affinity between allocations and frees: >>> Both sides use their own PCPs - one side depletes them, the other one >>> overfills them. Both sides routinely hit the zone->locked slowpath. >>> >>> My understanding is that tcmalloc has a similar architecture. >>> >>> Another contributor to contention is process exits, where large >>> numbers of pages are freed at once. The current PCP can only reduce >>> lock time when pages are reused. Reuse is unlikely because it's an >>> avalanche of free pages on a CPU busy walking page tables. Every time >>> the PCP overflows, the drain acquires the zone->lock and frees pages >>> one by one, trying to merge buddies together. >> >> IIUC, zone->lock held time is mostly spent on free page merging. >> Have you tried to let PCP do the free page merging before holding >> zone->lock and returning free pages to buddy? That is a much smaller >> change than what you proposed. This method might not work if >> physically contiguous free pages are allocated by separate CPUs, >> so that PCP merging cannot be done. But this might be rare? > > On my 32G system, pcp->high_min for zone Normal is 988. That's one > block and a half. The rmqueue_smallest policy means the next CPU will > prefer the remainder of that partial block. So if there is > concurrency, every other block is shared. Not exactly uncommon. The > effect lessens the larger the machine is, of course. > > But let's assume it's not an issue. How do you know you can safely > merge with a buddy pfn? You need to establish that it's on that same > PCP's list. Short of *scanning* the list, it seems something like > PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a > per-page cpu field is tough to come by. > > So the block ownership is more natural, and then you might as well use > that for affinity routing to increase the odds of merges. > > IOW, I'm having a hard time seeing what could be taken away and still > have it work. You are right. I was assuming that pages that can be merged are freed via the same CPU. That rarely happens. > >>> The idea proposed here is this: instead of single pages, make the PCP >>> grab entire pageblocks, split them outside the zone->lock. That CPU >>> then takes ownership of the block, and all frees route back to that >>> PCP instead of the freeing CPU's local one. >> >> This is basically distributed buddy allocators, right? Instead of >> relying on a single zone->lock, PCP locks are used. The worst case >> it can face is that physically contiguous free pages are allocated >> across all CPUs, so that all CPUs are competing a single PCP lock. > > The worst case is one CPU allocating for everybody else in the system, > so that all freers route to that PCP. > > I've played with microbenchmarks to provoke this, but it looks mostly > neutral over baseline, at least at the scale of this machine. > > In this scenario, baseline will have the affinity mismatch problem: > the allocating CPU routinely hits zone->lock to refill, and the > freeing CPUs routinely hit zone->lock to drain and merge. > > In the new scheme, they would hit the pcp->lock instead of the > zone->lock. So not necessarily an improvement in lock breaking. BUT > because freers refill the allocator's cache, merging is deferred; > that's a net reduction of work performed under the contended lock. This makes sense to me. > >> It seems that you have not hit this. So I wonder if what I proposed >> above might work as a simpler approach. Let me know if I miss anything. >> >> I wonder how this distributed buddy allocators would work if anyone >> wants to allocate >pageblock free pages, like alloc_contig_range(). >> Multiple PCP locks need to be taken one by one. Maybe it is better >> than taking and dropping zone->lock repeatedly. Have you benchmarked >> alloc_contig_range(), like hugetlb allocation? > > I didn't change that aspect. > > The PCPs are still the same size, and PCP pages are still skipped by > the isolation code. > > IOW it's not a purely distributed buddy allocator. It's still just a > per-cpu cache of limited size. The only thing I'm doing is provide a > mechanism for splitting and pre-merging at the cache level, and > setting up affinity/routing rules to increase the chances of > success. But the impact on alloc_contig should be the same. Got it. Thanks for the explanation. > >>> This has several benefits: >>> >>> 1. It's right away coarser/fewer allocations transactions under the >>> zone->lock. >>> >>> 1a. Even if no full free blocks are available (memory pressure or >>> small zone), with splitting available at the PCP level means the >>> PCP can still grab chunks larger than the requested order from the >>> zone->lock freelists, and dole them out on its own time. >>> >>> 2. The pages free back to where the allocations happen, increasing the >>> odds of reuse and reducing the chances of zone->lock slowpaths. >>> >>> 3. The page buddies come back into one place, allowing upfront merging >>> under the local pcp->lock. This makes coarser/fewer freeing >>> transactions under the zone->lock. >> >> I wonder if we could go more radical by moving buddy allocator out of >> zone->lock completely to PCP lock. If one PCP runs out of free pages, >> it can steal another PCP's whole pageblock. I probably should do some >> literature investigation on this. Some research must have been done >> on this. > > This is an interesting idea. Make the zone buddy a pure block economy > and remove all buddy code from it. Slowpath allocs and frees would > always be in whole blocks. > > You'd have to come up with a natural stealing order. If one CPU needs > something it doesn't have, which CPUs, and which order, do you look at > for stealing. One naive idea is to make zone buddy keep track of PCP free lists for stealing. > > I think you'd still have to route back frees to the nominal owner of > the block, or stealing could scatter pages all over the place and we'd > never be able to merge them back up. Basically, we want to keep free pages to be merged as much as possible. Something like free page compaction across all PCPs. > > I think you'd also need to pull accounting (NR_FREE_PAGES) to the > per-cpu level, and inform compaction/isolation to deal with these > pages, since the majority default is now distributed. > > But the scenario where one CPU needs what another one has is an > interesting one. I didn't invent anything new for this for now, but > rather rely on how we have been handling this through the zone > freelists. But I do think it's a little silly: right now, if a CPU > needs something another CPU might have, we ask EVERY CPU in the system > to drain their cache into the shared pool - simultaneously - running > the full buddy merge algorithm on everything that comes in. The CPU > grabs a small handful of these pages, most likely having to split > again. All other CPUs are now cache cold on the next request. Yes, a better way might be that when a CPU wants something, it should be able to ask the other CPUs to drain the minimal amount of free pages. But I do not have a good idea on how to do that yet. It sounds to me that your current approach is a good first step towards distributed buddy allocator. I will check the code and think about it more and ask questions later. Thank you for the explanation. Best Regards, Yan, Zi
© 2016 - 2026 Red Hat, Inc.