mm/page_alloc: Batch callers of free_pcppages_bulk

[PATCH 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk

Posted by Joshua Hahn 1 week, 5 days ago

While testing workloads with high sustained memory pressure on large machines
(1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
Further investigation showed that the lock in free_pcppages_bulk was being held
for a long time, even being held while 2k+ pages were being freed [1].

This causes starvation in other processes for both the pcp and zone locks,
which can lead to softlockups that cause the system to stall [2].

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 	20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: 	         hardirqs   softirqs   csw/system
[ 4512.638793] rcu: 	 number:        0        145            0
[ 4512.651177] rcu: 	cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu: 	(t=21077 jiffies g=783665 q=1242213 ncpus=316)

And here is the trace that accompanies it:

[ 4512.666815] RIP: 0010:free_unref_folios+0x47d/0xd80
[ 4512.666818] Code: 00 00 31 ff 40 80 ce 01 41 88 76 18 e9 a8 fe ff ff 40 84 ff 0f 84 d6 00 00 00 39 f0 0f 4c f0 4c 89 ff 4c 89 f2 e8 13 f2 fe ff <49> f7 87 88 05 00 00 04 00 00 00 0f 84 00 ff ff ff 49 8b 47 20 49
[ 4512.666820] RSP: 0018:ffffc900a62f3878 EFLAGS: 00000206
[ 4512.666822] RAX: 000000000005ae80 RBX: 000000000000087a RCX: 0000000000000001
[ 4512.666824] RDX: 000000000000007d RSI: 0000000000000282 RDI: ffff89404c8ba310
[ 4512.666825] RBP: 0000000000000001 R08: ffff89404c8b9d80 R09: 0000000000000001
[ 4512.666826] R10: 0000000000000010 R11: 00000000000130de R12: ffff89404c8b9d80
[ 4512.666827] R13: ffffea01cf3c0000 R14: ffff893d3ac5aec0 R15: ffff89404c8b9d80
[ 4512.666833]  ? free_unref_folios+0x47d/0xd80
[ 4512.666836]  free_pages_and_swap_cache+0xcd/0x1a0
[ 4512.666847]  tlb_finish_mmu+0x11c/0x350
[ 4512.666850]  vms_clear_ptes+0xf9/0x120
[ 4512.666855]  __mmap_region+0x29a/0xc00
[ 4512.666867]  do_mmap+0x34e/0x910
[ 4512.666873]  vm_mmap_pgoff+0xbb/0x200
[ 4512.666877]  ? hrtimer_interrupt+0x337/0x5c0
[ 4512.666879]  ? sched_clock+0x5/0x10
[ 4512.666882]  ? sched_clock_cpu+0xc/0x170
[ 4512.666885]  ? irqtime_account_irq+0x2b/0xa0
[ 4512.666888]  do_syscall_64+0x68/0x130
[ 4512.666892]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 4512.666896] RIP: 0033:0x7f1afe9257e2

To prevent starvation in both the pcp and zone locks, batch the freeing of
pages using pcp->batch. 

Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together. Thus, instead of modifying free_pcppages_bulk to break
both locks, batch the freeing from its callers instead. 

In our fleet, we have seen that performing batched lock freeing has led to
significantly lower rates of softlockups, while incurring relatively small
regressions (relative to the workload and relative to the variation).

The following are a few synthetic benchmarks, made on a machine with
250G RAM, 179G swap, and 176 CPUs. 

stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        0.0120 |  -0.0011 |
| bogo ops/s (real)    |        0.0109 |  -0.0091 |
| bogo ops/s (usr+sys) |        0.5560 |  +0.1049 |
+----------------------+---------------+----------+

stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        1.8530 |  +0.4728 |
| bogo ops/s (real)    |        1.8604 |  +0.2029 |
| bogo ops/s (usr+sys) |        1.6054 |  -0.6381 |
+----------------------+---------------+----------+

Patch 1 simplifies the return semantics of decay_pcp_high and
refresh_cpu_vm_stats, which makes the change in patch 3 more semantically
accurate. 

Patch 2, 3, and 4 each address one caller of free_pcppages_bulk, and ensures
that large values passed to it are batched. 

This series is a follow-up to [2], where I attempted to solve the same problem
by relinquishing only the zone lock within free_pcppages_bulk. Because this
approach is different in nature, I decided not to send this as a v2, but
as a separate series altogether. 

[1] For instance, during *just* the boot of said large machine, there were
    2092 instances of free_pcppages_bulk being called with count > 1000.
[2] https://lore.kernel.org/all/20250818185804.21044-1-joshua.hahnjy@gmail.com/

Joshua Hahn (4):
  mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  mm/page_alloc: Perform appropriate batching in drain_pages_zone
  mm/page_alloc: Batch page freeing in decay_pcp_high
  mm/page_alloc: Batch page freeing in free_frozen_page_commit

 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     | 65 ++++++++++++++++++++++++++++++++-------------
 mm/vmstat.c         | 26 +++++++++---------
 3 files changed, 61 insertions(+), 32 deletions(-)


base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
-- 
2.47.3

Re: [PATCH 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk

Posted by Andrew Morton 1 week, 5 days ago

On Fri, 19 Sep 2025 12:52:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed [1].

What problems are caused by this, apart from a warning which can
presumably be suppressed in some fashion?

> This causes starvation in other processes for both the pcp and zone locks,
> which can lead to softlockups that cause the system to stall [2].

[2] doesn't describe such stalls.

>
> ...
>
> In our fleet, we have seen that performing batched lock freeing has led to
> significantly lower rates of softlockups, while incurring relatively small
> regressions (relative to the workload and relative to the variation).

"our" == Meta?

Re: [PATCH 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk

Posted by Joshua Hahn 1 week, 5 days ago

Hello Andrew,

Thank you for your review, as always!

On Fri, 19 Sep 2025 13:06:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 19 Sep 2025 12:52:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed [1].
> 
> What problems are caused by this, apart from a warning which can
> presumably be suppressed in some fashion?

There are softlockup panics that we (Meta) saw in the fleet previously. For
some reason I can't get it to reproduce again, but let me try to find a way
to trigger these softlockups again. 

> > This causes starvation in other processes for both the pcp and zone locks,
> > which can lead to softlockups that cause the system to stall [2].
> 
> [2] doesn't describe such stalls.

You're absolutely right -- I was revising this cover letter a bit and I was
going to link the below message separately, but decided to put it in the
message and fogot to remove the footnote. The message below isn't a softlockup,
but let me try and get one to add to the cover letter in a reply to this.

> >
> > ...
> >
> > In our fleet, we have seen that performing batched lock freeing has led to
> > significantly lower rates of softlockups, while incurring relatively small
> > regressions (relative to the workload and relative to the variation).
> 
> "our" == Meta?

Yes -- sorry, I think I made this same mistake in the original version as well.
I'll be more careful about this!

Thank you again for your feedback Andrew, I hope you have a great day!
Joshua