include/linux/gfp.h | 2 +- mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++------------- mm/vmstat.c | 26 +++++++++--------- 3 files changed, 61 insertions(+), 32 deletions(-)
While testing workloads with high sustained memory pressure on large machines (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups. Further investigation showed that the lock in free_pcppages_bulk was being held for a long time, even being held while 2k+ pages were being freed [1]. This causes starvation in other processes for both the pcp and zone locks, which can lead to softlockups that cause the system to stall [2]. [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426 [ 4512.626401] rcu: hardirqs softirqs csw/system [ 4512.638793] rcu: number: 0 145 0 [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms) [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316) And here is the trace that accompanies it: [ 4512.666815] RIP: 0010:free_unref_folios+0x47d/0xd80 [ 4512.666818] Code: 00 00 31 ff 40 80 ce 01 41 88 76 18 e9 a8 fe ff ff 40 84 ff 0f 84 d6 00 00 00 39 f0 0f 4c f0 4c 89 ff 4c 89 f2 e8 13 f2 fe ff <49> f7 87 88 05 00 00 04 00 00 00 0f 84 00 ff ff ff 49 8b 47 20 49 [ 4512.666820] RSP: 0018:ffffc900a62f3878 EFLAGS: 00000206 [ 4512.666822] RAX: 000000000005ae80 RBX: 000000000000087a RCX: 0000000000000001 [ 4512.666824] RDX: 000000000000007d RSI: 0000000000000282 RDI: ffff89404c8ba310 [ 4512.666825] RBP: 0000000000000001 R08: ffff89404c8b9d80 R09: 0000000000000001 [ 4512.666826] R10: 0000000000000010 R11: 00000000000130de R12: ffff89404c8b9d80 [ 4512.666827] R13: ffffea01cf3c0000 R14: ffff893d3ac5aec0 R15: ffff89404c8b9d80 [ 4512.666833] ? free_unref_folios+0x47d/0xd80 [ 4512.666836] free_pages_and_swap_cache+0xcd/0x1a0 [ 4512.666847] tlb_finish_mmu+0x11c/0x350 [ 4512.666850] vms_clear_ptes+0xf9/0x120 [ 4512.666855] __mmap_region+0x29a/0xc00 [ 4512.666867] do_mmap+0x34e/0x910 [ 4512.666873] vm_mmap_pgoff+0xbb/0x200 [ 4512.666877] ? hrtimer_interrupt+0x337/0x5c0 [ 4512.666879] ? sched_clock+0x5/0x10 [ 4512.666882] ? sched_clock_cpu+0xc/0x170 [ 4512.666885] ? irqtime_account_irq+0x2b/0xa0 [ 4512.666888] do_syscall_64+0x68/0x130 [ 4512.666892] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 4512.666896] RIP: 0033:0x7f1afe9257e2 To prevent starvation in both the pcp and zone locks, batch the freeing of pages using pcp->batch. Because free_pcppages_bulk is called with both the pcp and zone lock, relinquishing and reacquiring the locks are only effective when both of them are broken together. Thus, instead of modifying free_pcppages_bulk to break both locks, batch the freeing from its callers instead. In our fleet, we have seen that performing batched lock freeing has led to significantly lower rates of softlockups, while incurring relatively small regressions (relative to the workload and relative to the variation). The following are a few synthetic benchmarks, made on a machine with 250G RAM, 179G swap, and 176 CPUs. stress-ng --vm 50 --vm-bytes 5G -M -t 100 +----------------------+---------------+----------+ | Metric | Variation (%) | Delta(%) | +----------------------+---------------+----------+ | bogo ops | 0.0120 | -0.0011 | | bogo ops/s (real) | 0.0109 | -0.0091 | | bogo ops/s (usr+sys) | 0.5560 | +0.1049 | +----------------------+---------------+----------+ stress-ng --vm 10 --vm-bytes 30G -M -t 100 +----------------------+---------------+----------+ | Metric | Variation (%) | Delta(%) | +----------------------+---------------+----------+ | bogo ops | 1.8530 | +0.4728 | | bogo ops/s (real) | 1.8604 | +0.2029 | | bogo ops/s (usr+sys) | 1.6054 | -0.6381 | +----------------------+---------------+----------+ Patch 1 simplifies the return semantics of decay_pcp_high and refresh_cpu_vm_stats, which makes the change in patch 3 more semantically accurate. Patch 2, 3, and 4 each address one caller of free_pcppages_bulk, and ensures that large values passed to it are batched. This series is a follow-up to [2], where I attempted to solve the same problem by relinquishing only the zone lock within free_pcppages_bulk. Because this approach is different in nature, I decided not to send this as a v2, but as a separate series altogether. [1] For instance, during *just* the boot of said large machine, there were 2092 instances of free_pcppages_bulk being called with count > 1000. [2] https://lore.kernel.org/all/20250818185804.21044-1-joshua.hahnjy@gmail.com/ Joshua Hahn (4): mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection mm/page_alloc: Perform appropriate batching in drain_pages_zone mm/page_alloc: Batch page freeing in decay_pcp_high mm/page_alloc: Batch page freeing in free_frozen_page_commit include/linux/gfp.h | 2 +- mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++------------- mm/vmstat.c | 26 +++++++++--------- 3 files changed, 61 insertions(+), 32 deletions(-) base-commit: 097a6c336d0080725c626fda118ecfec448acd0f -- 2.47.3
On Fri, 19 Sep 2025 12:52:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > While testing workloads with high sustained memory pressure on large machines > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups. > Further investigation showed that the lock in free_pcppages_bulk was being held > for a long time, even being held while 2k+ pages were being freed [1]. What problems are caused by this, apart from a warning which can presumably be suppressed in some fashion? > This causes starvation in other processes for both the pcp and zone locks, > which can lead to softlockups that cause the system to stall [2]. [2] doesn't describe such stalls. > > ... > > In our fleet, we have seen that performing batched lock freeing has led to > significantly lower rates of softlockups, while incurring relatively small > regressions (relative to the workload and relative to the variation). "our" == Meta?
Hello Andrew, Thank you for your review, as always! On Fri, 19 Sep 2025 13:06:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 19 Sep 2025 12:52:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > While testing workloads with high sustained memory pressure on large machines > > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups. > > Further investigation showed that the lock in free_pcppages_bulk was being held > > for a long time, even being held while 2k+ pages were being freed [1]. > > What problems are caused by this, apart from a warning which can > presumably be suppressed in some fashion? There are softlockup panics that we (Meta) saw in the fleet previously. For some reason I can't get it to reproduce again, but let me try to find a way to trigger these softlockups again. > > This causes starvation in other processes for both the pcp and zone locks, > > which can lead to softlockups that cause the system to stall [2]. > > [2] doesn't describe such stalls. You're absolutely right -- I was revising this cover letter a bit and I was going to link the below message separately, but decided to put it in the message and fogot to remove the footnote. The message below isn't a softlockup, but let me try and get one to add to the cover letter in a reply to this. > > > > ... > > > > In our fleet, we have seen that performing batched lock freeing has led to > > significantly lower rates of softlockups, while incurring relatively small > > regressions (relative to the workload and relative to the variation). > > "our" == Meta? Yes -- sorry, I think I made this same mistake in the original version as well. I'll be more careful about this! Thank you again for your feedback Andrew, I hope you have a great day! Joshua
© 2016 - 2025 Red Hat, Inc.