mm/vmstat.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
vmstat_update uses round_jiffies_relative() when re-queuing itself,
which aligns all CPUs' timers to the same second boundary. When many
CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
free_pcppages_bulk() simultaneously, serializing on zone->lock and
hitting contention.
Introduce vmstat_spread_delay() which distributes each CPU's
vmstat_update evenly across the stat interval instead of aligning them.
This does not increase the number of timer interrupts — each CPU still
fires once per interval. The timers are simply staggered rather than
aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
wake idle CPUs regardless of scheduling; the spread only affects CPUs
that are already active
`perf lock contention` shows 7.5x reduction in zone->lock contention
(872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
system under memory pressure.
Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
memory allocation bursts. Lock contention was measured with:
perf lock contention -a -b -S free_pcppages_bulk
Results with KASAN enabled:
free_pcppages_bulk contention (KASAN):
+--------------+----------+----------+
| Metric | No fix | With fix |
+--------------+----------+----------+
| Contentions | 872 | 117 |
| Total wait | 199.43ms | 80.76ms |
| Max wait | 4.19ms | 35.76ms |
+--------------+----------+----------+
Results without KASAN:
free_pcppages_bulk contention (no KASAN):
+--------------+----------+----------+
| Metric | No fix | With fix |
+--------------+----------+----------+
| Contentions | 240 | 133 |
| Total wait | 34.01ms | 24.61ms |
| Max wait | 965us | 1.35ms |
+--------------+----------+----------+
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/vmstat.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd..2e94bd765606 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
}
#endif /* CONFIG_PROC_FS */
+/*
+ * Return a per-cpu delay that spreads vmstat_update work across the stat
+ * interval. Without this, round_jiffies_relative() aligns every CPU's
+ * timer to the same second boundary, causing a thundering-herd on
+ * zone->lock when multiple CPUs drain PCP pages simultaneously via
+ * decay_pcp_high() -> free_pcppages_bulk().
+ */
+static unsigned long vmstat_spread_delay(void)
+{
+ unsigned long interval = sysctl_stat_interval;
+ unsigned int nr_cpus = num_online_cpus();
+
+ if (nr_cpus <= 1)
+ return round_jiffies_relative(interval);
+
+ /*
+ * Spread per-cpu vmstat work evenly across the interval. Don't
+ * use round_jiffies_relative() here -- it would snap every CPU
+ * back to the same second boundary, defeating the spread.
+ */
+ return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
+}
+
static void vmstat_update(struct work_struct *w)
{
if (refresh_cpu_vm_stats(true)) {
@@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
*/
queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
this_cpu_ptr(&vmstat_work),
- round_jiffies_relative(sysctl_stat_interval));
+ vmstat_spread_delay());
}
}
---
base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
change-id: 20260401-vmstat-048e0feaf344
Best regards,
--
Breno Leitao <leitao@debian.org>
On 4/1/26 15:57, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary. When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
>
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
>
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
>
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
>
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts. Lock contention was measured with:
>
> perf lock contention -a -b -S free_pcppages_bulk
>
> Results with KASAN enabled:
>
> free_pcppages_bulk contention (KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 872 | 117 |
> | Total wait | 199.43ms | 80.76ms |
> | Max wait | 4.19ms | 35.76ms |
> +--------------+----------+----------+
>
> Results without KASAN:
>
> free_pcppages_bulk contention (no KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 240 | 133 |
> | Total wait | 34.01ms | 24.61ms |
> | Max wait | 965us | 1.35ms |
> +--------------+----------+----------+
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Cool!
I noticed __round_jiffies_relative() exists and the description looks like
it's meant for exactly this use case?
> ---
> mm/vmstat.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval. Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return round_jiffies_relative(interval);
> +
> + /*
> + * Spread per-cpu vmstat work evenly across the interval. Don't
> + * use round_jiffies_relative() here -- it would snap every CPU
> + * back to the same second boundary, defeating the spread.
> + */
> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
Hm doesn't this mean that lower id cpus will consistently fire in shorter
intervals and higher id in longer intervals? What we want is same interval
but differently offset, no?
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + vmstat_spread_delay());
> }
> }
>
>
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary. When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
>
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
>
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
>
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
>
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts. Lock contention was measured with:
>
> perf lock contention -a -b -S free_pcppages_bulk
>
> Results with KASAN enabled:
>
> free_pcppages_bulk contention (KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 872 | 117 |
> | Total wait | 199.43ms | 80.76ms |
> | Max wait | 4.19ms | 35.76ms |
> +--------------+----------+----------+
>
> Results without KASAN:
>
> free_pcppages_bulk contention (no KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 240 | 133 |
> | Total wait | 34.01ms | 24.61ms |
> | Max wait | 965us | 1.35ms |
> +--------------+----------+----------+
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/vmstat.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval. Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return round_jiffies_relative(interval);
> +
> + /*
> + * Spread per-cpu vmstat work evenly across the interval. Don't
> + * use round_jiffies_relative() here -- it would snap every CPU
> + * back to the same second boundary, defeating the spread.
> + */
> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + vmstat_spread_delay());
This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
vmstat_shepherd() still queues work with delay 0 on all CPUs that
need_update() in its for_each_online_cpu() loop:
if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
simultaneously.
Under sustained memory pressure on a large system, I think the shepherd
fires every sysctl_stat_interval and could re-trigger the same lock
contention?
> }
> }
>
>
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>
On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>
> > vmstat_update uses round_jiffies_relative() when re-queuing itself,
> > which aligns all CPUs' timers to the same second boundary. When many
> > CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> > free_pcppages_bulk() simultaneously, serializing on zone->lock and
> > hitting contention.
> >
> > Introduce vmstat_spread_delay() which distributes each CPU's
> > vmstat_update evenly across the stat interval instead of aligning them.
> >
> > This does not increase the number of timer interrupts — each CPU still
> > fires once per interval. The timers are simply staggered rather than
> > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> > wake idle CPUs regardless of scheduling; the spread only affects CPUs
> > that are already active
> >
> > `perf lock contention` shows 7.5x reduction in zone->lock contention
> > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> > system under memory pressure.
> >
> > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> > memory allocation bursts. Lock contention was measured with:
> >
> > perf lock contention -a -b -S free_pcppages_bulk
> >
> > Results with KASAN enabled:
> >
> > free_pcppages_bulk contention (KASAN):
> > +--------------+----------+----------+
> > | Metric | No fix | With fix |
> > +--------------+----------+----------+
> > | Contentions | 872 | 117 |
> > | Total wait | 199.43ms | 80.76ms |
> > | Max wait | 4.19ms | 35.76ms |
> > +--------------+----------+----------+
> >
> > Results without KASAN:
> >
> > free_pcppages_bulk contention (no KASAN):
> > +--------------+----------+----------+
> > | Metric | No fix | With fix |
> > +--------------+----------+----------+
> > | Contentions | 240 | 133 |
> > | Total wait | 34.01ms | 24.61ms |
> > | Max wait | 965us | 1.35ms |
> > +--------------+----------+----------+
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> > mm/vmstat.c | 25 ++++++++++++++++++++++++-
> > 1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 2370c6fb1fcd..2e94bd765606 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> > }
> > #endif /* CONFIG_PROC_FS */
> >
> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval. Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > + unsigned long interval = sysctl_stat_interval;
> > + unsigned int nr_cpus = num_online_cpus();
> > +
> > + if (nr_cpus <= 1)
> > + return round_jiffies_relative(interval);
> > +
> > + /*
> > + * Spread per-cpu vmstat work evenly across the interval. Don't
> > + * use round_jiffies_relative() here -- it would snap every CPU
> > + * back to the same second boundary, defeating the spread.
> > + */
> > + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > +}
> > +
> > static void vmstat_update(struct work_struct *w)
> > {
> > if (refresh_cpu_vm_stats(true)) {
> > @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> > */
> > queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> > this_cpu_ptr(&vmstat_work),
> > - round_jiffies_relative(sysctl_stat_interval));
> > + vmstat_spread_delay());
>
> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>
> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> need_update() in its for_each_online_cpu() loop:
>
> if (!delayed_work_pending(dw) && need_update(cpu))
> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> simultaneously.
>
> Under sustained memory pressure on a large system, I think the shepherd
> fires every sysctl_stat_interval and could re-trigger the same lock
> contention?
Good point - incorporating similar spreading logic in vmstat_shepherd()
would indeed address the simultaneous queueing issue you've described.
Should I include this in a v2 of this patch, or would you prefer it as
a separate follow-up patch?
On 01/04/2026 18:43, Breno Leitao wrote:
> On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
>> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>>
>>> vmstat_update uses round_jiffies_relative() when re-queuing itself,
>>> which aligns all CPUs' timers to the same second boundary. When many
>>> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
>>> free_pcppages_bulk() simultaneously, serializing on zone->lock and
>>> hitting contention.
>>>
>>> Introduce vmstat_spread_delay() which distributes each CPU's
>>> vmstat_update evenly across the stat interval instead of aligning them.
>>>
>>> This does not increase the number of timer interrupts — each CPU still
>>> fires once per interval. The timers are simply staggered rather than
>>> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
>>> wake idle CPUs regardless of scheduling; the spread only affects CPUs
>>> that are already active
>>>
>>> `perf lock contention` shows 7.5x reduction in zone->lock contention
>>> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
>>> system under memory pressure.
>>>
>>> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
>>> memory allocation bursts. Lock contention was measured with:
>>>
>>> perf lock contention -a -b -S free_pcppages_bulk
>>>
>>> Results with KASAN enabled:
>>>
>>> free_pcppages_bulk contention (KASAN):
>>> +--------------+----------+----------+
>>> | Metric | No fix | With fix |
>>> +--------------+----------+----------+
>>> | Contentions | 872 | 117 |
>>> | Total wait | 199.43ms | 80.76ms |
>>> | Max wait | 4.19ms | 35.76ms |
>>> +--------------+----------+----------+
>>>
>>> Results without KASAN:
>>>
>>> free_pcppages_bulk contention (no KASAN):
>>> +--------------+----------+----------+
>>> | Metric | No fix | With fix |
>>> +--------------+----------+----------+
>>> | Contentions | 240 | 133 |
>>> | Total wait | 34.01ms | 24.61ms |
>>> | Max wait | 965us | 1.35ms |
>>> +--------------+----------+----------+
>>>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/vmstat.c | 25 ++++++++++++++++++++++++-
>>> 1 file changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 2370c6fb1fcd..2e94bd765606 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>>> }
>>> #endif /* CONFIG_PROC_FS */
>>>
>>> +/*
>>> + * Return a per-cpu delay that spreads vmstat_update work across the stat
>>> + * interval. Without this, round_jiffies_relative() aligns every CPU's
>>> + * timer to the same second boundary, causing a thundering-herd on
>>> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
>>> + * decay_pcp_high() -> free_pcppages_bulk().
>>> + */
>>> +static unsigned long vmstat_spread_delay(void)
>>> +{
>>> + unsigned long interval = sysctl_stat_interval;
>>> + unsigned int nr_cpus = num_online_cpus();
>>> +
>>> + if (nr_cpus <= 1)
>>> + return round_jiffies_relative(interval);
>>> +
>>> + /*
>>> + * Spread per-cpu vmstat work evenly across the interval. Don't
>>> + * use round_jiffies_relative() here -- it would snap every CPU
>>> + * back to the same second boundary, defeating the spread.
>>> + */
>>> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
>>> +}
>>> +
>>> static void vmstat_update(struct work_struct *w)
>>> {
>>> if (refresh_cpu_vm_stats(true)) {
>>> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>>> */
>>> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>>> this_cpu_ptr(&vmstat_work),
>>> - round_jiffies_relative(sysctl_stat_interval));
>>> + vmstat_spread_delay());
>>
>> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>>
>> vmstat_shepherd() still queues work with delay 0 on all CPUs that
>> need_update() in its for_each_online_cpu() loop:
>>
>> if (!delayed_work_pending(dw) && need_update(cpu))
>> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>>
>> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
>> simultaneously.
>>
>> Under sustained memory pressure on a large system, I think the shepherd
>> fires every sysctl_stat_interval and could re-trigger the same lock
>> contention?
>
> Good point - incorporating similar spreading logic in vmstat_shepherd()
> would indeed address the simultaneous queueing issue you've described.
>
> Should I include this in a v2 of this patch, or would you prefer it as
> a separate follow-up patch?
I think it can be a separate follow-up patch, but no strong preference.
For this patch:
Acked-by: Usama Arif <usama.arif@linux.dev>
On Wed, Apr 01, 2026 at 04:50:03PM +0100, Usama Arif wrote: > >> > >> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well? > >> > >> vmstat_shepherd() still queues work with delay 0 on all CPUs that > >> need_update() in its for_each_online_cpu() loop: > >> > >> if (!delayed_work_pending(dw) && need_update(cpu)) > >> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); > >> > >> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers > >> simultaneously. > >> > >> Under sustained memory pressure on a large system, I think the shepherd > >> fires every sysctl_stat_interval and could re-trigger the same lock > >> contention? > > > > Good point - incorporating similar spreading logic in vmstat_shepherd() > > would indeed address the simultaneous queueing issue you've described. > > > > Should I include this in a v2 of this patch, or would you prefer it as > > a separate follow-up patch? > > I think it can be a separate follow-up patch, but no strong preference. Thanks! I will send a follow-up patch soon. --breno
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote: > vmstat_update uses round_jiffies_relative() when re-queuing itself, > which aligns all CPUs' timers to the same second boundary. When many > CPUs have pending PCP pages to drain, they all call decay_pcp_high() -> > free_pcppages_bulk() simultaneously, serializing on zone->lock and > hitting contention. > > Introduce vmstat_spread_delay() which distributes each CPU's > vmstat_update evenly across the stat interval instead of aligning them. Nice idea. > This does not increase the number of timer interrupts — each CPU still > fires once per interval. The timers are simply staggered rather than > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not > wake idle CPUs regardless of scheduling; the spread only affects CPUs > that are already active > > `perf lock contention` shows 7.5x reduction in zone->lock contention > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64 > system under memory pressure. Wow. That's huge improvement. > > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate > memory allocation bursts. Lock contention was measured with: > > perf lock contention -a -b -S free_pcppages_bulk > > Results with KASAN enabled: > > free_pcppages_bulk contention (KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 872 | 117 | > | Total wait | 199.43ms | 80.76ms | > | Max wait | 4.19ms | 35.76ms | > +--------------+----------+----------+ > > Results without KASAN: > > free_pcppages_bulk contention (no KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 240 | 133 | > | Total wait | 34.01ms | 24.61ms | > | Max wait | 965us | 1.35ms | > +--------------+----------+----------+ > > Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org> -- Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote: > free_pcppages_bulk contention (KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 872 | 117 | > | Total wait | 199.43ms | 80.76ms | > | Max wait | 4.19ms | 35.76ms | > +--------------+----------+----------+ > > Results without KASAN: > > free_pcppages_bulk contention (no KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 240 | 133 | > | Total wait | 34.01ms | 24.61ms | > | Max wait | 965us | 1.35ms | > +--------------+----------+----------+ Sorry, the Max wait time is inverted on both cases. free_pcppages_bulk contention (KASAN): +--------------+----------+----------+ | Metric | No fix | With fix | +--------------+----------+----------+ | Contentions | 872 | 117 | | Total wait | 199.43ms | 80.76ms | | Max wait | 35.76ms | 4.19ms | +--------------+----------+----------+ Results without KASAN: free_pcppages_bulk contention (no KASAN): +--------------+----------+----------+ | Metric | No fix | With fix | +--------------+----------+----------+ | Contentions | 240 | 133 | | Total wait | 34.01ms | 24.61ms | | Max wait | 1.35ms | 965us | +--------------+----------+----------+
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary. When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
>
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
>
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
>
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
>
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts. Lock contention was measured with:
>
> perf lock contention -a -b -S free_pcppages_bulk
>
> Results with KASAN enabled:
>
> free_pcppages_bulk contention (KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 872 | 117 |
> | Total wait | 199.43ms | 80.76ms |
> | Max wait | 4.19ms | 35.76ms |
> +--------------+----------+----------+
>
> Results without KASAN:
>
> free_pcppages_bulk contention (no KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 240 | 133 |
> | Total wait | 34.01ms | 24.61ms |
> | Max wait | 965us | 1.35ms |
> +--------------+----------+----------+
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Nice!
> ---
> mm/vmstat.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval. Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return round_jiffies_relative(interval);
> +
> + /*
> + * Spread per-cpu vmstat work evenly across the interval. Don't
> + * use round_jiffies_relative() here -- it would snap every CPU
> + * back to the same second boundary, defeating the spread.
> + */
> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
smp_processor_id() <= nr_cpus, so
return interval + interval*cpu/nr_cpus
should be equivalent, no?
Other than that,
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Hello Johannes,
On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > + unsigned long interval = sysctl_stat_interval;
> > + unsigned int nr_cpus = num_online_cpus();
> > +
> > + if (nr_cpus <= 1)
> > + return round_jiffies_relative(interval);
> > +
> > + /*
> > + * Spread per-cpu vmstat work evenly across the interval. Don't
> > + * use round_jiffies_relative() here -- it would snap every CPU
> > + * back to the same second boundary, defeating the spread.
> > + */
> > + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
>
> smp_processor_id() <= nr_cpus, so
>
> return interval + interval*cpu/nr_cpus
>
> should be equivalent, no?
nr_cpus is the number of online CPUs, while smp_processor_id() is the
CPU id.
If you offline a CPU, then smp_processor_id() might be bigger than
num_online_cpus()
My goal was to linearly shift the timer and avoid creating gaps when
removing certain CPUs.
Thanks for the review,
--breno
On Wed, Apr 01, 2026 at 07:39:28AM -0700, Breno Leitao wrote:
> Hello Johannes,
>
> On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > > +static unsigned long vmstat_spread_delay(void)
> > > +{
> > > + unsigned long interval = sysctl_stat_interval;
> > > + unsigned int nr_cpus = num_online_cpus();
> > > +
> > > + if (nr_cpus <= 1)
> > > + return round_jiffies_relative(interval);
> > > +
> > > + /*
> > > + * Spread per-cpu vmstat work evenly across the interval. Don't
> > > + * use round_jiffies_relative() here -- it would snap every CPU
> > > + * back to the same second boundary, defeating the spread.
> > > + */
> > > + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> >
> > smp_processor_id() <= nr_cpus, so
> >
> > return interval + interval*cpu/nr_cpus
> >
> > should be equivalent, no?
>
> nr_cpus is the number of online CPUs, while smp_processor_id() is the
> CPU id.
>
> If you offline a CPU, then smp_processor_id() might be bigger than
> num_online_cpus()
>
> My goal was to linearly shift the timer and avoid creating gaps when
> removing certain CPUs.
Ah makes sense. Plus you'd spill into the next interval otherwise.
© 2016 - 2026 Red Hat, Inc.