[PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

Breno Leitao posted 1 patch 5 hours ago
mm/vmstat.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
[PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Breno Leitao 5 hours ago
vmstat_update uses round_jiffies_relative() when re-queuing itself,
which aligns all CPUs' timers to the same second boundary.  When many
CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
free_pcppages_bulk() simultaneously, serializing on zone->lock and
hitting contention.

Introduce vmstat_spread_delay() which distributes each CPU's
vmstat_update evenly across the stat interval instead of aligning them.

This does not increase the number of timer interrupts — each CPU still
fires once per interval. The timers are simply staggered rather than
aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
wake idle CPUs regardless of scheduling; the spread only affects CPUs
that are already active

`perf lock contention` shows 7.5x reduction in zone->lock contention
(872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
system under memory pressure.

Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
memory allocation bursts.  Lock contention was measured with:

  perf lock contention -a -b -S free_pcppages_bulk

Results with KASAN enabled:

  free_pcppages_bulk contention (KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      872 |      117 |
  | Total wait   | 199.43ms | 80.76ms  |
  | Max wait     |   4.19ms | 35.76ms  |
  +--------------+----------+----------+

Results without KASAN:

  free_pcppages_bulk contention (no KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      240 |      133 |
  | Total wait   |  34.01ms | 24.61ms  |
  | Max wait     |   965us  |  1.35ms  |
  +--------------+----------+----------+

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/vmstat.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd..2e94bd765606 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
 }
 #endif /* CONFIG_PROC_FS */
 
+/*
+ * Return a per-cpu delay that spreads vmstat_update work across the stat
+ * interval.  Without this, round_jiffies_relative() aligns every CPU's
+ * timer to the same second boundary, causing a thundering-herd on
+ * zone->lock when multiple CPUs drain PCP pages simultaneously via
+ * decay_pcp_high() -> free_pcppages_bulk().
+ */
+static unsigned long vmstat_spread_delay(void)
+{
+	unsigned long interval = sysctl_stat_interval;
+	unsigned int nr_cpus = num_online_cpus();
+
+	if (nr_cpus <= 1)
+		return round_jiffies_relative(interval);
+
+	/*
+	 * Spread per-cpu vmstat work evenly across the interval.  Don't
+	 * use round_jiffies_relative() here -- it would snap every CPU
+	 * back to the same second boundary, defeating the spread.
+	 */
+	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
+}
+
 static void vmstat_update(struct work_struct *w)
 {
 	if (refresh_cpu_vm_stats(true)) {
@@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
 		 */
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
-				round_jiffies_relative(sysctl_stat_interval));
+				vmstat_spread_delay());
 	}
 }
 

---
base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
change-id: 20260401-vmstat-048e0feaf344

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Vlastimil Babka (SUSE) an hour ago
On 4/1/26 15:57, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Cool!

I noticed __round_jiffies_relative() exists and the description looks like
it's meant for exactly this use case?

> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;

Hm doesn't this mean that lower id cpus will consistently fire in shorter
intervals and higher id in longer intervals? What we want is same interval
but differently offset, no?

> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 

Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Usama Arif 3 hours ago
On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:

> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());

This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?

vmstat_shepherd() still queues work with delay 0 on all CPUs that
need_update() in its for_each_online_cpu() loop:

      if (!delayed_work_pending(dw) && need_update(cpu))
          queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
simultaneously.

Under sustained memory pressure on a large system, I think the shepherd
fires every sysctl_stat_interval and could re-trigger the same lock
contention?
 
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 
> 
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Breno Leitao 3 hours ago
On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>
> > vmstat_update uses round_jiffies_relative() when re-queuing itself,
> > which aligns all CPUs' timers to the same second boundary.  When many
> > CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> > free_pcppages_bulk() simultaneously, serializing on zone->lock and
> > hitting contention.
> >
> > Introduce vmstat_spread_delay() which distributes each CPU's
> > vmstat_update evenly across the stat interval instead of aligning them.
> >
> > This does not increase the number of timer interrupts — each CPU still
> > fires once per interval. The timers are simply staggered rather than
> > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> > wake idle CPUs regardless of scheduling; the spread only affects CPUs
> > that are already active
> >
> > `perf lock contention` shows 7.5x reduction in zone->lock contention
> > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> > system under memory pressure.
> >
> > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> > memory allocation bursts.  Lock contention was measured with:
> >
> >   perf lock contention -a -b -S free_pcppages_bulk
> >
> > Results with KASAN enabled:
> >
> >   free_pcppages_bulk contention (KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      872 |      117 |
> >   | Total wait   | 199.43ms | 80.76ms  |
> >   | Max wait     |   4.19ms | 35.76ms  |
> >   +--------------+----------+----------+
> >
> > Results without KASAN:
> >
> >   free_pcppages_bulk contention (no KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      240 |      133 |
> >   | Total wait   |  34.01ms | 24.61ms  |
> >   | Max wait     |   965us  |  1.35ms  |
> >   +--------------+----------+----------+
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  mm/vmstat.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 2370c6fb1fcd..2e94bd765606 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> >  }
> >  #endif /* CONFIG_PROC_FS */
> >
> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > +}
> > +
> >  static void vmstat_update(struct work_struct *w)
> >  {
> >  	if (refresh_cpu_vm_stats(true)) {
> > @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> >  		 */
> >  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> >  				this_cpu_ptr(&vmstat_work),
> > -				round_jiffies_relative(sysctl_stat_interval));
> > +				vmstat_spread_delay());
>
> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>
> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> need_update() in its for_each_online_cpu() loop:
>
>       if (!delayed_work_pending(dw) && need_update(cpu))
>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> simultaneously.
>
> Under sustained memory pressure on a large system, I think the shepherd
> fires every sysctl_stat_interval and could re-trigger the same lock
> contention?

Good point - incorporating similar spreading logic in vmstat_shepherd()
would indeed address the simultaneous queueing issue you've described.

Should I include this in a v2 of this patch, or would you prefer it as
a separate follow-up patch?
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Usama Arif 3 hours ago

On 01/04/2026 18:43, Breno Leitao wrote:
> On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
>> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>>
>>> vmstat_update uses round_jiffies_relative() when re-queuing itself,
>>> which aligns all CPUs' timers to the same second boundary.  When many
>>> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
>>> free_pcppages_bulk() simultaneously, serializing on zone->lock and
>>> hitting contention.
>>>
>>> Introduce vmstat_spread_delay() which distributes each CPU's
>>> vmstat_update evenly across the stat interval instead of aligning them.
>>>
>>> This does not increase the number of timer interrupts — each CPU still
>>> fires once per interval. The timers are simply staggered rather than
>>> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
>>> wake idle CPUs regardless of scheduling; the spread only affects CPUs
>>> that are already active
>>>
>>> `perf lock contention` shows 7.5x reduction in zone->lock contention
>>> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
>>> system under memory pressure.
>>>
>>> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
>>> memory allocation bursts.  Lock contention was measured with:
>>>
>>>   perf lock contention -a -b -S free_pcppages_bulk
>>>
>>> Results with KASAN enabled:
>>>
>>>   free_pcppages_bulk contention (KASAN):
>>>   +--------------+----------+----------+
>>>   | Metric       | No fix   | With fix |
>>>   +--------------+----------+----------+
>>>   | Contentions  |      872 |      117 |
>>>   | Total wait   | 199.43ms | 80.76ms  |
>>>   | Max wait     |   4.19ms | 35.76ms  |
>>>   +--------------+----------+----------+
>>>
>>> Results without KASAN:
>>>
>>>   free_pcppages_bulk contention (no KASAN):
>>>   +--------------+----------+----------+
>>>   | Metric       | No fix   | With fix |
>>>   +--------------+----------+----------+
>>>   | Contentions  |      240 |      133 |
>>>   | Total wait   |  34.01ms | 24.61ms  |
>>>   | Max wait     |   965us  |  1.35ms  |
>>>   +--------------+----------+----------+
>>>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 2370c6fb1fcd..2e94bd765606 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>>>  }
>>>  #endif /* CONFIG_PROC_FS */
>>>
>>> +/*
>>> + * Return a per-cpu delay that spreads vmstat_update work across the stat
>>> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
>>> + * timer to the same second boundary, causing a thundering-herd on
>>> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
>>> + * decay_pcp_high() -> free_pcppages_bulk().
>>> + */
>>> +static unsigned long vmstat_spread_delay(void)
>>> +{
>>> +	unsigned long interval = sysctl_stat_interval;
>>> +	unsigned int nr_cpus = num_online_cpus();
>>> +
>>> +	if (nr_cpus <= 1)
>>> +		return round_jiffies_relative(interval);
>>> +
>>> +	/*
>>> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
>>> +	 * use round_jiffies_relative() here -- it would snap every CPU
>>> +	 * back to the same second boundary, defeating the spread.
>>> +	 */
>>> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
>>> +}
>>> +
>>>  static void vmstat_update(struct work_struct *w)
>>>  {
>>>  	if (refresh_cpu_vm_stats(true)) {
>>> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>>>  		 */
>>>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>>>  				this_cpu_ptr(&vmstat_work),
>>> -				round_jiffies_relative(sysctl_stat_interval));
>>> +				vmstat_spread_delay());
>>
>> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>>
>> vmstat_shepherd() still queues work with delay 0 on all CPUs that
>> need_update() in its for_each_online_cpu() loop:
>>
>>       if (!delayed_work_pending(dw) && need_update(cpu))
>>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>>
>> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
>> simultaneously.
>>
>> Under sustained memory pressure on a large system, I think the shepherd
>> fires every sysctl_stat_interval and could re-trigger the same lock
>> contention?
> 
> Good point - incorporating similar spreading logic in vmstat_shepherd()
> would indeed address the simultaneous queueing issue you've described.
> 
> Should I include this in a v2 of this patch, or would you prefer it as
> a separate follow-up patch?

I think it can be a separate follow-up patch, but no strong preference.
For this patch:

Acked-by: Usama Arif <usama.arif@linux.dev>

Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Breno Leitao 3 hours ago
On Wed, Apr 01, 2026 at 04:50:03PM +0100, Usama Arif wrote:
> >>
> >> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
> >>
> >> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> >> need_update() in its for_each_online_cpu() loop:
> >>
> >>       if (!delayed_work_pending(dw) && need_update(cpu))
> >>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> >>
> >> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> >> simultaneously.
> >>
> >> Under sustained memory pressure on a large system, I think the shepherd
> >> fires every sysctl_stat_interval and could re-trigger the same lock
> >> contention?
> > 
> > Good point - incorporating similar spreading logic in vmstat_shepherd()
> > would indeed address the simultaneous queueing issue you've described.
> > 
> > Should I include this in a v2 of this patch, or would you prefer it as
> > a separate follow-up patch?
> 
> I think it can be a separate follow-up patch, but no strong preference.

Thanks!

I will send a follow-up patch soon.
--breno
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Kiryl Shutsemau 3 hours ago
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.

Nice idea.

> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.

Wow. That's huge improvement.

> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Breno Leitao 4 hours ago
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+

Sorry, the Max wait time is inverted on both cases.

  free_pcppages_bulk contention (KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      872 |      117 |
  | Total wait   | 199.43ms | 80.76ms  |
  | Max wait     |  35.76ms | 4.19ms   |
  +--------------+----------+----------+

Results without KASAN:

  free_pcppages_bulk contention (no KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      240 |      133 |
  | Total wait   |  34.01ms | 24.61ms  |
  | Max wait     |   1.35ms |   965us  |
  +--------------+----------+----------+
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Johannes Weiner 4 hours ago
On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Nice!

> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;

smp_processor_id() <= nr_cpus, so

	return interval + interval*cpu/nr_cpus

should be equivalent, no?

Other than that,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Breno Leitao 4 hours ago
Hello Johannes,

On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> 
> smp_processor_id() <= nr_cpus, so
> 
> 	return interval + interval*cpu/nr_cpus
> 
> should be equivalent, no?

nr_cpus is the number of online CPUs, while smp_processor_id() is the
CPU id.

If you offline a CPU, then smp_processor_id() might be bigger than
num_online_cpus()

My goal was to linearly shift the timer and avoid creating gaps when
removing certain CPUs.

Thanks for the review,
--breno
Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Posted by Johannes Weiner 4 hours ago
On Wed, Apr 01, 2026 at 07:39:28AM -0700, Breno Leitao wrote:
> Hello Johannes,
> 
> On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > > +static unsigned long vmstat_spread_delay(void)
> > > +{
> > > +	unsigned long interval = sysctl_stat_interval;
> > > +	unsigned int nr_cpus = num_online_cpus();
> > > +
> > > +	if (nr_cpus <= 1)
> > > +		return round_jiffies_relative(interval);
> > > +
> > > +	/*
> > > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > > +	 * back to the same second boundary, defeating the spread.
> > > +	 */
> > > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > 
> > smp_processor_id() <= nr_cpus, so
> > 
> > 	return interval + interval*cpu/nr_cpus
> > 
> > should be equivalent, no?
> 
> nr_cpus is the number of online CPUs, while smp_processor_id() is the
> CPU id.
> 
> If you offline a CPU, then smp_processor_id() might be bigger than
> num_online_cpus()
> 
> My goal was to linearly shift the timer and avoid creating gaps when
> removing certain CPUs.

Ah makes sense. Plus you'd spill into the next interval otherwise.