[v2] selftests: memcg: Fix test_memcontrol test failures with large page sizes

[PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Waiman Long 2 weeks, 2 days ago

The vmstats flush threshold currently increases linearly with the
number of online CPUs. As the number of CPUs increases over time, it
will become increasingly difficult to meet the threshold and update the
vmstats data in a timely manner. These days, systems with hundreds of
CPUs or even thousands of them are becoming more common.

For example, the test_memcg_sock test of test_memcontrol always fails
when running on an arm64 system with 128 CPUs. It is because the
threshold is now 64*128 = 8192. With 4k page size, it needs changes in
32 MB of memory. It will be even worse with larger page size like 64k.

To make the output of memory.stat more correct, it is better to scale
up the threshold slower than linearly with the number of CPUs. The
int_sqrt() function is a good compromise as suggested by Li Wang [1].
An extra 2 is added to make sure that we will double the threshold for
a 2-core system. The increase will be slower after that.

With the int_sqrt() scale, we can use the possibly larger
num_possible_cpus() instead of num_online_cpus() which may change at
run time.

Although there is supposed to be a periodic and asynchronous flush of
vmstats every 2 seconds, the actual time lag between succesive runs
can actually vary quite a bit. In fact, I have seen time lags of up
to 10s of seconds in some cases. So we couldn't too rely on the hope
that there will be an asynchronous vmstats flush every 2 seconds. This
may be something we need to look into.

[1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/

Suggested-by: Li Wang <liwang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cc1fc0f5aeea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,20 +548,20 @@ struct memcg_vmstats {
  *    rstat update tree grow unbounded.
  *
  * 2) Flush the stats synchronously on reader side only when there are more than
- *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
- *    only for 2 seconds due to (1).
+ *    (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
+ *    optimization will let stats be out of sync by up to that amount. This is
+ *    supposed to last for up to 2 seconds due to (1).
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static u64 flush_last_time;
+static int vmstats_flush_threshold __ro_after_init;
 
 #define FLUSH_TIME (2UL*HZ)
 
 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
 {
-	return atomic_read(&vmstats->stats_updates) >
-		MEMCG_CHARGE_BATCH * num_online_cpus();
+	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
 }
 
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
 
 	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
 				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+	/*
+	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
+	 * 2 constant is to make sure that the threshold is double for a 2-core
+	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
+	 * number of the CPUs reaches the next (2^n - 2) value.
+	 */
+	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
+				  (int_sqrt(num_possible_cpus() + 2));
 
 	return 0;
 }
-- 
2.53.0

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Michal Koutný 4 days, 14 hours ago

Hello Waiman and Li.

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long <longman@redhat.com> wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.

The explanation seems [1] to just pick a function because log seemed too
slow.

(We should add a BPF hook to calculate the threshold. Haha, Date:)

The threshold has twofold role: to bound error and to preserve some
performance thanks to laziness and these two go against each other when
determining the threshold. The reasoning for linear scaling is that
_each_ CPU contributes some updates so that preserves the laziness.
Whereas error capping would hint to no dependency on nr_cpus.

My idea is that a job associated to a selected memcg doesn't necessarily
run on _all_ CPUs of (such big) machines but effectively cause updates
on J CPUs. (Either they're artificially constrained or they simply are
not-so-parallel jobs.) 
Hence the threshold should be based on that J and not actual nr_cpus.

Now the question is what is expected (CPU) size of a job and for that
I'd would consider a distribution like:
- 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
- 2 jobs of size nr_cpus/2,
- 3 jobs of size nr_cpus/3,
- ...
- nr_cpus jobs of size 1. // you'd underutilize the machine with fewer

Note this is quite naïve and arbitrary deliberation of mine but it
results in something like Pareto distribution which is IMO quite
reasonable. With (only) that assumption, I can estimate the average size
of jobs like
	nr_cpus / (log(nr_cpus) + 1)
(it's natural logarithm from harmonic series and +1 is from that
approximation too, it comes handy also on UP)

	log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
	log(x) ~ 45426 * ilog2(x) / 65536

or 
	65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)


with kernel functions:
	var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
	var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
	var3 = roundup_pow_of_two(var2)

I hope I don't need to present any more numbers at this moment because
the parameter derivation is backed by solid theory ;-) [*]


> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.

Hm, the inverted log turns this into dilemma whether to support hotplug
or keep performance at threshold comparisons. But it wouldn't be first
place where static initialization with possible count is used.


> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.

Yes, this sounds like a separate issue. I wouldn't mention it in this
commit unless you mean it's particularly related to the large nr_cpus.

> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> +	 * 2 constant is to make sure that the threshold is double for a 2-core
> +	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> +	 * number of the CPUs reaches the next (2^n - 2) value.

when you switched to sqrt, the comment should read n^2

> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (int_sqrt(num_possible_cpus() + 2));
>  
>  	return 0;
>  }
> -- 
> 2.53.0

(I will look at the rest of the series later. It looks interesting.)

[*]
nr_cpus	var1	var2	var3
1       1       1       1
2       1       2       2
4       1       2       2
8       2       3       4
16      4       5       8
32      7       8       8
64      12      13      16
128     21      22      32
256     39      40      64
512     70      71      128
1024    129     130     256

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Li Wang 3 days, 23 hours ago

Hi Michal,

> Hello Waiman and Li.
> ...
> The explanation seems [1] to just pick a function because log seemed too
> slow.
> 
> (We should add a BPF hook to calculate the threshold. Haha, Date:)
> 
> The threshold has twofold role: to bound error and to preserve some
> performance thanks to laziness and these two go against each other when
> determining the threshold. The reasoning for linear scaling is that
> _each_ CPU contributes some updates so that preserves the laziness.
> Whereas error capping would hint to no dependency on nr_cpus.
> 
> My idea is that a job associated to a selected memcg doesn't necessarily
> run on _all_ CPUs of (such big) machines but effectively cause updates
> on J CPUs. (Either they're artificially constrained or they simply are
> not-so-parallel jobs.) 

> Hence the threshold should be based on that J and not actual nr_cpus.

I completely agree on this point.

> Now the question is what is expected (CPU) size of a job and for that
> I'd would consider a distribution like:
> - 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
> - 2 jobs of size nr_cpus/2,
> - 3 jobs of size nr_cpus/3,
> - ...
> - nr_cpus jobs of size 1. // you'd underutilize the machine with fewer
> 
> Note this is quite naïve and arbitrary deliberation of mine but it
> results in something like Pareto distribution which is IMO quite
> reasonable. With (only) that assumption, I can estimate the average size
> of jobs like
> 	nr_cpus / (log(nr_cpus) + 1)
> (it's natural logarithm from harmonic series and +1 is from that
> approximation too, it comes handy also on UP)
> 
> 	log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
> 	log(x) ~ 45426 * ilog2(x) / 65536
> 
> or 
> 	65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> 
> 
> with kernel functions:
> 	var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> 	var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
> 	var3 = roundup_pow_of_two(var2)
> 
> I hope I don't need to present any more numbers at this moment because
> the parameter derivation is backed by solid theory ;-) [*]
> [*]

It is a elegant method but still not based on the J CPUs.

As you capture the core tension: bounding error wants the threshold
as small as possible, while preserving laziness wants it as large as
possible. Any scheme is a compromise between the two.

But there has several practical issues:

The threshold formula is system-wide, while each memcg has its own counter,
they all evaluate against the same MEMCG_CHARGE_BATCH * f(nr_cpu_ids),
with no awareness of how many CPUs are actually active for that particular
memcg. Small tasks with J=2 coexist with large services where J approaches
nr_cpus, yet they all face the same threshold. The ln-harmonic formula
optimizes for the average J, but workloads that most critically need
accurate memory.stat are precisely those spanning many CPUs, well above
average.

Moreover, the "average J" estimate assumes tasks are uniformly distributed
across CPUs, which rarely holds in practice with cpuset constraints, NUMA
affinity, and nested cgroup hierarchies. And even accepting that estimate,
the data shows ln-harmonic still yields 237MB of error at 2048 CPUs with
64K pages — still large enough to cause selftest failures.

In short: the theoretical analysis is sound, but the conclusion conflates
average case with worst case. Under the constraint of a single global
threshold, sqrt remains the more robust choice.

In future, if the J-sensory threshold per-memcg can be achieved, then your
ln-harmonic method is the most ideal formula.

To compare the three methods (linear, sqrt, ln-harmonic):

4K page size (BATCH=64):

  CPUs   linear   sqrt     ln-var3
  --------------------------------
  1      256KB    256KB     256KB
  2      512KB    512KB     512KB
  4      1MB      512KB     512KB
  8      2MB      768KB     1MB
  16     4MB      1MB       2MB
  32     8MB      1.25MB    2MB
  64     16MB     2MB       4MB
  128    32MB     2.75MB    8MB
  256    64MB     4MB       16MB
  512    128MB    5.5MB     32MB
  1024   256MB    8MB       64MB
  2048   512MB    11.25MB   64MB

64K page size (BATCH=16):

  CPUs   linear   sqrt     ln-var3
  --------------------------------
  1      1MB      1MB      1MB
  2      2MB      2MB      2MB
  4      4MB      2MB      2MB
  8      8MB      3MB      4MB
  16     16MB     4MB      8MB
  32     32MB     5MB      8MB
  64     64MB     8MB      16MB
  128    128MB    11MB     32MB
  256    256MB    16MB     64MB
  512    512MB    22MB     128MB
  1024   1GB      32MB     256MB
  2048   2GB      45MB     256MB

-- 
Regards,
Li Wang

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Li Wang 3 days, 22 hours ago

> > with kernel functions:
> > 	var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> > 	var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
> > 	var3 = roundup_pow_of_two(var2)


Consider a 1024-CPU machine with a cpuset-constrained cgroup using only 2 CPUs.

Its unavoidable batching error is just 2MB, yet the global threshold imposes
256MB (harmonic-mean) or 32MB (sqrt) of additional error — 128x and 16x
overprovisioning respectively. Both overshoot, but sqrt stays bit closer to
the ideal.

-- 
Regards,
Li Wang

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Li Wang 1 week, 6 days ago

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.
> 
> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.
> 
> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.
> 
> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> 
> Suggested-by: Li Wang <liwang@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 772bac21d155..cc1fc0f5aeea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>   *    rstat update tree grow unbounded.
>   *
>   * 2) Flush the stats synchronously on reader side only when there are more than
> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
> - *    only for 2 seconds due to (1).
> + *    (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
> + *    optimization will let stats be out of sync by up to that amount. This is
> + *    supposed to last for up to 2 seconds due to (1).
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static u64 flush_last_time;
> +static int vmstats_flush_threshold __ro_after_init;
>  
>  #define FLUSH_TIME (2UL*HZ)
>  
>  static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>  {
> -	return atomic_read(&vmstats->stats_updates) >
> -		MEMCG_CHARGE_BATCH * num_online_cpus();
> +	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>  }
>  
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> +	 * 2 constant is to make sure that the threshold is double for a 2-core
> +	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> +	 * number of the CPUs reaches the next (2^n - 2) value.
> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (int_sqrt(num_possible_cpus() + 2));
>  
>  	return 0;
>  }

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Yosry Ahmed 1 week, 6 days ago

On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
>
> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> > The vmstats flush threshold currently increases linearly with the
> > number of online CPUs. As the number of CPUs increases over time, it
> > will become increasingly difficult to meet the threshold and update the
> > vmstats data in a timely manner. These days, systems with hundreds of
> > CPUs or even thousands of them are becoming more common.
> >
> > For example, the test_memcg_sock test of test_memcontrol always fails
> > when running on an arm64 system with 128 CPUs. It is because the
> > threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> > 32 MB of memory. It will be even worse with larger page size like 64k.
> >
> > To make the output of memory.stat more correct, it is better to scale
> > up the threshold slower than linearly with the number of CPUs. The
> > int_sqrt() function is a good compromise as suggested by Li Wang [1].
> > An extra 2 is added to make sure that we will double the threshold for
> > a 2-core system. The increase will be slower after that.
> >
> > With the int_sqrt() scale, we can use the possibly larger
> > num_possible_cpus() instead of num_online_cpus() which may change at
> > run time.
> >
> > Although there is supposed to be a periodic and asynchronous flush of
> > vmstats every 2 seconds, the actual time lag between succesive runs
> > can actually vary quite a bit. In fact, I have seen time lags of up
> > to 10s of seconds in some cases. So we couldn't too rely on the hope
> > that there will be an asynchronous vmstats flush every 2 seconds. This
> > may be something we need to look into.
> >
> > [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> >
> > Suggested-by: Li Wang <liwang@redhat.com>
> > Signed-off-by: Waiman Long <longman@redhat.com>

What's the motivation for this fix? Is it purely to make tests more
reliable on systems with larger page sizes?

We need some performance tests to make sure we're not flushing too
eagerly with the sqrt scale imo. We need to make sure that when we
have a lot of cgroups and a lot of flushers we don't end up performing
worse.

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Waiman Long 1 week, 4 days ago

On 3/23/26 8:15 PM, Yosry Ahmed wrote:
> On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
>> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
>>> The vmstats flush threshold currently increases linearly with the
>>> number of online CPUs. As the number of CPUs increases over time, it
>>> will become increasingly difficult to meet the threshold and update the
>>> vmstats data in a timely manner. These days, systems with hundreds of
>>> CPUs or even thousands of them are becoming more common.
>>>
>>> For example, the test_memcg_sock test of test_memcontrol always fails
>>> when running on an arm64 system with 128 CPUs. It is because the
>>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
>>> 32 MB of memory. It will be even worse with larger page size like 64k.
>>>
>>> To make the output of memory.stat more correct, it is better to scale
>>> up the threshold slower than linearly with the number of CPUs. The
>>> int_sqrt() function is a good compromise as suggested by Li Wang [1].
>>> An extra 2 is added to make sure that we will double the threshold for
>>> a 2-core system. The increase will be slower after that.
>>>
>>> With the int_sqrt() scale, we can use the possibly larger
>>> num_possible_cpus() instead of num_online_cpus() which may change at
>>> run time.
>>>
>>> Although there is supposed to be a periodic and asynchronous flush of
>>> vmstats every 2 seconds, the actual time lag between succesive runs
>>> can actually vary quite a bit. In fact, I have seen time lags of up
>>> to 10s of seconds in some cases. So we couldn't too rely on the hope
>>> that there will be an asynchronous vmstats flush every 2 seconds. This
>>> may be something we need to look into.
>>>
>>> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
>>>
>>> Suggested-by: Li Wang <liwang@redhat.com>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
> What's the motivation for this fix? Is it purely to make tests more
> reliable on systems with larger page sizes?
>
> We need some performance tests to make sure we're not flushing too
> eagerly with the sqrt scale imo. We need to make sure that when we
> have a lot of cgroups and a lot of flushers we don't end up performing
> worse.

I will include some performance data in the next version. Do you have 
any suggestion of which readily available tests that I can use for this 
performance testing purpose.

Cheers,
Longman

Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

Posted by Yosry Ahmed 1 week, 4 days ago

On Wed, Mar 25, 2026 at 9:47 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/23/26 8:15 PM, Yosry Ahmed wrote:
> > On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
> >> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> >>> The vmstats flush threshold currently increases linearly with the
> >>> number of online CPUs. As the number of CPUs increases over time, it
> >>> will become increasingly difficult to meet the threshold and update the
> >>> vmstats data in a timely manner. These days, systems with hundreds of
> >>> CPUs or even thousands of them are becoming more common.
> >>>
> >>> For example, the test_memcg_sock test of test_memcontrol always fails
> >>> when running on an arm64 system with 128 CPUs. It is because the
> >>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> >>> 32 MB of memory. It will be even worse with larger page size like 64k.
> >>>
> >>> To make the output of memory.stat more correct, it is better to scale
> >>> up the threshold slower than linearly with the number of CPUs. The
> >>> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> >>> An extra 2 is added to make sure that we will double the threshold for
> >>> a 2-core system. The increase will be slower after that.
> >>>
> >>> With the int_sqrt() scale, we can use the possibly larger
> >>> num_possible_cpus() instead of num_online_cpus() which may change at
> >>> run time.
> >>>
> >>> Although there is supposed to be a periodic and asynchronous flush of
> >>> vmstats every 2 seconds, the actual time lag between succesive runs
> >>> can actually vary quite a bit. In fact, I have seen time lags of up
> >>> to 10s of seconds in some cases. So we couldn't too rely on the hope
> >>> that there will be an asynchronous vmstats flush every 2 seconds. This
> >>> may be something we need to look into.
> >>>
> >>> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> >>>
> >>> Suggested-by: Li Wang <liwang@redhat.com>
> >>> Signed-off-by: Waiman Long <longman@redhat.com>
> > What's the motivation for this fix? Is it purely to make tests more
> > reliable on systems with larger page sizes?
> >
> > We need some performance tests to make sure we're not flushing too
> > eagerly with the sqrt scale imo. We need to make sure that when we
> > have a lot of cgroups and a lot of flushers we don't end up performing
> > worse.
>
> I will include some performance data in the next version. Do you have
> any suggestion of which readily available tests that I can use for this
> performance testing purpose.

I am not sure what readily available tests can stress this. In the
past, I wrote a synthetic workload that spawns a lot of readers in
memory.stat in userspace as well as reclaimers to trigger flushing
from both the kernel and userspace, with a large number of cgroups. I
don't have that lying around unfortunately.

[PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
[PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
[PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size
[PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with page size
[PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger page size
[PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target
[PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL