[PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes

Waiman Long posted 7 patches 2 weeks, 3 days ago
There is a newer version of this series
include/linux/memcontrol.h                    |  8 +-
mm/memcontrol.c                               | 17 ++--
.../cgroup/lib/include/cgroup_util.h          |  1 +
.../selftests/cgroup/test_memcontrol.c        | 83 +++++++++++++++----
4 files changed, 87 insertions(+), 22 deletions(-)
[PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
Posted by Waiman Long 2 weeks, 3 days ago
There are a number of test failures with the running of the
test_memcontrol selftest on a 128-core arm64 system on kernels with
4k/16k/64k page sizes. This patch series makes some minor changes to
the kernel and the test_memcontrol selftest to address these failures.

The first kernel patch scales the memcg vmstats flush threshold
logarithmetically instead of linearly with the total number of CPUs. The
second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
size. These 2 patches help to reduce the discrepancies between the
reported usage data with the real ones.

The next 5 test_memcontrol selftest patches adjust the testing code to
greatly reduce the chance that it will report failure, though some
occasional failures is still possible.

To verify the changes, the test_memcontrol selftest was run 100
times each on a 128-core arm64 system on kernels with 4k/16k/64k
page sizes.  No failure was observed other than some failures of the
test_memcg_reclaim test when running on a 16k page size kernel. The
reclaim_until() call failed because of the unexpected over-reclaim of
memory. This will need a further look but it happens with the 16k page
size kernel only and I don't have a production ready kernel config file
to use in buildinig this 16k page size kernel. The new test_memcontrol
selftest and kernel were also run on a 96-core x86 system to make sure
there was no regression.

Waiman Long (7):
  memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
  memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  selftests: memcg: Iterate pages based on the actual page size
  selftests: memcg: Increase error tolerance in accordance with page
    size
  selftests: memcg: Reduce the expected swap.peak with larger page size
  selftests: memcg: Don't call reclaim_until() if already in target
  selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as
    XFAIL

 include/linux/memcontrol.h                    |  8 +-
 mm/memcontrol.c                               | 17 ++--
 .../cgroup/lib/include/cgroup_util.h          |  1 +
 .../selftests/cgroup/test_memcontrol.c        | 83 +++++++++++++++----
 4 files changed, 87 insertions(+), 22 deletions(-)

-- 
2.53.0
Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
Posted by Andrew Morton 2 weeks, 3 days ago
On Thu, 19 Mar 2026 13:37:45 -0400 Waiman Long <longman@redhat.com> wrote:

> There are a number of test failures with the running of the
> test_memcontrol selftest on a 128-core arm64 system on kernels with
> 4k/16k/64k page sizes. This patch series makes some minor changes to
> the kernel and the test_memcontrol selftest to address these failures.
> 
> The first kernel patch scales the memcg vmstats flush threshold
> logarithmetically instead of linearly with the total number of CPUs. The
> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
> size. These 2 patches help to reduce the discrepancies between the
> reported usage data with the real ones.
> 
> The next 5 test_memcontrol selftest patches adjust the testing code to
> greatly reduce the chance that it will report failure, though some
> occasional failures is still possible.
> 
> To verify the changes, the test_memcontrol selftest was run 100
> times each on a 128-core arm64 system on kernels with 4k/16k/64k
> page sizes.  No failure was observed other than some failures of the
> test_memcg_reclaim test when running on a 16k page size kernel. The
> reclaim_until() call failed because of the unexpected over-reclaim of
> memory. This will need a further look but it happens with the 16k page
> size kernel only and I don't have a production ready kernel config file
> to use in buildinig this 16k page size kernel. The new test_memcontrol
> selftest and kernel were also run on a 96-core x86 system to make sure
> there was no regression.

AI reviewbot asks questions:
	https://sashiko.dev/#/patchset/20260319173752.1472864-1-longman%40redhat.com
Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
Posted by Waiman Long 2 weeks, 2 days ago
On 3/19/26 10:43 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2026 13:37:45 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> There are a number of test failures with the running of the
>> test_memcontrol selftest on a 128-core arm64 system on kernels with
>> 4k/16k/64k page sizes. This patch series makes some minor changes to
>> the kernel and the test_memcontrol selftest to address these failures.
>>
>> The first kernel patch scales the memcg vmstats flush threshold
>> logarithmetically instead of linearly with the total number of CPUs. The
>> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
>> size. These 2 patches help to reduce the discrepancies between the
>> reported usage data with the real ones.
>>
>> The next 5 test_memcontrol selftest patches adjust the testing code to
>> greatly reduce the chance that it will report failure, though some
>> occasional failures is still possible.
>>
>> To verify the changes, the test_memcontrol selftest was run 100
>> times each on a 128-core arm64 system on kernels with 4k/16k/64k
>> page sizes.  No failure was observed other than some failures of the
>> test_memcg_reclaim test when running on a 16k page size kernel. The
>> reclaim_until() call failed because of the unexpected over-reclaim of
>> memory. This will need a further look but it happens with the 16k page
>> size kernel only and I don't have a production ready kernel config file
>> to use in buildinig this 16k page size kernel. The new test_memcontrol
>> selftest and kernel were also run on a 96-core x86 system to make sure
>> there was no regression.
> AI reviewbot asks questions:
> 	https://sashiko.dev/#/patchset/20260319173752.1472864-1-longman%40redhat.com

 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 >> @@ -548,20 +548,20 @@ struct memcg_vmstats {
 >>   *    rstat update tree grow unbounded.
 >>   *
 >>   * 2) Flush the stats synchronously on reader side only when there 
are more than
 >> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this 
optimization
 >> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * 
nr_cpus) but
 >> - *    only for 2 seconds due to (1).
 >> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. 
Though this
 >> + *    optimization will let stats be out of sync by up to that 
amount but only
 >> + *    for 2 seconds due to (1).
 > Is this description accurate regarding the maximum out of sync amount?
 > Looking at memcg_rstat_updated(), updates are buffered locally on 
each CPU
 > up to MEMCG_CHARGE_BATCH - 1 before they are added to the global
 > vmstats->stats_updates counter.
 > Because memcg_vmstats_needs_flush() only checks the global counter, could
 > N CPUs each buffer MEMCG_CHARGE_BATCH - 1 updates without triggering a
 > synchronous flush?
 > If so, wouldn't the actual worst-case out-of-sync error be
 > N * (MEMCG_CHARGE_BATCH - 1) + vmstats_flush_threshold, which remains
 > linear with the number of CPUs rather than scaling logarithmically?

Good point, the worst case scenario can indeed be worse than that. I
will update the comment accordingly.

 >> @@ -328,8 +328,14 @@ struct mem_cgroup {
 >>   * size of first charge trial.
 >>   * TODO: maybe necessary to use big numbers in big irons or dynamic 
based of the
 >>   * workload.
 >> + *
 >> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to 
limit the
 >> + * amount of memory that can be hidden in each percpu memcg_stock 
for a given
 >> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
 >>   */
 >> -#define MEMCG_CHARGE_BATCH 64U
 >> +#define MEMCG_CHARGE_BATCH_BASE  64U
 >> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT 
- 12)/2 : 2)
 >> +#define MEMCG_CHARGE_BATCH     (MEMCG_CHARGE_BATCH_BASE >> 
MEMCG_CHARGE_BATCH_SHIFT)
 > Will this cause false failures in the cgroup selftests on systems with
 > larger page sizes?

Exactly, Li Wang had posted a patch [1] to address this issue. We will have
to update this code again if this patch is adopted to reduce the expected
MAX_VMSTAT_ERROR accordingly.

[1] https://lore.kernel.org/lkml/20260306071843.149147-1-liwang@redhat.com/

 > In tools/testing/selftests/cgroup/test_kmem.c, MAX_VMSTAT_ERROR is 
hardcoded
 > assuming a 4KB page size and a 64-page charge batch (256KB per CPU):
 > #define MAX_VMSTAT_ERROR (4096 * 64 * get_nprocs())
 > ...
 >     if (labs(sum - current) < MAX_VMSTAT_ERROR) {
 > ...
 > With this change, a 64KB page system will have a batch size of 16. The
 > theoretical per-CPU error becomes 16 pages * 64KB = 1024KB per CPU, which
 > is 4 times larger than the hardcoded MAX_VMSTAT_ERROR.
 > Does MAX_VMSTAT_ERROR in test_kmem.c need to be updated to compute 
the error
 > using the runtime system page size and the adjusted batch limit?

 >> diff --git 
a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h 
b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> index 77f386dab5e8..c25228a78b8b 100644
 >> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> @@ -6,6 +6,7 @@
 >>  #define PAGE_SIZE 4096
 >>  #endif
 >>
 >> +#define KB(x) (x << 10)
 > This isn't a bug, but could this macro definition risk operator 
precedence
 > issues if invoked with an expression? Should it be defined with inner
 > parentheses around the argument, like #define KB(x) ((x) << 10)?

So far, only a single integer value is used for x. So the current macro
is still OK. Yes, I do agree that adding parenthesis will be more safe
for future use cases.

 >> @@ -1181,7 +1199,7 @@ static int test_memcg_swap_max_peak(const char 
*root)
 >>      if (cg_read_long(memcg, "memory.peak") < MB(29))
 >>          goto cleanup;
 >>
 >> -    if (cg_read_long(memcg, "memory.swap.peak") < MB(29))
 >> +    if (cg_read_long(memcg, "memory.swap.peak") < swap_peak)
 >>          goto cleanup;
 >>
 >>      if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
 >>          goto cleanup;
 > Does leaving memory.high set to 29M alter the semantics of the subsequent
 > alloc_anon_50M_check_swap test?
 > When alloc_anon_50M_check_swap runs, it allocates 50MB and expects
 > memory.current to be close to the 30MB memory.max limit.
 > Because memory.high is still set to 29MB from the earlier operation, 
the 50MB
 > allocation gets throttled at 29MB and enters synchronous reclaim before
 > reaching the 30MB memory.max limit.
 > Could this inadvertently change the second test from verifying swap 
behavior
 > under a hard limit to verifying it under a soft limit?

The purpose of setting memory.high to 29M is to slow down the memory
allocation process in order to enable the swapping code to swap out
more anonymous memory before the OOM killer comes in and kill the
process. Otherwise, the actual swap out value will be even lower with
larger page size. I can drop setting memory.high and set the threshold
even lower in order to avoid expected failures.

 >> @@ -1477,12 +1477,20 @@ static int test_memcg_sock(const char *root)
 >>       * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
 >>       * scheduling slack) and require that the "sock " counter
 >>       * eventually drops to zero.
 >> +     *
 >> +     * The actual run-to-run elapse time between consecutive run
 >> +     * of asynchronous memcg rstat flush may varies quite a bit.
 >> +     * So the 3 seconds wait time may not be enough for the "sock"
 >> +     * counter to go down to 0. Treat it as a XFAIL instead of
 >> +     * a FAIL.
 >>       */
 >>      sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
 >>                       MEMCG_SOCKSTAT_WAIT_RETRIES,
 >>                       DEFAULT_WAIT_INTERVAL_US);
 >> -    if (sock_post)
 >> +    if (sock_post) {
 >> +        ret = KSFT_XFAIL;
 >>          goto cleanup;
 >> +    }
 > Does this code inadvertently mask actual system or I/O errors?
 > If cg_read_key_long_poll() returns -1 because it failed to read the 
file or
 > the key is missing, this check will treat it as an expected timeout 
failure
 > rather than an actual test failure.

I will add a positive value check before setting KSFT_XFAIL.

 > Does marking this condition as KSFT_XFAIL prevent the test from catching
 > genuine socket memory leaks?
 > If a kernel regression causes socket memory to actually leak, the 
test will
 > time out and report an expected failure, which CI systems might ignore.
 > Would it be more robust to increase the polling timeout to 
accommodate the
 > maximum latency observed, or manually trigger a synchronous flush, 
instead
 > of masking the timeout?

We may have to increase the timeout excessively in order to allow for
the possible variations of the asynchronous vmstats flush delay. That may
make the test take too long to run. In my own test, the current code 
will fail
rather frequently without this change.

I do suggest that we will have to look into this issue and we can remove 
this expected failure if the issue is fixed.

Cheers,
Longman

Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
Posted by Waiman Long 2 weeks, 2 days ago
On 3/20/26 11:56 AM, Waiman Long wrote:
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -548,20 +548,20 @@ struct memcg_vmstats {
> >>   *    rstat update tree grow unbounded.
> >>   *
> >>   * 2) Flush the stats synchronously on reader side only when there 
> are more than
> >> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this 
> optimization
> >> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH 
> * nr_cpus) but
> >> - *    only for 2 seconds due to (1).
> >> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. 
> Though this
> >> + *    optimization will let stats be out of sync by up to that 
> amount but only
> >> + *    for 2 seconds due to (1).
> > Is this description accurate regarding the maximum out of sync amount?
> > Looking at memcg_rstat_updated(), updates are buffered locally on 
> each CPU
> > up to MEMCG_CHARGE_BATCH - 1 before they are added to the global
> > vmstats->stats_updates counter.
> > Because memcg_vmstats_needs_flush() only checks the global counter, 
> could
> > N CPUs each buffer MEMCG_CHARGE_BATCH - 1 updates without triggering a
> > synchronous flush?
> > If so, wouldn't the actual worst-case out-of-sync error be
> > N * (MEMCG_CHARGE_BATCH - 1) + vmstats_flush_threshold, which remains
> > linear with the number of CPUs rather than scaling logarithmically?
>
> Good point, the worst case scenario can indeed be worse than that. I
> will update the comment accordingly.

Looking at the code again, the hidden charge in memcg_stock should only 
affect memory.current, not memory.stat. There is nothing to add to the 
worst case situation.

Cheers,
Longman