[PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE

Waiman Long posted 7 patches 2 weeks, 3 days ago
There is a newer version of this series
[PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
Posted by Waiman Long 2 weeks, 3 days ago
For a system with 4k page size, each percpu memcg_stock can hide up
to 256 kbytes of memory with the current MEMCG_CHARGE_BATCH value of
64. For another system with 64k page size, that becomes 4 Mbytes.
This MEMCG_CHARGE_BATCH value also controls how often should the
memcg vmstat values need flushing. As a result, the values reported
in various memory cgroup control files are even less indicative of the
actual memory consumption of a particular memory cgroup when the page
size increases from 4k.

This problem can be illustrated by running the test_memcontrol
selftest. Running a 4k page size kernel on a 128-core arm64 system,
the test_memcg_current_peak test which allocates a 50M anonymous memory
passed. With a 64k page size kernel on the same system, however, the
same test failed because the "anon" attribute of memory.stat file might
report a size of 0 depending on the number of CPUs the system has.

To solve this inaccurate memory stats problem, we need to scale down
the amount of memory that can be hidden by reducing MEMCG_CHARGE_BATCH
when the page size increases. The same user application will likely
consume more memory on systems with larger page size and it is also
less efficient if we scale down MEMCG_CHARGE_BATCH by too much.  So I
believe a good compromise is to scale down MEMCG_CHARGE_BATCH by 2 for
16k page size and by 4 with 64k page size.

With that change, the test_memcg_current_peak test passed again with
the modified 64k page size kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..748cfd75d998 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -328,8 +328,14 @@ struct mem_cgroup {
  * size of first charge trial.
  * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
  * workload.
+ *
+ * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
+ * amount of memory that can be hidden in each percpu memcg_stock for a given
+ * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
  */
-#define MEMCG_CHARGE_BATCH 64U
+#define MEMCG_CHARGE_BATCH_BASE  64U
+#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
+#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)
 
 extern struct mem_cgroup *root_mem_cgroup;
 
-- 
2.53.0
Re: [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
Posted by Li Wang 2 weeks, 2 days ago
Waiman Long wrote:

> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -328,8 +328,14 @@ struct mem_cgroup {
>   * size of first charge trial.
>   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
>   * workload.
> + *
> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
> + * amount of memory that can be hidden in each percpu memcg_stock for a given
> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
>   */
> -#define MEMCG_CHARGE_BATCH 64U
> +#define MEMCG_CHARGE_BATCH_BASE  64U
> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
> +#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)

This is a good complement to the first patch. With this change,
I got a chart to compare the three methods (linear, log2, sqrt)
in the count threshold:

4k page size (BATCH=64):
  
  CPUs    linear    log2     sqrt
  --------------------------------
  1       256KB     256KB    256KB
  8       2MB       1MB      512KB
  128     32MB      2MB      2.75MB
  1024    256MB     2.75MB   8MB
	
64k page size (BATCH=16):

  CPUs    linear    log2     sqrt
  -------------------------------
  1       1MB       1MB      1MB
  8       8MB       4MB      2MB
  128     128MB     8MB      11MB
  1024    1GB       11MB     32MB


Both are huge improvements.

log2 flushes more aggressively on large systems, which gives more accurate
stats but at the cost of more frequent synchronous flushes.

sqrt is more conservative, still a massive reduction from linear but gives
more breathing room on large systems, which may be better for performance.

I would leave this choice to you, Waiman, and the data is for reference.

-- 
Regards,
Li Wang
Re: [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
Posted by Waiman Long 2 weeks, 2 days ago
On 3/20/26 7:26 AM, Li Wang wrote:
> Waiman Long wrote:
>
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -328,8 +328,14 @@ struct mem_cgroup {
>>    * size of first charge trial.
>>    * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
>>    * workload.
>> + *
>> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
>> + * amount of memory that can be hidden in each percpu memcg_stock for a given
>> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
>>    */
>> -#define MEMCG_CHARGE_BATCH 64U
>> +#define MEMCG_CHARGE_BATCH_BASE  64U
>> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
>> +#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)
> This is a good complement to the first patch. With this change,
> I got a chart to compare the three methods (linear, log2, sqrt)
> in the count threshold:
>
> 4k page size (BATCH=64):
>    
>    CPUs    linear    log2     sqrt
>    --------------------------------
>    1       256KB     256KB    256KB
>    8       2MB       1MB      512KB
>    128     32MB      2MB      2.75MB
>    1024    256MB     2.75MB   8MB
> 	
> 64k page size (BATCH=16):
>
>    CPUs    linear    log2     sqrt
>    -------------------------------
>    1       1MB       1MB      1MB
>    8       8MB       4MB      2MB
>    128     128MB     8MB      11MB
>    1024    1GB       11MB     32MB
>
>
> Both are huge improvements.
>
> log2 flushes more aggressively on large systems, which gives more accurate
> stats but at the cost of more frequent synchronous flushes.
>
> sqrt is more conservative, still a massive reduction from linear but gives
> more breathing room on large systems, which may be better for performance.
>
> I would leave this choice to you, Waiman, and the data is for reference.
>
I think it is a good idea to use the int_sqrt() function and I will use 
it in the next version.

Cheers,
Longman