per-memcg-per-node kmem accounting

[PATCH 0/8] per-memcg-per-node kmem accounting

Posted by Alexandre Ghiti 1 month ago

This series pursues the work initiated by Joshua [1]. We need kernel  
memory to be accounted on a per-node basis in order to be able to  
know the memcg and physical memory association.  
  
This series takes advantage of the recent introduction of per-node  
obj_cgroup [2] and makes those obj_cgroup tied to their numa node.  
  
The bulk of the series is percpu per-node accounting: percpu  
"precharges" the memcg before we know the actual location of the pages  
it uses, so charging and accounting had to be split. All other kmem 
users (slab, zswap, __memcg_kmem_charge_page) are straightforward 
conversions (zswap support is limited in this series because Joshua 
is working on it in parallel [3]). 
 
Thanks Joshua for your early feedbacks! 
  
[1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/  
[2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/  
[3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

Alexandre Ghiti (8):
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
  mm: percpu: Split memcg charging and kmem accounting
  mm: memcontrol: track MEMCG_KMEM per NUMA node
  mm: memcontrol: per-node kmem accounting for page charges
  mm: slab: per-node kmem accounting for slab
  mm: percpu: per-node kmem accounting using local credit
  mm: zswap: per-node kmem accounting for zswap/zsmalloc

 include/linux/memcontrol.h |  27 +++++--
 include/linux/mmzone.h     |   1 +
 include/linux/zsmalloc.h   |   2 +
 mm/memcontrol.c            | 150 ++++++++++++++++++++++++++++---------
 mm/percpu-internal.h       |  16 +---
 mm/percpu.c                |  90 ++++++++++++++++++++--
 mm/vmstat.c                |   1 +
 mm/zsmalloc.c              |  11 +++
 mm/zswap.c                 |   9 ++-
 9 files changed, 242 insertions(+), 65 deletions(-)

-- 
2.54.0

Re: [PATCH 0/8] per-memcg-per-node kmem accounting

Posted by Joshua Hahn 3 weeks, 4 days ago

On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:

> This series pursues the work initiated by Joshua [1]. We need kernel  
> memory to be accounted on a per-node basis in order to be able to  
> know the memcg and physical memory association.  
>   
> This series takes advantage of the recent introduction of per-node  
> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.  
>   
> The bulk of the series is percpu per-node accounting: percpu  
> "precharges" the memcg before we know the actual location of the pages  
> it uses, so charging and accounting had to be split. All other kmem 
> users (slab, zswap, __memcg_kmem_charge_page) are straightforward 
> conversions (zswap support is limited in this series because Joshua 
> is working on it in parallel [3]). 
>  
> Thanks Joshua for your early feedbacks! 

Hello Alex,

Thank you for your work!

Overall I think the direction makes sense to me. Pre-overcharging makes sense to
me as an approach, we would much rather overaccount than underaccount and
later have to breach limits.

I do have some concerns on performance, though. Namely, I think there are
some expensive operations that I think would benefit from some performane
benchmarking with this patch added (maybe some simple microbenchmarks that
demonstrates kernel allocation overhead could be useful).

From what I can tell, there is some additional performance overhead that has
to do with iterating over num_possible_cpus() x pages_per_alloc, which
doesn't seem trivial to me.

Another concern that I see is the stock credit system. Maybe we could be
bypassing the stock check leading to more time spent doing the atomic
operations.

obj_stock caches a single obj_cgroup, which means that if we split the objcg
to be per-node (in patch 6), then the obj_stock basically gets invalidated
every operation since we iterate over more objcgs (even though we are in
the same logical objcg). Maybe I'm missing something?

I haven't taken a deep look at the implementation details but just wanted to
raise some high level items that I noticed. Of course, all of these concerns
are just theoretical, if you can show that the performance delta is not
noticable then all of my concerns don't matter.

I also want to talk more about the local credit system but let's first see
what the numbers are first.

Thanks again, Alex. And I really like patch 2 because it is a solution to
a problem that I ran into in my percpu tracking series that I couldn't think
of before! Thank you for solving my problem too : -)

Have a great day!
Joshua

> [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/  
> [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/  
> [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/
> 
> Alexandre Ghiti (8):
>   mm: memcontrol: propagate NMI slab stats to memcg vmstats
>   mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
>   mm: percpu: Split memcg charging and kmem accounting
>   mm: memcontrol: track MEMCG_KMEM per NUMA node
>   mm: memcontrol: per-node kmem accounting for page charges
>   mm: slab: per-node kmem accounting for slab
>   mm: percpu: per-node kmem accounting using local credit
>   mm: zswap: per-node kmem accounting for zswap/zsmalloc
> 
>  include/linux/memcontrol.h |  27 +++++--
>  include/linux/mmzone.h     |   1 +
>  include/linux/zsmalloc.h   |   2 +
>  mm/memcontrol.c            | 150 ++++++++++++++++++++++++++++---------
>  mm/percpu-internal.h       |  16 +---
>  mm/percpu.c                |  90 ++++++++++++++++++++--
>  mm/vmstat.c                |   1 +
>  mm/zsmalloc.c              |  11 +++
>  mm/zswap.c                 |   9 ++-
>  9 files changed, 242 insertions(+), 65 deletions(-)
> 
> -- 
> 2.54.0
> 
>

Re: [PATCH 0/8] per-memcg-per-node kmem accounting

Posted by Alexandre Ghiti 3 weeks, 2 days ago

Hi Joshua,

On 5/18/26 16:57, Joshua Hahn wrote:
> On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>> This series pursues the work initiated by Joshua [1]. We need kernel
>> memory to be accounted on a per-node basis in order to be able to
>> know the memcg and physical memory association.
>>    
>> This series takes advantage of the recent introduction of per-node
>> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
>>    
>> The bulk of the series is percpu per-node accounting: percpu
>> "precharges" the memcg before we know the actual location of the pages
>> it uses, so charging and accounting had to be split. All other kmem
>> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
>> conversions (zswap support is limited in this series because Joshua
>> is working on it in parallel [3]).
>>   
>> Thanks Joshua for your early feedbacks!
> Hello Alex,
>
> Thank you for your work!
>
> Overall I think the direction makes sense to me. Pre-overcharging makes sense to
> me as an approach, we would much rather overaccount than underaccount and
> later have to breach limits.
>
> I do have some concerns on performance, though. Namely, I think there are
> some expensive operations that I think would benefit from some performane
> benchmarking with this patch added (maybe some simple microbenchmarks that
> demonstrates kernel allocation overhead could be useful).
>
>  From what I can tell, there is some additional performance overhead that has
> to do with iterating over num_possible_cpus() x pages_per_alloc, which
> doesn't seem trivial to me.


Indeed, let me microbenchmark the overhead on a large system.


>
> Another concern that I see is the stock credit system. Maybe we could be
> bypassing the stock check leading to more time spent doing the atomic
> operations.


I'm not following on this one, which atomic operations do you see that 
could be bypassed?


>
> obj_stock caches a single obj_cgroup, which means that if we split the objcg
> to be per-node (in patch 6), then the obj_stock basically gets invalidated
> every operation since we iterate over more objcgs (even though we are in
> the same logical objcg). Maybe I'm missing something?


The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert 
objcg to be per-memcg per-node type") and the problem you describe is 
exactly what Shakeel is trying to fix [1].

But I remember trying a microbenchmark and noticed a +5% regression (on 
top of the 67% then...), I'll rebase this series on top of Shakeel's and 
re-run.

[1] 
https://lore.kernel.org/linux-mm/20260520053123.2709959-1-shakeel.butt@linux.dev/T/#m127d4969b105c046a2a21e3c79c963771007583d


>
> I haven't taken a deep look at the implementation details but just wanted to
> raise some high level items that I noticed. Of course, all of these concerns
> are just theoretical, if you can show that the performance delta is not
> noticable then all of my concerns don't matter.
>
> I also want to talk more about the local credit system but let's first see
> what the numbers are first.
>
> Thanks again, Alex. And I really like patch 2 because it is a solution to
> a problem that I ran into in my percpu tracking series that I couldn't think
> of before! Thank you for solving my problem too : -)


Great then, thanks :)

Alex


>
> Have a great day!
> Joshua
>     
>> [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
>> [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/
>> [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/
>>
>> Alexandre Ghiti (8):
>>    mm: memcontrol: propagate NMI slab stats to memcg vmstats
>>    mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
>>    mm: percpu: Split memcg charging and kmem accounting
>>    mm: memcontrol: track MEMCG_KMEM per NUMA node
>>    mm: memcontrol: per-node kmem accounting for page charges
>>    mm: slab: per-node kmem accounting for slab
>>    mm: percpu: per-node kmem accounting using local credit
>>    mm: zswap: per-node kmem accounting for zswap/zsmalloc
>>
>>   include/linux/memcontrol.h |  27 +++++--
>>   include/linux/mmzone.h     |   1 +
>>   include/linux/zsmalloc.h   |   2 +
>>   mm/memcontrol.c            | 150 ++++++++++++++++++++++++++++---------
>>   mm/percpu-internal.h       |  16 +---
>>   mm/percpu.c                |  90 ++++++++++++++++++++--
>>   mm/vmstat.c                |   1 +
>>   mm/zsmalloc.c              |  11 +++
>>   mm/zswap.c                 |   9 ++-
>>   9 files changed, 242 insertions(+), 65 deletions(-)
>>
>> -- 
>> 2.54.0
>>
>>

Re: [PATCH 0/8] per-memcg-per-node kmem accounting

Posted by Joshua Hahn 3 weeks, 1 day ago

On Wed, 20 May 2026 10:39:59 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:

> Hi Joshua,
> 
> On 5/18/26 16:57, Joshua Hahn wrote:
> > On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
> >
> >> This series pursues the work initiated by Joshua [1]. We need kernel
> >> memory to be accounted on a per-node basis in order to be able to
> >> know the memcg and physical memory association.
> >>    
> >> This series takes advantage of the recent introduction of per-node
> >> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
> >>    
> >> The bulk of the series is percpu per-node accounting: percpu
> >> "precharges" the memcg before we know the actual location of the pages
> >> it uses, so charging and accounting had to be split. All other kmem
> >> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
> >> conversions (zswap support is limited in this series because Joshua
> >> is working on it in parallel [3]).
> >>   
> >> Thanks Joshua for your early feedbacks!
> > Hello Alex,
> >
> > Thank you for your work!
> >
> > Overall I think the direction makes sense to me. Pre-overcharging makes sense to
> > me as an approach, we would much rather overaccount than underaccount and
> > later have to breach limits.
> >
> > I do have some concerns on performance, though. Namely, I think there are
> > some expensive operations that I think would benefit from some performane
> > benchmarking with this patch added (maybe some simple microbenchmarks that
> > demonstrates kernel allocation overhead could be useful).
> >
> >  From what I can tell, there is some additional performance overhead that has
> > to do with iterating over num_possible_cpus() x pages_per_alloc, which
> > doesn't seem trivial to me.
> 
> Indeed, let me microbenchmark the overhead on a large system.

Hi Alex,

That sounds great with me : -) Looking forward to the numbers!

> > Another concern that I see is the stock credit system. Maybe we could be
> > bypassing the stock check leading to more time spent doing the atomic
> > operations.
> 
> I'm not following on this one, which atomic operations do you see that 
> could be bypassed?

So in my initial scan of the patch 7 I had a concern that if we have a nested
stock system (obj_cgroup stock and local credit "stock"), then we could
incur more work if these are out of sync; do extra work in the stock refill
path in obj_cgroup_precharge, and then do extra work on top in the loop
within the pcpu_memcg_post_alloc_hook (obj_cgroup_account_kmem does the
charging atomically I think).

So what I mean is, I'm not sure what the "size" is typically for
pcpu_memcg_post_alloc_hook. But it might be a worthwhile optimization to
do precharge all the pages, then for each cpu iterate over the pages to
figure out how many pages are used per nid (doing just math, not actually
doing the atomic adds), and then outside both of these loops just iterate
over every nid_objcg once to perform the atomic operation.

Maybe this is needed or not (depending on how big "size" typically is
and whether we go from doing O(1000) atomic adds --> O(10) or some
big reduction, but I just wanted to toss it out there as something that
could potentially be expensive.

> > obj_stock caches a single obj_cgroup, which means that if we split the objcg
> > to be per-node (in patch 6), then the obj_stock basically gets invalidated
> > every operation since we iterate over more objcgs (even though we are in
> > the same logical objcg). Maybe I'm missing something?
> 
> 
> The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert 
> objcg to be per-memcg per-node type") and the problem you describe is 
> exactly what Shakeel is trying to fix [1].

Whoops O_o I completely missed that one. Sorry for flagging it again!

> But I remember trying a microbenchmark and noticed a +5% regression (on 
> top of the 67% then...), I'll rebase this series on top of Shakeel's and 
> re-run.

Sounds like a great idea! Thanks again Alex, have a great day! : -)
Joshua

Re: [PATCH 0/8] per-memcg-per-node kmem accounting

Posted by Alexandre Ghiti 3 weeks, 1 day ago

On 5/21/26 05:46, Joshua Hahn wrote:
> On Wed, 20 May 2026 10:39:59 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>> Hi Joshua,
>>
>> On 5/18/26 16:57, Joshua Hahn wrote:
>>> On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>
>>>> This series pursues the work initiated by Joshua [1]. We need kernel
>>>> memory to be accounted on a per-node basis in order to be able to
>>>> know the memcg and physical memory association.
>>>>     
>>>> This series takes advantage of the recent introduction of per-node
>>>> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
>>>>     
>>>> The bulk of the series is percpu per-node accounting: percpu
>>>> "precharges" the memcg before we know the actual location of the pages
>>>> it uses, so charging and accounting had to be split. All other kmem
>>>> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
>>>> conversions (zswap support is limited in this series because Joshua
>>>> is working on it in parallel [3]).
>>>>    
>>>> Thanks Joshua for your early feedbacks!
>>> Hello Alex,
>>>
>>> Thank you for your work!
>>>
>>> Overall I think the direction makes sense to me. Pre-overcharging makes sense to
>>> me as an approach, we would much rather overaccount than underaccount and
>>> later have to breach limits.
>>>
>>> I do have some concerns on performance, though. Namely, I think there are
>>> some expensive operations that I think would benefit from some performane
>>> benchmarking with this patch added (maybe some simple microbenchmarks that
>>> demonstrates kernel allocation overhead could be useful).
>>>
>>>   From what I can tell, there is some additional performance overhead that has
>>> to do with iterating over num_possible_cpus() x pages_per_alloc, which
>>> doesn't seem trivial to me.
>> Indeed, let me microbenchmark the overhead on a large system.
> Hi Alex,
>
> That sounds great with me : -) Looking forward to the numbers!
>
>>> Another concern that I see is the stock credit system. Maybe we could be
>>> bypassing the stock check leading to more time spent doing the atomic
>>> operations.
>> I'm not following on this one, which atomic operations do you see that
>> could be bypassed?
> So in my initial scan of the patch 7 I had a concern that if we have a nested
> stock system (obj_cgroup stock and local credit "stock"), then we could
> incur more work if these are out of sync; do extra work in the stock refill
> path in obj_cgroup_precharge, and then do extra work on top in the loop
> within the pcpu_memcg_post_alloc_hook (obj_cgroup_account_kmem does the
> charging atomically I think).
>
> So what I mean is, I'm not sure what the "size" is typically for
> pcpu_memcg_post_alloc_hook. But it might be a worthwhile optimization to
> do precharge all the pages, then for each cpu iterate over the pages to
> figure out how many pages are used per nid (doing just math, not actually
> doing the atomic adds), and then outside both of these loops just iterate
> over every nid_objcg once to perform the atomic operation.
>
> Maybe this is needed or not (depending on how big "size" typically is
> and whether we go from doing O(1000) atomic adds --> O(10) or some
> big reduction, but I just wanted to toss it out there as something that
> could potentially be expensive.


I get it, I'll trace the microbenchmarks to see what happens there, 
thanks for the suggestion.

Thanks again,

Alex


>>> obj_stock caches a single obj_cgroup, which means that if we split the objcg
>>> to be per-node (in patch 6), then the obj_stock basically gets invalidated
>>> every operation since we iterate over more objcgs (even though we are in
>>> the same logical objcg). Maybe I'm missing something?
>>
>> The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert
>> objcg to be per-memcg per-node type") and the problem you describe is
>> exactly what Shakeel is trying to fix [1].
> Whoops O_o I completely missed that one. Sorry for flagging it again!
>
>> But I remember trying a microbenchmark and noticed a +5% regression (on
>> top of the 67% then...), I'll rebase this series on top of Shakeel's and
>> re-run.
> Sounds like a great idea! Thanks again Alex, have a great day! : -)
> Joshua