include/linux/memcontrol.h | 27 +++++-- include/linux/mmzone.h | 1 + include/linux/zsmalloc.h | 2 + mm/memcontrol.c | 150 ++++++++++++++++++++++++++++--------- mm/percpu-internal.h | 16 +--- mm/percpu.c | 90 ++++++++++++++++++++-- mm/vmstat.c | 1 + mm/zsmalloc.c | 11 +++ mm/zswap.c | 9 ++- 9 files changed, 242 insertions(+), 65 deletions(-)
This series pursues the work initiated by Joshua [1]. We need kernel memory to be accounted on a per-node basis in order to be able to know the memcg and physical memory association. This series takes advantage of the recent introduction of per-node obj_cgroup [2] and makes those obj_cgroup tied to their numa node. The bulk of the series is percpu per-node accounting: percpu "precharges" the memcg before we know the actual location of the pages it uses, so charging and accounting had to be split. All other kmem users (slab, zswap, __memcg_kmem_charge_page) are straightforward conversions (zswap support is limited in this series because Joshua is working on it in parallel [3]). Thanks Joshua for your early feedbacks! [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/ [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/ [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ Alexandre Ghiti (8): mm: memcontrol: propagate NMI slab stats to memcg vmstats mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT mm: percpu: Split memcg charging and kmem accounting mm: memcontrol: track MEMCG_KMEM per NUMA node mm: memcontrol: per-node kmem accounting for page charges mm: slab: per-node kmem accounting for slab mm: percpu: per-node kmem accounting using local credit mm: zswap: per-node kmem accounting for zswap/zsmalloc include/linux/memcontrol.h | 27 +++++-- include/linux/mmzone.h | 1 + include/linux/zsmalloc.h | 2 + mm/memcontrol.c | 150 ++++++++++++++++++++++++++++--------- mm/percpu-internal.h | 16 +--- mm/percpu.c | 90 ++++++++++++++++++++-- mm/vmstat.c | 1 + mm/zsmalloc.c | 11 +++ mm/zswap.c | 9 ++- 9 files changed, 242 insertions(+), 65 deletions(-) -- 2.54.0
On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote: > This series pursues the work initiated by Joshua [1]. We need kernel > memory to be accounted on a per-node basis in order to be able to > know the memcg and physical memory association. > > This series takes advantage of the recent introduction of per-node > obj_cgroup [2] and makes those obj_cgroup tied to their numa node. > > The bulk of the series is percpu per-node accounting: percpu > "precharges" the memcg before we know the actual location of the pages > it uses, so charging and accounting had to be split. All other kmem > users (slab, zswap, __memcg_kmem_charge_page) are straightforward > conversions (zswap support is limited in this series because Joshua > is working on it in parallel [3]). > > Thanks Joshua for your early feedbacks! Hello Alex, Thank you for your work! Overall I think the direction makes sense to me. Pre-overcharging makes sense to me as an approach, we would much rather overaccount than underaccount and later have to breach limits. I do have some concerns on performance, though. Namely, I think there are some expensive operations that I think would benefit from some performane benchmarking with this patch added (maybe some simple microbenchmarks that demonstrates kernel allocation overhead could be useful). From what I can tell, there is some additional performance overhead that has to do with iterating over num_possible_cpus() x pages_per_alloc, which doesn't seem trivial to me. Another concern that I see is the stock credit system. Maybe we could be bypassing the stock check leading to more time spent doing the atomic operations. obj_stock caches a single obj_cgroup, which means that if we split the objcg to be per-node (in patch 6), then the obj_stock basically gets invalidated every operation since we iterate over more objcgs (even though we are in the same logical objcg). Maybe I'm missing something? I haven't taken a deep look at the implementation details but just wanted to raise some high level items that I noticed. Of course, all of these concerns are just theoretical, if you can show that the performance delta is not noticable then all of my concerns don't matter. I also want to talk more about the local credit system but let's first see what the numbers are first. Thanks again, Alex. And I really like patch 2 because it is a solution to a problem that I ran into in my percpu tracking series that I couldn't think of before! Thank you for solving my problem too : -) Have a great day! Joshua > [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/ > [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/ > [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ > > Alexandre Ghiti (8): > mm: memcontrol: propagate NMI slab stats to memcg vmstats > mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT > mm: percpu: Split memcg charging and kmem accounting > mm: memcontrol: track MEMCG_KMEM per NUMA node > mm: memcontrol: per-node kmem accounting for page charges > mm: slab: per-node kmem accounting for slab > mm: percpu: per-node kmem accounting using local credit > mm: zswap: per-node kmem accounting for zswap/zsmalloc > > include/linux/memcontrol.h | 27 +++++-- > include/linux/mmzone.h | 1 + > include/linux/zsmalloc.h | 2 + > mm/memcontrol.c | 150 ++++++++++++++++++++++++++++--------- > mm/percpu-internal.h | 16 +--- > mm/percpu.c | 90 ++++++++++++++++++++-- > mm/vmstat.c | 1 + > mm/zsmalloc.c | 11 +++ > mm/zswap.c | 9 ++- > 9 files changed, 242 insertions(+), 65 deletions(-) > > -- > 2.54.0 > >
Hi Joshua,
On 5/18/26 16:57, Joshua Hahn wrote:
> On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>> This series pursues the work initiated by Joshua [1]. We need kernel
>> memory to be accounted on a per-node basis in order to be able to
>> know the memcg and physical memory association.
>>
>> This series takes advantage of the recent introduction of per-node
>> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
>>
>> The bulk of the series is percpu per-node accounting: percpu
>> "precharges" the memcg before we know the actual location of the pages
>> it uses, so charging and accounting had to be split. All other kmem
>> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
>> conversions (zswap support is limited in this series because Joshua
>> is working on it in parallel [3]).
>>
>> Thanks Joshua for your early feedbacks!
> Hello Alex,
>
> Thank you for your work!
>
> Overall I think the direction makes sense to me. Pre-overcharging makes sense to
> me as an approach, we would much rather overaccount than underaccount and
> later have to breach limits.
>
> I do have some concerns on performance, though. Namely, I think there are
> some expensive operations that I think would benefit from some performane
> benchmarking with this patch added (maybe some simple microbenchmarks that
> demonstrates kernel allocation overhead could be useful).
>
> From what I can tell, there is some additional performance overhead that has
> to do with iterating over num_possible_cpus() x pages_per_alloc, which
> doesn't seem trivial to me.
Indeed, let me microbenchmark the overhead on a large system.
>
> Another concern that I see is the stock credit system. Maybe we could be
> bypassing the stock check leading to more time spent doing the atomic
> operations.
I'm not following on this one, which atomic operations do you see that
could be bypassed?
>
> obj_stock caches a single obj_cgroup, which means that if we split the objcg
> to be per-node (in patch 6), then the obj_stock basically gets invalidated
> every operation since we iterate over more objcgs (even though we are in
> the same logical objcg). Maybe I'm missing something?
The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert
objcg to be per-memcg per-node type") and the problem you describe is
exactly what Shakeel is trying to fix [1].
But I remember trying a microbenchmark and noticed a +5% regression (on
top of the 67% then...), I'll rebase this series on top of Shakeel's and
re-run.
[1]
https://lore.kernel.org/linux-mm/20260520053123.2709959-1-shakeel.butt@linux.dev/T/#m127d4969b105c046a2a21e3c79c963771007583d
>
> I haven't taken a deep look at the implementation details but just wanted to
> raise some high level items that I noticed. Of course, all of these concerns
> are just theoretical, if you can show that the performance delta is not
> noticable then all of my concerns don't matter.
>
> I also want to talk more about the local credit system but let's first see
> what the numbers are first.
>
> Thanks again, Alex. And I really like patch 2 because it is a solution to
> a problem that I ran into in my percpu tracking series that I couldn't think
> of before! Thank you for solving my problem too : -)
Great then, thanks :)
Alex
>
> Have a great day!
> Joshua
>
>> [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
>> [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/
>> [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/
>>
>> Alexandre Ghiti (8):
>> mm: memcontrol: propagate NMI slab stats to memcg vmstats
>> mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
>> mm: percpu: Split memcg charging and kmem accounting
>> mm: memcontrol: track MEMCG_KMEM per NUMA node
>> mm: memcontrol: per-node kmem accounting for page charges
>> mm: slab: per-node kmem accounting for slab
>> mm: percpu: per-node kmem accounting using local credit
>> mm: zswap: per-node kmem accounting for zswap/zsmalloc
>>
>> include/linux/memcontrol.h | 27 +++++--
>> include/linux/mmzone.h | 1 +
>> include/linux/zsmalloc.h | 2 +
>> mm/memcontrol.c | 150 ++++++++++++++++++++++++++++---------
>> mm/percpu-internal.h | 16 +---
>> mm/percpu.c | 90 ++++++++++++++++++++--
>> mm/vmstat.c | 1 +
>> mm/zsmalloc.c | 11 +++
>> mm/zswap.c | 9 ++-
>> 9 files changed, 242 insertions(+), 65 deletions(-)
>>
>> --
>> 2.54.0
>>
>>
On Wed, 20 May 2026 10:39:59 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
> Hi Joshua,
>
> On 5/18/26 16:57, Joshua Hahn wrote:
> > On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
> >
> >> This series pursues the work initiated by Joshua [1]. We need kernel
> >> memory to be accounted on a per-node basis in order to be able to
> >> know the memcg and physical memory association.
> >>
> >> This series takes advantage of the recent introduction of per-node
> >> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
> >>
> >> The bulk of the series is percpu per-node accounting: percpu
> >> "precharges" the memcg before we know the actual location of the pages
> >> it uses, so charging and accounting had to be split. All other kmem
> >> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
> >> conversions (zswap support is limited in this series because Joshua
> >> is working on it in parallel [3]).
> >>
> >> Thanks Joshua for your early feedbacks!
> > Hello Alex,
> >
> > Thank you for your work!
> >
> > Overall I think the direction makes sense to me. Pre-overcharging makes sense to
> > me as an approach, we would much rather overaccount than underaccount and
> > later have to breach limits.
> >
> > I do have some concerns on performance, though. Namely, I think there are
> > some expensive operations that I think would benefit from some performane
> > benchmarking with this patch added (maybe some simple microbenchmarks that
> > demonstrates kernel allocation overhead could be useful).
> >
> > From what I can tell, there is some additional performance overhead that has
> > to do with iterating over num_possible_cpus() x pages_per_alloc, which
> > doesn't seem trivial to me.
>
> Indeed, let me microbenchmark the overhead on a large system.
Hi Alex,
That sounds great with me : -) Looking forward to the numbers!
> > Another concern that I see is the stock credit system. Maybe we could be
> > bypassing the stock check leading to more time spent doing the atomic
> > operations.
>
> I'm not following on this one, which atomic operations do you see that
> could be bypassed?
So in my initial scan of the patch 7 I had a concern that if we have a nested
stock system (obj_cgroup stock and local credit "stock"), then we could
incur more work if these are out of sync; do extra work in the stock refill
path in obj_cgroup_precharge, and then do extra work on top in the loop
within the pcpu_memcg_post_alloc_hook (obj_cgroup_account_kmem does the
charging atomically I think).
So what I mean is, I'm not sure what the "size" is typically for
pcpu_memcg_post_alloc_hook. But it might be a worthwhile optimization to
do precharge all the pages, then for each cpu iterate over the pages to
figure out how many pages are used per nid (doing just math, not actually
doing the atomic adds), and then outside both of these loops just iterate
over every nid_objcg once to perform the atomic operation.
Maybe this is needed or not (depending on how big "size" typically is
and whether we go from doing O(1000) atomic adds --> O(10) or some
big reduction, but I just wanted to toss it out there as something that
could potentially be expensive.
> > obj_stock caches a single obj_cgroup, which means that if we split the objcg
> > to be per-node (in patch 6), then the obj_stock basically gets invalidated
> > every operation since we iterate over more objcgs (even though we are in
> > the same logical objcg). Maybe I'm missing something?
>
>
> The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert
> objcg to be per-memcg per-node type") and the problem you describe is
> exactly what Shakeel is trying to fix [1].
Whoops O_o I completely missed that one. Sorry for flagging it again!
> But I remember trying a microbenchmark and noticed a +5% regression (on
> top of the 67% then...), I'll rebase this series on top of Shakeel's and
> re-run.
Sounds like a great idea! Thanks again Alex, have a great day! : -)
Joshua
On 5/21/26 05:46, Joshua Hahn wrote:
> On Wed, 20 May 2026 10:39:59 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>> Hi Joshua,
>>
>> On 5/18/26 16:57, Joshua Hahn wrote:
>>> On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>
>>>> This series pursues the work initiated by Joshua [1]. We need kernel
>>>> memory to be accounted on a per-node basis in order to be able to
>>>> know the memcg and physical memory association.
>>>>
>>>> This series takes advantage of the recent introduction of per-node
>>>> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
>>>>
>>>> The bulk of the series is percpu per-node accounting: percpu
>>>> "precharges" the memcg before we know the actual location of the pages
>>>> it uses, so charging and accounting had to be split. All other kmem
>>>> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
>>>> conversions (zswap support is limited in this series because Joshua
>>>> is working on it in parallel [3]).
>>>>
>>>> Thanks Joshua for your early feedbacks!
>>> Hello Alex,
>>>
>>> Thank you for your work!
>>>
>>> Overall I think the direction makes sense to me. Pre-overcharging makes sense to
>>> me as an approach, we would much rather overaccount than underaccount and
>>> later have to breach limits.
>>>
>>> I do have some concerns on performance, though. Namely, I think there are
>>> some expensive operations that I think would benefit from some performane
>>> benchmarking with this patch added (maybe some simple microbenchmarks that
>>> demonstrates kernel allocation overhead could be useful).
>>>
>>> From what I can tell, there is some additional performance overhead that has
>>> to do with iterating over num_possible_cpus() x pages_per_alloc, which
>>> doesn't seem trivial to me.
>> Indeed, let me microbenchmark the overhead on a large system.
> Hi Alex,
>
> That sounds great with me : -) Looking forward to the numbers!
>
>>> Another concern that I see is the stock credit system. Maybe we could be
>>> bypassing the stock check leading to more time spent doing the atomic
>>> operations.
>> I'm not following on this one, which atomic operations do you see that
>> could be bypassed?
> So in my initial scan of the patch 7 I had a concern that if we have a nested
> stock system (obj_cgroup stock and local credit "stock"), then we could
> incur more work if these are out of sync; do extra work in the stock refill
> path in obj_cgroup_precharge, and then do extra work on top in the loop
> within the pcpu_memcg_post_alloc_hook (obj_cgroup_account_kmem does the
> charging atomically I think).
>
> So what I mean is, I'm not sure what the "size" is typically for
> pcpu_memcg_post_alloc_hook. But it might be a worthwhile optimization to
> do precharge all the pages, then for each cpu iterate over the pages to
> figure out how many pages are used per nid (doing just math, not actually
> doing the atomic adds), and then outside both of these loops just iterate
> over every nid_objcg once to perform the atomic operation.
>
> Maybe this is needed or not (depending on how big "size" typically is
> and whether we go from doing O(1000) atomic adds --> O(10) or some
> big reduction, but I just wanted to toss it out there as something that
> could potentially be expensive.
I get it, I'll trace the microbenchmarks to see what happens there,
thanks for the suggestion.
Thanks again,
Alex
>>> obj_stock caches a single obj_cgroup, which means that if we split the objcg
>>> to be per-node (in patch 6), then the obj_stock basically gets invalidated
>>> every operation since we iterate over more objcgs (even though we are in
>>> the same logical objcg). Maybe I'm missing something?
>>
>> The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert
>> objcg to be per-memcg per-node type") and the problem you describe is
>> exactly what Shakeel is trying to fix [1].
> Whoops O_o I completely missed that one. Sorry for flagging it again!
>
>> But I remember trying a microbenchmark and noticed a +5% regression (on
>> top of the 67% then...), I'll rebase this series on top of Shakeel's and
>> re-run.
> Sounds like a great idea! Thanks again Alex, have a great day! : -)
> Joshua
© 2016 - 2026 Red Hat, Inc.