mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

[PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Joshua Hahn 5 days, 18 hours ago

Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
to give visibility into per-node breakdowns for percpu allocations and
turn it into NR_PERCPU_B.

Because percpu memory is accounted at a sub-PAGE_SIZE level, we must
account node level statistics (accounted in PAGE_SIZE units) and
memcg-lruvec statistics separately. Account node statistics when the pcpu
pages are allocated, and account memcg-lruvec statistics when pcpu
objects are handed out.

To do account these separately, expose mod_memcg_lruvec_state to be
used outside of memcontrol.

One functional change is that we do not account the 8 byte objcg
pointer per-memcg-lruvec. Since the objcg membership is tracked
per-memcg and not percpu, there is no appropriate lruvec to charge this
memory to (see pcpu_obj_full_size). Instead of adding additional
mechanisms to detect which lruvec the 8 byte pointer belongs to, let's
just simplify and account the pcpu objects' size.

Limit-checking is still done with the additional 8 bytes.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/memcontrol.h |  4 +++-
 include/linux/mmzone.h     |  4 +++-
 mm/memcontrol.c            | 12 ++++++------
 mm/percpu-vm.c             | 14 ++++++++++++--
 mm/percpu.c                | 24 ++++++++++++++++++++----
 mm/vmstat.c                |  1 +
 6 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 086158969529..96dae769c60d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -34,7 +34,6 @@ struct kmem_cache;
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
-	MEMCG_PERCPU_B,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
@@ -909,6 +908,9 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
 					    struct mem_cgroup *oom_domain);
 void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 
+void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			    int val);
+
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 void mod_memcg_state(struct mem_cgroup *memcg,
 		     enum memcg_stat_item idx, int val);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7bd0134c241c..e38d8fe8552b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -328,6 +328,7 @@ enum node_stat_item {
 #endif
 	NR_BALLOON_PAGES,
 	NR_KERNEL_FILE_PAGES,
+	NR_PERCPU_B,
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -365,7 +366,8 @@ static __always_inline bool vmstat_item_in_bytes(int idx)
 	 * byte-precise.
 	 */
 	return (idx == NR_SLAB_RECLAIMABLE_B ||
-		idx == NR_SLAB_UNRECLAIMABLE_B);
+		idx == NR_SLAB_UNRECLAIMABLE_B ||
+		idx == NR_PERCPU_B);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a47fb68dd65f..b320b6a42696 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -377,6 +377,7 @@ static const unsigned int memcg_node_stat_items[] = {
 	NR_UNEVICTABLE,
 	NR_SLAB_RECLAIMABLE_B,
 	NR_SLAB_UNRECLAIMABLE_B,
+	NR_PERCPU_B,
 	WORKINGSET_REFAULT_ANON,
 	WORKINGSET_REFAULT_FILE,
 	WORKINGSET_ACTIVATE_ANON,
@@ -428,7 +429,6 @@ static const unsigned int memcg_node_stat_items[] = {
 static const unsigned int memcg_stat_items[] = {
 	MEMCG_SWAP,
 	MEMCG_SOCK,
-	MEMCG_PERCPU_B,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
@@ -920,9 +920,8 @@ static void __mod_memcg_lruvec_state(struct mem_cgroup_per_node *pn,
 	put_cpu();
 }
 
-static void mod_memcg_lruvec_state(struct lruvec *lruvec,
-				     enum node_stat_item idx,
-				     int val)
+void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			    int val)
 {
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
@@ -936,6 +935,7 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec,
 
 	get_non_dying_memcg_end();
 }
+EXPORT_SYMBOL(mod_memcg_lruvec_state);
 
 /**
  * mod_lruvec_state - update lruvec memory statistics
@@ -1535,7 +1535,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "kernel_stack",		NR_KERNEL_STACK_KB		},
 	{ "pagetables",			NR_PAGETABLE			},
 	{ "sec_pagetables",		NR_SECONDARY_PAGETABLE		},
-	{ "percpu",			MEMCG_PERCPU_B			},
+	{ "percpu",			NR_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			NR_VMALLOC			},
 	{ "shmem",			NR_SHMEM			},
@@ -1597,7 +1597,7 @@ static const struct memory_stat memory_stats[] = {
 static int memcg_page_state_unit(int item)
 {
 	switch (item) {
-	case MEMCG_PERCPU_B:
+	case NR_PERCPU_B:
 	case MEMCG_ZSWAP_B:
 	case NR_SLAB_RECLAIMABLE_B:
 	case NR_SLAB_UNRECLAIMABLE_B:
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 4f5937090590..e36b639f521d 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -55,7 +55,8 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk,
 			    struct page **pages, int page_start, int page_end)
 {
 	unsigned int cpu;
-	int i;
+	int nr_pages = page_end - page_start;
+	int i, nid;
 
 	for_each_possible_cpu(cpu) {
 		for (i = page_start; i < page_end; i++) {
@@ -65,6 +66,10 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk,
 				__free_page(page);
 		}
 	}
+
+	for_each_node(nid)
+		mod_node_page_state(NODE_DATA(nid), NR_PERCPU_B,
+				-1L * nr_pages * nr_cpus_node(nid) * PAGE_SIZE);
 }
 
 /**
@@ -84,7 +89,8 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 			    gfp_t gfp)
 {
 	unsigned int cpu, tcpu;
-	int i;
+	int nr_pages = page_end - page_start;
+	int i, nid;
 
 	gfp |= __GFP_HIGHMEM;
 
@@ -97,6 +103,10 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 				goto err;
 		}
 	}
+
+	for_each_node(nid)
+		mod_node_page_state(NODE_DATA(nid), NR_PERCPU_B,
+				    nr_pages * nr_cpus_node(nid) * PAGE_SIZE);
 	return 0;
 
 err:
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..4ad3b9739eb9 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1632,6 +1632,24 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	return true;
 }
 
+static void pcpu_mod_memcg_lruvec(struct obj_cgroup *objcg, int charge)
+{
+	struct mem_cgroup *memcg;
+	int nid;
+
+	memcg = obj_cgroup_memcg(objcg);
+	for_each_node(nid) {
+		struct lruvec *lruvec;
+		unsigned int nr_cpus = nr_cpus_node(nid);
+
+		if (!nr_cpus)
+			continue;
+
+		lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+		mod_memcg_lruvec_state(lruvec, NR_PERCPU_B, nr_cpus * charge);
+	}
+}
+
 static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 				       struct pcpu_chunk *chunk, int off,
 				       size_t size)
@@ -1644,8 +1662,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
 		rcu_read_lock();
-		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				pcpu_obj_full_size(size));
+		pcpu_mod_memcg_lruvec(objcg, size);
 		rcu_read_unlock();
 	} else {
 		obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
@@ -1667,8 +1684,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
 
 	rcu_read_lock();
-	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-pcpu_obj_full_size(size));
+	pcpu_mod_memcg_lruvec(objcg, -size);
 	rcu_read_unlock();
 
 	obj_cgroup_put(objcg);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b33097ab9bc8..d73c3355be71 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1296,6 +1296,7 @@ const char * const vmstat_text[] = {
 #endif
 	[I(NR_BALLOON_PAGES)]			= "nr_balloon_pages",
 	[I(NR_KERNEL_FILE_PAGES)]		= "nr_kernel_file_pages",
+	[I(NR_PERCPU_B)]			= "nr_percpu",
 #undef I
 
 	/* system-wide enum vm_stat_item counters */
-- 
2.52.0

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Michal Hocko 3 days, 2 hours ago

On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> to give visibility into per-node breakdowns for percpu allocations and
> turn it into NR_PERCPU_B.

Why do we need/want this?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Joshua Hahn 2 days, 23 hours ago

On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > to give visibility into per-node breakdowns for percpu allocations and
> > turn it into NR_PERCPU_B.
> 
> Why do we need/want this?

Hello Michal,

Thank you for reviewing my patch! I hope you are doing well.

You're right, I could have done a better job of motivating the patch.
My intent with this patch is to give some more visibility into where
memory is physically, once you know which memcg it is in.

Percpu memory could probably be seen as "trivial" when it comes to figuring
out what node it is on, but I'm hoping to make similar transitions to the
rest of enum memcg_stat_item as well (you can see my work for the zswap
stats in [1]).

When all of the memory is moved from being tracked per-memcg to per-lruvec,
then the final vision would be able to attribute node placement within
each memcg, which can help with diagnosing things like asymmetric node
pressure within a memcg, which is currently only partially accurate.

Getting per-node breakdowns of percpu memory orthogonal to memcgs also
seems like a win to me. While unlikely, I think that we can benefit from
some amount of visibility into whether percpu allocations are happening
equally across all CPUs.

What do you think? Thank you again, I hope you have a great day!
Joshua

[1] https://lore.kernel.org/all/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Michal Hocko 2 days, 23 hours ago

On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > to give visibility into per-node breakdowns for percpu allocations and
> > > turn it into NR_PERCPU_B.
> > 
> > Why do we need/want this?
> 
> Hello Michal,
> 
> Thank you for reviewing my patch! I hope you are doing well.
> 
> You're right, I could have done a better job of motivating the patch.
> My intent with this patch is to give some more visibility into where
> memory is physically, once you know which memcg it is in.

Please keep in mind that WHY is very often much more important than HOW
in the patch so you should always start with the intention and
justification.

> Percpu memory could probably be seen as "trivial" when it comes to figuring
> out what node it is on, but I'm hoping to make similar transitions to the
> rest of enum memcg_stat_item as well (you can see my work for the zswap
> stats in [1]).
> 
> When all of the memory is moved from being tracked per-memcg to per-lruvec,
> then the final vision would be able to attribute node placement within
> each memcg, which can help with diagnosing things like asymmetric node
> pressure within a memcg, which is currently only partially accurate.
> 
> Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> seems like a win to me. While unlikely, I think that we can benefit from
> some amount of visibility into whether percpu allocations are happening
> equally across all CPUs.
> 
> What do you think? Thank you again, I hope you have a great day!

I think that you should have started with this intended outcome first
rather than slicing it in pieces. Why do we want to shift to per-node
stats for other/all counters? What is the cost associated comparing to the
existing accounting (if any)? Please go into details on how do you plan
to use the data before we commit into a lot of code churn.

TBH I do not see any fundamental reasons why this would be impossible
but I am not really sure this is worth the work and I also do not see
potential subtle issues that we might stumble over when getting there.
So I would appreciate if you could have a look into that deeper and
provide us with evaluation on how do you want to achieve your end goal
and what can we expect on the way. It is, of course, impossible to see
all potential problems without starting implementing the thing but a
high level evaluation would be really helpful.

> Joshua
> 
> [1] https://lore.kernel.org/all/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Joshua Hahn 2 days, 16 hours ago

On Mon, 30 Mar 2026 16:21:12 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> > On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > > to give visibility into per-node breakdowns for percpu allocations and
> > > > turn it into NR_PERCPU_B.
> > > 
> > > Why do we need/want this?
> > 
> > Hello Michal,
> > 
> > Thank you for reviewing my patch! I hope you are doing well.
> > 
> > You're right, I could have done a better job of motivating the patch.
> > My intent with this patch is to give some more visibility into where
> > memory is physically, once you know which memcg it is in.
> 
> Please keep in mind that WHY is very often much more important than HOW
> in the patch so you should always start with the intention and
> justification.
> 
> > Percpu memory could probably be seen as "trivial" when it comes to figuring
> > out what node it is on, but I'm hoping to make similar transitions to the
> > rest of enum memcg_stat_item as well (you can see my work for the zswap
> > stats in [1]).
> > 
> > When all of the memory is moved from being tracked per-memcg to per-lruvec,
> > then the final vision would be able to attribute node placement within
> > each memcg, which can help with diagnosing things like asymmetric node
> > pressure within a memcg, which is currently only partially accurate.
> > 
> > Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> > seems like a win to me. While unlikely, I think that we can benefit from
> > some amount of visibility into whether percpu allocations are happening
> > equally across all CPUs.
> > 
> > What do you think? Thank you again, I hope you have a great day!
> 
> I think that you should have started with this intended outcome first
> rather than slicing it in pieces. Why do we want to shift to per-node
> stats for other/all counters? What is the cost associated comparing to the
> existing accounting (if any)?

I went and ran a few tests, which seem to show rather negligible performance
differences (phew). I wrote a kernel module that does 100k percpu allocations
via __alloc_percpu_gfp with GFP_KERNEL | __GFP_ACCOUNT in a cgroup. I then
measured how long each allocation takes across two trials, one where I do
all 100k allocations and then free all of them at once, and another where I
interleave the allocs and frees. Everything below is ns / alloc, and the
+/- is the standard deviation across 20 trials.

+-------------+----------------+--------------+--------------+
|    Test     | linus-upstream |    patch     |     diff     |
+-------------+----------------+--------------+--------------+
| Batched     | 6586 +/- 51    | 6595 +/- 35  | +9 (0.13%)   |
| Interleaved | 1053 +/- 126   | 1085 +/- 113 | +32 (+0.85%) |
+-------------+----------------+--------------+--------------+

I'll include this, as well as the additional memory overhead that Yosry
suggested to include in a v2. I think we can get more accurate accounting
by distributing the obj_cgroup pointer size across the CPUs, so I've
gone ahead and made another iteration.

Thank you again for your insight, Michal!
Joshua

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Yosry Ahmed 2 days, 19 hours ago

On Mon, Mar 30, 2026 at 7:21 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> > On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:
> >
> > > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > > to give visibility into per-node breakdowns for percpu allocations and
> > > > turn it into NR_PERCPU_B.
> > >
> > > Why do we need/want this?
> >
> > Hello Michal,
> >
> > Thank you for reviewing my patch! I hope you are doing well.
> >
> > You're right, I could have done a better job of motivating the patch.
> > My intent with this patch is to give some more visibility into where
> > memory is physically, once you know which memcg it is in.
>
> Please keep in mind that WHY is very often much more important than HOW
> in the patch so you should always start with the intention and
> justification.
>
> > Percpu memory could probably be seen as "trivial" when it comes to figuring
> > out what node it is on, but I'm hoping to make similar transitions to the
> > rest of enum memcg_stat_item as well (you can see my work for the zswap
> > stats in [1]).
> >
> > When all of the memory is moved from being tracked per-memcg to per-lruvec,
> > then the final vision would be able to attribute node placement within
> > each memcg, which can help with diagnosing things like asymmetric node
> > pressure within a memcg, which is currently only partially accurate.
> >
> > Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> > seems like a win to me. While unlikely, I think that we can benefit from
> > some amount of visibility into whether percpu allocations are happening
> > equally across all CPUs.
> >
> > What do you think? Thank you again, I hope you have a great day!
>
> I think that you should have started with this intended outcome first
> rather than slicing it in pieces. Why do we want to shift to per-node
> stats for other/all counters? What is the cost associated comparing to the
> existing accounting (if any)? Please go into details on how do you plan
> to use the data before we commit into a lot of code churn.
>
> TBH I do not see any fundamental reasons why this would be impossible
> but I am not really sure this is worth the work and I also do not see
> potential subtle issues that we might stumble over when getting there.
> So I would appreciate if you could have a look into that deeper and
> provide us with evaluation on how do you want to achieve your end goal
> and what can we expect on the way. It is, of course, impossible to see
> all potential problems without starting implementing the thing but a
> high level evaluation would be really helpful.

You should probably also speak to extra memory overhead to move all
these stats from per-memcg to per-lruvec.

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Joshua Hahn 2 days, 19 hours ago

On Mon, 30 Mar 2026 11:35:38 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> On Mon, Mar 30, 2026 at 7:21 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> > > On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > > > to give visibility into per-node breakdowns for percpu allocations and
> > > > > turn it into NR_PERCPU_B.
> > > >
> > > > Why do we need/want this?
> > >
> > > Hello Michal,
> > >
> > > Thank you for reviewing my patch! I hope you are doing well.
> > >
> > > You're right, I could have done a better job of motivating the patch.
> > > My intent with this patch is to give some more visibility into where
> > > memory is physically, once you know which memcg it is in.
> >
> > Please keep in mind that WHY is very often much more important than HOW
> > in the patch so you should always start with the intention and
> > justification.
> >
> > > Percpu memory could probably be seen as "trivial" when it comes to figuring
> > > out what node it is on, but I'm hoping to make similar transitions to the
> > > rest of enum memcg_stat_item as well (you can see my work for the zswap
> > > stats in [1]).
> > >
> > > When all of the memory is moved from being tracked per-memcg to per-lruvec,
> > > then the final vision would be able to attribute node placement within
> > > each memcg, which can help with diagnosing things like asymmetric node
> > > pressure within a memcg, which is currently only partially accurate.
> > >
> > > Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> > > seems like a win to me. While unlikely, I think that we can benefit from
> > > some amount of visibility into whether percpu allocations are happening
> > > equally across all CPUs.
> > >
> > > What do you think? Thank you again, I hope you have a great day!
> >
> > I think that you should have started with this intended outcome first
> > rather than slicing it in pieces. Why do we want to shift to per-node
> > stats for other/all counters? What is the cost associated comparing to the
> > existing accounting (if any)? Please go into details on how do you plan
> > to use the data before we commit into a lot of code churn.
> >
> > TBH I do not see any fundamental reasons why this would be impossible
> > but I am not really sure this is worth the work and I also do not see
> > potential subtle issues that we might stumble over when getting there.
> > So I would appreciate if you could have a look into that deeper and
> > provide us with evaluation on how do you want to achieve your end goal
> > and what can we expect on the way. It is, of course, impossible to see
> > all potential problems without starting implementing the thing but a
> > high level evaluation would be really helpful.
> 
> You should probably also speak to extra memory overhead to move all
> these stats from per-memcg to per-lruvec.

Hello Yosry,

Thank you for your feedback!

Here are the things that I cna see from my end:
- NR_PERCPU_B adds a byte per-node, per-cpu. I think this is manageable.
- lruvec_stats_percpu grows by 1 long in 2 arrays (state, state_prev) since
  NR_MEMCG_NODE_STAT_ITEMS grows by 1 from ~30. This is +16 bytes per
  cgroup x node x CPU. Even still, I'm not sure this is too concerning,
  on a host with 300 CPUs across 2 nodes with 100 cgroups (theoretical)
  we would see a 16 * 300 * 2 * 100 = 937 kB change, less than a mB (and
  I think this would be considered a big machine).

What do you think? Do these numbers look acceptable?

Thanks again for your insights, I hope you have a great day : -)
Joshua

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Yosry Ahmed 2 days, 19 hours ago

> > You should probably also speak to extra memory overhead to move all
> > these stats from per-memcg to per-lruvec.
>
> Hello Yosry,
>
> Thank you for your feedback!
>
> Here are the things that I cna see from my end:
> - NR_PERCPU_B adds a byte per-node, per-cpu. I think this is manageable.
> - lruvec_stats_percpu grows by 1 long in 2 arrays (state, state_prev) since
>   NR_MEMCG_NODE_STAT_ITEMS grows by 1 from ~30. This is +16 bytes per
>   cgroup x node x CPU. Even still, I'm not sure this is too concerning,
>   on a host with 300 CPUs across 2 nodes with 100 cgroups (theoretical)
>   we would see a 16 * 300 * 2 * 100 = 937 kB change, less than a mB (and
>   I think this would be considered a big machine).
>
> What do you think? Do these numbers look acceptable?

Oh I wasn't trying to say whether this is acceptable or not, just that
this is a relevant context that should be included to help people see
the tradeoff clearly and make a decision.

>
> Thanks again for your insights, I hope you have a great day : -)
> Joshua
>

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Joshua Hahn 2 days, 23 hours ago

On Mon, 30 Mar 2026 16:21:12 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> > On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > > to give visibility into per-node breakdowns for percpu allocations and
> > > > turn it into NR_PERCPU_B.
> > > 
> > > Why do we need/want this?
> > 
> > Hello Michal,
> > 
> > Thank you for reviewing my patch! I hope you are doing well.
> > 
> > You're right, I could have done a better job of motivating the patch.
> > My intent with this patch is to give some more visibility into where
> > memory is physically, once you know which memcg it is in.
> 
> Please keep in mind that WHY is very often much more important than HOW
> in the patch so you should always start with the intention and
> justification.

Ack, I'll keep in mind for the future!

> > Percpu memory could probably be seen as "trivial" when it comes to figuring
> > out what node it is on, but I'm hoping to make similar transitions to the
> > rest of enum memcg_stat_item as well (you can see my work for the zswap
> > stats in [1]).
> > 
> > When all of the memory is moved from being tracked per-memcg to per-lruvec,
> > then the final vision would be able to attribute node placement within
> > each memcg, which can help with diagnosing things like asymmetric node
> > pressure within a memcg, which is currently only partially accurate.
> > 
> > Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> > seems like a win to me. While unlikely, I think that we can benefit from
> > some amount of visibility into whether percpu allocations are happening
> > equally across all CPUs.
> > 
> > What do you think? Thank you again, I hope you have a great day!

Thank you for the feedback, Michal. Let me break down your questions so I
can address them one-by-one:

> I think that you should have started with this intended outcome first
> rather than slicing it in pieces. Why do we want to shift to per-node
> stats for other/all counters? What is the cost associated comparing to the

Yup, ack here as well. Here is a bit more context on why I stumbled on this
in the first place. As you are aware, I'm also working on another series
whose goal is to make memory limits tier-aware [2]. While working on this,
I realized that memory in the enum memcg_stat_item had no physical
association, which meant that identifying (1) which node / tier they were on,
and (2) which node / tier the memory should be migrated to was completely
invisible.

That was the original motivation. Looking deeper I found that this is not
even a tier problem but rather just a lack of visibility into node-level
statistics for the user.

As another example, recently I have seen an example of socket memory
landing in CXL, which is really quite strange. (Was it demoted? Was it
through a fallback allocation?) It was only visible after there was an OOM
and I could use the vmcore to inspect the data manually and figure out
the page placement.

I was thinking that it would be very nice to have this level of node-level
perspective along with the memcg association because IMO something like
this has more value in being analyzed at runtime, rather than during a
post-mortem with the vmcore, and there is more we can do by understanding
what was happening at the system when this strange placement happened.

> What is the cost associated comparing to the
> existing accounting (if any)? Please go into details on how do you plan
> to use the data before we commit into a lot of code churn.

For percpu specifically, I think the cost is minimal. Thankfully these
changes also have minimal effects on single-NUMA machines as well.
But let me get some concrete numbers and get back to you so that I can
back these hypotheses up.

> TBH I do not see any fundamental reasons why this would be impossible
> but I am not really sure this is worth the work and I also do not see
> potential subtle issues that we might stumble over when getting there.
> So I would appreciate if you could have a look into that deeper and
> provide us with evaluation on how do you want to achieve your end goal
> and what can we expect on the way. It is, of course, impossible to see
> all potential problems without starting implementing the thing but a
> high level evaluation would be really helpful.

Great to hear that you think this is not impossible ; -)

Yes, I also definitely see that there can be some subtle issues. One thing
I'm trying to be very mindful of is locking semantics, whether we are
introducing any new bottlenecks for updates. I'll do some testing and
come back with numbers, hopefully that can instill some more confidence
with the side effects of these patches.

As a note of concern I do believe that socket memory will be tough
to track accurately since it uses a different model of memory accounting.
I hope that there can be some steps to make it more accurate without
introducing overhead in the socket hotpaths, since those are highly
performance-sensitive.

Another concern is what to do with MEMCG_SWAP, which is not really able
to be associated with a node. But swap is unique in that it genuinely
does not take up the memory in memory. So maybe at the end of all of this
when there is only MEMCG_SWAP in memcg_stat_item, we can treat it as a
single special case. 

Thank you for your thoughts Michal, I greatly appreciate them.
I hope you have a great day!
Joshua

> > [1] https://lore.kernel.org/all/20260311195153.4013476-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/all/20260223223830.586018-1-joshua.hahnjy@gmail.com/

Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

Posted by Michal Hocko an hour ago

Sorry, I got side tracked and didn't get to your follow up yet. I will
try sometimes next week. Please poke if I do not respond.

-- 
Michal Hocko
SUSE Labs