From nobody Mon Apr 6 15:27:11 2026 Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE1C514EC73 for ; Sat, 4 Apr 2026 03:38:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775273931; cv=none; b=XlLldjfZkvKBWTVW+hIkTFJ0Tn5zblUGbe9sAtYpH8bs9ClBKKETAE8kbFcYZ1etx34QyB70jWTqEZvKtr5POE9A4qiZCWshXsNXKi7x0EE0nbAqBflxzQDpQ8olhjDEvVArGqMErKL3YMfIOz7i7RBp/o9yOz2Xyd3jASUuOLg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775273931; c=relaxed/simple; bh=mL+c6eCZedn3xRYfM+0CLBd8b2ah649HOnkrhFPfDxY=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=HUZFeSsLXA3V5D38mXxaTYb0HtLqnywQ/oDHc405quGCUWPQ1wHf+pXev05Do4Dre9xMLzNdBScd/SqqmQb/c5kY6l2aZnqcoLvLU8ALuoT7Y2sBHW9n/R/HCwWVbnFzAAnsB/LnnBXdblgWE9NMohR3+IOe5Qbu10lp1ICUbco= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VUOIvCwc; arc=none smtp.client-ip=209.85.210.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VUOIvCwc" Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-7d7653db148so1487747a34.2 for ; Fri, 03 Apr 2026 20:38:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775273928; x=1775878728; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=UnK6OH8b4keQaVFK/5OEkwznTAbTC+vEuBT5MrB5GfA=; b=VUOIvCwc14mpX0UYEsJsKmgowx3r4TLWR5pluSgTj/mkCToA15hXzfypkgdxvlx7Cf zj2YTEMoWULkkJzG59wHhFFW9MBa9KM9mWFvHudfayS4l5eWsPwSJVExuoJxD7sHi6OO Oa26XTE8zb7QSz5fWIvGbzecCiA37VVqxLDHeySiCxUiWuExNeClyfFOXzTv9XnEgMfu njOSsjwXmJVt/B9dpxSmAOrezF6Oc139qg4vPkCp0tuJ4P7qCZ+9QUzbVS8D+QnIx7tW EpruqROj2ykfcioyY2KFuVlrAB/3H0Oy650wOQWm5UKZ/Ax1pKxr4R0hvpE10TZbeoz9 mSow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775273928; x=1775878728; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UnK6OH8b4keQaVFK/5OEkwznTAbTC+vEuBT5MrB5GfA=; b=XWJEKFXTZA/aImWTBYnunvTRMXLiXXgJ5K+f/Vhxi02uTsRVDew7QAapaaU+v6v7K4 MHfm04+5TbMhLfX5n54uhrJGboXjOFJTa6hqzlLYk2L38GUqiOK6YNDExlMPkxBQ5FI/ nhIigSKsCpRfd7Zxe7131GLjRejNS2im0fZ5E3y/SyPNgTtDjolcjQ4Krijc0X0hDLyX lIc9W5AbVCowZH/IWPVqopGRnZdsz47Z1ZVzmgQsue2Nug8GoACV9iEcUiaIQPFwR2a5 e/RI0jdfb1AHsTh2DeavI3rxCVLZPr7jlG35ttzkZyIz56M/OxPsFRvqgIhhboOxBCff sgEg== X-Forwarded-Encrypted: i=1; AJvYcCXPeDI3AVLs8BwdTLdINstecGbW9bJyYCe3PpwC5qDm5p5iv0n5YMmpyPAyzaXWqSgTJDaWLmW7BamCY3E=@vger.kernel.org X-Gm-Message-State: AOJu0YwNh7vvpHbNWZ9OQU/pZEQRmSuqfnEh+J9Byi/sM+XMe2rYX5I5 f1OIAUnNeSvxbwxC8XeUnVLEcEs7ddTh6xSMzZ25KgedoK29/aQsEdaH X-Gm-Gg: ATEYQzypdrGTFeRnNcJCSfkeQkQl4oY6E+rG6RGTUV01Ec+xVsRzzUpVLstJqBwaLJf j6j8x+Sx/P5zAa4DRN0sjd9HGV568ipfuOTJiGBgeYtai/56Rvsu9/V7Qca/wIS9fhgVNNF+YDz oI0RhEkMe7Wi8o70AqqeGVlgo1DiXUtwgmYQJ7kxz83Ra0R9FWohEeVDDt4iWB9UvjDjkw5b4F/ /Kag+lt+ppvqDWF48E55kChrhTlCbGClxv/L7h6X6GWBLSFTcL75DpG+nDRVnLWaHQDBUC8wMpT kldcFxcuJm3qEC9TNBqkQ4k+ffHHkFJbVX7l7iCbln3hQ7LK2cVqdKGcBbjOuIvAfsCkCxbh23j LgYIAitcitCRGekaaV8X1JZzXOonbZDd1ZyQhUXBhWpkT6GTvW9vdZHh3NFHS2GZ+S7qfslFuPY 7NO3EOr2lNlswX62NcdJT7 X-Received: by 2002:a05:6808:6f91:b0:467:2652:b29d with SMTP id 5614622812f47-46ef5001816mr2779936b6e.8.1775273927383; Fri, 03 Apr 2026 20:38:47 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:6::]) by smtp.gmail.com with ESMTPSA id 5614622812f47-46d8f9609bfsm4224118b6e.3.2026.04.03.20.38.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Apr 2026 20:38:45 -0700 (PDT) From: Joshua Hahn To: Johannes Weiner , Andrew Morton , Michal Hocko , Yosry Ahmed Cc: Roman Gushchin , Shakeel Butt , Muchun Song , David Hildenbrand , Lorenzo Stoakes , Vlastimil Babka , Dennis Zhou , Tejun Heo , Christoph Lameter , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [PATCH v2] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting Date: Fri, 3 Apr 2026 20:38:43 -0700 Message-ID: <20260404033844.1892595-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" enum memcg_stat_item includes memory that is tracked on a per-memcg level, but not at a per-node (and per-lruvec) level. Diagnosing memory pressure for memcgs in multi-NUMA systems can be difficult, since not all of the memory accounted in memcg can be traced back to a node. In scenarios where numa nodes in an memcg are asymmetrically stressed, this difference can be invisible to the user. Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item to give visibility into per-node breakdowns for percpu allocations. This will get us closer to being able to know the memcg and physical association of all memory on the system. Specifically for percpu, this granularity will help demonstrate footprint differences on systems with asymmetric NUMA nodes. Because percpu memory is accounted at a sub-PAGE_SIZE level, we must account node level statistics (accounted in PAGE_SIZE units) and memcg-lruvec statistics separately. Account node statistics when the pcpu pages are allocated, and account memcg-lruvec statistics when pcpu objects are handed out. To do account these separately, expose mod_memcg_lruvec_state to be used outside of memcontrol. The memory overhead of this patch is small; it adds 16 bytes per-cgroup-node-cpu. For an example machine with 200 CPUs split across 2 nodes and 50 cgroups in the system, we see a 312.5 kB increase. Note that this is the same cost as any other item in memcg_node_stat_item. Performance impact is also negligible. These are results from a kernel module which performs 100k percpu allocations via __alloc_percpu_gfp with GFP_KERNEL | __GFP_ACCOUNT in a cgroup, across 20 trials. Batched performs 100k allocations followed by 100k frees, while interleaved performs allocation --> free --> allocation ... +-------------+----------------+--------------+--------------+ | Test | linus-upstream | patch | diff | +-------------+----------------+--------------+--------------+ | Batched | 6586 +/- 51 | 6595 +/- 35 | +9 (0.13%) | | Interleaved | 1053 +/- 126 | 1085 +/- 113 | +32 (+0.85%) | +-------------+----------------+--------------+--------------+ One functional change is that there can be a tiny inconsistency between the size of the allocation used for memcg limit checking and what is charged to each lruvec due to dropping fractional charges when rounding. In reality this value is very very small and always lies on the side of memory checking at a higher threshold, so there is no behavioral change from userspace. Signed-off-by: Joshua Hahn --- v1 --> v2: - Updated commit message to be more explicit about motivation, suggested by Michal Hocko. - Added performance and memory impacts, suggested by Michal Hocko and Yosry Ahmed. - Instead of completely dropping the "extra" overhead of obj_cgroup pointers, they are now distributed among the nodes, proportional to the number of CPU each node has. --- include/linux/memcontrol.h | 4 +++- include/linux/mmzone.h | 4 +++- mm/memcontrol.c | 12 +++++----- mm/percpu-vm.c | 14 ++++++++++-- mm/percpu.c | 45 ++++++++++++++++++++++++++++++++++---- mm/vmstat.c | 1 + 6 files changed, 66 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0861589695298..96dae769c60d6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -34,7 +34,6 @@ struct kmem_cache; enum memcg_stat_item { MEMCG_SWAP =3D NR_VM_NODE_STAT_ITEMS, MEMCG_SOCK, - MEMCG_PERCPU_B, MEMCG_KMEM, MEMCG_ZSWAP_B, MEMCG_ZSWAPPED, @@ -909,6 +908,9 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task= _struct *victim, struct mem_cgroup *oom_domain); void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); =20 +void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, + int val); + /* idx can be of type enum memcg_stat_item or node_stat_item */ void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7bd0134c241ce..e38d8fe8552b1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -328,6 +328,7 @@ enum node_stat_item { #endif NR_BALLOON_PAGES, NR_KERNEL_FILE_PAGES, + NR_PERCPU_B, NR_VM_NODE_STAT_ITEMS }; =20 @@ -365,7 +366,8 @@ static __always_inline bool vmstat_item_in_bytes(int id= x) * byte-precise. */ return (idx =3D=3D NR_SLAB_RECLAIMABLE_B || - idx =3D=3D NR_SLAB_UNRECLAIMABLE_B); + idx =3D=3D NR_SLAB_UNRECLAIMABLE_B || + idx =3D=3D NR_PERCPU_B); } =20 /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a47fb68dd65f1..b320b6a426966 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -377,6 +377,7 @@ static const unsigned int memcg_node_stat_items[] =3D { NR_UNEVICTABLE, NR_SLAB_RECLAIMABLE_B, NR_SLAB_UNRECLAIMABLE_B, + NR_PERCPU_B, WORKINGSET_REFAULT_ANON, WORKINGSET_REFAULT_FILE, WORKINGSET_ACTIVATE_ANON, @@ -428,7 +429,6 @@ static const unsigned int memcg_node_stat_items[] =3D { static const unsigned int memcg_stat_items[] =3D { MEMCG_SWAP, MEMCG_SOCK, - MEMCG_PERCPU_B, MEMCG_KMEM, MEMCG_ZSWAP_B, MEMCG_ZSWAPPED, @@ -920,9 +920,8 @@ static void __mod_memcg_lruvec_state(struct mem_cgroup_= per_node *pn, put_cpu(); } =20 -static void mod_memcg_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, - int val) +void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, + int val) { struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); struct mem_cgroup_per_node *pn; @@ -936,6 +935,7 @@ static void mod_memcg_lruvec_state(struct lruvec *lruve= c, =20 get_non_dying_memcg_end(); } +EXPORT_SYMBOL(mod_memcg_lruvec_state); =20 /** * mod_lruvec_state - update lruvec memory statistics @@ -1535,7 +1535,7 @@ static const struct memory_stat memory_stats[] =3D { { "kernel_stack", NR_KERNEL_STACK_KB }, { "pagetables", NR_PAGETABLE }, { "sec_pagetables", NR_SECONDARY_PAGETABLE }, - { "percpu", MEMCG_PERCPU_B }, + { "percpu", NR_PERCPU_B }, { "sock", MEMCG_SOCK }, { "vmalloc", NR_VMALLOC }, { "shmem", NR_SHMEM }, @@ -1597,7 +1597,7 @@ static const struct memory_stat memory_stats[] =3D { static int memcg_page_state_unit(int item) { switch (item) { - case MEMCG_PERCPU_B: + case NR_PERCPU_B: case MEMCG_ZSWAP_B: case NR_SLAB_RECLAIMABLE_B: case NR_SLAB_UNRECLAIMABLE_B: diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 4f5937090590d..e36b639f521dd 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -55,7 +55,8 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk, struct page **pages, int page_start, int page_end) { unsigned int cpu; - int i; + int nr_pages =3D page_end - page_start; + int i, nid; =20 for_each_possible_cpu(cpu) { for (i =3D page_start; i < page_end; i++) { @@ -65,6 +66,10 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk, __free_page(page); } } + + for_each_node(nid) + mod_node_page_state(NODE_DATA(nid), NR_PERCPU_B, + -1L * nr_pages * nr_cpus_node(nid) * PAGE_SIZE); } =20 /** @@ -84,7 +89,8 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, gfp_t gfp) { unsigned int cpu, tcpu; - int i; + int nr_pages =3D page_end - page_start; + int i, nid; =20 gfp |=3D __GFP_HIGHMEM; =20 @@ -97,6 +103,10 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, goto err; } } + + for_each_node(nid) + mod_node_page_state(NODE_DATA(nid), NR_PERCPU_B, + nr_pages * nr_cpus_node(nid) * PAGE_SIZE); return 0; =20 err: diff --git a/mm/percpu.c b/mm/percpu.c index b0676b8054ed0..51c160deca01a 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1632,6 +1632,45 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, g= fp_t gfp, return true; } =20 +/* + * pcpu_mod_memcg_lruvec - update per-node memcg percpu stats + * @objcg: object cgroup to charge + * @size: size of pcpu allocation + * @sign: 1 for charge, -1 for uncharge + * + * Charge percpu memory across NUMA nodes proportional to per-node CPU cou= nt. + * Includes the obj_cgroup pointer overhead (see pcpu_obj_full_size) from = the + * chunk's obj_exts array, but spreads proportionally across all nodes to + * avoid attributing it to a single node. + * + * The "extra" size calculation is best-effort but deterministic. + * Charges will equal uncharges, although there may be small discrepancies + * due to rounding up/down. + */ +static void pcpu_mod_memcg_lruvec(struct obj_cgroup *objcg, size_t size, + int sign) +{ + struct mem_cgroup *memcg; + size_t extra =3D size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *); + int nid; + + memcg =3D obj_cgroup_memcg(objcg); + for_each_node(nid) { + struct lruvec *lruvec; + unsigned int nr_cpus =3D nr_cpus_node(nid); + long charge; + + if (!nr_cpus) + continue; + + charge =3D nr_cpus * size + + mult_frac(extra, nr_cpus, num_possible_cpus()); + + lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + mod_memcg_lruvec_state(lruvec, NR_PERCPU_B, sign * charge); + } +} + static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, struct pcpu_chunk *chunk, int off, size_t size) @@ -1644,8 +1683,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgr= oup *objcg, chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup =3D objcg; =20 rcu_read_lock(); - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - pcpu_obj_full_size(size)); + pcpu_mod_memcg_lruvec(objcg, size, 1); rcu_read_unlock(); } else { obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size)); @@ -1667,8 +1705,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *c= hunk, int off, size_t size) obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size)); =20 rcu_read_lock(); - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - -pcpu_obj_full_size(size)); + pcpu_mod_memcg_lruvec(objcg, size, -1); rcu_read_unlock(); =20 obj_cgroup_put(objcg); diff --git a/mm/vmstat.c b/mm/vmstat.c index b33097ab9bc81..d73c3355be715 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1296,6 +1296,7 @@ const char * const vmstat_text[] =3D { #endif [I(NR_BALLOON_PAGES)] =3D "nr_balloon_pages", [I(NR_KERNEL_FILE_PAGES)] =3D "nr_kernel_file_pages", + [I(NR_PERCPU_B)] =3D "nr_percpu", #undef I =20 /* system-wide enum vm_stat_item counters */ --=20 2.52.0