fs/proc/base.c | 2 +- include/linux/mm.h | 49 +- include/linux/mm_types.h | 54 ++- include/linux/oom.h | 11 +- include/linux/percpu_counter_tree.h | 344 ++++++++++++++ include/trace/events/kmem.h | 2 +- init/main.c | 2 + kernel/fork.c | 22 +- lib/Makefile | 1 + lib/percpu_counter_tree.c | 702 ++++++++++++++++++++++++++++ mm/oom_kill.c | 82 +++- 11 files changed, 1222 insertions(+), 49 deletions(-) create mode 100644 include/linux/percpu_counter_tree.h create mode 100644 lib/percpu_counter_tree.c
Introduce hierarchical per-cpu counters and use them for RSS tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.
The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:
Recently, several internal services had an RSS usage regression as part of a
kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
read RSS statistics in a backup watchdog process to monitor and decide if
they'd overrun their memory budget. Now, however, a representative service
with five threads, expected to use about a hundred MB of memory, on a 250-cpu
machine had memory usage tens of megabytes different from the expected amount
-- this constituted a significant percentage of inaccuracy, causing the
watchdog to act.
This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
into percpu_counter") [1]. Previously, the memory error was bounded by
64*nr_threads pages, a very livable megabyte. Now, however, as a result of
scheduler decisions moving the threads around the CPUs, the memory error could
be as large as a gigabyte.
This is a really tremendous inaccuracy for any few-threaded program on a
large machine and impedes monitoring significantly. These stat counters are
also used to make OOM killing decisions, so this additional inaccuracy could
make a big difference in OOM situations -- either resulting in the wrong
process being killed, or in less memory being returned from an OOM-kill than
expected.
The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).
Notable changes for v12:
- Reduce per-CPU counters memory allocation size to sizeof long
(fixing mixup with sizeof intermediate cache line aligned items).
- Use "long" counters types rather than "int".
- get_mm_counter_sum() returns a precise sum.
- Introduce and use functions to calculate the min/max possible precise
sum values associated with an approximate sum.
I've done moderate testing of this series on a 256-core VM with 128GB
RAM. Figuring out whether this indeed helps solve issues with real-life
workloads will require broader feedback from the community.
This series is based on v6.19-rc4, on top of the following two
preparation series:
https://lore.kernel.org/linux-mm/20251224173358.647691-1-mathieu.desnoyers@efficios.com/T/#t
https://lore.kernel.org/linux-mm/20251224173810.648699-1-mathieu.desnoyers@efficios.com/T/#t
Andrew, this series replaces v11, for testing in mm-new.
Thanks!
Mathieu
Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Mathieu Desnoyers (3):
lib: Introduce hierarchical per-cpu counters
mm: Fix OOM killer inaccuracy on large many-core systems
mm: Implement precise OOM killer task selection
fs/proc/base.c | 2 +-
include/linux/mm.h | 49 +-
include/linux/mm_types.h | 54 ++-
include/linux/oom.h | 11 +-
include/linux/percpu_counter_tree.h | 344 ++++++++++++++
include/trace/events/kmem.h | 2 +-
init/main.c | 2 +
kernel/fork.c | 22 +-
lib/Makefile | 1 +
lib/percpu_counter_tree.c | 702 ++++++++++++++++++++++++++++
mm/oom_kill.c | 82 +++-
11 files changed, 1222 insertions(+), 49 deletions(-)
create mode 100644 include/linux/percpu_counter_tree.h
create mode 100644 lib/percpu_counter_tree.c
--
2.39.5
On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > Introduce hierarchical per-cpu counters and use them for RSS tracking to > fix the per-mm RSS tracking which has become too inaccurate for OOM > killer purposes on large many-core systems. Great, thanks. > Notable changes for v12: > > - Reduce per-CPU counters memory allocation size to sizeof long > (fixing mixup with sizeof intermediate cache line aligned items). > - Use "long" counters types rather than "int". > - get_mm_counter_sum() returns a precise sum. > - Introduce and use functions to calculate the min/max possible precise > sum values associated with an approximate sum. May I ask, as an early adopter, what is your overall impression of the Gemini reviewbot?
On 2026-01-11 12:48, Andrew Morton wrote: > On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> Notable changes for v12: >> >> - Reduce per-CPU counters memory allocation size to sizeof long >> (fixing mixup with sizeof intermediate cache line aligned items). >> - Use "long" counters types rather than "int". >> - get_mm_counter_sum() returns a precise sum. >> - Introduce and use functions to calculate the min/max possible precise >> sum values associated with an approximate sum. > > May I ask, as an early adopter, what is your overall impression of > the Gemini reviewbot? The review comments were all spot-on. This is the level of review I would expect from a good reviewer who spends a significant amount of effort digging into the proposed change to make sure the type limits are OK for the intended purpose stated in the commit message and that the intent stated in comments match the code. As a patch author, I find this feedback really useful. Is there an easy way to get this feedback privately before sending out my patches ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes: > On 2026-01-11 12:48, Andrew Morton wrote: >> On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >>> Notable changes for v12: >>> >>> - Reduce per-CPU counters memory allocation size to sizeof long >>> (fixing mixup with sizeof intermediate cache line aligned items). >>> - Use "long" counters types rather than "int". >>> - get_mm_counter_sum() returns a precise sum. >>> - Introduce and use functions to calculate the min/max possible precise >>> sum values associated with an approximate sum. >> May I ask, as an early adopter, what is your overall impression of >> the Gemini reviewbot? > > The review comments were all spot-on. This is the level of review I > would expect from a good reviewer who spends a significant amount of > effort digging into the proposed change to make sure the type limits > are OK for the intended purpose stated in the commit message and that > the intent stated in comments match the code. > > As a patch author, I find this feedback really useful. Is there > an easy way to get this feedback privately before sending out my > patches ? If you need to review a limited number of patches, the easiest way to use gemini cli/claude code or similar tools with a consumer grade subscription (most are $20/month these days). I maintain a pre-configured environment for Gemini: git@github.com:rgushchin/kengp.git , but it's not hard to hack something similar for other tools. Thanks!
On Sun, 11 Jan 2026 13:04:59 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > As a patch author, I find this feedback really useful. Is there > an easy way to get this feedback privately before sending out my > patches ? Hehe, we all want to look good don't we ;-) For me. I let people see the good, the bad, and the ugly (probably more than I should!) -- Steve
On 2026-01-12 10:05, Steven Rostedt wrote: > On Sun, 11 Jan 2026 13:04:59 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> As a patch author, I find this feedback really useful. Is there >> an easy way to get this feedback privately before sending out my >> patches ? > > Hehe, we all want to look good don't we ;-) > > For me. I let people see the good, the bad, and the ugly (probably more than I should!) I really don't mind having healthy discussions publicly, but I try to minimize the amount of disruption my work brings on others. That being said, I think this AI review can lead to interesting discussions where others can pitch in, so having it in the open is worthwhile. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Mon, 12 Jan 2026 13:36:48 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > That being said, I think this AI review can lead to interesting > discussions where others can pitch in, so having it in the open > is worthwhile. Agreed. I'm interested in seeing what AI catches. -- Steve
© 2016 - 2026 Red Hat, Inc.