[v12] mm: Fix OOM killer inaccuracy on large many-core systems

[PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 3 weeks, 6 days ago

Introduce hierarchical per-cpu counters and use them for RSS tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part of a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250-cpu
  machine had memory usage tens of megabytes different from the expected amount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
  into percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters are
  also used to make OOM killing decisions, so this additional inaccuracy could
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill than
  expected.

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

Notable changes for v12:

- Reduce per-CPU counters memory allocation size to sizeof long
  (fixing mixup with sizeof intermediate cache line aligned items).
- Use "long" counters types rather than "int".
- get_mm_counter_sum() returns a precise sum.
- Introduce and use functions to calculate the min/max possible precise
  sum values associated with an approximate sum.

I've done moderate testing of this series on a 256-core VM with 128GB
RAM. Figuring out whether this indeed helps solve issues with real-life
workloads will require broader feedback from the community.

This series is based on v6.19-rc4, on top of the following two
preparation series:

https://lore.kernel.org/linux-mm/20251224173358.647691-1-mathieu.desnoyers@efficios.com/T/#t
https://lore.kernel.org/linux-mm/20251224173810.648699-1-mathieu.desnoyers@efficios.com/T/#t

Andrew, this series replaces v11, for testing in mm-new.

Thanks!

Mathieu

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>

Mathieu Desnoyers (3):
  lib: Introduce hierarchical per-cpu counters
  mm: Fix OOM killer inaccuracy on large many-core systems
  mm: Implement precise OOM killer task selection

 fs/proc/base.c                      |   2 +-
 include/linux/mm.h                  |  49 +-
 include/linux/mm_types.h            |  54 ++-
 include/linux/oom.h                 |  11 +-
 include/linux/percpu_counter_tree.h | 344 ++++++++++++++
 include/trace/events/kmem.h         |   2 +-
 init/main.c                         |   2 +
 kernel/fork.c                       |  22 +-
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 702 ++++++++++++++++++++++++++++
 mm/oom_kill.c                       |  82 +++-
 11 files changed, 1222 insertions(+), 49 deletions(-)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

-- 
2.39.5

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Andrew Morton 3 weeks, 5 days ago

On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Introduce hierarchical per-cpu counters and use them for RSS tracking to
> fix the per-mm RSS tracking which has become too inaccurate for OOM
> killer purposes on large many-core systems.

Great, thanks.

> Notable changes for v12:
> 
> - Reduce per-CPU counters memory allocation size to sizeof long
>   (fixing mixup with sizeof intermediate cache line aligned items).
> - Use "long" counters types rather than "int".
> - get_mm_counter_sum() returns a precise sum.
> - Introduce and use functions to calculate the min/max possible precise
>   sum values associated with an approximate sum.

May I ask, as an early adopter, what is your overall impression of
the Gemini reviewbot?

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 3 weeks, 5 days ago

On 2026-01-11 12:48, Andrew Morton wrote:
> On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> Notable changes for v12:
>>
>> - Reduce per-CPU counters memory allocation size to sizeof long
>>    (fixing mixup with sizeof intermediate cache line aligned items).
>> - Use "long" counters types rather than "int".
>> - get_mm_counter_sum() returns a precise sum.
>> - Introduce and use functions to calculate the min/max possible precise
>>    sum values associated with an approximate sum.
> 
> May I ask, as an early adopter, what is your overall impression of
> the Gemini reviewbot?

The review comments were all spot-on. This is the level of review I
would expect from a good reviewer who spends a significant amount of
effort digging into the proposed change to make sure the type limits
are OK for the intended purpose stated in the commit message and that
the intent stated in comments match the code.

As a patch author, I find this feedback really useful. Is there
an easy way to get this feedback privately before sending out my
patches ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Roman Gushchin 3 weeks, 5 days ago

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> On 2026-01-11 12:48, Andrew Morton wrote:
>> On Sun, 11 Jan 2026 10:02:46 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> 
>>> Notable changes for v12:
>>>
>>> - Reduce per-CPU counters memory allocation size to sizeof long
>>>    (fixing mixup with sizeof intermediate cache line aligned items).
>>> - Use "long" counters types rather than "int".
>>> - get_mm_counter_sum() returns a precise sum.
>>> - Introduce and use functions to calculate the min/max possible precise
>>>    sum values associated with an approximate sum.
>> May I ask, as an early adopter, what is your overall impression of
>> the Gemini reviewbot?
>
> The review comments were all spot-on. This is the level of review I
> would expect from a good reviewer who spends a significant amount of
> effort digging into the proposed change to make sure the type limits
> are OK for the intended purpose stated in the commit message and that
> the intent stated in comments match the code.
>
> As a patch author, I find this feedback really useful. Is there
> an easy way to get this feedback privately before sending out my
> patches ?

If you need to review a limited number of patches, the easiest way to
use gemini cli/claude code or similar tools with a consumer grade
subscription (most are $20/month these days).
I maintain a pre-configured environment for Gemini:
git@github.com:rgushchin/kengp.git , but it's not hard to hack something
similar for other tools.

Thanks!

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Steven Rostedt 3 weeks, 5 days ago

On Sun, 11 Jan 2026 13:04:59 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> As a patch author, I find this feedback really useful. Is there
> an easy way to get this feedback privately before sending out my
> patches ?

Hehe, we all want to look good don't we ;-)

For me. I let people see the good, the bad, and the ugly (probably more than I should!)

-- Steve

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 3 weeks, 4 days ago

On 2026-01-12 10:05, Steven Rostedt wrote:
> On Sun, 11 Jan 2026 13:04:59 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> As a patch author, I find this feedback really useful. Is there
>> an easy way to get this feedback privately before sending out my
>> patches ?
> 
> Hehe, we all want to look good don't we ;-)
> 
> For me. I let people see the good, the bad, and the ugly (probably more than I should!)

I really don't mind having healthy discussions publicly, but I try
to minimize the amount of disruption my work brings on others.

That being said, I think this AI review can lead to interesting
discussions where others can pitch in, so having it in the open
is worthwhile.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Steven Rostedt 3 weeks, 4 days ago

On Mon, 12 Jan 2026 13:36:48 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> That being said, I think this AI review can lead to interesting
> discussions where others can pitch in, so having it in the open
> is worthwhile.

Agreed. I'm interested in seeing what AI catches.

-- Steve