[v10] mm: Fix OOM killer inaccuracy on large many-core systems

[PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

Introduce hierarchical per-cpu counters and use them for RSS tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part of a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250-cpu
  machine had memory usage tens of megabytes different from the expected amount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
  into percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters are
  also used to make OOM killing decisions, so this additional inaccuracy could
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill than
  expected.

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

Notable change for v10: The new patch 3/3 changes the implementation of
the oom killer task selection to a 2-pass algorithm, where the first
pass uses the fast approximation provided by the hierarchical percpu
counters, and the second pass does a precise sum for all tasks which
have badness values within the range of the approximation accuracy.

I've done moderate testing of this series on a 256-core VM with 128GB
RAM. Figuring out whether this indeed helps solve issues with real-life
workloads will require broader feedback from the community.

The one request I did not have time to fulfill yet is to port the
tests from the librseq feature branch implementation (userspace) to the
kernel selftests.

This series is based on v6.18.

Andrew, are you interested to try this out in mm-new ?

Thanks,

Mathieu

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>

Mathieu Desnoyers (3):
  lib: Introduce hierarchical per-cpu counters
  mm: Fix OOM killer inaccuracy on large many-core systems
  mm: Implement precise OOM killer task selection

 fs/proc/base.c                      |   2 +-
 include/linux/mm.h                  |  58 ++-
 include/linux/mm_types.h            |   4 +-
 include/linux/oom.h                 |  12 +-
 include/linux/percpu_counter_tree.h | 242 ++++++++++
 include/trace/events/kmem.h         |   2 +-
 init/main.c                         |   2 +
 kernel/fork.c                       |  24 +-
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 705 ++++++++++++++++++++++++++++
 mm/oom_kill.c                       |  72 ++-
 11 files changed, 1089 insertions(+), 35 deletions(-)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

-- 
2.39.5

Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Andrew Morton 1 month, 3 weeks ago

On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Introduce hierarchical per-cpu counters and use them for RSS tracking to
> fix the per-mm RSS tracking which has become too inaccurate for OOM
> killer purposes on large many-core systems.
> 
> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
> which lead to picking wrong tasks as OOM kill target:
> 
>   Recently, several internal services had an RSS usage regression as part of a
>   kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
>   read RSS statistics in a backup watchdog process to monitor and decide if
>   they'd overrun their memory budget. Now, however, a representative service
>   with five threads, expected to use about a hundred MB of memory, on a 250-cpu
>   machine had memory usage tens of megabytes different from the expected amount
>   -- this constituted a significant percentage of inaccuracy, causing the
>   watchdog to act.
> 
>   This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
>   into percpu_counter") [1].  Previously, the memory error was bounded by
>   64*nr_threads pages, a very livable megabyte. Now, however, as a result of
>   scheduler decisions moving the threads around the CPUs, the memory error could
>   be as large as a gigabyte.
> 
>   This is a really tremendous inaccuracy for any few-threaded program on a
>   large machine and impedes monitoring significantly. These stat counters are
>   also used to make OOM killing decisions, so this additional inaccuracy could
>   make a big difference in OOM situations -- either resulting in the wrong
>   process being killed, or in less memory being returned from an OOM-kill than
>   expected.
> 
> The approach proposed here is to replace this by the hierarchical
> per-cpu counters, which bounds the inaccuracy based on the system
> topology with O(N*logN).
> 
> Notable change for v10: The new patch 3/3 changes the implementation of
> the oom killer task selection to a 2-pass algorithm, where the first
> pass uses the fast approximation provided by the hierarchical percpu
> counters, and the second pass does a precise sum for all tasks which
> have badness values within the range of the approximation accuracy.
> 
> I've done moderate testing of this series on a 256-core VM with 128GB
> RAM. Figuring out whether this indeed helps solve issues with real-life
> workloads will require broader feedback from the community.
> 
> The one request I did not have time to fulfill yet is to port the
> tests from the librseq feature branch implementation (userspace) to the
> kernel selftests.
> 
> This series is based on v6.18.
> 
> Andrew, are you interested to try this out in mm-new ?

Yes.  We have to start somewhere.

As you kind of mention, it's going to be difficult to determine when
this is ready to go upstream.  I assume that to really know this will
required detailed and lengthy fleet-wide operation and observation.

What sort of drawbacks do you think people miht encounter with this
change?

Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

On 2025-12-14 18:35, Andrew Morton wrote:
> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
[...]
>>
>> Andrew, are you interested to try this out in mm-new ?
> 
> Yes.  We have to start somewhere.

Cool !

> 
> As you kind of mention, it's going to be difficult to determine when
> this is ready to go upstream.  I assume that to really know this will
> required detailed and lengthy fleet-wide operation and observation.

For that kind of feature, yes, this is my expectation as well.

> What sort of drawbacks do you think people miht encounter with this
> change?

Let's see, here are some possible drawbacks to keep an eye out for:

- Taking for instance a machine with 256 logical CPUs topology,
   although allocation for small amount of memory is typically handled
   with a this_cpu_add_return, when doing large memory allocations, this
   will trickle up the carry over 3 levels, each of which require an
   atomic_add_return.

   The upstream implementation would instead go straight for a global
   spinlock, which may or may not be better than 3 atomics.

- 2-pass OOM killer task selection: with a large number of tasks, and
   small number of CPUs, the upstream algorithm would be adequately
   precise, and faster because it does a single iteration pass. So the
   open question here is do we care about overhead of the OOM killer task
   selection ?

- I understanding that some people implement their own OOM killer in
   userspace based on RSS values exposed through /proc. Because those
   RSS values are the precise counts (split-counter sums), there should
   be no difference there compared to the upstream implementation, but
   there would be no performance gain as well. It may be interesting
   to eventually expose the counter approximations (and the accuracy
   intervals) to userspace so it could speed up its task selection
   eventually. Not really a drawback, more something to keep in mind as
   future improvement.

- I took care not to add additional memory allocation to the mm
   allocation/free code because it regresses some benchmarks.
   Still it's good to keep an eye out for bot reports about those
   regressions.

- The intermediate tree levels counters use extra memory. This is
   a tradeoff between compactness and cache locality of the counters.
   I currently used cache-aligned integers (thus favored cache locality
   and eliminating false-sharing), but I have other prototypes which
   use packed bytes for the intermediate levels. For instance, on a
   256 core machine, we have 37 intermediate levels nodes, for a total
   of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
   for the per-cpu counters). If we choose to instead go for the packed
   bytes approach, the 37 intermediate levels nodes will use 37 bytes
   of memory, but there will be false-sharing across those counters.

   An alternative approach there is to use a strided allocator [1] with a
   byte counter set allocator on top [2]. This way we can benefit from the
   memory savings of byte counters without the false-sharing.

Thanks,

Mathieu

[1] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/rseq-mempool.c
[2] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/percpu-counter-tree.c#L190


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

On 2025-12-15 09:08, Mathieu Desnoyers wrote:
> On 2025-12-14 18:35, Andrew Morton wrote:
>> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers 
>> <mathieu.desnoyers@efficios.com> wrote:
> [...]
>>>
>>> Andrew, are you interested to try this out in mm-new ?
>>
>> Yes.  We have to start somewhere.
> 
> Cool !
> 
>>
>> As you kind of mention, it's going to be difficult to determine when
>> this is ready to go upstream.  I assume that to really know this will
>> required detailed and lengthy fleet-wide operation and observation.
> 
> For that kind of feature, yes, this is my expectation as well.
> 
>> What sort of drawbacks do you think people miht encounter with this
>> change?
> 
> Let's see, here are some possible drawbacks to keep an eye out for:
> 
> - Taking for instance a machine with 256 logical CPUs topology,
>    although allocation for small amount of memory is typically handled
>    with a this_cpu_add_return, when doing large memory allocations, this
>    will trickle up the carry over 3 levels, each of which require an
>    atomic_add_return.
> 
>    The upstream implementation would instead go straight for a global
>    spinlock, which may or may not be better than 3 atomics.
> 
> - 2-pass OOM killer task selection: with a large number of tasks, and
>    small number of CPUs, the upstream algorithm would be adequately
>    precise, and faster because it does a single iteration pass. So the
>    open question here is do we care about overhead of the OOM killer task
>    selection ?
> 
> - I understanding that some people implement their own OOM killer in
>    userspace based on RSS values exposed through /proc. Because those
>    RSS values are the precise counts (split-counter sums), there should
>    be no difference there compared to the upstream implementation, but
>    there would be no performance gain as well. It may be interesting
>    to eventually expose the counter approximations (and the accuracy
>    intervals) to userspace so it could speed up its task selection
>    eventually. Not really a drawback, more something to keep in mind as
>    future improvement.
> 
> - I took care not to add additional memory allocation to the mm
>    allocation/free code because it regresses some benchmarks.
>    Still it's good to keep an eye out for bot reports about those
>    regressions.
> 
> - The intermediate tree levels counters use extra memory. This is
>    a tradeoff between compactness and cache locality of the counters.
>    I currently used cache-aligned integers (thus favored cache locality
>    and eliminating false-sharing), but I have other prototypes which
>    use packed bytes for the intermediate levels. For instance, on a
>    256 core machine, we have 37 intermediate levels nodes, for a total
>    of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
>    for the per-cpu counters). If we choose to instead go for the packed
>    bytes approach, the 37 intermediate levels nodes will use 37 bytes
>    of memory, but there will be false-sharing across those counters.
> 
>    An alternative approach there is to use a strided allocator [1] with a
>    byte counter set allocator on top [2]. This way we can benefit from the
>    memory savings of byte counters without the false-sharing.

One more point:

- The choice of constants is an educated guess at best and would require
   testing/feedback on real-world workloads:

   - The batch size (32),

   - The n-arity of the counter tree for each power-of-two number of CPUs
     (per_nr_cpu_order_config).

   - The choice of making the number of intermediate level counter bits
     match the n-arity of the counters aggregated into that level is also
     arbitrary.

   - The precise badness sums limit (16) for an OOM killer task selection.
     This is perhaps something that could become a sysctl tunable.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com