fs/proc/base.c | 2 +- include/linux/mm.h | 58 ++- include/linux/mm_types.h | 4 +- include/linux/oom.h | 12 +- include/linux/percpu_counter_tree.h | 242 ++++++++++ include/trace/events/kmem.h | 2 +- init/main.c | 2 + kernel/fork.c | 24 +- lib/Makefile | 1 + lib/percpu_counter_tree.c | 705 ++++++++++++++++++++++++++++ mm/oom_kill.c | 72 ++- 11 files changed, 1089 insertions(+), 35 deletions(-) create mode 100644 include/linux/percpu_counter_tree.h create mode 100644 lib/percpu_counter_tree.c
Introduce hierarchical per-cpu counters and use them for RSS tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.
The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:
Recently, several internal services had an RSS usage regression as part of a
kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
read RSS statistics in a backup watchdog process to monitor and decide if
they'd overrun their memory budget. Now, however, a representative service
with five threads, expected to use about a hundred MB of memory, on a 250-cpu
machine had memory usage tens of megabytes different from the expected amount
-- this constituted a significant percentage of inaccuracy, causing the
watchdog to act.
This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
into percpu_counter") [1]. Previously, the memory error was bounded by
64*nr_threads pages, a very livable megabyte. Now, however, as a result of
scheduler decisions moving the threads around the CPUs, the memory error could
be as large as a gigabyte.
This is a really tremendous inaccuracy for any few-threaded program on a
large machine and impedes monitoring significantly. These stat counters are
also used to make OOM killing decisions, so this additional inaccuracy could
make a big difference in OOM situations -- either resulting in the wrong
process being killed, or in less memory being returned from an OOM-kill than
expected.
The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).
Notable change for v10: The new patch 3/3 changes the implementation of
the oom killer task selection to a 2-pass algorithm, where the first
pass uses the fast approximation provided by the hierarchical percpu
counters, and the second pass does a precise sum for all tasks which
have badness values within the range of the approximation accuracy.
I've done moderate testing of this series on a 256-core VM with 128GB
RAM. Figuring out whether this indeed helps solve issues with real-life
workloads will require broader feedback from the community.
The one request I did not have time to fulfill yet is to port the
tests from the librseq feature branch implementation (userspace) to the
kernel selftests.
This series is based on v6.18.
Andrew, are you interested to try this out in mm-new ?
Thanks,
Mathieu
Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Mathieu Desnoyers (3):
lib: Introduce hierarchical per-cpu counters
mm: Fix OOM killer inaccuracy on large many-core systems
mm: Implement precise OOM killer task selection
fs/proc/base.c | 2 +-
include/linux/mm.h | 58 ++-
include/linux/mm_types.h | 4 +-
include/linux/oom.h | 12 +-
include/linux/percpu_counter_tree.h | 242 ++++++++++
include/trace/events/kmem.h | 2 +-
init/main.c | 2 +
kernel/fork.c | 24 +-
lib/Makefile | 1 +
lib/percpu_counter_tree.c | 705 ++++++++++++++++++++++++++++
mm/oom_kill.c | 72 ++-
11 files changed, 1089 insertions(+), 35 deletions(-)
create mode 100644 include/linux/percpu_counter_tree.h
create mode 100644 lib/percpu_counter_tree.c
--
2.39.5
On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Introduce hierarchical per-cpu counters and use them for RSS tracking to
> fix the per-mm RSS tracking which has become too inaccurate for OOM
> killer purposes on large many-core systems.
>
> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
> which lead to picking wrong tasks as OOM kill target:
>
> Recently, several internal services had an RSS usage regression as part of a
> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> read RSS statistics in a backup watchdog process to monitor and decide if
> they'd overrun their memory budget. Now, however, a representative service
> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> machine had memory usage tens of megabytes different from the expected amount
> -- this constituted a significant percentage of inaccuracy, causing the
> watchdog to act.
>
> This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
> into percpu_counter") [1]. Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.
>
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
>
> The approach proposed here is to replace this by the hierarchical
> per-cpu counters, which bounds the inaccuracy based on the system
> topology with O(N*logN).
>
> Notable change for v10: The new patch 3/3 changes the implementation of
> the oom killer task selection to a 2-pass algorithm, where the first
> pass uses the fast approximation provided by the hierarchical percpu
> counters, and the second pass does a precise sum for all tasks which
> have badness values within the range of the approximation accuracy.
>
> I've done moderate testing of this series on a 256-core VM with 128GB
> RAM. Figuring out whether this indeed helps solve issues with real-life
> workloads will require broader feedback from the community.
>
> The one request I did not have time to fulfill yet is to port the
> tests from the librseq feature branch implementation (userspace) to the
> kernel selftests.
>
> This series is based on v6.18.
>
> Andrew, are you interested to try this out in mm-new ?
Yes. We have to start somewhere.
As you kind of mention, it's going to be difficult to determine when
this is ready to go upstream. I assume that to really know this will
required detailed and lengthy fleet-wide operation and observation.
What sort of drawbacks do you think people miht encounter with this
change?
On 2025-12-14 18:35, Andrew Morton wrote: > On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: [...] >> >> Andrew, are you interested to try this out in mm-new ? > > Yes. We have to start somewhere. Cool ! > > As you kind of mention, it's going to be difficult to determine when > this is ready to go upstream. I assume that to really know this will > required detailed and lengthy fleet-wide operation and observation. For that kind of feature, yes, this is my expectation as well. > What sort of drawbacks do you think people miht encounter with this > change? Let's see, here are some possible drawbacks to keep an eye out for: - Taking for instance a machine with 256 logical CPUs topology, although allocation for small amount of memory is typically handled with a this_cpu_add_return, when doing large memory allocations, this will trickle up the carry over 3 levels, each of which require an atomic_add_return. The upstream implementation would instead go straight for a global spinlock, which may or may not be better than 3 atomics. - 2-pass OOM killer task selection: with a large number of tasks, and small number of CPUs, the upstream algorithm would be adequately precise, and faster because it does a single iteration pass. So the open question here is do we care about overhead of the OOM killer task selection ? - I understanding that some people implement their own OOM killer in userspace based on RSS values exposed through /proc. Because those RSS values are the precise counts (split-counter sums), there should be no difference there compared to the upstream implementation, but there would be no performance gain as well. It may be interesting to eventually expose the counter approximations (and the accuracy intervals) to userspace so it could speed up its task selection eventually. Not really a drawback, more something to keep in mind as future improvement. - I took care not to add additional memory allocation to the mm allocation/free code because it regresses some benchmarks. Still it's good to keep an eye out for bot reports about those regressions. - The intermediate tree levels counters use extra memory. This is a tradeoff between compactness and cache locality of the counters. I currently used cache-aligned integers (thus favored cache locality and eliminating false-sharing), but I have other prototypes which use packed bytes for the intermediate levels. For instance, on a 256 core machine, we have 37 intermediate levels nodes, for a total of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory for the per-cpu counters). If we choose to instead go for the packed bytes approach, the 37 intermediate levels nodes will use 37 bytes of memory, but there will be false-sharing across those counters. An alternative approach there is to use a strided allocator [1] with a byte counter set allocator on top [2]. This way we can benefit from the memory savings of byte counters without the false-sharing. Thanks, Mathieu [1] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/rseq-mempool.c [2] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/percpu-counter-tree.c#L190 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2025-12-15 09:08, Mathieu Desnoyers wrote:
> On 2025-12-14 18:35, Andrew Morton wrote:
>> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
> [...]
>>>
>>> Andrew, are you interested to try this out in mm-new ?
>>
>> Yes. We have to start somewhere.
>
> Cool !
>
>>
>> As you kind of mention, it's going to be difficult to determine when
>> this is ready to go upstream. I assume that to really know this will
>> required detailed and lengthy fleet-wide operation and observation.
>
> For that kind of feature, yes, this is my expectation as well.
>
>> What sort of drawbacks do you think people miht encounter with this
>> change?
>
> Let's see, here are some possible drawbacks to keep an eye out for:
>
> - Taking for instance a machine with 256 logical CPUs topology,
> although allocation for small amount of memory is typically handled
> with a this_cpu_add_return, when doing large memory allocations, this
> will trickle up the carry over 3 levels, each of which require an
> atomic_add_return.
>
> The upstream implementation would instead go straight for a global
> spinlock, which may or may not be better than 3 atomics.
>
> - 2-pass OOM killer task selection: with a large number of tasks, and
> small number of CPUs, the upstream algorithm would be adequately
> precise, and faster because it does a single iteration pass. So the
> open question here is do we care about overhead of the OOM killer task
> selection ?
>
> - I understanding that some people implement their own OOM killer in
> userspace based on RSS values exposed through /proc. Because those
> RSS values are the precise counts (split-counter sums), there should
> be no difference there compared to the upstream implementation, but
> there would be no performance gain as well. It may be interesting
> to eventually expose the counter approximations (and the accuracy
> intervals) to userspace so it could speed up its task selection
> eventually. Not really a drawback, more something to keep in mind as
> future improvement.
>
> - I took care not to add additional memory allocation to the mm
> allocation/free code because it regresses some benchmarks.
> Still it's good to keep an eye out for bot reports about those
> regressions.
>
> - The intermediate tree levels counters use extra memory. This is
> a tradeoff between compactness and cache locality of the counters.
> I currently used cache-aligned integers (thus favored cache locality
> and eliminating false-sharing), but I have other prototypes which
> use packed bytes for the intermediate levels. For instance, on a
> 256 core machine, we have 37 intermediate levels nodes, for a total
> of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
> for the per-cpu counters). If we choose to instead go for the packed
> bytes approach, the 37 intermediate levels nodes will use 37 bytes
> of memory, but there will be false-sharing across those counters.
>
> An alternative approach there is to use a strided allocator [1] with a
> byte counter set allocator on top [2]. This way we can benefit from the
> memory savings of byte counters without the false-sharing.
One more point:
- The choice of constants is an educated guess at best and would require
testing/feedback on real-world workloads:
- The batch size (32),
- The n-arity of the counter tree for each power-of-two number of CPUs
(per_nr_cpu_order_config).
- The choice of making the number of intermediate level counter bits
match the n-arity of the counters aggregated into that level is also
arbitrary.
- The precise badness sums limit (16) for an OOM killer task selection.
This is perhaps something that could become a sysctl tunable.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
© 2016 - 2026 Red Hat, Inc.