From nobody Sun Feb  8 09:32:29 2026
Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 78D073126C2;
	Tue, 13 Jan 2026 19:48:40 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=158.69.130.18
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1768333722; cv=none;
 b=mFwG7WOawEeRmgsOlGAXg9hOQjS/klw9VqpX3/IdxqBpXRzzkEBBxcBO34h2ylt8Nj1XTipjATkrz/bjckVvgQq2bAjar6XfOgMYMiXuSiDxJDyzRMnm/Lxu7QLQqGtWwP5NwH9ztRyK3Ozi4ijsFayfKgD82jzY+GV39JWhfco=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1768333722; c=relaxed/simple;
	bh=Xz260U2fGs2qgfeXh7+wphQkViPoj/5lzrfL8TG5FxA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=pMgMTfqQeXK09pC5i43NRPzyyz5DrL2cxPNYnlbTXecoHNfx8QJFhUKf5bWYnuVvd536Ce7SO2yMdw0FTos0Xkmh5B1GlxSx33QXT0S/oL5S9fEWea1s7rorp13uqiXwcEE+ln8nWNiSsMl9tiyDjaPtvP0IPp2cvc2Nf5/y86M=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=efficios.com;
 spf=pass smtp.mailfrom=efficios.com;
 dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com
 header.b=FRxqryaH; arc=none smtp.client-ip=158.69.130.18
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=efficios.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=efficios.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com
 header.b="FRxqryaH"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com;
	s=smtpout1; t=1768333719;
	bh=ibI7m5hVA+BML8zWJ+X0AQLePZoxFXR6rc0fCgAMKyU=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=FRxqryaHtE4q7Dc3L1yQfbYf3EA6gZnk9TO9G6Ily22Tpr2V37FMSXS8FfXv+gkX3
	 3T9wjufK4+hIH9SJbm1+yYOjX0GnaleffjYwEDPXRBLHclHhpjERmjuHTFp2NE3Mcy
	 WCDvL2+74yzG3f9+UV5Sjw4iy9h+GU/i4foTfkxXh0zfjCHbcKMPBRCtLZPfoHgkf5
	 /1TmH5sXzlOWMF/m22Of/AjMoSsowk50zh3sbijkKbkrwXVCMjW5t0st6enAWDdbC/
	 +SRd5xRRMXGQbTxQafem5fOnEC+Y9Se3/lXhSz+lin50gYGeIhVACSIHRNj/qBUFuf
	 Q3CNp1Fvq1Tgw==
Received: from thinkos.internal.efficios.com (mtl.efficios.com
 [216.120.195.104])
	by smtpout.efficios.com (Postfix) with ESMTPSA id 4drKYg3hXRzlTQ;
	Tue, 13 Jan 2026 14:48:39 -0500 (EST)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Dennis Zhou <dennis@kernel.org>,
	Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Martin Liu <liumartin@google.com>,
	David Rientjes <rientjes@google.com>,
	christian.koenig@amd.com,
	Shakeel Butt <shakeel.butt@linux.dev>,
	SeongJae Park <sj@kernel.org>,
	Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Sweet Tea Dorminy <sweettea-kernel@dorminy.me>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <liam.howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christian Brauner <brauner@kernel.org>,
	Wei Yang <richard.weiyang@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org,
	stable@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Mateusz Guzik <mjguzik@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Aboorva Devarajan <aboorvad@linux.ibm.com>
Subject: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large
 many-core systems
Date: Tue, 13 Jan 2026 14:47:34 -0500
Message-Id: <20260113194734.28983-2-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.5
In-Reply-To: <20260113194734.28983-1-mathieu.desnoyers@efficios.com>
References: <20260113194734.28983-1-mathieu.desnoyers@efficios.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Use the precise, albeit slower, precise RSS counter sums for the OOM
killer task selection and proc statistics. The approximated value is
too imprecise on large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part o=
f a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250=
-cpu
  machine had memory usage tens of megabytes different from the expected am=
ount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
  into percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error =
could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters a=
re
  also used to make OOM killing decisions, so this additional inaccuracy co=
uld
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill t=
han
  expected.

Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
   increased system time in make test workloads [1]. Moreover, the
   inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
   error is high with systems that have lots of NUMA nodes (32 times
   the number of NUMA nodes).

The simple fix proposed here is to do the precise per-cpu counters sum
every time a counter value needs to be read. This applies to the OOM
killer task selection, to the /proc statistics, and to the oom mark_victim
trace event.

Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
issue for users") introduced get_mm_counter_sum() for precise proc
memory status queries for _some_ proc files. This change renames
get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
proc files to the precise sum.

This change effectively increases the latency introduced when the OOM
killer executes in favor of doing a more precise OOM target task
selection. Effectively, the OOM killer iterates on all tasks, for all
relevant page types, for which the precise sum iterates on all possible
CPUs.

As a reference, here is the execution time of the OOM killer
before/after the change:

AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.

                                  |  before  |  after   |
Suggested-by: Michal Hocko <mhocko@suse.com>
----------------------------------|----------|----------|
nr_processes=3D40                   |  0.3 ms  |   0.5 ms |
nr_processes=3D10000                |  3.0 ms  |  80.0 ms |

Suggested-by: Michal Hocko <mhocko@suse.com>
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@do=
rminy.me/ # [1]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
 fs/proc/task_mmu.c | 14 +++++++-------
 include/linux/mm.h |  5 -----
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 81dfc26bfae8..8ca4fbf53fc5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -39,9 +39,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	unsigned long text, lib, swap, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
=20
-	anon =3D get_mm_counter_sum(mm, MM_ANONPAGES);
-	file =3D get_mm_counter_sum(mm, MM_FILEPAGES);
-	shmem =3D get_mm_counter_sum(mm, MM_SHMEMPAGES);
+	anon =3D get_mm_counter(mm, MM_ANONPAGES);
+	file =3D get_mm_counter(mm, MM_FILEPAGES);
+	shmem =3D get_mm_counter(mm, MM_SHMEMPAGES);
=20
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -62,7 +62,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	text =3D min(text, mm->exec_vm << PAGE_SHIFT);
 	lib =3D (mm->exec_vm << PAGE_SHIFT) - text;
=20
-	swap =3D get_mm_counter_sum(mm, MM_SWAPENTS);
+	swap =3D get_mm_counter(mm, MM_SWAPENTS);
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -95,12 +95,12 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared =3D get_mm_counter_sum(mm, MM_FILEPAGES) +
-			get_mm_counter_sum(mm, MM_SHMEMPAGES);
+	*shared =3D get_mm_counter(mm, MM_FILEPAGES) +
+			get_mm_counter(mm, MM_SHMEMPAGES);
 	*text =3D (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data =3D mm->data_vm + mm->stack_vm;
-	*resident =3D *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
+	*resident =3D *shared + get_mm_counter(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
=20
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f959d8ca4b4..d096bb3593ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2847,11 +2847,6 @@ static inline bool get_user_page_fast_only(unsigned =
long addr,
  * per-process(per-mm_struct) statistics.
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int membe=
r)
-{
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
-}
-
-static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int m=
ember)
 {
 	return percpu_counter_sum_positive(&mm->rss_stat[member]);
 }
--=20
2.39.5