[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems

Mathieu Desnoyers posted 2 patches 3 months, 1 week ago
There is a newer version of this series
[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Mathieu Desnoyers 3 months, 1 week ago
Use hierarchical per-cpu counters for rss tracking to fix the per-mm RSS
tracking which has become too inaccurate for OOM killer purposes on
large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part of a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250-cpu
  machine had memory usage tens of megabytes different from the expected amount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
  percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters are
  also used to make OOM killing decisions, so this additional inaccuracy could
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill than
  expected.

Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
   increased system time in make test workloads [1]. Moreover, the
   inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
   error is high with systems that have lots of NUMA nodes (32 times
   the number of NUMA nodes).

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

commit 82241a83cd15 ("Baolin Wang <baolin.wang@linux.alibaba.com>")
introduced get_mm_counter_sum() for precise /proc memory status queries.
Implement it with percpu_counter_tree_precise_sum() since it is not a
fast path and precision is preferred over speed.

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Link: https://lore.kernel.org/lkml/20250704150226.47980-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
Changes since v6:
- Rebased on v6.18-rc3.
- Implement get_mm_counter_sum as percpu_counter_tree_precise_sum for
  /proc virtual files memory state queries.

Changes since v5:
- Use percpu_counter_tree_approximate_sum_positive.

Change since v4:
- get_mm_counter needs to return 0 or a positive value.
---
 include/linux/mm.h          | 10 +++++-----
 include/linux/mm_types.h    |  4 ++--
 include/trace/events/kmem.h |  2 +-
 kernel/fork.c               | 32 +++++++++++++++++++++-----------
 4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..4f8f3118cfd3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2679,33 +2679,33 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
+	return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]);
 }
 
 static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
 {
-	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+	return percpu_counter_tree_precise_sum(&mm->rss_stat[member]);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	percpu_counter_add(&mm->rss_stat[member], value);
+	percpu_counter_tree_add(&mm->rss_stat[member], value);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_inc(&mm->rss_stat[member]);
+	percpu_counter_tree_add(&mm->rss_stat[member], 1);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_dec(&mm->rss_stat[member]);
+	percpu_counter_tree_add(&mm->rss_stat[member], -1);
 
 	mm_trace_rss_stat(mm, member);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..adb2f227bac7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,7 +18,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu_counter_tree.h>
 #include <linux/types.h>
 #include <linux/bitmap.h>
 
@@ -1119,7 +1119,7 @@ struct mm_struct {
 		unsigned long saved_e_flags;
 #endif
 
-		struct percpu_counter rss_stat[NR_MM_COUNTERS];
+		struct percpu_counter_tree rss_stat[NR_MM_COUNTERS];
 
 		struct linux_binfmt *binfmt;
 
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 7f93e754da5c..91c81c44f884 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -442,7 +442,7 @@ TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
+		__entry->size = (percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member])
 							    << PAGE_SHIFT);
 	),
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..e3dd00809cf3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -133,6 +133,11 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Batch size of rss stat approximation
+ */
+#define RSS_STAT_BATCH_SIZE	32
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -583,14 +588,12 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
-
-		if (unlikely(x)) {
-			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n",
-				 mm, resident_page_types[i], x,
+		if (unlikely(percpu_counter_tree_precise_compare_value(&mm->rss_stat[i], 0) != 0))
+			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%d Comm:%s Pid:%d\n",
+				 mm, resident_page_types[i],
+				 percpu_counter_tree_precise_sum(&mm->rss_stat[i]),
 				 current->comm,
 				 task_pid_nr(current));
-		}
 	}
 
 	if (mm_pgtables_bytes(mm))
@@ -673,6 +676,8 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
+	int i;
+
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 
@@ -688,8 +693,8 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
-	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
-
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		percpu_counter_tree_destroy(&mm->rss_stat[i]);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1030,6 +1035,8 @@ static void mmap_init_lock(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
+	int i;
+
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	atomic_set(&mm->mm_users, 1);
@@ -1083,15 +1090,18 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
+	for (i = 0; i < NR_MM_COUNTERS; i++) {
+		if (percpu_counter_tree_init(&mm->rss_stat[i], RSS_STAT_BATCH_SIZE, GFP_KERNEL_ACCOUNT))
+			goto fail_pcpu;
+	}
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
 fail_pcpu:
+	for (i--; i >= 0; i--)
+		percpu_counter_tree_destroy(&mm->rss_stat[i]);
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
-- 
2.39.5
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by kernel test robot 3 months ago

Hello,

kernel test robot noticed "BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid" on:

commit: 25ae03e80acad812e536694c1a07a3f57784ae23 ("[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems")
url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/lib-Introduce-hierarchical-per-cpu-counters/20251031-224455
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20251031144232.15284-3-mathieu.desnoyers@efficios.com/
patch subject: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems

in testcase: boot

config: x86_64-randconfig-002-20251103
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


in fact, we observed various BUG:Bad_rss-counter_state_mm issues for this commit
but clean on parent, as below

+------------------------------------------------------------------------+------------+------------+
|                                                                        | 05880dc4af | 25ae03e80a |
+------------------------------------------------------------------------+------------+------------+
| BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:kworker##Pid | 0          | 10         |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid | 0          | 17         |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:swapper_Pid  | 0          | 2          |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:modprobe_Pid | 0          | 3          |
| BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:modprobe_Pid | 0          | 1          |
+------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511061432.4e534796-lkp@intel.com


[   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
[   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.918858][   T71] module: module-autoload: duplicate request for module crypto-aes
[   14.919479][   T71] module: module-autoload: duplicate request for module crypto-aes-all
[   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain<block
[   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain==block
[   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain>block
[   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no plain
[   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain<block
[   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain==block
[   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain>block
[   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
[   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85
[   15.054490][   T89] module: module-autoload: duplicate request for module crypto-camellia
[   15.055466][   T89] module: module-autoload: duplicate request for module crypto-camellia-all
[   15.056999][    T1] krb5: Running camellia128-cts-cmac enc 1 plain
[   15.057912][    T1] krb5: Running camellia128-cts-cmac enc 9 plain
[   15.058781][    T1] krb5: Running camellia128-cts-cmac enc 13 plain
[   15.059603][    T1] krb5: Running camellia128-cts-cmac enc 30 plain
[   15.061279][    T1] krb5: Running camellia256-cts-cmac enc no plain
[   15.062207][    T1] krb5: Running camellia256-cts-cmac enc 1 plain
[   15.063150][    T1] krb5: Running camellia256-cts-cmac enc 9 plain
[   15.072917][    T1] krb5: Running camellia256-cts-cmac enc 13 plain
[   15.073896][    T1] krb5: Running camellia256-cts-cmac enc 30 plain
[   15.074834][    T1] krb5: Running aes128-cts-hmac-sha256-128 mic
[   15.075625][    T1] krb5: Running aes256-cts-hmac-sha384-192 mic
[   15.076396][    T1] krb5: Running camellia128-cts-cmac mic abc
[   15.077225][    T1] krb5: Running camellia128-cts-cmac mic ABC
[   15.078052][    T1] krb5: Running camellia256-cts-cmac mic 123
[   15.078853][    T1] krb5: Running camellia256-cts-cmac mic !@#
[   15.079621][    T1] krb5: Selftests succeeded
[   15.080683][    T1] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
[   15.081610][    T1] io scheduler kyber registered
[   15.082527][    T1] test_mul_u64_u64_div_u64: Starting mul_u64_u64_div_u64() test
[   15.083365][    T1] test_mul_u64_u64_div_u64: ERROR: 0x000000000000000b * 0x0000000000000007 +/ 0x0000000000000003
[   15.086382][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 000000000000001a
[   15.087178][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000019
[   15.088064][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000000ffffffff * 0x00000000ffffffff +/ 0x0000000000000002
[   15.089105][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 7fffffff00000001
[   15.089924][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 7fffffff00000000
[   15.090696][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000001ffffffff * 0x00000000ffffffff +/ 0x0000000000000002
[   15.091734][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: fffffffe80000001
[   15.092502][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: fffffffe80000000
[   15.093281][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000001ffffffff * 0x00000001ffffffff +/ 0x0000000000000004
[   15.094337][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ffffffff00000001
[   15.095172][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ffffffff00000000
[   15.095953][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffff000000000000 * 0xffff000000000000 +/ 0xffff000000000001
[   15.097175][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ffff000000000000
[   15.098020][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: fffeffffffffffff
[   15.098837][    T1] test_mul_u64_u64_div_u64: ERROR: 0x3333333333333333 * 0x3333333333333333 +/ 0x5555555555555555
[   15.099924][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 1eb851eb851eb852
[   15.100721][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 1eb851eb851eb851
[   15.101542][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0x0000000000000002 +/ 0x0000000000000003
[   15.102565][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 5555555555555555
[   15.103368][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 5555555555555554
[   15.107134][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x0000000000000002 +/ 0x8000000000000000
[   15.108196][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 0000000000000004
[   15.109049][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000003
[   15.109887][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x0000000000000002 +/ 0xc000000000000000
[   15.111017][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 0000000000000003
[   15.111907][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000002
[   15.112666][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x4000000000000004 +/ 0x8000000000000000
[   15.113703][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000008
[   15.114527][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000007
[   15.115424][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x4000000000000001 +/ 0x8000000000000000
[   15.116279][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000002
[   15.116882][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000001
[   15.117490][    T1] test_mul_u64_u64_div_u64: ERROR: 0xfffffffffffffffe * 0x8000000000000001 +/ 0xffffffffffffffff
[   15.118363][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000001
[   15.119240][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000000
[   15.119914][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x8000000000000001 +/ 0xfffffffffffffffe
[   15.120785][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000002
[   15.121627][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000001
[   15.122503][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x8000000000000001 +/ 0xfffffffffffffffd
[   15.123624][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000003
[   15.124521][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000002
[   15.125399][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0xffffffffffffffff +/ 0xc000000000000000
[   15.126592][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: aaaaaaaaaaaaaaa9
[   15.127438][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: aaaaaaaaaaaaaaa8
[   15.128411][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x7fffffffffffffff +/ 0xa000000000000000
[   15.129565][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: cccccccccccccccb
[   15.130454][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ccccccccccccccca
[   15.131239][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x7fffffffffffffff +/ 0x9000000000000000
[   15.132213][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: e38e38e38e38e38c
[   15.132793][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: e38e38e38e38e38b
[   15.133374][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0x7fffffffffffffff +/ 0x5000000000000000
[   15.134101][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ccccccccccccccca
[   15.134674][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ccccccccccccccc9
[   15.135235][    T1] test_mul_u64_u64_div_u64: ERROR: 0xe6102d256d7ea3ae * 0x70a77d0be4c31201 +/ 0xd63ec35ab3220357
[   15.135984][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 78f8bf8cc86c6e19
[   15.136587][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 78f8bf8cc86c6e18
[   15.137140][    T1] test_mul_u64_u64_div_u64: ERROR: 0xf53bae05cb86c6e1 * 0x3847b32d2f8d32e0 +/ 0xcfd4f55a647f403c
[   15.137964][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 42687f79d8998d36
[   15.138541][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 42687f79d8998d35
[   15.139135][    T1] test_mul_u64_u64_div_u64: ERROR: 0x9951c5498f941092 * 0x1f8c8bfdf287a251 +/ 0xa3c8dc5f81ea3fe2
[   15.139884][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 1d887cb259000920
[   15.140444][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 1d887cb25900091f
[   15.141025][    T1] test_mul_u64_u64_div_u64: ERROR: 0x374fee9daa1bb2bb * 0x0d0bfbff7b8ae3ef +/ 0xc169337bd42d5179
[   15.141759][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 03bb2dbaffcbb962
[   15.142324][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 03bb2dbaffcbb961
[   15.142890][    T1] test_mul_u64_u64_div_u64: ERROR: 0xeac0d03ac10eeaf0 * 0x89be05dfa162ed9b +/ 0x92bb1679a41f0e4b
[   15.143618][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: dc5f5cc9e270d217
[   15.144200][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: dc5f5cc9e270d216
[   15.144767][    T1] test_mul_u64_u64_div_u64: Completed mul_u64_u64_div_u64() test, 56 tests, 23 errors, 61402015 ns
[   15.147067][    T1] gpio_virtuser: Failed to create the debugfs tree: -2
[   15.148313][    T1] gpio_winbond: chip ID at 2e is ffff
[   15.148884][    T1] gpio_winbond: not an our chip
[   15.149345][    T1] gpio_winbond: chip ID at 4e is ffff
[   15.149721][    T1] gpio_winbond: not an our chip
[   15.151343][    T1] IPMI message handler: version 39.2
[   15.151885][    T1] ipmi device interface
[   15.152644][    T1] ipmi_si: IPMI System Interface driver
[   15.153494][    T1] ipmi_si: Unable to find any System Interface(s)


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251106/202511061432.4e534796-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Shakeel Butt 3 months ago
On Thu, Nov 06, 2025 at 02:53:09PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid" on:
> 
> commit: 25ae03e80acad812e536694c1a07a3f57784ae23 ("[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems")
> url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/lib-Introduce-hierarchical-per-cpu-counters/20251031-224455
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20251031144232.15284-3-mathieu.desnoyers@efficios.com/
> patch subject: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
> 
> in testcase: boot
> 
> config: x86_64-randconfig-002-20251103
> compiler: clang-20
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> in fact, we observed various BUG:Bad_rss-counter_state_mm issues for this commit
> but clean on parent, as below
> 
> +------------------------------------------------------------------------+------------+------------+
> |                                                                        | 05880dc4af | 25ae03e80a |
> +------------------------------------------------------------------------+------------+------------+
> | BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:kworker##Pid | 0          | 10         |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid | 0          | 17         |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:swapper_Pid  | 0          | 2          |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:modprobe_Pid | 0          | 3          |
> | BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:modprobe_Pid | 0          | 1          |
> +------------------------------------------------------------------------+------------+------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202511061432.4e534796-lkp@intel.com
> 
> 
> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69

Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
percpu_counter_tree_approximate_sum() is off more than
counter->inaccuracy. I have not dig deeper to find why but this needs to
be resolved before considering this series for upstream.
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Mathieu Desnoyers 3 months ago
On 2025-11-06 19:32, Shakeel Butt wrote:

[...]

>> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
>> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
>> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
> 
> Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
> percpu_counter_tree_approximate_sum() is off more than
> counter->inaccuracy. I have not dig deeper to find why but this needs to
> be resolved before considering this series for upstream.

I notice that those BUG show up while loading modules at boot in kworker context, e.g.:

[   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
[   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.918858][   T71] module: module-autoload: duplicate request for module crypto-aes
[   14.919479][   T71] module: module-autoload: duplicate request for module crypto-aes-all
[   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain<block
[   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain==block
[   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain>block
[   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no plain
[   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain<block
[   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain==block
[   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain>block
[   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
[   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85b

I used "module_init" similarly to lib/percpu_counter.c, but I think it
happens too late in the boot sequence:

   module_init(percpu_counter_startup);

module_init maps to __initcall within a built-in compile unit, which
maps to device_initcall(), which happens quite late within the sequence
called from do_initcalls(), called from do_basic_setup().

And even do_basic_setup is documented as:

  * Ok, the machine is now initialized. None of the devices
  * have been touched yet, but the CPU subsystem is up and
  * running, and memory and process management works.

which clearly requires that the mm subsystem is expected to
be ready at that point.

It probably was not an issue for the non-hierarchical percpu
counters because all it was initializing is handling of CPU hotplug,
but the new hierarchical counters initialize the pre-calculated
inaccuracy value which is used to figure out whether the approximate
sum is sufficient to compare values or if the precise sum is needed.

I think this is why we are hitting this BUG.

Now I wonder where I should move this initialization. It requires
"nr_cpu_ids" to be initialized, and pretty much need to be done
before mms are created. I'm starting to suspect that the module init
code can spawn kworkers that have a mm before the init process runs.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Mathieu Desnoyers 3 months ago
On 2025-11-07 09:43, Mathieu Desnoyers wrote:
> On 2025-11-06 19:32, Shakeel Butt wrote:
> 
> [...]
> 
>>> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 
>>> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
>>> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
>>> type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
>>> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
>>> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
>>
>> Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
>> percpu_counter_tree_approximate_sum() is off more than
>> counter->inaccuracy. I have not dig deeper to find why but this needs to
>> be resolved before considering this series for upstream.
> 
> I notice that those BUG show up while loading modules at boot in kworker 
> context, e.g.:
> 
> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
> type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.918858][   T71] module: module-autoload: duplicate request for 
> module crypto-aes
> [   14.919479][   T71] module: module-autoload: duplicate request for 
> module crypto-aes-all
> [   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain<block
> [   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain==block
> [   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain>block
> [   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no 
> plain
> [   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain<block
> [   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain==block
> [   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain>block
> [   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
> [   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85b
> 
> I used "module_init" similarly to lib/percpu_counter.c, but I think it
> happens too late in the boot sequence:
> 
>    module_init(percpu_counter_startup);
> 
> module_init maps to __initcall within a built-in compile unit, which
> maps to device_initcall(), which happens quite late within the sequence
> called from do_initcalls(), called from do_basic_setup().
> 
> And even do_basic_setup is documented as:
> 
>   * Ok, the machine is now initialized. None of the devices
>   * have been touched yet, but the CPU subsystem is up and
>   * running, and memory and process management works.
> 
> which clearly requires that the mm subsystem is expected to
> be ready at that point.
> 
> It probably was not an issue for the non-hierarchical percpu
> counters because all it was initializing is handling of CPU hotplug,
> but the new hierarchical counters initialize the pre-calculated
> inaccuracy value which is used to figure out whether the approximate
> sum is sufficient to compare values or if the precise sum is needed.
> 
> I think this is why we are hitting this BUG.
> 
> Now I wonder where I should move this initialization. It requires
> "nr_cpu_ids" to be initialized, and pretty much need to be done
> before mms are created. I'm starting to suspect that the module init
> code can spawn kworkers that have a mm before the init process runs.

At least on x86, nr_cpu_ids appears to be set by set_nr_cpu_ids()
through early_param("possible_cpus", setup_possible_cpus), which is
AFAIU called from parse_early_param(), which happens very early in the
boot sequence.

It would make sense to call an explicit percpu counter tree init
function from start_kernel() between the call to mm_core_init() and the
call to maple_tree_init(). This way it would be initialized right after
mm, but given that the hierarchical counter tree is a lib that can be
used for other purposes than mm accounting, I think it makes sense
to call its init explicitly from start_kernel() rather than bury
it within mm_core_init().

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Mathieu Desnoyers 3 months ago
On 2025-11-07 10:53, Mathieu Desnoyers wrote:
[...]
> 
> It would make sense to call an explicit percpu counter tree init
> function from start_kernel() between the call to mm_core_init() and the
> call to maple_tree_init(). This way it would be initialized right after
> mm, but given that the hierarchical counter tree is a lib that can be
> used for other purposes than mm accounting, I think it makes sense
> to call its init explicitly from start_kernel() rather than bury
> it within mm_core_init().
See the following diff. If nobody object, I'll prepare a v8 which
includes it.

diff --git a/include/linux/percpu_counter_tree.h 
b/include/linux/percpu_counter_tree.h
index 8795e782680a..40fcdd6456b6 100644
--- a/include/linux/percpu_counter_tree.h
+++ b/include/linux/percpu_counter_tree.h
@@ -41,6 +41,7 @@ int percpu_counter_tree_precise_compare(struct 
percpu_counter_tree *a, struct pe
  int percpu_counter_tree_precise_compare_value(struct 
percpu_counter_tree *counter, int v);
  void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v);
  unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree 
*counter);
+int percpu_counter_tree_subsystem_init(void);

  /* Fast paths */

@@ -191,6 +192,12 @@ int percpu_counter_tree_approximate_sum(struct 
percpu_counter_tree *counter)
  	return percpu_counter_tree_precise_sum(counter);
  }

+static inline
+int percpu_counter_tree_subsystem_init(void)
+{
+	return 0;
+}
+
  #endif	/* CONFIG_SMP */

  static inline
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..204d9f913130 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
  #include <linux/pidfs.h>
  #include <linux/ptdump.h>
  #include <linux/time_namespace.h>
+#include <linux/percpu_counter_tree.h>
  #include <net/net_namespace.h>

  #include <asm/io.h>
@@ -969,6 +970,7 @@ void start_kernel(void)
  	sort_main_extable();
  	trap_init();
  	mm_core_init();
+	percpu_counter_tree_subsystem_init();
  	maple_tree_init();
  	poking_init();
  	ftrace_init();
diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c
index 9577d94251d1..05c3db0ce5b1 100644
--- a/lib/percpu_counter_tree.c
+++ b/lib/percpu_counter_tree.c
@@ -379,7 +379,7 @@ static unsigned int __init 
calculate_inaccuracy_multiplier(void)
  	return inaccuracy;
  }

-static int __init percpu_counter_startup(void)
+int __init percpu_counter_tree_subsystem_init(void)
  {

  	nr_cpus_order = get_count_order(nr_cpu_ids);
@@ -391,4 +391,3 @@ static int __init percpu_counter_startup(void)
  	inaccuracy_multiplier = calculate_inaccuracy_multiplier();
  	return 0;
  }
-module_init(percpu_counter_startup);


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
Posted by Shakeel Butt 3 months ago
On Fri, Nov 07, 2025 at 11:04:01AM -0500, Mathieu Desnoyers wrote:
> On 2025-11-07 10:53, Mathieu Desnoyers wrote:
> [...]
> > 
> > It would make sense to call an explicit percpu counter tree init
> > function from start_kernel() between the call to mm_core_init() and the
> > call to maple_tree_init(). This way it would be initialized right after
> > mm, but given that the hierarchical counter tree is a lib that can be
> > used for other purposes than mm accounting, I think it makes sense
> > to call its init explicitly from start_kernel() rather than bury
> > it within mm_core_init().
> See the following diff. If nobody object, I'll prepare a v8 which
> includes it.

This seems reasonable to me, I see v8 is already posted. I will take a
deeper look.