[v9] mm: Fix OOM killer inaccuracy on large many-core systems

[PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Mathieu Desnoyers 2 months, 2 weeks ago

* Motivation

The purpose of this hierarchical split-counter scheme is to:

- Minimize contention when incrementing and decrementing counters,
- Provide fast access to a sum approximation,
- Provide a sum approximation with an acceptable accuracy level when
  scaling to many-core systems.
- Provide approximate and precise comparison of two counters, and
  between a counter and a value.

It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

* Design

The hierarchical per-CPU counters propagate a sum approximation through
a N-way tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of logN(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

Example propagation diagram with 8 cpus through a binary tree:

Level 0:  0    1    2    3    4    5    6    7
          |   /     |   /     |   /     |   /
          |  /      |  /      |  /      |  /
          | /       | /       | /       | /
Level 1:  0         1         2         3
          |       /           |       /
          |    /              |    /
          | /                 | /
Level 2:  0                   1
          |               /
          |         /
          |   /
Level 3:  0

For a binary tree, the maximum inaccuracy is bound by:
   batch_size * log2(nr_cpus) * nr_cpus
which evolves with O(n*log(n)) as the number of CPUs increases.

For a N-way tree, the maximum inaccuracy can be pre-calculated
based on the the N-arity of each level and the batch size.

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
Changes since v8:
- Remove migrate guard from the fast path. It does not
  matter through which path the carry is propagated up
  the tree.
- Rebase on top of v6.18-rc6.
- Introduce percpu_counter_tree_init_many and
  percpu_counter_tree_destroy_many APIs.
- Move tree items allocation to the caller.
- Introduce percpu_counter_tree_items_size().
- Move percpu_counter_tree_subsystem_init() call before mm_core_init()
  so percpu_counter_tree_items_size() is initialized before it is used.

Changes since v7:
- Explicitly initialize the subsystem from start_kernel() right
  after mm_core_init() so it is up and running before the creation of
  the first mm at boot.
- Remove the module.h include which is not needed with the explicit
  initialization.
- Only consider levels>0 items for order={0,1} nr_items. No
  functional change except to allocate only the amount of memory
  which is strictly needed.
- Introduce positive precise sum API to handle a scenario where an
  unlucky precise sum iteration would hit negative counter values
  concurrently with counter updates.

Changes since v5:
- Introduce percpu_counter_tree_approximate_sum_positive.
- Introduce !CONFIG_SMP static inlines for UP build.
- Remove percpu_counter_tree_set_bias from the public API and make it
  static.

Changes since v3:
- Add gfp flags to init function.

Changes since v2:
- Introduce N-way tree to reduce tree depth on larger systems.

Changes since v1:
- Remove percpu_counter_tree_precise_sum_unbiased from public header,
  make this function static,
- Introduce precise and approximate comparisons between two counters,
- Reorder the struct percpu_counter_tree fields,
- Introduce approx_sum field, which points to the approximate sum
  for the percpu_counter_tree_approximate_sum() fast path.
---
 include/linux/percpu_counter_tree.h | 239 +++++++++++++++
 init/main.c                         |   2 +
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
 4 files changed, 685 insertions(+)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_counter_tree.h
new file mode 100644
index 000000000000..1f4938b67730
--- /dev/null
+++ b/include/linux/percpu_counter_tree.h
@@ -0,0 +1,239 @@
+/* SPDX-License-Identifier: GPL-2.0+ OR MIT */
+/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */
+
+#ifndef _PERCPU_COUNTER_TREE_H
+#define _PERCPU_COUNTER_TREE_H
+
+#include <linux/preempt.h>
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+#ifdef CONFIG_SMP
+
+struct percpu_counter_tree_level_item {
+	atomic_t count;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter_tree {
+	/* Fast-path fields. */
+	unsigned int __percpu *level0;
+	unsigned int level0_bit_mask;
+	union {
+		unsigned int *i;
+		atomic_t *a;
+	} approx_sum;
+	int bias;			/* bias for counter_set */
+
+	/* Slow-path fields. */
+	struct percpu_counter_tree_level_item *items;
+	unsigned int batch_size;
+	unsigned int inaccuracy;	/* approximation imprecise within ± inaccuracy */
+};
+
+size_t percpu_counter_tree_items_size(void);
+int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items,
+				  unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags);
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items,
+			     unsigned int batch_size, gfp_t gfp_flags);
+void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter, unsigned int nr_counters);
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter);
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc);
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v);
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v);
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v);
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter);
+int percpu_counter_tree_subsystem_init(void);
+
+/* Fast paths */
+
+static inline
+int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
+{
+	if (inc < 0) {
+		inc = -(-inc & ~(bit_mask - 1));
+		/*
+		 * xor bit_mask: underflow.
+		 *
+		 * If inc has bit set, decrement an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, decrement an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc -= bit_mask;
+	} else {
+		inc &= ~(bit_mask - 1);
+		/*
+		 * xor bit_mask: overflow.
+		 *
+		 * If inc has bit set, increment an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, increment an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc += bit_mask;
+	}
+	return inc;
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
+	unsigned int bit_mask = counter->level0_bit_mask, orig, res;
+
+	res = this_cpu_add_return(*counter->level0, inc);
+	orig = res - inc;
+	inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+	if (likely(!inc))
+		return;
+	percpu_counter_tree_add_slowpath(counter, inc);
+}
+
+static inline
+int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
+{
+	unsigned int v;
+
+	if (!counter->level0_bit_mask)
+		v = READ_ONCE(*counter->approx_sum.i);
+	else
+		v = atomic_read(counter->approx_sum.a);
+	return (int) (v + (unsigned int)READ_ONCE(counter->bias));
+}
+
+#else	/* !CONFIG_SMP */
+
+struct percpu_counter_tree_level_item;
+
+struct percpu_counter_tree {
+	atomic_t count;
+};
+
+static inline
+size_t percpu_counter_tree_items_size(void)
+{
+	return 0;
+}
+
+static inline
+int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items,
+				  unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags)
+{
+	for (unsigned int i = 0; i < nr_counters; i++)
+		atomic_set(&counters[i].count, 0);
+	return 0;
+}
+
+static inline
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items,
+			     unsigned int batch_size, gfp_t gfp_flags)
+{
+	return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_flags);
+}
+
+static inline
+void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter, unsigned int nr_counters)
+{
+}
+
+static inline
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter)
+{
+}
+
+static inline
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
+{
+	return atomic_read(&counter->count);
+}
+
+static inline
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_precise_sum(a),
+	    count_b = percpu_counter_tree_precise_sum(b);
+
+	if (count_a == count_b)
+		return 0;
+	if (count_a < count_b)
+		return -1;
+	return 1;
+}
+
+static inline
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_precise_sum(counter);
+
+	if (count == v)
+		return 0;
+	if (count < v)
+		return -1;
+	return 1;
+}
+
+static inline
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	return percpu_counter_tree_precise_compare(a, b);
+}
+
+static inline
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	return percpu_counter_tree_precise_compare_value(counter, v);
+}
+
+static inline
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v)
+{
+	atomic_set(&counter->count, v);
+}
+
+static inline
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter)
+{
+	return 0;
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
+	atomic_add(inc, &counter->count);
+}
+
+static inline
+int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_precise_sum(counter);
+}
+
+static inline
+int percpu_counter_tree_subsystem_init(void)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_SMP */
+
+static inline
+int percpu_counter_tree_approximate_sum_positive(struct percpu_counter_tree *counter)
+{
+	int v = percpu_counter_tree_approximate_sum(counter);
+	return v > 0 ? v : 0;
+}
+
+static inline
+int percpu_counter_tree_precise_sum_positive(struct percpu_counter_tree *counter)
+{
+	int v = percpu_counter_tree_precise_sum(counter);
+	return v > 0 ? v : 0;
+}
+
+#endif  /* _PERCPU_COUNTER_TREE_H */
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..8487489b9d00 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
 #include <linux/pidfs.h>
 #include <linux/ptdump.h>
 #include <linux/time_namespace.h>
+#include <linux/percpu_counter_tree.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -968,6 +969,7 @@ void start_kernel(void)
 	vfs_caches_init_early();
 	sort_main_extable();
 	trap_init();
+	percpu_counter_tree_subsystem_init();
 	mm_core_init();
 	maple_tree_init();
 	poking_init();
diff --git a/lib/Makefile b/lib/Makefile
index 1ab2c4be3b66..767dc178a55c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -179,6 +179,7 @@ obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
 obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
+obj-$(CONFIG_SMP) += percpu_counter_tree.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c
new file mode 100644
index 000000000000..5056d470bdc2
--- /dev/null
+++ b/lib/percpu_counter_tree.c
@@ -0,0 +1,443 @@
+// SPDX-License-Identifier: GPL-2.0+ OR MIT
+// SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+
+/*
+ * Split Counters With Tree Approximation Propagation
+ *
+ * * Propagation diagram when reaching batch size thresholds (± batch size):
+ *
+ * Example diagram for 8 CPUs:
+ *
+ * log2(8) = 3 levels
+ *
+ * At each level, each pair propagates its values to the next level when
+ * reaching the batch size thresholds.
+ *
+ * Counters at levels 0, 1, 2 can be kept on a single byte (±128 range),
+ * although it may be relevant to keep them on 32-bit counters for
+ * simplicity. (complexity vs memory footprint tradeoff)
+ *
+ * Counter at level 3 can be kept on a 32-bit counter.
+ *
+ * Level 0:  0    1    2    3    4    5    6    7
+ *           |   /     |   /     |   /     |   /
+ *           |  /      |  /      |  /      |  /
+ *           | /       | /       | /       | /
+ * Level 1:  0         1         2         3
+ *           |       /           |       /
+ *           |    /              |    /
+ *           | /                 | /
+ * Level 2:  0                   1
+ *           |               /
+ *           |         /
+ *           |   /
+ * Level 3:  0
+ *
+ * * Approximation inaccuracy:
+ *
+ * BATCH(level N): Level N batch size.
+ *
+ * Example for BATCH(level 0) = 32.
+ *
+ * BATCH(level 0) =  32
+ * BATCH(level 1) =  64
+ * BATCH(level 2) = 128
+ * BATCH(level N) = BATCH(level 0) * 2^N
+ *
+ *            per-counter     global
+ *            inaccuracy      inaccuracy
+ * Level 0:   [ -32 ..  +31]  ±256  (8 * 32)
+ * Level 1:   [ -64 ..  +63]  ±256  (4 * 64)
+ * Level 2:   [-128 .. +127]  ±256  (2 * 128)
+ * Total:      ------         ±768  (log2(nr_cpu_ids) * BATCH(level 0) * nr_cpu_ids)
+ *
+ * -----
+ *
+ * Approximate Sum Carry Propagation
+ *
+ * Let's define a number of counter bits for each level, e.g.:
+ *
+ * log2(BATCH(level 0)) = log2(32) = 5
+ *
+ *               nr_bit        value_mask                      range
+ * Level 0:      5 bits        v                             0 ..  +31
+ * Level 1:      1 bit        (v & ~((1UL << 5) - 1))        0 ..  +63
+ * Level 2:      1 bit        (v & ~((1UL << 6) - 1))        0 .. +127
+ * Level 3:     25 bits       (v & ~((1UL << 7) - 1))        0 .. 2^32-1
+ *
+ * Note: Use a full 32-bit per-cpu counter at level 0 to allow precise sum.
+ *
+ * Note: Use cacheline aligned counters at levels above 0 to prevent false sharing.
+ *       If memory footprint is an issue, a specialized allocator could be used
+ *       to eliminate padding.
+ *
+ * Example with expanded values:
+ *
+ * counter_add(counter, inc):
+ *
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = percpu_add_return(counter @ Level 0, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00011111);  // Clear used bits
+ *                 // xor bit 5: underflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc -= 0b00100000;
+ *         } else {
+ *                 inc &= ~0b00011111;           // Clear used bits
+ *                 // xor bit 5: overflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc += 0b00100000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 1, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00111111);  // Clear used bits
+ *                 // xor bit 6: underflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc -= 0b01000000;
+ *         } else {
+ *                 inc &= ~0b00111111;           // Clear used bits
+ *                 // xor bit 6: overflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc += 0b01000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 2, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b01111111);  // Clear used bits
+ *                 // xor bit 7: underflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc -= 0b10000000;
+ *         } else {
+ *                 inc &= ~0b01111111;           // Clear used bits
+ *                 // xor bit 7: overflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc += 0b10000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         atomic_add(counter @ Level 3, inc);
+ */
+
+#include <linux/percpu_counter_tree.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/atomic.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/math.h>
+
+#define MAX_NR_LEVELS 5
+
+struct counter_config {
+	unsigned int nr_items;
+	unsigned char nr_levels;
+	unsigned char n_arity_order[MAX_NR_LEVELS];
+};
+
+/*
+ * nr_items is the number of items in the tree for levels 1 to and
+ * including the final level (approximate sum). It excludes the level 0
+ * per-cpu counters.
+ */
+static const struct counter_config per_nr_cpu_order_config[] = {
+	[0] =	{ .nr_items = 0,	.nr_levels = 0,		.n_arity_order = { 0 } },
+	[1] =	{ .nr_items = 1,	.nr_levels = 1,		.n_arity_order = { 1 } },
+	[2] =	{ .nr_items = 3,	.nr_levels = 2,		.n_arity_order = { 1, 1 } },
+	[3] =	{ .nr_items = 7,	.nr_levels = 3,		.n_arity_order = { 1, 1, 1 } },
+	[4] =	{ .nr_items = 7,	.nr_levels = 3,		.n_arity_order = { 2, 1, 1 } },
+	[5] =	{ .nr_items = 11,	.nr_levels = 3,		.n_arity_order = { 2, 2, 1 } },
+	[6] =	{ .nr_items = 21,	.nr_levels = 3,		.n_arity_order = { 2, 2, 2 } },
+	[7] =	{ .nr_items = 21,	.nr_levels = 3,		.n_arity_order = { 3, 2, 2 } },
+	[8] =	{ .nr_items = 37,	.nr_levels = 3,		.n_arity_order = { 3, 3, 2 } },
+	[9] =	{ .nr_items = 73,	.nr_levels = 3,		.n_arity_order = { 3, 3, 3 } },
+	[10] =	{ .nr_items = 149,	.nr_levels = 4,		.n_arity_order = { 3, 3, 2, 2 } },
+	[11] =	{ .nr_items = 293,	.nr_levels = 4,		.n_arity_order = { 3, 3, 3, 2 } },
+	[12] =	{ .nr_items = 585,	.nr_levels = 4,		.n_arity_order = { 3, 3, 3, 3 } },
+	[13] =	{ .nr_items = 1173,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 2, 2 } },
+	[14] =	{ .nr_items = 2341,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 3, 2 } },
+	[15] =	{ .nr_items = 4681,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 3, 3 } },
+	[16] =	{ .nr_items = 4681,	.nr_levels = 5,		.n_arity_order = { 4, 3, 3, 3, 3 } },
+	[17] =	{ .nr_items = 8777,	.nr_levels = 5,		.n_arity_order = { 4, 4, 3, 3, 3 } },
+	[18] =	{ .nr_items = 17481,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 3, 3 } },
+	[19] =	{ .nr_items = 34953,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 4, 3 } },
+	[20] =	{ .nr_items = 69905,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 4, 4 } },
+};
+
+static const struct counter_config *counter_config;
+static unsigned int nr_cpus_order, inaccuracy_multiplier;
+
+static
+int __percpu_counter_tree_init(struct percpu_counter_tree *counter,
+			       unsigned int batch_size, gfp_t gfp_flags,
+			       unsigned int __percpu *level0,
+			       struct percpu_counter_tree_level_item *items)
+{
+	/* Batch size must be power of 2 */
+	if (!batch_size || (batch_size & (batch_size - 1)))
+		return -EINVAL;
+	counter->batch_size = batch_size;
+	counter->bias = 0;
+	counter->level0 = level0;
+	counter->items = items;
+	if (!nr_cpus_order) {
+		counter->approx_sum.i = per_cpu_ptr(counter->level0, 0);
+		counter->level0_bit_mask = 0;
+	} else {
+		counter->approx_sum.a = &counter->items[counter_config->nr_items - 1].count;
+		counter->level0_bit_mask = 1UL << get_count_order(batch_size);
+	}
+	counter->inaccuracy = batch_size * inaccuracy_multiplier;
+	return 0;
+}
+
+int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items,
+				  unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags)
+{
+	void __percpu *level0, *level0_iter;
+	size_t counter_size, items_size = 0;
+	void *items_iter;
+	unsigned int i;
+	int ret;
+
+	counter_size = ALIGN(sizeof(*counters), __alignof__(*counters));
+	level0 = __alloc_percpu_gfp(nr_counters * counter_size,
+				    __alignof__(*counters), gfp_flags);
+	if (!level0)
+		return -ENOMEM;
+	if (nr_cpus_order) {
+		items_size = percpu_counter_tree_items_size();
+		memset(items, 0, items_size * nr_counters);
+	}
+	level0_iter = level0;
+	items_iter = items;
+	for (i = 0; i < nr_counters; i++) {
+		ret = __percpu_counter_tree_init(&counters[i], batch_size, gfp_flags, level0_iter, items_iter);
+		if (ret)
+			goto free_level0;
+		level0_iter += counter_size;
+		if (nr_cpus_order)
+			items_iter += items_size;
+	}
+	return 0;
+
+free_level0:
+	free_percpu(level0);
+	return ret;
+}
+
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items,
+			     unsigned int batch_size, gfp_t gfp_flags)
+{
+	return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_flags);
+}
+
+void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counters, unsigned int nr_counters)
+{
+	free_percpu(counters->level0);
+}
+
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_destroy_many(counter, 1);
+}
+
+/*
+ * It does not matter through which path the carry propagates up the
+ * tree, therefore there is no need to disable preemption because the
+ * cpu number is only used to favor cache locality.
+ */
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc)
+{
+	unsigned int level_items, nr_levels = counter_config->nr_levels,
+		     level, n_arity_order, bit_mask;
+	struct percpu_counter_tree_level_item *item = counter->items;
+	unsigned int cpu = raw_smp_processor_id();
+
+	WARN_ON_ONCE(!nr_cpus_order);	/* Should never be called for 1 cpu. */
+
+	n_arity_order = counter_config->n_arity_order[0];
+	bit_mask = counter->level0_bit_mask << n_arity_order;
+	level_items = 1U << (nr_cpus_order - n_arity_order);
+
+	for (level = 1; level < nr_levels; level++) {
+		atomic_t *count = &item[cpu & (level_items - 1)].count;
+		unsigned int orig, res;
+
+		res = atomic_add_return_relaxed(inc, count);
+		orig = res - inc;
+		inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+		if (likely(!inc))
+			return;
+		item += level_items;
+		n_arity_order = counter_config->n_arity_order[level];
+		level_items >>= n_arity_order;
+		bit_mask <<= n_arity_order;
+	}
+	atomic_add(inc, counter->approx_sum.a);
+}
+
+/*
+ * Precise sum. Perform the sum of all per-cpu counters.
+ */
+static int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *counter)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(counter->level0, cpu);
+	return (int) sum;
+}
+
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(counter->bias);
+}
+
+/*
+ * Do an approximate comparison of two counters.
+ * Return 0 if counters do not differ by more than the sum of their
+ * respective inaccuracy ranges,
+ * Return -1 if counter @a less than counter @b,
+ * Return 1 if counter @a is greater than counter @b.
+ */
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_approximate_sum(a),
+	    count_b = percpu_counter_tree_approximate_sum(b);
+
+	if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy))
+		return 0;
+	if (count_a < count_b)
+		return -1;
+	return 1;
+}
+
+/*
+ * Do an approximate comparison of a counter against a given value.
+ * Return 0 if the value is within the inaccuracy range of the counter,
+ * Return -1 if the value less than counter,
+ * Return 1 if the value is greater than counter.
+ */
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approximate_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		return 0;
+	if (count < v)
+		return -1;
+	return 1;
+}
+
+/*
+ * Do a precise comparison of two counters.
+ * Return 0 if the counters are equal,
+ * Return -1 if counter @a less than counter @b,
+ * Return 1 if counter @a is greater than counter @b.
+ */
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_approximate_sum(a),
+	    count_b = percpu_counter_tree_approximate_sum(b);
+
+	if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy)) {
+		if (b->inaccuracy < a->inaccuracy) {
+			count_a = percpu_counter_tree_precise_sum(a);
+			if (abs(count_a - count_b) <= b->inaccuracy)
+				count_b = percpu_counter_tree_precise_sum(b);
+		} else {
+			count_b = percpu_counter_tree_precise_sum(b);
+			if (abs(count_a - count_b) <= a->inaccuracy)
+				count_a = percpu_counter_tree_precise_sum(a);
+		}
+	}
+	if (count_a > count_b)
+		return -1;
+	if (count_a > count_b)
+		return 1;
+	return 0;
+}
+
+/*
+ * Do a precise comparision of a counter against a given value.
+ * Return 0 if the value is equal to the counter,
+ * Return -1 if the value less than counter,
+ * Return 1 if the value is greater than counter.
+ */
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approximate_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		count = percpu_counter_tree_precise_sum(counter);
+	if (count < v)
+		return -1;
+	if (count > v)
+		return 1;
+	return 0;
+}
+
+static
+void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int bias)
+{
+	WRITE_ONCE(counter->bias, bias);
+}
+
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v)
+{
+	percpu_counter_tree_set_bias(counter,
+				     v - percpu_counter_tree_precise_sum_unbiased(counter));
+}
+
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter)
+{
+	return counter->inaccuracy;
+}
+
+size_t percpu_counter_tree_items_size(void)
+{
+	if (!nr_cpus_order)
+		return 0;
+	return counter_config->nr_items * sizeof(struct percpu_counter_tree_level_item);
+}
+
+static unsigned int __init calculate_inaccuracy_multiplier(void)
+{
+	unsigned int nr_levels = counter_config->nr_levels, level;
+	unsigned int level_items = 1U << nr_cpus_order;
+	unsigned int inaccuracy = 0, batch_size = 1;
+
+	for (level = 0; level < nr_levels; level++) {
+		unsigned int n_arity_order = counter_config->n_arity_order[level];
+
+		inaccuracy += batch_size * level_items;
+		batch_size <<= n_arity_order;
+		level_items >>= n_arity_order;
+	}
+	return inaccuracy;
+}
+
+int __init percpu_counter_tree_subsystem_init(void)
+{
+
+	nr_cpus_order = get_count_order(nr_cpu_ids);
+	if (WARN_ON_ONCE(nr_cpus_order >= ARRAY_SIZE(per_nr_cpu_order_config))) {
+		printk(KERN_ERR "Unsupported number of CPUs (%u)\n", nr_cpu_ids);
+		return -1;
+	}
+	counter_config = &per_nr_cpu_order_config[nr_cpus_order];
+	inaccuracy_multiplier = calculate_inaccuracy_multiplier();
+	return 0;
+}
-- 
2.39.5

Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Andrew Morton 2 months, 2 weeks ago

On Thu, 20 Nov 2025 16:03:53 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> * Motivation
> 
> The purpose of this hierarchical split-counter scheme is to:
> 
> - Minimize contention when incrementing and decrementing counters,
> - Provide fast access to a sum approximation,
> - Provide a sum approximation with an acceptable accuracy level when
>   scaling to many-core systems.
> - Provide approximate and precise comparison of two counters, and
>   between a counter and a value.
> 
> It aims at fixing the per-mm RSS tracking which has become too
> inaccurate for OOM killer purposes on large many-core systems [1].

Presentation nit: the info at [1] is rather well hidden until one reads
the [2/2] changelog.  You might want to move that material into the
[0/N] - after all, it's the entire point of the patchset.

> * Design
> 
> The hierarchical per-CPU counters propagate a sum approximation through
> a N-way tree. When reaching the batch size, the carry is propagated
> through a binary tree which consists of logN(nr_cpu_ids) levels. The
> batch size for each level is twice the batch size of the prior level.
> 
> Example propagation diagram with 8 cpus through a binary tree:
> 
> Level 0:  0    1    2    3    4    5    6    7
>           |   /     |   /     |   /     |   /
>           |  /      |  /      |  /      |  /
>           | /       | /       | /       | /
> Level 1:  0         1         2         3
>           |       /           |       /
>           |    /              |    /
>           | /                 | /
> Level 2:  0                   1
>           |               /
>           |         /
>           |   /
> Level 3:  0
> 
> For a binary tree, the maximum inaccuracy is bound by:
>    batch_size * log2(nr_cpus) * nr_cpus
> which evolves with O(n*log(n)) as the number of CPUs increases.
> 
> For a N-way tree, the maximum inaccuracy can be pre-calculated
> based on the the N-arity of each level and the batch size.

Looks very neat.

Have you identified other parts of the kernel which could use this?

>  include/linux/percpu_counter_tree.h | 239 +++++++++++++++
>  init/main.c                         |   2 +
>  lib/Makefile                        |   1 +
>  lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
>  4 files changed, 685 insertions(+)
>  create mode 100644 include/linux/percpu_counter_tree.h
>  create mode 100644 lib/percpu_counter_tree.c

An in-kernel test suite would be great.  Like lib/*test*.c or
tools/testing/.

> diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_counter_tree.h
> new file mode 100644
> index 000000000000..1f4938b67730
> --- /dev/null
> +++ b/include/linux/percpu_counter_tree.h
> @@ -0,0 +1,239 @@
> +/* SPDX-License-Identifier: GPL-2.0+ OR MIT */
> +/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */
> +
> +#ifndef _PERCPU_COUNTER_TREE_H
> +#define _PERCPU_COUNTER_TREE_H
> +
> +#include <linux/preempt.h>
> +#include <linux/atomic.h>
> +#include <linux/percpu.h>
> +
> +#ifdef CONFIG_SMP
> +
> +struct percpu_counter_tree_level_item {
> +	atomic_t count;
> +} ____cacheline_aligned_in_smp;
> +
> +struct percpu_counter_tree {
> +	/* Fast-path fields. */
> +	unsigned int __percpu *level0;
> +	unsigned int level0_bit_mask;
> +	union {
> +		unsigned int *i;
> +		atomic_t *a;
> +	} approx_sum;
> +	int bias;			/* bias for counter_set */
> +
> +	/* Slow-path fields. */
> +	struct percpu_counter_tree_level_item *items;
> +	unsigned int batch_size;
> +	unsigned int inaccuracy;	/* approximation imprecise within ± inaccuracy */
> +};

I find that understanding the data structure leads to understanding the
code, so additional documentation for the various fields would be
helpful.

> +
> +static inline
> +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
> +{
> ...
> +}
> +
> +static inline
> +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
> +{
> ...
> +}
> +
> +static inline
> +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
> +{
> ...
> +}

These are pretty large after all the nested inlining is expanded.  Are
you sure that inlining them is the correct call?


> +#else	/* !CONFIG_SMP */
> +
>
> ...
>
> +#include <linux/percpu_counter_tree.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <linux/atomic.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/math.h>
> +
> +#define MAX_NR_LEVELS 5
> +
> +struct counter_config {
> +	unsigned int nr_items;
> +	unsigned char nr_levels;
> +	unsigned char n_arity_order[MAX_NR_LEVELS];
> +};
> +
> +/*
> + * nr_items is the number of items in the tree for levels 1 to and
> + * including the final level (approximate sum). It excludes the level 0
> + * per-cpu counters.
> + */

That's referring to counter_config.nr_items?  Comment appears to be
misplaced.

>
> ...
>
> +static
> +int __percpu_counter_tree_init(struct percpu_counter_tree *counter,
> +			       unsigned int batch_size, gfp_t gfp_flags,
> +			       unsigned int __percpu *level0,
> +			       struct percpu_counter_tree_level_item *items)
> +{
> +	/* Batch size must be power of 2 */
> +	if (!batch_size || (batch_size & (batch_size - 1)))
> +		return -EINVAL;

It's a bug, yes?  Worth a WARN?

> +	counter->batch_size = batch_size;
> +	counter->bias = 0;
> +	counter->level0 = level0;
> +	counter->items = items;
> +	if (!nr_cpus_order) {
> +		counter->approx_sum.i = per_cpu_ptr(counter->level0, 0);
> +		counter->level0_bit_mask = 0;
> +	} else {
> +		counter->approx_sum.a = &counter->items[counter_config->nr_items - 1].count;
> +		counter->level0_bit_mask = 1UL << get_count_order(batch_size);
> +	}
> +	counter->inaccuracy = batch_size * inaccuracy_multiplier;
> +	return 0;
> +}
> +
>
> ...
>
> +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
> +{
> +	return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(counter->bias);
> +}
> +
> +/*
> + * Do an approximate comparison of two counters.
> + * Return 0 if counters do not differ by more than the sum of their
> + * respective inaccuracy ranges,
> + * Return -1 if counter @a less than counter @b,
> + * Return 1 if counter @a is greater than counter @b.
> + */

It would be nice to kerneldocify the exported API.

Some fairly detailed words explaining the pros and cons of precise vs
approximate would be helpful to people who are using this API.

> +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
> +{
> +	int count_a = percpu_counter_tree_approximate_sum(a),
> +	    count_b = percpu_counter_tree_approximate_sum(b);
> +
> +	if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy))
> +		return 0;
> +	if (count_a < count_b)
> +		return -1;
> +	return 1;
> +}
> +
>
> ...
>
> +static unsigned int __init calculate_inaccuracy_multiplier(void)
> +{
> +	unsigned int nr_levels = counter_config->nr_levels, level;
> +	unsigned int level_items = 1U << nr_cpus_order;
> +	unsigned int inaccuracy = 0, batch_size = 1;
> +
> +	for (level = 0; level < nr_levels; level++) {
> +		unsigned int n_arity_order = counter_config->n_arity_order[level];
> +
> +		inaccuracy += batch_size * level_items;
> +		batch_size <<= n_arity_order;
> +		level_items >>= n_arity_order;
> +	}
> +	return inaccuracy;
> +}
> +
> +int __init percpu_counter_tree_subsystem_init(void)

I'm not sure that the "subsystem_" adds any value.

> +{
> +
> +	nr_cpus_order = get_count_order(nr_cpu_ids);

Stray newline.

> +	if (WARN_ON_ONCE(nr_cpus_order >= ARRAY_SIZE(per_nr_cpu_order_config))) {
> +		printk(KERN_ERR "Unsupported number of CPUs (%u)\n", nr_cpu_ids);
> +		return -1;
> +	}
> +	counter_config = &per_nr_cpu_order_config[nr_cpus_order];
> +	inaccuracy_multiplier = calculate_inaccuracy_multiplier();
> +	return 0;
> +}

Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Mathieu Desnoyers 2 months, 2 weeks ago

On 2025-11-21 13:03, Andrew Morton wrote:
> On Thu, 20 Nov 2025 16:03:53 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> ...
>> It aims at fixing the per-mm RSS tracking which has become too
>> inaccurate for OOM killer purposes on large many-core systems [1].
> 
> Presentation nit: the info at [1] is rather well hidden until one reads
> the [2/2] changelog.  You might want to move that material into the
> [0/N] - after all, it's the entire point of the patchset.

Will do.

>> The hierarchical per-CPU counters propagate a sum approximation through
>> a N-way tree. When reaching the batch size, the carry is propagated
>> through a binary tree which consists of logN(nr_cpu_ids) levels. The
>> batch size for each level is twice the batch size of the prior level.
>> ...

> Looks very neat.

I'm glad you like the approach :) When looking at this problem, a
hierarchical tree approach seemed like a good candidate to solve the
general problem.

> 
> Have you identified other parts of the kernel which could use this?

I have not surveyed other possible users yet. At first glance, users
of the linux/percpu_counter.h percpu_counter_read_positive() may
be good candidates for the hierarchical tree. Here is a list:

lib/flex_proportions.c
157:		num = percpu_counter_read_positive(&pl->events);
158:		den = percpu_counter_read_positive(&p->events);

fs/file_table.c
88:	return percpu_counter_read_positive(&nr_files);

fs/ext4/balloc.c
627:	free_clusters  = percpu_counter_read_positive(fcc);
628:	dirty_clusters = percpu_counter_read_positive(dcc);

fs/ext4/ialloc.c
448:	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
450:	freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
453:	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/ext4/extents_status.c
1682:	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1694:	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1699:	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);

fs/ext4/inode.c
3091:		percpu_counter_read_positive(&sbi->s_freeclusters_counter);
3093:		percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);

fs/btrfs/space-info.c
1028:	ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
1029:	delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);

fs/quota/dquot.c
808:	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));

fs/ext2/balloc.c
1162:	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);

fs/ext2/ialloc.c
268:	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
270:	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
272:	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/jbd2/journal.c
1263:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1268:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1287:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);

fs/xfs/xfs_ioctl.c
1149:		.allocino = percpu_counter_read_positive(&mp->m_icount),
1150:		.freeino  = percpu_counter_read_positive(&mp->m_ifree),

fs/xfs/xfs_mount.h
736:	return percpu_counter_read_positive(&mp->m_free[ctr].count);

fs/xfs/libxfs/xfs_ialloc.c
732:	    percpu_counter_read_positive(&args.mp->m_icount) + newlen >
1913:	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
1917:	    percpu_counter_read_positive(&mp->m_icount) + igeo->ialloc_inos

mm/util.c
965:	if (percpu_counter_read_positive(&vm_committed_as) < allowed)

drivers/md/dm-crypt.c
1734:			if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) +
2750:	 * Note, percpu_counter_read_positive() may over (and under) estimate
2754:	if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client) &&

io_uring/io_uring.c
2502:	return percpu_counter_read_positive(&tctx->inflight);

include/linux/backing-dev.h
71:	return percpu_counter_read_positive(&wb->stat[item]);

include/net/sock.h
1435:	return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);

include/net/dst_ops.h
48:	return percpu_counter_read_positive(&dst->pcpuc_entries);

There are probably other use-cases lurking out there. Anything
that requires sorting resources taken by objects used from many
CPUs concurrently in order to make decisions are good candidates.

The fact that the max inaccuracy is bounded means that we can perform
approximate comparisons in a fast path (which is enough if the values
to compare greatly vary), and only have to do the more expensive
"precise" comparison when values are close to each other and we
actually care about their relative order.

I'm thinking memory usage tracking, runtime usage (scheduler ?), cgroups
ressource accounting.

> 
>>   include/linux/percpu_counter_tree.h | 239 +++++++++++++++
>>   init/main.c                         |   2 +
>>   lib/Makefile                        |   1 +
>>   lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
>>   4 files changed, 685 insertions(+)
>>   create mode 100644 include/linux/percpu_counter_tree.h
>>   create mode 100644 lib/percpu_counter_tree.c
> 
> An in-kernel test suite would be great.  Like lib/*test*.c or
> tools/testing/.

I'll keep a note to port the tests from my userspace librseq
percpu counters feature branch to the kernel. I did not do it
initially because I wanted to see if the overall approach was
deemed interesting for the kernel.

>> +struct percpu_counter_tree {
>>...
>> +};
> 
> I find that understanding the data structure leads to understanding the
> code, so additional documentation for the various fields would be
> helpful.
Will do.

> 
>> +
>> +static inline
>> +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
>> +{
>> ...
>> +}
>> +
>> +static inline
>> +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
>> +{
>> ...
>> +}

(x86-64 asm below when expanding the inline functions into a real function)

00000000000006a0 <disasm_percpu_counter_tree_add>:
[...]
  6a4:   8b 4f 08                mov    0x8(%rdi),%ecx
  6a7:   48 8b 17                mov    (%rdi),%rdx
  6aa:   89 f0                   mov    %esi,%eax
  6ac:   65 0f c1 02             xadd   %eax,%gs:(%rdx)
  6b0:   41 89 c8                mov    %ecx,%r8d
  6b3:   8d 14 06                lea    (%rsi,%rax,1),%edx
  6b6:   41 f7 d8                neg    %r8d
  6b9:   85 f6                   test   %esi,%esi
  6bb:   78 18                   js     6d5 <disasm_percpu_counter_tree_add+0x35>
  6bd:   44 21 c6                and    %r8d,%esi
  6c0:   31 f0                   xor    %esi,%eax
  6c2:   31 d0                   xor    %edx,%eax
  6c4:   8d 14 31                lea    (%rcx,%rsi,1),%edx
  6c7:   85 c8                   test   %ecx,%eax
  6c9:   0f 45 f2                cmovne %edx,%esi
  6cc:   85 f6                   test   %esi,%esi
  6ce:   75 1d                   jne    6ed <disasm_percpu_counter_tree_add+0x4d>
  6d0:   e9 00 00 00 00          jmp    6d5 <disasm_percpu_counter_tree_add+0x35>
  6d5:   f7 de                   neg    %esi
  6d7:   44 21 c6                and    %r8d,%esi
  6da:   f7 de                   neg    %esi
  6dc:   31 f0                   xor    %esi,%eax
  6de:   31 d0                   xor    %edx,%eax
  6e0:   89 f2                   mov    %esi,%edx
  6e2:   29 ca                   sub    %ecx,%edx
  6e4:   85 c8                   test   %ecx,%eax
  6e6:   0f 45 f2                cmovne %edx,%esi
  6e9:   85 f6                   test   %esi,%esi
  6eb:   74 e3                   je     6d0 <disasm_percpu_counter_tree_add+0x30>
  6ed:   e9 ae fb ff ff          jmp    2a0 <percpu_counter_tree_add_slowpath>

>> +
>> +static inline
>> +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
>> +{
>> ...
>> +}

0000000000000710 <disasm_percpu_counter_tree_approximate_sum>:
[...]
  714:   48 8b 47 10             mov    0x10(%rdi),%rax
  718:   8b 10                   mov    (%rax),%edx
  71a:   8b 47 18                mov    0x18(%rdi),%eax
  71d:   01 d0                   add    %edx,%eax
  71f:   e9 00 00 00 00          jmp    724 <disasm_percpu_counter_tree_approximate_sum+0x14>

> 
> These are pretty large after all the nested inlining is expanded.  Are
> you sure that inlining them is the correct call?

percpu_counter_tree_add: 30 insn, 79 bytes
percpu_counter_tree_approximate_sum: 5 insn, 16 bytes

So I agree that inlining "add" may not be the right call there due
to the relatively large footprint of bitwise operations used to pick
the carry, and the fact that it contains a xadd which is ~9 cycles
on recent CPUs, so adding the overhead of a function call should not
be so significant.

Inlining "approximate_sum" seems to be the right call, because the
function is really small (16 bytes), and it only contains loads and
a "add", which are faster than a function call.

> 
> 
>> +#else	/* !CONFIG_SMP */
>> +
>>
>> ...
>>
>> +#include <linux/percpu_counter_tree.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/percpu.h>
>> +#include <linux/atomic.h>
>> +#include <linux/errno.h>
>> +#include <linux/slab.h>
>> +#include <linux/math.h>
>> +
>> +#define MAX_NR_LEVELS 5
>> +
>> +struct counter_config {
>> +	unsigned int nr_items;
>> +	unsigned char nr_levels;
>> +	unsigned char n_arity_order[MAX_NR_LEVELS];
>> +};
>> +
>> +/*
>> + * nr_items is the number of items in the tree for levels 1 to and
>> + * including the final level (approximate sum). It excludes the level 0
>> + * per-cpu counters.
>> + */
> 
> That's referring to counter_config.nr_items?  Comment appears to be
> misplaced.

Good point, I'll move it besides the "nr_items" field in counter_config.
I'll also document the other fields.

>>...
>> +	/* Batch size must be power of 2 */
>> +	if (!batch_size || (batch_size & (batch_size - 1)))
>> +		return -EINVAL;
> 
> It's a bug, yes?  Worth a WARN?

I'll add a WARN_ON().

>>...
> It would be nice to kerneldocify the exported API.

OK

> 
> Some fairly detailed words explaining the pros and cons of precise vs
> approximate would be helpful to people who are using this API.

Good point.

>> +int __init percpu_counter_tree_subsystem_init(void)
> 
> I'm not sure that the "subsystem_" adds any value.

The symbol "percpu_counter_tree_init()" is already taken
for initialization of individual counter trees, so I added
"subsystem" here do distinguish between the two symbols.
Looking at kernel bootup functions "subsystem_init" seems
to be used at least once:

1108:	acpi_subsystem_init();

> 
>> +{
>> +
>> +	nr_cpus_order = get_count_order(nr_cpu_ids);
> 
> Stray newline.

OK

Thanks for the review!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Mathieu Desnoyers 2 months, 2 weeks ago

On 2025-11-22 12:15, Mathieu Desnoyers wrote:
>... 
> The fact that the max inaccuracy is bounded means that we can perform
> approximate comparisons in a fast path (which is enough if the values
> to compare greatly vary), and only have to do the more expensive
> "precise" comparison when values are close to each other and we
> actually care about their relative order.
> 
> I'm thinking memory usage tracking, runtime usage (scheduler ?), cgroups
> ressource accounting.

Another use-case worth explicitly mentioning is accounting the
resources (memory, CPU, I/O) used by a given entity (process,
cgroup) and enforce quota validation based on a configured
capacity limit.

This is a comparison with a limit which can be much larger than
the current resource usage value. Therefore, because the inaccuracy
is known, it is possible to enforce the quota limit with an
approximate comparison, and only use a precise sum when the current
value is near the limit, and rate-limit the frequency at which precise
limit validation is done.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Andrew Morton 2 months, 2 weeks ago

On Sat, 22 Nov 2025 12:15:02 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> >>   include/linux/percpu_counter_tree.h | 239 +++++++++++++++
> >>   init/main.c                         |   2 +
> >>   lib/Makefile                        |   1 +
> >>   lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
> >>   4 files changed, 685 insertions(+)
> >>   create mode 100644 include/linux/percpu_counter_tree.h
> >>   create mode 100644 lib/percpu_counter_tree.c
> > 
> > An in-kernel test suite would be great.  Like lib/*test*.c or
> > tools/testing/.
> 
> I'll keep a note to port the tests from my userspace librseq
> percpu counters feature branch to the kernel. I did not do it
> initially because I wanted to see if the overall approach was
> deemed interesting for the kernel.

It deems interesting to me - certainly seems useful in addressing some
nasty problems.

If it hasn't been massacred by reviewers, please poke me in a week or
so and I'll look at giving it some exposure in mm.git's mm-new branch
(which isn't included in linux-next).

Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

Posted by Christoph Lameter (Ampere) 2 months, 2 weeks ago

On Fri, 21 Nov 2025, Andrew Morton wrote:

> Have you identified other parts of the kernel which could use this?

We need to compare this with the ZVC counters used for the vm statistics.
Maybe this is better and we can scale counters easier in the VM.

ZVCs were designed for systems with a large number of NUMA nodes and a
handful of processors in each of those. What we have today is large number
of processors in one socket (my employer is shooting high on that level).

Hierachical counters may be better in that scenario. Maybe we can call
them HVC (Hierachical VM Counters). ;-) )

[PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters
[PATCH v9 2/2] mm: Fix OOM killer inaccuracy on large many-core systems