From nobody Sun Dec 14 11:18:37 2025 Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2011D1E1E1E; Sat, 13 Dec 2025 19:04:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=158.69.130.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765652664; cv=none; b=hDq6G5vqklA6ZvJQXMqQ3fbaYe2UCsIP+FC7S4V/CoosNAh+UGGmM+CLmJwTQD1ltnkdQB0WfIyU8D4BBAo8lKAl2tLLmfCfnpSGHc/5BFmMQt7LM+B14FSjZFR1v+Ljrqjb90voSjp21MamC1R6OJMFBl0xsJmx5yyx079dPig= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765652664; c=relaxed/simple; bh=MfMAWkW/epMu3qsLjPlsMhf4ry9/OgP1SZxA23brnbQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=i+MWV6F4qr8cyWAGNHCuVgPKOT/nAOb6V1R2dD1YAK8i0hRel0jN2ZM9/HBPCuKz/a4rcZBMTLmnjmNCzltCltgcTV7UWnfkugHAILRhd2rmfSwlsBtwCFpllGcB36chEckkGf2eWvG/RDjpsZK+DgVWkB4C3tJVUiGhk3b0jac= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=gVZSewU1; arc=none smtp.client-ip=158.69.130.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="gVZSewU1" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=smtpout1; t=1765652175; bh=0v1dFF/8ZWaqYC+CYbnbSe73LklzL2KNepA8cxgVoFY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gVZSewU1hdClBTcZwfZpdv9qRwS6B+hFzhMuBnWibjUouPKm3bmpZI1qNY387iY7s D17AkrueDHJxIKbKVW5oTG90AUlqvl7QNaIlF6qBxR9ySAAoFdD2Z2frlNvtIfXae8 WrrxrgY6+gj7LB1OFCg1zDvlhmeeo7fx0sJKr4DLV5nqoMadcTabn5K1ENTtTnJfPl DMjJQCO0udaHs7qwlMKYRrLS99BijKIq/eyM2sToo5QQVgQ3x6TILzu7E//dJSgRDn DBu7wBNUDMF84pVB39OWqr61m/fEd4tN7MNyzwjofemQRLTLKgMgdmPg3K7ScF92X1 iU+qK9E5t7fZg== Received: from thinkos.internal.efficios.com (mtl.efficios.com [216.120.195.104]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4dTFsV6BHXzZXd; Sat, 13 Dec 2025 13:56:14 -0500 (EST) From: Mathieu Desnoyers To: Andrew Morton Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Steven Rostedt , Masami Hiramatsu , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , christian.koenig@amd.com, Shakeel Butt , SeongJae Park , Michal Hocko , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Mateusz Guzik , Matthew Wilcox , Baolin Wang , Aboorva Devarajan Subject: [PATCH v10 1/3] lib: Introduce hierarchical per-cpu counters Date: Sat, 13 Dec 2025 13:56:06 -0500 Message-Id: <20251213185608.3418096-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20251213185608.3418096-1-mathieu.desnoyers@efficios.com> References: <20251213185608.3418096-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable * Motivation The purpose of this hierarchical split-counter scheme is to: - Minimize contention when incrementing and decrementing counters, - Provide fast access to a sum approximation, - Provide a sum approximation with an acceptable accuracy level when scaling to many-core systems. - Provide approximate and precise comparison of two counters, and between a counter and a value. It aims at fixing the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems [1]. * Design The hierarchical per-CPU counters propagate a sum approximation through a N-way tree. When reaching the batch size, the carry is propagated through a binary tree which consists of logN(nr_cpu_ids) levels. The batch size for each level is twice the batch size of the prior level. Example propagation diagram with 8 cpus through a binary tree: Level 0: 0 1 2 3 4 5 6 7 | / | / | / | / | / | / | / | / | / | / | / | / Level 1: 0 1 2 3 | / | / | / | / | / | / Level 2: 0 1 | / | / | / Level 3: 0 For a binary tree, the maximum inaccuracy is bound by: batch_size * log2(nr_cpus) * nr_cpus which evolves with O(n*log(n)) as the number of CPUs increases. For a N-way tree, the maximum inaccuracy can be pre-calculated based on the the N-arity of each level and the batch size. Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@do= rminy.me/ # [1] Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Baolin Wang Cc: Aboorva Devarajan --- Changes since v9: - Introduce kerneldoc documentation. - Document structure fields. - Remove inline from percpu_counter_tree_add(). - Reject batch_size=3D=3D1 which makes no sense. - Fix copy-paste bug in percpu_counter_tree_precise_compare() (wrong inequality comparison). - Rename "inaccuracy" to "accuracy", which makes it easier to document. - Track accuracy limits more precisely. In two's complement signed integers, the range before a n-bit underflow is one unit larger than the range before a n-bit overflow. This sums for all the counters within the tree. Therefore the "under" vs "over" accuracy is not symmetrical. Changes since v8: - Remove migrate guard from the fast path. It does not matter through which path the carry is propagated up the tree. - Rebase on top of v6.18-rc6. - Introduce percpu_counter_tree_init_many and percpu_counter_tree_destroy_many APIs. - Move tree items allocation to the caller. - Introduce percpu_counter_tree_items_size(). - Move percpu_counter_tree_subsystem_init() call before mm_core_init() so percpu_counter_tree_items_size() is initialized before it is used. Changes since v7: - Explicitly initialize the subsystem from start_kernel() right after mm_core_init() so it is up and running before the creation of the first mm at boot. - Remove the module.h include which is not needed with the explicit initialization. - Only consider levels>0 items for order=3D{0,1} nr_items. No functional change except to allocate only the amount of memory which is strictly needed. - Introduce positive precise sum API to handle a scenario where an unlucky precise sum iteration would hit negative counter values concurrently with counter updates. Changes since v5: - Introduce percpu_counter_tree_approximate_sum_positive. - Introduce !CONFIG_SMP static inlines for UP build. - Remove percpu_counter_tree_set_bias from the public API and make it static. Changes since v3: - Add gfp flags to init function. Changes since v2: - Introduce N-way tree to reduce tree depth on larger systems. Changes since v1: - Remove percpu_counter_tree_precise_sum_unbiased from public header, make this function static, - Introduce precise and approximate comparisons between two counters, - Reorder the struct percpu_counter_tree fields, - Introduce approx_sum field, which points to the approximate sum for the percpu_counter_tree_approximate_sum() fast path. --- include/linux/percpu_counter_tree.h | 242 ++++++++++ init/main.c | 2 + lib/Makefile | 1 + lib/percpu_counter_tree.c | 705 ++++++++++++++++++++++++++++ 4 files changed, 950 insertions(+) create mode 100644 include/linux/percpu_counter_tree.h create mode 100644 lib/percpu_counter_tree.c diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_cou= nter_tree.h new file mode 100644 index 000000000000..0daf09e08111 --- /dev/null +++ b/include/linux/percpu_counter_tree.h @@ -0,0 +1,242 @@ +/* SPDX-License-Identifier: GPL-2.0+ OR MIT */ +/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers */ + +#ifndef _PERCPU_COUNTER_TREE_H +#define _PERCPU_COUNTER_TREE_H + +#include +#include +#include + +#ifdef CONFIG_SMP + +struct percpu_counter_tree_level_item { + atomic_t count; /* + * Count the number of carry fort this tree item. + * The carry counter is kept at the order of the + * carry accounted for at this tree level. + */ +} ____cacheline_aligned_in_smp; + +struct percpu_counter_tree { + /* Fast-path fields. */ + unsigned int __percpu *level0; /* Pointer to per-CPU split counters (tree= level 0). */ + unsigned int level0_bit_mask; /* Bit mask to apply to detect carry propag= ation from tree level 0. */ + union { + unsigned int *i; /* Approximate sum for single-CPU topology. */ + atomic_t *a; /* Approximate sum for SMP topology. */ + } approx_sum; + int bias; /* Bias to apply to counter precise and approximate values. */ + + /* Slow-path fields. */ + struct percpu_counter_tree_level_item *items; /* Array of tree items for = levels 1 to N. */ + unsigned int batch_size; /* + * The batch size is the increment step at level 0 which + * triggers a carry propagation. The batch size is required + * to be greater than 1, and a power of 2. + */ + /* + * The tree approximate sum is guaranteed to be within this accuracy rang= e: + * (precise_sum - approx_accuracy_range.under) <=3D approx_sum <=3D (prec= ise_sum + approx_accuracy_range.over). + * This accuracy is derived from the hardware topology and the tree batch= _size. + * The "under" accuracy is larger than the "over" accuracy because the ne= gative range of a + * two's complement signed integer is one unit larger than the positive r= ange. This delta + * is summed for each tree item, which leads to a significantly larger "u= nder" accuracy range + * compared to the "over" accuracy range. + */ + struct { + unsigned int under; + unsigned int over; + } approx_accuracy_range; +}; + +size_t percpu_counter_tree_items_size(void); +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, st= ruct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags); +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct p= ercpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags); +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter,= unsigned int nr_counters); +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter); +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc); +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter); +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a,= struct percpu_counter_tree *b); +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tr= ee *counter, int v); +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, str= uct percpu_counter_tree *b); +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *= counter, int v); +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v); +void percpu_counter_tree_approximate_accuracy_range(struct percpu_counter_= tree *counter, + unsigned int *under, unsigned int *over); +int percpu_counter_tree_subsystem_init(void); + +/** + * percpu_counter_tree_approximate_sum() - Return approximate counter sum. + * @counter: The counter to sum. + * + * Querying the approximate sum is fast, but it is only accurate within + * the bounds delimited by percpu_counter_tree_approximate_accuracy_range(= ). + * This is meant to be used when speed is preferred over accuracy. + * + * Return: The current approximate counter sum. + */ +static inline +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counte= r) +{ + unsigned int v; + + if (!counter->level0_bit_mask) + v =3D READ_ONCE(*counter->approx_sum.i); + else + v =3D atomic_read(counter->approx_sum.a); + return (int) (v + (unsigned int)READ_ONCE(counter->bias)); +} + +#else /* !CONFIG_SMP */ + +struct percpu_counter_tree_level_item; + +struct percpu_counter_tree { + atomic_t count; +}; + +static inline +size_t percpu_counter_tree_items_size(void) +{ + return 0; +} + +static inline +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, st= ruct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags) +{ + for (unsigned int i =3D 0; i < nr_counters; i++) + atomic_set(&counters[i].count, 0); + return 0; +} + +static inline +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct p= ercpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags) +{ + return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_f= lags); +} + +static inline +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter,= unsigned int nr_counters) +{ +} + +static inline +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter) +{ +} + +static inline +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter) +{ + return atomic_read(&counter->count); +} + +static inline +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, str= uct percpu_counter_tree *b) +{ + int count_a =3D percpu_counter_tree_precise_sum(a), + count_b =3D percpu_counter_tree_precise_sum(b); + + if (count_a =3D=3D count_b) + return 0; + if (count_a < count_b) + return -1; + return 1; +} + +static inline +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *= counter, int v) +{ + int count =3D percpu_counter_tree_precise_sum(counter); + + if (count =3D=3D v) + return 0; + if (count < v) + return -1; + return 1; +} + +static inline +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a,= struct percpu_counter_tree *b) +{ + return percpu_counter_tree_precise_compare(a, b); +} + +static inline +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tr= ee *counter, int v) +{ + return percpu_counter_tree_precise_compare_value(counter, v); +} + +static inline +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v) +{ + atomic_set(&counter->count, v); +} + +static inline +void percpu_counter_tree_approximate_accuracy_range(struct percpu_counter_= tree *counter, + unsigned int *under, unsigned int *over) +{ + *under =3D 0; + *over =3D 0; +} + +static inline +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc) +{ + atomic_add(inc, &counter->count); +} + +static inline +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counte= r) +{ + return percpu_counter_tree_precise_sum(counter); +} + +static inline +int percpu_counter_tree_subsystem_init(void) +{ + return 0; +} + +#endif /* CONFIG_SMP */ + +/** + * percpu_counter_tree_approximate_sum_positive() - Return a positive appr= oximate counter sum. + * @counter: The counter to sum. + * + * Return an approximate counter sum which is guaranteed to be greater + * or equal to 0. + * + * Return: The current positive approximate counter sum. + */ +static inline +int percpu_counter_tree_approximate_sum_positive(struct percpu_counter_tre= e *counter) +{ + int v =3D percpu_counter_tree_approximate_sum(counter); + return v > 0 ? v : 0; +} + +/** + * percpu_counter_tree_precise_sum_positive() - Return a positive precise = counter sum. + * @counter: The counter to sum. + * + * Return a precise counter sum which is guaranteed to be greater + * or equal to 0. + * + * Return: The current positive precise counter sum. + */ +static inline +int percpu_counter_tree_precise_sum_positive(struct percpu_counter_tree *c= ounter) +{ + int v =3D percpu_counter_tree_precise_sum(counter); + return v > 0 ? v : 0; +} + +#endif /* _PERCPU_COUNTER_TREE_H */ diff --git a/init/main.c b/init/main.c index 07a3116811c5..8487489b9d00 100644 --- a/init/main.c +++ b/init/main.c @@ -104,6 +104,7 @@ #include #include #include +#include #include =20 #include @@ -968,6 +969,7 @@ void start_kernel(void) vfs_caches_init_early(); sort_main_extable(); trap_init(); + percpu_counter_tree_subsystem_init(); mm_core_init(); maple_tree_init(); poking_init(); diff --git a/lib/Makefile b/lib/Makefile index 1ab2c4be3b66..767dc178a55c 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -179,6 +179,7 @@ obj-$(CONFIG_TEXTSEARCH_KMP) +=3D ts_kmp.o obj-$(CONFIG_TEXTSEARCH_BM) +=3D ts_bm.o obj-$(CONFIG_TEXTSEARCH_FSM) +=3D ts_fsm.o obj-$(CONFIG_SMP) +=3D percpu_counter.o +obj-$(CONFIG_SMP) +=3D percpu_counter_tree.o obj-$(CONFIG_AUDIT_GENERIC) +=3D audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) +=3D compat_audit.o =20 diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c new file mode 100644 index 000000000000..9b7ae7df5530 --- /dev/null +++ b/lib/percpu_counter_tree.c @@ -0,0 +1,705 @@ +// SPDX-License-Identifier: GPL-2.0+ OR MIT +// SPDX-FileCopyrightText: 2025 Mathieu Desnoyers + +/* + * Split Counters With Tree Approximation Propagation + * + * * Propagation diagram when reaching batch size thresholds (=C2=B1 batch= size): + * + * Example diagram for 8 CPUs: + * + * log2(8) =3D 3 levels + * + * At each level, each pair propagates its values to the next level when + * reaching the batch size thresholds. + * + * Counters at levels 0, 1, 2 can be kept on a single byte ([-128 .. +127]= range), + * although it may be relevant to keep them on 32-bit counters for + * simplicity. (complexity vs memory footprint tradeoff) + * + * Counter at level 3 can be kept on a 32-bit counter. + * + * Level 0: 0 1 2 3 4 5 6 7 + * | / | / | / | / + * | / | / | / | / + * | / | / | / | / + * Level 1: 0 1 2 3 + * | / | / + * | / | / + * | / | / + * Level 2: 0 1 + * | / + * | / + * | / + * Level 3: 0 + * + * * Approximation accuracy: + * + * BATCH(level N): Level N batch size. + * + * Example for BATCH(level 0) =3D 32. + * + * BATCH(level 0) =3D 32 + * BATCH(level 1) =3D 64 + * BATCH(level 2) =3D 128 + * BATCH(level N) =3D BATCH(level 0) * 2^N + * + * per-counter global + * accuracy accuracy + * Level 0: [ -32 .. +31] =C2=B1256 (8 * 32) + * Level 1: [ -64 .. +63] =C2=B1256 (4 * 64) + * Level 2: [-128 .. +127] =C2=B1256 (2 * 128) + * Total: ------ =C2=B1768 (log2(nr_cpu_ids) * BATCH(level 0= ) * nr_cpu_ids) + * + * Note that the global accuracy can be calculated more precisely + * by taking into account that the positive accuracy range is + * 31 rather than 32. + * + * ----- + * + * Approximate Sum Carry Propagation + * + * Let's define a number of counter bits for each level, e.g.: + * + * log2(BATCH(level 0)) =3D log2(32) =3D 5 + * + * nr_bit value_mask range + * Level 0: 5 bits v 0 .. +31 + * Level 1: 1 bit (v & ~((1UL << 5) - 1)) 0 .. +63 + * Level 2: 1 bit (v & ~((1UL << 6) - 1)) 0 .. +127 + * Level 3: 25 bits (v & ~((1UL << 7) - 1)) 0 .. 2^32-1 + * + * Note: Use a full 32-bit per-cpu counter at level 0 to allow precise sum. + * + * Note: Use cacheline aligned counters at levels above 0 to prevent false= sharing. + * If memory footprint is an issue, a specialized allocator could be= used + * to eliminate padding. + * + * Example with expanded values: + * + * counter_add(counter, inc): + * + * if (!inc) + * return; + * + * res =3D percpu_add_return(counter @ Level 0, inc); + * orig =3D res - inc; + * if (inc < 0) { + * inc =3D -(-inc & ~0b00011111); // Clear used bits + * // xor bit 5: underflow + * if ((inc ^ orig ^ res) & 0b00100000) + * inc -=3D 0b00100000; + * } else { + * inc &=3D ~0b00011111; // Clear used bits + * // xor bit 5: overflow + * if ((inc ^ orig ^ res) & 0b00100000) + * inc +=3D 0b00100000; + * } + * if (!inc) + * return; + * + * res =3D atomic_add_return(counter @ Level 1, inc); + * orig =3D res - inc; + * if (inc < 0) { + * inc =3D -(-inc & ~0b00111111); // Clear used bits + * // xor bit 6: underflow + * if ((inc ^ orig ^ res) & 0b01000000) + * inc -=3D 0b01000000; + * } else { + * inc &=3D ~0b00111111; // Clear used bits + * // xor bit 6: overflow + * if ((inc ^ orig ^ res) & 0b01000000) + * inc +=3D 0b01000000; + * } + * if (!inc) + * return; + * + * res =3D atomic_add_return(counter @ Level 2, inc); + * orig =3D res - inc; + * if (inc < 0) { + * inc =3D -(-inc & ~0b01111111); // Clear used bits + * // xor bit 7: underflow + * if ((inc ^ orig ^ res) & 0b10000000) + * inc -=3D 0b10000000; + * } else { + * inc &=3D ~0b01111111; // Clear used bits + * // xor bit 7: overflow + * if ((inc ^ orig ^ res) & 0b10000000) + * inc +=3D 0b10000000; + * } + * if (!inc) + * return; + * + * atomic_add(counter @ Level 3, inc); + */ + +#include +#include +#include +#include +#include +#include +#include + +#define MAX_NR_LEVELS 5 + +/* + * The counter configuration is selected at boot time based on the + * hardware topology. + */ +struct counter_config { + unsigned int nr_items; /* + * nr_items is the number of items in the tree for levels 1 + * up to and including the final level (approximate sum). + * It excludes the level 0 per-CPU counters. + */ + unsigned char nr_levels; /* + * nr_levels is the number of hierarchical counter tree levels. + * It excludes the final level (approximate sum). + */ + unsigned char n_arity_order[MAX_NR_LEVELS]; /* + * n-arity of tree nodes for each level from + * 0 to (nr_levels - 1). + */ +}; + +static const struct counter_config per_nr_cpu_order_config[] =3D { + [0] =3D { .nr_items =3D 0, .nr_levels =3D 0, .n_arity_order =3D { 0 } }, + [1] =3D { .nr_items =3D 1, .nr_levels =3D 1, .n_arity_order =3D { 1 } }, + [2] =3D { .nr_items =3D 3, .nr_levels =3D 2, .n_arity_order =3D { 1, 1 }= }, + [3] =3D { .nr_items =3D 7, .nr_levels =3D 3, .n_arity_order =3D { 1, 1, = 1 } }, + [4] =3D { .nr_items =3D 7, .nr_levels =3D 3, .n_arity_order =3D { 2, 1, = 1 } }, + [5] =3D { .nr_items =3D 11, .nr_levels =3D 3, .n_arity_order =3D { 2, 2,= 1 } }, + [6] =3D { .nr_items =3D 21, .nr_levels =3D 3, .n_arity_order =3D { 2, 2,= 2 } }, + [7] =3D { .nr_items =3D 21, .nr_levels =3D 3, .n_arity_order =3D { 3, 2,= 2 } }, + [8] =3D { .nr_items =3D 37, .nr_levels =3D 3, .n_arity_order =3D { 3, 3,= 2 } }, + [9] =3D { .nr_items =3D 73, .nr_levels =3D 3, .n_arity_order =3D { 3, 3,= 3 } }, + [10] =3D { .nr_items =3D 149, .nr_levels =3D 4, .n_arity_order =3D { 3, = 3, 2, 2 } }, + [11] =3D { .nr_items =3D 293, .nr_levels =3D 4, .n_arity_order =3D { 3, = 3, 3, 2 } }, + [12] =3D { .nr_items =3D 585, .nr_levels =3D 4, .n_arity_order =3D { 3, = 3, 3, 3 } }, + [13] =3D { .nr_items =3D 1173, .nr_levels =3D 5, .n_arity_order =3D { 3,= 3, 3, 2, 2 } }, + [14] =3D { .nr_items =3D 2341, .nr_levels =3D 5, .n_arity_order =3D { 3,= 3, 3, 3, 2 } }, + [15] =3D { .nr_items =3D 4681, .nr_levels =3D 5, .n_arity_order =3D { 3,= 3, 3, 3, 3 } }, + [16] =3D { .nr_items =3D 4681, .nr_levels =3D 5, .n_arity_order =3D { 4,= 3, 3, 3, 3 } }, + [17] =3D { .nr_items =3D 8777, .nr_levels =3D 5, .n_arity_order =3D { 4,= 4, 3, 3, 3 } }, + [18] =3D { .nr_items =3D 17481, .nr_levels =3D 5, .n_arity_order =3D { 4= , 4, 4, 3, 3 } }, + [19] =3D { .nr_items =3D 34953, .nr_levels =3D 5, .n_arity_order =3D { 4= , 4, 4, 4, 3 } }, + [20] =3D { .nr_items =3D 69905, .nr_levels =3D 5, .n_arity_order =3D { 4= , 4, 4, 4, 4 } }, +}; + +static const struct counter_config *counter_config; /* Hierarchical counte= r configuration for the hardware topology. */ +static unsigned int nr_cpus_order, /* Order of nr_cpu_ids. */ + accuracy_multiplier; /* Calculate accuracy for a given batch size (= multiplication factor). */ + +static +int __percpu_counter_tree_init(struct percpu_counter_tree *counter, + unsigned int batch_size, gfp_t gfp_flags, + unsigned int __percpu *level0, + struct percpu_counter_tree_level_item *items) +{ + /* Batch size must be greater than 1, and a power of 2. */ + if (WARN_ON(batch_size <=3D 1 || (batch_size & (batch_size - 1)))) + return -EINVAL; + counter->batch_size =3D batch_size; + counter->bias =3D 0; + counter->level0 =3D level0; + counter->items =3D items; + if (!nr_cpus_order) { + counter->approx_sum.i =3D per_cpu_ptr(counter->level0, 0); + counter->level0_bit_mask =3D 0; + } else { + counter->approx_sum.a =3D &counter->items[counter_config->nr_items - 1].= count; + counter->level0_bit_mask =3D 1UL << get_count_order(batch_size); + } + /* + * Each tree item signed integer has a negative range which is + * one unit greater than the positive range. + */ + counter->approx_accuracy_range.under =3D batch_size * accuracy_multiplier; + counter->approx_accuracy_range.over =3D (batch_size - 1) * accuracy_multi= plier; + return 0; +} + +/** + * percpu_counter_tree_init_many() - Initialize many per-CPU counter trees. + * @counters: An array of @nr_counters counters to initialize. + * Their memory is provided by the caller. + * @items: Pointer to memory area where to store tree items. + * This memory is provided by the caller. + * Its size needs to be at least @nr_counters * percpu_counter_tree_ite= ms_size(). + * @nr_counters: The number of counter trees to initialize + * @batch_size: The batch size is the increment step at level 0 which trig= gers a + * carry propagation. + * The batch size is required to be greater than 1, and a power of 2. + * @gfp_flags: gfp flags to pass to the per-CPU allocator. + * + * Initialize many per-CPU counter trees using a single per-CPU + * allocator invocation for @nr_counters counters. + * + * Return: + * * %0: Success + * * %-EINVAL: - Invalid @batch_size argument + * * %-ENOMEM: - Out of memory + */ +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, st= ruct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags) +{ + void __percpu *level0, *level0_iter; + size_t counter_size, items_size =3D 0; + void *items_iter; + unsigned int i; + int ret; + + counter_size =3D ALIGN(sizeof(*counters), __alignof__(*counters)); + level0 =3D __alloc_percpu_gfp(nr_counters * counter_size, + __alignof__(*counters), gfp_flags); + if (!level0) + return -ENOMEM; + if (nr_cpus_order) { + items_size =3D percpu_counter_tree_items_size(); + memset(items, 0, items_size * nr_counters); + } + level0_iter =3D level0; + items_iter =3D items; + for (i =3D 0; i < nr_counters; i++) { + ret =3D __percpu_counter_tree_init(&counters[i], batch_size, gfp_flags, = level0_iter, items_iter); + if (ret) + goto free_level0; + level0_iter +=3D counter_size; + if (nr_cpus_order) + items_iter +=3D items_size; + } + return 0; + +free_level0: + free_percpu(level0); + return ret; +} + +/** + * percpu_counter_tree_init() - Initialize one per-CPU counter tree. + * @counter: Counter to initialize. + * Its memory is provided by the caller. + * @items: Pointer to memory area where to store tree items. + * This memory is provided by the caller. + * Its size needs to be at least percpu_counter_tree_items_size(). + * @batch_size: The batch size is the increment step at level 0 which trig= gers a + * carry propagation. + * The batch size is required to be greater than 1, and a power of 2. + * @gfp_flags: gfp flags to pass to the per-CPU allocator. + * + * Initialize one per-CPU counter tree. + * + * Return: + * * %0: Success + * * %-EINVAL: - Invalid @batch_size argument + * * %-ENOMEM: - Out of memory + */ +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct p= ercpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags) +{ + return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_f= lags); +} + +/** + * percpu_counter_tree_destroy_many() - Destroy many per-CPU counter trees. + * @counters: Array of counters trees to destroy. + * @nr_counters: The number of counter trees to destroy. + * + * Release internal resources allocated for @nr_counters per-CPU counter t= rees. + */ + +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counters= , unsigned int nr_counters) +{ + free_percpu(counters->level0); +} + +/** + * percpu_counter_tree_destroy() - Destroy one per-CPU counter tree. + * @counter: Counter to destroy. + * + * Release internal resources allocated for one per-CPU counter tree. + */ +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter) +{ + return percpu_counter_tree_destroy_many(counter, 1); +} + +static +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit= _mask) +{ + if (inc < 0) { + inc =3D -(-inc & ~(bit_mask - 1)); + /* + * xor bit_mask: underflow. + * + * If inc has bit set, decrement an additional bit if + * there is _no_ bit transition between orig and res. + * Else, inc has bit cleared, decrement an additional + * bit if there is a bit transition between orig and + * res. + */ + if ((inc ^ orig ^ res) & bit_mask) + inc -=3D bit_mask; + } else { + inc &=3D ~(bit_mask - 1); + /* + * xor bit_mask: overflow. + * + * If inc has bit set, increment an additional bit if + * there is _no_ bit transition between orig and res. + * Else, inc has bit cleared, increment an additional + * bit if there is a bit transition between orig and + * res. + */ + if ((inc ^ orig ^ res) & bit_mask) + inc +=3D bit_mask; + } + return inc; +} + +/* + * It does not matter through which path the carry propagates up the + * tree, therefore there is no need to disable preemption because the + * cpu number is only used to favor cache locality. + */ +static +void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter,= int inc) +{ + unsigned int level_items, nr_levels =3D counter_config->nr_levels, + level, n_arity_order, bit_mask; + struct percpu_counter_tree_level_item *item =3D counter->items; + unsigned int cpu =3D raw_smp_processor_id(); + + WARN_ON_ONCE(!nr_cpus_order); /* Should never be called for 1 cpu. */ + + n_arity_order =3D counter_config->n_arity_order[0]; + bit_mask =3D counter->level0_bit_mask << n_arity_order; + level_items =3D 1U << (nr_cpus_order - n_arity_order); + + for (level =3D 1; level < nr_levels; level++) { + atomic_t *count =3D &item[cpu & (level_items - 1)].count; + unsigned int orig, res; + + res =3D atomic_add_return_relaxed(inc, count); + orig =3D res - inc; + inc =3D percpu_counter_tree_carry(orig, res, inc, bit_mask); + if (likely(!inc)) + return; + item +=3D level_items; + n_arity_order =3D counter_config->n_arity_order[level]; + level_items >>=3D n_arity_order; + bit_mask <<=3D n_arity_order; + } + atomic_add(inc, counter->approx_sum.a); +} + +/** + * percpu_counter_tree_add() - Add to a per-CPU counter tree. + * @counter: Counter added to. + * @inc: Increment value (either positive or negative). + * + * Add @inc to a per-CPU counter tree. This is a fast-path which will + * typically increment per-CPU counters as long as there is no carry + * greater or equal to the counter tree batch size. + */ +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc) +{ + unsigned int bit_mask =3D counter->level0_bit_mask, orig, res; + + res =3D this_cpu_add_return(*counter->level0, inc); + orig =3D res - inc; + inc =3D percpu_counter_tree_carry(orig, res, inc, bit_mask); + if (likely(!inc)) + return; + percpu_counter_tree_add_slowpath(counter, inc); +} + + +static +int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *c= ounter) +{ + unsigned int sum =3D 0; + int cpu; + + for_each_possible_cpu(cpu) + sum +=3D *per_cpu_ptr(counter->level0, cpu); + return (int) sum; +} + +/** + * percpu_counter_tree_precise_sum() - Return precise counter sum. + * @counter: The counter to sum. + * + * Querying the precise sum is relatively expensive because it needs to + * iterate over all CPUs. + * This is meant to be used when accuracy is preferred over speed. + * + * Return: The current precise counter sum. + */ +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter) +{ + return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(coun= ter->bias); +} + +static +int compare_delta(int delta, unsigned int accuracy_neg, unsigned int accur= acy_pos) +{ + if (delta >=3D 0) { + if (delta <=3D accuracy_pos) + return 0; + else + return 1; + } else { + if (-delta <=3D accuracy_neg) + return 0; + else + return -1; + } +} + +/** + * percpu_counter_tree_approximate_compare - Approximated comparison of tw= o counter trees. + * @a: First counter to compare. + * @b: Second counter to compare. + * + * Evaluate an approximate comparison of two counter trees. + * This approximation comparison is fast, and provides an accurate + * answer if the counters are found to be either less than or greater + * than the other. However, if the approximated comparison returns + * 0, the counters respective sums are found to be within the two + * counters accuracy range. + * + * Return: + * * %0 - Counters @a and @b do not differ by more than the sum of their = respective + * accuracy ranges. + * * %-1 - Counter @a less than counter @b. + * * %1 - Counter @a is greater than counter @b. + */ +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a,= struct percpu_counter_tree *b) +{ + return compare_delta(percpu_counter_tree_approximate_sum(a) - percpu_coun= ter_tree_approximate_sum(b), + a->approx_accuracy_range.over + b->approx_accuracy_range.under, + a->approx_accuracy_range.under + b->approx_accuracy_range.over); +} + +/** + * percpu_counter_tree_approximate_compare_value - Approximated comparison= of a counter tree against a given value. + * @counter: Counter to compare. + * @v: Value to compare. + * + * Evaluate an approximate comparison of a counter tree against a given va= lue. + * This approximation comparison is fast, and provides an accurate + * answer if the counter is found to be either less than or greater + * than the value. However, if the approximated comparison returns + * 0, the value is within the counter accuracy range. + * + * Return: + * * %0 - The value @v is within the accuracy range of the counter. + * * %-1 - The value @v is less than the counter. + * * %1 - The value @v is greater than the counter. + */ +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tr= ee *counter, int v) +{ + return compare_delta(v - percpu_counter_tree_approximate_sum(counter), + counter->approx_accuracy_range.under, + counter->approx_accuracy_range.over); +} + +/** + * percpu_counter_tree_precise_compare - Precise comparison of two counter= trees. + * @a: First counter to compare. + * @b: Second counter to compare. + * + * Evaluate a precise comparison of two counter trees. + * As an optimization, it uses the approximate counter comparison + * to quickly compare counters which are far apart. Only cases where + * counter sums are within the accuracy range require precise counter + * sums. + * + * Return: + * * %0 - Counters are equal. + * * %-1 - Counter @a less than counter @b. + * * %1 - Counter @a is greater than counter @b. + */ +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, str= uct percpu_counter_tree *b) +{ + int count_a =3D percpu_counter_tree_approximate_sum(a), + count_b =3D percpu_counter_tree_approximate_sum(b); + unsigned int accuracy_a, accuracy_b; + int delta =3D count_a - count_b; + int res; + + res =3D compare_delta(delta, + a->approx_accuracy_range.over + b->approx_accuracy_range.under, + a->approx_accuracy_range.under + b->approx_accuracy_range.over); + /* The values are distanced enough for an accurate approximated compariso= n. */ + if (res) + return res; + + /* + * The approximated comparison is within the accuracy range, therefore at= least one + * precise sum is needed. Sum the counter which has the largest accuracy = first. + */ + if (delta >=3D 0) { + accuracy_a =3D a->approx_accuracy_range.under; + accuracy_b =3D b->approx_accuracy_range.over; + } else { + accuracy_a =3D a->approx_accuracy_range.over; + accuracy_b =3D b->approx_accuracy_range.under; + } + if (accuracy_b < accuracy_a) { + count_a =3D percpu_counter_tree_precise_sum(a); + res =3D compare_delta(count_a - count_b, + b->approx_accuracy_range.under, + b->approx_accuracy_range.over); + if (res) + return res; + /* Precise sum of second counter is required. */ + count_b =3D percpu_counter_tree_precise_sum(b); + } else { + count_b =3D percpu_counter_tree_precise_sum(b); + res =3D compare_delta(count_a - count_b, + a->approx_accuracy_range.over, + a->approx_accuracy_range.under); + if (res) + return res; + /* Precise sum of second counter is required. */ + count_a =3D percpu_counter_tree_precise_sum(a); + } + if (count_a - count_b < 0) + return -1; + if (count_a - count_b > 0) + return 1; + return 0; +} + +/** + * percpu_counter_tree_precise_compare_value - Precise comparison of a cou= nter tree against a given value. + * @counter: Counter to compare. + * @v: Value to compare. + * + * Evaluate a precise comparison of a counter tree against a given value. + * As an optimization, it uses the approximate counter comparison + * to quickly identify whether the counter and value are far apart. + * Only cases where the value is within the counter accuracy range + * require a precise counter sum. + * + * Return: + * * %0 - The value @v is equal to the counter. + * * %-1 - The value @v is less than the counter. + * * %1 - The value @v is greater than the counter. + */ +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *= counter, int v) +{ + int count =3D percpu_counter_tree_approximate_sum(counter); + int res; + + res =3D compare_delta(v - count, + counter->approx_accuracy_range.under, + counter->approx_accuracy_range.over); + /* The values are distanced enough for an accurate approximated compariso= n. */ + if (res) + return res; + + /* Precise sum is required. */ + count =3D percpu_counter_tree_precise_sum(counter); + if (v - count < 0) + return -1; + if (v - count > 0) + return 1; + return 0; +} + +static +void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int= bias) +{ + WRITE_ONCE(counter->bias, bias); +} + +/** + * percpu_counter_tree_set - Set the counter tree sum to a given value. + * @counter: Counter to set. + * @v: Value to set. + * + * Set the counter sum to a given value. It can be useful for instance + * to reset the counter sum to 0. Note that even after setting the + * counter sum to a given value, the counter sum approximation can + * return any value within the accuracy range around that value. + */ +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v) +{ + percpu_counter_tree_set_bias(counter, + v - percpu_counter_tree_precise_sum_unbiased(counter)); +} + +/** + * percpu_counter_tree_approximate_accuracy_range - Query the accuracy ran= ge for a counter tree. + * @counter: Counter to query. + * @under: Pointer where to store the to the accuracy range below the appr= oximation. + * @over: Pointer where to store the to the accuracy range above the appro= ximation. + * + * Query the accuracy range limits for the counter. + * Because of two's complement binary representation, the "under" range is= typically + * slightly larger than the "over" range. + * Those values are derived from the hardware topology and the counter tre= e batch size. + * They are invariant for a given counter tree. + * Using this function should not be typically required, see the following= functions instead: + * * percpu_counter_tree_approximate_compare(), + * * percpu_counter_tree_approximate_compare_value(), + * * percpu_counter_tree_precise_compare(), + * * percpu_counter_tree_precise_compare_value(). + */ +void percpu_counter_tree_approximate_accuracy_range(struct percpu_counter_= tree *counter, + unsigned int *under, unsigned int *over) +{ + *under =3D counter->approx_accuracy_range.under; + *over =3D counter->approx_accuracy_range.over; +} + +/* + * percpu_counter_tree_items_size - Query the size required for counter tr= ee items. + * + * Query the size of the memory area required to hold the counter tree + * items. This depends on the hardware topology and is invariant after + * boot. + * + * Return: Size required to hold tree items. + */ +size_t percpu_counter_tree_items_size(void) +{ + if (!nr_cpus_order) + return 0; + return counter_config->nr_items * sizeof(struct percpu_counter_tree_level= _item); +} + +static void __init calculate_accuracy_topology(void) +{ + unsigned int nr_levels =3D counter_config->nr_levels, level; + unsigned int level_items =3D 1U << nr_cpus_order; + unsigned int batch_size =3D 1; + + for (level =3D 0; level < nr_levels; level++) { + unsigned int n_arity_order =3D counter_config->n_arity_order[level]; + + /* + * The accuracy multiplier is derived from a batch size of 1 + * to speed up calculating the accuracy at tree initialization. + */ + accuracy_multiplier +=3D batch_size * level_items; + batch_size <<=3D n_arity_order; + level_items >>=3D n_arity_order; + } +} + +int __init percpu_counter_tree_subsystem_init(void) +{ + nr_cpus_order =3D get_count_order(nr_cpu_ids); + if (WARN_ON_ONCE(nr_cpus_order >=3D ARRAY_SIZE(per_nr_cpu_order_config)))= { + printk(KERN_ERR "Unsupported number of CPUs (%u)\n", nr_cpu_ids); + return -1; + } + counter_config =3D &per_nr_cpu_order_config[nr_cpus_order]; + calculate_accuracy_topology(); + return 0; +} --=20 2.39.5