This API provides existence guarantees of objects through Hazard
Pointers [1] (hazptr).
Its main benefit over RCU is that it allows fast reclaim of
HP-protected pointers without needing to wait for a grace period.
This implementation has 8 statically allocated hazard pointer slots per
cpu for the fast path, and relies on a on-stack backup slot allocated by
the hazard pointer user as fallback in case no per-cpu slot is
available.
References:
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Link: https://lpc.events/event/19/contributions/2082/
Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/
Link: https://lpc.events/event/18/contributions/1731/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
Changes since v3:
- Rename hazptr_retire to hazptr_release.
- Remove domains.
- Introduce "backup_slot" within hazptr context structure (on stack)
to handle slot overflow.
- Rename hazptr_try_protect to hazptr_acquire.
- Preallocate 8 per-CPU slots, and rely on caller-provided backup
slots (typically on stack) for out-of-slots situations.
Changes since v2:
- Address Peter Zijlstra's comments.
- Address Paul E. McKenney's comments.
Changes since v0:
- Remove slot variable from hp_dereference_allocate().
---
include/linux/hazptr.h | 182 +++++++++++++++++++++++++++++++++++++++++
init/main.c | 2 +
kernel/Makefile | 2 +-
kernel/hazptr.c | 150 +++++++++++++++++++++++++++++++++
4 files changed, 335 insertions(+), 1 deletion(-)
create mode 100644 include/linux/hazptr.h
create mode 100644 kernel/hazptr.c
diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..70c066ddb0f5
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,182 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+/*
+ * hazptr: Hazard Pointers
+ *
+ * This API provides existence guarantees of objects through hazard
+ * pointers.
+ *
+ * Its main benefit over RCU is that it allows fast reclaim of
+ * HP-protected pointers without needing to wait for a grace period.
+ *
+ * References:
+ *
+ * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
+ * lock-free objects," in IEEE Transactions on Parallel and
+ * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
+ */
+
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/cleanup.h>
+
+/* 8 slots (each sizeof(void *)) fit in a single cache line. */
+#define NR_HAZPTR_PERCPU_SLOTS 8
+
+/*
+ * Hazard pointer slot.
+ */
+struct hazptr_slot {
+ void *addr;
+};
+
+struct hazptr_backup_slot {
+ struct list_head node;
+ struct hazptr_slot slot;
+ /* CPU requesting the backup slot. */
+ int cpu;
+};
+
+struct hazptr_ctx {
+ struct hazptr_slot *slot;
+ /* Backup slot in case all per-CPU slots are used. */
+ struct hazptr_backup_slot backup_slot;
+};
+
+struct hazptr_percpu_slots {
+ struct hazptr_slot slots[NR_HAZPTR_PERCPU_SLOTS];
+} ____cacheline_aligned;
+
+DECLARE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr);
+
+/*
+ * hazptr_chain_backup_slot: Chain backup slot into overflow list.
+ *
+ * Set backup slot address to @addr, and chain it into the overflow
+ * list.
+ */
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx);
+
+/*
+ * hazptr_unchain_backup_slot: Unchain backup slot from overflow list.
+ */
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx);
+
+static inline
+struct hazptr_slot *hazptr_get_free_percpu_slot(void)
+{
+ struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ if (!READ_ONCE(slot->addr))
+ return slot;
+ }
+ /* All slots are in use. */
+ return NULL;
+}
+
+static inline
+bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
+{
+ return slot == &ctx->backup_slot.slot;
+}
+
+/*
+ * hazptr_acquire: Load pointer at address and protect with hazard pointer.
+ *
+ * Load @addr_p, and protect the loaded pointer with hazard pointer.
+ *
+ * Returns a non-NULL protected address if the loaded pointer is non-NULL.
+ * Returns NULL if the loaded pointer is NULL.
+ *
+ * On success the protected hazptr slot is stored in @ctx->slot.
+ */
+static inline
+void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
+{
+ struct hazptr_slot *slot = NULL;
+ void *addr, *addr2;
+
+ /*
+ * Load @addr_p to know which address should be protected.
+ */
+ addr = READ_ONCE(*addr_p);
+ for (;;) {
+ if (!addr)
+ return NULL;
+ guard(preempt)();
+ if (likely(!hazptr_slot_is_backup(ctx, slot))) {
+ slot = hazptr_get_free_percpu_slot();
+ /*
+ * If all the per-CPU slots are already in use, fallback
+ * to the backup slot.
+ */
+ if (unlikely(!slot))
+ slot = hazptr_chain_backup_slot(ctx);
+ }
+ WRITE_ONCE(slot->addr, addr); /* Store B */
+
+ /* Memory ordering: Store B before Load A. */
+ smp_mb();
+
+ /*
+ * Re-load @addr_p after storing it to the hazard pointer slot.
+ */
+ addr2 = READ_ONCE(*addr_p); /* Load A */
+ if (likely(ptr_eq(addr2, addr)))
+ break;
+ /*
+ * If @addr_p content has changed since the first load,
+ * release the hazard pointer and try again.
+ */
+ WRITE_ONCE(slot->addr, NULL);
+ if (!addr2) {
+ if (hazptr_slot_is_backup(ctx, slot))
+ hazptr_unchain_backup_slot(ctx);
+ return NULL;
+ }
+ addr = addr2;
+ }
+ ctx->slot = slot;
+ /*
+ * Use addr2 loaded from the second READ_ONCE() to preserve
+ * address dependency ordering.
+ */
+ return addr2;
+}
+
+/* Release the protected hazard pointer from @slot. */
+static inline
+void hazptr_release(struct hazptr_ctx *ctx, void *addr)
+{
+ struct hazptr_slot *slot;
+
+ if (!addr)
+ return;
+ slot = ctx->slot;
+ WARN_ON_ONCE(slot->addr != addr);
+ smp_store_release(&slot->addr, NULL);
+ if (unlikely(hazptr_slot_is_backup(ctx, slot)))
+ hazptr_unchain_backup_slot(ctx);
+}
+
+void hazptr_init(void);
+
+#endif /* _LINUX_HAZPTR_H */
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..858eaa87bde7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
#include <linux/pidfs.h>
#include <linux/ptdump.h>
#include <linux/time_namespace.h>
+#include <linux/hazptr.h>
#include <net/net_namespace.h>
#include <asm/io.h>
@@ -1002,6 +1003,7 @@ void start_kernel(void)
workqueue_init_early();
rcu_init();
+ hazptr_init();
kvfree_rcu_init();
/* Trace events are available after this */
diff --git a/kernel/Makefile b/kernel/Makefile
index 9fe722305c9b..1178907fe0ea 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
cpu.o exit.o softirq.o resource.o \
sysctl.o capability.o ptrace.o user.o \
signal.o sys.o umh.o workqueue.o pid.o task_work.o \
- extable.o params.o \
+ extable.o params.o hazptr.o \
kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \
notifier.o ksysfs.o cred.o reboot.o \
async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..2ec288bc1132
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,150 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+/*
+ * hazptr: Hazard Pointers
+ */
+
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/export.h>
+
+struct overflow_list {
+ raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */
+ struct list_head head; /* Overflow list head. */
+ uint64_t gen; /* Overflow list generation. */
+};
+
+static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list);
+
+DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots);
+
+/*
+ * Perform piecewise iteration on overflow list waiting until "addr" is
+ * not present. Raw spinlock is released and taken between each list
+ * item and busy loop iteration. The overflow list generation is checked
+ * each time the lock is taken to validate that the list has not changed
+ * before resuming iteration or busy wait. If the generation has
+ * changed, retry the entire list traversal.
+ */
+static
+void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
+{
+ struct hazptr_backup_slot *backup_slot;
+ uint64_t snapshot_gen;
+
+ raw_spin_lock(&overflow_list->lock);
+retry:
+ snapshot_gen = overflow_list->gen;
+ list_for_each_entry(backup_slot, &overflow_list->head, node) {
+ /* Busy-wait if node is found. */
+ while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
+ raw_spin_unlock(&overflow_list->lock);
+ cpu_relax();
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+ /*
+ * Release raw spinlock, validate generation after
+ * re-acquiring the lock.
+ */
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+}
+
+static
+void hazptr_synchronize_cpu_slots(int cpu, void *addr)
+{
+ struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ /* Busy-wait if node is found. */
+ smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */
+ }
+}
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr)
+{
+ int cpu;
+
+ /*
+ * Busy-wait should only be done from preemptible context.
+ */
+ lockdep_assert_preemption_enabled();
+
+ /*
+ * Store A precedes hazptr_scan(): it unpublishes addr (sets it to
+ * NULL or to a different value), and thus hides it from hazard
+ * pointer readers.
+ */
+ if (!addr)
+ return;
+ /* Memory ordering: Store A before Load B. */
+ smp_mb();
+ /* Scan all CPUs slots. */
+ for_each_possible_cpu(cpu) {
+ /* Scan CPU slots. */
+ hazptr_synchronize_cpu_slots(cpu, addr);
+ /* Scan backup slots in percpu overflow list. */
+ hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr);
+ }
+}
+EXPORT_SYMBOL_GPL(hazptr_synchronize);
+
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list);
+ struct hazptr_slot *slot = &ctx->backup_slot.slot;
+
+ slot->addr = NULL;
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_add(&ctx->backup_slot.node, &overflow_list->head);
+ ctx->backup_slot.cpu = smp_processor_id();
+ raw_spin_unlock(&overflow_list->lock);
+ return slot;
+}
+EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot);
+
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu);
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_del(&ctx->backup_slot.node);
+ raw_spin_unlock(&overflow_list->lock);
+}
+EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot);
+
+void __init hazptr_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu);
+
+ raw_spin_lock_init(&overflow_list->lock);
+ INIT_LIST_HEAD(&overflow_list->head);
+ }
+}
--
2.39.5
On 12/18/2025 12:54 PM, Mathieu Desnoyers wrote:
[..]
>>> <mathieu.desnoyers@efficios.com> wrote:
[..]
>>> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
>>> if you guys have time to try it out with your use-cases it would be
>>> great!
>>>
>>> This new version does the following:
>>>
>>> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
>>> - The hazard pointer user allocates a hazard pointer context variable
>>> (typically on the stack), which contains the pointer to the slot *and*
>>> a backup slot,
>>> - When all the per-CPU slots are in use, fallback to the backup slot.
>>> Chain the backup slot into per-CPU lists, each protected by a raw
>>> spinlock.
>>> - The hazard pointer synchronize does a piecewise iteration on the
>>> per-CPU overflow slots lists, releasing the raw spinlock between
>>> each list item. It uses a 64-bit generation counter to check for
>>> concurrent list changes, and restart the traversal on generation
>>> counter mismatch.
>>> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>>> the hazard pointer acquire/release adds and then removes the hazard
>>> pointer context from a per-task linked list. On context switch, the
>>> scheduler migrates the per-CPU slots used by the task to the backup
>>> per-context slots, thus making sure the per-CPU slots are not used
>>> by preempted and blocked tasks.
>>
>> This last point is another reason why I want the slots to be per task instead
>> of per CPU. It becomes very natural because the hazard pointer is always
>> associated with a task only anyway, not with the CPU (at usecase level). By
>> putting the slot in the task struct, we allow these requirements to flow
>> naturally without requiring any locking or list management.. Did I miss
>> something about the use cases?
>
> I start from the hypothesis that scanning all tasks at synchronize
> is likely too slow for a general purpose synchronization algo.
>
> This is why we have RCU (general purpose) tracking hierarchical per-cpu
> state, and specialized RCU flavors (for tracing and other use-cases)
> using iteration on all tasks.
>
I wouldn't call HP as "general" personally, it could be (should be?) usecase
specific. That's why I am not opposed to the per-cpu approach, but I am not
convinced that there is no room for a per-task slot approach which is simpler
in many respects in the code.
If usecase has 1000s of CPUs but few tasks running, then per-task would be way
faster. It would be even cheaper than per-cpu slots on the read-side:
1. Unlikely to overflow relatively speaking, since plenty of slots distributed
to tasks, thus avoiding locking.
2. Preemption also doesn't require saving/restoring and locking.
I would look at per-task HP as a more specific HP implementation if you will and
per-cpu as more leaning towards "general". There's tradeoffs of course.
> One of the highlights of hazard pointers over RCU is its ability to
> know about quiescence of a pointer without waiting for a whole grace
> period. Therefore I went down the route of scanning per-CPU state rather
> than per-task.
Sure, but there could be room for both flavors, no?
>> I did some measurements about the task-scanning issue, and it is fast in my
>> testing (~1ms/10000 tasks). Any input from you or anyone on what the typical
>> task count distribution is that we are addressing? I also made a rough
>> prototype, and it appears to be simpler with fewer lines of code because I do
>> not need to handle preemption. It just happens naturally.
>
> It really depends on the access pattern here. If you want to replace
> refcount decrement and test with hazard pointer synchronize, a delay of
> 1ms on each hazptr synchronize is a *lot*.
Yeah true but I don't think *that* is a fair comparison. The slow path is always
going to be slower than an atomic refcount no matter which HP approach we talk.
I think percpu_ref_kill is a better comparison which triggers a GP:
percpu_ref_kill()
- percpu_ref_kill_and_confirm()
...
- call_rcu_hurry(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu)
> Of course perhaps this can be batched with a worker thread, but then
> you are back to using more memory due to delayed reclaim, and thus
> closer to RCU.
If you have millions of objects with per-cpu ref count, that can go into the
megabytes. With HP, you can negate that. So even with batched worker thread,
there can be space savings and fast read-side (at the expense of slightly more
overhead on write side). Sure lockdep usecase does not benefit since Boqun wants
to speed up the write side in that case (synchronize_rcu_expedited) but we
probably don't want to discount other potential HP usecases which want faster
simpler read side while having preemptability, small structures etc and don't
care about update side performance..
>> First of all, we can have a per-task counter that tracks how many hazard
>> pointers are active. If this is zero, then we can simply skip the taskinstead
>> of wasting cycles scanning all the task slot.
>> Further, we can have a retire list that reuses a single scan to scan all the
>> objects in the retire list, thus reusing the scan cost. This can also assist
>> in asynchronously implementing object retiring via a dedicated thread perhaps
>> with the tasks RCU infrastructure. We can also make this per-task counter a
>> bitmap to speed up scanning potentially.
>>
>> I am okay with the concept of an overflow list, but if we keep the overflow
>> list at the per-task level instead of the per-CPU level, it is highlyunlikely
>> IMO that such an overflow list will be used unless more than, say, eight
>> hazard pointers per task are active at any given time.
>
> The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
> rather than per-task.
Sure, but with additional complexity and locking that per-task slot does not at
all have.
>
>> So its lock contention would be rarer than, say, having a per-CPU overflow
>> list. I would say that contention would be incredibly rare because typically
>> hazard pointers are used by multiple tasks, each of which will have its own
>> unique set of slots. Whereas in a per-CPU overflow approach, we have ahigher
>> chance of lock contention, especially when the number of CPUs is low.
>
> Contention on the per-CPU spinlocks would only happen if threads are
> migrated between chaining of their backup slot (either if 8 hazptr
> are in use by the thread, or on context switch), and release. Do
> you expect this type of migration to happen so often as to increase
> contention ?
Is that really true? You have contention also when you exhaust all per-cpu slots
not just on migration:
+ /*
+ * If all the per-CPU slots are already in use, fallback
+ * to the backup slot.
+ */
+ if (unlikely(!slot))
+ slot = hazptr_chain_backup_slot(ctx);
>
>>
>> Other than the task-scanning performance issue, what am I missing?
>
> Task-scan performance is the key issue. Also you have to set a limit
> of per-task slots. How would you handle the scenario where a user
> needs more slots than the limit ?
Sure there is a limit but I wager that this limit is harder to hit than with the
per-cpu case. If you have smaller number of CPUs, you might be hitting this all
the time with per-cpu approach. I'd have a overflow node in task_struct to
handle this. Billions of Linux systems have 8-16 CPUs.
>> Another nice benefit of using per-task hazard pointers is that we can also
>> implement sleeping in hazard pointer sections because we will be scanning for
>> sleeping tasks as well.
>
> The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
> hazptr held will trigger the context switch, and the scheduler will
> simply move the hazard pointer from the per-CPU slot to the backup
> slot. End of story.
Yeah, but the per-task approach does not need that at all. In fact the hazard
pointer context structure is smaller.
>> By contrast, the other approaches I have seen with per-CPU hazard pointers
>> forbid sleeping, since after sleeping a task is no longer associated with its
>> CPU. The other approaches also have a higher likelihood of locking Due to
>> running out of slots.
>
> Except for PREEMPT_HAZPTR. :)
Ah you mean, because you make a copy of the slots into the task. But why not
just keep it in the task to begin with :-D (your points on task scan latency is
valid though..).
>> Of course I am missing a use case, but I suspect we can find a per-CPU ref-
>> count use case that benefits from this. I am researching use cases when I get
>> time. I think my next task is to find a solid use case for this
> before doing further development of a solution..
>
> Let me know what you find, I'm very interested!
Sure :) will do, thanks.
>> By the way, feedback on the scanning patch:
>>
>> Can you consider using a per-CPU counter to track the number of active slots
>> per CPU? That way you can ignore CPU slots for CPUs that are not using hazard
>> pointers. Another idea is to skip idle CPUs as well.
>
> I had this in a userspace prototype: a counter of used per-cpu slots.
> But I finally decided against it because it requires extra instructions
> on the fast path, *and* scanning 8 pointers within a single cache line
> on synchronize is really cheap. So I'd need benchmarks to justify adding
> back the counter.
Interesting, makes sense.
>> Have you also considered any asynchronous use case where maintaining a
>> retired list would assist in RCU-style deferred reclaim of hazard-pointer
objects?
>
> This would be a logical next step indeed. I'll have to think
> about this some more.
>
You might be able to re-use some of the RCU async processing machinery for that.
Or maybe using timers/workqueues.
Thanks.
On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
[...]
> +static inline
> +struct hazptr_slot *hazptr_get_free_percpu_slot(void)
> +{
> + struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots);
> + unsigned int idx;
> +
> + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
> + struct hazptr_slot *slot = &percpu_slots->slots[idx];
> +
> + if (!READ_ONCE(slot->addr))
> + return slot;
> + }
> + /* All slots are in use. */
> + return NULL;
> +}
> +
> +static inline
> +bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
> +{
> + return slot == &ctx->backup_slot.slot;
> +}
> +
> +/*
> + * hazptr_acquire: Load pointer at address and protect with hazard pointer.
> + *
> + * Load @addr_p, and protect the loaded pointer with hazard pointer.
> + *
> + * Returns a non-NULL protected address if the loaded pointer is non-NULL.
> + * Returns NULL if the loaded pointer is NULL.
> + *
> + * On success the protected hazptr slot is stored in @ctx->slot.
> + */
> +static inline
> +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> +{
> + struct hazptr_slot *slot = NULL;
> + void *addr, *addr2;
> +
> + /*
> + * Load @addr_p to know which address should be protected.
> + */
> + addr = READ_ONCE(*addr_p);
> + for (;;) {
> + if (!addr)
> + return NULL;
> + guard(preempt)();
> + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
> + slot = hazptr_get_free_percpu_slot();
I need to continue share my concerns about this "allocating slot while
protecting" pattern. Here realistically, we will go over a few of the
per-CPU hazard pointer slots *every time* instead of directly using a
pre-allocated hazard pointer slot. Could you utilize this[1] to see a
comparison of the reader-side performance against RCU/SRCU?
> + /*
> + * If all the per-CPU slots are already in use, fallback
> + * to the backup slot.
> + */
> + if (unlikely(!slot))
> + slot = hazptr_chain_backup_slot(ctx);
> + }
> + WRITE_ONCE(slot->addr, addr); /* Store B */
> +
> + /* Memory ordering: Store B before Load A. */
> + smp_mb();
> +
> + /*
> + * Re-load @addr_p after storing it to the hazard pointer slot.
> + */
> + addr2 = READ_ONCE(*addr_p); /* Load A */
> + if (likely(ptr_eq(addr2, addr)))
> + break;
> + /*
> + * If @addr_p content has changed since the first load,
> + * release the hazard pointer and try again.
> + */
> + WRITE_ONCE(slot->addr, NULL);
> + if (!addr2) {
> + if (hazptr_slot_is_backup(ctx, slot))
> + hazptr_unchain_backup_slot(ctx);
> + return NULL;
> + }
> + addr = addr2;
> + }
> + ctx->slot = slot;
> + /*
> + * Use addr2 loaded from the second READ_ONCE() to preserve
> + * address dependency ordering.
> + */
> + return addr2;
> +}
> +
> +/* Release the protected hazard pointer from @slot. */
> +static inline
> +void hazptr_release(struct hazptr_ctx *ctx, void *addr)
> +{
> + struct hazptr_slot *slot;
> +
> + if (!addr)
> + return;
> + slot = ctx->slot;
> + WARN_ON_ONCE(slot->addr != addr);
> + smp_store_release(&slot->addr, NULL);
> + if (unlikely(hazptr_slot_is_backup(ctx, slot)))
> + hazptr_unchain_backup_slot(ctx);
> +}
> +
> +void hazptr_init(void);
> +
> +#endif /* _LINUX_HAZPTR_H */
> diff --git a/init/main.c b/init/main.c
> index 07a3116811c5..858eaa87bde7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -104,6 +104,7 @@
> #include <linux/pidfs.h>
> #include <linux/ptdump.h>
> #include <linux/time_namespace.h>
> +#include <linux/hazptr.h>
> #include <net/net_namespace.h>
>
> #include <asm/io.h>
> @@ -1002,6 +1003,7 @@ void start_kernel(void)
> workqueue_init_early();
>
> rcu_init();
> + hazptr_init();
> kvfree_rcu_init();
>
> /* Trace events are available after this */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 9fe722305c9b..1178907fe0ea 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> cpu.o exit.o softirq.o resource.o \
> sysctl.o capability.o ptrace.o user.o \
> signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> - extable.o params.o \
> + extable.o params.o hazptr.o \
> kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hazptr.c b/kernel/hazptr.c
> new file mode 100644
> index 000000000000..2ec288bc1132
> --- /dev/null
> +++ b/kernel/hazptr.c
> @@ -0,0 +1,150 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * hazptr: Hazard Pointers
> + */
> +
> +#include <linux/hazptr.h>
> +#include <linux/percpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/list.h>
> +#include <linux/export.h>
> +
> +struct overflow_list {
> + raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */
> + struct list_head head; /* Overflow list head. */
> + uint64_t gen; /* Overflow list generation. */
> +};
> +
> +static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list);
> +
> +DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
> +EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots);
> +
> +/*
> + * Perform piecewise iteration on overflow list waiting until "addr" is
> + * not present. Raw spinlock is released and taken between each list
> + * item and busy loop iteration. The overflow list generation is checked
> + * each time the lock is taken to validate that the list has not changed
> + * before resuming iteration or busy wait. If the generation has
> + * changed, retry the entire list traversal.
> + */
> +static
> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> +{
> + struct hazptr_backup_slot *backup_slot;
> + uint64_t snapshot_gen;
> +
> + raw_spin_lock(&overflow_list->lock);
> +retry:
> + snapshot_gen = overflow_list->gen;
> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> + /* Busy-wait if node is found. */
> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> + raw_spin_unlock(&overflow_list->lock);
> + cpu_relax();
I think we should prioritize the scan thread solution [2] instead of
busy waiting hazrd pointer updaters, because when we have multiple
hazard pointer usages we would want to consolidate the scans from
updater side. If so, the whole ->gen can be avoided.
However this ->gen idea does seem ot resolve another issue for me, I'm
trying to make shazptr critical section preemptive by using a per-task
backup slot (if you recall, this is your idea from the hallway
discussions we had during LPC 2024), and currently I could not make it
work because the following sequeue:
1. CPU 0 already has one pointer protected.
2. CPU 1 begins the updater scan, and it scans the list of preempted
hazard pointer readers, no reader.
3. CPU 0 does a context switch, it stores the current hazard pointer
value to the current task's ->hazard_slot (let's say the task is task
A), and add it to the list of preempted hazard pointer readers.
4. CPU 0 clears its percpu hazptr_slots for the next task (B).
5. CPU 1 continues the updater scan, and it scans the percpu slot of
CPU 0, and finds no reader.
in this situation, updater will miss a reader. But if we add a
generation snapshotting at step 2 and generation increment at step 3, I
think it'll work.
IMO, if we make this work, it's better than the current backup slot
mechanism IMO, because we only need to acquire the lock if context
switch happens. I will look into the implementation of this and if I
could get it down, I will send it in my next version of shazptr. Mention
it here just to add this option into the discussion.
[1]: https://lore.kernel.org/lkml/20250625031101.12555-3-boqun.feng@gmail.com/
[2]: https://lore.kernel.org/lkml/20250625031101.12555-5-boqun.feng@gmail.com/
Regards,
Boqun
> + raw_spin_lock(&overflow_list->lock);
> + if (overflow_list->gen != snapshot_gen)
> + goto retry;
> + }
> + raw_spin_unlock(&overflow_list->lock);
> + /*
> + * Release raw spinlock, validate generation after
> + * re-acquiring the lock.
> + */
> + raw_spin_lock(&overflow_list->lock);
> + if (overflow_list->gen != snapshot_gen)
> + goto retry;
> + }
> + raw_spin_unlock(&overflow_list->lock);
> +}
> +
> +static
> +void hazptr_synchronize_cpu_slots(int cpu, void *addr)
> +{
> + struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu);
> + unsigned int idx;
> +
> + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
> + struct hazptr_slot *slot = &percpu_slots->slots[idx];
> +
> + /* Busy-wait if node is found. */
> + smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */
> + }
> +}
> +
> +/*
> + * hazptr_synchronize: Wait until @addr is released from all slots.
> + *
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + * Should be called from preemptible context.
> + */
> +void hazptr_synchronize(void *addr)
> +{
> + int cpu;
> +
> + /*
> + * Busy-wait should only be done from preemptible context.
> + */
> + lockdep_assert_preemption_enabled();
> +
> + /*
> + * Store A precedes hazptr_scan(): it unpublishes addr (sets it to
> + * NULL or to a different value), and thus hides it from hazard
> + * pointer readers.
> + */
> + if (!addr)
> + return;
> + /* Memory ordering: Store A before Load B. */
> + smp_mb();
> + /* Scan all CPUs slots. */
> + for_each_possible_cpu(cpu) {
> + /* Scan CPU slots. */
> + hazptr_synchronize_cpu_slots(cpu, addr);
> + /* Scan backup slots in percpu overflow list. */
> + hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr);
> + }
> +}
> +EXPORT_SYMBOL_GPL(hazptr_synchronize);
> +
> +struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx)
> +{
> + struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list);
> + struct hazptr_slot *slot = &ctx->backup_slot.slot;
> +
> + slot->addr = NULL;
> +
> + raw_spin_lock(&overflow_list->lock);
> + overflow_list->gen++;
> + list_add(&ctx->backup_slot.node, &overflow_list->head);
> + ctx->backup_slot.cpu = smp_processor_id();
> + raw_spin_unlock(&overflow_list->lock);
> + return slot;
> +}
> +EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot);
> +
> +void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx)
> +{
> + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu);
> +
> + raw_spin_lock(&overflow_list->lock);
> + overflow_list->gen++;
> + list_del(&ctx->backup_slot.node);
> + raw_spin_unlock(&overflow_list->lock);
> +}
> +EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot);
> +
> +void __init hazptr_init(void)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu);
> +
> + raw_spin_lock_init(&overflow_list->lock);
> + INIT_LIST_HEAD(&overflow_list->head);
> + }
> +}
> --
> 2.39.5
>
On 2025-12-18 03:36, Boqun Feng wrote:
> On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
[...]
>> +static inline
>> +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
>> +{
>> + struct hazptr_slot *slot = NULL;
>> + void *addr, *addr2;
>> +
>> + /*
>> + * Load @addr_p to know which address should be protected.
>> + */
>> + addr = READ_ONCE(*addr_p);
>> + for (;;) {
>> + if (!addr)
>> + return NULL;
>> + guard(preempt)();
>> + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
>> + slot = hazptr_get_free_percpu_slot();
>
> I need to continue share my concerns about this "allocating slot while
> protecting" pattern. Here realistically, we will go over a few of the
> per-CPU hazard pointer slots *every time* instead of directly using a
> pre-allocated hazard pointer slot.
No, that's not the expected fast-path with CONFIG_PREEMPT_HAZPTR=y
(introduced in patch 4/4).
With PREEMPT_HAZPTR, using more than one hazard pointer per CPU will
only happen if there are nested hazard pointer users, which can happen
due to:
- Holding a hazard pointer across function calls, where another hazard
pointer is used.
- Using hazard pointers from interrupt handlers (note: my current code
only does preempt disable, not irq disable, this is something I'd need
to change if we wish to acquire/release hazard pointers from interrupt
handlers). But even that should be a rare event.
So the fast-path has an initial state where there are no hazard pointers
in use on the CPU, which means hazptr_acquire() finds its empty slot at
index 0.
> Could you utilize this[1] to see a
> comparison of the reader-side performance against RCU/SRCU?
Good point ! Let's see.
On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
hyperthreading disabled,
CONFIG_PREEMPT=y,
CONFIG_PREEMPT_RCU=y,
CONFIG_PREEMPT_HAZPTR=y.
scale_type ns
-----------------------
hazptr-smp-mb 13.1 <- this implementation
hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
rcu 17.0
srcu 20.0
srcu-fast 1.5
rcu-tasks 0.0
rcu-trace 1.7
refcnt 1148.0
rwlock 1190.0
rwsem 4199.3
lock 41070.6
lock-irq 46176.3
acqrel 1.1
So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
appear to beat hazptr read-side performance.
[...]
>> +/*
>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>> + * not present. Raw spinlock is released and taken between each list
>> + * item and busy loop iteration. The overflow list generation is checked
>> + * each time the lock is taken to validate that the list has not changed
>> + * before resuming iteration or busy wait. If the generation has
>> + * changed, retry the entire list traversal.
>> + */
>> +static
>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>> +{
>> + struct hazptr_backup_slot *backup_slot;
>> + uint64_t snapshot_gen;
>> +
>> + raw_spin_lock(&overflow_list->lock);
>> +retry:
>> + snapshot_gen = overflow_list->gen;
>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>> + /* Busy-wait if node is found. */
>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>> + raw_spin_unlock(&overflow_list->lock);
>> + cpu_relax();
>
> I think we should prioritize the scan thread solution [2] instead of
> busy waiting hazrd pointer updaters, because when we have multiple
> hazard pointer usages we would want to consolidate the scans from
> updater side.
I agree that batching scans with a worker thread is a logical next step.
> If so, the whole ->gen can be avoided.
How would it allow removing the generation trick without causing long
raw spinlock latencies ?
>
> However this ->gen idea does seem ot resolve another issue for me, I'm
> trying to make shazptr critical section preemptive by using a per-task
> backup slot (if you recall, this is your idea from the hallway
> discussions we had during LPC 2024),
I honestly did not remember. It's been a whole year! ;-)
> and currently I could not make it
> work because the following sequeue:
>
> 1. CPU 0 already has one pointer protected.
>
> 2. CPU 1 begins the updater scan, and it scans the list of preempted
> hazard pointer readers, no reader.
>
> 3. CPU 0 does a context switch, it stores the current hazard pointer
> value to the current task's ->hazard_slot (let's say the task is task
> A), and add it to the list of preempted hazard pointer readers.
>
> 4. CPU 0 clears its percpu hazptr_slots for the next task (B).
>
> 5. CPU 1 continues the updater scan, and it scans the percpu slot of
> CPU 0, and finds no reader.
>
> in this situation, updater will miss a reader. But if we add a
> generation snapshotting at step 2 and generation increment at step 3, I
> think it'll work.
>
> IMO, if we make this work, it's better than the current backup slot
> mechanism IMO, because we only need to acquire the lock if context
> switch happens.
With PREEMPT_HAZPTR we also only need to acquire the per-cpu overflow
list raw spinlock on context switch (preemption or blocking). The only
other case requiring it is hazptr nested usage (more than 8 active
hazptr) on a thread context + nested irqs.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote: [...] > > Could you utilize this[1] to see a > > comparison of the reader-side performance against RCU/SRCU? > > Good point ! Let's see. > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores, > hyperthreading disabled, > CONFIG_PREEMPT=y, > CONFIG_PREEMPT_RCU=y, > CONFIG_PREEMPT_HAZPTR=y. > > scale_type ns > ----------------------- > hazptr-smp-mb 13.1 <- this implementation > hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize. > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist. > rcu 17.0 Hmm.. now looking back, how is it possible that hazptr is faster than RCU on the reader-side? Because a grace period was happening and triggered rcu_read_unlock_special()? This is actualy more interesting. Regards, Boqun > srcu 20.0 > srcu-fast 1.5 > rcu-tasks 0.0 > rcu-trace 1.7 > refcnt 1148.0 > rwlock 1190.0 > rwsem 4199.3 > lock 41070.6 > lock-irq 46176.3 > acqrel 1.1 > > So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel > appear to beat hazptr read-side performance. > > [...] > [...]
On 2025-12-18 19:43, Boqun Feng wrote:
> On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> [...]
>>> Could you utilize this[1] to see a
>>> comparison of the reader-side performance against RCU/SRCU?
>>
>> Good point ! Let's see.
>>
>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>> hyperthreading disabled,
>> CONFIG_PREEMPT=y,
>> CONFIG_PREEMPT_RCU=y,
>> CONFIG_PREEMPT_HAZPTR=y.
>>
>> scale_type ns
>> -----------------------
>> hazptr-smp-mb 13.1 <- this implementation
>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>> rcu 17.0
>
> Hmm.. now looking back, how is it possible that hazptr is faster than
> RCU on the reader-side? Because a grace period was happening and
> triggered rcu_read_unlock_special()? This is actualy more interesting.
So I could be entirely misreading the code, but, we have:
rcu_flavor_sched_clock_irq():
[...]
/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
if (rcu_preempt_depth() > 0 &&
__this_cpu_read(rcu_data.core_needs_qs) &&
__this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
!t->rcu_read_unlock_special.b.need_qs &&
time_after(jiffies, rcu_state.gp_start + HZ))
t->rcu_read_unlock_special.b.need_qs = true;
which means we set need_qs = true as a result from observing
cpu_no_qs.b.norm == true.
This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
from __rcu_read_unlock.
But then if we look at rcu_preempt_deferred_qs_irqrestore()
which we would expect to clear the rcu_read_unlock_special.b.need_qs
state, we have this:
special = t->rcu_read_unlock_special;
if (!special.s && !rdp->cpu_no_qs.b.exp) {
local_irq_restore(flags);
return;
}
t->rcu_read_unlock_special.s = 0;
which skips over clearing the state unless there is an expedited
grace period required.
So unless I'm missing something, we should _also_ clear that state
when it's invoked after rcu_flavor_sched_clock_irq, so the next
__rcu_read_unlock won't all call into rcu_read_unlock_special().
I'm adding a big warning about sleep deprivation and possibly
misunderstanding the whole thing. What am I missing ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Le Fri, Dec 19, 2025 at 09:22:19AM -0500, Mathieu Desnoyers a écrit :
> On 2025-12-18 19:43, Boqun Feng wrote:
> > On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> > [...]
> > > > Could you utilize this[1] to see a
> > > > comparison of the reader-side performance against RCU/SRCU?
> > >
> > > Good point ! Let's see.
> > >
> > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > hyperthreading disabled,
> > > CONFIG_PREEMPT=y,
> > > CONFIG_PREEMPT_RCU=y,
> > > CONFIG_PREEMPT_HAZPTR=y.
> > >
> > > scale_type ns
> > > -----------------------
> > > hazptr-smp-mb 13.1 <- this implementation
> > > hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> > > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> > > rcu 17.0
> >
> > Hmm.. now looking back, how is it possible that hazptr is faster than
> > RCU on the reader-side? Because a grace period was happening and
> > triggered rcu_read_unlock_special()? This is actualy more interesting.
> So I could be entirely misreading the code, but, we have:
>
> rcu_flavor_sched_clock_irq():
> [...]
> /* If GP is oldish, ask for help from rcu_read_unlock_special(). */
> if (rcu_preempt_depth() > 0 &&
> __this_cpu_read(rcu_data.core_needs_qs) &&
> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> !t->rcu_read_unlock_special.b.need_qs &&
> time_after(jiffies, rcu_state.gp_start + HZ))
> t->rcu_read_unlock_special.b.need_qs = true;
>
> which means we set need_qs = true as a result from observing
> cpu_no_qs.b.norm == true.
>
> This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
> from __rcu_read_unlock.
>
> But then if we look at rcu_preempt_deferred_qs_irqrestore()
> which we would expect to clear the rcu_read_unlock_special.b.need_qs
> state, we have this:
>
> special = t->rcu_read_unlock_special;
> if (!special.s && !rdp->cpu_no_qs.b.exp) {
> local_irq_restore(flags);
> return;
> }
> t->rcu_read_unlock_special.s = 0;
>
> which skips over clearing the state unless there is an expedited
> grace period required.
>
> So unless I'm missing something, we should _also_ clear that state
> when it's invoked after rcu_flavor_sched_clock_irq, so the next
> __rcu_read_unlock won't all call into rcu_read_unlock_special().
>
> I'm adding a big warning about sleep deprivation and possibly
> misunderstanding the whole thing. What am I missing ?
As far as I can tell, this skips clearing the state if the state is
already cleared. Or am I even more sleep deprived than you? :o)
Thanks.
--
Frederic Weisbecker
SUSE Labs
On Thu, Jan 08, 2026 at 05:34:34PM +0100, Frederic Weisbecker wrote:
> Le Fri, Dec 19, 2025 at 09:22:19AM -0500, Mathieu Desnoyers a écrit :
> > On 2025-12-18 19:43, Boqun Feng wrote:
> > > On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> > > [...]
> > > > > Could you utilize this[1] to see a
> > > > > comparison of the reader-side performance against RCU/SRCU?
> > > >
> > > > Good point ! Let's see.
> > > >
> > > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > > hyperthreading disabled,
> > > > CONFIG_PREEMPT=y,
> > > > CONFIG_PREEMPT_RCU=y,
> > > > CONFIG_PREEMPT_HAZPTR=y.
> > > >
> > > > scale_type ns
> > > > -----------------------
> > > > hazptr-smp-mb 13.1 <- this implementation
> > > > hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> > > > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> > > > rcu 17.0
> > >
> > > Hmm.. now looking back, how is it possible that hazptr is faster than
> > > RCU on the reader-side? Because a grace period was happening and
> > > triggered rcu_read_unlock_special()? This is actualy more interesting.
> > So I could be entirely misreading the code, but, we have:
> >
> > rcu_flavor_sched_clock_irq():
> > [...]
> > /* If GP is oldish, ask for help from rcu_read_unlock_special(). */
> > if (rcu_preempt_depth() > 0 &&
> > __this_cpu_read(rcu_data.core_needs_qs) &&
> > __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > !t->rcu_read_unlock_special.b.need_qs &&
> > time_after(jiffies, rcu_state.gp_start + HZ))
> > t->rcu_read_unlock_special.b.need_qs = true;
> >
> > which means we set need_qs = true as a result from observing
> > cpu_no_qs.b.norm == true.
> >
> > This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
> > from __rcu_read_unlock.
> >
> > But then if we look at rcu_preempt_deferred_qs_irqrestore()
> > which we would expect to clear the rcu_read_unlock_special.b.need_qs
> > state, we have this:
> >
> > special = t->rcu_read_unlock_special;
> > if (!special.s && !rdp->cpu_no_qs.b.exp) {
> > local_irq_restore(flags);
> > return;
> > }
> > t->rcu_read_unlock_special.s = 0;
> >
> > which skips over clearing the state unless there is an expedited
> > grace period required.
> >
> > So unless I'm missing something, we should _also_ clear that state
> > when it's invoked after rcu_flavor_sched_clock_irq, so the next
> > __rcu_read_unlock won't all call into rcu_read_unlock_special().
> >
> > I'm adding a big warning about sleep deprivation and possibly
> > misunderstanding the whole thing. What am I missing ?
>
> As far as I can tell, this skips clearing the state if the state is
> already cleared. Or am I even more sleep deprived than you? :o)
Get some sleep! A good night's sleep is one of the best debugging aids
available, even in this brave new world of LLMs. ;-)
Thanx, Paul
On 2026-01-08 11:34, Frederic Weisbecker wrote:
> Le Fri, Dec 19, 2025 at 09:22:19AM -0500, Mathieu Desnoyers a écrit :
>> On 2025-12-18 19:43, Boqun Feng wrote:
>>> On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
>>> [...]
>>>>> Could you utilize this[1] to see a
>>>>> comparison of the reader-side performance against RCU/SRCU?
>>>>
>>>> Good point ! Let's see.
>>>>
>>>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>>>> hyperthreading disabled,
>>>> CONFIG_PREEMPT=y,
>>>> CONFIG_PREEMPT_RCU=y,
>>>> CONFIG_PREEMPT_HAZPTR=y.
>>>>
>>>> scale_type ns
>>>> -----------------------
>>>> hazptr-smp-mb 13.1 <- this implementation
>>>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>>>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>>>> rcu 17.0
>>>
>>> Hmm.. now looking back, how is it possible that hazptr is faster than
>>> RCU on the reader-side? Because a grace period was happening and
>>> triggered rcu_read_unlock_special()? This is actualy more interesting.
>> So I could be entirely misreading the code, but, we have:
>>
>> rcu_flavor_sched_clock_irq():
>> [...]
>> /* If GP is oldish, ask for help from rcu_read_unlock_special(). */
>> if (rcu_preempt_depth() > 0 &&
>> __this_cpu_read(rcu_data.core_needs_qs) &&
>> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>> !t->rcu_read_unlock_special.b.need_qs &&
>> time_after(jiffies, rcu_state.gp_start + HZ))
>> t->rcu_read_unlock_special.b.need_qs = true;
>>
>> which means we set need_qs = true as a result from observing
>> cpu_no_qs.b.norm == true.
>>
>> This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
>> from __rcu_read_unlock.
>>
>> But then if we look at rcu_preempt_deferred_qs_irqrestore()
>> which we would expect to clear the rcu_read_unlock_special.b.need_qs
>> state, we have this:
>>
>> special = t->rcu_read_unlock_special;
>> if (!special.s && !rdp->cpu_no_qs.b.exp) {
>> local_irq_restore(flags);
>> return;
>> }
>> t->rcu_read_unlock_special.s = 0;
>>
>> which skips over clearing the state unless there is an expedited
>> grace period required.
>>
>> So unless I'm missing something, we should _also_ clear that state
>> when it's invoked after rcu_flavor_sched_clock_irq, so the next
>> __rcu_read_unlock won't all call into rcu_read_unlock_special().
>>
>> I'm adding a big warning about sleep deprivation and possibly
>> misunderstanding the whole thing. What am I missing ?
>
> As far as I can tell, this skips clearing the state if the state is
> already cleared. Or am I even more sleep deprived than you? :o)
No, you are right. The (!x && !y) pattern confused me, but the
code is correct. Good thing I've put a warning about sleep
deprivation. ;-)
Sorry for the noise.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On 1/8/2026 11:45 AM, Mathieu Desnoyers wrote:
> On 2026-01-08 11:34, Frederic Weisbecker wrote:
>> Le Fri, Dec 19, 2025 at 09:22:19AM -0500, Mathieu Desnoyers a écrit :
>>> On 2025-12-18 19:43, Boqun Feng wrote:
>>>> On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
>>>> [...]
>>>>>> Could you utilize this[1] to see a
>>>>>> comparison of the reader-side performance against RCU/SRCU?
>>>>>
>>>>> Good point ! Let's see.
>>>>>
>>>>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>>>>> hyperthreading disabled,
>>>>> CONFIG_PREEMPT=y,
>>>>> CONFIG_PREEMPT_RCU=y,
>>>>> CONFIG_PREEMPT_HAZPTR=y.
>>>>>
>>>>> scale_type ns
>>>>> -----------------------
>>>>> hazptr-smp-mb 13.1 <- this implementation
>>>>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with
>>>>> barrier(), requires IPIs on synchronize.
>>>>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu
>>>>> overflow lists by hlist.
>>>>> rcu 17.0
>>>>
>>>> Hmm.. now looking back, how is it possible that hazptr is faster than
>>>> RCU on the reader-side? Because a grace period was happening and
>>>> triggered rcu_read_unlock_special()? This is actualy more interesting.
>>> So I could be entirely misreading the code, but, we have:
>>>
>>> rcu_flavor_sched_clock_irq():
>>> [...]
>>> /* If GP is oldish, ask for help from rcu_read_unlock_special(). */
>>> if (rcu_preempt_depth() > 0 &&
>>> __this_cpu_read(rcu_data.core_needs_qs) &&
>>> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>> !t->rcu_read_unlock_special.b.need_qs &&
>>> time_after(jiffies, rcu_state.gp_start + HZ))
>>> t->rcu_read_unlock_special.b.need_qs = true;
>>>
>>> which means we set need_qs = true as a result from observing
>>> cpu_no_qs.b.norm == true.
>>>
>>> This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
>>> from __rcu_read_unlock.
>>>
>>> But then if we look at rcu_preempt_deferred_qs_irqrestore()
>>> which we would expect to clear the rcu_read_unlock_special.b.need_qs
>>> state, we have this:
>>>
>>> special = t->rcu_read_unlock_special;
>>> if (!special.s && !rdp->cpu_no_qs.b.exp) {
>>> local_irq_restore(flags);
>>> return;
>>> }
>>> t->rcu_read_unlock_special.s = 0;
>>>
>>> which skips over clearing the state unless there is an expedited
>>> grace period required.
>>>
>>> So unless I'm missing something, we should _also_ clear that state
>>> when it's invoked after rcu_flavor_sched_clock_irq, so the next
>>> __rcu_read_unlock won't all call into rcu_read_unlock_special().
>>>
>>> I'm adding a big warning about sleep deprivation and possibly
>>> misunderstanding the whole thing. What am I missing ?
>>
>> As far as I can tell, this skips clearing the state if the state is
>> already cleared. Or am I even more sleep deprived than you? :o)
>
> No, you are right. The (!x && !y) pattern confused me, but the
> code is correct. Good thing I've put a warning about sleep
> deprivation. ;-)
>
> Sorry for the noise.
Right, I think this can happen when after a rcu_flavor_sched_clock_irq() set
special.b.need_qs, then another upcoming rcu_flavor_sched_clock_irq() raced with
reader's rcu_read_unlock() and interrupted rcu_read_unlock_special() before it
could disable interrupts.
rcu_read_unlock()
-> rcu_read_lock_nesting--;
-> nesting == 0 and special is set.
<interrupted by sched clock>
-> rcu_flavor_sched_clock_irq()
-> rcu_preempt_deferred_qs_irqrestore
-> clear b.special
<interrupt returned>
-> rcu_read_unlock_special()
-> local_irq_save(flags); // too late
-> rcu_preempt_deferred_qs_irqrestore
-> Early return.
thanks,
- Joel
On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 03:36, Boqun Feng wrote:
> > On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
> [...]
> > > +static inline
> > > +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> > > +{
> > > + struct hazptr_slot *slot = NULL;
> > > + void *addr, *addr2;
> > > +
> > > + /*
> > > + * Load @addr_p to know which address should be protected.
> > > + */
> > > + addr = READ_ONCE(*addr_p);
> > > + for (;;) {
> > > + if (!addr)
> > > + return NULL;
> > > + guard(preempt)();
> > > + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
> > > + slot = hazptr_get_free_percpu_slot();
> >
> > I need to continue share my concerns about this "allocating slot while
> > protecting" pattern. Here realistically, we will go over a few of the
> > per-CPU hazard pointer slots *every time* instead of directly using a
> > pre-allocated hazard pointer slot.
>
> No, that's not the expected fast-path with CONFIG_PREEMPT_HAZPTR=y
> (introduced in patch 4/4).
>
I see, I was missing the patch #4, will take a look and reply
accordingly.
> With PREEMPT_HAZPTR, using more than one hazard pointer per CPU will
> only happen if there are nested hazard pointer users, which can happen
> due to:
>
> - Holding a hazard pointer across function calls, where another hazard
> pointer is used.
> - Using hazard pointers from interrupt handlers (note: my current code
> only does preempt disable, not irq disable, this is something I'd need
> to change if we wish to acquire/release hazard pointers from interrupt
> handlers). But even that should be a rare event.
>
> So the fast-path has an initial state where there are no hazard pointers
> in use on the CPU, which means hazptr_acquire() finds its empty slot at
> index 0.
>
> > Could you utilize this[1] to see a
> > comparison of the reader-side performance against RCU/SRCU?
>
> Good point ! Let's see.
>
> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> hyperthreading disabled,
> CONFIG_PREEMPT=y,
> CONFIG_PREEMPT_RCU=y,
> CONFIG_PREEMPT_HAZPTR=y.
>
> scale_type ns
> -----------------------
> hazptr-smp-mb 13.1 <- this implementation
> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> rcu 17.0
> srcu 20.0
> srcu-fast 1.5
> rcu-tasks 0.0
> rcu-trace 1.7
> refcnt 1148.0
> rwlock 1190.0
> rwsem 4199.3
> lock 41070.6
> lock-irq 46176.3
> acqrel 1.1
>
> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> appear to beat hazptr read-side performance.
>
Could you also see the reader-side performance impact when the percpu
hazard pointer slots are used up? I.e. the worst case.
> [...]
>
> > > +/*
> > > + * Perform piecewise iteration on overflow list waiting until "addr" is
> > > + * not present. Raw spinlock is released and taken between each list
> > > + * item and busy loop iteration. The overflow list generation is checked
> > > + * each time the lock is taken to validate that the list has not changed
> > > + * before resuming iteration or busy wait. If the generation has
> > > + * changed, retry the entire list traversal.
> > > + */
> > > +static
> > > +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> > > +{
> > > + struct hazptr_backup_slot *backup_slot;
> > > + uint64_t snapshot_gen;
> > > +
> > > + raw_spin_lock(&overflow_list->lock);
> > > +retry:
> > > + snapshot_gen = overflow_list->gen;
> > > + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > + /* Busy-wait if node is found. */
> > > + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > > + raw_spin_unlock(&overflow_list->lock);
> > > + cpu_relax();
> >
> > I think we should prioritize the scan thread solution [2] instead of
> > busy waiting hazrd pointer updaters, because when we have multiple
> > hazard pointer usages we would want to consolidate the scans from
> > updater side.
>
> I agree that batching scans with a worker thread is a logical next step.
>
> > If so, the whole ->gen can be avoided.
>
> How would it allow removing the generation trick without causing long
> raw spinlock latencies ?
>
Because we won't need to busy-wait for the readers to go away, we can
check whether they are still there in the next scan.
so:
list_for_each_entry(backup_slot, &overflow_list->head, node) {
/* Busy-wait if node is found. */
if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
<mark addr as unable to free and move on>
> >
> > However this ->gen idea does seem ot resolve another issue for me, I'm
> > trying to make shazptr critical section preemptive by using a per-task
> > backup slot (if you recall, this is your idea from the hallway
> > discussions we had during LPC 2024),
>
> I honestly did not remember. It's been a whole year! ;-)
>
> > and currently I could not make it
> > work because the following sequeue:
> >
> > 1. CPU 0 already has one pointer protected.
> >
> > 2. CPU 1 begins the updater scan, and it scans the list of preempted
> > hazard pointer readers, no reader.
> >
> > 3. CPU 0 does a context switch, it stores the current hazard pointer
> > value to the current task's ->hazard_slot (let's say the task is task
> > A), and add it to the list of preempted hazard pointer readers.
> >
> > 4. CPU 0 clears its percpu hazptr_slots for the next task (B).
> >
> > 5. CPU 1 continues the updater scan, and it scans the percpu slot of
> > CPU 0, and finds no reader.
> >
> > in this situation, updater will miss a reader. But if we add a
> > generation snapshotting at step 2 and generation increment at step 3, I
> > think it'll work.
> >
> > IMO, if we make this work, it's better than the current backup slot
> > mechanism IMO, because we only need to acquire the lock if context
> > switch happens.
>
> With PREEMPT_HAZPTR we also only need to acquire the per-cpu overflow
> list raw spinlock on context switch (preemption or blocking). The only
Indeed, pre-allocating the slot on the stack to save the percpu slot
when context switch seems easier and quite smart ;-) Let me take a look.
Regards,
Boqun
> other case requiring it is hazptr nested usage (more than 8 active
> hazptr) on a thread context + nested irqs.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
On 2025-12-18 15:22, Boqun Feng wrote:
[...]
>>> Could you utilize this[1] to see a
>>> comparison of the reader-side performance against RCU/SRCU?
>>
>> Good point ! Let's see.
>>
>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>> hyperthreading disabled,
>> CONFIG_PREEMPT=y,
>> CONFIG_PREEMPT_RCU=y,
>> CONFIG_PREEMPT_HAZPTR=y.
>>
>> scale_type ns
>> -----------------------
>> hazptr-smp-mb 13.1 <- this implementation
>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>> rcu 17.0
>> srcu 20.0
>> srcu-fast 1.5
>> rcu-tasks 0.0
>> rcu-trace 1.7
>> refcnt 1148.0
>> rwlock 1190.0
>> rwsem 4199.3
>> lock 41070.6
>> lock-irq 46176.3
>> acqrel 1.1
>>
>> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
>> appear to beat hazptr read-side performance.
>>
>
> Could you also see the reader-side performance impact when the percpu
> hazard pointer slots are used up? I.e. the worst case.
I've modified the code to populate "(void *)1UL" in the 7 first slots
at bootup, here is the result:
hazptr-smp-mb-7-fail 16.3 ns
So we go from 13.1 ns to 16.3 ns when all but one slots are used.
And if we pre-populate the 8 slots for each cpu, and thus force
fallback to overflow list:
hazptr-smp-mb-8-fail 67.1 ns
>
>> [...]
>>
>>>> +/*
>>>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>>>> + * not present. Raw spinlock is released and taken between each list
>>>> + * item and busy loop iteration. The overflow list generation is checked
>>>> + * each time the lock is taken to validate that the list has not changed
>>>> + * before resuming iteration or busy wait. If the generation has
>>>> + * changed, retry the entire list traversal.
>>>> + */
>>>> +static
>>>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>>>> +{
>>>> + struct hazptr_backup_slot *backup_slot;
>>>> + uint64_t snapshot_gen;
>>>> +
>>>> + raw_spin_lock(&overflow_list->lock);
>>>> +retry:
>>>> + snapshot_gen = overflow_list->gen;
>>>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>>> + /* Busy-wait if node is found. */
>>>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>>> + raw_spin_unlock(&overflow_list->lock);
>>>> + cpu_relax();
>>>
>>> I think we should prioritize the scan thread solution [2] instead of
>>> busy waiting hazrd pointer updaters, because when we have multiple
>>> hazard pointer usages we would want to consolidate the scans from
>>> updater side.
>>
>> I agree that batching scans with a worker thread is a logical next step.
>>
>>> If so, the whole ->gen can be avoided.
>>
>> How would it allow removing the generation trick without causing long
>> raw spinlock latencies ?
>>
>
> Because we won't need to busy-wait for the readers to go away, we can
> check whether they are still there in the next scan.
>
> so:
>
> list_for_each_entry(backup_slot, &overflow_list->head, node) {
> /* Busy-wait if node is found. */
> if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> <mark addr as unable to free and move on>
But then you still iterate on a possibly large list of overflow nodes,
with a raw spinlock held. That raw spinlock is taken by the scheduler
on context switch. This can cause very long scheduler latency.
So breaking up the iteration into pieces is not just to handle
busy-waiting, but also to make sure we don't increase the
system latency by holding a raw spinlock (taken with rq lock
held) for more than the little time needed to iterate to the next
node.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 15:22, Boqun Feng wrote:
> [...]
> > > > Could you utilize this[1] to see a
> > > > comparison of the reader-side performance against RCU/SRCU?
> > >
> > > Good point ! Let's see.
> > >
> > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > hyperthreading disabled,
> > > CONFIG_PREEMPT=y,
> > > CONFIG_PREEMPT_RCU=y,
> > > CONFIG_PREEMPT_HAZPTR=y.
> > >
> > > scale_type ns
> > > -----------------------
> > > hazptr-smp-mb 13.1 <- this implementation
> > > hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> > > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> > > rcu 17.0
> > > srcu 20.0
> > > srcu-fast 1.5
> > > rcu-tasks 0.0
> > > rcu-trace 1.7
> > > refcnt 1148.0
> > > rwlock 1190.0
> > > rwsem 4199.3
> > > lock 41070.6
> > > lock-irq 46176.3
> > > acqrel 1.1
> > >
> > > So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> > > appear to beat hazptr read-side performance.
> > >
> >
> > Could you also see the reader-side performance impact when the percpu
> > hazard pointer slots are used up? I.e. the worst case.
>
> I've modified the code to populate "(void *)1UL" in the 7 first slots
> at bootup, here is the result:
>
> hazptr-smp-mb-7-fail 16.3 ns
>
> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
>
> And if we pre-populate the 8 slots for each cpu, and thus force
> fallback to overflow list:
>
> hazptr-smp-mb-8-fail 67.1 ns
>
Thank you! So involving locking seems to hurt performance more than
per-CPU/per-task operations. This may suggest that enabling
PREEMPT_HAZPTR by default has an acceptable performance.
> >
> > > [...]
> > >
> > > > > +/*
> > > > > + * Perform piecewise iteration on overflow list waiting until "addr" is
> > > > > + * not present. Raw spinlock is released and taken between each list
> > > > > + * item and busy loop iteration. The overflow list generation is checked
> > > > > + * each time the lock is taken to validate that the list has not changed
> > > > > + * before resuming iteration or busy wait. If the generation has
> > > > > + * changed, retry the entire list traversal.
> > > > > + */
> > > > > +static
> > > > > +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> > > > > +{
> > > > > + struct hazptr_backup_slot *backup_slot;
> > > > > + uint64_t snapshot_gen;
> > > > > +
> > > > > + raw_spin_lock(&overflow_list->lock);
> > > > > +retry:
> > > > > + snapshot_gen = overflow_list->gen;
> > > > > + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > > > + /* Busy-wait if node is found. */
> > > > > + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > > > > + raw_spin_unlock(&overflow_list->lock);
> > > > > + cpu_relax();
> > > >
> > > > I think we should prioritize the scan thread solution [2] instead of
> > > > busy waiting hazrd pointer updaters, because when we have multiple
> > > > hazard pointer usages we would want to consolidate the scans from
> > > > updater side.
> > >
> > > I agree that batching scans with a worker thread is a logical next step.
> > >
> > > > If so, the whole ->gen can be avoided.
> > >
> > > How would it allow removing the generation trick without causing long
> > > raw spinlock latencies ?
> > >
> >
> > Because we won't need to busy-wait for the readers to go away, we can
> > check whether they are still there in the next scan.
> >
> > so:
> >
> > list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > /* Busy-wait if node is found. */
> > if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > <mark addr as unable to free and move on>
>
> But then you still iterate on a possibly large list of overflow nodes,
> with a raw spinlock held. That raw spinlock is taken by the scheduler
> on context switch. This can cause very long scheduler latency.
>
That's fair.
> So breaking up the iteration into pieces is not just to handle
> busy-waiting, but also to make sure we don't increase the
> system latency by holding a raw spinlock (taken with rq lock
> held) for more than the little time needed to iterate to the next
> node.
>
I agree that it helps reduce the latency, but I feel like with a scan
thread in the picture (and we don't need to busy-wait), we should use
a forward-progress-guaranteed way in the updater side scan, which means
we may need to explore other solutions for the latency (e.g.
fine-grained locking hashlist for the overflow list) than the generation
counter.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
On 2025-12-18 19:25, Boqun Feng wrote: [...] >> And if we pre-populate the 8 slots for each cpu, and thus force >> fallback to overflow list: >> >> hazptr-smp-mb-8-fail 67.1 ns >> > > Thank you! So involving locking seems to hurt performance more than > per-CPU/per-task operations. This may suggest that enabling > PREEMPT_HAZPTR by default has an acceptable performance. Indeed, I can fold it into the hazptr patch and remove the config option then. > >> So breaking up the iteration into pieces is not just to handle >> busy-waiting, but also to make sure we don't increase the >> system latency by holding a raw spinlock (taken with rq lock >> held) for more than the little time needed to iterate to the next >> node. >> > > I agree that it helps reduce the latency, but I feel like with a scan > thread in the picture (and we don't need to busy-wait), we should use > a forward-progress-guaranteed way in the updater side scan, which means > we may need to explore other solutions for the latency (e.g. > fine-grained locking hashlist for the overflow list) than the generation > counter. Guaranteeing forward progress of synchronize even with a steady stream of unrelated hazard pointers addition/removal to/from the overflow list is something we should aim for, with or without a scan thread. As is, my current generation scheme does not guarantee this. But we can use liburcu RCU grace period "parity" concept as inspiration [1] and introduce a two-lists scheme, and have hazptr_synchronize flip the current "list_add" head while it iterates on the other list. There would be one generation counter for each of the two lists. This would be protected by holding a global mutex across hazptr_synchronize. hazptr_synchronize would need to iterate on the two lists one after the other, carefully flipping the current "addition list" head between the two iterations. So the worse case that can happen in terms of retry caused by generation counter increments is if list entries are deleted while the list is being traversed by hazptr_synchronize. Because there are no possible concurrent additions to that list, the worse case is that the list becomes empty, which bounds the number of retry to the number of list elements. Thoughts ? Thanks, Mathieu [1] https://github.com/urcu/userspace-rcu/blob/master/src/urcu.c#L409 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 12/19/2025 10:14 AM, Mathieu Desnoyers wrote: > On 2025-12-18 19:25, Boqun Feng wrote: > [...] >>> And if we pre-populate the 8 slots for each cpu, and thus force >>> fallback to overflow list: >>> >>> hazptr-smp-mb-8-fail 67.1 ns >>> >> >> Thank you! So involving locking seems to hurt performance more than >> per-CPU/per-task operations. This may suggest that enabling >> PREEMPT_HAZPTR by default has an acceptable performance. > > Indeed, I can fold it into the hazptr patch and remove the config > option then. > >> >>> So breaking up the iteration into pieces is not just to handle >>> busy-waiting, but also to make sure we don't increase the >>> system latency by holding a raw spinlock (taken with rq lock >>> held) for more than the little time needed to iterate to the next >>> node. >>> >> >> I agree that it helps reduce the latency, but I feel like with a scan >> thread in the picture (and we don't need to busy-wait), we should use >> a forward-progress-guaranteed way in the updater side scan, which means >> we may need to explore other solutions for the latency (e.g. >> fine-grained locking hashlist for the overflow list) than the generation >> counter. > > Guaranteeing forward progress of synchronize even with a steady stream > of unrelated hazard pointers addition/removal to/from the overflow list > is something we should aim for, with or without a scan thread. > > As is, my current generation scheme does not guarantee this. But we can > use liburcu RCU grace period "parity" concept as inspiration [1] and > introduce a two-lists scheme, and have hazptr_synchronize flip the > current "list_add" head while it iterates on the other list. There would > be one generation counter for each of the two lists. > > This would be protected by holding a global mutex across > hazptr_synchronize. hazptr_synchronize would need to iterate > on the two lists one after the other, carefully flipping the > current "addition list" head between the two iterations. > > So the worse case that can happen in terms of retry caused by > generation counter increments is if list entries are deleted while > the list is being traversed by hazptr_synchronize. Because there > are no possible concurrent additions to that list, the worse case > is that the list becomes empty, which bounds the number of retry > to the number of list elements. > > Thoughts ? IMHO the overflow case is "special" and should not happen often, otherwise things are "bad" anyway. I am not sure if this kind of complexity will be worth it unless we know HP forward-progress is a real problem. Also, since HP acquire will be short lived, are we that likely to not get past a temporary shortage of slots? Perhaps the forward-progress problem should be rephrased to the following?: If a reader hit an overflow slot, it should probably be able to get a non-overflow slot soon, even if hazard pointer slots are over-subscribed. Thanks.
On 2025-12-19 10:42, Joel Fernandes wrote: > > IMHO the overflow case is "special" and should not happen often, otherwise > things are "bad" anyway. I am not sure if this kind of complexity will be worth > it unless we know HP forward-progress is a real problem. Also, since HP acquire > will be short lived, are we that likely to not get past a temporary shortage of > slots? Given that we have context switch integration which moves the active per-cpu slots to the overflow list, I can see that even not-so-special users which get preempted or block regularly with active per-cpu slots could theoretically end up preventing HP scan progress. Providing HP scan progress guarantees is IMO an important aspect to consider if we want to ensure the implementation does not cause subtle HP scan stall when its amount of use will scale up. > Perhaps the forward-progress problem should be rephrased to the following?: If a > reader hit an overflow slot, it should probably be able to get a non-overflow > slot soon, even if hazard pointer slots are over-subscribed. Those are two distinct forward progress guarantees, and I think both are important: * Forward progress of HP readers, * Forward progress of HP scan. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 12/19/2025 5:19 PM, Mathieu Desnoyers wrote: > On 2025-12-19 10:42, Joel Fernandes wrote: >> >> IMHO the overflow case is "special" and should not happen often, otherwise >> things are "bad" anyway. I am not sure if this kind of complexity will be worth >> it unless we know HP forward-progress is a real problem. Also, since HP acquire >> will be short lived, are we that likely to not get past a temporary shortage of >> slots? > > Given that we have context switch integration which moves the active > per-cpu slots to the overflow list, I can see that even not-so-special > users which get preempted or block regularly with active per-cpu slots > could theoretically end up preventing HP scan progress. Yeah, I see. That's the 'problem' with the preemption design of this patchset. You always have to move it in and out of overflow list on preemption even when there is no overflow AFAICS. My idea (which has its own issues) does not require that on preemption. > Providing HP scan progress guarantees is IMO an important aspect to > consider if we want to ensure the implementation does not cause subtle > HP scan stall when its amount of use will scale up. Sure. But if you didn't have to deal with a list in the 'normal' case (not over saturated slots case), then it wouldn't be that big a deal. >> Perhaps the forward-progress problem should be rephrased to the following?: If a >> reader hit an overflow slot, it should probably be able to get a non-overflow >> slot soon, even if hazard pointer slots are over-subscribed. > > Those are two distinct forward progress guarantees, and I think both are > important: > > * Forward progress of HP readers, > * Forward progress of HP scan. Maybe I am missing something, but AFAICS, if the readers and only using slots and not locking in normal operation, then the scan also will automatically make forward progress. So both are forward progress of readers and scanning related. It is your preemption design that requires the locking.. Btw, I thought about the scanning issue with the task-slots idea, and really synchronize() is supposed to be a slow-path so I am not fully sure it is that much of an issue - again depends on usecase but for per-cpu ref and synchronize_rcu(), both are 'slow' anyway. Again depends on usecase. And for the async case, it is almost not an issue at all due to the batching/amortization of the task scan cost. thanks, - Joel
On Fri, Dec 19, 2025 at 05:39:00PM -0500, Joel Fernandes wrote: > > > On 12/19/2025 5:19 PM, Mathieu Desnoyers wrote: > > On 2025-12-19 10:42, Joel Fernandes wrote: > >> > >> IMHO the overflow case is "special" and should not happen often, otherwise > >> things are "bad" anyway. I am not sure if this kind of complexity will be worth > >> it unless we know HP forward-progress is a real problem. Also, since HP acquire > >> will be short lived, are we that likely to not get past a temporary shortage of > >> slots? > > > > Given that we have context switch integration which moves the active > > per-cpu slots to the overflow list, I can see that even not-so-special > > users which get preempted or block regularly with active per-cpu slots > > could theoretically end up preventing HP scan progress. > > Yeah, I see. That's the 'problem' with the preemption design of this patchset. > You always have to move it in and out of overflow list on preemption even when > there is no overflow AFAICS. My idea (which has its own issues) does not require > that on preemption. > > > Providing HP scan progress guarantees is IMO an important aspect to > > consider if we want to ensure the implementation does not cause subtle > > HP scan stall when its amount of use will scale up. > > Sure. But if you didn't have to deal with a list in the 'normal' case (not over > saturated slots case), then it wouldn't be that big a deal. > With PREEPT_HAZPTR, the overflow list will be used if a task gets preempted while using hazptrs, without PREEPT_HAZPTR, the overflow list will be used if there are more than 8 slots used on the CPU (most likely due to task being preempted with a hazptr slot being used). Therefore, if this hazptr design it to have a few users and we want to support that preemptible hazard pointer read-side section, soon or later we need to resolve this scanning racing with readers (in context switches) on overflow list issue. Surely we don't need to have something perfect in day 1, but planning a bit ahead won't hurt ;-) Regards, Boqun > >> Perhaps the forward-progress problem should be rephrased to the following?: If a > >> reader hit an overflow slot, it should probably be able to get a non-overflow > >> slot soon, even if hazard pointer slots are over-subscribed. > > > > Those are two distinct forward progress guarantees, and I think both are > > important: > > > > * Forward progress of HP readers, > > * Forward progress of HP scan. > [..]
Hi Boqun,
> On Dec 18, 2025, at 9:07 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
>> On 2025-12-18 15:22, Boqun Feng wrote:
>> [...]
>>>>> Could you utilize this[1] to see a
>>>>> comparison of the reader-side performance against RCU/SRCU?
>>>>
>>>> Good point ! Let's see.
>>>>
>>>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>>>> hyperthreading disabled,
>>>> CONFIG_PREEMPT=y,
>>>> CONFIG_PREEMPT_RCU=y,
>>>> CONFIG_PREEMPT_HAZPTR=y.
>>>>
>>>> scale_type ns
>>>> -----------------------
>>>> hazptr-smp-mb 13.1 <- this implementation
>>>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>>>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>>>> rcu 17.0
>>>> srcu 20.0
>>>> srcu-fast 1.5
>>>> rcu-tasks 0.0
>>>> rcu-trace 1.7
>>>> refcnt 1148.0
>>>> rwlock 1190.0
>>>> rwsem 4199.3
>>>> lock 41070.6
>>>> lock-irq 46176.3
>>>> acqrel 1.1
>>>>
>>>> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
>>>> appear to beat hazptr read-side performance.
>>>>
>>>
>>> Could you also see the reader-side performance impact when the percpu
>>> hazard pointer slots are used up? I.e. the worst case.
>>
>> I've modified the code to populate "(void *)1UL" in the 7 first slots
>> at bootup, here is the result:
>>
>> hazptr-smp-mb-7-fail 16.3 ns
>>
>> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
>>
>> And if we pre-populate the 8 slots for each cpu, and thus force
>> fallback to overflow list:
>>
>> hazptr-smp-mb-8-fail 67.1 ns
>>
>
> Thank you! So involving locking seems to hurt performance more than
> per-CPU/per-task operations. This may suggest that enabling
> PREEMPT_HAZPTR by default has an acceptable performance.
My impression is we do other locking on preemption anyway, such as the block list for preempted RCU critical sections. So maybe that's okay. As you said, it is an acceptable performance.
>>>
>>>> [...]
>>>>
>>>>>> +/*
>>>>>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>>>>>> + * not present. Raw spinlock is released and taken between each list
>>>>>> + * item and busy loop iteration. The overflow list generation is checked
>>>>>> + * each time the lock is taken to validate that the list has not changed
>>>>>> + * before resuming iteration or busy wait. If the generation has
>>>>>> + * changed, retry the entire list traversal.
>>>>>> + */
>>>>>> +static
>>>>>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>>>>>> +{
>>>>>> + struct hazptr_backup_slot *backup_slot;
>>>>>> + uint64_t snapshot_gen;
>>>>>> +
>>>>>> + raw_spin_lock(&overflow_list->lock);
>>>>>> +retry:
>>>>>> + snapshot_gen = overflow_list->gen;
>>>>>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>>>>> + /* Busy-wait if node is found. */
>>>>>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>>>>> + raw_spin_unlock(&overflow_list->lock);
>>>>>> + cpu_relax();
>>>>>
>>>>> I think we should prioritize the scan thread solution [2] instead of
>>>>> busy waiting hazrd pointer updaters, because when we have multiple
>>>>> hazard pointer usages we would want to consolidate the scans from
>>>>> updater side.
Yeah the scan thread idea also fixes the scan cost issue with per-task slots if we batch. If we implement a separate hazard pointer flavor along those lines, then maybe we should definitely do a worker thread.
>>>> I agree that batching scans with a worker thread is a logical next step.
>>>>
>>>>> If so, the whole ->gen can be avoided.
>>>>
>>>> How would it allow removing the generation trick without causing long
>>>> raw spinlock latencies ?
>>>>
>>>
>>> Because we won't need to busy-wait for the readers to go away, we can
>>> check whether they are still there in the next scan.
>>>
>>> so:
>>>
>>> list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>> /* Busy-wait if node is found. */
>>> if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>> <mark addr as unable to free and move on>
>>
>> But then you still iterate on a possibly large list of overflow nodes,
>> with a raw spinlock held. That raw spinlock is taken by the scheduler
>> on context switch. This can cause very long scheduler latency.
>>
>
> That's fair.
What about combining both approaches?
Can we do the generation trick along with worker thread scanning? I feel that should be doable.
>
>> So breaking up the iteration into pieces is not just to handle
>> busy-waiting, but also to make sure we don't increase the
>> system latency by holding a raw spinlock (taken with rq lock
>> held) for more than the little time needed to iterate to the next
>> node.
>>
>
> I agree that it helps reduce the latency, but I feel like with a scan
> thread in the picture (and we don't need to busy-wait), we should use
> a forward-progress-guaranteed way in the updater side scan, which means
> we may need to explore other solutions for the latency (e.g.
> fine-grained locking hashlist for the overflow list) than the generation
> counter.
Hmm.. That only works I guess if there is no interference between the finger grained list being iterated and the list being overflowed into. Otherwise, I think it might run into the same issue, but I could be missing something about the idea.
thanks,
- Joel
>
> Regards,
> Boqun
>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>
© 2016 - 2026 Red Hat, Inc.