Documentation/RCU/rcu_dereference.rst | 38 +++- include/linux/compiler.h | 63 +++++++ include/linux/hazptr.h | 241 ++++++++++++++++++++++++++ include/linux/sched.h | 4 + init/init_task.c | 3 + init/main.c | 2 + kernel/Kconfig.preempt | 10 ++ kernel/Makefile | 2 +- kernel/fork.c | 3 + kernel/hazptr.c | 150 ++++++++++++++++ kernel/sched/core.c | 2 + 11 files changed, 512 insertions(+), 6 deletions(-) create mode 100644 include/linux/hazptr.h create mode 100644 kernel/hazptr.c
Hi, Here is a revisited version of my Hazard Pointers series. Boqun, Joel, if you guys have time to try it out with your use-cases it would be great! This new version does the following: - It has 8 preallocated hazard pointer slots per CPU (one cache line), - The hazard pointer user allocates a hazard pointer context variable (typically on the stack), which contains the pointer to the slot *and* a backup slot, - When all the per-CPU slots are in use, fallback to the backup slot. Chain the backup slot into per-CPU lists, each protected by a raw spinlock. - The hazard pointer synchronize does a piecewise iteration on the per-CPU overflow slots lists, releasing the raw spinlock between each list item. It uses a 64-bit generation counter to check for concurrent list changes, and restart the traversal on generation counter mismatch. - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled, the hazard pointer acquire/release adds and then removes the hazard pointer context from a per-task linked list. On context switch, the scheduler migrates the per-CPU slots used by the task to the backup per-context slots, thus making sure the per-CPU slots are not used by preempted and blocked tasks. It is based on v6.18.1. Review is very welcome, Thanks, Mathieu Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: John Stultz <jstultz@google.com> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: maged.michael@gmail.com Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev Mathieu Desnoyers (4): compiler.h: Introduce ptr_eq() to preserve address dependency Documentation: RCU: Refer to ptr_eq() hazptr: Implement Hazard Pointers hazptr: Migrate per-CPU slots to backup slot on context switch Documentation/RCU/rcu_dereference.rst | 38 +++- include/linux/compiler.h | 63 +++++++ include/linux/hazptr.h | 241 ++++++++++++++++++++++++++ include/linux/sched.h | 4 + init/init_task.c | 3 + init/main.c | 2 + kernel/Kconfig.preempt | 10 ++ kernel/Makefile | 2 +- kernel/fork.c | 3 + kernel/hazptr.c | 150 ++++++++++++++++ kernel/sched/core.c | 2 + 11 files changed, 512 insertions(+), 6 deletions(-) create mode 100644 include/linux/hazptr.h create mode 100644 kernel/hazptr.c -- 2.39.5
Hi Mathieu, Thanks for posting this. > On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Hi, > > Here is a revisited version of my Hazard Pointers series. Boqun, Joel, > if you guys have time to try it out with your use-cases it would be > great! > > This new version does the following: > > - It has 8 preallocated hazard pointer slots per CPU (one cache line), > - The hazard pointer user allocates a hazard pointer context variable > (typically on the stack), which contains the pointer to the slot *and* > a backup slot, > - When all the per-CPU slots are in use, fallback to the backup slot. > Chain the backup slot into per-CPU lists, each protected by a raw > spinlock. > - The hazard pointer synchronize does a piecewise iteration on the > per-CPU overflow slots lists, releasing the raw spinlock between > each list item. It uses a 64-bit generation counter to check for > concurrent list changes, and restart the traversal on generation > counter mismatch. > - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled, > the hazard pointer acquire/release adds and then removes the hazard > pointer context from a per-task linked list. On context switch, the > scheduler migrates the per-CPU slots used by the task to the backup > per-context slots, thus making sure the per-CPU slots are not used > by preempted and blocked tasks. This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases? I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally. First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot. Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially. I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time. So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow approach, we have a higher chance of lock contention, especially when the number of CPUs is low. Other than the task-scanning performance issue, what am I missing? Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well. By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots. Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this before doing further development of a solution.. By the way, feedback on the scanning patch: Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well. Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects? thanks, - Joel > > It is based on v6.18.1. > > Review is very welcome, > > Thanks, > > Mathieu > > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: John Stultz <jstultz@google.com> > Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Joel Fernandes <joel@joelfernandes.org> > Cc: Josh Triplett <josh@joshtriplett.org> > Cc: Uladzislau Rezki <urezki@gmail.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > Cc: Zqiang <qiang.zhang1211@gmail.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Waiman Long <longman@redhat.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: maged.michael@gmail.com > Cc: Mateusz Guzik <mjguzik@gmail.com> > Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > Cc: rcu@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: lkmm@lists.linux.dev > > Mathieu Desnoyers (4): > compiler.h: Introduce ptr_eq() to preserve address dependency > Documentation: RCU: Refer to ptr_eq() > hazptr: Implement Hazard Pointers > hazptr: Migrate per-CPU slots to backup slot on context switch > > Documentation/RCU/rcu_dereference.rst | 38 +++- > include/linux/compiler.h | 63 +++++++ > include/linux/hazptr.h | 241 ++++++++++++++++++++++++++ > include/linux/sched.h | 4 + > init/init_task.c | 3 + > init/main.c | 2 + > kernel/Kconfig.preempt | 10 ++ > kernel/Makefile | 2 +- > kernel/fork.c | 3 + > kernel/hazptr.c | 150 ++++++++++++++++ > kernel/sched/core.c | 2 + > 11 files changed, 512 insertions(+), 6 deletions(-) > create mode 100644 include/linux/hazptr.h > create mode 100644 kernel/hazptr.c > > -- > 2.39.5 >
On 2025-12-18 05:33, Joel Fernandes wrote: > Hi Mathieu, > Thanks for posting this. > >> On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> Hi, >> >> Here is a revisited version of my Hazard Pointers series. Boqun, Joel, >> if you guys have time to try it out with your use-cases it would be >> great! >> >> This new version does the following: >> >> - It has 8 preallocated hazard pointer slots per CPU (one cache line), >> - The hazard pointer user allocates a hazard pointer context variable >> (typically on the stack), which contains the pointer to the slot *and* >> a backup slot, >> - When all the per-CPU slots are in use, fallback to the backup slot. >> Chain the backup slot into per-CPU lists, each protected by a raw >> spinlock. >> - The hazard pointer synchronize does a piecewise iteration on the >> per-CPU overflow slots lists, releasing the raw spinlock between >> each list item. It uses a 64-bit generation counter to check for >> concurrent list changes, and restart the traversal on generation >> counter mismatch. >> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled, >> the hazard pointer acquire/release adds and then removes the hazard >> pointer context from a per-task linked list. On context switch, the >> scheduler migrates the per-CPU slots used by the task to the backup >> per-context slots, thus making sure the per-CPU slots are not used >> by preempted and blocked tasks. > > This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases? I start from the hypothesis that scanning all tasks at synchronize is likely too slow for a general purpose synchronization algo. This is why we have RCU (general purpose) tracking hierarchical per-cpu state, and specialized RCU flavors (for tracing and other use-cases) using iteration on all tasks. One of the highlights of hazard pointers over RCU is its ability to know about quiescence of a pointer without waiting for a whole grace period. Therefore I went down the route of scanning per-CPU state rather than per-task. > > I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally. It really depends on the access pattern here. If you want to replace refcount decrement and test with hazard pointer synchronize, a delay of 1ms on each hazptr synchronize is a *lot*. Of course perhaps this can be batched with a worker thread, but then you are back to using more memory due to delayed reclaim, and thus closer to RCU. > > First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot. > Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially. > > I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time. The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan rather than per-task. > So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow approach, we have a higher chance of lock contention, especially when the number of CPUs is low. Contention on the per-CPU spinlocks would only happen if threads are migrated between chaining of their backup slot (either if 8 hazptr are in use by the thread, or on context switch), and release. Do you expect this type of migration to happen so often as to increase contention ? > > Other than the task-scanning performance issue, what am I missing? Task-scan performance is the key issue. Also you have to set a limit of per-task slots. How would you handle the scenario where a user needs more slots than the limit ? > > Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well. The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a hazptr held will trigger the context switch, and the scheduler will simply move the hazard pointer from the per-CPU slot to the backup slot. End of story. > > By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots. Except for PREEMPT_HAZPTR. :) > > Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this before doing further development of a solution.. Let me know what you find, I'm very interested! > > By the way, feedback on the scanning patch: > > Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well. I had this in a userspace prototype: a counter of used per-cpu slots. But I finally decided against it because it requires extra instructions on the fast path, *and* scanning 8 pointers within a single cache line on synchronize is really cheap. So I'd need benchmarks to justify adding back the counter. > > Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects? This would be a logical next step indeed. I'll have to think about this some more. Thanks, Mathieu > > thanks, > > - Joel > >> >> It is based on v6.18.1. >> >> Review is very welcome, >> >> Thanks, >> >> Mathieu >> >> Cc: Nicholas Piggin <npiggin@gmail.com> >> Cc: Michael Ellerman <mpe@ellerman.id.au> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> Cc: "Paul E. McKenney" <paulmck@kernel.org> >> Cc: Will Deacon <will@kernel.org> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Boqun Feng <boqun.feng@gmail.com> >> Cc: Alan Stern <stern@rowland.harvard.edu> >> Cc: John Stultz <jstultz@google.com> >> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Boqun Feng <boqun.feng@gmail.com> >> Cc: Frederic Weisbecker <frederic@kernel.org> >> Cc: Joel Fernandes <joel@joelfernandes.org> >> Cc: Josh Triplett <josh@joshtriplett.org> >> Cc: Uladzislau Rezki <urezki@gmail.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Lai Jiangshan <jiangshanlai@gmail.com> >> Cc: Zqiang <qiang.zhang1211@gmail.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Waiman Long <longman@redhat.com> >> Cc: Mark Rutland <mark.rutland@arm.com> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: maged.michael@gmail.com >> Cc: Mateusz Guzik <mjguzik@gmail.com> >> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >> Cc: rcu@vger.kernel.org >> Cc: linux-mm@kvack.org >> Cc: lkmm@lists.linux.dev >> >> Mathieu Desnoyers (4): >> compiler.h: Introduce ptr_eq() to preserve address dependency >> Documentation: RCU: Refer to ptr_eq() >> hazptr: Implement Hazard Pointers >> hazptr: Migrate per-CPU slots to backup slot on context switch >> >> Documentation/RCU/rcu_dereference.rst | 38 +++- >> include/linux/compiler.h | 63 +++++++ >> include/linux/hazptr.h | 241 ++++++++++++++++++++++++++ >> include/linux/sched.h | 4 + >> init/init_task.c | 3 + >> init/main.c | 2 + >> kernel/Kconfig.preempt | 10 ++ >> kernel/Makefile | 2 +- >> kernel/fork.c | 3 + >> kernel/hazptr.c | 150 ++++++++++++++++ >> kernel/sched/core.c | 2 + >> 11 files changed, 512 insertions(+), 6 deletions(-) >> create mode 100644 include/linux/hazptr.h >> create mode 100644 kernel/hazptr.c >> >> -- >> 2.39.5 >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
© 2016 - 2026 Red Hat, Inc.