Hazard Pointers

[RFC PATCH v4 0/4] Hazard Pointers

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

Hi,

Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
if you guys have time to try it out with your use-cases it would be
great!

This new version does the following:

- It has 8 preallocated hazard pointer slots per CPU (one cache line),
- The hazard pointer user allocates a hazard pointer context variable
  (typically on the stack), which contains the pointer to the slot *and*
  a backup slot,
- When all the per-CPU slots are in use, fallback to the backup slot.
  Chain the backup slot into per-CPU lists, each protected by a raw
  spinlock.
- The hazard pointer synchronize does a piecewise iteration on the
  per-CPU overflow slots lists, releasing the raw spinlock between
  each list item. It uses a 64-bit generation counter to check for
  concurrent list changes, and restart the traversal on generation
  counter mismatch.
- There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
  the hazard pointer acquire/release adds and then removes the hazard
  pointer context from a per-task linked list. On context switch, the
  scheduler migrates the per-CPU slots used by the task to the backup
  per-context slots, thus making sure the per-CPU slots are not used
  by preempted and blocked tasks.

It is based on v6.18.1.

Review is very welcome,

Thanks,

Mathieu

Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev

Mathieu Desnoyers (4):
  compiler.h: Introduce ptr_eq() to preserve address dependency
  Documentation: RCU: Refer to ptr_eq()
  hazptr: Implement Hazard Pointers
  hazptr: Migrate per-CPU slots to backup slot on context switch

 Documentation/RCU/rcu_dereference.rst |  38 +++-
 include/linux/compiler.h              |  63 +++++++
 include/linux/hazptr.h                | 241 ++++++++++++++++++++++++++
 include/linux/sched.h                 |   4 +
 init/init_task.c                      |   3 +
 init/main.c                           |   2 +
 kernel/Kconfig.preempt                |  10 ++
 kernel/Makefile                       |   2 +-
 kernel/fork.c                         |   3 +
 kernel/hazptr.c                       | 150 ++++++++++++++++
 kernel/sched/core.c                   |   2 +
 11 files changed, 512 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/hazptr.h
 create mode 100644 kernel/hazptr.c

-- 
2.39.5

Re: [RFC PATCH v4 0/4] Hazard Pointers

Posted by Joel Fernandes 1 month, 3 weeks ago

Hi Mathieu,
Thanks for posting this.

> On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> Hi,
> 
> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
> if you guys have time to try it out with your use-cases it would be
> great!
> 
> This new version does the following:
> 
> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
> - The hazard pointer user allocates a hazard pointer context variable
>  (typically on the stack), which contains the pointer to the slot *and*
>  a backup slot,
> - When all the per-CPU slots are in use, fallback to the backup slot.
>  Chain the backup slot into per-CPU lists, each protected by a raw
>  spinlock.
> - The hazard pointer synchronize does a piecewise iteration on the
>  per-CPU overflow slots lists, releasing the raw spinlock between
>  each list item. It uses a 64-bit generation counter to check for
>  concurrent list changes, and restart the traversal on generation
>  counter mismatch.
> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>  the hazard pointer acquire/release adds and then removes the hazard
>  pointer context from a per-task linked list. On context switch, the
>  scheduler migrates the per-CPU slots used by the task to the backup
>  per-context slots, thus making sure the per-CPU slots are not used
>  by preempted and blocked tasks.

This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases?

I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally.

First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot. Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially.

I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time. So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow  approach, we have a higher chance of lock contention, especially when the number of CPUs is low.

Other than the task-scanning performance issue, what am I missing?

Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well.

By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots.

Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this before doing further development of a solution..

By the way, feedback on the scanning patch:

Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well.

Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects?

thanks,

 - Joel 

> 
> It is based on v6.18.1.
> 
> Review is very welcome,
> 
> Thanks,
> 
> Mathieu
> 
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Zqiang <qiang.zhang1211@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: maged.michael@gmail.com
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> Cc: rcu@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: lkmm@lists.linux.dev
> 
> Mathieu Desnoyers (4):
>  compiler.h: Introduce ptr_eq() to preserve address dependency
>  Documentation: RCU: Refer to ptr_eq()
>  hazptr: Implement Hazard Pointers
>  hazptr: Migrate per-CPU slots to backup slot on context switch
> 
> Documentation/RCU/rcu_dereference.rst |  38 +++-
> include/linux/compiler.h              |  63 +++++++
> include/linux/hazptr.h                | 241 ++++++++++++++++++++++++++
> include/linux/sched.h                 |   4 +
> init/init_task.c                      |   3 +
> init/main.c                           |   2 +
> kernel/Kconfig.preempt                |  10 ++
> kernel/Makefile                       |   2 +-
> kernel/fork.c                         |   3 +
> kernel/hazptr.c                       | 150 ++++++++++++++++
> kernel/sched/core.c                   |   2 +
> 11 files changed, 512 insertions(+), 6 deletions(-)
> create mode 100644 include/linux/hazptr.h
> create mode 100644 kernel/hazptr.c
> 
> --
> 2.39.5
>

Re: [RFC PATCH v4 0/4] Hazard Pointers

Posted by Mathieu Desnoyers 1 month, 3 weeks ago

On 2025-12-18 05:33, Joel Fernandes wrote:
> Hi Mathieu,
> Thanks for posting this.
> 
>> On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> Hi,
>>
>> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
>> if you guys have time to try it out with your use-cases it would be
>> great!
>>
>> This new version does the following:
>>
>> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
>> - The hazard pointer user allocates a hazard pointer context variable
>>   (typically on the stack), which contains the pointer to the slot *and*
>>   a backup slot,
>> - When all the per-CPU slots are in use, fallback to the backup slot.
>>   Chain the backup slot into per-CPU lists, each protected by a raw
>>   spinlock.
>> - The hazard pointer synchronize does a piecewise iteration on the
>>   per-CPU overflow slots lists, releasing the raw spinlock between
>>   each list item. It uses a 64-bit generation counter to check for
>>   concurrent list changes, and restart the traversal on generation
>>   counter mismatch.
>> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>>   the hazard pointer acquire/release adds and then removes the hazard
>>   pointer context from a per-task linked list. On context switch, the
>>   scheduler migrates the per-CPU slots used by the task to the backup
>>   per-context slots, thus making sure the per-CPU slots are not used
>>   by preempted and blocked tasks.
> 
> This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases?

I start from the hypothesis that scanning all tasks at synchronize
is likely too slow for a general purpose synchronization algo.

This is why we have RCU (general purpose) tracking hierarchical per-cpu
state, and specialized RCU flavors (for tracing and other use-cases)
using iteration on all tasks.

One of the highlights of hazard pointers over RCU is its ability to
know about quiescence of a pointer without waiting for a whole grace
period. Therefore I went down the route of scanning per-CPU state rather
than per-task.

> 
> I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally.

It really depends on the access pattern here. If you want to replace
refcount decrement and test with hazard pointer synchronize, a delay of
1ms on each hazptr synchronize is a *lot*.

Of course perhaps this can be batched with a worker thread, but then
you are back to using more memory due to delayed reclaim, and thus
closer to RCU.

> 
> First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot.
> Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially.
> 
> I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time.

The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
rather than per-task.

> So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow  approach, we have a higher chance of lock contention, especially when the number of CPUs is low.

Contention on the per-CPU spinlocks would only happen if threads are
migrated between chaining of their backup slot (either if 8 hazptr
are in use by the thread, or on context switch), and release. Do
you expect this type of migration to happen so often as to increase
contention ?

> 
> Other than the task-scanning performance issue, what am I missing?

Task-scan performance is the key issue. Also you have to set a limit
of per-task slots. How would you handle the scenario where a user
needs more slots than the limit ?

> 
> Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well.

The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
hazptr held will trigger the context switch, and the scheduler will
simply move the hazard pointer from the per-CPU slot to the backup
slot. End of story.

> 
> By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots.

Except for PREEMPT_HAZPTR. :)

> 
> Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this 
before doing further development of a solution..

Let me know what you find, I'm very interested!

> 
> By the way, feedback on the scanning patch:
> 
> Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well.

I had this in a userspace prototype: a counter of used per-cpu slots.
But I finally decided against it because it requires extra instructions
on the fast path, *and* scanning 8 pointers within a single cache line
on synchronize is really cheap. So I'd need benchmarks to justify adding
back the counter.

> 
> Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects?

This would be a logical next step indeed. I'll have to think
about this some more.

Thanks,

Mathieu

> 
> thanks,
> 
>   - Joel
> 
>>
>> It is based on v6.18.1.
>>
>> Review is very welcome,
>>
>> Thanks,
>>
>> Mathieu
>>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: John Stultz <jstultz@google.com>
>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Uladzislau Rezki <urezki@gmail.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>> Cc: Zqiang <qiang.zhang1211@gmail.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: maged.michael@gmail.com
>> Cc: Mateusz Guzik <mjguzik@gmail.com>
>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> Cc: rcu@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: lkmm@lists.linux.dev
>>
>> Mathieu Desnoyers (4):
>>   compiler.h: Introduce ptr_eq() to preserve address dependency
>>   Documentation: RCU: Refer to ptr_eq()
>>   hazptr: Implement Hazard Pointers
>>   hazptr: Migrate per-CPU slots to backup slot on context switch
>>
>> Documentation/RCU/rcu_dereference.rst |  38 +++-
>> include/linux/compiler.h              |  63 +++++++
>> include/linux/hazptr.h                | 241 ++++++++++++++++++++++++++
>> include/linux/sched.h                 |   4 +
>> init/init_task.c                      |   3 +
>> init/main.c                           |   2 +
>> kernel/Kconfig.preempt                |  10 ++
>> kernel/Makefile                       |   2 +-
>> kernel/fork.c                         |   3 +
>> kernel/hazptr.c                       | 150 ++++++++++++++++
>> kernel/sched/core.c                   |   2 +
>> 11 files changed, 512 insertions(+), 6 deletions(-)
>> create mode 100644 include/linux/hazptr.h
>> create mode 100644 kernel/hazptr.c
>>
>> --
>> 2.39.5
>>


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com