include/linux/shazptr.h | 73 +++++++++ kernel/locking/Makefile | 2 +- kernel/locking/lockdep.c | 11 +- kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ kernel/rcu/rcuscale.c | 60 +++++++- kernel/rcu/refscale.c | 77 ++++++++++ 6 files changed, 534 insertions(+), 7 deletions(-) create mode 100644 include/linux/shazptr.h create mode 100644 kernel/locking/shazptr.c
Hi, This is the official first version of simple hazard pointers following the RFC: https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/ I rebase it onto v6.16-rc3 and hope to get more feedback this time. Thanks a lot for Breno Leitao to try the RFC out and share the numbers. I did an extra comparison this time, between the shazptr solution and the synchronize_rcu_expedited() solution. In my test, during a 100 times "tc qdisc replace" run: * IPI rate with the shazptr solution: ~14 per second per core. * IPI rate with synchronize_rcu_expedited(): ~140 per second per core. (IPI results were from the 'CAL' line in /proc/interrupt) This shows that while both solutions have the similar speedup, shazptr solution avoids the introduce of high IPI rate compared to synchronize_rcu_expedited(). Feedback is welcome and please let know if there is any concern or suggestion. Thanks! Regards, Boqun -------------------------------------- Please find the old performance below: On my system (a 96-cpu VMs), the results of: time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq are (with lockdep enabled): (without the patchset) real 0m1.039s user 0m0.001s sys 0m0.069s (with the patchset) real 0m0.053s user 0m0.000s sys 0m0.051s i.e. almost 20x speed-up. Other comparisons between RCU and shazptr, the rcuscale results (using default configuration from tools/testing/selftests/rcutorture/bin/kvm.sh): RCU: Average grace-period duration: 7470.02 microseconds Minimum grace-period duration: 3981.6 50th percentile grace-period duration: 6002.73 90th percentile grace-period duration: 7008.93 99th percentile grace-period duration: 10015 Maximum grace-period duration: 142228 shazptr: Average grace-period duration: 0.845825 microseconds Minimum grace-period duration: 0.199 50th percentile grace-period duration: 0.585 90th percentile grace-period duration: 1.656 99th percentile grace-period duration: 3.872 Maximum grace-period duration: 3049.05 shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to wakeup): Average grace-period duration: 467.861 microseconds Minimum grace-period duration: 92.913 50th percentile grace-period duration: 440.691 90th percentile grace-period duration: 460.623 99th percentile grace-period duration: 650.068 Maximum grace-period duration: 5775.46 shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD): Average grace-period duration: 599.569 microseconds Minimum grace-period duration: 1.432 50th percentile grace-period duration: 582.631 90th percentile grace-period duration: 781.704 99th percentile grace-period duration: 1160.26 Maximum grace-period duration: 6727.53 shazptr_wildcard (skip_synchronize_self_scan=1): Average grace-period duration: 460.466 microseconds Minimum grace-period duration: 303.546 50th percentile grace-period duration: 424.334 90th percentile grace-period duration: 482.637 99th percentile grace-period duration: 600.214 Maximum grace-period duration: 4126.94 Boqun Feng (8): Introduce simple hazard pointers shazptr: Add refscale test shazptr: Add refscale test for wildcard shazptr: Avoid synchronize_shaptr() busy waiting shazptr: Allow skip self scan in synchronize_shaptr() rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL rcuscale: Add tests for simple hazard pointers locking/lockdep: Use shazptr to protect the key hashlist include/linux/shazptr.h | 73 +++++++++ kernel/locking/Makefile | 2 +- kernel/locking/lockdep.c | 11 +- kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ kernel/rcu/rcuscale.c | 60 +++++++- kernel/rcu/refscale.c | 77 ++++++++++ 6 files changed, 534 insertions(+), 7 deletions(-) create mode 100644 include/linux/shazptr.h create mode 100644 kernel/locking/shazptr.c -- 2.39.5 (Apple Git-154)
On 2025-06-24 23:10, Boqun Feng wrote: > Hi, > > This is the official first version of simple hazard pointers following > the RFC: > > https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/ > > I rebase it onto v6.16-rc3 and hope to get more feedback this time. > > Thanks a lot for Breno Leitao to try the RFC out and share the numbers. > > I did an extra comparison this time, between the shazptr solution and > the synchronize_rcu_expedited() solution. In my test, during a 100 times > "tc qdisc replace" run: > > * IPI rate with the shazptr solution: ~14 per second per core. > * IPI rate with synchronize_rcu_expedited(): ~140 per second per core. > > (IPI results were from the 'CAL' line in /proc/interrupt) > > This shows that while both solutions have the similar speedup, shazptr > solution avoids the introduce of high IPI rate compared to > synchronize_rcu_expedited(). > > Feedback is welcome and please let know if there is any concern or > suggestion. Thanks! Hi Boqun, What is unclear to me is what is the delta wrt: https://lore.kernel.org/lkml/20241008135034.1982519-4-mathieu.desnoyers@efficios.com/ and whether this helper against compiler optimizations would still be needed here: https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/ Thanks, Mathieu > > Regards, > Boqun > > -------------------------------------- > Please find the old performance below: > > On my system (a 96-cpu VMs), the results of: > > time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > are (with lockdep enabled): > > (without the patchset) > real 0m1.039s > user 0m0.001s > sys 0m0.069s > > (with the patchset) > real 0m0.053s > user 0m0.000s > sys 0m0.051s > > i.e. almost 20x speed-up. > > Other comparisons between RCU and shazptr, the rcuscale results (using > default configuration from > tools/testing/selftests/rcutorture/bin/kvm.sh): > > RCU: > > Average grace-period duration: 7470.02 microseconds > Minimum grace-period duration: 3981.6 > 50th percentile grace-period duration: 6002.73 > 90th percentile grace-period duration: 7008.93 > 99th percentile grace-period duration: 10015 > Maximum grace-period duration: 142228 > > shazptr: > > Average grace-period duration: 0.845825 microseconds > Minimum grace-period duration: 0.199 > 50th percentile grace-period duration: 0.585 > 90th percentile grace-period duration: 1.656 > 99th percentile grace-period duration: 3.872 > Maximum grace-period duration: 3049.05 > > shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to > wakeup): > > Average grace-period duration: 467.861 microseconds > Minimum grace-period duration: 92.913 > 50th percentile grace-period duration: 440.691 > 90th percentile grace-period duration: 460.623 > 99th percentile grace-period duration: 650.068 > Maximum grace-period duration: 5775.46 > > shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD): > > Average grace-period duration: 599.569 microseconds > Minimum grace-period duration: 1.432 > 50th percentile grace-period duration: 582.631 > 90th percentile grace-period duration: 781.704 > 99th percentile grace-period duration: 1160.26 > Maximum grace-period duration: 6727.53 > > shazptr_wildcard (skip_synchronize_self_scan=1): > > Average grace-period duration: 460.466 microseconds > Minimum grace-period duration: 303.546 > 50th percentile grace-period duration: 424.334 > 90th percentile grace-period duration: 482.637 > 99th percentile grace-period duration: 600.214 > Maximum grace-period duration: 4126.94 > > Boqun Feng (8): > Introduce simple hazard pointers > shazptr: Add refscale test > shazptr: Add refscale test for wildcard > shazptr: Avoid synchronize_shaptr() busy waiting > shazptr: Allow skip self scan in synchronize_shaptr() > rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL > rcuscale: Add tests for simple hazard pointers > locking/lockdep: Use shazptr to protect the key hashlist > > include/linux/shazptr.h | 73 +++++++++ > kernel/locking/Makefile | 2 +- > kernel/locking/lockdep.c | 11 +- > kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ > kernel/rcu/rcuscale.c | 60 +++++++- > kernel/rcu/refscale.c | 77 ++++++++++ > 6 files changed, 534 insertions(+), 7 deletions(-) > create mode 100644 include/linux/shazptr.h > create mode 100644 kernel/locking/shazptr.c > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, Jun 25, 2025 at 08:25:52AM -0400, Mathieu Desnoyers wrote: > On 2025-06-24 23:10, Boqun Feng wrote: > > Hi, > > > > This is the official first version of simple hazard pointers following > > the RFC: > > > > https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/ > > > > I rebase it onto v6.16-rc3 and hope to get more feedback this time. > > > > Thanks a lot for Breno Leitao to try the RFC out and share the numbers. > > > > I did an extra comparison this time, between the shazptr solution and > > the synchronize_rcu_expedited() solution. In my test, during a 100 times > > "tc qdisc replace" run: > > > > * IPI rate with the shazptr solution: ~14 per second per core. > > * IPI rate with synchronize_rcu_expedited(): ~140 per second per core. > > > > (IPI results were from the 'CAL' line in /proc/interrupt) > > > > This shows that while both solutions have the similar speedup, shazptr > > solution avoids the introduce of high IPI rate compared to > > synchronize_rcu_expedited(). > > > > Feedback is welcome and please let know if there is any concern or > > suggestion. Thanks! > > Hi Boqun, > > What is unclear to me is what is the delta wrt: > > https://lore.kernel.org/lkml/20241008135034.1982519-4-mathieu.desnoyers@efficios.com/ > shazptr is more close to the general hazptr I proposed earlier: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/ , it's aimed as a global facility therefore no hazptr_domain is needed, plus it supports non-busy waiting synchronize_shazptr() at the beginning. > and whether this helper against compiler optimizations would still be needed here: > > https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/ > For the current user, no, but eventually we are going to need it. Regards, Boqun > Thanks, > > Mathieu > > > > > Regards, > > Boqun > > > > -------------------------------------- > > Please find the old performance below: > > > > On my system (a 96-cpu VMs), the results of: > > > > time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > > are (with lockdep enabled): > > > > (without the patchset) > > real 0m1.039s > > user 0m0.001s > > sys 0m0.069s > > > > (with the patchset) > > real 0m0.053s > > user 0m0.000s > > sys 0m0.051s > > > > i.e. almost 20x speed-up. > > > > Other comparisons between RCU and shazptr, the rcuscale results (using > > default configuration from > > tools/testing/selftests/rcutorture/bin/kvm.sh): > > > > RCU: > > > > Average grace-period duration: 7470.02 microseconds > > Minimum grace-period duration: 3981.6 > > 50th percentile grace-period duration: 6002.73 > > 90th percentile grace-period duration: 7008.93 > > 99th percentile grace-period duration: 10015 > > Maximum grace-period duration: 142228 > > > > shazptr: > > > > Average grace-period duration: 0.845825 microseconds > > Minimum grace-period duration: 0.199 > > 50th percentile grace-period duration: 0.585 > > 90th percentile grace-period duration: 1.656 > > 99th percentile grace-period duration: 3.872 > > Maximum grace-period duration: 3049.05 > > > > shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to > > wakeup): > > > > Average grace-period duration: 467.861 microseconds > > Minimum grace-period duration: 92.913 > > 50th percentile grace-period duration: 440.691 > > 90th percentile grace-period duration: 460.623 > > 99th percentile grace-period duration: 650.068 > > Maximum grace-period duration: 5775.46 > > > > shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD): > > > > Average grace-period duration: 599.569 microseconds > > Minimum grace-period duration: 1.432 > > 50th percentile grace-period duration: 582.631 > > 90th percentile grace-period duration: 781.704 > > 99th percentile grace-period duration: 1160.26 > > Maximum grace-period duration: 6727.53 > > > > shazptr_wildcard (skip_synchronize_self_scan=1): > > > > Average grace-period duration: 460.466 microseconds > > Minimum grace-period duration: 303.546 > > 50th percentile grace-period duration: 424.334 > > 90th percentile grace-period duration: 482.637 > > 99th percentile grace-period duration: 600.214 > > Maximum grace-period duration: 4126.94 > > > > Boqun Feng (8): > > Introduce simple hazard pointers > > shazptr: Add refscale test > > shazptr: Add refscale test for wildcard > > shazptr: Avoid synchronize_shaptr() busy waiting > > shazptr: Allow skip self scan in synchronize_shaptr() > > rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL > > rcuscale: Add tests for simple hazard pointers > > locking/lockdep: Use shazptr to protect the key hashlist > > > > include/linux/shazptr.h | 73 +++++++++ > > kernel/locking/Makefile | 2 +- > > kernel/locking/lockdep.c | 11 +- > > kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ > > kernel/rcu/rcuscale.c | 60 +++++++- > > kernel/rcu/refscale.c | 77 ++++++++++ > > 6 files changed, 534 insertions(+), 7 deletions(-) > > create mode 100644 include/linux/shazptr.h > > create mode 100644 kernel/locking/shazptr.c > > > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com
On Tue, Jun 24, 2025 at 08:10:53PM -0700, Boqun Feng wrote: > Hi, > > This is the official first version of simple hazard pointers following > the RFC: Can you please put an explanation of what hazard pointers are prominently into this cover letter?
On Wed, Jun 25, 2025 at 05:05:11AM -0700, Christoph Hellwig wrote: > On Tue, Jun 24, 2025 at 08:10:53PM -0700, Boqun Feng wrote: > > Hi, > > > > This is the official first version of simple hazard pointers following > > the RFC: > > Can you please put an explanation of what hazard pointers are > prominently into this cover letter? > Sure, I will put one for the future version, here is the gist: Hazard pointers provide the similar synchronzation behavior as RCU: readers are cheap, updaters need to wait for existing readers to go before they can free the objects. The difference between hazard pointers and RCU is that instead of waiting for a grace period, which all the readers have to exit the RCU read-side critical sections, the updaters of hazard pointers only need to wait for the readers that are accessing the objects they are about to free. For example, if we have 2 readers accessing different objects and 1 updater is freeing one of them: using RCU: Reader 1 Reader 2 Updater ======== ======== ======= rcu_read_lock(); r = rcu_dereference(a); rcu_read_lock(); r = rcu_dereference(b); synchronize_rcu(); rcu_read_unlock(); rcu_read_unlock(); <synchronize_rcu() returns> free(a); The updater will need to wait for reader 2 to finish before it can free 'a', however when using hazard pointers: Reader 1 Reader 2 Updater ======== ======== ======= g = shazptr_acquire(a); g = shazptr_acqurie(b); synchronize_shazptr(a); shazptr_clear(g); <synchronize_shazptr(a) returns> free(a); shazptr_clear(g); // <- updater doesn't // need to wait for // this. The updater's wait can finish immediately if no one is accessing 'a', in other words it doesn't need to wait for reader 2. This means for a particular workload, hazard pointers may have smaller memory footprint and less updater wait time compared to RCU, while still have the similar performance on the reader side. That being said, it does come with some cost, the readers would need to provide their own hazard pointer slots (allocating memory) in general cases. And in the simple hazard pointer implementation in this series, although readers don't need to provide their own hazard pointer slots, they need to disable the preemption to use the hazard pointer, and the performance would downgrade (to a naive SRCU implementation probably) if they want to protect multiple objects in one read-side critical section. Regards, Boqun
On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote: > Sure, I will put one for the future version, here is the gist: Thanks a lot! > The updater's wait can finish immediately if no one is accessing 'a', in > other words it doesn't need to wait for reader 2. So basically it is the RCU concept, but limited to protecting exactly one pointer update per critical section with no ability for the read to e.g. acquire a refcount on the objected pointed to by that pointer?
On Thu, Jun 26, 2025 at 03:16:49AM -0700, Christoph Hellwig wrote: > On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote: > > Sure, I will put one for the future version, here is the gist: > > Thanks a lot! > > > The updater's wait can finish immediately if no one is accessing 'a', in > > other words it doesn't need to wait for reader 2. > > So basically it is the RCU concept, but limited to protecting exactly > one pointer update per critical section with no ability for the read > to e.g. acquire a refcount on the objected pointed to by that pointer? For the current simple hazard pointer, yes. But simple hazard pointers is easily to extend so support reading: { gp is a global pointer } Reader Updater ====== ======= g = shazptr_acquire(p): WRITE_ONCE(*this_cpu_ptr(slot), gp); smp_mb(); if (READ_ONCE(gp) == *this_cpu_ptr(slot)) { // still being protected. <can read gp here> to_free = READ_ONCE(gp); WRITE_ONCE(gp, new); synchronize_shazptr(to_free): smp_mb(); // wait on the slot of reader // CPU being 0. READ_ONCE(per_cpu(reader, slot)); } shazptr_clear(g): WRITE_ONCE(*this_cpu_ptr(slot), NULL); // unblock synchronize_shazptr() Usually the shazptr_acqurie() + "pointer comparison"* is called shazptr_try_protect(). I will add a document about this in the next version along with other bits of hazard pointers. [*]: The pointer comparison is more complicated topic, but Mathieu has figured out how to do it correctly: https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/ Regards, Boqun
On Thu, Jun 26, 2025 at 08:47:25AM -0700, Boqun Feng wrote: > On Thu, Jun 26, 2025 at 03:16:49AM -0700, Christoph Hellwig wrote: > > On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote: > > > Sure, I will put one for the future version, here is the gist: > > > > Thanks a lot! > > > > > The updater's wait can finish immediately if no one is accessing 'a', in > > > other words it doesn't need to wait for reader 2. > > > > So basically it is the RCU concept, but limited to protecting exactly > > one pointer update per critical section with no ability for the read > > to e.g. acquire a refcount on the objected pointed to by that pointer? > > For the current simple hazard pointer, yes. But simple hazard pointers > is easily to extend so support reading: > > { gp is a global pointer } > > Reader Updater > ====== ======= > g = shazptr_acquire(p): > WRITE_ONCE(*this_cpu_ptr(slot), gp); > smp_mb(); > > if (READ_ONCE(gp) == *this_cpu_ptr(slot)) { > // still being protected. > <can read gp here> > to_free = READ_ONCE(gp); > WRITE_ONCE(gp, new); > synchronize_shazptr(to_free): > smp_mb(); > // wait on the slot of reader > // CPU being 0. > READ_ONCE(per_cpu(reader, slot)); > } > > shazptr_clear(g): > WRITE_ONCE(*this_cpu_ptr(slot), NULL); // unblock synchronize_shazptr() > > > Usually the shazptr_acqurie() + "pointer comparison"* is called > shazptr_try_protect(). > > I will add a document about this in the next version along with other > bits of hazard pointers. > > [*]: The pointer comparison is more complicated topic, but Mathieu has > figured out how to do it correctly: > > https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/ It might be helpful to add that, at a high level, hazard pointers are a scalable replacement for reference counting. At a similarly high level, RCU is a scalable replacement for reader-writer locking. At lower levels, there is considerable overlap in applicability, so that you can use RCU to replace many reference-counting use cases and hazard pointers to replace many reader-writer-locking use cases.. Plus, as both Mathieu and Boqun pointed out, both RCU and hazard pointers can be combined with other synchronization mechanisms, including each other. Thanx, Paul
On 2025-06-26 06:16, Christoph Hellwig wrote: > On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote: >> Sure, I will put one for the future version, here is the gist: > > Thanks a lot! > >> The updater's wait can finish immediately if no one is accessing 'a', in >> other words it doesn't need to wait for reader 2. > > So basically it is the RCU concept, but limited to protecting exactly > one pointer update per critical section with no ability for the read > to e.g. acquire a refcount on the objected pointed to by that pointer? FWIW, hazard pointers can be chained with other existence guarantee mechanisms. I've done prototypes that use hazard pointers chained with a reference counter in the object to implement something similar to smart pointers. Let me know if you are interested in the details. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
© 2016 - 2025 Red Hat, Inc.