include/linux/shazptr.h | 73 +++++++++ kernel/locking/Makefile | 2 +- kernel/locking/lockdep.c | 11 +- kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ kernel/rcu/rcuscale.c | 60 +++++++- kernel/rcu/refscale.c | 77 ++++++++++ 6 files changed, 534 insertions(+), 7 deletions(-) create mode 100644 include/linux/shazptr.h create mode 100644 kernel/locking/shazptr.c
Hi, This RFC is mostly a follow-up on discussion: https://lore.kernel.org/lkml/20250321-lockdep-v1-1-78b732d195fb@debian.org/ I found that using a hazard pointer variant can speed up the lockdep_unregister_key(), on my system (a 96-cpu VMs), the results of: time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq are (without the patchset) real 0m1.039s user 0m0.001s sys 0m0.069s (with the patchset) real 0m0.053s user 0m0.000s sys 0m0.051s i.e. almost 20x speed-up. Other comparisons between RCU and shazptr, the rcuscale results (using default configuration from tools/testing/selftests/rcutorture/bin/kvm.sh): RCU: Average grace-period duration: 7470.02 microseconds Minimum grace-period duration: 3981.6 50th percentile grace-period duration: 6002.73 90th percentile grace-period duration: 7008.93 99th percentile grace-period duration: 10015 Maximum grace-period duration: 142228 shazptr: Average grace-period duration: 0.845825 microseconds Minimum grace-period duration: 0.199 50th percentile grace-period duration: 0.585 90th percentile grace-period duration: 1.656 99th percentile grace-period duration: 3.872 Maximum grace-period duration: 3049.05 shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to wakeup): Average grace-period duration: 467.861 microseconds Minimum grace-period duration: 92.913 50th percentile grace-period duration: 440.691 90th percentile grace-period duration: 460.623 99th percentile grace-period duration: 650.068 Maximum grace-period duration: 5775.46 shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD): Average grace-period duration: 599.569 microseconds Minimum grace-period duration: 1.432 50th percentile grace-period duration: 582.631 90th percentile grace-period duration: 781.704 99th percentile grace-period duration: 1160.26 Maximum grace-period duration: 6727.53 shazptr_wildcard (skip_synchronize_self_scan=1): Average grace-period duration: 460.466 microseconds Minimum grace-period duration: 303.546 50th percentile grace-period duration: 424.334 90th percentile grace-period duration: 482.637 99th percentile grace-period duration: 600.214 Maximum grace-period duration: 4126.94 Overall it looks promising to me, but I would like to see how it performs in the environment of Breno. Also as Paul always reminds me: buggy code usually run faster, so please take a look in case I'm missing something ;-) Thanks! The patchset is based on v6.15-rc1. Boqun Feng (8): Introduce simple hazard pointers shazptr: Add refscale test shazptr: Add refscale test for wildcard shazptr: Avoid synchronize_shaptr() busy waiting shazptr: Allow skip self scan in synchronize_shaptr() rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL rcuscale: Add tests for simple hazard pointers locking/lockdep: Use shazptr to protect the key hashlist include/linux/shazptr.h | 73 +++++++++ kernel/locking/Makefile | 2 +- kernel/locking/lockdep.c | 11 +- kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++ kernel/rcu/rcuscale.c | 60 +++++++- kernel/rcu/refscale.c | 77 ++++++++++ 6 files changed, 534 insertions(+), 7 deletions(-) create mode 100644 include/linux/shazptr.h create mode 100644 kernel/locking/shazptr.c -- 2.47.1
Hi Boqun, On Sun, Apr 13, 2025 at 11:00:47PM -0700, Boqun Feng wrote: > Overall it looks promising to me, but I would like to see how it > performs in the environment of Breno. Also as Paul always reminds me: > buggy code usually run faster, so please take a look in case I'm missing > something ;-) Thanks! Thanks for the patchset. I've confirmed that the wins are large on my environment, but, at the same magnitute of synchronize_rcu_expedited(). Here are the numbers I got: 6.15-rc1 (upstream) # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m3.986s user 0m0.001s sys 0m0.093s Your patchset on top of 6.15-rc1 # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m0.072s user 0m0.001s sys 0m0.070s My original proposal of using synchronize_rcu_expedited()[1] # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m0.074s user 0m0.001s sys 0m0.061s Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1] Thanks for working on it, --breno
On Wed, Apr 16, 2025 at 07:14:04AM -0700, Breno Leitao wrote: > Hi Boqun, > > On Sun, Apr 13, 2025 at 11:00:47PM -0700, Boqun Feng wrote: > > > Overall it looks promising to me, but I would like to see how it > > performs in the environment of Breno. Also as Paul always reminds me: > > buggy code usually run faster, so please take a look in case I'm missing > > something ;-) Thanks! > > Thanks for the patchset. I've confirmed that the wins are large on my > environment, but, at the same magnitute of synchronize_rcu_expedited(). > > Here are the numbers I got: > > 6.15-rc1 (upstream) > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m3.986s > user 0m0.001s > sys 0m0.093s > > Your patchset on top of 6.15-rc1 > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m0.072s > user 0m0.001s > sys 0m0.070s > > > My original proposal of using synchronize_rcu_expedited()[1] > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m0.074s > user 0m0.001s > sys 0m0.061s > > Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1] > Could you please also do the test of fist scenario with a regular synchronize_rcu() but switch to its faster variant: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp and run the test. If you have a time. Thank you! -- Vlad Rezki
Hello Vlad, On Wed, Apr 16, 2025 at 05:04:31PM +0200, Uladzislau Rezki wrote: > On Wed, Apr 16, 2025 at 07:14:04AM -0700, Breno Leitao wrote: > > Hi Boqun, > > > > On Sun, Apr 13, 2025 at 11:00:47PM -0700, Boqun Feng wrote: > > > > > Overall it looks promising to me, but I would like to see how it > > > performs in the environment of Breno. Also as Paul always reminds me: > > > buggy code usually run faster, so please take a look in case I'm missing > > > something ;-) Thanks! > > > > Thanks for the patchset. I've confirmed that the wins are large on my > > environment, but, at the same magnitute of synchronize_rcu_expedited(). > > > > Here are the numbers I got: > > > > 6.15-rc1 (upstream) > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > real 0m3.986s > > user 0m0.001s > > sys 0m0.093s > > > > Your patchset on top of 6.15-rc1 > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > real 0m0.072s > > user 0m0.001s > > sys 0m0.070s > > > > > > My original proposal of using synchronize_rcu_expedited()[1] > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > real 0m0.074s > > user 0m0.001s > > sys 0m0.061s > > > > Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1] > > > Could you please also do the test of fist scenario with a regular > synchronize_rcu() but switch to its faster variant: > > echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > and run the test. If you have a time. Of course, I am more than interesting in this topic. This is what I run: # /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m4.150s user 0m0.001s sys 0m0.076s [root@host2 ~]# echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp [root@host2 ~]# /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m4.225s user 0m0.000s sys 0m0.106s [root@host2 ~]# cat /sys/module/rcutree/parameters/rcu_normal_wake_from_gp 1 [root@host2 ~]# echo 0 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp [root@host2 ~]# /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq real 0m4.152s user 0m0.001s sys 0m0.099s It seems it made very little difference? Thanks --breno
Hello, Breno! > Hello Vlad, > > On Wed, Apr 16, 2025 at 05:04:31PM +0200, Uladzislau Rezki wrote: > > On Wed, Apr 16, 2025 at 07:14:04AM -0700, Breno Leitao wrote: > > > Hi Boqun, > > > > > > On Sun, Apr 13, 2025 at 11:00:47PM -0700, Boqun Feng wrote: > > > > > > > Overall it looks promising to me, but I would like to see how it > > > > performs in the environment of Breno. Also as Paul always reminds me: > > > > buggy code usually run faster, so please take a look in case I'm missing > > > > something ;-) Thanks! > > > > > > Thanks for the patchset. I've confirmed that the wins are large on my > > > environment, but, at the same magnitute of synchronize_rcu_expedited(). > > > > > > Here are the numbers I got: > > > > > > 6.15-rc1 (upstream) > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > > real 0m3.986s > > > user 0m0.001s > > > sys 0m0.093s > > > > > > Your patchset on top of 6.15-rc1 > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > > real 0m0.072s > > > user 0m0.001s > > > sys 0m0.070s > > > > > > > > > My original proposal of using synchronize_rcu_expedited()[1] > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > > real 0m0.074s > > > user 0m0.001s > > > sys 0m0.061s > > > > > > Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1] > > > > > Could you please also do the test of fist scenario with a regular > > synchronize_rcu() but switch to its faster variant: > > > > echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > > > and run the test. If you have a time. > > Of course, I am more than interesting in this topic. This is what I run: > > > # /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m4.150s > user 0m0.001s > sys 0m0.076s > > [root@host2 ~]# echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > [root@host2 ~]# /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m4.225s > user 0m0.000s > sys 0m0.106s > > [root@host2 ~]# cat /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > 1 > [root@host2 ~]# echo 0 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > [root@host2 ~]# /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq; time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > real 0m4.152s > user 0m0.001s > sys 0m0.099s > > It seems it made very little difference? > Yep, no difference. In your case you just need to speed up a grace period completion. So an expedited version really improves your case. So you do not have a lot of callbacks which may delay a normal GP. Thank you! -- Uladzislau Rezki
© 2016 - 2025 Red Hat, Inc.