include/linux/futex.h | 16 +- include/linux/mm_types.h | 5 + include/linux/sched/mm.h | 2 +- include/uapi/linux/prctl.h | 2 - init/Kconfig | 4 - kernel/fork.c | 8 +- kernel/futex/core.c | 281 ++++++++++++++---- kernel/futex/futex.h | 2 - tools/include/uapi/linux/prctl.h | 2 - tools/perf/bench/futex-hash.c | 1 - tools/perf/bench/futex-lock-pi.c | 1 - tools/perf/bench/futex-requeue.c | 1 - tools/perf/bench/futex-wake-parallel.c | 1 - tools/perf/bench/futex-wake.c | 1 - tools/perf/bench/futex.c | 21 +- tools/perf/bench/futex.h | 1 - .../trace/beauty/include/uapi/linux/prctl.h | 2 - .../futex/functional/futex_priv_hash.c | 113 +++---- 18 files changed, 315 insertions(+), 149 deletions(-)
I picked up PeterZ futex patch from https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/ and I am posting it here it now so it can be staged for v6.17. This survived a few days on my machine and compile robot reported that is passes its tests. v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de - Removed the IMMUTABLE bits - There was a race if the application exits while the RCU callback is pending. Stuffed with mmget()/ mmput_async(). Changes since its initial posting: - A patch description has been added - The testuite is "fixed" slightly different and has been split out - futex_mm_init() is fixed up. - The guard(preempt) has been replaced with guard(rcu) since there is no reason to disable preemption. Since it was not yet released, should we rip out the IMMUTABLE bits and just stick with GET/SET slots? Peter Zijlstra (1): futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior (5): selftests/futex: Adapt the private hash test to RCU related changes futex: Make futex_private_hash_get() static futex: Remove support for IMMUTABLE selftests/futex: Remove support for IMMUTABLE perf bench futex: Remove support for IMMUTABLE include/linux/futex.h | 16 +- include/linux/mm_types.h | 5 + include/linux/sched/mm.h | 2 +- include/uapi/linux/prctl.h | 2 - init/Kconfig | 4 - kernel/fork.c | 8 +- kernel/futex/core.c | 281 ++++++++++++++---- kernel/futex/futex.h | 2 - tools/include/uapi/linux/prctl.h | 2 - tools/perf/bench/futex-hash.c | 1 - tools/perf/bench/futex-lock-pi.c | 1 - tools/perf/bench/futex-requeue.c | 1 - tools/perf/bench/futex-wake-parallel.c | 1 - tools/perf/bench/futex-wake.c | 1 - tools/perf/bench/futex.c | 21 +- tools/perf/bench/futex.h | 1 - .../trace/beauty/include/uapi/linux/prctl.h | 2 - .../futex/functional/futex_priv_hash.c | 113 +++---- 18 files changed, 315 insertions(+), 149 deletions(-) -- 2.50.0
On 7/10/25 16:30, Sebastian Andrzej Siewior wrote: > I picked up PeterZ futex patch from > https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/ > > and I am posting it here it now so it can be staged for v6.17. > > This survived a few days on my machine and compile robot reported that > is passes its tests. > > v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de > - Removed the IMMUTABLE bits > - There was a race if the application exits while the RCU callback is > pending. Stuffed with mmget()/ mmput_async(). > > Changes since its initial posting: > - A patch description has been added > - The testuite is "fixed" slightly different and has been split out > - futex_mm_init() is fixed up. > - The guard(preempt) has been replaced with guard(rcu) since there is > no reason to disable preemption. > > Since it was not yet released, should we rip out the IMMUTABLE bits and > just stick with GET/SET slots? > > Peter Zijlstra (1): > futex: Use RCU-based per-CPU reference counting instead of rcuref_t > > Sebastian Andrzej Siewior (5): > selftests/futex: Adapt the private hash test to RCU related changes > futex: Make futex_private_hash_get() static > futex: Remove support for IMMUTABLE > selftests/futex: Remove support for IMMUTABLE > perf bench futex: Remove support for IMMUTABLE > > include/linux/futex.h | 16 +- > include/linux/mm_types.h | 5 + > include/linux/sched/mm.h | 2 +- > include/uapi/linux/prctl.h | 2 - > init/Kconfig | 4 - > kernel/fork.c | 8 +- > kernel/futex/core.c | 281 ++++++++++++++---- > kernel/futex/futex.h | 2 - > tools/include/uapi/linux/prctl.h | 2 - > tools/perf/bench/futex-hash.c | 1 - > tools/perf/bench/futex-lock-pi.c | 1 - > tools/perf/bench/futex-requeue.c | 1 - > tools/perf/bench/futex-wake-parallel.c | 1 - > tools/perf/bench/futex-wake.c | 1 - > tools/perf/bench/futex.c | 21 +- > tools/perf/bench/futex.h | 1 - > .../trace/beauty/include/uapi/linux/prctl.h | 2 - > .../futex/functional/futex_priv_hash.c | 113 +++---- > 18 files changed, 315 insertions(+), 149 deletions(-) > Hi. Sorry for not stumble upon this earlier. Saw these now. Since perf bench had shown a significant regression last time around, and for which immutable option was added, gave perf futex a try again. Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf. =========== baseline: =========== tip/master at commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD) Author: Borislav Petkov (AMD) <bp@alien8.de> ./perf bench futex hash Averaged 1559643 operations/sec (+- 0.09%), total secs = 10 Futex hashing: global hash schbench -t 64 -r 5 -i 5 current rps: 2629.85 schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5 current rps: 1538674.22 ================= baseline + series ================= ./perf bench futex hash Averaged 306403 operations/sec (+- 0.29%), total secs = 10 <<< around 1/5th of baseline. Futex hashing: auto resized to 256 buckets <<< maybe resize doesn't happen fast? ./perf bench futex hash -b 512 <<< Gave 512 buckets, Averaged 1412543 operations/sec (+- 0.14%), total secs = 10 <<< much better numbers, still off by 8-10%. Futex hashing: 512 hash buckets (512 is the number of buckets that baseline would have used, increased the buckets to 8192 for trial) ./perf bench futex hash -b 8192 Averaged 1441627 operations/sec (+- 0.14%), total secs = 10 Futex hashing: 8192 hash buckets schbench -t 64 -r 5 -i 5 current rps: 2656.85 <<< schbench seems good. schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5 current rps: 1539273.79
On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote: > Hi. Sorry for not stumble upon this earlier. Saw these now. > > Since perf bench had shown a significant regression last time around, and > for which immutable option was added, gave perf futex a try again. > > Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf. Thank you. If you use perf-bench with -b then the buckets are applied "immediately". It mostly works also with auto scaling. The problem is that perf creates the threads and immediately after it starts the test. While the RCU kicks in shortly after there is no transition happening until after all the test completes/ the threads terminate. The reason is that several private-hash references are in use because a some threads are always in the futex() syscall. It would require something like commit a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes") to have this transition before the test starts. Your schbench seems not affected? If you use -b, is it better than or equal compared to the immutable option? This isn't quite clear. Sebastian
On 7/15/25 22:01, Sebastian Andrzej Siewior wrote: > On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote: >> Hi. Sorry for not stumble upon this earlier. Saw these now. >> >> Since perf bench had shown a significant regression last time around, and >> for which immutable option was added, gave perf futex a try again. >> >> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf. > Thank you. > > If you use perf-bench with -b then the buckets are applied > "immediately". It mostly works also with auto scaling. The problem is > that perf creates the threads and immediately after it starts the test. > While the RCU kicks in shortly after there is no transition happening > until after all the test completes/ the threads terminate. The reason is > that several private-hash references are in use because a some threads > are always in the futex() syscall. > > It would require something like commit > a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes") > > to have this transition before the test starts. > Your schbench seems not affected? Yes. schbench shows similar number. > > If you use -b, is it better than or equal compared to the immutable > option? This isn't quite clear. > n I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers. ./perf bench futex hash -Ib512 Averaged 1536035 operations/sec (+- 0.11%), total secs = 10 Futex hashing: 512 hash buckets (immutable) So, with -b 512 option, it is around 8-10% less compared to immutable.
On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote: > I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers. > ./perf bench futex hash -Ib512 > Averaged 1536035 operations/sec (+- 0.11%), total secs = 10 > Futex hashing: 512 hash buckets (immutable) > > So, with -b 512 option, it is around 8-10% less compared to immutable. Urgh, can you run perf on that and tell me if this is due to this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire() doing LWSYNC ? Anyway, I think we can improve both. Does the below help? --- diff --git a/kernel/futex/core.c b/kernel/futex/core.c index d9bb5567af0c..8c41d050bd1f 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph) { struct mm_struct *mm = fph->mm; - guard(rcu)(); + guard(preempt)(); - if (smp_load_acquire(&fph->state) == FR_PERCPU) { - this_cpu_inc(*mm->futex_ref); + if (READ_ONCE(fph->state) == FR_PERCPU) { + __this_cpu_inc(*mm->futex_ref); return true; } @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph) { struct mm_struct *mm = fph->mm; - guard(rcu)(); + guard(preempt)(); - if (smp_load_acquire(&fph->state) == FR_PERCPU) { - this_cpu_dec(*mm->futex_ref); + if (READ_ONCE(fph->state) == FR_PERCPU) { + __this_cpu_dec(*mm->futex_ref); return false; }
On 7/16/25 19:59, Peter Zijlstra wrote: > On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote: > >> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers. >> ./perf bench futex hash -Ib512 >> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10 >> Futex hashing: 512 hash buckets (immutable) >> >> So, with -b 512 option, it is around 8-10% less compared to immutable. > > Urgh, can you run perf on that and tell me if this is due to > this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire() > doing LWSYNC ? It seems like due to rcu and irq enable. Both perf records are collected with -b512. base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512 per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512 perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512 # Event 'cycles' # # Baseline Delta Abs Shared Object Symbol # ........ ......... .......................... .................................................... # 21.62% -2.26% [kernel.vmlinux] [k] futex_get_value_locked 0.16% +2.01% [kernel.vmlinux] [k] __rcu_read_unlock 1.35% +1.63% [kernel.vmlinux] [k] arch_local_irq_restore.part.0 +1.48% [kernel.vmlinux] [k] futex_private_hash_put +1.16% [kernel.vmlinux] [k] futex_ref_get 10.41% -0.78% [kernel.vmlinux] [k] system_call_vectored_common 1.24% +0.72% perf [.] workerfn 5.32% -0.66% [kernel.vmlinux] [k] futex_q_lock 2.48% -0.43% [kernel.vmlinux] [k] futex_wait 2.47% -0.40% [kernel.vmlinux] [k] _raw_spin_lock 2.98% -0.35% [kernel.vmlinux] [k] futex_q_unlock 2.42% -0.34% [kernel.vmlinux] [k] __futex_wait 5.47% -0.32% libc.so.6 [.] syscall 4.03% -0.32% [kernel.vmlinux] [k] memcpy_power7 0.16% +0.22% [kernel.vmlinux] [k] arch_local_irq_restore 5.93% -0.18% [kernel.vmlinux] [k] futex_hash 1.72% -0.17% [kernel.vmlinux] [k] sys_futex > > Anyway, I think we can improve both. Does the below help? > > > --- > diff --git a/kernel/futex/core.c b/kernel/futex/core.c > index d9bb5567af0c..8c41d050bd1f 100644 > --- a/kernel/futex/core.c > +++ b/kernel/futex/core.c > @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph) > { > struct mm_struct *mm = fph->mm; > > - guard(rcu)(); > + guard(preempt)(); > > - if (smp_load_acquire(&fph->state) == FR_PERCPU) { > - this_cpu_inc(*mm->futex_ref); > + if (READ_ONCE(fph->state) == FR_PERCPU) { > + __this_cpu_inc(*mm->futex_ref); > return true; > } > > @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph) > { > struct mm_struct *mm = fph->mm; > > - guard(rcu)(); > + guard(preempt)(); > > - if (smp_load_acquire(&fph->state) == FR_PERCPU) { > - this_cpu_dec(*mm->futex_ref); > + if (READ_ONCE(fph->state) == FR_PERCPU) { > + __this_cpu_dec(*mm->futex_ref); > return false; > } > Yes. It helps. It improves "-b 512" numbers by at-least 5%. baseline + series: Averaged 1412543 operations/sec (+- 0.14%), total secs = 10 Futex hashing: 512 hash buckets baseline + series+ above_patch: Averaged 1482733 operations/sec (+- 0.26%), total secs = 10 <<< 5% improvement Futex hashing: 512 hash buckets Now we are closer baseline/immutable by 4-5%. baseline: commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD) ./perf bench futex hash Averaged 1559643 operations/sec (+- 0.09%), total secs = 10 Futex hashing: global hash
© 2016 - 2025 Red Hat, Inc.