include/linux/futex.h | 16 +- include/linux/mm_types.h | 5 + include/linux/sched/mm.h | 2 +- include/uapi/linux/prctl.h | 2 - init/Kconfig | 4 - kernel/fork.c | 8 +- kernel/futex/core.c | 281 ++++++++++++++---- kernel/futex/futex.h | 2 - tools/include/uapi/linux/prctl.h | 2 - tools/perf/bench/futex-hash.c | 1 - tools/perf/bench/futex-lock-pi.c | 1 - tools/perf/bench/futex-requeue.c | 1 - tools/perf/bench/futex-wake-parallel.c | 1 - tools/perf/bench/futex-wake.c | 1 - tools/perf/bench/futex.c | 21 +- tools/perf/bench/futex.h | 1 - .../trace/beauty/include/uapi/linux/prctl.h | 2 - .../futex/functional/futex_priv_hash.c | 113 +++---- 18 files changed, 315 insertions(+), 149 deletions(-)
I picked up PeterZ futex patch from
https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/
and I am posting it here it now so it can be staged for v6.17.
This survived a few days on my machine and compile robot reported that
is passes its tests.
v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de
- Removed the IMMUTABLE bits
- There was a race if the application exits while the RCU callback is
pending. Stuffed with mmget()/ mmput_async().
Changes since its initial posting:
- A patch description has been added
- The testuite is "fixed" slightly different and has been split out
- futex_mm_init() is fixed up.
- The guard(preempt) has been replaced with guard(rcu) since there is
no reason to disable preemption.
Since it was not yet released, should we rip out the IMMUTABLE bits and
just stick with GET/SET slots?
Peter Zijlstra (1):
futex: Use RCU-based per-CPU reference counting instead of rcuref_t
Sebastian Andrzej Siewior (5):
selftests/futex: Adapt the private hash test to RCU related changes
futex: Make futex_private_hash_get() static
futex: Remove support for IMMUTABLE
selftests/futex: Remove support for IMMUTABLE
perf bench futex: Remove support for IMMUTABLE
include/linux/futex.h | 16 +-
include/linux/mm_types.h | 5 +
include/linux/sched/mm.h | 2 +-
include/uapi/linux/prctl.h | 2 -
init/Kconfig | 4 -
kernel/fork.c | 8 +-
kernel/futex/core.c | 281 ++++++++++++++----
kernel/futex/futex.h | 2 -
tools/include/uapi/linux/prctl.h | 2 -
tools/perf/bench/futex-hash.c | 1 -
tools/perf/bench/futex-lock-pi.c | 1 -
tools/perf/bench/futex-requeue.c | 1 -
tools/perf/bench/futex-wake-parallel.c | 1 -
tools/perf/bench/futex-wake.c | 1 -
tools/perf/bench/futex.c | 21 +-
tools/perf/bench/futex.h | 1 -
.../trace/beauty/include/uapi/linux/prctl.h | 2 -
.../futex/functional/futex_priv_hash.c | 113 +++----
18 files changed, 315 insertions(+), 149 deletions(-)
--
2.50.0
On 7/10/25 16:30, Sebastian Andrzej Siewior wrote: > I picked up PeterZ futex patch from > https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/ > > and I am posting it here it now so it can be staged for v6.17. > > This survived a few days on my machine and compile robot reported that > is passes its tests. > > v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de > - Removed the IMMUTABLE bits > - There was a race if the application exits while the RCU callback is > pending. Stuffed with mmget()/ mmput_async(). > > Changes since its initial posting: > - A patch description has been added > - The testuite is "fixed" slightly different and has been split out > - futex_mm_init() is fixed up. > - The guard(preempt) has been replaced with guard(rcu) since there is > no reason to disable preemption. > > Since it was not yet released, should we rip out the IMMUTABLE bits and > just stick with GET/SET slots? > > Peter Zijlstra (1): > futex: Use RCU-based per-CPU reference counting instead of rcuref_t > > Sebastian Andrzej Siewior (5): > selftests/futex: Adapt the private hash test to RCU related changes > futex: Make futex_private_hash_get() static > futex: Remove support for IMMUTABLE > selftests/futex: Remove support for IMMUTABLE > perf bench futex: Remove support for IMMUTABLE > > include/linux/futex.h | 16 +- > include/linux/mm_types.h | 5 + > include/linux/sched/mm.h | 2 +- > include/uapi/linux/prctl.h | 2 - > init/Kconfig | 4 - > kernel/fork.c | 8 +- > kernel/futex/core.c | 281 ++++++++++++++---- > kernel/futex/futex.h | 2 - > tools/include/uapi/linux/prctl.h | 2 - > tools/perf/bench/futex-hash.c | 1 - > tools/perf/bench/futex-lock-pi.c | 1 - > tools/perf/bench/futex-requeue.c | 1 - > tools/perf/bench/futex-wake-parallel.c | 1 - > tools/perf/bench/futex-wake.c | 1 - > tools/perf/bench/futex.c | 21 +- > tools/perf/bench/futex.h | 1 - > .../trace/beauty/include/uapi/linux/prctl.h | 2 - > .../futex/functional/futex_priv_hash.c | 113 +++---- > 18 files changed, 315 insertions(+), 149 deletions(-) > Hi. Sorry for not stumble upon this earlier. Saw these now. Since perf bench had shown a significant regression last time around, and for which immutable option was added, gave perf futex a try again. Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf. =========== baseline: =========== tip/master at commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD) Author: Borislav Petkov (AMD) <bp@alien8.de> ./perf bench futex hash Averaged 1559643 operations/sec (+- 0.09%), total secs = 10 Futex hashing: global hash schbench -t 64 -r 5 -i 5 current rps: 2629.85 schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5 current rps: 1538674.22 ================= baseline + series ================= ./perf bench futex hash Averaged 306403 operations/sec (+- 0.29%), total secs = 10 <<< around 1/5th of baseline. Futex hashing: auto resized to 256 buckets <<< maybe resize doesn't happen fast? ./perf bench futex hash -b 512 <<< Gave 512 buckets, Averaged 1412543 operations/sec (+- 0.14%), total secs = 10 <<< much better numbers, still off by 8-10%. Futex hashing: 512 hash buckets (512 is the number of buckets that baseline would have used, increased the buckets to 8192 for trial) ./perf bench futex hash -b 8192 Averaged 1441627 operations/sec (+- 0.14%), total secs = 10 Futex hashing: 8192 hash buckets schbench -t 64 -r 5 -i 5 current rps: 2656.85 <<< schbench seems good. schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5 current rps: 1539273.79
On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
> Hi. Sorry for not stumble upon this earlier. Saw these now.
>
> Since perf bench had shown a significant regression last time around, and
> for which immutable option was added, gave perf futex a try again.
>
> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
Thank you.
If you use perf-bench with -b then the buckets are applied
"immediately". It mostly works also with auto scaling. The problem is
that perf creates the threads and immediately after it starts the test.
While the RCU kicks in shortly after there is no transition happening
until after all the test completes/ the threads terminate. The reason is
that several private-hash references are in use because a some threads
are always in the futex() syscall.
It would require something like commit
a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")
to have this transition before the test starts.
Your schbench seems not affected?
If you use -b, is it better than or equal compared to the immutable
option? This isn't quite clear.
Sebastian
On 7/15/25 22:01, Sebastian Andrzej Siewior wrote:
> On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
>> Hi. Sorry for not stumble upon this earlier. Saw these now.
>>
>> Since perf bench had shown a significant regression last time around, and
>> for which immutable option was added, gave perf futex a try again.
>>
>> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
> Thank you.
>
> If you use perf-bench with -b then the buckets are applied
> "immediately". It mostly works also with auto scaling. The problem is
> that perf creates the threads and immediately after it starts the test.
> While the RCU kicks in shortly after there is no transition happening
> until after all the test completes/ the threads terminate. The reason is
> that several private-hash references are in use because a some threads
> are always in the futex() syscall.
>
> It would require something like commit
> a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")
>
> to have this transition before the test starts.
> Your schbench seems not affected?
Yes. schbench shows similar number.
>
> If you use -b, is it better than or equal compared to the immutable
> option? This isn't quite clear.
> n
I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
./perf bench futex hash -Ib512
Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
Futex hashing: 512 hash buckets (immutable)
So, with -b 512 option, it is around 8-10% less compared to immutable.
On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:
> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
> ./perf bench futex hash -Ib512
> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
> Futex hashing: 512 hash buckets (immutable)
>
> So, with -b 512 option, it is around 8-10% less compared to immutable.
Urgh, can you run perf on that and tell me if this is due to
this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
doing LWSYNC ?
Anyway, I think we can improve both. Does the below help?
---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..8c41d050bd1f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_inc(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_inc(*mm->futex_ref);
return true;
}
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_dec(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_dec(*mm->futex_ref);
return false;
}
On 7/16/25 19:59, Peter Zijlstra wrote:
> On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:
>
>> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
>> ./perf bench futex hash -Ib512
>> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
>> Futex hashing: 512 hash buckets (immutable)
>>
>> So, with -b 512 option, it is around 8-10% less compared to immutable.
>
> Urgh, can you run perf on that and tell me if this is due to
> this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
> doing LWSYNC ?
It seems like due to rcu and irq enable.
Both perf records are collected with -b512.
base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512
per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512
perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512
# Event 'cycles'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... .......................... ....................................................
#
21.62% -2.26% [kernel.vmlinux] [k] futex_get_value_locked
0.16% +2.01% [kernel.vmlinux] [k] __rcu_read_unlock
1.35% +1.63% [kernel.vmlinux] [k] arch_local_irq_restore.part.0
+1.48% [kernel.vmlinux] [k] futex_private_hash_put
+1.16% [kernel.vmlinux] [k] futex_ref_get
10.41% -0.78% [kernel.vmlinux] [k] system_call_vectored_common
1.24% +0.72% perf [.] workerfn
5.32% -0.66% [kernel.vmlinux] [k] futex_q_lock
2.48% -0.43% [kernel.vmlinux] [k] futex_wait
2.47% -0.40% [kernel.vmlinux] [k] _raw_spin_lock
2.98% -0.35% [kernel.vmlinux] [k] futex_q_unlock
2.42% -0.34% [kernel.vmlinux] [k] __futex_wait
5.47% -0.32% libc.so.6 [.] syscall
4.03% -0.32% [kernel.vmlinux] [k] memcpy_power7
0.16% +0.22% [kernel.vmlinux] [k] arch_local_irq_restore
5.93% -0.18% [kernel.vmlinux] [k] futex_hash
1.72% -0.17% [kernel.vmlinux] [k] sys_futex
>
> Anyway, I think we can improve both. Does the below help?
>
>
> ---
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index d9bb5567af0c..8c41d050bd1f 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
> {
> struct mm_struct *mm = fph->mm;
>
> - guard(rcu)();
> + guard(preempt)();
>
> - if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> - this_cpu_inc(*mm->futex_ref);
> + if (READ_ONCE(fph->state) == FR_PERCPU) {
> + __this_cpu_inc(*mm->futex_ref);
> return true;
> }
>
> @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
> {
> struct mm_struct *mm = fph->mm;
>
> - guard(rcu)();
> + guard(preempt)();
>
> - if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> - this_cpu_dec(*mm->futex_ref);
> + if (READ_ONCE(fph->state) == FR_PERCPU) {
> + __this_cpu_dec(*mm->futex_ref);
> return false;
> }
>
Yes. It helps. It improves "-b 512" numbers by at-least 5%.
baseline + series:
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 512 hash buckets
baseline + series+ above_patch:
Averaged 1482733 operations/sec (+- 0.26%), total secs = 10 <<< 5% improvement
Futex hashing: 512 hash buckets
Now we are closer baseline/immutable by 4-5%.
baseline:
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)
./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash
On Wed, Jul 16, 2025 at 11:51:46PM +0530, Shrikanth Hegde wrote:
> > Anyway, I think we can improve both. Does the below help?
> >
> >
> > ---
> > diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> > index d9bb5567af0c..8c41d050bd1f 100644
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
> > {
> > struct mm_struct *mm = fph->mm;
> > - guard(rcu)();
> > + guard(preempt)();
> > - if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > - this_cpu_inc(*mm->futex_ref);
> > + if (READ_ONCE(fph->state) == FR_PERCPU) {
> > + __this_cpu_inc(*mm->futex_ref);
> > return true;
> > }
> > @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
> > {
> > struct mm_struct *mm = fph->mm;
> > - guard(rcu)();
> > + guard(preempt)();
> > - if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > - this_cpu_dec(*mm->futex_ref);
> > + if (READ_ONCE(fph->state) == FR_PERCPU) {
> > + __this_cpu_dec(*mm->futex_ref);
> > return false;
> > }
>
> Yes. It helps. It improves "-b 512" numbers by at-least 5%.
While talking with Sebastian about this work, I realized this patch was
never committed. So I've written it up like so, and will commit to
tip/locking/urgent soonish.
---
Subject: futex: Optimize per-cpu reference counting
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 16 Jul 2025 16:29:46 +0200
Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).
Further optimize the per-cpu reference counter by:
- switching from RCU to preempt;
- using __this_cpu_*() since we now have preempt disabled;
- switching from smp_load_acquire() to READ_ONCE().
This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%.
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/futex/core.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_p
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_inc(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_inc(*mm->futex_ref);
return true;
}
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_p
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_dec(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_dec(*mm->futex_ref);
return false;
}
On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote: > Subject: futex: Optimize per-cpu reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Wed, 16 Jul 2025 16:29:46 +0200 > > Shrikanth noted that the per-cpu reference counter was still some 10% > slower than the old immutable option (which removes the reference > counting entirely). > > Further optimize the per-cpu reference counter by: > > - switching from RCU to preempt; > - using __this_cpu_*() since we now have preempt disabled; > - switching from smp_load_acquire() to READ_ONCE(). > > This is all safe because disabling preemption inhibits the RCU grace > period exactly like rcu_read_lock(). > > Having preemption disabled allows using __this_cpu_*() provided the > only access to the variable is in task context -- which is the case > here. Right. Read and Write from softirq happens after the user transitioned to atomics. > Furthermore, since we know changing fph->state to FR_ATOMIC demands a > full RCU grace period we can rely on the implied smp_mb() from that to > replace the acquire barrier(). That is the only part I struggle with but having a smp_mb() after a grace period sounds reasonable. > This is very similar to the percpu_down_read_internal() fast-path. > > The reason this is significant for PowerPC is that it uses the generic > this_cpu_*() implementation which relies on local_irq_disable() (the > x86 implementation relies on it being a single memop instruction to be > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE > barrier, not having to use explicit barriers safes a bunch. > > Combined this reduces the performance gap by half, down to some 5%. Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sebastian
On Thu, Nov 06, 2025 at 12:09:07PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote: > > Subject: futex: Optimize per-cpu reference counting > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Wed, 16 Jul 2025 16:29:46 +0200 > > > > Shrikanth noted that the per-cpu reference counter was still some 10% > > slower than the old immutable option (which removes the reference > > counting entirely). > > > > Further optimize the per-cpu reference counter by: > > > > - switching from RCU to preempt; > > - using __this_cpu_*() since we now have preempt disabled; > > - switching from smp_load_acquire() to READ_ONCE(). > > > > This is all safe because disabling preemption inhibits the RCU grace > > period exactly like rcu_read_lock(). > > > > Having preemption disabled allows using __this_cpu_*() provided the > > only access to the variable is in task context -- which is the case > > here. > > Right. Read and Write from softirq happens after the user transitioned > to atomics. > > > Furthermore, since we know changing fph->state to FR_ATOMIC demands a > > full RCU grace period we can rely on the implied smp_mb() from that to > > replace the acquire barrier(). > > That is the only part I struggle with but having a smp_mb() after a > grace period sounds reasonable. IIRC the argument goes something like so: A grace-period (for rcu-sched, which is implied by regular rcu) implies that every task has done at least one voluntary context switch. A context switch implies a full barrier. Therefore observing a state change separated by a grace-period implies an smp_mb(). > > This is very similar to the percpu_down_read_internal() fast-path. > > > > The reason this is significant for PowerPC is that it uses the generic > > this_cpu_*() implementation which relies on local_irq_disable() (the > > x86 implementation relies on it being a single memop instruction to be > > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids > > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE > > barrier, not having to use explicit barriers safes a bunch. > > > > Combined this reduces the performance gap by half, down to some 5%. > > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > Sebastian
On Thu, Nov 06, 2025 at 12:23:39PM +0100, Peter Zijlstra wrote: > On Thu, Nov 06, 2025 at 12:09:07PM +0100, Sebastian Andrzej Siewior wrote: > > On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote: > > > Subject: futex: Optimize per-cpu reference counting > > > From: Peter Zijlstra <peterz@infradead.org> > > > Date: Wed, 16 Jul 2025 16:29:46 +0200 > > > > > > Shrikanth noted that the per-cpu reference counter was still some 10% > > > slower than the old immutable option (which removes the reference > > > counting entirely). > > > > > > Further optimize the per-cpu reference counter by: > > > > > > - switching from RCU to preempt; > > > - using __this_cpu_*() since we now have preempt disabled; > > > - switching from smp_load_acquire() to READ_ONCE(). > > > > > > This is all safe because disabling preemption inhibits the RCU grace > > > period exactly like rcu_read_lock(). > > > > > > Having preemption disabled allows using __this_cpu_*() provided the > > > only access to the variable is in task context -- which is the case > > > here. > > > > Right. Read and Write from softirq happens after the user transitioned > > to atomics. > > > > > Furthermore, since we know changing fph->state to FR_ATOMIC demands a > > > full RCU grace period we can rely on the implied smp_mb() from that to > > > replace the acquire barrier(). > > > > That is the only part I struggle with but having a smp_mb() after a > > grace period sounds reasonable. > > IIRC the argument goes something like so: > > A grace-period (for rcu-sched, which is implied by regular rcu) > implies that every task has done at least one voluntary context switch. Agreed, except for: s/voluntary context switch/context switch/ It is Tasks RCU that pays attention only to voluntary context switches. > A context switch implies a full barrier. > > Therefore observing a state change separated by a grace-period implies > an smp_mb(). Just to be pedantic, for any given CPU and any given grace period, it is the case that: 1. That CPU will have executed a full barrier between any code executed on any CPU that happens before the beginning of that grace period and any RCU read-side critical section on that CPU that extends beyond the end of that grace period, and 2. That CPU will have executed a full barrier between any RCU read-side critical section on that CPU that extends before the beginning of that grace period and any code executed on any CPU that happens after the end of that grace period. An RCU read-side critical sections is: (1) Any region code protected by rcu_read_lock() and friends, and (2) Any region of code where preemption is disabled that does not contain a call to schedule(). Thanx, Paul > > > This is very similar to the percpu_down_read_internal() fast-path. > > > > > > The reason this is significant for PowerPC is that it uses the generic > > > this_cpu_*() implementation which relies on local_irq_disable() (the > > > x86 implementation relies on it being a single memop instruction to be > > > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids > > > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE > > > barrier, not having to use explicit barriers safes a bunch. > > > > > > Combined this reduces the performance gap by half, down to some 5%. > > > > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > > > Sebastian
The following commit has been merged into the locking/urgent branch of tip:
Commit-ID: 4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
Gitweb: https://git.kernel.org/tip/4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Jul 2025 16:29:46 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 06 Nov 2025 12:30:54 +01:00
futex: Optimize per-cpu reference counting
Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).
Further optimize the per-cpu reference counter by:
- switching from RCU to preempt;
- using __this_cpu_*() since we now have preempt disabled;
- switching from smp_load_acquire() to READ_ONCE().
This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%.
Fixes: 760e6f7befba ("futex: Remove support for IMMUTABLE")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net
---
kernel/futex/core.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 125804f..2e77a6e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_inc(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_inc(*mm->futex_ref);
return true;
}
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_dec(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_dec(*mm->futex_ref);
return false;
}
On 11/6/25 5:10 PM, tip-bot2 for Peter Zijlstra wrote: > The following commit has been merged into the locking/urgent branch of tip: > > Commit-ID: 4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d > Gitweb: https://git.kernel.org/tip/4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d > Author: Peter Zijlstra <peterz@infradead.org> > AuthorDate: Wed, 16 Jul 2025 16:29:46 +02:00 > Committer: Peter Zijlstra <peterz@infradead.org> > CommitterDate: Thu, 06 Nov 2025 12:30:54 +01:00 > > futex: Optimize per-cpu reference counting > > Shrikanth noted that the per-cpu reference counter was still some 10% > slower than the old immutable option (which removes the reference > counting entirely). > Thanks for picking it up.
© 2016 - 2026 Red Hat, Inc.