[PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Sebastian Andrzej Siewior posted 6 patches 7 months ago
include/linux/futex.h                         |  16 +-
include/linux/mm_types.h                      |   5 +
include/linux/sched/mm.h                      |   2 +-
include/uapi/linux/prctl.h                    |   2 -
init/Kconfig                                  |   4 -
kernel/fork.c                                 |   8 +-
kernel/futex/core.c                           | 281 ++++++++++++++----
kernel/futex/futex.h                          |   2 -
tools/include/uapi/linux/prctl.h              |   2 -
tools/perf/bench/futex-hash.c                 |   1 -
tools/perf/bench/futex-lock-pi.c              |   1 -
tools/perf/bench/futex-requeue.c              |   1 -
tools/perf/bench/futex-wake-parallel.c        |   1 -
tools/perf/bench/futex-wake.c                 |   1 -
tools/perf/bench/futex.c                      |  21 +-
tools/perf/bench/futex.h                      |   1 -
.../trace/beauty/include/uapi/linux/prctl.h   |   2 -
.../futex/functional/futex_priv_hash.c        | 113 +++----
18 files changed, 315 insertions(+), 149 deletions(-)
[PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Sebastian Andrzej Siewior 7 months ago
I picked up PeterZ futex patch from
    https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/

and I am posting it here it now so it can be staged for v6.17.

This survived a few days on my machine and compile robot reported that
is passes its tests.

v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de
 - Removed the IMMUTABLE bits
 - There was a race if the application exits while the RCU callback is
   pending. Stuffed with mmget()/ mmput_async().

Changes since its initial posting:
- A patch description has been added
- The testuite is "fixed" slightly different and has been split out
- futex_mm_init() is fixed up.
- The guard(preempt) has been replaced with guard(rcu) since there is
  no reason to disable preemption.

Since it was not yet released, should we rip out the IMMUTABLE bits and
just stick with GET/SET slots?

Peter Zijlstra (1):
  futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Sebastian Andrzej Siewior (5):
  selftests/futex: Adapt the private hash test to RCU related changes
  futex: Make futex_private_hash_get() static
  futex: Remove support for IMMUTABLE
  selftests/futex: Remove support for IMMUTABLE
  perf bench futex: Remove support for IMMUTABLE

 include/linux/futex.h                         |  16 +-
 include/linux/mm_types.h                      |   5 +
 include/linux/sched/mm.h                      |   2 +-
 include/uapi/linux/prctl.h                    |   2 -
 init/Kconfig                                  |   4 -
 kernel/fork.c                                 |   8 +-
 kernel/futex/core.c                           | 281 ++++++++++++++----
 kernel/futex/futex.h                          |   2 -
 tools/include/uapi/linux/prctl.h              |   2 -
 tools/perf/bench/futex-hash.c                 |   1 -
 tools/perf/bench/futex-lock-pi.c              |   1 -
 tools/perf/bench/futex-requeue.c              |   1 -
 tools/perf/bench/futex-wake-parallel.c        |   1 -
 tools/perf/bench/futex-wake.c                 |   1 -
 tools/perf/bench/futex.c                      |  21 +-
 tools/perf/bench/futex.h                      |   1 -
 .../trace/beauty/include/uapi/linux/prctl.h   |   2 -
 .../futex/functional/futex_priv_hash.c        | 113 +++----
 18 files changed, 315 insertions(+), 149 deletions(-)

-- 
2.50.0
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Shrikanth Hegde 6 months, 3 weeks ago

On 7/10/25 16:30, Sebastian Andrzej Siewior wrote:
> I picked up PeterZ futex patch from
>      https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/
> 
> and I am posting it here it now so it can be staged for v6.17.
> 
> This survived a few days on my machine and compile robot reported that
> is passes its tests.
> 
> v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de
>   - Removed the IMMUTABLE bits
>   - There was a race if the application exits while the RCU callback is
>     pending. Stuffed with mmget()/ mmput_async().
> 
> Changes since its initial posting:
> - A patch description has been added
> - The testuite is "fixed" slightly different and has been split out
> - futex_mm_init() is fixed up.
> - The guard(preempt) has been replaced with guard(rcu) since there is
>    no reason to disable preemption.
> 
> Since it was not yet released, should we rip out the IMMUTABLE bits and
> just stick with GET/SET slots?
> 
> Peter Zijlstra (1):
>    futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> 
> Sebastian Andrzej Siewior (5):
>    selftests/futex: Adapt the private hash test to RCU related changes
>    futex: Make futex_private_hash_get() static
>    futex: Remove support for IMMUTABLE
>    selftests/futex: Remove support for IMMUTABLE
>    perf bench futex: Remove support for IMMUTABLE
> 
>   include/linux/futex.h                         |  16 +-
>   include/linux/mm_types.h                      |   5 +
>   include/linux/sched/mm.h                      |   2 +-
>   include/uapi/linux/prctl.h                    |   2 -
>   init/Kconfig                                  |   4 -
>   kernel/fork.c                                 |   8 +-
>   kernel/futex/core.c                           | 281 ++++++++++++++----
>   kernel/futex/futex.h                          |   2 -
>   tools/include/uapi/linux/prctl.h              |   2 -
>   tools/perf/bench/futex-hash.c                 |   1 -
>   tools/perf/bench/futex-lock-pi.c              |   1 -
>   tools/perf/bench/futex-requeue.c              |   1 -
>   tools/perf/bench/futex-wake-parallel.c        |   1 -
>   tools/perf/bench/futex-wake.c                 |   1 -
>   tools/perf/bench/futex.c                      |  21 +-
>   tools/perf/bench/futex.h                      |   1 -
>   .../trace/beauty/include/uapi/linux/prctl.h   |   2 -
>   .../futex/functional/futex_priv_hash.c        | 113 +++----
>   18 files changed, 315 insertions(+), 149 deletions(-)
> 

Hi. Sorry for not stumble upon this earlier. Saw these now.

Since perf bench had shown a significant regression last time around, and
for which immutable option was added, gave perf futex a try again.

Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.

===========
baseline:
===========
tip/master at
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)
Author: Borislav Petkov (AMD) <bp@alien8.de>

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash

schbench -t 64 -r 5 -i 5
current rps: 2629.85

schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
current rps: 1538674.22

=================
baseline + series
=================

./perf bench futex hash
Averaged 306403 operations/sec (+- 0.29%), total secs = 10    <<<  around 1/5th of baseline.
Futex hashing: auto resized to 256 buckets                    <<<  maybe resize doesn't happen fast?


./perf bench futex hash -b 512                                <<< Gave 512 buckets,
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10   <<< much better numbers, still off by 8-10%.
Futex hashing: 512 hash buckets

(512 is the number of buckets that baseline would have used, increased the buckets to 8192 for trial)

./perf bench futex hash -b 8192
Averaged 1441627 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 8192 hash buckets


schbench -t 64 -r 5 -i 5
current rps: 2656.85                                          <<< schbench seems good.

schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
current rps: 1539273.79
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Sebastian Andrzej Siewior 6 months, 3 weeks ago
On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
> Hi. Sorry for not stumble upon this earlier. Saw these now.
> 
> Since perf bench had shown a significant regression last time around, and
> for which immutable option was added, gave perf futex a try again.
> 
> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
Thank you.

If you use perf-bench with -b then the buckets are applied
"immediately". It mostly works also with auto scaling. The problem is
that perf creates the threads and immediately after it starts the test.
While the RCU kicks in shortly after there is no transition happening
until after all the test completes/ the threads terminate. The reason is
that several private-hash references are in use because a some threads
are always in the futex() syscall.

It would require something like commit
    a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")

to have this transition before the test starts.
Your schbench seems not affected?

If you use -b, is it better than or equal compared to the immutable
option? This isn't quite clear.

Sebastian
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Shrikanth Hegde 6 months, 3 weeks ago

On 7/15/25 22:01, Sebastian Andrzej Siewior wrote:
> On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
>> Hi. Sorry for not stumble upon this earlier. Saw these now.
>>
>> Since perf bench had shown a significant regression last time around, and
>> for which immutable option was added, gave perf futex a try again.
>>
>> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
> Thank you.
> 
> If you use perf-bench with -b then the buckets are applied
> "immediately". It mostly works also with auto scaling. The problem is
> that perf creates the threads and immediately after it starts the test.
> While the RCU kicks in shortly after there is no transition happening
> until after all the test completes/ the threads terminate. The reason is
> that several private-hash references are in use because a some threads
> are always in the futex() syscall.
> 
> It would require something like commit
>      a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")
> 
> to have this transition before the test starts.
> Your schbench seems not affected?

Yes. schbench shows similar number.

> 
> If you use -b, is it better than or equal compared to the immutable
> option? This isn't quite clear.
> n


I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
./perf bench futex hash -Ib512
Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
Futex hashing: 512 hash buckets (immutable)

So, with -b 512 option, it is around 8-10% less compared to immutable.
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Peter Zijlstra 6 months, 3 weeks ago
On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:

> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
> ./perf bench futex hash -Ib512
> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
> Futex hashing: 512 hash buckets (immutable)
> 
> So, with -b 512 option, it is around 8-10% less compared to immutable.

Urgh, can you run perf on that and tell me if this is due to
this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
doing LWSYNC ?

Anyway, I think we can improve both. Does the below help?


---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..8c41d050bd1f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_inc(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_inc(*mm->futex_ref);
 		return true;
 	}
 
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_dec(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_dec(*mm->futex_ref);
 		return false;
 	}
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Shrikanth Hegde 6 months, 3 weeks ago

On 7/16/25 19:59, Peter Zijlstra wrote:
> On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:
> 
>> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
>> ./perf bench futex hash -Ib512
>> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
>> Futex hashing: 512 hash buckets (immutable)
>>
>> So, with -b 512 option, it is around 8-10% less compared to immutable.
> 
> Urgh, can you run perf on that and tell me if this is due to
> this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
> doing LWSYNC ?

It seems like due to rcu and irq enable.
Both perf records are collected with -b512.


base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512
per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512


perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512
# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object               Symbol
# ........  .........  ..........................  ....................................................
#
     21.62%     -2.26%  [kernel.vmlinux]            [k] futex_get_value_locked
      0.16%     +2.01%  [kernel.vmlinux]            [k] __rcu_read_unlock
      1.35%     +1.63%  [kernel.vmlinux]            [k] arch_local_irq_restore.part.0
                +1.48%  [kernel.vmlinux]            [k] futex_private_hash_put
                +1.16%  [kernel.vmlinux]            [k] futex_ref_get
     10.41%     -0.78%  [kernel.vmlinux]            [k] system_call_vectored_common
      1.24%     +0.72%  perf                        [.] workerfn
      5.32%     -0.66%  [kernel.vmlinux]            [k] futex_q_lock
      2.48%     -0.43%  [kernel.vmlinux]            [k] futex_wait
      2.47%     -0.40%  [kernel.vmlinux]            [k] _raw_spin_lock
      2.98%     -0.35%  [kernel.vmlinux]            [k] futex_q_unlock
      2.42%     -0.34%  [kernel.vmlinux]            [k] __futex_wait
      5.47%     -0.32%  libc.so.6                   [.] syscall
      4.03%     -0.32%  [kernel.vmlinux]            [k] memcpy_power7
      0.16%     +0.22%  [kernel.vmlinux]            [k] arch_local_irq_restore
      5.93%     -0.18%  [kernel.vmlinux]            [k] futex_hash
      1.72%     -0.17%  [kernel.vmlinux]            [k] sys_futex


> 
> Anyway, I think we can improve both. Does the below help?
> 
> 
> ---
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index d9bb5567af0c..8c41d050bd1f 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_inc(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_inc(*mm->futex_ref);
>   		return true;
>   	}
>   
> @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_dec(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_dec(*mm->futex_ref);
>   		return false;
>   	}
>   

Yes. It helps. It improves "-b 512" numbers by at-least 5%.

baseline + series:
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 512 hash buckets


baseline + series+ above_patch:
Averaged 1482733 operations/sec (+- 0.26%), total secs = 10   <<< 5% improvement
Futex hashing: 512 hash buckets


Now we are closer baseline/immutable by 4-5%.
baseline:
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Peter Zijlstra 3 months ago
On Wed, Jul 16, 2025 at 11:51:46PM +0530, Shrikanth Hegde wrote:

> > Anyway, I think we can improve both. Does the below help?
> > 
> > 
> > ---
> > diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> > index d9bb5567af0c..8c41d050bd1f 100644
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
> >   {
> >   	struct mm_struct *mm = fph->mm;
> > -	guard(rcu)();
> > +	guard(preempt)();
> > -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > -		this_cpu_inc(*mm->futex_ref);
> > +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> > +		__this_cpu_inc(*mm->futex_ref);
> >   		return true;
> >   	}
> > @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
> >   {
> >   	struct mm_struct *mm = fph->mm;
> > -	guard(rcu)();
> > +	guard(preempt)();
> > -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > -		this_cpu_dec(*mm->futex_ref);
> > +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> > +		__this_cpu_dec(*mm->futex_ref);
> >   		return false;
> >   	}
> 
> Yes. It helps. It improves "-b 512" numbers by at-least 5%.

While talking with Sebastian about this work, I realized this patch was
never committed. So I've written it up like so, and will commit to
tip/locking/urgent soonish.

---
Subject: futex: Optimize per-cpu reference counting
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 16 Jul 2025 16:29:46 +0200

Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).

Further optimize the per-cpu reference counter by:

 - switching from RCU to preempt;
 - using __this_cpu_*() since we now have preempt disabled;
 - switching from smp_load_acquire() to READ_ONCE().

This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().

Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.

Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().

This is very similar to the percpu_down_read_internal() fast-path.

The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.

Combined this reduces the performance gap by half, down to some 5%.

Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex/core.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_p
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_inc(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_inc(*mm->futex_ref);
 		return true;
 	}
 
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_p
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_dec(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_dec(*mm->futex_ref);
 		return false;
 	}
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Sebastian Andrzej Siewior 3 months ago
On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote:
> Subject: futex: Optimize per-cpu reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, 16 Jul 2025 16:29:46 +0200
> 
> Shrikanth noted that the per-cpu reference counter was still some 10%
> slower than the old immutable option (which removes the reference
> counting entirely).
> 
> Further optimize the per-cpu reference counter by:
> 
>  - switching from RCU to preempt;
>  - using __this_cpu_*() since we now have preempt disabled;
>  - switching from smp_load_acquire() to READ_ONCE().
> 
> This is all safe because disabling preemption inhibits the RCU grace
> period exactly like rcu_read_lock().
> 
> Having preemption disabled allows using __this_cpu_*() provided the
> only access to the variable is in task context -- which is the case
> here.

Right. Read and Write from softirq happens after the user transitioned
to atomics.

> Furthermore, since we know changing fph->state to FR_ATOMIC demands a
> full RCU grace period we can rely on the implied smp_mb() from that to
> replace the acquire barrier().

That is the only part I struggle with but having a smp_mb() after a
grace period sounds reasonable.

> This is very similar to the percpu_down_read_internal() fast-path.
>
> The reason this is significant for PowerPC is that it uses the generic
> this_cpu_*() implementation which relies on local_irq_disable() (the
> x86 implementation relies on it being a single memop instruction to be
> IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
> this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
> barrier, not having to use explicit barriers safes a bunch.
> 
> Combined this reduces the performance gap by half, down to some 5%.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Peter Zijlstra 3 months ago
On Thu, Nov 06, 2025 at 12:09:07PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote:
> > Subject: futex: Optimize per-cpu reference counting
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Wed, 16 Jul 2025 16:29:46 +0200
> > 
> > Shrikanth noted that the per-cpu reference counter was still some 10%
> > slower than the old immutable option (which removes the reference
> > counting entirely).
> > 
> > Further optimize the per-cpu reference counter by:
> > 
> >  - switching from RCU to preempt;
> >  - using __this_cpu_*() since we now have preempt disabled;
> >  - switching from smp_load_acquire() to READ_ONCE().
> > 
> > This is all safe because disabling preemption inhibits the RCU grace
> > period exactly like rcu_read_lock().
> > 
> > Having preemption disabled allows using __this_cpu_*() provided the
> > only access to the variable is in task context -- which is the case
> > here.
> 
> Right. Read and Write from softirq happens after the user transitioned
> to atomics.
> 
> > Furthermore, since we know changing fph->state to FR_ATOMIC demands a
> > full RCU grace period we can rely on the implied smp_mb() from that to
> > replace the acquire barrier().
> 
> That is the only part I struggle with but having a smp_mb() after a
> grace period sounds reasonable.

IIRC the argument goes something like so:

A grace-period (for rcu-sched, which is implied by regular rcu)
implies that every task has done at least one voluntary context switch.
A context switch implies a full barrier.

Therefore observing a state change separated by a grace-period implies
an smp_mb().

> > This is very similar to the percpu_down_read_internal() fast-path.
> >
> > The reason this is significant for PowerPC is that it uses the generic
> > this_cpu_*() implementation which relies on local_irq_disable() (the
> > x86 implementation relies on it being a single memop instruction to be
> > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
> > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
> > barrier, not having to use explicit barriers safes a bunch.
> > 
> > Combined this reduces the performance gap by half, down to some 5%.
> 
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> Sebastian
Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting
Posted by Paul E. McKenney 3 months ago
On Thu, Nov 06, 2025 at 12:23:39PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 06, 2025 at 12:09:07PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote:
> > > Subject: futex: Optimize per-cpu reference counting
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > Date: Wed, 16 Jul 2025 16:29:46 +0200
> > > 
> > > Shrikanth noted that the per-cpu reference counter was still some 10%
> > > slower than the old immutable option (which removes the reference
> > > counting entirely).
> > > 
> > > Further optimize the per-cpu reference counter by:
> > > 
> > >  - switching from RCU to preempt;
> > >  - using __this_cpu_*() since we now have preempt disabled;
> > >  - switching from smp_load_acquire() to READ_ONCE().
> > > 
> > > This is all safe because disabling preemption inhibits the RCU grace
> > > period exactly like rcu_read_lock().
> > > 
> > > Having preemption disabled allows using __this_cpu_*() provided the
> > > only access to the variable is in task context -- which is the case
> > > here.
> > 
> > Right. Read and Write from softirq happens after the user transitioned
> > to atomics.
> > 
> > > Furthermore, since we know changing fph->state to FR_ATOMIC demands a
> > > full RCU grace period we can rely on the implied smp_mb() from that to
> > > replace the acquire barrier().
> > 
> > That is the only part I struggle with but having a smp_mb() after a
> > grace period sounds reasonable.
> 
> IIRC the argument goes something like so:
> 
> A grace-period (for rcu-sched, which is implied by regular rcu)
> implies that every task has done at least one voluntary context switch.

Agreed, except for: s/voluntary context switch/context switch/

It is Tasks RCU that pays attention only to voluntary context switches.

> A context switch implies a full barrier.
> 
> Therefore observing a state change separated by a grace-period implies
> an smp_mb().

Just to be pedantic, for any given CPU and any given grace period,
it is the case that:

1.	That CPU will have executed a full barrier between any code
	executed on any CPU that happens before the beginning of that
	grace period and any RCU read-side critical section on that CPU
	that extends beyond the end of that grace period, and

2.	That CPU will have executed a full barrier between any RCU
	read-side critical section on that CPU that extends before the
	beginning of that grace period and any code executed on any CPU
	that happens after the end of that grace period.

An RCU read-side critical sections is: (1) Any region code protected by
rcu_read_lock() and friends, and (2) Any region of code where preemption
is disabled that does not contain a call to schedule().

							Thanx, Paul

> > > This is very similar to the percpu_down_read_internal() fast-path.
> > >
> > > The reason this is significant for PowerPC is that it uses the generic
> > > this_cpu_*() implementation which relies on local_irq_disable() (the
> > > x86 implementation relies on it being a single memop instruction to be
> > > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
> > > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
> > > barrier, not having to use explicit barriers safes a bunch.
> > > 
> > > Combined this reduces the performance gap by half, down to some 5%.
> > 
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > Sebastian
[tip: locking/urgent] futex: Optimize per-cpu reference counting
Posted by tip-bot2 for Peter Zijlstra 3 months ago
The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
Gitweb:        https://git.kernel.org/tip/4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 16 Jul 2025 16:29:46 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 06 Nov 2025 12:30:54 +01:00

futex: Optimize per-cpu reference counting

Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).

Further optimize the per-cpu reference counter by:

 - switching from RCU to preempt;
 - using __this_cpu_*() since we now have preempt disabled;
 - switching from smp_load_acquire() to READ_ONCE().

This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().

Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.

Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().

This is very similar to the percpu_down_read_internal() fast-path.

The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.

Combined this reduces the performance gap by half, down to some 5%.

Fixes: 760e6f7befba ("futex: Remove support for IMMUTABLE")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net
---
 kernel/futex/core.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 125804f..2e77a6e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_inc(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_inc(*mm->futex_ref);
 		return true;
 	}
 
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_dec(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_dec(*mm->futex_ref);
 		return false;
 	}
 
Re: [tip: locking/urgent] futex: Optimize per-cpu reference counting
Posted by Shrikanth Hegde 3 months ago

On 11/6/25 5:10 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the locking/urgent branch of tip:
> 
> Commit-ID:     4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
> Gitweb:        https://git.kernel.org/tip/4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Wed, 16 Jul 2025 16:29:46 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Thu, 06 Nov 2025 12:30:54 +01:00
> 
> futex: Optimize per-cpu reference counting
> 
> Shrikanth noted that the per-cpu reference counter was still some 10%
> slower than the old immutable option (which removes the reference
> counting entirely).
> 

Thanks for picking it up.