[v2] futex: Use RCU-based per-CPU reference counting

[PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Sebastian Andrzej Siewior 2 months, 4 weeks ago

I picked up PeterZ futex patch from
    https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/

and I am posting it here it now so it can be staged for v6.17.

This survived a few days on my machine and compile robot reported that
is passes its tests.

v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de
 - Removed the IMMUTABLE bits
 - There was a race if the application exits while the RCU callback is
   pending. Stuffed with mmget()/ mmput_async().

Changes since its initial posting:
- A patch description has been added
- The testuite is "fixed" slightly different and has been split out
- futex_mm_init() is fixed up.
- The guard(preempt) has been replaced with guard(rcu) since there is
  no reason to disable preemption.

Since it was not yet released, should we rip out the IMMUTABLE bits and
just stick with GET/SET slots?

Peter Zijlstra (1):
  futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Sebastian Andrzej Siewior (5):
  selftests/futex: Adapt the private hash test to RCU related changes
  futex: Make futex_private_hash_get() static
  futex: Remove support for IMMUTABLE
  selftests/futex: Remove support for IMMUTABLE
  perf bench futex: Remove support for IMMUTABLE

 include/linux/futex.h                         |  16 +-
 include/linux/mm_types.h                      |   5 +
 include/linux/sched/mm.h                      |   2 +-
 include/uapi/linux/prctl.h                    |   2 -
 init/Kconfig                                  |   4 -
 kernel/fork.c                                 |   8 +-
 kernel/futex/core.c                           | 281 ++++++++++++++----
 kernel/futex/futex.h                          |   2 -
 tools/include/uapi/linux/prctl.h              |   2 -
 tools/perf/bench/futex-hash.c                 |   1 -
 tools/perf/bench/futex-lock-pi.c              |   1 -
 tools/perf/bench/futex-requeue.c              |   1 -
 tools/perf/bench/futex-wake-parallel.c        |   1 -
 tools/perf/bench/futex-wake.c                 |   1 -
 tools/perf/bench/futex.c                      |  21 +-
 tools/perf/bench/futex.h                      |   1 -
 .../trace/beauty/include/uapi/linux/prctl.h   |   2 -
 .../futex/functional/futex_priv_hash.c        | 113 +++----
 18 files changed, 315 insertions(+), 149 deletions(-)

-- 
2.50.0

Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Shrikanth Hegde 2 months, 3 weeks ago


On 7/10/25 16:30, Sebastian Andrzej Siewior wrote:
> I picked up PeterZ futex patch from
>      https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/
> 
> and I am posting it here it now so it can be staged for v6.17.
> 
> This survived a few days on my machine and compile robot reported that
> is passes its tests.
> 
> v1…v2 https://lore.kernel.org/all/20250707143623.70325-1-bigeasy@linutronix.de
>   - Removed the IMMUTABLE bits
>   - There was a race if the application exits while the RCU callback is
>     pending. Stuffed with mmget()/ mmput_async().
> 
> Changes since its initial posting:
> - A patch description has been added
> - The testuite is "fixed" slightly different and has been split out
> - futex_mm_init() is fixed up.
> - The guard(preempt) has been replaced with guard(rcu) since there is
>    no reason to disable preemption.
> 
> Since it was not yet released, should we rip out the IMMUTABLE bits and
> just stick with GET/SET slots?
> 
> Peter Zijlstra (1):
>    futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> 
> Sebastian Andrzej Siewior (5):
>    selftests/futex: Adapt the private hash test to RCU related changes
>    futex: Make futex_private_hash_get() static
>    futex: Remove support for IMMUTABLE
>    selftests/futex: Remove support for IMMUTABLE
>    perf bench futex: Remove support for IMMUTABLE
> 
>   include/linux/futex.h                         |  16 +-
>   include/linux/mm_types.h                      |   5 +
>   include/linux/sched/mm.h                      |   2 +-
>   include/uapi/linux/prctl.h                    |   2 -
>   init/Kconfig                                  |   4 -
>   kernel/fork.c                                 |   8 +-
>   kernel/futex/core.c                           | 281 ++++++++++++++----
>   kernel/futex/futex.h                          |   2 -
>   tools/include/uapi/linux/prctl.h              |   2 -
>   tools/perf/bench/futex-hash.c                 |   1 -
>   tools/perf/bench/futex-lock-pi.c              |   1 -
>   tools/perf/bench/futex-requeue.c              |   1 -
>   tools/perf/bench/futex-wake-parallel.c        |   1 -
>   tools/perf/bench/futex-wake.c                 |   1 -
>   tools/perf/bench/futex.c                      |  21 +-
>   tools/perf/bench/futex.h                      |   1 -
>   .../trace/beauty/include/uapi/linux/prctl.h   |   2 -
>   .../futex/functional/futex_priv_hash.c        | 113 +++----
>   18 files changed, 315 insertions(+), 149 deletions(-)
> 

Hi. Sorry for not stumble upon this earlier. Saw these now.

Since perf bench had shown a significant regression last time around, and
for which immutable option was added, gave perf futex a try again.

Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.

===========
baseline:
===========
tip/master at
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)
Author: Borislav Petkov (AMD) <bp@alien8.de>

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash

schbench -t 64 -r 5 -i 5
current rps: 2629.85

schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
current rps: 1538674.22

=================
baseline + series
=================

./perf bench futex hash
Averaged 306403 operations/sec (+- 0.29%), total secs = 10    <<<  around 1/5th of baseline.
Futex hashing: auto resized to 256 buckets                    <<<  maybe resize doesn't happen fast?


./perf bench futex hash -b 512                                <<< Gave 512 buckets,
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10   <<< much better numbers, still off by 8-10%.
Futex hashing: 512 hash buckets

(512 is the number of buckets that baseline would have used, increased the buckets to 8192 for trial)

./perf bench futex hash -b 8192
Averaged 1441627 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 8192 hash buckets


schbench -t 64 -r 5 -i 5
current rps: 2656.85                                          <<< schbench seems good.

schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
current rps: 1539273.79

Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
> Hi. Sorry for not stumble upon this earlier. Saw these now.
> 
> Since perf bench had shown a significant regression last time around, and
> for which immutable option was added, gave perf futex a try again.
> 
> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
Thank you.

If you use perf-bench with -b then the buckets are applied
"immediately". It mostly works also with auto scaling. The problem is
that perf creates the threads and immediately after it starts the test.
While the RCU kicks in shortly after there is no transition happening
until after all the test completes/ the threads terminate. The reason is
that several private-hash references are in use because a some threads
are always in the futex() syscall.

It would require something like commit
    a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")

to have this transition before the test starts.
Your schbench seems not affected?

If you use -b, is it better than or equal compared to the immutable
option? This isn't quite clear.

Sebastian

Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Shrikanth Hegde 2 months, 3 weeks ago


On 7/15/25 22:01, Sebastian Andrzej Siewior wrote:
> On 2025-07-15 21:29:34 [+0530], Shrikanth Hegde wrote:
>> Hi. Sorry for not stumble upon this earlier. Saw these now.
>>
>> Since perf bench had shown a significant regression last time around, and
>> for which immutable option was added, gave perf futex a try again.
>>
>> Below are the results: Ran on 5 core LPAR(VM) on power. perf was compiled from tools/perf.
> Thank you.
> 
> If you use perf-bench with -b then the buckets are applied
> "immediately". It mostly works also with auto scaling. The problem is
> that perf creates the threads and immediately after it starts the test.
> While the RCU kicks in shortly after there is no transition happening
> until after all the test completes/ the threads terminate. The reason is
> that several private-hash references are in use because a some threads
> are always in the futex() syscall.
> 
> It would require something like commit
>      a255b78d14324 ("selftests/futex: Adapt the private hash test to RCU related changes")
> 
> to have this transition before the test starts.
> Your schbench seems not affected?

Yes. schbench shows similar number.

> 
> If you use -b, is it better than or equal compared to the immutable
> option? This isn't quite clear.
> n


I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
./perf bench futex hash -Ib512
Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
Futex hashing: 512 hash buckets (immutable)

So, with -b 512 option, it is around 8-10% less compared to immutable.

Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Peter Zijlstra 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:

> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
> ./perf bench futex hash -Ib512
> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
> Futex hashing: 512 hash buckets (immutable)
> 
> So, with -b 512 option, it is around 8-10% less compared to immutable.

Urgh, can you run perf on that and tell me if this is due to
this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
doing LWSYNC ?

Anyway, I think we can improve both. Does the below help?


---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..8c41d050bd1f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_inc(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_inc(*mm->futex_ref);
 		return true;
 	}
 
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
 {
 	struct mm_struct *mm = fph->mm;
 
-	guard(rcu)();
+	guard(preempt)();
 
-	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
-		this_cpu_dec(*mm->futex_ref);
+	if (READ_ONCE(fph->state) == FR_PERCPU) {
+		__this_cpu_dec(*mm->futex_ref);
 		return false;
 	}

Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

Posted by Shrikanth Hegde 2 months, 3 weeks ago


On 7/16/25 19:59, Peter Zijlstra wrote:
> On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:
> 
>> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
>> ./perf bench futex hash -Ib512
>> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
>> Futex hashing: 512 hash buckets (immutable)
>>
>> So, with -b 512 option, it is around 8-10% less compared to immutable.
> 
> Urgh, can you run perf on that and tell me if this is due to
> this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
> doing LWSYNC ?

It seems like due to rcu and irq enable.
Both perf records are collected with -b512.


base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512
per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512


perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512
# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object               Symbol
# ........  .........  ..........................  ....................................................
#
     21.62%     -2.26%  [kernel.vmlinux]            [k] futex_get_value_locked
      0.16%     +2.01%  [kernel.vmlinux]            [k] __rcu_read_unlock
      1.35%     +1.63%  [kernel.vmlinux]            [k] arch_local_irq_restore.part.0
                +1.48%  [kernel.vmlinux]            [k] futex_private_hash_put
                +1.16%  [kernel.vmlinux]            [k] futex_ref_get
     10.41%     -0.78%  [kernel.vmlinux]            [k] system_call_vectored_common
      1.24%     +0.72%  perf                        [.] workerfn
      5.32%     -0.66%  [kernel.vmlinux]            [k] futex_q_lock
      2.48%     -0.43%  [kernel.vmlinux]            [k] futex_wait
      2.47%     -0.40%  [kernel.vmlinux]            [k] _raw_spin_lock
      2.98%     -0.35%  [kernel.vmlinux]            [k] futex_q_unlock
      2.42%     -0.34%  [kernel.vmlinux]            [k] __futex_wait
      5.47%     -0.32%  libc.so.6                   [.] syscall
      4.03%     -0.32%  [kernel.vmlinux]            [k] memcpy_power7
      0.16%     +0.22%  [kernel.vmlinux]            [k] arch_local_irq_restore
      5.93%     -0.18%  [kernel.vmlinux]            [k] futex_hash
      1.72%     -0.17%  [kernel.vmlinux]            [k] sys_futex


> 
> Anyway, I think we can improve both. Does the below help?
> 
> 
> ---
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index d9bb5567af0c..8c41d050bd1f 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_inc(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_inc(*mm->futex_ref);
>   		return true;
>   	}
>   
> @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_dec(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_dec(*mm->futex_ref);
>   		return false;
>   	}
>   

Yes. It helps. It improves "-b 512" numbers by at-least 5%.

baseline + series:
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 512 hash buckets


baseline + series+ above_patch:
Averaged 1482733 operations/sec (+- 0.26%), total secs = 10   <<< 5% improvement
Futex hashing: 512 hash buckets


Now we are closer baseline/immutable by 4-5%.
baseline:
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash