[PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL

Sebastian Andrzej Siewior posted 21 patches 9 months, 1 week ago
There is a newer version of this series
include/linux/futex.h      |  34 +-
include/linux/mm_types.h   |   7 +-
include/linux/mmap_lock.h  |   4 +
include/linux/rcuref.h     |  22 +-
include/linux/vmalloc.h    |   3 +
include/uapi/linux/futex.h |  10 +-
include/uapi/linux/prctl.h |   5 +
init/Kconfig               |  10 +
io_uring/futex.c           |   4 +-
kernel/fork.c              |  24 ++
kernel/futex/core.c        | 746 +++++++++++++++++++++++++++++++++----
kernel/futex/futex.h       |  81 +++-
kernel/futex/pi.c          | 300 ++++++++-------
kernel/futex/requeue.c     | 480 ++++++++++++------------
kernel/futex/waitwake.c    | 203 +++++-----
kernel/sys.c               |   4 +
mm/nommu.c                 |   5 +
mm/vmalloc.c               |   7 +
18 files changed, 1396 insertions(+), 553 deletions(-)
[PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 9 months, 1 week ago
Hi,

this is a follow up on
        https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

and adds support for task local futex_hash_bucket.

This is the local hash map series based on v9 extended with PeterZ
FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top.

The complete tree is at
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10

v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
  - The rcuref_read() check in __futex_hash_private() has been replaced
    with rcuref_is_dead() which is added as part of the series.

  - The local hash support depended on !CONFIG_BASE_SMALL which has been
    replaced with CONFIG_FUTEX_PRIVATE_HASH. This is defined as
    "!BASE_SMALL && MMU" because as part of the rework
    futex_key::private::mm is used which is not set on !CONFIG_MMU builds

  - Added CONFIG_FUTEX_MPOL to build on !NUMA configs.

  - Replaced direct access of mm_struct::futex_phash with a RCU
    accessor.
 
  - futex_hash_allocate() for !CONFIG_FUTEX_PRIVATE_HASH returns an
    error. This does not affect fork() but is noticed by
    PR_FUTEX_HASH_SET_SLOTS.

  - futex_init() ensures the computed hashsize is not less than 4 after
    the divide by num_possible_nodes().

  - futex_init() added info output about used hash table entries (in the
    global hash) and occupied memory, allocation method. This vanished
    after the removal of alloc_large_system_hash().

  - There is a WARN_ON again in futex_hash_free() path if the task
    failed to free all references (that would be a leak).

  - vmalloc_huge_node_noprof():
    - Replced __vmalloc_node_range() with __vmalloc_node_range_noprof()
      to skip the alloc_hooks() layer which is already part of
      vmalloc_huge_node().
    - Added vmalloc_huge_node_noprof for !MMU.

v8…v9 https://lore.kernel.org/all/20250203135935.440018-1-bigeasy@linutronix.de
  - Rebase on top PeterZ futex_class
  - A few patches vanished due to class rework.
  - struct futex_hash_bucket has now pointer to futex_private_hash
    instead of slot number
  - CONFIG_BASE_SMALL now removes support for the "futex local hash"
    instead of restricting it to to 2 slots.
  - Number of threads, used to determine the number of slots, is capped
    at num_online_cpus.

Peter Zijlstra (11):
  futex: Move futex_queue() into futex_wait_setup()
  futex: Pull futex_hash() out of futex_q_lock()
  futex: Create hb scopes
  futex: Create futex_hash() get/put class
  futex: s/hb_p/fph/
  futex: Remove superfluous state
  futex: Untangle and naming
  futex: Rework SET_SLOTS
  mm: Add vmalloc_huge_node()
  futex: Implement FUTEX2_NUMA
  futex: Implement FUTEX2_MPOL

Sebastian Andrzej Siewior (10):
  rcuref: Provide rcuref_is_dead().
  futex: Create helper function to initialize a hash slot.
  futex: Add basic infrastructure for local task local hash.
  futex: Hash only the address for private futexes.
  futex: Allow automatic allocation of process wide futex hash.
  futex: Decrease the waiter count before the unlock operation.
  futex: Introduce futex_q_lockptr_lock().
  futex: Acquire a hash reference in futex_wait_multiple_setup().
  futex: Allow to re-allocate the private local hash.
  futex: Resize local futex hash table based on number of threads.

 include/linux/futex.h      |  34 +-
 include/linux/mm_types.h   |   7 +-
 include/linux/mmap_lock.h  |   4 +
 include/linux/rcuref.h     |  22 +-
 include/linux/vmalloc.h    |   3 +
 include/uapi/linux/futex.h |  10 +-
 include/uapi/linux/prctl.h |   5 +
 init/Kconfig               |  10 +
 io_uring/futex.c           |   4 +-
 kernel/fork.c              |  24 ++
 kernel/futex/core.c        | 746 +++++++++++++++++++++++++++++++++----
 kernel/futex/futex.h       |  81 +++-
 kernel/futex/pi.c          | 300 ++++++++-------
 kernel/futex/requeue.c     | 480 ++++++++++++------------
 kernel/futex/waitwake.c    | 203 +++++-----
 kernel/sys.c               |   4 +
 mm/nommu.c                 |   5 +
 mm/vmalloc.c               |   7 +
 18 files changed, 1396 insertions(+), 553 deletions(-)

-- 
2.47.2
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Shrikanth Hegde 9 months ago

On 3/12/25 20:46, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> this is a follow up on
>          https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
> 
> and adds support for task local futex_hash_bucket.
> 
> This is the local hash map series based on v9 extended with PeterZ
> FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top.
> 
> The complete tree is at
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
> 

Hi Sebastian. Thanks for working on this (along with bringing back FUTEX2 NUMA) which
might help large systems with many futexes.

I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash.
Maybe i am missing some config or doing something stupid w.r.t to benchmarking.
I am trying to understand this stuff.

I ran "perf bench futex all" as is. No change has been made to perf.
=========================================
Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
# Running futex/hash benchmark...
Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M

=========================================
With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs.

# Running futex/hash benchmark...
Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, close to 10x down.

=========================================

Did try a git bisect based on the futex/hash numbers. It narrowed it to this one.
first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash.

Is this expected given the complexity of hash function change?


Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
try it on multi-NUMA system to see the benefit.
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months, 2 weeks ago
On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote:
> I ran "perf bench futex all" as is. No change has been made to perf.
…

I just posted v11
	https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de/

this series extends "perf bench futex" with
- -b switch to specify number of buckets to use. Default is auto-scale,
  0 is global hash, everything is the number of buckets.
- The used buckets are displayed after the run
- -I switches freezes the used buckets so the get/ put can be avoided.
  This brings the invocations/sec back to where it was.

If you use the "all" instead of "hash" then the arguments are skipped.

I did not wire up the MPOL part. IMHO in order to make sense, the memory
allocation should based on the NUMA node and then the OP itself could be
based on the NUMA node.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months, 3 weeks ago
On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote:
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to benchmarking.
> I am trying to understand this stuff.
> 
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M
> 
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs.
> 
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, close to 10x down.
> 
> =========================================
> 
> Did try a git bisect based on the futex/hash numbers. It narrowed it to this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash.
> 
> Is this expected given the complexity of hash function change?

So with 80 CPUs/ threads you should end up with roundup_pow_of_two(80 *
4) = 512 buckets. Before the series you should have
roundup_pow_of_two(80 * 256) = 32768 buckets. This is also printed at
boot.
_Now_ you have less buckets so a hash collision is more likely to
happen. To get to the old numbers you would have increase the buckets
and you get the same results. I benchmark a few things at
	https://lore.kernel.org/all/20241101110810.R3AnEqdu@linutronix.de/

This looks like the series makes it worse. But then those buckets are
per-task so you won't collide with a different task. This in turn should
relax the situation as a whole because different tasks can't block each
other. If two threads block on the same bucket then they might use the
same `uaddr'. 

The benchmark measures how many hash operations can be performed per
second. This means hash + lock + unlock. In reality you would also
queue, wait and wake. It is not very use-case driven.
The only thing that it measures is hash quality in terms of distribution
and the time spent to perform the hashing operation. If you want to
improve any of the two then this is the micro benchmark for it.

> Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
> try it on multi-NUMA system to see the benefit.

Let me try to add that up to the test tool.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Shrikanth Hegde 8 months, 3 weeks ago
Hi Sebastian.

On 3/18/25 18:54, Shrikanth Hegde wrote:
> 
>> The complete tree is at
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git/log/?h=futex_local_v10
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git futex_local_v10
>>
> 
> Hi Sebastian. Thanks for working on this (along with bringing back 
> FUTEX2 NUMA) which
> might help large systems with many futexes.
> 
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see 
> significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to 
> benchmarking.
> I am trying to understand this stuff.
> 
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M
> 
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for 
> TIMERs.
> 
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, 
> close to 10x down.
> 
> =========================================
> 
> Did try a git bisect based on the futex/hash numbers. It narrowed it to 
> this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: 
> Allow automatic allocation of process wide futex hash.
> 
> Is this expected given the complexity of hash function change?

So, did some more bench-marking using the same perf futex hash.
I see that perf creates N threads and binds each thread to a CPU and then
calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
only futex_hash is exercised.

Numbers with different threads. (private futexes)
threads		baseline		with series    (ratio)
1		3386265			3266560		0.96	
10		1972069			 821565		0.41
40		1580497			 277900		0.17
80		1555482			 150450		0.096


With Shared Futex: (-s option)
Threads		baseline		with series    (ratio)
80		590144			 585067		0.99

After looking into code, and after some hacking, could get the
performance back with below change. this is likely functionally not correct.
the reason for below change is,

1. perf report showed significant time in futex_private_hash_put.
    so removed rcu usage for users. that brought some improvements.
    from 150k to 300k. Is there a better way to do this users protection?

2. Since number of buckets would be less by default, this would cause hb
    collision. This was seen by queued_spin_lock_slowpath. Increased the hash
    bucket size what was before the series. That brought the numbers back to
    1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Note: Just increasing the hash bucket size without point 1, didn't matter much.

-------
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 363a7692909d..7d01bf8caa13 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -65,7 +65,7 @@ static struct {
  #define futex_queues	(__futex_data.queues)
  
  struct futex_private_hash {
-	rcuref_t	users;
+	int	users;
  	unsigned int	hash_mask;
  	struct rcu_head	rcu;
  	void		*mm;
@@ -200,7 +200,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
  	fph = rcu_dereference_protected(mm->futex_phash,
  					lockdep_is_held(&mm->futex_hash_lock));
  	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!(fph->users)) {
  			mm->futex_phash_new = new;
  			return false;
  		}
@@ -247,7 +247,7 @@ struct futex_private_hash *futex_private_hash(void)
  		if (!fph)
  			return NULL;
  
-		if (rcuref_get(&fph->users))
+		if ((fph->users))
  			return fph;
  	}
  	futex_pivot_hash(mm);
@@ -256,7 +256,7 @@ struct futex_private_hash *futex_private_hash(void)
  
  bool futex_private_hash_get(struct futex_private_hash *fph)
  {
-	return rcuref_get(&fph->users);
+	return !!(fph->users);
  }
  
  void futex_private_hash_put(struct futex_private_hash *fph)
@@ -265,7 +265,7 @@ void futex_private_hash_put(struct futex_private_hash *fph)
  	 * Ignore the result; the DEAD state is picked up
  	 * when rcuref_get() starts failing via rcuref_is_dead().
  	 */
-	if (rcuref_put(&fph->users))
+	if ((fph->users))
  		wake_up_var(fph->mm);
  }
  
@@ -1509,7 +1509,7 @@ void futex_hash_free(struct mm_struct *mm)
  	kvfree(mm->futex_phash_new);
  	fph = rcu_dereference_raw(mm->futex_phash);
  	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		WARN_ON_ONCE((fph->users) > 1);
  		kvfree(fph);
  	}
  }
@@ -1524,7 +1524,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
  		return false;
  
  	fph = rcu_dereference(mm->futex_phash);
-	return !rcuref_read(&fph->users);
+	return !!(fph->users);
  }
  
  static bool futex_hash_less(struct futex_private_hash *a,
@@ -1576,7 +1576,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
  	if (!fph)
  		return -ENOMEM;
  
-	rcuref_init(&fph->users, 1);
+	fph->users = 1;
  	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
  	fph->custom = custom;
  	fph->mm = mm;
@@ -1671,6 +1671,8 @@ int futex_hash_allocate_default(void)
  	if (current_buckets >= buckets)
  		return 0;
  
+	buckets = 32768;
+
  	return futex_hash_allocate(buckets, false);
  }
  
@@ -1732,6 +1734,8 @@ static int __init futex_init(void)
  	hashsize = max(4, hashsize);
  	hashsize = roundup_pow_of_two(hashsize);
  #endif
+	hashsize = 32768;
+
  	futex_hashshift = ilog2(hashsize);
  	size = sizeof(struct futex_hash_bucket) * hashsize;
  	order = get_order(size);

Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months, 3 weeks ago
On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
> Hi Sebastian.
Hi Shrikanth,

> So, did some more bench-marking using the same perf futex hash.
> I see that perf creates N threads and binds each thread to a CPU and then
> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
> only futex_hash is exercised.

It also does spin_lock() + unlock on the hash bucket. Without the
locking, you would have constant numbers.

> Numbers with different threads. (private futexes)
> threads	baseline		with series    (ratio)
> 1		3386265			3266560		0.96	
> 10		1972069			 821565		0.41
> 40		1580497			 277900		0.17
> 80		1555482			 150450		0.096
> 
> 
> With Shared Futex: (-s option)
> Threads	baseline		with series    (ratio)
> 80		590144			 585067		0.99

The shared numbers are equal since the code path there is unchanged.

> After looking into code, and after some hacking, could get the
> performance back with below change. this is likely functionally not correct.
> the reason for below change is,
> 
> 1. perf report showed significant time in futex_private_hash_put.
>    so removed rcu usage for users. that brought some improvements.
>    from 150k to 300k. Is there a better way to do this users protection?

This is likely from the atomic dec operation itself. Then there is also
the preemption counter operation. The inc should be also visible but
might be inlined into the hash operation.
This is _just_ the atomic inc/ dec that doubled the "throughput" but you
don't have anything from the regular path.
Anyway. To avoid the atomic part we would need to have a per-CPU counter
instead of a global one and a more expensive slow path for the resize
since you have to sum up all the per-CPU counters and so on. Not sure it
is worth it.

> 2. Since number of buckets would be less by default, this would cause hb
>    collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>    bucket size what was before the series. That brought the numbers back to
>    1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Yes. The idea is to avoid a resize at runtime and setting to something
you know best. You can also use it now to disable the private hash and
stick with the global.

> Note: Just increasing the hash bucket size without point 1, didn't matter much.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Shrikanth Hegde 8 months, 3 weeks ago

On 3/26/25 15:01, Sebastian Andrzej Siewior wrote:
> On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
>> Hi Sebastian.
> Hi Shrikanth,
> 

Hi.

>> So, did some more bench-marking using the same perf futex hash.
>> I see that perf creates N threads and binds each thread to a CPU and then
>> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
>> only futex_hash is exercised.
> 
> It also does spin_lock() + unlock on the hash bucket. Without the
> locking, you would have constant numbers.
> 
Thanks for explanations.

Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case
probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline.

>> Numbers with different threads. (private futexes)
>> threads	baseline		with series    (ratio)
>> 1		3386265			3266560		0.96	
>> 10		1972069			 821565		0.41
>> 40		1580497			 277900		0.17
>> 80		1555482			 150450		0.096
>>
>>
>> With Shared Futex: (-s option)
>> Threads	baseline		with series    (ratio)
>> 80		590144			 585067		0.99
> 
> The shared numbers are equal since the code path there is unchanged.
> 
>> After looking into code, and after some hacking, could get the
>> performance back with below change. this is likely functionally not correct.
>> the reason for below change is,
>>
>> 1. perf report showed significant time in futex_private_hash_put.
>>     so removed rcu usage for users. that brought some improvements.
>>     from 150k to 300k. Is there a better way to do this users protection?
> 
> This is likely from the atomic dec operation itself. Then there is also
> the preemption counter operation. The inc should be also visible but
> might be inlined into the hash operation.
> This is _just_ the atomic inc/ dec that doubled the "throughput" but you
> don't have anything from the regular path.
> Anyway. To avoid the atomic part we would need to have a per-CPU counter
> instead of a global one and a more expensive slow path for the resize
> since you have to sum up all the per-CPU counters and so on. Not sure it
> is worth it.
> 

resize would happen when one does prctl right? or
it can happen automatically too?

fph is going to be on thread leader's CPU and using atomics to do
fph->users would likely cause cacheline bouncing no?

Not sure if this happens only due to this benchmark which doesn't actually block.
Maybe the real life use-case this doesn't matter.

>> 2. Since number of buckets would be less by default, this would cause hb
>>     collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>>     bucket size what was before the series. That brought the numbers back to
>>     1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.
> 
> Yes. The idea is to avoid a resize at runtime and setting to something
> you know best. You can also use it now to disable the private hash and
> stick with the global.

yes. SET_SLOTS would take care of it.

> 
>> Note: Just increasing the hash bucket size without point 1, didn't matter much.
> 
> Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months, 3 weeks ago
On 2025-03-26 18:24:37 [+0530], Shrikanth Hegde wrote:
> > Anyway. To avoid the atomic part we would need to have a per-CPU counter
> > instead of a global one and a more expensive slow path for the resize
> > since you have to sum up all the per-CPU counters and so on. Not sure it
> > is worth it.
> > 
> 
> resize would happen when one does prctl right? or
> it can happen automatically too?

If prctl is used once then only then. Without prctl it will start with
16 buckets once the first thread is created (so you have two threads in
total).
After that it will only increase the buckets if 4 * threads < buckets.
See futex_hash_allocate_default(). 

> fph is going to be on thread leader's CPU and using atomics to do
> fph->users would likely cause cacheline bouncing no?

Yes, this can happen. And since the user can even resize after using
prctl we can't avoid the inc/ dec even if we switch to custom mode.

> Not sure if this happens only due to this benchmark which doesn't actually block.
> Maybe the real life use-case this doesn't matter.

That is what I assume. You go into the kernel if the futex is occupied.
If multiple threads do this at once then the cacheline bouncing is
unfortunate.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Davidlohr Bueso 9 months ago
On Tue, 18 Mar 2025, Shrikanth Hegde wrote:

>Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
>try it on multi-NUMA system to see the benefit.

It would be good to integrate futex2 into 'perf bench futex'.
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 9 months, 1 week ago
On 2025-03-12 16:16:13 [+0100], To linux-kernel@vger.kernel.org wrote:
> The complete tree is at
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
> 
> v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
The exact diff vs peterz/locking/futex:

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 0cdd5882e89c1..19c37afa0432a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -82,12 +82,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
 
-#ifdef CONFIG_BASE_SMALL
-static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
-#else /* !CONFIG_BASE_SMALL */
-
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
 
@@ -97,7 +92,11 @@ static inline void futex_mm_init(struct mm_struct *mm)
 	mutex_init(&mm->futex_hash_lock);
 }
 
-#endif /* CONFIG_BASE_SMALL */
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 #else /* !CONFIG_FUTEX */
 static inline void futex_init_task(struct task_struct *tsk) { }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9399ee7d40201..e0e8adbe66bdd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -938,7 +938,7 @@ struct mm_struct {
 		 */
 		seqcount_t mm_lock_seq;
 #endif
-#if defined(CONFIG_FUTEX) && !defined(CONFIG_BASE_SMALL)
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
  * rcuref_read - Read the number of held reference counts of a rcuref
  * @ref:	Pointer to the reference count
  *
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
  */
 static inline unsigned int rcuref_read(rcuref_t *ref)
 {
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
 	return c >= RCUREF_RELEASED ? 0 : c + 1;
 }
 
+/**
+ * rcuref_is_dead -	Check if the rcuref has been already marked dead
+ * @ref:		Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
 extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
 
 /**
diff --git a/init/Kconfig b/init/Kconfig
index a0ea04c177842..a4502a9077e03 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1683,6 +1683,16 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+config FUTEX_PRIVATE_HASH
+	bool
+	depends on FUTEX && !BASE_SMALL && MMU
+	default y
+
+config FUTEX_MPOL
+	bool
+	depends on FUTEX && NUMA
+	default y
+
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 976a487bf3ad5..65523f3cfe32e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -136,7 +136,7 @@ static inline bool futex_key_is_private(union futex_key *key)
 static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 static struct futex_hash_bucket *
 __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
@@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 {
 	struct futex_private_hash *fph;
 
-	lockdep_assert_held(&mm->futex_hash_lock);
 	WARN_ON_ONCE(mm->futex_phash_new);
 
-	fph = mm->futex_phash;
+	fph = rcu_dereference_protected(mm->futex_phash,
+					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (rcuref_read(&fph->users) != 0) {
+		if (!rcuref_is_dead(&fph->users)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
@@ -262,6 +262,10 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
+	/*
+	 * Ignore the result; the DEAD state is picked up
+	 * when rcuref_get() starts failing via rcuref_is_dead().
+	 */
 	if (rcuref_put(&fph->users))
 		wake_up_var(fph->mm);
 }
@@ -301,7 +305,7 @@ void futex_hash_put(struct futex_hash_bucket *hb)
 	futex_private_hash_put(fph);
 }
 
-#else
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
 
 static inline struct futex_hash_bucket *
 __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
@@ -314,8 +318,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 	return __futex_hash(key, NULL);
 }
 
-#endif /* CONFIG_BASE_SMALL */
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
+#ifdef CONFIG_FUTEX_MPOL
 static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
 {
 	struct vm_area_struct *vma = vma_lookup(mm, addr);
@@ -325,7 +330,7 @@ static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
 	if (!vma)
 		return FUTEX_NO_NODE;
 
-	mpol = vma->vm_policy;
+	mpol = vma_policy(vma);
 	if (!mpol)
 		return FUTEX_NO_NODE;
 
@@ -373,6 +378,14 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
 	guard(mmap_read_lock)(mm);
 	return __futex_key_to_node(mm, addr);
 }
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+	return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
 
 /**
  * futex_hash - Return the hash bucket in the global hash
@@ -420,7 +433,6 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
 	return &futex_queues[node][hash & futex_hashmask];
 }
 
-
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
  * @time:	ptr to the given timeout value
@@ -932,9 +944,6 @@ int futex_unqueue(struct futex_q *q)
 
 void futex_q_lockptr_lock(struct futex_q *q)
 {
-#if 0
-	struct futex_hash_bucket *hb;
-#endif
 	spinlock_t *lock_ptr;
 
 	/*
@@ -949,18 +958,6 @@ void futex_q_lockptr_lock(struct futex_q *q)
 		spin_unlock(lock_ptr);
 		goto retry;
 	}
-#if 0
-	hb = container_of(lock_ptr, struct futex_hash_bucket, lock);
-	/*
-	 * The caller needs to either hold a reference on the hash (to ensure
-	 * that the hash is not resized) _or_ be enqueued on the hash. This
-	 * ensures that futex_q::lock_ptr is updated while moved to the new
-	 * hash during resize.
-	 * Once the hash bucket is locked the resize operation, which might be
-	 * in progress, will block on the lock.
-	 */
-	return hb;
-#endif
 }
 
 /*
@@ -1497,7 +1494,7 @@ void futex_exit_release(struct task_struct *tsk)
 static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 				   struct futex_private_hash *fph)
 {
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 	fhb->priv = fph;
 #endif
 	atomic_set(&fhb->waiters, 0);
@@ -1505,21 +1502,30 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 	spin_lock_init(&fhb->lock);
 }
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 void futex_hash_free(struct mm_struct *mm)
 {
+	struct futex_private_hash *fph;
+
 	kvfree(mm->futex_phash_new);
-	kvfree(mm->futex_phash);
+	fph = rcu_dereference_raw(mm->futex_phash);
+	if (fph) {
+		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		kvfree(fph);
+	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
 {
+	struct futex_private_hash *fph;
+
 	guard(rcu)();
 
 	if (!mm->futex_phash_new)
 		return false;
 
-	return !rcuref_read(&mm->futex_phash->users);
+	fph = rcu_dereference(mm->futex_phash);
+	return !rcuref_read(&fph->users);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1560,7 +1566,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 	 */
 	scoped_guard (rcu) {
 		fph = rcu_dereference(mm->futex_phash);
-		if (fph && !mm->futex_phash->hash_mask) {
+		if (fph && !fph->hash_mask) {
 			if (custom)
 				return -EBUSY;
 			return 0;
@@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 		struct futex_private_hash *free __free(kvfree) = NULL;
 		struct futex_private_hash *cur, *new;
 
-		cur = mm->futex_phash;
+		cur = rcu_dereference_protected(mm->futex_phash,
+						lockdep_is_held(&mm->futex_hash_lock));
 		new = mm->futex_phash_new;
 		mm->futex_phash_new = NULL;
 
@@ -1602,7 +1609,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(mm->futex_phash);
+				futex_private_hash_put(cur);
 			}
 
 			if (new) {
@@ -1683,7 +1690,7 @@ static int futex_hash_get_slots(void)
 
 static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 {
-	return 0;
+	return -EINVAL;
 }
 
 static int futex_hash_get_slots(void)
@@ -1723,6 +1730,7 @@ static int __init futex_init(void)
 #else
 	hashsize = 256 * num_possible_cpus();
 	hashsize /= num_possible_nodes();
+	hashsize = max(4, hashsize);
 	hashsize = roundup_pow_of_two(hashsize);
 #endif
 	futex_hashshift = ilog2(hashsize);
@@ -1740,12 +1748,15 @@ static int __init futex_init(void)
 		BUG_ON(!table);
 
 		for (i = 0; i < hashsize; i++)
-			futex_hash_bucket_init(&table[i], 0);
+			futex_hash_bucket_init(&table[i], NULL);
 
 		futex_queues[n] = table;
 	}
 
 	futex_hashmask = hashsize - 1;
+	pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+		hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+		order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
 	return 0;
 }
 core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 40f06523a3565..52e9c0c4b6c87 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -223,14 +223,15 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 
 extern struct futex_hash_bucket *futex_hash(union futex_key *key);
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 
 extern struct futex_private_hash *futex_private_hash(void);
 extern bool futex_private_hash_get(struct futex_private_hash *fph);
 extern void futex_private_hash_put(struct futex_private_hash *fph);
-#else
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index baa79abdaf037..d04e601a8f4d7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -209,6 +209,11 @@ EXPORT_SYMBOL(vmalloc_noprof);
 
 void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return vmalloc_huge_noprof(size, gfp_mask);
+}
+
 /*
  *	vzalloc - allocate virtually contiguous memory with zero fill
  *
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39fe43183a64f..69247b46413ca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3968,9 +3968,9 @@ EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
 
 void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
 {
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
-				    node, __builtin_return_address(0));
+	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
+					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+					   node, __builtin_return_address(0));
 }
 
 /**


Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 9 months, 1 week ago
On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:

> @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
>  		struct futex_private_hash *free __free(kvfree) = NULL;
>  		struct futex_private_hash *cur, *new;
>  
> -		cur = mm->futex_phash;
> +		cur = rcu_dereference_protected(mm->futex_phash,
> +						lockdep_is_held(&mm->futex_hash_lock));
>  		new = mm->futex_phash_new;
>  		mm->futex_phash_new = NULL;
>  

Same thing again, this makes no sense.
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 9 months, 1 week ago
On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> 
> > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> >  		struct futex_private_hash *free __free(kvfree) = NULL;
> >  		struct futex_private_hash *cur, *new;
> >  
> > -		cur = mm->futex_phash;
> > +		cur = rcu_dereference_protected(mm->futex_phash,
> > +						lockdep_is_held(&mm->futex_hash_lock));
> >  		new = mm->futex_phash_new;
> >  		mm->futex_phash_new = NULL;
> >  
> 
> Same thing again, this makes no sense.

With "mm->futex_phash" sparse complains about direct RCU access. This
makes it obvious that you can access it, it won't change as long as you
have the lock.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 9 months, 1 week ago
On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > 
> > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > >  		struct futex_private_hash *cur, *new;
> > >  
> > > -		cur = mm->futex_phash;
> > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > +						lockdep_is_held(&mm->futex_hash_lock));
> > >  		new = mm->futex_phash_new;
> > >  		mm->futex_phash_new = NULL;
> > >  
> > 
> > Same thing again, this makes no sense.
> 
> With "mm->futex_phash" sparse complains about direct RCU access.

Yeah, but sparse is stupid.

> This makes it obvious that you can access it, it won't change as long
> as you have the lock.

It's just plain confusing. rcu_dereference() says you care about the
load being single copy atomic and the data dependency, we don't.

If we just want to shut up sparse; can't we write it like:

	cur = unrcu_pointer(mm->futex_phash);

?
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 9 months, 1 week ago
On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > > 
> > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > > >  		struct futex_private_hash *cur, *new;
> > > >  
> > > > -		cur = mm->futex_phash;
> > > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > > +						lockdep_is_held(&mm->futex_hash_lock));
> > > >  		new = mm->futex_phash_new;
> > > >  		mm->futex_phash_new = NULL;
> > > >  
> > > 
> > > Same thing again, this makes no sense.
> > 
> > With "mm->futex_phash" sparse complains about direct RCU access.
> 
> Yeah, but sparse is stupid.

I though we like sparse.

> > This makes it obvious that you can access it, it won't change as long
> > as you have the lock.
> 
> It's just plain confusing. rcu_dereference() says you care about the
> load being single copy atomic and the data dependency, we don't.
> 
> If we just want to shut up sparse; can't we write it like:
> 
> 	cur = unrcu_pointer(mm->futex_phash);
> 
> ?

But isn't rcu_dereference_protected() doing exactly this? It only
verifies that lockdep_is_held() thingy and it performs a plain read, no
READ_ONCE() or anything. And the reader understands why it is safe to
access the pointer as-is.

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 9 months, 1 week ago
On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > > > 
> > > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > > > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > > > >  		struct futex_private_hash *cur, *new;
> > > > >  
> > > > > -		cur = mm->futex_phash;
> > > > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > > > +						lockdep_is_held(&mm->futex_hash_lock));
> > > > >  		new = mm->futex_phash_new;
> > > > >  		mm->futex_phash_new = NULL;
> > > > >  
> > > > 
> > > > Same thing again, this makes no sense.
> > > 
> > > With "mm->futex_phash" sparse complains about direct RCU access.
> > 
> > Yeah, but sparse is stupid.
> 
> I though we like sparse.

I always ignore it, too much noise.

> > > This makes it obvious that you can access it, it won't change as long
> > > as you have the lock.
> > 
> > It's just plain confusing. rcu_dereference() says you care about the
> > load being single copy atomic and the data dependency, we don't.
> > 
> > If we just want to shut up sparse; can't we write it like:
> > 
> > 	cur = unrcu_pointer(mm->futex_phash);
> > 
> > ?
> 
> But isn't rcu_dereference_protected() doing exactly this? It only
> verifies that lockdep_is_held() thingy and it performs a plain read, no
> READ_ONCE() or anything. And the reader understands why it is safe to
> access the pointer as-is.

Urgh, so we have a rcu_dereference_*() function that does not in fact
imply rcu_dereference() ? WTF kind of insane naming it that?

But yes, it appears you are correct :-(
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Paul E. McKenney 9 months, 1 week ago
On Fri, Mar 14, 2025 at 01:30:58PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> > > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:

[ . . . ]

> > > > This makes it obvious that you can access it, it won't change as long
> > > > as you have the lock.
> > > 
> > > It's just plain confusing. rcu_dereference() says you care about the
> > > load being single copy atomic and the data dependency, we don't.
> > > 
> > > If we just want to shut up sparse; can't we write it like:
> > > 
> > > 	cur = unrcu_pointer(mm->futex_phash);
> > > 
> > > ?
> > 
> > But isn't rcu_dereference_protected() doing exactly this? It only
> > verifies that lockdep_is_held() thingy and it performs a plain read, no
> > READ_ONCE() or anything. And the reader understands why it is safe to
> > access the pointer as-is.
> 
> Urgh, so we have a rcu_dereference_*() function that does not in fact
> imply rcu_dereference() ? WTF kind of insane naming it that?

My kind of insane naming!  ;-)

The rationale is that "_protected" means "protected from updates".

							Thanx, Paul

------------------------------------------------------------------------

/**
 * rcu_dereference_protected() - fetch RCU pointer when updates prevented
 * @p: The pointer to read, prior to dereferencing
 * @c: The conditions under which the dereference will take place
 *
 * Return the value of the specified RCU-protected pointer, but omit
 * the READ_ONCE().  This is useful in cases where update-side locks
 * prevent the value of the pointer from changing.  Please note that this
 * primitive does *not* prevent the compiler from repeating this reference
 * or combining it with other references, so it should not be used without
 * protection of appropriate locks.
 *
 * This function is only for update-side use.  Using this function
 * when protected only by rcu_read_lock() will result in infrequent
 * but very ugly failures.
 */
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 9 months, 1 week ago
On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote:
> > > Yeah, but sparse is stupid.
> > 
> > I though we like sparse.
> 
> I always ignore it, too much noise.

What do you suggest? Cleaning up that noise that noise or moving that
RCU checking towards other tooling which is restricted to RCU only?
Knowing you, you have already a plan in your cupboard but not the time :)

Sebastian
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 9 months, 1 week ago
On Fri, Mar 14, 2025 at 02:30:39PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote:
> > > > Yeah, but sparse is stupid.
> > > 
> > > I though we like sparse.
> > 
> > I always ignore it, too much noise.
> 
> What do you suggest? Cleaning up that noise that noise or moving that
> RCU checking towards other tooling which is restricted to RCU only?
> Knowing you, you have already a plan in your cupboard but not the time :)

No real plan. Its been so long since I ran sparse I can't even remember
the shape of the noise, just that there was a _lot_ of it.

But IIRC the toolchains are starting to introduce address spaces:

  https://lkml.kernel.org/r/20250127160709.80604-1-ubizjak@gmail.com

so perhaps the __rcu thing can go that way.
Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 9 months, 1 week ago
On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> @@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
>  {
>  	struct futex_private_hash *fph;
>  
> -	lockdep_assert_held(&mm->futex_hash_lock);
>  	WARN_ON_ONCE(mm->futex_phash_new);
>  
> -	fph = mm->futex_phash;
> +	fph = rcu_dereference_protected(mm->futex_phash,
> +					lockdep_is_held(&mm->futex_hash_lock));

I are confused... this makes no sense. Why ?!

We only ever write that variable while holding this lock, we hold the
lock, we don't need RCU to read the variable.