include/linux/futex.h | 34 +- include/linux/mm_types.h | 7 +- include/linux/mmap_lock.h | 4 + include/linux/rcuref.h | 22 +- include/linux/vmalloc.h | 3 + include/uapi/linux/futex.h | 10 +- include/uapi/linux/prctl.h | 5 + init/Kconfig | 10 + io_uring/futex.c | 4 +- kernel/fork.c | 24 ++ kernel/futex/core.c | 746 +++++++++++++++++++++++++++++++++---- kernel/futex/futex.h | 81 +++- kernel/futex/pi.c | 300 ++++++++------- kernel/futex/requeue.c | 480 ++++++++++++------------ kernel/futex/waitwake.c | 203 +++++----- kernel/sys.c | 4 + mm/nommu.c | 5 + mm/vmalloc.c | 7 + 18 files changed, 1396 insertions(+), 553 deletions(-)
Hi,
this is a follow up on
https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
and adds support for task local futex_hash_bucket.
This is the local hash map series based on v9 extended with PeterZ
FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top.
The complete tree is at
https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
- The rcuref_read() check in __futex_hash_private() has been replaced
with rcuref_is_dead() which is added as part of the series.
- The local hash support depended on !CONFIG_BASE_SMALL which has been
replaced with CONFIG_FUTEX_PRIVATE_HASH. This is defined as
"!BASE_SMALL && MMU" because as part of the rework
futex_key::private::mm is used which is not set on !CONFIG_MMU builds
- Added CONFIG_FUTEX_MPOL to build on !NUMA configs.
- Replaced direct access of mm_struct::futex_phash with a RCU
accessor.
- futex_hash_allocate() for !CONFIG_FUTEX_PRIVATE_HASH returns an
error. This does not affect fork() but is noticed by
PR_FUTEX_HASH_SET_SLOTS.
- futex_init() ensures the computed hashsize is not less than 4 after
the divide by num_possible_nodes().
- futex_init() added info output about used hash table entries (in the
global hash) and occupied memory, allocation method. This vanished
after the removal of alloc_large_system_hash().
- There is a WARN_ON again in futex_hash_free() path if the task
failed to free all references (that would be a leak).
- vmalloc_huge_node_noprof():
- Replced __vmalloc_node_range() with __vmalloc_node_range_noprof()
to skip the alloc_hooks() layer which is already part of
vmalloc_huge_node().
- Added vmalloc_huge_node_noprof for !MMU.
v8…v9 https://lore.kernel.org/all/20250203135935.440018-1-bigeasy@linutronix.de
- Rebase on top PeterZ futex_class
- A few patches vanished due to class rework.
- struct futex_hash_bucket has now pointer to futex_private_hash
instead of slot number
- CONFIG_BASE_SMALL now removes support for the "futex local hash"
instead of restricting it to to 2 slots.
- Number of threads, used to determine the number of slots, is capped
at num_online_cpus.
Peter Zijlstra (11):
futex: Move futex_queue() into futex_wait_setup()
futex: Pull futex_hash() out of futex_q_lock()
futex: Create hb scopes
futex: Create futex_hash() get/put class
futex: s/hb_p/fph/
futex: Remove superfluous state
futex: Untangle and naming
futex: Rework SET_SLOTS
mm: Add vmalloc_huge_node()
futex: Implement FUTEX2_NUMA
futex: Implement FUTEX2_MPOL
Sebastian Andrzej Siewior (10):
rcuref: Provide rcuref_is_dead().
futex: Create helper function to initialize a hash slot.
futex: Add basic infrastructure for local task local hash.
futex: Hash only the address for private futexes.
futex: Allow automatic allocation of process wide futex hash.
futex: Decrease the waiter count before the unlock operation.
futex: Introduce futex_q_lockptr_lock().
futex: Acquire a hash reference in futex_wait_multiple_setup().
futex: Allow to re-allocate the private local hash.
futex: Resize local futex hash table based on number of threads.
include/linux/futex.h | 34 +-
include/linux/mm_types.h | 7 +-
include/linux/mmap_lock.h | 4 +
include/linux/rcuref.h | 22 +-
include/linux/vmalloc.h | 3 +
include/uapi/linux/futex.h | 10 +-
include/uapi/linux/prctl.h | 5 +
init/Kconfig | 10 +
io_uring/futex.c | 4 +-
kernel/fork.c | 24 ++
kernel/futex/core.c | 746 +++++++++++++++++++++++++++++++++----
kernel/futex/futex.h | 81 +++-
kernel/futex/pi.c | 300 ++++++++-------
kernel/futex/requeue.c | 480 ++++++++++++------------
kernel/futex/waitwake.c | 203 +++++-----
kernel/sys.c | 4 +
mm/nommu.c | 5 +
mm/vmalloc.c | 7 +
18 files changed, 1396 insertions(+), 553 deletions(-)
--
2.47.2
On 3/12/25 20:46, Sebastian Andrzej Siewior wrote: > Hi, > > this is a follow up on > https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb > > and adds support for task local futex_hash_bucket. > > This is the local hash map series based on v9 extended with PeterZ > FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top. > > The complete tree is at > https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10 > https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10 > Hi Sebastian. Thanks for working on this (along with bringing back FUTEX2 NUMA) which might help large systems with many futexes. I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash. Maybe i am missing some config or doing something stupid w.r.t to benchmarking. I am trying to understand this stuff. I ran "perf bench futex all" as is. No change has been made to perf. ========================================= Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc # Running futex/hash benchmark... Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs. Averaged 1556023 operations/sec (+- 0.08%), total secs = 10 <<--- 1.5M ========================================= With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs. # Running futex/hash benchmark... Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs. Averaged 150382 operations/sec (+- 0.42%), total secs = 10 <<-- 0.15M, close to 10x down. ========================================= Did try a git bisect based on the futex/hash numbers. It narrowed it to this one. first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash. Is this expected given the complexity of hash function change? Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to try it on multi-NUMA system to see the benefit.
On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote: > I ran "perf bench futex all" as is. No change has been made to perf. … I just posted v11 https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de/ this series extends "perf bench futex" with - -b switch to specify number of buckets to use. Default is auto-scale, 0 is global hash, everything is the number of buckets. - The used buckets are displayed after the run - -I switches freezes the used buckets so the get/ put can be avoided. This brings the invocations/sec back to where it was. If you use the "all" instead of "hash" then the arguments are skipped. I did not wire up the MPOL part. IMHO in order to make sense, the memory allocation should based on the NUMA node and then the OP itself could be based on the NUMA node. Sebastian
On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote: > I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash. > Maybe i am missing some config or doing something stupid w.r.t to benchmarking. > I am trying to understand this stuff. > > I ran "perf bench futex all" as is. No change has been made to perf. > ========================================= > Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc > # Running futex/hash benchmark... > Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs. > Averaged 1556023 operations/sec (+- 0.08%), total secs = 10 <<--- 1.5M > > ========================================= > With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs. > > # Running futex/hash benchmark... > Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs. > Averaged 150382 operations/sec (+- 0.42%), total secs = 10 <<-- 0.15M, close to 10x down. > > ========================================= > > Did try a git bisect based on the futex/hash numbers. It narrowed it to this one. > first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash. > > Is this expected given the complexity of hash function change? So with 80 CPUs/ threads you should end up with roundup_pow_of_two(80 * 4) = 512 buckets. Before the series you should have roundup_pow_of_two(80 * 256) = 32768 buckets. This is also printed at boot. _Now_ you have less buckets so a hash collision is more likely to happen. To get to the old numbers you would have increase the buckets and you get the same results. I benchmark a few things at https://lore.kernel.org/all/20241101110810.R3AnEqdu@linutronix.de/ This looks like the series makes it worse. But then those buckets are per-task so you won't collide with a different task. This in turn should relax the situation as a whole because different tasks can't block each other. If two threads block on the same bucket then they might use the same `uaddr'. The benchmark measures how many hash operations can be performed per second. This means hash + lock + unlock. In reality you would also queue, wait and wake. It is not very use-case driven. The only thing that it measures is hash quality in terms of distribution and the time spent to perform the hashing operation. If you want to improve any of the two then this is the micro benchmark for it. > Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to > try it on multi-NUMA system to see the benefit. Let me try to add that up to the test tool. Sebastian
Hi Sebastian.
On 3/18/25 18:54, Shrikanth Hegde wrote:
>
>> The complete tree is at
>> https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/
>> staging.git/log/?h=futex_local_v10
>> https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/
>> staging.git futex_local_v10
>>
>
> Hi Sebastian. Thanks for working on this (along with bringing back
> FUTEX2 NUMA) which
> might help large systems with many futexes.
>
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see
> significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to
> benchmarking.
> I am trying to understand this stuff.
>
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private]
> futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10 <<--- 1.5M
>
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for
> TIMERs.
>
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private]
> futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10 <<-- 0.15M,
> close to 10x down.
>
> =========================================
>
> Did try a git bisect based on the futex/hash numbers. It narrowed it to
> this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex:
> Allow automatic allocation of process wide futex hash.
>
> Is this expected given the complexity of hash function change?
So, did some more bench-marking using the same perf futex hash.
I see that perf creates N threads and binds each thread to a CPU and then
calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
only futex_hash is exercised.
Numbers with different threads. (private futexes)
threads baseline with series (ratio)
1 3386265 3266560 0.96
10 1972069 821565 0.41
40 1580497 277900 0.17
80 1555482 150450 0.096
With Shared Futex: (-s option)
Threads baseline with series (ratio)
80 590144 585067 0.99
After looking into code, and after some hacking, could get the
performance back with below change. this is likely functionally not correct.
the reason for below change is,
1. perf report showed significant time in futex_private_hash_put.
so removed rcu usage for users. that brought some improvements.
from 150k to 300k. Is there a better way to do this users protection?
2. Since number of buckets would be less by default, this would cause hb
collision. This was seen by queued_spin_lock_slowpath. Increased the hash
bucket size what was before the series. That brought the numbers back to
1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.
Note: Just increasing the hash bucket size without point 1, didn't matter much.
-------
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 363a7692909d..7d01bf8caa13 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -65,7 +65,7 @@ static struct {
#define futex_queues (__futex_data.queues)
struct futex_private_hash {
- rcuref_t users;
+ int users;
unsigned int hash_mask;
struct rcu_head rcu;
void *mm;
@@ -200,7 +200,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
fph = rcu_dereference_protected(mm->futex_phash,
lockdep_is_held(&mm->futex_hash_lock));
if (fph) {
- if (!rcuref_is_dead(&fph->users)) {
+ if (!(fph->users)) {
mm->futex_phash_new = new;
return false;
}
@@ -247,7 +247,7 @@ struct futex_private_hash *futex_private_hash(void)
if (!fph)
return NULL;
- if (rcuref_get(&fph->users))
+ if ((fph->users))
return fph;
}
futex_pivot_hash(mm);
@@ -256,7 +256,7 @@ struct futex_private_hash *futex_private_hash(void)
bool futex_private_hash_get(struct futex_private_hash *fph)
{
- return rcuref_get(&fph->users);
+ return !!(fph->users);
}
void futex_private_hash_put(struct futex_private_hash *fph)
@@ -265,7 +265,7 @@ void futex_private_hash_put(struct futex_private_hash *fph)
* Ignore the result; the DEAD state is picked up
* when rcuref_get() starts failing via rcuref_is_dead().
*/
- if (rcuref_put(&fph->users))
+ if ((fph->users))
wake_up_var(fph->mm);
}
@@ -1509,7 +1509,7 @@ void futex_hash_free(struct mm_struct *mm)
kvfree(mm->futex_phash_new);
fph = rcu_dereference_raw(mm->futex_phash);
if (fph) {
- WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+ WARN_ON_ONCE((fph->users) > 1);
kvfree(fph);
}
}
@@ -1524,7 +1524,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
return false;
fph = rcu_dereference(mm->futex_phash);
- return !rcuref_read(&fph->users);
+ return !!(fph->users);
}
static bool futex_hash_less(struct futex_private_hash *a,
@@ -1576,7 +1576,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
if (!fph)
return -ENOMEM;
- rcuref_init(&fph->users, 1);
+ fph->users = 1;
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
fph->custom = custom;
fph->mm = mm;
@@ -1671,6 +1671,8 @@ int futex_hash_allocate_default(void)
if (current_buckets >= buckets)
return 0;
+ buckets = 32768;
+
return futex_hash_allocate(buckets, false);
}
@@ -1732,6 +1734,8 @@ static int __init futex_init(void)
hashsize = max(4, hashsize);
hashsize = roundup_pow_of_two(hashsize);
#endif
+ hashsize = 32768;
+
futex_hashshift = ilog2(hashsize);
size = sizeof(struct futex_hash_bucket) * hashsize;
order = get_order(size);
On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote: > Hi Sebastian. Hi Shrikanth, > So, did some more bench-marking using the same perf futex hash. > I see that perf creates N threads and binds each thread to a CPU and then > calls futex_wait such that it never blocks. It always returns EWOULDBLOCK. > only futex_hash is exercised. It also does spin_lock() + unlock on the hash bucket. Without the locking, you would have constant numbers. > Numbers with different threads. (private futexes) > threads baseline with series (ratio) > 1 3386265 3266560 0.96 > 10 1972069 821565 0.41 > 40 1580497 277900 0.17 > 80 1555482 150450 0.096 > > > With Shared Futex: (-s option) > Threads baseline with series (ratio) > 80 590144 585067 0.99 The shared numbers are equal since the code path there is unchanged. > After looking into code, and after some hacking, could get the > performance back with below change. this is likely functionally not correct. > the reason for below change is, > > 1. perf report showed significant time in futex_private_hash_put. > so removed rcu usage for users. that brought some improvements. > from 150k to 300k. Is there a better way to do this users protection? This is likely from the atomic dec operation itself. Then there is also the preemption counter operation. The inc should be also visible but might be inlined into the hash operation. This is _just_ the atomic inc/ dec that doubled the "throughput" but you don't have anything from the regular path. Anyway. To avoid the atomic part we would need to have a per-CPU counter instead of a global one and a more expensive slow path for the resize since you have to sum up all the per-CPU counters and so on. Not sure it is worth it. > 2. Since number of buckets would be less by default, this would cause hb > collision. This was seen by queued_spin_lock_slowpath. Increased the hash > bucket size what was before the series. That brought the numbers back to > 1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess. Yes. The idea is to avoid a resize at runtime and setting to something you know best. You can also use it now to disable the private hash and stick with the global. > Note: Just increasing the hash bucket size without point 1, didn't matter much. Sebastian
On 3/26/25 15:01, Sebastian Andrzej Siewior wrote: > On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote: >> Hi Sebastian. > Hi Shrikanth, > Hi. >> So, did some more bench-marking using the same perf futex hash. >> I see that perf creates N threads and binds each thread to a CPU and then >> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK. >> only futex_hash is exercised. > > It also does spin_lock() + unlock on the hash bucket. Without the > locking, you would have constant numbers. > Thanks for explanations. Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline. >> Numbers with different threads. (private futexes) >> threads baseline with series (ratio) >> 1 3386265 3266560 0.96 >> 10 1972069 821565 0.41 >> 40 1580497 277900 0.17 >> 80 1555482 150450 0.096 >> >> >> With Shared Futex: (-s option) >> Threads baseline with series (ratio) >> 80 590144 585067 0.99 > > The shared numbers are equal since the code path there is unchanged. > >> After looking into code, and after some hacking, could get the >> performance back with below change. this is likely functionally not correct. >> the reason for below change is, >> >> 1. perf report showed significant time in futex_private_hash_put. >> so removed rcu usage for users. that brought some improvements. >> from 150k to 300k. Is there a better way to do this users protection? > > This is likely from the atomic dec operation itself. Then there is also > the preemption counter operation. The inc should be also visible but > might be inlined into the hash operation. > This is _just_ the atomic inc/ dec that doubled the "throughput" but you > don't have anything from the regular path. > Anyway. To avoid the atomic part we would need to have a per-CPU counter > instead of a global one and a more expensive slow path for the resize > since you have to sum up all the per-CPU counters and so on. Not sure it > is worth it. > resize would happen when one does prctl right? or it can happen automatically too? fph is going to be on thread leader's CPU and using atomics to do fph->users would likely cause cacheline bouncing no? Not sure if this happens only due to this benchmark which doesn't actually block. Maybe the real life use-case this doesn't matter. >> 2. Since number of buckets would be less by default, this would cause hb >> collision. This was seen by queued_spin_lock_slowpath. Increased the hash >> bucket size what was before the series. That brought the numbers back to >> 1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess. > > Yes. The idea is to avoid a resize at runtime and setting to something > you know best. You can also use it now to disable the private hash and > stick with the global. yes. SET_SLOTS would take care of it. > >> Note: Just increasing the hash bucket size without point 1, didn't matter much. > > Sebastian
On 2025-03-26 18:24:37 [+0530], Shrikanth Hegde wrote: > > Anyway. To avoid the atomic part we would need to have a per-CPU counter > > instead of a global one and a more expensive slow path for the resize > > since you have to sum up all the per-CPU counters and so on. Not sure it > > is worth it. > > > > resize would happen when one does prctl right? or > it can happen automatically too? If prctl is used once then only then. Without prctl it will start with 16 buckets once the first thread is created (so you have two threads in total). After that it will only increase the buckets if 4 * threads < buckets. See futex_hash_allocate_default(). > fph is going to be on thread leader's CPU and using atomics to do > fph->users would likely cause cacheline bouncing no? Yes, this can happen. And since the user can even resize after using prctl we can't avoid the inc/ dec even if we switch to custom mode. > Not sure if this happens only due to this benchmark which doesn't actually block. > Maybe the real life use-case this doesn't matter. That is what I assume. You go into the kernel if the futex is occupied. If multiple threads do this at once then the cacheline bouncing is unfortunate. Sebastian
On Tue, 18 Mar 2025, Shrikanth Hegde wrote: >Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to >try it on multi-NUMA system to see the benefit. It would be good to integrate futex2 into 'perf bench futex'.
On 2025-03-12 16:16:13 [+0100], To linux-kernel@vger.kernel.org wrote:
> The complete tree is at
> https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
> https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
>
> v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
The exact diff vs peterz/locking/futex:
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 0cdd5882e89c1..19c37afa0432a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -82,12 +82,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
-#ifdef CONFIG_BASE_SMALL
-static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
-#else /* !CONFIG_BASE_SMALL */
-
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
@@ -97,7 +92,11 @@ static inline void futex_mm_init(struct mm_struct *mm)
mutex_init(&mm->futex_hash_lock);
}
-#endif /* CONFIG_BASE_SMALL */
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
#else /* !CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9399ee7d40201..e0e8adbe66bdd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -938,7 +938,7 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-#if defined(CONFIG_FUTEX) && !defined(CONFIG_BASE_SMALL)
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
struct mutex futex_hash_lock;
struct futex_private_hash __rcu *futex_phash;
struct futex_private_hash *futex_phash_new;
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
* rcuref_read - Read the number of held reference counts of a rcuref
* @ref: Pointer to the reference count
*
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
*/
static inline unsigned int rcuref_read(rcuref_t *ref)
{
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
return c >= RCUREF_RELEASED ? 0 : c + 1;
}
+/**
+ * rcuref_is_dead - Check if the rcuref has been already marked dead
+ * @ref: Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+ unsigned int c = atomic_read(&ref->refcnt);
+
+ return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
/**
diff --git a/init/Kconfig b/init/Kconfig
index a0ea04c177842..a4502a9077e03 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1683,6 +1683,16 @@ config FUTEX_PI
depends on FUTEX && RT_MUTEXES
default y
+config FUTEX_PRIVATE_HASH
+ bool
+ depends on FUTEX && !BASE_SMALL && MMU
+ default y
+
+config FUTEX_MPOL
+ bool
+ depends on FUTEX && NUMA
+ default y
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 976a487bf3ad5..65523f3cfe32e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -136,7 +136,7 @@ static inline bool futex_key_is_private(union futex_key *key)
static struct futex_hash_bucket *
__futex_hash(union futex_key *key, struct futex_private_hash *fph);
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
static struct futex_hash_bucket *
__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
{
@@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
{
struct futex_private_hash *fph;
- lockdep_assert_held(&mm->futex_hash_lock);
WARN_ON_ONCE(mm->futex_phash_new);
- fph = mm->futex_phash;
+ fph = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
if (fph) {
- if (rcuref_read(&fph->users) != 0) {
+ if (!rcuref_is_dead(&fph->users)) {
mm->futex_phash_new = new;
return false;
}
@@ -262,6 +262,10 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
void futex_private_hash_put(struct futex_private_hash *fph)
{
+ /*
+ * Ignore the result; the DEAD state is picked up
+ * when rcuref_get() starts failing via rcuref_is_dead().
+ */
if (rcuref_put(&fph->users))
wake_up_var(fph->mm);
}
@@ -301,7 +305,7 @@ void futex_hash_put(struct futex_hash_bucket *hb)
futex_private_hash_put(fph);
}
-#else
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
static inline struct futex_hash_bucket *
__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
@@ -314,8 +318,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return __futex_hash(key, NULL);
}
-#endif /* CONFIG_BASE_SMALL */
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+#ifdef CONFIG_FUTEX_MPOL
static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
{
struct vm_area_struct *vma = vma_lookup(mm, addr);
@@ -325,7 +330,7 @@ static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
if (!vma)
return FUTEX_NO_NODE;
- mpol = vma->vm_policy;
+ mpol = vma_policy(vma);
if (!mpol)
return FUTEX_NO_NODE;
@@ -373,6 +378,14 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
guard(mmap_read_lock)(mm);
return __futex_key_to_node(mm, addr);
}
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+ return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
/**
* futex_hash - Return the hash bucket in the global hash
@@ -420,7 +433,6 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
return &futex_queues[node][hash & futex_hashmask];
}
-
/**
* futex_setup_timer - set up the sleeping hrtimer.
* @time: ptr to the given timeout value
@@ -932,9 +944,6 @@ int futex_unqueue(struct futex_q *q)
void futex_q_lockptr_lock(struct futex_q *q)
{
-#if 0
- struct futex_hash_bucket *hb;
-#endif
spinlock_t *lock_ptr;
/*
@@ -949,18 +958,6 @@ void futex_q_lockptr_lock(struct futex_q *q)
spin_unlock(lock_ptr);
goto retry;
}
-#if 0
- hb = container_of(lock_ptr, struct futex_hash_bucket, lock);
- /*
- * The caller needs to either hold a reference on the hash (to ensure
- * that the hash is not resized) _or_ be enqueued on the hash. This
- * ensures that futex_q::lock_ptr is updated while moved to the new
- * hash during resize.
- * Once the hash bucket is locked the resize operation, which might be
- * in progress, will block on the lock.
- */
- return hb;
-#endif
}
/*
@@ -1497,7 +1494,7 @@ void futex_exit_release(struct task_struct *tsk)
static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
struct futex_private_hash *fph)
{
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
fhb->priv = fph;
#endif
atomic_set(&fhb->waiters, 0);
@@ -1505,21 +1502,30 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
spin_lock_init(&fhb->lock);
}
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
void futex_hash_free(struct mm_struct *mm)
{
+ struct futex_private_hash *fph;
+
kvfree(mm->futex_phash_new);
- kvfree(mm->futex_phash);
+ fph = rcu_dereference_raw(mm->futex_phash);
+ if (fph) {
+ WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+ kvfree(fph);
+ }
}
static bool futex_pivot_pending(struct mm_struct *mm)
{
+ struct futex_private_hash *fph;
+
guard(rcu)();
if (!mm->futex_phash_new)
return false;
- return !rcuref_read(&mm->futex_phash->users);
+ fph = rcu_dereference(mm->futex_phash);
+ return !rcuref_read(&fph->users);
}
static bool futex_hash_less(struct futex_private_hash *a,
@@ -1560,7 +1566,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
*/
scoped_guard (rcu) {
fph = rcu_dereference(mm->futex_phash);
- if (fph && !mm->futex_phash->hash_mask) {
+ if (fph && !fph->hash_mask) {
if (custom)
return -EBUSY;
return 0;
@@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
struct futex_private_hash *free __free(kvfree) = NULL;
struct futex_private_hash *cur, *new;
- cur = mm->futex_phash;
+ cur = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
new = mm->futex_phash_new;
mm->futex_phash_new = NULL;
@@ -1602,7 +1609,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
* allocated a replacement hash, drop the initial
* reference on the existing hash.
*/
- futex_private_hash_put(mm->futex_phash);
+ futex_private_hash_put(cur);
}
if (new) {
@@ -1683,7 +1690,7 @@ static int futex_hash_get_slots(void)
static int futex_hash_allocate(unsigned int hash_slots, bool custom)
{
- return 0;
+ return -EINVAL;
}
static int futex_hash_get_slots(void)
@@ -1723,6 +1730,7 @@ static int __init futex_init(void)
#else
hashsize = 256 * num_possible_cpus();
hashsize /= num_possible_nodes();
+ hashsize = max(4, hashsize);
hashsize = roundup_pow_of_two(hashsize);
#endif
futex_hashshift = ilog2(hashsize);
@@ -1740,12 +1748,15 @@ static int __init futex_init(void)
BUG_ON(!table);
for (i = 0; i < hashsize; i++)
- futex_hash_bucket_init(&table[i], 0);
+ futex_hash_bucket_init(&table[i], NULL);
futex_queues[n] = table;
}
futex_hashmask = hashsize - 1;
+ pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+ hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+ order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
return 0;
}
core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 40f06523a3565..52e9c0c4b6c87 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -223,14 +223,15 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
extern struct futex_private_hash *futex_private_hash(void);
extern bool futex_private_hash_get(struct futex_private_hash *fph);
extern void futex_private_hash_put(struct futex_private_hash *fph);
-#else
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
diff --git a/mm/nommu.c b/mm/nommu.c
index baa79abdaf037..d04e601a8f4d7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -209,6 +209,11 @@ EXPORT_SYMBOL(vmalloc_noprof);
void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+ return vmalloc_huge_noprof(size, gfp_mask);
+}
+
/*
* vzalloc - allocate virtually contiguous memory with zero fill
*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39fe43183a64f..69247b46413ca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3968,9 +3968,9 @@ EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
{
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
- gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
- node, __builtin_return_address(0));
+ return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
+ gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+ node, __builtin_return_address(0));
}
/**
Sebastian
On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom) > struct futex_private_hash *free __free(kvfree) = NULL; > struct futex_private_hash *cur, *new; > > - cur = mm->futex_phash; > + cur = rcu_dereference_protected(mm->futex_phash, > + lockdep_is_held(&mm->futex_hash_lock)); > new = mm->futex_phash_new; > mm->futex_phash_new = NULL; > Same thing again, this makes no sense.
On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote: > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom) > > struct futex_private_hash *free __free(kvfree) = NULL; > > struct futex_private_hash *cur, *new; > > > > - cur = mm->futex_phash; > > + cur = rcu_dereference_protected(mm->futex_phash, > > + lockdep_is_held(&mm->futex_hash_lock)); > > new = mm->futex_phash_new; > > mm->futex_phash_new = NULL; > > > > Same thing again, this makes no sense. With "mm->futex_phash" sparse complains about direct RCU access. This makes it obvious that you can access it, it won't change as long as you have the lock. Sebastian
On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote: > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: > > > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom) > > > struct futex_private_hash *free __free(kvfree) = NULL; > > > struct futex_private_hash *cur, *new; > > > > > > - cur = mm->futex_phash; > > > + cur = rcu_dereference_protected(mm->futex_phash, > > > + lockdep_is_held(&mm->futex_hash_lock)); > > > new = mm->futex_phash_new; > > > mm->futex_phash_new = NULL; > > > > > > > Same thing again, this makes no sense. > > With "mm->futex_phash" sparse complains about direct RCU access. Yeah, but sparse is stupid. > This makes it obvious that you can access it, it won't change as long > as you have the lock. It's just plain confusing. rcu_dereference() says you care about the load being single copy atomic and the data dependency, we don't. If we just want to shut up sparse; can't we write it like: cur = unrcu_pointer(mm->futex_phash); ?
On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote: > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote: > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote: > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: > > > > > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom) > > > > struct futex_private_hash *free __free(kvfree) = NULL; > > > > struct futex_private_hash *cur, *new; > > > > > > > > - cur = mm->futex_phash; > > > > + cur = rcu_dereference_protected(mm->futex_phash, > > > > + lockdep_is_held(&mm->futex_hash_lock)); > > > > new = mm->futex_phash_new; > > > > mm->futex_phash_new = NULL; > > > > > > > > > > Same thing again, this makes no sense. > > > > With "mm->futex_phash" sparse complains about direct RCU access. > > Yeah, but sparse is stupid. I though we like sparse. > > This makes it obvious that you can access it, it won't change as long > > as you have the lock. > > It's just plain confusing. rcu_dereference() says you care about the > load being single copy atomic and the data dependency, we don't. > > If we just want to shut up sparse; can't we write it like: > > cur = unrcu_pointer(mm->futex_phash); > > ? But isn't rcu_dereference_protected() doing exactly this? It only verifies that lockdep_is_held() thingy and it performs a plain read, no READ_ONCE() or anything. And the reader understands why it is safe to access the pointer as-is. Sebastian
On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote: > > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote: > > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: > > > > > > > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom) > > > > > struct futex_private_hash *free __free(kvfree) = NULL; > > > > > struct futex_private_hash *cur, *new; > > > > > > > > > > - cur = mm->futex_phash; > > > > > + cur = rcu_dereference_protected(mm->futex_phash, > > > > > + lockdep_is_held(&mm->futex_hash_lock)); > > > > > new = mm->futex_phash_new; > > > > > mm->futex_phash_new = NULL; > > > > > > > > > > > > > Same thing again, this makes no sense. > > > > > > With "mm->futex_phash" sparse complains about direct RCU access. > > > > Yeah, but sparse is stupid. > > I though we like sparse. I always ignore it, too much noise. > > > This makes it obvious that you can access it, it won't change as long > > > as you have the lock. > > > > It's just plain confusing. rcu_dereference() says you care about the > > load being single copy atomic and the data dependency, we don't. > > > > If we just want to shut up sparse; can't we write it like: > > > > cur = unrcu_pointer(mm->futex_phash); > > > > ? > > But isn't rcu_dereference_protected() doing exactly this? It only > verifies that lockdep_is_held() thingy and it performs a plain read, no > READ_ONCE() or anything. And the reader understands why it is safe to > access the pointer as-is. Urgh, so we have a rcu_dereference_*() function that does not in fact imply rcu_dereference() ? WTF kind of insane naming it that? But yes, it appears you are correct :-(
On Fri, Mar 14, 2025 at 01:30:58PM +0100, Peter Zijlstra wrote: > On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote: > > On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote: > > > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote: > > > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote: > > > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote: [ . . . ] > > > > This makes it obvious that you can access it, it won't change as long > > > > as you have the lock. > > > > > > It's just plain confusing. rcu_dereference() says you care about the > > > load being single copy atomic and the data dependency, we don't. > > > > > > If we just want to shut up sparse; can't we write it like: > > > > > > cur = unrcu_pointer(mm->futex_phash); > > > > > > ? > > > > But isn't rcu_dereference_protected() doing exactly this? It only > > verifies that lockdep_is_held() thingy and it performs a plain read, no > > READ_ONCE() or anything. And the reader understands why it is safe to > > access the pointer as-is. > > Urgh, so we have a rcu_dereference_*() function that does not in fact > imply rcu_dereference() ? WTF kind of insane naming it that? My kind of insane naming! ;-) The rationale is that "_protected" means "protected from updates". Thanx, Paul ------------------------------------------------------------------------ /** * rcu_dereference_protected() - fetch RCU pointer when updates prevented * @p: The pointer to read, prior to dereferencing * @c: The conditions under which the dereference will take place * * Return the value of the specified RCU-protected pointer, but omit * the READ_ONCE(). This is useful in cases where update-side locks * prevent the value of the pointer from changing. Please note that this * primitive does *not* prevent the compiler from repeating this reference * or combining it with other references, so it should not be used without * protection of appropriate locks. * * This function is only for update-side use. Using this function * when protected only by rcu_read_lock() will result in infrequent * but very ugly failures. */
On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote: > > > Yeah, but sparse is stupid. > > > > I though we like sparse. > > I always ignore it, too much noise. What do you suggest? Cleaning up that noise that noise or moving that RCU checking towards other tooling which is restricted to RCU only? Knowing you, you have already a plan in your cupboard but not the time :) Sebastian
On Fri, Mar 14, 2025 at 02:30:39PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote: > > > > Yeah, but sparse is stupid. > > > > > > I though we like sparse. > > > > I always ignore it, too much noise. > > What do you suggest? Cleaning up that noise that noise or moving that > RCU checking towards other tooling which is restricted to RCU only? > Knowing you, you have already a plan in your cupboard but not the time :) No real plan. Its been so long since I ran sparse I can't even remember the shape of the noise, just that there was a _lot_ of it. But IIRC the toolchains are starting to introduce address spaces: https://lkml.kernel.org/r/20250127160709.80604-1-ubizjak@gmail.com so perhaps the __rcu thing can go that way.
On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> @@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
> {
> struct futex_private_hash *fph;
>
> - lockdep_assert_held(&mm->futex_hash_lock);
> WARN_ON_ONCE(mm->futex_phash_new);
>
> - fph = mm->futex_phash;
> + fph = rcu_dereference_protected(mm->futex_phash,
> + lockdep_is_held(&mm->futex_hash_lock));
I are confused... this makes no sense. Why ?!
We only ever write that variable while holding this lock, we hold the
lock, we don't need RCU to read the variable.
© 2016 - 2025 Red Hat, Inc.