[PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL

Sebastian Andrzej Siewior posted 21 patches 8 months ago
include/linux/futex.h                         |  36 +-
include/linux/mm_types.h                      |   7 +-
include/linux/mmap_lock.h                     |   4 +
include/linux/rcuref.h                        |  22 +-
include/linux/vmalloc.h                       |   9 +-
include/uapi/linux/futex.h                    |  10 +-
include/uapi/linux/prctl.h                    |   6 +
init/Kconfig                                  |  10 +
io_uring/futex.c                              |   4 +-
kernel/fork.c                                 |  24 +
kernel/futex/core.c                           | 802 ++++++++++++++++--
kernel/futex/futex.h                          |  73 +-
kernel/futex/pi.c                             | 306 ++++---
kernel/futex/requeue.c                        | 480 +++++------
kernel/futex/waitwake.c                       | 201 +++--
kernel/sys.c                                  |   4 +
mm/nommu.c                                    |  18 +-
mm/vmalloc.c                                  |  11 +-
tools/include/uapi/linux/prctl.h              |  44 +-
tools/perf/bench/Build                        |   1 +
tools/perf/bench/futex-hash.c                 |   7 +
tools/perf/bench/futex-lock-pi.c              |   5 +
tools/perf/bench/futex-requeue.c              |   6 +
tools/perf/bench/futex-wake-parallel.c        |   9 +-
tools/perf/bench/futex-wake.c                 |   4 +
tools/perf/bench/futex.c                      |  65 ++
tools/perf/bench/futex.h                      |   5 +
.../selftests/futex/functional/.gitignore     |   6 +-
.../selftests/futex/functional/Makefile       |   4 +-
.../futex/functional/futex_numa_mpol.c        | 232 +++++
.../futex/functional/futex_priv_hash.c        | 315 +++++++
.../testing/selftests/futex/functional/run.sh |   7 +
.../selftests/futex/include/futex2test.h      |  34 +
33 files changed, 2199 insertions(+), 572 deletions(-)
create mode 100644 tools/perf/bench/futex.c
create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpol.c
create mode 100644 tools/testing/selftests/futex/functional/futex_priv_hash.c
[PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months ago
this is a follow up on
        https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

and adds support for task local futex_hash_bucket.

This is the local hash map series with PeterZ FUTEX2_NUMA and
FUTEX2_MPOL. This went through some testing now with the selftests…

The complete tree is at
        https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v12
        https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v12

v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
  - Moved futex_hash_put() in futex_lock_pi() before
    rt_mutex_pre_schedule() for obvious reasons.
  
  - Use __GFP_NOWARN while allocating the local hash to supress warnings
    about failures especially if huge values were used and vmalloc
    refuses.

  - The "immutable" mode is its own patch. The basic infrastructure patch
    enforces a "0" for prctl()'s arg4. The "immutable mode" allows only 0
    (disabled) or 1 (enabled) as argument.
    The "perf bench" bench adds the "bucket" and "immutable" support.
  
  - The position of node member after the uaddr is computed in units of
    u32. Added a cast to (void *) to get the math in right.
  
  - Added FUTEX2_MPOL to FUTEX2_VALID_MASK assuming that we want to expose
    it. However the mpol seems not to work here but it is likely that my
    setup is proper.
  
  - If the user specified FUTEX_NO_NODE as node then the node is updated
    to a valid node number. The node value is only written back to the
    user if it has been changed.
    While this only avoids the unnecessary write back if the user supplied
    a valid node number the whole interface is slighly race if
    FUTEX_NO_NODE is supplied and two futex_wait() invocations are invoked
    on parallel then the first invocation can set node to 0 and the send
    to 1. The following callers will stick to node 1 but the first one
    will remain waiting on the wrong node.
  
  - Added selftests for private hash and the NUMA bits.

v10…v11: https://lore.kernel.org/all/20250312151634.2183278-1-bigeasy@linutronix.de
  - PeterZ' fixups, changes to the local hash series have been folded
    into the earlier patches so things are not added and renamed later
    and the functionality is changed.

  - vmalloc_huge() has been implemented on top of vmalloc_huge_node()
    and the NOMMU bots have been adjusted. akpm asked for this.

  - wake_up_var() has been removed from __futex_pivot_hash(). It is
    enough to wake the userspace waiter after the final put so it can
    perform the resize itself.

  - Changed to logic in futex_pivot_pending() so it does not block for
    the user. It waits for __futex_pivot_hash() which follows the logic
    in __futex_pivot_hash().

  - Updated kernel doc for __futex_hash().

  - Patches 17+ are new:
    - Wire up PR_FUTEX_HASH_SET_SLOTS in "perf bench futex"
    - Add "immutable" mode to PR_FUTEX_HASH_SET_SLOTS to avoid resizing
      the local hash any further. This avoids rcuref usage which is
      noticeable in "perf bench futex hash"


Peter Zijlstra (8):
  mm: Add vmalloc_huge_node()
  futex: Move futex_queue() into futex_wait_setup()
  futex: Pull futex_hash() out of futex_q_lock()
  futex: Create hb scopes
  futex: Create futex_hash() get/put class
  futex: Create private_hash() get/put class
  futex: Implement FUTEX2_NUMA
  futex: Implement FUTEX2_MPOL

Sebastian Andrzej Siewior (13):
  rcuref: Provide rcuref_is_dead()
  futex: Acquire a hash reference in futex_wait_multiple_setup()
  futex: Decrease the waiter count before the unlock operation
  futex: Introduce futex_q_lockptr_lock()
  futex: Create helper function to initialize a hash slot
  futex: Add basic infrastructure for local task local hash
  futex: Allow automatic allocation of process wide futex hash
  futex: Allow to resize the private local hash
  futex: Allow to make the private hash immutable
  tools headers: Synchronize prctl.h ABI header
  tools/perf: Allow to select the number of hash buckets
  selftests/futex: Add futex_priv_hash
  selftests/futex: Add futex_numa_mpol

 include/linux/futex.h                         |  36 +-
 include/linux/mm_types.h                      |   7 +-
 include/linux/mmap_lock.h                     |   4 +
 include/linux/rcuref.h                        |  22 +-
 include/linux/vmalloc.h                       |   9 +-
 include/uapi/linux/futex.h                    |  10 +-
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  10 +
 io_uring/futex.c                              |   4 +-
 kernel/fork.c                                 |  24 +
 kernel/futex/core.c                           | 802 ++++++++++++++++--
 kernel/futex/futex.h                          |  73 +-
 kernel/futex/pi.c                             | 306 ++++---
 kernel/futex/requeue.c                        | 480 +++++------
 kernel/futex/waitwake.c                       | 201 +++--
 kernel/sys.c                                  |   4 +
 mm/nommu.c                                    |  18 +-
 mm/vmalloc.c                                  |  11 +-
 tools/include/uapi/linux/prctl.h              |  44 +-
 tools/perf/bench/Build                        |   1 +
 tools/perf/bench/futex-hash.c                 |   7 +
 tools/perf/bench/futex-lock-pi.c              |   5 +
 tools/perf/bench/futex-requeue.c              |   6 +
 tools/perf/bench/futex-wake-parallel.c        |   9 +-
 tools/perf/bench/futex-wake.c                 |   4 +
 tools/perf/bench/futex.c                      |  65 ++
 tools/perf/bench/futex.h                      |   5 +
 .../selftests/futex/functional/.gitignore     |   6 +-
 .../selftests/futex/functional/Makefile       |   4 +-
 .../futex/functional/futex_numa_mpol.c        | 232 +++++
 .../futex/functional/futex_priv_hash.c        | 315 +++++++
 .../testing/selftests/futex/functional/run.sh |   7 +
 .../selftests/futex/include/futex2test.h      |  34 +
 33 files changed, 2199 insertions(+), 572 deletions(-)
 create mode 100644 tools/perf/bench/futex.c
 create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpol.c
 create mode 100644 tools/testing/selftests/futex/functional/futex_priv_hash.c

-- 
2.49.0
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 8 months ago
On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de

A diff excluding the tools/testing/ changes:

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 96c7229856d97..eccc99751bd94 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -109,7 +109,7 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
 {
 	return -EINVAL;
 }
-static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 {
 	return -EINVAL;
 }
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 44bb9eeb0a9c1..ee1d7182ce0c0 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -551,6 +551,7 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 	struct folio *folio;
 	struct address_space *mapping;
 	int node, err, size, ro = 0;
+	bool node_updated = false;
 	bool fshared;
 
 	fshared = flags & FLAGS_SHARED;
@@ -575,24 +576,29 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 	node = FUTEX_NO_NODE;
 
 	if (flags & FLAGS_NUMA) {
-		u32 __user *naddr = uaddr + size / 2;
+		u32 __user *naddr = (void *)uaddr + size / 2;
 
 		if (futex_get_value(&node, naddr))
 			return -EFAULT;
 
-		if (node >= MAX_NUMNODES || !node_possible(node))
+		if (node != FUTEX_NO_NODE &&
+		    (node >= MAX_NUMNODES || !node_possible(node)))
 			return -EINVAL;
 	}
 
-	if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL))
+	if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL)) {
 		node = futex_mpol(mm, address);
+		node_updated = true;
+	}
 
 	if (flags & FLAGS_NUMA) {
-		u32 __user *naddr = uaddr + size / 2;
+		u32 __user *naddr = (void *)uaddr + size / 2;
 
-		if (node == FUTEX_NO_NODE)
+		if (node == FUTEX_NO_NODE) {
 			node = numa_node_id();
-		if (futex_put_value(node, naddr))
+			node_updated = true;
+		}
+		if (node_updated && futex_put_value(node, naddr))
 			return -EFAULT;
 	}
 
@@ -1573,6 +1579,8 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable,
 
 	if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
 		return -EINVAL;
+	if (immutable > 2)
+		return -EINVAL;
 
 	/*
 	 * Once we've disabled the global hash there is no way back.
@@ -1586,7 +1594,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable,
 		}
 	}
 
-	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
 		return -ENOMEM;
 
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 004e4dbee4f93..069fc2a83080d 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -55,7 +55,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
 	return flags;
 }
 
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_MPOL | FUTEX2_PRIVATE)
 
 /* FUTEX2_ to FLAGS_ */
 static inline unsigned int futex2_to_flags(unsigned int flags2)
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 356e52c17d3c5..dacb2330f1fbc 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -993,6 +993,16 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 			goto no_block;
 		}
 
+		/*
+		 * Caution; releasing @hb in-scope. The hb->lock is still locked
+		 * while the reference is dropped. The reference can not be dropped
+		 * after the unlock because if a user initiated resize is in progress
+		 * then we might need to wake him. This can not be done after the
+		 * rt_mutex_pre_schedule() invocation. The hb will remain valid because
+		 * the thread, performing resize, will block on hb->lock during
+		 * the requeue.
+		 */
+		futex_hash_put(no_free_ptr(hb));
 		/*
 		 * Must be done before we enqueue the waiter, here is unfortunately
 		 * under the hb lock, but that *should* work because it does nothing.
@@ -1016,10 +1026,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		 */
 		raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
 		spin_unlock(q.lock_ptr);
-		/*
-		 * Caution; releasing @hb in-scope.
-		 */
-		futex_hash_put(no_free_ptr(hb));
 		/*
 		 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
 		 * such that futex_unlock_pi() is guaranteed to observe the waiter when
diff --git a/tools/perf/bench/futex.c b/tools/perf/bench/futex.c
index bed3b6e46d109..02ae6c52ba881 100644
--- a/tools/perf/bench/futex.c
+++ b/tools/perf/bench/futex.c
@@ -31,20 +31,25 @@ void futex_print_nbuckets(struct bench_futex_parameters *params)
 	if (params->nbuckets >= 0) {
 		if (ret != params->nbuckets) {
 			if (ret < 0) {
-				printf("Can't query number of buckets: %d/%m\n", ret);
+				printf("Can't query number of buckets: %m\n");
 				err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
 			}
 			printf("Requested number of hash buckets does not currently used.\n");
 			printf("Requested: %d in usage: %d\n", params->nbuckets, ret);
 			err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
 		}
-		ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
-		if (params->nbuckets == 0)
+		if (params->nbuckets == 0) {
 			ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
-		else
+		} else {
+			ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+			if (ret < 0) {
+				printf("Can't check if the hash is immutable: %m\n");
+				err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+			}
 			ret = asprintf(&futex_hash_mode, "Futex hashing: %d hash buckets %s",
 				       params->nbuckets,
 				       ret == 1 ? "(immutable)" : "");
+		}
 	} else {
 		if (ret <= 0) {
 			ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");


Sebastian
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 7 months, 2 weeks ago
On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de

I made a few changes (mostly the stuff I mailed about) and pushed out to
queue/locking/futex.
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 7 months, 2 weeks ago
On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
> 
> I made a few changes (mostly the stuff I mailed about) and pushed out to
> queue/locking/futex.

And again, with hopefully less build errors included :-)
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 7 months, 2 weeks ago
On 2025-05-03 12:09:05 [+0200], Peter Zijlstra wrote:
> On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
> > 
> > I made a few changes (mostly the stuff I mailed about) and pushed out to
> > queue/locking/futex.
> 
> And again, with hopefully less build errors included :-)

Okay. I guess the NUMA part where the nodeid is written back to userland
if 0 was supplied is not an issue. I was worried that if you fire
multiple threads which end up in the sys_futex_wait() at the same time,
waiting on the same addr on two nodes and the "current" nodeid is used
then the variable might be written back twice with two node ids. The
mpol interface should report always the same one.

Sebastian
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Peter Zijlstra 7 months, 2 weeks ago
On Mon, May 05, 2025 at 09:30:36AM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-03 12:09:05 [+0200], Peter Zijlstra wrote:
> > On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
> > > 
> > > I made a few changes (mostly the stuff I mailed about) and pushed out to
> > > queue/locking/futex.
> > 
> > And again, with hopefully less build errors included :-)
> 
> Okay. I guess the NUMA part where the nodeid is written back to userland
> if 0 was supplied is not an issue. I was worried that if you fire
> multiple threads which end up in the sys_futex_wait() at the same time,
> waiting on the same addr on two nodes and the "current" nodeid is used
> then the variable might be written back twice with two node ids. The
> mpol interface should report always the same one.

Well, if you do stupid things, you get to keep the pieces or something
along those lines. Same as when userspace goes scribble the node value
while another thread is waiting and all that.

Even with the unconditional write back you're going to have a problem
with concurrent wait on the same futex.
Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
Posted by Sebastian Andrzej Siewior 7 months, 1 week ago
On 2025-05-06 09:36:11 [+0200], Peter Zijlstra wrote:
> Well, if you do stupid things, you get to keep the pieces or something
> along those lines. Same as when userspace goes scribble the node value
> while another thread is waiting and all that.
> 
> Even with the unconditional write back you're going to have a problem
> with concurrent wait on the same futex.

We could add a global lock for the write back case to ensure there is
only one at a time. However let me document the current behaviour of the
new pieces and tick it off ;)

Sebastian