[PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes

Sebastian Andrzej Siewior posted 3 patches 3 months ago
There is a newer version of this series
[PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
Posted by Sebastian Andrzej Siewior 3 months ago
The auto scaling on create creation used to automatically assign the new
hash because there was the private hash was unused and could be replaced
right away.

With the upcoming change to wait for a RCU grace period before the hash
can be assigned, the test will always fail.

If the reported number of hash buckets is not updated after an
auto scaling event, block on an acquired lock with a timeout. The timeout
is the delay to wait towards a grace period and locking and a locked
pthread_mutex_t ensure that glibc calls into kernel using futex
operation which will assign new private hash if available.
This will retry every 100ms up to 2 seconds in total.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 .../futex/functional/futex_priv_hash.c        | 42 ++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 24a92dc94eb86..625e3be4129c3 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -111,6 +111,30 @@ static void join_max_threads(void)
 	}
 }
 
+#define SEC_IN_NSEC	1000000000
+#define MSEC_IN_NSEC	1000000
+
+static void futex_dummy_op(void)
+{
+	pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+	struct timespec timeout;
+	int ret;
+
+	pthread_mutex_lock(&lock);
+	clock_gettime(CLOCK_REALTIME, &timeout);
+	timeout.tv_nsec += 100 * MSEC_IN_NSEC;
+	if (timeout.tv_nsec >=  SEC_IN_NSEC) {
+		timeout.tv_nsec -= SEC_IN_NSEC;
+		timeout.tv_sec++;
+	}
+	ret = pthread_mutex_timedlock(&lock, &timeout);
+	if (ret == 0)
+		ksft_exit_fail_msg("Succeffuly locked an already locked mutex.\n");
+
+	if (ret != ETIMEDOUT)
+		ksft_exit_fail_msg("pthread_mutex_timedlock() did not timeout: %d.\n", ret);
+}
+
 static void usage(char *prog)
 {
 	printf("Usage: %s\n", prog);
@@ -129,7 +153,7 @@ int main(int argc, char *argv[])
 	int futex_slots1, futex_slotsn, online_cpus;
 	pthread_mutexattr_t mutex_attr_pi;
 	int use_global_hash = 0;
-	int ret;
+	int ret, retry = 20;
 	int c;
 
 	while ((c = getopt(argc, argv, "cghv:")) != -1) {
@@ -208,8 +232,24 @@ int main(int argc, char *argv[])
 	 */
 	ksft_print_msg("Online CPUs: %d\n", online_cpus);
 	if (online_cpus > 16) {
+retry_getslots:
 		futex_slotsn = futex_hash_slots_get();
 		if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) {
+			retry--;
+			/*
+			 * Auto scaling on thread creation can be slightly delayed
+			 * because it waits for a RCU grace period twice. The new
+			 * private hash is assigned upon the first futex operation
+			 * after grace period.
+			 * To cover all this for testing purposes the function
+			 * below will acquire a lock and acquire it again with a
+			 * 100ms timeout which must timeout. This ensures we
+			 * sleep for 100ms and issue a futex operation.
+			 */
+			if (retry > 0) {
+				futex_dummy_op();
+				goto retry_getslots;
+			}
 			ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n",
 				       futex_slots1, futex_slotsn);
 			ksft_exit_fail_msg(test_msg_auto_inc);
-- 
2.50.0
Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
Posted by Peter Zijlstra 3 months ago
On Mon, Jul 07, 2025 at 04:36:21PM +0200, Sebastian Andrzej Siewior wrote:
> The auto scaling on create creation used to automatically assign the new
> hash because there was the private hash was unused and could be replaced
> right away.
> 
> With the upcoming change to wait for a RCU grace period before the hash
> can be assigned, the test will always fail.
> 
> If the reported number of hash buckets is not updated after an
> auto scaling event, block on an acquired lock with a timeout. The timeout
> is the delay to wait towards a grace period and locking and a locked
> pthread_mutex_t ensure that glibc calls into kernel using futex
> operation which will assign new private hash if available.
> This will retry every 100ms up to 2 seconds in total.

So the auto scaling thing is 'broken' in that if you do a 'final'
pthread_create() it will try and stage this new hash. If for whatever
reason the refcount isn't '0' -- and this can already happen today due
to a concurrent futex operation. Nothing will re-try the rehash.

This RCU business made this all but a certainty, but the race was there.

I briefly wondered if we should have a workqueue re-try the rehash.
Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
Posted by Sebastian Andrzej Siewior 3 months ago
On 2025-07-08 09:41:24 [+0200], Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 04:36:21PM +0200, Sebastian Andrzej Siewior wrote:
> > The auto scaling on create creation used to automatically assign the new
> > hash because there was the private hash was unused and could be replaced
> > right away.
> > 
> > With the upcoming change to wait for a RCU grace period before the hash
> > can be assigned, the test will always fail.
> > 
> > If the reported number of hash buckets is not updated after an
> > auto scaling event, block on an acquired lock with a timeout. The timeout
> > is the delay to wait towards a grace period and locking and a locked
> > pthread_mutex_t ensure that glibc calls into kernel using futex
> > operation which will assign new private hash if available.
> > This will retry every 100ms up to 2 seconds in total.
> 
> So the auto scaling thing is 'broken' in that if you do a 'final'
> pthread_create() it will try and stage this new hash. If for whatever
> reason the refcount isn't '0' -- and this can already happen today due
> to a concurrent futex operation. Nothing will re-try the rehash.

Sure it was there but not in the way the test case was setup. I *think*
it is okay because in real life the new hash will be put to use unless
you terminate shortly after at which point you don't need it.

> This RCU business made this all but a certainty, but the race was there.
>
> I briefly wondered if we should have a workqueue re-try the rehash.

I don't think we need to worry about it.

Sebastian
Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
Posted by Sebastian Andrzej Siewior 2 months, 4 weeks ago
On 2025-07-08 10:01:42 [+0200], To Peter Zijlstra wrote:
> > So the auto scaling thing is 'broken' in that if you do a 'final'
> > pthread_create() it will try and stage this new hash. If for whatever
> > reason the refcount isn't '0' -- and this can already happen today due
> > to a concurrent futex operation. Nothing will re-try the rehash.
> 
> Sure it was there but not in the way the test case was setup. I *think*
> it is okay because in real life the new hash will be put to use unless
> you terminate shortly after at which point you don't need it.
> 
> > This RCU business made this all but a certainty, but the race was there.
> >
> > I briefly wondered if we should have a workqueue re-try the rehash.
> 
> I don't think we need to worry about it.

Just to replace the *think* arguments with some technical foundation: I
added a few trace printks to see how it going and this came out:

|    zstd-2308004 [012] ..... 01.857262: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff88854ed3f000 (16)
first thread/ assignment
|    zstd-2308004 [012] ..... 01.858722: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858760: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858792: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858830: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858865: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.858913: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.858963: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859019: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859057: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859092: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859125: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859160: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859202: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff8885b65c8000

Another and another thread is created. So it ends up allocating a new
futex_private_hash because current mask is less than what we want to
have. We then acquire mm__struct::futex_hash_lock, notice that
mm_struct::futex_phash_new is already set (pending in the trace) and
free the newly allocated one.
As an optimization we could check mm_struct::futex_phash_new under the
lock and skip the allocation if the already fph is okay. We will acquire
the lock for assignment anyway so we could skip the allocation in this
case. As you see the pending address changed a few times because as the
number of threads increased, the size also increased three times.

|  <idle>-0       [012] ..s1. 01.890423: futex_ref_rcu: mm ffff88854c3f9b80 [0] ffff88854ed3f000
|  <idle>-0       [012] ..s1. 01.926421: futex_ref_rcu: mm ffff88854c3f9b80 [1] ffff88854ed3f000
the state transition, two times because the setting up threads was so
quick.

|    zstd-2308007 [031] ..... 02.724004: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff8885b65c8000 (128)
almost a second later the first futex op led to the assignment of the hash.

|    zstd-2308004 [012] ..... 41.162795: futex_hash_free: mm ffff88854c3f9b80 fph ffff8885b65c8000 new 0000000000000000
almost 40 secs, the application is done.

This is an example for let me setup the threads, I need them for a while.
A different one:

|         lz4-2309026 [008] ..... 2.654404: __futex_pivot_hash: mm ffff888664977380 fph ffff8885b2000800 (16)
first assignment.

|         lz4-2309026 [008] ..... 2.654597: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654641: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654690: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654732: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654774: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654812: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654847: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654884: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654919: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654962: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655002: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655039: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655202: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655250: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655288: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655329: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655370: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655411: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655449: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655488: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655525: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655562: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655603: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
a bunch of threads is created.

|      <idle>-0       [008] ..s1. 2.689306: futex_ref_rcu: mm ffff888664977380 [0] ffff8885b2000800
|      <idle>-0       [008] ..s1. 2.725324: futex_ref_rcu: mm ffff888664977380 [1] ffff8885b2000800
state transition over

|kworker/8:38-2307372 [008] ..... 2.727104: futex_hash_free: mm ffff888664977380 fph ffff8885b2000800 new ffff8885ac4b4000

The hash is released from the kworker due to mmput_async(). It frees the
current and the new hash which was never assigned.
I have a few examples which are similar to the lz4 where the new hash
was not assigned either because the application terminated early or it
did not invoke any futex opcodes after that.

Therefore it does not make sense to assign the new hash once it is
possible and mobilize a kworker for it. Avoiding the alloc/ free dance
would make sense.

Sebastian