[v2] futex: Use RCU-based per-CPU reference counting

[PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by Sebastian Andrzej Siewior 7 months ago

From: Peter Zijlstra <peterz@infradead.org>

The use of rcuref_t for reference counting introduces a performance bottleneck
when accessed concurrently by multiple threads during futex operations.

Replace rcuref_t with special crafted per-CPU reference counters. The
lifetime logic remains the same.

The newly allocate private hash starts in FR_PERCPU state. In this state, each
futex operation that requires the private hash uses a per-CPU counter (an
unsigned int) for incrementing or decrementing the reference count.

When the private hash is about to be replaced, the per-CPU counters are
migrated to a atomic_t counter mm_struct::futex_atomic.
The migration process:
- Waiting for one RCU grace period to ensure all users observe the
  current private hash. This can be skipped if a grace period elapsed
  since the private hash was assigned.

- futex_private_hash::state is set to FR_ATOMIC, forcing all users to
  use mm_struct::futex_atomic for reference counting.

- After a RCU grace period, all users are guaranteed to be using the
  atomic counter. The per-CPU counters can now be summed up and added to
  the atomic_t counter. If the resulting count is zero, the hash can be
  safely replaced. Otherwise, active users still hold a valid reference.

- Once the atomic reference count drops to zero, the next futex
  operation will switch to the new private hash.

call_rcu_hurry() is used to speed up transition which otherwise might be
delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
side effects would be that on auto scaling the new hash is used later
and the SET_SLOTS prctl() will block longer.

[bigeasy: commit description + mm get/ put_async]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h    |  16 +--
 include/linux/mm_types.h |   5 +
 include/linux/sched/mm.h |   2 +-
 init/Kconfig             |   4 -
 kernel/fork.c            |   8 +-
 kernel/futex/core.c      | 243 ++++++++++++++++++++++++++++++++++++---
 6 files changed, 243 insertions(+), 35 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b37193653e6b5..9e9750f049805 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,18 +85,12 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
-
-static inline void futex_mm_init(struct mm_struct *mm)
-{
-	RCU_INIT_POINTER(mm->futex_phash, NULL);
-	mm->futex_phash_new = NULL;
-	mutex_init(&mm->futex_hash_lock);
-}
+int futex_mm_init(struct mm_struct *mm);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 #else /* !CONFIG_FUTEX */
@@ -118,8 +112,8 @@ static inline int futex_hash_allocate_default(void)
 {
 	return 0;
 }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 
 #endif
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6d..0f0662157066a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,6 +1070,11 @@ struct mm_struct {
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
+		/* futex-ref */
+		unsigned long			futex_batches;
+		struct rcu_head			futex_rcu;
+		atomic_long_t			futex_atomic;
+		unsigned int			__percpu *futex_ref;
 #endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b13474825130f..2201da0afecc5 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -140,7 +140,7 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
  */
diff --git a/init/Kconfig b/init/Kconfig
index 666783eb50abd..af4c2f0854554 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,13 +1716,9 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
-#
-# marked broken for performance reasons; gives us one more cycle to sort things out.
-#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
-	depends on BROKEN
 	default y
 
 config FUTEX_MPOL
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38ba..0b885dcbde9af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1046,7 +1046,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
-	futex_mm_init(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -1061,6 +1060,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		mm->def_flags = 0;
 	}
 
+	if (futex_mm_init(mm))
+		goto fail_mm_init;
+
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
@@ -1090,6 +1092,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 fail_noid:
 	mm_free_pgd(mm);
 fail_nopgd:
+	futex_hash_free(mm);
+fail_mm_init:
 	free_mm(mm);
 	return NULL;
 }
@@ -1145,7 +1149,7 @@ void mmput(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mmput);
 
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 static void mmput_async_fn(struct work_struct *work)
 {
 	struct mm_struct *mm = container_of(work, struct mm_struct,
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 90d53fb0ee9e1..1dcb4c8a2585d 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -42,7 +42,6 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
-#include <linux/rcuref.h>
 #include <linux/mempolicy.h>
 #include <linux/mmap_lock.h>
 
@@ -65,7 +64,7 @@ static struct {
 #define futex_queues	(__futex_data.queues)
 
 struct futex_private_hash {
-	rcuref_t	users;
+	int		state;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
 	void		*mm;
@@ -129,6 +128,12 @@ static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+static bool futex_ref_get(struct futex_private_hash *fph);
+static bool futex_ref_put(struct futex_private_hash *fph);
+static bool futex_ref_is_dead(struct futex_private_hash *fph);
+
+enum { FR_PERCPU = 0, FR_ATOMIC };
+
 static inline bool futex_key_is_private(union futex_key *key)
 {
 	/*
@@ -142,15 +147,14 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	if (fph->immutable)
 		return true;
-	return rcuref_get(&fph->users);
+	return futex_ref_get(fph);
 }
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
-	/* Ignore return value, last put is verified via rcuref_is_dead() */
 	if (fph->immutable)
 		return;
-	if (rcuref_put(&fph->users))
+	if (futex_ref_put(fph))
 		wake_up_var(fph->mm);
 }
 
@@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	fph = rcu_dereference_protected(mm->futex_phash,
 					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!futex_ref_is_dead(fph)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
 
 		futex_rehash_private(fph, new);
 	}
-	rcu_assign_pointer(mm->futex_phash, new);
+	new->state = FR_PERCPU;
+	scoped_guard(rcu) {
+		mm->futex_batches = get_state_synchronize_rcu();
+		rcu_assign_pointer(mm->futex_phash, new);
+	}
 	kvfree_rcu(fph, rcu);
 	return true;
 }
@@ -289,9 +297,7 @@ struct futex_private_hash *futex_private_hash(void)
 		if (!fph)
 			return NULL;
 
-		if (fph->immutable)
-			return fph;
-		if (rcuref_get(&fph->users))
+		if (futex_private_hash_get(fph))
 			return fph;
 	}
 	futex_pivot_hash(mm);
@@ -1527,16 +1533,219 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #define FH_IMMUTABLE	0x02
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+/*
+ * futex-ref
+ *
+ * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
+ * code because it just doesn't fit right.
+ *
+ * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
+ * re-initializes the state automatically, such that the fph swizzle is also a
+ * transition back to per-cpu.
+ */
+
+static void futex_ref_rcu(struct rcu_head *head);
+
+static void __futex_ref_atomic_begin(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * The counter we're about to switch to must have fully switched;
+	 * otherwise it would be impossible for it to have reported success
+	 * from futex_ref_is_dead().
+	 */
+	WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
+
+	/*
+	 * Set the atomic to the bias value such that futex_ref_{get,put}()
+	 * will never observe 0. Will be fixed up in __futex_ref_atomic_end()
+	 * when folding in the percpu count.
+	 */
+	atomic_long_set(&mm->futex_atomic, LONG_MAX);
+	smp_store_release(&fph->state, FR_ATOMIC);
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static void __futex_ref_atomic_end(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+	unsigned int count = 0;
+	long ret;
+	int cpu;
+
+	/*
+	 * Per __futex_ref_atomic_begin() the state of the fph must be ATOMIC
+	 * and per this RCU callback, everybody must now observe this state and
+	 * use the atomic variable.
+	 */
+	WARN_ON_ONCE(fph->state != FR_ATOMIC);
+
+	/*
+	 * Therefore the per-cpu counter is now stable, sum and reset.
+	 */
+	for_each_possible_cpu(cpu) {
+		unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
+		count += *ptr;
+		*ptr = 0;
+	}
+
+	/*
+	 * Re-init for the next cycle.
+	 */
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+
+	/*
+	 * Add actual count, subtract bias and initial refcount.
+	 *
+	 * The moment this atomic operation happens, futex_ref_is_dead() can
+	 * become true.
+	 */
+	ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
+	if (!ret)
+		wake_up_var(mm);
+
+	WARN_ON_ONCE(ret < 0);
+	mmput_async(mm);
+}
+
+static void futex_ref_rcu(struct rcu_head *head)
+{
+	struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
+	struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
+
+	if (fph->state == FR_PERCPU) {
+		/*
+		 * Per this extra grace-period, everybody must now observe
+		 * fph as the current fph and no previously observed fph's
+		 * are in-flight.
+		 *
+		 * Notably, nobody will now rely on the atomic
+		 * futex_ref_is_dead() state anymore so we can begin the
+		 * migration of the per-cpu counter into the atomic.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	__futex_ref_atomic_end(fph);
+}
+
+/*
+ * Drop the initial refcount and transition to atomics.
+ */
+static void futex_ref_drop(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * Can only transition the current fph;
+	 */
+	WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+	/*
+	 * We enqueue at least one RCU callback. Ensure mm stays if the task
+	 * exits before the transition is completed.
+	 */
+	mmget(mm);
+
+	/*
+	 * In order to avoid the following scenario:
+	 *
+	 * futex_hash()			__futex_pivot_hash()
+	 *   guard(rcu);		  guard(mm->futex_hash_lock);
+	 *   fph = mm->futex_phash;
+	 *				  rcu_assign_pointer(&mm->futex_phash, new);
+	 *				futex_hash_allocate()
+	 *				  futex_ref_drop()
+	 *				    fph->state = FR_ATOMIC;
+	 *				    atomic_set(, BIAS);
+	 *
+	 *   futex_private_hash_get(fph); // OOPS
+	 *
+	 * Where an old fph (which is FR_ATOMIC) and should fail on
+	 * inc_not_zero, will succeed because a new transition is started and
+	 * the atomic is bias'ed away from 0.
+	 *
+	 * There must be at least one full grace-period between publishing a
+	 * new fph and trying to replace it.
+	 */
+	if (poll_state_synchronize_rcu(mm->futex_batches)) {
+		/*
+		 * There was a grace-period, we can begin now.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static bool futex_ref_get(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_inc(*mm->futex_ref);
+		return true;
+	}
+
+	return atomic_long_inc_not_zero(&mm->futex_atomic);
+}
+
+static bool futex_ref_put(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_dec(*mm->futex_ref);
+		return false;
+	}
+
+	return atomic_long_dec_and_test(&mm->futex_atomic);
+}
+
+static bool futex_ref_is_dead(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU)
+		return false;
+
+	return atomic_long_read(&mm->futex_atomic) == 0;
+}
+
+int futex_mm_init(struct mm_struct *mm)
+{
+	mutex_init(&mm->futex_hash_lock);
+	RCU_INIT_POINTER(mm->futex_phash, NULL);
+	mm->futex_phash_new = NULL;
+	/* futex-ref */
+	atomic_long_set(&mm->futex_atomic, 0);
+	mm->futex_batches = get_state_synchronize_rcu();
+	mm->futex_ref = alloc_percpu(unsigned int);
+	if (!mm->futex_ref)
+		return -ENOMEM;
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	return 0;
+}
+
 void futex_hash_free(struct mm_struct *mm)
 {
 	struct futex_private_hash *fph;
 
+	free_percpu(mm->futex_ref);
 	kvfree(mm->futex_phash_new);
 	fph = rcu_dereference_raw(mm->futex_phash);
-	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+	if (fph)
 		kvfree(fph);
-	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
@@ -1549,7 +1758,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
 		return true;
 
 	fph = rcu_dereference(mm->futex_phash);
-	return rcuref_is_dead(&fph->users);
+	return futex_ref_is_dead(fph);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1598,11 +1807,11 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
-	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
 		return -ENOMEM;
 
-	rcuref_init(&fph->users, 1);
 	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
 	fph->custom = custom;
 	fph->immutable = !!(flags & FH_IMMUTABLE);
@@ -1645,7 +1854,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(cur);
+				futex_ref_drop(cur);
 			}
 
 			if (new) {
-- 
2.50.0

Re: [PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by André Draszik 6 months, 2 weeks ago

On Thu, 2025-07-10 at 13:00 +0200, Sebastian Andrzej Siewior wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> The use of rcuref_t for reference counting introduces a performance bottleneck
> when accessed concurrently by multiple threads during futex operations.
> 
> Replace rcuref_t with special crafted per-CPU reference counters. The
> lifetime logic remains the same.
> 
> The newly allocate private hash starts in FR_PERCPU state. In this state, each
> futex operation that requires the private hash uses a per-CPU counter (an
> unsigned int) for incrementing or decrementing the reference count.
> 
> When the private hash is about to be replaced, the per-CPU counters are
> migrated to a atomic_t counter mm_struct::futex_atomic.
> The migration process:
> - Waiting for one RCU grace period to ensure all users observe the
>   current private hash. This can be skipped if a grace period elapsed
>   since the private hash was assigned.
> 
> - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
>   use mm_struct::futex_atomic for reference counting.
> 
> - After a RCU grace period, all users are guaranteed to be using the
>   atomic counter. The per-CPU counters can now be summed up and added to
>   the atomic_t counter. If the resulting count is zero, the hash can be
>   safely replaced. Otherwise, active users still hold a valid reference.
> 
> - Once the atomic reference count drops to zero, the next futex
>   operation will switch to the new private hash.
> 
> call_rcu_hurry() is used to speed up transition which otherwise might be
> delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> side effects would be that on auto scaling the new hash is used later
> and the SET_SLOTS prctl() will block longer.
> 
> [bigeasy: commit description + mm get/ put_async]

kmemleak complains about a new memleak with this commit:

[  680.179004][  T101] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

$ cat /sys/kernel/debug/kmemleak
unreferenced object (percpu) 0xc22ec0eface8 (size 4):
  comm "swapper/0", pid 1, jiffies 4294893115
  hex dump (first 4 bytes on cpu 7):
    01 00 00 00                                      ....
  backtrace (crc b8bc6765):
    kmemleak_alloc_percpu+0x48/0xb8
    pcpu_alloc_noprof+0x6ac/0xb68
    futex_mm_init+0x60/0xe0
    mm_init+0x1e8/0x3c0
    mm_alloc+0x5c/0x78
    init_args+0x74/0x4b0
    debug_vm_pgtable+0x60/0x2d8
    do_one_initcall+0x128/0x3e0
    do_initcall_level+0xb4/0xe8
    do_initcalls+0x60/0xb0
    do_basic_setup+0x28/0x40
    kernel_init_freeable+0x158/0x1f8
    kernel_init+0x2c/0x1e0
    ret_from_fork+0x10/0x20

And futex_mm_init+0x60/0xe0 resolves to
    mm->futex_ref = alloc_percpu(unsigned int);
in futex_mm_init().

Reverting this commit (and patches 3 and 4 in this series due to context),
makes kmemleak happy again.

Cheers,
Andre'

Re: [PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by Thomas Gleixner 6 months, 2 weeks ago

On Wed, Jul 30 2025 at 13:20, André Draszik wrote:
> kmemleak complains about a new memleak with this commit:
>
> [  680.179004][  T101] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>
> $ cat /sys/kernel/debug/kmemleak
> unreferenced object (percpu) 0xc22ec0eface8 (size 4):
>   comm "swapper/0", pid 1, jiffies 4294893115
>   hex dump (first 4 bytes on cpu 7):
>     01 00 00 00                                      ....
>   backtrace (crc b8bc6765):
>     kmemleak_alloc_percpu+0x48/0xb8
>     pcpu_alloc_noprof+0x6ac/0xb68
>     futex_mm_init+0x60/0xe0
>     mm_init+0x1e8/0x3c0
>     mm_alloc+0x5c/0x78
>     init_args+0x74/0x4b0
>     debug_vm_pgtable+0x60/0x2d8
>
> Reverting this commit (and patches 3 and 4 in this series due to context),
> makes kmemleak happy again.

Unsurprisingly ...

debug_vm_pgtable() allocates it via mm_alloc() -> mm->init() and then
after the selftest it invokes mmdrop(), which does not free it, as it is
only freed in __mmput().

The patch below should fix it.

Thanks,

        tglx
---
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	futex_hash_free(mm);
 
 	free_mm(mm);
 }
@@ -1133,7 +1134,6 @@ static inline void __mmput(struct mm_str
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
-	futex_hash_free(mm);
 	mmdrop(mm);
 }

Re: [PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by André Draszik 6 months, 1 week ago

On Wed, 2025-07-30 at 21:44 +0200, Thomas Gleixner wrote:
> On Wed, Jul 30 2025 at 13:20, André Draszik wrote:
> > kmemleak complains about a new memleak with this commit:
> > 
> > [  680.179004][  T101] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> > 
> > $ cat /sys/kernel/debug/kmemleak
> > unreferenced object (percpu) 0xc22ec0eface8 (size 4):
> >   comm "swapper/0", pid 1, jiffies 4294893115
> >   hex dump (first 4 bytes on cpu 7):
> >     01 00 00 00                                      ....
> >   backtrace (crc b8bc6765):
> >     kmemleak_alloc_percpu+0x48/0xb8
> >     pcpu_alloc_noprof+0x6ac/0xb68
> >     futex_mm_init+0x60/0xe0
> >     mm_init+0x1e8/0x3c0
> >     mm_alloc+0x5c/0x78
> >     init_args+0x74/0x4b0
> >     debug_vm_pgtable+0x60/0x2d8
> > 
> > Reverting this commit (and patches 3 and 4 in this series due to context),
> > makes kmemleak happy again.
> 
> Unsurprisingly ...
> 
> debug_vm_pgtable() allocates it via mm_alloc() -> mm->init() and then
> after the selftest it invokes mmdrop(), which does not free it, as it is
> only freed in __mmput().
> 
> The patch below should fix it.

It does. Thanks Thomas!

A.

[tip: locking/urgent] futex: Move futex cleanup to __mmdrop()

Posted by tip-bot2 for Thomas Gleixner 6 months, 1 week ago

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     e703b7e247503b8bf87b62c02a4392749b09eca8
Gitweb:        https://git.kernel.org/tip/e703b7e247503b8bf87b62c02a4392749b09eca8
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 30 Jul 2025 21:44:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 02 Aug 2025 15:11:52 +02:00

futex: Move futex cleanup to __mmdrop()

Futex hash allocations are done in mm_init() and the cleanup happens in
__mmput(). That works most of the time, but there are mm instances which
are instantiated via mm_alloc() and freed via mmdrop(), which causes the
futex hash to be leaked.

Move the cleanup to __mmdrop().

Fixes: 56180dd20c19 ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t")
Reported-by: André Draszik <andre.draszik@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx
Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org

---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f82b77e..1b0535e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	futex_hash_free(mm);
 
 	free_mm(mm);
 }
@@ -1133,7 +1134,6 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
-	futex_hash_free(mm);
 	mmdrop(mm);
 }

Re: [tip: locking/urgent] futex: Move futex cleanup to __mmdrop()

Posted by Breno Leitao 5 months, 3 weeks ago

On Sat, Aug 02, 2025 at 01:22:10PM +0000, tip-bot2 for Thomas Gleixner wrote:
> The following commit has been merged into the locking/urgent branch of tip:
> 
> Commit-ID:     e703b7e247503b8bf87b62c02a4392749b09eca8
> Gitweb:        https://git.kernel.org/tip/e703b7e247503b8bf87b62c02a4392749b09eca8
> Author:        Thomas Gleixner <tglx@linutronix.de>
> AuthorDate:    Wed, 30 Jul 2025 21:44:55 +02:00
> Committer:     Thomas Gleixner <tglx@linutronix.de>
> CommitterDate: Sat, 02 Aug 2025 15:11:52 +02:00
> 
> futex: Move futex cleanup to __mmdrop()
> 
> Futex hash allocations are done in mm_init() and the cleanup happens in
> __mmput(). That works most of the time, but there are mm instances which
> are instantiated via mm_alloc() and freed via mmdrop(), which causes the
> futex hash to be leaked.
> 
> Move the cleanup to __mmdrop().
> 
> Fixes: 56180dd20c19 ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t")
> Reported-by: André Draszik <andre.draszik@linaro.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: André Draszik <andre.draszik@linaro.org>
> Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx
> Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org

Thomas,

it seems this change caused the bug being reported here:

https://lore.kernel.org/all/20250818131902.5039-1-hdanton@sina.com/

[tip: locking/futex] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by tip-bot2 for Peter Zijlstra 7 months ago

The following commit has been merged into the locking/futex branch of tip:

Commit-ID:     56180dd20c19e5b0fa34822997a9ac66b517e7b3
Gitweb:        https://git.kernel.org/tip/56180dd20c19e5b0fa34822997a9ac66b517e7b3
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 10 Jul 2025 13:00:07 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 11 Jul 2025 16:02:00 +02:00

futex: Use RCU-based per-CPU reference counting instead of rcuref_t

The use of rcuref_t for reference counting introduces a performance bottleneck
when accessed concurrently by multiple threads during futex operations.

Replace rcuref_t with special crafted per-CPU reference counters. The
lifetime logic remains the same.

The newly allocate private hash starts in FR_PERCPU state. In this state, each
futex operation that requires the private hash uses a per-CPU counter (an
unsigned int) for incrementing or decrementing the reference count.

When the private hash is about to be replaced, the per-CPU counters are
migrated to a atomic_t counter mm_struct::futex_atomic.
The migration process:
- Waiting for one RCU grace period to ensure all users observe the
  current private hash. This can be skipped if a grace period elapsed
  since the private hash was assigned.

- futex_private_hash::state is set to FR_ATOMIC, forcing all users to
  use mm_struct::futex_atomic for reference counting.

- After a RCU grace period, all users are guaranteed to be using the
  atomic counter. The per-CPU counters can now be summed up and added to
  the atomic_t counter. If the resulting count is zero, the hash can be
  safely replaced. Otherwise, active users still hold a valid reference.

- Once the atomic reference count drops to zero, the next futex
  operation will switch to the new private hash.

call_rcu_hurry() is used to speed up transition which otherwise might be
delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
side effects would be that on auto scaling the new hash is used later
and the SET_SLOTS prctl() will block longer.

[bigeasy: commit description + mm get/ put_async]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
---
 include/linux/futex.h    |  16 +---
 include/linux/mm_types.h |   5 +-
 include/linux/sched/mm.h |   2 +-
 init/Kconfig             |   4 +-
 kernel/fork.c            |   8 +-
 kernel/futex/core.c      | 243 +++++++++++++++++++++++++++++++++++---
 6 files changed, 243 insertions(+), 35 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b371936..9e9750f 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,18 +85,12 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
-
-static inline void futex_mm_init(struct mm_struct *mm)
-{
-	RCU_INIT_POINTER(mm->futex_phash, NULL);
-	mm->futex_phash_new = NULL;
-	mutex_init(&mm->futex_hash_lock);
-}
+int futex_mm_init(struct mm_struct *mm);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 #else /* !CONFIG_FUTEX */
@@ -118,8 +112,8 @@ static inline int futex_hash_allocate_default(void)
 {
 	return 0;
 }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 
 #endif
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8..0f06621 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,6 +1070,11 @@ struct mm_struct {
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
+		/* futex-ref */
+		unsigned long			futex_batches;
+		struct rcu_head			futex_rcu;
+		atomic_long_t			futex_atomic;
+		unsigned int			__percpu *futex_ref;
 #endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b134748..2201da0 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -140,7 +140,7 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
  */
diff --git a/init/Kconfig b/init/Kconfig
index 666783e..af4c2f0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,13 +1716,9 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
-#
-# marked broken for performance reasons; gives us one more cycle to sort things out.
-#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
-	depends on BROKEN
 	default y
 
 config FUTEX_MPOL
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb1..0b885dc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1046,7 +1046,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
-	futex_mm_init(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -1061,6 +1060,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		mm->def_flags = 0;
 	}
 
+	if (futex_mm_init(mm))
+		goto fail_mm_init;
+
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
@@ -1090,6 +1092,8 @@ fail_nocontext:
 fail_noid:
 	mm_free_pgd(mm);
 fail_nopgd:
+	futex_hash_free(mm);
+fail_mm_init:
 	free_mm(mm);
 	return NULL;
 }
@@ -1145,7 +1149,7 @@ void mmput(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mmput);
 
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 static void mmput_async_fn(struct work_struct *work)
 {
 	struct mm_struct *mm = container_of(work, struct mm_struct,
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 90d53fb..1dcb4c8 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -42,7 +42,6 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
-#include <linux/rcuref.h>
 #include <linux/mempolicy.h>
 #include <linux/mmap_lock.h>
 
@@ -65,7 +64,7 @@ static struct {
 #define futex_queues	(__futex_data.queues)
 
 struct futex_private_hash {
-	rcuref_t	users;
+	int		state;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
 	void		*mm;
@@ -129,6 +128,12 @@ static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+static bool futex_ref_get(struct futex_private_hash *fph);
+static bool futex_ref_put(struct futex_private_hash *fph);
+static bool futex_ref_is_dead(struct futex_private_hash *fph);
+
+enum { FR_PERCPU = 0, FR_ATOMIC };
+
 static inline bool futex_key_is_private(union futex_key *key)
 {
 	/*
@@ -142,15 +147,14 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	if (fph->immutable)
 		return true;
-	return rcuref_get(&fph->users);
+	return futex_ref_get(fph);
 }
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
-	/* Ignore return value, last put is verified via rcuref_is_dead() */
 	if (fph->immutable)
 		return;
-	if (rcuref_put(&fph->users))
+	if (futex_ref_put(fph))
 		wake_up_var(fph->mm);
 }
 
@@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	fph = rcu_dereference_protected(mm->futex_phash,
 					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!futex_ref_is_dead(fph)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
 
 		futex_rehash_private(fph, new);
 	}
-	rcu_assign_pointer(mm->futex_phash, new);
+	new->state = FR_PERCPU;
+	scoped_guard(rcu) {
+		mm->futex_batches = get_state_synchronize_rcu();
+		rcu_assign_pointer(mm->futex_phash, new);
+	}
 	kvfree_rcu(fph, rcu);
 	return true;
 }
@@ -289,9 +297,7 @@ again:
 		if (!fph)
 			return NULL;
 
-		if (fph->immutable)
-			return fph;
-		if (rcuref_get(&fph->users))
+		if (futex_private_hash_get(fph))
 			return fph;
 	}
 	futex_pivot_hash(mm);
@@ -1527,16 +1533,219 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #define FH_IMMUTABLE	0x02
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+/*
+ * futex-ref
+ *
+ * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
+ * code because it just doesn't fit right.
+ *
+ * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
+ * re-initializes the state automatically, such that the fph swizzle is also a
+ * transition back to per-cpu.
+ */
+
+static void futex_ref_rcu(struct rcu_head *head);
+
+static void __futex_ref_atomic_begin(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * The counter we're about to switch to must have fully switched;
+	 * otherwise it would be impossible for it to have reported success
+	 * from futex_ref_is_dead().
+	 */
+	WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
+
+	/*
+	 * Set the atomic to the bias value such that futex_ref_{get,put}()
+	 * will never observe 0. Will be fixed up in __futex_ref_atomic_end()
+	 * when folding in the percpu count.
+	 */
+	atomic_long_set(&mm->futex_atomic, LONG_MAX);
+	smp_store_release(&fph->state, FR_ATOMIC);
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static void __futex_ref_atomic_end(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+	unsigned int count = 0;
+	long ret;
+	int cpu;
+
+	/*
+	 * Per __futex_ref_atomic_begin() the state of the fph must be ATOMIC
+	 * and per this RCU callback, everybody must now observe this state and
+	 * use the atomic variable.
+	 */
+	WARN_ON_ONCE(fph->state != FR_ATOMIC);
+
+	/*
+	 * Therefore the per-cpu counter is now stable, sum and reset.
+	 */
+	for_each_possible_cpu(cpu) {
+		unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
+		count += *ptr;
+		*ptr = 0;
+	}
+
+	/*
+	 * Re-init for the next cycle.
+	 */
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+
+	/*
+	 * Add actual count, subtract bias and initial refcount.
+	 *
+	 * The moment this atomic operation happens, futex_ref_is_dead() can
+	 * become true.
+	 */
+	ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
+	if (!ret)
+		wake_up_var(mm);
+
+	WARN_ON_ONCE(ret < 0);
+	mmput_async(mm);
+}
+
+static void futex_ref_rcu(struct rcu_head *head)
+{
+	struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
+	struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
+
+	if (fph->state == FR_PERCPU) {
+		/*
+		 * Per this extra grace-period, everybody must now observe
+		 * fph as the current fph and no previously observed fph's
+		 * are in-flight.
+		 *
+		 * Notably, nobody will now rely on the atomic
+		 * futex_ref_is_dead() state anymore so we can begin the
+		 * migration of the per-cpu counter into the atomic.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	__futex_ref_atomic_end(fph);
+}
+
+/*
+ * Drop the initial refcount and transition to atomics.
+ */
+static void futex_ref_drop(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * Can only transition the current fph;
+	 */
+	WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+	/*
+	 * We enqueue at least one RCU callback. Ensure mm stays if the task
+	 * exits before the transition is completed.
+	 */
+	mmget(mm);
+
+	/*
+	 * In order to avoid the following scenario:
+	 *
+	 * futex_hash()			__futex_pivot_hash()
+	 *   guard(rcu);		  guard(mm->futex_hash_lock);
+	 *   fph = mm->futex_phash;
+	 *				  rcu_assign_pointer(&mm->futex_phash, new);
+	 *				futex_hash_allocate()
+	 *				  futex_ref_drop()
+	 *				    fph->state = FR_ATOMIC;
+	 *				    atomic_set(, BIAS);
+	 *
+	 *   futex_private_hash_get(fph); // OOPS
+	 *
+	 * Where an old fph (which is FR_ATOMIC) and should fail on
+	 * inc_not_zero, will succeed because a new transition is started and
+	 * the atomic is bias'ed away from 0.
+	 *
+	 * There must be at least one full grace-period between publishing a
+	 * new fph and trying to replace it.
+	 */
+	if (poll_state_synchronize_rcu(mm->futex_batches)) {
+		/*
+		 * There was a grace-period, we can begin now.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static bool futex_ref_get(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_inc(*mm->futex_ref);
+		return true;
+	}
+
+	return atomic_long_inc_not_zero(&mm->futex_atomic);
+}
+
+static bool futex_ref_put(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_dec(*mm->futex_ref);
+		return false;
+	}
+
+	return atomic_long_dec_and_test(&mm->futex_atomic);
+}
+
+static bool futex_ref_is_dead(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU)
+		return false;
+
+	return atomic_long_read(&mm->futex_atomic) == 0;
+}
+
+int futex_mm_init(struct mm_struct *mm)
+{
+	mutex_init(&mm->futex_hash_lock);
+	RCU_INIT_POINTER(mm->futex_phash, NULL);
+	mm->futex_phash_new = NULL;
+	/* futex-ref */
+	atomic_long_set(&mm->futex_atomic, 0);
+	mm->futex_batches = get_state_synchronize_rcu();
+	mm->futex_ref = alloc_percpu(unsigned int);
+	if (!mm->futex_ref)
+		return -ENOMEM;
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	return 0;
+}
+
 void futex_hash_free(struct mm_struct *mm)
 {
 	struct futex_private_hash *fph;
 
+	free_percpu(mm->futex_ref);
 	kvfree(mm->futex_phash_new);
 	fph = rcu_dereference_raw(mm->futex_phash);
-	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+	if (fph)
 		kvfree(fph);
-	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
@@ -1549,7 +1758,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
 		return true;
 
 	fph = rcu_dereference(mm->futex_phash);
-	return rcuref_is_dead(&fph->users);
+	return futex_ref_is_dead(fph);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1598,11 +1807,11 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
-	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
 		return -ENOMEM;
 
-	rcuref_init(&fph->users, 1);
 	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
 	fph->custom = custom;
 	fph->immutable = !!(flags & FH_IMMUTABLE);
@@ -1645,7 +1854,7 @@ again:
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(cur);
+				futex_ref_drop(cur);
 			}
 
 			if (new) {

Re: [tip: locking/futex] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by Sean Christopherson 5 months, 4 weeks ago

On Fri, Jul 11, 2025, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the locking/futex branch of tip:
> 
> Commit-ID:     56180dd20c19e5b0fa34822997a9ac66b517e7b3
> Gitweb:        https://git.kernel.org/tip/56180dd20c19e5b0fa34822997a9ac66b517e7b3
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Thu, 10 Jul 2025 13:00:07 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Fri, 11 Jul 2025 16:02:00 +02:00
> 
> futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> 
> The use of rcuref_t for reference counting introduces a performance bottleneck
> when accessed concurrently by multiple threads during futex operations.
> 
> Replace rcuref_t with special crafted per-CPU reference counters. The
> lifetime logic remains the same.
> 
> The newly allocate private hash starts in FR_PERCPU state. In this state, each
> futex operation that requires the private hash uses a per-CPU counter (an
> unsigned int) for incrementing or decrementing the reference count.
> 
> When the private hash is about to be replaced, the per-CPU counters are
> migrated to a atomic_t counter mm_struct::futex_atomic.
> The migration process:
> - Waiting for one RCU grace period to ensure all users observe the
>   current private hash. This can be skipped if a grace period elapsed
>   since the private hash was assigned.
> 
> - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
>   use mm_struct::futex_atomic for reference counting.
> 
> - After a RCU grace period, all users are guaranteed to be using the
>   atomic counter. The per-CPU counters can now be summed up and added to
>   the atomic_t counter. If the resulting count is zero, the hash can be
>   safely replaced. Otherwise, active users still hold a valid reference.
> 
> - Once the atomic reference count drops to zero, the next futex
>   operation will switch to the new private hash.
> 
> call_rcu_hurry() is used to speed up transition which otherwise might be
> delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> side effects would be that on auto scaling the new hash is used later
> and the SET_SLOTS prctl() will block longer.
> 
> [bigeasy: commit description + mm get/ put_async]
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
> ---

This is causing explosions on my test systems, in code that doesn't obviously
have anything to do with futex.

The most common symptom is a #GP on this code in try_to_wake_up():

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

although on systems with 5-level paging I _think_ it just manifests as hard
hanges (I assume because prev->next is corrupted, but is still canonical with
LA57?  But that's a wild guess).

The failure always occurs when userspace writes /sys/module/kvm/parameters/nx_huge_pages,
but I don't think there's anything KVM specific about the issue.  Simply writing
the param doesn't explode, the problem only arises when I'm running tests in
parallel (but then failure is almost immediate), so presumably there's a task
migration angle or something?

Manually disabling CONFIG_FUTEX_PRIVATE_HASH makes the problem go away, and
running with CONFIG_FUTEX_PRIVATE_HASH=y prior to this rework is also fine.  So
it appears that the problem is specifically in the new code.

I can provide more info as needed next week.

Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP
CPU: 51 UID: 0 PID: 53807 Comm: tee Tainted: G S         O        6.17.0-smp--38183c31756a-next #826 NONE 
Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
Hardware name: Google LLC Indus/Indus_QC_03, BIOS 30.110.0 09/13/2024
RIP: 0010:queued_spin_lock_slowpath+0x123/0x250
Code: d8 c1 e8 10 66 87 47 02 66 85 c0 74 40 0f b7 c0 89 c2 83 e2 03 c1 e2 04 83 e0 fc 48 c7 c6 f8 ff ff ff 48 8b 84 46 a0 e2 a3 a0 <48> 89 8c 02 c0 da 47 a2 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8
RSP: 0018:ffffbf55cffe7cf8 EFLAGS: 00010006
RAX: ff0e899fff0e8562 RBX: 0000000000d00000 RCX: ffffa39b40aefac0
RDX: 0000000000000030 RSI: fffffffffffffff8 RDI: ffffa39d0592e68c
RBP: 0000000000d00000 R08: 00000000ffffff80 R09: 0000000400000000
R10: ffffa36cce4fe401 R11: 0000000000000800 R12: 0000000000000003
R13: 0000000000000000 R14: ffffa39d0592e68c R15: ffffa39b9e672000
FS:  00007f233b2e9740(0000) GS:ffffa39b9e672000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f233b39fda0 CR3: 00000004d031f002 CR4: 00000000007726f0
PKRU: 55555554
Call Trace:
 <TASK>
 _raw_spin_lock_irqsave+0x50/0x60
 try_to_wake_up+0x4f/0x5d0
 set_nx_huge_pages+0xe4/0x1c0 [kvm]
 param_attr_store+0x89/0xf0
 module_attr_store+0x1e/0x30
 kernfs_fop_write_iter+0xe4/0x160
 vfs_write+0x2cb/0x420
 ksys_write+0x7f/0xf0
 do_syscall_64+0x6f/0x1f0
 ? arch_exit_to_user_mode_prepare+0x9/0x50
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f233b4178b3
Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 8b 05 99 91 07 00 83 38 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 4d c3 55 48 89 e5 41 57 41 56 53 50 48 89 d3
RSP: 002b:00007ffffbdbf3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f233b4178b3
RDX: 0000000000000002 RSI: 00007ffffbdbf4a0 RDI: 0000000000000005
RBP: 00007ffffbdbf3e0 R08: 0000000000001004 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 00007ffffbdbf4a0
R13: 0000000000000002 R14: 00000000226ff3d0 R15: 0000000000000002
 </TASK>
Modules linked in: kvm_intel kvm irqbypass vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd gq(O) sha3_generic
gsmi: Log Shutdown Reason 0x03
---[ end trace 0000000000000000 ]---
RIP: 0010:queued_spin_lock_slowpath+0x123/0x250
Code: d8 c1 e8 10 66 87 47 02 66 85 c0 74 40 0f b7 c0 89 c2 83 e2 03 c1 e2 04 83 e0 fc 48 c7 c6 f8 ff ff ff 48 8b 84 46 a0 e2 a3 a0 <48> 89 8c 02 c0 da 47 a2 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8
RSP: 0018:ffffbf55cffe7cf8 EFLAGS: 00010006
RAX: ff0e899fff0e8562 RBX: 0000000000d00000 RCX: ffffa39b40aefac0
RDX: 0000000000000030 RSI: fffffffffffffff8 RDI: ffffa39d0592e68c
RBP: 0000000000d00000 R08: 00000000ffffff80 R09: 0000000400000000
R10: ffffa36cce4fe401 R11: 0000000000000800 R12: 0000000000000003
R13: 0000000000000000 R14: ffffa39d0592e68c R15: ffffa39b9e672000
FS:  00007f233b2e9740(0000) GS:ffffa39b9e672000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f233b39fda0 CR3: 00000004d031f002 CR4: 00000000007726f0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1e600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
gsmi: Log Shutdown Reason 0x02
Rebooting in 10 seconds..

Re: [tip: locking/futex] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Posted by Sean Christopherson 5 months, 2 weeks ago

On Fri, Aug 15, 2025, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the locking/futex branch of tip:
> > 
> > Commit-ID:     56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Gitweb:        https://git.kernel.org/tip/56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Author:        Peter Zijlstra <peterz@infradead.org>
> > AuthorDate:    Thu, 10 Jul 2025 13:00:07 +02:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 11 Jul 2025 16:02:00 +02:00
> > 
> > futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> > 
> > The use of rcuref_t for reference counting introduces a performance bottleneck
> > when accessed concurrently by multiple threads during futex operations.
> > 
> > Replace rcuref_t with special crafted per-CPU reference counters. The
> > lifetime logic remains the same.
> > 
> > The newly allocate private hash starts in FR_PERCPU state. In this state, each
> > futex operation that requires the private hash uses a per-CPU counter (an
> > unsigned int) for incrementing or decrementing the reference count.
> > 
> > When the private hash is about to be replaced, the per-CPU counters are
> > migrated to a atomic_t counter mm_struct::futex_atomic.
> > The migration process:
> > - Waiting for one RCU grace period to ensure all users observe the
> >   current private hash. This can be skipped if a grace period elapsed
> >   since the private hash was assigned.
> > 
> > - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
> >   use mm_struct::futex_atomic for reference counting.
> > 
> > - After a RCU grace period, all users are guaranteed to be using the
> >   atomic counter. The per-CPU counters can now be summed up and added to
> >   the atomic_t counter. If the resulting count is zero, the hash can be
> >   safely replaced. Otherwise, active users still hold a valid reference.
> > 
> > - Once the atomic reference count drops to zero, the next futex
> >   operation will switch to the new private hash.
> > 
> > call_rcu_hurry() is used to speed up transition which otherwise might be
> > delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> > side effects would be that on auto scaling the new hash is used later
> > and the SET_SLOTS prctl() will block longer.
> > 
> > [bigeasy: commit description + mm get/ put_async]
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
> > ---
> 
> This is causing explosions on my test systems, in code that doesn't obviously
> have anything to do with futex.

Closing the loop, this turned out to be a KVM bug[*].  Why the futex changes
exposed the bug and caused explosions, I have no idea, but nothing suggests that
this patch is buggy.

[*] https://lore.kernel.org/all/20250825160406.ZVcVPStz@linutronix.de


> The most common symptom is a #GP on this code in try_to_wake_up():
> 
> 		/* Link @node into the waitqueue. */
> 		WRITE_ONCE(prev->next, node);
> 
> although on systems with 5-level paging I _think_ it just manifests as hard
> hanges (I assume because prev->next is corrupted, but is still canonical with
> LA57?  But that's a wild guess).
> 
> The failure always occurs when userspace writes /sys/module/kvm/parameters/nx_huge_pages,
> but I don't think there's anything KVM specific about the issue.  Simply writing
> the param doesn't explode, the problem only arises when I'm running tests in
> parallel (but then failure is almost immediate), so presumably there's a task
> migration angle or something?
> 
> Manually disabling CONFIG_FUTEX_PRIVATE_HASH makes the problem go away, and
> running with CONFIG_FUTEX_PRIVATE_HASH=y prior to this rework is also fine.  So
> it appears that the problem is specifically in the new code.
> 
> I can provide more info as needed next week.
> 
> Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP

...

> Call Trace:
>  <TASK>
>  _raw_spin_lock_irqsave+0x50/0x60
>  try_to_wake_up+0x4f/0x5d0
>  set_nx_huge_pages+0xe4/0x1c0 [kvm]
>  param_attr_store+0x89/0xf0
>  module_attr_store+0x1e/0x30
>  kernfs_fop_write_iter+0xe4/0x160
>  vfs_write+0x2cb/0x420
>  ksys_write+0x7f/0xf0
>  do_syscall_64+0x6f/0x1f0
>  ? arch_exit_to_user_mode_prepare+0x9/0x50
>  entry_SYSCALL_64_after_hwframe+0x4b/0x53

[PATCH v2 1/6] selftests/futex: Adapt the private hash test to RCU related changes
[PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
[PATCH v2 3/6] futex: Make futex_private_hash_get() static
[PATCH v2 4/6] futex: Remove support for IMMUTABLE
[PATCH v2 5/6] selftests/futex: Remove support for IMMUTABLE
[PATCH v2 6/6] perf bench futex: Remove support for IMMUTABLE