futex: avoid false sharing between hb->chain and the bucket lock

[PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
Posted by Breno Leitao 2 days, 11 hours ago
struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
struct plist_head chain, struct futex_private_hash *priv) into a
single ____cacheline_aligned_in_smp 64-byte block. Three distinct
access patterns hit that line:

  1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
     on the fast path before taking the lock.
  2. spin_lock(&hb->lock) contenders writing the lock word.
  3. The lock holder modifying chain.{next,prev} on every futex_wake,
     futex_q_unlock, plist_add, __futex_unqueue.

This was first noticed on a Meta cache (ucache) production workload:
perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
CPUs in a second.

The contention is not specific to that workload, though. Our very own
"perf bench futex" hash exercises the same buckets and shows the same
false sharing, so the rest of this changelog quantifies the fix with
perf bench futex.

Move chain to its own cacheline so:
  - Lockless waiters_pending() readers no longer invalidate the line
    that lock contenders are spinning to acquire.
  - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
    next holder reads chain from its own L2/L3 instead of fetching
    chain entries together with the lock byte.

This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
15%:

                   baseline    +fix       delta
  average      1,394,938   1,616,781    +15.9 %
  median       1,430,012   1,617,072    +13.1 %
  min          1,214,488   1,501,741    +23.5 %
  max          1,488,167   1,730,734    +16.3 %

The distributions do not overlap: the slowest +fix run (1.50 M) is
faster than every baseline run except the single fastest (1.49 M).

This improves wake up latency as well:

perf bench futex wake -s (broadcast wakeup latency, lower is better):
  baseline:   0.300 / 0.329 / 0.266 ms   (avg 0.298)
  +fix:       0.292 / 0.253 / 0.270 ms   (avg 0.272, -9 %)

Cost: one extra cacheline (56 B padding) per bucket. Would it be
acceptable?

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/futex/futex.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 79ef2c709c81..4981dcf465a9 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -142,7 +142,16 @@ static inline bool should_fail_futex(bool fshared)
 struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
-	struct plist_head chain;
+	/*
+	 * Keep the plist_head chain on its own cacheline. Lockless
+	 * futex_hb_waiters_pending() readers and lock contenders touch
+	 * the (waiters, lock) line; the lock holder modifies chain on
+	 * every wake/queue. perf c2c on a busy 176-core AMD host showed
+	 * this bucket cacheline as the #1 HITM source (129 Lcl + 31 Rmt
+	 * in 5s), hit by 156 distinct CPUs at offset 0x4 (lock) and
+	 * 0x8/0x10 (chain.{next,prev}).
+	 */
+	struct plist_head chain ____cacheline_aligned_in_smp;
 	struct futex_private_hash *priv;
 } ____cacheline_aligned_in_smp;
 

---
base-commit: b99ae45861eccff1e1d8c7b05a13650be805d437
change-id: 20260605-futex-c5478d627985

Best regards,
-- 
Breno Leitao <leitao@debian.org>