futex: fix NUMA node publication race causing missed wakeups

[PATCH] futex: fix NUMA node publication race causing missed wakeups

Posted by Chengfeng Ye 1 month, 1 week ago

get_futex_key() publishes the FUTEX2_NUMA node side word in userspace.
The publication path used a non-atomic read/compute/write sequence, so
concurrent callers could overwrite each other during initialization.

This race can make concurrent operations on the same futex derive
different node values while the NUMA hint is being initialized,
resulting in inconsistent futex keying between wait and wake sides.
In practice this can lead to missed wakeups; at user level, missed
wakeups can manifest as threads waiting indefinitely
(application-level deadlock/hang).

PoC description (see Link below):
  - two threads repeatedly exercising FUTEX2_NUMA wait/wake on the
    same futex,
  - waiter and waker pinned to CPUs from different NUMA nodes,
  - waker continuously issuing wake calls while waiter performs
    10-second timed waits.

PoC output on unpatched kernel (wake sigal missed and waiter timeout):
  - observed on Linux v7.0-rc2 running in qemu-system-x86_64 with
    4 vCPUs
  Using CPU 0 (waiter) and CPU 2 (waker) from different NUMA nodes
  [TRIGGER EVENT #1] iter=38 timed out (futex.node=1)
  [TRIGGER EVENT #2] iter=85 timed out (futex.node=1)
  [TRIGGER EVENT #3] iter=95 timed out (futex.node=1)

Fix by making node-hint publication publish-once via atomic cmpxchg on
naddr (FUTEX_NO_NODE -> computed node), retrying transient -EAGAIN,
and adopting/validating the winner value on contention.

Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL")
Link: https://gist.github.com/Ychame/d4a5e95401a471f4211a751734b5d164
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
---
 kernel/futex/core.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cf7e610eac42..d45612b36e30 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -596,13 +596,29 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 
 	if (flags & FLAGS_NUMA) {
 		u32 __user *naddr = (void *)uaddr + size / 2;
+		u32 old_node;
 
 		if (node == FUTEX_NO_NODE) {
 			node = numa_node_id();
 			node_updated = true;
 		}
-		if (node_updated && put_user_inline(node, naddr))
-			return -EFAULT;
+		if (node_updated) {
+retry_numa_node:
+			err = futex_cmpxchg_value_locked(&old_node, naddr,
+							 FUTEX_NO_NODE, (u32)node);
+			if (err == -EAGAIN) {
+				cond_resched();
+				goto retry_numa_node;
+			}
+			if (err)
+				return err;
+			if (old_node != FUTEX_NO_NODE) {
+				node = old_node;
+				if ((unsigned int)node >= MAX_NUMNODES ||
+				    !node_possible(node))
+					return -EINVAL;
+			}
+		}
 	}
 
 	key->both.node = node;
-- 
2.25.1

Re: [PATCH] futex: fix NUMA node publication race causing missed wakeups

Posted by Sebastian Andrzej Siewior 4 weeks ago

On 2026-03-03 03:01:00 [+0000], Chengfeng Ye wrote:
> get_futex_key() publishes the FUTEX2_NUMA node side word in userspace.
> The publication path used a non-atomic read/compute/write sequence, so
> concurrent callers could overwrite each other during initialization.
> 
> This race can make concurrent operations on the same futex derive
> different node values while the NUMA hint is being initialized,
> resulting in inconsistent futex keying between wait and wake sides.
> In practice this can lead to missed wakeups; at user level, missed
> wakeups can manifest as threads waiting indefinitely
> (application-level deadlock/hang).
> 
> PoC description (see Link below):
>   - two threads repeatedly exercising FUTEX2_NUMA wait/wake on the
>     same futex,
>   - waiter and waker pinned to CPUs from different NUMA nodes,
>   - waker continuously issuing wake calls while waiter performs
>     10-second timed waits.
> 
> PoC output on unpatched kernel (wake sigal missed and waiter timeout):
>   - observed on Linux v7.0-rc2 running in qemu-system-x86_64 with
>     4 vCPUs
>   Using CPU 0 (waiter) and CPU 2 (waker) from different NUMA nodes
>   [TRIGGER EVENT #1] iter=38 timed out (futex.node=1)
>   [TRIGGER EVENT #2] iter=85 timed out (futex.node=1)
>   [TRIGGER EVENT #3] iter=95 timed out (futex.node=1)
> 
> Fix by making node-hint publication publish-once via atomic cmpxchg on
> naddr (FUTEX_NO_NODE -> computed node), retrying transient -EAGAIN,
> and adopting/validating the winner value on contention.
> 
> Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL")
> Link: https://gist.github.com/Ychame/d4a5e95401a471f4211a751734b5d164
> Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>

I did point out this scenario and it was said that this should not be
done this way. Initialize once and be done with it plus with mpol the
value should be consistent.

I intended to document this and started with the new futex syscalls but
didn't get very far. But the whole PR_FUTEX_HASH thingy is in \o/.

Sebastian

Re: [PATCH] futex: fix NUMA node publication race causing missed wakeups

Posted by Peter Zijlstra 4 weeks ago

On Thu, Mar 12, 2026 at 10:37:09AM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-03-03 03:01:00 [+0000], Chengfeng Ye wrote:
> > get_futex_key() publishes the FUTEX2_NUMA node side word in userspace.
> > The publication path used a non-atomic read/compute/write sequence, so
> > concurrent callers could overwrite each other during initialization.
> > 
> > This race can make concurrent operations on the same futex derive
> > different node values while the NUMA hint is being initialized,
> > resulting in inconsistent futex keying between wait and wake sides.
> > In practice this can lead to missed wakeups; at user level, missed
> > wakeups can manifest as threads waiting indefinitely
> > (application-level deadlock/hang).
> > 
> > PoC description (see Link below):
> >   - two threads repeatedly exercising FUTEX2_NUMA wait/wake on the
> >     same futex,
> >   - waiter and waker pinned to CPUs from different NUMA nodes,
> >   - waker continuously issuing wake calls while waiter performs
> >     10-second timed waits.
> > 
> > PoC output on unpatched kernel (wake sigal missed and waiter timeout):
> >   - observed on Linux v7.0-rc2 running in qemu-system-x86_64 with
> >     4 vCPUs
> >   Using CPU 0 (waiter) and CPU 2 (waker) from different NUMA nodes
> >   [TRIGGER EVENT #1] iter=38 timed out (futex.node=1)
> >   [TRIGGER EVENT #2] iter=85 timed out (futex.node=1)
> >   [TRIGGER EVENT #3] iter=95 timed out (futex.node=1)
> > 
> > Fix by making node-hint publication publish-once via atomic cmpxchg on
> > naddr (FUTEX_NO_NODE -> computed node), retrying transient -EAGAIN,
> > and adopting/validating the winner value on contention.
> > 
> > Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL")
> > Link: https://gist.github.com/Ychame/d4a5e95401a471f4211a751734b5d164
> > Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
> 
> I did point out this scenario and it was said that this should not be
> done this way. Initialize once and be done with it plus with mpol the
> value should be consistent.

Right, see tools/testing/selftests/futex/functional/futex_numa.c, that
has a very simple numa lock implementation you can crib from.

You can only clear the node word when you clear the waiter bit (eg,
there are no more waiters left) and it must be done atomically such that
any concurrent lock operation will DTRT.

Specifically, futex_numa_32 requires an 64bit cmpxchg.