From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF0763B5821 for ; Mon, 16 Mar 2026 17:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681181; cv=none; b=kSd0mvNYNOrY3sWpGiUdQG83qqiWzgdGRVRQsEQzo+LFOiZoADu05VoPoV2zjzoxORQd8jT8/7pPT/jLDO4g5SHB8ranCKX8a7lYDahhbNxsmnefd54PIKVVQ1mq9AguJDQl1f/xisLKQjDZgqtFxcq3nWkkSgNv65ZoZB1TZu4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681181; c=relaxed/simple; bh=GocfcCIP8tAW+aOa+1c/Yigqk/8CTL08tGBugSEDikQ=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=lY+2nCM6eXCash44IIFvIIWRZmlHqKrKAP75iCoakfwurWqaDVxA/A7+t/URYIZQ1GsbZtlbdZr6dYR5jjoL8NL4ijhtOSAeXgjt25LH0O8RZ1jK41DIgkrZah/AO3I7ubLj92X4n4bkA/lMIvIcGu5LDEGV03Ysq24OcUlZCBA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=P+w7GabL; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="P+w7GabL" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7A214C19421; Mon, 16 Mar 2026 17:12:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681180; bh=GocfcCIP8tAW+aOa+1c/Yigqk/8CTL08tGBugSEDikQ=; h=Date:From:To:Cc:Subject:References:From; b=P+w7GabLuwPqDuhJX2pJXYqMbKB593sKUMNdo6ZsvCoThgt4Y+GvyaNmrjLO2nn4Q X8+8DlqKwWe5uvUILO7S4ixuOlaGDNqIaGIH43Qd2AHR9Aqi/VUjNHZ/8KeAJZva6L 3uTVUV5uS5tolFrqozYW33YezyTTL9lV1B4qOgBqiB4K5QixHm1001/HfUtbhrGjjW QHHCvFzz1d0UW/LXHu6fhVgiuaZ743rJQ5MbX33ZzgXoWzyknc6WnGrm+uivPWu9NM FvA8/2bC9mohGH8za7jcQt/GcjMwkzA4ttGOKg59Hbkrx7QSv5IHqqt8xIcUsOquvo WftgKOP10EkVA== Date: Mon, 16 Mar 2026 18:12:56 +0100 Message-ID: <20260316164951.004423818@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 1/8] futex: Move futex task related data into a struct References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Having all these members in task_struct along with the required #ifdeffery is annoying, does not allow efficient initializing of the data with memset() and makes extending it tedious. Move it into a data structure and fix up all usage sites. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- Documentation/locking/robust-futexes.rst | 8 ++-- include/linux/futex.h | 12 ++---- include/linux/futex_types.h | 34 +++++++++++++++++++ include/linux/sched.h | 16 ++------- kernel/exit.c | 4 +- kernel/futex/core.c | 55 +++++++++++++++-----------= ----- kernel/futex/pi.c | 26 +++++++------- kernel/futex/syscalls.c | 23 ++++-------- 8 files changed, 97 insertions(+), 81 deletions(-) --- a/Documentation/locking/robust-futexes.rst +++ b/Documentation/locking/robust-futexes.rst @@ -94,7 +94,7 @@ time, the kernel checks this user-space locks to be cleaned up? =20 In the common case, at do_exit() time, there is no list registered, so -the cost of robust futexes is just a simple current->robust_list !=3D NULL +the cost of robust futexes is just a current->futex.robust_list !=3D NULL comparison. If the thread has registered a list, then normally the list is empty. If the thread/process crashed or terminated in some incorrect way then the list might be non-empty: in this case the kernel carefully @@ -178,9 +178,9 @@ The patch adds two new syscalls: one to size_t __user *len_ptr); =20 List registration is very fast: the pointer is simply stored in -current->robust_list. [Note that in the future, if robust futexes become -widespread, we could extend sys_clone() to register a robust-list head -for new threads, without the need of another syscall.] +current->futex.robust_list. [Note that in the future, if robust futexes +become widespread, we could extend sys_clone() to register a robust-list +head for new threads, without the need of another syscall.] =20 So there is virtually zero overhead for tasks not using robust futexes, and even for robust futex users, there is only one extra syscall per --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -64,14 +64,10 @@ enum { =20 static inline void futex_init_task(struct task_struct *tsk) { - tsk->robust_list =3D NULL; -#ifdef CONFIG_COMPAT - tsk->compat_robust_list =3D NULL; -#endif - INIT_LIST_HEAD(&tsk->pi_state_list); - tsk->pi_state_cache =3D NULL; - tsk->futex_state =3D FUTEX_STATE_OK; - mutex_init(&tsk->futex_exit_mutex); + memset(&tsk->futex, 0, sizeof(tsk->futex)); + INIT_LIST_HEAD(&tsk->futex.pi_state_list); + tsk->futex.state =3D FUTEX_STATE_OK; + mutex_init(&tsk->futex.exit_mutex); } =20 void futex_exit_recursive(struct task_struct *tsk); --- /dev/null +++ b/include/linux/futex_types.h @@ -0,0 +1,34 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_FUTEX_TYPES_H +#define _LINUX_FUTEX_TYPES_H + +#ifdef CONFIG_FUTEX +#include +#include + +struct compat_robust_list_head; +struct futex_pi_state; +struct robust_list_head; + +/** + * struct futex_ctrl - Futex related per task data + * @robust_list: User space registered robust list pointer + * @compat_robust_list: User space registered robust list pointer for comp= at tasks + * @exit_mutex: Mutex for serializing exit + * @state: Futex handling state to handle exit races correctly + */ +struct futex_ctrl { + struct robust_list_head __user *robust_list; +#ifdef CONFIG_COMPAT + struct compat_robust_list_head __user *compat_robust_list; +#endif + struct list_head pi_state_list; + struct futex_pi_state *pi_state_cache; + struct mutex exit_mutex; + unsigned int state; +}; +#else +struct futex_ctrl { }; +#endif /* !CONFIG_FUTEX */ + +#endif /* _LINUX_FUTEX_TYPES_H */ --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -16,6 +16,7 @@ #include =20 #include +#include #include #include #include @@ -64,7 +65,6 @@ struct bpf_net_context; struct capture_control; struct cfs_rq; struct fs_struct; -struct futex_pi_state; struct io_context; struct io_uring_task; struct mempolicy; @@ -76,7 +76,6 @@ struct pid_namespace; struct pipe_inode_info; struct rcu_node; struct reclaim_state; -struct robust_list_head; struct root_domain; struct rq; struct sched_attr; @@ -1329,16 +1328,9 @@ struct task_struct { u32 closid; u32 rmid; #endif -#ifdef CONFIG_FUTEX - struct robust_list_head __user *robust_list; -#ifdef CONFIG_COMPAT - struct compat_robust_list_head __user *compat_robust_list; -#endif - struct list_head pi_state_list; - struct futex_pi_state *pi_state_cache; - struct mutex futex_exit_mutex; - unsigned int futex_state; -#endif + + struct futex_ctrl futex; + #ifdef CONFIG_PERF_EVENTS u8 perf_recursion[PERF_NR_CONTEXTS]; struct perf_event_context *perf_event_ctxp; --- a/kernel/exit.c +++ b/kernel/exit.c @@ -989,8 +989,8 @@ void __noreturn do_exit(long code) proc_exit_connector(tsk); mpol_put_task_policy(tsk); #ifdef CONFIG_FUTEX - if (unlikely(current->pi_state_cache)) - kfree(current->pi_state_cache); + if (unlikely(current->futex.pi_state_cache)) + kfree(current->futex.pi_state_cache); #endif /* * Make sure we are holding no locks: --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -32,18 +32,19 @@ * "But they come in a choice of three flavours!" */ #include -#include -#include #include -#include +#include #include -#include +#include #include -#include -#include -#include #include #include +#include +#include +#include +#include +#include +#include =20 #include "futex.h" #include "../locking/rtmutex_common.h" @@ -829,7 +830,7 @@ void wait_for_owner_exiting(int ret, str if (WARN_ON_ONCE(ret =3D=3D -EBUSY && !exiting)) return; =20 - mutex_lock(&exiting->futex_exit_mutex); + mutex_lock(&exiting->futex.exit_mutex); /* * No point in doing state checking here. If the waiter got here * while the task was in exec()->exec_futex_release() then it can @@ -838,7 +839,7 @@ void wait_for_owner_exiting(int ret, str * already. Highly unlikely and not a problem. Just one more round * through the futex maze. */ - mutex_unlock(&exiting->futex_exit_mutex); + mutex_unlock(&exiting->futex.exit_mutex); =20 put_task_struct(exiting); } @@ -1048,7 +1049,7 @@ static int handle_futex_death(u32 __user * * In both cases the following conditions are met: * - * 1) task->robust_list->list_op_pending !=3D NULL + * 1) task->futex.robust_list->list_op_pending !=3D NULL * @pending_op =3D=3D true * 2) The owner part of user space futex value =3D=3D 0 * 3) Regular futex: @pi =3D=3D false @@ -1153,7 +1154,7 @@ static inline int fetch_robust_entry(str */ static void exit_robust_list(struct task_struct *curr) { - struct robust_list_head __user *head =3D curr->robust_list; + struct robust_list_head __user *head =3D curr->futex.robust_list; struct robust_list __user *entry, *next_entry, *pending; unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; unsigned int next_pi; @@ -1247,7 +1248,7 @@ compat_fetch_robust_entry(compat_uptr_t */ static void compat_exit_robust_list(struct task_struct *curr) { - struct compat_robust_list_head __user *head =3D curr->compat_robust_list; + struct compat_robust_list_head __user *head =3D curr->futex.compat_robust= _list; struct robust_list __user *entry, *next_entry, *pending; unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; unsigned int next_pi; @@ -1323,7 +1324,7 @@ static void compat_exit_robust_list(stru */ static void exit_pi_state_list(struct task_struct *curr) { - struct list_head *next, *head =3D &curr->pi_state_list; + struct list_head *next, *head =3D &curr->futex.pi_state_list; struct futex_pi_state *pi_state; union futex_key key =3D FUTEX_KEY_INIT; =20 @@ -1407,19 +1408,19 @@ static inline void exit_pi_state_list(st =20 static void futex_cleanup(struct task_struct *tsk) { - if (unlikely(tsk->robust_list)) { + if (unlikely(tsk->futex.robust_list)) { exit_robust_list(tsk); - tsk->robust_list =3D NULL; + tsk->futex.robust_list =3D NULL; } =20 #ifdef CONFIG_COMPAT - if (unlikely(tsk->compat_robust_list)) { + if (unlikely(tsk->futex.compat_robust_list)) { compat_exit_robust_list(tsk); - tsk->compat_robust_list =3D NULL; + tsk->futex.compat_robust_list =3D NULL; } #endif =20 - if (unlikely(!list_empty(&tsk->pi_state_list))) + if (unlikely(!list_empty(&tsk->futex.pi_state_list))) exit_pi_state_list(tsk); } =20 @@ -1442,10 +1443,10 @@ static void futex_cleanup(struct task_st */ void futex_exit_recursive(struct task_struct *tsk) { - /* If the state is FUTEX_STATE_EXITING then futex_exit_mutex is held */ - if (tsk->futex_state =3D=3D FUTEX_STATE_EXITING) - mutex_unlock(&tsk->futex_exit_mutex); - tsk->futex_state =3D FUTEX_STATE_DEAD; + /* If the state is FUTEX_STATE_EXITING then futex.exit_mutex is held */ + if (tsk->futex.state =3D=3D FUTEX_STATE_EXITING) + mutex_unlock(&tsk->futex.exit_mutex); + tsk->futex.state =3D FUTEX_STATE_DEAD; } =20 static void futex_cleanup_begin(struct task_struct *tsk) @@ -1453,10 +1454,10 @@ static void futex_cleanup_begin(struct t /* * Prevent various race issues against a concurrent incoming waiter * including live locks by forcing the waiter to block on - * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in + * tsk->futex.exit_mutex when it observes FUTEX_STATE_EXITING in * attach_to_pi_owner(). */ - mutex_lock(&tsk->futex_exit_mutex); + mutex_lock(&tsk->futex.exit_mutex); =20 /* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. @@ -1470,7 +1471,7 @@ static void futex_cleanup_begin(struct t * be observed in exit_pi_state_list(). */ raw_spin_lock_irq(&tsk->pi_lock); - tsk->futex_state =3D FUTEX_STATE_EXITING; + tsk->futex.state =3D FUTEX_STATE_EXITING; raw_spin_unlock_irq(&tsk->pi_lock); } =20 @@ -1480,12 +1481,12 @@ static void futex_cleanup_end(struct tas * Lockless store. The only side effect is that an observer might * take another loop until it becomes visible. */ - tsk->futex_state =3D state; + tsk->futex.state =3D state; /* * Drop the exit protection. This unblocks waiters which observed * FUTEX_STATE_EXITING to reevaluate the state. */ - mutex_unlock(&tsk->futex_exit_mutex); + mutex_unlock(&tsk->futex.exit_mutex); } =20 void futex_exec_release(struct task_struct *tsk) --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -14,7 +14,7 @@ int refill_pi_state_cache(void) { struct futex_pi_state *pi_state; =20 - if (likely(current->pi_state_cache)) + if (likely(current->futex.pi_state_cache)) return 0; =20 pi_state =3D kzalloc_obj(*pi_state); @@ -28,17 +28,17 @@ int refill_pi_state_cache(void) refcount_set(&pi_state->refcount, 1); pi_state->key =3D FUTEX_KEY_INIT; =20 - current->pi_state_cache =3D pi_state; + current->futex.pi_state_cache =3D pi_state; =20 return 0; } =20 static struct futex_pi_state *alloc_pi_state(void) { - struct futex_pi_state *pi_state =3D current->pi_state_cache; + struct futex_pi_state *pi_state =3D current->futex.pi_state_cache; =20 WARN_ON(!pi_state); - current->pi_state_cache =3D NULL; + current->futex.pi_state_cache =3D NULL; =20 return pi_state; } @@ -60,7 +60,7 @@ static void pi_state_update_owner(struct if (new_owner) { raw_spin_lock(&new_owner->pi_lock); WARN_ON(!list_empty(&pi_state->list)); - list_add(&pi_state->list, &new_owner->pi_state_list); + list_add(&pi_state->list, &new_owner->futex.pi_state_list); pi_state->owner =3D new_owner; raw_spin_unlock(&new_owner->pi_lock); } @@ -96,7 +96,7 @@ void put_pi_state(struct futex_pi_state raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags); } =20 - if (current->pi_state_cache) { + if (current->futex.pi_state_cache) { kfree(pi_state); } else { /* @@ -106,7 +106,7 @@ void put_pi_state(struct futex_pi_state */ pi_state->owner =3D NULL; refcount_set(&pi_state->refcount, 1); - current->pi_state_cache =3D pi_state; + current->futex.pi_state_cache =3D pi_state; } } =20 @@ -179,7 +179,7 @@ void put_pi_state(struct futex_pi_state * * p->pi_lock: * - * p->pi_state_list -> pi_state->list, relation + * p->futex.pi_state_list -> pi_state->list, relation * pi_mutex->owner -> pi_state->owner, relation * * pi_state->refcount: @@ -327,7 +327,7 @@ static int handle_exit_race(u32 __user * * If the futex exit state is not yet FUTEX_STATE_DEAD, tell the * caller that the alleged owner is busy. */ - if (tsk && tsk->futex_state !=3D FUTEX_STATE_DEAD) + if (tsk && tsk->futex.state !=3D FUTEX_STATE_DEAD) return -EBUSY; =20 /* @@ -346,8 +346,8 @@ static int handle_exit_race(u32 __user * * *uaddr =3D 0xC0000000; tsk =3D get_task(PID); * } if (!tsk->flags & PF_EXITING) { * ... attach(); - * tsk->futex_state =3D } else { - * FUTEX_STATE_DEAD; if (tsk->futex_state !=3D + * tsk->futex.state =3D } else { + * FUTEX_STATE_DEAD; if (tsk->futex.state !=3D * FUTEX_STATE_DEAD) * return -EAGAIN; * return -ESRCH; <--- FAIL @@ -395,7 +395,7 @@ static void __attach_to_pi_owner(struct pi_state->key =3D *key; =20 WARN_ON(!list_empty(&pi_state->list)); - list_add(&pi_state->list, &p->pi_state_list); + list_add(&pi_state->list, &p->futex.pi_state_list); /* * Assignment without holding pi_state->pi_mutex.wait_lock is safe * because there is no concurrency as the object is not published yet. @@ -439,7 +439,7 @@ static int attach_to_pi_owner(u32 __user * in futex_exit_release(), we do this protected by p->pi_lock: */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->futex_state !=3D FUTEX_STATE_OK)) { + if (unlikely(p->futex.state !=3D FUTEX_STATE_OK)) { /* * The task is on the way out. When the futex state is * FUTEX_STATE_DEAD, we know that the task has finished --- a/kernel/futex/syscalls.c +++ b/kernel/futex/syscalls.c @@ -25,17 +25,13 @@ * @head: pointer to the list-head * @len: length of the list-head, as userspace expects */ -SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, - size_t, len) +SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, s= ize_t, len) { - /* - * The kernel knows only one size for now: - */ + /* The kernel knows only one size for now. */ if (unlikely(len !=3D sizeof(*head))) return -EINVAL; =20 - current->robust_list =3D head; - + current->futex.robust_list =3D head; return 0; } =20 @@ -43,9 +39,9 @@ static inline void __user *futex_task_ro { #ifdef CONFIG_COMPAT if (compat) - return p->compat_robust_list; + return p->futex.compat_robust_list; #endif - return p->robust_list; + return p->futex.robust_list; } =20 static void __user *futex_get_robust_list_common(int pid, bool compat) @@ -467,15 +463,13 @@ SYSCALL_DEFINE4(futex_requeue, } =20 #ifdef CONFIG_COMPAT -COMPAT_SYSCALL_DEFINE2(set_robust_list, - struct compat_robust_list_head __user *, head, - compat_size_t, len) +COMPAT_SYSCALL_DEFINE2(set_robust_list, struct compat_robust_list_head __u= ser *, head, + compat_size_t, len) { if (unlikely(len !=3D sizeof(*head))) return -EINVAL; =20 - current->compat_robust_list =3D head; - + current->futex.compat_robust_list =3D head; return 0; } =20 @@ -515,4 +509,3 @@ SYSCALL_DEFINE6(futex_time32, u32 __user return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3); } #endif /* CONFIG_COMPAT_32BIT_TIME */ - From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 528B03CF69E for ; Mon, 16 Mar 2026 17:13:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681186; cv=none; b=hRN1XajuEZI5vAmcUsmzvIvC5ZOC8J0r2WQLProU2vnNNG/03ePHRYV4AOOHWZUuba9uPlNz/+Xz46RTmqy6Pd326ezPnoGxA3U7shfE2bxSbfI75yrcST1sd+16UTBwDQ/lQkQnTL1/rd0pB5tOT+rOHKfVviCtCdJHNk3lxnc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681186; c=relaxed/simple; bh=Bw8djJRJ7prVMkfCnAMP/NyuYr3GVODQM8vnvmX/FSc=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=i5pQ8HoTIpWT5i7qMCrjZ6JpHyjMdSafSGO842inhr/5Zf1QrWP08ifXU9RD6TD7BkMKRfwHuqwbRAnflW0UgD4WhdPF8BhR6VG7LQh9OBgfPXlWBbCPBRAC7P1/gyDEknHg2qiHul0HmRmoLHA5MItzvEr3Hb/PDvP8efFa6GE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=d7ISQ0IT; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="d7ISQ0IT" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 14B44C19425; Mon, 16 Mar 2026 17:13:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681186; bh=Bw8djJRJ7prVMkfCnAMP/NyuYr3GVODQM8vnvmX/FSc=; h=Date:From:To:Cc:Subject:References:From; b=d7ISQ0ITIUd1tYpc6razMoJ4wsNrS9jOJmlF0/eufnfUl77v0hrvx3IXXFYLBVrgS xO0qzM3loLcLvGXHhqCT0YDhAbU3hu46jj+Nkm1ImOAnQUFcyqOxEBQs5GSxv2Qajt t0N+U2ZMxngypkzGSVOX10dz74QEXjlY0i3xz/OqXQo0sR6gyLDh2RsctkrSuFnXaI 3h1l8YoViMKORjKfH3muAW+GhRX3AXFWxVWWWwJHzMB19+RnwHKlsdtgCWWJRMEuYQ NiJdF2zHrzHxCqb7F8pgrrKuDPvWGkpJzm+PIXSFBG5IfS5P+BSJlbc44DzS+wcQVP CD7giUJQ2eLIQ== Date: Mon, 16 Mar 2026 18:13:02 +0100 Message-ID: <20260316164951.073076616@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 2/8] futex: Move futex related mm_struct data into a struct References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Having all these members in mm_struct along with the required #ifdeffery is annoying, does not allow efficient initializing of the data with memset() and makes extending it tedious. Move it into a data structure and fix up all usage sites. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/futex_types.h | 22 +++++++ include/linux/mm_types.h | 11 --- kernel/futex/core.c | 123 ++++++++++++++++++++-------------------= ----- 3 files changed, 80 insertions(+), 76 deletions(-) --- a/include/linux/futex_types.h +++ b/include/linux/futex_types.h @@ -31,4 +31,26 @@ struct futex_ctrl { struct futex_ctrl { }; #endif /* !CONFIG_FUTEX */ =20 +/** + * struct futex_mm_data - Futex related per MM data + * @phash_lock: Mutex to protect the private hash operations + * @phash: RCU managed pointer to the private hash + * @phash_new: Pointer to a newly allocated private hash + * @phash_batches: Batch state for RCU synchronization + * @phash_rcu: RCU head for call_rcu() + * @phash_atomic: Aggregate value for @phash_ref + * @phash_ref: Per CPU reference counter for a private hash + */ +struct futex_mm_data { +#ifdef CONFIG_FUTEX_PRIVATE_HASH + struct mutex phash_lock; + struct futex_private_hash __rcu *phash; + struct futex_private_hash *phash_new; + unsigned long phash_batches; + struct rcu_head phash_rcu; + atomic_long_t phash_atomic; + unsigned int __percpu *phash_ref; +#endif +}; + #endif /* _LINUX_FUTEX_TYPES_H */ --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1221,16 +1221,7 @@ struct mm_struct { */ seqcount_t mm_lock_seq; #endif -#ifdef CONFIG_FUTEX_PRIVATE_HASH - struct mutex futex_hash_lock; - struct futex_private_hash __rcu *futex_phash; - struct futex_private_hash *futex_phash_new; - /* futex-ref */ - unsigned long futex_batches; - struct rcu_head futex_rcu; - atomic_long_t futex_atomic; - unsigned int __percpu *futex_ref; -#endif + struct futex_mm_data futex; =20 unsigned long hiwater_rss; /* High-watermark of RSS usage */ unsigned long hiwater_vm; /* High-water virtual memory usage */ --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -188,13 +188,13 @@ static struct futex_hash_bucket * return NULL; =20 if (!fph) - fph =3D rcu_dereference(key->private.mm->futex_phash); + fph =3D rcu_dereference(key->private.mm->futex.phash); if (!fph || !fph->hash_mask) return NULL; =20 - hash =3D jhash2((void *)&key->private.address, - sizeof(key->private.address) / 4, + hash =3D jhash2((void *)&key->private.address, sizeof(key->private.addres= s) / 4, key->both.offset); + return &fph->queues[hash & fph->hash_mask]; } =20 @@ -238,13 +238,12 @@ static bool __futex_pivot_hash(struct mm { struct futex_private_hash *fph; =20 - WARN_ON_ONCE(mm->futex_phash_new); + WARN_ON_ONCE(mm->futex.phash_new); =20 - fph =3D rcu_dereference_protected(mm->futex_phash, - lockdep_is_held(&mm->futex_hash_lock)); + fph =3D rcu_dereference_protected(mm->futex.phash, lockdep_is_held(&mm->f= utex.phash_lock)); if (fph) { if (!futex_ref_is_dead(fph)) { - mm->futex_phash_new =3D new; + mm->futex.phash_new =3D new; return false; } =20 @@ -252,8 +251,8 @@ static bool __futex_pivot_hash(struct mm } new->state =3D FR_PERCPU; scoped_guard(rcu) { - mm->futex_batches =3D get_state_synchronize_rcu(); - rcu_assign_pointer(mm->futex_phash, new); + mm->futex.phash_batches =3D get_state_synchronize_rcu(); + rcu_assign_pointer(mm->futex.phash, new); } kvfree_rcu(fph, rcu); return true; @@ -261,12 +260,12 @@ static bool __futex_pivot_hash(struct mm =20 static void futex_pivot_hash(struct mm_struct *mm) { - scoped_guard(mutex, &mm->futex_hash_lock) { + scoped_guard(mutex, &mm->futex.phash_lock) { struct futex_private_hash *fph; =20 - fph =3D mm->futex_phash_new; + fph =3D mm->futex.phash_new; if (fph) { - mm->futex_phash_new =3D NULL; + mm->futex.phash_new =3D NULL; __futex_pivot_hash(mm, fph); } } @@ -289,7 +288,7 @@ struct futex_private_hash *futex_private scoped_guard(rcu) { struct futex_private_hash *fph; =20 - fph =3D rcu_dereference(mm->futex_phash); + fph =3D rcu_dereference(mm->futex.phash); if (!fph) return NULL; =20 @@ -412,8 +411,7 @@ static int futex_mpol(struct mm_struct * * private hash) is returned if existing. Otherwise a hash bucket from the * global hash is returned. */ -static struct futex_hash_bucket * -__futex_hash(union futex_key *key, struct futex_private_hash *fph) +static struct futex_hash_bucket *__futex_hash(union futex_key *key, struct= futex_private_hash *fph) { int node =3D key->both.node; u32 hash; @@ -426,8 +424,7 @@ static struct futex_hash_bucket * return hb; } =20 - hash =3D jhash2((u32 *)key, - offsetof(typeof(*key), both.offset) / sizeof(u32), + hash =3D jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / sizeof(= u32), key->both.offset); =20 if (node =3D=3D FUTEX_NO_NODE) { @@ -442,8 +439,7 @@ static struct futex_hash_bucket * */ node =3D (hash >> futex_hashshift) % nr_node_ids; if (!node_possible(node)) { - node =3D find_next_bit_wrap(node_possible_map.bits, - nr_node_ids, node); + node =3D find_next_bit_wrap(node_possible_map.bits, nr_node_ids, node); } } =20 @@ -460,9 +456,8 @@ static struct futex_hash_bucket * * Return: Initialized hrtimer_sleeper structure or NULL if no timeout * value given */ -struct hrtimer_sleeper * -futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout, - int flags, u64 range_ns) +struct hrtimer_sleeper *futex_setup_timer(ktime_t *time, struct hrtimer_sl= eeper *timeout, + int flags, u64 range_ns) { if (!time) return NULL; @@ -1551,17 +1546,17 @@ static void __futex_ref_atomic_begin(str * otherwise it would be impossible for it to have reported success * from futex_ref_is_dead(). */ - WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) !=3D 0); + WARN_ON_ONCE(atomic_long_read(&mm->futex.phash_atomic) !=3D 0); =20 /* * Set the atomic to the bias value such that futex_ref_{get,put}() * will never observe 0. Will be fixed up in __futex_ref_atomic_end() * when folding in the percpu count. */ - atomic_long_set(&mm->futex_atomic, LONG_MAX); + atomic_long_set(&mm->futex.phash_atomic, LONG_MAX); smp_store_release(&fph->state, FR_ATOMIC); =20 - call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); + call_rcu_hurry(&mm->futex.phash_rcu, futex_ref_rcu); } =20 static void __futex_ref_atomic_end(struct futex_private_hash *fph) @@ -1582,7 +1577,7 @@ static void __futex_ref_atomic_end(struc * Therefore the per-cpu counter is now stable, sum and reset. */ for_each_possible_cpu(cpu) { - unsigned int *ptr =3D per_cpu_ptr(mm->futex_ref, cpu); + unsigned int *ptr =3D per_cpu_ptr(mm->futex.phash_ref, cpu); count +=3D *ptr; *ptr =3D 0; } @@ -1590,7 +1585,7 @@ static void __futex_ref_atomic_end(struc /* * Re-init for the next cycle. */ - this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */ + this_cpu_inc(*mm->futex.phash_ref); /* 0 -> 1 */ =20 /* * Add actual count, subtract bias and initial refcount. @@ -1598,7 +1593,7 @@ static void __futex_ref_atomic_end(struc * The moment this atomic operation happens, futex_ref_is_dead() can * become true. */ - ret =3D atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic); + ret =3D atomic_long_add_return(count - LONG_MAX - 1, &mm->futex.phash_ato= mic); if (!ret) wake_up_var(mm); =20 @@ -1608,8 +1603,8 @@ static void __futex_ref_atomic_end(struc =20 static void futex_ref_rcu(struct rcu_head *head) { - struct mm_struct *mm =3D container_of(head, struct mm_struct, futex_rcu); - struct futex_private_hash *fph =3D rcu_dereference_raw(mm->futex_phash); + struct mm_struct *mm =3D container_of(head, struct mm_struct, futex.phash= _rcu); + struct futex_private_hash *fph =3D rcu_dereference_raw(mm->futex.phash); =20 if (fph->state =3D=3D FR_PERCPU) { /* @@ -1638,7 +1633,7 @@ static void futex_ref_drop(struct futex_ /* * Can only transition the current fph; */ - WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) !=3D fph); + WARN_ON_ONCE(rcu_dereference_raw(mm->futex.phash) !=3D fph); /* * We enqueue at least one RCU callback. Ensure mm stays if the task * exits before the transition is completed. @@ -1650,8 +1645,8 @@ static void futex_ref_drop(struct futex_ * * futex_hash() __futex_pivot_hash() * guard(rcu); guard(mm->futex_hash_lock); - * fph =3D mm->futex_phash; - * rcu_assign_pointer(&mm->futex_phash, new); + * fph =3D mm->futex.phash; + * rcu_assign_pointer(&mm->futex.phash, new); * futex_hash_allocate() * futex_ref_drop() * fph->state =3D FR_ATOMIC; @@ -1666,7 +1661,7 @@ static void futex_ref_drop(struct futex_ * There must be at least one full grace-period between publishing a * new fph and trying to replace it. */ - if (poll_state_synchronize_rcu(mm->futex_batches)) { + if (poll_state_synchronize_rcu(mm->futex.phash_batches)) { /* * There was a grace-period, we can begin now. */ @@ -1674,7 +1669,7 @@ static void futex_ref_drop(struct futex_ return; } =20 - call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); + call_rcu_hurry(&mm->futex.phash_rcu, futex_ref_rcu); } =20 static bool futex_ref_get(struct futex_private_hash *fph) @@ -1684,11 +1679,11 @@ static bool futex_ref_get(struct futex_p guard(preempt)(); =20 if (READ_ONCE(fph->state) =3D=3D FR_PERCPU) { - __this_cpu_inc(*mm->futex_ref); + __this_cpu_inc(*mm->futex.phash_ref); return true; } =20 - return atomic_long_inc_not_zero(&mm->futex_atomic); + return atomic_long_inc_not_zero(&mm->futex.phash_atomic); } =20 static bool futex_ref_put(struct futex_private_hash *fph) @@ -1698,11 +1693,11 @@ static bool futex_ref_put(struct futex_p guard(preempt)(); =20 if (READ_ONCE(fph->state) =3D=3D FR_PERCPU) { - __this_cpu_dec(*mm->futex_ref); + __this_cpu_dec(*mm->futex.phash_ref); return false; } =20 - return atomic_long_dec_and_test(&mm->futex_atomic); + return atomic_long_dec_and_test(&mm->futex.phash_atomic); } =20 static bool futex_ref_is_dead(struct futex_private_hash *fph) @@ -1714,18 +1709,14 @@ static bool futex_ref_is_dead(struct fut if (smp_load_acquire(&fph->state) =3D=3D FR_PERCPU) return false; =20 - return atomic_long_read(&mm->futex_atomic) =3D=3D 0; + return atomic_long_read(&mm->futex.phash_atomic) =3D=3D 0; } =20 int futex_mm_init(struct mm_struct *mm) { - mutex_init(&mm->futex_hash_lock); - RCU_INIT_POINTER(mm->futex_phash, NULL); - mm->futex_phash_new =3D NULL; - /* futex-ref */ - mm->futex_ref =3D NULL; - atomic_long_set(&mm->futex_atomic, 0); - mm->futex_batches =3D get_state_synchronize_rcu(); + memset(&mm->futex, 0, sizeof(mm->futex)); + mutex_init(&mm->futex.phash_lock); + mm->futex.phash_batches =3D get_state_synchronize_rcu(); return 0; } =20 @@ -1733,9 +1724,9 @@ void futex_hash_free(struct mm_struct *m { struct futex_private_hash *fph; =20 - free_percpu(mm->futex_ref); - kvfree(mm->futex_phash_new); - fph =3D rcu_dereference_raw(mm->futex_phash); + free_percpu(mm->futex.phash_ref); + kvfree(mm->futex.phash_new); + fph =3D rcu_dereference_raw(mm->futex.phash); if (fph) kvfree(fph); } @@ -1746,10 +1737,10 @@ static bool futex_pivot_pending(struct m =20 guard(rcu)(); =20 - if (!mm->futex_phash_new) + if (!mm->futex.phash_new) return true; =20 - fph =3D rcu_dereference(mm->futex_phash); + fph =3D rcu_dereference(mm->futex.phash); return futex_ref_is_dead(fph); } =20 @@ -1791,7 +1782,7 @@ static int futex_hash_allocate(unsigned * Once we've disabled the global hash there is no way back. */ scoped_guard(rcu) { - fph =3D rcu_dereference(mm->futex_phash); + fph =3D rcu_dereference(mm->futex.phash); if (fph && !fph->hash_mask) { if (custom) return -EBUSY; @@ -1799,15 +1790,15 @@ static int futex_hash_allocate(unsigned } } =20 - if (!mm->futex_ref) { + if (!mm->futex.phash_ref) { /* * This will always be allocated by the first thread and * therefore requires no locking. */ - mm->futex_ref =3D alloc_percpu(unsigned int); - if (!mm->futex_ref) + mm->futex.phash_ref =3D alloc_percpu(unsigned int); + if (!mm->futex.phash_ref) return -ENOMEM; - this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */ + this_cpu_inc(*mm->futex.phash_ref); /* 0 -> 1 */ } =20 fph =3D kvzalloc(struct_size(fph, queues, hash_slots), @@ -1830,14 +1821,14 @@ static int futex_hash_allocate(unsigned wait_var_event(mm, futex_pivot_pending(mm)); } =20 - scoped_guard(mutex, &mm->futex_hash_lock) { + scoped_guard(mutex, &mm->futex.phash_lock) { struct futex_private_hash *free __free(kvfree) =3D NULL; struct futex_private_hash *cur, *new; =20 - cur =3D rcu_dereference_protected(mm->futex_phash, - lockdep_is_held(&mm->futex_hash_lock)); - new =3D mm->futex_phash_new; - mm->futex_phash_new =3D NULL; + cur =3D rcu_dereference_protected(mm->futex.phash, + lockdep_is_held(&mm->futex.phash_lock)); + new =3D mm->futex.phash_new; + mm->futex.phash_new =3D NULL; =20 if (fph) { if (cur && !cur->hash_mask) { @@ -1847,7 +1838,7 @@ static int futex_hash_allocate(unsigned * the second one returns here. */ free =3D fph; - mm->futex_phash_new =3D new; + mm->futex.phash_new =3D new; return -EBUSY; } if (cur && !new) { @@ -1877,7 +1868,7 @@ static int futex_hash_allocate(unsigned =20 if (new) { /* - * Will set mm->futex_phash_new on failure; + * Will set mm->futex.phash_new on failure; * futex_private_hash_get() will try again. */ if (!__futex_pivot_hash(mm, new) && custom) @@ -1900,7 +1891,7 @@ int futex_hash_allocate_default(void) get_nr_threads(current), num_online_cpus()); =20 - fph =3D rcu_dereference(current->mm->futex_phash); + fph =3D rcu_dereference(current->mm->futex.phash); if (fph) { if (fph->custom) return 0; @@ -1927,7 +1918,7 @@ static int futex_hash_get_slots(void) struct futex_private_hash *fph; =20 guard(rcu)(); - fph =3D rcu_dereference(current->mm->futex_phash); + fph =3D rcu_dereference(current->mm->futex.phash); if (fph && fph->hash_mask) return fph->hash_mask + 1; return 0; From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D99211F8AC5 for ; Mon, 16 Mar 2026 17:13:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681191; cv=none; b=Bwlyw0u86sh0zZ+lfxlEEGCoEG2SdG1VqOCNhPIOa8vQ+l/apY/r2IjodIxzl69RKS3OK36RRX8/zh+3GeiYQXFnRnVi5G4fiIyQdIZyPZ7ttVR9q6vEs9CkJh82WdLxffli67Ov2zQ6PMLIbkl7iLThTGr9/xohu9ZxHwYXJY0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681191; c=relaxed/simple; bh=0Bq4q6JEc8CXmF5E+eLm61dqJMwU3RJtb7rzPGnG4AY=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=nJEaRuptZZzqd9jTuP8ht/rYh6elMgupLLXrrpVuzHcC3lMhwISBD1K0euX+u1eJowCvnNmV09BQEMNMYPQPB/qOiYGFMGqvZQR8DBMpRZL8PF1xbiGnRDdn5E2rH/+JMlkVfRJfvr1oUhvHqbNR52oKw0AWbiPB8jDE57zAqAI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=blHaHkG7; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="blHaHkG7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B01F1C19421; Mon, 16 Mar 2026 17:13:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681191; bh=0Bq4q6JEc8CXmF5E+eLm61dqJMwU3RJtb7rzPGnG4AY=; h=Date:From:To:Cc:Subject:References:From; b=blHaHkG79Xw0SavuhqHK6199e8NOcBKW5Y75D+1GL7UZHDIWFZDs/hUeWwFh88twX kLcu/Qc595ipqs+hrOKLjeOGv6WV8wsVDHek+gxwZJx37mr0p0KE6Rqm9nPGrvri5Y TM73Wje252381rRx135UIBRG7DHkHCd+AyndN2lc5lLf+aKaBuT1w2h9z2Pdpvpn9P TdGprrm2xxNt+z1bMrNAFDSS5CcuThgjWigMyPBPhb9dKW+esIGqaU0xU5t5s//j4L Ra6kHxnKxV9xSMBvEd8hGIAf/0FrCRcOtLA8LInxOePDNEmnCyMEe/3nZIz5+2xf2t ntT+mR8qyXdmQ== Date: Mon, 16 Mar 2026 18:13:07 +0100 Message-ID: <20260316164951.141212347@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 3/8] futex: Provide UABI defines for robust list entry modifiers References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks any sensible form of documentation. Provide proper defines for the bit and the mask and fix up the usage sites. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/uapi/linux/futex.h | 4 +++ kernel/futex/core.c | 53 +++++++++++++++++++++-------------------= ----- 2 files changed, 29 insertions(+), 28 deletions(-) --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -177,6 +177,10 @@ struct robust_list_head { */ #define ROBUST_LIST_LIMIT 2048 =20 +/* Modifiers for robust_list_head::list_op_pending */ +#define FUTEX_ROBUST_MOD_PI (0x1UL) +#define FUTEX_ROBUST_MOD_MASK (FUTEX_ROBUST_MOD_PI) + /* * bitset with all bits set for the FUTEX_xxx_BITSET OPs to request a * match of any bit. --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1009,8 +1009,9 @@ void futex_unqueue_pi(struct futex_q *q) * dying task, and do notification if so: */ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, - bool pi, bool pending_op) + unsigned int mod, bool pending_op) { + bool pi =3D !!(mod & FUTEX_ROBUST_MOD_PI); u32 uval, nval, mval; pid_t owner; int err; @@ -1128,21 +1129,21 @@ static int handle_futex_death(u32 __user */ static inline int fetch_robust_entry(struct robust_list __user **entry, struct robust_list __user * __user *head, - unsigned int *pi) + unsigned int *mod) { unsigned long uentry; =20 if (get_user(uentry, (unsigned long __user *)head)) return -EFAULT; =20 - *entry =3D (void __user *)(uentry & ~1UL); - *pi =3D uentry & 1; + *entry =3D (void __user *)(uentry & ~FUTEX_ROBUST_MOD_MASK); + *mod =3D uentry & FUTEX_ROBUST_MOD_MASK; =20 return 0; } =20 /* - * Walk curr->robust_list (very carefully, it's a userspace list!) + * Walk curr->futex.robust_list (very carefully, it's a userspace list!) * and mark any locks found there dead, and notify any waiters. * * We silently return on any sign of list-walking problem. @@ -1150,9 +1151,8 @@ static inline int fetch_robust_entry(str static void exit_robust_list(struct task_struct *curr) { struct robust_list_head __user *head =3D curr->futex.robust_list; + unsigned int limit =3D ROBUST_LIST_LIMIT, cur_mod, next_mod, pend_mod; struct robust_list __user *entry, *next_entry, *pending; - unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; - unsigned int next_pi; unsigned long futex_offset; int rc; =20 @@ -1160,7 +1160,7 @@ static void exit_robust_list(struct task * Fetch the list head (which was registered earlier, via * sys_set_robust_list()): */ - if (fetch_robust_entry(&entry, &head->list.next, &pi)) + if (fetch_robust_entry(&entry, &head->list.next, &cur_mod)) return; /* * Fetch the relative futex offset: @@ -1171,7 +1171,7 @@ static void exit_robust_list(struct task * Fetch any possibly pending lock-add first, and handle it * if it exists: */ - if (fetch_robust_entry(&pending, &head->list_op_pending, &pip)) + if (fetch_robust_entry(&pending, &head->list_op_pending, &pend_mod)) return; =20 next_entry =3D NULL; /* avoid warning with gcc */ @@ -1180,20 +1180,20 @@ static void exit_robust_list(struct task * Fetch the next entry in the list before calling * handle_futex_death: */ - rc =3D fetch_robust_entry(&next_entry, &entry->next, &next_pi); + rc =3D fetch_robust_entry(&next_entry, &entry->next, &next_mod); /* * A pending lock might already be on the list, so * don't process it twice: */ if (entry !=3D pending) { if (handle_futex_death((void __user *)entry + futex_offset, - curr, pi, HANDLE_DEATH_LIST)) + curr, cur_mod, HANDLE_DEATH_LIST)) return; } if (rc) return; entry =3D next_entry; - pi =3D next_pi; + cur_mod =3D next_mod; /* * Avoid excessively long or circular lists: */ @@ -1205,7 +1205,7 @@ static void exit_robust_list(struct task =20 if (pending) { handle_futex_death((void __user *)pending + futex_offset, - curr, pip, HANDLE_DEATH_PENDING); + curr, pend_mod, HANDLE_DEATH_PENDING); } } =20 @@ -1224,29 +1224,28 @@ static void __user *futex_uaddr(struct r */ static inline int compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user= **entry, - compat_uptr_t __user *head, unsigned int *pi) + compat_uptr_t __user *head, unsigned int *pflags) { if (get_user(*uentry, head)) return -EFAULT; =20 - *entry =3D compat_ptr((*uentry) & ~1); - *pi =3D (unsigned int)(*uentry) & 1; + *entry =3D compat_ptr((*uentry) & ~FUTEX_ROBUST_MOD_MASK); + *pflags =3D (unsigned int)(*uentry) & FUTEX_ROBUST_MOD_MASK; =20 return 0; } =20 /* - * Walk curr->robust_list (very carefully, it's a userspace list!) + * Walk curr->futex.robust_list (very carefully, it's a userspace list!) * and mark any locks found there dead, and notify any waiters. * * We silently return on any sign of list-walking problem. */ static void compat_exit_robust_list(struct task_struct *curr) { - struct compat_robust_list_head __user *head =3D curr->futex.compat_robust= _list; + struct compat_robust_list_head __user *head =3D current->futex.compat_rob= ust_list; + unsigned int limit =3D ROBUST_LIST_LIMIT, cur_mod, next_mod, pend_mod; struct robust_list __user *entry, *next_entry, *pending; - unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; - unsigned int next_pi; compat_uptr_t uentry, next_uentry, upending; compat_long_t futex_offset; int rc; @@ -1255,7 +1254,7 @@ static void compat_exit_robust_list(stru * Fetch the list head (which was registered earlier, via * sys_set_robust_list()): */ - if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi)) + if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &cur_mod= )) return; /* * Fetch the relative futex offset: @@ -1266,8 +1265,7 @@ static void compat_exit_robust_list(stru * Fetch any possibly pending lock-add first, and handle it * if it exists: */ - if (compat_fetch_robust_entry(&upending, &pending, - &head->list_op_pending, &pip)) + if (compat_fetch_robust_entry(&upending, &pending, &head->list_op_pending= , &pend_mod)) return; =20 next_entry =3D NULL; /* avoid warning with gcc */ @@ -1277,7 +1275,7 @@ static void compat_exit_robust_list(stru * handle_futex_death: */ rc =3D compat_fetch_robust_entry(&next_uentry, &next_entry, - (compat_uptr_t __user *)&entry->next, &next_pi); + (compat_uptr_t __user *)&entry->next, &next_mod); /* * A pending lock might already be on the list, so * dont process it twice: @@ -1285,15 +1283,14 @@ static void compat_exit_robust_list(stru if (entry !=3D pending) { void __user *uaddr =3D futex_uaddr(entry, futex_offset); =20 - if (handle_futex_death(uaddr, curr, pi, - HANDLE_DEATH_LIST)) + if (handle_futex_death(uaddr, curr, cur_mod, HANDLE_DEATH_LIST)) return; } if (rc) return; uentry =3D next_uentry; entry =3D next_entry; - pi =3D next_pi; + cur_mod =3D next_mod; /* * Avoid excessively long or circular lists: */ @@ -1305,7 +1302,7 @@ static void compat_exit_robust_list(stru if (pending) { void __user *uaddr =3D futex_uaddr(pending, futex_offset); =20 - handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); + handle_futex_death(uaddr, curr, pend_mod, HANDLE_DEATH_PENDING); } } #endif From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D6FA1F8AC5 for ; Mon, 16 Mar 2026 17:13:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681197; cv=none; b=p/8WNKNW3qx16grpA7dTKR0J6u8lORm4W3UPyP7Ntc3vQwB8dO+bLrVs9XxYmOmZpPs3sgjWPxlhXBXMF9BDPKgRBnTy5npEYrkzPos+EgGOs+5m8CccQaPgfuLk+p1sqmQtlcwVnHr5p/tn32mkmd2/WrUuTwu2ywdfrGrwtsU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681197; c=relaxed/simple; bh=IIfJ6OgCvhE+yrKiMFTMhMl9X9rjeCNShAHshSlxHDM=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=QiXoeXFQ/m58rwp5dvUSS2cBskccH6ZLEEn7iNCh4cfCTCX83U3e94nTqkgLL3yFcAWX4U07C9XvMFhjVV6getYCI1LTUrP4kcUzjiaUYB5bh7WO5pKhHoATmwe1WJ6KlYPbKndC0XvXynYWkFKurAnB1mQjparrJaVqkApMfjA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=RqrC7Knp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="RqrC7Knp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 43180C19425; Mon, 16 Mar 2026 17:13:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681197; bh=IIfJ6OgCvhE+yrKiMFTMhMl9X9rjeCNShAHshSlxHDM=; h=Date:From:To:Cc:Subject:References:From; b=RqrC7KnpHeKVxzySjF9jFFfS8D44vF/dMr6jZRyEaRuct9bEnj36+C8uPP79hufPO 7Dl6BxTIcuCeNG7FI5G1360QDsn52f5/6w8udLb1/oRI2Fp7+sTj6QoixmzLTDE6hX 7FawgDIZ48WXjyq3Z3UsZWI6tJj9clmZLRCUCtgDJMEfSp8w0DtjQhiuoQ81HsuYGP fSUQJTM+FyXL2dRTrAZ5+L6akokWVMycp0cn+Hf4IUCFr4vLhA7NuAMRHEhKd8GG+I XuZnmub9quCTnqPaWHoE/jgpRFbPFsLM8qJdF3ZP5QhLaY4KA9g/4rA7fpAWY0B3rd D/th7OxPRyz4Q== Date: Mon, 16 Mar 2026 18:13:13 +0100 Message-ID: <20260316164951.209959583@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 4/8] futex: Add support for unlocking robust futexes References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Unlocking robust non-PI futexes happens in user space with the following sequence: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); =09 lval =3D 0; 3) atomic_xchg(lock, lval); 4) if (lval & WAITERS) 5) sys_futex(WAKE,....); 6) robust_list_clear_op_pending(); That opens a window between #3 and #6 where the mutex could be acquired by some other task which observes that it is the last user and: A) unmaps the mutex memory B) maps a different file, which ends up covering the same address When the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupting unrelated data. PI futexes have a similar problem both for the non-contented user space unlock and the in kernel unlock: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); =09 lval =3D gettid(); 3) if (!atomic_try_cmpxchg(lock, lval, 0)) 4) sys_futex(UNLOCK_PI,....); 5) robust_list_clear_op_pending(); Address the first part of the problem where the futexes have waiters and need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET operations. This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear whether this is needed and there is no usage of it in glibc either to investigate. For the futex2 syscall family this needs to be implemented with a new syscall. The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to the kernel. This argument is only evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward compatible. The pointer has a modifier to indicate that it points to an u32 and not to an u64. This is required for several reasons: 1) sys_futex() has no compat variant 2) The gaming emulators use both both 64-bit and compat 32-bit robust lists in the same 64-bit application 3) Having the pointer handed in spares the evaluation of the registered robust lists to figure out whether the futex address is matching the registered [compat_]robust_list_head::list_pending_op pointer. As a consequence 32-bit applications have to set this bit unconditionally so they can run on a 64-bit kernel in compat mode unmodified. 32-bit kernels return an error code when the bit is not set. 64-bit kernels will happily clear the full 64 bits if user space fails to set it. In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the unlock succeeded. In case of errors, the user space value is still locked by the caller and therefore the above cannot happen. In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and clears the robust list pending op when the unlock was successful. If not, the user space value is still locked and user space has to deal with the returned error. That means that the unlocking of non-PI robust futexes has to use the same try_cmpxchg() unlock scheme as PI futexes. If the clearing of the pending list op fails (fault) then the kernel clears the registered robust list pointer if it matches to prevent that exit() will try to handle invalid data. That's a valid paranoid decision because the robust list head sits usually in the TLS and if the TLS is not longer accessible then the chance for fixing up the resulting mess is very close to zero. The problem of non-contended unlocks still exists and will be addressed separately. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/uapi/linux/futex.h | 20 ++++++++++++++ io_uring/futex.c | 2 - kernel/futex/core.c | 61 ++++++++++++++++++++++++++++++++++++++++= +++-- kernel/futex/futex.h | 11 ++++++-- kernel/futex/pi.c | 15 +++++++++-- kernel/futex/syscalls.c | 13 +++++++-- kernel/futex/waitwake.c | 27 ++++++++++++++++++- 7 files changed, 136 insertions(+), 13 deletions(-) --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -25,7 +25,8 @@ =20 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 -#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME) +#define FUTEX_UNLOCK_ROBUST 512 +#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | FUTE= X_UNLOCK_ROBUST) =20 #define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) #define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) @@ -182,6 +183,23 @@ struct robust_list_head { #define FUTEX_ROBUST_MOD_MASK (FUTEX_ROBUST_MOD_PI) =20 /* + * Modifier for FUTEX_ROBUST_UNLOCK uaddr2. Required to distinguish the st= orage + * size for the robust_list_head::list_pending_op. This solves two problem= s: + * + * 1) COMPAT tasks + * + * 2) The mixed mode magic gaming use case which has both 32-bit and 64-bit + * robust lists. Oh well.... + * + * Long story short: 32-bit userspace must set this bit unconditionally to + * ensure that it can run on a 64-bit kernel in compat mode. If user space + * screws that up a 64-bit kernel will happily clear the full 64-bits. 32-= bit + * kernels return an error code if the bit is not set. + */ +#define FUTEX_ROBUST_UNLOCK_MOD_32BIT (0x1UL) +#define FUTEX_ROBUST_UNLOCK_MOD_MASK (FUTEX_ROBUST_UNLOCK_MOD_32BIT) + +/* * bitset with all bits set for the FUTEX_xxx_BITSET OPs to request a * match of any bit. */ --- a/io_uring/futex.c +++ b/io_uring/futex.c @@ -325,7 +325,7 @@ int io_futex_wake(struct io_kiocb *req, * Strict flags - ensure that waking 0 futexes yields a 0 result. * See commit 43adf8449510 ("futex: FLAGS_STRICT") for details. */ - ret =3D futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags, + ret =3D futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags, NULL, iof->futex_val, iof->futex_mask); if (ret < 0) req_set_fail(req); --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1063,7 +1063,7 @@ static int handle_futex_death(u32 __user owner =3D uval & FUTEX_TID_MASK; =20 if (pending_op && !pi && !owner) { - futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1, + futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1, FUTEX_BITSET_MATCH_ANY); return 0; } @@ -1117,7 +1117,7 @@ static int handle_futex_death(u32 __user * PI futexes happens in exit_pi_state(): */ if (!pi && (uval & FUTEX_WAITERS)) { - futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1, + futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1, FUTEX_BITSET_MATCH_ANY); } =20 @@ -1209,6 +1209,27 @@ static void exit_robust_list(struct task } } =20 +static bool robust_list_clear_pending(unsigned long __user *pop) +{ + struct robust_list_head __user *head =3D current->futex.robust_list; + + if (!put_user(0UL, pop)) + return true; + + /* + * Just give up. The robust list head is usually part of TLS, so the + * chance that this gets resolved is close to zero. + * + * If @pop_addr is the robust_list_head::list_op_pending pointer then + * clear the robust list head pointer to prevent further damage when the + * task exits. Better a few stale futexes than corrupted memory. But + * that's mostly an academic exercise. + */ + if (pop =3D=3D (unsigned long __user *)&head->list_op_pending) + current->futex.robust_list =3D NULL; + return false; +} + #ifdef CONFIG_COMPAT static void __user *futex_uaddr(struct robust_list __user *entry, compat_long_t futex_offset) @@ -1305,6 +1326,21 @@ static void compat_exit_robust_list(stru handle_futex_death(uaddr, curr, pend_mod, HANDLE_DEATH_PENDING); } } + +static bool compat_robust_list_clear_pending(u32 __user *pop) +{ + struct compat_robust_list_head __user *head =3D current->futex.compat_rob= ust_list; + + if (!put_user(0U, pop)) + return true; + + /* See comment in robust_list_clear_pending(). */ + if (pop =3D=3D &head->list_op_pending) + current->futex.compat_robust_list =3D NULL; + return false; +} +#else +static bool compat_robust_list_clear_pending(u32 __user *pop_addr) { retur= n false; } #endif =20 #ifdef CONFIG_FUTEX_PI @@ -1398,6 +1434,27 @@ static void exit_pi_state_list(struct ta static inline void exit_pi_state_list(struct task_struct *curr) { } #endif =20 +static inline bool mask_pop_addr(void __user **pop) +{ + unsigned long addr =3D (unsigned long)*pop; + + *pop =3D (void __user *) (addr & ~FUTEX_ROBUST_UNLOCK_MOD_MASK); + return !!(addr & FUTEX_ROBUST_UNLOCK_MOD_32BIT); +} + +bool futex_robust_list_clear_pending(void __user *pop) +{ + bool size32bit =3D mask_pop_addr(&pop); + + if (!IS_ENABLED(CONFIG_64BIT) && !size32bit) + return false; + + if (IS_ENABLED(CONFIG_64BIT) && size32bit) + return compat_robust_list_clear_pending(pop); + + return robust_list_clear_pending(pop); +} + static void futex_cleanup(struct task_struct *tsk) { if (unlikely(tsk->futex.robust_list)) { --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -40,6 +40,7 @@ #define FLAGS_NUMA 0x0080 #define FLAGS_STRICT 0x0100 #define FLAGS_MPOL 0x0200 +#define FLAGS_UNLOCK_ROBUST 0x0400 =20 /* FUTEX_ to FLAGS_ */ static inline unsigned int futex_to_flags(unsigned int op) @@ -52,6 +53,9 @@ static inline unsigned int futex_to_flag if (op & FUTEX_CLOCK_REALTIME) flags |=3D FLAGS_CLOCKRT; =20 + if (op & FUTEX_UNLOCK_ROBUST) + flags |=3D FLAGS_UNLOCK_ROBUST; + return flags; } =20 @@ -438,13 +442,16 @@ extern int futex_unqueue_multiple(struct extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count, struct hrtimer_sleeper *to); =20 -extern int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, = u32 bitset); +extern int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *= pop, + int nr_wake, u32 bitset); =20 extern int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2, int nr_wake, int nr_wake2, int op); =20 -extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags); +extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __u= ser *pop); =20 extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *t= ime, int trylock); =20 +bool futex_robust_list_clear_pending(void __user *pop); + #endif /* _FUTEX_H */ --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -1129,7 +1129,7 @@ int futex_lock_pi(u32 __user *uaddr, uns * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) +static int __futex_unlock_pi(u32 __user *uaddr, unsigned int flags) { u32 curval, uval, vpid =3D task_pid_vnr(current); union futex_key key =3D FUTEX_KEY_INIT; @@ -1138,7 +1138,6 @@ int futex_unlock_pi(u32 __user *uaddr, u =20 if (!IS_ENABLED(CONFIG_FUTEX_PI)) return -ENOSYS; - retry: if (get_user(uval, uaddr)) return -EFAULT; @@ -1292,3 +1291,15 @@ int futex_unlock_pi(u32 __user *uaddr, u return ret; } =20 +int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *po= p) +{ + int ret =3D __futex_unlock_pi(uaddr, flags); + + if (ret || !(flags & FLAGS_UNLOCK_ROBUST)) + return ret; + + if (!futex_robust_list_clear_pending(pop)) + return -EFAULT; + + return 0; +} --- a/kernel/futex/syscalls.c +++ b/kernel/futex/syscalls.c @@ -118,6 +118,13 @@ long do_futex(u32 __user *uaddr, int op, return -ENOSYS; } =20 + if (flags & FLAGS_UNLOCK_ROBUST) { + if (cmd !=3D FUTEX_WAKE && + cmd !=3D FUTEX_WAKE_BITSET && + cmd !=3D FUTEX_UNLOCK_PI) + return -ENOSYS; + } + switch (cmd) { case FUTEX_WAIT: val3 =3D FUTEX_BITSET_MATCH_ANY; @@ -128,7 +135,7 @@ long do_futex(u32 __user *uaddr, int op, val3 =3D FUTEX_BITSET_MATCH_ANY; fallthrough; case FUTEX_WAKE_BITSET: - return futex_wake(uaddr, flags, val, val3); + return futex_wake(uaddr, flags, uaddr2, val, val3); case FUTEX_REQUEUE: return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0); case FUTEX_CMP_REQUEUE: @@ -141,7 +148,7 @@ long do_futex(u32 __user *uaddr, int op, case FUTEX_LOCK_PI2: return futex_lock_pi(uaddr, flags, timeout, 0); case FUTEX_UNLOCK_PI: - return futex_unlock_pi(uaddr, flags); + return futex_unlock_pi(uaddr, flags, uaddr2); case FUTEX_TRYLOCK_PI: return futex_lock_pi(uaddr, flags, NULL, 1); case FUTEX_WAIT_REQUEUE_PI: @@ -375,7 +382,7 @@ SYSCALL_DEFINE4(futex_wake, if (!futex_validate_input(flags, mask)) return -EINVAL; =20 - return futex_wake(uaddr, FLAGS_STRICT | flags, nr, mask); + return futex_wake(uaddr, FLAGS_STRICT | flags, NULL, nr, mask); } =20 /* --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -150,12 +150,32 @@ void futex_wake_mark(struct wake_q_head } =20 /* + * If requested, clear the robust list pending op and unlock the futex + */ +static bool futex_robust_unlock(u32 __user *uaddr, unsigned int flags, voi= d __user *pop) +{ + if (!(flags & FLAGS_UNLOCK_ROBUST)) + return true; + + /* First unlock the futex. */ + if (put_user(0U, uaddr)) + return false; + + /* + * Clear the pending list op now. If that fails, then the task is in + * deeper trouble as the robust list head is usually part of TLS. The + * chance of survival is close to zero. + */ + return futex_robust_list_clear_pending(pop); +} + +/* * Wake up waiters matching bitset queued on this futex (uaddr). */ -int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bit= set) +int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *pop, in= t nr_wake, u32 bitset) { - struct futex_q *this, *next; union futex_key key =3D FUTEX_KEY_INIT; + struct futex_q *this, *next; DEFINE_WAKE_Q(wake_q); int ret; =20 @@ -166,6 +186,9 @@ int futex_wake(u32 __user *uaddr, unsign if (unlikely(ret !=3D 0)) return ret; =20 + if (!futex_robust_unlock(uaddr, flags, pop)) + return -EFAULT; + if ((flags & FLAGS_STRICT) && !nr_wake) return 0; From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE5353CFF68 for ; Mon, 16 Mar 2026 17:13:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681202; cv=none; b=ALwYA/Coy2sRbJ9GU+tQXUxp+LZ7W8dsjkb1/R3BkmAz193abXVRoC69XM+QYv4dYqHa4dT6R9p0Gh00TibcPBJIkabjB5q6qShQitybgUEzZXc1fFZVFLmXsjShdHr4TEA5uJrf804Xfcq/BDNakSka7SqOHqxInnO/B77bGKc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681202; c=relaxed/simple; bh=igHKbK3BClquVFqMW+rw/fWxzpB0BWH+cD4RYKhTHfM=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=U/Pg3rX6ghn8KXaclihu6BRfHp5Jo3le9RxkSHwtN4VN9lHdqjZZ24as+YQc+R+7EjTZYK53qvhEjYoeB99KS5OjaaM4B9RO8aIfFMW/tBQEbdMxPF+2Gjx9dv/Gtq60k//jOu1FrHhWTuRe64NmDRfJQ0c2k4Yjy5TzGilr6/o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=NowuCcEm; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="NowuCcEm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 95911C19421; Mon, 16 Mar 2026 17:13:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681202; bh=igHKbK3BClquVFqMW+rw/fWxzpB0BWH+cD4RYKhTHfM=; h=Date:From:To:Cc:Subject:References:From; b=NowuCcEmueJXF5BfgNtYpfX8lQ5R96e2eHpck7HPue/DU09SFtwd4mKtjQgCAB06i Jc4dvRCmbgoKE/bJRJCdFSeh8giAD3MB54+iJk/mUPATC+W0dVvXKZ5i7mjhY/Ja3T IvddWVwZOT3LvjTRKDj6IJwhTsS0b7PZXHwMADsC+m+/xKSHbcHU+LBDwPQs0O4pMn UIn3iptgW8+Z7UIJOpP5pTHCc7TdXRVPBlswfL7OsfjnFd8zGF3zuDJF5LY3z7APh7 f6Wfaeo9Kb5J59V6IjMgFNFO7PNHwRRfMikWSLvVBkudM9BMitIPofBaUO5f1SeKMY UH0MSsnODJ9KQ== Date: Mon, 16 Mar 2026 18:13:18 +0100 Message-ID: <20260316164951.277660913@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 5/8] futex: Add robust futex unlock IP range References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There will be a VDSO function to unlock robust futexes in user space. The unlock sequence is racy vs. clearing the list_pending_op pointer in the tasks robust list head. To plug this race the kernel needs to know the instruction window. As the VDSO is per MM the addresses are stored in mm_struct::futex. Architectures which implement support for this have to update these addresses when the VDSO is (re)mapped. Arguably this could be resolved by chasing mm->context->vdso->image, but that that's architecture specific and requires to touch quite some cache lines. Having it in mm::futex reduces the cache line impact and avoids having yet another set of architecture specific functionality. Signed-off-by: Thomas Gleixner Reviewed-by: Mathieu Desnoyers --- include/linux/futex_types.h | 32 +++++++++++++++++++++++++------- include/linux/mm_types.h | 1 + init/Kconfig | 6 ++++++ 3 files changed, 32 insertions(+), 7 deletions(-) --- a/include/linux/futex_types.h +++ b/include/linux/futex_types.h @@ -33,13 +33,26 @@ struct futex_ctrl { }; =20 /** * struct futex_mm_data - Futex related per MM data - * @phash_lock: Mutex to protect the private hash operations - * @phash: RCU managed pointer to the private hash - * @phash_new: Pointer to a newly allocated private hash - * @phash_batches: Batch state for RCU synchronization - * @phash_rcu: RCU head for call_rcu() - * @phash_atomic: Aggregate value for @phash_ref - * @phash_ref: Per CPU reference counter for a private hash + * @phash_lock: Mutex to protect the private hash operations + * @phash: RCU managed pointer to the private hash + * @phash_new: Pointer to a newly allocated private hash + * @phash_batches: Batch state for RCU synchronization + * @phash_rcu: RCU head for call_rcu() + * @phash_atomic: Aggregate value for @phash_ref + * @phash_ref: Per CPU reference counter for a private hash + * + * @unlock_cs_start_ip: The start IP of the robust futex unlock critical = section + * + * @unlock_cs_success_ip: The IP of the robust futex unlock critical secti= on which + * indicates that the unlock (cmpxchg) was successful + * Required to handle the compat size insanity for mixed mode + * game emulators. + * + * Not evaluated by the core code as that only + * evaluates the start/end range. Can therefore be 0 if the + * architecture does not care. + * + * @unlock_cs_end_ip: The end IP of the robust futex unlock critical sect= ion */ struct futex_mm_data { #ifdef CONFIG_FUTEX_PRIVATE_HASH @@ -51,6 +64,11 @@ struct futex_mm_data { atomic_long_t phash_atomic; unsigned int __percpu *phash_ref; #endif +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK + unsigned long unlock_cs_start_ip; + unsigned long unlock_cs_success_ip; + unsigned long unlock_cs_end_ip; +#endif }; =20 #endif /* _LINUX_FUTEX_TYPES_H */ --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -22,6 +22,7 @@ #include #include #include +#include =20 #include =20 --- a/init/Kconfig +++ b/init/Kconfig @@ -1822,6 +1822,12 @@ config FUTEX_MPOL depends on FUTEX && NUMA default y =20 +config HAVE_FUTEX_ROBUST_UNLOCK + bool + +config FUTEX_ROBUST_UNLOCK + def_bool FUTEX && HAVE_GENERIC_VDSO && GENERIC_IRQ_ENTRY && RSEQ && HAVE_= FUTEX_ROBUST_UNLOCK + config EPOLL bool "Enable eventpoll support" if EXPERT default y From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 009C83CFF5E for ; Mon, 16 Mar 2026 17:13:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681208; cv=none; b=jUegrQGqfvPMPN1frbM8KFfGUwP3WpkEBu3ro+l6mXEW1Nxu4m4Xka1RSuM5R9V+cfczt0/WDZpFMHgLjlXrLPZ5J5voj6b2mUmlDeRH9zkK1JjAtwB8KhG4efpQM08MwuekyTi4zW0hy43OnUN5gfq2Sae0pJBWTkM0i1Ltp58= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681208; c=relaxed/simple; bh=jlaIabhQ8Yp2phXrbWVKundSKgM8xjyydCt4UqRSUsA=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=ObbmiImin75MRTEOcGUJUeeowiGWaKXc8CDFdqZPmV9I8rU9/W9fJEotWD/M7kxncXKDcBKUME4HyVzJfdTR5ASeOCf0AucqFJe+nABCutkkiCt2ZqnNZzvpMnLB9neUi2QLRbybrBWawBgMp0iA0Xp3bFNYdugPCkFBxjd1qYc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=a7W6p222; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="a7W6p222" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 208C5C2BC87; Mon, 16 Mar 2026 17:13:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681207; bh=jlaIabhQ8Yp2phXrbWVKundSKgM8xjyydCt4UqRSUsA=; h=Date:From:To:Cc:Subject:References:From; b=a7W6p222ZKsTnIobQ+ccl8mZnD+ZosGQTm62jodWqza3cj5ja9T0haxl3ZPcqm38y GTC/1yYkdKcY8t5X0QZWA2SoaOexNvrWuIXlcmGzCGyNnaWfalVcOlwjkgAweJIjTa kxG89rMXP3zR7xAP2yGcRgsbaUUOQToHWwe417YdC/k5melKywc2LIQT6UmJ+WbkPt e7SoAuR7Q/wpS6N9UnvWHWBtLwOEbowTQ2tRa2W3PeeK0LJrmXn2aMPSLBbXJHn7C+ LuQZ1NemJ0FWHoaJB+52z3ciOhktapB4LzkoWdTsuh6tev9s28Zhu/pnsU6Pr+d5vs wDyZFfB0vLLcw== Date: Mon, 16 Mar 2026 18:13:24 +0100 Message-ID: <20260316164951.345973752@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 6/8] futex: Provide infrastructure to plug the non contended robust futex unlock race References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes, then the unlock sequence in user space looks like this: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); =09 lval =3D gettid(); 3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0)) 4) robust_list_clear_op_pending(); else 5) sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....); That still leaves a minimal race window between #3 and #4 where the mutex could be acquired by some other task, which observes that it is the last user and: 1) unmaps the mutex memory 2) maps a different file, which ends up covering the same address When then the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupt unrelated data. On X86 this boils down to this simplified assembly sequence: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 #3 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend #4 movq $0x0,(%rdx) // Clear list_op_pending .Lend: If the cmpxchg() succeeds and the task is interrupted before it can clear list_op_pending in the robust list head (#4) and the task crashes in a signal handler or gets killed then it ends up in do_exit() and subsequently in the robust list handling, which then might run into the unmap/map issue described above. This is only relevant when user space was interrupted and a signal is pending. The fix-up has to be done before signal delivery is attempted because: 1) The signal might be fatal so get_signal() ends up in do_exit() 2) The signal handler might crash or the task is killed before returning from the handler. At that point the instruction pointer in pt_regs is not longer the instruction pointer of the initially interrupted unlock sequence. The right place to handle this is in __exit_to_user_mode_loop() before invoking arch_do_signal_or_restart() as this covers obviously both scenarios. As this is only relevant when the task was interrupted in user space, this is tied to RSEQ and the generic entry code as RSEQ keeps track of user space interrupts unconditionally even if the task does not have a RSEQ region installed. That makes the decision very lightweight: if (current->rseq.user_irq && within(regs, unlock_ip_range)) futex_fixup_robust_unlock(regs); futex_fixup_robust_unlock() then invokes a architecture specific function which evaluates the register content to decide whether the pending ops pointer in the robust list head needs to be cleared. Assuming the above unlock sequence, then on x86 this results in the trivial evaluation of the zero flag: return regs->eflags & X86_EFLAGS_ZF; Other architectures might need to do more complex evaluations due to LLSC, but the approach is valid in general. In case that COMPAT is enabled the decision function is a bit more complex, but that's an implementation detail. The handling code also requires to retrieve the pending op pointer via an architecture specific function to be able to perform the clearing. The unlock sequence is going to be placed in the VDSO so that the kernel can keep everything synchronized. The resulting code sequence for user space is: if (__vdso_futex_robust_try_unlock(lock, tid, &pending_op) !=3D tid) err =3D sys_futex($OP | FUTEX_ROBUST_UNLOCK,....); Both the VDSO unlock and the kernel side unlock ensure that the pending_op pointer is always cleared when the lock becomes unlocked. The pending op pointer has the same modifier requirements as the @uaddr2 argument of sys_futex(FUTEX_ROBUST_UNLOCK) for the very same reasons. That means VDSO implementations need to support the variable size case for the pending op pointer as well if COMPAT is enabled. Signed-off-by: Thomas Gleixner --- include/linux/futex.h | 31 ++++++++++++++++++++++++++++++- include/vdso/futex.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ kernel/entry/common.c | 9 ++++++--- kernel/futex/core.c | 13 +++++++++++++ 4 files changed, 93 insertions(+), 4 deletions(-) --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -110,7 +110,36 @@ static inline int futex_hash_allocate_de } static inline int futex_hash_free(struct mm_struct *mm) { return 0; } static inline int futex_mm_init(struct mm_struct *mm) { return 0; } +#endif /* !CONFIG_FUTEX */ =20 -#endif +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK +#include + +void __futex_fixup_robust_unlock(struct pt_regs *regs); + +static inline bool futex_within_robust_unlock(struct pt_regs *regs) +{ + unsigned long ip =3D instruction_pointer(regs); + + return ip >=3D current->mm->futex.unlock_cs_start_ip && + ip < current->mm->futex.unlock_cs_end_ip; +} + +static inline void futex_fixup_robust_unlock(struct pt_regs *regs) +{ + /* + * Avoid dereferencing current->mm if not returning from interrupt. + * current->rseq.event is going to be used anyway in the exit to user + * code, so bringing it in is not a big deal. + */ + if (!current->rseq.event.user_irq) + return; + + if (unlikely(futex_within_robust_unlock(regs))) + __futex_fixup_robust_unlock(regs); +} +#else /* CONFIG_FUTEX_ROBUST_UNLOCK */ +static inline void futex_fixup_robust_unlock(struct pt_regs *regs) {} +#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */ =20 #endif --- /dev/null +++ b/include/vdso/futex.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _VDSO_FUTEX_H +#define _VDSO_FUTEX_H + +#include + +struct robust_list; + +/** + * __vdso_futex_robust_try_unlock - Try to unlock an uncontended robust fu= tex + * @lock: Pointer to the futex lock object + * @tid: The TID of the calling task + * @op: Pointer to the task's robust_list_head::list_pending_op + * + * Return: The content of *@lock. On success this is the same as @tid. + * + * The function implements: + * if (atomic_try_cmpxchg(lock, &tid, 0)) + * *op =3D NULL; + * return tid; + * + * There is a race between a successful unlock and clearing the pending op + * pointer in the robust list head. If the calling task is interrupted in = the + * race window and has to handle a (fatal) signal on return to user space = then + * the kernel handles the clearing of @pending_op before attempting to del= iver + * the signal. That ensures that a task cannot exit with a potentially inv= alid + * pending op pointer. + * + * User space uses it in the following way: + * + * if (__vdso_futex_robust_try_unlock(lock, tid, &pending_op) !=3D tid) + * err =3D sys_futex($OP | FUTEX_ROBUST_UNLOCK,....); + * + * If the unlock attempt fails due to the FUTEX_WAITERS bit set in the loc= k, + * then the syscall does the unlock, clears the pending op pointer and wak= es the + * requested number of waiters. + * + * The @op pointer is intentionally void. It has the same requirements as = the + * @uaddr2 argument for sys_futex(FUTEX_ROBUST_UNLOCK) operations. See the + * modifier and the related documentation in include/uapi/linux/futex.h + */ +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void= *op); + +#endif --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -1,11 +1,12 @@ // SPDX-License-Identifier: GPL-2.0 =20 -#include -#include +#include #include +#include #include #include #include +#include #include =20 /* Workaround to allow gradual conversion of architecture code */ @@ -60,8 +61,10 @@ static __always_inline unsigned long __e if (ti_work & _TIF_PATCH_PENDING) klp_update_patch_state(current); =20 - if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) { + futex_fixup_robust_unlock(regs); arch_do_signal_or_restart(regs); + } =20 if (ti_work & _TIF_NOTIFY_RESUME) resume_user_mode_work(regs); --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1455,6 +1455,19 @@ bool futex_robust_list_clear_pending(voi return robust_list_clear_pending(pop); } =20 +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK +void __futex_fixup_robust_unlock(struct pt_regs *regs) +{ + void __user *pop; + + if (!arch_futex_needs_robust_unlock_fixup(regs)) + return; + + pop =3D arch_futex_robust_unlock_get_pop(regs); + futex_robust_list_clear_pending(pop); +} +#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */ + static void futex_cleanup(struct task_struct *tsk) { if (unlikely(tsk->futex.robust_list)) { From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 679EB3D16EB for ; Mon, 16 Mar 2026 17:13:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681213; cv=none; b=ZmC+yQE4FUiCJSId8DBFrBappcC4HzOsgx3+5kzB6sK19/fJKLTCx49L/p2NobR5m1vEv6kexJnLoHiPAbMGoB/Dt833hsoGzwKoRnHMIM6t7bFhpEGn7R2J0HGRWc29QOiTnrggicOfePZOal6r951J+8kzaDcs4RueTynzOBw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681213; c=relaxed/simple; bh=ZRE+TfloXdZ1if6rmtD97C6SqL87kqd+K3WPmC9aFfg=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=YiTqEQncdt3h4Pu+TK0J8CcQxDyvNSyqOUlccoQgP3fSAz1rHaV0VvOUS7UzUAtH9GBBnpX5Pz7YfrxgI9HcMnh706/D6ssTnht+ee1XphcGFqy6IJ1+5Zy0mMtIwRYQcNNDl/XQ7LDmkIkK9qt7+icWeezISdf+foR+LnIVXTM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gPw6CwVr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gPw6CwVr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 58091C19421; Mon, 16 Mar 2026 17:13:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681213; bh=ZRE+TfloXdZ1if6rmtD97C6SqL87kqd+K3WPmC9aFfg=; h=Date:From:To:Cc:Subject:References:From; b=gPw6CwVr7zo1Z0OKI4OoXzr/S2SENmsWAb3HT91qxnuv0RoGE429pv6YQdM5s7jpp HKGbbHfDqNR6RG2D26iVe1s53oHKn6/a+CuQ9OlNQNR+Iiq77y8z/S8z9kTJsBIFsC SSbD8OxS8XLPSf9TMyQAYgRpX7NAIO3jdqs/Jm9jGHrq4CRgFL+WR7kOdoaoI0zaKl b3lxGDty8bHqpgQvBOA8TyDzfRUGtbKvvvuW/GXXLJqbfmz9cOkiWjXvEoVwsDIsDM ghec47dmJAOq+mXQreNd6p8gVLNRZYNqSTfzwS/83sGmRc88F+0+ocnqHTWoIkOyTj cVpNRExN5xe1A== Date: Mon, 16 Mar 2026 18:13:29 +0100 Message-ID: <20260316164951.413709497@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 7/8] x86/vdso: Prepare for robust futex unlock support References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There will be a VDSO function to unlock non-contended robust futexes in user space. The unlock sequence is racy vs. clearing the list_pending_op pointer in the task's robust list head. To plug this race the kernel needs to know the instruction window so it can clear the pointer when the task is interrupted within that race window. Add the symbols to the vdso2c generator and use them in the VDSO VMA code to update the critical section addresses in mm_struct::futex on (re)map(). Signed-off-by: Thomas Gleixner --- arch/x86/entry/vdso/vma.c | 20 ++++++++++++++++++++ arch/x86/include/asm/vdso.h | 3 +++ arch/x86/tools/vdso2c.c | 17 ++++++++++------- 3 files changed, 33 insertions(+), 7 deletions(-) --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -73,6 +73,23 @@ static void vdso_fix_landing(const struc regs->ip =3D new_vma->vm_start + ipoffset; } =20 +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK +static void vdso_futex_robust_unlock_update_ips(void) +{ + const struct vdso_image *image =3D current->mm->context.vdso_image; + unsigned long vdso =3D (unsigned long) current->mm->context.vdso; + + current->mm->futex.unlock_cs_start_ip =3D + vdso + image->sym___vdso_futex_robust_try_unlock_cs_start; + current->mm->futex.unlock_cs_success_ip =3D + vdso + image->sym___vdso_futex_robust_try_unlock_cs_success; + current->mm->futex.unlock_cs_end_ip =3D + vdso + image->sym___vdso_futex_robust_try_unlock_cs_end; +} +#else +static inline void vdso_futex_robust_unlock_update_ips(void) { } +#endif + static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) { @@ -80,6 +97,7 @@ static int vdso_mremap(const struct vm_s =20 vdso_fix_landing(image, new_vma); current->mm->context.vdso =3D (void __user *)new_vma->vm_start; + vdso_futex_robust_unlock_update_ips(); =20 return 0; } @@ -189,6 +207,8 @@ static int map_vdso(const struct vdso_im current->mm->context.vdso =3D (void __user *)text_start; current->mm->context.vdso_image =3D image; =20 + vdso_futex_robust_unlock_update_ips(); + up_fail: mmap_write_unlock(mm); return ret; --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -25,6 +25,9 @@ struct vdso_image { long sym_int80_landing_pad; long sym_vdso32_sigreturn_landing_pad; long sym_vdso32_rt_sigreturn_landing_pad; + long sym___vdso_futex_robust_try_unlock_cs_start; + long sym___vdso_futex_robust_try_unlock_cs_success; + long sym___vdso_futex_robust_try_unlock_cs_end; }; =20 extern const struct vdso_image vdso64_image; --- a/arch/x86/tools/vdso2c.c +++ b/arch/x86/tools/vdso2c.c @@ -75,13 +75,16 @@ struct vdso_sym { }; =20 struct vdso_sym required_syms[] =3D { - {"VDSO32_NOTE_MASK", true}, - {"__kernel_vsyscall", true}, - {"__kernel_sigreturn", true}, - {"__kernel_rt_sigreturn", true}, - {"int80_landing_pad", true}, - {"vdso32_rt_sigreturn_landing_pad", true}, - {"vdso32_sigreturn_landing_pad", true}, + {"VDSO32_NOTE_MASK", true}, + {"__kernel_vsyscall", true}, + {"__kernel_sigreturn", true}, + {"__kernel_rt_sigreturn", true}, + {"int80_landing_pad", true}, + {"vdso32_rt_sigreturn_landing_pad", true}, + {"vdso32_sigreturn_landing_pad", true}, + {"__vdso_futex_robust_try_unlock_cs_start", true}, + {"__vdso_futex_robust_try_unlock_cs_success", true}, + {"__vdso_futex_robust_try_unlock_cs_end", true}, }; =20 __attribute__((format(printf, 1, 2))) __attribute__((noreturn)) From nobody Tue Apr 7 02:33:54 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 23CA63CFF5F for ; Mon, 16 Mar 2026 17:13:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681219; cv=none; b=rq3Wiln+eMJ8YBjMkkzeHA9khpOlXml+PNamjRF9FZC3E49ch7xPyDTzGCVoXNjgYIFlFNJUQl05SbeX8pDTDffQYc47booC6aNleVcUIfzFSvx4E9pIOFtnxgKWhNBOGqBbFZIINXmGGJntrdbR5x9jg5s7TTv/moHfskmTCHc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681219; c=relaxed/simple; bh=s7dEMkZXWBwhSEpxYtTzmv6N3dp1R67hswF+hMcLS+8=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=eAmVwvAdRCO8aie2CVfFYCAr5F3fZblnTZmi+P6ogtKWMcm+q4LI/Egdgvkd9yT5Fe6Z0Pgo67zYjYeIZAADjGcwT5el4anAdvFmpJmroGdh81rBKDOa0e+KDFV207MNF/mT8iylAgXt59zEuRrKW+g56lzAm5W8y83KkZD5li4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=j4vGQhsN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="j4vGQhsN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F24F6C19421; Mon, 16 Mar 2026 17:13:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681218; bh=s7dEMkZXWBwhSEpxYtTzmv6N3dp1R67hswF+hMcLS+8=; h=Date:From:To:Cc:Subject:References:From; b=j4vGQhsNuE4Vn2CJCS2LXtJP+9zUrIeNE4wGVezu1zDNAkx41Y2OPSoJSASyhRikq 2oDqSfq+3mw2VaYRc6l1Cr8Bf7hkFOypb2rFfghOwRvNXFefAqSvZGOQ5IuIuvirBE qYmuqI3QuWKx7ElVPDW1WxNUOqHRm5w/NGC/GUcWOFvtizXEMKaFQQn48zKZQpiFQ/ yWXwO4lEeUCnuL68+63a7Fvrl5BdF8KnlW6x+g6O/QSDAsbCub7XSMV83LTbgLP56N JRiAuOEQ+wHv3fuVcFIu8dvNabesEZwVg7ot/3Jrgzpzyq08EZWg7eGsHR+DcIu/QQ 3KAQJZcT0EXqQ== Date: Mon, 16 Mar 2026 18:13:34 +0100 Message-ID: <20260316164951.484640267@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 8/8] x86/vdso: Implement __vdso_futex_robust_try_unlock() References: <20260316162316.356674433@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes, then the unlock sequence in userspace looks like this: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); =09 lval =3D gettid(); 3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0)) 4) robust_list_clear_op_pending(); else 5) sys_futex(OP,...FUTEX_ROBUST_UNLOCK); That still leaves a minimal race window between #3 and #4 where the mutex could be acquired by some other task which observes that it is the last user and: 1) unmaps the mutex memory 2) maps a different file, which ends up covering the same address When then the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupt unrelated data. Provide a VDSO function which exposes the critical section window in the VDSO symbol table. The resulting addresses are updated in the task's mm when the VDSO is (re)map()'ed. The core code detects when a task was interrupted within the critical section and is about to deliver a signal. It then invokes an architecture specific function which determines whether the pending op pointer has to be cleared or not. The assembly sequence for the non COMPAT case is: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend movq $0x0,(%rdx) // Clear list_op_pending .Lend: ret So the decision can be simply based on the ZF state in regs->flags. If COMPAT is enabled then the try_unlock() function needs to take the size bit in the OP pointer into account, which makes it slightly more complex: mov %esi,%eax // Load TID into EAX mov %rdx,%rsi // Get the op pointer xor %ecx,%ecx // Set ECX to 0 and $0xfffffffffffffffe,%rsi // Clear the size bit lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend .Lsuccess: testl $0x1,(%rdx) // Test the size bit jz .Lop64 // Not set: 64-bit movl $0x0,(%rsi) // Clear 32-bit jmp .Lend .Lop64:=09 movq $0x0,(%rsi) // Clear 64-bit .Lend: ret The decision function has to check whether regs->ip is in the success portion as the size bit test obviously modifies ZF too. If it is before .Lsuccess then ZF contains the cmpxchg() result. If it's at of after .Lsuccess then the pointer has to be cleared. The original pointer with the size bit is preserved in RDX so the fixup can utilize the existing clearing mechanism, which is used by sys_futex(). Arguably this could be avoided by providing separate functions and making the IP range for the quick check in the exit to user path cover the whole text section which contains the two functions. But that's not a win at all because: 1) User space needs to handle the two variants instead of just relying on a bit which can be saved in the mutex at initialization time. 2) The fixup decision function has then to evaluate which code path is used. That just adds more symbols and range checking for no real value. The unlock function is inspired by an idea from Mathieu Desnoyers. Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@ef= ficios.com --- arch/x86/Kconfig | 1=20 arch/x86/entry/vdso/common/vfutex.c | 72 ++++++++++++++++++++++++++= +++++ arch/x86/entry/vdso/vdso32/Makefile | 5 +- arch/x86/entry/vdso/vdso32/vdso32.lds.S | 6 ++ arch/x86/entry/vdso/vdso32/vfutex.c | 1=20 arch/x86/entry/vdso/vdso64/Makefile | 7 +-- arch/x86/entry/vdso/vdso64/vdso64.lds.S | 6 ++ arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 6 ++ arch/x86/entry/vdso/vdso64/vfutex.c | 1=20 arch/x86/include/asm/futex_robust.h | 44 ++++++++++++++++++ 10 files changed, 144 insertions(+), 5 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -237,6 +237,7 @@ config X86 select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_EISA if X86_32 select HAVE_EXIT_THREAD + select HAVE_FUTEX_ROBUST_UNLOCK select HAVE_GENERIC_TIF_BITS select HAVE_GUP_FAST select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE --- /dev/null +++ b/arch/x86/entry/vdso/common/vfutex.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include + +/* + * Compat enabled kernels have to take the size bit into account to suppor= t the + * mixed size use case of gaming emulators. Contrary to the kernel robust = unlock + * mechanism all of this does not test for the 32-bit modifier in 32-bit V= DSOs + * and in compat disabled kernels. User space can keep the pieces. + */ +#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64) + +#ifdef CONFIG_COMPAT + +# define ASM_CLEAR_PTR \ + " testl $1, (%[pop]) \n" \ + " jz .Lop64 \n" \ + " movl $0, (%[pad]) \n" \ + " jmp __vdso_futex_robust_try_unlock_cs_end \n" \ + ".Lop64: \n" \ + " movq $0, (%[pad]) \n" + +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL) + +#else /* CONFIG_COMPAT */ + +# define ASM_CLEAR_PTR \ + " movq $0, (%[pop]) \n" + +# define ASM_PAD_CONSTRAINT + +#endif /* !CONFIG_COMPAT */ + +#else /* CONFIG_X86_64 && !BUILD_VDSO32_64 */ + +# define ASM_CLEAR_PTR \ + " movl $0, (%[pad]) \n" + +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL) + +#endif /* !CONFIG_X86_64 || BUILD_VDSO32_64 */ + +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void= *pop) +{ + asm volatile ( + ".global __vdso_futex_robust_try_unlock_cs_start \n" + ".global __vdso_futex_robust_try_unlock_cs_success \n" + ".global __vdso_futex_robust_try_unlock_cs_end \n" + " \n" + " lock cmpxchgl %[val], (%[ptr]) \n" + " \n" + "__vdso_futex_robust_try_unlock_cs_start: \n" + " \n" + " jnz __vdso_futex_robust_try_unlock_cs_end \n" + " \n" + "__vdso_futex_robust_try_unlock_cs_success: \n" + " \n" + ASM_CLEAR_PTR + " \n" + "__vdso_futex_robust_try_unlock_cs_end: \n" + : [tid] "+a" (tid) + : [ptr] "D" (lock), + [pop] "d" (pop), + [val] "r" (0) + ASM_PAD_CONSTRAINT + : "memory" + ); + + return tid; +} + +uint32_t futex_robust_try_unlock(uint32_t *, uint32_t, void **) + __attribute__((weak, alias("__vdso_futex_robust_try_unlock"))); --- a/arch/x86/entry/vdso/vdso32/Makefile +++ b/arch/x86/entry/vdso/vdso32/Makefile @@ -7,8 +7,9 @@ vdsos-y :=3D 32 =20 # Files to link into the vDSO: -vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o -vobjs-y +=3D system_call.o sigreturn.o +vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o +vobjs-y +=3D system_call.o sigreturn.o +vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) +=3D vfutex.o =20 # Compilation flags flags-y :=3D -DBUILD_VDSO32 -m32 -mregparm=3D0 --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S @@ -30,6 +30,12 @@ VERSION __vdso_clock_gettime64; __vdso_clock_getres_time64; __vdso_getcpu; +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK + __vdso_futex_robust_try_unlock; + __vdso_futex_robust_try_unlock_cs_start; + __vdso_futex_robust_try_unlock_cs_success; + __vdso_futex_robust_try_unlock_cs_end; +#endif }; =20 LINUX_2.5 { --- /dev/null +++ b/arch/x86/entry/vdso/vdso32/vfutex.c @@ -0,0 +1 @@ +#include "common/vfutex.c" --- a/arch/x86/entry/vdso/vdso64/Makefile +++ b/arch/x86/entry/vdso/vdso64/Makefile @@ -8,9 +8,10 @@ vdsos-y :=3D 64 vdsos-$(CONFIG_X86_X32_ABI) +=3D x32 =20 # Files to link into the vDSO: -vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o -vobjs-y +=3D vgetrandom.o vgetrandom-chacha.o -vobjs-$(CONFIG_X86_SGX) +=3D vsgx.o +vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o +vobjs-y +=3D vgetrandom.o vgetrandom-chacha.o +vobjs-$(CONFIG_X86_SGX) +=3D vsgx.o +vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) +=3D vfutex.o =20 # Compilation flags flags-y :=3D -DBUILD_VDSO64 -m64 -mcmodel=3Dsmall --- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S +++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S @@ -32,6 +32,12 @@ VERSION { #endif getrandom; __vdso_getrandom; +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK + __vdso_futex_robust_try_unlock; + __vdso_futex_robust_try_unlock_cs_start; + __vdso_futex_robust_try_unlock_cs_success; + __vdso_futex_robust_try_unlock_cs_end; +#endif local: *; }; } --- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S +++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S @@ -22,6 +22,12 @@ VERSION { __vdso_getcpu; __vdso_time; __vdso_clock_getres; +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK + __vdso_futex_robust_try_unlock; + __vdso_futex_robust_try_unlock_cs_start; + __vdso_futex_robust_try_unlock_cs_success; + __vdso_futex_robust_try_unlock_cs_end; +#endif local: *; }; } --- /dev/null +++ b/arch/x86/entry/vdso/vdso64/vfutex.c @@ -0,0 +1 @@ +#include "common/vfutex.c" --- /dev/null +++ b/arch/x86/include/asm/futex_robust.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_FUTEX_ROBUST_H +#define _ASM_X86_FUTEX_ROBUST_H + +#include + +static __always_inline bool x86_futex_needs_robust_unlock_fixup(struct pt_= regs *regs) +{ + /* + * This is tricky in the compat case as it has to take the size check + * into account. See the ASM magic in the VDSO vfutex code. If compat is + * disabled or this is a 32-bit kernel then ZF is authoritive no matter + * what. + */ + if (!IS_ENABLED(CONFIG_X86_64) || !IS_ENABLED(CONFIG_IA32_EMULATION)) + return !!(regs->flags & X86_EFLAGS_ZF); + + /* + * For the compat case, the core code already established that regs->ip + * is >=3D cs_start and < cs_end. Now check whether it is at the + * conditional jump which checks the cmpxchg() or if it succeeded and + * does the size check, which obviously modifies ZF too. + */ + if (regs->ip >=3D current->mm->futex.unlock_cs_success_ip) + return true; + /* + * It's at the jnz right after the cmpxchg(). ZF tells whether this + * succeeded or not. + */ + return !!(regs->flags & X86_EFLAGS_ZF); +} + +#define arch_futex_needs_robust_unlock_fixup(regs) \ + x86_futex_needs_robust_unlock_fixup(regs) + +static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct= pt_regs *regs) +{ + return (void __user *)regs->dx; +} + +#define arch_futex_robust_unlock_get_pop(regs) \ + x86_futex_robust_unlock_get_pop(regs) + +#endif /* _ASM_X86_FUTEX_ROBUST_H */