From nobody Mon Jun 8 08:30:47 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 520F748164A; Wed, 3 Jun 2026 14:24:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780496698; cv=none; b=YZHmORKRfqYlwx2qpELscTXsBUb3N61D/eUO8SNb/YxEy0zF3g+m1pu6iP3t2CqWkxeUxAJK9Hm5gc2+IuLmqIN+jbWyYF2iunkPDDmtPFm2yjFypQ/Gj/+XkEzLorKUDl3Dt/sHpYBXyHxyJrpEZrXd7TT8bjZ4bqtgND3D7PM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780496698; c=relaxed/simple; bh=W8ywTT52R0K9PhGWAsd5hCWdzqaZs5/djPwt1DxEDaQ=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=uwL2PEwHtJVlxLUKjZEM8aLxwS/CjhZ86/SRn4zc7CUr1vlIHfrFFc+ndwFdKkFR3bP3IdfHKQetsCngFsH23Pbw4+VewRMdeSSINBmRcWYTJm5S5RM1LLuiEV9fspM9nD6mzBeT0Z+eAucrArefLVAL8cRXVCmgV6cAxaxg9Rc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=XyJSCJ3N; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=xWeaMPtp; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="XyJSCJ3N"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="xWeaMPtp" Date: Wed, 03 Jun 2026 14:24:53 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1780496694; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YhM7pDgXOjhftnUz584se+MgvYg8xAgwdQegQGkkCaw=; b=XyJSCJ3NeQYnkayM0qrWl9ylVrkdHve5QKkNznG81vgjeQeo+KSmzHjRXVQ4SQkxZ9I2pi b+AYgwwwoFczfiAY0aZSFdtbM4h+D6eQGth14GP/OWPhjOtvhaj0zQ8sKfV6JkLTA4b25F nPWL+fpfOrOoZ+DHLO3DdS2omcRdKZrHcTU46gmoML5b01pTNJi83PgXkaQTEhQL8ywH9y 3deiin/sbclfb1IYNLmZu4tu3P2DCrI9dTpcG0Cw1JX9lKFzOOhwFyPV7x6a9kIMl0pmOf cLHibbGXwyTvSXGyuLdyuKCNZoKHZAsgOaKdSXwsZgby9+Z0oVAIsuKRMkvZEQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1780496694; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YhM7pDgXOjhftnUz584se+MgvYg8xAgwdQegQGkkCaw=; b=xWeaMPtpR1veC0/EnhDbkkiDt5kRtTeztLzNHc8M6+d7MjfGDW+1vNDpg91Wm87Q4roke3 aTag/1kpr8m7ECBw== From: "tip-bot2 for Thomas Gleixner" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: locking/core] futex: Provide infrastructure to plug the non contended robust futex unlock race Cc: Thomas Gleixner , "Peter Zijlstra (Intel)" , andrealmeid@igalia.com, x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20260602090535.773669210@kernel.org> References: <20260602090535.773669210@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <178049669339.710.10754146774541450376.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the locking/core branch of tip: Commit-ID: 7010c39d8fc5063af69ee63f905e592e046f8e5d Gitweb: https://git.kernel.org/tip/7010c39d8fc5063af69ee63f905e592e0= 46f8e5d Author: Thomas Gleixner AuthorDate: Tue, 02 Jun 2026 11:10:04 +02:00 Committer: Peter Zijlstra CommitterDate: Wed, 03 Jun 2026 11:38:52 +02:00 futex: Provide infrastructure to plug the non contended robust futex unlock= race When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes, then the unlock sequence in user space looks like this: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval =3D gettid(); 3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0)) 4) robust_list_clear_op_pending(); else 5) sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....); That still leaves a minimal race window between #3 and #4 where the mutex could be acquired by some other task, which observes that it is the last user and: 1) unmaps the mutex memory 2) maps a different file, which ends up covering the same address When then the original task exits before reaching #5 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupt unrelated data. On X86 this boils down to this simplified assembly sequence: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 #3 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend #4 movq %rcx,(%rdx) // Clear list_op_pending .Lend: If the cmpxchg() succeeds and the task is interrupted before it can clear list_op_pending in the robust list head (#4) and the task crashes in a signal handler or gets killed then it ends up in do_exit() and subsequently in the robust list handling, which then might run into the unmap/map issue described above. This is only relevant when user space was interrupted and a signal is pending. The fix-up has to be done before signal delivery is attempted because: 1) The signal might be fatal so get_signal() ends up in do_exit() 2) The signal handler might crash or the task is killed before returning from the handler. At that point the instruction pointer in pt_regs is not longer the instruction pointer of the initially interrupted unlock sequence. The right place to handle this is in __exit_to_user_mode_loop() before invoking arch_do_signal_or_restart() as this covers obviously both scenarios. As this is only relevant when the task was interrupted in user space, this is tied to RSEQ and the generic entry code as RSEQ keeps track of user space interrupts unconditionally even if the task does not have a RSEQ region installed. That makes the decision very lightweight: if (current->rseq.user_irq && within(regs, csr->unlock_ip_range)) futex_fixup_robust_unlock(regs, csr); futex_fixup_robust_unlock() then invokes a architecture specific function to return the pending op pointer or NULL. The function evaluates the register content to decide whether the pending ops pointer in the robust list head needs to be cleared. Assuming the above unlock sequence, then on x86 this decision is the trivial evaluation of the zero flag: return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL; Other architectures might need to do more complex evaluations due to LLSC, but the approach is valid in general. The size of the pointer is determined from the matching range struct, which covers both 32-bit and 64-bit builds including COMPAT. The unlock sequence is going to be placed in the VDSO so that the kernel can keep everything synchronized, especially the register usage. The resulting code sequence for user space is: if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) !=3D = tid) err =3D sys_futex($OP | FUTEX_ROBUST_UNLOCK,....); Both the VDSO unlock and the kernel side unlock ensure that the pending_op pointer is always cleared when the lock becomes unlocked. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Andr=C3=A9 Almeida Link: https://patch.msgid.link/20260602090535.773669210@kernel.org --- include/linux/futex.h | 39 +++++++++++++++++++++++++++++++- include/vdso/futex.h | 52 ++++++++++++++++++++++++++++++++++++++++++- kernel/entry/common.c | 9 ++++--- kernel/futex/core.c | 18 +++++++++++++++- 4 files changed, 114 insertions(+), 4 deletions(-) create mode 100644 include/vdso/futex.h diff --git a/include/linux/futex.h b/include/linux/futex.h index cb2a182..51f4ccd 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -105,7 +105,41 @@ static inline int futex_hash_free(struct mm_struct *mm= ) { return 0; } #endif /* !CONFIG_FUTEX */ =20 #ifdef CONFIG_FUTEX_ROBUST_UNLOCK +#include + void futex_reset_cs_ranges(struct futex_mm_data *fd); +void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock= _cs_range *csr); + +static inline bool futex_within_robust_unlock(struct pt_regs *regs, + struct futex_unlock_cs_range *csr) +{ + unsigned long ip =3D instruction_pointer(regs); + + return ip >=3D csr->start_ip && ip < csr->start_ip + csr->len; +} + +static inline void futex_fixup_robust_unlock(struct pt_regs *regs) +{ + struct futex_unlock_cs_range *csr; + + /* + * Avoid dereferencing current->mm if not returning from interrupt. + * current->rseq.event is going to be used subsequently, so bringing the + * cache line in is not a big deal. + */ + if (!current->rseq.event.user_irq) + return; + + csr =3D current->mm->futex.unlock.cs_ranges; + + /* The loop is optimized out for !COMPAT */ + for (int r =3D 0; r < FUTEX_ROBUST_MAX_CS_RANGES; r++, csr++) { + if (unlikely(futex_within_robust_unlock(regs, csr))) { + __futex_fixup_robust_unlock(regs, csr); + return; + } + } +} =20 static inline void futex_set_vdso_cs_range(struct futex_mm_data *fd, unsig= ned int idx, unsigned long start, unsigned long end, bool sz32) @@ -114,7 +148,10 @@ static inline void futex_set_vdso_cs_range(struct fute= x_mm_data *fd, unsigned in fd->unlock.cs_ranges[idx].len =3D end - start; fd->unlock.cs_ranges[idx].pop_size32 =3D sz32; } -#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */ +#else /* CONFIG_FUTEX_ROBUST_UNLOCK */ +static inline void futex_fixup_robust_unlock(struct pt_regs *regs) { } +#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */ + =20 #if defined(CONFIG_FUTEX_PRIVATE_HASH) || defined(CONFIG_FUTEX_ROBUST_UNLO= CK) void futex_mm_init(struct mm_struct *mm); diff --git a/include/vdso/futex.h b/include/vdso/futex.h new file mode 100644 index 0000000..3cd175e --- /dev/null +++ b/include/vdso/futex.h @@ -0,0 +1,52 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _VDSO_FUTEX_H +#define _VDSO_FUTEX_H + +#include + +/** + * __vdso_futex_robust_list64_try_unlock - Try to unlock an uncontended ro= bust futex + * with a 64-bit pending op pointer + * @lock: Pointer to the futex lock object + * @tid: The TID of the calling task + * @pop: Pointer to the task's robust_list_head::list_pending_op + * + * Return: The content of *@lock. On success this is the same as @tid. + * + * The function implements: + * if (atomic_try_cmpxchg(lock, &tid, 0)) + * *op =3D NULL; + * return tid; + * + * There is a race between a successful unlock and clearing the pending op + * pointer in the robust list head. If the calling task is interrupted in = the + * race window and has to handle a (fatal) signal on return to user space = then + * the kernel handles the clearing of @pending_op before attempting to del= iver + * the signal. That ensures that a task cannot exit with a potentially inv= alid + * pending op pointer. + * + * User space uses it in the following way: + * + * if (__vdso_futex_robust_list64_try_unlock(lock, tid, &pending_op) !=3D = tid) + * err =3D sys_futex($OP | FUTEX_ROBUST_UNLOCK,....); + * + * If the unlock attempt fails due to the FUTEX_WAITERS bit set in the loc= k, + * then the syscall does the unlock, clears the pending op pointer and wak= es the + * requested number of waiters. + */ +__u32 __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 = *pop); + +/** + * __vdso_futex_robust_list32_try_unlock - Try to unlock an uncontended ro= bust futex + * with a 32-bit pending op pointer + * @lock: Pointer to the futex lock object + * @tid: The TID of the calling task + * @pop: Pointer to the task's robust_list_head::list_pending_op + * + * Return: The content of *@lock. On success this is the same as @tid. + * + * Same as __vdso_futex_robust_list64_try_unlock() just with a 32-bit @pop= pointer. + */ +__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 = *pop); + +#endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 19d2244..e3d381f 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -1,11 +1,12 @@ // SPDX-License-Identifier: GPL-2.0 =20 -#include -#include +#include #include +#include #include #include #include +#include #include =20 /* Workaround to allow gradual conversion of architecture code */ @@ -60,8 +61,10 @@ static __always_inline unsigned long __exit_to_user_mode= _loop(struct pt_regs *re if (ti_work & _TIF_PATCH_PENDING) klp_update_patch_state(current); =20 - if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) { + futex_fixup_robust_unlock(regs); arch_do_signal_or_restart(regs); + } =20 if (ti_work & _TIF_NOTIFY_RESUME) resume_user_mode_work(regs); diff --git a/kernel/futex/core.c b/kernel/futex/core.c index aad6e50..6ea4a97 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -46,6 +46,8 @@ #include #include =20 +#include + #include "futex.h" #include "../locking/rtmutex_common.h" =20 @@ -1446,6 +1448,22 @@ bool futex_robust_list_clear_pending(void __user *po= p, unsigned int flags) return robust_list_clear_pending(pop); } =20 +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK +void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock= _cs_range *csr) +{ + /* + * arch_futex_robust_unlock_get_pop() returns the list pending op pointer= from + * @regs if the try_cmpxchg() succeeded. + */ + void __user *pop =3D arch_futex_robust_unlock_get_pop(regs); + + if (!pop) + return; + + futex_robust_list_clear_pending(pop, csr->pop_size32 ? FLAGS_ROBUST_LIST3= 2 : 0); +} +#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */ + static void futex_cleanup(struct task_struct *tsk) { if (unlikely(tsk->futex.robust_list)) {