From nobody Tue Apr 7 14:38:06 2026 Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83FFD3FF8AC for ; Thu, 12 Mar 2026 18:18:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=158.69.130.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773339548; cv=none; b=C7q8kgtveOTT3l5jlgHwGSXa4HNH8M+5Gd5r0z/vebG0t7zonyZEQPcF52EYaFYTqdZjLrmK07nO+yMDjR9xFBHXL3QMRa0/sfX1JypM/AkiUqNOKNdZMOGWrBHg+8BSj2tiWlEUXUn4wqm7l81NM23H0wGC2F6LiXpyhddxxOA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773339548; c=relaxed/simple; bh=DQkUYamqJwpDtnM5vfO5EmI6GRrnaVIEQ7l+iEOTOgw=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=ZX1an+d0HRylWpPnV+HklAHtDX6Tz28NXEyXztuyWQ05pe3jUWSnZH4P1c1duutjbGkwkr2kER05lnOqx6XsicIwWZEru/k4iSgXvPtooxBTxlyiw06mDaE858fzbig6c4dwiLEB7YAvcUl/fw4Lzvk+SgFg9K9PJvNS8C7E9a0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=wm2n/7Bn; arc=none smtp.client-ip=158.69.130.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="wm2n/7Bn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=smtpout1; t=1773339536; bh=/m76ADgvzdc1rv3DepsKWYcxsTe/ZDofH4K8S6sFG10=; h=From:To:Cc:Subject:Date:From; b=wm2n/7Bnhs2OUkH5dgNya6iQEpjo6WUMfWuyFSb8MArHjogblEISjWsz1id12gEXN i/4lI+KA/VzseUg+vETh4Oo3eUVmrYOTyrbUqf30PbukrFJ4DuHm/5xmvABrzOvvmS 2GEbAMGF9o480Yt1xiV7qEztzpS+m8o4LTfMhYOWTrCuf6wu8DzdrjmO9zUSetM/i6 tEwRRcHjsoLzBhVu25kqo8WLcQEH4rfA1D0oDO1LIg2ZWSwzAqKC7TbhoK5uPKdpA9 k71o0JDoFLMNt1KFEJcnY0WLBDyvGP/3EGUrQYqsax6JqCwyKVmTFyubaCgkckqy7k TvfhmLnX17bjg== Received: from thinkos.internal.efficios.com (mtl.efficios.com [216.120.195.104]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4fWwqN65PgzNds; Thu, 12 Mar 2026 14:18:56 -0400 (EDT) From: Mathieu Desnoyers To: =?UTF-8?q?Andr=C3=A9=20Almeida?= Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Carlos O'Donell , Sebastian Andrzej Siewior , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Thomas Gleixner , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [RFC PATCH v3] futex: Introduce __vdso_robust_futex_unlock_u32 and __vdso_robust_pi_futex_try_unlock_u32 Date: Thu, 12 Mar 2026 14:18:50 -0400 Message-Id: <20260312181850.2156985-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.5 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Fix a long standing data corruption race condition with robust futexes, as pointed out here: "File corruption race condition in robust mutex unlocking" https://sourceware.org/bugzilla/show_bug.cgi?id=3D14485 The __vdso_robust_futex_unlock_u32 vDSO unlocks the robust futex by exchanging the content of uaddr with val with a store-release semantic. If the futex has waiters, it sets bit 1 of *robust_list_head->list_op_pending, else it clears *robust_list_head->list_op_pending. Those operations are within a code region known by the kernel, making them safe with respect to asynchronous program termination either from thread context or from a nested signal handler. Expected use of this vDSO: if ((__vdso_robust_futex_unlock_u32((u32 *) &mutex->__data.__lock, 0, robus= t_list_head) & FUTEX_WAITERS) !=3D 0) futex_wake((u32 *) &mutex->__data.__lock, 1, private); WRITE_ONCE(robust_list_head->list_op_pending, 0); Also introduce __vdso_robust_pi_futex_try_unlock_u32 to fix a similar unlock race with robust PI futexes. The __vdso_robust_pi_futex_try_unlock_u32 vDSO try to perform a compare-and-exchange with release semantic to set the expected *uaddr content to val. If the futex has waiters, it fails, and userspace needs to call futex_unlock_pi(). Before exiting the critical section, if the cmpxchg fails, it sets bit 1 of *robust_list_head->list_op_pending. If the cmpxchg succeeds, it clears *@robust_list_head->list_op_pending. Those operations are within a code region known by the kernel, making them safe with respect to asynchronous program termination either from thread context or from a nested signal handler. Expected use of this vDSO: int l =3D atomic_load_relaxed(&mutex->__data.__lock); do { if (((l & FUTEX_WAITERS) !=3D 0) || (l !=3D READ_ONCE(pd->tid))) { futex_unlock_pi((unsigned int *) &mutex->__data.__lock, pri= vate); break; } } while (!__vdso_robust_pi_futex_try_unlock_u32(&mutex->__data.__lock, &l, 0, robust_list_head)); WRITE_ONCE(robust_list_head->list_op_pending, 0); The approach taken by these vDSO is to extend the x86 vDSO exception table to track the relevant ip ranges. The four kernel execution paths impacted by this change are: 1) exit_robust_list/compat_exit_robust_list (process exit) 2) setup_rt_frame (signal delivery) 3) futex_wake 4) futex_unlock_pi Bit 1 of the robust_list_head->list_op_pending pointer is used to flag whether there is either a pending wakeup or futex_unlock_pi action (FUTEX_UADDR_NEED_ACTION). This allows extending the "need action" state beyond the vDSO and lets the caller issue futex_wake and futex_unlock_pi system calls. This "need action" flag is cleared by the caller when zeroing robust_list_head->list_op_pending. futex_wake now clears the robust_list_head->list_op_pending to close the race between call to futex_wake and clearing of the robust_list_head->list_op_pending by the application. This prevents multiple calls to futex_wake in case a crash happens within that window. futex_unlock_pi now clears the robust_list_head->list_op_pending to close the race between call to futex_unlock_pi and clearing of the robust_list_head->list_op_pending by the application. This prevents multiple calls to futex_unlock_pi in case a crash happens within that window. [ This patch is lightly compiled tested on x86-64 only, submitted for feedb= ack. It implements the vDSO for x86-32 and x86-64. It is based on v7.0-rc3. ] Link: https://lore.kernel.org/lkml/20260220202620.139584-1-andrealmeid@igal= ia.com/ Signed-off-by: Mathieu Desnoyers Cc: "Andr=C3=A9 Almeida" Cc: Carlos O'Donell Cc: Sebastian Andrzej Siewior Cc: Peter Zijlstra Cc: Florian Weimer Cc: Rich Felker Cc: Torvald Riegel Cc: Darren Hart Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Davidlohr Bueso Cc: Arnd Bergmann Cc: "Liam R . Howlett" --- Changes since v2: - Pass robust_list_head as vdso argument. - Add "val" parameter to each vdso. - Add _u32 suffix to each vdso. - Introduce ARCH_HAS_VDSO_FUTEX to provide a futex_vdso_exception stub when not implemented by the architecture. - Wire up x86 vdso32 vfutex.o. Changes since v1: - Remove unlock_store_done leftover code from handle_futex_death. - Handle robust PI futexes. --- arch/Kconfig | 3 + arch/x86/Kconfig | 1 + arch/x86/entry/vdso/common/vfutex.c | 88 +++++++++++++ arch/x86/entry/vdso/extable.c | 59 ++++++++- arch/x86/entry/vdso/extable.h | 37 ++++-- arch/x86/entry/vdso/vdso32/Makefile | 1 + arch/x86/entry/vdso/vdso32/vfutex.c | 1 + arch/x86/entry/vdso/vdso64/Makefile | 1 + arch/x86/entry/vdso/vdso64/vfutex.c | 1 + arch/x86/entry/vdso/vdso64/vsgx.S | 2 +- arch/x86/include/asm/vdso.h | 3 + arch/x86/kernel/signal.c | 4 + include/linux/futex.h | 1 + include/vdso/futex.h | 74 +++++++++++ kernel/futex/core.c | 188 ++++++++++++++++++++++++---- kernel/futex/futex.h | 2 + kernel/futex/pi.c | 3 + kernel/futex/waitwake.c | 3 + 18 files changed, 439 insertions(+), 33 deletions(-) create mode 100644 arch/x86/entry/vdso/common/vfutex.c create mode 100644 arch/x86/entry/vdso/vdso32/vfutex.c create mode 100644 arch/x86/entry/vdso/vdso64/vfutex.c create mode 100644 include/vdso/futex.h diff --git a/arch/Kconfig b/arch/Kconfig index 102ddbd4298e..4f3e1be29af1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1670,6 +1670,9 @@ config ARCH_HAS_VDSO_ARCH_DATA config ARCH_HAS_VDSO_TIME_DATA bool =20 +config ARCH_HAS_VDSO_FUTEX + bool + config HAVE_STATIC_CALL bool =20 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e2df1b147184..957d5d9209a1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -111,6 +111,7 @@ config X86 select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN select ARCH_HAS_DEBUG_WX + select ARCH_HAS_VDSO_FUTEX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_HAVE_EXTRA_ELF_NOTES diff --git a/arch/x86/entry/vdso/common/vfutex.c b/arch/x86/entry/vdso/comm= on/vfutex.c new file mode 100644 index 000000000000..cc6bcd735755 --- /dev/null +++ b/arch/x86/entry/vdso/common/vfutex.c @@ -0,0 +1,88 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026 Mathieu Desnoyers + */ +#include +#include +#include +#include "extable.h" + +#ifdef CONFIG_X86_64 +# define ASM_PTR_BIT_SET "btsq " +# define ASM_PTR_SET "movq " +#else +# define ASM_PTR_BIT_SET "btsl " +# define ASM_PTR_SET "movl " +#endif + +u32 __vdso_robust_futex_unlock_u32(u32 *uaddr, u32 val, struct robust_list= _head *robust_list_head) +{ + /* + * Within the ip range identified by the futex exception table, + * the register "eax" contains the value loaded by xchg. This is + * expected by futex_vdso_exception() to check whether waiters + * need to be woken up. This register state is transferred to + * bit 1 (NEED_ACTION) of *op_pending_addr before the ip range + * ends. + */ + asm volatile ( + _ASM_VDSO_EXTABLE_FUTEX_HANDLE(1f, 3f) + /* Exchange uaddr (store-release). */ + "xchg %[uaddr], %[val]\n\t" + "1:\n\t" + /* Test if FUTEX_WAITERS (0x80000000) is set. */ + "test %[val], %[val]\n\t" + "js 2f\n\t" + /* Clear *op_pending_addr if there are no waiters. */ + ASM_PTR_SET "$0, %[op_pending_addr]\n\t" + "jmp 3f\n\t" + "2:\n\t" + /* Set bit 1 (NEED_ACTION) in *op_pending_addr. */ + ASM_PTR_BIT_SET "$1, %[op_pending_addr]\n\t" + "3:\n\t" + : [val] "+a" (val), + [uaddr] "+m" (*uaddr) + : [op_pending_addr] "m" (robust_list_head->list_op_pending) + : "memory" + ); + return val; +} + +u32 robust_futex_unlock_u32(u32 *, u32, struct robust_list_head *) + __attribute__((weak, alias("__vdso_robust_futex_unlock_u32"))); + +int __vdso_robust_pi_futex_try_unlock_u32(u32 *uaddr, u32 *expected, u32 v= al, struct robust_list_head *robust_list_head) +{ + u32 orig, expect =3D *expected; + + orig =3D expect; + /* + * The ZF is set/cleared by cmpxchg and expected to stay + * invariant for the rest of the code region. + */ + asm volatile ( + _ASM_VDSO_EXTABLE_PI_FUTEX_HANDLE(1f, 3f) + /* Compare-and-exchange uaddr (store-release). Set/clear the ZF. */ + "lock; cmpxchg %[val], %[uaddr]\n\t" + "1:\n\t" + /* Check whether cmpxchg fails. */ + "jnz 2f\n\t" + /* Clear *op_pending_addr. */ + ASM_PTR_SET "$0, %[op_pending_addr]\n\t" + "jmp 3f\n\t" + "2:\n\t" + /* Set bit 1 (NEED_ACTION) in *op_pending_addr. */ + ASM_PTR_BIT_SET "$1, %[op_pending_addr]\n\t" + "3:\n\t" + : [expect] "+a" (expect), + [uaddr] "+m" (*uaddr) + : [op_pending_addr] "m" (robust_list_head->list_op_pending), + [val] "r" (val) + : "memory" + ); + *expected =3D expect; + return expect =3D=3D orig; +} + +int robust_pi_futex_try_unlock_u32(u32 *, u32 *, u32, struct robust_list_h= ead *) + __attribute__((weak, alias("__vdso_robust_pi_futex_try_unlock_u32"))); diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c index afcf5b65beef..90a31ffb9c6d 100644 --- a/arch/x86/entry/vdso/extable.c +++ b/arch/x86/entry/vdso/extable.c @@ -1,12 +1,27 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include =20 +enum vdso_extable_entry_type { + VDSO_EXTABLE_ENTRY_FIXUP =3D 0, + VDSO_EXTABLE_ENTRY_FUTEX =3D 1, + VDSO_EXTABLE_ENTRY_PI_FUTEX =3D 2, +}; + struct vdso_exception_table_entry { - int insn, fixup; + int type; /* enum vdso_extable_entry_type */ + union { + struct { + int insn, fixup_insn; + } fixup; + struct { + int start, end; + } futex; + }; }; =20 bool fixup_vdso_exception(struct pt_regs *regs, int trapnr, @@ -33,8 +48,10 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trap= nr, extable =3D image->extable; =20 for (i =3D 0; i < nr_entries; i++) { - if (regs->ip =3D=3D base + extable[i].insn) { - regs->ip =3D base + extable[i].fixup; + if (extable[i].type !=3D VDSO_EXTABLE_ENTRY_FIXUP) + continue; + if (regs->ip =3D=3D base + extable[i].fixup.insn) { + regs->ip =3D base + extable[i].fixup.fixup_insn; regs->di =3D trapnr; regs->si =3D error_code; regs->dx =3D fault_addr; @@ -44,3 +61,39 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trap= nr, =20 return false; } + +void futex_vdso_exception(struct pt_regs *regs, + bool *_in_futex_vdso, + bool *_need_action) +{ + const struct vdso_image *image =3D current->mm->context.vdso_image; + const struct vdso_exception_table_entry *extable; + bool in_futex_vdso =3D false, need_action =3D false; + unsigned int nr_entries, i; + unsigned long base; + + if (!current->mm->context.vdso) + goto end; + + base =3D (unsigned long)current->mm->context.vdso + image->extable_base; + nr_entries =3D image->extable_len / (sizeof(*extable)); + extable =3D image->extable; + + for (i =3D 0; i < nr_entries; i++) { + if (extable[i].type !=3D VDSO_EXTABLE_ENTRY_FUTEX && + extable[i].type !=3D VDSO_EXTABLE_ENTRY_PI_FUTEX) + continue; + if (regs->ip >=3D base + extable[i].futex.start && + regs->ip < base + extable[i].futex.end) { + in_futex_vdso =3D true; + if (extable[i].type =3D=3D VDSO_EXTABLE_ENTRY_FUTEX) + need_action =3D (regs->ax & FUTEX_WAITERS); + else + need_action =3D !(regs->flags & X86_EFLAGS_ZF); + break; + } + } +end: + *_in_futex_vdso =3D in_futex_vdso; + *_need_action =3D need_action; +} diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h index baba612b832c..5dfbde724065 100644 --- a/arch/x86/entry/vdso/extable.h +++ b/arch/x86/entry/vdso/extable.h @@ -8,21 +8,44 @@ * exception table, not each individual entry. */ #ifdef __ASSEMBLER__ -#define _ASM_VDSO_EXTABLE_HANDLE(from, to) \ - ASM_VDSO_EXTABLE_HANDLE from to +#define _ASM_VDSO_EXTABLE_FIXUP_HANDLE(from, to) \ + ASM_VDSO_EXTABLE_FIXUP_HANDLE from to =20 -.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req +.macro ASM_VDSO_EXTABLE_FIXUP_HANDLE from:req to:req .pushsection __ex_table, "a" + .long 0 /* type: fixup */ .long (\from) - __ex_table .long (\to) - __ex_table .popsection .endm #else -#define _ASM_VDSO_EXTABLE_HANDLE(from, to) \ - ".pushsection __ex_table, \"a\"\n" \ - ".long (" #from ") - __ex_table\n" \ - ".long (" #to ") - __ex_table\n" \ +#define _ASM_VDSO_EXTABLE_FIXUP_HANDLE(from, to) \ + ".pushsection __ex_table, \"a\"\n" \ + ".long 0\n" /* type: fixup */ \ + ".long (" #from ") - __ex_table\n" \ + ".long (" #to ") - __ex_table\n" \ ".popsection\n" + +/* + * Identify robust futex unlock critical section. + */ +#define _ASM_VDSO_EXTABLE_FUTEX_HANDLE(start, end) \ + ".pushsection __ex_table, \"a\"\n" \ + ".long 1\n" /* type: futex */ \ + ".long (" #start ") - __ex_table\n" \ + ".long (" #end ") - __ex_table\n" \ + ".popsection\n" + +/* + * Identify robust PI futex unlock critical section. + */ +#define _ASM_VDSO_EXTABLE_PI_FUTEX_HANDLE(start, end) \ + ".pushsection __ex_table, \"a\"\n" \ + ".long 2\n" /* type: pi_futex */ \ + ".long (" #start ") - __ex_table\n" \ + ".long (" #end ") - __ex_table\n" \ + ".popsection\n" + #endif =20 #endif /* __VDSO_EXTABLE_H */ diff --git a/arch/x86/entry/vdso/vdso32/Makefile b/arch/x86/entry/vdso/vdso= 32/Makefile index add6afb484ba..acf4f990be98 100644 --- a/arch/x86/entry/vdso/vdso32/Makefile +++ b/arch/x86/entry/vdso/vdso32/Makefile @@ -9,6 +9,7 @@ vdsos-y :=3D 32 # Files to link into the vDSO: vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o vobjs-y +=3D system_call.o sigreturn.o +vobjs-y +=3D vfutex.o =20 # Compilation flags flags-y :=3D -DBUILD_VDSO32 -m32 -mregparm=3D0 diff --git a/arch/x86/entry/vdso/vdso32/vfutex.c b/arch/x86/entry/vdso/vdso= 32/vfutex.c new file mode 100644 index 000000000000..940a6ee30026 --- /dev/null +++ b/arch/x86/entry/vdso/vdso32/vfutex.c @@ -0,0 +1 @@ +#include "common/vfutex.c" diff --git a/arch/x86/entry/vdso/vdso64/Makefile b/arch/x86/entry/vdso/vdso= 64/Makefile index bfffaf1aeecc..df53c2d0037d 100644 --- a/arch/x86/entry/vdso/vdso64/Makefile +++ b/arch/x86/entry/vdso/vdso64/Makefile @@ -10,6 +10,7 @@ vdsos-$(CONFIG_X86_X32_ABI) +=3D x32 # Files to link into the vDSO: vobjs-y :=3D note.o vclock_gettime.o vgetcpu.o vobjs-y +=3D vgetrandom.o vgetrandom-chacha.o +vobjs-y +=3D vfutex.o vobjs-$(CONFIG_X86_SGX) +=3D vsgx.o =20 # Compilation flags diff --git a/arch/x86/entry/vdso/vdso64/vfutex.c b/arch/x86/entry/vdso/vdso= 64/vfutex.c new file mode 100644 index 000000000000..940a6ee30026 --- /dev/null +++ b/arch/x86/entry/vdso/vdso64/vfutex.c @@ -0,0 +1 @@ +#include "common/vfutex.c" diff --git a/arch/x86/entry/vdso/vdso64/vsgx.S b/arch/x86/entry/vdso/vdso64= /vsgx.S index 37a3d4c02366..0ea5a1ebd455 100644 --- a/arch/x86/entry/vdso/vdso64/vsgx.S +++ b/arch/x86/entry/vdso/vdso64/vsgx.S @@ -145,6 +145,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave) =20 .cfi_endproc =20 -_ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception) +_ASM_VDSO_EXTABLE_FIXUP_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception) =20 SYM_FUNC_END(__vdso_sgx_enter_enclave) diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index e8afbe9faa5b..9ac7af34cdc4 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -38,6 +38,9 @@ extern int map_vdso_once(const struct vdso_image *image, = unsigned long addr); extern bool fixup_vdso_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, unsigned long fault_addr); +extern void futex_vdso_exception(struct pt_regs *regs, + bool *in_futex_vdso, + bool *need_action); #endif /* __ASSEMBLER__ */ =20 #endif /* _ASM_X86_VDSO_H */ diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 2404233336ab..c2e4db89f16d 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -28,6 +28,7 @@ #include #include #include +#include =20 #include #include @@ -235,6 +236,9 @@ unsigned long get_sigframe_size(void) static int setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs) { + /* Handle futex robust list fixup. */ + futex_signal_deliver(ksig, regs); + /* Perform fixup for the pre-signal frame. */ rseq_signal_deliver(ksig, regs); =20 diff --git a/include/linux/futex.h b/include/linux/futex.h index 9e9750f04980..6c274c79e176 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -81,6 +81,7 @@ void futex_exec_release(struct task_struct *tsk); long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4); +void futex_signal_deliver(struct ksignal *ksig, struct pt_regs *regs); =20 #ifdef CONFIG_FUTEX_PRIVATE_HASH int futex_hash_allocate_default(void); diff --git a/include/vdso/futex.h b/include/vdso/futex.h new file mode 100644 index 000000000000..bc7ff4534bee --- /dev/null +++ b/include/vdso/futex.h @@ -0,0 +1,74 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2026 Mathieu Desnoyers + */ + +#ifndef _VDSO_FUTEX_H +#define _VDSO_FUTEX_H + +#include +#include + +/** + * __vdso_robust_futex_unlock_u32 - Architecture-specific vDSO implementat= ion of robust futex unlock. + * @uaddr: Lock address (points to a 32-bit unsigned integer type). + * @val: New value to set in *@uaddr. + * @robust_list_head: The thread-specific robust list that has been regist= ered with set_robust_list. + * + * This vDSO unlocks the robust futex by exchanging the content of + * *@uaddr with @val with a store-release semantic. If the futex has + * waiters, it sets bit 1 of *@robust_list_head->list_op_pending, else + * it clears *@robust_list_head->list_op_pending. Those operations are + * within a code region known by the kernel, making them safe with + * respect to asynchronous program termination either from thread + * context or from a nested signal handler. + * + * Returns: The old value present at *@uaddr. + * + * Expected use of this vDSO: + * + * robust_list_head is the thread-specific robust list that has been + * registered with set_robust_list. + * + * if ((__vdso_robust_futex_unlock_u32((u32 *) &mutex->__data.__lock, 0, r= obust_list_head) + * & FUTEX_WAITERS) !=3D 0) + * futex_wake((u32 *) &mutex->__data.__lock, 1, private); + * WRITE_ONCE(robust_list_head->list_op_pending, 0); + */ +extern u32 __vdso_robust_futex_unlock_u32(u32 *uaddr, u32 val, struct robu= st_list_head *robust_list_head); + +/* + * __vdso_robust_pi_futex_try_unlock_u32 - Architecture-specific vDSO impl= ementation of robust PI futex unlock. + * @uaddr: Lock address (points to a 32-bit unsigned integer type). + * @expected: Expected value (in), value loaded by compare-and-exchange (= out). + * @val: New value to set in *@uaddr if *@uaddr match *@expected. + * @robust_list_head: The thread-specific robust list that has been regist= ered with set_robust_list. + * + * This vDSO try to perform a compare-and-exchange with release semantic + * to set the expected *@uaddr content to @val. If the futex has + * waiters, it fails, and userspace needs to call futex_unlock_pi(). + * Before exiting the critical section, if the cmpxchg fails, it sets + * bit 1 of *@robust_list_head->list_op_pending. If the cmpxchg + * succeeds, it clears *@robust_list_head->list_op_pending. Those + * operations are within a code region known by the kernel, making them + * safe with respect to asynchronous program termination either from + * thread context or from a nested signal handler. + * + * Returns: Zero if the operation fails to release the lock, non-zero on s= uccess. + * + * Expected use of this vDSO: + * + * + * int l =3D atomic_load_relaxed(&mutex->__data.__lock); + * do { + * if (((l & FUTEX_WAITERS) !=3D 0) || (l !=3D READ_ONCE(pd->tid))= ) { + * futex_unlock_pi((unsigned int *) &mutex->__data.__lock,= private); + * break; + * } + * } while (!__vdso_robust_pi_futex_try_unlock_u32(&mutex->__data.__lock, + * &l, 0, robust_list_head= )); + * WRITE_ONCE(robust_list_head->list_op_pending, 0); + */ +int __vdso_robust_pi_futex_try_unlock_u32(u32 *uaddr, u32 *expected, u32 v= al, struct robust_list_head *robust_list_head); + +#endif /* _VDSO_FUTEX_H */ diff --git a/kernel/futex/core.c b/kernel/futex/core.c index cf7e610eac42..28bcbe6156ee 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -48,6 +48,10 @@ #include "futex.h" #include "../locking/rtmutex_common.h" =20 +#define FUTEX_UADDR_PI (1UL << 0) +#define FUTEX_UADDR_NEED_ACTION (1UL << 1) +#define FUTEX_UADDR_MASK (~(FUTEX_UADDR_PI | FUTEX_UADDR_NEED_ACTION)) + /* * The base of the bucket array and its size are always used together * (after initialization only in futex_hash()), so ensure that they @@ -1004,6 +1008,118 @@ void futex_unqueue_pi(struct futex_q *q) q->pi_state =3D NULL; } =20 +#ifndef CONFIG_ARCH_HAS_VDSO_FUTEX +static void futex_vdso_exception(struct pt_regs *regs, bool *in_futex_vdso= , bool *need_action) +{ + *in_futex_vdso =3D false; + *need_action =3D false; +} +#endif + +/* + * Transfer the need action state from vDSO stack to the + * FUTEX_UADDR_NEED_ACTION list_op_pending bit so it's observed if the + * program is terminated while executing the signal handler. + */ +static void signal_delivery_fixup_robust_list(struct task_struct *curr, st= ruct pt_regs *regs) +{ + struct robust_list_head __user *head =3D curr->robust_list; + bool in_futex_vdso, need_action; + unsigned long pending; + + if (!head) + return; + futex_vdso_exception(regs, &in_futex_vdso, &need_action); + if (!in_futex_vdso) + return; + + if (need_action) { + if (get_user(pending, (unsigned long __user *)&head->list_op_pending)) + goto fault; + pending |=3D FUTEX_UADDR_NEED_ACTION; + if (put_user(pending, (unsigned long __user *)&head->list_op_pending)) + goto fault; + } else { + if (put_user(0UL, (unsigned long __user *)&head->list_op_pending)) + goto fault; + } + return; +fault: + force_sig(SIGSEGV); +} + +#ifdef CONFIG_COMPAT +static void compat_signal_delivery_fixup_robust_list(struct task_struct *c= urr, struct pt_regs *regs) +{ + struct compat_robust_list_head __user *head =3D curr->compat_robust_list; + bool in_futex_vdso, need_action; + unsigned int pending; + + if (!head) + return; + futex_vdso_exception(regs, &in_futex_vdso, &need_action); + if (!in_futex_vdso) + return; + if (need_action) { + if (get_user(pending, (compat_uptr_t __user *)&head->list_op_pending)) + goto fault; + pending |=3D FUTEX_UADDR_NEED_ACTION; + if (put_user(pending, (compat_uptr_t __user *)&head->list_op_pending)) + goto fault; + } else { + if (put_user(0U, (compat_uptr_t __user *)&head->list_op_pending)) + goto fault; + } + return; +fault: + force_sig(SIGSEGV); +} +#endif + +void futex_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) +{ + struct task_struct *tsk =3D current; + + if (unlikely(tsk->robust_list)) + signal_delivery_fixup_robust_list(tsk, regs); +#ifdef CONFIG_COMPAT + if (unlikely(tsk->compat_robust_list)) + compat_signal_delivery_fixup_robust_list(tsk, regs); +#endif +} + +static void do_clear_robust_list_pending_op(struct task_struct *curr) +{ + struct robust_list_head __user *head =3D curr->robust_list; + + if (!head) + return; + if (put_user(0UL, (unsigned long __user *)&head->list_op_pending)) + force_sig(SIGSEGV); +} + +#ifdef CONFIG_COMPAT +static void do_compat_clear_robust_list_pending_op(struct task_struct *cur= r) +{ + struct robust_list_head __user *head =3D curr->robust_list; + + if (!head) + return; + if (put_user(0U, (unsigned int __user *)&head->list_op_pending)) + force_sig(SIGSEGV); +} +#endif + +void clear_robust_list_pending_op(struct task_struct *curr) +{ + if (unlikely(curr->robust_list)) + do_clear_robust_list_pending_op(curr); +#ifdef CONFIG_COMPAT + if (unlikely(curr->compat_robust_list)) + do_compat_clear_robust_list_pending_op(curr); +#endif +} + /* Constants for the pending_op argument of handle_futex_death */ #define HANDLE_DEATH_PENDING true #define HANDLE_DEATH_LIST false @@ -1013,12 +1129,34 @@ void futex_unqueue_pi(struct futex_q *q) * dying task, and do notification if so: */ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, - bool pi, bool pending_op) + bool pi, bool pending_op, bool need_action) { u32 uval, nval, mval; pid_t owner; int err; =20 + /* + * Process dies after the store unlocking futex, before clearing + * the pending ops. Perform the required action if needed. + * For non-PI futex, the action is to wake up the waiter. + * For PI futex, the action is to call robust_unlock_pi. + * Prevent storing to the futex after it was unlocked. + */ + if (pending_op) { + bool in_futex_vdso, vdso_need_action; + + futex_vdso_exception(task_pt_regs(curr), &in_futex_vdso, &vdso_need_acti= on); + if (need_action || vdso_need_action) { + if (pi) + futex_unlock_pi(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED); + else + futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1, + FUTEX_BITSET_MATCH_ANY); + } + if (need_action || in_futex_vdso) + return 0; + } + /* Futex address must be 32bit aligned */ if ((((unsigned long)uaddr) % sizeof(*uaddr)) !=3D 0) return -1; @@ -1128,19 +1266,23 @@ static int handle_futex_death(u32 __user *uaddr, st= ruct task_struct *curr, } =20 /* - * Fetch a robust-list pointer. Bit 0 signals PI futexes: + * Fetch a robust-list pointer. Bit 0 signals PI futexes, bit 1 signals + * need action: */ static inline int fetch_robust_entry(struct robust_list __user **entry, struct robust_list __user * __user *head, - unsigned int *pi) + unsigned int *pi, + unsigned int *need_action) { unsigned long uentry; =20 if (get_user(uentry, (unsigned long __user *)head)) return -EFAULT; =20 - *entry =3D (void __user *)(uentry & ~1UL); - *pi =3D uentry & 1; + *entry =3D (void __user *)(uentry & FUTEX_UADDR_MASK); + *pi =3D uentry & FUTEX_UADDR_PI; + if (need_action) + *need_action =3D uentry & FUTEX_UADDR_NEED_ACTION; =20 return 0; } @@ -1155,7 +1297,7 @@ static void exit_robust_list(struct task_struct *curr) { struct robust_list_head __user *head =3D curr->robust_list; struct robust_list __user *entry, *next_entry, *pending; - unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; + unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip, need_action; unsigned int next_pi; unsigned long futex_offset; int rc; @@ -1164,7 +1306,7 @@ static void exit_robust_list(struct task_struct *curr) * Fetch the list head (which was registered earlier, via * sys_set_robust_list()): */ - if (fetch_robust_entry(&entry, &head->list.next, &pi)) + if (fetch_robust_entry(&entry, &head->list.next, &pi, NULL)) return; /* * Fetch the relative futex offset: @@ -1175,7 +1317,7 @@ static void exit_robust_list(struct task_struct *curr) * Fetch any possibly pending lock-add first, and handle it * if it exists: */ - if (fetch_robust_entry(&pending, &head->list_op_pending, &pip)) + if (fetch_robust_entry(&pending, &head->list_op_pending, &pip, &need_acti= on)) return; =20 next_entry =3D NULL; /* avoid warning with gcc */ @@ -1184,14 +1326,14 @@ static void exit_robust_list(struct task_struct *cu= rr) * Fetch the next entry in the list before calling * handle_futex_death: */ - rc =3D fetch_robust_entry(&next_entry, &entry->next, &next_pi); + rc =3D fetch_robust_entry(&next_entry, &entry->next, &next_pi, NULL); /* * A pending lock might already be on the list, so * don't process it twice: */ if (entry !=3D pending) { if (handle_futex_death((void __user *)entry + futex_offset, - curr, pi, HANDLE_DEATH_LIST)) + curr, pi, HANDLE_DEATH_LIST, false)) return; } if (rc) @@ -1209,7 +1351,7 @@ static void exit_robust_list(struct task_struct *curr) =20 if (pending) { handle_futex_death((void __user *)pending + futex_offset, - curr, pip, HANDLE_DEATH_PENDING); + curr, pip, HANDLE_DEATH_PENDING, need_action); } } =20 @@ -1224,17 +1366,20 @@ static void __user *futex_uaddr(struct robust_list = __user *entry, } =20 /* - * Fetch a robust-list pointer. Bit 0 signals PI futexes: + * Fetch a robust-list pointer. Bit 0 signals PI futexes, bit 1 signals + * need action: */ static inline int compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user= **entry, - compat_uptr_t __user *head, unsigned int *pi) + compat_uptr_t __user *head, unsigned int *pi, unsigned int *need_acti= on) { if (get_user(*uentry, head)) return -EFAULT; =20 - *entry =3D compat_ptr((*uentry) & ~1); - *pi =3D (unsigned int)(*uentry) & 1; + *entry =3D compat_ptr((*uentry) & FUTEX_UADDR_MASK); + *pi =3D (unsigned int)(*uentry) & FUTEX_UADDR_PI; + if (need_action) + *need_action =3D (unsigned int)(*uentry) & FUTEX_UADDR_NEED_ACTION; =20 return 0; } @@ -1249,7 +1394,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) { struct compat_robust_list_head __user *head =3D curr->compat_robust_list; struct robust_list __user *entry, *next_entry, *pending; - unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip; + unsigned int limit =3D ROBUST_LIST_LIMIT, pi, pip, need_action; unsigned int next_pi; compat_uptr_t uentry, next_uentry, upending; compat_long_t futex_offset; @@ -1259,7 +1404,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) * Fetch the list head (which was registered earlier, via * sys_set_robust_list()): */ - if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi)) + if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi, NUL= L)) return; /* * Fetch the relative futex offset: @@ -1271,7 +1416,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) * if it exists: */ if (compat_fetch_robust_entry(&upending, &pending, - &head->list_op_pending, &pip)) + &head->list_op_pending, &pip, &need_action)) return; =20 next_entry =3D NULL; /* avoid warning with gcc */ @@ -1281,7 +1426,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) * handle_futex_death: */ rc =3D compat_fetch_robust_entry(&next_uentry, &next_entry, - (compat_uptr_t __user *)&entry->next, &next_pi); + (compat_uptr_t __user *)&entry->next, &next_pi, NULL); /* * A pending lock might already be on the list, so * dont process it twice: @@ -1289,8 +1434,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) if (entry !=3D pending) { void __user *uaddr =3D futex_uaddr(entry, futex_offset); =20 - if (handle_futex_death(uaddr, curr, pi, - HANDLE_DEATH_LIST)) + if (handle_futex_death(uaddr, curr, pi, HANDLE_DEATH_LIST, false)) return; } if (rc) @@ -1309,7 +1453,7 @@ static void compat_exit_robust_list(struct task_struc= t *curr) if (pending) { void __user *uaddr =3D futex_uaddr(pending, futex_offset); =20 - handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); + handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING, need_action); } } #endif diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 30c2afa03889..f64ed00463ca 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -396,6 +396,8 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct = futex_hash_bucket *hb2) spin_unlock(&hb2->lock); } =20 +extern void clear_robust_list_pending_op(struct task_struct *curr); + /* syscalls */ =20 extern int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32 diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c index bc1f7e83a37e..3b889dfbcdd5 100644 --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -1148,6 +1148,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int f= lags) if ((uval & FUTEX_TID_MASK) !=3D vpid) return -EPERM; =20 + /* Clear the pending_op_list. */ + clear_robust_list_pending_op(current); + ret =3D get_futex_key(uaddr, flags, &key, FUTEX_WRITE); if (ret) return ret; diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 1c2dd03f11ec..7752ed8c6dc1 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -162,6 +162,9 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, i= nt nr_wake, u32 bitset) if (!bitset) return -EINVAL; =20 + /* Clear the pending_op_list. */ + clear_robust_list_pending_op(current); + ret =3D get_futex_key(uaddr, flags, &key, FUTEX_READ); if (unlikely(ret !=3D 0)) return ret; --=20 2.39.5