From nobody Sat Apr 4 00:05:44 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF4532FD68B for ; Thu, 2 Apr 2026 15:21:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775143264; cv=none; b=snL/gf0AeneS+2V7ZGvEEnck9rCi5jZp83XPm+xHi/0GziTwU0lH9E64HL4MvXscIEA6Q9HyTOKxRy/5fLGZEZ6CqZbl5aPTkaGv0ihuOlOK23B3HTHk6Jw/jtUaFIcYJ/5UD134f791kp/iAUOxRsC3TrK5BihpMlfWuyPXlOE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775143264; c=relaxed/simple; bh=X0A1Q26fW9WH5jpP1Eoth5qEr8qqfEtdWKaJ18kniKI=; h=Date:Message-ID:From:To:Cc:Subject; b=q83gcrg/9HvyVVJKiS9gyJI7/xrmcnfF1WlmqQpdpugC7WUKQLKQl6+5UTPmP1OTyBZD28UqMOU9H4zhwuTEof17hYw29ifRGdr2mbVQN9ftkC8rrnIf6ZmWwhvwYc2/iMMnzeUKZCklAwmhROvR8MPlIYpMAkoK3TUTvOxekZk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lm6pfZTc; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lm6pfZTc" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5B1F7C19423; Thu, 2 Apr 2026 15:21:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775143264; bh=X0A1Q26fW9WH5jpP1Eoth5qEr8qqfEtdWKaJ18kniKI=; h=Date:From:To:Cc:Subject:From; b=lm6pfZTcTbOtMQfQkDBnUG1RbshGHdkyIBhBpJsHCWGIM0nJyVBrRyrY+ODkWlJOY rd9LOvNS/AyfCjdgv7rPlDBlW8JhWAXksO4N1VXVkKJYTpUU+Cdhu4vXyB2YlWgCQt /DL94mZsjR/zWW6E12rcayx1OzX7LF4jKd3Ogktvol/rVlvMjUZ1ohaicCtY3tRyf3 2Jzb7Y5Xu7AQphXKVG6bmQsMzeQqoCT3/S/6S7hNWWvTCbn9dUB93Xm35drn/iGdBM 0OsGdl2WnnhPGhO5MYEp4rcNttcsHY69hXxzbdqy6m1PQFo6iA1LKash4eCDDwAJCI 8W3h+H9GcPKhA== Date: Thu, 02 Apr 2026 17:21:00 +0200 Message-ID: <20260402151131.876492985@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A8=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" , Uros Bizjak , =?UTF-8?q?Thomas=20Wei=C3=9Fschuh?= Subject: [patch V4 00/14] futex: Address the robust futex unlock race for real Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This is a follow up to v3 which can be found here: https://lore.kernel.org/20260330114212.927686587@kernel.org The v1 cover letter contains a detailed analysis of the underlying problem: https://lore.kernel.org/20260316162316.356674433@kernel.org TLDR: The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer because unlock and clearing the pointer are not atomic. The race window is between the unlock and clearing the pending op pointer. If the task is forced to exit in this window, exit will access a potentially invalid pending op pointer when cleaning up the robust list. That happens if another task manages to unmap the object containing the lock before the cleanup, which results in an UAF. In the worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. User space can't solve this problem without help from the kernel. This series provides the kernel side infrastructure to help it along: 1) Combined unlock, pointer clearing, wake-up for the contended case 2) VDSO based unlock and pointer clearing helpers with a fix-up function in the kernel when user space was interrupted within the critical section. Both ensure that the pointer clearing happens _before_ a task exits and the kernel cleans up the robust list during the exit procedure. Changes since v3: - s/TOS/TSO/ :) - Added a barrier() into unsafe_atomic_store_release_user() for the TSO case. The barrier is not required for the futex unlock usecase, but it's harmless and ensures that the function can be safely used in other contexts. - Fixed up FUTEX op defines - Prevented a build fail when neither FUTEX_PRIVATE_HASH nor FUTEX_ROBUST_UNLOCK are enabled - Fixed a few typos in the documentation - Picked up the latest version from Andr\ufffd\ufffd's selftests and repla= ced the vdso function lookup. The delta patch against the previous version is below. The series applies on v7.0-rc3 and is also available via git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git locking-fu= tex-v4 Opens: ptrace based validation test. Sebastian has a working variant which needs to be integrated properly into the test suite. Thanks, tglx --- diff --git a/Documentation/locking/robust-futex-ABI.rst b/Documentation/loc= king/robust-futex-ABI.rst index 0faec175fc26..5e6a0665b8ba 100644 --- a/Documentation/locking/robust-futex-ABI.rst +++ b/Documentation/locking/robust-futex-ABI.rst @@ -190,7 +190,7 @@ Robust release is racy ---------------------- =20 The removal of a robust futex from the list is racy when doing it solely in -userspace. Quoting Thomas Gleixer for the explanation: +userspace. Quoting Thomas Gleixner for the explanation: =20 The robust futex unlock mechanism is racy in respect to the clearing of = the robust_list_head::list_op_pending pointer because unlock and clearing the @@ -202,11 +202,11 @@ userspace. Quoting Thomas Gleixer for the explanation: worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. =20 -A full in dept analysis can be read at +A full in-depth analysis can be read at https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/ =20 To overcome that, the kernel needs to participate in the lock release oper= ation. -This ensures that the release happens "atomically" in the regard of releas= ing +This ensures that the release happens "atomically" with regard to releasing the lock and removing the address from ``list_op_pending``. If the release= is interrupted by a signal, the kernel will also verify if it interrupted the release operation. diff --git a/arch/Kconfig b/arch/Kconfig index c3579449571c..8940fe236394 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -403,8 +403,8 @@ config ARCH_32BIT_OFF_T config ARCH_32BIT_USTAT_F_TINODE bool =20 -# Selected by architectures with Total Store Order (TOS) -config ARCH_MEMORY_ORDER_TOS +# Selected by architectures with Total Store Order (TSO) +config ARCH_MEMORY_ORDER_TSO bool =20 config HAVE_ASM_MODVERSIONS diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index c9b1075a0694..7016aba407e9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -114,7 +114,7 @@ config X86 select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_HAVE_EXTRA_ELF_NOTES - select ARCH_MEMORY_ORDER_TOS + select ARCH_MEMORY_ORDER_TSO select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7522782d8164..ec38e3f342bc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -20,9 +20,9 @@ #include #include #include +#include #include #include -#include =20 #include =20 diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index ac1d9ce1f1ec..1764d13c41c1 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -647,8 +647,10 @@ static inline void user_access_restore(unsigned long f= lags) { } #ifndef unsafe_atomic_store_release_user # define unsafe_atomic_store_release_user(val, uptr, elbl) \ do { \ - if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) \ + if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TSO)) \ smp_mb(); \ + else \ + barrier(); \ unsafe_put_user(val, uptr, elbl); \ } while (0) #endif diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index 9a0f564f1737..aaf86a6b75cc 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -56,11 +56,11 @@ #define FUTEX_UNLOCK_PI_LIST32_PRIVATE (FUTEX_UNLOCK_PI_LIST32 | FUTEX_PR= IVATE_FLAG) =20 #define FUTEX_UNLOCK_WAKE_LIST64 (FUTEX_WAKE | FUTEX_UNLOCK_ROBUST) -#define FUTEX_UNLOCK_WAKE_LIST64_PRIVATE (FUTEX_UNLOCK_LIST64 | FUTEX_PRIV= ATE_FLAG) +#define FUTEX_UNLOCK_WAKE_LIST64_PRIVATE (FUTEX_UNLOCK_WAKE_LIST64 | FUTEX= _PRIVATE_FLAG) =20 #define FUTEX_UNLOCK_WAKE_LIST32 (FUTEX_WAKE | FUTEX_UNLOCK_ROBUST | \ FUTEX_ROBUST_LIST32) -#define FUTEX_UNLOCK_WAKE_LIST32_PRIVATE (FUTEX_UNLOCK_LIST32 | FUTEX_PRIV= ATE_FLAG) +#define FUTEX_UNLOCK_WAKE_LIST32_PRIVATE (FUTEX_UNLOCK_WAKE_LIST32 | FUTEX= _PRIVATE_FLAG) =20 #define FUTEX_UNLOCK_BITSET_LIST64 (FUTEX_WAKE_BITSET | FUTEX_UNLOCK_ROBU= ST) #define FUTEX_UNLOCK_BITSET_LIST64_PRIVATE (FUTEX_UNLOCK_BITSET_LIST64 | F= UTEX_PRIVATE_FLAG) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index ce47d02f1ea2..0d5af8f738f3 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1931,7 +1931,7 @@ static int futex_hash_allocate(unsigned int hash_slot= s, unsigned int flags) =20 if (new) { /* - * Will set mm->futex.phash_new on failure; + * Will set mm->futex.phash.new_hash on failure; * futex_private_hash_get() will try again. */ if (!__futex_pivot_hash(mm, new) && custom) @@ -2019,11 +2019,13 @@ static void futex_robust_unlock_init_mm(struct fute= x_mm_data *fd) static inline void futex_robust_unlock_init_mm(struct futex_mm_data *fd) {= } #endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */ =20 +#if defined(CONFIG_FUTEX_PRIVATE_HASH) || defined(CONFIG_FUTEX_ROBUST_UNLO= CK) void futex_mm_init(struct mm_struct *mm) { futex_hash_init_mm(&mm->futex); futex_robust_unlock_init_mm(&mm->futex); } +#endif =20 int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4) { diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools= /testing/selftests/futex/functional/robust_list.c index 62f21f8d89a6..43059f6dbc40 100644 --- a/tools/testing/selftests/futex/functional/robust_list.c +++ b/tools/testing/selftests/futex/functional/robust_list.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -44,6 +45,10 @@ =20 #define SLEEP_US 100 =20 +#if UINTPTR_MAX =3D=3D 0xffffffffffffffff +# define BUILD_64 +#endif + static pthread_barrier_t barrier, barrier2; =20 static int set_robust_list(struct robust_list_head *head, size_t len) @@ -564,28 +569,20 @@ TEST(test_circular_list) */ =20 /* - * Auxiliary code for loading the vDSO functions + * Auxiliary code for binding the vDSO functions */ -#define VDSO_SIZE 0x4000 - -void *get_vdso_func_addr(const char *str) +static void *get_vdso_func_addr(const char *function) { - void *vdso_base =3D (void *) getauxval(AT_SYSINFO_EHDR), *addr; - Dl_info info; + const char *vdso_names[] =3D { + "linux-vdso.so.1", "linux-gate.so.1", "linux-vdso32.so.1", "linux-vdso64= .so.1", + }; =20 - if (!vdso_base) { - perror("Error to get AT_SYSINFO_EHDR"); - return NULL; - } + for (int i =3D 0; i < ARRAY_SIZE(vdso_names); i++) { + void *vdso =3D dlopen(vdso_names[i], RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOA= D); =20 - for (addr =3D vdso_base; addr < vdso_base + VDSO_SIZE; addr +=3D sizeof(a= ddr)) { - if (dladdr(addr, &info) =3D=3D 0 || !info.dli_sname) - continue; - - if (!strcmp(info.dli_sname, str)) - return info.dli_saddr; + if (vdso) + return dlsym(vdso, function); } - return NULL; } =20 @@ -611,9 +608,6 @@ FIXTURE_VARIANT(vdso_unlock) FIXTURE_SETUP(vdso_unlock) { self->vdso =3D get_vdso_func_addr(variant->func_name); - - if (!self->vdso) - ksft_test_result_skip("%s not found\n", variant->func_name); } =20 FIXTURE_TEARDOWN(vdso_unlock) {} @@ -640,10 +634,15 @@ TEST_F(vdso_unlock, test_robust_try_unlock_uncontende= d) struct lock_struct lock =3D { .futex =3D 0 }; _Atomic(unsigned int) *futex =3D &lock.futex; struct robust_list_head head; - uint64_t exp =3D (uint64_t) NULL; + uintptr_t exp =3D (uintptr_t) NULL; pid_t tid =3D gettid(); int ret; =20 + if (!self->vdso) { + ksft_test_result_skip("%s not found\n", variant->func_name); + return; + } + *futex =3D tid; =20 ret =3D set_list(&head); @@ -659,11 +658,11 @@ TEST_F(vdso_unlock, test_robust_try_unlock_uncontende= d) =20 /* Check only the lower 32 bits for the 32-bit entry point */ if (variant->is_32) { - exp =3D (uint64_t)(unsigned long)&lock.list; + exp =3D (uintptr_t)(unsigned long)&lock.list; exp &=3D ~0xFFFFFFFFULL; } =20 - ASSERT_EQ((uint64_t)(unsigned long)head.list_op_pending, exp); + ASSERT_EQ((uintptr_t)(unsigned long)head.list_op_pending, exp); } =20 /* @@ -679,6 +678,11 @@ TEST_F(vdso_unlock, test_robust_try_unlock_contended) pid_t tid =3D gettid(); int ret; =20 + if (!self->vdso) { + ksft_test_result_skip("%s not found\n", variant->func_name); + return; + } + *futex =3D tid | FUTEX_WAITERS; =20 ret =3D set_list(&head); @@ -724,6 +728,24 @@ FIXTURE_VARIANT_ADD(futex_op, unlock_pi) .val3 =3D 0, }; =20 +FIXTURE_VARIANT_ADD(futex_op, wake32) +{ + .op =3D FUTEX_WAKE | FUTEX_ROBUST_LIST32, + .val3 =3D 0, +}; + +FIXTURE_VARIANT_ADD(futex_op, wake_bitset32) +{ + .op =3D FUTEX_WAKE_BITSET | FUTEX_ROBUST_LIST32, + .val3 =3D FUTEX_BITSET_MATCH_ANY, +}; + +FIXTURE_VARIANT_ADD(futex_op, unlock_pi32) +{ + .op =3D FUTEX_UNLOCK_PI | FUTEX_ROBUST_LIST32, + .val3 =3D 0, +}; + /* * The syscall should return the number of tasks waken (for this test, 0),= clear the futex word and * clear list_op_pending @@ -732,10 +754,18 @@ TEST_F(futex_op, test_futex_robust_unlock) { struct lock_struct lock =3D { .futex =3D 0 }; _Atomic(unsigned int) *futex =3D &lock.futex; + uintptr_t exp =3D (uintptr_t) NULL; struct robust_list_head head; pid_t tid =3D gettid(); int ret; =20 +#ifndef BUILD_64 + if (!(variant->op & FUTEX_ROBUST_LIST32)) { + ksft_test_result_skip("Not supported for 32 bit build\n"); + return; + } +#endif + *futex =3D tid | FUTEX_WAITERS; =20 ret =3D set_list(&head); @@ -749,7 +779,13 @@ TEST_F(futex_op, test_futex_robust_unlock) =20 ASSERT_EQ(ret, 0); ASSERT_EQ(*futex, 0); - ASSERT_EQ(head.list_op_pending, NULL); + + if (variant->op & FUTEX_ROBUST_LIST32) { + exp =3D (uint64_t)(unsigned long)&lock.list; + exp &=3D ~0xFFFFFFFFULL; + } + + ASSERT_EQ((uintptr_t)(unsigned long)head.list_op_pending, exp); } =20 TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/test= ing/selftests/futex/include/futextest.h index f4d880b8e795..df33f31d6994 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -41,6 +41,9 @@ typedef volatile u_int32_t futex_t; #ifndef FUTEX_ROBUST_UNLOCK #define FUTEX_ROBUST_UNLOCK 512 #endif +#ifndef FUTEX_ROBUST_LIST32 +#define FUTEX_ROBUST_LIST32 1024 +#endif #ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE #define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG)