This is a follow up to v2 which can be found here:
https://lore.kernel.org/20260319225224.853416463@kernel.org
The v1 cover letter contains a detailed analysis of the underlying
problem:
https://lore.kernel.org/20260316162316.356674433@kernel.org
TLDR:
The robust futex unlock mechanism is racy in respect to the clearing of the
robust_list_head::list_op_pending pointer because unlock and clearing the
pointer are not atomic. The race window is between the unlock and clearing
the pending op pointer. If the task is forced to exit in this window, exit
will access a potentially invalid pending op pointer when cleaning up the
robust list. That happens if another task manages to unmap the object
containing the lock before the cleanup, which results in an UAF. In the
worst case this UAF can lead to memory corruption when unrelated content
has been mapped to the same address by the time the access happens.
User space can't solve this problem without help from the kernel. This
series provides the kernel side infrastructure to help it along:
1) Combined unlock, pointer clearing, wake-up for the contended case
2) VDSO based unlock and pointer clearing helpers with a fix-up function
in the kernel when user space was interrupted within the critical
section.
Both ensure that the pointer clearing happens _before_ a task exits and the
kernel cleans up the robust list during the exit procedure.
Changes since v2:
- Retain the critical section ranges on fork() - Sebastian
- Simplify the region update and provide generic helpers for that to
avoid copy and pasta in the architecture VDSO/VMA code.
- Consolidate the naming: __vdso_futex_robust_list64_try_unlock() and
__vdso_futex_robust_list32_try_unlock() as there is no need to make it
different for the 32-bit VDSO, which only supports the list32 variant.
- Use 'r' constraint in the ASM template - Uros
- Rename ARCH_STORE_IMPLIES_RELEASE to ARCH_MEMORY_ORDER_TOS - Peter
- Save space in the ranges array by using start_ip + len instead of
start_ip/end_ip - Sebastian
- Reduce number of ranges to 1 if COMPAT is disabled - Peter
- Seperate the private hash and unlock data into their own structs to
make fork/exec handling simpler
- Make futex_mm_init() void as it cannot fail
- Invalidate critical section ranges by setting start_ip to ~0UL so that
the quick check in the signal path drops out on the first compare as
with 0 it always has to evaluate both conditions.
- Picked up the documentation and selftest patches from Andr\ufffd\ufffd
- Addressed various review comments
- Picked up tags as appropriate
Thanks to everyone for feedback and discussion!
The delta patch against the previous version is below.
The series applies on v7.0-rc3 and is also available via git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git locking-futex-v3
Opens: ptrace based validation test. Sebastian has a working prototype.
Thanks,
tglx
---
diff --git a/Documentation/locking/robust-futex-ABI.rst b/Documentation/locking/robust-futex-ABI.rst
index f24904f1c16f..0faec175fc26 100644
--- a/Documentation/locking/robust-futex-ABI.rst
+++ b/Documentation/locking/robust-futex-ABI.rst
@@ -153,6 +153,9 @@ On removal:
3) release the futex lock, and
4) clear the 'lock_op_pending' word.
+Please note that the removal of a robust futex purely in userspace is
+racy. Refer to the next chapter to learn more and how to avoid this.
+
On exit, the kernel will consider the address stored in
'list_op_pending' and the address of each 'lock word' found by walking
the list starting at 'head'. For each such address, if the bottom 30
@@ -182,3 +185,44 @@ any point:
When the kernel sees a list entry whose 'lock word' doesn't have the
current threads TID in the lower 30 bits, it does nothing with that
entry, and goes on to the next entry.
+
+Robust release is racy
+----------------------
+
+The removal of a robust futex from the list is racy when doing it solely in
+userspace. Quoting Thomas Gleixer for the explanation:
+
+ The robust futex unlock mechanism is racy in respect to the clearing of the
+ robust_list_head::list_op_pending pointer because unlock and clearing the
+ pointer are not atomic. The race window is between the unlock and clearing
+ the pending op pointer. If the task is forced to exit in this window, exit
+ will access a potentially invalid pending op pointer when cleaning up the
+ robust list. That happens if another task manages to unmap the object
+ containing the lock before the cleanup, which results in an UAF. In the
+ worst case this UAF can lead to memory corruption when unrelated content
+ has been mapped to the same address by the time the access happens.
+
+A full in dept analysis can be read at
+https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/
+
+To overcome that, the kernel needs to participate in the lock release operation.
+This ensures that the release happens "atomically" in the regard of releasing
+the lock and removing the address from ``list_op_pending``. If the release is
+interrupted by a signal, the kernel will also verify if it interrupted the
+release operation.
+
+For the contended unlock case, where other threads are waiting for the lock
+release, there's the ``FUTEX_ROBUST_UNLOCK`` operation feature flag for the
+``futex()`` system call, which must be used with one of the following
+operations: ``FUTEX_WAKE``, ``FUTEX_WAKE_BITSET`` or ``FUTEX_UNLOCK_PI``.
+The kernel will release the lock (set the futex word to zero), clean the
+``list_op_pending`` field. Then, it will proceed with the normal wake path.
+
+For the non-contended path, there's still a race between checking the futex word
+and clearing the ``list_op_pending`` field. To solve this without the need of a
+complete system call, userspace should call the virtual syscall
+``__vdso_futex_robust_listXX_try_unlock()`` (where XX is either 32 or 64,
+depending on the size of the pointer). If the vDSO call succeeds, it means that
+it released the lock and cleared ``list_op_pending``. If it fails, that means
+that there are waiters for this lock and a call to ``futex()`` syscall with
+``FUTEX_ROBUST_UNLOCK`` is needed.
diff --git a/arch/Kconfig b/arch/Kconfig
index 0c1e6cc101ff..c3579449571c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -403,8 +403,8 @@ config ARCH_32BIT_OFF_T
config ARCH_32BIT_USTAT_F_TINODE
bool
-# Selected by architectures when plain stores have release semantics
-config ARCH_STORE_IMPLIES_RELEASE
+# Selected by architectures with Total Store Order (TOS)
+config ARCH_MEMORY_ORDER_TOS
bool
config HAVE_ASM_MODVERSIONS
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e9437efae787..c9b1075a0694 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,12 +114,12 @@ config X86
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAVE_EXTRA_ELF_NOTES
+ select ARCH_MEMORY_ORDER_TOS
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_STACKWALK
- select ARCH_STORE_IMPLIES_RELEASE
select ARCH_SUPPORTS_ACPI
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
diff --git a/arch/x86/entry/vdso/common/vfutex.c b/arch/x86/entry/vdso/common/vfutex.c
index 8df8fd6c759d..dba54745b355 100644
--- a/arch/x86/entry/vdso/common/vfutex.c
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -25,32 +25,32 @@
#define __stringify_1(x...) #x
#define __stringify(x...) __stringify_1(x)
-#define LABEL(name, which) __stringify(name##_futex_try_unlock_cs_##which:)
+#define LABEL(prefix, which) __stringify(prefix##_try_unlock_cs_##which:)
-#define JNZ_END(name) "jnz " __stringify(name) "_futex_try_unlock_cs_end\n"
+#define JNZ_END(prefix) "jnz " __stringify(prefix) "_try_unlock_cs_end\n"
#define CLEAR_POPQ "movq %[zero], %a[pop]\n"
#define CLEAR_POPL "movl %k[zero], %a[pop]\n"
-#define futex_robust_try_unlock(name, clear_pop, __lock, __tid, __pop) \
+#define futex_robust_try_unlock(prefix, clear_pop, __lock, __tid, __pop) \
({ \
asm volatile ( \
" \n" \
" lock cmpxchgl %k[zero], %a[lock] \n" \
" \n" \
- LABEL(name, start) \
+ LABEL(prefix, start) \
" \n" \
- JNZ_END(name) \
+ JNZ_END(prefix) \
" \n" \
- LABEL(name, success) \
+ LABEL(prefix, success) \
" \n" \
clear_pop \
" \n" \
- LABEL(name, end) \
+ LABEL(prefix, end) \
: [tid] "+&a" (__tid) \
: [lock] "D" (__lock), \
[pop] "d" (__pop), \
- [zero] "S" (0UL) \
+ [zero] "r" (0UL) \
: "memory" \
); \
__tid; \
@@ -59,18 +59,13 @@
#ifdef __x86_64__
__u32 __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop)
{
- return futex_robust_try_unlock(x86_64, CLEAR_POPQ, lock, tid, pop);
+ return futex_robust_try_unlock(__futex_list64, CLEAR_POPQ, lock, tid, pop);
}
+#endif /* __x86_64__ */
-#ifdef CONFIG_COMPAT
+#if defined(CONFIG_X86_32) || defined(CONFIG_COMPAT)
__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
{
- return futex_robust_try_unlock(x86_64_compat, CLEAR_POPL, lock, tid, pop);
+ return futex_robust_try_unlock(__futex_list32, CLEAR_POPL, lock, tid, pop);
}
-#endif /* CONFIG_COMPAT */
-#else /* __x86_64__ */
-__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
-{
- return futex_robust_try_unlock(x86_32, CLEAR_POPL, lock, tid, pop);
-}
-#endif /* !__x86_64__ */
+#endif /* CONFIG_X86_32 || CONFIG_COMPAT */
diff --git a/arch/x86/entry/vdso/vdso64/vdso64.lds.S b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
index 11dae35358a2..4a72122da81b 100644
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -32,9 +32,12 @@ VERSION {
#endif
getrandom;
__vdso_getrandom;
+
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
__vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
__vdso_futex_robust_list32_try_unlock;
+#endif
#endif
local: *;
};
diff --git a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
index 0e844af63304..b917dc69f62f 100644
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -22,9 +22,12 @@ VERSION {
__vdso_getcpu;
__vdso_time;
__vdso_clock_getres;
+
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
__vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
__vdso_futex_robust_list32_try_unlock;
+#endif
#endif
local: *;
};
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index ad87818d42a0..357e18db0c7a 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -6,6 +6,7 @@
*/
#include <linux/mm.h>
#include <linux/err.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/slab.h>
@@ -79,26 +80,19 @@ static void vdso_futex_robust_unlock_update_ips(void)
const struct vdso_image *image = current->mm->context.vdso_image;
unsigned long vdso = (unsigned long) current->mm->context.vdso;
struct futex_mm_data *fd = ¤t->mm->futex;
- struct futex_unlock_cs_range *csr = fd->unlock_cs_ranges;
+ unsigned int idx = 0;
+
+ futex_reset_cs_ranges(fd);
- fd->unlock_cs_num_ranges = 0;
#ifdef CONFIG_X86_64
- if (image->sym_x86_64_futex_try_unlock_cs_start) {
- csr->start_ip = vdso + image->sym_x86_64_futex_try_unlock_cs_start;
- csr->end_ip = vdso + image->sym_x86_64_futex_try_unlock_cs_end;
- csr->pop_size32 = 0;
- csr++;
- fd->unlock_cs_num_ranges++;
- }
+ futex_set_vdso_cs_range(fd, idx, vdso, image->sym___futex_list64_try_unlock_cs_start,
+ image->sym___futex_list64_try_unlock_cs_end, false);
+ idx++;
#endif /* CONFIG_X86_64 */
#if defined(CONFIG_X86_32) || defined(CONFIG_COMPAT)
- if (image->sym_x86_32_futex_try_unlock_cs_start) {
- csr->start_ip = vdso + image->sym_x86_32_futex_try_unlock_cs_start;
- csr->end_ip = vdso + image->sym_x86_32_futex_try_unlock_cs_end;
- csr->pop_size32 = 1;
- fd->unlock_cs_num_ranges++;
- }
+ futex_set_vdso_cs_range(fd, idx, vdso, image->sym___futex_list32_try_unlock_cs_start,
+ image->sym___futex_list32_try_unlock_cs_end, true);
#endif /* CONFIG_X86_32 || CONFIG_COMPAT */
}
#else
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index b96a6f04d677..68cf5cdd84b4 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -25,12 +25,10 @@ struct vdso_image {
long sym_int80_landing_pad;
long sym_vdso32_sigreturn_landing_pad;
long sym_vdso32_rt_sigreturn_landing_pad;
- long sym_x86_64_futex_try_unlock_cs_start;
- long sym_x86_64_futex_try_unlock_cs_end;
- long sym_x86_64_compat_futex_try_unlock_cs_start;
- long sym_x86_64_compat_futex_try_unlock_cs_end;
- long sym_x86_32_futex_try_unlock_cs_start;
- long sym_x86_32_futex_try_unlock_cs_end;
+ long sym___futex_list64_try_unlock_cs_start;
+ long sym___futex_list64_try_unlock_cs_end;
+ long sym___futex_list32_try_unlock_cs_start;
+ long sym___futex_list32_try_unlock_cs_end;
};
extern const struct vdso_image vdso64_image;
diff --git a/arch/x86/tools/vdso2c.c b/arch/x86/tools/vdso2c.c
index 2d01e511ca8a..921576b6a5f5 100644
--- a/arch/x86/tools/vdso2c.c
+++ b/arch/x86/tools/vdso2c.c
@@ -82,12 +82,10 @@ struct vdso_sym required_syms[] = {
{"int80_landing_pad", true},
{"vdso32_rt_sigreturn_landing_pad", true},
{"vdso32_sigreturn_landing_pad", true},
- {"x86_64_futex_try_unlock_cs_start", true},
- {"x86_64_futex_try_unlock_cs_end", true},
- {"x86_64_compat_futex_try_unlock_cs_start", true},
- {"x86_64_compat_futex_try_unlock_cs_end", true},
- {"x86_32_futex_try_unlock_cs_start", true},
- {"x86_32_futex_try_unlock_cs_end", true},
+ {"__futex_list64_try_unlock_cs_start", true},
+ {"__futex_list64_try_unlock_cs_end", true},
+ {"__futex_list32_try_unlock_cs_start", true},
+ {"__futex_list32_try_unlock_cs_end", true},
};
__attribute__((format(printf, 1, 2))) __attribute__((noreturn))
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 8e3d46737b03..33524dfb3fe4 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,22 +81,18 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
#ifdef CONFIG_FUTEX_PRIVATE_HASH
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
-int futex_mm_init(struct mm_struct *mm);
-
-#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
-#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+#endif /* !CONFIG_FUTEX_PRIVATE_HASH */
-#else /* !CONFIG_FUTEX */
+#else /* CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
static inline void futex_exit_release(struct task_struct *tsk) { }
static inline void futex_exec_release(struct task_struct *tsk) { }
-static inline long do_futex(u32 __user *uaddr, int op, u32 val,
- ktime_t *timeout, u32 __user *uaddr2,
- u32 val2, u32 val3)
+static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
+ u32 __user *uaddr2, u32 val2, u32 val3)
{
return -EINVAL;
}
@@ -104,17 +100,14 @@ static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsig
{
return -EINVAL;
}
-static inline int futex_hash_allocate_default(void)
-{
- return 0;
-}
+static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
#endif /* !CONFIG_FUTEX */
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
#include <asm/futex_robust.h>
+void futex_reset_cs_ranges(struct futex_mm_data *fd);
void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock_cs_range *csr);
static inline bool futex_within_robust_unlock(struct pt_regs *regs,
@@ -122,7 +115,7 @@ static inline bool futex_within_robust_unlock(struct pt_regs *regs,
{
unsigned long ip = instruction_pointer(regs);
- return ip >= csr->start_ip && ip < csr->end_ip;
+ return ip >= csr->start_ip && ip < csr->start_ip + csr->len;
}
static inline void futex_fixup_robust_unlock(struct pt_regs *regs)
@@ -131,26 +124,40 @@ static inline void futex_fixup_robust_unlock(struct pt_regs *regs)
/*
* Avoid dereferencing current->mm if not returning from interrupt.
- * current->rseq.event is going to be used anyway in the exit to user
- * code, so bringing it in is not a big deal.
+ * current->rseq.event is going to be used subsequently, so bringing the
+ * cache line in is not a big deal.
*/
if (!current->rseq.event.user_irq)
return;
- csr = current->mm->futex.unlock_cs_ranges;
- if (unlikely(futex_within_robust_unlock(regs, csr))) {
- __futex_fixup_robust_unlock(regs, csr);
- return;
- }
+ csr = current->mm->futex.unlock.cs_ranges;
- /* Multi sized robust lists are only supported with CONFIG_COMPAT */
- if (IS_ENABLED(CONFIG_COMPAT) && current->mm->futex.unlock_cs_num_ranges == 2) {
- if (unlikely(futex_within_robust_unlock(regs, ++csr)))
+ /* The loop is optimized out for !COMPAT */
+ for (int r = 0; r < FUTEX_ROBUST_MAX_CS_RANGES; r++, csr++) {
+ if (unlikely(futex_within_robust_unlock(regs, csr))) {
__futex_fixup_robust_unlock(regs, csr);
+ return;
+ }
}
}
+
+static inline void futex_set_vdso_cs_range(struct futex_mm_data *fd, unsigned int idx,
+ unsigned long vdso, unsigned long start,
+ unsigned long end, bool sz32)
+{
+ fd->unlock.cs_ranges[idx].start_ip = vdso + start;
+ fd->unlock.cs_ranges[idx].len = end - start;
+ fd->unlock.cs_ranges[idx].pop_size32 = sz32;
+}
#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
-static inline void futex_fixup_robust_unlock(struct pt_regs *regs) {}
+static inline void futex_fixup_robust_unlock(struct pt_regs *regs) { }
#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+
+#if defined(CONFIG_FUTEX_PRIVATE_HASH) || defined(CONFIG_FUTEX_ROBUST_UNLOCK)
+void futex_mm_init(struct mm_struct *mm);
+#else
+static inline void futex_mm_init(struct mm_struct *mm) { }
#endif
+
+#endif /* _LINUX_FUTEX_H */
diff --git a/include/linux/futex_types.h b/include/linux/futex_types.h
index 90e24a10ed08..288666fb37b6 100644
--- a/include/linux/futex_types.h
+++ b/include/linux/futex_types.h
@@ -30,51 +30,66 @@ struct futex_sched_data {
unsigned int state;
};
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/**
+ * struct futex_mm_phash - Futex private hash related per MM data
+ * @lock: Mutex to protect the private hash operations
+ * @hash: RCU managed pointer to the private hash
+ * @hash_new: Pointer to a newly allocated private hash
+ * @batches: Batch state for RCU synchronization
+ * @rcu: RCU head for call_rcu()
+ * @atomic: Aggregate value for @hash_ref
+ * @ref: Per CPU reference counter for a private hash
+ */
+struct futex_mm_phash {
+ struct mutex lock;
+ struct futex_private_hash __rcu *hash;
+ struct futex_private_hash *hash_new;
+ unsigned long batches;
+ struct rcu_head rcu;
+ atomic_long_t atomic;
+ unsigned int __percpu *ref;
+};
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+struct futex_mm_phash { };
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
/**
* struct futex_unlock_cs_range - Range for the VDSO unlock critical section
* @start_ip: The start IP of the robust futex unlock critical section (inclusive)
- * @end_ip: The end IP of the robust futex unlock critical section (exclusive)
+ * @len: The length of the robust futex unlock critical section
* @pop_size32: Pending OP pointer size indicator. 0 == 64-bit, 1 == 32-bit
*/
struct futex_unlock_cs_range {
unsigned long start_ip;
- unsigned long end_ip;
+ unsigned int len;
unsigned int pop_size32;
};
-#define FUTEX_ROBUST_MAX_CS_RANGES 2
+#define FUTEX_ROBUST_MAX_CS_RANGES (1 + IS_ENABLED(CONFIG_COMPAT))
+
+/**
+ * struct futex_unlock_cs_ranges - Futex unlock VSDO critical sections
+ * @cs_ranges: Array of critical section ranges
+ */
+struct futex_unlock_cs_ranges {
+ struct futex_unlock_cs_range cs_ranges[FUTEX_ROBUST_MAX_CS_RANGES];
+};
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+struct futex_unlock_cs_ranges { };
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
/**
* struct futex_mm_data - Futex related per MM data
- * @phash_lock: Mutex to protect the private hash operations
- * @phash: RCU managed pointer to the private hash
- * @phash_new: Pointer to a newly allocated private hash
- * @phash_batches: Batch state for RCU synchronization
- * @phash_rcu: RCU head for call_rcu()
- * @phash_atomic: Aggregate value for @phash_ref
- * @phash_ref: Per CPU reference counter for a private hash
- *
- * @unlock_cs_num_ranges: The number of critical section ranges for VDSO assisted unlock
- * of robust futexes.
- * @unlock_cs_ranges: The critical section ranges for VDSO assisted unlock
+ * @phash: Futex private hash related data
+ * @unlock: Futex unlock VDSO critical sections
*/
struct futex_mm_data {
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
- struct mutex phash_lock;
- struct futex_private_hash __rcu *phash;
- struct futex_private_hash *phash_new;
- unsigned long phash_batches;
- struct rcu_head phash_rcu;
- atomic_long_t phash_atomic;
- unsigned int __percpu *phash_ref;
-#endif
-#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
- unsigned int unlock_cs_num_ranges;
- struct futex_unlock_cs_range unlock_cs_ranges[FUTEX_ROBUST_MAX_CS_RANGES];
-#endif
+ struct futex_mm_phash phash;
+ struct futex_unlock_cs_ranges unlock;
};
-
-#else
+#else /* CONFIG_FUTEX */
struct futex_sched_data { };
struct futex_mm_data { };
#endif /* !CONFIG_FUTEX */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index bc41d619f9a3..ac1d9ce1f1ec 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -645,11 +645,11 @@ static inline void user_access_restore(unsigned long flags) { }
#endif
#ifndef unsafe_atomic_store_release_user
-# define unsafe_atomic_store_release_user(val, uptr, elbl) \
- do { \
- if (!IS_ENABLED(CONFIG_ARCH_STORE_IMPLIES_RELEASE)) \
- smp_mb(); \
- unsafe_put_user(val, uptr, elbl); \
+# define unsafe_atomic_store_release_user(val, uptr, elbl) \
+ do { \
+ if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) \
+ smp_mb(); \
+ unsafe_put_user(val, uptr, elbl); \
} while (0)
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518..726c9427d811 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1097,6 +1097,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
#endif
mm_init_uprobes_state(mm);
hugetlb_count_init(mm);
+ futex_mm_init(mm);
mm_flags_clear_all(mm);
if (current->mm) {
@@ -1109,11 +1110,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->def_flags = 0;
}
- if (futex_mm_init(mm))
- goto fail_mm_init;
-
if (mm_alloc_pgd(mm))
- goto fail_nopgd;
+ goto fail_mm_init;
if (mm_alloc_id(mm))
goto fail_noid;
@@ -1140,8 +1138,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm_free_id(mm);
fail_noid:
mm_free_pgd(mm);
-fail_nopgd:
- futex_hash_free(mm);
fail_mm_init:
free_mm(mm);
return NULL;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6a9c04471c44..ce47d02f1ea2 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -190,7 +190,7 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
return NULL;
if (!fph)
- fph = rcu_dereference(key->private.mm->futex.phash);
+ fph = rcu_dereference(key->private.mm->futex.phash.hash);
if (!fph || !fph->hash_mask)
return NULL;
@@ -235,17 +235,17 @@ static void futex_rehash_private(struct futex_private_hash *old,
}
}
-static bool __futex_pivot_hash(struct mm_struct *mm,
- struct futex_private_hash *new)
+static bool __futex_pivot_hash(struct mm_struct *mm, struct futex_private_hash *new)
{
+ struct futex_mm_phash *mmph = &mm->futex.phash;
struct futex_private_hash *fph;
- WARN_ON_ONCE(mm->futex.phash_new);
+ WARN_ON_ONCE(mmph->hash_new);
- fph = rcu_dereference_protected(mm->futex.phash, lockdep_is_held(&mm->futex.phash_lock));
+ fph = rcu_dereference_protected(mmph->hash, lockdep_is_held(&mmph->lock));
if (fph) {
if (!futex_ref_is_dead(fph)) {
- mm->futex.phash_new = new;
+ mmph->hash_new = new;
return false;
}
@@ -253,8 +253,8 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
}
new->state = FR_PERCPU;
scoped_guard(rcu) {
- mm->futex.phash_batches = get_state_synchronize_rcu();
- rcu_assign_pointer(mm->futex.phash, new);
+ mmph->batches = get_state_synchronize_rcu();
+ rcu_assign_pointer(mmph->hash, new);
}
kvfree_rcu(fph, rcu);
return true;
@@ -262,12 +262,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
static void futex_pivot_hash(struct mm_struct *mm)
{
- scoped_guard(mutex, &mm->futex.phash_lock) {
+ scoped_guard(mutex, &mm->futex.phash.lock) {
struct futex_private_hash *fph;
- fph = mm->futex.phash_new;
+ fph = mm->futex.phash.hash_new;
if (fph) {
- mm->futex.phash_new = NULL;
+ mm->futex.phash.hash_new = NULL;
__futex_pivot_hash(mm, fph);
}
}
@@ -290,7 +290,7 @@ struct futex_private_hash *futex_private_hash(void)
scoped_guard(rcu) {
struct futex_private_hash *fph;
- fph = rcu_dereference(mm->futex.phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
if (!fph)
return NULL;
@@ -1452,12 +1452,16 @@ bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags)
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock_cs_range *csr)
{
+ /*
+ * arch_futex_robust_unlock_get_pop() returns the list pending op pointer from
+ * @regs if the try_cmpxchg() succeeded.
+ */
void __user *pop = arch_futex_robust_unlock_get_pop(regs);
if (!pop)
return;
- futex_robust_list_clear_pending(pop, csr->cs_pop_size32 ? FLAGS_ROBUST_LIST32 : 0);
+ futex_robust_list_clear_pending(pop, csr->pop_size32 ? FLAGS_ROBUST_LIST32 : 0);
}
#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
@@ -1606,17 +1610,17 @@ static void __futex_ref_atomic_begin(struct futex_private_hash *fph)
* otherwise it would be impossible for it to have reported success
* from futex_ref_is_dead().
*/
- WARN_ON_ONCE(atomic_long_read(&mm->futex.phash_atomic) != 0);
+ WARN_ON_ONCE(atomic_long_read(&mm->futex.phash.atomic) != 0);
/*
* Set the atomic to the bias value such that futex_ref_{get,put}()
* will never observe 0. Will be fixed up in __futex_ref_atomic_end()
* when folding in the percpu count.
*/
- atomic_long_set(&mm->futex.phash_atomic, LONG_MAX);
+ atomic_long_set(&mm->futex.phash.atomic, LONG_MAX);
smp_store_release(&fph->state, FR_ATOMIC);
- call_rcu_hurry(&mm->futex.phash_rcu, futex_ref_rcu);
+ call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
}
static void __futex_ref_atomic_end(struct futex_private_hash *fph)
@@ -1637,7 +1641,7 @@ static void __futex_ref_atomic_end(struct futex_private_hash *fph)
* Therefore the per-cpu counter is now stable, sum and reset.
*/
for_each_possible_cpu(cpu) {
- unsigned int *ptr = per_cpu_ptr(mm->futex.phash_ref, cpu);
+ unsigned int *ptr = per_cpu_ptr(mm->futex.phash.ref, cpu);
count += *ptr;
*ptr = 0;
}
@@ -1645,7 +1649,7 @@ static void __futex_ref_atomic_end(struct futex_private_hash *fph)
/*
* Re-init for the next cycle.
*/
- this_cpu_inc(*mm->futex.phash_ref); /* 0 -> 1 */
+ this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
/*
* Add actual count, subtract bias and initial refcount.
@@ -1653,7 +1657,7 @@ static void __futex_ref_atomic_end(struct futex_private_hash *fph)
* The moment this atomic operation happens, futex_ref_is_dead() can
* become true.
*/
- ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex.phash_atomic);
+ ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex.phash.atomic);
if (!ret)
wake_up_var(mm);
@@ -1663,8 +1667,8 @@ static void __futex_ref_atomic_end(struct futex_private_hash *fph)
static void futex_ref_rcu(struct rcu_head *head)
{
- struct mm_struct *mm = container_of(head, struct mm_struct, futex.phash_rcu);
- struct futex_private_hash *fph = rcu_dereference_raw(mm->futex.phash);
+ struct mm_struct *mm = container_of(head, struct mm_struct, futex.phash.rcu);
+ struct futex_private_hash *fph = rcu_dereference_raw(mm->futex.phash.hash);
if (fph->state == FR_PERCPU) {
/*
@@ -1693,7 +1697,7 @@ static void futex_ref_drop(struct futex_private_hash *fph)
/*
* Can only transition the current fph;
*/
- WARN_ON_ONCE(rcu_dereference_raw(mm->futex.phash) != fph);
+ WARN_ON_ONCE(rcu_dereference_raw(mm->futex.phash.hash) != fph);
/*
* We enqueue at least one RCU callback. Ensure mm stays if the task
* exits before the transition is completed.
@@ -1704,9 +1708,9 @@ static void futex_ref_drop(struct futex_private_hash *fph)
* In order to avoid the following scenario:
*
* futex_hash() __futex_pivot_hash()
- * guard(rcu); guard(mm->futex_hash_lock);
- * fph = mm->futex.phash;
- * rcu_assign_pointer(&mm->futex.phash, new);
+ * guard(rcu); guard(mm->futex.phash.lock);
+ * fph = mm->futex.phash.hash;
+ * rcu_assign_pointer(&mm->futex.phash.hash, new);
* futex_hash_allocate()
* futex_ref_drop()
* fph->state = FR_ATOMIC;
@@ -1721,7 +1725,7 @@ static void futex_ref_drop(struct futex_private_hash *fph)
* There must be at least one full grace-period between publishing a
* new fph and trying to replace it.
*/
- if (poll_state_synchronize_rcu(mm->futex.phash_batches)) {
+ if (poll_state_synchronize_rcu(mm->futex.phash.batches)) {
/*
* There was a grace-period, we can begin now.
*/
@@ -1729,7 +1733,7 @@ static void futex_ref_drop(struct futex_private_hash *fph)
return;
}
- call_rcu_hurry(&mm->futex.phash_rcu, futex_ref_rcu);
+ call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
}
static bool futex_ref_get(struct futex_private_hash *fph)
@@ -1739,11 +1743,11 @@ static bool futex_ref_get(struct futex_private_hash *fph)
guard(preempt)();
if (READ_ONCE(fph->state) == FR_PERCPU) {
- __this_cpu_inc(*mm->futex.phash_ref);
+ __this_cpu_inc(*mm->futex.phash.ref);
return true;
}
- return atomic_long_inc_not_zero(&mm->futex.phash_atomic);
+ return atomic_long_inc_not_zero(&mm->futex.phash.atomic);
}
static bool futex_ref_put(struct futex_private_hash *fph)
@@ -1753,11 +1757,11 @@ static bool futex_ref_put(struct futex_private_hash *fph)
guard(preempt)();
if (READ_ONCE(fph->state) == FR_PERCPU) {
- __this_cpu_dec(*mm->futex.phash_ref);
+ __this_cpu_dec(*mm->futex.phash.ref);
return false;
}
- return atomic_long_dec_and_test(&mm->futex.phash_atomic);
+ return atomic_long_dec_and_test(&mm->futex.phash.atomic);
}
static bool futex_ref_is_dead(struct futex_private_hash *fph)
@@ -1769,24 +1773,23 @@ static bool futex_ref_is_dead(struct futex_private_hash *fph)
if (smp_load_acquire(&fph->state) == FR_PERCPU)
return false;
- return atomic_long_read(&mm->futex.phash_atomic) == 0;
+ return atomic_long_read(&mm->futex.phash.atomic) == 0;
}
-int futex_mm_init(struct mm_struct *mm)
+static void futex_hash_init_mm(struct futex_mm_data *fd)
{
- memset(&mm->futex, 0, sizeof(mm->futex));
- mutex_init(&mm->futex.phash_lock);
- mm->futex.phash_batches = get_state_synchronize_rcu();
- return 0;
+ memset(&fd->phash, 0, sizeof(fd->phash));
+ mutex_init(&fd->phash.lock);
+ fd->phash.batches = get_state_synchronize_rcu();
}
void futex_hash_free(struct mm_struct *mm)
{
struct futex_private_hash *fph;
- free_percpu(mm->futex.phash_ref);
- kvfree(mm->futex.phash_new);
- fph = rcu_dereference_raw(mm->futex.phash);
+ free_percpu(mm->futex.phash.ref);
+ kvfree(mm->futex.phash.hash_new);
+ fph = rcu_dereference_raw(mm->futex.phash.hash);
if (fph)
kvfree(fph);
}
@@ -1797,10 +1800,10 @@ static bool futex_pivot_pending(struct mm_struct *mm)
guard(rcu)();
- if (!mm->futex.phash_new)
+ if (!mm->futex.phash.hash_new)
return true;
- fph = rcu_dereference(mm->futex.phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
return futex_ref_is_dead(fph);
}
@@ -1842,7 +1845,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
* Once we've disabled the global hash there is no way back.
*/
scoped_guard(rcu) {
- fph = rcu_dereference(mm->futex.phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
if (fph && !fph->hash_mask) {
if (custom)
return -EBUSY;
@@ -1850,15 +1853,15 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
}
}
- if (!mm->futex.phash_ref) {
+ if (!mm->futex.phash.ref) {
/*
* This will always be allocated by the first thread and
* therefore requires no locking.
*/
- mm->futex.phash_ref = alloc_percpu(unsigned int);
- if (!mm->futex.phash_ref)
+ mm->futex.phash.ref = alloc_percpu(unsigned int);
+ if (!mm->futex.phash.ref)
return -ENOMEM;
- this_cpu_inc(*mm->futex.phash_ref); /* 0 -> 1 */
+ this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
}
fph = kvzalloc(struct_size(fph, queues, hash_slots),
@@ -1881,14 +1884,14 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
wait_var_event(mm, futex_pivot_pending(mm));
}
- scoped_guard(mutex, &mm->futex.phash_lock) {
+ scoped_guard(mutex, &mm->futex.phash.lock) {
struct futex_private_hash *free __free(kvfree) = NULL;
struct futex_private_hash *cur, *new;
- cur = rcu_dereference_protected(mm->futex.phash,
- lockdep_is_held(&mm->futex.phash_lock));
- new = mm->futex.phash_new;
- mm->futex.phash_new = NULL;
+ cur = rcu_dereference_protected(mm->futex.phash.hash,
+ lockdep_is_held(&mm->futex.phash.lock));
+ new = mm->futex.phash.hash_new;
+ mm->futex.phash.hash_new = NULL;
if (fph) {
if (cur && !cur->hash_mask) {
@@ -1898,7 +1901,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
* the second one returns here.
*/
free = fph;
- mm->futex.phash_new = new;
+ mm->futex.phash.hash_new = new;
return -EBUSY;
}
if (cur && !new) {
@@ -1947,11 +1950,9 @@ int futex_hash_allocate_default(void)
return 0;
scoped_guard(rcu) {
- threads = min_t(unsigned int,
- get_nr_threads(current),
- num_online_cpus());
+ threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
- fph = rcu_dereference(current->mm->futex.phash);
+ fph = rcu_dereference(current->mm->futex.phash.hash);
if (fph) {
if (fph->custom)
return 0;
@@ -1978,25 +1979,51 @@ static int futex_hash_get_slots(void)
struct futex_private_hash *fph;
guard(rcu)();
- fph = rcu_dereference(current->mm->futex.phash);
+ fph = rcu_dereference(current->mm->futex.phash.hash);
if (fph && fph->hash_mask)
return fph->hash_mask + 1;
return 0;
}
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate(unsigned int hslots, unsigned int flags) { return -EINVAL; }
+static inline int futex_hash_get_slots(void) { return 0; }
+static inline void futex_hash_init_mm(struct futex_mm_data *fd) { }
+#endif /* !CONFIG_FUTEX_PRIVATE_HASH */
-#else
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+static void futex_invalidate_cs_ranges(struct futex_mm_data *fd)
+{
+ /*
+ * Invalidate start_ip so that the quick check fails for ip >= start_ip
+ * if VDSO is not mapped or the second slot is not available for compat
+ * tasks as they use VDSO32 which does not provide the 64-bit pointer
+ * variant.
+ */
+ for (int i = 0; i < FUTEX_ROBUST_MAX_CS_RANGES; i++)
+ fd->unlock.cs_ranges[i].start_ip = ~0UL;
+}
-static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
+void futex_reset_cs_ranges(struct futex_mm_data *fd)
{
- return -EINVAL;
+ memset(fd->unlock.cs_ranges, 0, sizeof(fd->unlock.cs_ranges));
+ futex_invalidate_cs_ranges(fd);
}
-static int futex_hash_get_slots(void)
+static void futex_robust_unlock_init_mm(struct futex_mm_data *fd)
{
- return 0;
+ /* mm_dup() preserves the range, mm_alloc() clears it */
+ if (!fd->unlock.cs_ranges[0].start_ip)
+ futex_invalidate_cs_ranges(fd);
}
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+static inline void futex_robust_unlock_init_mm(struct futex_mm_data *fd) { }
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
-#endif
+void futex_mm_init(struct mm_struct *mm)
+{
+ futex_hash_init_mm(&mm->futex);
+ futex_robust_unlock_init_mm(&mm->futex);
+}
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
{
diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools/testing/selftests/futex/functional/robust_list.c
index e7d1254e18ca..62f21f8d89a6 100644
--- a/tools/testing/selftests/futex/functional/robust_list.c
+++ b/tools/testing/selftests/futex/functional/robust_list.c
@@ -27,12 +27,14 @@
#include "futextest.h"
#include "../../kselftest_harness.h"
+#include <dlfcn.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
+#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/wait.h>
@@ -54,6 +56,12 @@ static int get_robust_list(int pid, struct robust_list_head **head, size_t *len_
return syscall(SYS_get_robust_list, pid, head, len_ptr);
}
+static int sys_futex_robust_unlock(_Atomic(uint32_t) *uaddr, unsigned int op, int val,
+ void *list_op_pending, unsigned int val3)
+{
+ return syscall(SYS_futex, uaddr, op, val, NULL, list_op_pending, val3, 0);
+}
+
/*
* Basic lock struct, contains just the futex word and the robust list element
* Real implementations have also a *prev to easily walk in the list
@@ -549,4 +557,199 @@ TEST(test_circular_list)
ksft_test_result_pass("%s\n", __func__);
}
+/*
+ * Below are tests for the fix of robust release race condition. Please read the following
+ * thread to learn more about the issue in the first place and why the following functions fix it:
+ * https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/
+ */
+
+/*
+ * Auxiliary code for loading the vDSO functions
+ */
+#define VDSO_SIZE 0x4000
+
+void *get_vdso_func_addr(const char *str)
+{
+ void *vdso_base = (void *) getauxval(AT_SYSINFO_EHDR), *addr;
+ Dl_info info;
+
+ if (!vdso_base) {
+ perror("Error to get AT_SYSINFO_EHDR");
+ return NULL;
+ }
+
+ for (addr = vdso_base; addr < vdso_base + VDSO_SIZE; addr += sizeof(addr)) {
+ if (dladdr(addr, &info) == 0 || !info.dli_sname)
+ continue;
+
+ if (!strcmp(info.dli_sname, str))
+ return info.dli_saddr;
+ }
+
+ return NULL;
+}
+
+/*
+ * These are the real vDSO function signatures:
+ *
+ * __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop)
+ * __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
+ *
+ * So for the generic entry point we need to use a void pointer as the last argument
+ */
+FIXTURE(vdso_unlock)
+{
+ uint32_t (*vdso)(_Atomic(uint32_t) *lock, uint32_t tid, void *pop);
+};
+
+FIXTURE_VARIANT(vdso_unlock)
+{
+ bool is_32;
+ char func_name[];
+};
+
+FIXTURE_SETUP(vdso_unlock)
+{
+ self->vdso = get_vdso_func_addr(variant->func_name);
+
+ if (!self->vdso)
+ ksft_test_result_skip("%s not found\n", variant->func_name);
+}
+
+FIXTURE_TEARDOWN(vdso_unlock) {}
+
+FIXTURE_VARIANT_ADD(vdso_unlock, 32)
+{
+ .func_name = "__vdso_futex_robust_list32_try_unlock",
+ .is_32 = true,
+};
+
+FIXTURE_VARIANT_ADD(vdso_unlock, 64)
+{
+ .func_name = "__vdso_futex_robust_list64_try_unlock",
+ .is_32 = false,
+};
+
+/*
+ * Test the vDSO robust_listXX_try_unlock() for the uncontended case. The virtual syscall should
+ * return the thread ID of the lock owner, the lock word must be 0 and the list_op_pending should
+ * be NULL.
+ */
+TEST_F(vdso_unlock, test_robust_try_unlock_uncontended)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ uint64_t exp = (uint64_t) NULL;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = self->vdso(futex, tid, &head.list_op_pending);
+
+ ASSERT_EQ(ret, tid);
+ ASSERT_EQ(*futex, 0);
+
+ /* Check only the lower 32 bits for the 32-bit entry point */
+ if (variant->is_32) {
+ exp = (uint64_t)(unsigned long)&lock.list;
+ exp &= ~0xFFFFFFFFULL;
+ }
+
+ ASSERT_EQ((uint64_t)(unsigned long)head.list_op_pending, exp);
+}
+
+/*
+ * If the lock is contended, the operation fails. The return value is the value found at the
+ * futex word (tid | FUTEX_WAITERS), the futex word is not modified and the list_op_pending is_32
+ * not cleared.
+ */
+TEST_F(vdso_unlock, test_robust_try_unlock_contended)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid | FUTEX_WAITERS;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = self->vdso(futex, tid, &head.list_op_pending);
+
+ ASSERT_EQ(ret, tid | FUTEX_WAITERS);
+ ASSERT_EQ(*futex, tid | FUTEX_WAITERS);
+ ASSERT_EQ(head.list_op_pending, &lock.list);
+}
+
+FIXTURE(futex_op) {};
+
+FIXTURE_VARIANT(futex_op)
+{
+ unsigned int op;
+ unsigned int val3;
+};
+
+FIXTURE_SETUP(futex_op) {}
+
+FIXTURE_TEARDOWN(futex_op) {}
+
+FIXTURE_VARIANT_ADD(futex_op, wake)
+{
+ .op = FUTEX_WAKE,
+ .val3 = 0,
+};
+
+FIXTURE_VARIANT_ADD(futex_op, wake_bitset)
+{
+ .op = FUTEX_WAKE_BITSET,
+ .val3 = FUTEX_BITSET_MATCH_ANY,
+};
+
+FIXTURE_VARIANT_ADD(futex_op, unlock_pi)
+{
+ .op = FUTEX_UNLOCK_PI,
+ .val3 = 0,
+};
+
+/*
+ * The syscall should return the number of tasks waken (for this test, 0), clear the futex word and
+ * clear list_op_pending
+ */
+TEST_F(futex_op, test_futex_robust_unlock)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid | FUTEX_WAITERS;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = sys_futex_robust_unlock(futex, FUTEX_ROBUST_UNLOCK | variant->op, tid,
+ &head.list_op_pending, variant->val3);
+
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(*futex, 0);
+ ASSERT_EQ(head.list_op_pending, NULL);
+}
+
TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index 3d48e9789d9f..f4d880b8e795 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -38,6 +38,9 @@ typedef volatile u_int32_t futex_t;
#ifndef FUTEX_CMP_REQUEUE_PI
#define FUTEX_CMP_REQUEUE_PI 12
#endif
+#ifndef FUTEX_ROBUST_UNLOCK
+#define FUTEX_ROBUST_UNLOCK 512
+#endif
#ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: > This is a follow up to v2 which can be found here: > > https://lore.kernel.org/20260319225224.853416463@kernel.org > > The v1 cover letter contains a detailed analysis of the underlying > problem: > > https://lore.kernel.org/20260316162316.356674433@kernel.org > > TLDR: > > The robust futex unlock mechanism is racy in respect to the clearing of the > robust_list_head::list_op_pending pointer because unlock and clearing the > pointer are not atomic. The race window is between the unlock and clearing > the pending op pointer. If the task is forced to exit in this window, exit > will access a potentially invalid pending op pointer when cleaning up the > robust list. That happens if another task manages to unmap the object > containing the lock before the cleanup, which results in an UAF. In the > worst case this UAF can lead to memory corruption when unrelated content > has been mapped to the same address by the time the access happens. > > User space can't solve this problem without help from the kernel. This > series provides the kernel side infrastructure to help it along: > > 1) Combined unlock, pointer clearing, wake-up for the contended case > > 2) VDSO based unlock and pointer clearing helpers with a fix-up function > in the kernel when user space was interrupted within the critical > section. I see the vdso bits in this series are specific to x86. Do other architectures need something here? I might be missing some context; I'm not sure whether that's not necessary or just not implemented by this series, and so I'm not sure whether arm64 folk and other need to go dig into this. [...] > Changes since v2: > - Rename ARCH_STORE_IMPLIES_RELEASE to ARCH_MEMORY_ORDER_TOS - Peter I believe that should be s/TOS/TSO/, since the standard terminology is Total Store Order (TSO). Mark.
Resending with linux-arch in CC and a bit more context. On Mon, Mar 30 2026 at 14:45, Mark Rutland wrote: > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: >> User space can't solve this problem without help from the kernel. This >> series provides the kernel side infrastructure to help it along: [ 6 more citation lines. Click/Enter to show. ] >> >> 1) Combined unlock, pointer clearing, wake-up for the contended case >> >> 2) VDSO based unlock and pointer clearing helpers with a fix-up function >> in the kernel when user space was interrupted within the critical >> section. There is an effort to solve the long standing issue of robust futexes versus unlock and clearing the list op pointer not being atomic. See: https://lore.kernel.org/20260316162316.356674433@kernel.org TLDR: The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer because unlock and clearing the pointer are not atomic. The race window is between the unlock and clearing the pending op pointer. If the task is forced to exit in this window, exit will access a potentially invalid pending op pointer when cleaning up the robust list. That happens if another task manages to unmap the object containing the lock before the cleanup, which results in an UAF. In the worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. The contended unlock case will be solved with an extension to the futex() syscall so that the kernel handles the "atomicity". The uncontended case where user space unlocks the futex needs VDSO support, so that the kernel can clear the list op pending pointer when user space gets interrupted between unlock and clearing. Below is how this works on x86. The series in progress covers only x86, so this a heads up for that's coming to architectures which support VDSO. If anyone sees a problem with then, then please issue your concerns now. > I see the vdso bits in this series are specific to x86. Do other > architectures need something here? Yes. > I might be missing some context; I'm not sure whether that's not > necessary or just not implemented by this series, and so I'm not sure > whether arm64 folk and other need to go dig into this. The VDSO functions __vdso_futex_robust_list64_try_unlock() and __vdso_futex_robust_list32_try_unlock() are architecture specific. The scheme in x86 ASM is: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend movq %rcx,(%rdx) // Clear list_op_pending .Lend: ret .Lstart is the start of the critical section, .Lend the end. These two addresses need to be retrieved from the VDSO when the VDSO is mapped to user space and stored in mm::futex:unlock::cs_ranges[]. See patch 11/14. If the cmpxchg was successful, then the pending pointer has to be cleared when user space was interrupted before reaching .Lend. So .Lstart has to be immediately after the instruction which did the try compare exchange and the architecture needs to have their ASM variant and the helper function which tells the generic code whether the pointer has to be cleared or not. On x86 that is: return regs->flags & X86_EFLAGS_ZF ? (void __user *)regs->dx : NULL; as the result of CMPXCHG is in the Zero Flag and the pointer is in [ER]DX. The former is defined by the ISA, the latter is enforced by the ASM constraints and has to be kept in sync between the VDSO ASM and the evaluation helper. Thanks, tglx
Hey Mark, Em 30/03/2026 10:45, Mark Rutland escreveu: > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: >> >> 2) VDSO based unlock and pointer clearing helpers with a fix-up function >> in the kernel when user space was interrupted within the critical >> section. > > I see the vdso bits in this series are specific to x86. Do other > architectures need something here? > > I might be missing some context; I'm not sure whether that's not > necessary or just not implemented by this series, and so I'm not sure > whether arm64 folk and other need to go dig into this. > If you haven't done it yet, I would like to give a shot to implement the arm64 version of the vDSO.
On Tue, Mar 31, 2026 at 09:59:17AM -0300, André Almeida wrote: > Hey Mark, Hi André, > Em 30/03/2026 10:45, Mark Rutland escreveu: > > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: > > > > > > 2) VDSO based unlock and pointer clearing helpers with a fix-up function > > > in the kernel when user space was interrupted within the critical > > > section. > > > > I see the vdso bits in this series are specific to x86. Do other > > architectures need something here? > > > > I might be missing some context; I'm not sure whether that's not > > necessary or just not implemented by this series, and so I'm not sure > > whether arm64 folk and other need to go dig into this. > > If you haven't done it yet, I would like to give a shot to implement the > arm64 version of the vDSO. Oh, please do have a go! I was just trying to figure out whether someone needed to do this, rather than asserting that I would. I'm happy to review/test patches for this. Mark.
On 2026-03-31 09:59:17 [-0300], André Almeida wrote: > > If you haven't done it yet, I would like to give a shot to implement the > arm64 version of the vDSO. I have a test half way ready which uses the debug symbols to check the section where the fixup must happen. If you keep the same symbol names then it should work. Sebastian
On Mon, Mar 30 2026 at 14:45, Mark Rutland wrote: > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: >> User space can't solve this problem without help from the kernel. This >> series provides the kernel side infrastructure to help it along: [ 6 more citation lines. Click/Enter to show. ] >> >> 1) Combined unlock, pointer clearing, wake-up for the contended case >> >> 2) VDSO based unlock and pointer clearing helpers with a fix-up function >> in the kernel when user space was interrupted within the critical >> section. > > I see the vdso bits in this series are specific to x86. Do other > architectures need something here? Yes. > I might be missing some context; I'm not sure whether that's not > necessary or just not implemented by this series, and so I'm not sure > whether arm64 folk and other need to go dig into this. The VDSO functions __vdso_futex_robust_list64_try_unlock() and __vdso_futex_robust_list32_try_unlock() are architecture specific. The scheme in x86 ASM is: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend movq %rcx,(%rdx) // Clear list_op_pending .Lend: ret .Lstart is the start of the critical section, .Lend the end. These two addresses need to be retrieved from the VDSO when the VDSO is mapped to user space and stored in mm::futex:unlock::cs_ranges[]. See patch 11/14. If the cmpxchg was successful, then the pending pointer has to be cleared when user space was interrupted before reaching .Lend. So .Lstart has to be immediately after the instruction which did the try compare exchange and the architecture needs to have their ASM variant and the helper function which tells the generic code whether the pointer has to be cleared or not. On x86 that is: return regs->flags & X86_EFLAGS_ZF ? (void __user *)regs->dx : NULL; as the result of CMPXCHG is in the Zero Flag and the pointer is in [ER]DX. The former is defined by the ISA, the latter is enforced by the ASM constraints and has to be kept in sync between the VDSO ASM and the evaluation helper. Do you see a problem with that on AARGH64? Thanks, tglx
On Mon, Mar 30, 2026 at 09:36:44PM +0200, Thomas Gleixner wrote: > On Mon, Mar 30 2026 at 14:45, Mark Rutland wrote: > > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: > >> User space can't solve this problem without help from the kernel. This > >> series provides the kernel side infrastructure to help it along: > [ 6 more citation lines. Click/Enter to show. ] > >> > >> 1) Combined unlock, pointer clearing, wake-up for the contended case > >> > >> 2) VDSO based unlock and pointer clearing helpers with a fix-up function > >> in the kernel when user space was interrupted within the critical > >> section. > > > > I see the vdso bits in this series are specific to x86. Do other > > architectures need something here? > > Yes. Cool. Thanks for confirming! > > I might be missing some context; I'm not sure whether that's not > > necessary or just not implemented by this series, and so I'm not sure > > whether arm64 folk and other need to go dig into this. > > The VDSO functions __vdso_futex_robust_list64_try_unlock() and > __vdso_futex_robust_list32_try_unlock() are architecture specific. > > The scheme in x86 ASM is: > > mov %esi,%eax // Load TID into EAX > xor %ecx,%ecx // Set ECX to 0 > lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition > .Lstart: > jnz .Lend > movq %rcx,(%rdx) // Clear list_op_pending > .Lend: > ret > > .Lstart is the start of the critical section, .Lend the end. These two > addresses need to be retrieved from the VDSO when the VDSO is mapped to > user space and stored in mm::futex:unlock::cs_ranges[]. See patch 11/14. > > If the cmpxchg was successful, then the pending pointer has to be > cleared when user space was interrupted before reaching .Lend. > > So .Lstart has to be immediately after the instruction which did the try > compare exchange and the architecture needs to have their ASM variant > and the helper function which tells the generic code whether the pointer > has to be cleared or not. On x86 that is: > > return regs->flags & X86_EFLAGS_ZF ? (void __user *)regs->dx : NULL; > > as the result of CMPXCHG is in the Zero Flag and the pointer is in > [ER]DX. The former is defined by the ISA, the latter is enforced by the > ASM constraints and has to be kept in sync between the VDSO ASM and the > evaluation helper. > > Do you see a problem with that on AARGH64? I think something like that should be possible. The one pain point I see immediately is how to work with LL/SC and LSE atomics. For LL/SC atomics, we need to retry even if the value hasn't changed; I think that just makes the critical section and helper a bit more complicated. LL/SC and LSE atomics indicate success differently, but again that's probably just some additional logic in the helper function, and special care when laying out the asm for patching. Mark.
On Mon, Mar 30, 2026 at 02:45:59PM +0100, Mark Rutland wrote: > On Mon, Mar 30, 2026 at 02:01:58PM +0200, Thomas Gleixner wrote: > > This is a follow up to v2 which can be found here: > > > > https://lore.kernel.org/20260319225224.853416463@kernel.org > > > > The v1 cover letter contains a detailed analysis of the underlying > > problem: > > > > https://lore.kernel.org/20260316162316.356674433@kernel.org > > > > TLDR: > > > > The robust futex unlock mechanism is racy in respect to the clearing of the > > robust_list_head::list_op_pending pointer because unlock and clearing the > > pointer are not atomic. The race window is between the unlock and clearing > > the pending op pointer. If the task is forced to exit in this window, exit > > will access a potentially invalid pending op pointer when cleaning up the > > robust list. That happens if another task manages to unmap the object > > containing the lock before the cleanup, which results in an UAF. In the > > worst case this UAF can lead to memory corruption when unrelated content > > has been mapped to the same address by the time the access happens. > > > > User space can't solve this problem without help from the kernel. This > > series provides the kernel side infrastructure to help it along: > > > > 1) Combined unlock, pointer clearing, wake-up for the contended case > > > > 2) VDSO based unlock and pointer clearing helpers with a fix-up function > > in the kernel when user space was interrupted within the critical > > section. > > I see the vdso bits in this series are specific to x86. Do other > architectures need something here? > > I might be missing some context; I'm not sure whether that's not > necessary or just not implemented by this series, and so I'm not sure > whether arm64 folk and other need to go dig into this. > ARM64 (and all the other archs) need the relevant VDSO bits as well. > > Changes since v2: > > > - Rename ARCH_STORE_IMPLIES_RELEASE to ARCH_MEMORY_ORDER_TOS - Peter > > I believe that should be s/TOS/TSO/, since the standard terminology is > Total Store Order (TSO). Indeed!
Having all these members in task_struct along with the required #ifdeffery
is annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.
Move it into a data structure and fix up all usage sites.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V2: Rename the struct and add the missing kernel doc - Andre
---
Documentation/locking/robust-futexes.rst | 8 ++--
include/linux/futex.h | 12 ++----
include/linux/futex_types.h | 36 ++++++++++++++++++++
include/linux/sched.h | 16 ++-------
kernel/exit.c | 4 +-
kernel/futex/core.c | 55 +++++++++++++++----------------
kernel/futex/pi.c | 26 +++++++-------
kernel/futex/syscalls.c | 23 ++++--------
8 files changed, 99 insertions(+), 81 deletions(-)
--- a/Documentation/locking/robust-futexes.rst
+++ b/Documentation/locking/robust-futexes.rst
@@ -94,7 +94,7 @@ time, the kernel checks this user-space
locks to be cleaned up?
In the common case, at do_exit() time, there is no list registered, so
-the cost of robust futexes is just a simple current->robust_list != NULL
+the cost of robust futexes is just a current->futex.robust_list != NULL
comparison. If the thread has registered a list, then normally the list
is empty. If the thread/process crashed or terminated in some incorrect
way then the list might be non-empty: in this case the kernel carefully
@@ -178,9 +178,9 @@ The patch adds two new syscalls: one to
size_t __user *len_ptr);
List registration is very fast: the pointer is simply stored in
-current->robust_list. [Note that in the future, if robust futexes become
-widespread, we could extend sys_clone() to register a robust-list head
-for new threads, without the need of another syscall.]
+current->futex.robust_list. [Note that in the future, if robust futexes
+become widespread, we could extend sys_clone() to register a robust-list
+head for new threads, without the need of another syscall.]
So there is virtually zero overhead for tasks not using robust futexes,
and even for robust futex users, there is only one extra syscall per
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -64,14 +64,10 @@ enum {
static inline void futex_init_task(struct task_struct *tsk)
{
- tsk->robust_list = NULL;
-#ifdef CONFIG_COMPAT
- tsk->compat_robust_list = NULL;
-#endif
- INIT_LIST_HEAD(&tsk->pi_state_list);
- tsk->pi_state_cache = NULL;
- tsk->futex_state = FUTEX_STATE_OK;
- mutex_init(&tsk->futex_exit_mutex);
+ memset(&tsk->futex, 0, sizeof(tsk->futex));
+ INIT_LIST_HEAD(&tsk->futex.pi_state_list);
+ tsk->futex.state = FUTEX_STATE_OK;
+ mutex_init(&tsk->futex.exit_mutex);
}
void futex_exit_recursive(struct task_struct *tsk);
--- /dev/null
+++ b/include/linux/futex_types.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FUTEX_TYPES_H
+#define _LINUX_FUTEX_TYPES_H
+
+#ifdef CONFIG_FUTEX
+#include <linux/mutex_types.h>
+#include <linux/types.h>
+
+struct compat_robust_list_head;
+struct futex_pi_state;
+struct robust_list_head;
+
+/**
+ * struct futex_sched_data - Futex related per task data
+ * @robust_list: User space registered robust list pointer
+ * @compat_robust_list: User space registered robust list pointer for compat tasks
+ * @pi_state_list: List head for Priority Inheritance (PI) state management
+ * @pi_state_cache: Pointer to cache one PI state object per task
+ * @exit_mutex: Mutex for serializing exit
+ * @state: Futex handling state to handle exit races correctly
+ */
+struct futex_sched_data {
+ struct robust_list_head __user *robust_list;
+#ifdef CONFIG_COMPAT
+ struct compat_robust_list_head __user *compat_robust_list;
+#endif
+ struct list_head pi_state_list;
+ struct futex_pi_state *pi_state_cache;
+ struct mutex exit_mutex;
+ unsigned int state;
+};
+#else
+struct futex_sched_data { };
+#endif /* !CONFIG_FUTEX */
+
+#endif /* _LINUX_FUTEX_TYPES_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -16,6 +16,7 @@
#include <linux/cpumask_types.h>
#include <linux/cache.h>
+#include <linux/futex_types.h>
#include <linux/irqflags_types.h>
#include <linux/smp_types.h>
#include <linux/pid_types.h>
@@ -64,7 +65,6 @@ struct bpf_net_context;
struct capture_control;
struct cfs_rq;
struct fs_struct;
-struct futex_pi_state;
struct io_context;
struct io_uring_task;
struct mempolicy;
@@ -76,7 +76,6 @@ struct pid_namespace;
struct pipe_inode_info;
struct rcu_node;
struct reclaim_state;
-struct robust_list_head;
struct root_domain;
struct rq;
struct sched_attr;
@@ -1329,16 +1328,9 @@ struct task_struct {
u32 closid;
u32 rmid;
#endif
-#ifdef CONFIG_FUTEX
- struct robust_list_head __user *robust_list;
-#ifdef CONFIG_COMPAT
- struct compat_robust_list_head __user *compat_robust_list;
-#endif
- struct list_head pi_state_list;
- struct futex_pi_state *pi_state_cache;
- struct mutex futex_exit_mutex;
- unsigned int futex_state;
-#endif
+
+ struct futex_sched_data futex;
+
#ifdef CONFIG_PERF_EVENTS
u8 perf_recursion[PERF_NR_CONTEXTS];
struct perf_event_context *perf_event_ctxp;
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -989,8 +989,8 @@ void __noreturn do_exit(long code)
proc_exit_connector(tsk);
mpol_put_task_policy(tsk);
#ifdef CONFIG_FUTEX
- if (unlikely(current->pi_state_cache))
- kfree(current->pi_state_cache);
+ if (unlikely(current->futex.pi_state_cache))
+ kfree(current->futex.pi_state_cache);
#endif
/*
* Make sure we are holding no locks:
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -32,18 +32,19 @@
* "But they come in a choice of three flavours!"
*/
#include <linux/compat.h>
-#include <linux/jhash.h>
-#include <linux/pagemap.h>
#include <linux/debugfs.h>
-#include <linux/plist.h>
+#include <linux/fault-inject.h>
#include <linux/gfp.h>
-#include <linux/vmalloc.h>
+#include <linux/jhash.h>
#include <linux/memblock.h>
-#include <linux/fault-inject.h>
-#include <linux/slab.h>
-#include <linux/prctl.h>
#include <linux/mempolicy.h>
#include <linux/mmap_lock.h>
+#include <linux/pagemap.h>
+#include <linux/plist.h>
+#include <linux/prctl.h>
+#include <linux/rseq.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -829,7 +830,7 @@ void wait_for_owner_exiting(int ret, str
if (WARN_ON_ONCE(ret == -EBUSY && !exiting))
return;
- mutex_lock(&exiting->futex_exit_mutex);
+ mutex_lock(&exiting->futex.exit_mutex);
/*
* No point in doing state checking here. If the waiter got here
* while the task was in exec()->exec_futex_release() then it can
@@ -838,7 +839,7 @@ void wait_for_owner_exiting(int ret, str
* already. Highly unlikely and not a problem. Just one more round
* through the futex maze.
*/
- mutex_unlock(&exiting->futex_exit_mutex);
+ mutex_unlock(&exiting->futex.exit_mutex);
put_task_struct(exiting);
}
@@ -1048,7 +1049,7 @@ static int handle_futex_death(u32 __user
*
* In both cases the following conditions are met:
*
- * 1) task->robust_list->list_op_pending != NULL
+ * 1) task->futex.robust_list->list_op_pending != NULL
* @pending_op == true
* 2) The owner part of user space futex value == 0
* 3) Regular futex: @pi == false
@@ -1153,7 +1154,7 @@ static inline int fetch_robust_entry(str
*/
static void exit_robust_list(struct task_struct *curr)
{
- struct robust_list_head __user *head = curr->robust_list;
+ struct robust_list_head __user *head = curr->futex.robust_list;
struct robust_list __user *entry, *next_entry, *pending;
unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
unsigned int next_pi;
@@ -1247,7 +1248,7 @@ compat_fetch_robust_entry(compat_uptr_t
*/
static void compat_exit_robust_list(struct task_struct *curr)
{
- struct compat_robust_list_head __user *head = curr->compat_robust_list;
+ struct compat_robust_list_head __user *head = curr->futex.compat_robust_list;
struct robust_list __user *entry, *next_entry, *pending;
unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
unsigned int next_pi;
@@ -1323,7 +1324,7 @@ static void compat_exit_robust_list(stru
*/
static void exit_pi_state_list(struct task_struct *curr)
{
- struct list_head *next, *head = &curr->pi_state_list;
+ struct list_head *next, *head = &curr->futex.pi_state_list;
struct futex_pi_state *pi_state;
union futex_key key = FUTEX_KEY_INIT;
@@ -1407,19 +1408,19 @@ static inline void exit_pi_state_list(st
static void futex_cleanup(struct task_struct *tsk)
{
- if (unlikely(tsk->robust_list)) {
+ if (unlikely(tsk->futex.robust_list)) {
exit_robust_list(tsk);
- tsk->robust_list = NULL;
+ tsk->futex.robust_list = NULL;
}
#ifdef CONFIG_COMPAT
- if (unlikely(tsk->compat_robust_list)) {
+ if (unlikely(tsk->futex.compat_robust_list)) {
compat_exit_robust_list(tsk);
- tsk->compat_robust_list = NULL;
+ tsk->futex.compat_robust_list = NULL;
}
#endif
- if (unlikely(!list_empty(&tsk->pi_state_list)))
+ if (unlikely(!list_empty(&tsk->futex.pi_state_list)))
exit_pi_state_list(tsk);
}
@@ -1442,10 +1443,10 @@ static void futex_cleanup(struct task_st
*/
void futex_exit_recursive(struct task_struct *tsk)
{
- /* If the state is FUTEX_STATE_EXITING then futex_exit_mutex is held */
- if (tsk->futex_state == FUTEX_STATE_EXITING)
- mutex_unlock(&tsk->futex_exit_mutex);
- tsk->futex_state = FUTEX_STATE_DEAD;
+ /* If the state is FUTEX_STATE_EXITING then futex.exit_mutex is held */
+ if (tsk->futex.state == FUTEX_STATE_EXITING)
+ mutex_unlock(&tsk->futex.exit_mutex);
+ tsk->futex.state = FUTEX_STATE_DEAD;
}
static void futex_cleanup_begin(struct task_struct *tsk)
@@ -1453,10 +1454,10 @@ static void futex_cleanup_begin(struct t
/*
* Prevent various race issues against a concurrent incoming waiter
* including live locks by forcing the waiter to block on
- * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
+ * tsk->futex.exit_mutex when it observes FUTEX_STATE_EXITING in
* attach_to_pi_owner().
*/
- mutex_lock(&tsk->futex_exit_mutex);
+ mutex_lock(&tsk->futex.exit_mutex);
/*
* Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock.
@@ -1470,7 +1471,7 @@ static void futex_cleanup_begin(struct t
* be observed in exit_pi_state_list().
*/
raw_spin_lock_irq(&tsk->pi_lock);
- tsk->futex_state = FUTEX_STATE_EXITING;
+ tsk->futex.state = FUTEX_STATE_EXITING;
raw_spin_unlock_irq(&tsk->pi_lock);
}
@@ -1480,12 +1481,12 @@ static void futex_cleanup_end(struct tas
* Lockless store. The only side effect is that an observer might
* take another loop until it becomes visible.
*/
- tsk->futex_state = state;
+ tsk->futex.state = state;
/*
* Drop the exit protection. This unblocks waiters which observed
* FUTEX_STATE_EXITING to reevaluate the state.
*/
- mutex_unlock(&tsk->futex_exit_mutex);
+ mutex_unlock(&tsk->futex.exit_mutex);
}
void futex_exec_release(struct task_struct *tsk)
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -14,7 +14,7 @@ int refill_pi_state_cache(void)
{
struct futex_pi_state *pi_state;
- if (likely(current->pi_state_cache))
+ if (likely(current->futex.pi_state_cache))
return 0;
pi_state = kzalloc_obj(*pi_state);
@@ -28,17 +28,17 @@ int refill_pi_state_cache(void)
refcount_set(&pi_state->refcount, 1);
pi_state->key = FUTEX_KEY_INIT;
- current->pi_state_cache = pi_state;
+ current->futex.pi_state_cache = pi_state;
return 0;
}
static struct futex_pi_state *alloc_pi_state(void)
{
- struct futex_pi_state *pi_state = current->pi_state_cache;
+ struct futex_pi_state *pi_state = current->futex.pi_state_cache;
WARN_ON(!pi_state);
- current->pi_state_cache = NULL;
+ current->futex.pi_state_cache = NULL;
return pi_state;
}
@@ -60,7 +60,7 @@ static void pi_state_update_owner(struct
if (new_owner) {
raw_spin_lock(&new_owner->pi_lock);
WARN_ON(!list_empty(&pi_state->list));
- list_add(&pi_state->list, &new_owner->pi_state_list);
+ list_add(&pi_state->list, &new_owner->futex.pi_state_list);
pi_state->owner = new_owner;
raw_spin_unlock(&new_owner->pi_lock);
}
@@ -96,7 +96,7 @@ void put_pi_state(struct futex_pi_state
raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags);
}
- if (current->pi_state_cache) {
+ if (current->futex.pi_state_cache) {
kfree(pi_state);
} else {
/*
@@ -106,7 +106,7 @@ void put_pi_state(struct futex_pi_state
*/
pi_state->owner = NULL;
refcount_set(&pi_state->refcount, 1);
- current->pi_state_cache = pi_state;
+ current->futex.pi_state_cache = pi_state;
}
}
@@ -179,7 +179,7 @@ void put_pi_state(struct futex_pi_state
*
* p->pi_lock:
*
- * p->pi_state_list -> pi_state->list, relation
+ * p->futex.pi_state_list -> pi_state->list, relation
* pi_mutex->owner -> pi_state->owner, relation
*
* pi_state->refcount:
@@ -327,7 +327,7 @@ static int handle_exit_race(u32 __user *
* If the futex exit state is not yet FUTEX_STATE_DEAD, tell the
* caller that the alleged owner is busy.
*/
- if (tsk && tsk->futex_state != FUTEX_STATE_DEAD)
+ if (tsk && tsk->futex.state != FUTEX_STATE_DEAD)
return -EBUSY;
/*
@@ -346,8 +346,8 @@ static int handle_exit_race(u32 __user *
* *uaddr = 0xC0000000; tsk = get_task(PID);
* } if (!tsk->flags & PF_EXITING) {
* ... attach();
- * tsk->futex_state = } else {
- * FUTEX_STATE_DEAD; if (tsk->futex_state !=
+ * tsk->futex.state = } else {
+ * FUTEX_STATE_DEAD; if (tsk->futex.state !=
* FUTEX_STATE_DEAD)
* return -EAGAIN;
* return -ESRCH; <--- FAIL
@@ -395,7 +395,7 @@ static void __attach_to_pi_owner(struct
pi_state->key = *key;
WARN_ON(!list_empty(&pi_state->list));
- list_add(&pi_state->list, &p->pi_state_list);
+ list_add(&pi_state->list, &p->futex.pi_state_list);
/*
* Assignment without holding pi_state->pi_mutex.wait_lock is safe
* because there is no concurrency as the object is not published yet.
@@ -439,7 +439,7 @@ static int attach_to_pi_owner(u32 __user
* in futex_exit_release(), we do this protected by p->pi_lock:
*/
raw_spin_lock_irq(&p->pi_lock);
- if (unlikely(p->futex_state != FUTEX_STATE_OK)) {
+ if (unlikely(p->futex.state != FUTEX_STATE_OK)) {
/*
* The task is on the way out. When the futex state is
* FUTEX_STATE_DEAD, we know that the task has finished
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -25,17 +25,13 @@
* @head: pointer to the list-head
* @len: length of the list-head, as userspace expects
*/
-SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
- size_t, len)
+SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, size_t, len)
{
- /*
- * The kernel knows only one size for now:
- */
+ /* The kernel knows only one size for now. */
if (unlikely(len != sizeof(*head)))
return -EINVAL;
- current->robust_list = head;
-
+ current->futex.robust_list = head;
return 0;
}
@@ -43,9 +39,9 @@ static inline void __user *futex_task_ro
{
#ifdef CONFIG_COMPAT
if (compat)
- return p->compat_robust_list;
+ return p->futex.compat_robust_list;
#endif
- return p->robust_list;
+ return p->futex.robust_list;
}
static void __user *futex_get_robust_list_common(int pid, bool compat)
@@ -467,15 +463,13 @@ SYSCALL_DEFINE4(futex_requeue,
}
#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE2(set_robust_list,
- struct compat_robust_list_head __user *, head,
- compat_size_t, len)
+COMPAT_SYSCALL_DEFINE2(set_robust_list, struct compat_robust_list_head __user *, head,
+ compat_size_t, len)
{
if (unlikely(len != sizeof(*head)))
return -EINVAL;
- current->compat_robust_list = head;
-
+ current->futex.compat_robust_list = head;
return 0;
}
@@ -515,4 +509,3 @@ SYSCALL_DEFINE6(futex_time32, u32 __user
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
#endif /* CONFIG_COMPAT_32BIT_TIME */
-
Nothing fails there. Mop up the leftovers of the early version of this,
which did an allocation.
While at it clean up the stubs and the #ifdef comments to make the header
file readable.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
include/linux/futex.h | 28 +++++++++++-----------------
kernel/fork.c | 8 ++------
kernel/futex/core.c | 3 +--
3 files changed, 14 insertions(+), 25 deletions(-)
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,22 +81,20 @@ int futex_hash_prctl(unsigned long arg2,
#ifdef CONFIG_FUTEX_PRIVATE_HASH
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
-int futex_mm_init(struct mm_struct *mm);
-
-#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+void futex_mm_init(struct mm_struct *mm);
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
-#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* !CONFIG_FUTEX_PRIVATE_HASH */
-#else /* !CONFIG_FUTEX */
+#else /* CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
static inline void futex_exit_release(struct task_struct *tsk) { }
static inline void futex_exec_release(struct task_struct *tsk) { }
-static inline long do_futex(u32 __user *uaddr, int op, u32 val,
- ktime_t *timeout, u32 __user *uaddr2,
- u32 val2, u32 val3)
+static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
+ u32 __user *uaddr2, u32 val2, u32 val3)
{
return -EINVAL;
}
@@ -104,13 +102,9 @@ static inline int futex_hash_prctl(unsig
{
return -EINVAL;
}
-static inline int futex_hash_allocate_default(void)
-{
- return 0;
-}
+static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
-
-#endif
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* !CONFIG_FUTEX */
-#endif
+#endif /* _LINUX_FUTEX_H */
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1097,6 +1097,7 @@ static struct mm_struct *mm_init(struct
#endif
mm_init_uprobes_state(mm);
hugetlb_count_init(mm);
+ futex_mm_init(mm);
mm_flags_clear_all(mm);
if (current->mm) {
@@ -1109,11 +1110,8 @@ static struct mm_struct *mm_init(struct
mm->def_flags = 0;
}
- if (futex_mm_init(mm))
- goto fail_mm_init;
-
if (mm_alloc_pgd(mm))
- goto fail_nopgd;
+ goto fail_mm_init;
if (mm_alloc_id(mm))
goto fail_noid;
@@ -1140,8 +1138,6 @@ static struct mm_struct *mm_init(struct
mm_free_id(mm);
fail_noid:
mm_free_pgd(mm);
-fail_nopgd:
- futex_hash_free(mm);
fail_mm_init:
free_mm(mm);
return NULL;
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1717,7 +1717,7 @@ static bool futex_ref_is_dead(struct fut
return atomic_long_read(&mm->futex_atomic) == 0;
}
-int futex_mm_init(struct mm_struct *mm)
+void futex_mm_init(struct mm_struct *mm)
{
mutex_init(&mm->futex_hash_lock);
RCU_INIT_POINTER(mm->futex_phash, NULL);
@@ -1726,7 +1726,6 @@ int futex_mm_init(struct mm_struct *mm)
mm->futex_ref = NULL;
atomic_long_set(&mm->futex_atomic, 0);
mm->futex_batches = get_state_synchronize_rcu();
- return 0;
}
void futex_hash_free(struct mm_struct *mm)
Having all these members in mm_struct along with the required #ifdeffery is
annoying, does not allow efficient initializing of the data with
memset() and makes extending it tedious.
Move it into a data structure and fix up all usage sites.
The extra struct for the private hash is intentional to make integration of
other conditional mechanisms easier in terms of initialization and separation.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
V3: Split out the private hash data
V2: Use an empty stub struct as for the others - Mathieu
---
include/linux/futex_types.h | 35 +++++++++++
include/linux/mm_types.h | 11 ---
kernel/futex/core.c | 133 ++++++++++++++++++++------------------------
3 files changed, 96 insertions(+), 83 deletions(-)
--- a/include/linux/futex_types.h
+++ b/include/linux/futex_types.h
@@ -29,8 +29,41 @@ struct futex_sched_data {
struct mutex exit_mutex;
unsigned int state;
};
-#else
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/**
+ * struct futex_mm_phash - Futex private hash related per MM data
+ * @lock: Mutex to protect the private hash operations
+ * @hash: RCU managed pointer to the private hash
+ * @hash_new: Pointer to a newly allocated private hash
+ * @batches: Batch state for RCU synchronization
+ * @rcu: RCU head for call_rcu()
+ * @atomic: Aggregate value for @hash_ref
+ * @ref: Per CPU reference counter for a private hash
+ */
+struct futex_mm_phash {
+ struct mutex lock;
+ struct futex_private_hash __rcu *hash;
+ struct futex_private_hash *hash_new;
+ unsigned long batches;
+ struct rcu_head rcu;
+ atomic_long_t atomic;
+ unsigned int __percpu *ref;
+};
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+struct futex_mm_phash { };
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+
+/**
+ * struct futex_mm_data - Futex related per MM data
+ * @phash: Futex private hash related data
+ */
+struct futex_mm_data {
+ struct futex_mm_phash phash;
+};
+#else /* CONFIG_FUTEX */
struct futex_sched_data { };
+struct futex_mm_data { };
#endif /* !CONFIG_FUTEX */
#endif /* _LINUX_FUTEX_TYPES_H */
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1221,16 +1221,7 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
- struct mutex futex_hash_lock;
- struct futex_private_hash __rcu *futex_phash;
- struct futex_private_hash *futex_phash_new;
- /* futex-ref */
- unsigned long futex_batches;
- struct rcu_head futex_rcu;
- atomic_long_t futex_atomic;
- unsigned int __percpu *futex_ref;
-#endif
+ struct futex_mm_data futex;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -188,13 +188,13 @@ static struct futex_hash_bucket *
return NULL;
if (!fph)
- fph = rcu_dereference(key->private.mm->futex_phash);
+ fph = rcu_dereference(key->private.mm->futex.phash.hash);
if (!fph || !fph->hash_mask)
return NULL;
- hash = jhash2((void *)&key->private.address,
- sizeof(key->private.address) / 4,
+ hash = jhash2((void *)&key->private.address, sizeof(key->private.address) / 4,
key->both.offset);
+
return &fph->queues[hash & fph->hash_mask];
}
@@ -233,18 +233,17 @@ static void futex_rehash_private(struct
}
}
-static bool __futex_pivot_hash(struct mm_struct *mm,
- struct futex_private_hash *new)
+static bool __futex_pivot_hash(struct mm_struct *mm, struct futex_private_hash *new)
{
+ struct futex_mm_phash *mmph = &mm->futex.phash;
struct futex_private_hash *fph;
- WARN_ON_ONCE(mm->futex_phash_new);
+ WARN_ON_ONCE(mmph->hash_new);
- fph = rcu_dereference_protected(mm->futex_phash,
- lockdep_is_held(&mm->futex_hash_lock));
+ fph = rcu_dereference_protected(mmph->hash, lockdep_is_held(&mmph->lock));
if (fph) {
if (!futex_ref_is_dead(fph)) {
- mm->futex_phash_new = new;
+ mmph->hash_new = new;
return false;
}
@@ -252,8 +251,8 @@ static bool __futex_pivot_hash(struct mm
}
new->state = FR_PERCPU;
scoped_guard(rcu) {
- mm->futex_batches = get_state_synchronize_rcu();
- rcu_assign_pointer(mm->futex_phash, new);
+ mmph->batches = get_state_synchronize_rcu();
+ rcu_assign_pointer(mmph->hash, new);
}
kvfree_rcu(fph, rcu);
return true;
@@ -261,12 +260,12 @@ static bool __futex_pivot_hash(struct mm
static void futex_pivot_hash(struct mm_struct *mm)
{
- scoped_guard(mutex, &mm->futex_hash_lock) {
+ scoped_guard(mutex, &mm->futex.phash.lock) {
struct futex_private_hash *fph;
- fph = mm->futex_phash_new;
+ fph = mm->futex.phash.hash_new;
if (fph) {
- mm->futex_phash_new = NULL;
+ mm->futex.phash.hash_new = NULL;
__futex_pivot_hash(mm, fph);
}
}
@@ -289,7 +288,7 @@ struct futex_private_hash *futex_private
scoped_guard(rcu) {
struct futex_private_hash *fph;
- fph = rcu_dereference(mm->futex_phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
if (!fph)
return NULL;
@@ -412,8 +411,7 @@ static int futex_mpol(struct mm_struct *
* private hash) is returned if existing. Otherwise a hash bucket from the
* global hash is returned.
*/
-static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+static struct futex_hash_bucket *__futex_hash(union futex_key *key, struct futex_private_hash *fph)
{
int node = key->both.node;
u32 hash;
@@ -426,8 +424,7 @@ static struct futex_hash_bucket *
return hb;
}
- hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / sizeof(u32),
+ hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
if (node == FUTEX_NO_NODE) {
@@ -442,8 +439,7 @@ static struct futex_hash_bucket *
*/
node = (hash >> futex_hashshift) % nr_node_ids;
if (!node_possible(node)) {
- node = find_next_bit_wrap(node_possible_map.bits,
- nr_node_ids, node);
+ node = find_next_bit_wrap(node_possible_map.bits, nr_node_ids, node);
}
}
@@ -460,9 +456,8 @@ static struct futex_hash_bucket *
* Return: Initialized hrtimer_sleeper structure or NULL if no timeout
* value given
*/
-struct hrtimer_sleeper *
-futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
- int flags, u64 range_ns)
+struct hrtimer_sleeper *futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
+ int flags, u64 range_ns)
{
if (!time)
return NULL;
@@ -1551,17 +1546,17 @@ static void __futex_ref_atomic_begin(str
* otherwise it would be impossible for it to have reported success
* from futex_ref_is_dead().
*/
- WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
+ WARN_ON_ONCE(atomic_long_read(&mm->futex.phash.atomic) != 0);
/*
* Set the atomic to the bias value such that futex_ref_{get,put}()
* will never observe 0. Will be fixed up in __futex_ref_atomic_end()
* when folding in the percpu count.
*/
- atomic_long_set(&mm->futex_atomic, LONG_MAX);
+ atomic_long_set(&mm->futex.phash.atomic, LONG_MAX);
smp_store_release(&fph->state, FR_ATOMIC);
- call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+ call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
}
static void __futex_ref_atomic_end(struct futex_private_hash *fph)
@@ -1582,7 +1577,7 @@ static void __futex_ref_atomic_end(struc
* Therefore the per-cpu counter is now stable, sum and reset.
*/
for_each_possible_cpu(cpu) {
- unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
+ unsigned int *ptr = per_cpu_ptr(mm->futex.phash.ref, cpu);
count += *ptr;
*ptr = 0;
}
@@ -1590,7 +1585,7 @@ static void __futex_ref_atomic_end(struc
/*
* Re-init for the next cycle.
*/
- this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+ this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
/*
* Add actual count, subtract bias and initial refcount.
@@ -1598,7 +1593,7 @@ static void __futex_ref_atomic_end(struc
* The moment this atomic operation happens, futex_ref_is_dead() can
* become true.
*/
- ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
+ ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex.phash.atomic);
if (!ret)
wake_up_var(mm);
@@ -1608,8 +1603,8 @@ static void __futex_ref_atomic_end(struc
static void futex_ref_rcu(struct rcu_head *head)
{
- struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
- struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
+ struct mm_struct *mm = container_of(head, struct mm_struct, futex.phash.rcu);
+ struct futex_private_hash *fph = rcu_dereference_raw(mm->futex.phash.hash);
if (fph->state == FR_PERCPU) {
/*
@@ -1638,7 +1633,7 @@ static void futex_ref_drop(struct futex_
/*
* Can only transition the current fph;
*/
- WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+ WARN_ON_ONCE(rcu_dereference_raw(mm->futex.phash.hash) != fph);
/*
* We enqueue at least one RCU callback. Ensure mm stays if the task
* exits before the transition is completed.
@@ -1649,9 +1644,9 @@ static void futex_ref_drop(struct futex_
* In order to avoid the following scenario:
*
* futex_hash() __futex_pivot_hash()
- * guard(rcu); guard(mm->futex_hash_lock);
- * fph = mm->futex_phash;
- * rcu_assign_pointer(&mm->futex_phash, new);
+ * guard(rcu); guard(mm->futex.phash.lock);
+ * fph = mm->futex.phash.hash;
+ * rcu_assign_pointer(&mm->futex.phash.hash, new);
* futex_hash_allocate()
* futex_ref_drop()
* fph->state = FR_ATOMIC;
@@ -1666,7 +1661,7 @@ static void futex_ref_drop(struct futex_
* There must be at least one full grace-period between publishing a
* new fph and trying to replace it.
*/
- if (poll_state_synchronize_rcu(mm->futex_batches)) {
+ if (poll_state_synchronize_rcu(mm->futex.phash.batches)) {
/*
* There was a grace-period, we can begin now.
*/
@@ -1674,7 +1669,7 @@ static void futex_ref_drop(struct futex_
return;
}
- call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+ call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
}
static bool futex_ref_get(struct futex_private_hash *fph)
@@ -1684,11 +1679,11 @@ static bool futex_ref_get(struct futex_p
guard(preempt)();
if (READ_ONCE(fph->state) == FR_PERCPU) {
- __this_cpu_inc(*mm->futex_ref);
+ __this_cpu_inc(*mm->futex.phash.ref);
return true;
}
- return atomic_long_inc_not_zero(&mm->futex_atomic);
+ return atomic_long_inc_not_zero(&mm->futex.phash.atomic);
}
static bool futex_ref_put(struct futex_private_hash *fph)
@@ -1698,11 +1693,11 @@ static bool futex_ref_put(struct futex_p
guard(preempt)();
if (READ_ONCE(fph->state) == FR_PERCPU) {
- __this_cpu_dec(*mm->futex_ref);
+ __this_cpu_dec(*mm->futex.phash.ref);
return false;
}
- return atomic_long_dec_and_test(&mm->futex_atomic);
+ return atomic_long_dec_and_test(&mm->futex.phash.atomic);
}
static bool futex_ref_is_dead(struct futex_private_hash *fph)
@@ -1714,27 +1709,23 @@ static bool futex_ref_is_dead(struct fut
if (smp_load_acquire(&fph->state) == FR_PERCPU)
return false;
- return atomic_long_read(&mm->futex_atomic) == 0;
+ return atomic_long_read(&mm->futex.phash.atomic) == 0;
}
void futex_mm_init(struct mm_struct *mm)
{
- mutex_init(&mm->futex_hash_lock);
- RCU_INIT_POINTER(mm->futex_phash, NULL);
- mm->futex_phash_new = NULL;
- /* futex-ref */
- mm->futex_ref = NULL;
- atomic_long_set(&mm->futex_atomic, 0);
- mm->futex_batches = get_state_synchronize_rcu();
+ memset(&mm->futex, 0, sizeof(mm->futex));
+ mutex_init(&mm->futex.phash.lock);
+ mm->futex.phash.batches = get_state_synchronize_rcu();
}
void futex_hash_free(struct mm_struct *mm)
{
struct futex_private_hash *fph;
- free_percpu(mm->futex_ref);
- kvfree(mm->futex_phash_new);
- fph = rcu_dereference_raw(mm->futex_phash);
+ free_percpu(mm->futex.phash.ref);
+ kvfree(mm->futex.phash.hash_new);
+ fph = rcu_dereference_raw(mm->futex.phash.hash);
if (fph)
kvfree(fph);
}
@@ -1745,10 +1736,10 @@ static bool futex_pivot_pending(struct m
guard(rcu)();
- if (!mm->futex_phash_new)
+ if (!mm->futex.phash.hash_new)
return true;
- fph = rcu_dereference(mm->futex_phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
return futex_ref_is_dead(fph);
}
@@ -1790,7 +1781,7 @@ static int futex_hash_allocate(unsigned
* Once we've disabled the global hash there is no way back.
*/
scoped_guard(rcu) {
- fph = rcu_dereference(mm->futex_phash);
+ fph = rcu_dereference(mm->futex.phash.hash);
if (fph && !fph->hash_mask) {
if (custom)
return -EBUSY;
@@ -1798,15 +1789,15 @@ static int futex_hash_allocate(unsigned
}
}
- if (!mm->futex_ref) {
+ if (!mm->futex.phash.ref) {
/*
* This will always be allocated by the first thread and
* therefore requires no locking.
*/
- mm->futex_ref = alloc_percpu(unsigned int);
- if (!mm->futex_ref)
+ mm->futex.phash.ref = alloc_percpu(unsigned int);
+ if (!mm->futex.phash.ref)
return -ENOMEM;
- this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+ this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
}
fph = kvzalloc(struct_size(fph, queues, hash_slots),
@@ -1829,14 +1820,14 @@ static int futex_hash_allocate(unsigned
wait_var_event(mm, futex_pivot_pending(mm));
}
- scoped_guard(mutex, &mm->futex_hash_lock) {
+ scoped_guard(mutex, &mm->futex.phash.lock) {
struct futex_private_hash *free __free(kvfree) = NULL;
struct futex_private_hash *cur, *new;
- cur = rcu_dereference_protected(mm->futex_phash,
- lockdep_is_held(&mm->futex_hash_lock));
- new = mm->futex_phash_new;
- mm->futex_phash_new = NULL;
+ cur = rcu_dereference_protected(mm->futex.phash.hash,
+ lockdep_is_held(&mm->futex.phash.lock));
+ new = mm->futex.phash.hash_new;
+ mm->futex.phash.hash_new = NULL;
if (fph) {
if (cur && !cur->hash_mask) {
@@ -1846,7 +1837,7 @@ static int futex_hash_allocate(unsigned
* the second one returns here.
*/
free = fph;
- mm->futex_phash_new = new;
+ mm->futex.phash.hash_new = new;
return -EBUSY;
}
if (cur && !new) {
@@ -1876,7 +1867,7 @@ static int futex_hash_allocate(unsigned
if (new) {
/*
- * Will set mm->futex_phash_new on failure;
+ * Will set mm->futex.phash_new on failure;
* futex_private_hash_get() will try again.
*/
if (!__futex_pivot_hash(mm, new) && custom)
@@ -1895,11 +1886,9 @@ int futex_hash_allocate_default(void)
return 0;
scoped_guard(rcu) {
- threads = min_t(unsigned int,
- get_nr_threads(current),
- num_online_cpus());
+ threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
- fph = rcu_dereference(current->mm->futex_phash);
+ fph = rcu_dereference(current->mm->futex.phash.hash);
if (fph) {
if (fph->custom)
return 0;
@@ -1926,7 +1915,7 @@ static int futex_hash_get_slots(void)
struct futex_private_hash *fph;
guard(rcu)();
- fph = rcu_dereference(current->mm->futex_phash);
+ fph = rcu_dereference(current->mm->futex.phash.hash);
if (fph && fph->hash_mask)
return fph->hash_mask + 1;
return 0;
Hello Thomas,
On Mon, Mar 30, 2026 at 5:05 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +/**
> + * struct futex_mm_phash - Futex private hash related per MM data
> + * @lock: Mutex to protect the private hash operations
> + * @hash: RCU managed pointer to the private hash
> + * @hash_new: Pointer to a newly allocated private hash
> + * @batches: Batch state for RCU synchronization
> + * @rcu: RCU head for call_rcu()
> + * @atomic: Aggregate value for @hash_ref
> + * @ref: Per CPU reference counter for a private hash
> + */
> +struct futex_mm_phash {
> + struct mutex lock;
> + struct futex_private_hash __rcu *hash;
> + struct futex_private_hash *hash_new;
> + unsigned long batches;
> + struct rcu_head rcu;
> + atomic_long_t atomic;
> + unsigned int __percpu *ref;
> +};
> +#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
The comment here should be CONFIG_FUTEX_PRIVATE_HASH
> +struct futex_mm_phash { };
> +#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
And the same.
> +/**
> + * struct futex_mm_data - Futex related per MM data
> + * @phash: Futex private hash related data
> + */
> +struct futex_mm_data {
> + struct futex_mm_phash phash;
> +};
> +#else /* CONFIG_FUTEX */
> struct futex_sched_data { };
> +struct futex_mm_data { };
> #endif /* !CONFIG_FUTEX */
>
> #endif /* _LINUX_FUTEX_TYPES_H */
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1221,16 +1221,7 @@ struct mm_struct {
> */
> seqcount_t mm_lock_seq;
> #endif
> -#ifdef CONFIG_FUTEX_PRIVATE_HASH
> - struct mutex futex_hash_lock;
> - struct futex_private_hash __rcu *futex_phash;
> - struct futex_private_hash *futex_phash_new;
> - /* futex-ref */
> - unsigned long futex_batches;
> - struct rcu_head futex_rcu;
> - atomic_long_t futex_atomic;
> - unsigned int __percpu *futex_ref;
> -#endif
> + struct futex_mm_data futex;
>
> unsigned long hiwater_rss; /* High-watermark of RSS usage */
> unsigned long hiwater_vm; /* High-water virtual memory usage */
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -188,13 +188,13 @@ static struct futex_hash_bucket *
> return NULL;
>
> if (!fph)
> - fph = rcu_dereference(key->private.mm->futex_phash);
> + fph = rcu_dereference(key->private.mm->futex.phash.hash);
> if (!fph || !fph->hash_mask)
> return NULL;
>
> - hash = jhash2((void *)&key->private.address,
> - sizeof(key->private.address) / 4,
> + hash = jhash2((void *)&key->private.address, sizeof(key->private.address) / 4,
> key->both.offset);
> +
> return &fph->queues[hash & fph->hash_mask];
> }
>
> @@ -233,18 +233,17 @@ static void futex_rehash_private(struct
> }
> }
>
> -static bool __futex_pivot_hash(struct mm_struct *mm,
> - struct futex_private_hash *new)
> +static bool __futex_pivot_hash(struct mm_struct *mm, struct futex_private_hash *new)
> {
> + struct futex_mm_phash *mmph = &mm->futex.phash;
> struct futex_private_hash *fph;
>
> - WARN_ON_ONCE(mm->futex_phash_new);
> + WARN_ON_ONCE(mmph->hash_new);
>
> - fph = rcu_dereference_protected(mm->futex_phash,
> - lockdep_is_held(&mm->futex_hash_lock));
> + fph = rcu_dereference_protected(mmph->hash, lockdep_is_held(&mmph->lock));
> if (fph) {
> if (!futex_ref_is_dead(fph)) {
> - mm->futex_phash_new = new;
> + mmph->hash_new = new;
> return false;
> }
>
> @@ -252,8 +251,8 @@ static bool __futex_pivot_hash(struct mm
> }
> new->state = FR_PERCPU;
> scoped_guard(rcu) {
> - mm->futex_batches = get_state_synchronize_rcu();
> - rcu_assign_pointer(mm->futex_phash, new);
> + mmph->batches = get_state_synchronize_rcu();
> + rcu_assign_pointer(mmph->hash, new);
> }
> kvfree_rcu(fph, rcu);
> return true;
> @@ -261,12 +260,12 @@ static bool __futex_pivot_hash(struct mm
>
> static void futex_pivot_hash(struct mm_struct *mm)
> {
> - scoped_guard(mutex, &mm->futex_hash_lock) {
> + scoped_guard(mutex, &mm->futex.phash.lock) {
> struct futex_private_hash *fph;
>
> - fph = mm->futex_phash_new;
> + fph = mm->futex.phash.hash_new;
> if (fph) {
> - mm->futex_phash_new = NULL;
> + mm->futex.phash.hash_new = NULL;
> __futex_pivot_hash(mm, fph);
> }
> }
> @@ -289,7 +288,7 @@ struct futex_private_hash *futex_private
> scoped_guard(rcu) {
> struct futex_private_hash *fph;
>
> - fph = rcu_dereference(mm->futex_phash);
> + fph = rcu_dereference(mm->futex.phash.hash);
> if (!fph)
> return NULL;
>
> @@ -412,8 +411,7 @@ static int futex_mpol(struct mm_struct *
> * private hash) is returned if existing. Otherwise a hash bucket from the
> * global hash is returned.
> */
> -static struct futex_hash_bucket *
> -__futex_hash(union futex_key *key, struct futex_private_hash *fph)
> +static struct futex_hash_bucket *__futex_hash(union futex_key *key, struct futex_private_hash *fph)
> {
> int node = key->both.node;
> u32 hash;
> @@ -426,8 +424,7 @@ static struct futex_hash_bucket *
> return hb;
> }
>
> - hash = jhash2((u32 *)key,
> - offsetof(typeof(*key), both.offset) / sizeof(u32),
> + hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / sizeof(u32),
> key->both.offset);
>
> if (node == FUTEX_NO_NODE) {
> @@ -442,8 +439,7 @@ static struct futex_hash_bucket *
> */
> node = (hash >> futex_hashshift) % nr_node_ids;
> if (!node_possible(node)) {
> - node = find_next_bit_wrap(node_possible_map.bits,
> - nr_node_ids, node);
> + node = find_next_bit_wrap(node_possible_map.bits, nr_node_ids, node);
> }
> }
>
> @@ -460,9 +456,8 @@ static struct futex_hash_bucket *
> * Return: Initialized hrtimer_sleeper structure or NULL if no timeout
> * value given
> */
> -struct hrtimer_sleeper *
> -futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
> - int flags, u64 range_ns)
> +struct hrtimer_sleeper *futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
> + int flags, u64 range_ns)
> {
> if (!time)
> return NULL;
> @@ -1551,17 +1546,17 @@ static void __futex_ref_atomic_begin(str
> * otherwise it would be impossible for it to have reported success
> * from futex_ref_is_dead().
> */
> - WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
> + WARN_ON_ONCE(atomic_long_read(&mm->futex.phash.atomic) != 0);
>
> /*
> * Set the atomic to the bias value such that futex_ref_{get,put}()
> * will never observe 0. Will be fixed up in __futex_ref_atomic_end()
> * when folding in the percpu count.
> */
> - atomic_long_set(&mm->futex_atomic, LONG_MAX);
> + atomic_long_set(&mm->futex.phash.atomic, LONG_MAX);
> smp_store_release(&fph->state, FR_ATOMIC);
>
> - call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
> + call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
> }
>
> static void __futex_ref_atomic_end(struct futex_private_hash *fph)
> @@ -1582,7 +1577,7 @@ static void __futex_ref_atomic_end(struc
> * Therefore the per-cpu counter is now stable, sum and reset.
> */
> for_each_possible_cpu(cpu) {
> - unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
> + unsigned int *ptr = per_cpu_ptr(mm->futex.phash.ref, cpu);
> count += *ptr;
> *ptr = 0;
> }
> @@ -1590,7 +1585,7 @@ static void __futex_ref_atomic_end(struc
> /*
> * Re-init for the next cycle.
> */
> - this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
> + this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
>
> /*
> * Add actual count, subtract bias and initial refcount.
> @@ -1598,7 +1593,7 @@ static void __futex_ref_atomic_end(struc
> * The moment this atomic operation happens, futex_ref_is_dead() can
> * become true.
> */
> - ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
> + ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex.phash.atomic);
> if (!ret)
> wake_up_var(mm);
>
> @@ -1608,8 +1603,8 @@ static void __futex_ref_atomic_end(struc
>
> static void futex_ref_rcu(struct rcu_head *head)
> {
> - struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
> - struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
> + struct mm_struct *mm = container_of(head, struct mm_struct, futex.phash.rcu);
> + struct futex_private_hash *fph = rcu_dereference_raw(mm->futex.phash.hash);
>
> if (fph->state == FR_PERCPU) {
> /*
> @@ -1638,7 +1633,7 @@ static void futex_ref_drop(struct futex_
> /*
> * Can only transition the current fph;
> */
> - WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
> + WARN_ON_ONCE(rcu_dereference_raw(mm->futex.phash.hash) != fph);
> /*
> * We enqueue at least one RCU callback. Ensure mm stays if the task
> * exits before the transition is completed.
> @@ -1649,9 +1644,9 @@ static void futex_ref_drop(struct futex_
> * In order to avoid the following scenario:
> *
> * futex_hash() __futex_pivot_hash()
> - * guard(rcu); guard(mm->futex_hash_lock);
> - * fph = mm->futex_phash;
> - * rcu_assign_pointer(&mm->futex_phash, new);
> + * guard(rcu); guard(mm->futex.phash.lock);
> + * fph = mm->futex.phash.hash;
> + * rcu_assign_pointer(&mm->futex.phash.hash, new);
> * futex_hash_allocate()
> * futex_ref_drop()
> * fph->state = FR_ATOMIC;
> @@ -1666,7 +1661,7 @@ static void futex_ref_drop(struct futex_
> * There must be at least one full grace-period between publishing a
> * new fph and trying to replace it.
> */
> - if (poll_state_synchronize_rcu(mm->futex_batches)) {
> + if (poll_state_synchronize_rcu(mm->futex.phash.batches)) {
> /*
> * There was a grace-period, we can begin now.
> */
> @@ -1674,7 +1669,7 @@ static void futex_ref_drop(struct futex_
> return;
> }
>
> - call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
> + call_rcu_hurry(&mm->futex.phash.rcu, futex_ref_rcu);
> }
>
> static bool futex_ref_get(struct futex_private_hash *fph)
> @@ -1684,11 +1679,11 @@ static bool futex_ref_get(struct futex_p
> guard(preempt)();
>
> if (READ_ONCE(fph->state) == FR_PERCPU) {
> - __this_cpu_inc(*mm->futex_ref);
> + __this_cpu_inc(*mm->futex.phash.ref);
> return true;
> }
>
> - return atomic_long_inc_not_zero(&mm->futex_atomic);
> + return atomic_long_inc_not_zero(&mm->futex.phash.atomic);
> }
>
> static bool futex_ref_put(struct futex_private_hash *fph)
> @@ -1698,11 +1693,11 @@ static bool futex_ref_put(struct futex_p
> guard(preempt)();
>
> if (READ_ONCE(fph->state) == FR_PERCPU) {
> - __this_cpu_dec(*mm->futex_ref);
> + __this_cpu_dec(*mm->futex.phash.ref);
> return false;
> }
>
> - return atomic_long_dec_and_test(&mm->futex_atomic);
> + return atomic_long_dec_and_test(&mm->futex.phash.atomic);
> }
>
> static bool futex_ref_is_dead(struct futex_private_hash *fph)
> @@ -1714,27 +1709,23 @@ static bool futex_ref_is_dead(struct fut
> if (smp_load_acquire(&fph->state) == FR_PERCPU)
> return false;
>
> - return atomic_long_read(&mm->futex_atomic) == 0;
> + return atomic_long_read(&mm->futex.phash.atomic) == 0;
> }
>
> void futex_mm_init(struct mm_struct *mm)
> {
> - mutex_init(&mm->futex_hash_lock);
> - RCU_INIT_POINTER(mm->futex_phash, NULL);
> - mm->futex_phash_new = NULL;
> - /* futex-ref */
> - mm->futex_ref = NULL;
> - atomic_long_set(&mm->futex_atomic, 0);
> - mm->futex_batches = get_state_synchronize_rcu();
> + memset(&mm->futex, 0, sizeof(mm->futex));
> + mutex_init(&mm->futex.phash.lock);
> + mm->futex.phash.batches = get_state_synchronize_rcu();
> }
>
> void futex_hash_free(struct mm_struct *mm)
> {
> struct futex_private_hash *fph;
>
> - free_percpu(mm->futex_ref);
> - kvfree(mm->futex_phash_new);
> - fph = rcu_dereference_raw(mm->futex_phash);
> + free_percpu(mm->futex.phash.ref);
> + kvfree(mm->futex.phash.hash_new);
> + fph = rcu_dereference_raw(mm->futex.phash.hash);
> if (fph)
> kvfree(fph);
> }
> @@ -1745,10 +1736,10 @@ static bool futex_pivot_pending(struct m
>
> guard(rcu)();
>
> - if (!mm->futex_phash_new)
> + if (!mm->futex.phash.hash_new)
> return true;
>
> - fph = rcu_dereference(mm->futex_phash);
> + fph = rcu_dereference(mm->futex.phash.hash);
> return futex_ref_is_dead(fph);
> }
>
> @@ -1790,7 +1781,7 @@ static int futex_hash_allocate(unsigned
> * Once we've disabled the global hash there is no way back.
> */
> scoped_guard(rcu) {
> - fph = rcu_dereference(mm->futex_phash);
> + fph = rcu_dereference(mm->futex.phash.hash);
> if (fph && !fph->hash_mask) {
> if (custom)
> return -EBUSY;
> @@ -1798,15 +1789,15 @@ static int futex_hash_allocate(unsigned
> }
> }
>
> - if (!mm->futex_ref) {
> + if (!mm->futex.phash.ref) {
> /*
> * This will always be allocated by the first thread and
> * therefore requires no locking.
> */
> - mm->futex_ref = alloc_percpu(unsigned int);
> - if (!mm->futex_ref)
> + mm->futex.phash.ref = alloc_percpu(unsigned int);
> + if (!mm->futex.phash.ref)
> return -ENOMEM;
> - this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
> + this_cpu_inc(*mm->futex.phash.ref); /* 0 -> 1 */
> }
>
> fph = kvzalloc(struct_size(fph, queues, hash_slots),
> @@ -1829,14 +1820,14 @@ static int futex_hash_allocate(unsigned
> wait_var_event(mm, futex_pivot_pending(mm));
> }
>
> - scoped_guard(mutex, &mm->futex_hash_lock) {
> + scoped_guard(mutex, &mm->futex.phash.lock) {
> struct futex_private_hash *free __free(kvfree) = NULL;
> struct futex_private_hash *cur, *new;
>
> - cur = rcu_dereference_protected(mm->futex_phash,
> - lockdep_is_held(&mm->futex_hash_lock));
> - new = mm->futex_phash_new;
> - mm->futex_phash_new = NULL;
> + cur = rcu_dereference_protected(mm->futex.phash.hash,
> + lockdep_is_held(&mm->futex.phash.lock));
> + new = mm->futex.phash.hash_new;
> + mm->futex.phash.hash_new = NULL;
>
> if (fph) {
> if (cur && !cur->hash_mask) {
> @@ -1846,7 +1837,7 @@ static int futex_hash_allocate(unsigned
> * the second one returns here.
> */
> free = fph;
> - mm->futex_phash_new = new;
> + mm->futex.phash.hash_new = new;
> return -EBUSY;
> }
> if (cur && !new) {
> @@ -1876,7 +1867,7 @@ static int futex_hash_allocate(unsigned
>
> if (new) {
> /*
> - * Will set mm->futex_phash_new on failure;
> + * Will set mm->futex.phash_new on failure;
mm->futex.phash.hash_new?
The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks
any sensible form of documentation.
Provide proper defines for the bit and the mask and fix up the usage
sites. Thereby convert the boolean pi argument into a modifier argument,
which allows new modifier bits to be trivially added and conveyed.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V2: Explain the code shuffling - Andre
---
include/uapi/linux/futex.h | 4 +++
kernel/futex/core.c | 53 +++++++++++++++++++++------------------------
2 files changed, 29 insertions(+), 28 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -177,6 +177,10 @@ struct robust_list_head {
*/
#define ROBUST_LIST_LIMIT 2048
+/* Modifiers for robust_list_head::list_op_pending */
+#define FUTEX_ROBUST_MOD_PI (0x1UL)
+#define FUTEX_ROBUST_MOD_MASK (FUTEX_ROBUST_MOD_PI)
+
/*
* bitset with all bits set for the FUTEX_xxx_BITSET OPs to request a
* match of any bit.
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1009,8 +1009,9 @@ void futex_unqueue_pi(struct futex_q *q)
* dying task, and do notification if so:
*/
static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
- bool pi, bool pending_op)
+ unsigned int mod, bool pending_op)
{
+ bool pi = !!(mod & FUTEX_ROBUST_MOD_PI);
u32 uval, nval, mval;
pid_t owner;
int err;
@@ -1128,21 +1129,21 @@ static int handle_futex_death(u32 __user
*/
static inline int fetch_robust_entry(struct robust_list __user **entry,
struct robust_list __user * __user *head,
- unsigned int *pi)
+ unsigned int *mod)
{
unsigned long uentry;
if (get_user(uentry, (unsigned long __user *)head))
return -EFAULT;
- *entry = (void __user *)(uentry & ~1UL);
- *pi = uentry & 1;
+ *entry = (void __user *)(uentry & ~FUTEX_ROBUST_MOD_MASK);
+ *mod = uentry & FUTEX_ROBUST_MOD_MASK;
return 0;
}
/*
- * Walk curr->robust_list (very carefully, it's a userspace list!)
+ * Walk curr->futex.robust_list (very carefully, it's a userspace list!)
* and mark any locks found there dead, and notify any waiters.
*
* We silently return on any sign of list-walking problem.
@@ -1150,9 +1151,8 @@ static inline int fetch_robust_entry(str
static void exit_robust_list(struct task_struct *curr)
{
struct robust_list_head __user *head = curr->futex.robust_list;
+ unsigned int limit = ROBUST_LIST_LIMIT, cur_mod, next_mod, pend_mod;
struct robust_list __user *entry, *next_entry, *pending;
- unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
- unsigned int next_pi;
unsigned long futex_offset;
int rc;
@@ -1160,7 +1160,7 @@ static void exit_robust_list(struct task
* Fetch the list head (which was registered earlier, via
* sys_set_robust_list()):
*/
- if (fetch_robust_entry(&entry, &head->list.next, &pi))
+ if (fetch_robust_entry(&entry, &head->list.next, &cur_mod))
return;
/*
* Fetch the relative futex offset:
@@ -1171,7 +1171,7 @@ static void exit_robust_list(struct task
* Fetch any possibly pending lock-add first, and handle it
* if it exists:
*/
- if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
+ if (fetch_robust_entry(&pending, &head->list_op_pending, &pend_mod))
return;
next_entry = NULL; /* avoid warning with gcc */
@@ -1180,20 +1180,20 @@ static void exit_robust_list(struct task
* Fetch the next entry in the list before calling
* handle_futex_death:
*/
- rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
+ rc = fetch_robust_entry(&next_entry, &entry->next, &next_mod);
/*
* A pending lock might already be on the list, so
* don't process it twice:
*/
if (entry != pending) {
if (handle_futex_death((void __user *)entry + futex_offset,
- curr, pi, HANDLE_DEATH_LIST))
+ curr, cur_mod, HANDLE_DEATH_LIST))
return;
}
if (rc)
return;
entry = next_entry;
- pi = next_pi;
+ cur_mod = next_mod;
/*
* Avoid excessively long or circular lists:
*/
@@ -1205,7 +1205,7 @@ static void exit_robust_list(struct task
if (pending) {
handle_futex_death((void __user *)pending + futex_offset,
- curr, pip, HANDLE_DEATH_PENDING);
+ curr, pend_mod, HANDLE_DEATH_PENDING);
}
}
@@ -1224,29 +1224,28 @@ static void __user *futex_uaddr(struct r
*/
static inline int
compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry,
- compat_uptr_t __user *head, unsigned int *pi)
+ compat_uptr_t __user *head, unsigned int *pflags)
{
if (get_user(*uentry, head))
return -EFAULT;
- *entry = compat_ptr((*uentry) & ~1);
- *pi = (unsigned int)(*uentry) & 1;
+ *entry = compat_ptr((*uentry) & ~FUTEX_ROBUST_MOD_MASK);
+ *pflags = (unsigned int)(*uentry) & FUTEX_ROBUST_MOD_MASK;
return 0;
}
/*
- * Walk curr->robust_list (very carefully, it's a userspace list!)
+ * Walk curr->futex.robust_list (very carefully, it's a userspace list!)
* and mark any locks found there dead, and notify any waiters.
*
* We silently return on any sign of list-walking problem.
*/
static void compat_exit_robust_list(struct task_struct *curr)
{
- struct compat_robust_list_head __user *head = curr->futex.compat_robust_list;
+ struct compat_robust_list_head __user *head = current->futex.compat_robust_list;
+ unsigned int limit = ROBUST_LIST_LIMIT, cur_mod, next_mod, pend_mod;
struct robust_list __user *entry, *next_entry, *pending;
- unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
- unsigned int next_pi;
compat_uptr_t uentry, next_uentry, upending;
compat_long_t futex_offset;
int rc;
@@ -1255,7 +1254,7 @@ static void compat_exit_robust_list(stru
* Fetch the list head (which was registered earlier, via
* sys_set_robust_list()):
*/
- if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi))
+ if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &cur_mod))
return;
/*
* Fetch the relative futex offset:
@@ -1266,8 +1265,7 @@ static void compat_exit_robust_list(stru
* Fetch any possibly pending lock-add first, and handle it
* if it exists:
*/
- if (compat_fetch_robust_entry(&upending, &pending,
- &head->list_op_pending, &pip))
+ if (compat_fetch_robust_entry(&upending, &pending, &head->list_op_pending, &pend_mod))
return;
next_entry = NULL; /* avoid warning with gcc */
@@ -1277,7 +1275,7 @@ static void compat_exit_robust_list(stru
* handle_futex_death:
*/
rc = compat_fetch_robust_entry(&next_uentry, &next_entry,
- (compat_uptr_t __user *)&entry->next, &next_pi);
+ (compat_uptr_t __user *)&entry->next, &next_mod);
/*
* A pending lock might already be on the list, so
* dont process it twice:
@@ -1285,15 +1283,14 @@ static void compat_exit_robust_list(stru
if (entry != pending) {
void __user *uaddr = futex_uaddr(entry, futex_offset);
- if (handle_futex_death(uaddr, curr, pi,
- HANDLE_DEATH_LIST))
+ if (handle_futex_death(uaddr, curr, cur_mod, HANDLE_DEATH_LIST))
return;
}
if (rc)
return;
uentry = next_uentry;
entry = next_entry;
- pi = next_pi;
+ cur_mod = next_mod;
/*
* Avoid excessively long or circular lists:
*/
@@ -1305,7 +1302,7 @@ static void compat_exit_robust_list(stru
if (pending) {
void __user *uaddr = futex_uaddr(pending, futex_offset);
- handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
+ handle_futex_death(uaddr, curr, pend_mod, HANDLE_DEATH_PENDING);
}
}
#endif
The upcoming support for unlocking robust futexes in the kernel requires
store release semantics. Syscalls do not imply memory ordering on all
architectures so the unlock operation requires a barrier.
This barrier can be avoided when stores imply release like on x86.
Provide a generic version with a smp_mb() before the unsafe_put_user(),
which can be overridden by architectures.
Provide also a ARCH_MEMORY_ORDER_TOS Kconfig option, which can be selected
by architectures with Total Store Order (TSO), where store implies release,
so that the smp_mb() in the generic implementation can be avoided.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V3: Rename to CONFIG_ARCH_MEMORY_ORDER_TSO - Peter
V2: New patch
---
arch/Kconfig | 4 ++++
include/linux/uaccess.h | 9 +++++++++
2 files changed, 13 insertions(+)
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -403,6 +403,10 @@ config ARCH_32BIT_OFF_T
config ARCH_32BIT_USTAT_F_TINODE
bool
+# Selected by architectures with Total Store Order (TOS)
+config ARCH_MEMORY_ORDER_TOS
+ bool
+
config HAVE_ASM_MODVERSIONS
bool
help
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -644,6 +644,15 @@ static inline void user_access_restore(u
#define user_read_access_end user_access_end
#endif
+#ifndef unsafe_atomic_store_release_user
+# define unsafe_atomic_store_release_user(val, uptr, elbl) \
+ do { \
+ if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) \
+ smp_mb(); \
+ unsafe_put_user(val, uptr, elbl); \
+ } while (0)
+#endif
+
/* Define RW variant so the below _mode macro expansion works */
#define masked_user_rw_access_begin(u) masked_user_access_begin(u)
#define user_rw_access_begin(u, s) user_access_begin(u, s)
On Mon, Mar 30, 2026 at 02:02:25PM +0200, Thomas Gleixner wrote:
> The upcoming support for unlocking robust futexes in the kernel requires
> store release semantics. Syscalls do not imply memory ordering on all
> architectures so the unlock operation requires a barrier.
>
> This barrier can be avoided when stores imply release like on x86.
>
> Provide a generic version with a smp_mb() before the unsafe_put_user(),
> which can be overridden by architectures.
>
> Provide also a ARCH_MEMORY_ORDER_TOS Kconfig option, which can be selected
> by architectures with Total Store Order (TSO), where store implies release,
> so that the smp_mb() in the generic implementation can be avoided.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Reviewed-by: André Almeida <andrealmeid@igalia.com>
> ---
> V3: Rename to CONFIG_ARCH_MEMORY_ORDER_TSO - Peter
Looks like this was missed, or applied incompletely? There are a number
of uses of "TOS" rather than "TSO", above, below, and in subsequent
patches.
Mark.
> V2: New patch
> ---
> arch/Kconfig | 4 ++++
> include/linux/uaccess.h | 9 +++++++++
> 2 files changed, 13 insertions(+)
>
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -403,6 +403,10 @@ config ARCH_32BIT_OFF_T
> config ARCH_32BIT_USTAT_F_TINODE
> bool
>
> +# Selected by architectures with Total Store Order (TOS)
> +config ARCH_MEMORY_ORDER_TOS
> + bool
> +
> config HAVE_ASM_MODVERSIONS
> bool
> help
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -644,6 +644,15 @@ static inline void user_access_restore(u
> #define user_read_access_end user_access_end
> #endif
>
> +#ifndef unsafe_atomic_store_release_user
> +# define unsafe_atomic_store_release_user(val, uptr, elbl) \
> + do { \
> + if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) \
> + smp_mb(); \
> + unsafe_put_user(val, uptr, elbl); \
> + } while (0)
> +#endif
> +
> /* Define RW variant so the below _mode macro expansion works */
> #define masked_user_rw_access_begin(u) masked_user_access_begin(u)
> #define user_rw_access_begin(u, s) user_access_begin(u, s)
>
The generic unsafe_atomic_store_release_user() implementation does:
if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS))
smp_mb();
unsafe_put_user();
As x86 implements Total Store Order (TOS) which means stores imply release,
select ARCH_MEMORY_ORDER_TOS to avoid the unnecessary smp_mb().
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V3: Rename to TOS - Peter
V2: New patch
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,6 +114,7 @@ config X86
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAVE_EXTRA_ELF_NOTES
+ select ARCH_MEMORY_ORDER_TOS
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
On Mon, Mar 30, 2026 at 02:02:31PM +0200, Thomas Gleixner wrote: > The generic unsafe_atomic_store_release_user() implementation does: > > if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) > smp_mb(); > unsafe_put_user(); > > As x86 implements Total Store Order (TOS) which means stores imply release, > select ARCH_MEMORY_ORDER_TOS to avoid the unnecessary smp_mb(). > > Signed-off-by: Thomas Gleixner <tglx@kernel.org> > Reviewed-by: André Almeida <andrealmeid@igalia.com> > --- > V3: Rename to TOS - Peter As on the last patch, shouldn't that be TSO? Mark > V2: New patch > --- > arch/x86/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -114,6 +114,7 @@ config X86 > select ARCH_HAS_ZONE_DMA_SET if EXPERT > select ARCH_HAVE_NMI_SAFE_CMPXCHG > select ARCH_HAVE_EXTRA_ELF_NOTES > + select ARCH_MEMORY_ORDER_TOS > select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI > select ARCH_MIGHT_HAVE_PC_PARPORT >
On Mon, Mar 30 2026 at 14:34, Mark Rutland wrote: > On Mon, Mar 30, 2026 at 02:02:31PM +0200, Thomas Gleixner wrote: >> The generic unsafe_atomic_store_release_user() implementation does: >> >> if (!IS_ENABLED(CONFIG_ARCH_MEMORY_ORDER_TOS)) >> smp_mb(); >> unsafe_put_user(); >> >> As x86 implements Total Store Order (TOS) which means stores imply release, >> select ARCH_MEMORY_ORDER_TOS to avoid the unnecessary smp_mb(). >> >> Signed-off-by: Thomas Gleixner <tglx@kernel.org> >> Reviewed-by: André Almeida <andrealmeid@igalia.com> >> --- >> V3: Rename to TOS - Peter > > As on the last patch, shouldn't that be TSO? Duh. Yes. My dyslexia seems to get worse.
Make the operand defines tabular for readability sake.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V2: New patch
---
include/uapi/linux/futex.h | 27 +++++++++++++--------------
1 file changed, 13 insertions(+), 14 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -25,23 +25,22 @@
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
-#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
-#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
-#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
-#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
-#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
-#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
-#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
-#define FUTEX_LOCK_PI2_PRIVATE (FUTEX_LOCK_PI2 | FUTEX_PRIVATE_FLAG)
-#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
-#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+
+#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI2_PRIVATE (FUTEX_LOCK_PI2 | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITSET | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITSET | FUTEX_PRIVATE_FLAG)
-#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
-#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
/*
* Flags for futex2 syscalls.
Unlocking robust non-PI futexes happens in user space with the following
sequence:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = 0;
3) lval = atomic_xchg(lock, lval);
4) if (lval & WAITERS)
5) sys_futex(WAKE,....);
6) robust_list_clear_op_pending();
That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:
A) unmaps the mutex memory
B) maps a different file, which ends up covering the same address
When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.
PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (!atomic_try_cmpxchg(lock, lval, 0))
4) sys_futex(UNLOCK_PI,....);
5) robust_list_clear_op_pending();
Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.
This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.
For the futex2 syscall family this needs to be implemented with a new
syscall.
The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
robust_list_head::list_pending_op into the kernel. This argument is only
evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward
compatible.
This is an explicit argument to avoid the lookup of the robust list pointer
and retrieving the pending op pointer from there. User space has the
pointer already available so it can just put it into the @uaddr2
argument. Aside of that this allows the usage of multiple robust lists in
the future without any changes to the internal functions as they just operate
on the provided pointer.
This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:
1) sys_futex() has no compat variant
2) The gaming emulators use both both 64-bit and compat 32-bit robust
lists in the same 64-bit application
As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.
In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.
In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.
If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.
The problem of non-contended unlocks still exists and will be addressed
separately.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
---
V3: Expand changelog to explain @uaddr2 - Andre
V2: Use store release for unlock - Andre, Peter
Use a separate FLAG for 32bit lists - Florian
Add command defines
---
include/uapi/linux/futex.h | 29 +++++++++++++++++++++++-
io_uring/futex.c | 2 -
kernel/futex/core.c | 53 +++++++++++++++++++++++++++++++++++++++++++--
kernel/futex/futex.h | 15 +++++++++++-
kernel/futex/pi.c | 15 +++++++++++-
kernel/futex/syscalls.c | 13 ++++++++---
kernel/futex/waitwake.c | 30 +++++++++++++++++++++++--
7 files changed, 144 insertions(+), 13 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -25,8 +25,11 @@
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
+#define FUTEX_UNLOCK_ROBUST 512
+#define FUTEX_ROBUST_LIST32 1024
-#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | \
+ FUTEX_UNLOCK_ROBUST | FUTEX_ROBUST_LIST32)
#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
@@ -43,6 +46,30 @@
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
/*
+ * Operations to unlock a futex, clear the robust list pending op pointer and
+ * wake waiters.
+ */
+#define FUTEX_UNLOCK_PI_LIST64 (FUTEX_UNLOCK_PI | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_PI_LIST64_PRIVATE (FUTEX_UNLOCK_PI_LIST64 | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_LIST32 (FUTEX_UNLOCK_PI | FUTEX_UNLOCK_ROBUST | \
+ FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_PI_LIST32_PRIVATE (FUTEX_UNLOCK_PI_LIST32 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_WAKE_LIST64 (FUTEX_WAKE | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_WAKE_LIST64_PRIVATE (FUTEX_UNLOCK_LIST64 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_WAKE_LIST32 (FUTEX_WAKE | FUTEX_UNLOCK_ROBUST | \
+ FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_WAKE_LIST32_PRIVATE (FUTEX_UNLOCK_LIST32 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_BITSET_LIST64 (FUTEX_WAKE_BITSET | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_BITSET_LIST64_PRIVATE (FUTEX_UNLOCK_BITSET_LIST64 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_BITSET_LIST32 (FUTEX_WAKE_BITSET | FUTEX_UNLOCK_ROBUST | \
+ FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_BITSET_LIST32_PRIVATE (FUTEX_UNLOCK_BITSET_LIST32 | FUTEX_PRIVATE_FLAG)
+
+/*
* Flags for futex2 syscalls.
*
* NOTE: these are not pure flags, they can also be seen as:
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -325,7 +325,7 @@ int io_futex_wake(struct io_kiocb *req,
* Strict flags - ensure that waking 0 futexes yields a 0 result.
* See commit 43adf8449510 ("futex: FLAGS_STRICT") for details.
*/
- ret = futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags,
+ ret = futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags, NULL,
iof->futex_val, iof->futex_mask);
if (ret < 0)
req_set_fail(req);
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1063,7 +1063,7 @@ static int handle_futex_death(u32 __user
owner = uval & FUTEX_TID_MASK;
if (pending_op && !pi && !owner) {
- futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1,
+ futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1,
FUTEX_BITSET_MATCH_ANY);
return 0;
}
@@ -1117,7 +1117,7 @@ static int handle_futex_death(u32 __user
* PI futexes happens in exit_pi_state():
*/
if (!pi && (uval & FUTEX_WAITERS)) {
- futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1,
+ futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1,
FUTEX_BITSET_MATCH_ANY);
}
@@ -1209,6 +1209,27 @@ static void exit_robust_list(struct task
}
}
+static bool robust_list_clear_pending(unsigned long __user *pop)
+{
+ struct robust_list_head __user *head = current->futex.robust_list;
+
+ if (!put_user(0UL, pop))
+ return true;
+
+ /*
+ * Just give up. The robust list head is usually part of TLS, so the
+ * chance that this gets resolved is close to zero.
+ *
+ * If @pop_addr is the robust_list_head::list_op_pending pointer then
+ * clear the robust list head pointer to prevent further damage when the
+ * task exits. Better a few stale futexes than corrupted memory. But
+ * that's mostly an academic exercise.
+ */
+ if (pop == (unsigned long __user *)&head->list_op_pending)
+ current->futex.robust_list = NULL;
+ return false;
+}
+
#ifdef CONFIG_COMPAT
static void __user *futex_uaddr(struct robust_list __user *entry,
compat_long_t futex_offset)
@@ -1305,6 +1326,21 @@ static void compat_exit_robust_list(stru
handle_futex_death(uaddr, curr, pend_mod, HANDLE_DEATH_PENDING);
}
}
+
+static bool compat_robust_list_clear_pending(u32 __user *pop)
+{
+ struct compat_robust_list_head __user *head = current->futex.compat_robust_list;
+
+ if (!put_user(0U, pop))
+ return true;
+
+ /* See comment in robust_list_clear_pending(). */
+ if (pop == &head->list_op_pending)
+ current->futex.compat_robust_list = NULL;
+ return false;
+}
+#else
+static bool compat_robust_list_clear_pending(u32 __user *pop_addr) { return false; }
#endif
#ifdef CONFIG_FUTEX_PI
@@ -1398,6 +1434,19 @@ static void exit_pi_state_list(struct ta
static inline void exit_pi_state_list(struct task_struct *curr) { }
#endif
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags)
+{
+ bool size32bit = !!(flags & FLAGS_ROBUST_LIST32);
+
+ if (!IS_ENABLED(CONFIG_64BIT) && !size32bit)
+ return false;
+
+ if (IS_ENABLED(CONFIG_64BIT) && size32bit)
+ return compat_robust_list_clear_pending(pop);
+
+ return robust_list_clear_pending(pop);
+}
+
static void futex_cleanup(struct task_struct *tsk)
{
if (unlikely(tsk->futex.robust_list)) {
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -40,6 +40,8 @@
#define FLAGS_NUMA 0x0080
#define FLAGS_STRICT 0x0100
#define FLAGS_MPOL 0x0200
+#define FLAGS_UNLOCK_ROBUST 0x0400
+#define FLAGS_ROBUST_LIST32 0x0800
/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
@@ -52,6 +54,12 @@ static inline unsigned int futex_to_flag
if (op & FUTEX_CLOCK_REALTIME)
flags |= FLAGS_CLOCKRT;
+ if (op & FUTEX_UNLOCK_ROBUST)
+ flags |= FLAGS_UNLOCK_ROBUST;
+
+ if (op & FUTEX_ROBUST_LIST32)
+ flags |= FLAGS_ROBUST_LIST32;
+
return flags;
}
@@ -438,13 +446,16 @@ extern int futex_unqueue_multiple(struct
extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
struct hrtimer_sleeper *to);
-extern int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset);
+extern int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *pop,
+ int nr_wake, u32 bitset);
extern int futex_wake_op(u32 __user *uaddr1, unsigned int flags,
u32 __user *uaddr2, int nr_wake, int nr_wake2, int op);
-extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags);
+extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *pop);
extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags);
+
#endif /* _FUTEX_H */
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1129,7 +1129,7 @@ int futex_lock_pi(u32 __user *uaddr, uns
* This is the in-kernel slowpath: we look up the PI state (if any),
* and do the rt-mutex unlock.
*/
-int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
+static int __futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 curval, uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
@@ -1138,7 +1138,6 @@ int futex_unlock_pi(u32 __user *uaddr, u
if (!IS_ENABLED(CONFIG_FUTEX_PI))
return -ENOSYS;
-
retry:
if (get_user(uval, uaddr))
return -EFAULT;
@@ -1292,3 +1291,15 @@ int futex_unlock_pi(u32 __user *uaddr, u
return ret;
}
+int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *pop)
+{
+ int ret = __futex_unlock_pi(uaddr, flags);
+
+ if (ret || !(flags & FLAGS_UNLOCK_ROBUST))
+ return ret;
+
+ if (!futex_robust_list_clear_pending(pop, flags))
+ return -EFAULT;
+
+ return 0;
+}
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -118,6 +118,13 @@ long do_futex(u32 __user *uaddr, int op,
return -ENOSYS;
}
+ if (flags & FLAGS_UNLOCK_ROBUST) {
+ if (cmd != FUTEX_WAKE &&
+ cmd != FUTEX_WAKE_BITSET &&
+ cmd != FUTEX_UNLOCK_PI)
+ return -ENOSYS;
+ }
+
switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
@@ -128,7 +135,7 @@ long do_futex(u32 __user *uaddr, int op,
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAKE_BITSET:
- return futex_wake(uaddr, flags, val, val3);
+ return futex_wake(uaddr, flags, uaddr2, val, val3);
case FUTEX_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
case FUTEX_CMP_REQUEUE:
@@ -141,7 +148,7 @@ long do_futex(u32 __user *uaddr, int op,
case FUTEX_LOCK_PI2:
return futex_lock_pi(uaddr, flags, timeout, 0);
case FUTEX_UNLOCK_PI:
- return futex_unlock_pi(uaddr, flags);
+ return futex_unlock_pi(uaddr, flags, uaddr2);
case FUTEX_TRYLOCK_PI:
return futex_lock_pi(uaddr, flags, NULL, 1);
case FUTEX_WAIT_REQUEUE_PI:
@@ -375,7 +382,7 @@ SYSCALL_DEFINE4(futex_wake,
if (!futex_validate_input(flags, mask))
return -EINVAL;
- return futex_wake(uaddr, FLAGS_STRICT | flags, nr, mask);
+ return futex_wake(uaddr, FLAGS_STRICT | flags, NULL, nr, mask);
}
/*
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -150,12 +150,35 @@ void futex_wake_mark(struct wake_q_head
}
/*
+ * If requested, clear the robust list pending op and unlock the futex
+ */
+static bool futex_robust_unlock(u32 __user *uaddr, unsigned int flags, void __user *pop)
+{
+ if (!(flags & FLAGS_UNLOCK_ROBUST))
+ return true;
+
+ /* First unlock the futex, which requires release semantics. */
+ scoped_user_write_access(uaddr, efault)
+ unsafe_atomic_store_release_user(0, uaddr, efault);
+
+ /*
+ * Clear the pending list op now. If that fails, then the task is in
+ * deeper trouble as the robust list head is usually part of the TLS.
+ * The chance of survival is close to zero.
+ */
+ return futex_robust_list_clear_pending(pop, flags);
+
+efault:
+ return false;
+}
+
+/*
* Wake up waiters matching bitset queued on this futex (uaddr).
*/
-int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *pop, int nr_wake, u32 bitset)
{
- struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
+ struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
int ret;
@@ -166,6 +189,9 @@ int futex_wake(u32 __user *uaddr, unsign
if (unlikely(ret != 0))
return ret;
+ if (!futex_robust_unlock(uaddr, flags, pop))
+ return -EFAULT;
+
if ((flags & FLAGS_STRICT) && !nr_wake)
return 0;
There will be a VDSO function to unlock robust futexes in user space. The
unlock sequence is racy vs. clearing the list_pending_op pointer in the
tasks robust list head. To plug this race the kernel needs to know the
instruction window. As the VDSO is per MM the addresses are stored in
mm_struct::futex.
Architectures which implement support for this have to update these
addresses when the VDSO is (re)mapped and indicate the pending op pointer
size which is matching the IP.
Arguably this could be resolved by chasing mm->context->vdso->image, but
that's architecture specific and requires to touch quite some cache
lines. Having it in mm::futex reduces the cache line impact and avoids
having yet another set of architecture specific functionality.
To support multi size robust list applications (gaming) this provides two
ranges when COMPAT is enabled.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
V3: Make the number of ranges depend on COMPAT - Peter
V2: Store ranges in a struct with size information and allow up to two ranges.
---
include/linux/futex.h | 22 ++++++++++++++++++---
include/linux/futex_types.h | 28 ++++++++++++++++++++++++++
include/linux/mm_types.h | 1
init/Kconfig | 6 +++++
kernel/futex/core.c | 46 ++++++++++++++++++++++++++++++++++----------
5 files changed, 90 insertions(+), 13 deletions(-)
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,11 +81,9 @@ int futex_hash_prctl(unsigned long arg2,
#ifdef CONFIG_FUTEX_PRIVATE_HASH
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
-void futex_mm_init(struct mm_struct *mm);
#else /* CONFIG_FUTEX_PRIVATE_HASH */
static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline void futex_mm_init(struct mm_struct *mm) { }
#endif /* !CONFIG_FUTEX_PRIVATE_HASH */
#else /* CONFIG_FUTEX */
@@ -104,7 +102,25 @@ static inline int futex_hash_prctl(unsig
}
static inline int futex_hash_allocate_default(void) { return 0; }
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline void futex_mm_init(struct mm_struct *mm) { }
#endif /* !CONFIG_FUTEX */
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+void futex_reset_cs_ranges(struct futex_mm_data *fd);
+
+static inline void futex_set_vdso_cs_range(struct futex_mm_data *fd, unsigned int idx,
+ unsigned long vdso, unsigned long start,
+ unsigned long end, bool sz32)
+{
+ fd->unlock.cs_ranges[idx].start_ip = vdso + start;
+ fd->unlock.cs_ranges[idx].len = end - start;
+ fd->unlock.cs_ranges[idx].pop_size32 = sz32;
+}
+#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
+
+#if defined(CONFIG_FUTEX_PRIVATE_HASH) || defined(CONFIG_FUTEX_ROBUST_UNLOCK)
+void futex_mm_init(struct mm_struct *mm);
+#else
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif
+
#endif /* _LINUX_FUTEX_H */
--- a/include/linux/futex_types.h
+++ b/include/linux/futex_types.h
@@ -54,12 +54,40 @@ struct futex_mm_phash {
struct futex_mm_phash { };
#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+/**
+ * struct futex_unlock_cs_range - Range for the VDSO unlock critical section
+ * @start_ip: The start IP of the robust futex unlock critical section (inclusive)
+ * @len: The length of the robust futex unlock critical section
+ * @pop_size32: Pending OP pointer size indicator. 0 == 64-bit, 1 == 32-bit
+ */
+struct futex_unlock_cs_range {
+ unsigned long start_ip;
+ unsigned int len;
+ unsigned int pop_size32;
+};
+
+#define FUTEX_ROBUST_MAX_CS_RANGES (1 + IS_ENABLED(CONFIG_COMPAT))
+
+/**
+ * struct futex_unlock_cs_ranges - Futex unlock VSDO critical sections
+ * @cs_ranges: Array of critical section ranges
+ */
+struct futex_unlock_cs_ranges {
+ struct futex_unlock_cs_range cs_ranges[FUTEX_ROBUST_MAX_CS_RANGES];
+};
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+struct futex_unlock_cs_ranges { };
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+
/**
* struct futex_mm_data - Futex related per MM data
* @phash: Futex private hash related data
+ * @unlock: Futex unlock VDSO critical sections
*/
struct futex_mm_data {
struct futex_mm_phash phash;
+ struct futex_unlock_cs_ranges unlock;
};
#else /* CONFIG_FUTEX */
struct futex_sched_data { };
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -22,6 +22,7 @@
#include <linux/types.h>
#include <linux/rseq_types.h>
#include <linux/bitmap.h>
+#include <linux/futex_types.h>
#include <asm/mmu.h>
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1822,6 +1822,12 @@ config FUTEX_MPOL
depends on FUTEX && NUMA
default y
+config HAVE_FUTEX_ROBUST_UNLOCK
+ bool
+
+config FUTEX_ROBUST_UNLOCK
+ def_bool FUTEX && HAVE_GENERIC_VDSO && GENERIC_IRQ_ENTRY && RSEQ && HAVE_FUTEX_ROBUST_UNLOCK
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1758,11 +1758,11 @@ static bool futex_ref_is_dead(struct fut
return atomic_long_read(&mm->futex.phash.atomic) == 0;
}
-void futex_mm_init(struct mm_struct *mm)
+static void futex_hash_init_mm(struct futex_mm_data *fd)
{
- memset(&mm->futex, 0, sizeof(mm->futex));
- mutex_init(&mm->futex.phash.lock);
- mm->futex.phash.batches = get_state_synchronize_rcu();
+ memset(&fd->phash, 0, sizeof(fd->phash));
+ mutex_init(&fd->phash.lock);
+ fd->phash.batches = get_state_synchronize_rcu();
}
void futex_hash_free(struct mm_struct *mm)
@@ -1966,20 +1966,46 @@ static int futex_hash_get_slots(void)
return fph->hash_mask + 1;
return 0;
}
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate(unsigned int hslots, unsigned int flags) { return -EINVAL; }
+static inline int futex_hash_get_slots(void) { return 0; }
+static inline void futex_hash_init_mm(struct futex_mm_data *fd) { }
+#endif /* !CONFIG_FUTEX_PRIVATE_HASH */
-#else
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+static void futex_invalidate_cs_ranges(struct futex_mm_data *fd)
+{
+ /*
+ * Invalidate start_ip so that the quick check fails for ip >= start_ip
+ * if VDSO is not mapped or the second slot is not available for compat
+ * tasks as they use VDSO32 which does not provide the 64-bit pointer
+ * variant.
+ */
+ for (int i = 0; i < FUTEX_ROBUST_MAX_CS_RANGES; i++)
+ fd->unlock.cs_ranges[i].start_ip = ~0UL;
+}
-static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
+void futex_reset_cs_ranges(struct futex_mm_data *fd)
{
- return -EINVAL;
+ memset(fd->unlock.cs_ranges, 0, sizeof(fd->unlock.cs_ranges));
+ futex_invalidate_cs_ranges(fd);
}
-static int futex_hash_get_slots(void)
+static void futex_robust_unlock_init_mm(struct futex_mm_data *fd)
{
- return 0;
+ /* mm_dup() preserves the range, mm_alloc() clears it */
+ if (!fd->unlock.cs_ranges[0].start_ip)
+ futex_invalidate_cs_ranges(fd);
}
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+static inline void futex_robust_unlock_init_mm(struct futex_mm_data *fd) { }
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
-#endif
+void futex_mm_init(struct mm_struct *mm)
+{
+ futex_hash_init_mm(&mm->futex);
+ futex_robust_unlock_init_mm(&mm->futex);
+}
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
{
When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
4) robust_list_clear_op_pending();
else
5) sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);
That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:
1) unmaps the mutex memory
2) maps a different file, which ends up covering the same address
When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.
On X86 this boils down to this simplified assembly sequence:
mov %esi,%eax // Load TID into EAX
xor %ecx,%ecx // Set ECX to 0
#3 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
#4 movq %rcx,(%rdx) // Clear list_op_pending
.Lend:
If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.
This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:
1) The signal might be fatal so get_signal() ends up in do_exit()
2) The signal handler might crash or the task is killed before returning
from the handler. At that point the instruction pointer in pt_regs is
not longer the instruction pointer of the initially interrupted unlock
sequence.
The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.
As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:
if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
futex_fixup_robust_unlock(regs, csr);
futex_fixup_robust_unlock() then invokes a architecture specific function
to returen the pending op pointer or NULL. The function evaluates the
register content to decide whether the pending ops pointer in the robust
list head needs to be cleared.
Assuming the above unlock sequence, then on x86 this decision is the
trivial evaluation of the zero flag:
return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;
Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. The size of the pointer is determined
from the matching range struct, which covers both 32-bit and 64-bit builds
including COMPAT.
The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized, especially the register usage. The
resulting code sequence for user space is:
if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
V3: Fixup conversion leftover which was lost on the devel machine
V2: Convert to the struct range storage and simplify the fixup logic
---
include/linux/futex.h | 39 ++++++++++++++++++++++++++++++++++++-
include/vdso/futex.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/entry/common.c | 9 +++++---
kernel/futex/core.c | 18 +++++++++++++++++
4 files changed, 114 insertions(+), 4 deletions(-)
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -105,7 +105,41 @@ static inline int futex_hash_free(struct
#endif /* !CONFIG_FUTEX */
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+#include <asm/futex_robust.h>
+
void futex_reset_cs_ranges(struct futex_mm_data *fd);
+void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock_cs_range *csr);
+
+static inline bool futex_within_robust_unlock(struct pt_regs *regs,
+ struct futex_unlock_cs_range *csr)
+{
+ unsigned long ip = instruction_pointer(regs);
+
+ return ip >= csr->start_ip && ip < csr->start_ip + csr->len;
+}
+
+static inline void futex_fixup_robust_unlock(struct pt_regs *regs)
+{
+ struct futex_unlock_cs_range *csr;
+
+ /*
+ * Avoid dereferencing current->mm if not returning from interrupt.
+ * current->rseq.event is going to be used subsequently, so bringing the
+ * cache line in is not a big deal.
+ */
+ if (!current->rseq.event.user_irq)
+ return;
+
+ csr = current->mm->futex.unlock.cs_ranges;
+
+ /* The loop is optimized out for !COMPAT */
+ for (int r = 0; r < FUTEX_ROBUST_MAX_CS_RANGES; r++, csr++) {
+ if (unlikely(futex_within_robust_unlock(regs, csr))) {
+ __futex_fixup_robust_unlock(regs, csr);
+ return;
+ }
+ }
+}
static inline void futex_set_vdso_cs_range(struct futex_mm_data *fd, unsigned int idx,
unsigned long vdso, unsigned long start,
@@ -115,7 +149,10 @@ static inline void futex_set_vdso_cs_ran
fd->unlock.cs_ranges[idx].len = end - start;
fd->unlock.cs_ranges[idx].pop_size32 = sz32;
}
-#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+static inline void futex_fixup_robust_unlock(struct pt_regs *regs) { }
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
+
#if defined(CONFIG_FUTEX_PRIVATE_HASH) || defined(CONFIG_FUTEX_ROBUST_UNLOCK)
void futex_mm_init(struct mm_struct *mm);
--- /dev/null
+++ b/include/vdso/futex.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _VDSO_FUTEX_H
+#define _VDSO_FUTEX_H
+
+#include <uapi/linux/types.h>
+
+/**
+ * __vdso_futex_robust_list64_try_unlock - Try to unlock an uncontended robust futex
+ * with a 64-bit pending op pointer
+ * @lock: Pointer to the futex lock object
+ * @tid: The TID of the calling task
+ * @pop: Pointer to the task's robust_list_head::list_pending_op
+ *
+ * Return: The content of *@lock. On success this is the same as @tid.
+ *
+ * The function implements:
+ * if (atomic_try_cmpxchg(lock, &tid, 0))
+ * *op = NULL;
+ * return tid;
+ *
+ * There is a race between a successful unlock and clearing the pending op
+ * pointer in the robust list head. If the calling task is interrupted in the
+ * race window and has to handle a (fatal) signal on return to user space then
+ * the kernel handles the clearing of @pending_op before attempting to deliver
+ * the signal. That ensures that a task cannot exit with a potentially invalid
+ * pending op pointer.
+ *
+ * User space uses it in the following way:
+ *
+ * if (__vdso_futex_robust_list64_try_unlock(lock, tid, &pending_op) != tid)
+ * err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
+ *
+ * If the unlock attempt fails due to the FUTEX_WAITERS bit set in the lock,
+ * then the syscall does the unlock, clears the pending op pointer and wakes the
+ * requested number of waiters.
+ */
+__u32 __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop);
+
+/**
+ * __vdso_futex_robust_list32_try_unlock - Try to unlock an uncontended robust futex
+ * with a 32-bit pending op pointer
+ * @lock: Pointer to the futex lock object
+ * @tid: The TID of the calling task
+ * @pop: Pointer to the task's robust_list_head::list_pending_op
+ *
+ * Return: The content of *@lock. On success this is the same as @tid.
+ *
+ * Same as __vdso_futex_robust_list64_try_unlock() just with a 32-bit @pop pointer.
+ */
+__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop);
+
+#endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -1,11 +1,12 @@
// SPDX-License-Identifier: GPL-2.0
-#include <linux/irq-entry-common.h>
-#include <linux/resume_user_mode.h>
+#include <linux/futex.h>
#include <linux/highmem.h>
+#include <linux/irq-entry-common.h>
#include <linux/jump_label.h>
#include <linux/kmsan.h>
#include <linux/livepatch.h>
+#include <linux/resume_user_mode.h>
#include <linux/tick.h>
/* Workaround to allow gradual conversion of architecture code */
@@ -60,8 +61,10 @@ static __always_inline unsigned long __e
if (ti_work & _TIF_PATCH_PENDING)
klp_update_patch_state(current);
- if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
+ if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
+ futex_fixup_robust_unlock(regs);
arch_do_signal_or_restart(regs);
+ }
if (ti_work & _TIF_NOTIFY_RESUME)
resume_user_mode_work(regs);
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -46,6 +46,8 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <vdso/futex.h>
+
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -1447,6 +1449,22 @@ bool futex_robust_list_clear_pending(voi
return robust_list_clear_pending(pop);
}
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+void __futex_fixup_robust_unlock(struct pt_regs *regs, struct futex_unlock_cs_range *csr)
+{
+ /*
+ * arch_futex_robust_unlock_get_pop() returns the list pending op pointer from
+ * @regs if the try_cmpxchg() succeeded.
+ */
+ void __user *pop = arch_futex_robust_unlock_get_pop(regs);
+
+ if (!pop)
+ return;
+
+ futex_robust_list_clear_pending(pop, csr->pop_size32 ? FLAGS_ROBUST_LIST32 : 0);
+}
+#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
+
static void futex_cleanup(struct task_struct *tsk)
{
if (unlikely(tsk->futex.robust_list)) {
There will be a VDSO function to unlock non-contended robust futexes in
user space. The unlock sequence is racy vs. clearing the list_pending_op
pointer in the task's robust list head. To plug this race the kernel needs
to know the critical section window so it can clear the pointer when the
task is interrupted within that race window. The window is determined by
labels in the inline assembly.
Add these symbols to the vdso2c generator and use them in the VDSO VMA code
to update the critical section addresses in mm_struct::futex on (re)map().
The symbols are not exported to user space, but available in the debug
version of the vDSO.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
V3: Rename the symbols once more
V2: Rename the symbols
---
arch/x86/entry/vdso/vma.c | 29 +++++++++++++++++++++++++++++
arch/x86/include/asm/vdso.h | 4 ++++
arch/x86/tools/vdso2c.c | 18 +++++++++++-------
3 files changed, 44 insertions(+), 7 deletions(-)
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -6,6 +6,7 @@
*/
#include <linux/mm.h>
#include <linux/err.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/slab.h>
@@ -73,6 +74,31 @@ static void vdso_fix_landing(const struc
regs->ip = new_vma->vm_start + ipoffset;
}
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+static void vdso_futex_robust_unlock_update_ips(void)
+{
+ const struct vdso_image *image = current->mm->context.vdso_image;
+ unsigned long vdso = (unsigned long) current->mm->context.vdso;
+ struct futex_mm_data *fd = ¤t->mm->futex;
+ unsigned int idx = 0;
+
+ futex_reset_cs_ranges(fd);
+
+#ifdef CONFIG_X86_64
+ futex_set_vdso_cs_range(fd, idx, vdso, image->sym___futex_list64_try_unlock_cs_start,
+ image->sym___futex_list64_try_unlock_cs_end, false);
+ idx++;
+#endif /* CONFIG_X86_64 */
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_COMPAT)
+ futex_set_vdso_cs_range(fd, idx, vdso, image->sym___futex_list32_try_unlock_cs_start,
+ image->sym___futex_list32_try_unlock_cs_end, true);
+#endif /* CONFIG_X86_32 || CONFIG_COMPAT */
+}
+#else
+static inline void vdso_futex_robust_unlock_update_ips(void) { }
+#endif
+
static int vdso_mremap(const struct vm_special_mapping *sm,
struct vm_area_struct *new_vma)
{
@@ -80,6 +106,7 @@ static int vdso_mremap(const struct vm_s
vdso_fix_landing(image, new_vma);
current->mm->context.vdso = (void __user *)new_vma->vm_start;
+ vdso_futex_robust_unlock_update_ips();
return 0;
}
@@ -189,6 +216,8 @@ static int map_vdso(const struct vdso_im
current->mm->context.vdso = (void __user *)text_start;
current->mm->context.vdso_image = image;
+ vdso_futex_robust_unlock_update_ips();
+
up_fail:
mmap_write_unlock(mm);
return ret;
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -25,6 +25,10 @@ struct vdso_image {
long sym_int80_landing_pad;
long sym_vdso32_sigreturn_landing_pad;
long sym_vdso32_rt_sigreturn_landing_pad;
+ long sym___futex_list64_try_unlock_cs_start;
+ long sym___futex_list64_try_unlock_cs_end;
+ long sym___futex_list32_try_unlock_cs_start;
+ long sym___futex_list32_try_unlock_cs_end;
};
extern const struct vdso_image vdso64_image;
--- a/arch/x86/tools/vdso2c.c
+++ b/arch/x86/tools/vdso2c.c
@@ -75,13 +75,17 @@ struct vdso_sym {
};
struct vdso_sym required_syms[] = {
- {"VDSO32_NOTE_MASK", true},
- {"__kernel_vsyscall", true},
- {"__kernel_sigreturn", true},
- {"__kernel_rt_sigreturn", true},
- {"int80_landing_pad", true},
- {"vdso32_rt_sigreturn_landing_pad", true},
- {"vdso32_sigreturn_landing_pad", true},
+ {"VDSO32_NOTE_MASK", true},
+ {"__kernel_vsyscall", true},
+ {"__kernel_sigreturn", true},
+ {"__kernel_rt_sigreturn", true},
+ {"int80_landing_pad", true},
+ {"vdso32_rt_sigreturn_landing_pad", true},
+ {"vdso32_sigreturn_landing_pad", true},
+ {"__futex_list64_try_unlock_cs_start", true},
+ {"__futex_list64_try_unlock_cs_end", true},
+ {"__futex_list32_try_unlock_cs_start", true},
+ {"__futex_list32_try_unlock_cs_end", true},
};
__attribute__((format(printf, 1, 2))) __attribute__((noreturn))
When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in userspace looks like this:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
4) robust_list_clear_op_pending();
else
5) sys_futex(OP,...FUTEX_ROBUST_UNLOCK);
That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task which observes that it is the last
user and:
1) unmaps the mutex memory
2) maps a different file, which ends up covering the same address
When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.
Provide a VDSO function which exposes the critical section window in the
VDSO symbol table. The resulting addresses are updated in the task's mm
when the VDSO is (re)map()'ed.
The core code detects when a task was interrupted within the critical
section and is about to deliver a signal. It then invokes an architecture
specific function which determines whether the pending op pointer has to be
cleared or not. The unlock assembly sequence on 64-bit is:
mov %esi,%eax // Load TID into EAX
xor %ecx,%ecx // Set ECX to 0
lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
movq %rcx,(%rdx) // Clear list_op_pending
.Lend:
ret
So the decision can be simply based on the ZF state in regs->flags. The
pending op pointer is always in DX independent of the build mode
(32/64-bit) to make the pending op pointer retrieval uniform. The size of
the pointer is stored in the matching criticial section range struct and
the core code retrieves it from there. So the pointer retrieval function
does not have to care. It is bit-size independent:
return regs->flags & X86_EFLAGS_ZF ? regs->dx : NULL;
There are two entry points to handle the different robust list pending op
pointer size:
__vdso_futex_robust_list64_try_unlock()
__vdso_futex_robust_list32_try_unlock()
The 32-bit VDSO provides only __vdso_futex_robust_list32_try_unlock().
The 64-bit VDSO provides always __vdso_futex_robust_list64_try_unlock() and
when COMPAT is enabled also the list32 variant, which is required to
support multi-size robust list pointers used by gaming emulators.
The unlock function is inspired by an idea from Mathieu Desnoyers.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com
--
V3: Use 'r' for the zero register - Uros
V2: Provide different entry points - Florian
Use __u32 and __x86_64__ - Thomas
Use private labels - Thomas
Optimize assembly - Uros
Split the functions up now that ranges are supported in the core and
document the actual assembly.
---
arch/x86/Kconfig | 1
arch/x86/entry/vdso/common/vfutex.c | 71 +++++++++++++++++++++++++++++++
arch/x86/entry/vdso/vdso32/Makefile | 5 +-
arch/x86/entry/vdso/vdso32/vdso32.lds.S | 3 +
arch/x86/entry/vdso/vdso32/vfutex.c | 1
arch/x86/entry/vdso/vdso64/Makefile | 7 +--
arch/x86/entry/vdso/vdso64/vdso64.lds.S | 7 +++
arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 7 +++
arch/x86/entry/vdso/vdso64/vfutex.c | 1
arch/x86/include/asm/futex_robust.h | 19 ++++++++
10 files changed, 117 insertions(+), 5 deletions(-)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EISA if X86_32
select HAVE_EXIT_THREAD
+ select HAVE_FUTEX_ROBUST_UNLOCK
select HAVE_GENERIC_TIF_BITS
select HAVE_GUP_FAST
select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
--- /dev/null
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <vdso/futex.h>
+
+/*
+ * Assembly template for the try unlock functions. The basic functionality is:
+ *
+ * mov esi, %eax Move the TID into EAX
+ * xor %ecx, %ecx Clear ECX
+ * lock_cmpxchgl %ecx, (%rdi) Attempt the TID -> 0 transition
+ * .Lcs_start: Start of the critical section
+ * jnz .Lcs_end If cmpxchl failed jump to the end
+ * .Lcs_success: Start of the success section
+ * movq %rcx, (%rdx) Set the pending op pointer to 0
+ * .Lcs_end: End of the critical section
+ *
+ * .Lcs_start and .Lcs_end establish the critical section range. .Lcs_success is
+ * technically not required, but there for illustration, debugging and testing.
+ *
+ * When CONFIG_COMPAT is enabled then the 64-bit VDSO provides two functions.
+ * One for the regular 64-bit sized pending operation pointer and one for a
+ * 32-bit sized pointer to support gaming emulators.
+ *
+ * The 32-bit VDSO provides only the one for 32-bit sized pointers.
+ */
+#define __stringify_1(x...) #x
+#define __stringify(x...) __stringify_1(x)
+
+#define LABEL(prefix, which) __stringify(prefix##_try_unlock_cs_##which:)
+
+#define JNZ_END(prefix) "jnz " __stringify(prefix) "_try_unlock_cs_end\n"
+
+#define CLEAR_POPQ "movq %[zero], %a[pop]\n"
+#define CLEAR_POPL "movl %k[zero], %a[pop]\n"
+
+#define futex_robust_try_unlock(prefix, clear_pop, __lock, __tid, __pop) \
+({ \
+ asm volatile ( \
+ " \n" \
+ " lock cmpxchgl %k[zero], %a[lock] \n" \
+ " \n" \
+ LABEL(prefix, start) \
+ " \n" \
+ JNZ_END(prefix) \
+ " \n" \
+ LABEL(prefix, success) \
+ " \n" \
+ clear_pop \
+ " \n" \
+ LABEL(prefix, end) \
+ : [tid] "+&a" (__tid) \
+ : [lock] "D" (__lock), \
+ [pop] "d" (__pop), \
+ [zero] "r" (0UL) \
+ : "memory" \
+ ); \
+ __tid; \
+})
+
+#ifdef __x86_64__
+__u32 __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop)
+{
+ return futex_robust_try_unlock(__futex_list64, CLEAR_POPQ, lock, tid, pop);
+}
+#endif /* __x86_64__ */
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_COMPAT)
+__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
+{
+ return futex_robust_try_unlock(__futex_list32, CLEAR_POPL, lock, tid, pop);
+}
+#endif /* CONFIG_X86_32 || CONFIG_COMPAT */
--- a/arch/x86/entry/vdso/vdso32/Makefile
+++ b/arch/x86/entry/vdso/vdso32/Makefile
@@ -7,8 +7,9 @@
vdsos-y := 32
# Files to link into the vDSO:
-vobjs-y := note.o vclock_gettime.o vgetcpu.o
-vobjs-y += system_call.o sigreturn.o
+vobjs-y := note.o vclock_gettime.o vgetcpu.o
+vobjs-y += system_call.o sigreturn.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
# Compilation flags
flags-y := -DBUILD_VDSO32 -m32 -mregparm=0
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -30,6 +30,9 @@ VERSION
__vdso_clock_gettime64;
__vdso_clock_getres_time64;
__vdso_getcpu;
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_list32_try_unlock;
+#endif
};
LINUX_2.5 {
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- a/arch/x86/entry/vdso/vdso64/Makefile
+++ b/arch/x86/entry/vdso/vdso64/Makefile
@@ -8,9 +8,10 @@ vdsos-y := 64
vdsos-$(CONFIG_X86_X32_ABI) += x32
# Files to link into the vDSO:
-vobjs-y := note.o vclock_gettime.o vgetcpu.o
-vobjs-y += vgetrandom.o vgetrandom-chacha.o
-vobjs-$(CONFIG_X86_SGX) += vsgx.o
+vobjs-y := note.o vclock_gettime.o vgetcpu.o
+vobjs-y += vgetrandom.o vgetrandom-chacha.o
+vobjs-$(CONFIG_X86_SGX) += vsgx.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
# Compilation flags
flags-y := -DBUILD_VDSO64 -m64 -mcmodel=small
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -32,6 +32,13 @@ VERSION {
#endif
getrandom;
__vdso_getrandom;
+
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
+ __vdso_futex_robust_list32_try_unlock;
+#endif
+#endif
local: *;
};
}
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -22,6 +22,13 @@ VERSION {
__vdso_getcpu;
__vdso_time;
__vdso_clock_getres;
+
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
+ __vdso_futex_robust_list32_try_unlock;
+#endif
+#endif
local: *;
};
}
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso64/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- /dev/null
+++ b/arch/x86/include/asm/futex_robust.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_FUTEX_ROBUST_H
+#define _ASM_X86_FUTEX_ROBUST_H
+
+#include <asm/ptrace.h>
+
+static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
+{
+ /*
+ * If ZF is set then the cmpxchg succeeded and the pending op pointer
+ * needs to be cleared.
+ */
+ return regs->flags & X86_EFLAGS_ZF ? (void __user *)regs->dx : NULL;
+}
+
+#define arch_futex_robust_unlock_get_pop(regs) \
+ x86_futex_robust_unlock_get_pop(regs)
+
+#endif /* _ASM_X86_FUTEX_ROBUST_H */
From: André Almeida <andrealmeid@igalia.com>
Add a note to the documentation giving a brief explanation why doing a
robust futex release in userspace is racy, what should be done to avoid
it and provide links to read more.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326-tonyk-vdso_test-v1-1-30a6f78c8bc3@igalia.com
---
Documentation/locking/robust-futex-ABI.rst | 44 +++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
--- a/Documentation/locking/robust-futex-ABI.rst
+++ b/Documentation/locking/robust-futex-ABI.rst
@@ -153,6 +153,9 @@ manipulating this list), the user code m
3) release the futex lock, and
4) clear the 'lock_op_pending' word.
+Please note that the removal of a robust futex purely in userspace is
+racy. Refer to the next chapter to learn more and how to avoid this.
+
On exit, the kernel will consider the address stored in
'list_op_pending' and the address of each 'lock word' found by walking
the list starting at 'head'. For each such address, if the bottom 30
@@ -182,3 +185,44 @@ The kernel exit code will silently stop
When the kernel sees a list entry whose 'lock word' doesn't have the
current threads TID in the lower 30 bits, it does nothing with that
entry, and goes on to the next entry.
+
+Robust release is racy
+----------------------
+
+The removal of a robust futex from the list is racy when doing it solely in
+userspace. Quoting Thomas Gleixer for the explanation:
+
+ The robust futex unlock mechanism is racy in respect to the clearing of the
+ robust_list_head::list_op_pending pointer because unlock and clearing the
+ pointer are not atomic. The race window is between the unlock and clearing
+ the pending op pointer. If the task is forced to exit in this window, exit
+ will access a potentially invalid pending op pointer when cleaning up the
+ robust list. That happens if another task manages to unmap the object
+ containing the lock before the cleanup, which results in an UAF. In the
+ worst case this UAF can lead to memory corruption when unrelated content
+ has been mapped to the same address by the time the access happens.
+
+A full in dept analysis can be read at
+https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/
+
+To overcome that, the kernel needs to participate in the lock release operation.
+This ensures that the release happens "atomically" in the regard of releasing
+the lock and removing the address from ``list_op_pending``. If the release is
+interrupted by a signal, the kernel will also verify if it interrupted the
+release operation.
+
+For the contended unlock case, where other threads are waiting for the lock
+release, there's the ``FUTEX_ROBUST_UNLOCK`` operation feature flag for the
+``futex()`` system call, which must be used with one of the following
+operations: ``FUTEX_WAKE``, ``FUTEX_WAKE_BITSET`` or ``FUTEX_UNLOCK_PI``.
+The kernel will release the lock (set the futex word to zero), clean the
+``list_op_pending`` field. Then, it will proceed with the normal wake path.
+
+For the non-contended path, there's still a race between checking the futex word
+and clearing the ``list_op_pending`` field. To solve this without the need of a
+complete system call, userspace should call the virtual syscall
+``__vdso_futex_robust_listXX_try_unlock()`` (where XX is either 32 or 64,
+depending on the size of the pointer). If the vDSO call succeeds, it means that
+it released the lock and cleared ``list_op_pending``. If it fails, that means
+that there are waiters for this lock and a call to ``futex()`` syscall with
+``FUTEX_ROBUST_UNLOCK`` is needed.
From: André Almeida <andrealmeid@igalia.com>
Add tests for __vdso_futex_robust_listXX_try_unlock() and for the futex()
op FUTEX_ROBUST_UNLOCK.
Test the contended and uncontended cases for the vDSO functions and all
ops combinations for FUTEX_ROBUST_UNLOCK.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326-tonyk-vdso_test-v1-2-30a6f78c8bc3@igalia.com
---
tools/testing/selftests/futex/functional/robust_list.c | 203 +++++++++++++++++
tools/testing/selftests/futex/include/futextest.h | 3
2 files changed, 206 insertions(+)
--- a/tools/testing/selftests/futex/functional/robust_list.c
+++ b/tools/testing/selftests/futex/functional/robust_list.c
@@ -27,12 +27,14 @@
#include "futextest.h"
#include "../../kselftest_harness.h"
+#include <dlfcn.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
+#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/wait.h>
@@ -54,6 +56,12 @@ static int get_robust_list(int pid, stru
return syscall(SYS_get_robust_list, pid, head, len_ptr);
}
+static int sys_futex_robust_unlock(_Atomic(uint32_t) *uaddr, unsigned int op, int val,
+ void *list_op_pending, unsigned int val3)
+{
+ return syscall(SYS_futex, uaddr, op, val, NULL, list_op_pending, val3, 0);
+}
+
/*
* Basic lock struct, contains just the futex word and the robust list element
* Real implementations have also a *prev to easily walk in the list
@@ -549,4 +557,199 @@ TEST(test_circular_list)
ksft_test_result_pass("%s\n", __func__);
}
+/*
+ * Below are tests for the fix of robust release race condition. Please read the following
+ * thread to learn more about the issue in the first place and why the following functions fix it:
+ * https://lore.kernel.org/lkml/20260316162316.356674433@kernel.org/
+ */
+
+/*
+ * Auxiliary code for loading the vDSO functions
+ */
+#define VDSO_SIZE 0x4000
+
+void *get_vdso_func_addr(const char *str)
+{
+ void *vdso_base = (void *) getauxval(AT_SYSINFO_EHDR), *addr;
+ Dl_info info;
+
+ if (!vdso_base) {
+ perror("Error to get AT_SYSINFO_EHDR");
+ return NULL;
+ }
+
+ for (addr = vdso_base; addr < vdso_base + VDSO_SIZE; addr += sizeof(addr)) {
+ if (dladdr(addr, &info) == 0 || !info.dli_sname)
+ continue;
+
+ if (!strcmp(info.dli_sname, str))
+ return info.dli_saddr;
+ }
+
+ return NULL;
+}
+
+/*
+ * These are the real vDSO function signatures:
+ *
+ * __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop)
+ * __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
+ *
+ * So for the generic entry point we need to use a void pointer as the last argument
+ */
+FIXTURE(vdso_unlock)
+{
+ uint32_t (*vdso)(_Atomic(uint32_t) *lock, uint32_t tid, void *pop);
+};
+
+FIXTURE_VARIANT(vdso_unlock)
+{
+ bool is_32;
+ char func_name[];
+};
+
+FIXTURE_SETUP(vdso_unlock)
+{
+ self->vdso = get_vdso_func_addr(variant->func_name);
+
+ if (!self->vdso)
+ ksft_test_result_skip("%s not found\n", variant->func_name);
+}
+
+FIXTURE_TEARDOWN(vdso_unlock) {}
+
+FIXTURE_VARIANT_ADD(vdso_unlock, 32)
+{
+ .func_name = "__vdso_futex_robust_list32_try_unlock",
+ .is_32 = true,
+};
+
+FIXTURE_VARIANT_ADD(vdso_unlock, 64)
+{
+ .func_name = "__vdso_futex_robust_list64_try_unlock",
+ .is_32 = false,
+};
+
+/*
+ * Test the vDSO robust_listXX_try_unlock() for the uncontended case. The virtual syscall should
+ * return the thread ID of the lock owner, the lock word must be 0 and the list_op_pending should
+ * be NULL.
+ */
+TEST_F(vdso_unlock, test_robust_try_unlock_uncontended)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ uint64_t exp = (uint64_t) NULL;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = self->vdso(futex, tid, &head.list_op_pending);
+
+ ASSERT_EQ(ret, tid);
+ ASSERT_EQ(*futex, 0);
+
+ /* Check only the lower 32 bits for the 32-bit entry point */
+ if (variant->is_32) {
+ exp = (uint64_t)(unsigned long)&lock.list;
+ exp &= ~0xFFFFFFFFULL;
+ }
+
+ ASSERT_EQ((uint64_t)(unsigned long)head.list_op_pending, exp);
+}
+
+/*
+ * If the lock is contended, the operation fails. The return value is the value found at the
+ * futex word (tid | FUTEX_WAITERS), the futex word is not modified and the list_op_pending is_32
+ * not cleared.
+ */
+TEST_F(vdso_unlock, test_robust_try_unlock_contended)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid | FUTEX_WAITERS;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = self->vdso(futex, tid, &head.list_op_pending);
+
+ ASSERT_EQ(ret, tid | FUTEX_WAITERS);
+ ASSERT_EQ(*futex, tid | FUTEX_WAITERS);
+ ASSERT_EQ(head.list_op_pending, &lock.list);
+}
+
+FIXTURE(futex_op) {};
+
+FIXTURE_VARIANT(futex_op)
+{
+ unsigned int op;
+ unsigned int val3;
+};
+
+FIXTURE_SETUP(futex_op) {}
+
+FIXTURE_TEARDOWN(futex_op) {}
+
+FIXTURE_VARIANT_ADD(futex_op, wake)
+{
+ .op = FUTEX_WAKE,
+ .val3 = 0,
+};
+
+FIXTURE_VARIANT_ADD(futex_op, wake_bitset)
+{
+ .op = FUTEX_WAKE_BITSET,
+ .val3 = FUTEX_BITSET_MATCH_ANY,
+};
+
+FIXTURE_VARIANT_ADD(futex_op, unlock_pi)
+{
+ .op = FUTEX_UNLOCK_PI,
+ .val3 = 0,
+};
+
+/*
+ * The syscall should return the number of tasks waken (for this test, 0), clear the futex word and
+ * clear list_op_pending
+ */
+TEST_F(futex_op, test_futex_robust_unlock)
+{
+ struct lock_struct lock = { .futex = 0 };
+ _Atomic(unsigned int) *futex = &lock.futex;
+ struct robust_list_head head;
+ pid_t tid = gettid();
+ int ret;
+
+ *futex = tid | FUTEX_WAITERS;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ head.list_op_pending = &lock.list;
+
+ ret = sys_futex_robust_unlock(futex, FUTEX_ROBUST_UNLOCK | variant->op, tid,
+ &head.list_op_pending, variant->val3);
+
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(*futex, 0);
+ ASSERT_EQ(head.list_op_pending, NULL);
+}
+
TEST_HARNESS_MAIN
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -38,6 +38,9 @@ typedef volatile u_int32_t futex_t;
#ifndef FUTEX_CMP_REQUEUE_PI
#define FUTEX_CMP_REQUEUE_PI 12
#endif
+#ifndef FUTEX_ROBUST_UNLOCK
+#define FUTEX_ROBUST_UNLOCK 512
+#endif
#ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
© 2016 - 2026 Red Hat, Inc.