When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in userspace looks like this:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
4) robust_list_clear_op_pending();
else
5) sys_futex(OP,...FUTEX_ROBUST_UNLOCK);
That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task which observes that it is the last
user and:
1) unmaps the mutex memory
2) maps a different file, which ends up covering the same address
When then the original task exits before reaching #6 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.
In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.
Provide a VDSO function which exposes the critical section window in the
VDSO symbol table. The resulting addresses are updated in the task's mm
when the VDSO is (re)map()'ed.
The core code detects when a task was interrupted within the critical
section and is about to deliver a signal. It then invokes an architecture
specific function which determines whether the pending op pointer has to be
cleared or not. The assembly sequence for the non COMPAT case is:
mov %esi,%eax // Load TID into EAX
xor %ecx,%ecx // Set ECX to 0
lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
movq $0x0,(%rdx) // Clear list_op_pending
.Lend:
ret
So the decision can be simply based on the ZF state in regs->flags.
If COMPAT is enabled then the try_unlock() function needs to take the size
bit in the OP pointer into account, which makes it slightly more complex:
mov %esi,%eax // Load TID into EAX
mov %rdx,%rsi // Get the op pointer
xor %ecx,%ecx // Set ECX to 0
and $0xfffffffffffffffe,%rsi // Clear the size bit
lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
.Lsuccess:
testl $0x1,(%rdx) // Test the size bit
jz .Lop64 // Not set: 64-bit
movl $0x0,(%rsi) // Clear 32-bit
jmp .Lend
.Lop64:
movq $0x0,(%rsi) // Clear 64-bit
.Lend:
ret
The decision function has to check whether regs->ip is in the success
portion as the size bit test obviously modifies ZF too. If it is before
.Lsuccess then ZF contains the cmpxchg() result. If it's at of after
.Lsuccess then the pointer has to be cleared.
The original pointer with the size bit is preserved in RDX so the fixup can
utilize the existing clearing mechanism, which is used by sys_futex().
Arguably this could be avoided by providing separate functions and making
the IP range for the quick check in the exit to user path cover the whole
text section which contains the two functions. But that's not a win at all
because:
1) User space needs to handle the two variants instead of just
relying on a bit which can be saved in the mutex at
initialization time.
2) The fixup decision function has then to evaluate which code path is
used. That just adds more symbols and range checking for no real
value.
The unlock function is inspired by an idea from Mathieu Desnoyers.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com
---
arch/x86/Kconfig | 1
arch/x86/entry/vdso/common/vfutex.c | 72 +++++++++++++++++++++++++++++++
arch/x86/entry/vdso/vdso32/Makefile | 5 +-
arch/x86/entry/vdso/vdso32/vdso32.lds.S | 6 ++
arch/x86/entry/vdso/vdso32/vfutex.c | 1
arch/x86/entry/vdso/vdso64/Makefile | 7 +--
arch/x86/entry/vdso/vdso64/vdso64.lds.S | 6 ++
arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 6 ++
arch/x86/entry/vdso/vdso64/vfutex.c | 1
arch/x86/include/asm/futex_robust.h | 44 ++++++++++++++++++
10 files changed, 144 insertions(+), 5 deletions(-)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,6 +237,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EISA if X86_32
select HAVE_EXIT_THREAD
+ select HAVE_FUTEX_ROBUST_UNLOCK
select HAVE_GENERIC_TIF_BITS
select HAVE_GUP_FAST
select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
--- /dev/null
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <vdso/futex.h>
+
+/*
+ * Compat enabled kernels have to take the size bit into account to support the
+ * mixed size use case of gaming emulators. Contrary to the kernel robust unlock
+ * mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
+ * and in compat disabled kernels. User space can keep the pieces.
+ */
+#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
+
+#ifdef CONFIG_COMPAT
+
+# define ASM_CLEAR_PTR \
+ " testl $1, (%[pop]) \n" \
+ " jz .Lop64 \n" \
+ " movl $0, (%[pad]) \n" \
+ " jmp __vdso_futex_robust_try_unlock_cs_end \n" \
+ ".Lop64: \n" \
+ " movq $0, (%[pad]) \n"
+
+# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
+
+#else /* CONFIG_COMPAT */
+
+# define ASM_CLEAR_PTR \
+ " movq $0, (%[pop]) \n"
+
+# define ASM_PAD_CONSTRAINT
+
+#endif /* !CONFIG_COMPAT */
+
+#else /* CONFIG_X86_64 && !BUILD_VDSO32_64 */
+
+# define ASM_CLEAR_PTR \
+ " movl $0, (%[pad]) \n"
+
+# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
+
+#endif /* !CONFIG_X86_64 || BUILD_VDSO32_64 */
+
+uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
+{
+ asm volatile (
+ ".global __vdso_futex_robust_try_unlock_cs_start \n"
+ ".global __vdso_futex_robust_try_unlock_cs_success \n"
+ ".global __vdso_futex_robust_try_unlock_cs_end \n"
+ " \n"
+ " lock cmpxchgl %[val], (%[ptr]) \n"
+ " \n"
+ "__vdso_futex_robust_try_unlock_cs_start: \n"
+ " \n"
+ " jnz __vdso_futex_robust_try_unlock_cs_end \n"
+ " \n"
+ "__vdso_futex_robust_try_unlock_cs_success: \n"
+ " \n"
+ ASM_CLEAR_PTR
+ " \n"
+ "__vdso_futex_robust_try_unlock_cs_end: \n"
+ : [tid] "+a" (tid)
+ : [ptr] "D" (lock),
+ [pop] "d" (pop),
+ [val] "r" (0)
+ ASM_PAD_CONSTRAINT
+ : "memory"
+ );
+
+ return tid;
+}
+
+uint32_t futex_robust_try_unlock(uint32_t *, uint32_t, void **)
+ __attribute__((weak, alias("__vdso_futex_robust_try_unlock")));
--- a/arch/x86/entry/vdso/vdso32/Makefile
+++ b/arch/x86/entry/vdso/vdso32/Makefile
@@ -7,8 +7,9 @@
vdsos-y := 32
# Files to link into the vDSO:
-vobjs-y := note.o vclock_gettime.o vgetcpu.o
-vobjs-y += system_call.o sigreturn.o
+vobjs-y := note.o vclock_gettime.o vgetcpu.o
+vobjs-y += system_call.o sigreturn.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
# Compilation flags
flags-y := -DBUILD_VDSO32 -m32 -mregparm=0
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -30,6 +30,12 @@ VERSION
__vdso_clock_gettime64;
__vdso_clock_getres_time64;
__vdso_getcpu;
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_try_unlock;
+ __vdso_futex_robust_try_unlock_cs_start;
+ __vdso_futex_robust_try_unlock_cs_success;
+ __vdso_futex_robust_try_unlock_cs_end;
+#endif
};
LINUX_2.5 {
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- a/arch/x86/entry/vdso/vdso64/Makefile
+++ b/arch/x86/entry/vdso/vdso64/Makefile
@@ -8,9 +8,10 @@ vdsos-y := 64
vdsos-$(CONFIG_X86_X32_ABI) += x32
# Files to link into the vDSO:
-vobjs-y := note.o vclock_gettime.o vgetcpu.o
-vobjs-y += vgetrandom.o vgetrandom-chacha.o
-vobjs-$(CONFIG_X86_SGX) += vsgx.o
+vobjs-y := note.o vclock_gettime.o vgetcpu.o
+vobjs-y += vgetrandom.o vgetrandom-chacha.o
+vobjs-$(CONFIG_X86_SGX) += vsgx.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
# Compilation flags
flags-y := -DBUILD_VDSO64 -m64 -mcmodel=small
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -32,6 +32,12 @@ VERSION {
#endif
getrandom;
__vdso_getrandom;
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_try_unlock;
+ __vdso_futex_robust_try_unlock_cs_start;
+ __vdso_futex_robust_try_unlock_cs_success;
+ __vdso_futex_robust_try_unlock_cs_end;
+#endif
local: *;
};
}
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -22,6 +22,12 @@ VERSION {
__vdso_getcpu;
__vdso_time;
__vdso_clock_getres;
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+ __vdso_futex_robust_try_unlock;
+ __vdso_futex_robust_try_unlock_cs_start;
+ __vdso_futex_robust_try_unlock_cs_success;
+ __vdso_futex_robust_try_unlock_cs_end;
+#endif
local: *;
};
}
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso64/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- /dev/null
+++ b/arch/x86/include/asm/futex_robust.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_FUTEX_ROBUST_H
+#define _ASM_X86_FUTEX_ROBUST_H
+
+#include <asm/ptrace.h>
+
+static __always_inline bool x86_futex_needs_robust_unlock_fixup(struct pt_regs *regs)
+{
+ /*
+ * This is tricky in the compat case as it has to take the size check
+ * into account. See the ASM magic in the VDSO vfutex code. If compat is
+ * disabled or this is a 32-bit kernel then ZF is authoritive no matter
+ * what.
+ */
+ if (!IS_ENABLED(CONFIG_X86_64) || !IS_ENABLED(CONFIG_IA32_EMULATION))
+ return !!(regs->flags & X86_EFLAGS_ZF);
+
+ /*
+ * For the compat case, the core code already established that regs->ip
+ * is >= cs_start and < cs_end. Now check whether it is at the
+ * conditional jump which checks the cmpxchg() or if it succeeded and
+ * does the size check, which obviously modifies ZF too.
+ */
+ if (regs->ip >= current->mm->futex.unlock_cs_success_ip)
+ return true;
+ /*
+ * It's at the jnz right after the cmpxchg(). ZF tells whether this
+ * succeeded or not.
+ */
+ return !!(regs->flags & X86_EFLAGS_ZF);
+}
+
+#define arch_futex_needs_robust_unlock_fixup(regs) \
+ x86_futex_needs_robust_unlock_fixup(regs)
+
+static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
+{
+ return (void __user *)regs->dx;
+}
+
+#define arch_futex_robust_unlock_get_pop(regs) \
+ x86_futex_robust_unlock_get_pop(regs)
+
+#endif /* _ASM_X86_FUTEX_ROBUST_H */
On 3/16/26 18:13, Thomas Gleixner wrote:
> When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
> then the unlock sequence in userspace looks like this:
>
> 1) robust_list_set_op_pending(mutex);
> 2) robust_list_remove(mutex);
>
> lval = gettid();
> 3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
> 4) robust_list_clear_op_pending();
> else
> 5) sys_futex(OP,...FUTEX_ROBUST_UNLOCK);
>
> That still leaves a minimal race window between #3 and #4 where the mutex
> could be acquired by some other task which observes that it is the last
> user and:
>
> 1) unmaps the mutex memory
> 2) maps a different file, which ends up covering the same address
>
> When then the original task exits before reaching #6 then the kernel robust
> list handling observes the pending op entry and tries to fix up user space.
>
> In case that the newly mapped data contains the TID of the exiting thread
> at the address of the mutex/futex the kernel will set the owner died bit in
> that memory and therefore corrupt unrelated data.
>
> Provide a VDSO function which exposes the critical section window in the
> VDSO symbol table. The resulting addresses are updated in the task's mm
> when the VDSO is (re)map()'ed.
>
> The core code detects when a task was interrupted within the critical
> section and is about to deliver a signal. It then invokes an architecture
> specific function which determines whether the pending op pointer has to be
> cleared or not. The assembly sequence for the non COMPAT case is:
>
> mov %esi,%eax // Load TID into EAX
> xor %ecx,%ecx // Set ECX to 0
> lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
> .Lstart:
> jnz .Lend
> movq $0x0,(%rdx) // Clear list_op_pending
> .Lend:
> ret
>
> So the decision can be simply based on the ZF state in regs->flags.
>
> If COMPAT is enabled then the try_unlock() function needs to take the size
> bit in the OP pointer into account, which makes it slightly more complex:
>
> mov %esi,%eax // Load TID into EAX
> mov %rdx,%rsi // Get the op pointer
> xor %ecx,%ecx // Set ECX to 0
> and $0xfffffffffffffffe,%rsi // Clear the size bit
> lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
> .Lstart:
> jnz .Lend
> .Lsuccess:
> testl $0x1,(%rdx) // Test the size bit
> jz .Lop64 // Not set: 64-bit
> movl $0x0,(%rsi) // Clear 32-bit
> jmp .Lend
> .Lop64:
> movq $0x0,(%rsi) // Clear 64-bit
> .Lend:
> ret
>
> The decision function has to check whether regs->ip is in the success
> portion as the size bit test obviously modifies ZF too. If it is before
> .Lsuccess then ZF contains the cmpxchg() result. If it's at of after
> .Lsuccess then the pointer has to be cleared.
>
> The original pointer with the size bit is preserved in RDX so the fixup can
> utilize the existing clearing mechanism, which is used by sys_futex().
>
> Arguably this could be avoided by providing separate functions and making
> the IP range for the quick check in the exit to user path cover the whole
> text section which contains the two functions. But that's not a win at all
> because:
>
> 1) User space needs to handle the two variants instead of just
> relying on a bit which can be saved in the mutex at
> initialization time.
>
> 2) The fixup decision function has then to evaluate which code path is
> used. That just adds more symbols and range checking for no real
> value.
>
> The unlock function is inspired by an idea from Mathieu Desnoyers.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Link: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com
> ---
> arch/x86/Kconfig | 1
> arch/x86/entry/vdso/common/vfutex.c | 72 +++++++++++++++++++++++++++++++
> arch/x86/entry/vdso/vdso32/Makefile | 5 +-
> arch/x86/entry/vdso/vdso32/vdso32.lds.S | 6 ++
> arch/x86/entry/vdso/vdso32/vfutex.c | 1
> arch/x86/entry/vdso/vdso64/Makefile | 7 +--
> arch/x86/entry/vdso/vdso64/vdso64.lds.S | 6 ++
> arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 6 ++
> arch/x86/entry/vdso/vdso64/vfutex.c | 1
> arch/x86/include/asm/futex_robust.h | 44 ++++++++++++++++++
> 10 files changed, 144 insertions(+), 5 deletions(-)
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -237,6 +237,7 @@ config X86
> select HAVE_EFFICIENT_UNALIGNED_ACCESS
> select HAVE_EISA if X86_32
> select HAVE_EXIT_THREAD
> + select HAVE_FUTEX_ROBUST_UNLOCK
> select HAVE_GENERIC_TIF_BITS
> select HAVE_GUP_FAST
> select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
> --- /dev/null
> +++ b/arch/x86/entry/vdso/common/vfutex.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <vdso/futex.h>
> +
> +/*
> + * Compat enabled kernels have to take the size bit into account to support the
> + * mixed size use case of gaming emulators. Contrary to the kernel robust unlock
> + * mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
> + * and in compat disabled kernels. User space can keep the pieces.
> + */
> +#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
> +
> +#ifdef CONFIG_COMPAT
The following asm template can be substantially improved.
> +# define ASM_CLEAR_PTR \
> + " testl $1, (%[pop]) \n" \
Please use byte-wide instruction, TESTB with address operand modifier,
"%a[pop]" instead of "(%[pop])":
testb $1, %a[pop]
> + " jz .Lop64 \n" \
> + " movl $0, (%[pad]) \n" \
Here you can reuse zero-valued operand "val" and use address operand
modifier. Please note %k modifier.
movl %k[val], %a[pad]
> + " jmp __vdso_futex_robust_try_unlock_cs_end \n" \
> + ".Lop64: \n" \
> + " movq $0, (%[pad]) \n"
Again, zero-valued operand "val" and address op modifier can be used here:
movq %[val], %a[pad]
> +
> +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
> +
> +#else /* CONFIG_COMPAT */
> +
> +# define ASM_CLEAR_PTR \
> + " movq $0, (%[pop]) \n"
movq %[val], %a[pop]
> +
> +# define ASM_PAD_CONSTRAINT
> +
> +#endif /* !CONFIG_COMPAT */
> +
> +#else /* CONFIG_X86_64 && !BUILD_VDSO32_64 */
> +
> +# define ASM_CLEAR_PTR \
> + " movl $0, (%[pad]) \n"
movl %[val], %a[pad]
> +
> +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
> +
> +#endif /* !CONFIG_X86_64 || BUILD_VDSO32_64 */
> +
> +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
> +{
> + asm volatile (
> + ".global __vdso_futex_robust_try_unlock_cs_start \n"
> + ".global __vdso_futex_robust_try_unlock_cs_success \n"
> + ".global __vdso_futex_robust_try_unlock_cs_end \n"
> + " \n"
> + " lock cmpxchgl %[val], (%[ptr]) \n"
> + " \n"
> + "__vdso_futex_robust_try_unlock_cs_start: \n"
> + " \n"
> + " jnz __vdso_futex_robust_try_unlock_cs_end \n"
> + " \n"
> + "__vdso_futex_robust_try_unlock_cs_success: \n"
> + " \n"
> + ASM_CLEAR_PTR
> + " \n"
> + "__vdso_futex_robust_try_unlock_cs_end: \n"
> + : [tid] "+a" (tid)
You need earlyclobber here "+&a", because not all input arguemnts are
read before this argument is written.
> + : [ptr] "D" (lock),
> + [pop] "d" (pop),
> + [val] "r" (0)
[val] "r" (0UL), so the correct register width will be used. I'd name
this operand [zero], because 0 lives here, and it will be reused in
several places.
Uros.
> + ASM_PAD_CONSTRAINT
> + : "memory"
> + );
> +
> + return tid;
> +}
> +
> +uint32_t futex_robust_try_unlock(uint32_t *, uint32_t, void **)
> + __attribute__((weak, alias("__vdso_futex_robust_try_unlock")));
> --- a/arch/x86/entry/vdso/vdso32/Makefile
> +++ b/arch/x86/entry/vdso/vdso32/Makefile
> @@ -7,8 +7,9 @@
> vdsos-y := 32
>
> # Files to link into the vDSO:
> -vobjs-y := note.o vclock_gettime.o vgetcpu.o
> -vobjs-y += system_call.o sigreturn.o
> +vobjs-y := note.o vclock_gettime.o vgetcpu.o
> +vobjs-y += system_call.o sigreturn.o
> +vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
>
> # Compilation flags
> flags-y := -DBUILD_VDSO32 -m32 -mregparm=0
> --- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
> +++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
> @@ -30,6 +30,12 @@ VERSION
> __vdso_clock_gettime64;
> __vdso_clock_getres_time64;
> __vdso_getcpu;
> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
> + __vdso_futex_robust_try_unlock;
> + __vdso_futex_robust_try_unlock_cs_start;
> + __vdso_futex_robust_try_unlock_cs_success;
> + __vdso_futex_robust_try_unlock_cs_end;
> +#endif
> };
>
> LINUX_2.5 {
> --- /dev/null
> +++ b/arch/x86/entry/vdso/vdso32/vfutex.c
> @@ -0,0 +1 @@
> +#include "common/vfutex.c"
> --- a/arch/x86/entry/vdso/vdso64/Makefile
> +++ b/arch/x86/entry/vdso/vdso64/Makefile
> @@ -8,9 +8,10 @@ vdsos-y := 64
> vdsos-$(CONFIG_X86_X32_ABI) += x32
>
> # Files to link into the vDSO:
> -vobjs-y := note.o vclock_gettime.o vgetcpu.o
> -vobjs-y += vgetrandom.o vgetrandom-chacha.o
> -vobjs-$(CONFIG_X86_SGX) += vsgx.o
> +vobjs-y := note.o vclock_gettime.o vgetcpu.o
> +vobjs-y += vgetrandom.o vgetrandom-chacha.o
> +vobjs-$(CONFIG_X86_SGX) += vsgx.o
> +vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK) += vfutex.o
>
> # Compilation flags
> flags-y := -DBUILD_VDSO64 -m64 -mcmodel=small
> --- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
> +++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
> @@ -32,6 +32,12 @@ VERSION {
> #endif
> getrandom;
> __vdso_getrandom;
> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
> + __vdso_futex_robust_try_unlock;
> + __vdso_futex_robust_try_unlock_cs_start;
> + __vdso_futex_robust_try_unlock_cs_success;
> + __vdso_futex_robust_try_unlock_cs_end;
> +#endif
> local: *;
> };
> }
> --- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> +++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> @@ -22,6 +22,12 @@ VERSION {
> __vdso_getcpu;
> __vdso_time;
> __vdso_clock_getres;
> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
> + __vdso_futex_robust_try_unlock;
> + __vdso_futex_robust_try_unlock_cs_start;
> + __vdso_futex_robust_try_unlock_cs_success;
> + __vdso_futex_robust_try_unlock_cs_end;
> +#endif
> local: *;
> };
> }
> --- /dev/null
> +++ b/arch/x86/entry/vdso/vdso64/vfutex.c
> @@ -0,0 +1 @@
> +#include "common/vfutex.c"
> --- /dev/null
> +++ b/arch/x86/include/asm/futex_robust.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_FUTEX_ROBUST_H
> +#define _ASM_X86_FUTEX_ROBUST_H
> +
> +#include <asm/ptrace.h>
> +
> +static __always_inline bool x86_futex_needs_robust_unlock_fixup(struct pt_regs *regs)
> +{
> + /*
> + * This is tricky in the compat case as it has to take the size check
> + * into account. See the ASM magic in the VDSO vfutex code. If compat is
> + * disabled or this is a 32-bit kernel then ZF is authoritive no matter
> + * what.
> + */
> + if (!IS_ENABLED(CONFIG_X86_64) || !IS_ENABLED(CONFIG_IA32_EMULATION))
> + return !!(regs->flags & X86_EFLAGS_ZF);
> +
> + /*
> + * For the compat case, the core code already established that regs->ip
> + * is >= cs_start and < cs_end. Now check whether it is at the
> + * conditional jump which checks the cmpxchg() or if it succeeded and
> + * does the size check, which obviously modifies ZF too.
> + */
> + if (regs->ip >= current->mm->futex.unlock_cs_success_ip)
> + return true;
> + /*
> + * It's at the jnz right after the cmpxchg(). ZF tells whether this
> + * succeeded or not.
> + */
> + return !!(regs->flags & X86_EFLAGS_ZF);
> +}
> +
> +#define arch_futex_needs_robust_unlock_fixup(regs) \
> + x86_futex_needs_robust_unlock_fixup(regs)
> +
> +static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
> +{
> + return (void __user *)regs->dx;
> +}
> +
> +#define arch_futex_robust_unlock_get_pop(regs) \
> + x86_futex_robust_unlock_get_pop(regs)
> +
> +#endif /* _ASM_X86_FUTEX_ROBUST_H */
>
On Tue, Mar 17 2026 at 16:33, Uros Bizjak wrote:
> On 3/16/26 18:13, Thomas Gleixner wrote:
>> +#ifdef CONFIG_COMPAT
>
> The following asm template can be substantially improved.
In theory.
>> +# define ASM_CLEAR_PTR \
>> + " testl $1, (%[pop]) \n" \
>
> Please use byte-wide instruction, TESTB with address operand modifier,
> "%a[pop]" instead of "(%[pop])":
>
> testb $1, %a[pop]
New fangled %a syntax seems to work. Didn't know about that. Though I'm
not convinced that this is an improvement. At least not for someone who
is used to read/write plain old school ASM for several decades.
>> + " jz .Lop64 \n" \
>> + " movl $0, (%[pad]) \n" \
>
> Here you can reuse zero-valued operand "val" and use address operand
> modifier. Please note %k modifier.
>
> movl %k[val], %a[pad]
...
> movq %[val], %a[pad]
But this one does not and _cannot_ work.
Error: incorrect register `%rcx' used with `l' suffix
Which is obvious because of the initalization:
[val] "r" (0UL)
which makes it explicitely R$$. If you change the initialization back to
[val] "r" (0) // or (0U)
the failure unsurprisingly becomes
Error: incorrect register `%ecx' used with `q' suffix
So much for the theory....
<SNIP>
>> + ASM_PAD_CONSTRAINT
tons of useless quoted text
>> +#endif /* _ASM_X86_FUTEX_ROBUST_H */
</SNIP>
Can you please trim your replies properly?
Thanks,
tglx
On Wed, Mar 18, 2026 at 9:22 AM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Tue, Mar 17 2026 at 16:33, Uros Bizjak wrote:
> > On 3/16/26 18:13, Thomas Gleixner wrote:
> >> +#ifdef CONFIG_COMPAT
> >
> > The following asm template can be substantially improved.
>
> In theory.
>
> >> +# define ASM_CLEAR_PTR \
> >> + " testl $1, (%[pop]) \n" \
> >
> > Please use byte-wide instruction, TESTB with address operand modifier,
> > "%a[pop]" instead of "(%[pop])":
> >
> > testb $1, %a[pop]
>
> New fangled %a syntax seems to work. Didn't know about that. Though I'm
> not convinced that this is an improvement. At least not for someone who
> is used to read/write plain old school ASM for several decades.
>
> >> + " jz .Lop64 \n" \
> >> + " movl $0, (%[pad]) \n" \
> >
> > Here you can reuse zero-valued operand "val" and use address operand
> > modifier. Please note %k modifier.
> >
> > movl %k[val], %a[pad]
> ...
> > movq %[val], %a[pad]
>
> But this one does not and _cannot_ work.
>
> Error: incorrect register `%rcx' used with `l' suffix
Ah, I missed:
" lock cmpxchgl %[val], (%[ptr]) \n"
Please use %k[val] here and it will work. %k forces %ecx.
> Which is obvious because of the initalization:
>
> [val] "r" (0UL)
You have to force 0 to %r$$ for x86_64 when movq is used. The
resulting code (xorl %ecx, %ecx) is the same, but the compiler knows
what type of value is created here.
Uros.
* Thomas Gleixner: > Arguably this could be avoided by providing separate functions and making > the IP range for the quick check in the exit to user path cover the whole > text section which contains the two functions. But that's not a win at all > because: > > 1) User space needs to handle the two variants instead of just > relying on a bit which can be saved in the mutex at > initialization time. I'm pretty sure that on the user-space side, we wouldn't have cross-word-size operations (e.g., 64-bit code working on both 64-bit and 32-bit robust mutexes). Certainly not within libcs. The other point about complexity is of course still valid. Thanks, Florian
On Tue, Mar 17 2026 at 09:28, Florian Weimer wrote:
> * Thomas Gleixner:
>
>> Arguably this could be avoided by providing separate functions and making
>> the IP range for the quick check in the exit to user path cover the whole
>> text section which contains the two functions. But that's not a win at all
>> because:
>>
>> 1) User space needs to handle the two variants instead of just
>> relying on a bit which can be saved in the mutex at
>> initialization time.
>
> I'm pretty sure that on the user-space side, we wouldn't have
> cross-word-size operations (e.g., 64-bit code working on both 64-bit and
> 32-bit robust mutexes). Certainly not within libcs. The other point
> about complexity is of course still valid.
Right, I know that no libc implementation supports such an insanity, but
the kernel unfortunately allows to do so and it's used in the wild :(
So we have to deal with it somehow and the size modifier was the most
straight forward solution I could come up with. I'm all ears if someone
has a better idea.
That said, do you see any issue from libc size versus extending the
WAKE/UNLOCK_PI functionality with that UNLOCK_ROBUST functionality?
I did some basic performance tests in the meanwhile with an open coded
mutex implementation. I can't observe any significant difference between
doing the unlock in user space or letting the kernel do it, but that
needs of course more scrunity.
Thanks,
tglx
* Thomas Gleixner: > On Tue, Mar 17 2026 at 09:28, Florian Weimer wrote: >> * Thomas Gleixner: >> >>> Arguably this could be avoided by providing separate functions and making >>> the IP range for the quick check in the exit to user path cover the whole >>> text section which contains the two functions. But that's not a win at all >>> because: >>> >>> 1) User space needs to handle the two variants instead of just >>> relying on a bit which can be saved in the mutex at >>> initialization time. >> >> I'm pretty sure that on the user-space side, we wouldn't have >> cross-word-size operations (e.g., 64-bit code working on both 64-bit and >> 32-bit robust mutexes). Certainly not within libcs. The other point >> about complexity is of course still valid. > > Right, I know that no libc implementation supports such an insanity, but > the kernel unfortunately allows to do so and it's used in the wild :( > > So we have to deal with it somehow and the size modifier was the most > straight forward solution I could come up with. I'm all ears if someone > has a better idea. Maybe a separate futex op? And the vDSO would have the futex call, mangle uaddr2 as required for the shared code section that handles both ops? As far as I can tell at this point, the current proposal should work. We'd probably start with using the syscall-based unlock. Thanks, Florian
On Tue, Mar 17 2026 at 11:37, Florian Weimer wrote:
> * Thomas Gleixner:
>> Right, I know that no libc implementation supports such an insanity, but
>> the kernel unfortunately allows to do so and it's used in the wild :(
>>
>> So we have to deal with it somehow and the size modifier was the most
>> straight forward solution I could come up with. I'm all ears if someone
>> has a better idea.
>
> Maybe a separate futex op? And the vDSO would have the futex call,
> mangle uaddr2 as required for the shared code section that handles both
> ops?
>
> As far as I can tell at this point, the current proposal should work.
> We'd probably start with using the syscall-based unlock.
Something like the below compiled but untested delta diff which includes
also the other unrelated feedback fixups?
Thanks,
tglx
---
diff --git a/arch/x86/entry/vdso/common/vfutex.c b/arch/x86/entry/vdso/common/vfutex.c
index 19d8ef130b63..491ed141622d 100644
--- a/arch/x86/entry/vdso/common/vfutex.c
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -1,72 +1,218 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <vdso/futex.h>
+/*
+ * Assembly template for the try unlock functions. The basic functionality for
+ * 64-bit is:
+ *
+ * At the call site:
+ * mov &lock, %rdi Store the lock pointer in RDI
+ * mov &pop, %rdx Store the pending op pointer in RDX
+ * mov TID, %esi Store the thread's TID in ESI
+ *
+ * 64-bit unlock function:
+ * mov esi, %eax Move the TID into EAX
+ * xor %ecx, %ecx Clear ECX
+ * lock_cmpxchgl %ecx, (%rdi) Attempt the TID -> 0 transition
+ * .Lcs_start: Start of the critical section
+ * jnz .Lcs_end If cmpxchl failed jump to the end
+ * .Lcs_success: Start of the success section
+ * movq $0, (%rdx) Set the pending op pointer to 0
+ * .Lcs_end: End of the critical section
+ *
+ * For COMPAT enabled 64-bit kernels this is a bit more complex because the size
+ * of the @pop pointer has to be determined in the success section:
+ *
+ * At the 64-bit call site:
+ * mov &lock, %rdi Store the lock pointer in RDI
+ * mov &pop, %rdx Store the pending op pointer in RDX
+ * mov TID, %esi Store the thread's TID in ESI
+ *
+ * At the 32-bit call site:
+ * mov &lock, %edi Store the lock pointer in EDI
+ * mov &pop, %edx Store the pending op pointer in EDX
+ * mov TID, %esi Store the thread's TID in ESI
+ *
+ * The 32-bit entry point:
+ * or $0x1, %edx Mark the op pointer 32-bit
+ *
+ * Common unlock function:
+ * mov esi, %eax Move the TID into EAX
+ * xor %ecx, %ecx Clear ECX
+ * mov %rdx, %rsi Store the op pointer in RSI
+ * and ~0x1, %rsi Clear the size bit in RSI
+ * lock_cmpxchgl %ecx, (%rdi) Attempt the TID -> 0 transition
+ * .Lcs_start: Start of the critical section
+ * jnz .Lcs_end If cmpxchl failed jump to the end
+ * .Lcs_success: Start of the success section
+ * test $0x1, %rdx Test the 32-bit size bit in the original pointer
+ * jz .Lop64 If not set, clear 64-bit
+ * movl $0, (%rsi) Set the 32-bit pending op pointer to 0
+ * jmp .Lcs_end Leave the critical section
+ * .Lop64: movq $0, (%rsi) Set the 64-bit pending op pointer to 0
+ * .Lcs_end: End of the critical section
+ *
+ * The 32-bit VDSO needs to set the 32-bit size bit as well to keep the code
+ * compatible for the kernel side fixup function, but it does not require the
+ * size evaluation in the success path.
+ *
+ * At the 32-bit call site:
+ * mov &lock, %edi Store the lock pointer in EDI
+ * mov &pop, %edx Store the pending op pointer in EDX
+ * mov TID, %esi Store the thread's TID in ESI
+ *
+ * The 32-bit entry point does:
+ * or $0x1, %edx Mark the op pointer 32-bit
+ *
+ * 32-bit unlock function:
+ * mov esi, %eax Move the TID into EAX
+ * xor %ecx, %ecx Clear ECX
+ * mov %edx, %esi Store the op pointer in ESI
+ * and ~0x1, %esi Clear the size bit in ESI
+ * lock_cmpxchgl %ecx, (%edi) Attempt the TID -> 0 transition
+ * .Lcs_start: Start of the critical section
+ * jnz .Lcs_end If cmpxchl failed jump to the end
+ * .Lcs_success: Start of the success section
+ * movl $0, (%esi) Set the 32-bit pending op pointer to 0
+ * .Lcs_end: End of the critical section
+ *
+ * The pointer modification makes sure that the unlock function can determine
+ * the pending op pointer size correctly and clear either 32 or 64 bit.
+ *
+ * The intermediate storage of the unmangled pointer (bit 0 cleared) in [ER]SI
+ * makes sure that the store hits the right address.
+ *
+ * The mangled pointer (bit 0 set for 32-bit) stays in [ER]DX so that the kernel
+ * side fixup function can determine the storage size correctly and always
+ * retrieve regs->rdx without any extra knowledge of the actual code path taken
+ * or checking the compat mode of the task.
+ *
+ * The .Lcs_success label is technically not required for a pure 64-bit and the
+ * 32-bit VDSO but is kept there for simplicity. In those cases the ZF flag in
+ * regs->eflags is authoritative for the whole critical section and no further
+ * evaluation is required.
+ *
+ * In the 64-bit compat case the .Lcs_success label is required because the
+ * pointer size check modifies the ZF flag, which means it is only valid for the
+ * case where .Lcs_start <= regs->ip < L.cs_success, which is obviously the
+ * same as l.cs_start == regs->ip for x86.
+ *
+ * That's still a valuable distinction for clarity to keep the ASM template the
+ * same for all case. This is also a template for other architectures which
+ * might have different requirements even for the non COMPAT case.
+ *
+ * That means in the 64-bit compat case the decision to do the fixup is:
+ *
+ * if (regs->ip >= .Lcs_start && regs->ip < L.cs_success)
+ * return (regs->eflags & ZF);
+ * return regs->ip < .Lcs_end;
+ *
+ * As the initial critical section check in the return to user space code
+ * already established that:
+ *
+ * .Lcs_start <= regs->ip < L.cs_end
+ *
+ * that decision can be simplified to:
+ *
+ * return regs->ip >= L.cs_success || regs->eflags & ZF;
+ *
+ */
+#define robust_try_unlock_asm(__tid, __lock, __pop) \
+ asm volatile ( \
+ ".global __kernel_futex_robust_try_unlock_cs_start \n" \
+ ".global __kernel_futex_robust_try_unlock_cs_success \n" \
+ ".global __kernel_futex_robust_try_unlock_cs_end \n" \
+ " \n" \
+ " lock cmpxchgl %[val], (%[ptr]) \n" \
+ " \n" \
+ "__kernel_futex_robust_try_unlock_cs_start: \n" \
+ " \n" \
+ " jnz __kernel_futex_robust_try_unlock_cs_end \n" \
+ " \n" \
+ "__kernel_futex_robust_try_unlock_cs_success: \n" \
+ " \n" \
+ ASM_CLEAR_PTR \
+ " \n" \
+ "__kernel_futex_robust_try_unlock_cs_end: \n" \
+ : [tid] "+a" (__tid) \
+ : [ptr] "D" (__lock), \
+ [pop] "d" (__pop), \
+ [val] "r" (0) \
+ ASM_PAD_CONSTRAINT(__pop) \
+ : "memory" \
+ )
+
/*
* Compat enabled kernels have to take the size bit into account to support the
* mixed size use case of gaming emulators. Contrary to the kernel robust unlock
* mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
* and in compat disabled kernels. User space can keep the pieces.
*/
-#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
-
+#ifdef __x86_64__
#ifdef CONFIG_COMPAT
# define ASM_CLEAR_PTR \
" testl $1, (%[pop]) \n" \
" jz .Lop64 \n" \
" movl $0, (%[pad]) \n" \
- " jmp __vdso_futex_robust_try_unlock_cs_end \n" \
+ " jmp __kernel_futex_robust_try_unlock_cs_end \n" \
".Lop64: \n" \
" movq $0, (%[pad]) \n"
-# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
+# define ASM_PAD_CONSTRAINT(__pop) ,[pad] "S" (((unsigned long)__pop) & ~0x1UL)
+
+__u32 noinline __vdso_futex_robust_try_unlock_64(__u32 *lock, __u32 tid, __u64 *pop)
+{
+ robust_try_unlock_asm(lock, tid, pop);
+ return tid;
+}
+
+__u32 noinline __vdso_futex_robust_try_unlock_32(__u32 *lock, __u32 tid, __u32 *pop)
+{
+ __u64 pop_addr = ((u64) pop) | FUTEX_ROBUST_UNLOCK_MOD_32BIT;
+
+ return __vdso_futex_robust_try_unlock_64(lock, tid, (__u64 *)pop_addr);
+}
+
+__u32 futex_robust_try_unlock_64(__u32 *, __u32, __u64 *)
+ __attribute__((weak, alias("__vdso_futex_robust_try_unlock_64")));
+
+__u32 futex_robust_try_unlock_32(__u32 *, __u32, __u32 *)
+ __attribute__((weak, alias("__vdso_futex_robust_try_unlock_32")));
#else /* CONFIG_COMPAT */
# define ASM_CLEAR_PTR \
" movq $0, (%[pop]) \n"
-# define ASM_PAD_CONSTRAINT
+# define ASM_PAD_CONSTRAINT(__pop)
+
+__u32 noinline __vdso_futex_robust_try_unlock_64(__u32 *lock, __u32 tid, __u64 *pop)
+{
+ robust_try_unlock_asm(lock, tid, pop);
+ return tid;
+}
+
+__u32 futex_robust_try_unlock_64(__u32 *, __u32, __u64 *)
+ __attribute__((weak, alias("__vdso_futex_robust_try_unlock_64")));
#endif /* !CONFIG_COMPAT */
-#else /* CONFIG_X86_64 && !BUILD_VDSO32_64 */
+#else /* __x86_64__ */
# define ASM_CLEAR_PTR \
" movl $0, (%[pad]) \n"
-# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
-
-#endif /* !CONFIG_X86_64 || BUILD_VDSO32_64 */
+# define ASM_PAD_CONSTRAINT(__pop) ,[pad] "S" (((unsigned long)__pop) & ~0x1UL)
-uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
+__u32 noinline __vdso_futex_robust_try_unlock_32(__u32 *lock, __u32 tid, __u32 *pop)
{
- asm volatile (
- ".global __vdso_futex_robust_try_unlock_cs_start \n"
- ".global __vdso_futex_robust_try_unlock_cs_success \n"
- ".global __vdso_futex_robust_try_unlock_cs_end \n"
- " \n"
- " lock cmpxchgl %[val], (%[ptr]) \n"
- " \n"
- "__vdso_futex_robust_try_unlock_cs_start: \n"
- " \n"
- " jnz __vdso_futex_robust_try_unlock_cs_end \n"
- " \n"
- "__vdso_futex_robust_try_unlock_cs_success: \n"
- " \n"
- ASM_CLEAR_PTR
- " \n"
- "__vdso_futex_robust_try_unlock_cs_end: \n"
- : [tid] "+a" (tid)
- : [ptr] "D" (lock),
- [pop] "d" (pop),
- [val] "r" (0)
- ASM_PAD_CONSTRAINT
- : "memory"
- );
+ __u32 pop_addr = ((u32) pop) | FUTEX_ROBUST_UNLOCK_MOD_32BIT;
+ robust_try_unlock_asm(lock, tid, (__u32 *)pop_addr);
return tid;
}
-uint32_t futex_robust_try_unlock(uint32_t *, uint32_t, void **)
- __attribute__((weak, alias("__vdso_futex_robust_try_unlock")));
+__u32 futex_robust_try_unlock_32(__u32 *, __u32, __u32 *)
+ __attribute__((weak, alias("__vdso_futex_robust_try_unlock_32")));
+#endif /* !__x86_64__ */
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index b027d2f98bd0..cb7b8de8009c 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -31,10 +31,10 @@ VERSION
__vdso_clock_getres_time64;
__vdso_getcpu;
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
- __vdso_futex_robust_try_unlock;
- __vdso_futex_robust_try_unlock_cs_start;
- __vdso_futex_robust_try_unlock_cs_success;
- __vdso_futex_robust_try_unlock_cs_end;
+ __vdso_futex_robust_try_unlock_32;
+ __kernel_futex_robust_try_unlock_cs_start;
+ __kernel_futex_robust_try_unlock_cs_success;
+ __kernel_futex_robust_try_unlock_cs_end;
#endif
};
diff --git a/arch/x86/entry/vdso/vdso64/vdso64.lds.S b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
index e5c0ca9664e1..6dd36ae2ab79 100644
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -33,10 +33,11 @@ VERSION {
getrandom;
__vdso_getrandom;
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
- __vdso_futex_robust_try_unlock;
- __vdso_futex_robust_try_unlock_cs_start;
- __vdso_futex_robust_try_unlock_cs_success;
- __vdso_futex_robust_try_unlock_cs_end;
+ __vdso_futex_robust_try_unlock_64;
+ __vdso_futex_robust_try_unlock_32;
+ __kernel_futex_robust_try_unlock_cs_start;
+ __kernel_futex_robust_try_unlock_cs_success;
+ __kernel_futex_robust_try_unlock_cs_end;
#endif
local: *;
};
diff --git a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
index 4409d97e7ef6..a456f184c937 100644
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -23,7 +23,8 @@ VERSION {
__vdso_time;
__vdso_clock_getres;
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
- __vdso_futex_robust_try_unlock;
+ __vdso_futex_robust_try_unlock_64;
+ __vdso_futex_robust_try_unlock_32;
__vdso_futex_robust_try_unlock_cs_start;
__vdso_futex_robust_try_unlock_cs_success;
__vdso_futex_robust_try_unlock_cs_end;
diff --git a/include/linux/futex_types.h b/include/linux/futex_types.h
index 223f469789c5..a96293050bf4 100644
--- a/include/linux/futex_types.h
+++ b/include/linux/futex_types.h
@@ -11,13 +11,15 @@ struct futex_pi_state;
struct robust_list_head;
/**
- * struct futex_ctrl - Futex related per task data
+ * struct futex_sched_data - Futex related per task data
* @robust_list: User space registered robust list pointer
* @compat_robust_list: User space registered robust list pointer for compat tasks
+ * @pi_state_list: List head for Priority Inheritance (PI) state management
+ * @pi_state_cache: Pointer to cache one PI state object per task
* @exit_mutex: Mutex for serializing exit
* @state: Futex handling state to handle exit races correctly
*/
-struct futex_ctrl {
+struct futex_sched_data {
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
struct compat_robust_list_head __user *compat_robust_list;
@@ -27,9 +29,6 @@ struct futex_ctrl {
struct mutex exit_mutex;
unsigned int state;
};
-#else
-struct futex_ctrl { };
-#endif /* !CONFIG_FUTEX */
/**
* struct futex_mm_data - Futex related per MM data
@@ -71,4 +70,9 @@ struct futex_mm_data {
#endif
};
+#else
+struct futex_sched_data { };
+struct futex_mm_data { };
+#endif /* !CONFIG_FUTEX */
+
#endif /* _LINUX_FUTEX_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 266d4859e322..a5d5c0ec3c64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,7 +1329,7 @@ struct task_struct {
u32 rmid;
#endif
- struct futex_ctrl futex;
+ struct futex_sched_data futex;
#ifdef CONFIG_PERF_EVENTS
u8 perf_recursion[PERF_NR_CONTEXTS];
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index ab9d89748595..e447eaea63f4 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -26,6 +26,7 @@
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
#define FUTEX_UNLOCK_ROBUST 512
+#define FUTEX_ROBUST_LIST32 1024
#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | FUTEX_UNLOCK_ROBUST)
#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
@@ -182,23 +183,6 @@ struct robust_list_head {
#define FUTEX_ROBUST_MOD_PI (0x1UL)
#define FUTEX_ROBUST_MOD_MASK (FUTEX_ROBUST_MOD_PI)
-/*
- * Modifier for FUTEX_ROBUST_UNLOCK uaddr2. Required to distinguish the storage
- * size for the robust_list_head::list_pending_op. This solves two problems:
- *
- * 1) COMPAT tasks
- *
- * 2) The mixed mode magic gaming use case which has both 32-bit and 64-bit
- * robust lists. Oh well....
- *
- * Long story short: 32-bit userspace must set this bit unconditionally to
- * ensure that it can run on a 64-bit kernel in compat mode. If user space
- * screws that up a 64-bit kernel will happily clear the full 64-bits. 32-bit
- * kernels return an error code if the bit is not set.
- */
-#define FUTEX_ROBUST_UNLOCK_MOD_32BIT (0x1UL)
-#define FUTEX_ROBUST_UNLOCK_MOD_MASK (FUTEX_ROBUST_UNLOCK_MOD_32BIT)
-
/*
* bitset with all bits set for the FUTEX_xxx_BITSET OPs to request a
* match of any bit.
diff --git a/include/vdso/futex.h b/include/vdso/futex.h
index 8061bfcb6b92..a768c00b0ada 100644
--- a/include/vdso/futex.h
+++ b/include/vdso/futex.h
@@ -2,12 +2,11 @@
#ifndef _VDSO_FUTEX_H
#define _VDSO_FUTEX_H
-#include <linux/types.h>
-
-struct robust_list;
+#include <uapi/linux/types.h>
/**
- * __vdso_futex_robust_try_unlock - Try to unlock an uncontended robust futex
+ * __vdso_futex_robust_try_unlock_64 - Try to unlock an uncontended robust futex
+ * with a 64-bit op pointer
* @lock: Pointer to the futex lock object
* @tid: The TID of the calling task
* @op: Pointer to the task's robust_list_head::list_pending_op
@@ -39,6 +38,23 @@ struct robust_list;
* @uaddr2 argument for sys_futex(FUTEX_ROBUST_UNLOCK) operations. See the
* modifier and the related documentation in include/uapi/linux/futex.h
*/
-uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *op);
+__u32 __vdso_futex_robust_try_unlock_64(__u32 *lock, __u32 tid, __u64 *op);
+
+/**
+ * __vdso_futex_robust_try_unlock_32 - Try to unlock an uncontended robust futex
+ * with a 32-bit op pointer
+ * @lock: Pointer to the futex lock object
+ * @tid: The TID of the calling task
+ * @op: Pointer to the task's robust_list_head::list_pending_op
+ *
+ * Return: The content of *@lock. On success this is the same as @tid.
+ *
+ * Same as __vdso_futex_robust_try_unlock_64() just with a 32-bit @op pointer.
+ */
+__u32 __vdso_futex_robust_try_unlock_32(__u32 *lock, __u32 tid, __u32 *op);
+
+/* Modifier to convey the size of the op pointer */
+#define FUTEX_ROBUST_UNLOCK_MOD_32BIT (0x1UL)
+#define FUTEX_ROBUST_UNLOCK_MOD_MASK (FUTEX_ROBUST_UNLOCK_MOD_32BIT)
#endif
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 7957edd46b89..39041cf94522 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -46,6 +46,8 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <vdso/futex.h>
+
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -1434,17 +1436,9 @@ static void exit_pi_state_list(struct task_struct *curr)
static inline void exit_pi_state_list(struct task_struct *curr) { }
#endif
-static inline bool mask_pop_addr(void __user **pop)
-{
- unsigned long addr = (unsigned long)*pop;
-
- *pop = (void __user *) (addr & ~FUTEX_ROBUST_UNLOCK_MOD_MASK);
- return !!(addr & FUTEX_ROBUST_UNLOCK_MOD_32BIT);
-}
-
-bool futex_robust_list_clear_pending(void __user *pop)
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags)
{
- bool size32bit = mask_pop_addr(&pop);
+ bool size32bit = !!(flags & FLAGS_ROBUST_LIST32);
if (!IS_ENABLED(CONFIG_64BIT) && !size32bit)
return false;
@@ -1456,15 +1450,28 @@ bool futex_robust_list_clear_pending(void __user *pop)
}
#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+static inline bool mask_pop_addr(void __user **pop)
+{
+ unsigned long addr = (unsigned long)*pop;
+
+ *pop = (void __user *) (addr & ~FUTEX_ROBUST_UNLOCK_MOD_MASK);
+ return !!(addr & FUTEX_ROBUST_UNLOCK_MOD_32BIT);
+}
+
void __futex_fixup_robust_unlock(struct pt_regs *regs)
{
+ unsigned int flags = 0;
void __user *pop;
if (!arch_futex_needs_robust_unlock_fixup(regs))
return;
pop = arch_futex_robust_unlock_get_pop(regs);
- futex_robust_list_clear_pending(pop);
+
+ if (mask_pop_addr(&pop))
+ flags = FUTEX_ROBUST_UNLOCK_MOD_32BIT;
+
+ futex_robust_list_clear_pending(pop, flags);
}
#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index b1aaa90f1779..31a5bae8b470 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -41,6 +41,7 @@
#define FLAGS_STRICT 0x0100
#define FLAGS_MPOL 0x0200
#define FLAGS_UNLOCK_ROBUST 0x0400
+#define FLAGS_ROBUST_LIST32 0x0800
/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
@@ -56,6 +57,9 @@ static inline unsigned int futex_to_flags(unsigned int op)
if (op & FUTEX_UNLOCK_ROBUST)
flags |= FLAGS_UNLOCK_ROBUST;
+ if (op & FUTEX_ROBUST_LIST32)
+ flags |= FLAGS_ROBUST_LIST32;
+
return flags;
}
@@ -452,6 +456,6 @@ extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *p
extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);
-bool futex_robust_list_clear_pending(void __user *pop);
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags);
#endif /* _FUTEX_H */
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index b8c76b6242e4..05ca360a7a30 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1298,7 +1298,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *pop)
if (ret || !(flags & FLAGS_UNLOCK_ROBUST))
return ret;
- if (!futex_robust_list_clear_pending(pop))
+ if (!futex_robust_list_clear_pending(pop, flags))
return -EFAULT;
return 0;
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 45effcf42961..233f38b1f52e 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -166,7 +166,7 @@ static bool futex_robust_unlock(u32 __user *uaddr, unsigned int flags, void __us
* deeper trouble as the robust list head is usually part of TLS. The
* chance of survival is close to zero.
*/
- return futex_robust_list_clear_pending(pop);
+ return futex_robust_list_clear_pending(pop, flags);
}
/*
On Tue, Mar 17 2026 at 23:32, Thomas Gleixner wrote:
> On Tue, Mar 17 2026 at 11:37, Florian Weimer wrote:
> Something like the below compiled but untested delta diff which includes
> also the other unrelated feedback fixups?
The more I look into it, the more I regret that we allowed mixed mode in
the first place. Which way I turn it around the code becomes more
horrible than it really should be.
If I understand it correctly then the only real world use case is the
x86 emulator for ARM64. That made me think about possible options to
keep the insanity restricted.
1) Delegate the problem to the ARM64 people :)
2) Make that mixed size mode depend on a config option
3) Require that such a use case issues a prctl to switch into that
special case mode.
or a combination of those.
Andre?
Em 18/03/2026 19:08, Thomas Gleixner escreveu: > > 2) Make that mixed size mode depend on a config option > > 3) Require that such a use case issues a prctl to switch into that > special case mode. > > or a combination of those. > > Andre? > Those two last options works for me, if it helps to make the code more readable. However, I think that QEMU might be interested in those features as well :) I'm going to ping them
On Wed, Mar 18 2026 at 23:05, André Almeida wrote:
> Em 18/03/2026 19:08, Thomas Gleixner escreveu:
>>
>> 2) Make that mixed size mode depend on a config option
>>
>> 3) Require that such a use case issues a prctl to switch into that
>> special case mode.
>>
>> or a combination of those.
>>
>> Andre?
>>
>
> Those two last options works for me, if it helps to make the code more
> readable. However, I think that QEMU might be interested in those
> features as well :) I'm going to ping them
I already came up with something. It makes the fixup range larger as it
has to cover two functions and then pick the right one.
So the range check becomes:
if (likely(!ip_within(regs, mm->futex.cs_start, mm->futex.cs_end)))
return;
if (likely(!mm->futex.cs_multi)) {
fixup(regs, NULL);
return;
}
csr = mm->futex.cs_ranges;
for (range = 0; range < mm->futex.cs_multi; range++, csr++) {
if (ip_within(regs, csr->cs_start, csr->cs_end)) {
fixup(regs, csr);
return;
}
}
Or something daft like that.
That makes the multi CS range check generic and still optimizes for the
single entry case. The ASM functions become minimal w/o extra pointer
size conditionals.
Thanks,
tglx
On Wed, Mar 18, 2026 at 11:08:26PM +0100, Thomas Gleixner wrote: > On Tue, Mar 17 2026 at 23:32, Thomas Gleixner wrote: > > On Tue, Mar 17 2026 at 11:37, Florian Weimer wrote: > > Something like the below compiled but untested delta diff which includes > > also the other unrelated feedback fixups? > > The more I look into it, the more I regret that we allowed mixed mode in > the first place. Which way I turn it around the code becomes more > horrible than it really should be. > > If I understand it correctly then the only real world use case is the > x86 emulator for ARM64. That made me think about possible options to > keep the insanity restricted. > > 1) Delegate the problem to the ARM64 people :) > > 2) Make that mixed size mode depend on a config option > > 3) Require that such a use case issues a prctl to switch into that > special case mode. > > or a combination of those. > > Andre? Well, he has this patch set that creates multiple lists, which would nicely solve things for FEX I recon.
On Mon, Mar 16, 2026 at 06:13:34PM +0100, Thomas Gleixner wrote:
(...)
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -237,6 +237,7 @@ config X86
> select HAVE_EFFICIENT_UNALIGNED_ACCESS
> select HAVE_EISA if X86_32
> select HAVE_EXIT_THREAD
> + select HAVE_FUTEX_ROBUST_UNLOCK
> select HAVE_GENERIC_TIF_BITS
> select HAVE_GUP_FAST
> select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
> --- /dev/null
> +++ b/arch/x86/entry/vdso/common/vfutex.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <vdso/futex.h>
> +
> +/*
> + * Compat enabled kernels have to take the size bit into account to support the
> + * mixed size use case of gaming emulators. Contrary to the kernel robust unlock
> + * mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
> + * and in compat disabled kernels. User space can keep the pieces.
> + */
> +#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
#ifndef __x86_64__ ?
> +
> +#ifdef CONFIG_COMPAT
> +
> +# define ASM_CLEAR_PTR \
> + " testl $1, (%[pop]) \n" \
> + " jz .Lop64 \n" \
> + " movl $0, (%[pad]) \n" \
> + " jmp __vdso_futex_robust_try_unlock_cs_end \n" \
> + ".Lop64: \n" \
> + " movq $0, (%[pad]) \n"
> +
> +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
> +
> +#else /* CONFIG_COMPAT */
> +
> +# define ASM_CLEAR_PTR \
> + " movq $0, (%[pop]) \n"
> +
> +# define ASM_PAD_CONSTRAINT
> +
> +#endif /* !CONFIG_COMPAT */
> +
> +#else /* CONFIG_X86_64 && !BUILD_VDSO32_64 */
> +
> +# define ASM_CLEAR_PTR \
> + " movl $0, (%[pad]) \n"
> +
> +# define ASM_PAD_CONSTRAINT ,[pad] "S" (((unsigned long)pop) & ~0x1UL)
> +
> +#endif /* !CONFIG_X86_64 || BUILD_VDSO32_64 */
> +
> +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
While uint32_t is originally a userspace type, in the kernel it also pulls in
other internal types which are problematic in the vDSO. __u32 from
uapi/linux/types.h avoid this issue.
(...)
> --- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> +++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> @@ -22,6 +22,12 @@ VERSION {
> __vdso_getcpu;
> __vdso_time;
> __vdso_clock_getres;
> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
> + __vdso_futex_robust_try_unlock;
> + __vdso_futex_robust_try_unlock_cs_start;
> + __vdso_futex_robust_try_unlock_cs_success;
> + __vdso_futex_robust_try_unlock_cs_end;
These three symbols are not meant to be used from outside the vDSO
implementation, so they don't need to be exported by the linkerscripts.
> +#endif
> local: *;
> };
> }
(...)
On Tue, Mar 17 2026 at 08:25, Thomas Weißschuh wrote:
> On Mon, Mar 16, 2026 at 06:13:34PM +0100, Thomas Gleixner wrote:
>> +/*
>> + * Compat enabled kernels have to take the size bit into account to support the
>> + * mixed size use case of gaming emulators. Contrary to the kernel robust unlock
>> + * mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
>> + * and in compat disabled kernels. User space can keep the pieces.
>> + */
>> +#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
>
> #ifndef __x86_64__ ?
#ifdef :)
Just had to double check and convince myself that __x86_64__ is set when
building for the X86_X32 ABI. Seems to work.
>> +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
>
> While uint32_t is originally a userspace type, in the kernel it also pulls in
> other internal types which are problematic in the vDSO. __u32 from
> uapi/linux/types.h avoid this issue.
Sure.
>> --- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
>> +++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
>> @@ -22,6 +22,12 @@ VERSION {
>> __vdso_getcpu;
>> __vdso_time;
>> __vdso_clock_getres;
>> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
>> + __vdso_futex_robust_try_unlock;
>
>> + __vdso_futex_robust_try_unlock_cs_start;
>> + __vdso_futex_robust_try_unlock_cs_success;
>> + __vdso_futex_robust_try_unlock_cs_end;
>
> These three symbols are not meant to be used from outside the vDSO
> implementation, so they don't need to be exported by the linkerscripts.
You are right in principle and I had it differently in the first place
until I realized that exposing them is useful for debugging and
validation purposes.
As I pointed out in the cover letter you can use GDB to actually verify
the fixup magic. That makes it obviously possible to write a user space
selftest without requiring to decode the internals of the VDSO.
Due to my pretty limited userspace DSO knowledge that was the best I
came up with. If you have a better idea, please let me know.
Thanks,
tglx
On Tue, Mar 17, 2026 at 10:51:47AM +0100, Thomas Gleixner wrote:
> On Tue, Mar 17 2026 at 08:25, Thomas Weißschuh wrote:
> > On Mon, Mar 16, 2026 at 06:13:34PM +0100, Thomas Gleixner wrote:
> >> +/*
> >> + * Compat enabled kernels have to take the size bit into account to support the
> >> + * mixed size use case of gaming emulators. Contrary to the kernel robust unlock
> >> + * mechanism all of this does not test for the 32-bit modifier in 32-bit VDSOs
> >> + * and in compat disabled kernels. User space can keep the pieces.
> >> + */
> >> +#if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
> >
> > #ifndef __x86_64__ ?
>
> #ifdef :)
Indeed :-)
> Just had to double check and convince myself that __x86_64__ is set when
> building for the X86_X32 ABI. Seems to work.
Afaik this is mandated by the x32 ABI. Together with __ILP32__.
In any case it doesn't matter as the x32 vDSO is not actually built but
instead is a copy of the x86_64 one with its elf type patched around.
(...)
> >> --- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> >> +++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
> >> @@ -22,6 +22,12 @@ VERSION {
> >> __vdso_getcpu;
> >> __vdso_time;
> >> __vdso_clock_getres;
> >> +#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
> >> + __vdso_futex_robust_try_unlock;
> >
> >> + __vdso_futex_robust_try_unlock_cs_start;
> >> + __vdso_futex_robust_try_unlock_cs_success;
> >> + __vdso_futex_robust_try_unlock_cs_end;
> >
> > These three symbols are not meant to be used from outside the vDSO
> > implementation, so they don't need to be exported by the linkerscripts.
>
> You are right in principle and I had it differently in the first place
> until I realized that exposing them is useful for debugging and
> validation purposes.
>
> As I pointed out in the cover letter you can use GDB to actually verify
> the fixup magic. That makes it obviously possible to write a user space
> selftest without requiring to decode the internals of the VDSO.
>
> Due to my pretty limited userspace DSO knowledge that was the best I
> came up with. If you have a better idea, please let me know.
I would have expected GDB to be able to use the separate vDSO debugging
symbols to find these symbols. So far I was not able to make it work,
but I blame my limited GDB knowledge.
Or move the symbols into a dedicated version to make clear that this is
not a stable interface.
Thomas
On Tue, Mar 17 2026 at 12:17, Thomas Weißschuh wrote: >> Due to my pretty limited userspace DSO knowledge that was the best I >> came up with. If you have a better idea, please let me know. > > I would have expected GDB to be able to use the separate vDSO debugging > symbols to find these symbols. So far I was not able to make it work, > but I blame my limited GDB knowledge. I got it "working" by manually loading vdso64.so.dbg at the right offset, which only took about 10 attempts to get it right. Then you can use actual local symbols. vdso2c picks them up correctly too.
On Wed, Mar 18, 2026 at 05:17:38PM +0100, Thomas Gleixner wrote: > On Tue, Mar 17 2026 at 12:17, Thomas Weißschuh wrote: > >> Due to my pretty limited userspace DSO knowledge that was the best I > >> came up with. If you have a better idea, please let me know. > > > > I would have expected GDB to be able to use the separate vDSO debugging > > symbols to find these symbols. So far I was not able to make it work, > > but I blame my limited GDB knowledge. > > I got it "working" by manually loading vdso64.so.dbg at the right > offset, which only took about 10 attempts to get it right. Then you can > use actual local symbols. > > vdso2c picks them up correctly too. What also works is to have GDB look up the debug symbols through their debug ids. At this point the load address of the vDSO is already known. $ make vdso_install INSTALL_MOD_PATH=$SOME_DIRECTORY $ gdb -ex "set debug-file-directory $SOME_DIRECTORY/lib/modules/$(uname -r)/vdso" $BINARY Depending on the distribution the vDSO from the kernel package might already be set up to be found automatically. Maybe we could add a helper to scripts/gdb/ which uses $(vdso-install-y) to either populate a debug-file-directory automatically or hook into the GDB lookup process to avoid these manual steps. Thomas
On 2026-03-19 08:41:47 [+0100], Thomas Weißschuh wrote: > > vdso2c picks them up correctly too. > > What also works is to have GDB look up the debug symbols through their > debug ids. At this point the load address of the vDSO is already known. > > $ make vdso_install INSTALL_MOD_PATH=$SOME_DIRECTORY > $ gdb -ex "set debug-file-directory $SOME_DIRECTORY/lib/modules/$(uname -r)/vdso" $BINARY > > Depending on the distribution the vDSO from the kernel package might already > be set up to be found automatically. > > Maybe we could add a helper to scripts/gdb/ which uses $(vdso-install-y) > to either populate a debug-file-directory automatically or hook into the GDB > lookup process to avoid these manual steps. Is this a complete vdso.so as mapped in process or just the debug symbols or both? Looking at my Debian thingy this seems to be there as of https://packages.debian.org/sid/amd64/linux-image-6.19.8+deb14-amd64-dbg/filelist | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdso32.so | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdso64.so | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdsox32.so | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vmlinux or do we talk about other things? Usually there is -dbgsym with the stripped out debug symbols under /usr/lib/debug/.build-id/ but the kernel seems different. > Thomas Sebastian
On Thu, Mar 19, 2026 at 11:36:04AM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-19 08:41:47 [+0100], Thomas Weißschuh wrote: > > > vdso2c picks them up correctly too. > > > > What also works is to have GDB look up the debug symbols through their > > debug ids. At this point the load address of the vDSO is already known. > > > > $ make vdso_install INSTALL_MOD_PATH=$SOME_DIRECTORY > > $ gdb -ex "set debug-file-directory $SOME_DIRECTORY/lib/modules/$(uname -r)/vdso" $BINARY > > > > Depending on the distribution the vDSO from the kernel package might already > > be set up to be found automatically. > > > > Maybe we could add a helper to scripts/gdb/ which uses $(vdso-install-y) > > to either populate a debug-file-directory automatically or hook into the GDB > > lookup process to avoid these manual steps. > > Is this a complete vdso.so as mapped in process or just the debug > symbols or both? $(vdso-install-y) references full vDSO images including executable code and debug symbols. The one mapped into userspace is stripped. > Looking at my Debian thingy this seems to be there as of > https://packages.debian.org/sid/amd64/linux-image-6.19.8+deb14-amd64-dbg/filelist > > | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdso32.so > | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdso64.so > | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vdso/vdsox32.so > | /usr/lib/debug/lib/modules/6.19.8+deb14-amd64/vmlinux > > or do we talk about other things? Usually there is -dbgsym with the > stripped out debug symbols under /usr/lib/debug/.build-id/ but the > kernel seems different. $ dpkg -L linux-image-6.12.74+deb13+1-amd64-dbg | grep -e vdso/vdso -e /usr/lib/debug/.build-id/ /usr/lib/debug/lib/modules/6.12.74+deb13+1-amd64/vdso/vdso32.so /usr/lib/debug/lib/modules/6.12.74+deb13+1-amd64/vdso/vdso64.so /usr/lib/debug/lib/modules/6.12.74+deb13+1-amd64/vdso/vdsox32.so /usr/lib/debug/.build-id/4a/bb1230e4abe0e2d856e1a304b392831ab7a8e1.debug /usr/lib/debug/.build-id/5f/a3a3ed11e017bcc765ade0997821383a7d4df8.debug /usr/lib/debug/.build-id/6e/d4f6c60913c24e158bbdfd680b8a1c1b07d8a4.debug $ readlink /usr/lib/debug/.build-id/4a/bb1230e4abe0e2d856e1a304b392831ab7a8e1.debug ../../lib/modules/6.12.74+deb13+1-amd64/vdso/vdso32.so This looks as expected to me. It doesn't help for development kernels, though. Also kbuild 'make deb-pkg' does *not* package these, see also [0]. One thing to note is that Debian does not follow the 'make vdso_install' layout. Not that this would be a requirement or anything. [0] https://lore.kernel.org/lkml/20260318-kbuild-pacman-vdso-install-v1-1-48ceb31c0e80@weissschuh.net/ Thomas
On 2026-03-19 11:49:37 [+0100], Thomas Weißschuh wrote: > $ readlink /usr/lib/debug/.build-id/4a/bb1230e4abe0e2d856e1a304b392831ab7a8e1.debug > ../../lib/modules/6.12.74+deb13+1-amd64/vdso/vdso32.so > > This looks as expected to me. It doesn't help for development kernels, though. Good to know. > Also kbuild 'make deb-pkg' does *not* package these, see also [0]. This is not used by Debian kernel building package but by people building a Debian package. If it is useful maybe add it. > One thing to note is that Debian does not follow the 'make vdso_install' layout. > Not that this would be a requirement or anything. Okay. Well, if it needs to change or anything I know who to poke. > Thomas Sebastian
* Thomas Weißschuh: > On Wed, Mar 18, 2026 at 05:17:38PM +0100, Thomas Gleixner wrote: >> On Tue, Mar 17 2026 at 12:17, Thomas Weißschuh wrote: >> >> Due to my pretty limited userspace DSO knowledge that was the best I >> >> came up with. If you have a better idea, please let me know. >> > >> > I would have expected GDB to be able to use the separate vDSO debugging >> > symbols to find these symbols. So far I was not able to make it work, >> > but I blame my limited GDB knowledge. >> >> I got it "working" by manually loading vdso64.so.dbg at the right >> offset, which only took about 10 attempts to get it right. Then you can >> use actual local symbols. >> >> vdso2c picks them up correctly too. > > What also works is to have GDB look up the debug symbols through their > debug ids. At this point the load address of the vDSO is already known. > > $ make vdso_install INSTALL_MOD_PATH=$SOME_DIRECTORY > $ gdb -ex "set debug-file-directory $SOME_DIRECTORY/lib/modules/$(uname -r)/vdso" $BINARY > > Depending on the distribution the vDSO from the kernel package might already > be set up to be found automatically. > > Maybe we could add a helper to scripts/gdb/ which uses $(vdso-install-y) > to either populate a debug-file-directory automatically or hook into the GDB > lookup process to avoid these manual steps. If the the /lib/modules/$(uname -r)/vdso/vdsoNN.so path is standard, we can use it in the link map, at the expense of an additional system call during process startup. It isn't just a performance cost. We've seen seccomp filters that kill the process on uname calls. Thanks, Florian
On Thu, Mar 19, 2026 at 09:53:53AM +0100, Florian Weimer wrote: > If the the /lib/modules/$(uname -r)/vdso/vdsoNN.so path is standard, we > can use it in the link map, at the expense of an additional system call > during process startup. It isn't just a performance cost. We've seen > seccomp filters that kill the process on uname calls. I've always though it would be a good idea to expose vdsoNN.so somewhere in our virtual filesystem, either /proc or /sys or whatever.
On Thu, Mar 19 2026 at 10:08, Peter Zijlstra wrote: > On Thu, Mar 19, 2026 at 09:53:53AM +0100, Florian Weimer wrote: >> If the the /lib/modules/$(uname -r)/vdso/vdsoNN.so path is standard, we >> can use it in the link map, at the expense of an additional system call >> during process startup. It isn't just a performance cost. We've seen >> seccomp filters that kill the process on uname calls. > > I've always though it would be a good idea to expose vdsoNN.so somewhere > in our virtual filesystem, either /proc or /sys or whatever. Yes, exposing this through a virtual fs would be really sensible. It's not a lot of data. VDSO debug info is ~40K extra per vdsoNN.so and vdsoNN.so.dbg compresses to ~25k with zstd. Converting that whole thing to a binary blob and bake it into the kernel is nor rocket science. Thanks, tglx
On Thu, Mar 19, 2026 at 09:53:53AM +0100, Florian Weimer wrote: > * Thomas Weißschuh: > > > On Wed, Mar 18, 2026 at 05:17:38PM +0100, Thomas Gleixner wrote: > >> On Tue, Mar 17 2026 at 12:17, Thomas Weißschuh wrote: > >> >> Due to my pretty limited userspace DSO knowledge that was the best I > >> >> came up with. If you have a better idea, please let me know. > >> > > >> > I would have expected GDB to be able to use the separate vDSO debugging > >> > symbols to find these symbols. So far I was not able to make it work, > >> > but I blame my limited GDB knowledge. > >> > >> I got it "working" by manually loading vdso64.so.dbg at the right > >> offset, which only took about 10 attempts to get it right. Then you can > >> use actual local symbols. > >> > >> vdso2c picks them up correctly too. > > > > What also works is to have GDB look up the debug symbols through their > > debug ids. At this point the load address of the vDSO is already known. > > > > $ make vdso_install INSTALL_MOD_PATH=$SOME_DIRECTORY > > $ gdb -ex "set debug-file-directory $SOME_DIRECTORY/lib/modules/$(uname -r)/vdso" $BINARY > > > > Depending on the distribution the vDSO from the kernel package might already > > be set up to be found automatically. > > > > Maybe we could add a helper to scripts/gdb/ which uses $(vdso-install-y) > > to either populate a debug-file-directory automatically or hook into the GDB > > lookup process to avoid these manual steps. > > If the the /lib/modules/$(uname -r)/vdso/vdsoNN.so path is standard, we > can use it in the link map, at the expense of an additional system call > during process startup. It isn't just a performance cost. We've seen > seccomp filters that kill the process on uname calls. It's not a standard, only what 'make vdso_install' ends up doing and some package managers are copying. Also the file is not guaranteed to be there, as it is only part of the kernel debug packages, if packaged at all. So I don't think it makes sense to integrate it into low-level system components. For debugging, the distros can already make the file available through /usr/lib/debug/ so it works out-of-the box. (As done by Debian) Thomas
On 2026-03-16 13:13, Thomas Gleixner wrote:
> When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
> then the unlock sequence in userspace looks like this:
>
> 1) robust_list_set_op_pending(mutex);
> 2) robust_list_remove(mutex);
>
> lval = gettid();
> 3) if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
> 4) robust_list_clear_op_pending();
> else
> 5) sys_futex(OP,...FUTEX_ROBUST_UNLOCK);
[...]
>
> When then the original task exits before reaching #6 then the kernel robust
> list handling observes the pending op entry and tries to fix up user space.
There is no #6.
[...]
> +
> +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
I'm not sure I see the link between "list_pending_op" and @pop ?
> +{
> + asm volatile (
> + ".global __vdso_futex_robust_try_unlock_cs_start \n"
> + ".global __vdso_futex_robust_try_unlock_cs_success \n"
> + ".global __vdso_futex_robust_try_unlock_cs_end \n"
Those .global are fragile: they depend on making sure the compiler does
not emit those symbols more than once per compile unit (due to
optimizations).
I understand that you want to skip the "iteration on a section" within
the kernel fast path, and I agree with that intent, but I think we can
achieve that goal with more robustness by:
- emitting those as triplets within a dedicated section,
- validating that the section only contains a single triplet within the
vdso2c script.
This would fail immediately in the vsdo2c script if the compiler does
not do as expected rather than silently fail to cover part of the
emitted code range.
[...]
> + : [tid] "+a" (tid)
Could use a few comments:
- "tid" sits in eax.
> + : [ptr] "D" (lock),
- "lock" sits in edi/rdi.
> + [pop] "d" (pop),
This constraint puts the unmasked "pop" pointer into edx/rdx.
> + [val] "r" (0)
> + ASM_PAD_CONSTRAINT
The masked "pop" pointer sits in esi/rsi.
[...]
> + * disabled or this is a 32-bit kernel then ZF is authoritive no matter
authoritative
> +
> +static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
> +{
> + return (void __user *)regs->dx;
When userspace is compat 32-bit, with a 64-bit kernel, are we sure that
the 32 upper bits are cleared ? If not can we rely on
compat_robust_list_clear_pending to ignore those top bits in
put_user(0U, pop) ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Mon, Mar 16 2026 at 15:19, Mathieu Desnoyers wrote:
> On 2026-03-16 13:13, Thomas Gleixner wrote:
>> +{
>> + asm volatile (
>> + ".global __vdso_futex_robust_try_unlock_cs_start \n"
>> + ".global __vdso_futex_robust_try_unlock_cs_success \n"
>> + ".global __vdso_futex_robust_try_unlock_cs_end \n"
>
> Those .global are fragile: they depend on making sure the compiler does
> not emit those symbols more than once per compile unit (due to
> optimizations).
This is a single global function which contains the unique ASM code with
those unique symbols. The unique compilation unit containing it ends up
in the VDSO "library".
Q: Which optimizations would cause the compiler to emit them more than
once?
A: None.
If that happens then the compiler is seriously broken and the resulting
VDSO trainwreck is the least of your worries.
That would be equivalent to a single global C function in a unique
compilation unit being emitted more than once.
So what is fragile about that?
Thanks,
tglx
On Mon, Mar 16 2026 at 15:19, Mathieu Desnoyers wrote:
>> +uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *pop)
>
> I'm not sure I see the link between "list_pending_op" and @pop ?
What so hard to understand about that? The function prototype in
include/vdso/futex.h is extensively documented.
> [...]
>> + : [tid] "+a" (tid)
>
> Could use a few comments:
>
> - "tid" sits in eax.
If someone needs a comment to understand the constraint, then that
persone definitely should not touch the code. I'm all for documentation
and comments, but documenting the obvious is not useful at all. See:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#comment-style
I'm amazed that you complain about these obvious details and not about
the actual lack of a general comment which explains the actual inner
workings of that ASM maze. That would be actually useful for a
unexperienced reader. Interesting preference.
Thanks,
tglx
On Mon, Mar 16 2026 at 15:19, Mathieu Desnoyers wrote:
> On 2026-03-16 13:13, Thomas Gleixner wrote:
>> +
>> +static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
>> +{
>> + return (void __user *)regs->dx;
>
> When userspace is compat 32-bit, with a 64-bit kernel, are we sure that
> the 32 upper bits are cleared ? If not can we rely on
> compat_robust_list_clear_pending to ignore those top bits in
> put_user(0U, pop) ?
Which compat version are you talking about?
1) A 32-bit application which truly runs as compat
2) A 64-bit application which uses both variants and invokes the
64-bit VDSO from a 32-bit program segment
#1 is inherently safe. The 32-bit application uses the compat 32-bit VDSO
which only accesses the lower half of the registers. So the mov $ptr,
%edx results in zero extending the 32-bit value. From the SDM:
"32-bit operands generate a 32-bit result, zero-extended to a
64-bit result in the destination general-purpose register."
The exception/interrupt entry switches into 64-bit mode, but due to
the above the upper 32 bit are 0. So it's safe to just blindly use
regs->dx.
Otherwise it would be pretty impossible to run 32-bit user space on a
64-bit kernel.
#2 can really be assumed to be safe as there must be some magic
translation in the emulation code which handles the different calling
conventions.
That's not any different when 32-bit code which runs in the context
of a 64-bit application invokes a syscall or a library function.
If that goes wrong, then it's not a kernel problem because the
application explicitely tells the kernel to corrupt it's own memory.
The golden rule of UNIX applies here as always:
Do what user space asked for unless it results in a boundary
violation which can't be achieved by user space itself.
IOW, let user space shoot itself into the foot when it desires to
do so.
Thanks,
tglx
On 2026-03-16 17:02, Thomas Gleixner wrote:
> On Mon, Mar 16 2026 at 15:19, Mathieu Desnoyers wrote:
>> On 2026-03-16 13:13, Thomas Gleixner wrote:
>>> +
>>> +static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
>>> +{
>>> + return (void __user *)regs->dx;
>>
>> When userspace is compat 32-bit, with a 64-bit kernel, are we sure that
>> the 32 upper bits are cleared ? If not can we rely on
>> compat_robust_list_clear_pending to ignore those top bits in
>> put_user(0U, pop) ?
>
> Which compat version are you talking about?
>
> 1) A 32-bit application which truly runs as compat
>
> 2) A 64-bit application which uses both variants and invokes the
> 64-bit VDSO from a 32-bit program segment
>
> #1 is inherently safe. The 32-bit application uses the compat 32-bit VDSO
> which only accesses the lower half of the registers. So the mov $ptr,
> %edx results in zero extending the 32-bit value. From the SDM:
>
> "32-bit operands generate a 32-bit result, zero-extended to a
> 64-bit result in the destination general-purpose register."
Ah, very well, this is the important piece I was missing.
>
> The exception/interrupt entry switches into 64-bit mode, but due to
> the above the upper 32 bit are 0. So it's safe to just blindly use
> regs->dx.
OK.
[...]
> #2 can really be assumed to be safe as there must be some magic
> translation in the emulation code which handles the different calling
> conventions.
[...]
Sounds good,
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
© 2016 - 2026 Red Hat, Inc.