[PATCH v2 0/3] futex: Create set_robust_list2

André Almeida posted 3 patches 3 weeks, 2 days ago
arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
arch/arm/tools/syscall.tbl                  |   1 +
arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
arch/s390/kernel/syscalls/syscall.tbl       |   1 +
arch/sh/kernel/syscalls/syscall.tbl         |   1 +
arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
include/linux/compat.h                      |  12 +-
include/linux/futex.h                       |  12 ++
include/linux/sched.h                       |   3 +-
include/uapi/asm-generic/unistd.h           |   5 +-
include/uapi/linux/futex.h                  |  24 ++++
init/init_task.c                            |   3 +
kernel/futex/core.c                         | 116 +++++++++++++++++---
kernel/futex/futex.h                        |   3 +
kernel/futex/syscalls.c                     |  40 ++++++-
kernel/sys_ni.c                             |   1 +
scripts/syscall.tbl                         |   1 +
26 files changed, 203 insertions(+), 32 deletions(-)
[PATCH v2 0/3] futex: Create set_robust_list2
Posted by André Almeida 3 weeks, 2 days ago
This patch adds a new robust_list() syscall. The current syscall
can't be expanded to cover the following use case, so a new one is
needed. This new syscall allows users to set multiple robust lists per
process and to have either 32bit or 64bit pointers in the list.

* Use case

FEX-Emu[1] is an application that runs x86 and x86-64 binaries on an
AArch64 Linux host. One of the tasks of FEX-Emu is to translate syscalls
from one platform to another. Existing set_robust_list() can't be easily
translated because of two limitations:

1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
   this is not a problem, because of the compat entry point. But there's
   no such compat entry point for AArch64, so the kernel would do the
   pointer arithmetic wrongly. Is also unviable to userspace to keep
   track every addition/removal to the robust list and keep a 64bit
   version of it somewhere else to feed the kernel. Thus, the new
   interface has an option of telling the kernel if the list is filled
   with 32bit or 64bit pointers.

2) Apps can set just one robust list (in theory, x86-64 can set two if
   they also use the compat entry point). That means that when a x86 app
   asks FEX-Emu to call set_robust_list(), FEX have two options: to
   overwrite their own robust list pointer and make the app robust, or
   to ignore the app robust list and keep the emulator robust. The new
   interface allows for multiple robust lists per application, solving
   this.

* Interface

This is the proposed interface:

	long set_robust_list2(void *head, int index, unsigned int flags)

`head` is the head of the userspace struct robust_list_head, just as old
set_robust_list(). It needs to be a void pointer since it can point to a normal
robust_list_head or a compat_robust_list_head.

`flags` can be used for defining the list type:

	enum robust_list_type {
	 	ROBUST_LIST_32BIT,
		ROBUST_LIST_64BIT,
	 };

`index` is the index in the internal robust_list's linked list (the naming
starts to get confusing, I reckon). If `index == -1`, that means that user wants
to set a new robust_list, and the kernel will append it in the end of the list,
assign a new index and return this index to the user. If `index >= 0`, that
means that user wants to re-set `*head` of an already existing list (similarly
to what happens when you call set_robust_list() twice with different `*head`).

If `index` is out of range, or it points to a non-existing robust_list, or if
the internal list is full, an error is returned.

* Implementation

The implementation re-uses most of the existing robust list interface as
possible. The new task_struct member `struct list_head robust_list2` is just a
linked list where new lists are appended as the user requests more lists, and by
futex_cleanup(), the kernel walks through the internal list feeding
exit_robust_list() with the robust_list's.

This implementation supports up to 10 lists (defined at ROBUST_LISTS_PER_TASK),
but it was an arbitrary number for this RFC. For the described use case above, 4
should be enough, I'm not sure which should be the limit.

It doesn't support list removal (should it support?). It doesn't have a proper
get_robust_list2() yet as well, but I can add it in a next revision. We could
also have a generic robust_list() syscall that can be used to set/get and be
controlled by flags.

The new interface has a `unsigned int flags` argument, making it
extensible for future use cases as well.

* Testing

I will provide a selftest similar to the one I proposed for the current
interface here:
https://lore.kernel.org/lkml/20241010011142.905297-1-andrealmeid@igalia.com/

Also, FEX-Emu added support for this interface to validate it:
https://github.com/FEX-Emu/FEX/pull/3966

Feedback is very welcomed!

Thanks,
	André

[1] https://github.com/FEX-Emu/FEX

Changelog:
- Added a patch to properly deal with exit_robust_list() in 64bit vs 32bit
- Wired-up syscall for all archs
- Added more of the cover letter to the commit message
v1: https://lore.kernel.org/lkml/20241024145735.162090-1-andrealmeid@igalia.com/

André Almeida (3):
  futex: Use explicit sizes for compat_exit_robust_list
  futex: Create set_robust_list2
  futex: Wire up set_robust_list2 syscall

 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 include/linux/compat.h                      |  12 +-
 include/linux/futex.h                       |  12 ++
 include/linux/sched.h                       |   3 +-
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/futex.h                  |  24 ++++
 init/init_task.c                            |   3 +
 kernel/futex/core.c                         | 116 +++++++++++++++++---
 kernel/futex/futex.h                        |   3 +
 kernel/futex/syscalls.c                     |  40 ++++++-
 kernel/sys_ni.c                             |   1 +
 scripts/syscall.tbl                         |   1 +
 26 files changed, 203 insertions(+), 32 deletions(-)

-- 
2.47.0

Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by Florian Weimer 3 weeks ago
* André Almeida:

> 1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
>    this is not a problem, because of the compat entry point. But there's
>    no such compat entry point for AArch64, so the kernel would do the
>    pointer arithmetic wrongly. Is also unviable to userspace to keep
>    track every addition/removal to the robust list and keep a 64bit
>    version of it somewhere else to feed the kernel. Thus, the new
>    interface has an option of telling the kernel if the list is filled
>    with 32bit or 64bit pointers.

The size is typically different for 32-bit and 64-bit mode (12 vs 24
bytes).  Why isn't this enough to disambiguate?

> 2) Apps can set just one robust list (in theory, x86-64 can set two if
>    they also use the compat entry point). That means that when a x86 app
>    asks FEX-Emu to call set_robust_list(), FEX have two options: to
>    overwrite their own robust list pointer and make the app robust, or
>    to ignore the app robust list and keep the emulator robust. The new
>    interface allows for multiple robust lists per application, solving
>    this.

Can't you avoid mixing emulated and general userspace code on the same
thread?  On emulator threads, you have full control over the TCB.

QEMU hints towards further problems (in linux-user/syscall.c):

    case TARGET_NR_set_robust_list:
    case TARGET_NR_get_robust_list:
        /* The ABI for supporting robust futexes has userspace pass
         * the kernel a pointer to a linked list which is updated by
         * userspace after the syscall; the list is walked by the kernel
         * when the thread exits. Since the linked list in QEMU guest
         * memory isn't a valid linked list for the host and we have
         * no way to reliably intercept the thread-death event, we can't
         * support these. Silently return ENOSYS so that guest userspace
         * falls back to a non-robust futex implementation (which should
         * be OK except in the corner case of the guest crashing while
         * holding a mutex that is shared with another process via
         * shared memory).
         */
        return -TARGET_ENOSYS;

The glibc implementation is not really prepared for this
(__ASSUME_SET_ROBUST_LIST is defined for must architectures).  But a
couple of years ago, we had a bunch of kernels that regressed robust
list support on POWER, and I think we found out only when we tested an
unrelated glibc update and saw unexpected glibc test suite failures …

Thanks,
Florian
Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by André Almeida 2 weeks, 5 days ago
Hi Florian,

Em 02/11/2024 18:58, Florian Weimer escreveu:
> * André Almeida:
> 
>> 1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
>>     this is not a problem, because of the compat entry point. But there's
>>     no such compat entry point for AArch64, so the kernel would do the
>>     pointer arithmetic wrongly. Is also unviable to userspace to keep
>>     track every addition/removal to the robust list and keep a 64bit
>>     version of it somewhere else to feed the kernel. Thus, the new
>>     interface has an option of telling the kernel if the list is filled
>>     with 32bit or 64bit pointers.
> 
> The size is typically different for 32-bit and 64-bit mode (12 vs 24
> bytes).  Why isn't this enough to disambiguate?
> 

Right, so the idea would be to use `size_t len` from the syscall 
arguments for that?

>> 2) Apps can set just one robust list (in theory, x86-64 can set two if
>>     they also use the compat entry point). That means that when a x86 app
>>     asks FEX-Emu to call set_robust_list(), FEX have two options: to
>>     overwrite their own robust list pointer and make the app robust, or
>>     to ignore the app robust list and keep the emulator robust. The new
>>     interface allows for multiple robust lists per application, solving
>>     this.
> 
> Can't you avoid mixing emulated and general userspace code on the same
> thread?  On emulator threads, you have full control over the TCB.
> 

FEX can't avoid that because it doesn't do a full system emulation, it 
just does instructions translation. FEX doesn't have full control over 
the TCB, that's still all glibc, or whatever other dynamic linker is used.

Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by Peter Zijlstra 2 weeks, 6 days ago
On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:

> QEMU hints towards further problems (in linux-user/syscall.c):
> 
>     case TARGET_NR_set_robust_list:
>     case TARGET_NR_get_robust_list:
>         /* The ABI for supporting robust futexes has userspace pass
>          * the kernel a pointer to a linked list which is updated by
>          * userspace after the syscall; the list is walked by the kernel
>          * when the thread exits. Since the linked list in QEMU guest
>          * memory isn't a valid linked list for the host and we have
>          * no way to reliably intercept the thread-death event, we can't
>          * support these. Silently return ENOSYS so that guest userspace
>          * falls back to a non-robust futex implementation (which should
>          * be OK except in the corner case of the guest crashing while
>          * holding a mutex that is shared with another process via
>          * shared memory).
>          */
>         return -TARGET_ENOSYS;

I don't think we can sanely fix that. Can't QEMU track the robust thing
itself and use waitpid() to discover the thread is gone and fudge things
from there?
Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by Florian Weimer 2 weeks, 6 days ago
* Peter Zijlstra:

> On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
>
>> QEMU hints towards further problems (in linux-user/syscall.c):
>> 
>>     case TARGET_NR_set_robust_list:
>>     case TARGET_NR_get_robust_list:
>>         /* The ABI for supporting robust futexes has userspace pass
>>          * the kernel a pointer to a linked list which is updated by
>>          * userspace after the syscall; the list is walked by the kernel
>>          * when the thread exits. Since the linked list in QEMU guest
>>          * memory isn't a valid linked list for the host and we have
>>          * no way to reliably intercept the thread-death event, we can't
>>          * support these. Silently return ENOSYS so that guest userspace
>>          * falls back to a non-robust futex implementation (which should
>>          * be OK except in the corner case of the guest crashing while
>>          * holding a mutex that is shared with another process via
>>          * shared memory).
>>          */
>>         return -TARGET_ENOSYS;
>
> I don't think we can sanely fix that. Can't QEMU track the robust thing
> itself and use waitpid() to discover the thread is gone and fudge things
> from there?

There are race conditions with munmap, I think, and they probably get a
lot of worse if QEMU does that.

See Rich Felker's bug report:

| The corruption is performed by the kernel when it walks the robust
| list. The basic situation is the same as in PR #13690, except that
| here there's actually a potential write to the memory rather than just
| a read.
| 
| The sequence of events leading to corruption goes like this:
| 
| 1. Thread A unlocks the process-shared, robust mutex and is preempted
|    after the mutex is removed from the robust list and atomically
|    unlocked, but before it's removed from the list_op_pending field of
|    the robust list header.
| 
| 2. Thread B locks the mutex, and, knowing by program logic that it's
|    the last user of the mutex, unlocks and unmaps it, allocates/maps
|    something else that gets assigned the same address as the shared mutex
|    mapping, and then exits.
| 
| 3. The kernel destroys the process, which involves walking each
|   thread's robust list and processing each thread's list_op_pending
|   field of the robust list header. Since thread A has a list_op_pending
|   pointing at the address previously occupied by the mutex, the kernel
|   obliviously "unlocks the mutex" by writing a 0 to the address and
|   futex-waking it. However, the kernel has instead overwritten part of
|   whatever mapping thread A created. If this is private memory it
|   (probably) doesn't matter since the process is ending anyway (but are
|   there race conditions where this can be seen?). If this is shared
|   memory or a shared file mapping, however, the kernel corrupts it.
| 
| I suspect the race is difficult to hit since thread A has to get
| preempted at exactly the wrong time AND thread B has to do a fair
| amount of work without thread A getting scheduled again. So I'm not
| sure how much luck we'd have getting a test case.


<https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3>

We also have a silent unlocking failure because userspace does not know
about ROBUST_LIST_LIMIT:

  Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account
  <https://sourceware.org/bugzilla/show_bug.cgi?id=19089>

(I think we may have discussed this one before, and you may have
suggested to just hard-code 2048 in userspace because the constant is
not expected to change.)

So the in-mutex linked list has quite a few problems even outside of
emulation. 8-(

Thanks,
Florian
Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Mon, Nov 04, 2024 at 01:36:43PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
> >
> >> QEMU hints towards further problems (in linux-user/syscall.c):
> >> 
> >>     case TARGET_NR_set_robust_list:
> >>     case TARGET_NR_get_robust_list:
> >>         /* The ABI for supporting robust futexes has userspace pass
> >>          * the kernel a pointer to a linked list which is updated by
> >>          * userspace after the syscall; the list is walked by the kernel
> >>          * when the thread exits. Since the linked list in QEMU guest
> >>          * memory isn't a valid linked list for the host and we have
> >>          * no way to reliably intercept the thread-death event, we can't
> >>          * support these. Silently return ENOSYS so that guest userspace
> >>          * falls back to a non-robust futex implementation (which should
> >>          * be OK except in the corner case of the guest crashing while
> >>          * holding a mutex that is shared with another process via
> >>          * shared memory).
> >>          */
> >>         return -TARGET_ENOSYS;
> >
> > I don't think we can sanely fix that. Can't QEMU track the robust thing
> > itself and use waitpid() to discover the thread is gone and fudge things
> > from there?
> 
> There are race conditions with munmap, I think, and they probably get a
> lot of worse if QEMU does that.
> 
> See Rich Felker's bug report:
> 
> | The corruption is performed by the kernel when it walks the robust
> | list. The basic situation is the same as in PR #13690, except that
> | here there's actually a potential write to the memory rather than just
> | a read.
> | 
> | The sequence of events leading to corruption goes like this:
> | 
> | 1. Thread A unlocks the process-shared, robust mutex and is preempted
> |    after the mutex is removed from the robust list and atomically
> |    unlocked, but before it's removed from the list_op_pending field of
> |    the robust list header.
> | 
> | 2. Thread B locks the mutex, and, knowing by program logic that it's
> |    the last user of the mutex, unlocks and unmaps it, allocates/maps
> |    something else that gets assigned the same address as the shared mutex
> |    mapping, and then exits.
> | 
> | 3. The kernel destroys the process, which involves walking each
> |   thread's robust list and processing each thread's list_op_pending
> |   field of the robust list header. Since thread A has a list_op_pending
> |   pointing at the address previously occupied by the mutex, the kernel
> |   obliviously "unlocks the mutex" by writing a 0 to the address and
> |   futex-waking it. However, the kernel has instead overwritten part of
> |   whatever mapping thread A created. If this is private memory it
> |   (probably) doesn't matter since the process is ending anyway (but are
> |   there race conditions where this can be seen?). If this is shared
> |   memory or a shared file mapping, however, the kernel corrupts it.
> | 
> | I suspect the race is difficult to hit since thread A has to get
> | preempted at exactly the wrong time AND thread B has to do a fair
> | amount of work without thread A getting scheduled again. So I'm not
> | sure how much luck we'd have getting a test case.
> 
> 
> <https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3>

So I've only managed to conjure up two horrible solutions for this:

 - put the robust futex operations under user-space RCU, and mandate a
   matching synchronize_rcu() before any munmap() calls.

 - add a robust-barrier syscall that waits until all list_op_pending are
   either NULL or changed since invocation. And mandate this call before
   munmap().

Neither are particularly pretty I admit, but at least they should work.

But doing this and mandating the alignment thing should at least make
this qemu thing workable, no?

> We also have a silent unlocking failure because userspace does not know
> about ROBUST_LIST_LIMIT:
> 
>   Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=19089>
> 
> (I think we may have discussed this one before, and you may have
> suggested to just hard-code 2048 in userspace because the constant is
> not expected to change.)
> 
> So the in-mutex linked list has quite a few problems even outside of
> emulation. 8-(

It's futex, ofcourse its a pain in the arse :-)

And yeah, no better ideas on that limit for now...
Re: [PATCH v2 0/3] futex: Create set_robust_list2
Posted by Peter Zijlstra 2 weeks, 6 days ago
On Mon, Nov 04, 2024 at 12:32:40PM +0100, Peter Zijlstra wrote:
> On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
> 
> > QEMU hints towards further problems (in linux-user/syscall.c):
> > 
> >     case TARGET_NR_set_robust_list:
> >     case TARGET_NR_get_robust_list:
> >         /* The ABI for supporting robust futexes has userspace pass
> >          * the kernel a pointer to a linked list which is updated by
> >          * userspace after the syscall; the list is walked by the kernel
> >          * when the thread exits. Since the linked list in QEMU guest
> >          * memory isn't a valid linked list for the host and we have
> >          * no way to reliably intercept the thread-death event, we can't
> >          * support these. Silently return ENOSYS so that guest userspace
> >          * falls back to a non-robust futex implementation (which should
> >          * be OK except in the corner case of the guest crashing while
> >          * holding a mutex that is shared with another process via
> >          * shared memory).
> >          */
> >         return -TARGET_ENOSYS;
> 
> I don't think we can sanely fix that. Can't QEMU track the robust thing
> itself and use waitpid() to discover the thread is gone and fudge things
> from there?

Hmm, what about we mandate 'natural' alignement of the structure such
that it is always inside a single page, then QEMU can do the translation
here and hand the kernel the 'real' address.