[PATCH 0/5] linux-user: Support extended clone(CLONE_VM)

Josh Kunz posted 5 patches 3 years, 10 months ago
Test docker-mingw@fedora passed
Test checkpatch passed
Test asan passed
Test docker-quick@centos7 failed
Test FreeBSD passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20200612014606.147691-1-jkz@google.com
Maintainers: Laurent Vivier <laurent@vivier.eu>, Riku Voipio <riku.voipio@iki.fi>, "Alex Bennée" <alex.bennee@linaro.org>
linux-user/Makefile.objs            |   2 +-
linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
linux-user/clone.h                  |  27 ++
linux-user/fd-trans-tbl.c           |  13 +
linux-user/fd-trans-type.h          |  17 +
linux-user/fd-trans.c               |   3 -
linux-user/fd-trans.h               |  75 ++--
linux-user/main.c                   |   1 +
linux-user/qemu.h                   |  49 +++
linux-user/signal.c                 |  84 ++++-
linux-user/syscall.c                | 452 ++++++++++++----------
tests/tcg/multiarch/Makefile.target |   3 +
tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
13 files changed, 1264 insertions(+), 254 deletions(-)
create mode 100644 linux-user/clone.c
create mode 100644 linux-user/clone.h
create mode 100644 linux-user/fd-trans-tbl.c
create mode 100644 linux-user/fd-trans-type.h
[PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by Josh Kunz 3 years, 10 months ago
This patch series implements extended support for the `clone` system
call. As best I can tell, any option combination including `CLONE_VM`
should be supported with the addition of this patch series. The
implementation is described in greater detail in the patches themselves.

Testing:

  * All targets built on x86_64.
  * `make check` and `make check-tcg` are passing. Additional tests have
    been added to `linux-test.c` to validate clone behavior.

Caveats:

  * This series touches, but does not fix, several bits of code that are
    racey (namely the sigact table and the fd trans table).
  * `exit_group` does not perform the appropriate cleanup for non-thread
    children created with `CLONE_VM`. CPUs for such children are never
    cleaned up. The correct implementation of exit-group is non-trivial
    (since it also needs to track/handle cleanup for threads in the
    clone'd child process). Also, I don't fully understand the
    interaction between QOM<->linux-user. My naive implementation based
    on the current implementation `exit(2)` was regularly crashing. If
    maintainers have suggestions for better ways to handle exit_group,
    they would be greatly appreciated. 
  * execve does not clean up the CPUs of clone'd children, for the same
    reasons as `exit_group`.

Josh Kunz (5):
  linux-user: Refactor do_fork to use new `qemu_clone`
  linux-user: Make fd_trans task-specific.
  linux-user: Make sigact_table part of the task state.
  linux-user: Support CLONE_VM and extended clone options
  linux-user: Add PDEATHSIG test for clone process hierarchy.

 linux-user/Makefile.objs            |   2 +-
 linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
 linux-user/clone.h                  |  27 ++
 linux-user/fd-trans-tbl.c           |  13 +
 linux-user/fd-trans-type.h          |  17 +
 linux-user/fd-trans.c               |   3 -
 linux-user/fd-trans.h               |  75 ++--
 linux-user/main.c                   |   1 +
 linux-user/qemu.h                   |  49 +++
 linux-user/signal.c                 |  84 ++++-
 linux-user/syscall.c                | 452 ++++++++++++----------
 tests/tcg/multiarch/Makefile.target |   3 +
 tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
 13 files changed, 1264 insertions(+), 254 deletions(-)
 create mode 100644 linux-user/clone.c
 create mode 100644 linux-user/clone.h
 create mode 100644 linux-user/fd-trans-tbl.c
 create mode 100644 linux-user/fd-trans-type.h

-- 
2.27.0.290.gba653c62da-goog


Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by Alex Bennée 3 years, 10 months ago
Josh Kunz <jkz@google.com> writes:

> This patch series implements extended support for the `clone` system
> call. As best I can tell, any option combination including `CLONE_VM`
> should be supported with the addition of this patch series. The
> implementation is described in greater detail in the patches themselves.
>
> Testing:
>
>   * All targets built on x86_64.
>   * `make check` and `make check-tcg` are passing. Additional tests have
>     been added to `linux-test.c` to validate clone behavior.
>
> Caveats:
>
>   * This series touches, but does not fix, several bits of code that are
>     racey (namely the sigact table and the fd trans table).
>   * `exit_group` does not perform the appropriate cleanup for non-thread
>     children created with `CLONE_VM`. CPUs for such children are never
>     cleaned up. The correct implementation of exit-group is non-trivial
>     (since it also needs to track/handle cleanup for threads in the
>     clone'd child process). Also, I don't fully understand the
>     interaction between QOM<->linux-user.

When the QOM object gets unrefed for the final time it should cause a
bunch of clean-up in the common vCPU management code where things like
plugin cleanup are done.

This was recently touched in 1f81ce90e31ef338ee53a0cea02344237bc470cc
where I removed linux-user messing around with the active cpu list and
left it to the core code to deal with. Previously it wasn't being
properly unrealized.

>     My naive implementation based
>     on the current implementation `exit(2)` was regularly crashing. If
>     maintainers have suggestions for better ways to handle exit_group,
>     they would be greatly appreciated. 
>   * execve does not clean up the CPUs of clone'd children, for the same
>     reasons as `exit_group`.
>
> Josh Kunz (5):
>   linux-user: Refactor do_fork to use new `qemu_clone`
>   linux-user: Make fd_trans task-specific.
>   linux-user: Make sigact_table part of the task state.
>   linux-user: Support CLONE_VM and extended clone options
>   linux-user: Add PDEATHSIG test for clone process hierarchy.
>
>  linux-user/Makefile.objs            |   2 +-
>  linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
>  linux-user/clone.h                  |  27 ++
>  linux-user/fd-trans-tbl.c           |  13 +
>  linux-user/fd-trans-type.h          |  17 +
>  linux-user/fd-trans.c               |   3 -
>  linux-user/fd-trans.h               |  75 ++--
>  linux-user/main.c                   |   1 +
>  linux-user/qemu.h                   |  49 +++
>  linux-user/signal.c                 |  84 ++++-
>  linux-user/syscall.c                | 452 ++++++++++++----------
>  tests/tcg/multiarch/Makefile.target |   3 +
>  tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
>  13 files changed, 1264 insertions(+), 254 deletions(-)
>  create mode 100644 linux-user/clone.c
>  create mode 100644 linux-user/clone.h
>  create mode 100644 linux-user/fd-trans-tbl.c
>  create mode 100644 linux-user/fd-trans-type.h


-- 
Alex Bennée

Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by no-reply@patchew.org 3 years, 10 months ago
Patchew URL: https://patchew.org/QEMU/20200612014606.147691-1-jkz@google.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

+++ /tmp/qemu-test/build/tests/qemu-iotests/041.out.bad 2020-06-12 03:41:14.015076859 +0000
@@ -1,5 +1,30 @@
-........................................................................................................
+WARNING:qemu.machine:qemu received signal 9: /tmp/qemu-test/build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 -display none -vga none -chardev socket,id=mon,path=/tmp/tmp.YHsXXgjzyL/qemu-22509-monitor.sock -mon chardev=mon,mode=control -qtest unix:path=/tmp/tmp.YHsXXgjzyL/qemu-22509-qtest.sock -accel qtest -nodefaults -display none -accel qtest -drive if=virtio,id=drive0,file=/tmp/qemu-test/test.img,format=qcow2,cache=writeback,aio=threads,node-name=top,backing.node-name=base -drive if=ide,id=drive1,media=cdrom
+......................................E.................................................................
+======================================================================
+ERROR: test_pause (__main__.TestSingleBlockdev)
+----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "041", line 108, in test_pause
---
 Ran 104 tests
 
-OK
+FAILED (errors=1)
  TEST    iotest-qcow2: 042
  TEST    iotest-qcow2: 043
  TEST    iotest-qcow2: 046
---
Not run: 259
Failures: 041
Failed 1 of 119 iotests
make: *** [check-tests/check-block.sh] Error 1
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in <module>
    sys.exit(main())
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=0b953bb4c5534a2f9f54bbb077c4a060', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-fy1xwr_z/src/docker-src.2020-06-11-23.31.14.29203:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=0b953bb4c5534a2f9f54bbb077c4a060
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-fy1xwr_z/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    17m15.344s
user    0m8.554s


The full log is available at
http://patchew.org/logs/20200612014606.147691-1-jkz@google.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by Alex Bennée 3 years, 10 months ago
Josh Kunz <jkz@google.com> writes:

> This patch series implements extended support for the `clone` system
> call. As best I can tell, any option combination including `CLONE_VM`
> should be supported with the addition of this patch series. The
> implementation is described in greater detail in the patches themselves.
>
> Testing:
>
>   * All targets built on x86_64.
>   * `make check` and `make check-tcg` are passing. Additional tests have
>     been added to `linux-test.c` to validate clone behavior.

Not for me - which probably means you don't have docker/podman setup or
cross compilers for all the various guests we have. See
tests/tcg/configure.sh

>
> Caveats:
>
>   * This series touches, but does not fix, several bits of code that are
>     racey (namely the sigact table and the fd trans table).
>   * `exit_group` does not perform the appropriate cleanup for non-thread
>     children created with `CLONE_VM`. CPUs for such children are never
>     cleaned up. The correct implementation of exit-group is non-trivial
>     (since it also needs to track/handle cleanup for threads in the
>     clone'd child process). Also, I don't fully understand the
>     interaction between QOM<->linux-user. My naive implementation based
>     on the current implementation `exit(2)` was regularly crashing. If
>     maintainers have suggestions for better ways to handle exit_group,
>     they would be greatly appreciated. 
>   * execve does not clean up the CPUs of clone'd children, for the same
>     reasons as `exit_group`.

Apart from "a more perfect emulation" is there a particular use case
served by the extra functionality? AIUI up until this point we've
basically supported glibc's use of clone() which has generally been
enough. I'm assuming you've come across stuff that needs this more fine
grained support?

>
> Josh Kunz (5):
>   linux-user: Refactor do_fork to use new `qemu_clone`
>   linux-user: Make fd_trans task-specific.
>   linux-user: Make sigact_table part of the task state.
>   linux-user: Support CLONE_VM and extended clone options
>   linux-user: Add PDEATHSIG test for clone process hierarchy.
>
>  linux-user/Makefile.objs            |   2 +-
>  linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
>  linux-user/clone.h                  |  27 ++
>  linux-user/fd-trans-tbl.c           |  13 +
>  linux-user/fd-trans-type.h          |  17 +
>  linux-user/fd-trans.c               |   3 -
>  linux-user/fd-trans.h               |  75 ++--
>  linux-user/main.c                   |   1 +
>  linux-user/qemu.h                   |  49 +++
>  linux-user/signal.c                 |  84 ++++-
>  linux-user/syscall.c                | 452 ++++++++++++----------
>  tests/tcg/multiarch/Makefile.target |   3 +
>  tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
>  13 files changed, 1264 insertions(+), 254 deletions(-)
>  create mode 100644 linux-user/clone.c
>  create mode 100644 linux-user/clone.h
>  create mode 100644 linux-user/fd-trans-tbl.c
>  create mode 100644 linux-user/fd-trans-type.h


-- 
Alex Bennée

Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by Peter Maydell 3 years, 10 months ago
On Tue, 16 Jun 2020 at 17:08, Alex Bennée <alex.bennee@linaro.org> wrote:
> Apart from "a more perfect emulation" is there a particular use case
> served by the extra functionality? AIUI up until this point we've
> basically supported glibc's use of clone() which has generally been
> enough. I'm assuming you've come across stuff that needs this more fine
> grained support?

There are definitely cases we don't handle that cause problems;
notably https://bugs.launchpad.net/qemu/+bug/1673976 reports
that newer glibc implement posix_spawn() using CLONE_VM|CLONE_VFORK
which we don't handle correctly (though it is now just "we don't
report failures correctly" rather than "guest asserts").
The problem has always been that glibc implicitly assumes it
knows what the process's threads are like, ie that it is the
only thing doing any clone()s. (The comment in syscall.c mentions
it "breaking mutexes" though I forget what I had in mind when
I wrote that comment.) I haven't looked at these patches,
but the risk of being clever is that we end up implicitly
depending on details of glibc's internal implementation in a
potentially fragile way.

I forget whether QEMU can build against musl libc, but if we do
then that might be an interesting test of whether we have
accidental dependencies on the libc internals.

thanks
-- PMM

Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
Posted by Josh Kunz 3 years, 10 months ago
On Tue, Jun 16, 2020 at 9:32 AM Peter Maydell <peter.maydell@linaro.org>
wrote:
>
> On Tue, 16 Jun 2020 at 17:08, Alex Bennée <alex.bennee@linaro.org> wrote:
> > Apart from "a more perfect emulation" is there a particular use case
> > served by the extra functionality? AIUI up until this point we've
> > basically supported glibc's use of clone() which has generally been
> > enough. I'm assuming you've come across stuff that needs this more fine
> > grained support?
>
> There are definitely cases we don't handle that cause problems;
> notably https://bugs.launchpad.net/qemu/+bug/1673976 reports
> that newer glibc implement posix_spawn() using CLONE_VM|CLONE_VFORK
> which we don't handle correctly (though it is now just "we don't
> report failures correctly" rather than "guest asserts").

This originally came up for us at Google in multi-threaded guest binaries
using TCMalloc (https://github.com/google/tcmalloc). TCMalloc does not have
any special `at_fork` handling, so guest processes using TCMalloc spawn
subprocesses using a custom bit of code based on `clone(CLONE_VM)` (notably
*not* vfork()).

We've also been using this patch to work around similar issues in QEMU
itself when creating subprocesses with fork()/vfork(). Since QEMU (and
GLib) rely on locks to emulate multi-threaded guests that share virtual
memory, QEMU itself is also vulnerable to deadlocks when a guest forks.
Without this patch we've run into deadlocks with TCG region trees, and
GLib's `g_malloc`, there are likely others as well, since we could still
reproduce the deadlocks after fixing these specific problems.

The issues caused by using fork() in multi-threaded guests are pretty
tricky to debug. They are fundamentally data races (was another thread in
the critical section or not?), and they usually manifest as deadlocks,
which show up as timeouts or hangs to users. I suspect this issue happens
frequently in the wild, but at a low enough rate/user that nobody bothered
fixing it/reporting it yet. Use of `vfork()` with `CLONE_VM` is common as
mentioned by Alex. For example it is the only way to spawn subprocesses in
Go on most platforms:
https://github.com/golang/go/blob/master/src/syscall/exec_linux.go#L218

I tried to come up with a good reproducer for these issues, but I haven't
been able to make one. The cases we have at Google that trigger this issue
reliably are big and they contain lots of code I can't share. When
compiling QEMU itself with TCMalloc, I can pretty reliably reproduce a
deadlock with a program that just calls vfork(), but that's somewhat
synthetic.

> The problem has always been that glibc implicitly assumes it
> knows what the process's threads are like, ie that it is the
> only thing doing any clone()s. (The comment in syscall.c mentions
> it "breaking mutexes" though I forget what I had in mind when
> I wrote that comment.) I haven't looked at these patches,
> but the risk of being clever is that we end up implicitly
> depending on details of glibc's internal implementation in a
> potentially fragile way.
>
>
> I forget whether QEMU can build against musl libc, but if we do
> then that might be an interesting test of whether we have
> accidental dependencies on the libc internals.

I agree it would be interesting to test against musl. I'm pretty sure it
would work (this patch only relies on POSIX APIs + Platform ABIs for TLS),
but it would be interesting to confirm.

--
Josh Kunz