On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :) Is this with or without my patch(es)?
> >
> > You didn't answer this question.
>
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
>
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > >
> > > With the dummy the testcase also runs fine for 500 iterations.
> >
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
>
> Good point. I'll try to force both into the same cacheline.
On the Hi1620, this still hangs in the first iteration:
diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
* Instead, the aio_poll calls include both the prepare and the
* dispatch phase, hence a simple counter is enough for them.
*/
- uint32_t notify_me;
+ __attribute__((aligned(16))) uint64_t notify_me;
/* A lock to protect between QEMUBH and AioHandler adders and deleter,
* and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
aio_notify_accept(ctx);
for (bh = ctx->first_bh; bh; bh = bh->next) {
- if (bh->scheduled) {
+ if (atomic_mb_read(&bh->scheduled)) {
return true;
}
}
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
void aio_notify(AioContext *ctx)
{
- /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
- * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+ /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+ * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
+ * atomic_add in aio_poll.
*/
- smp_mb();
- if (ctx->notify_me) {
- event_notifier_set(&ctx->notifier);
+ if (atomic_mb_read(&ctx->notify_me)) {
+ event_notifier_set(&ctx->notifier);
atomic_mb_set(&ctx->notified, true);
}
}
--
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256
Title:
qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
converting images
Status in kunpeng920:
New
Status in QEMU:
In Progress
Status in qemu package in Ubuntu:
In Progress
Status in qemu source package in Bionic:
New
Status in qemu source package in Disco:
New
Status in qemu source package in Eoan:
In Progress
Status in qemu source package in FF-Series:
New
Bug description:
Command:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Hangs indefinitely approximately 30% of the runs.
----
Workaround:
qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Run "qemu-img convert" with "a single coroutine" to avoid this issue.
----
(gdb) thread 1
...
(gdb) bt
#0 0x0000ffffbf1ad81c in __GI_ppoll
#1 0x0000aaaaaabcf73c in ppoll
#2 qemu_poll_ns
#3 0x0000aaaaaabd0764 in os_host_main_loop_wait
#4 main_loop_wait
...
(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()
(gdb) thread 3
...
(gdb) bt
#0 0x0000ffffbf11aa20 in __GI___sigtimedwait
#1 0x0000ffffbf2671b4 in __sigwait
#2 0x0000aaaaaabd1ddc in sigwait_compat
#3 0x0000aaaaaabd34c8 in qemu_thread_start
#4 0x0000ffffbf25c880 in start_thread
#5 0x0000ffffbf1b6b9c in thread_start
----
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2
[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]
[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]
<HUNG w/ 3 threads in the stack trace showed before>
"""
All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).
Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).
----
[ Original Description ]
On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
frequently hangs (~50% of the time) with this command:
qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
qcow2->qcow2 conversion happens to be something uvtool does every time
it fetches images.
Once hung, attaching gdb gives the following backtrace:
(gdb) bt
#0 0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
__fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
at util/main-loop.c:233
#4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5 0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
#6 img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7 0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
Reproduced w/ latest QEMU git (@ 53744e0a182)
To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions
Patchew URL: https://patchew.org/QEMU/20191011175536.GB25464@xps13.dannf/ Hi, This series seems to have some coding style problems. See output below for more information: Subject: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Type: series Message-id: 20191011175536.GB25464@xps13.dannf === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Switched to a new branch 'test' d3707c3 qemu_futex_wait() lockups in ARM64: 2 possible issues === OUTPUT BEGIN === WARNING: Block comments use a leading /* on a separate line #66: FILE: util/async.c:345: + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before ERROR: code indent should never use tabs #74: FILE: util/async.c:350: +^Ievent_notifier_set(&ctx->notifier);$ ERROR: Missing Signed-off-by: line(s) total: 2 errors, 1 warnings, 33 lines checked Commit d3707c31b5da (qemu_futex_wait() lockups in ARM64: 2 possible issues) has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/20191011175536.GB25464@xps13.dannf/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
Patchew URL: https://patchew.org/QEMU/20191011175536.GB25464@xps13.dannf/ Hi, This series failed the docker-mingw@fedora build test. Please find the testing commands and their output below. If you have Docker installed, you can probably reproduce it locally. === TEST SCRIPT BEGIN === #! /bin/bash export ARCH=x86_64 make docker-image-fedora V=1 NETWORK=1 time make docker-test-mingw@fedora J=14 NETWORK=1 === TEST SCRIPT END === In file included from /tmp/qemu-test/src/include/qemu/osdep.h:51, from /tmp/qemu-test/src/util/async.c:26: /tmp/qemu-test/src/util/async.c: In function 'aio_notify': /tmp/qemu-test/src/include/qemu/compiler.h:86:36: error: static assertion failed: "not expecting: sizeof(*&ctx->notify_me) > ATOMIC_REG_SIZE" #define QEMU_BUILD_BUG_MSG(x, msg) _Static_assert(!(x), msg) ^~~~~~~~~~~~~~ /tmp/qemu-test/src/include/qemu/compiler.h:94:30: note: in expansion of macro 'QEMU_BUILD_BUG_MSG' --- /tmp/qemu-test/src/util/async.c:349:9: note: in expansion of macro 'atomic_mb_read' if (atomic_mb_read(&ctx->notify_me)) { ^~~~~~~~~~~~~~ make: *** [/tmp/qemu-test/src/rules.mak:69: util/async.o] Error 1 make: *** Waiting for unfinished jobs.... Traceback (most recent call last): File "./tests/docker/docker.py", line 662, in <module> --- raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=b90449ea684e4a14ad6b34f4f5dcb0c9', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-0y2eb54d/src/docker-src.2019-10-11-20.12.57.30254:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2. filter=--filter=label=com.qemu.instance.uuid=b90449ea684e4a14ad6b34f4f5dcb0c9 make[1]: *** [docker-run] Error 1 make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-0y2eb54d/src' make: *** [docker-run-test-mingw@fedora] Error 2 real 11m27.735s user 0m7.532s The full log is available at http://patchew.org/logs/20191011175536.GB25464@xps13.dannf/testing.docker-mingw@fedora/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
© 2016 - 2024 Red Hat, Inc.