RE: [PATCH v4] blockjob: Fix crash with IOthread when block commit after snapshot

仇大玉 posted 1 patch 3 years, 2 months ago
Failed in applying to current master (apply log)
There is a newer version of this series
blockjob.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
RE: [PATCH v4] blockjob: Fix crash with IOthread when block commit after snapshot
Posted by 仇大玉 3 years, 2 months ago
Any comments?

It's really a bug and can cause the qemu to segmentfault.

Thanks,
Michael

-----Original Message-----
From: 仇大玉 
Sent: 2021年1月28日 13:16
To: qemu-block@nongnu.org; qemu-devel@nongnu.org
Cc: kwolf@redhat.com; mreitz@redhat.com; jsnow@redhat.com; 08005325@163.com
Subject: RE: [PATCH v4] blockjob: Fix crash with IOthread when block commit after snapshot

Any comments?

-----Original Message-----
From: 08005325@163.com <08005325@163.com> 
Sent: 2021年1月28日 9:31
To: kwolf@redhat.com; mreitz@redhat.com; jsnow@redhat.com
Cc: qemu-block@nongnu.org; qemu-devel@nongnu.org; 仇大玉 <qiudayu@huayun.com>
Subject: [PATCH v4] blockjob: Fix crash with IOthread when block commit after snapshot

From: Michael Qiu <qiudayu@huayun.com>

v4: rebase to latest code

v3: reformat the commit log, remove duplicate content

v2: modify the coredump backtrace within commit log with the newest
    qemu with master branch

Currently, if guest has workloads, IO thread will acquire aio_context lock before do io_submit, it leads to segmentfault when do block commit after snapshot. Just like below:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f7c7d91f700 (LWP 99907)] 0x00005576d0f65aab in bdrv_mirror_top_pwritev at ../block/mirror.c:1437
1437    ../block/mirror.c: No such file or directory.
(gdb) p s->job
$17 = (MirrorBlockJob *) 0x0
(gdb) p s->stop
$18 = false

(gdb) bt

Switch to qemu main thread:
/lib/../lib64/libpthread.so.0
/lib/../lib64/libpthread.so.0
../util/qemu-thread-posix.c:79
qapi/qapi-commands-block-core.c:346
../qapi/qmp-dispatch.c:110
/lib/../lib64/libglib-2.0.so.0

In IO thread when do bdrv_mirror_top_pwritev, the job is NULL, and stop field is false, this means the MirrorBDSOpaque "s" object has not been initialized yet, and this object is initialized by block_job_create(), but the initialize process is stuck in acquiring the lock.

The rootcause is that qemu do release/acquire when hold the lock, at the same time, IO thread get the lock after release stage, and the crash occured.

Actually, in this situation, job->job.aio_context will not equal to qemu_get_aio_context(), and will be the same as bs->aio_context, thus, no need to release the lock, becasue bdrv_root_attach_child() will not change the context.

This patch fix this issue.

Signed-off-by: Michael Qiu <qiudayu@huayun.com>
---
 blockjob.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index 98ac8af982..51a09f3b60 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -214,13 +214,15 @@ int block_job_add_bdrv(BlockJob *job, const char *name, BlockDriverState *bs,
     BdrvChild *c;
 
     bdrv_ref(bs);
-    if (job->job.aio_context != qemu_get_aio_context()) {
+    if (bdrv_get_aio_context(bs) != job->job.aio_context &&
+        job->job.aio_context != qemu_get_aio_context()) {
         aio_context_release(job->job.aio_context);
     }
     c = bdrv_root_attach_child(bs, name, &child_job, 0,
                                job->job.aio_context, perm, shared_perm, job,
                                errp);
-    if (job->job.aio_context != qemu_get_aio_context()) {
+    if (bdrv_get_aio_context(bs) != job->job.aio_context &&
+        job->job.aio_context != qemu_get_aio_context()) {
         aio_context_acquire(job->job.aio_context);
     }
     if (c == NULL) {
--
2.22.0