qemu_rbd_completion_cb() schedules the request completion code
(qemu_rbd_finish_bh()) to run in the BDS’s AioContext, assuming that
this is the same thread in which qemu_rbd_start_co() runs.
To explain, this is how both latter functions interact:
In qemu_rbd_start_co():
while (!task.complete)
qemu_coroutine_yield();
In qemu_rbd_finish_bh():
task->complete = true;
aio_co_wake(task->co); // task->co is qemu_rbd_start_co()
For this interaction to work reliably, both must run in the same thread
so that qemu_rbd_finish_bh() can only run once the coroutine yields.
Otherwise, finish_bh() may run before start_co() checks task.complete,
which will result in the latter seeing .complete as true immediately and
skipping the yield altogether, even though finish_bh() still wakes it.
With multiqueue, the BDS’s AioContext is not necessarily the thread
start_co() runs in, and so finish_bh() may be scheduled to run in a
different thread than start_co(). With the right timing, this will
cause the problems described above; waking a non-yielding coroutine is
not good, as can be reproduced by putting e.g. a usleep(100000) above
the while loop in start_co() (and using multiqueue), giving finish_bh()
a much better chance at exiting before start_co() can yield.
So instead of scheduling finish_bh() in the BDS’s AioContext, schedule
finish_bh() in task->co’s AioContext.
In addition, we can get rid of task.complete altogether because we will
get woken exactly once, when the task is indeed complete, no need to
check.
(We could go further and drop the BH, running aio_co_wake() directly in
qemu_rbd_completion_cb() because we are allowed to do that even if the
coroutine isn’t yet yielding and we’re in a different thread – but the
doc comment on qemu_rbd_completion_cb() says to be careful, so I decided
not to go so far here.)
Buglink: https://issues.redhat.com/browse/RHEL-67115
Reported-by: Junyao Zhao <junzhao@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/rbd.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/block/rbd.c b/block/rbd.c
index 3611dc81cf..2a70b5a983 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -110,9 +110,7 @@ typedef struct BDRVRBDState {
} BDRVRBDState;
typedef struct RBDTask {
- BlockDriverState *bs;
Coroutine *co;
- bool complete;
int64_t ret;
} RBDTask;
@@ -1309,7 +1307,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size)
static void qemu_rbd_finish_bh(void *opaque)
{
RBDTask *task = opaque;
- task->complete = true;
aio_co_wake(task->co);
}
@@ -1326,7 +1323,7 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
{
task->ret = rbd_aio_get_return_value(c);
rbd_aio_release(c);
- aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+ aio_bh_schedule_oneshot(qemu_coroutine_get_aio_context(task->co),
qemu_rbd_finish_bh, task);
}
@@ -1338,7 +1335,7 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
RBDAIOCmd cmd)
{
BDRVRBDState *s = bs->opaque;
- RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
+ RBDTask task = { .co = qemu_coroutine_self() };
rbd_completion_t c;
int r;
@@ -1401,9 +1398,8 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
return r;
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ /* Expect exactly a single wake from qemu_rbd_finish_bh() */
+ qemu_coroutine_yield();
if (task.ret < 0) {
error_report("rbd request failed: cmd %d offset %" PRIu64 " bytes %"
--
2.51.1