[v1] block: Some multi-threading fixes

[PATCH 06/16] gluster: Do not move coroutine into BDS context

Posted by Hanna Czenczek 3 months, 2 weeks ago

The request coroutine may not run in the BDS AioContext.  We should wake
it in its own context, not move it.

With that, we can remove GlusterAIOCB.aio_context.

Also add a comment why aio_co_schedule() is safe to use in this way.

**Note:** Due to a lack of a gluster set-up, I have not tested this
commit.  It seemed safe enough to send anyway, just maybe not to
qemu-stable.  To be clear, I don’t know of any user-visible bugs that
would arise from the state without this patch; the request coroutine is
moved into the main BDS AioContext, which feels wrong, but I’m not sure
whether it can actually produce hard bugs.

I’ll leave it to others’ opinions whether to keep or drop this patch
under these circumstances.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/gluster.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 89abd40f31..4fb25b2c6d 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -56,7 +56,6 @@ typedef struct GlusterAIOCB {
     int64_t size;
     int ret;
     Coroutine *coroutine;
-    AioContext *aio_context;
 } GlusterAIOCB;
 
 typedef struct BDRVGlusterState {
@@ -743,7 +742,17 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret,
         acb->ret = -EIO; /* Partial read/write - fail it */
     }
 
-    aio_co_schedule(acb->aio_context, acb->coroutine);
+    /*
+     * Safe to call: The coroutine will yield exactly once awaiting this
+     * scheduling, and the context is its own context, so it will be scheduled
+     * once it does yield.
+     *
+     * (aio_co_wake() would call qemu_get_current_aio_context() to check whether
+     * we are in the same context, but we are not in a qemu thread, so we cannot
+     * do that.  Use aio_co_schedule() directly.)
+     */
+    aio_co_schedule(qemu_coroutine_get_aio_context(acb->coroutine),
+                    acb->coroutine);
 }
 
 static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
@@ -1006,7 +1015,6 @@ static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
     acb.size = bytes;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_zerofill_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
     if (ret < 0) {
@@ -1184,7 +1192,6 @@ static coroutine_fn int qemu_gluster_co_rw(BlockDriverState *bs,
     acb.size = size;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     if (write) {
         ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0,
@@ -1251,7 +1258,6 @@ static coroutine_fn int qemu_gluster_co_flush_to_disk(BlockDriverState *bs)
     acb.size = 0;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_fsync_async(s->fd, gluster_finish_aiocb, &acb);
     if (ret < 0) {
@@ -1299,7 +1305,6 @@ static coroutine_fn int qemu_gluster_co_pdiscard(BlockDriverState *bs,
     acb.size = 0;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_discard_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
     if (ret < 0) {
-- 
2.51.0

Re: [PATCH 06/16] gluster: Do not move coroutine into BDS context

Posted by Kevin Wolf 3 months, 1 week ago

Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> The request coroutine may not run in the BDS AioContext.  We should wake
> it in its own context, not move it.
> 
> With that, we can remove GlusterAIOCB.aio_context.
> 
> Also add a comment why aio_co_schedule() is safe to use in this way.
> 
> **Note:** Due to a lack of a gluster set-up, I have not tested this
> commit.  It seemed safe enough to send anyway, just maybe not to
> qemu-stable.  To be clear, I don’t know of any user-visible bugs that
> would arise from the state without this patch; the request coroutine is
> moved into the main BDS AioContext, which feels wrong, but I’m not sure
> whether it can actually produce hard bugs.

Doesn't moving into a different AioContext mean that everything down to
the AIO callback in the guest device runs in a different thread? That's
definitely a big no-no for virtio-scsi, and I think also for virtio-blk.

> I’ll leave it to others’ opinions whether to keep or drop this patch
> under these circumstances.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

It looks trivially correct (famous last words...) and the bugs not
unlikely to be hit, so I'd say keep it.

I have no idea if the gluster library is actually thread safe, but if it
isn't, that breaks before gluster_finish_aiocb(). After reentering the
coroutine, the library isn't called any more.

Kevin

Re: [PATCH 06/16] gluster: Do not move coroutine into BDS context

Posted by Hanna Czenczek 3 months, 1 week ago

On 29.10.25 18:10, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> The request coroutine may not run in the BDS AioContext.  We should wake
>> it in its own context, not move it.
>>
>> With that, we can remove GlusterAIOCB.aio_context.
>>
>> Also add a comment why aio_co_schedule() is safe to use in this way.
>>
>> **Note:** Due to a lack of a gluster set-up, I have not tested this
>> commit.  It seemed safe enough to send anyway, just maybe not to
>> qemu-stable.  To be clear, I don’t know of any user-visible bugs that
>> would arise from the state without this patch; the request coroutine is
>> moved into the main BDS AioContext, which feels wrong, but I’m not sure
>> whether it can actually produce hard bugs.
> Doesn't moving into a different AioContext mean that everything down to
> the AIO callback in the guest device runs in a different thread? That's
> definitely a big no-no for virtio-scsi, and I think also for virtio-blk.

It does, yes.

>> I’ll leave it to others’ opinions whether to keep or drop this patch
>> under these circumstances.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> It looks trivially correct (famous last words...) and the bugs not
> unlikely to be hit, so I'd say keep it.
>
> I have no idea if the gluster library is actually thread safe, but if it
> isn't, that breaks before gluster_finish_aiocb(). After reentering the
> coroutine, the library isn't called any more.

OK.

Hanna