[v1] block/nbd: fix crashers in reconnect while migrating

[PATCH 2/3] block/nbd: only enter connection coroutine if it's present

Posted by Roman Kagan 5 years ago

When an NBD block driver state is moved from one aio_context to another
(e.g. when doing a drain in a migration thread),
nbd_client_attach_aio_context_bh is executed that enters the connection
coroutine.

However, the assumption that ->connection_co is always present here
appears incorrect: the connection may have encountered an error other
than -EIO in the underlying transport, and thus may have decided to quit
rather than keep trying to reconnect, and therefore it may have
terminated the connection coroutine.  As a result an attempt to reassign
the client in this state (NBD_CLIENT_QUIT) to a different aio_context
leads to a null pointer dereference:

    at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
    opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
    at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
    at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
    at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
    blocking=blocking@entry=true)
    at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
    cb=<optimized out>, opaque=<optimized out>)
    at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
    bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
    new_context=new_context@entry=0x5618056c7580,
    ignore=ignore@entry=0x7f60e1e63780)
    at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
    bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
    ignore_child=<optimized out>, errp=<optimized out>)
    at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
    new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
    errp=errp@entry=0x0)
    at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
    new_context=<optimized out>, errp=errp@entry=0x0)
    at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
    at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
    at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
    running=0, state=<optimized out>)
    at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
    state=state@entry=RUN_STATE_FINISH_MIGRATE)
    at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
    send_stop=<optimized out>) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
    at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
    at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
    at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
    at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
   from /lib/x86_64-linux-gnu/libpthread.so.0

Fix it by checking that the connection coroutine is non-null before
trying to enter it.  If it is null, no entering is needed, as the
connection is probably going down anyway.

Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
 block/nbd.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index bcd6641e90..b3cbbeb4b0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
     BlockDriverState *bs = opaque;
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-    /*
-     * The node is still drained, so we know the coroutine has yielded in
-     * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
-     * entered for the first time. Both places are safe for entering the
-     * coroutine.
-     */
-    qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+    if (s->connection_co) {
+        /*
+         * The node is still drained, so we know the coroutine has yielded in
+         * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
+         * it is entered for the first time. Both places are safe for entering
+         * the coroutine.
+         */
+        qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+    }
     bdrv_dec_in_flight(bs);
 }
 
-- 
2.29.2

Re: [PATCH 2/3] block/nbd: only enter connection coroutine if it's present

Posted by Vladimir Sementsov-Ogievskiy 5 years ago

28.01.2021 23:14, Roman Kagan wrote:
> When an NBD block driver state is moved from one aio_context to another
> (e.g. when doing a drain in a migration thread),
> nbd_client_attach_aio_context_bh is executed that enters the connection
> coroutine.
> 
> However, the assumption that ->connection_co is always present here
> appears incorrect: the connection may have encountered an error other
> than -EIO in the underlying transport, and thus may have decided to quit
> rather than keep trying to reconnect, and therefore it may have
> terminated the connection coroutine.  As a result an attempt to reassign
> the client in this state (NBD_CLIENT_QUIT) to a different aio_context
> leads to a null pointer dereference:
> 
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
>      opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
>      blocking=blocking@entry=true)
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
>      cb=<optimized out>, opaque=<optimized out>)
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
>      bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
>      new_context=new_context@entry=0x5618056c7580,
>      ignore=ignore@entry=0x7f60e1e63780)
>      at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
>      bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
>      ignore_child=<optimized out>, errp=<optimized out>)
>      at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
>      new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
>      errp=errp@entry=0x0)
>      at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
>      new_context=<optimized out>, errp=errp@entry=0x0)
>      at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
>      at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
>      at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
>      running=0, state=<optimized out>)
>      at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
>      state=state@entry=RUN_STATE_FINISH_MIGRATE)
>      at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
>      send_stop=<optimized out>) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
>      at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
>      at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
>      at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
>      at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
>     from /lib/x86_64-linux-gnu/libpthread.so.0
> 
> Fix it by checking that the connection coroutine is non-null before
> trying to enter it.  If it is null, no entering is needed, as the
> connection is probably going down anyway.
> 
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> ---
>   block/nbd.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index bcd6641e90..b3cbbeb4b0 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
>       BlockDriverState *bs = opaque;
>       BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>   
> -    /*
> -     * The node is still drained, so we know the coroutine has yielded in
> -     * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
> -     * entered for the first time. Both places are safe for entering the
> -     * coroutine.
> -     */
> -    qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
> +    if (s->connection_co) {
> +        /*
> +         * The node is still drained, so we know the coroutine has yielded in
> +         * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
> +         * it is entered for the first time. Both places are safe for entering
> +         * the coroutine.
> +         */
> +        qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
> +    }
>       bdrv_dec_in_flight(bs);
>   }
>   
> 


-- 
Best regards,
Vladimir

[PATCH 1/3] block/nbd: only detach existing iochannel from aio_context
[PATCH 2/3] block/nbd: only enter connection coroutine if it's present
[PATCH 3/3] nbd: make nbd_read* return -EIO on error