[RFC] block-backend: prevent dangling BDS pointer in blk_drain()

Stefan Hajnoczi posted 1 patch 2 years, 4 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20211209142304.381253-1-stefanha@redhat.com
Maintainers: Hanna Reitz <hreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>
block/block-backend.c | 2 ++
1 file changed, 2 insertions(+)
[RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Stefan Hajnoczi 2 years, 4 months ago
The BlockBackend root child can change during bdrv_drained_begin() when
aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
and blk_drain() is left with a dangling BDS pointer.

One example is scsi_device_purge_requests(), which calls blk_drain() to
wait for in-flight requests to cancel. If the backup blockjob is active,
then the BlockBackend root child is a temporary filter BDS owned by the
blockjob. The blockjob can complete during bdrv_drained_begin() and the
last reference to the BDS is released when the temporary filter node is
removed. This results in a use-after-free when blk_drain() calls
bdrv_drained_end(bs) on the dangling pointer.

The general problem is that a function and its callers must not assume
that bs is still valid across aio_poll(). Explicitly hold a reference to
bs in blk_drain() to avoid the dangling pointer.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
0 (at least when running "make check"), so it is currently not possible
to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
bdrv_do_drained_end() because they will be unbalanced. That would have
been a more general solution than only fixing blk_drain().

Any suggestions for a better fix?

I think it's likely that more "dangling pointer across aio_poll()"
problems exist :(.

Here is the (hacky) reproducer:

  build/qemu-system-x86_64 \
     -name 'avocado-vt-vm1'  \
     -sandbox on  \
     -machine q35,memory-backend=mem-machine_mem \
     -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
     -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
     -nodefaults \
     -device VGA,bus=pcie.0,addr=0x2 \
     -m 1024 \
     -object memory-backend-ram,size=1024M,id=mem-machine_mem  \
     -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
     -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
     -chardev socket,wait=off,server=on,id=qmp_id_qmpmonitor1,path=/tmp/qmp.sock  \
     -mon chardev=qmp_id_qmpmonitor1,mode=control \
     -chardev socket,wait=off,server=on,id=qmp_id_catch_monitor,path=/tmp/catch_monitor.sock  \
     -mon chardev=qmp_id_catch_monitor,mode=control \
     -device pvpanic,ioport=0x505,id=idgKHYrQ \
     -chardev socket,wait=off,server=on,id=chardev_serial0,path=/tmp/serial.sock \
     -device isa-serial,id=serial0,chardev=chardev_serial0  \
     -chardev socket,id=seabioslog_id_20211110-012521-TNCkxDmn,path=/tmp/seabios.sock,server=on,wait=off \
     -device isa-debugcon,chardev=seabioslog_id_20211110-012521-TNCkxDmn,iobase=0x402 \
     -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
     -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
     -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=test.img,cache.direct=on,cache.no-flush=off \
     -blockdev node-name=drive_image1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
     -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
     -blockdev node-name=file_src1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=sr1.qcow2,cache.direct=on,cache.no-flush=off \
     -blockdev node-name=drive_src1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_src1 \
     -device scsi-hd,id=src1,drive=drive_src1,write-cache=on \
     -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
     -device virtio-net-pci,mac=9a:11:64:b0:5d:a8,id=idxnEEYY,netdev=idBjpylo,bus=pcie-root-port-3,addr=0x0  \
     -netdev user,id=idBjpylo  \
     -vnc :0  \
     -rtc base=utc,clock=host,driftfix=slew  \
     -boot menu=off,order=cdn,once=c,strict=off \
     -enable-kvm \
     -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 &

  sleep 8 # delay for VM startup and socket creation

  nc -U /tmp/qmp.sock <<EOF
  {'execute': 'qmp_capabilities'}
  {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': 'dst1.qcow2', 'size': 209715200}, 'job-id': 'file_dst1'}, 'id': 'Fk1bF3FV'}
  EOF
  sleep 1 # wait for blockdev-create completion
  nc -U /tmp/qmp.sock <<EOF
  {'execute': 'qmp_capabilities'}
  {'execute': 'job-dismiss', 'arguments': {'id': 'file_dst1'}, 'id': '13R5TDSj'}
  {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_dst1', 'driver': 'file', 'filename': 'dst1.qcow2', 'aio': 'threads', 'auto-read-only': true, 'discard': 'unmap'}, 'id': 'VIzrN0zy'}
  {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_dst1', 'size': 209715200}, 'job-id': 'drive_dst1'}, 'id': 'YX8t8hBs'}
  EOF
  sleep 1 # wait for blockdev-create completion
  nc -U /tmp/qmp.sock <<EOF
  {'execute': 'qmp_capabilities'}
  {'execute': 'job-dismiss', 'arguments': {'id': 'drive_dst1'}, 'id': 'OTZwYb7J'}
  {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_dst1', 'driver': 'qcow2', 'file': 'file_dst1', 'read-only': false}, 'id': 'QHyUxtql'}
  {'execute': 'system_reset', 'id': 'OREutgnz'}
  {'execute': 'blockdev-backup', 'arguments': {'device': 'drive_src1', 'target': 'drive_dst1', 'job-id': 'drive_src1_qnFF', 'sync': 'full', 'speed': 0, 'compress': false, 'auto-finalize': true, 'auto-dismiss': true, 'on-source-error': 'report', 'on-target-error': 'report'}, 'id': 'WbDARa8c'}
  EOF
---
 block/block-backend.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 12ef80ea17..5608c0451b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1705,6 +1705,7 @@ void blk_drain(BlockBackend *blk)
     BlockDriverState *bs = blk_bs(blk);
 
     if (bs) {
+        bdrv_ref(bs);
         bdrv_drained_begin(bs);
     }
 
@@ -1714,6 +1715,7 @@ void blk_drain(BlockBackend *blk)
 
     if (bs) {
         bdrv_drained_end(bs);
+        bdrv_unref(bs);
     }
 }
 
-- 
2.33.1


Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Kevin Wolf 2 years, 4 months ago
Am 09.12.2021 um 15:23 hat Stefan Hajnoczi geschrieben:
> The BlockBackend root child can change during bdrv_drained_begin() when
> aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> and blk_drain() is left with a dangling BDS pointer.
> 
> One example is scsi_device_purge_requests(), which calls blk_drain() to
> wait for in-flight requests to cancel. If the backup blockjob is active,
> then the BlockBackend root child is a temporary filter BDS owned by the
> blockjob. The blockjob can complete during bdrv_drained_begin() and the
> last reference to the BDS is released when the temporary filter node is
> removed. This results in a use-after-free when blk_drain() calls
> bdrv_drained_end(bs) on the dangling pointer.
> 
> The general problem is that a function and its callers must not assume
> that bs is still valid across aio_poll(). Explicitly hold a reference to
> bs in blk_drain() to avoid the dangling pointer.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> 0 (at least when running "make check"), so it is currently not possible
> to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> bdrv_do_drained_end() because they will be unbalanced. That would have
> been a more general solution than only fixing blk_drain().

They are not supposed to end up unbalanced because detaching a child
calls bdrv_unapply_subtree_drain(). In fact, I think test-bdrv-drain
tests a few scenarios like this.

Do have more details about the case that failed for you?

Kevin


Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Stefan Hajnoczi 2 years, 4 months ago
On Fri, Dec 10, 2021 at 03:00:38PM +0100, Kevin Wolf wrote:
> Am 09.12.2021 um 15:23 hat Stefan Hajnoczi geschrieben:
> > The BlockBackend root child can change during bdrv_drained_begin() when
> > aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> > and blk_drain() is left with a dangling BDS pointer.
> > 
> > One example is scsi_device_purge_requests(), which calls blk_drain() to
> > wait for in-flight requests to cancel. If the backup blockjob is active,
> > then the BlockBackend root child is a temporary filter BDS owned by the
> > blockjob. The blockjob can complete during bdrv_drained_begin() and the
> > last reference to the BDS is released when the temporary filter node is
> > removed. This results in a use-after-free when blk_drain() calls
> > bdrv_drained_end(bs) on the dangling pointer.
> > 
> > The general problem is that a function and its callers must not assume
> > that bs is still valid across aio_poll(). Explicitly hold a reference to
> > bs in blk_drain() to avoid the dangling pointer.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> > I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> > 0 (at least when running "make check"), so it is currently not possible
> > to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> > bdrv_do_drained_end() because they will be unbalanced. That would have
> > been a more general solution than only fixing blk_drain().
> 
> They are not supposed to end up unbalanced because detaching a child
> calls bdrv_unapply_subtree_drain(). In fact, I think test-bdrv-drain
> tests a few scenarios like this.
> 
> Do have more details about the case that failed for you?

You're right. I put the assert(bs->quiesce_counter == 0) statement to
early in bdrv_delete(). Moving it after bdrv_close() shows that the
counter reaches 0 thanks to the bdrv_drain_all_end_quiesce() call in
bdrv_close().

Stefan
Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Hanna Reitz 2 years, 4 months ago
On 09.12.21 15:23, Stefan Hajnoczi wrote:
> The BlockBackend root child can change during bdrv_drained_begin() when
> aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> and blk_drain() is left with a dangling BDS pointer.
>
> One example is scsi_device_purge_requests(), which calls blk_drain() to
> wait for in-flight requests to cancel. If the backup blockjob is active,
> then the BlockBackend root child is a temporary filter BDS owned by the
> blockjob. The blockjob can complete during bdrv_drained_begin() and the
> last reference to the BDS is released when the temporary filter node is
> removed. This results in a use-after-free when blk_drain() calls
> bdrv_drained_end(bs) on the dangling pointer.
>
> The general problem is that a function and its callers must not assume
> that bs is still valid across aio_poll(). Explicitly hold a reference to
> bs in blk_drain() to avoid the dangling pointer.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> 0 (at least when running "make check"), so it is currently not possible
> to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> bdrv_do_drained_end() because they will be unbalanced. That would have
> been a more general solution than only fixing blk_drain().

Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to 
me – deleting only depends on strong references, and so I’d expect that 
anything that increases the quiesce_counter also has a strong reference 
to the node if the former wants the latter to stay around.

I suppose we could make it so that both the quiesce_counter and the 
refcnt need to be 0 before a BDS is deleted (and then deletion can 
happen both from bdrv_unref() and drained_end), but I don’t know whether 
that’s really necessary.  I’d rather leave it to the caller to ensure 
they keep a strong reference throughout the drain.

The question is, how often do we have a situation like this, where we 
take a weak reference for draining, because we assume there’s a strong 
reference backing us up (namely the one through blk->root), but that 
strong reference then can go away due to draining...

> Any suggestions for a better fix?

The fix makes sense to me.

One alternative that comes to my mind is to instead re-fetch `bs = 
blk_bs(blk);` after the AIO_WAIT_WHILE() loop.  But that might be wrong, 
because if the node attached to the BB changed (i.e. isn’t `bs`, and 
isn’t `NULL`), then we’d end the drain on the wrong node.

So I think your fix is the right one.

Hanna

> I think it's likely that more "dangling pointer across aio_poll()"
> problems exist :(.
>
> Here is the (hacky) reproducer:
>
>    build/qemu-system-x86_64 \
>       -name 'avocado-vt-vm1'  \
>       -sandbox on  \
>       -machine q35,memory-backend=mem-machine_mem \
>       -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
>       -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
>       -nodefaults \
>       -device VGA,bus=pcie.0,addr=0x2 \
>       -m 1024 \
>       -object memory-backend-ram,size=1024M,id=mem-machine_mem  \
>       -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
>       -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
>       -chardev socket,wait=off,server=on,id=qmp_id_qmpmonitor1,path=/tmp/qmp.sock  \
>       -mon chardev=qmp_id_qmpmonitor1,mode=control \
>       -chardev socket,wait=off,server=on,id=qmp_id_catch_monitor,path=/tmp/catch_monitor.sock  \
>       -mon chardev=qmp_id_catch_monitor,mode=control \
>       -device pvpanic,ioport=0x505,id=idgKHYrQ \
>       -chardev socket,wait=off,server=on,id=chardev_serial0,path=/tmp/serial.sock \
>       -device isa-serial,id=serial0,chardev=chardev_serial0  \
>       -chardev socket,id=seabioslog_id_20211110-012521-TNCkxDmn,path=/tmp/seabios.sock,server=on,wait=off \
>       -device isa-debugcon,chardev=seabioslog_id_20211110-012521-TNCkxDmn,iobase=0x402 \
>       -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
>       -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
>       -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=test.img,cache.direct=on,cache.no-flush=off \
>       -blockdev node-name=drive_image1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
>       -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
>       -blockdev node-name=file_src1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=sr1.qcow2,cache.direct=on,cache.no-flush=off \
>       -blockdev node-name=drive_src1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_src1 \
>       -device scsi-hd,id=src1,drive=drive_src1,write-cache=on \
>       -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
>       -device virtio-net-pci,mac=9a:11:64:b0:5d:a8,id=idxnEEYY,netdev=idBjpylo,bus=pcie-root-port-3,addr=0x0  \
>       -netdev user,id=idBjpylo  \
>       -vnc :0  \
>       -rtc base=utc,clock=host,driftfix=slew  \
>       -boot menu=off,order=cdn,once=c,strict=off \
>       -enable-kvm \
>       -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 &
>
>    sleep 8 # delay for VM startup and socket creation
>
>    nc -U /tmp/qmp.sock <<EOF
>    {'execute': 'qmp_capabilities'}
>    {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': 'dst1.qcow2', 'size': 209715200}, 'job-id': 'file_dst1'}, 'id': 'Fk1bF3FV'}
>    EOF
>    sleep 1 # wait for blockdev-create completion
>    nc -U /tmp/qmp.sock <<EOF
>    {'execute': 'qmp_capabilities'}
>    {'execute': 'job-dismiss', 'arguments': {'id': 'file_dst1'}, 'id': '13R5TDSj'}
>    {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_dst1', 'driver': 'file', 'filename': 'dst1.qcow2', 'aio': 'threads', 'auto-read-only': true, 'discard': 'unmap'}, 'id': 'VIzrN0zy'}
>    {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_dst1', 'size': 209715200}, 'job-id': 'drive_dst1'}, 'id': 'YX8t8hBs'}
>    EOF
>    sleep 1 # wait for blockdev-create completion
>    nc -U /tmp/qmp.sock <<EOF
>    {'execute': 'qmp_capabilities'}
>    {'execute': 'job-dismiss', 'arguments': {'id': 'drive_dst1'}, 'id': 'OTZwYb7J'}
>    {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_dst1', 'driver': 'qcow2', 'file': 'file_dst1', 'read-only': false}, 'id': 'QHyUxtql'}
>    {'execute': 'system_reset', 'id': 'OREutgnz'}
>    {'execute': 'blockdev-backup', 'arguments': {'device': 'drive_src1', 'target': 'drive_dst1', 'job-id': 'drive_src1_qnFF', 'sync': 'full', 'speed': 0, 'compress': false, 'auto-finalize': true, 'auto-dismiss': true, 'on-source-error': 'report', 'on-target-error': 'report'}, 'id': 'WbDARa8c'}
>    EOF
> ---
>   block/block-backend.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 12ef80ea17..5608c0451b 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1705,6 +1705,7 @@ void blk_drain(BlockBackend *blk)
>       BlockDriverState *bs = blk_bs(blk);
>   
>       if (bs) {
> +        bdrv_ref(bs);
>           bdrv_drained_begin(bs);
>       }
>   
> @@ -1714,6 +1715,7 @@ void blk_drain(BlockBackend *blk)
>   
>       if (bs) {
>           bdrv_drained_end(bs);
> +        bdrv_unref(bs);
>       }
>   }
>   


Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Stefan Hajnoczi 2 years, 4 months ago
On Thu, Dec 09, 2021 at 04:45:13PM +0100, Hanna Reitz wrote:
> On 09.12.21 15:23, Stefan Hajnoczi wrote:
> > The BlockBackend root child can change during bdrv_drained_begin() when
> > aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> > and blk_drain() is left with a dangling BDS pointer.
> > 
> > One example is scsi_device_purge_requests(), which calls blk_drain() to
> > wait for in-flight requests to cancel. If the backup blockjob is active,
> > then the BlockBackend root child is a temporary filter BDS owned by the
> > blockjob. The blockjob can complete during bdrv_drained_begin() and the
> > last reference to the BDS is released when the temporary filter node is
> > removed. This results in a use-after-free when blk_drain() calls
> > bdrv_drained_end(bs) on the dangling pointer.
> > 
> > The general problem is that a function and its callers must not assume
> > that bs is still valid across aio_poll(). Explicitly hold a reference to
> > bs in blk_drain() to avoid the dangling pointer.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> > I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> > 0 (at least when running "make check"), so it is currently not possible
> > to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> > bdrv_do_drained_end() because they will be unbalanced. That would have
> > been a more general solution than only fixing blk_drain().
> 
> Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me –
> deleting only depends on strong references, and so I’d expect that anything
> that increases the quiesce_counter also has a strong reference to the node
> if the former wants the latter to stay around.
> 
> I suppose we could make it so that both the quiesce_counter and the refcnt
> need to be 0 before a BDS is deleted (and then deletion can happen both from
> bdrv_unref() and drained_end), but I don’t know whether that’s really
> necessary.  I’d rather leave it to the caller to ensure they keep a strong
> reference throughout the drain.
> 
> The question is, how often do we have a situation like this, where we take a
> weak reference for draining, because we assume there’s a strong reference
> backing us up (namely the one through blk->root), but that strong reference
> then can go away due to draining...
> 
> > Any suggestions for a better fix?
> 
> The fix makes sense to me.

Okay. My concern was that this is a whole class of bugs and my patch
only fixes blk_drain(). I have audited the code some more in the
meantime.

bdrv_insert_node() may be unsafe in the case where bs is a temporary
filter node that is unref'd during bdrv_drained_begin():

  BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
                                     int flags, Error **errp)
  {
      ERRP_GUARD();
      int ret;
      BlockDriverState *new_node_bs = NULL;
      const char *drvname, *node_name;
      BlockDriver *drv;
  
      drvname = qdict_get_try_str(options, "driver");
      if (!drvname) {
          error_setg(errp, "driver is not specified");
          goto fail;
      }
  
      drv = bdrv_find_format(drvname);
      if (!drv) {
          error_setg(errp, "Unknown driver: '%s'", drvname);
          goto fail;
      }
  
      node_name = qdict_get_try_str(options, "node-name");
  
      new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
                                              errp);
      options = NULL; /* bdrv_new_open_driver() eats options */
      if (!new_node_bs) {
          error_prepend(errp, "Could not create node: ");
          goto fail;
      }
  
      bdrv_drained_begin(bs);
      ^^^^^^^^^^^^^^^^^^^^^^^ <--- bs can be dangling pointer
      ret = bdrv_replace_node(bs, new_node_bs, errp);
      bdrv_drained_end(bs);

The fix isn't as simple as blk_drain() because we don't want to insert
the new node before the now-deleted node. I think the correct way to
insert a node is against BdrvChild, not BlockDriverState. That way we
can be sure the new node will be inserted into a graph that is reachable
via BdrvChild (e.g. BlockBackend) instead of a detached BDS.

bdrv_set_aio_context_ignore() and blk_io_limits_disable() need to ref bs
like blk_drain() in this patch.

There are some other bdrv_drained_begin() calls that I'm assuming are
safe because they are during creation/deletion so I think we have strong
references there or nothing else knows about our BDS yet.

Do you agree with extending this patch series to cover the functions I
mentioned above?

> One alternative that comes to my mind is to instead re-fetch `bs =
> blk_bs(blk);` after the AIO_WAIT_WHILE() loop.  But that might be wrong,
> because if the node attached to the BB changed (i.e. isn’t `bs`, and isn’t
> `NULL`), then we’d end the drain on the wrong node.

Yes.

Stefan
Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Vladimir Sementsov-Ogievskiy 2 years, 4 months ago
09.12.2021 19:32, Stefan Hajnoczi wrote:
> On Thu, Dec 09, 2021 at 04:45:13PM +0100, Hanna Reitz wrote:
>> On 09.12.21 15:23, Stefan Hajnoczi wrote:
>>> The BlockBackend root child can change during bdrv_drained_begin() when
>>> aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
>>> and blk_drain() is left with a dangling BDS pointer.
>>>
>>> One example is scsi_device_purge_requests(), which calls blk_drain() to
>>> wait for in-flight requests to cancel. If the backup blockjob is active,
>>> then the BlockBackend root child is a temporary filter BDS owned by the
>>> blockjob. The blockjob can complete during bdrv_drained_begin() and the
>>> last reference to the BDS is released when the temporary filter node is
>>> removed. This results in a use-after-free when blk_drain() calls
>>> bdrv_drained_end(bs) on the dangling pointer.
>>>
>>> The general problem is that a function and its callers must not assume
>>> that bs is still valid across aio_poll(). Explicitly hold a reference to
>>> bs in blk_drain() to avoid the dangling pointer.
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>> I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
>>> 0 (at least when running "make check"), so it is currently not possible
>>> to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
>>> bdrv_do_drained_end() because they will be unbalanced. That would have
>>> been a more general solution than only fixing blk_drain().
>>
>> Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me –
>> deleting only depends on strong references, and so I’d expect that anything
>> that increases the quiesce_counter also has a strong reference to the node
>> if the former wants the latter to stay around.
>>
>> I suppose we could make it so that both the quiesce_counter and the refcnt
>> need to be 0 before a BDS is deleted (and then deletion can happen both from
>> bdrv_unref() and drained_end), but I don’t know whether that’s really
>> necessary.  I’d rather leave it to the caller to ensure they keep a strong
>> reference throughout the drain.
>>
>> The question is, how often do we have a situation like this, where we take a
>> weak reference for draining, because we assume there’s a strong reference
>> backing us up (namely the one through blk->root), but that strong reference
>> then can go away due to draining...
>>
>>> Any suggestions for a better fix?
>>
>> The fix makes sense to me.
> 
> Okay. My concern was that this is a whole class of bugs and my patch
> only fixes blk_drain(). I have audited the code some more in the
> meantime.
> 
> bdrv_insert_node() may be unsafe in the case where bs is a temporary
> filter node that is unref'd during bdrv_drained_begin():
> 
>    BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
>                                       int flags, Error **errp)
>    {
>        ERRP_GUARD();
>        int ret;
>        BlockDriverState *new_node_bs = NULL;
>        const char *drvname, *node_name;
>        BlockDriver *drv;
>    
>        drvname = qdict_get_try_str(options, "driver");
>        if (!drvname) {
>            error_setg(errp, "driver is not specified");
>            goto fail;
>        }
>    
>        drv = bdrv_find_format(drvname);
>        if (!drv) {
>            error_setg(errp, "Unknown driver: '%s'", drvname);
>            goto fail;
>        }
>    
>        node_name = qdict_get_try_str(options, "node-name");
>    
>        new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
>                                                errp);
>        options = NULL; /* bdrv_new_open_driver() eats options */
>        if (!new_node_bs) {
>            error_prepend(errp, "Could not create node: ");
>            goto fail;
>        }
>    
>        bdrv_drained_begin(bs);
>        ^^^^^^^^^^^^^^^^^^^^^^^ <--- bs can be dangling pointer
>        ret = bdrv_replace_node(bs, new_node_bs, errp);
>        bdrv_drained_end(bs);
> 
> The fix isn't as simple as blk_drain() because we don't want to insert
> the new node before the now-deleted node. I think the correct way to
> insert a node is against BdrvChild, not BlockDriverState. That way we
> can be sure the new node will be inserted into a graph that is reachable
> via BdrvChild (e.g. BlockBackend) instead of a detached BDS.
> 
> bdrv_set_aio_context_ignore() and blk_io_limits_disable() need to ref bs
> like blk_drain() in this patch.
> 
> There are some other bdrv_drained_begin() calls that I'm assuming are
> safe because they are during creation/deletion so I think we have strong
> references there or nothing else knows about our BDS yet.
> 
> Do you agree with extending this patch series to cover the functions I
> mentioned above?

I'm not sure.

First, we can't support "any graph change" during some graph changing operation.

Actually, when we do some specific graph change operation, we should forbid any other graph change operations, they should wait. Possibly, by adding strong references everywhere, we can avoid crashes. But what about the logic? If we do several graph changing operations simultaneously, the result is absolutely unpredictable, it's not what user wants.

The problem is known. For example it may lead to 030 iotest failure. Probably now it can't, after changes by Hanna.. But I think we'll not be safe, until we have a kind of global mutex for graph changing operations. For example, here is my old attempt: https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html .

So, probably, until we have a good solution for this, better do only necessary small fixes like your original patch..


Second, actually bs may disappear on first yield point, which will happen earlier than bdrv_drained_being() - in bdrv_new_open_driver_opts(). So, if fix something, we'd better declare that caller of bdrv_insert_node() is responsible for bs not disappear during function call. And check callers.

-- 
Best regards,
Vladimir

Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Stefan Hajnoczi 2 years, 4 months ago
On Thu, Dec 09, 2021 at 07:51:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.12.2021 19:32, Stefan Hajnoczi wrote:
> > On Thu, Dec 09, 2021 at 04:45:13PM +0100, Hanna Reitz wrote:
> > > On 09.12.21 15:23, Stefan Hajnoczi wrote:
> > > > The BlockBackend root child can change during bdrv_drained_begin() when
> > > > aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> > > > and blk_drain() is left with a dangling BDS pointer.
> > > > 
> > > > One example is scsi_device_purge_requests(), which calls blk_drain() to
> > > > wait for in-flight requests to cancel. If the backup blockjob is active,
> > > > then the BlockBackend root child is a temporary filter BDS owned by the
> > > > blockjob. The blockjob can complete during bdrv_drained_begin() and the
> > > > last reference to the BDS is released when the temporary filter node is
> > > > removed. This results in a use-after-free when blk_drain() calls
> > > > bdrv_drained_end(bs) on the dangling pointer.
> > > > 
> > > > The general problem is that a function and its callers must not assume
> > > > that bs is still valid across aio_poll(). Explicitly hold a reference to
> > > > bs in blk_drain() to avoid the dangling pointer.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > > I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> > > > 0 (at least when running "make check"), so it is currently not possible
> > > > to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> > > > bdrv_do_drained_end() because they will be unbalanced. That would have
> > > > been a more general solution than only fixing blk_drain().
> > > 
> > > Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me –
> > > deleting only depends on strong references, and so I’d expect that anything
> > > that increases the quiesce_counter also has a strong reference to the node
> > > if the former wants the latter to stay around.
> > > 
> > > I suppose we could make it so that both the quiesce_counter and the refcnt
> > > need to be 0 before a BDS is deleted (and then deletion can happen both from
> > > bdrv_unref() and drained_end), but I don’t know whether that’s really
> > > necessary.  I’d rather leave it to the caller to ensure they keep a strong
> > > reference throughout the drain.
> > > 
> > > The question is, how often do we have a situation like this, where we take a
> > > weak reference for draining, because we assume there’s a strong reference
> > > backing us up (namely the one through blk->root), but that strong reference
> > > then can go away due to draining...
> > > 
> > > > Any suggestions for a better fix?
> > > 
> > > The fix makes sense to me.
> > 
> > Okay. My concern was that this is a whole class of bugs and my patch
> > only fixes blk_drain(). I have audited the code some more in the
> > meantime.
> > 
> > bdrv_insert_node() may be unsafe in the case where bs is a temporary
> > filter node that is unref'd during bdrv_drained_begin():
> > 
> >    BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
> >                                       int flags, Error **errp)
> >    {
> >        ERRP_GUARD();
> >        int ret;
> >        BlockDriverState *new_node_bs = NULL;
> >        const char *drvname, *node_name;
> >        BlockDriver *drv;
> >        drvname = qdict_get_try_str(options, "driver");
> >        if (!drvname) {
> >            error_setg(errp, "driver is not specified");
> >            goto fail;
> >        }
> >        drv = bdrv_find_format(drvname);
> >        if (!drv) {
> >            error_setg(errp, "Unknown driver: '%s'", drvname);
> >            goto fail;
> >        }
> >        node_name = qdict_get_try_str(options, "node-name");
> >        new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
> >                                                errp);
> >        options = NULL; /* bdrv_new_open_driver() eats options */
> >        if (!new_node_bs) {
> >            error_prepend(errp, "Could not create node: ");
> >            goto fail;
> >        }
> >        bdrv_drained_begin(bs);
> >        ^^^^^^^^^^^^^^^^^^^^^^^ <--- bs can be dangling pointer
> >        ret = bdrv_replace_node(bs, new_node_bs, errp);
> >        bdrv_drained_end(bs);
> > 
> > The fix isn't as simple as blk_drain() because we don't want to insert
> > the new node before the now-deleted node. I think the correct way to
> > insert a node is against BdrvChild, not BlockDriverState. That way we
> > can be sure the new node will be inserted into a graph that is reachable
> > via BdrvChild (e.g. BlockBackend) instead of a detached BDS.
> > 
> > bdrv_set_aio_context_ignore() and blk_io_limits_disable() need to ref bs
> > like blk_drain() in this patch.
> > 
> > There are some other bdrv_drained_begin() calls that I'm assuming are
> > safe because they are during creation/deletion so I think we have strong
> > references there or nothing else knows about our BDS yet.
> > 
> > Do you agree with extending this patch series to cover the functions I
> > mentioned above?
> 
> I'm not sure.
> 
> First, we can't support "any graph change" during some graph changing operation.
> 
> Actually, when we do some specific graph change operation, we should forbid any other graph change operations, they should wait. Possibly, by adding strong references everywhere, we can avoid crashes. But what about the logic? If we do several graph changing operations simultaneously, the result is absolutely unpredictable, it's not what user wants.
> 
> The problem is known. For example it may lead to 030 iotest failure. Probably now it can't, after changes by Hanna.. But I think we'll not be safe, until we have a kind of global mutex for graph changing operations. For example, here is my old attempt: https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html .
> 
> So, probably, until we have a good solution for this, better do only necessary small fixes like your original patch..
> 
> 
> Second, actually bs may disappear on first yield point, which will happen earlier than bdrv_drained_being() - in bdrv_new_open_driver_opts(). So, if fix something, we'd better declare that caller of bdrv_insert_node() is responsible for bs not disappear during function call. And check callers.

bdrv_insert_node() is tricky. I will leave it alone and focus on
bdrv_set_aio_context_ignore() and blk_io_limits_disable() because I
think they have exactly the same problem as blk_drain().

Stefan
Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Posted by Vladimir Sementsov-Ogievskiy 2 years, 4 months ago
09.12.2021 18:45, Hanna Reitz wrote:
> On 09.12.21 15:23, Stefan Hajnoczi wrote:
>> The BlockBackend root child can change during bdrv_drained_begin() when
>> aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
>> and blk_drain() is left with a dangling BDS pointer.
>>
>> One example is scsi_device_purge_requests(), which calls blk_drain() to
>> wait for in-flight requests to cancel. If the backup blockjob is active,
>> then the BlockBackend root child is a temporary filter BDS owned by the
>> blockjob. The blockjob can complete during bdrv_drained_begin() and the
>> last reference to the BDS is released when the temporary filter node is
>> removed. This results in a use-after-free when blk_drain() calls
>> bdrv_drained_end(bs) on the dangling pointer.
>>
>> The general problem is that a function and its callers must not assume
>> that bs is still valid across aio_poll(). Explicitly hold a reference to
>> bs in blk_drain() to avoid the dangling pointer.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>> I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
>> 0 (at least when running "make check"), so it is currently not possible
>> to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
>> bdrv_do_drained_end() because they will be unbalanced. That would have
>> been a more general solution than only fixing blk_drain().
> 
> Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me – deleting only depends on strong references, and so I’d expect that anything that increases the quiesce_counter also has a strong reference to the node if the former wants the latter to stay around.
> 
> I suppose we could make it so that both the quiesce_counter and the refcnt need to be 0 before a BDS is deleted (and then deletion can happen both from bdrv_unref() and drained_end), but I don’t know whether that’s really necessary.  I’d rather leave it to the caller to ensure they keep a strong reference throughout the drain.

Agree. Better to keep the ref-count behavior obvious.

> 
> The question is, how often do we have a situation like this, where we take a weak reference for draining, because we assume there’s a strong reference backing us up (namely the one through blk->root), but that strong reference then can go away due to draining...
> 
>> Any suggestions for a better fix?
> 
> The fix makes sense to me.
> 
> One alternative that comes to my mind is to instead re-fetch `bs = blk_bs(blk);` after the AIO_WAIT_WHILE() loop.  But that might be wrong, because if the node attached to the BB changed (i.e. isn’t `bs`, and isn’t `NULL`), then we’d end the drain on the wrong node.
> 
> So I think your fix is the right one.
> 

I agree.

Interesting how many code paths that care to take a strong reference are actually prepared to the fact that block-graph may change, and this bs may be in some other place, with changed permissions, children and parents :/

Is graph modifying during drain is safe? Hmm, we probably always do graph modification in drained section for purpose) As I understand, all the logic about quiesce_counter is to support exactly this. And the only logic that seems correct is to finish drain on same node where it was started.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir