[v2] virtio-scsi: fix SCSIDevice hot unplug with IOThread

[PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race

Posted by Stefan Hajnoczi 2 years, 12 months ago

dma_blk_cb() only takes the AioContext lock around ->io_func(). That
means the rest of dma_blk_cb() is not protected. In particular, the
DMAAIOCB field accesses happen outside the lock.

There is a race when the main loop thread holds the AioContext lock and
invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
field determines how cancellation proceeds. If dma_aio_cancel() see
dbs->acb == NULL while dma_blk_cb() is still running, the request can be
completed twice (-ECANCELED and the actual return value).

The following assertion can occur with virtio-scsi when an IOThread is
used:

  ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.

Fix the race by holding the AioContext across dma_blk_cb(). Now
dma_aio_cancel() under the AioContext lock will not see
inconsistent/intermediate states.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/scsi/scsi-disk.c   |  4 +---
 softmmu/dma-helpers.c | 12 +++++++-----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 115584f8b9..97c9b1c8cd 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -354,13 +354,12 @@ done:
     scsi_req_unref(&r->req);
 }
 
+/* Called with AioContext lock held */
 static void scsi_dma_complete(void *opaque, int ret)
 {
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
@@ -370,7 +369,6 @@ static void scsi_dma_complete(void *opaque, int ret)
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_dma_complete_noio(r, ret);
-    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_read_complete_noio(SCSIDiskReq *r, int ret)
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 7820fec54c..2463964805 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
 static void dma_blk_cb(void *opaque, int ret)
 {
     DMAAIOCB *dbs = (DMAAIOCB *)opaque;
+    AioContext *ctx = dbs->ctx;
     dma_addr_t cur_addr, cur_len;
     void *mem;
 
     trace_dma_blk_cb(dbs, ret);
 
+    aio_context_acquire(ctx);
     dbs->acb = NULL;
     dbs->offset += dbs->iov.size;
 
     if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
         dma_complete(dbs, ret);
-        return;
+        goto out;
     }
     dma_blk_unmap(dbs);
 
@@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
 
     if (dbs->iov.size == 0) {
         trace_dma_map_wait(dbs);
-        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
+        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
         cpu_register_map_client(dbs->bh);
-        return;
+        goto out;
     }
 
     if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
@@ -174,11 +176,11 @@ static void dma_blk_cb(void *opaque, int ret)
                                 QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
     }
 
-    aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                             dma_blk_cb, dbs, dbs->io_func_opaque);
-    aio_context_release(dbs->ctx);
     assert(dbs->acb);
+out:
+    aio_context_release(ctx);
 }
 
 static void dma_aio_cancel(BlockAIOCB *acb)
-- 
2.39.1

Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race

Posted by Kevin Wolf 2 years, 11 months ago

Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> means the rest of dma_blk_cb() is not protected. In particular, the
> DMAAIOCB field accesses happen outside the lock.
> 
> There is a race when the main loop thread holds the AioContext lock and
> invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> field determines how cancellation proceeds. If dma_aio_cancel() see
> dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> completed twice (-ECANCELED and the actual return value).
> 
> The following assertion can occur with virtio-scsi when an IOThread is
> used:
> 
>   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> 
> Fix the race by holding the AioContext across dma_blk_cb(). Now
> dma_aio_cancel() under the AioContext lock will not see
> inconsistent/intermediate states.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Two things that seem to need attention in the review:

> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> index 7820fec54c..2463964805 100644
> --- a/softmmu/dma-helpers.c
> +++ b/softmmu/dma-helpers.c
> @@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
>  static void dma_blk_cb(void *opaque, int ret)
>  {
>      DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> +    AioContext *ctx = dbs->ctx;
>      dma_addr_t cur_addr, cur_len;
>      void *mem;
>  
>      trace_dma_blk_cb(dbs, ret);
>  
> +    aio_context_acquire(ctx);

During the first iteration, the caller may already hold the AioContext
lock. In the case of scsi-disk, it does. Locking a second time is fine
in principle because it's a recursive lock, but we have to be careful
not to call AIO_WAIT_WHILE() indirectly then because it could deadlock.

Except for the dbs->common.cb (see below) I don't see any functions that
would be problematic in this respect. In fact, the one I would be most
worried about is dbs->io_func, but it was already locked before.

>      dbs->acb = NULL;
>      dbs->offset += dbs->iov.size;
>  
>      if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
>          dma_complete(dbs, ret);

All request callbacks hold the AioContext lock now when they didn't
before. I wonder if it's worth documenting the locking rules for
dma_blk_io() in a comment. Could be a separate patch, though.

You remove the locking in scsi_dma_complete() to compensate. All the
other callers come from IDE and nvme, which don't take the lock
internally. Taking the main AioContext lock once is fine, so this looks
good.

If it is possible that we already complete on the first iteration, this
could however also be affected by the case above so that the AioContext
is locked twice. In this case, the invoked callback must not call
AIO_WAIT_WHILE() and we would need to audit IDE and nvme.

Is it possible? In other words, can dbs->sg->nsg be 0? If not, can we
assert and document it?

> -        return;
> +        goto out;
>      }
>      dma_blk_unmap(dbs);
>  
> @@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
>  
>      if (dbs->iov.size == 0) {
>          trace_dma_map_wait(dbs);
> -        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
> +        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);

Does copying dbs->ctx to a local variable imply that it may change
during the function? I didn't think so, but if it may, then why is
calling aio_bh_new() with the old value right?

>          cpu_register_map_client(dbs->bh);
> -        return;
> +        goto out;
>      }
>  
>      if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
> @@ -174,11 +176,11 @@ static void dma_blk_cb(void *opaque, int ret)
>                                  QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
>      }
>  
> -    aio_context_acquire(dbs->ctx);
>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
>                              dma_blk_cb, dbs, dbs->io_func_opaque);
> -    aio_context_release(dbs->ctx);
>      assert(dbs->acb);
> +out:
> +    aio_context_release(ctx);
>  }

Kevin

Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race

Posted by Stefan Hajnoczi 2 years, 11 months ago

On Thu, Feb 16, 2023 at 04:27:42PM +0100, Kevin Wolf wrote:
> Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> > dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> > means the rest of dma_blk_cb() is not protected. In particular, the
> > DMAAIOCB field accesses happen outside the lock.
> > 
> > There is a race when the main loop thread holds the AioContext lock and
> > invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> > dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> > field determines how cancellation proceeds. If dma_aio_cancel() see
> > dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> > completed twice (-ECANCELED and the actual return value).
> > 
> > The following assertion can occur with virtio-scsi when an IOThread is
> > used:
> > 
> >   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> > 
> > Fix the race by holding the AioContext across dma_blk_cb(). Now
> > dma_aio_cancel() under the AioContext lock will not see
> > inconsistent/intermediate states.
> > 
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Two things that seem to need attention in the review:
> 
> > diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> > index 7820fec54c..2463964805 100644
> > --- a/softmmu/dma-helpers.c
> > +++ b/softmmu/dma-helpers.c
> > @@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
> >  static void dma_blk_cb(void *opaque, int ret)
> >  {
> >      DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> > +    AioContext *ctx = dbs->ctx;
> >      dma_addr_t cur_addr, cur_len;
> >      void *mem;
> >  
> >      trace_dma_blk_cb(dbs, ret);
> >  
> > +    aio_context_acquire(ctx);
> 
> During the first iteration, the caller may already hold the AioContext
> lock. In the case of scsi-disk, it does. Locking a second time is fine
> in principle because it's a recursive lock, but we have to be careful
> not to call AIO_WAIT_WHILE() indirectly then because it could deadlock.
> 
> Except for the dbs->common.cb (see below) I don't see any functions that
> would be problematic in this respect. In fact, the one I would be most
> worried about is dbs->io_func, but it was already locked before.
> 
> >      dbs->acb = NULL;
> >      dbs->offset += dbs->iov.size;
> >  
> >      if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
> >          dma_complete(dbs, ret);
> 
> All request callbacks hold the AioContext lock now when they didn't
> before. I wonder if it's worth documenting the locking rules for
> dma_blk_io() in a comment. Could be a separate patch, though.
> 
> You remove the locking in scsi_dma_complete() to compensate. All the
> other callers come from IDE and nvme, which don't take the lock
> internally. Taking the main AioContext lock once is fine, so this looks
> good.
> 
> If it is possible that we already complete on the first iteration, this
> could however also be affected by the case above so that the AioContext
> is locked twice. In this case, the invoked callback must not call
> AIO_WAIT_WHILE() and we would need to audit IDE and nvme.
> 
> Is it possible? In other words, can dbs->sg->nsg be 0? If not, can we
> assert and document it?

In the nsg == 0 case there's another problem: the completion callback
function is invoked and the AIOCB is unref'd before dma_blk_io() returns
the stale AIOCB pointer.

That would lead to problems later because the pattern is typically:

  r->aiocb = dma_blk_io(...);
  ...
  use r and r->aiocb later

So I don't think nsg = 0 makes sense.

Functions I looked at invoke dma_blk_io() only when there are still
bytes to transfer. I think we're safe but I'll admit I'm not 100% sure.

> > -        return;
> > +        goto out;
> >      }
> >      dma_blk_unmap(dbs);
> >  
> > @@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
> >  
> >      if (dbs->iov.size == 0) {
> >          trace_dma_map_wait(dbs);
> > -        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
> > +        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
> 
> Does copying dbs->ctx to a local variable imply that it may change
> during the function? I didn't think so, but if it may, then why is
> calling aio_bh_new() with the old value right?

I changed this line for consistency, not to change behavior or fix a bug.

Regarding AioContext changes, they can't happen because no function that
changes the AioContext is called between this line and the earlier
aio_context_acquire().

(Having to worry about AioContext changes is a pain though and I look
forward to when we can get rid of this lock.)

Stefan

Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race

Posted by Eric Blake 2 years, 11 months ago

On Fri, Feb 10, 2023 at 09:32:37AM -0500, Stefan Hajnoczi wrote:
> dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> means the rest of dma_blk_cb() is not protected. In particular, the
> DMAAIOCB field accesses happen outside the lock.
> 
> There is a race when the main loop thread holds the AioContext lock and
> invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> field determines how cancellation proceeds. If dma_aio_cancel() see

"sees" or "can see"

> dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> completed twice (-ECANCELED and the actual return value).
> 
> The following assertion can occur with virtio-scsi when an IOThread is
> used:
> 
>   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> 
> Fix the race by holding the AioContext across dma_blk_cb(). Now
> dma_aio_cancel() under the AioContext lock will not see
> inconsistent/intermediate states.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/scsi/scsi-disk.c   |  4 +---
>  softmmu/dma-helpers.c | 12 +++++++-----
>  2 files changed, 8 insertions(+), 8 deletions(-)

Widening the scope protected by the lock makes sense, insofar as we
still need the lock.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock
[PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
[PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread