Series comparison

-[PULL 00/12] Block patches
+[PULL 0/8] Block patches
-The following changes since commit 171199f56f5f9bdf1e5d670d09ef1351d8f01bae:
+The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:
-  Merge remote-tracking branch 'remotes/alistair/tags/pull-riscv-to-apply-20200619-3' into staging (2020-06-22 14:45:25 +0100)
+  decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)
 are available in the Git repository at:
-  https://github.com/stefanha/qemu.git tags/block-pull-request
+  https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to 7838c67f22a81fcf669785cd6c0876438422071a:
+for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:
-  block/nvme: support nested aio_poll() (2020-06-23 15:46:08 +0100)
+  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 11:08:21 -0400)
 ----------------------------------------------------------------
 Pull request
+- Stefano Garzarella's blkio block driver 'fd' parameter
+- My thread-local blk_io_plug() series
 ----------------------------------------------------------------
-Daniele Buono (4):
+Stefan Hajnoczi (6):
-  coroutine: support SafeStack in ucontext backend
+  block: add blk_io_plug_call() API
-  coroutine: add check for SafeStack in sigaltstack
+  block/nvme: convert to blk_io_plug_call() API
-  configure: add flags to support SafeStack
+  block/blkio: convert to blk_io_plug_call() API
-  check-block: enable iotests with SafeStack
+  block/io_uring: convert to blk_io_plug_call() API
   block/linux-aio: convert to blk_io_plug_call() API
   block: remove bdrv_co_io_plug() API
-Stefan Hajnoczi (8):
+Stefano Garzarella (2):
-  minikconf: explicitly set encoding to UTF-8
+  block/blkio: use qemu_open() to support fd passing for virtio-blk
-  block/nvme: poll queues without q->lock
+  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa
   block/nvme: drop tautologous assertion
   block/nvme: don't access CQE after moving cq.head
   block/nvme: switch to a NVMeRequest freelist
   block/nvme: clarify that free_req_queue is protected by q->lock
   block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
   block/nvme: support nested aio_poll()
- configure                    |  73 ++++++++++++
+ MAINTAINERS                       |   1 +
- include/qemu/coroutine_int.h |   5 +
+ qapi/block-core.json              |   6 ++
- block/nvme.c                 | 220 +++++++++++++++++++++++++----------
+ meson.build                       |   4 +
- util/coroutine-sigaltstack.c |   4 +
+ include/block/block-io.h          |   3 -
- util/coroutine-ucontext.c    |  28 +++++
+ include/block/block_int-common.h  |  11 ---
- block/trace-events           |   2 +-
+ include/block/raw-aio.h           |  14 ---
- scripts/minikconf.py         |   6 +-
+ include/sysemu/block-backend-io.h |  13 +--
- tests/check-block.sh         |  12 +-
+ block/blkio.c                     |  96 ++++++++++++------
-files changed, 284 insertions(+), 66 deletions(-)
+ block/block-backend.c             |  22 -----
  block/file-posix.c                |  38 -------
  block/io.c                        |  37 -------
  block/io_uring.c                  |  44 ++++-----
  block/linux-aio.c                 |  41 +++-----
  block/nvme.c                      |  44 +++------
  block/plug.c                      | 159 ++++++++++++++++++++++++++++++
  hw/block/dataplane/xen-block.c    |   8 +-
  hw/block/virtio-blk.c             |   4 +-
  hw/scsi/virtio-scsi.c             |   6 +-
  block/meson.build                 |   1 +
  block/trace-events                |   6 +-
 files changed, 293 insertions(+), 265 deletions(-)
  create mode 100644 block/plug.c
 --
-.26.2
+.40.1

-[PULL 01/12] minikconf: explicitly set encoding to UTF-8
+Deleted patch
-QEMU currently only has ASCII Kconfig files but Linux actually uses
-UTF-8. Explicitly specify the encoding and that we're doing text file
-I/O.
-It's unclear whether or not QEMU will ever need Unicode in its Kconfig
-files. If we start using the help text then it will become an issue
-sooner or later. Make this change now for consistency with Linux
-Kconfig.
-Reported-by: Philippe Mathieu-Daudé <philmd@redhat.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-id: 20200521153616.307100-1-stefanha@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- scripts/minikconf.py | 6 +++---
-file changed, 3 insertions(+), 3 deletions(-)
-diff --git a/scripts/minikconf.py b/scripts/minikconf.py
-index XXXXXXX..XXXXXXX 100755
---- a/scripts/minikconf.py
-+++ b/scripts/minikconf.py
-@@ -XXX,XX +XXX,XX @@ class KconfigParser:
-         if incl_abs_fname in self.data.previously_included:
-             return
-         try:
--            fp = open(incl_abs_fname, 'r')
-+            fp = open(incl_abs_fname, 'rt', encoding='utf-8')
-         except IOError as e:
-             raise KconfigParserError(self,
-                                 '%s: %s' % (e.strerror, include))
-@@ -XXX,XX +XXX,XX @@ if __name__ == '__main__':
-             parser.do_assignment(name, value == 'y')
-             external_vars.add(name[7:])
-         else:
--            fp = open(arg, 'r')
-+            fp = open(arg, 'rt', encoding='utf-8')
-             parser.parse_file(fp)
-             fp.close()
-@@ -XXX,XX +XXX,XX @@ if __name__ == '__main__':
-         if key not in external_vars and config[key]:
-             print ('CONFIG_%s=y' % key)
--    deps = open(argv[2], 'w')
-+    deps = open(argv[2], 'wt', encoding='utf-8')
-     for fname in data.previously_included:
-         print ('%s: %s' % (argv[1], fname), file=deps)
-     deps.close()
---
-.26.2

-[PULL 08/12] block/nvme: don't access CQE after moving cq.head
+[PULL 1/8] block: add blk_io_plug_call() API
-Do not access a CQE after incrementing q->cq.head and releasing q->lock.
+Introduce a new API for thread-local blk_io_plug() that does not
-It is unlikely that this causes problems in practice but it's a latent
+traverse the block graph. The goal is to make blk_io_plug() multi-queue
-bug.
+friendly.
-The reason why it should be safe at the moment is that completion
+Instead of having block drivers track whether or not we're in a plugged
-processing is not re-entrant and the CQ doorbell isn't written until the
+section, provide an API that allows them to defer a function call until
-end of nvme_process_completion().
+we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
+called multiple times with the same fn/opaque pair, then fn() is only
-Make this change now because QEMU expects completion processing to be
+called once at the end of the function - resulting in batching.
-re-entrant and later patches will do that.
 This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
 blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
 because the plug state is now thread-local.
 Later patches convert block drivers to blk_io_plug_call() and then we
 can finally remove .bdrv_co_io_plug() once all block drivers have been
 converted.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-Message-id: 20200617132201.1832152-4-stefanha@redhat.com
+Acked-by: Kevin Wolf <kwolf@redhat.com>
 Message-id: 20230530180959.1108766-2-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 5 ++++-
+ MAINTAINERS                       |   1 +
-file changed, 4 insertions(+), 1 deletion(-)
+ include/sysemu/block-backend-io.h |  13 +--
+ block/block-backend.c             |  22 -----
-diff --git a/block/nvme.c b/block/nvme.c
+ block/plug.c                      | 159 ++++++++++++++++++++++++++++++
-index XXXXXXX..XXXXXXX 100644
+ hw/block/dataplane/xen-block.c    |   8 +-
---- a/block/nvme.c
+ hw/block/virtio-blk.c             |   4 +-
-+++ b/block/nvme.c
+ hw/scsi/virtio-scsi.c             |   6 +-
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+ block/meson.build                 |   1 +
-     q->busy = true;
+files changed, 173 insertions(+), 41 deletions(-)
-     assert(q->inflight >= 0);
+ create mode 100644 block/plug.c
-     while (q->inflight) {
-+        int ret;
+diff --git a/MAINTAINERS b/MAINTAINERS
-         int16_t cid;
+index XXXXXXX..XXXXXXX 100644
-+
+--- a/MAINTAINERS
-         c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
++++ b/MAINTAINERS
-         if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+@@ -XXX,XX +XXX,XX @@ F: util/aio-*.c
-             break;
+ F: util/aio-*.h
  F: util/fdmon-*.c
  F: block/io.c
 +F: block/plug.c
  F: migration/block*
  F: include/block/aio.h
  F: include/block/aio-wait.h
 diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/sysemu/block-backend-io.h
 +++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
  int blk_get_max_iov(BlockBackend *blk);
  int blk_get_max_hw_iov(BlockBackend *blk);
 -/*
 - * blk_io_plug/unplug are thread-local operations. This means that multiple
 - * IOThreads can simultaneously call plug/unplug, but the caller must ensure
 - * that each unplug() is called in the same IOThread of the matching plug().
 - */
 -void coroutine_fn blk_co_io_plug(BlockBackend *blk);
 -void co_wrapper blk_io_plug(BlockBackend *blk);
 -
 -void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
 -void co_wrapper blk_io_unplug(BlockBackend *blk);
 +void blk_io_plug(void);
 +void blk_io_unplug(void);
 +void blk_io_plug_call(void (*fn)(void *), void *opaque);
  AioContext *blk_get_aio_context(BlockBackend *blk);
  BlockAcctStats *blk_get_stats(BlockBackend *blk);
 diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify)
      notifier_list_add(&blk->insert_bs_notifiers, notify);
  }
 -void coroutine_fn blk_co_io_plug(BlockBackend *blk)
 -{
 -    BlockDriverState *bs = blk_bs(blk);
 -    IO_CODE();
 -    GRAPH_RDLOCK_GUARD();
 -
 -    if (bs) {
 -        bdrv_co_io_plug(bs);
 -    }
 -}
 -
 -void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
 -{
 -    BlockDriverState *bs = blk_bs(blk);
 -    IO_CODE();
 -    GRAPH_RDLOCK_GUARD();
 -
 -    if (bs) {
 -        bdrv_co_io_unplug(bs);
 -    }
 -}
 -
  BlockAcctStats *blk_get_stats(BlockBackend *blk)
  {
      IO_CODE();
 diff --git a/block/plug.c b/block/plug.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/block/plug.c
@@ -XXX,XX +XXX,XX @@
 +/* SPDX-License-Identifier: GPL-2.0-or-later */
 +/*
 + * Block I/O plugging
 + *
 + * Copyright Red Hat.
 + *
 + * This API defers a function call within a blk_io_plug()/blk_io_unplug()
 + * section, allowing multiple calls to batch up. This is a performance
 + * optimization that is used in the block layer to submit several I/O requests
 + * at once instead of individually:
 + *
 + *   blk_io_plug(); <-- start of plugged region
 + *   ...
 + *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
 + *   blk_io_plug_call(my_func, my_obj); <-- another
 + *   blk_io_plug_call(my_func, my_obj); <-- another
 + *   ...
 + *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
 + *
 + * This code is actually generic and not tied to the block layer. If another
 + * subsystem needs this functionality, it could be renamed.
 + */
 +
 +#include "qemu/osdep.h"
 +#include "qemu/coroutine-tls.h"
 +#include "qemu/notify.h"
 +#include "qemu/thread.h"
 +#include "sysemu/block-backend.h"
 +
 +/* A function call that has been deferred until unplug() */
 +typedef struct {
 +    void (*fn)(void *);
 +    void *opaque;
 +} UnplugFn;
 +
 +/* Per-thread state */
 +typedef struct {
 +    unsigned count;       /* how many times has plug() been called? */
 +    GArray *unplug_fns;   /* functions to call at unplug time */
 +} Plug;
 +
 +/* Use get_ptr_plug() to fetch this thread-local value */
 +QEMU_DEFINE_STATIC_CO_TLS(Plug, plug);
 +
 +/* Called at thread cleanup time */
 +static void blk_io_plug_atexit(Notifier *n, void *value)
 +{
 +    Plug *plug = get_ptr_plug();
 +    g_array_free(plug->unplug_fns, TRUE);
 +}
 +
 +/* This won't involve coroutines, so use __thread */
 +static __thread Notifier blk_io_plug_atexit_notifier;
 +
 +/**
 + * blk_io_plug_call:
 + * @fn: a function pointer to be invoked
 + * @opaque: a user-defined argument to @fn()
 + *
 + * Call @fn(@opaque) immediately if not within a blk_io_plug()/blk_io_unplug()
 + * section.
 + *
 + * Otherwise defer the call until the end of the outermost
 + * blk_io_plug()/blk_io_unplug() section in this thread. If the same
 + * @fn/@opaque pair has already been deferred, it will only be called once upon
 + * blk_io_unplug() so that accumulated calls are batched into a single call.
 + *
 + * The caller must ensure that @opaque is not freed before @fn() is invoked.
 + */
 +void blk_io_plug_call(void (*fn)(void *), void *opaque)
 +{
 +    Plug *plug = get_ptr_plug();
 +
 +    /* Call immediately if we're not plugged */
 +    if (plug->count == 0) {
 +        fn(opaque);
 +        return;
 +    }
 +
 +    GArray *array = plug->unplug_fns;
 +    if (!array) {
 +        array = g_array_new(FALSE, FALSE, sizeof(UnplugFn));
 +        plug->unplug_fns = array;
 +        blk_io_plug_atexit_notifier.notify = blk_io_plug_atexit;
 +        qemu_thread_atexit_add(&blk_io_plug_atexit_notifier);
 +    }
 +
 +    UnplugFn *fns = (UnplugFn *)array->data;
 +    UnplugFn new_fn = {
 +        .fn = fn,
 +        .opaque = opaque,
 +    };
 +
 +    /*
 +     * There won't be many, so do a linear search. If this becomes a bottleneck
 +     * then a binary search (glib 2.62+) or different data structure could be
 +     * used.
 +     */
 +    for (guint i = 0; i < array->len; i++) {
 +        if (memcmp(&fns[i], &new_fn, sizeof(new_fn)) == 0) {
 +            return; /* already exists */
 +        }
 +    }
 +
 +    g_array_append_val(array, new_fn);
 +}
 +
 +/**
 + * blk_io_plug: Defer blk_io_plug_call() functions until blk_io_unplug()
 + *
 + * blk_io_plug/unplug are thread-local operations. This means that multiple
 + * threads can simultaneously call plug/unplug, but the caller must ensure that
 + * each unplug() is called in the same thread of the matching plug().
 + *
 + * Nesting is supported. blk_io_plug_call() functions are only called at the
 + * outermost blk_io_unplug().
 + */
 +void blk_io_plug(void)
 +{
 +    Plug *plug = get_ptr_plug();
 +
 +    assert(plug->count < UINT32_MAX);
 +
 +    plug->count++;
 +}
 +
 +/**
 + * blk_io_unplug: Run any pending blk_io_plug_call() functions
 + *
 + * There must have been a matching blk_io_plug() call in the same thread prior
 + * to this blk_io_unplug() call.
 + */
 +void blk_io_unplug(void)
 +{
 +    Plug *plug = get_ptr_plug();
 +
 +    assert(plug->count > 0);
 +
 +    if (--plug->count > 0) {
 +        return;
 +    }
 +
 +    GArray *array = plug->unplug_fns;
 +    if (!array) {
 +        return;
 +    }
 +
 +    UnplugFn *fns = (UnplugFn *)array->data;
 +
 +    for (guint i = 0; i < array->len; i++) {
 +        fns[i].fn(fns[i].opaque);
 +    }
 +
 +    /*
 +     * This resets the array without freeing memory so that appending is cheap
 +     * in the future.
 +     */
 +    g_array_set_size(array, 0);
 +}
 diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/dataplane/xen-block.c
 +++ b/hw/block/dataplane/xen-block.c
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
       * is below us.
       */
      if (inflight_atstart > IO_PLUG_THRESHOLD) {
 -        blk_io_plug(dataplane->blk);
 +        blk_io_plug();
      }
      while (rc != rp) {
          /* pull request from ring */
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
          if (inflight_atstart > IO_PLUG_THRESHOLD &&
              batched >= inflight_atstart) {
 -            blk_io_unplug(dataplane->blk);
 +            blk_io_unplug();
          }
-+        ret = nvme_translate_error(c);
+         xen_block_do_aio(request);
-         q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
+         if (inflight_atstart > IO_PLUG_THRESHOLD) {
-         if (!q->cq.head) {
+             if (batched >= inflight_atstart) {
-             q->cq_phase = !q->cq_phase;
+-                blk_io_plug(dataplane->blk);
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
++                blk_io_plug();
-         preq->busy = false;
+                 batched = 0;
-         preq->cb = preq->opaque = NULL;
+             } else {
-         qemu_mutex_unlock(&q->lock);
+                 batched++;
--        req.cb(req.opaque, nvme_translate_error(c));
+@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
-+        req.cb(req.opaque, ret);
+         }
-         qemu_mutex_lock(&q->lock);
+     }
-         q->inflight--;
+     if (inflight_atstart > IO_PLUG_THRESHOLD) {
-         progress = true;
+-        blk_io_unplug(dataplane->blk);
 +        blk_io_unplug();
      }
      return done_something;
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
      bool suppress_notifications = virtio_queue_get_notification(vq);
      aio_context_acquire(blk_get_aio_context(s->blk));
 -    blk_io_plug(s->blk);
 +    blk_io_plug();
      do {
          if (suppress_notifications) {
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
          virtio_blk_submit_multireq(s, &mrb);
      }
 -    blk_io_unplug(s->blk);
 +    blk_io_unplug();
      aio_context_release(blk_get_aio_context(s->blk));
  }
 diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/scsi/virtio-scsi.c
 +++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req)
          return -ENOBUFS;
      }
      scsi_req_ref(req->sreq);
 -    blk_io_plug(d->conf.blk);
 +    blk_io_plug();
      object_unref(OBJECT(d));
      return 0;
  }
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req)
      if (scsi_req_enqueue(sreq)) {
          scsi_req_continue(sreq);
      }
 -    blk_io_unplug(sreq->dev->conf.blk);
 +    blk_io_unplug();
      scsi_req_unref(sreq);
  }
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
                  while (!QTAILQ_EMPTY(&reqs)) {
                      req = QTAILQ_FIRST(&reqs);
                      QTAILQ_REMOVE(&reqs, req, next);
 -                    blk_io_unplug(req->sreq->dev->conf.blk);
 +                    blk_io_unplug();
                      scsi_req_unref(req->sreq);
                      virtqueue_detach_element(req->vq, &req->elem, 0);
                      virtio_scsi_free_req(req);
 diff --git a/block/meson.build b/block/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/meson.build
 +++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(files(
    'mirror.c',
    'nbd.c',
    'null.c',
 +  'plug.c',
    'qapi.c',
    'qcow2-bitmap.c',
    'qcow2-cache.c',
 --
-.26.2
+.40.1

-[PULL 12/12] block/nvme: support nested aio_poll()
+[PULL 2/8] block/nvme: convert to blk_io_plug_call() API
-QEMU block drivers are supposed to support aio_poll() from I/O
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
-completion callback functions. This means completion processing must be
+block layer friendly. Use the new blk_io_plug_call() API to batch I/O
-re-entrant.
+submission instead.
 The standard approach is to schedule a BH during completion processing
 and cancel it at the end of processing. If aio_poll() is invoked by a
 callback function then the BH will run. The BH continues the suspended
 completion processing.
 All of this means that request A's cb() can synchronously wait for
 request B to complete. Previously the nvme block driver would hang
 because it didn't process completions from nested aio_poll().
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Message-id: 20200617132201.1832152-8-stefanha@redhat.com
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Message-id: 20230530180959.1108766-3-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c       | 67 ++++++++++++++++++++++++++++++++++++++++------
+ block/nvme.c       | 44 ++++++++++++--------------------------------
- block/trace-events |  2 +-
+ block/trace-events |  1 -
-files changed, 60 insertions(+), 9 deletions(-)
+files changed, 12 insertions(+), 33 deletions(-)
 diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nvme.c
 +++ b/block/nvme.c
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@
-     int         cq_phase;
+ #include "qemu/vfio-helpers.h"
-     int         free_req_head;
+ #include "block/block-io.h"
-     NVMeRequest reqs[NVME_NUM_REQS];
+ #include "block/block_int.h"
--    bool        busy;
++#include "sysemu/block-backend.h"
-     int         need_kick;
+ #include "sysemu/replay.h"
-     int         inflight;
+ #include "trace.h"
-+
 +    /* Thread-safe, no lock necessary */
 +    QEMUBH      *completion_bh;
  } NVMeQueuePair;
  /* Memory mapped registers */
 @@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
- #define NVME_BLOCK_OPT_DEVICE "device"
+     int blkshift;
- #define NVME_BLOCK_OPT_NAMESPACE "namespace"
+     uint64_t max_transfer;
-+static void nvme_process_completion_bh(void *opaque);
+-    bool plugged;
-+
- static QemuOptsList runtime_opts = {
+     bool supports_write_zeroes;
-     .name = "nvme",
+     bool supports_discard;
-     .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+@@ -XXX,XX +XXX,XX @@ static void nvme_kick(NVMeQueuePair *q)
@@ -XXX,XX +XXX,XX @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
  static void nvme_free_queue_pair(NVMeQueuePair *q)
  {
-+    if (q->completion_bh) {
+     BDRVNVMeState *s = q->s;
-+        qemu_bh_delete(q->completion_bh);
-+    }
+-    if (s->plugged || !q->need_kick) {
-     qemu_vfree(q->prp_list_pages);
++    if (!q->need_kick) {
-     qemu_vfree(q->sq.queue);
+         return;
-     qemu_vfree(q->cq.queue);
+     }
-@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
+     trace_nvme_kick(s, q->index);
      q->index = idx;
      qemu_co_queue_init(&q->free_req_queue);
      q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
 +    q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
 +                                  nvme_process_completion_bh, q);
      r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
                            s->page_size * NVME_NUM_REQS,
                            false, &prp_list_iova);
 @@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
      NvmeCqe *c;
      trace_nvme_process_completion(s, q->index, q->inflight);
--    if (q->busy || s->plugged) {
+-    if (s->plugged) {
--        trace_nvme_process_completion_queue_busy(s, q->index);
+-        trace_nvme_process_completion_queue_plugged(s, q->index);
-+    if (s->plugged) {
+-        return false;
-+        trace_nvme_process_completion_queue_plugged(s, q->index);
+-    }
-         return false;
      /*
       * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
      }
--    q->busy = true;
-+
-+    /*
-+     * Support re-entrancy when a request cb() function invokes aio_poll().
-+     * Pending completions must be visible to aio_poll() so that a cb()
-+     * function can wait for the completion of another request.
-+     *
-+     * The aio_poll() loop will execute our BH and we'll resume completion
-+     * processing there.
-+     */
-+    qemu_bh_schedule(q->completion_bh);
-+
-     assert(q->inflight >= 0);
-     while (q->inflight) {
-         int ret;
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
-         assert(req.cb);
-         nvme_put_free_req_locked(q, preq);
-         preq->cb = preq->opaque = NULL;
--        qemu_mutex_unlock(&q->lock);
--        req.cb(req.opaque, ret);
--        qemu_mutex_lock(&q->lock);
-         q->inflight--;
-+        qemu_mutex_unlock(&q->lock);
-+        req.cb(req.opaque, ret);
-+        qemu_mutex_lock(&q->lock);
-         progress = true;
-     }
-     if (progress) {
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
-         *q->cq.doorbell = cpu_to_le32(q->cq.head);
-         nvme_wake_free_req_locked(q);
-     }
--    q->busy = false;
-+
-+    qemu_bh_cancel(q->completion_bh);
-+
-     return progress;
  }
-+static void nvme_process_completion_bh(void *opaque)
++static void nvme_unplug_fn(void *opaque)
 +{
 +    NVMeQueuePair *q = opaque;
 +
-+    /*
++    QEMU_LOCK_GUARD(&q->lock);
-+     * We're being invoked because a nvme_process_completion() cb() function
++    nvme_kick(q);
 +     * called aio_poll(). The callback may be waiting for further completions
 +     * so notify the device that it has space to fill in more completions now.
 +     */
 +    smp_mb_release();
 +    *q->cq.doorbell = cpu_to_le32(q->cq.head);
 +    nvme_wake_free_req_locked(q);
 +
 +    nvme_process_completion(q);
 +}
 +
- static void nvme_trace_command(const NvmeCmd *cmd)
+ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                  NvmeCmd *cmd, BlockCompletionFunc cb,
                                  void *opaque)
@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
             q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
      q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
      q->need_kick++;
 -    nvme_kick(q);
 -    nvme_process_completion(q);
 +    blk_io_plug_call(nvme_unplug_fn, q);
      qemu_mutex_unlock(&q->lock);
  }
@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
      }
  }
 -static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
 -{
 -    BDRVNVMeState *s = bs->opaque;
 -    assert(!s->plugged);
 -    s->plugged = true;
 -}
 -
 -static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
 -{
 -    BDRVNVMeState *s = bs->opaque;
 -    assert(s->plugged);
 -    s->plugged = false;
 -    for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
 -        NVMeQueuePair *q = s->queues[i];
 -        qemu_mutex_lock(&q->lock);
 -        nvme_kick(q);
 -        nvme_process_completion(q);
 -        qemu_mutex_unlock(&q->lock);
 -    }
 -}
 -
  static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
                                Error **errp)
  {
-     int i;
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
-@@ -XXX,XX +XXX,XX @@ static void nvme_detach_aio_context(BlockDriverState *bs)
+     .bdrv_detach_aio_context  = nvme_detach_aio_context,
- {
+     .bdrv_attach_aio_context  = nvme_attach_aio_context,
-     BDRVNVMeState *s = bs->opaque;
+-    .bdrv_co_io_plug          = nvme_co_io_plug,
-+    for (int i = 0; i < s->nr_queues; i++) {
+-    .bdrv_co_io_unplug        = nvme_co_io_unplug,
-+        NVMeQueuePair *q = s->queues[i];
+-
-+
+     .bdrv_register_buf        = nvme_register_buf,
-+        qemu_bh_delete(q->completion_bh);
+     .bdrv_unregister_buf      = nvme_unregister_buf,
-+        q->completion_bh = NULL;
+ };
 +    }
 +
      aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
                             false, NULL, NULL);
  }
@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
      s->aio_context = new_context;
      aio_set_event_notifier(new_context, &s->irq_notifier,
                             false, nvme_handle_event, nvme_poll_cb);
 +
 +    for (int i = 0; i < s->nr_queues; i++) {
 +        NVMeQueuePair *q = s->queues[i];
 +
 +        q->completion_bh =
 +            aio_bh_new(new_context, nvme_process_completion_bh, q);
 +    }
  }
  static void nvme_aio_plug(BlockDriverState *bs)
 diff --git a/block/trace-events b/block/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/block/trace-events
 +++ b/block/trace-events
-@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, int queue) "s %p queue %d"
+@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
  nvme_dma_flush_queue_wait(void *s) "s %p"
  nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
- nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
+ nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u inflight %d"
--nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
+-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
-+nvme_process_completion_queue_plugged(void *s, int index) "s %p queue %d"
+ nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
- nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
+ nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
  nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d"
  nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
 --
-.26.2
+.40.1

-[PULL 09/12] block/nvme: switch to a NVMeRequest freelist
+[PULL 3/8] block/blkio: convert to blk_io_plug_call() API
-There are three issues with the current NVMeRequest->busy field:
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
-. The busy field is accidentally accessed outside q->lock when request
+block layer friendly. Use the new blk_io_plug_call() API to batch I/O
-   submission fails.
+submission instead.
 . Waiters on free_req_queue are not woken when a request is returned
    early due to submission failure.
 . Finding a free request involves scanning all requests. This makes
    request submission O(n^2).
 Switch to an O(1) freelist that is always accessed under the lock.
 Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
 NVME_NUM_REQS, the number of usable requests. This makes the code
 simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
 that one slot is reserved.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Message-id: 20200617132201.1832152-5-stefanha@redhat.com
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Message-id: 20230530180959.1108766-4-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 81 ++++++++++++++++++++++++++++++++++------------------
+ block/blkio.c | 43 ++++++++++++++++++++++++-------------------
-file changed, 54 insertions(+), 27 deletions(-)
+file changed, 24 insertions(+), 19 deletions(-)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/block/blkio.c b/block/blkio.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/blkio.c
-+++ b/block/nvme.c
++++ b/block/blkio.c
 @@ -XXX,XX +XXX,XX @@
- #define NVME_QUEUE_SIZE 128
+ #include "qemu/error-report.h"
- #define NVME_BAR_SIZE 8192
+ #include "qapi/qmp/qdict.h"
+ #include "qemu/module.h"
 +#include "sysemu/block-backend.h"
  #include "exec/memory.h" /* for ram_block_discard_disable() */
  #include "block/block-io.h"
@@ -XXX,XX +XXX,XX @@ static void blkio_detach_aio_context(BlockDriverState *bs)
                         NULL, NULL, NULL);
  }
 -/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
 -static void blkio_submit_io(BlockDriverState *bs)
 +/*
-+ * We have to leave one slot empty as that is the full queue case where
++ * Called by blk_io_unplug() or immediately if not plugged. Called without
-+ * head == tail + 1.
++ * blkio_lock.
 + */
-+#define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
++static void blkio_unplug_fn(void *opaque)
  {
 -    if (qatomic_read(&bs->io_plugged) == 0) {
 -        BDRVBlkioState *s = bs->opaque;
 +    BDRVBlkioState *s = opaque;
 +    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
          blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
      }
  }
 +/*
 + * Schedule I/O submission after enqueuing a new request. Called without
 + * blkio_lock.
 + */
 +static void blkio_submit_io(BlockDriverState *bs)
 +{
 +    BDRVBlkioState *s = bs->opaque;
 +
- typedef struct {
++    blk_io_plug_call(blkio_unplug_fn, s);
      int32_t  head, tail;
      uint8_t  *queue;
@@ -XXX,XX +XXX,XX @@ typedef struct {
      int cid;
      void *prp_list_page;
      uint64_t prp_list_iova;
 -    bool busy;
 +    int free_req_next; /* q->reqs[] index of next free req */
  } NVMeRequest;
  typedef struct {
@@ -XXX,XX +XXX,XX @@ typedef struct {
      /* Fields protected by @lock */
      NVMeQueue   sq, cq;
      int         cq_phase;
 -    NVMeRequest reqs[NVME_QUEUE_SIZE];
 +    int         free_req_head;
 +    NVMeRequest reqs[NVME_NUM_REQS];
      bool        busy;
      int         need_kick;
      int         inflight;
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
      qemu_mutex_init(&q->lock);
      q->index = idx;
      qemu_co_queue_init(&q->free_req_queue);
 -    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
 +    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
      r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
 -                          s->page_size * NVME_QUEUE_SIZE,
 +                          s->page_size * NVME_NUM_REQS,
                            false, &prp_list_iova);
      if (r) {
          goto fail;
      }
 -    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
 +    q->free_req_head = -1;
 +    for (i = 0; i < NVME_NUM_REQS; i++) {
          NVMeRequest *req = &q->reqs[i];
          req->cid = i + 1;
 +        req->free_req_next = q->free_req_head;
 +        q->free_req_head = i;
          req->prp_list_page = q->prp_list_pages + i * s->page_size;
          req->prp_list_iova = prp_list_iova + i * s->page_size;
      }
 +
      nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
      if (local_err) {
          error_propagate(errp, local_err);
@@ -XXX,XX +XXX,XX @@ static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
   */
  static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
  {
 -    int i;
 -    NVMeRequest *req = NULL;
 +    NVMeRequest *req;
      qemu_mutex_lock(&q->lock);
 -    while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
 -        /* We have to leave one slot empty as that is the full queue case (head
 -         * == tail + 1). */
 +
 +    while (q->free_req_head == -1) {
          if (qemu_in_coroutine()) {
              trace_nvme_free_req_queue_wait(q);
              qemu_co_queue_wait(&q->free_req_queue, &q->lock);
@@ -XXX,XX +XXX,XX @@ static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
              return NULL;
          }
      }
 -    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
 -        if (!q->reqs[i].busy) {
 -            q->reqs[i].busy = true;
 -            req = &q->reqs[i];
 -            break;
 -        }
 -    }
 -    /* We have checked inflight and need_kick while holding q->lock, so one
 -     * free req must be available. */
 -    assert(req);
 +
 +    req = &q->reqs[q->free_req_head];
 +    q->free_req_head = req->free_req_next;
 +    req->free_req_next = -1;
 +
      qemu_mutex_unlock(&q->lock);
      return req;
  }
 +/* With q->lock */
 +static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
 +{
 +    req->free_req_next = q->free_req_head;
 +    q->free_req_head = req - q->reqs;
 +}
 +
-+/* With q->lock */
+ static int coroutine_fn
-+static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 +{
 +    if (!qemu_co_queue_empty(&q->free_req_queue)) {
 +        replay_bh_schedule_oneshot_event(s->aio_context,
 +                nvme_free_req_queue_cb, q);
 +    }
 +}
 +
 +/* Insert a request in the freelist and wake waiters */
 +static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
 +                                       NVMeRequest *req)
 +{
 +    qemu_mutex_lock(&q->lock);
 +    nvme_put_free_req_locked(q, req);
 +    nvme_wake_free_req_locked(s, q);
 +    qemu_mutex_unlock(&q->lock);
 +}
 +
  static inline int nvme_translate_error(const NvmeCqe *c)
  {
-     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
+@@ -XXX,XX +XXX,XX @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
-         req = *preq;
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-         assert(req.cid == cid);
+         blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-         assert(req.cb);
+-        blkio_submit_io(bs);
 -        preq->busy = false;
 +        nvme_put_free_req_locked(q, preq);
          preq->cb = preq->opaque = NULL;
          qemu_mutex_unlock(&q->lock);
          req.cb(req.opaque, ret);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
          /* Notify the device so it can post more completions. */
          smp_mb_release();
          *q->cq.doorbell = cpu_to_le32(q->cq.head);
 -        if (!qemu_co_queue_empty(&q->free_req_queue)) {
 -            replay_bh_schedule_oneshot_event(s->aio_context,
 -                                             nvme_free_req_queue_cb, q);
 -        }
 +        nvme_wake_free_req_locked(s, q);
      }
-     q->busy = false;
-     return progress;
++    blkio_submit_io(bs);
-@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
+     qemu_coroutine_yield();
-     r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
+     return cod.ret;
-     qemu_co_mutex_unlock(&s->dma_map_lock);
+ }
-     if (r) {
+@@ -XXX,XX +XXX,XX @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
--        req->busy = false;
-+        nvme_put_free_req_and_wake(s, ioq, req);
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-         return r;
+         blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
 -        blkio_submit_io(bs);
      }
-     nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
++    blkio_submit_io(bs);
-     qemu_co_mutex_unlock(&s->dma_map_lock);
+     qemu_coroutine_yield();
-     if (ret) {
+     if (use_bounce_buffer) {
--        req->busy = false;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
-+        nvme_put_free_req_and_wake(s, ioq, req);
-         goto out;
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
          blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
 -        blkio_submit_io(bs);
      }
++    blkio_submit_io(bs);
+     qemu_coroutine_yield();
+     if (use_bounce_buffer) {
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+         blkioq_flush(s->blkioq, &cod, 0);
+-        blkio_submit_io(bs);
+     }
++    blkio_submit_io(bs);
+     qemu_coroutine_yield();
+     return cod.ret;
+ }
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+         blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
+-        blkio_submit_io(bs);
+     }
++    blkio_submit_io(bs);
+     qemu_coroutine_yield();
+     return cod.ret;
+ }
+-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
+-{
+-    BDRVBlkioState *s = bs->opaque;
+-
+-    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+-        blkio_submit_io(bs);
+-    }
+-}
+-
+ typedef enum {
+     BMRR_OK,
+     BMRR_SKIP,
+@@ -XXX,XX +XXX,XX @@ static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
+         .bdrv_co_pwritev         = blkio_co_pwritev, \
+         .bdrv_co_flush_to_disk   = blkio_co_flush, \
+         .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
+-        .bdrv_co_io_unplug       = blkio_co_io_unplug, \
+         .bdrv_refresh_limits     = blkio_refresh_limits, \
+         .bdrv_register_buf       = blkio_register_buf, \
+         .bdrv_unregister_buf     = blkio_unregister_buf, \
 --
-.26.2
+.40.1

-[PULL 11/12] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
+[PULL 4/8] block/io_uring: convert to blk_io_plug_call() API
-Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
-the number of function arguments by keeping the BDRVNVMeState pointer in
+block layer friendly. Use the new blk_io_plug_call() API to batch I/O
-NVMeQueuePair. This will come in handly when a BH is introduced in a
+submission instead.
 later patch and only one argument can be passed to it.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-Message-id: 20200617132201.1832152-7-stefanha@redhat.com
+Acked-by: Kevin Wolf <kwolf@redhat.com>
 Message-id: 20230530180959.1108766-5-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 70 ++++++++++++++++++++++++++++------------------------
+ include/block/raw-aio.h |  7 -------
-file changed, 38 insertions(+), 32 deletions(-)
+ block/file-posix.c      | 10 ----------
  block/io_uring.c        | 44 ++++++++++++++++-------------------------
  block/trace-events      |  5 ++---
 files changed, 19 insertions(+), 47 deletions(-)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/include/block/raw-aio.h
-+++ b/block/nvme.c
++++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
                                    QEMUIOVector *qiov, int type);
  void luring_detach_aio_context(LuringState *s, AioContext *old_context);
  void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 -
 -/*
 - * luring_io_plug/unplug work in the thread's current AioContext, therefore the
 - * caller must ensure that they are paired in the same IOThread.
 - */
 -void luring_io_plug(void);
 -void luring_io_unplug(void);
  #endif
  #ifdef _WIN32
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
          laio_io_plug();
      }
  #endif
 -#ifdef CONFIG_LINUX_IO_URING
 -    if (s->use_linux_io_uring) {
 -        luring_io_plug();
 -    }
 -#endif
  }
  static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
          laio_io_unplug(s->aio_max_batch);
      }
  #endif
 -#ifdef CONFIG_LINUX_IO_URING
 -    if (s->use_linux_io_uring) {
 -        luring_io_unplug();
 -    }
 -#endif
  }
  static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 diff --git a/block/io_uring.c b/block/io_uring.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io_uring.c
 +++ b/block/io_uring.c
 @@ -XXX,XX +XXX,XX @@
-  */
+ #include "block/raw-aio.h"
- #define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
+ #include "qemu/coroutine.h"
+ #include "qapi/error.h"
-+typedef struct BDRVNVMeState BDRVNVMeState;
++#include "sysemu/block-backend.h"
-+
+ #include "trace.h"
- typedef struct {
-     int32_t  head, tail;
+ /* Only used for assertions.  */
-     uint8_t  *queue;
+@@ -XXX,XX +XXX,XX @@ typedef struct LuringAIOCB {
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+ } LuringAIOCB;
- typedef struct {
-     QemuMutex   lock;
+ typedef struct LuringQueue {
+-    int plugged;
-+    /* Read from I/O code path, initialized under BQL */
+     unsigned int in_queue;
-+    BDRVNVMeState   *s;
+     unsigned int in_flight;
-+    int             index;
+     bool blocked;
-+
+@@ -XXX,XX +XXX,XX @@ static void luring_process_completions_and_submit(LuringState *s)
-     /* Fields protected by BQL */
+ {
--    int         index;
+     luring_process_completions(s);
-     uint8_t     *prp_list_pages;
+-    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
-     /* Fields protected by @lock */
++    if (s->io_q.in_queue > 0) {
-@@ -XXX,XX +XXX,XX @@ typedef volatile struct {
+         ioq_submit(s);
  QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
 -typedef struct {
 +struct BDRVNVMeState {
      AioContext *aio_context;
      QEMUVFIOState *vfio;
      NVMeRegs *regs;
@@ -XXX,XX +XXX,XX @@ typedef struct {
      /* PCI address (required for nvme_refresh_filename()) */
      char *device;
 -} BDRVNVMeState;
 +};
  #define NVME_BLOCK_OPT_DEVICE "device"
  #define NVME_BLOCK_OPT_NAMESPACE "namespace"
@@ -XXX,XX +XXX,XX @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
      }
  }
+@@ -XXX,XX +XXX,XX @@ static void qemu_luring_poll_ready(void *opaque)
--static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+ static void ioq_init(LuringQueue *io_q)
 +static void nvme_free_queue_pair(NVMeQueuePair *q)
  {
-     qemu_vfree(q->prp_list_pages);
+     QSIMPLEQ_INIT(&io_q->submit_queue);
-     qemu_vfree(q->sq.queue);
+-    io_q->plugged = 0;
-@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
+     io_q->in_queue = 0;
-     uint64_t prp_list_iova;
+     io_q->in_flight = 0;
+     io_q->blocked = false;
      qemu_mutex_init(&q->lock);
 +    q->s = s;
      q->index = idx;
      qemu_co_queue_init(&q->free_req_queue);
      q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
      return q;
  fail:
 -    nvme_free_queue_pair(bs, q);
 +    nvme_free_queue_pair(q);
      return NULL;
  }
- /* With q->lock */
+-void luring_io_plug(void)
--static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
++static void luring_unplug_fn(void *opaque)
 +static void nvme_kick(NVMeQueuePair *q)
  {
-+    BDRVNVMeState *s = q->s;
+-    AioContext *ctx = qemu_get_current_aio_context();
-+
+-    LuringState *s = aio_get_linux_io_uring(ctx);
-     if (s->plugged || !q->need_kick) {
+-    trace_luring_io_plug(s);
-         return;
+-    s->io_q.plugged++;
-     }
+-}
-@@ -XXX,XX +XXX,XX @@ static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
+-
- }
+-void luring_io_unplug(void)
+-{
- /* With q->lock */
+-    AioContext *ctx = qemu_get_current_aio_context();
--static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+-    LuringState *s = aio_get_linux_io_uring(ctx);
-+static void nvme_wake_free_req_locked(NVMeQueuePair *q)
+-    assert(s->io_q.plugged);
- {
+-    trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-     if (!qemu_co_queue_empty(&q->free_req_queue)) {
+-                           s->io_q.in_queue, s->io_q.in_flight);
--        replay_bh_schedule_oneshot_event(s->aio_context,
+-    if (--s->io_q.plugged == 0 &&
-+        replay_bh_schedule_oneshot_event(q->s->aio_context,
+-        !s->io_q.blocked && s->io_q.in_queue > 0) {
-                 nvme_free_req_queue_cb, q);
++    LuringState *s = opaque;
 +    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
 +                           s->io_q.in_flight);
 +    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
          ioq_submit(s);
      }
  }
+@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
- /* Insert a request in the freelist and wake waiters */
--static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
+     QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
--                                       NVMeRequest *req)
+     s->io_q.in_queue++;
-+static void nvme_put_free_req_and_wake(NVMeQueuePair *q, NVMeRequest *req)
+-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
- {
+-                           s->io_q.in_queue, s->io_q.in_flight);
-     qemu_mutex_lock(&q->lock);
+-    if (!s->io_q.blocked &&
-     nvme_put_free_req_locked(q, req);
+-        (!s->io_q.plugged ||
--    nvme_wake_free_req_locked(s, q);
+-         s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-+    nvme_wake_free_req_locked(q);
+-        ret = ioq_submit(s);
-     qemu_mutex_unlock(&q->lock);
+-        trace_luring_do_submit_done(s, ret);
 -        return ret;
 +    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
 +                           s->io_q.in_flight);
 +    if (!s->io_q.blocked) {
 +        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
 +            ret = ioq_submit(s);
 +            trace_luring_do_submit_done(s, ret);
 +            return ret;
 +        }
 +
 +        blk_io_plug_call(luring_unplug_fn, s);
      }
      return 0;
  }
+diff --git a/block/trace-events b/block/trace-events
-@@ -XXX,XX +XXX,XX @@ static inline int nvme_translate_error(const NvmeCqe *c)
+index XXXXXXX..XXXXXXX 100644
- }
+--- a/block/trace-events
++++ b/block/trace-events
- /* With q->lock */
+@@ -XXX,XX +XXX,XX @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
--static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+ # io_uring.c
-+static bool nvme_process_completion(NVMeQueuePair *q)
+ luring_init_state(void *s, size_t size) "s %p size %zu"
- {
+ luring_cleanup_state(void *s) "%p freed"
-+    BDRVNVMeState *s = q->s;
+-luring_io_plug(void *s) "LuringState %p plug"
-     bool progress = false;
+-luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
-     NVMeRequest *preq;
+-luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
-     NVMeRequest req;
++luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
++luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-         /* Notify the device so it can post more completions. */
+ luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
-         smp_mb_release();
+ luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
-         *q->cq.doorbell = cpu_to_le32(q->cq.head);
+ luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
 -        nvme_wake_free_req_locked(s, q);
 +        nvme_wake_free_req_locked(q);
      }
      q->busy = false;
      return progress;
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
      }
  }
 -static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
 -                                NVMeRequest *req,
 +static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                  NvmeCmd *cmd, BlockCompletionFunc cb,
                                  void *opaque)
  {
@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
      req->opaque = opaque;
      cmd->cid = cpu_to_le32(req->cid);
 -    trace_nvme_submit_command(s, q->index, req->cid);
 +    trace_nvme_submit_command(q->s, q->index, req->cid);
      nvme_trace_command(cmd);
      qemu_mutex_lock(&q->lock);
      memcpy((uint8_t *)q->sq.queue +
             q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
      q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
      q->need_kick++;
 -    nvme_kick(s, q);
 -    nvme_process_completion(s, q);
 +    nvme_kick(q);
 +    nvme_process_completion(q);
      qemu_mutex_unlock(&q->lock);
  }
@@ -XXX,XX +XXX,XX @@ static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
                           NvmeCmd *cmd)
  {
      NVMeRequest *req;
 -    BDRVNVMeState *s = bs->opaque;
      int ret = -EINPROGRESS;
      req = nvme_get_free_req(q);
      if (!req) {
          return -EBUSY;
      }
 -    nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
 +    nvme_submit_command(q, req, cmd, nvme_cmd_sync_cb, &ret);
      BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
      return ret;
@@ -XXX,XX +XXX,XX @@ static bool nvme_poll_queues(BDRVNVMeState *s)
          }
          qemu_mutex_lock(&q->lock);
 -        while (nvme_process_completion(s, q)) {
 +        while (nvme_process_completion(q)) {
              /* Keep polling */
              progress = true;
          }
@@ -XXX,XX +XXX,XX @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
      };
      if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
          error_setg(errp, "Failed to create io queue [%d]", n);
 -        nvme_free_queue_pair(bs, q);
 +        nvme_free_queue_pair(q);
          return false;
      }
      cmd = (NvmeCmd) {
@@ -XXX,XX +XXX,XX @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
      };
      if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
          error_setg(errp, "Failed to create io queue [%d]", n);
 -        nvme_free_queue_pair(bs, q);
 +        nvme_free_queue_pair(q);
          return false;
      }
      s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
      BDRVNVMeState *s = bs->opaque;
      for (i = 0; i < s->nr_queues; ++i) {
 -        nvme_free_queue_pair(bs, s->queues[i]);
 +        nvme_free_queue_pair(s->queues[i]);
      }
      g_free(s->queues);
      aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
      r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
      qemu_co_mutex_unlock(&s->dma_map_lock);
      if (r) {
 -        nvme_put_free_req_and_wake(s, ioq, req);
 +        nvme_put_free_req_and_wake(ioq, req);
          return r;
      }
 -    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
 +    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
      data.co = qemu_coroutine_self();
      while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
      assert(s->nr_queues > 1);
      req = nvme_get_free_req(ioq);
      assert(req);
 -    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
 +    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
      data.co = qemu_coroutine_self();
      if (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
      req = nvme_get_free_req(ioq);
      assert(req);
 -    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
 +    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
      data.co = qemu_coroutine_self();
      while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
      qemu_co_mutex_unlock(&s->dma_map_lock);
      if (ret) {
 -        nvme_put_free_req_and_wake(s, ioq, req);
 +        nvme_put_free_req_and_wake(ioq, req);
          goto out;
      }
      trace_nvme_dsm(s, offset, bytes);
 -    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
 +    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
      data.co = qemu_coroutine_self();
      while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static void nvme_aio_unplug(BlockDriverState *bs)
      for (i = 1; i < s->nr_queues; i++) {
          NVMeQueuePair *q = s->queues[i];
          qemu_mutex_lock(&q->lock);
 -        nvme_kick(s, q);
 -        nvme_process_completion(s, q);
 +        nvme_kick(q);
 +        nvme_process_completion(q);
          qemu_mutex_unlock(&q->lock);
      }
  }
 --
-.26.2
+.40.1

-[PULL 10/12] block/nvme: clarify that free_req_queue is protected by q->lock
+[PULL 5/8] block/linux-aio: convert to blk_io_plug_call() API
-Existing users access free_req_queue under q->lock. Document this.
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
 block layer friendly. Use the new blk_io_plug_call() API to batch I/O
 submission instead.
 Note that a dev_max_batch check is dropped in laio_io_unplug() because
 the semantics of unplug_fn() are different from .bdrv_co_unplug():
 . unplug_fn() is only called when the last blk_io_unplug() call occurs,
    not every time blk_io_unplug() is called.
 . unplug_fn() is per-thread, not per-BlockDriverState, so there is no
    way to get per-BlockDriverState fields like dev_max_batch.
 Therefore this condition cannot be moved to laio_unplug_fn(). It is not
 obvious that this condition affects performance in practice, so I am
 removing it instead of trying to come up with a more complex mechanism
 to preserve the condition.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Message-id: 20200617132201.1832152-6-stefanha@redhat.com
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-id: 20230530180959.1108766-6-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 2 +-
+ include/block/raw-aio.h |  7 -------
-file changed, 1 insertion(+), 1 deletion(-)
+ block/file-posix.c      | 28 ----------------------------
+ block/linux-aio.c       | 41 +++++++++++------------------------------
-diff --git a/block/nvme.c b/block/nvme.c
+files changed, 11 insertions(+), 65 deletions(-)
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/include/block/raw-aio.h
-+++ b/block/nvme.c
++++ b/include/block/raw-aio.h
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
- } NVMeRequest;
  void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 -
 -/*
 - * laio_io_plug/unplug work in the thread's current AioContext, therefore the
 - * caller must ensure that they are paired in the same IOThread.
 - */
 -void laio_io_plug(void);
 -void laio_io_unplug(uint64_t dev_max_batch);
  #endif
  /* io_uring.c - Linux io_uring implementation */
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
      return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
  }
 -static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
 -{
 -    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 -#ifdef CONFIG_LINUX_AIO
 -    if (s->use_linux_aio) {
 -        laio_io_plug();
 -    }
 -#endif
 -}
 -
 -static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
 -{
 -    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 -#ifdef CONFIG_LINUX_AIO
 -    if (s->use_linux_aio) {
 -        laio_io_unplug(s->aio_max_batch);
 -    }
 -#endif
 -}
 -
  static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
  {
      BDRVRawState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_file = {
      .bdrv_co_copy_range_from = raw_co_copy_range_from,
      .bdrv_co_copy_range_to  = raw_co_copy_range_to,
      .bdrv_refresh_limits = raw_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
      .bdrv_co_copy_range_from = raw_co_copy_range_from,
      .bdrv_co_copy_range_to  = raw_co_copy_range_to,
      .bdrv_refresh_limits = raw_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
      .bdrv_co_pwritev        = raw_co_pwritev,
      .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
      .bdrv_refresh_limits    = cdrom_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
      .bdrv_co_pwritev        = raw_co_pwritev,
      .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
      .bdrv_refresh_limits    = cdrom_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/event_notifier.h"
  #include "qemu/coroutine.h"
  #include "qapi/error.h"
 +#include "sysemu/block-backend.h"
  /* Only used for assertions.  */
  #include "qemu/coroutine_int.h"
@@ -XXX,XX +XXX,XX @@ struct qemu_laiocb {
  };
  typedef struct {
--    CoQueue     free_req_queue;
+-    int plugged;
-     QemuMutex   lock;
+     unsigned int in_queue;
+     unsigned int in_flight;
-     /* Fields protected by BQL */
+     bool blocked;
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
-     uint8_t     *prp_list_pages;
+ {
+     qemu_laio_process_completions(s);
-     /* Fields protected by @lock */
-+    CoQueue     free_req_queue;
+-    if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
-     NVMeQueue   sq, cq;
++    if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
-     int         cq_phase;
+         ioq_submit(s);
-     int         free_req_head;
+     }
  }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
  static void ioq_init(LaioQueue *io_q)
  {
      QSIMPLEQ_INIT(&io_q->pending);
 -    io_q->plugged = 0;
      io_q->in_queue = 0;
      io_q->in_flight = 0;
      io_q->blocked = false;
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
      return max_batch;
  }
 -void laio_io_plug(void)
 +static void laio_unplug_fn(void *opaque)
  {
 -    AioContext *ctx = qemu_get_current_aio_context();
 -    LinuxAioState *s = aio_get_linux_aio(ctx);
 +    LinuxAioState *s = opaque;
 -    s->io_q.plugged++;
 -}
 -
 -void laio_io_unplug(uint64_t dev_max_batch)
 -{
 -    AioContext *ctx = qemu_get_current_aio_context();
 -    LinuxAioState *s = aio_get_linux_aio(ctx);
 -
 -    assert(s->io_q.plugged);
 -    s->io_q.plugged--;
 -
 -    /*
 -     * Why max batch checking is performed here:
 -     * Another BDS may have queued requests with a higher dev_max_batch and
 -     * therefore in_queue could now exceed our dev_max_batch. Re-check the max
 -     * batch so we can honor our device's dev_max_batch.
 -     */
 -    if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch) ||
 -        (!s->io_q.plugged &&
 -         !s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending))) {
 +    if (!s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
  }
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
      QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
      s->io_q.in_queue++;
 -    if (!s->io_q.blocked &&
 -        (!s->io_q.plugged ||
 -         s->io_q.in_queue >= laio_max_batch(s, dev_max_batch))) {
 -        ioq_submit(s);
 +    if (!s->io_q.blocked) {
 +        if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
 +            ioq_submit(s);
 +        } else {
 +            blk_io_plug_call(laio_unplug_fn, s);
 +        }
      }
      return 0;
 --
-.26.2
+.40.1

-[PULL 07/12] block/nvme: drop tautologous assertion
+[PULL 6/8] block: remove bdrv_co_io_plug() API
-nvme_process_completion() explicitly checks cid so the assertion that
+No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
-follows is always true:
+function pointers.
   if (cid == 0 || cid > NVME_QUEUE_SIZE) {
       ...
       continue;
   }
   assert(cid <= NVME_QUEUE_SIZE);
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-Message-id: 20200617132201.1832152-3-stefanha@redhat.com
+Acked-by: Kevin Wolf <kwolf@redhat.com>
 Message-id: 20230530180959.1108766-7-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 1 -
+ include/block/block-io.h         |  3 ---
-file changed, 1 deletion(-)
+ include/block/block_int-common.h | 11 ----------
  block/io.c                       | 37 --------------------------------
 files changed, 51 deletions(-)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/include/block/block-io.h b/include/block/block-io.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/include/block/block-io.h
-+++ b/block/nvme.c
++++ b/include/block/block-io.h
-@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+@@ -XXX,XX +XXX,XX @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
-                     cid);
-             continue;
+ AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
-         }
--        assert(cid <= NVME_QUEUE_SIZE);
+-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-         trace_nvme_complete_command(s, q->index, cid);
+-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-         preq = &q->reqs[cid - 1];
+-
-         req = *preq;
+ bool coroutine_fn GRAPH_RDLOCK
  bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
                                     uint32_t granularity, Error **errp);
 diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block_int-common.h
 +++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
      void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
          BlockDriverState *bs, BlkdebugEvent event);
 -    /* io queue for linux-aio */
 -    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState *bs);
 -    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
 -        BlockDriverState *bs);
 -
      bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
      bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
      unsigned int in_flight;
      unsigned int serialising_in_flight;
 -    /*
 -     * counter for nested bdrv_io_plug.
 -     * Accessed with atomic ops.
 -     */
 -    unsigned io_plugged;
 -
      /* do we need to tell the quest if we have a volatile write cache? */
      int enable_write_cache;
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size)
      return mem;
  }
 -void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
 -{
 -    BdrvChild *child;
 -    IO_CODE();
 -    assert_bdrv_graph_readable();
 -
 -    QLIST_FOREACH(child, &bs->children, next) {
 -        bdrv_co_io_plug(child->bs);
 -    }
 -
 -    if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
 -        BlockDriver *drv = bs->drv;
 -        if (drv && drv->bdrv_co_io_plug) {
 -            drv->bdrv_co_io_plug(bs);
 -        }
 -    }
 -}
 -
 -void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
 -{
 -    BdrvChild *child;
 -    IO_CODE();
 -    assert_bdrv_graph_readable();
 -
 -    assert(bs->io_plugged);
 -    if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
 -        BlockDriver *drv = bs->drv;
 -        if (drv && drv->bdrv_co_io_unplug) {
 -            drv->bdrv_co_io_unplug(bs);
 -        }
 -    }
 -
 -    QLIST_FOREACH(child, &bs->children, next) {
 -        bdrv_co_io_unplug(child->bs);
 -    }
 -}
 -
  /* Helper that undoes bdrv_register_buf() when it fails partway through */
  static void GRAPH_RDLOCK
  bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
 --
-.26.2
+.40.1

-[PULL 02/12] coroutine: support SafeStack in ucontext backend
+[PULL 7/8] block/blkio: use qemu_open() to support fd passing for virtio-blk
-From: Daniele Buono <dbuono@linux.vnet.ibm.com>
+From: Stefano Garzarella <sgarzare@redhat.com>
-LLVM's SafeStack instrumentation does not yet support programs that make
+Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
-use of the APIs in ucontext.h
+passing. Let's expose this to the user, so the management layer
-With the current implementation of coroutine-ucontext, the resulting
+can pass the file descriptor of an already opened path.
 binary is incorrect, with different coroutines sharing the same unsafe
 stack and producing undefined behavior at runtime.
 This fix allocates an additional unsafe stack area for each coroutine,
 and sets the new unsafe stack pointer before calling swapcontext() in
 qemu_coroutine_new.
 This is the only place where the pointer needs to be manually updated,
 since sigsetjmp/siglongjmp are already instrumented by LLVM to properly
 support SafeStack.
 The additional stack is then freed in qemu_coroutine_delete.
-Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
+If the libblkio virtio-blk driver supports fd passing, let's always
-Message-id: 20200529205122.714-2-dbuono@linux.vnet.ibm.com
+use qemu_open() to open the `path`, so we can handle fd passing
 from the management layer through the "/dev/fdset/N" special path.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-id: 20230530071941.8954-2-sgarzare@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine_int.h |  5 +++++
+ block/blkio.c | 53 ++++++++++++++++++++++++++++++++++++++++++---------
- util/coroutine-ucontext.c    | 28 ++++++++++++++++++++++++++++
+file changed, 44 insertions(+), 9 deletions(-)
 files changed, 33 insertions(+)
-diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
+diff --git a/block/blkio.c b/block/blkio.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine_int.h
+--- a/block/blkio.c
-+++ b/include/qemu/coroutine_int.h
++++ b/block/blkio.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static int blkio_virtio_blk_common_open(BlockDriverState *bs,
- #include "qemu/queue.h"
+ {
- #include "qemu/coroutine.h"
+     const char *path = qdict_get_try_str(options, "path");
+     BDRVBlkioState *s = bs->opaque;
-+#ifdef CONFIG_SAFESTACK
+-    int ret;
-+/* Pointer to the unsafe stack, defined by the compiler */
++    bool fd_supported = false;
-+extern __thread void *__safestack_unsafe_stack_ptr;
++    int fd, ret;
-+#endif
      if (!path) {
          error_setg(errp, "missing 'path' option");
          return -EINVAL;
      }
 -    ret = blkio_set_str(s->blkio, "path", path);
 -    qdict_del(options, "path");
 -    if (ret < 0) {
 -        error_setg_errno(errp, -ret, "failed to set path: %s",
 -                         blkio_get_error_msg());
 -        return ret;
 -    }
 -
      if (!(flags & BDRV_O_NOCACHE)) {
          error_setg(errp, "cache.direct=off is not supported");
          return -EINVAL;
      }
 +
- #define COROUTINE_STACK_SIZE (1 << 20)
++    if (blkio_get_int(s->blkio, "fd", &fd) == 0) {
++        fd_supported = true;
- typedef enum {
++    }
 diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/coroutine-ucontext.c
 +++ b/util/coroutine-ucontext.c
@@ -XXX,XX +XXX,XX @@ typedef struct {
      Coroutine base;
      void *stack;
      size_t stack_size;
 +#ifdef CONFIG_SAFESTACK
 +    /* Need an unsafe stack for each coroutine */
 +    void *unsafe_stack;
 +    size_t unsafe_stack_size;
 +#endif
      sigjmp_buf env;
      void *tsan_co_fiber;
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_new(void)
      co = g_malloc0(sizeof(*co));
      co->stack_size = COROUTINE_STACK_SIZE;
      co->stack = qemu_alloc_stack(&co->stack_size);
 +#ifdef CONFIG_SAFESTACK
 +    co->unsafe_stack_size = COROUTINE_STACK_SIZE;
 +    co->unsafe_stack = qemu_alloc_stack(&co->unsafe_stack_size);
 +#endif
      co->base.entry_arg = &old_env; /* stash away our jmp_buf */
      uc.uc_link = &old_uc;
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_new(void)
              COROUTINE_YIELD,
              &fake_stack_save,
              co->stack, co->stack_size, co->tsan_co_fiber);
 +
-+#ifdef CONFIG_SAFESTACK
++    /*
-+        /*
++     * If the libblkio driver supports fd passing, let's always use qemu_open()
-+         * Before we swap the context, set the new unsafe stack
++     * to open the `path`, so we can handle fd passing from the management
-+         * The unsafe stack grows just like the normal stack, so start from
++     * layer through the "/dev/fdset/N" special path.
-+         * the last usable location of the memory area.
++     */
-+         * NOTE: we don't have to re-set the usp afterwards because we are
++    if (fd_supported) {
-+         * coming back to this context through a siglongjmp.
++        int open_flags;
 +         * The compiler already wrapped the corresponding sigsetjmp call with
 +         * code that saves the usp on the (safe) stack before the call, and
 +         * restores it right after (which is where we return with siglongjmp).
 +         */
 +        void *usp = co->unsafe_stack + co->unsafe_stack_size;
 +        __safestack_unsafe_stack_ptr = usp;
 +#endif
 +
-         swapcontext(&old_uc, &uc);
++        if (flags & BDRV_O_RDWR) {
-     }
++            open_flags = O_RDWR;
++        } else {
-@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_delete(Coroutine *co_)
++            open_flags = O_RDONLY;
- #endif
++        }
++
-     qemu_free_stack(co->stack, co->stack_size);
++        fd = qemu_open(path, open_flags, errp);
-+#ifdef CONFIG_SAFESTACK
++        if (fd < 0) {
-+    qemu_free_stack(co->unsafe_stack, co->unsafe_stack_size);
++            return -EINVAL;
-+#endif
++        }
-     g_free(co);
++
 +        ret = blkio_set_int(s->blkio, "fd", fd);
 +        if (ret < 0) {
 +            error_setg_errno(errp, -ret, "failed to set fd: %s",
 +                             blkio_get_error_msg());
 +            qemu_close(fd);
 +            return ret;
 +        }
 +    } else {
 +        ret = blkio_set_str(s->blkio, "path", path);
 +        if (ret < 0) {
 +            error_setg_errno(errp, -ret, "failed to set path: %s",
 +                             blkio_get_error_msg());
 +            return ret;
 +        }
 +    }
 +
 +    qdict_del(options, "path");
 +
      return 0;
  }
 --
-.26.2
+.40.1

-[PULL 03/12] coroutine: add check for SafeStack in sigaltstack
+Deleted patch
-From: Daniele Buono <dbuono@linux.vnet.ibm.com>
-Current implementation of LLVM's SafeStack is not compatible with
-code that uses an alternate stack created with sigaltstack().
-Since coroutine-sigaltstack relies on sigaltstack(), it is not
-compatible with SafeStack. The resulting binary is incorrect, with
-different coroutines sharing the same unsafe stack and producing
-undefined behavior at runtime.
-In the future LLVM may provide a SafeStack implementation compatible with
-sigaltstack(). In the meantime, if SafeStack is desired, the coroutine
-implementation from coroutine-ucontext should be used.
-As a safety check, add a control in coroutine-sigaltstack to throw a
-preprocessor #error if SafeStack is enabled and we are trying to
-use coroutine-sigaltstack to implement coroutines.
-Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
-Message-id: 20200529205122.714-3-dbuono@linux.vnet.ibm.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- util/coroutine-sigaltstack.c | 4 ++++
-file changed, 4 insertions(+)
-diff --git a/util/coroutine-sigaltstack.c b/util/coroutine-sigaltstack.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/coroutine-sigaltstack.c
-+++ b/util/coroutine-sigaltstack.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu-common.h"
- #include "qemu/coroutine_int.h"
-+#ifdef CONFIG_SAFESTACK
-+#error "SafeStack is not compatible with code run in alternate signal stacks"
-+#endif
-+
- typedef struct {
-     Coroutine base;
-     void *stack;
---
-.26.2

-[PULL 04/12] configure: add flags to support SafeStack
+Deleted patch
-From: Daniele Buono <dbuono@linux.vnet.ibm.com>
-This patch adds a flag to enable/disable the SafeStack instrumentation
-provided by LLVM.
-On enable, make sure that the compiler supports the flags, and that we
-are using the proper coroutine implementation (coroutine-ucontext).
-On disable, explicitly disable the option if it was enabled by default.
-While SafeStack is supported only on Linux, NetBSD, FreeBSD and macOS,
-we are not checking for the O.S. since this is already done by LLVM.
-Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
-Message-id: 20200529205122.714-4-dbuono@linux.vnet.ibm.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- configure | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
-file changed, 73 insertions(+)
-diff --git a/configure b/configure
-index XXXXXXX..XXXXXXX 100755
---- a/configure
-+++ b/configure
-@@ -XXX,XX +XXX,XX @@ audio_win_int=""
- libs_qga=""
- debug_info="yes"
- stack_protector=""
-+safe_stack=""
- use_containers="yes"
- gdb_bin=$(command -v "gdb-multiarch" || command -v "gdb")
-@@ -XXX,XX +XXX,XX @@ for opt do
-   ;;
-   --disable-stack-protector) stack_protector="no"
-   ;;
-+  --enable-safe-stack) safe_stack="yes"
-+  ;;
-+  --disable-safe-stack) safe_stack="no"
-+  ;;
-   --disable-curses) curses="no"
-   ;;
-   --enable-curses) curses="yes"
-@@ -XXX,XX +XXX,XX @@ disabled with --disable-FEATURE, default is enabled if available:
-   debug-tcg       TCG debugging (default is disabled)
-   debug-info      debugging information
-   sparse          sparse checker
-+  safe-stack      SafeStack Stack Smash Protection. Depends on
-+                  clang/llvm >= 3.7 and requires coroutine backend ucontext.
-   gnutls          GNUTLS cryptography support
-   nettle          nettle cryptography support
-@@ -XXX,XX +XXX,XX @@ if test "$debug_stack_usage" = "yes"; then
-   fi
- fi
-+##################################################
-+# SafeStack
-+
-+
-+if test "$safe_stack" = "yes"; then
-+cat > $TMPC << EOF
-+int main(int argc, char *argv[])
-+{
-+#if ! __has_feature(safe_stack)
-+#error SafeStack Disabled
-+#endif
-+    return 0;
-+}
-+EOF
-+  flag="-fsanitize=safe-stack"
-+  # Check that safe-stack is supported and enabled.
-+  if compile_prog "-Werror $flag" "$flag"; then
-+    # Flag needed both at compilation and at linking
-+    QEMU_CFLAGS="$QEMU_CFLAGS $flag"
-+    QEMU_LDFLAGS="$QEMU_LDFLAGS $flag"
-+  else
-+    error_exit "SafeStack not supported by your compiler"
-+  fi
-+  if test "$coroutine" != "ucontext"; then
-+    error_exit "SafeStack is only supported by the coroutine backend ucontext"
-+  fi
-+else
-+cat > $TMPC << EOF
-+int main(int argc, char *argv[])
-+{
-+#if defined(__has_feature)
-+#if __has_feature(safe_stack)
-+#error SafeStack Enabled
-+#endif
-+#endif
-+    return 0;
-+}
-+EOF
-+if test "$safe_stack" = "no"; then
-+  # Make sure that safe-stack is disabled
-+  if ! compile_prog "-Werror" ""; then
-+    # SafeStack was already enabled, try to explicitly remove the feature
-+    flag="-fno-sanitize=safe-stack"
-+    if ! compile_prog "-Werror $flag" "$flag"; then
-+      error_exit "Configure cannot disable SafeStack"
-+    fi
-+    QEMU_CFLAGS="$QEMU_CFLAGS $flag"
-+    QEMU_LDFLAGS="$QEMU_LDFLAGS $flag"
-+  fi
-+else # "$safe_stack" = ""
-+  # Set safe_stack to yes or no based on pre-existing flags
-+  if compile_prog "-Werror" ""; then
-+    safe_stack="no"
-+  else
-+    safe_stack="yes"
-+    if test "$coroutine" != "ucontext"; then
-+      error_exit "SafeStack is only supported by the coroutine backend ucontext"
-+    fi
-+  fi
-+fi
-+fi
- ##########################################
- # check if we have open_by_handle_at
-@@ -XXX,XX +XXX,XX @@ echo "sparse enabled    $sparse"
- echo "strip binaries    $strip_opt"
- echo "profiler          $profiler"
- echo "static build      $static"
-+echo "safe stack        $safe_stack"
- if test "$darwin" = "yes" ; then
-     echo "Cocoa support     $cocoa"
- fi
-@@ -XXX,XX +XXX,XX @@ if test "$ccache_cpp2" = "yes"; then
-   echo "export CCACHE_CPP2=y" >> $config_host_mak
- fi
-+if test "$safe_stack" = "yes"; then
-+  echo "CONFIG_SAFESTACK=y" >> $config_host_mak
-+fi
-+
- # If we're using a separate build tree, set it up now.
- # DIRS are directories which we simply mkdir in the build tree;
- # LINKS are things to symlink back into the source tree
---
-.26.2

-[PULL 05/12] check-block: enable iotests with SafeStack
+Deleted patch
-From: Daniele Buono <dbuono@linux.vnet.ibm.com>
-SafeStack is a stack protection technique implemented in llvm. It is
-enabled with a -fsanitize flag.
-iotests are currently disabled when any -fsanitize option is used,
-because such options tend to produce additional warnings and false
-positives.
-While common -fsanitize options are used to verify the code and not
-added in production, SafeStack's main use is in production environments
-to protect against stack smashing.
-Since SafeStack does not print any warning or false positive, enable
-iotests when SafeStack is the only -fsanitize option used.
-This is likely going to be a production binary and we want to make sure
-it works correctly.
-Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
-Message-id: 20200529205122.714-5-dbuono@linux.vnet.ibm.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- tests/check-block.sh | 12 +++++++++++-
-file changed, 11 insertions(+), 1 deletion(-)
-diff --git a/tests/check-block.sh b/tests/check-block.sh
-index XXXXXXX..XXXXXXX 100755
---- a/tests/check-block.sh
-+++ b/tests/check-block.sh
-@@ -XXX,XX +XXX,XX @@ if grep -q "CONFIG_GPROF=y" config-host.mak 2>/dev/null ; then
-     exit 0
- fi
--if grep -q "CFLAGS.*-fsanitize" config-host.mak 2>/dev/null ; then
-+# Disable tests with any sanitizer except for SafeStack
-+CFLAGS=$( grep "CFLAGS.*-fsanitize" config-host.mak 2>/dev/null )
-+SANITIZE_FLAGS=""
-+#Remove all occurrencies of -fsanitize=safe-stack
-+for i in ${CFLAGS}; do
-+        if [ "${i}" != "-fsanitize=safe-stack" ]; then
-+                SANITIZE_FLAGS="${SANITIZE_FLAGS} ${i}"
-+        fi
-+done
-+if echo ${SANITIZE_FLAGS} | grep -q "\-fsanitize" 2>/dev/null; then
-+    # Have a sanitize flag that is not allowed, stop
-     echo "Sanitizers are enabled ==> Not running the qemu-iotests."
-     exit 0
- fi
---
-.26.2

-[PULL 06/12] block/nvme: poll queues without q->lock
+[PULL 8/8] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa
-A lot of CPU time is spent simply locking/unlocking q->lock during
+From: Stefano Garzarella <sgarzare@redhat.com>
 polling. Check for completion outside the lock to make q->lock disappear
 from the profile.
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
-Reviewed-by: Sergio Lopez <slp@redhat.com>
+passing through the new 'fd' property.
-Message-id: 20200617132201.1832152-2-stefanha@redhat.com
 Since now we are using qemu_open() on '@path' if the virtio-blk driver
 supports the fd passing, let's announce it.
 In this way, the management layer can pass the file descriptor of an
 already opened vhost-vdpa character device. This is useful especially
 when the device can only be accessed with certain privileges.
 Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
 in libblkio supports it.
 Suggested-by: Markus Armbruster <armbru@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-id: 20230530071941.8954-3-sgarzare@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nvme.c | 12 ++++++++++++
+ qapi/block-core.json | 6 ++++++
-file changed, 12 insertions(+)
+ meson.build          | 4 ++++
 files changed, 10 insertions(+)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/qapi/block-core.json
-+++ b/block/nvme.c
++++ b/qapi/block-core.json
-@@ -XXX,XX +XXX,XX @@ static bool nvme_poll_queues(BDRVNVMeState *s)
+@@ -XXX,XX +XXX,XX @@
+ #
-     for (i = 0; i < s->nr_queues; i++) {
+ # @path: path to the vhost-vdpa character device.
-         NVMeQueuePair *q = s->queues[i];
+ #
-+        const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
++# Features:
-+        NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
++# @fdset: Member @path supports the special "/dev/fdset/N" path
-+
++#     (since 8.1)
-+        /*
++#
-+         * Do an early check for completions. q->lock isn't needed because
+ # Since: 7.2
-+         * nvme_process_completion() only runs in the event loop thread and
+ ##
-+         * cannot race with itself.
+ { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
-+         */
+   'data': { 'path': 'str' },
-+        if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
++  'features': [ { 'name' :'fdset',
-+            continue;
++                  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
-+        }
+   'if': 'CONFIG_BLKIO' }
-+
-         qemu_mutex_lock(&q->lock);
+ ##
-         while (nvme_process_completion(s, q)) {
+diff --git a/meson.build b/meson.build
-             /* Keep polling */
+index XXXXXXX..XXXXXXX 100644
 --- a/meson.build
 +++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_LZO', lzo.found())
  config_host_data.set('CONFIG_MPATH', mpathpersist.found())
  config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
  config_host_data.set('CONFIG_BLKIO', blkio.found())
 +if blkio.found()
 +  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
 +                       blkio.version().version_compare('>=1.3.0'))
 +endif
  config_host_data.set('CONFIG_CURL', curl.found())
  config_host_data.set('CONFIG_CURSES', curses.found())
  config_host_data.set('CONFIG_GBM', gbm.found())
 --
-.26.2
+.40.1

The following changes since commit 171199f56f5f9bdf1e5d670d09ef1351d8f01bae:

Merge remote-tracking branch 'remotes/alistair/tags/pull-riscv-to-apply-20200619-3' into staging (2020-06-22 14:45:25 +0100)

are available in the Git repository at:

https://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 7838c67f22a81fcf669785cd6c0876438422071a:

block/nvme: support nested aio_poll() (2020-06-23 15:46:08 +0100)

----------------------------------------------------------------
Pull request

----------------------------------------------------------------

Daniele Buono (4):
  coroutine: support SafeStack in ucontext backend
  coroutine: add check for SafeStack in sigaltstack
  configure: add flags to support SafeStack
  check-block: enable iotests with SafeStack

Stefan Hajnoczi (8):
  minikconf: explicitly set encoding to UTF-8
  block/nvme: poll queues without q->lock
  block/nvme: drop tautologous assertion
  block/nvme: don't access CQE after moving cq.head
  block/nvme: switch to a NVMeRequest freelist
  block/nvme: clarify that free_req_queue is protected by q->lock
  block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
  block/nvme: support nested aio_poll()

-- 
2.26.2

QEMU currently only has ASCII Kconfig files but Linux actually uses
UTF-8. Explicitly specify the encoding and that we're doing text file
I/O.

It's unclear whether or not QEMU will ever need Unicode in its Kconfig
files. If we start using the help text then it will become an issue
sooner or later. Make this change now for consistency with Linux
Kconfig.

Reported-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200521153616.307100-1-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 scripts/minikconf.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/minikconf.py b/scripts/minikconf.py
index XXXXXXX..XXXXXXX 100755
--- a/scripts/minikconf.py
+++ b/scripts/minikconf.py
@@ -XXX,XX +XXX,XX @@ class KconfigParser:
         if incl_abs_fname in self.data.previously_included:
             return
         try:
-            fp = open(incl_abs_fname, 'r')
+            fp = open(incl_abs_fname, 'rt', encoding='utf-8')
         except IOError as e:
             raise KconfigParserError(self,
                                 '%s: %s' % (e.strerror, include))
@@ -XXX,XX +XXX,XX @@ if __name__ == '__main__':
             parser.do_assignment(name, value == 'y')
             external_vars.add(name[7:])
         else:
-            fp = open(arg, 'r')
+            fp = open(arg, 'rt', encoding='utf-8')
             parser.parse_file(fp)
             fp.close()
 
@@ -XXX,XX +XXX,XX @@ if __name__ == '__main__':
         if key not in external_vars and config[key]:
             print ('CONFIG_%s=y' % key)
 
-    deps = open(argv[2], 'w')
+    deps = open(argv[2], 'wt', encoding='utf-8')
     for fname in data.previously_included:
         print ('%s: %s' % (argv[1], fname), file=deps)
     deps.close()
-- 
2.26.2

From: Daniele Buono <dbuono@linux.vnet.ibm.com>

LLVM's SafeStack instrumentation does not yet support programs that make
use of the APIs in ucontext.h
With the current implementation of coroutine-ucontext, the resulting
binary is incorrect, with different coroutines sharing the same unsafe
stack and producing undefined behavior at runtime.
This fix allocates an additional unsafe stack area for each coroutine,
and sets the new unsafe stack pointer before calling swapcontext() in
qemu_coroutine_new.
This is the only place where the pointer needs to be manually updated,
since sigsetjmp/siglongjmp are already instrumented by LLVM to properly
support SafeStack.
The additional stack is then freed in qemu_coroutine_delete.

Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
Message-id: 20200529205122.714-2-dbuono@linux.vnet.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine_int.h |  5 +++++
 util/coroutine-ucontext.c    | 28 ++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine_int.h
+++ b/include/qemu/coroutine_int.h
@@ -XXX,XX +XXX,XX @@
 #include "qemu/queue.h"
 #include "qemu/coroutine.h"
 
+#ifdef CONFIG_SAFESTACK
+/* Pointer to the unsafe stack, defined by the compiler */
+extern __thread void *__safestack_unsafe_stack_ptr;
+#endif
+
 #define COROUTINE_STACK_SIZE (1 << 20)
 
 typedef enum {
diff --git a/util/coroutine-ucontext.c b/util/coroutine-ucontext.c
index XXXXXXX..XXXXXXX 100644
--- a/util/coroutine-ucontext.c
+++ b/util/coroutine-ucontext.c
@@ -XXX,XX +XXX,XX @@ typedef struct {
     Coroutine base;
     void *stack;
     size_t stack_size;
+#ifdef CONFIG_SAFESTACK
+    /* Need an unsafe stack for each coroutine */
+    void *unsafe_stack;
+    size_t unsafe_stack_size;
+#endif
     sigjmp_buf env;
 
     void *tsan_co_fiber;
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_new(void)
     co = g_malloc0(sizeof(*co));
     co->stack_size = COROUTINE_STACK_SIZE;
     co->stack = qemu_alloc_stack(&co->stack_size);
+#ifdef CONFIG_SAFESTACK
+    co->unsafe_stack_size = COROUTINE_STACK_SIZE;
+    co->unsafe_stack = qemu_alloc_stack(&co->unsafe_stack_size);
+#endif
     co->base.entry_arg = &old_env; /* stash away our jmp_buf */
 
     uc.uc_link = &old_uc;
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_new(void)
             COROUTINE_YIELD,
             &fake_stack_save,
             co->stack, co->stack_size, co->tsan_co_fiber);
+
+#ifdef CONFIG_SAFESTACK
+        /*
+         * Before we swap the context, set the new unsafe stack
+         * The unsafe stack grows just like the normal stack, so start from
+         * the last usable location of the memory area.
+         * NOTE: we don't have to re-set the usp afterwards because we are
+         * coming back to this context through a siglongjmp.
+         * The compiler already wrapped the corresponding sigsetjmp call with
+         * code that saves the usp on the (safe) stack before the call, and
+         * restores it right after (which is where we return with siglongjmp).
+         */
+        void *usp = co->unsafe_stack + co->unsafe_stack_size;
+        __safestack_unsafe_stack_ptr = usp;
+#endif
+
         swapcontext(&old_uc, &uc);
     }
 
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_delete(Coroutine *co_)
 #endif
 
     qemu_free_stack(co->stack, co->stack_size);
+#ifdef CONFIG_SAFESTACK
+    qemu_free_stack(co->unsafe_stack, co->unsafe_stack_size);
+#endif
     g_free(co);
 }
 
-- 
2.26.2

From: Daniele Buono <dbuono@linux.vnet.ibm.com>

Current implementation of LLVM's SafeStack is not compatible with
code that uses an alternate stack created with sigaltstack().
Since coroutine-sigaltstack relies on sigaltstack(), it is not
compatible with SafeStack. The resulting binary is incorrect, with
different coroutines sharing the same unsafe stack and producing
undefined behavior at runtime.

In the future LLVM may provide a SafeStack implementation compatible with
sigaltstack(). In the meantime, if SafeStack is desired, the coroutine
implementation from coroutine-ucontext should be used.
As a safety check, add a control in coroutine-sigaltstack to throw a
preprocessor #error if SafeStack is enabled and we are trying to
use coroutine-sigaltstack to implement coroutines.

Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
Message-id: 20200529205122.714-3-dbuono@linux.vnet.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/coroutine-sigaltstack.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/util/coroutine-sigaltstack.c b/util/coroutine-sigaltstack.c
index XXXXXXX..XXXXXXX 100644
--- a/util/coroutine-sigaltstack.c
+++ b/util/coroutine-sigaltstack.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu-common.h"
 #include "qemu/coroutine_int.h"
 
+#ifdef CONFIG_SAFESTACK
+#error "SafeStack is not compatible with code run in alternate signal stacks"
+#endif
+
 typedef struct {
     Coroutine base;
     void *stack;
-- 
2.26.2

From: Daniele Buono <dbuono@linux.vnet.ibm.com>

This patch adds a flag to enable/disable the SafeStack instrumentation
provided by LLVM.

On enable, make sure that the compiler supports the flags, and that we
are using the proper coroutine implementation (coroutine-ucontext).
On disable, explicitly disable the option if it was enabled by default.

While SafeStack is supported only on Linux, NetBSD, FreeBSD and macOS,
we are not checking for the O.S. since this is already done by LLVM.

Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
Message-id: 20200529205122.714-4-dbuono@linux.vnet.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 configure | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/configure b/configure
index XXXXXXX..XXXXXXX 100755
--- a/configure
+++ b/configure
@@ -XXX,XX +XXX,XX @@ audio_win_int=""
 libs_qga=""
 debug_info="yes"
 stack_protector=""
+safe_stack=""
 use_containers="yes"
 gdb_bin=$(command -v "gdb-multiarch" || command -v "gdb")
 
@@ -XXX,XX +XXX,XX @@ for opt do
   ;;
   --disable-stack-protector) stack_protector="no"
   ;;
+  --enable-safe-stack) safe_stack="yes"
+  ;;
+  --disable-safe-stack) safe_stack="no"
+  ;;
   --disable-curses) curses="no"
   ;;
   --enable-curses) curses="yes"
@@ -XXX,XX +XXX,XX @@ disabled with --disable-FEATURE, default is enabled if available:
   debug-tcg       TCG debugging (default is disabled)
   debug-info      debugging information
   sparse          sparse checker
+  safe-stack      SafeStack Stack Smash Protection. Depends on
+                  clang/llvm >= 3.7 and requires coroutine backend ucontext.
 
   gnutls          GNUTLS cryptography support
   nettle          nettle cryptography support
@@ -XXX,XX +XXX,XX @@ if test "$debug_stack_usage" = "yes"; then
   fi
 fi
 
+##################################################
+# SafeStack
+
+
+if test "$safe_stack" = "yes"; then
+cat > $TMPC << EOF
+int main(int argc, char *argv[])
+{
+#if ! __has_feature(safe_stack)
+#error SafeStack Disabled
+#endif
+    return 0;
+}
+EOF
+  flag="-fsanitize=safe-stack"
+  # Check that safe-stack is supported and enabled.
+  if compile_prog "-Werror $flag" "$flag"; then
+    # Flag needed both at compilation and at linking
+    QEMU_CFLAGS="$QEMU_CFLAGS $flag"
+    QEMU_LDFLAGS="$QEMU_LDFLAGS $flag"
+  else
+    error_exit "SafeStack not supported by your compiler"
+  fi
+  if test "$coroutine" != "ucontext"; then
+    error_exit "SafeStack is only supported by the coroutine backend ucontext"
+  fi
+else
+cat > $TMPC << EOF
+int main(int argc, char *argv[])
+{
+#if defined(__has_feature)
+#if __has_feature(safe_stack)
+#error SafeStack Enabled
+#endif
+#endif
+    return 0;
+}
+EOF
+if test "$safe_stack" = "no"; then
+  # Make sure that safe-stack is disabled
+  if ! compile_prog "-Werror" ""; then
+    # SafeStack was already enabled, try to explicitly remove the feature
+    flag="-fno-sanitize=safe-stack"
+    if ! compile_prog "-Werror $flag" "$flag"; then
+      error_exit "Configure cannot disable SafeStack"
+    fi
+    QEMU_CFLAGS="$QEMU_CFLAGS $flag"
+    QEMU_LDFLAGS="$QEMU_LDFLAGS $flag"
+  fi
+else # "$safe_stack" = ""
+  # Set safe_stack to yes or no based on pre-existing flags
+  if compile_prog "-Werror" ""; then
+    safe_stack="no"
+  else
+    safe_stack="yes"
+    if test "$coroutine" != "ucontext"; then
+      error_exit "SafeStack is only supported by the coroutine backend ucontext"
+    fi
+  fi
+fi
+fi
 
 ##########################################
 # check if we have open_by_handle_at
@@ -XXX,XX +XXX,XX @@ echo "sparse enabled    $sparse"
 echo "strip binaries    $strip_opt"
 echo "profiler          $profiler"
 echo "static build      $static"
+echo "safe stack        $safe_stack"
 if test "$darwin" = "yes" ; then
     echo "Cocoa support     $cocoa"
 fi
@@ -XXX,XX +XXX,XX @@ if test "$ccache_cpp2" = "yes"; then
   echo "export CCACHE_CPP2=y" >> $config_host_mak
 fi
 
+if test "$safe_stack" = "yes"; then
+  echo "CONFIG_SAFESTACK=y" >> $config_host_mak
+fi
+
 # If we're using a separate build tree, set it up now.
 # DIRS are directories which we simply mkdir in the build tree;
 # LINKS are things to symlink back into the source tree
-- 
2.26.2

From: Daniele Buono <dbuono@linux.vnet.ibm.com>

SafeStack is a stack protection technique implemented in llvm. It is
enabled with a -fsanitize flag.
iotests are currently disabled when any -fsanitize option is used,
because such options tend to produce additional warnings and false
positives.

While common -fsanitize options are used to verify the code and not
added in production, SafeStack's main use is in production environments
to protect against stack smashing.

Since SafeStack does not print any warning or false positive, enable
iotests when SafeStack is the only -fsanitize option used.
This is likely going to be a production binary and we want to make sure
it works correctly.

Signed-off-by: Daniele Buono <dbuono@linux.vnet.ibm.com>
Message-id: 20200529205122.714-5-dbuono@linux.vnet.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/check-block.sh | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/tests/check-block.sh b/tests/check-block.sh
index XXXXXXX..XXXXXXX 100755
--- a/tests/check-block.sh
+++ b/tests/check-block.sh
@@ -XXX,XX +XXX,XX @@ if grep -q "CONFIG_GPROF=y" config-host.mak 2>/dev/null ; then
     exit 0
 fi
 
-if grep -q "CFLAGS.*-fsanitize" config-host.mak 2>/dev/null ; then
+# Disable tests with any sanitizer except for SafeStack
+CFLAGS=$( grep "CFLAGS.*-fsanitize" config-host.mak 2>/dev/null )
+SANITIZE_FLAGS=""
+#Remove all occurrencies of -fsanitize=safe-stack
+for i in ${CFLAGS}; do
+        if [ "${i}" != "-fsanitize=safe-stack" ]; then
+                SANITIZE_FLAGS="${SANITIZE_FLAGS} ${i}"
+        fi
+done
+if echo ${SANITIZE_FLAGS} | grep -q "\-fsanitize" 2>/dev/null; then
+    # Have a sanitize flag that is not allowed, stop
     echo "Sanitizers are enabled ==> Not running the qemu-iotests."
     exit 0
 fi
-- 
2.26.2

A lot of CPU time is spent simply locking/unlocking q->lock during
polling. Check for completion outside the lock to make q->lock disappear
from the profile.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/nvme.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ static bool nvme_poll_queues(BDRVNVMeState *s)
 
     for (i = 0; i < s->nr_queues; i++) {
         NVMeQueuePair *q = s->queues[i];
+        const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
+        NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
+
+        /*
+         * Do an early check for completions. q->lock isn't needed because
+         * nvme_process_completion() only runs in the event loop thread and
+         * cannot race with itself.
+         */
+        if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
+            continue;
+        }
+
         qemu_mutex_lock(&q->lock);
         while (nvme_process_completion(s, q)) {
             /* Keep polling */
-- 
2.26.2

Do not access a CQE after incrementing q->cq.head and releasing q->lock.
It is unlikely that this causes problems in practice but it's a latent
bug.

The reason why it should be safe at the moment is that completion
processing is not re-entrant and the CQ doorbell isn't written until the
end of nvme_process_completion().

Make this change now because QEMU expects completion processing to be
re-entrant and later patches will do that.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200617132201.1832152-4-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/nvme.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
     q->busy = true;
     assert(q->inflight >= 0);
     while (q->inflight) {
+        int ret;
         int16_t cid;
+
         c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
         if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
             break;
         }
+        ret = nvme_translate_error(c);
         q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
         if (!q->cq.head) {
             q->cq_phase = !q->cq_phase;
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         preq->busy = false;
         preq->cb = preq->opaque = NULL;
         qemu_mutex_unlock(&q->lock);
-        req.cb(req.opaque, nvme_translate_error(c));
+        req.cb(req.opaque, ret);
         qemu_mutex_lock(&q->lock);
         q->inflight--;
         progress = true;
-- 
2.26.2

There are three issues with the current NVMeRequest->busy field:
1. The busy field is accidentally accessed outside q->lock when request
   submission fails.
2. Waiters on free_req_queue are not woken when a request is returned
   early due to submission failure.
2. Finding a free request involves scanning all requests. This makes
   request submission O(n^2).

Switch to an O(1) freelist that is always accessed under the lock.

Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
NVME_NUM_REQS, the number of usable requests. This makes the code
simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
that one slot is reserved.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/nvme.c | 81 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 27 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
 #define NVME_QUEUE_SIZE 128
 #define NVME_BAR_SIZE 8192
 
+/*
+ * We have to leave one slot empty as that is the full queue case where
+ * head == tail + 1.
+ */
+#define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
+
 typedef struct {
     int32_t  head, tail;
     uint8_t  *queue;
@@ -XXX,XX +XXX,XX @@ typedef struct {
     int cid;
     void *prp_list_page;
     uint64_t prp_list_iova;
-    bool busy;
+    int free_req_next; /* q->reqs[] index of next free req */
 } NVMeRequest;
 
 typedef struct {
@@ -XXX,XX +XXX,XX @@ typedef struct {
     /* Fields protected by @lock */
     NVMeQueue   sq, cq;
     int         cq_phase;
-    NVMeRequest reqs[NVME_QUEUE_SIZE];
+    int         free_req_head;
+    NVMeRequest reqs[NVME_NUM_REQS];
     bool        busy;
     int         need_kick;
     int         inflight;
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
     qemu_mutex_init(&q->lock);
     q->index = idx;
     qemu_co_queue_init(&q->free_req_queue);
-    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
+    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
     r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
-                          s->page_size * NVME_QUEUE_SIZE,
+                          s->page_size * NVME_NUM_REQS,
                           false, &prp_list_iova);
     if (r) {
         goto fail;
     }
-    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+    q->free_req_head = -1;
+    for (i = 0; i < NVME_NUM_REQS; i++) {
         NVMeRequest *req = &q->reqs[i];
         req->cid = i + 1;
+        req->free_req_next = q->free_req_head;
+        q->free_req_head = i;
         req->prp_list_page = q->prp_list_pages + i * s->page_size;
         req->prp_list_iova = prp_list_iova + i * s->page_size;
     }
+
     nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
@@ -XXX,XX +XXX,XX @@ static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
  */
 static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 {
-    int i;
-    NVMeRequest *req = NULL;
+    NVMeRequest *req;
 
     qemu_mutex_lock(&q->lock);
-    while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
-        /* We have to leave one slot empty as that is the full queue case (head
-         * == tail + 1). */
+
+    while (q->free_req_head == -1) {
         if (qemu_in_coroutine()) {
             trace_nvme_free_req_queue_wait(q);
             qemu_co_queue_wait(&q->free_req_queue, &q->lock);
@@ -XXX,XX +XXX,XX @@ static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
             return NULL;
         }
     }
-    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
-        if (!q->reqs[i].busy) {
-            q->reqs[i].busy = true;
-            req = &q->reqs[i];
-            break;
-        }
-    }
-    /* We have checked inflight and need_kick while holding q->lock, so one
-     * free req must be available. */
-    assert(req);
+
+    req = &q->reqs[q->free_req_head];
+    q->free_req_head = req->free_req_next;
+    req->free_req_next = -1;
+
     qemu_mutex_unlock(&q->lock);
     return req;
 }
 
+/* With q->lock */
+static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
+{
+    req->free_req_next = q->free_req_head;
+    q->free_req_head = req - q->reqs;
+}
+
+/* With q->lock */
+static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+    if (!qemu_co_queue_empty(&q->free_req_queue)) {
+        replay_bh_schedule_oneshot_event(s->aio_context,
+                nvme_free_req_queue_cb, q);
+    }
+}
+
+/* Insert a request in the freelist and wake waiters */
+static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
+                                       NVMeRequest *req)
+{
+    qemu_mutex_lock(&q->lock);
+    nvme_put_free_req_locked(q, req);
+    nvme_wake_free_req_locked(s, q);
+    qemu_mutex_unlock(&q->lock);
+}
+
 static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         req = *preq;
         assert(req.cid == cid);
         assert(req.cb);
-        preq->busy = false;
+        nvme_put_free_req_locked(q, preq);
         preq->cb = preq->opaque = NULL;
         qemu_mutex_unlock(&q->lock);
         req.cb(req.opaque, ret);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         /* Notify the device so it can post more completions. */
         smp_mb_release();
         *q->cq.doorbell = cpu_to_le32(q->cq.head);
-        if (!qemu_co_queue_empty(&q->free_req_queue)) {
-            replay_bh_schedule_oneshot_event(s->aio_context,
-                                             nvme_free_req_queue_cb, q);
-        }
+        nvme_wake_free_req_locked(s, q);
     }
     q->busy = false;
     return progress;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
     r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
     qemu_co_mutex_unlock(&s->dma_map_lock);
     if (r) {
-        req->busy = false;
+        nvme_put_free_req_and_wake(s, ioq, req);
         return r;
     }
     nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
     qemu_co_mutex_unlock(&s->dma_map_lock);
 
     if (ret) {
-        req->busy = false;
+        nvme_put_free_req_and_wake(s, ioq, req);
         goto out;
     }
 
-- 
2.26.2

Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce
the number of function arguments by keeping the BDRVNVMeState pointer in
NVMeQueuePair. This will come in handly when a BH is introduced in a
later patch and only one argument can be passed to it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200617132201.1832152-7-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/nvme.c | 70 ++++++++++++++++++++++++++++------------------------
 1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
  */
 #define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
 
+typedef struct BDRVNVMeState BDRVNVMeState;
+
 typedef struct {
     int32_t  head, tail;
     uint8_t  *queue;
@@ -XXX,XX +XXX,XX @@ typedef struct {
 typedef struct {
     QemuMutex   lock;
 
+    /* Read from I/O code path, initialized under BQL */
+    BDRVNVMeState   *s;
+    int             index;
+
     /* Fields protected by BQL */
-    int         index;
     uint8_t     *prp_list_pages;
 
     /* Fields protected by @lock */
@@ -XXX,XX +XXX,XX @@ typedef volatile struct {
 
 QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
 
-typedef struct {
+struct BDRVNVMeState {
     AioContext *aio_context;
     QEMUVFIOState *vfio;
     NVMeRegs *regs;
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
     /* PCI address (required for nvme_refresh_filename()) */
     char *device;
-} BDRVNVMeState;
+};
 
 #define NVME_BLOCK_OPT_DEVICE "device"
 #define NVME_BLOCK_OPT_NAMESPACE "namespace"
@@ -XXX,XX +XXX,XX @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
     }
 }
 
-static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+static void nvme_free_queue_pair(NVMeQueuePair *q)
 {
     qemu_vfree(q->prp_list_pages);
     qemu_vfree(q->sq.queue);
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
     uint64_t prp_list_iova;
 
     qemu_mutex_init(&q->lock);
+    q->s = s;
     q->index = idx;
     qemu_co_queue_init(&q->free_req_queue);
     q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
 
     return q;
 fail:
-    nvme_free_queue_pair(bs, q);
+    nvme_free_queue_pair(q);
     return NULL;
 }
 
 /* With q->lock */
-static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_kick(NVMeQueuePair *q)
 {
+    BDRVNVMeState *s = q->s;
+
     if (s->plugged || !q->need_kick) {
         return;
     }
@@ -XXX,XX +XXX,XX @@ static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
 }
 
 /* With q->lock */
-static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_wake_free_req_locked(NVMeQueuePair *q)
 {
     if (!qemu_co_queue_empty(&q->free_req_queue)) {
-        replay_bh_schedule_oneshot_event(s->aio_context,
+        replay_bh_schedule_oneshot_event(q->s->aio_context,
                 nvme_free_req_queue_cb, q);
     }
 }
 
 /* Insert a request in the freelist and wake waiters */
-static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
-                                       NVMeRequest *req)
+static void nvme_put_free_req_and_wake(NVMeQueuePair *q, NVMeRequest *req)
 {
     qemu_mutex_lock(&q->lock);
     nvme_put_free_req_locked(q, req);
-    nvme_wake_free_req_locked(s, q);
+    nvme_wake_free_req_locked(q);
     qemu_mutex_unlock(&q->lock);
 }
 
@@ -XXX,XX +XXX,XX @@ static inline int nvme_translate_error(const NvmeCqe *c)
 }
 
 /* With q->lock */
-static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+static bool nvme_process_completion(NVMeQueuePair *q)
 {
+    BDRVNVMeState *s = q->s;
     bool progress = false;
     NVMeRequest *preq;
     NVMeRequest req;
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         /* Notify the device so it can post more completions. */
         smp_mb_release();
         *q->cq.doorbell = cpu_to_le32(q->cq.head);
-        nvme_wake_free_req_locked(s, q);
+        nvme_wake_free_req_locked(q);
     }
     q->busy = false;
     return progress;
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
     }
 }
 
-static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
-                                NVMeRequest *req,
+static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                 NvmeCmd *cmd, BlockCompletionFunc cb,
                                 void *opaque)
 {
@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
     req->opaque = opaque;
     cmd->cid = cpu_to_le32(req->cid);
 
-    trace_nvme_submit_command(s, q->index, req->cid);
+    trace_nvme_submit_command(q->s, q->index, req->cid);
     nvme_trace_command(cmd);
     qemu_mutex_lock(&q->lock);
     memcpy((uint8_t *)q->sq.queue +
            q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
     q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
     q->need_kick++;
-    nvme_kick(s, q);
-    nvme_process_completion(s, q);
+    nvme_kick(q);
+    nvme_process_completion(q);
     qemu_mutex_unlock(&q->lock);
 }
 
@@ -XXX,XX +XXX,XX @@ static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
                          NvmeCmd *cmd)
 {
     NVMeRequest *req;
-    BDRVNVMeState *s = bs->opaque;
     int ret = -EINPROGRESS;
     req = nvme_get_free_req(q);
     if (!req) {
         return -EBUSY;
     }
-    nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
+    nvme_submit_command(q, req, cmd, nvme_cmd_sync_cb, &ret);
 
     BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
     return ret;
@@ -XXX,XX +XXX,XX @@ static bool nvme_poll_queues(BDRVNVMeState *s)
         }
 
         qemu_mutex_lock(&q->lock);
-        while (nvme_process_completion(s, q)) {
+        while (nvme_process_completion(q)) {
             /* Keep polling */
             progress = true;
         }
@@ -XXX,XX +XXX,XX @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
     };
     if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
         error_setg(errp, "Failed to create io queue [%d]", n);
-        nvme_free_queue_pair(bs, q);
+        nvme_free_queue_pair(q);
         return false;
     }
     cmd = (NvmeCmd) {
@@ -XXX,XX +XXX,XX @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
     };
     if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
         error_setg(errp, "Failed to create io queue [%d]", n);
-        nvme_free_queue_pair(bs, q);
+        nvme_free_queue_pair(q);
         return false;
     }
     s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
     BDRVNVMeState *s = bs->opaque;
 
     for (i = 0; i < s->nr_queues; ++i) {
-        nvme_free_queue_pair(bs, s->queues[i]);
+        nvme_free_queue_pair(s->queues[i]);
     }
     g_free(s->queues);
     aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
     r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
     qemu_co_mutex_unlock(&s->dma_map_lock);
     if (r) {
-        nvme_put_free_req_and_wake(s, ioq, req);
+        nvme_put_free_req_and_wake(ioq, req);
         return r;
     }
-    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
 
     data.co = qemu_coroutine_self();
     while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
     assert(s->nr_queues > 1);
     req = nvme_get_free_req(ioq);
     assert(req);
-    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
 
     data.co = qemu_coroutine_self();
     if (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
     req = nvme_get_free_req(ioq);
     assert(req);
 
-    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
 
     data.co = qemu_coroutine_self();
     while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
     qemu_co_mutex_unlock(&s->dma_map_lock);
 
     if (ret) {
-        nvme_put_free_req_and_wake(s, ioq, req);
+        nvme_put_free_req_and_wake(ioq, req);
         goto out;
     }
 
     trace_nvme_dsm(s, offset, bytes);
 
-    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+    nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
 
     data.co = qemu_coroutine_self();
     while (data.ret == -EINPROGRESS) {
@@ -XXX,XX +XXX,XX @@ static void nvme_aio_unplug(BlockDriverState *bs)
     for (i = 1; i < s->nr_queues; i++) {
         NVMeQueuePair *q = s->queues[i];
         qemu_mutex_lock(&q->lock);
-        nvme_kick(s, q);
-        nvme_process_completion(s, q);
+        nvme_kick(q);
+        nvme_process_completion(q);
         qemu_mutex_unlock(&q->lock);
     }
 }
-- 
2.26.2

QEMU block drivers are supposed to support aio_poll() from I/O
completion callback functions. This means completion processing must be
re-entrant.

The standard approach is to schedule a BH during completion processing
and cancel it at the end of processing. If aio_poll() is invoked by a
callback function then the BH will run. The BH continues the suspended
completion processing.

All of this means that request A's cb() can synchronously wait for
request B to complete. Previously the nvme block driver would hang
because it didn't process completions from nested aio_poll().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-8-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/nvme.c       | 67 ++++++++++++++++++++++++++++++++++++++++------
 block/trace-events |  2 +-
 2 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ typedef struct {
     int         cq_phase;
     int         free_req_head;
     NVMeRequest reqs[NVME_NUM_REQS];
-    bool        busy;
     int         need_kick;
     int         inflight;
+
+    /* Thread-safe, no lock necessary */
+    QEMUBH      *completion_bh;
 } NVMeQueuePair;
 
 /* Memory mapped registers */
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
 #define NVME_BLOCK_OPT_DEVICE "device"
 #define NVME_BLOCK_OPT_NAMESPACE "namespace"
 
+static void nvme_process_completion_bh(void *opaque);
+
 static QemuOptsList runtime_opts = {
     .name = "nvme",
     .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
@@ -XXX,XX +XXX,XX @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
 {
+    if (q->completion_bh) {
+        qemu_bh_delete(q->completion_bh);
+    }
     qemu_vfree(q->prp_list_pages);
     qemu_vfree(q->sq.queue);
     qemu_vfree(q->cq.queue);
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
     q->index = idx;
     qemu_co_queue_init(&q->free_req_queue);
     q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
+    q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
+                                  nvme_process_completion_bh, q);
     r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
                           s->page_size * NVME_NUM_REQS,
                           false, &prp_list_iova);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
     NvmeCqe *c;
 
     trace_nvme_process_completion(s, q->index, q->inflight);
-    if (q->busy || s->plugged) {
-        trace_nvme_process_completion_queue_busy(s, q->index);
+    if (s->plugged) {
+        trace_nvme_process_completion_queue_plugged(s, q->index);
         return false;
     }
-    q->busy = true;
+
+    /*
+     * Support re-entrancy when a request cb() function invokes aio_poll().
+     * Pending completions must be visible to aio_poll() so that a cb()
+     * function can wait for the completion of another request.
+     *
+     * The aio_poll() loop will execute our BH and we'll resume completion
+     * processing there.
+     */
+    qemu_bh_schedule(q->completion_bh);
+
     assert(q->inflight >= 0);
     while (q->inflight) {
         int ret;
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
         assert(req.cb);
         nvme_put_free_req_locked(q, preq);
         preq->cb = preq->opaque = NULL;
-        qemu_mutex_unlock(&q->lock);
-        req.cb(req.opaque, ret);
-        qemu_mutex_lock(&q->lock);
         q->inflight--;
+        qemu_mutex_unlock(&q->lock);
+        req.cb(req.opaque, ret);
+        qemu_mutex_lock(&q->lock);
         progress = true;
     }
     if (progress) {
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
         *q->cq.doorbell = cpu_to_le32(q->cq.head);
         nvme_wake_free_req_locked(q);
     }
-    q->busy = false;
+
+    qemu_bh_cancel(q->completion_bh);
+
     return progress;
 }
 
+static void nvme_process_completion_bh(void *opaque)
+{
+    NVMeQueuePair *q = opaque;
+
+    /*
+     * We're being invoked because a nvme_process_completion() cb() function
+     * called aio_poll(). The callback may be waiting for further completions
+     * so notify the device that it has space to fill in more completions now.
+     */
+    smp_mb_release();
+    *q->cq.doorbell = cpu_to_le32(q->cq.head);
+    nvme_wake_free_req_locked(q);
+
+    nvme_process_completion(q);
+}
+
 static void nvme_trace_command(const NvmeCmd *cmd)
 {
     int i;
@@ -XXX,XX +XXX,XX @@ static void nvme_detach_aio_context(BlockDriverState *bs)
 {
     BDRVNVMeState *s = bs->opaque;
 
+    for (int i = 0; i < s->nr_queues; i++) {
+        NVMeQueuePair *q = s->queues[i];
+
+        qemu_bh_delete(q->completion_bh);
+        q->completion_bh = NULL;
+    }
+
     aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
                            false, NULL, NULL);
 }
@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
     s->aio_context = new_context;
     aio_set_event_notifier(new_context, &s->irq_notifier,
                            false, nvme_handle_event, nvme_poll_cb);
+
+    for (int i = 0; i < s->nr_queues; i++) {
+        NVMeQueuePair *q = s->queues[i];
+
+        q->completion_bh =
+            aio_bh_new(new_context, nvme_process_completion_bh, q);
+    }
 }
 
 static void nvme_aio_plug(BlockDriverState *bs)
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, int queue) "s %p queue %d"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
-nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
+nvme_process_completion_queue_plugged(void *s, int index) "s %p queue %d"
 nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
 nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.26.2

The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:

decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:

qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 11:08:21 -0400)

----------------------------------------------------------------
Pull request

- Stefano Garzarella's blkio block driver 'fd' parameter
- My thread-local blk_io_plug() series

----------------------------------------------------------------

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

Stefano Garzarella (2):
  block/blkio: use qemu_open() to support fd passing for virtio-blk
  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

-- 
2.40.1

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20230530180959.1108766-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                       |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c             |  22 -----
 block/plug.c                      | 159 ++++++++++++++++++++++++++++++
 hw/block/dataplane/xen-block.c    |   8 +-
 hw/block/virtio-blk.c             |   4 +-
 hw/scsi/virtio-scsi.c             |   6 +-
 block/meson.build                 |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify)
     notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-    BlockDriverState *bs = blk_bs(blk);
-    IO_CODE();
-    GRAPH_RDLOCK_GUARD();
-
-    if (bs) {
-        bdrv_co_io_plug(bs);
-    }
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-    BlockDriverState *bs = blk_bs(blk);
-    IO_CODE();
-    GRAPH_RDLOCK_GUARD();
-
-    if (bs) {
-        bdrv_co_io_unplug(bs);
-    }
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
     IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/plug.c
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+    void (*fn)(void *);
+    void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+typedef struct {
+    unsigned count;       /* how many times has plug() been called? */
+    GArray *unplug_fns;   /* functions to call at unplug time */
+} Plug;
+
+/* Use get_ptr_plug() to fetch this thread-local value */
+QEMU_DEFINE_STATIC_CO_TLS(Plug, plug);
+
+/* Called at thread cleanup time */
+static void blk_io_plug_atexit(Notifier *n, void *value)
+{
+    Plug *plug = get_ptr_plug();
+    g_array_free(plug->unplug_fns, TRUE);
+}
+
+/* This won't involve coroutines, so use __thread */
+static __thread Notifier blk_io_plug_atexit_notifier;
+
+/**
+ * blk_io_plug_call:
+ * @fn: a function pointer to be invoked
+ * @opaque: a user-defined argument to @fn()
+ *
+ * Call @fn(@opaque) immediately if not within a blk_io_plug()/blk_io_unplug()
+ * section.
+ *
+ * Otherwise defer the call until the end of the outermost
+ * blk_io_plug()/blk_io_unplug() section in this thread. If the same
+ * @fn/@opaque pair has already been deferred, it will only be called once upon
+ * blk_io_unplug() so that accumulated calls are batched into a single call.
+ *
+ * The caller must ensure that @opaque is not freed before @fn() is invoked.
+ */
+void blk_io_plug_call(void (*fn)(void *), void *opaque)
+{
+    Plug *plug = get_ptr_plug();
+
+    /* Call immediately if we're not plugged */
+    if (plug->count == 0) {
+        fn(opaque);
+        return;
+    }
+
+    GArray *array = plug->unplug_fns;
+    if (!array) {
+        array = g_array_new(FALSE, FALSE, sizeof(UnplugFn));
+        plug->unplug_fns = array;
+        blk_io_plug_atexit_notifier.notify = blk_io_plug_atexit;
+        qemu_thread_atexit_add(&blk_io_plug_atexit_notifier);
+    }
+
+    UnplugFn *fns = (UnplugFn *)array->data;
+    UnplugFn new_fn = {
+        .fn = fn,
+        .opaque = opaque,
+    };
+
+    /*
+     * There won't be many, so do a linear search. If this becomes a bottleneck
+     * then a binary search (glib 2.62+) or different data structure could be
+     * used.
+     */
+    for (guint i = 0; i < array->len; i++) {
+        if (memcmp(&fns[i], &new_fn, sizeof(new_fn)) == 0) {
+            return; /* already exists */
+        }
+    }
+
+    g_array_append_val(array, new_fn);
+}
+
+/**
+ * blk_io_plug: Defer blk_io_plug_call() functions until blk_io_unplug()
+ *
+ * blk_io_plug/unplug are thread-local operations. This means that multiple
+ * threads can simultaneously call plug/unplug, but the caller must ensure that
+ * each unplug() is called in the same thread of the matching plug().
+ *
+ * Nesting is supported. blk_io_plug_call() functions are only called at the
+ * outermost blk_io_unplug().
+ */
+void blk_io_plug(void)
+{
+    Plug *plug = get_ptr_plug();
+
+    assert(plug->count < UINT32_MAX);
+
+    plug->count++;
+}
+
+/**
+ * blk_io_unplug: Run any pending blk_io_plug_call() functions
+ *
+ * There must have been a matching blk_io_plug() call in the same thread prior
+ * to this blk_io_unplug() call.
+ */
+void blk_io_unplug(void)
+{
+    Plug *plug = get_ptr_plug();
+
+    assert(plug->count > 0);
+
+    if (--plug->count > 0) {
+        return;
+    }
+
+    GArray *array = plug->unplug_fns;
+    if (!array) {
+        return;
+    }
+
+    UnplugFn *fns = (UnplugFn *)array->data;
+
+    for (guint i = 0; i < array->len; i++) {
+        fns[i].fn(fns[i].opaque);
+    }
+
+    /*
+     * This resets the array without freeing memory so that appending is cheap
+     * in the future.
+     */
+    g_array_set_size(array, 0);
+}
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
      * is below us.
      */
     if (inflight_atstart > IO_PLUG_THRESHOLD) {
-        blk_io_plug(dataplane->blk);
+        blk_io_plug();
     }
     while (rc != rp) {
         /* pull request from ring */
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
 
         if (inflight_atstart > IO_PLUG_THRESHOLD &&
             batched >= inflight_atstart) {
-            blk_io_unplug(dataplane->blk);
+            blk_io_unplug();
         }
         xen_block_do_aio(request);
         if (inflight_atstart > IO_PLUG_THRESHOLD) {
             if (batched >= inflight_atstart) {
-                blk_io_plug(dataplane->blk);
+                blk_io_plug();
                 batched = 0;
             } else {
                 batched++;
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
         }
     }
     if (inflight_atstart > IO_PLUG_THRESHOLD) {
-        blk_io_unplug(dataplane->blk);
+        blk_io_unplug();
     }
 
     return done_something;
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     bool suppress_notifications = virtio_queue_get_notification(vq);
 
     aio_context_acquire(blk_get_aio_context(s->blk));
-    blk_io_plug(s->blk);
+    blk_io_plug();
 
     do {
         if (suppress_notifications) {
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
         virtio_blk_submit_multireq(s, &mrb);
     }
 
-    blk_io_unplug(s->blk);
+    blk_io_unplug();
     aio_context_release(blk_get_aio_context(s->blk));
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req)
         return -ENOBUFS;
     }
     scsi_req_ref(req->sreq);
-    blk_io_plug(d->conf.blk);
+    blk_io_plug();
     object_unref(OBJECT(d));
     return 0;
 }
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req)
     if (scsi_req_enqueue(sreq)) {
         scsi_req_continue(sreq);
     }
-    blk_io_unplug(sreq->dev->conf.blk);
+    blk_io_unplug();
     scsi_req_unref(sreq);
 }
 
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
                 while (!QTAILQ_EMPTY(&reqs)) {
                     req = QTAILQ_FIRST(&reqs);
                     QTAILQ_REMOVE(&reqs, req, next);
-                    blk_io_unplug(req->sreq->dev->conf.blk);
+                    blk_io_unplug();
                     scsi_req_unref(req->sreq);
                     virtqueue_detach_element(req->vq, &req->elem, 0);
                     virtio_scsi_free_req(req);
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(files(
   'mirror.c',
   'nbd.c',
   'null.c',
+  'plug.c',
   'qapi.c',
   'qcow2-bitmap.c',
   'qcow2-cache.c',
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
     int blkshift;
 
     uint64_t max_transfer;
-    bool plugged;
 
     bool supports_write_zeroes;
     bool supports_discard;
@@ -XXX,XX +XXX,XX @@ static void nvme_kick(NVMeQueuePair *q)
 {
     BDRVNVMeState *s = q->s;
 
-    if (s->plugged || !q->need_kick) {
+    if (!q->need_kick) {
         return;
     }
     trace_nvme_kick(s, q->index);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
     NvmeCqe *c;
 
     trace_nvme_process_completion(s, q->index, q->inflight);
-    if (s->plugged) {
-        trace_nvme_process_completion_queue_plugged(s, q->index);
-        return false;
-    }
 
     /*
      * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
     }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+    NVMeQueuePair *q = opaque;
+
+    QEMU_LOCK_GUARD(&q->lock);
+    nvme_kick(q);
+    nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                 NvmeCmd *cmd, BlockCompletionFunc cb,
                                 void *opaque)
@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
            q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
     q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
     q->need_kick++;
-    nvme_kick(q);
-    nvme_process_completion(q);
+    blk_io_plug_call(nvme_unplug_fn, q);
     qemu_mutex_unlock(&q->lock);
 }
 
@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
     }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-    BDRVNVMeState *s = bs->opaque;
-    assert(!s->plugged);
-    s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVNVMeState *s = bs->opaque;
-    assert(s->plugged);
-    s->plugged = false;
-    for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-        NVMeQueuePair *q = s->queues[i];
-        qemu_mutex_lock(&q->lock);
-        nvme_kick(q);
-        nvme_process_completion(q);
-        qemu_mutex_unlock(&q->lock);
-    }
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
                               Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
     .bdrv_detach_aio_context  = nvme_detach_aio_context,
     .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-    .bdrv_co_io_plug          = nvme_co_io_plug,
-    .bdrv_co_io_unplug        = nvme_co_io_unplug,
-
     .bdrv_register_buf        = nvme_register_buf,
     .bdrv_unregister_buf      = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

diff --git a/block/blkio.c b/block/blkio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -XXX,XX +XXX,XX @@ static void blkio_detach_aio_context(BlockDriverState *bs)
                        NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-    if (qatomic_read(&bs->io_plugged) == 0) {
-        BDRVBlkioState *s = bs->opaque;
+    BDRVBlkioState *s = opaque;
 
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
     }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -XXX,XX +XXX,XX @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
@@ -XXX,XX +XXX,XX @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
 
     if (use_bounce_buffer) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
 
     if (use_bounce_buffer) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_flush(s->blkioq, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVBlkioState *s = bs->opaque;
-
-    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-        blkio_submit_io(bs);
-    }
-}
-
 typedef enum {
     BMRR_OK,
     BMRR_SKIP,
@@ -XXX,XX +XXX,XX @@ static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
         .bdrv_co_pwritev         = blkio_co_pwritev, \
         .bdrv_co_flush_to_disk   = blkio_co_flush, \
         .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-        .bdrv_co_io_unplug       = blkio_co_io_unplug, \
         .bdrv_refresh_limits     = blkio_refresh_limits, \
         .bdrv_register_buf       = blkio_register_buf, \
         .bdrv_unregister_buf     = blkio_unregister_buf, \
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20230530180959.1108766-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/raw-aio.h |  7 -------
 block/file-posix.c      | 10 ----------
 block/io_uring.c        | 44 ++++++++++++++++-------------------------
 block/trace-events      |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
                                   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
         laio_io_plug();
     }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-    if (s->use_linux_io_uring) {
-        luring_io_plug();
-    }
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
         laio_io_unplug(s->aio_max_batch);
     }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-    if (s->use_linux_io_uring) {
-        luring_io_unplug();
-    }
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -XXX,XX +XXX,XX @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-    int plugged;
     unsigned int in_queue;
     unsigned int in_flight;
     bool blocked;
@@ -XXX,XX +XXX,XX @@ static void luring_process_completions_and_submit(LuringState *s)
 {
     luring_process_completions(s);
 
-    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+    if (s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->submit_queue);
-    io_q->plugged = 0;
     io_q->in_queue = 0;
     io_q->in_flight = 0;
     io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    trace_luring_io_plug(s);
-    s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    assert(s->io_q.plugged);
-    trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-                           s->io_q.in_queue, s->io_q.in_flight);
-    if (--s->io_q.plugged == 0 &&
-        !s->io_q.blocked && s->io_q.in_queue > 0) {
+    LuringState *s = opaque;
+    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+                           s->io_q.in_flight);
+    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
 
     QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
     s->io_q.in_queue++;
-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-                           s->io_q.in_queue, s->io_q.in_flight);
-    if (!s->io_q.blocked &&
-        (!s->io_q.plugged ||
-         s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-        ret = ioq_submit(s);
-        trace_luring_do_submit_done(s, ret);
-        return ret;
+    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+                           s->io_q.in_flight);
+    if (!s->io_q.blocked) {
+        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+            ret = ioq_submit(s);
+            trace_luring_do_submit_done(s, ret);
+            return ret;
+        }
+
+        blk_io_plug_call(luring_unplug_fn, s);
     }
     return 0;
 }
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
 # io_uring.c
 luring_init_state(void *s, size_t size) "s %p size %zu"
 luring_cleanup_state(void *s) "%p freed"
-luring_io_plug(void *s) "LuringState %p plug"
-luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
-luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
+luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
+luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
 luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
 luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
 luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530180959.1108766-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/raw-aio.h |  7 -------
 block/file-posix.c      | 28 ----------------------------
 block/linux-aio.c       | 41 +++++++++++------------------------------
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
     return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-    BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-    if (s->use_linux_aio) {
-        laio_io_plug();
-    }
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-    if (s->use_linux_aio) {
-        laio_io_unplug(s->aio_max_batch);
-    }
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
     BDRVRawState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_file = {
     .bdrv_co_copy_range_from = raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = raw_co_copy_range_to,
     .bdrv_refresh_limits = raw_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
     .bdrv_co_copy_range_from = raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = raw_co_copy_range_to,
     .bdrv_refresh_limits = raw_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_co_pwritev        = raw_co_pwritev,
     .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
     .bdrv_refresh_limits    = cdrom_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_co_pwritev        = raw_co_pwritev,
     .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
     .bdrv_refresh_limits    = cdrom_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -XXX,XX +XXX,XX @@ struct qemu_laiocb {
 };
 
 typedef struct {
-    int plugged;
     unsigned int in_queue;
     unsigned int in_flight;
     bool blocked;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
     qemu_laio_process_completions(s);
 
-    if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
+    if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
 static void ioq_init(LaioQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->pending);
-    io_q->plugged = 0;
     io_q->in_queue = 0;
     io_q->in_flight = 0;
     io_q->blocked = false;
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
     return max_batch;
 }
 
-void laio_io_plug(void)
+static void laio_unplug_fn(void *opaque)
 {
-    AioContext *ctx = qemu_get_current_aio_context();
-    LinuxAioState *s = aio_get_linux_aio(ctx);
+    LinuxAioState *s = opaque;
 
-    s->io_q.plugged++;
-}
-
-void laio_io_unplug(uint64_t dev_max_batch)
-{
-    AioContext *ctx = qemu_get_current_aio_context();
-    LinuxAioState *s = aio_get_linux_aio(ctx);
-
-    assert(s->io_q.plugged);
-    s->io_q.plugged--;
-
-    /*
-     * Why max batch checking is performed here:
-     * Another BDS may have queued requests with a higher dev_max_batch and
-     * therefore in_queue could now exceed our dev_max_batch. Re-check the max
-     * batch so we can honor our device's dev_max_batch.
-     */
-    if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch) ||
-        (!s->io_q.plugged &&
-         !s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending))) {
+    if (!s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
 
     QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
     s->io_q.in_queue++;
-    if (!s->io_q.blocked &&
-        (!s->io_q.plugged ||
-         s->io_q.in_queue >= laio_max_batch(s, dev_max_batch))) {
-        ioq_submit(s);
+    if (!s->io_q.blocked) {
+        if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
+            ioq_submit(s);
+        } else {
+            blk_io_plug_call(laio_unplug_fn, s);
+        }
     }
 
     return 0;
-- 
2.40.1

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

diff --git a/include/block/block-io.h b/include/block/block-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
                                    uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
     void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
         BlockDriverState *bs, BlkdebugEvent event);
 
-    /* io queue for linux-aio */
-    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState *bs);
-    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-        BlockDriverState *bs);
-
     bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
     bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     unsigned int in_flight;
     unsigned int serialising_in_flight;
 
-    /*
-     * counter for nested bdrv_io_plug.
-     * Accessed with atomic ops.
-     */
-    unsigned io_plugged;
-
     /* do we need to tell the quest if we have a volatile write cache? */
     int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size)
     return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-    BdrvChild *child;
-    IO_CODE();
-    assert_bdrv_graph_readable();
-
-    QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_co_io_plug(child->bs);
-    }
-
-    if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-        BlockDriver *drv = bs->drv;
-        if (drv && drv->bdrv_co_io_plug) {
-            drv->bdrv_co_io_plug(bs);
-        }
-    }
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-    BdrvChild *child;
-    IO_CODE();
-    assert_bdrv_graph_readable();
-
-    assert(bs->io_plugged);
-    if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-        BlockDriver *drv = bs->drv;
-        if (drv && drv->bdrv_co_io_unplug) {
-            drv->bdrv_co_io_unplug(bs);
-        }
-    }
-
-    QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_co_io_unplug(child->bs);
-    }
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

From: Stefano Garzarella <sgarzare@redhat.com>

Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
passing. Let's expose this to the user, so the management layer
can pass the file descriptor of an already opened path.

If the libblkio virtio-blk driver supports fd passing, let's always
use qemu_open() to open the `path`, so we can handle fd passing
from the management layer through the "/dev/fdset/N" special path.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530071941.8954-2-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkio.c | 53 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -XXX,XX +XXX,XX @@ static int blkio_virtio_blk_common_open(BlockDriverState *bs,
 {
     const char *path = qdict_get_try_str(options, "path");
     BDRVBlkioState *s = bs->opaque;
-    int ret;
+    bool fd_supported = false;
+    int fd, ret;
 
     if (!path) {
         error_setg(errp, "missing 'path' option");
         return -EINVAL;
     }
 
-    ret = blkio_set_str(s->blkio, "path", path);
-    qdict_del(options, "path");
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "failed to set path: %s",
-                         blkio_get_error_msg());
-        return ret;
-    }
-
     if (!(flags & BDRV_O_NOCACHE)) {
         error_setg(errp, "cache.direct=off is not supported");
         return -EINVAL;
     }
+
+    if (blkio_get_int(s->blkio, "fd", &fd) == 0) {
+        fd_supported = true;
+    }
+
+    /*
+     * If the libblkio driver supports fd passing, let's always use qemu_open()
+     * to open the `path`, so we can handle fd passing from the management
+     * layer through the "/dev/fdset/N" special path.
+     */
+    if (fd_supported) {
+        int open_flags;
+
+        if (flags & BDRV_O_RDWR) {
+            open_flags = O_RDWR;
+        } else {
+            open_flags = O_RDONLY;
+        }
+
+        fd = qemu_open(path, open_flags, errp);
+        if (fd < 0) {
+            return -EINVAL;
+        }
+
+        ret = blkio_set_int(s->blkio, "fd", fd);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set fd: %s",
+                             blkio_get_error_msg());
+            qemu_close(fd);
+            return ret;
+        }
+    } else {
+        ret = blkio_set_str(s->blkio, "path", path);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set path: %s",
+                             blkio_get_error_msg());
+            return ret;
+        }
+    }
+
+    qdict_del(options, "path");
+
     return 0;
 }
 
-- 
2.40.1

From: Stefano Garzarella <sgarzare@redhat.com>

The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
passing through the new 'fd' property.

Since now we are using qemu_open() on '@path' if the virtio-blk driver
supports the fd passing, let's announce it.
In this way, the management layer can pass the file descriptor of an
already opened vhost-vdpa character device. This is useful especially
when the device can only be accessed with certain privileges.

Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
in libblkio supports it.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530071941.8954-3-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-core.json | 6 ++++++
 meson.build          | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
 #
 # @path: path to the vhost-vdpa character device.
 #
+# Features:
+# @fdset: Member @path supports the special "/dev/fdset/N" path
+#     (since 8.1)
+#
 # Since: 7.2
 ##
 { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
   'data': { 'path': 'str' },
+  'features': [ { 'name' :'fdset',
+                  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
   'if': 'CONFIG_BLKIO' }
 
 ##
diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
 config_host_data.set('CONFIG_BLKIO', blkio.found())
+if blkio.found()
+  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
+                       blkio.version().version_compare('>=1.3.0'))
+endif
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
-- 
2.40.1