Series comparison

-[PULL for-6.0 0/6] Block patches
+[PULL 0/8] Block patches
-The following changes since commit 6d40ce00c1166c317e298ad82ecf10e650c4f87d:
+The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:
-  Update version for v6.0.0-rc1 release (2021-03-30 18:19:07 +0100)
+  decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)
 are available in the Git repository at:
   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to b6489ac06695e257ea0a9841364577e247fdee30:
+for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:
-  test-coroutine: Add rwlock downgrade test (2021-03-31 10:44:21 +0100)
+  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 11:08:21 -0400)
 ----------------------------------------------------------------
 Pull request
-A fix for VDI image files and more generally for CoRwlock.
+- Stefano Garzarella's blkio block driver 'fd' parameter
 - My thread-local blk_io_plug() series
 ----------------------------------------------------------------
-David Edmondson (4):
+Stefan Hajnoczi (6):
-  block/vdi: When writing new bmap entry fails, don't leak the buffer
+  block: add blk_io_plug_call() API
-  block/vdi: Don't assume that blocks are larger than VdiHeader
+  block/nvme: convert to blk_io_plug_call() API
-  coroutine-lock: Store the coroutine in the CoWaitRecord only once
+  block/blkio: convert to blk_io_plug_call() API
-  test-coroutine: Add rwlock downgrade test
+  block/io_uring: convert to blk_io_plug_call() API
   block/linux-aio: convert to blk_io_plug_call() API
   block: remove bdrv_co_io_plug() API
-Paolo Bonzini (2):
+Stefano Garzarella (2):
-  coroutine-lock: Reimplement CoRwlock to fix downgrade bug
+  block/blkio: use qemu_open() to support fd passing for virtio-blk
-  test-coroutine: Add rwlock upgrade test
+  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa
- include/qemu/coroutine.h    |  17 ++--
+ MAINTAINERS                       |   1 +
- block/vdi.c                 |  11 ++-
+ qapi/block-core.json              |   6 ++
- tests/unit/test-coroutine.c | 161 +++++++++++++++++++++++++++++++++++
+ meson.build                       |   4 +
- util/qemu-coroutine-lock.c  | 165 +++++++++++++++++++++++-------------
+ include/block/block-io.h          |   3 -
-files changed, 282 insertions(+), 72 deletions(-)
+ include/block/block_int-common.h  |  11 ---
  include/block/raw-aio.h           |  14 ---
  include/sysemu/block-backend-io.h |  13 +--
  block/blkio.c                     |  96 ++++++++++++------
  block/block-backend.c             |  22 -----
  block/file-posix.c                |  38 -------
  block/io.c                        |  37 -------
  block/io_uring.c                  |  44 ++++-----
  block/linux-aio.c                 |  41 +++-----
  block/nvme.c                      |  44 +++------
  block/plug.c                      | 159 ++++++++++++++++++++++++++++++
  hw/block/dataplane/xen-block.c    |   8 +-
  hw/block/virtio-blk.c             |   4 +-
  hw/scsi/virtio-scsi.c             |   6 +-
  block/meson.build                 |   1 +
  block/trace-events                |   6 +-
 files changed, 293 insertions(+), 265 deletions(-)
  create mode 100644 block/plug.c
 --
-.30.2
+.40.1

-[PULL for-6.0 4/6] coroutine-lock: Reimplement CoRwlock to fix downgrade bug
+[PULL 1/8] block: add blk_io_plug_call() API
-From: Paolo Bonzini <pbonzini@redhat.com>
+Introduce a new API for thread-local blk_io_plug() that does not
+traverse the block graph. The goal is to make blk_io_plug() multi-queue
-An invariant of the current rwlock is that if multiple coroutines hold a
+friendly.
-reader lock, all must be runnable. The unlock implementation relies on
-this, choosing to wake a single coroutine when the final read lock
+Instead of having block drivers track whether or not we're in a plugged
-holder exits the critical section, assuming that it will wake a
+section, provide an API that allows them to defer a function call until
-coroutine attempting to acquire a write lock.
+we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
+called multiple times with the same fn/opaque pair, then fn() is only
-The downgrade implementation violates this assumption by creating a
+called once at the end of the function - resulting in batching.
-read lock owning coroutine that is exclusively runnable - any other
-coroutines that are waiting to acquire a read lock are *not* made
+This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
-runnable when the write lock holder converts its ownership to read
+blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
-only.
+because the plug state is now thread-local.
-More in general, the old implementation had lots of other fairness bugs.
+Later patches convert block drivers to blk_io_plug_call() and then we
-The root cause of the bugs was that CoQueue would wake up readers even
+can finally remove .bdrv_co_io_plug() once all block drivers have been
-if there were pending writers, and would wake up writers even if there
+converted.
-were readers.  In that case, the coroutine would go back to sleep *at
-the end* of the CoQueue, losing its place at the head of the line.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-To fix this, keep the queue of waiters explicitly in the CoRwlock
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-instead of using CoQueue, and store for each whether it is a
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-potential reader or a writer.  This way, downgrade can look at the
+Message-id: 20230530180959.1108766-2-stefanha@redhat.com
 first queued coroutines and wake it only if it is a reader, causing
 all other readers in line to be released in turn.
 Reported-by: David Edmondson <david.edmondson@oracle.com>
 Reviewed-by: David Edmondson <david.edmondson@oracle.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-id: 20210325112941.365238-5-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  17 ++--
+ MAINTAINERS                       |   1 +
- util/qemu-coroutine-lock.c | 164 +++++++++++++++++++++++--------------
+ include/sysemu/block-backend-io.h |  13 +--
-files changed, 114 insertions(+), 67 deletions(-)
+ block/block-backend.c             |  22 -----
+ block/plug.c                      | 159 ++++++++++++++++++++++++++++++
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+ hw/block/dataplane/xen-block.c    |   8 +-
-index XXXXXXX..XXXXXXX 100644
+ hw/block/virtio-blk.c             |   4 +-
---- a/include/qemu/coroutine.h
+ hw/scsi/virtio-scsi.c             |   6 +-
-+++ b/include/qemu/coroutine.h
+ block/meson.build                 |   1 +
-@@ -XXX,XX +XXX,XX @@ bool qemu_co_enter_next_impl(CoQueue *queue, QemuLockable *lock);
+files changed, 173 insertions(+), 41 deletions(-)
- bool qemu_co_queue_empty(CoQueue *queue);
+ create mode 100644 block/plug.c
+diff --git a/MAINTAINERS b/MAINTAINERS
-+typedef struct CoRwTicket CoRwTicket;
+index XXXXXXX..XXXXXXX 100644
- typedef struct CoRwlock {
+--- a/MAINTAINERS
--    int pending_writer;
++++ b/MAINTAINERS
--    int reader;
+@@ -XXX,XX +XXX,XX @@ F: util/aio-*.c
-     CoMutex mutex;
+ F: util/aio-*.h
--    CoQueue queue;
+ F: util/fdmon-*.c
-+
+ F: block/io.c
-+    /* Number of readers, or -1 if owned for writing.  */
++F: block/plug.c
-+    int owners;
+ F: migration/block*
-+
+ F: include/block/aio.h
-+    /* Waiting coroutines.  */
+ F: include/block/aio-wait.h
-+    QSIMPLEQ_HEAD(, CoRwTicket) tickets;
+diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
- } CoRwlock;
+index XXXXXXX..XXXXXXX 100644
+--- a/include/sysemu/block-backend-io.h
- /**
++++ b/include/sysemu/block-backend-io.h
-@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock);
+@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
- /**
+ int blk_get_max_iov(BlockBackend *blk);
-  * Write Locks the CoRwlock from a reader.  This is a bit more efficient than
+ int blk_get_max_hw_iov(BlockBackend *blk);
-  * @qemu_co_rwlock_unlock followed by a separate @qemu_co_rwlock_wrlock.
-- * However, if the lock cannot be upgraded immediately, control is transferred
+-/*
-- * to the caller of the current coroutine.  Also, @qemu_co_rwlock_upgrade
+- * blk_io_plug/unplug are thread-local operations. This means that multiple
-- * only overrides CoRwlock fairness if there are no concurrent readers, so
+- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
-- * another writer might run while @qemu_co_rwlock_upgrade blocks.
+- * that each unplug() is called in the same IOThread of the matching plug().
-+ * Note that if the lock cannot be upgraded immediately, control is transferred
+- */
-+ * to the caller of the current coroutine; another writer might run while
+-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-+ * @qemu_co_rwlock_upgrade blocks.
+-void co_wrapper blk_io_plug(BlockBackend *blk);
-  */
+-
- void qemu_co_rwlock_upgrade(CoRwlock *lock);
+-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
+-void co_wrapper blk_io_unplug(BlockBackend *blk);
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
++void blk_io_plug(void);
-index XXXXXXX..XXXXXXX 100644
++void blk_io_unplug(void);
---- a/util/qemu-coroutine-lock.c
++void blk_io_plug_call(void (*fn)(void *), void *opaque);
-+++ b/util/qemu-coroutine-lock.c
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
+ AioContext *blk_get_aio_context(BlockBackend *blk);
-     trace_qemu_co_mutex_unlock_return(mutex, self);
+ BlockAcctStats *blk_get_stats(BlockBackend *blk);
 diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify)
      notifier_list_add(&blk->insert_bs_notifiers, notify);
  }
-+struct CoRwTicket {
+-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-+    bool read;
+-{
-+    Coroutine *co;
+-    BlockDriverState *bs = blk_bs(blk);
-+    QSIMPLEQ_ENTRY(CoRwTicket) next;
+-    IO_CODE();
-+};
+-    GRAPH_RDLOCK_GUARD();
-+
+-
- void qemu_co_rwlock_init(CoRwlock *lock)
+-    if (bs) {
 -        bdrv_co_io_plug(bs);
 -    }
 -}
 -
 -void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
 -{
 -    BlockDriverState *bs = blk_bs(blk);
 -    IO_CODE();
 -    GRAPH_RDLOCK_GUARD();
 -
 -    if (bs) {
 -        bdrv_co_io_unplug(bs);
 -    }
 -}
 -
  BlockAcctStats *blk_get_stats(BlockBackend *blk)
  {
--    memset(lock, 0, sizeof(*lock));
+     IO_CODE();
--    qemu_co_queue_init(&lock->queue);
+diff --git a/block/plug.c b/block/plug.c
-     qemu_co_mutex_init(&lock->mutex);
+new file mode 100644
-+    lock->owners = 0;
+index XXXXXXX..XXXXXXX
-+    QSIMPLEQ_INIT(&lock->tickets);
+--- /dev/null
 +++ b/block/plug.c
@@ -XXX,XX +XXX,XX @@
 +/* SPDX-License-Identifier: GPL-2.0-or-later */
 +/*
 + * Block I/O plugging
 + *
 + * Copyright Red Hat.
 + *
 + * This API defers a function call within a blk_io_plug()/blk_io_unplug()
 + * section, allowing multiple calls to batch up. This is a performance
 + * optimization that is used in the block layer to submit several I/O requests
 + * at once instead of individually:
 + *
 + *   blk_io_plug(); <-- start of plugged region
 + *   ...
 + *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
 + *   blk_io_plug_call(my_func, my_obj); <-- another
 + *   blk_io_plug_call(my_func, my_obj); <-- another
 + *   ...
 + *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
 + *
 + * This code is actually generic and not tied to the block layer. If another
 + * subsystem needs this functionality, it could be renamed.
 + */
 +
 +#include "qemu/osdep.h"
 +#include "qemu/coroutine-tls.h"
 +#include "qemu/notify.h"
 +#include "qemu/thread.h"
 +#include "sysemu/block-backend.h"
 +
 +/* A function call that has been deferred until unplug() */
 +typedef struct {
 +    void (*fn)(void *);
 +    void *opaque;
 +} UnplugFn;
 +
 +/* Per-thread state */
 +typedef struct {
 +    unsigned count;       /* how many times has plug() been called? */
 +    GArray *unplug_fns;   /* functions to call at unplug time */
 +} Plug;
 +
 +/* Use get_ptr_plug() to fetch this thread-local value */
 +QEMU_DEFINE_STATIC_CO_TLS(Plug, plug);
 +
 +/* Called at thread cleanup time */
 +static void blk_io_plug_atexit(Notifier *n, void *value)
 +{
 +    Plug *plug = get_ptr_plug();
 +    g_array_free(plug->unplug_fns, TRUE);
 +}
 +
-+/* Releases the internal CoMutex.  */
++/* This won't involve coroutines, so use __thread */
-+static void qemu_co_rwlock_maybe_wake_one(CoRwlock *lock)
++static __thread Notifier blk_io_plug_atexit_notifier;
 +
 +/**
 + * blk_io_plug_call:
 + * @fn: a function pointer to be invoked
 + * @opaque: a user-defined argument to @fn()
 + *
 + * Call @fn(@opaque) immediately if not within a blk_io_plug()/blk_io_unplug()
 + * section.
 + *
 + * Otherwise defer the call until the end of the outermost
 + * blk_io_plug()/blk_io_unplug() section in this thread. If the same
 + * @fn/@opaque pair has already been deferred, it will only be called once upon
 + * blk_io_unplug() so that accumulated calls are batched into a single call.
 + *
 + * The caller must ensure that @opaque is not freed before @fn() is invoked.
 + */
 +void blk_io_plug_call(void (*fn)(void *), void *opaque)
 +{
-+    CoRwTicket *tkt = QSIMPLEQ_FIRST(&lock->tickets);
++    Plug *plug = get_ptr_plug();
-+    Coroutine *co = NULL;
++
 +    /* Call immediately if we're not plugged */
 +    if (plug->count == 0) {
 +        fn(opaque);
 +        return;
 +    }
 +
 +    GArray *array = plug->unplug_fns;
 +    if (!array) {
 +        array = g_array_new(FALSE, FALSE, sizeof(UnplugFn));
 +        plug->unplug_fns = array;
 +        blk_io_plug_atexit_notifier.notify = blk_io_plug_atexit;
 +        qemu_thread_atexit_add(&blk_io_plug_atexit_notifier);
 +    }
 +
 +    UnplugFn *fns = (UnplugFn *)array->data;
 +    UnplugFn new_fn = {
 +        .fn = fn,
 +        .opaque = opaque,
 +    };
 +
 +    /*
-+     * Setting lock->owners here prevents rdlock and wrlock from
++     * There won't be many, so do a linear search. If this becomes a bottleneck
-+     * sneaking in between unlock and wake.
++     * then a binary search (glib 2.62+) or different data structure could be
 +     * used.
 +     */
-+
++    for (guint i = 0; i < array->len; i++) {
-+    if (tkt) {
++        if (memcmp(&fns[i], &new_fn, sizeof(new_fn)) == 0) {
-+        if (tkt->read) {
++            return; /* already exists */
 +            if (lock->owners >= 0) {
 +                lock->owners++;
 +                co = tkt->co;
 +            }
 +        } else {
 +            if (lock->owners == 0) {
 +                lock->owners = -1;
 +                co = tkt->co;
 +            }
 +        }
 +    }
 +
-+    if (co) {
++    g_array_append_val(array, new_fn);
-+        QSIMPLEQ_REMOVE_HEAD(&lock->tickets, next);
++}
-+        qemu_co_mutex_unlock(&lock->mutex);
++
-+        aio_co_wake(co);
++/**
-+    } else {
++ * blk_io_plug: Defer blk_io_plug_call() functions until blk_io_unplug()
-+        qemu_co_mutex_unlock(&lock->mutex);
++ *
-+    }
++ * blk_io_plug/unplug are thread-local operations. This means that multiple
 + * threads can simultaneously call plug/unplug, but the caller must ensure that
 + * each unplug() is called in the same thread of the matching plug().
 + *
 + * Nesting is supported. blk_io_plug_call() functions are only called at the
 + * outermost blk_io_unplug().
 + */
 +void blk_io_plug(void)
 +{
 +    Plug *plug = get_ptr_plug();
 +
 +    assert(plug->count < UINT32_MAX);
 +
 +    plug->count++;
 +}
 +
 +/**
 + * blk_io_unplug: Run any pending blk_io_plug_call() functions
 + *
 + * There must have been a matching blk_io_plug() call in the same thread prior
 + * to this blk_io_unplug() call.
 + */
 +void blk_io_unplug(void)
 +{
 +    Plug *plug = get_ptr_plug();
 +
 +    assert(plug->count > 0);
 +
 +    if (--plug->count > 0) {
 +        return;
 +    }
 +
 +    GArray *array = plug->unplug_fns;
 +    if (!array) {
 +        return;
 +    }
 +
 +    UnplugFn *fns = (UnplugFn *)array->data;
 +
 +    for (guint i = 0; i < array->len; i++) {
 +        fns[i].fn(fns[i].opaque);
 +    }
 +
 +    /*
 +     * This resets the array without freeing memory so that appending is cheap
 +     * in the future.
 +     */
 +    g_array_set_size(array, 0);
 +}
 diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/dataplane/xen-block.c
 +++ b/hw/block/dataplane/xen-block.c
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
       * is below us.
       */
      if (inflight_atstart > IO_PLUG_THRESHOLD) {
 -        blk_io_plug(dataplane->blk);
 +        blk_io_plug();
      }
      while (rc != rp) {
          /* pull request from ring */
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
          if (inflight_atstart > IO_PLUG_THRESHOLD &&
              batched >= inflight_atstart) {
 -            blk_io_unplug(dataplane->blk);
 +            blk_io_unplug();
          }
          xen_block_do_aio(request);
          if (inflight_atstart > IO_PLUG_THRESHOLD) {
              if (batched >= inflight_atstart) {
 -                blk_io_plug(dataplane->blk);
 +                blk_io_plug();
                  batched = 0;
              } else {
                  batched++;
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
          }
      }
      if (inflight_atstart > IO_PLUG_THRESHOLD) {
 -        blk_io_unplug(dataplane->blk);
 +        blk_io_unplug();
      }
      return done_something;
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
      bool suppress_notifications = virtio_queue_get_notification(vq);
      aio_context_acquire(blk_get_aio_context(s->blk));
 -    blk_io_plug(s->blk);
 +    blk_io_plug();
      do {
          if (suppress_notifications) {
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
          virtio_blk_submit_multireq(s, &mrb);
      }
 -    blk_io_unplug(s->blk);
 +    blk_io_unplug();
      aio_context_release(blk_get_aio_context(s->blk));
  }
- void qemu_co_rwlock_rdlock(CoRwlock *lock)
+diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
-@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/scsi/virtio-scsi.c
-     qemu_co_mutex_lock(&lock->mutex);
++++ b/hw/scsi/virtio-scsi.c
-     /* For fairness, wait if a writer is in line.  */
+@@ -XXX,XX +XXX,XX @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req)
--    while (lock->pending_writer) {
+         return -ENOBUFS;
--        qemu_co_queue_wait(&lock->queue, &lock->mutex);
+     }
--    }
+     scsi_req_ref(req->sreq);
--    lock->reader++;
+-    blk_io_plug(d->conf.blk);
--    qemu_co_mutex_unlock(&lock->mutex);
++    blk_io_plug();
--
+     object_unref(OBJECT(d));
--    /* The rest of the read-side critical section is run without the mutex.  */
+     return 0;
 -    self->locks_held++;
 -}
 -
 -void qemu_co_rwlock_unlock(CoRwlock *lock)
 -{
 -    Coroutine *self = qemu_coroutine_self();
 -
 -    assert(qemu_in_coroutine());
 -    if (!lock->reader) {
 -        /* The critical section started in qemu_co_rwlock_wrlock.  */
 -        qemu_co_queue_restart_all(&lock->queue);
 +    if (lock->owners == 0 || (lock->owners > 0 && QSIMPLEQ_EMPTY(&lock->tickets))) {
 +        lock->owners++;
 +        qemu_co_mutex_unlock(&lock->mutex);
      } else {
 -        self->locks_held--;
 +        CoRwTicket my_ticket = { true, self };
 +        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
 +        qemu_co_mutex_unlock(&lock->mutex);
 +        qemu_coroutine_yield();
 +        assert(lock->owners >= 1);
 +
 +        /* Possibly wake another reader, which will wake the next in line.  */
          qemu_co_mutex_lock(&lock->mutex);
 -        lock->reader--;
 -        assert(lock->reader >= 0);
 -        /* Wakeup only one waiting writer */
 -        if (!lock->reader) {
 -            qemu_co_queue_next(&lock->queue);
 -        }
 +        qemu_co_rwlock_maybe_wake_one(lock);
      }
 -    qemu_co_mutex_unlock(&lock->mutex);
 +
 +    self->locks_held++;
 +}
 +
 +void qemu_co_rwlock_unlock(CoRwlock *lock)
 +{
 +    Coroutine *self = qemu_coroutine_self();
 +
 +    assert(qemu_in_coroutine());
 +    self->locks_held--;
 +
 +    qemu_co_mutex_lock(&lock->mutex);
 +    if (lock->owners > 0) {
 +        lock->owners--;
 +    } else {
 +        assert(lock->owners == -1);
 +        lock->owners = 0;
 +    }
 +
 +    qemu_co_rwlock_maybe_wake_one(lock);
  }
+@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req)
- void qemu_co_rwlock_downgrade(CoRwlock *lock)
+     if (scsi_req_enqueue(sreq)) {
- {
+         scsi_req_continue(sreq);
--    Coroutine *self = qemu_coroutine_self();
+     }
-+    qemu_co_mutex_lock(&lock->mutex);
+-    blk_io_unplug(sreq->dev->conf.blk);
-+    assert(lock->owners == -1);
++    blk_io_unplug();
-+    lock->owners = 1;
+     scsi_req_unref(sreq);
 -    /* lock->mutex critical section started in qemu_co_rwlock_wrlock or
 -     * qemu_co_rwlock_upgrade.
 -     */
 -    assert(lock->reader == 0);
 -    lock->reader++;
 -    qemu_co_mutex_unlock(&lock->mutex);
 -
 -    /* The rest of the read-side critical section is run without the mutex.  */
 -    self->locks_held++;
 +    /* Possibly wake another reader, which will wake the next in line.  */
 +    qemu_co_rwlock_maybe_wake_one(lock);
  }
- void qemu_co_rwlock_wrlock(CoRwlock *lock)
+@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
- {
+                 while (!QTAILQ_EMPTY(&reqs)) {
-+    Coroutine *self = qemu_coroutine_self();
+                     req = QTAILQ_FIRST(&reqs);
-+
+                     QTAILQ_REMOVE(&reqs, req, next);
-     qemu_co_mutex_lock(&lock->mutex);
+-                    blk_io_unplug(req->sreq->dev->conf.blk);
--    lock->pending_writer++;
++                    blk_io_unplug();
--    while (lock->reader) {
+                     scsi_req_unref(req->sreq);
--        qemu_co_queue_wait(&lock->queue, &lock->mutex);
+                     virtqueue_detach_element(req->vq, &req->elem, 0);
-+    if (lock->owners == 0) {
+                     virtio_scsi_free_req(req);
-+        lock->owners = -1;
+diff --git a/block/meson.build b/block/meson.build
-+        qemu_co_mutex_unlock(&lock->mutex);
+index XXXXXXX..XXXXXXX 100644
-+    } else {
+--- a/block/meson.build
-+        CoRwTicket my_ticket = { false, qemu_coroutine_self() };
++++ b/block/meson.build
-+
+@@ -XXX,XX +XXX,XX @@ block_ss.add(files(
-+        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
+   'mirror.c',
-+        qemu_co_mutex_unlock(&lock->mutex);
+   'nbd.c',
-+        qemu_coroutine_yield();
+   'null.c',
-+        assert(lock->owners == -1);
++  'plug.c',
-     }
+   'qapi.c',
--    lock->pending_writer--;
+   'qcow2-bitmap.c',
+   'qcow2-cache.c',
 -    /* The rest of the write-side critical section is run with
 -     * the mutex taken, so that lock->reader remains zero.
 -     * There is no need to update self->locks_held.
 -     */
 +    self->locks_held++;
  }
  void qemu_co_rwlock_upgrade(CoRwlock *lock)
  {
 -    Coroutine *self = qemu_coroutine_self();
 -
      qemu_co_mutex_lock(&lock->mutex);
 -    assert(lock->reader > 0);
 -    lock->reader--;
 -    lock->pending_writer++;
 -    while (lock->reader) {
 -        qemu_co_queue_wait(&lock->queue, &lock->mutex);
 +    assert(lock->owners > 0);
 +    /* For fairness, wait if a writer is in line.  */
 +    if (lock->owners == 1 && QSIMPLEQ_EMPTY(&lock->tickets)) {
 +        lock->owners = -1;
 +        qemu_co_mutex_unlock(&lock->mutex);
 +    } else {
 +        CoRwTicket my_ticket = { false, qemu_coroutine_self() };
 +
 +        lock->owners--;
 +        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
 +        qemu_co_rwlock_maybe_wake_one(lock);
 +        qemu_coroutine_yield();
 +        assert(lock->owners == -1);
      }
 -    lock->pending_writer--;
 -
 -    /* The rest of the write-side critical section is run with
 -     * the mutex taken, similar to qemu_co_rwlock_wrlock.  Do
 -     * not account for the lock twice in self->locks_held.
 -     */
 -    self->locks_held--;
  }
 --
-.30.2
+.40.1

-[PULL for-6.0 6/6] test-coroutine: Add rwlock downgrade test
+[PULL 2/8] block/nvme: convert to blk_io_plug_call() API
-From: David Edmondson <david.edmondson@oracle.com>
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
 block layer friendly. Use the new blk_io_plug_call() API to batch I/O
 submission instead.
-Test that downgrading an rwlock does not result in a failure to
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-schedule coroutines queued on the rwlock.
+Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-The diagram associated with test_co_rwlock_downgrade() describes the
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-intended behaviour, but what was observed previously corresponds to:
+Message-id: 20230530180959.1108766-3-stefanha@redhat.com
 | c1     | c2         | c3         | c4       |
 |--------+------------+------------+----------|
 | rdlock |            |            |          |
 | yield  |            |            |          |
 |        | wrlock     |            |          |
 |        | <queued>   |            |          |
 |        |            | rdlock     |          |
 |        |            | <queued>   |          |
 |        |            |            | wrlock   |
 |        |            |            | <queued> |
 | unlock |            |            |          |
 | yield  |            |            |          |
 |        | <dequeued> |            |          |
 |        | downgrade  |            |          |
 |        | ...        |            |          |
 |        | unlock     |            |          |
 |        |            | <dequeued> |          |
 |        |            | <queued>   |          |
 This results in a failure...
 ERROR:../tests/test-coroutine.c:369:test_co_rwlock_downgrade: assertion failed: (c3_done)
 Bail out! ERROR:../tests/test-coroutine.c:369:test_co_rwlock_downgrade: assertion failed: (c3_done)
 ...as a result of the c3 coroutine failing to run to completion.
 Signed-off-by: David Edmondson <david.edmondson@oracle.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-id: 20210325112941.365238-7-pbonzini@redhat.com
 Message-Id: <20210309144015.557477-5-david.edmondson@oracle.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/unit/test-coroutine.c | 99 +++++++++++++++++++++++++++++++++++++
+ block/nvme.c       | 44 ++++++++++++--------------------------------
-file changed, 99 insertions(+)
+ block/trace-events |  1 -
 files changed, 12 insertions(+), 33 deletions(-)
-diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
+diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
---- a/tests/unit/test-coroutine.c
+--- a/block/nvme.c
-+++ b/tests/unit/test-coroutine.c
++++ b/block/nvme.c
-@@ -XXX,XX +XXX,XX @@ static void test_co_rwlock_upgrade(void)
+@@ -XXX,XX +XXX,XX @@
-     g_assert(c2_done);
+ #include "qemu/vfio-helpers.h"
  #include "block/block-io.h"
  #include "block/block_int.h"
 +#include "sysemu/block-backend.h"
  #include "sysemu/replay.h"
  #include "trace.h"
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
      int blkshift;
      uint64_t max_transfer;
 -    bool plugged;
      bool supports_write_zeroes;
      bool supports_discard;
@@ -XXX,XX +XXX,XX @@ static void nvme_kick(NVMeQueuePair *q)
  {
      BDRVNVMeState *s = q->s;
 -    if (s->plugged || !q->need_kick) {
 +    if (!q->need_kick) {
          return;
      }
      trace_nvme_kick(s, q->index);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
      NvmeCqe *c;
      trace_nvme_process_completion(s, q->index, q->inflight);
 -    if (s->plugged) {
 -        trace_nvme_process_completion_queue_plugged(s, q->index);
 -        return false;
 -    }
      /*
       * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
      }
  }
-+static void coroutine_fn rwlock_rdlock_yield(void *opaque)
++static void nvme_unplug_fn(void *opaque)
 +{
-+    qemu_co_rwlock_rdlock(&rwlock);
++    NVMeQueuePair *q = opaque;
 +    qemu_coroutine_yield();
 +
-+    qemu_co_rwlock_unlock(&rwlock);
++    QEMU_LOCK_GUARD(&q->lock);
-+    qemu_coroutine_yield();
++    nvme_kick(q);
-+
++    nvme_process_completion(q);
 +    *(bool *)opaque = true;
 +}
 +
-+static void coroutine_fn rwlock_wrlock_downgrade(void *opaque)
+ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
-+{
+                                 NvmeCmd *cmd, BlockCompletionFunc cb,
-+    qemu_co_rwlock_wrlock(&rwlock);
+                                 void *opaque)
-+
+@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
-+    qemu_co_rwlock_downgrade(&rwlock);
+            q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
-+    qemu_co_rwlock_unlock(&rwlock);
+     q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
-+    *(bool *)opaque = true;
+     q->need_kick++;
-+}
+-    nvme_kick(q);
-+
+-    nvme_process_completion(q);
-+static void coroutine_fn rwlock_rdlock(void *opaque)
++    blk_io_plug_call(nvme_unplug_fn, q);
-+{
+     qemu_mutex_unlock(&q->lock);
-+    qemu_co_rwlock_rdlock(&rwlock);
+ }
-+
-+    qemu_co_rwlock_unlock(&rwlock);
+@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
-+    *(bool *)opaque = true;
+     }
-+}
+ }
-+
-+static void coroutine_fn rwlock_wrlock(void *opaque)
+-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-+{
+-{
-+    qemu_co_rwlock_wrlock(&rwlock);
+-    BDRVNVMeState *s = bs->opaque;
-+
+-    assert(!s->plugged);
-+    qemu_co_rwlock_unlock(&rwlock);
+-    s->plugged = true;
-+    *(bool *)opaque = true;
+-}
-+}
+-
-+
+-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-+/*
+-{
-+ * Check that downgrading a reader-writer lock does not cause a hang.
+-    BDRVNVMeState *s = bs->opaque;
-+ *
+-    assert(s->plugged);
-+ * Four coroutines are used to produce a situation where there are
+-    s->plugged = false;
-+ * both reader and writer hopefuls waiting to acquire an rwlock that
+-    for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-+ * is held by a reader.
+-        NVMeQueuePair *q = s->queues[i];
-+ *
+-        qemu_mutex_lock(&q->lock);
-+ * The correct sequence of operations we aim to provoke can be
+-        nvme_kick(q);
-+ * represented as:
+-        nvme_process_completion(q);
-+ *
+-        qemu_mutex_unlock(&q->lock);
-+ * | c1     | c2         | c3         | c4         |
+-    }
-+ * |--------+------------+------------+------------|
+-}
-+ * | rdlock |            |            |            |
+-
-+ * | yield  |            |            |            |
+ static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
-+ * |        | wrlock     |            |            |
+                               Error **errp)
-+ * |        | <queued>   |            |            |
+ {
-+ * |        |            | rdlock     |            |
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
-+ * |        |            | <queued>   |            |
+     .bdrv_detach_aio_context  = nvme_detach_aio_context,
-+ * |        |            |            | wrlock     |
+     .bdrv_attach_aio_context  = nvme_attach_aio_context,
-+ * |        |            |            | <queued>   |
-+ * | unlock |            |            |            |
+-    .bdrv_co_io_plug          = nvme_co_io_plug,
-+ * | yield  |            |            |            |
+-    .bdrv_co_io_unplug        = nvme_co_io_unplug,
-+ * |        | <dequeued> |            |            |
+-
-+ * |        | downgrade  |            |            |
+     .bdrv_register_buf        = nvme_register_buf,
-+ * |        |            | <dequeued> |            |
+     .bdrv_unregister_buf      = nvme_unregister_buf,
-+ * |        |            | unlock     |            |
+ };
-+ * |        | ...        |            |            |
+diff --git a/block/trace-events b/block/trace-events
-+ * |        | unlock     |            |            |
+index XXXXXXX..XXXXXXX 100644
-+ * |        |            |            | <dequeued> |
+--- a/block/trace-events
-+ * |        |            |            | unlock     |
++++ b/block/trace-events
-+ */
+@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
-+static void test_co_rwlock_downgrade(void)
+ nvme_dma_flush_queue_wait(void *s) "s %p"
-+{
+ nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
-+    bool c1_done = false;
+ nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u inflight %d"
-+    bool c2_done = false;
+-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
-+    bool c3_done = false;
+ nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
-+    bool c4_done = false;
+ nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
-+    Coroutine *c1, *c2, *c3, *c4;
+ nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
 +
 +    qemu_co_rwlock_init(&rwlock);
 +
 +    c1 = qemu_coroutine_create(rwlock_rdlock_yield, &c1_done);
 +    c2 = qemu_coroutine_create(rwlock_wrlock_downgrade, &c2_done);
 +    c3 = qemu_coroutine_create(rwlock_rdlock, &c3_done);
 +    c4 = qemu_coroutine_create(rwlock_wrlock, &c4_done);
 +
 +    qemu_coroutine_enter(c1);
 +    qemu_coroutine_enter(c2);
 +    qemu_coroutine_enter(c3);
 +    qemu_coroutine_enter(c4);
 +
 +    qemu_coroutine_enter(c1);
 +
 +    g_assert(c2_done);
 +    g_assert(c3_done);
 +    g_assert(c4_done);
 +
 +    qemu_coroutine_enter(c1);
 +
 +    g_assert(c1_done);
 +}
 +
  /*
   * Check that creation, enter, and return work
   */
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
      g_test_add_func("/locking/co-mutex", test_co_mutex);
      g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
      g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
 +    g_test_add_func("/locking/co-rwlock/downgrade", test_co_rwlock_downgrade);
      if (g_test_perf()) {
          g_test_add_func("/perf/lifecycle", perf_lifecycle);
          g_test_add_func("/perf/nesting", perf_nesting);
 --
-.30.2
+.40.1

-[PULL for-6.0 5/6] test-coroutine: Add rwlock upgrade test
+[PULL 3/8] block/blkio: convert to blk_io_plug_call() API
-From: Paolo Bonzini <pbonzini@redhat.com>
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
 block layer friendly. Use the new blk_io_plug_call() API to batch I/O
 submission instead.
-Test that rwlock upgrade is fair, and that readers go back to sleep if
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-a writer is in line.
+Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Message-id: 20210325112941.365238-6-pbonzini@redhat.com
+Message-id: 20230530180959.1108766-4-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/unit/test-coroutine.c | 62 +++++++++++++++++++++++++++++++++++++
+ block/blkio.c | 43 ++++++++++++++++++++++++-------------------
-file changed, 62 insertions(+)
+file changed, 24 insertions(+), 19 deletions(-)
-diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
+diff --git a/block/blkio.c b/block/blkio.c
 index XXXXXXX..XXXXXXX 100644
---- a/tests/unit/test-coroutine.c
+--- a/block/blkio.c
-+++ b/tests/unit/test-coroutine.c
++++ b/block/blkio.c
-@@ -XXX,XX +XXX,XX @@ static void test_co_mutex_lockable(void)
+@@ -XXX,XX +XXX,XX @@
-     g_assert(QEMU_MAKE_LOCKABLE(null_pointer) == NULL);
+ #include "qemu/error-report.h"
  #include "qapi/qmp/qdict.h"
  #include "qemu/module.h"
 +#include "sysemu/block-backend.h"
  #include "exec/memory.h" /* for ram_block_discard_disable() */
  #include "block/block-io.h"
@@ -XXX,XX +XXX,XX @@ static void blkio_detach_aio_context(BlockDriverState *bs)
                         NULL, NULL, NULL);
  }
-+static CoRwlock rwlock;
+-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
 -static void blkio_submit_io(BlockDriverState *bs)
 +/*
 + * Called by blk_io_unplug() or immediately if not plugged. Called without
 + * blkio_lock.
 + */
 +static void blkio_unplug_fn(void *opaque)
  {
 -    if (qatomic_read(&bs->io_plugged) == 0) {
 -        BDRVBlkioState *s = bs->opaque;
 +    BDRVBlkioState *s = opaque;
 +    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
          blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
      }
  }
 +/*
 + * Schedule I/O submission after enqueuing a new request. Called without
 + * blkio_lock.
 + */
 +static void blkio_submit_io(BlockDriverState *bs)
 +{
 +    BDRVBlkioState *s = bs->opaque;
 +
-+/* Test that readers are properly sent back to the queue when upgrading,
++    blk_io_plug_call(blkio_unplug_fn, s);
 + * even if they are the sole readers.  The test scenario is as follows:
 + *
 + *
 + * | c1           | c2         |
 + * |--------------+------------+
 + * | rdlock       |            |
 + * | yield        |            |
 + * |              | wrlock     |
 + * |              | <queued>   |
 + * | upgrade      |            |
 + * | <queued>     | <dequeued> |
 + * |              | unlock     |
 + * | <dequeued>   |            |
 + * | unlock       |            |
 + */
 +
 +static void coroutine_fn rwlock_yield_upgrade(void *opaque)
 +{
 +    qemu_co_rwlock_rdlock(&rwlock);
 +    qemu_coroutine_yield();
 +
 +    qemu_co_rwlock_upgrade(&rwlock);
 +    qemu_co_rwlock_unlock(&rwlock);
 +
 +    *(bool *)opaque = true;
 +}
 +
-+static void coroutine_fn rwlock_wrlock_yield(void *opaque)
+ static int coroutine_fn
-+{
+ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
-+    qemu_co_rwlock_wrlock(&rwlock);
+ {
-+    qemu_coroutine_yield();
+@@ -XXX,XX +XXX,XX @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
-+
-+    qemu_co_rwlock_unlock(&rwlock);
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-+    *(bool *)opaque = true;
+         blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-+}
+-        blkio_submit_io(bs);
-+
+     }
-+static void test_co_rwlock_upgrade(void)
-+{
++    blkio_submit_io(bs);
-+    bool c1_done = false;
+     qemu_coroutine_yield();
-+    bool c2_done = false;
+     return cod.ret;
-+    Coroutine *c1, *c2;
+ }
-+
+@@ -XXX,XX +XXX,XX @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
-+    qemu_co_rwlock_init(&rwlock);
-+    c1 = qemu_coroutine_create(rwlock_yield_upgrade, &c1_done);
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-+    c2 = qemu_coroutine_create(rwlock_wrlock_yield, &c2_done);
+         blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-+
+-        blkio_submit_io(bs);
-+    qemu_coroutine_enter(c1);
+     }
-+    qemu_coroutine_enter(c2);
-+
++    blkio_submit_io(bs);
-+    /* c1 now should go to sleep.  */
+     qemu_coroutine_yield();
-+    qemu_coroutine_enter(c1);
-+    g_assert(!c1_done);
+     if (use_bounce_buffer) {
-+
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
-+    qemu_coroutine_enter(c2);
-+    g_assert(c1_done);
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-+    g_assert(c2_done);
+         blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-+}
+-        blkio_submit_io(bs);
-+
+     }
- /*
-  * Check that creation, enter, and return work
++    blkio_submit_io(bs);
-  */
+     qemu_coroutine_yield();
-@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
-     g_test_add_func("/basic/order", test_order);
+     if (use_bounce_buffer) {
-     g_test_add_func("/locking/co-mutex", test_co_mutex);
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
-     g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
-+    g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
+     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-     if (g_test_perf()) {
+         blkioq_flush(s->blkioq, &cod, 0);
-         g_test_add_func("/perf/lifecycle", perf_lifecycle);
+-        blkio_submit_io(bs);
-         g_test_add_func("/perf/nesting", perf_nesting);
+     }
 +    blkio_submit_io(bs);
      qemu_coroutine_yield();
      return cod.ret;
  }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
      WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
          blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
 -        blkio_submit_io(bs);
      }
 +    blkio_submit_io(bs);
      qemu_coroutine_yield();
      return cod.ret;
  }
 -static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
 -{
 -    BDRVBlkioState *s = bs->opaque;
 -
 -    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 -        blkio_submit_io(bs);
 -    }
 -}
 -
  typedef enum {
      BMRR_OK,
      BMRR_SKIP,
@@ -XXX,XX +XXX,XX @@ static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
          .bdrv_co_pwritev         = blkio_co_pwritev, \
          .bdrv_co_flush_to_disk   = blkio_co_flush, \
          .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
 -        .bdrv_co_io_unplug       = blkio_co_io_unplug, \
          .bdrv_refresh_limits     = blkio_refresh_limits, \
          .bdrv_register_buf       = blkio_register_buf, \
          .bdrv_unregister_buf     = blkio_unregister_buf, \
 --
-.30.2
+.40.1

-New patch
+[PULL 4/8] block/io_uring: convert to blk_io_plug_call() API
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
+block layer friendly. Use the new blk_io_plug_call() API to batch I/O
+submission instead.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
+Message-id: 20230530180959.1108766-5-stefanha@redhat.com
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ include/block/raw-aio.h |  7 -------
+ block/file-posix.c      | 10 ----------
+ block/io_uring.c        | 44 ++++++++++++++++-------------------------
+ block/trace-events      |  5 ++---
+files changed, 19 insertions(+), 47 deletions(-)
+diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
+index XXXXXXX..XXXXXXX 100644
+--- a/include/block/raw-aio.h
++++ b/include/block/raw-aio.h
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
+                                   QEMUIOVector *qiov, int type);
+ void luring_detach_aio_context(LuringState *s, AioContext *old_context);
+ void luring_attach_aio_context(LuringState *s, AioContext *new_context);
+-
+-/*
+- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
+- * caller must ensure that they are paired in the same IOThread.
+- */
+-void luring_io_plug(void);
+-void luring_io_unplug(void);
+ #endif
+ #ifdef _WIN32
+diff --git a/block/file-posix.c b/block/file-posix.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/file-posix.c
++++ b/block/file-posix.c
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
+         laio_io_plug();
+     }
+ #endif
+-#ifdef CONFIG_LINUX_IO_URING
+-    if (s->use_linux_io_uring) {
+-        luring_io_plug();
+-    }
+-#endif
+ }
+ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
+         laio_io_unplug(s->aio_max_batch);
+     }
+ #endif
+-#ifdef CONFIG_LINUX_IO_URING
+-    if (s->use_linux_io_uring) {
+-        luring_io_unplug();
+-    }
+-#endif
+ }
+ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
+diff --git a/block/io_uring.c b/block/io_uring.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/io_uring.c
++++ b/block/io_uring.c
+@@ -XXX,XX +XXX,XX @@
+ #include "block/raw-aio.h"
+ #include "qemu/coroutine.h"
+ #include "qapi/error.h"
++#include "sysemu/block-backend.h"
+ #include "trace.h"
+ /* Only used for assertions.  */
+@@ -XXX,XX +XXX,XX @@ typedef struct LuringAIOCB {
+ } LuringAIOCB;
+ typedef struct LuringQueue {
+-    int plugged;
+     unsigned int in_queue;
+     unsigned int in_flight;
+     bool blocked;
+@@ -XXX,XX +XXX,XX @@ static void luring_process_completions_and_submit(LuringState *s)
+ {
+     luring_process_completions(s);
+-    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
++    if (s->io_q.in_queue > 0) {
+         ioq_submit(s);
+     }
+ }
+@@ -XXX,XX +XXX,XX @@ static void qemu_luring_poll_ready(void *opaque)
+ static void ioq_init(LuringQueue *io_q)
+ {
+     QSIMPLEQ_INIT(&io_q->submit_queue);
+-    io_q->plugged = 0;
+     io_q->in_queue = 0;
+     io_q->in_flight = 0;
+     io_q->blocked = false;
+ }
+-void luring_io_plug(void)
++static void luring_unplug_fn(void *opaque)
+ {
+-    AioContext *ctx = qemu_get_current_aio_context();
+-    LuringState *s = aio_get_linux_io_uring(ctx);
+-    trace_luring_io_plug(s);
+-    s->io_q.plugged++;
+-}
+-
+-void luring_io_unplug(void)
+-{
+-    AioContext *ctx = qemu_get_current_aio_context();
+-    LuringState *s = aio_get_linux_io_uring(ctx);
+-    assert(s->io_q.plugged);
+-    trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
+-                           s->io_q.in_queue, s->io_q.in_flight);
+-    if (--s->io_q.plugged == 0 &&
+-        !s->io_q.blocked && s->io_q.in_queue > 0) {
++    LuringState *s = opaque;
++    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
++                           s->io_q.in_flight);
++    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
+         ioq_submit(s);
+     }
+ }
+@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
+     QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
+     s->io_q.in_queue++;
+-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
+-                           s->io_q.in_queue, s->io_q.in_flight);
+-    if (!s->io_q.blocked &&
+-        (!s->io_q.plugged ||
+-         s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
+-        ret = ioq_submit(s);
+-        trace_luring_do_submit_done(s, ret);
+-        return ret;
++    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
++                           s->io_q.in_flight);
++    if (!s->io_q.blocked) {
++        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
++            ret = ioq_submit(s);
++            trace_luring_do_submit_done(s, ret);
++            return ret;
++        }
++
++        blk_io_plug_call(luring_unplug_fn, s);
+     }
+     return 0;
+ }
+diff --git a/block/trace-events b/block/trace-events
+index XXXXXXX..XXXXXXX 100644
+--- a/block/trace-events
++++ b/block/trace-events
+@@ -XXX,XX +XXX,XX @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
+ # io_uring.c
+ luring_init_state(void *s, size_t size) "s %p size %zu"
+ luring_cleanup_state(void *s) "%p freed"
+-luring_io_plug(void *s) "LuringState %p plug"
+-luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
+-luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
++luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
++luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
+ luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
+ luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
+ luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
+--
+.40.1

-New patch
+[PULL 5/8] block/linux-aio: convert to blk_io_plug_call() API
+Stop using the .bdrv_co_io_plug() API because it is not multi-queue
 block layer friendly. Use the new blk_io_plug_call() API to batch I/O
 submission instead.
 Note that a dev_max_batch check is dropped in laio_io_unplug() because
 the semantics of unplug_fn() are different from .bdrv_co_unplug():
 . unplug_fn() is only called when the last blk_io_unplug() call occurs,
    not every time blk_io_unplug() is called.
 . unplug_fn() is per-thread, not per-BlockDriverState, so there is no
    way to get per-BlockDriverState fields like dev_max_batch.
 Therefore this condition cannot be moved to laio_unplug_fn(). It is not
 obvious that this condition affects performance in practice, so I am
 removing it instead of trying to come up with a more complex mechanism
 to preserve the condition.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-id: 20230530180959.1108766-6-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
  include/block/raw-aio.h |  7 -------
  block/file-posix.c      | 28 ----------------------------
  block/linux-aio.c       | 41 +++++++++++------------------------------
 files changed, 11 insertions(+), 65 deletions(-)
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
  void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 -
 -/*
 - * laio_io_plug/unplug work in the thread's current AioContext, therefore the
 - * caller must ensure that they are paired in the same IOThread.
 - */
 -void laio_io_plug(void);
 -void laio_io_unplug(uint64_t dev_max_batch);
  #endif
  /* io_uring.c - Linux io_uring implementation */
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
      return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
  }
 -static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
 -{
 -    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 -#ifdef CONFIG_LINUX_AIO
 -    if (s->use_linux_aio) {
 -        laio_io_plug();
 -    }
 -#endif
 -}
 -
 -static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
 -{
 -    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 -#ifdef CONFIG_LINUX_AIO
 -    if (s->use_linux_aio) {
 -        laio_io_unplug(s->aio_max_batch);
 -    }
 -#endif
 -}
 -
  static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
  {
      BDRVRawState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_file = {
      .bdrv_co_copy_range_from = raw_co_copy_range_from,
      .bdrv_co_copy_range_to  = raw_co_copy_range_to,
      .bdrv_refresh_limits = raw_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
      .bdrv_co_copy_range_from = raw_co_copy_range_from,
      .bdrv_co_copy_range_to  = raw_co_copy_range_to,
      .bdrv_refresh_limits = raw_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
      .bdrv_co_pwritev        = raw_co_pwritev,
      .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
      .bdrv_refresh_limits    = cdrom_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
      .bdrv_co_pwritev        = raw_co_pwritev,
      .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
      .bdrv_refresh_limits    = cdrom_refresh_limits,
 -    .bdrv_co_io_plug        = raw_co_io_plug,
 -    .bdrv_co_io_unplug      = raw_co_io_unplug,
      .bdrv_attach_aio_context = raw_aio_attach_aio_context,
      .bdrv_co_truncate                   = raw_co_truncate,
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/event_notifier.h"
  #include "qemu/coroutine.h"
  #include "qapi/error.h"
 +#include "sysemu/block-backend.h"
  /* Only used for assertions.  */
  #include "qemu/coroutine_int.h"
@@ -XXX,XX +XXX,XX @@ struct qemu_laiocb {
  };
  typedef struct {
 -    int plugged;
      unsigned int in_queue;
      unsigned int in_flight;
      bool blocked;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
  {
      qemu_laio_process_completions(s);
 -    if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
 +    if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
  }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
  static void ioq_init(LaioQueue *io_q)
  {
      QSIMPLEQ_INIT(&io_q->pending);
 -    io_q->plugged = 0;
      io_q->in_queue = 0;
      io_q->in_flight = 0;
      io_q->blocked = false;
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
      return max_batch;
  }
 -void laio_io_plug(void)
 +static void laio_unplug_fn(void *opaque)
  {
 -    AioContext *ctx = qemu_get_current_aio_context();
 -    LinuxAioState *s = aio_get_linux_aio(ctx);
 +    LinuxAioState *s = opaque;
 -    s->io_q.plugged++;
 -}
 -
 -void laio_io_unplug(uint64_t dev_max_batch)
 -{
 -    AioContext *ctx = qemu_get_current_aio_context();
 -    LinuxAioState *s = aio_get_linux_aio(ctx);
 -
 -    assert(s->io_q.plugged);
 -    s->io_q.plugged--;
 -
 -    /*
 -     * Why max batch checking is performed here:
 -     * Another BDS may have queued requests with a higher dev_max_batch and
 -     * therefore in_queue could now exceed our dev_max_batch. Re-check the max
 -     * batch so we can honor our device's dev_max_batch.
 -     */
 -    if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch) ||
 -        (!s->io_q.plugged &&
 -         !s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending))) {
 +    if (!s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
  }
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
      QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
      s->io_q.in_queue++;
 -    if (!s->io_q.blocked &&
 -        (!s->io_q.plugged ||
 -         s->io_q.in_queue >= laio_max_batch(s, dev_max_batch))) {
 -        ioq_submit(s);
 +    if (!s->io_q.blocked) {
 +        if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
 +            ioq_submit(s);
 +        } else {
 +            blk_io_plug_call(laio_unplug_fn, s);
 +        }
      }
      return 0;
 --
 .40.1

-[PULL for-6.0 3/6] coroutine-lock: Store the coroutine in the CoWaitRecord only once
+[PULL 6/8] block: remove bdrv_co_io_plug() API
-From: David Edmondson <david.edmondson@oracle.com>
+No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
 function pointers.
-When taking the slow path for mutex acquisition, set the coroutine
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-value in the CoWaitRecord in push_waiter(), rather than both there and
+Reviewed-by: Eric Blake <eblake@redhat.com>
-in the caller.
+Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20230530180959.1108766-7-stefanha@redhat.com
 Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
 Signed-off-by: David Edmondson <david.edmondson@oracle.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-id: 20210325112941.365238-4-pbonzini@redhat.com
 Message-Id: <20210309144015.557477-4-david.edmondson@oracle.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/qemu-coroutine-lock.c | 1 -
+ include/block/block-io.h         |  3 ---
-file changed, 1 deletion(-)
+ include/block/block_int-common.h | 11 ----------
  block/io.c                       | 37 --------------------------------
 files changed, 51 deletions(-)
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
+diff --git a/include/block/block-io.h b/include/block/block-io.h
 index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine-lock.c
+--- a/include/block/block-io.h
-+++ b/util/qemu-coroutine-lock.c
++++ b/include/block/block-io.h
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
+@@ -XXX,XX +XXX,XX @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
-     unsigned old_handoff;
+ AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
-     trace_qemu_co_mutex_lock_entry(mutex, self);
--    w.co = self;
+-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-     push_waiter(mutex, &w);
+-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
+-
-     /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
+ bool coroutine_fn GRAPH_RDLOCK
  bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
                                     uint32_t granularity, Error **errp);
 diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block_int-common.h
 +++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
      void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
          BlockDriverState *bs, BlkdebugEvent event);
 -    /* io queue for linux-aio */
 -    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState *bs);
 -    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
 -        BlockDriverState *bs);
 -
      bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
      bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
      unsigned int in_flight;
      unsigned int serialising_in_flight;
 -    /*
 -     * counter for nested bdrv_io_plug.
 -     * Accessed with atomic ops.
 -     */
 -    unsigned io_plugged;
 -
      /* do we need to tell the quest if we have a volatile write cache? */
      int enable_write_cache;
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size)
      return mem;
  }
 -void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
 -{
 -    BdrvChild *child;
 -    IO_CODE();
 -    assert_bdrv_graph_readable();
 -
 -    QLIST_FOREACH(child, &bs->children, next) {
 -        bdrv_co_io_plug(child->bs);
 -    }
 -
 -    if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
 -        BlockDriver *drv = bs->drv;
 -        if (drv && drv->bdrv_co_io_plug) {
 -            drv->bdrv_co_io_plug(bs);
 -        }
 -    }
 -}
 -
 -void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
 -{
 -    BdrvChild *child;
 -    IO_CODE();
 -    assert_bdrv_graph_readable();
 -
 -    assert(bs->io_plugged);
 -    if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
 -        BlockDriver *drv = bs->drv;
 -        if (drv && drv->bdrv_co_io_unplug) {
 -            drv->bdrv_co_io_unplug(bs);
 -        }
 -    }
 -
 -    QLIST_FOREACH(child, &bs->children, next) {
 -        bdrv_co_io_unplug(child->bs);
 -    }
 -}
 -
  /* Helper that undoes bdrv_register_buf() when it fails partway through */
  static void GRAPH_RDLOCK
  bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
 --
-.30.2
+.40.1

-[PULL for-6.0 1/6] block/vdi: When writing new bmap entry fails, don't leak the buffer
+[PULL 7/8] block/blkio: use qemu_open() to support fd passing for virtio-blk
-From: David Edmondson <david.edmondson@oracle.com>
+From: Stefano Garzarella <sgarzare@redhat.com>
-If a new bitmap entry is allocated, requiring the entire block to be
+Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
-written, avoiding leaking the buffer allocated for the block should
+passing. Let's expose this to the user, so the management layer
-the write fail.
+can pass the file descriptor of an already opened path.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+If the libblkio virtio-blk driver supports fd passing, let's always
-Signed-off-by: David Edmondson <david.edmondson@oracle.com>
+use qemu_open() to open the `path`, so we can handle fd passing
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+from the management layer through the "/dev/fdset/N" special path.
-Acked-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20210325112941.365238-2-pbonzini@redhat.com
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Message-Id: <20210309144015.557477-2-david.edmondson@oracle.com>
+Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
-Acked-by: Max Reitz <mreitz@redhat.com>
+Message-id: 20230530071941.8954-2-sgarzare@redhat.com
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/vdi.c | 1 +
+ block/blkio.c | 53 ++++++++++++++++++++++++++++++++++++++++++---------
-file changed, 1 insertion(+)
+file changed, 44 insertions(+), 9 deletions(-)
-diff --git a/block/vdi.c b/block/vdi.c
+diff --git a/block/blkio.c b/block/blkio.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/vdi.c
+--- a/block/blkio.c
-+++ b/block/vdi.c
++++ b/block/blkio.c
-@@ -XXX,XX +XXX,XX @@ nonallocating_write:
+@@ -XXX,XX +XXX,XX @@ static int blkio_virtio_blk_common_open(BlockDriverState *bs,
+ {
-     logout("finished data write\n");
+     const char *path = qdict_get_try_str(options, "path");
-     if (ret < 0) {
+     BDRVBlkioState *s = bs->opaque;
-+        g_free(block);
+-    int ret;
-         return ret;
++    bool fd_supported = false;
 +    int fd, ret;
      if (!path) {
          error_setg(errp, "missing 'path' option");
          return -EINVAL;
      }
+-    ret = blkio_set_str(s->blkio, "path", path);
+-    qdict_del(options, "path");
+-    if (ret < 0) {
+-        error_setg_errno(errp, -ret, "failed to set path: %s",
+-                         blkio_get_error_msg());
+-        return ret;
+-    }
+-
+     if (!(flags & BDRV_O_NOCACHE)) {
+         error_setg(errp, "cache.direct=off is not supported");
+         return -EINVAL;
+     }
++
++    if (blkio_get_int(s->blkio, "fd", &fd) == 0) {
++        fd_supported = true;
++    }
++
++    /*
++     * If the libblkio driver supports fd passing, let's always use qemu_open()
++     * to open the `path`, so we can handle fd passing from the management
++     * layer through the "/dev/fdset/N" special path.
++     */
++    if (fd_supported) {
++        int open_flags;
++
++        if (flags & BDRV_O_RDWR) {
++            open_flags = O_RDWR;
++        } else {
++            open_flags = O_RDONLY;
++        }
++
++        fd = qemu_open(path, open_flags, errp);
++        if (fd < 0) {
++            return -EINVAL;
++        }
++
++        ret = blkio_set_int(s->blkio, "fd", fd);
++        if (ret < 0) {
++            error_setg_errno(errp, -ret, "failed to set fd: %s",
++                             blkio_get_error_msg());
++            qemu_close(fd);
++            return ret;
++        }
++    } else {
++        ret = blkio_set_str(s->blkio, "path", path);
++        if (ret < 0) {
++            error_setg_errno(errp, -ret, "failed to set path: %s",
++                             blkio_get_error_msg());
++            return ret;
++        }
++    }
++
++    qdict_del(options, "path");
++
+     return 0;
+ }
 --
-.30.2
+.40.1

-[PULL for-6.0 2/6] block/vdi: Don't assume that blocks are larger than VdiHeader
+[PULL 8/8] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa
-From: David Edmondson <david.edmondson@oracle.com>
+From: Stefano Garzarella <sgarzare@redhat.com>
-Given that the block size is read from the header of the VDI file, a
+The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
-wide variety of sizes might be seen. Rather than re-using a block
+passing through the new 'fd' property.
 sized memory region when writing the VDI header, allocate an
 appropriately sized buffer.
-Signed-off-by: David Edmondson <david.edmondson@oracle.com>
+Since now we are using qemu_open() on '@path' if the virtio-blk driver
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+supports the fd passing, let's announce it.
-Acked-by: Max Reitz <mreitz@redhat.com>
+In this way, the management layer can pass the file descriptor of an
-Message-id: 20210325112941.365238-3-pbonzini@redhat.com
+already opened vhost-vdpa character device. This is useful especially
-Message-Id: <20210309144015.557477-3-david.edmondson@oracle.com>
+when the device can only be accessed with certain privileges.
-Acked-by: Max Reitz <mreitz@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
 in libblkio supports it.
 Suggested-by: Markus Armbruster <armbru@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-id: 20230530071941.8954-3-sgarzare@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/vdi.c | 10 ++++++----
+ qapi/block-core.json | 6 ++++++
-file changed, 6 insertions(+), 4 deletions(-)
+ meson.build          | 4 ++++
 files changed, 10 insertions(+)
-diff --git a/block/vdi.c b/block/vdi.c
+diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
---- a/block/vdi.c
+--- a/qapi/block-core.json
-+++ b/block/vdi.c
++++ b/qapi/block-core.json
-@@ -XXX,XX +XXX,XX @@ nonallocating_write:
+@@ -XXX,XX +XXX,XX @@
+ #
-     if (block) {
+ # @path: path to the vhost-vdpa character device.
-         /* One or more new blocks were allocated. */
+ #
--        VdiHeader *header = (VdiHeader *) block;
++# Features:
-+        VdiHeader *header;
++# @fdset: Member @path supports the special "/dev/fdset/N" path
-         uint8_t *base;
++#     (since 8.1)
-         uint64_t offset;
++#
-         uint32_t n_sectors;
+ # Since: 7.2
+ ##
-+        g_free(block);
+ { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
-+        header = g_malloc(sizeof(*header));
+   'data': { 'path': 'str' },
-+
++  'features': [ { 'name' :'fdset',
-         logout("now writing modified header\n");
++                  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
-         assert(VDI_IS_ALLOCATED(bmap_first));
+   'if': 'CONFIG_BLKIO' }
-         *header = s->header;
-         vdi_header_to_le(header);
+ ##
--        ret = bdrv_pwrite(bs->file, 0, block, sizeof(VdiHeader));
+diff --git a/meson.build b/meson.build
--        g_free(block);
+index XXXXXXX..XXXXXXX 100644
--        block = NULL;
+--- a/meson.build
-+        ret = bdrv_pwrite(bs->file, 0, header, sizeof(*header));
++++ b/meson.build
-+        g_free(header);
+@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_LZO', lzo.found())
+ config_host_data.set('CONFIG_MPATH', mpathpersist.found())
-         if (ret < 0) {
+ config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
-             return ret;
+ config_host_data.set('CONFIG_BLKIO', blkio.found())
 +if blkio.found()
 +  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
 +                       blkio.version().version_compare('>=1.3.0'))
 +endif
  config_host_data.set('CONFIG_CURL', curl.found())
  config_host_data.set('CONFIG_CURSES', curses.found())
  config_host_data.set('CONFIG_GBM', gbm.found())
 --
-.30.2
+.40.1

The following changes since commit 6d40ce00c1166c317e298ad82ecf10e650c4f87d:

Update version for v6.0.0-rc1 release (2021-03-30 18:19:07 +0100)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to b6489ac06695e257ea0a9841364577e247fdee30:

test-coroutine: Add rwlock downgrade test (2021-03-31 10:44:21 +0100)

----------------------------------------------------------------
Pull request

A fix for VDI image files and more generally for CoRwlock.

----------------------------------------------------------------

David Edmondson (4):
  block/vdi: When writing new bmap entry fails, don't leak the buffer
  block/vdi: Don't assume that blocks are larger than VdiHeader
  coroutine-lock: Store the coroutine in the CoWaitRecord only once
  test-coroutine: Add rwlock downgrade test

Paolo Bonzini (2):
  coroutine-lock: Reimplement CoRwlock to fix downgrade bug
  test-coroutine: Add rwlock upgrade test

include/qemu/coroutine.h    |  17 ++--
 block/vdi.c                 |  11 ++-
 tests/unit/test-coroutine.c | 161 +++++++++++++++++++++++++++++++++++
 util/qemu-coroutine-lock.c  | 165 +++++++++++++++++++++++-------------
 4 files changed, 282 insertions(+), 72 deletions(-)

-- 
2.30.2

From: David Edmondson <david.edmondson@oracle.com>

If a new bitmap entry is allocated, requiring the entire block to be
written, avoiding leaking the buffer allocated for the block should
the write fail.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Message-id: 20210325112941.365238-2-pbonzini@redhat.com
Message-Id: <20210309144015.557477-2-david.edmondson@oracle.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/vdi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/vdi.c b/block/vdi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -XXX,XX +XXX,XX @@ nonallocating_write:
 
     logout("finished data write\n");
     if (ret < 0) {
+        g_free(block);
         return ret;
     }
 
-- 
2.30.2

From: David Edmondson <david.edmondson@oracle.com>

Given that the block size is read from the header of the VDI file, a
wide variety of sizes might be seen. Rather than re-using a block
sized memory region when writing the VDI header, allocate an
appropriately sized buffer.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Message-id: 20210325112941.365238-3-pbonzini@redhat.com
Message-Id: <20210309144015.557477-3-david.edmondson@oracle.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/vdi.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -XXX,XX +XXX,XX @@ nonallocating_write:
 
     if (block) {
         /* One or more new blocks were allocated. */
-        VdiHeader *header = (VdiHeader *) block;
+        VdiHeader *header;
         uint8_t *base;
         uint64_t offset;
         uint32_t n_sectors;
 
+        g_free(block);
+        header = g_malloc(sizeof(*header));
+
         logout("now writing modified header\n");
         assert(VDI_IS_ALLOCATED(bmap_first));
         *header = s->header;
         vdi_header_to_le(header);
-        ret = bdrv_pwrite(bs->file, 0, block, sizeof(VdiHeader));
-        g_free(block);
-        block = NULL;
+        ret = bdrv_pwrite(bs->file, 0, header, sizeof(*header));
+        g_free(header);
 
         if (ret < 0) {
             return ret;
-- 
2.30.2

From: David Edmondson <david.edmondson@oracle.com>

When taking the slow path for mutex acquisition, set the coroutine
value in the CoWaitRecord in push_waiter(), rather than both there and
in the caller.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210325112941.365238-4-pbonzini@redhat.com
Message-Id: <20210309144015.557477-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/qemu-coroutine-lock.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
     unsigned old_handoff;
 
     trace_qemu_co_mutex_lock_entry(mutex, self);
-    w.co = self;
     push_waiter(mutex, &w);
 
     /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
-- 
2.30.2

From: Paolo Bonzini <pbonzini@redhat.com>

An invariant of the current rwlock is that if multiple coroutines hold a
reader lock, all must be runnable. The unlock implementation relies on
this, choosing to wake a single coroutine when the final read lock
holder exits the critical section, assuming that it will wake a
coroutine attempting to acquire a write lock.

The downgrade implementation violates this assumption by creating a
read lock owning coroutine that is exclusively runnable - any other
coroutines that are waiting to acquire a read lock are *not* made
runnable when the write lock holder converts its ownership to read
only.

More in general, the old implementation had lots of other fairness bugs.
The root cause of the bugs was that CoQueue would wake up readers even
if there were pending writers, and would wake up writers even if there
were readers.  In that case, the coroutine would go back to sleep *at
the end* of the CoQueue, losing its place at the head of the line.

To fix this, keep the queue of waiters explicitly in the CoRwlock
instead of using CoQueue, and store for each whether it is a
potential reader or a writer.  This way, downgrade can look at the
first queued coroutines and wake it only if it is a reader, causing
all other readers in line to be released in turn.

Reported-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210325112941.365238-5-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  17 ++--
 util/qemu-coroutine-lock.c | 164 +++++++++++++++++++++++--------------
 2 files changed, 114 insertions(+), 67 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_co_enter_next_impl(CoQueue *queue, QemuLockable *lock);
 bool qemu_co_queue_empty(CoQueue *queue);
 
 
+typedef struct CoRwTicket CoRwTicket;
 typedef struct CoRwlock {
-    int pending_writer;
-    int reader;
     CoMutex mutex;
-    CoQueue queue;
+
+    /* Number of readers, or -1 if owned for writing.  */
+    int owners;
+
+    /* Waiting coroutines.  */
+    QSIMPLEQ_HEAD(, CoRwTicket) tickets;
 } CoRwlock;
 
 /**
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock);
 /**
  * Write Locks the CoRwlock from a reader.  This is a bit more efficient than
  * @qemu_co_rwlock_unlock followed by a separate @qemu_co_rwlock_wrlock.
- * However, if the lock cannot be upgraded immediately, control is transferred
- * to the caller of the current coroutine.  Also, @qemu_co_rwlock_upgrade
- * only overrides CoRwlock fairness if there are no concurrent readers, so
- * another writer might run while @qemu_co_rwlock_upgrade blocks.
+ * Note that if the lock cannot be upgraded immediately, control is transferred
+ * to the caller of the current coroutine; another writer might run while
+ * @qemu_co_rwlock_upgrade blocks.
  */
 void qemu_co_rwlock_upgrade(CoRwlock *lock);
 
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
     trace_qemu_co_mutex_unlock_return(mutex, self);
 }
 
+struct CoRwTicket {
+    bool read;
+    Coroutine *co;
+    QSIMPLEQ_ENTRY(CoRwTicket) next;
+};
+
 void qemu_co_rwlock_init(CoRwlock *lock)
 {
-    memset(lock, 0, sizeof(*lock));
-    qemu_co_queue_init(&lock->queue);
     qemu_co_mutex_init(&lock->mutex);
+    lock->owners = 0;
+    QSIMPLEQ_INIT(&lock->tickets);
+}
+
+/* Releases the internal CoMutex.  */
+static void qemu_co_rwlock_maybe_wake_one(CoRwlock *lock)
+{
+    CoRwTicket *tkt = QSIMPLEQ_FIRST(&lock->tickets);
+    Coroutine *co = NULL;
+
+    /*
+     * Setting lock->owners here prevents rdlock and wrlock from
+     * sneaking in between unlock and wake.
+     */
+
+    if (tkt) {
+        if (tkt->read) {
+            if (lock->owners >= 0) {
+                lock->owners++;
+                co = tkt->co;
+            }
+        } else {
+            if (lock->owners == 0) {
+                lock->owners = -1;
+                co = tkt->co;
+            }
+        }
+    }
+
+    if (co) {
+        QSIMPLEQ_REMOVE_HEAD(&lock->tickets, next);
+        qemu_co_mutex_unlock(&lock->mutex);
+        aio_co_wake(co);
+    } else {
+        qemu_co_mutex_unlock(&lock->mutex);
+    }
 }
 
 void qemu_co_rwlock_rdlock(CoRwlock *lock)
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
 
     qemu_co_mutex_lock(&lock->mutex);
     /* For fairness, wait if a writer is in line.  */
-    while (lock->pending_writer) {
-        qemu_co_queue_wait(&lock->queue, &lock->mutex);
-    }
-    lock->reader++;
-    qemu_co_mutex_unlock(&lock->mutex);
-
-    /* The rest of the read-side critical section is run without the mutex.  */
-    self->locks_held++;
-}
-
-void qemu_co_rwlock_unlock(CoRwlock *lock)
-{
-    Coroutine *self = qemu_coroutine_self();
-
-    assert(qemu_in_coroutine());
-    if (!lock->reader) {
-        /* The critical section started in qemu_co_rwlock_wrlock.  */
-        qemu_co_queue_restart_all(&lock->queue);
+    if (lock->owners == 0 || (lock->owners > 0 && QSIMPLEQ_EMPTY(&lock->tickets))) {
+        lock->owners++;
+        qemu_co_mutex_unlock(&lock->mutex);
     } else {
-        self->locks_held--;
+        CoRwTicket my_ticket = { true, self };
 
+        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
+        qemu_co_mutex_unlock(&lock->mutex);
+        qemu_coroutine_yield();
+        assert(lock->owners >= 1);
+
+        /* Possibly wake another reader, which will wake the next in line.  */
         qemu_co_mutex_lock(&lock->mutex);
-        lock->reader--;
-        assert(lock->reader >= 0);
-        /* Wakeup only one waiting writer */
-        if (!lock->reader) {
-            qemu_co_queue_next(&lock->queue);
-        }
+        qemu_co_rwlock_maybe_wake_one(lock);
     }
-    qemu_co_mutex_unlock(&lock->mutex);
+
+    self->locks_held++;
+}
+
+void qemu_co_rwlock_unlock(CoRwlock *lock)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    assert(qemu_in_coroutine());
+    self->locks_held--;
+
+    qemu_co_mutex_lock(&lock->mutex);
+    if (lock->owners > 0) {
+        lock->owners--;
+    } else {
+        assert(lock->owners == -1);
+        lock->owners = 0;
+    }
+
+    qemu_co_rwlock_maybe_wake_one(lock);
 }
 
 void qemu_co_rwlock_downgrade(CoRwlock *lock)
 {
-    Coroutine *self = qemu_coroutine_self();
+    qemu_co_mutex_lock(&lock->mutex);
+    assert(lock->owners == -1);
+    lock->owners = 1;
 
-    /* lock->mutex critical section started in qemu_co_rwlock_wrlock or
-     * qemu_co_rwlock_upgrade.
-     */
-    assert(lock->reader == 0);
-    lock->reader++;
-    qemu_co_mutex_unlock(&lock->mutex);
-
-    /* The rest of the read-side critical section is run without the mutex.  */
-    self->locks_held++;
+    /* Possibly wake another reader, which will wake the next in line.  */
+    qemu_co_rwlock_maybe_wake_one(lock);
 }
 
 void qemu_co_rwlock_wrlock(CoRwlock *lock)
 {
+    Coroutine *self = qemu_coroutine_self();
+
     qemu_co_mutex_lock(&lock->mutex);
-    lock->pending_writer++;
-    while (lock->reader) {
-        qemu_co_queue_wait(&lock->queue, &lock->mutex);
+    if (lock->owners == 0) {
+        lock->owners = -1;
+        qemu_co_mutex_unlock(&lock->mutex);
+    } else {
+        CoRwTicket my_ticket = { false, qemu_coroutine_self() };
+
+        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
+        qemu_co_mutex_unlock(&lock->mutex);
+        qemu_coroutine_yield();
+        assert(lock->owners == -1);
     }
-    lock->pending_writer--;
 
-    /* The rest of the write-side critical section is run with
-     * the mutex taken, so that lock->reader remains zero.
-     * There is no need to update self->locks_held.
-     */
+    self->locks_held++;
 }
 
 void qemu_co_rwlock_upgrade(CoRwlock *lock)
 {
-    Coroutine *self = qemu_coroutine_self();
-
     qemu_co_mutex_lock(&lock->mutex);
-    assert(lock->reader > 0);
-    lock->reader--;
-    lock->pending_writer++;
-    while (lock->reader) {
-        qemu_co_queue_wait(&lock->queue, &lock->mutex);
+    assert(lock->owners > 0);
+    /* For fairness, wait if a writer is in line.  */
+    if (lock->owners == 1 && QSIMPLEQ_EMPTY(&lock->tickets)) {
+        lock->owners = -1;
+        qemu_co_mutex_unlock(&lock->mutex);
+    } else {
+        CoRwTicket my_ticket = { false, qemu_coroutine_self() };
+
+        lock->owners--;
+        QSIMPLEQ_INSERT_TAIL(&lock->tickets, &my_ticket, next);
+        qemu_co_rwlock_maybe_wake_one(lock);
+        qemu_coroutine_yield();
+        assert(lock->owners == -1);
     }
-    lock->pending_writer--;
-
-    /* The rest of the write-side critical section is run with
-     * the mutex taken, similar to qemu_co_rwlock_wrlock.  Do
-     * not account for the lock twice in self->locks_held.
-     */
-    self->locks_held--;
 }
-- 
2.30.2

From: Paolo Bonzini <pbonzini@redhat.com>

Test that rwlock upgrade is fair, and that readers go back to sleep if
a writer is in line.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210325112941.365238-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/unit/test-coroutine.c | 62 +++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/unit/test-coroutine.c
+++ b/tests/unit/test-coroutine.c
@@ -XXX,XX +XXX,XX @@ static void test_co_mutex_lockable(void)
     g_assert(QEMU_MAKE_LOCKABLE(null_pointer) == NULL);
 }
 
+static CoRwlock rwlock;
+
+/* Test that readers are properly sent back to the queue when upgrading,
+ * even if they are the sole readers.  The test scenario is as follows:
+ *
+ *
+ * | c1           | c2         |
+ * |--------------+------------+
+ * | rdlock       |            |
+ * | yield        |            |
+ * |              | wrlock     |
+ * |              | <queued>   |
+ * | upgrade      |            |
+ * | <queued>     | <dequeued> |
+ * |              | unlock     |
+ * | <dequeued>   |            |
+ * | unlock       |            |
+ */
+
+static void coroutine_fn rwlock_yield_upgrade(void *opaque)
+{
+    qemu_co_rwlock_rdlock(&rwlock);
+    qemu_coroutine_yield();
+
+    qemu_co_rwlock_upgrade(&rwlock);
+    qemu_co_rwlock_unlock(&rwlock);
+
+    *(bool *)opaque = true;
+}
+
+static void coroutine_fn rwlock_wrlock_yield(void *opaque)
+{
+    qemu_co_rwlock_wrlock(&rwlock);
+    qemu_coroutine_yield();
+
+    qemu_co_rwlock_unlock(&rwlock);
+    *(bool *)opaque = true;
+}
+
+static void test_co_rwlock_upgrade(void)
+{
+    bool c1_done = false;
+    bool c2_done = false;
+    Coroutine *c1, *c2;
+
+    qemu_co_rwlock_init(&rwlock);
+    c1 = qemu_coroutine_create(rwlock_yield_upgrade, &c1_done);
+    c2 = qemu_coroutine_create(rwlock_wrlock_yield, &c2_done);
+
+    qemu_coroutine_enter(c1);
+    qemu_coroutine_enter(c2);
+
+    /* c1 now should go to sleep.  */
+    qemu_coroutine_enter(c1);
+    g_assert(!c1_done);
+
+    qemu_coroutine_enter(c2);
+    g_assert(c1_done);
+    g_assert(c2_done);
+}
+
 /*
  * Check that creation, enter, and return work
  */
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
     g_test_add_func("/basic/order", test_order);
     g_test_add_func("/locking/co-mutex", test_co_mutex);
     g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
+    g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
     if (g_test_perf()) {
         g_test_add_func("/perf/lifecycle", perf_lifecycle);
         g_test_add_func("/perf/nesting", perf_nesting);
-- 
2.30.2

From: David Edmondson <david.edmondson@oracle.com>

Test that downgrading an rwlock does not result in a failure to
schedule coroutines queued on the rwlock.

The diagram associated with test_co_rwlock_downgrade() describes the
intended behaviour, but what was observed previously corresponds to:

| c1     | c2         | c3         | c4       |
|--------+------------+------------+----------|
| rdlock |            |            |          |
| yield  |            |            |          |
|        | wrlock     |            |          |
|        | <queued>   |            |          |
|        |            | rdlock     |          |
|        |            | <queued>   |          |
|        |            |            | wrlock   |
|        |            |            | <queued> |
| unlock |            |            |          |
| yield  |            |            |          |
|        | <dequeued> |            |          |
|        | downgrade  |            |          |
|        | ...        |            |          |
|        | unlock     |            |          |
|        |            | <dequeued> |          |
|        |            | <queued>   |          |

This results in a failure...

ERROR:../tests/test-coroutine.c:369:test_co_rwlock_downgrade: assertion failed: (c3_done)
Bail out! ERROR:../tests/test-coroutine.c:369:test_co_rwlock_downgrade: assertion failed: (c3_done)

...as a result of the c3 coroutine failing to run to completion.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210325112941.365238-7-pbonzini@redhat.com
Message-Id: <20210309144015.557477-5-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/unit/test-coroutine.c | 99 +++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tests/unit/test-coroutine.c b/tests/unit/test-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/unit/test-coroutine.c
+++ b/tests/unit/test-coroutine.c
@@ -XXX,XX +XXX,XX @@ static void test_co_rwlock_upgrade(void)
     g_assert(c2_done);
 }
 
+static void coroutine_fn rwlock_rdlock_yield(void *opaque)
+{
+    qemu_co_rwlock_rdlock(&rwlock);
+    qemu_coroutine_yield();
+
+    qemu_co_rwlock_unlock(&rwlock);
+    qemu_coroutine_yield();
+
+    *(bool *)opaque = true;
+}
+
+static void coroutine_fn rwlock_wrlock_downgrade(void *opaque)
+{
+    qemu_co_rwlock_wrlock(&rwlock);
+
+    qemu_co_rwlock_downgrade(&rwlock);
+    qemu_co_rwlock_unlock(&rwlock);
+    *(bool *)opaque = true;
+}
+
+static void coroutine_fn rwlock_rdlock(void *opaque)
+{
+    qemu_co_rwlock_rdlock(&rwlock);
+
+    qemu_co_rwlock_unlock(&rwlock);
+    *(bool *)opaque = true;
+}
+
+static void coroutine_fn rwlock_wrlock(void *opaque)
+{
+    qemu_co_rwlock_wrlock(&rwlock);
+
+    qemu_co_rwlock_unlock(&rwlock);
+    *(bool *)opaque = true;
+}
+
+/*
+ * Check that downgrading a reader-writer lock does not cause a hang.
+ *
+ * Four coroutines are used to produce a situation where there are
+ * both reader and writer hopefuls waiting to acquire an rwlock that
+ * is held by a reader.
+ *
+ * The correct sequence of operations we aim to provoke can be
+ * represented as:
+ *
+ * | c1     | c2         | c3         | c4         |
+ * |--------+------------+------------+------------|
+ * | rdlock |            |            |            |
+ * | yield  |            |            |            |
+ * |        | wrlock     |            |            |
+ * |        | <queued>   |            |            |
+ * |        |            | rdlock     |            |
+ * |        |            | <queued>   |            |
+ * |        |            |            | wrlock     |
+ * |        |            |            | <queued>   |
+ * | unlock |            |            |            |
+ * | yield  |            |            |            |
+ * |        | <dequeued> |            |            |
+ * |        | downgrade  |            |            |
+ * |        |            | <dequeued> |            |
+ * |        |            | unlock     |            |
+ * |        | ...        |            |            |
+ * |        | unlock     |            |            |
+ * |        |            |            | <dequeued> |
+ * |        |            |            | unlock     |
+ */
+static void test_co_rwlock_downgrade(void)
+{
+    bool c1_done = false;
+    bool c2_done = false;
+    bool c3_done = false;
+    bool c4_done = false;
+    Coroutine *c1, *c2, *c3, *c4;
+
+    qemu_co_rwlock_init(&rwlock);
+
+    c1 = qemu_coroutine_create(rwlock_rdlock_yield, &c1_done);
+    c2 = qemu_coroutine_create(rwlock_wrlock_downgrade, &c2_done);
+    c3 = qemu_coroutine_create(rwlock_rdlock, &c3_done);
+    c4 = qemu_coroutine_create(rwlock_wrlock, &c4_done);
+
+    qemu_coroutine_enter(c1);
+    qemu_coroutine_enter(c2);
+    qemu_coroutine_enter(c3);
+    qemu_coroutine_enter(c4);
+
+    qemu_coroutine_enter(c1);
+
+    g_assert(c2_done);
+    g_assert(c3_done);
+    g_assert(c4_done);
+
+    qemu_coroutine_enter(c1);
+
+    g_assert(c1_done);
+}
+
 /*
  * Check that creation, enter, and return work
  */
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
     g_test_add_func("/locking/co-mutex", test_co_mutex);
     g_test_add_func("/locking/co-mutex/lockable", test_co_mutex_lockable);
     g_test_add_func("/locking/co-rwlock/upgrade", test_co_rwlock_upgrade);
+    g_test_add_func("/locking/co-rwlock/downgrade", test_co_rwlock_downgrade);
     if (g_test_perf()) {
         g_test_add_func("/perf/lifecycle", perf_lifecycle);
         g_test_add_func("/perf/nesting", perf_nesting);
-- 
2.30.2

The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:

decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:

qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 11:08:21 -0400)

----------------------------------------------------------------
Pull request

- Stefano Garzarella's blkio block driver 'fd' parameter
- My thread-local blk_io_plug() series

----------------------------------------------------------------

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

Stefano Garzarella (2):
  block/blkio: use qemu_open() to support fd passing for virtio-blk
  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

-- 
2.40.1

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20230530180959.1108766-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                       |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c             |  22 -----
 block/plug.c                      | 159 ++++++++++++++++++++++++++++++
 hw/block/dataplane/xen-block.c    |   8 +-
 hw/block/virtio-blk.c             |   4 +-
 hw/scsi/virtio-scsi.c             |   6 +-
 block/meson.build                 |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify)
     notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-    BlockDriverState *bs = blk_bs(blk);
-    IO_CODE();
-    GRAPH_RDLOCK_GUARD();
-
-    if (bs) {
-        bdrv_co_io_plug(bs);
-    }
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-    BlockDriverState *bs = blk_bs(blk);
-    IO_CODE();
-    GRAPH_RDLOCK_GUARD();
-
-    if (bs) {
-        bdrv_co_io_unplug(bs);
-    }
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
     IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/plug.c
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+    void (*fn)(void *);
+    void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+typedef struct {
+    unsigned count;       /* how many times has plug() been called? */
+    GArray *unplug_fns;   /* functions to call at unplug time */
+} Plug;
+
+/* Use get_ptr_plug() to fetch this thread-local value */
+QEMU_DEFINE_STATIC_CO_TLS(Plug, plug);
+
+/* Called at thread cleanup time */
+static void blk_io_plug_atexit(Notifier *n, void *value)
+{
+    Plug *plug = get_ptr_plug();
+    g_array_free(plug->unplug_fns, TRUE);
+}
+
+/* This won't involve coroutines, so use __thread */
+static __thread Notifier blk_io_plug_atexit_notifier;
+
+/**
+ * blk_io_plug_call:
+ * @fn: a function pointer to be invoked
+ * @opaque: a user-defined argument to @fn()
+ *
+ * Call @fn(@opaque) immediately if not within a blk_io_plug()/blk_io_unplug()
+ * section.
+ *
+ * Otherwise defer the call until the end of the outermost
+ * blk_io_plug()/blk_io_unplug() section in this thread. If the same
+ * @fn/@opaque pair has already been deferred, it will only be called once upon
+ * blk_io_unplug() so that accumulated calls are batched into a single call.
+ *
+ * The caller must ensure that @opaque is not freed before @fn() is invoked.
+ */
+void blk_io_plug_call(void (*fn)(void *), void *opaque)
+{
+    Plug *plug = get_ptr_plug();
+
+    /* Call immediately if we're not plugged */
+    if (plug->count == 0) {
+        fn(opaque);
+        return;
+    }
+
+    GArray *array = plug->unplug_fns;
+    if (!array) {
+        array = g_array_new(FALSE, FALSE, sizeof(UnplugFn));
+        plug->unplug_fns = array;
+        blk_io_plug_atexit_notifier.notify = blk_io_plug_atexit;
+        qemu_thread_atexit_add(&blk_io_plug_atexit_notifier);
+    }
+
+    UnplugFn *fns = (UnplugFn *)array->data;
+    UnplugFn new_fn = {
+        .fn = fn,
+        .opaque = opaque,
+    };
+
+    /*
+     * There won't be many, so do a linear search. If this becomes a bottleneck
+     * then a binary search (glib 2.62+) or different data structure could be
+     * used.
+     */
+    for (guint i = 0; i < array->len; i++) {
+        if (memcmp(&fns[i], &new_fn, sizeof(new_fn)) == 0) {
+            return; /* already exists */
+        }
+    }
+
+    g_array_append_val(array, new_fn);
+}
+
+/**
+ * blk_io_plug: Defer blk_io_plug_call() functions until blk_io_unplug()
+ *
+ * blk_io_plug/unplug are thread-local operations. This means that multiple
+ * threads can simultaneously call plug/unplug, but the caller must ensure that
+ * each unplug() is called in the same thread of the matching plug().
+ *
+ * Nesting is supported. blk_io_plug_call() functions are only called at the
+ * outermost blk_io_unplug().
+ */
+void blk_io_plug(void)
+{
+    Plug *plug = get_ptr_plug();
+
+    assert(plug->count < UINT32_MAX);
+
+    plug->count++;
+}
+
+/**
+ * blk_io_unplug: Run any pending blk_io_plug_call() functions
+ *
+ * There must have been a matching blk_io_plug() call in the same thread prior
+ * to this blk_io_unplug() call.
+ */
+void blk_io_unplug(void)
+{
+    Plug *plug = get_ptr_plug();
+
+    assert(plug->count > 0);
+
+    if (--plug->count > 0) {
+        return;
+    }
+
+    GArray *array = plug->unplug_fns;
+    if (!array) {
+        return;
+    }
+
+    UnplugFn *fns = (UnplugFn *)array->data;
+
+    for (guint i = 0; i < array->len; i++) {
+        fns[i].fn(fns[i].opaque);
+    }
+
+    /*
+     * This resets the array without freeing memory so that appending is cheap
+     * in the future.
+     */
+    g_array_set_size(array, 0);
+}
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
      * is below us.
      */
     if (inflight_atstart > IO_PLUG_THRESHOLD) {
-        blk_io_plug(dataplane->blk);
+        blk_io_plug();
     }
     while (rc != rp) {
         /* pull request from ring */
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
 
         if (inflight_atstart > IO_PLUG_THRESHOLD &&
             batched >= inflight_atstart) {
-            blk_io_unplug(dataplane->blk);
+            blk_io_unplug();
         }
         xen_block_do_aio(request);
         if (inflight_atstart > IO_PLUG_THRESHOLD) {
             if (batched >= inflight_atstart) {
-                blk_io_plug(dataplane->blk);
+                blk_io_plug();
                 batched = 0;
             } else {
                 batched++;
@@ -XXX,XX +XXX,XX @@ static bool xen_block_handle_requests(XenBlockDataPlane *dataplane)
         }
     }
     if (inflight_atstart > IO_PLUG_THRESHOLD) {
-        blk_io_unplug(dataplane->blk);
+        blk_io_unplug();
     }
 
     return done_something;
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     bool suppress_notifications = virtio_queue_get_notification(vq);
 
     aio_context_acquire(blk_get_aio_context(s->blk));
-    blk_io_plug(s->blk);
+    blk_io_plug();
 
     do {
         if (suppress_notifications) {
@@ -XXX,XX +XXX,XX @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
         virtio_blk_submit_multireq(s, &mrb);
     }
 
-    blk_io_unplug(s->blk);
+    blk_io_unplug();
     aio_context_release(blk_get_aio_context(s->blk));
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req)
         return -ENOBUFS;
     }
     scsi_req_ref(req->sreq);
-    blk_io_plug(d->conf.blk);
+    blk_io_plug();
     object_unref(OBJECT(d));
     return 0;
 }
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req)
     if (scsi_req_enqueue(sreq)) {
         scsi_req_continue(sreq);
     }
-    blk_io_unplug(sreq->dev->conf.blk);
+    blk_io_unplug();
     scsi_req_unref(sreq);
 }
 
@@ -XXX,XX +XXX,XX @@ static void virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
                 while (!QTAILQ_EMPTY(&reqs)) {
                     req = QTAILQ_FIRST(&reqs);
                     QTAILQ_REMOVE(&reqs, req, next);
-                    blk_io_unplug(req->sreq->dev->conf.blk);
+                    blk_io_unplug();
                     scsi_req_unref(req->sreq);
                     virtqueue_detach_element(req->vq, &req->elem, 0);
                     virtio_scsi_free_req(req);
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(files(
   'mirror.c',
   'nbd.c',
   'null.c',
+  'plug.c',
   'qapi.c',
   'qcow2-bitmap.c',
   'qcow2-cache.c',
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
     int blkshift;
 
     uint64_t max_transfer;
-    bool plugged;
 
     bool supports_write_zeroes;
     bool supports_discard;
@@ -XXX,XX +XXX,XX @@ static void nvme_kick(NVMeQueuePair *q)
 {
     BDRVNVMeState *s = q->s;
 
-    if (s->plugged || !q->need_kick) {
+    if (!q->need_kick) {
         return;
     }
     trace_nvme_kick(s, q->index);
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
     NvmeCqe *c;
 
     trace_nvme_process_completion(s, q->index, q->inflight);
-    if (s->plugged) {
-        trace_nvme_process_completion_queue_plugged(s, q->index);
-        return false;
-    }
 
     /*
      * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -XXX,XX +XXX,XX @@ static void nvme_trace_command(const NvmeCmd *cmd)
     }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+    NVMeQueuePair *q = opaque;
+
+    QEMU_LOCK_GUARD(&q->lock);
+    nvme_kick(q);
+    nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                 NvmeCmd *cmd, BlockCompletionFunc cb,
                                 void *opaque)
@@ -XXX,XX +XXX,XX @@ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
            q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
     q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
     q->need_kick++;
-    nvme_kick(q);
-    nvme_process_completion(q);
+    blk_io_plug_call(nvme_unplug_fn, q);
     qemu_mutex_unlock(&q->lock);
 }
 
@@ -XXX,XX +XXX,XX @@ static void nvme_attach_aio_context(BlockDriverState *bs,
     }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-    BDRVNVMeState *s = bs->opaque;
-    assert(!s->plugged);
-    s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVNVMeState *s = bs->opaque;
-    assert(s->plugged);
-    s->plugged = false;
-    for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-        NVMeQueuePair *q = s->queues[i];
-        qemu_mutex_lock(&q->lock);
-        nvme_kick(q);
-        nvme_process_completion(q);
-        qemu_mutex_unlock(&q->lock);
-    }
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
                               Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
     .bdrv_detach_aio_context  = nvme_detach_aio_context,
     .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-    .bdrv_co_io_plug          = nvme_co_io_plug,
-    .bdrv_co_io_unplug        = nvme_co_io_unplug,
-
     .bdrv_register_buf        = nvme_register_buf,
     .bdrv_unregister_buf      = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

diff --git a/block/blkio.c b/block/blkio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -XXX,XX +XXX,XX @@ static void blkio_detach_aio_context(BlockDriverState *bs)
                        NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-    if (qatomic_read(&bs->io_plugged) == 0) {
-        BDRVBlkioState *s = bs->opaque;
+    BDRVBlkioState *s = opaque;
 
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
     }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -XXX,XX +XXX,XX @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
@@ -XXX,XX +XXX,XX @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
 
     if (use_bounce_buffer) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
 
     if (use_bounce_buffer) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_flush(s->blkioq, &cod, 0);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
     WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
         blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-        blkio_submit_io(bs);
     }
 
+    blkio_submit_io(bs);
     qemu_coroutine_yield();
     return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVBlkioState *s = bs->opaque;
-
-    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-        blkio_submit_io(bs);
-    }
-}
-
 typedef enum {
     BMRR_OK,
     BMRR_SKIP,
@@ -XXX,XX +XXX,XX @@ static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
         .bdrv_co_pwritev         = blkio_co_pwritev, \
         .bdrv_co_flush_to_disk   = blkio_co_flush, \
         .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-        .bdrv_co_io_unplug       = blkio_co_io_unplug, \
         .bdrv_refresh_limits     = blkio_refresh_limits, \
         .bdrv_register_buf       = blkio_register_buf, \
         .bdrv_unregister_buf     = blkio_unregister_buf, \
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20230530180959.1108766-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/raw-aio.h |  7 -------
 block/file-posix.c      | 10 ----------
 block/io_uring.c        | 44 ++++++++++++++++-------------------------
 block/trace-events      |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
                                   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
         laio_io_plug();
     }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-    if (s->use_linux_io_uring) {
-        luring_io_plug();
-    }
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
         laio_io_unplug(s->aio_max_batch);
     }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-    if (s->use_linux_io_uring) {
-        luring_io_unplug();
-    }
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -XXX,XX +XXX,XX @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-    int plugged;
     unsigned int in_queue;
     unsigned int in_flight;
     bool blocked;
@@ -XXX,XX +XXX,XX @@ static void luring_process_completions_and_submit(LuringState *s)
 {
     luring_process_completions(s);
 
-    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+    if (s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->submit_queue);
-    io_q->plugged = 0;
     io_q->in_queue = 0;
     io_q->in_flight = 0;
     io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    trace_luring_io_plug(s);
-    s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    assert(s->io_q.plugged);
-    trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-                           s->io_q.in_queue, s->io_q.in_flight);
-    if (--s->io_q.plugged == 0 &&
-        !s->io_q.blocked && s->io_q.in_queue > 0) {
+    LuringState *s = opaque;
+    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+                           s->io_q.in_flight);
+    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
 
     QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
     s->io_q.in_queue++;
-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-                           s->io_q.in_queue, s->io_q.in_flight);
-    if (!s->io_q.blocked &&
-        (!s->io_q.plugged ||
-         s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-        ret = ioq_submit(s);
-        trace_luring_do_submit_done(s, ret);
-        return ret;
+    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+                           s->io_q.in_flight);
+    if (!s->io_q.blocked) {
+        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+            ret = ioq_submit(s);
+            trace_luring_do_submit_done(s, ret);
+            return ret;
+        }
+
+        blk_io_plug_call(luring_unplug_fn, s);
     }
     return 0;
 }
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
 # io_uring.c
 luring_init_state(void *s, size_t size) "s %p size %zu"
 luring_cleanup_state(void *s) "%p freed"
-luring_io_plug(void *s) "LuringState %p plug"
-luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
-luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
+luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
+luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
 luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
 luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
 luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
-- 
2.40.1

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530180959.1108766-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/raw-aio.h |  7 -------
 block/file-posix.c      | 28 ----------------------------
 block/linux-aio.c       | 41 +++++++++++------------------------------
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
     return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-    BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-    if (s->use_linux_aio) {
-        laio_io_plug();
-    }
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-    BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-    if (s->use_linux_aio) {
-        laio_io_unplug(s->aio_max_batch);
-    }
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
     BDRVRawState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_file = {
     .bdrv_co_copy_range_from = raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = raw_co_copy_range_to,
     .bdrv_refresh_limits = raw_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
     .bdrv_co_copy_range_from = raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = raw_co_copy_range_to,
     .bdrv_refresh_limits = raw_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_co_pwritev        = raw_co_pwritev,
     .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
     .bdrv_refresh_limits    = cdrom_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_co_pwritev        = raw_co_pwritev,
     .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
     .bdrv_refresh_limits    = cdrom_refresh_limits,
-    .bdrv_co_io_plug        = raw_co_io_plug,
-    .bdrv_co_io_unplug      = raw_co_io_unplug,
     .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
     .bdrv_co_truncate                   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -XXX,XX +XXX,XX @@ struct qemu_laiocb {
 };
 
 typedef struct {
-    int plugged;
     unsigned int in_queue;
     unsigned int in_flight;
     bool blocked;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
     qemu_laio_process_completions(s);
 
-    if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
+    if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
 static void ioq_init(LaioQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->pending);
-    io_q->plugged = 0;
     io_q->in_queue = 0;
     io_q->in_flight = 0;
     io_q->blocked = false;
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
     return max_batch;
 }
 
-void laio_io_plug(void)
+static void laio_unplug_fn(void *opaque)
 {
-    AioContext *ctx = qemu_get_current_aio_context();
-    LinuxAioState *s = aio_get_linux_aio(ctx);
+    LinuxAioState *s = opaque;
 
-    s->io_q.plugged++;
-}
-
-void laio_io_unplug(uint64_t dev_max_batch)
-{
-    AioContext *ctx = qemu_get_current_aio_context();
-    LinuxAioState *s = aio_get_linux_aio(ctx);
-
-    assert(s->io_q.plugged);
-    s->io_q.plugged--;
-
-    /*
-     * Why max batch checking is performed here:
-     * Another BDS may have queued requests with a higher dev_max_batch and
-     * therefore in_queue could now exceed our dev_max_batch. Re-check the max
-     * batch so we can honor our device's dev_max_batch.
-     */
-    if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch) ||
-        (!s->io_q.plugged &&
-         !s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending))) {
+    if (!s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
 }
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
 
     QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
     s->io_q.in_queue++;
-    if (!s->io_q.blocked &&
-        (!s->io_q.plugged ||
-         s->io_q.in_queue >= laio_max_batch(s, dev_max_batch))) {
-        ioq_submit(s);
+    if (!s->io_q.blocked) {
+        if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
+            ioq_submit(s);
+        } else {
+            blk_io_plug_call(laio_unplug_fn, s);
+        }
     }
 
     return 0;
-- 
2.40.1

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

diff --git a/include/block/block-io.h b/include/block/block-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
                                    uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
     void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
         BlockDriverState *bs, BlkdebugEvent event);
 
-    /* io queue for linux-aio */
-    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState *bs);
-    void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-        BlockDriverState *bs);
-
     bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
     bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     unsigned int in_flight;
     unsigned int serialising_in_flight;
 
-    /*
-     * counter for nested bdrv_io_plug.
-     * Accessed with atomic ops.
-     */
-    unsigned io_plugged;
-
     /* do we need to tell the quest if we have a volatile write cache? */
     int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size)
     return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-    BdrvChild *child;
-    IO_CODE();
-    assert_bdrv_graph_readable();
-
-    QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_co_io_plug(child->bs);
-    }
-
-    if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-        BlockDriver *drv = bs->drv;
-        if (drv && drv->bdrv_co_io_plug) {
-            drv->bdrv_co_io_plug(bs);
-        }
-    }
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-    BdrvChild *child;
-    IO_CODE();
-    assert_bdrv_graph_readable();
-
-    assert(bs->io_plugged);
-    if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-        BlockDriver *drv = bs->drv;
-        if (drv && drv->bdrv_co_io_unplug) {
-            drv->bdrv_co_io_unplug(bs);
-        }
-    }
-
-    QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_co_io_unplug(child->bs);
-    }
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

From: Stefano Garzarella <sgarzare@redhat.com>

Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
passing. Let's expose this to the user, so the management layer
can pass the file descriptor of an already opened path.

If the libblkio virtio-blk driver supports fd passing, let's always
use qemu_open() to open the `path`, so we can handle fd passing
from the management layer through the "/dev/fdset/N" special path.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530071941.8954-2-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkio.c | 53 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -XXX,XX +XXX,XX @@ static int blkio_virtio_blk_common_open(BlockDriverState *bs,
 {
     const char *path = qdict_get_try_str(options, "path");
     BDRVBlkioState *s = bs->opaque;
-    int ret;
+    bool fd_supported = false;
+    int fd, ret;
 
     if (!path) {
         error_setg(errp, "missing 'path' option");
         return -EINVAL;
     }
 
-    ret = blkio_set_str(s->blkio, "path", path);
-    qdict_del(options, "path");
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "failed to set path: %s",
-                         blkio_get_error_msg());
-        return ret;
-    }
-
     if (!(flags & BDRV_O_NOCACHE)) {
         error_setg(errp, "cache.direct=off is not supported");
         return -EINVAL;
     }
+
+    if (blkio_get_int(s->blkio, "fd", &fd) == 0) {
+        fd_supported = true;
+    }
+
+    /*
+     * If the libblkio driver supports fd passing, let's always use qemu_open()
+     * to open the `path`, so we can handle fd passing from the management
+     * layer through the "/dev/fdset/N" special path.
+     */
+    if (fd_supported) {
+        int open_flags;
+
+        if (flags & BDRV_O_RDWR) {
+            open_flags = O_RDWR;
+        } else {
+            open_flags = O_RDONLY;
+        }
+
+        fd = qemu_open(path, open_flags, errp);
+        if (fd < 0) {
+            return -EINVAL;
+        }
+
+        ret = blkio_set_int(s->blkio, "fd", fd);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set fd: %s",
+                             blkio_get_error_msg());
+            qemu_close(fd);
+            return ret;
+        }
+    } else {
+        ret = blkio_set_str(s->blkio, "path", path);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set path: %s",
+                             blkio_get_error_msg());
+            return ret;
+        }
+    }
+
+    qdict_del(options, "path");
+
     return 0;
 }
 
-- 
2.40.1

From: Stefano Garzarella <sgarzare@redhat.com>

The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
passing through the new 'fd' property.

Since now we are using qemu_open() on '@path' if the virtio-blk driver
supports the fd passing, let's announce it.
In this way, the management layer can pass the file descriptor of an
already opened vhost-vdpa character device. This is useful especially
when the device can only be accessed with certain privileges.

Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
in libblkio supports it.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530071941.8954-3-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-core.json | 6 ++++++
 meson.build          | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
 #
 # @path: path to the vhost-vdpa character device.
 #
+# Features:
+# @fdset: Member @path supports the special "/dev/fdset/N" path
+#     (since 8.1)
+#
 # Since: 7.2
 ##
 { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
   'data': { 'path': 'str' },
+  'features': [ { 'name' :'fdset',
+                  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
   'if': 'CONFIG_BLKIO' }
 
 ##
diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
 config_host_data.set('CONFIG_BLKIO', blkio.found())
+if blkio.found()
+  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
+                       blkio.version().version_compare('>=1.3.0'))
+endif
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
-- 
2.40.1