Series comparison

-[Qemu-devel] [PULL v2 00/24] Block patches
+[PULL v2 00/16] Block patches
-The following changes since commit 56f9e46b841c7be478ca038d8d4085d776ab4b0d:
+The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:
-  Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2017-02-20' into staging (2017-02-20 17:42:47 +0000)
+  Merge tag 'or1k-pull-request-20230513' of https://github.com/stffrdhrn/qemu into staging (2023-05-13 11:23:14 +0100)
-are available in the git repository at:
+are available in the Git repository at:
-  git://github.com/stefanha/qemu.git tags/block-pull-request
+  https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to a7b91d35bab97a2d3e779d0c64c9b837b52a6cf7:
+for you to fetch changes up to 01562fee5f3ad4506d57dbcf4b1903b565eceec7:
-  coroutine-lock: make CoRwlock thread-safe and fair (2017-02-21 11:39:40 +0000)
+  docs/zoned-storage:add zoned emulation use case (2023-05-15 08:19:04 -0400)
 ----------------------------------------------------------------
 Pull request
+This pull request contain's Sam Li's zoned storage support in the QEMU block
+layer and virtio-blk emulation.
 v2:
- * Rebased to resolve scsi conflicts
+- Sam fixed the CI failures. CI passes for me now. [Richard]
 ----------------------------------------------------------------
-Paolo Bonzini (24):
+Sam Li (16):
-  block: move AioContext, QEMUTimer, main-loop to libqemuutil
+  block/block-common: add zoned device structs
-  aio: introduce aio_co_schedule and aio_co_wake
+  block/file-posix: introduce helper functions for sysfs attributes
-  block-backend: allow blk_prw from coroutine context
+  block/block-backend: add block layer APIs resembling Linux
-  test-thread-pool: use generic AioContext infrastructure
+    ZonedBlockDevice ioctls
-  io: add methods to set I/O handlers on AioContext
+  block/raw-format: add zone operations to pass through requests
-  io: make qio_channel_yield aware of AioContexts
+  block: add zoned BlockDriver check to block layer
-  nbd: convert to use qio_channel_yield
+  iotests: test new zone operations
-  coroutine-lock: reschedule coroutine on the AioContext it was running
+  block: add some trace events for new block layer APIs
-    on
+  docs/zoned-storage: add zoned device documentation
-  blkdebug: reschedule coroutine on the AioContext it is running on
+  file-posix: add tracking of the zone write pointers
-  qed: introduce qed_aio_start_io and qed_aio_next_io_cb
+  block: introduce zone append write for zoned devices
-  aio: push aio_context_acquire/release down to dispatching
+  qemu-iotests: test zone append operation
-  block: explicitly acquire aiocontext in timers that need it
+  block: add some trace events for zone append
-  block: explicitly acquire aiocontext in callbacks that need it
+  virtio-blk: add zoned storage emulation for zoned devices
-  block: explicitly acquire aiocontext in bottom halves that need it
+  block: add accounting for zone append operation
-  block: explicitly acquire aiocontext in aio callbacks that need it
+  virtio-blk: add some trace events for zoned emulation
-  aio-posix: partially inline aio_dispatch into aio_poll
+  docs/zoned-storage:add zoned emulation use case
   async: remove unnecessary inc/dec pairs
   block: document fields protected by AioContext lock
   coroutine-lock: make CoMutex thread-safe
   coroutine-lock: add limited spinning to CoMutex
   test-aio-multithread: add performance comparison with thread-based
     mutexes
   coroutine-lock: place CoMutex before CoQueue in header
   coroutine-lock: add mutex argument to CoQueue APIs
   coroutine-lock: make CoRwlock thread-safe and fair
- Makefile.objs                       |   4 -
+ docs/devel/index-api.rst               |   1 +
- stubs/Makefile.objs                 |   1 +
+ docs/devel/zoned-storage.rst           |  62 +++
- tests/Makefile.include              |  19 +-
+ qapi/block-core.json                   |  68 ++-
- util/Makefile.objs                  |   6 +-
+ qapi/block.json                        |   4 +
- block/nbd-client.h                  |   2 +-
+ meson.build                            |   5 +
- block/qed.h                         |   3 +
+ include/block/accounting.h             |   1 +
- include/block/aio.h                 |  38 ++-
+ include/block/block-common.h           |  57 ++
- include/block/block_int.h           |  64 +++--
+ include/block/block-io.h               |  13 +
- include/io/channel.h                |  72 +++++-
+ include/block/block_int-common.h       |  37 ++
- include/qemu/coroutine.h            |  84 ++++---
+ include/block/raw-aio.h                |   8 +-
- include/qemu/coroutine_int.h        |  11 +-
+ include/sysemu/block-backend-io.h      |  27 +
- include/sysemu/block-backend.h      |  14 +-
+ block.c                                |  19 +
- tests/iothread.h                    |  25 ++
+ block/block-backend.c                  | 198 +++++++
- block/backup.c                      |   2 +-
+ block/file-posix.c                     | 692 +++++++++++++++++++++++--
- block/blkdebug.c                    |   9 +-
+ block/io.c                             |  68 +++
- block/blkreplay.c                   |   2 +-
+ block/io_uring.c                       |   4 +
- block/block-backend.c               |  13 +-
+ block/linux-aio.c                      |   3 +
- block/curl.c                        |  44 +++-
+ block/qapi-sysemu.c                    |  11 +
- block/gluster.c                     |   9 +-
+ block/qapi.c                           |  18 +
- block/io.c                          |  42 +---
+ block/raw-format.c                     |  26 +
- block/iscsi.c                       |  15 +-
+ hw/block/virtio-blk-common.c           |   2 +
- block/linux-aio.c                   |  10 +-
+ hw/block/virtio-blk.c                  | 405 +++++++++++++++
- block/mirror.c                      |  12 +-
+ hw/virtio/virtio-qmp.c                 |   2 +
- block/nbd-client.c                  | 119 +++++----
+ qemu-io-cmds.c                         | 224 ++++++++
- block/nfs.c                         |   9 +-
+ block/trace-events                     |   4 +
- block/qcow2-cluster.c               |   4 +-
+ docs/system/qemu-block-drivers.rst.inc |   6 +
- block/qed-cluster.c                 |   2 +
+ hw/block/trace-events                  |   7 +
- block/qed-table.c                   |  12 +-
+ tests/qemu-iotests/227.out             |  18 +
- block/qed.c                         |  58 +++--
+ tests/qemu-iotests/tests/zoned         | 105 ++++
- block/sheepdog.c                    |  31 +--
+ tests/qemu-iotests/tests/zoned.out     |  69 +++
- block/ssh.c                         |  29 +--
+files changed, 2106 insertions(+), 58 deletions(-)
- block/throttle-groups.c             |   4 +-
+ create mode 100644 docs/devel/zoned-storage.rst
- block/win32-aio.c                   |   9 +-
+ create mode 100755 tests/qemu-iotests/tests/zoned
- dma-helpers.c                       |   2 +
+ create mode 100644 tests/qemu-iotests/tests/zoned.out
  hw/9pfs/9p.c                        |   2 +-
  hw/block/virtio-blk.c               |  19 +-
  hw/scsi/scsi-bus.c                  |   2 +
  hw/scsi/scsi-disk.c                 |  15 ++
  hw/scsi/scsi-generic.c              |  20 +-
  hw/scsi/virtio-scsi.c               |   7 +
  io/channel-command.c                |  13 +
  io/channel-file.c                   |  11 +
  io/channel-socket.c                 |  16 +-
  io/channel-tls.c                    |  12 +
  io/channel-watch.c                  |   6 +
  io/channel.c                        |  97 ++++++--
  nbd/client.c                        |   2 +-
  nbd/common.c                        |   9 +-
  nbd/server.c                        |  94 +++-----
  stubs/linux-aio.c                   |  32 +++
  stubs/set-fd-handler.c              |  11 -
  tests/iothread.c                    |  91 +++++++
  tests/test-aio-multithread.c        | 463 ++++++++++++++++++++++++++++++++++++
  tests/test-thread-pool.c            |  12 +-
  aio-posix.c => util/aio-posix.c     |  62 ++---
  aio-win32.c => util/aio-win32.c     |  30 +--
  util/aiocb.c                        |  55 +++++
  async.c => util/async.c             |  84 ++++++-
  iohandler.c => util/iohandler.c     |   0
  main-loop.c => util/main-loop.c     |   0
  util/qemu-coroutine-lock.c          | 254 ++++++++++++++++++--
  util/qemu-coroutine-sleep.c         |   2 +-
  util/qemu-coroutine.c               |   8 +
  qemu-timer.c => util/qemu-timer.c   |   0
  thread-pool.c => util/thread-pool.c |   8 +-
  trace-events                        |  11 -
  util/trace-events                   |  17 +-
 files changed, 1712 insertions(+), 533 deletions(-)
  create mode 100644 tests/iothread.h
  create mode 100644 stubs/linux-aio.c
  create mode 100644 tests/iothread.c
  create mode 100644 tests/test-aio-multithread.c
  rename aio-posix.c => util/aio-posix.c (94%)
  rename aio-win32.c => util/aio-win32.c (95%)
  create mode 100644 util/aiocb.c
  rename async.c => util/async.c (82%)
  rename iohandler.c => util/iohandler.c (100%)
  rename main-loop.c => util/main-loop.c (100%)
  rename qemu-timer.c => util/qemu-timer.c (100%)
  rename thread-pool.c => util/thread-pool.c (97%)
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 23/24] coroutine-lock: add mutex argument to CoQueue APIs
+[PULL v2 01/16] block/block-common: add zoned device structs
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-All that CoQueue needs in order to become thread-safe is help
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-from an external mutex.  Add this to the API.
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Hannes Reinecke <hare@suse.de>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
-Message-id: 20170213181244.16297-6-pbonzini@redhat.com
+Acked-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-2-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-2-faithilikerun@gmail.com
 [Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 <philmd@linaro.org>.
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  8 +++++---
+ include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
- block/backup.c             |  2 +-
+file changed, 43 insertions(+)
  block/io.c                 |  4 ++--
  block/nbd-client.c         |  2 +-
  block/qcow2-cluster.c      |  4 +---
  block/sheepdog.c           |  2 +-
  block/throttle-groups.c    |  2 +-
  hw/9pfs/9p.c               |  2 +-
  util/qemu-coroutine-lock.c | 24 +++++++++++++++++++++---
 files changed, 34 insertions(+), 16 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/include/block/block-common.h b/include/block/block-common.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/include/block/block-common.h
-+++ b/include/qemu/coroutine.h
++++ b/include/block/block-common.h
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
+@@ -XXX,XX +XXX,XX @@ typedef struct BlockDriver BlockDriver;
+ typedef struct BdrvChild BdrvChild;
- /**
+ typedef struct BdrvChildClass BdrvChildClass;
-  * CoQueues are a mechanism to queue coroutines in order to continue executing
-- * them later.
++typedef enum BlockZoneOp {
-+ * them later.  They are similar to condition variables, but they need help
++    BLK_ZO_OPEN,
-+ * from an external mutex in order to maintain thread-safety.
++    BLK_ZO_CLOSE,
-  */
++    BLK_ZO_FINISH,
- typedef struct CoQueue {
++    BLK_ZO_RESET,
-     QSIMPLEQ_HEAD(, Coroutine) entries;
++} BlockZoneOp;
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue);
  /**
   * Adds the current coroutine to the CoQueue and transfers control to the
 - * caller of the coroutine.
 + * caller of the coroutine.  The mutex is unlocked during the wait and
 + * locked again afterwards.
   */
 -void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
 +void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex);
  /**
   * Restarts the next coroutine in the CoQueue and removes it from the queue.
 diff --git a/block/backup.c b/block/backup.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/backup.c
 +++ b/block/backup.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
          retry = false;
          QLIST_FOREACH(req, &job->inflight_reqs, list) {
              if (end > req->start && start < req->end) {
 -                qemu_co_queue_wait(&req->wait_queue);
 +                qemu_co_queue_wait(&req->wait_queue, NULL);
                  retry = true;
                  break;
              }
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                   * (instead of producing a deadlock in the former case). */
                  if (!req->waiting_for) {
                      self->waiting_for = req;
 -                    qemu_co_queue_wait(&req->wait_queue);
 +                    qemu_co_queue_wait(&req->wait_queue, NULL);
                      self->waiting_for = NULL;
                      retry = true;
                      waited = true;
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
      /* Wait until any previous flushes are completed */
      while (bs->active_flush_req) {
 -        qemu_co_queue_wait(&bs->flush_queue);
 +        qemu_co_queue_wait(&bs->flush_queue, NULL);
      }
      bs->active_flush_req = true;
 diff --git a/block/nbd-client.c b/block/nbd-client.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nbd-client.c
 +++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
      /* Poor man semaphore.  The free_sema is locked when no other request
       * can be accepted, and unlocked after receiving one reply.  */
      if (s->in_flight == MAX_NBD_REQUESTS) {
 -        qemu_co_queue_wait(&s->free_sema);
 +        qemu_co_queue_wait(&s->free_sema, NULL);
          assert(s->in_flight < MAX_NBD_REQUESTS);
      }
      s->in_flight++;
 diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-cluster.c
 +++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
              if (bytes == 0) {
                  /* Wait for the dependency to complete. We need to recheck
                   * the free/allocated clusters when we continue. */
 -                qemu_co_mutex_unlock(&s->lock);
 -                qemu_co_queue_wait(&old_alloc->dependent_requests);
 -                qemu_co_mutex_lock(&s->lock);
 +                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
                  return -EAGAIN;
              }
          }
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void wait_for_overlapping_aiocb(BDRVSheepdogState *s, SheepdogAIOCB *acb)
  retry:
      QLIST_FOREACH(cb, &s->inflight_aiocb_head, aiocb_siblings) {
          if (AIOCBOverlapping(acb, cb)) {
 -            qemu_co_queue_wait(&s->overlapping_queue);
 +            qemu_co_queue_wait(&s->overlapping_queue, NULL);
              goto retry;
          }
      }
 diff --git a/block/throttle-groups.c b/block/throttle-groups.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/throttle-groups.c
 +++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockBackend *blk,
      if (must_wait || blkp->pending_reqs[is_write]) {
          blkp->pending_reqs[is_write]++;
          qemu_mutex_unlock(&tg->lock);
 -        qemu_co_queue_wait(&blkp->throttled_reqs[is_write]);
 +        qemu_co_queue_wait(&blkp->throttled_reqs[is_write], NULL);
          qemu_mutex_lock(&tg->lock);
          blkp->pending_reqs[is_write]--;
      }
 diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/9pfs/9p.c
 +++ b/hw/9pfs/9p.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn v9fs_flush(void *opaque)
          /*
           * Wait for pdu to complete.
           */
 -        qemu_co_queue_wait(&cancel_pdu->complete);
 +        qemu_co_queue_wait(&cancel_pdu->complete, NULL);
          cancel_pdu->cancelled = 0;
          pdu_free(cancel_pdu);
      }
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-lock.c
 +++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue)
      QSIMPLEQ_INIT(&queue->entries);
  }
 -void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
 +void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex)
  {
      Coroutine *self = qemu_coroutine_self();
      QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
 +
-+    if (mutex) {
++typedef enum BlockZoneModel {
-+        qemu_co_mutex_unlock(mutex);
++    BLK_Z_NONE = 0x0, /* Regular block device */
-+    }
++    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
 +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
 +} BlockZoneModel;
 +
-+    /* There is no race condition here.  Other threads will call
++typedef enum BlockZoneState {
-+     * aio_co_schedule on our AioContext, which can reenter this
++    BLK_ZS_NOT_WP = 0x0,
-+     * coroutine but only after this yield and after the main loop
++    BLK_ZS_EMPTY = 0x1,
-+     * has gone through the next iteration.
++    BLK_ZS_IOPEN = 0x2,
-+     */
++    BLK_ZS_EOPEN = 0x3,
-     qemu_coroutine_yield();
++    BLK_ZS_CLOSED = 0x4,
-     assert(qemu_in_coroutine());
++    BLK_ZS_RDONLY = 0xD,
 +    BLK_ZS_FULL = 0xE,
 +    BLK_ZS_OFFLINE = 0xF,
 +} BlockZoneState;
 +
-+    /* TODO: OSv implements wait morphing here, where the wakeup
++typedef enum BlockZoneType {
-+     * primitive automatically places the woken coroutine on the
++    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
-+     * mutex's queue.  This avoids the thundering herd effect.
++    BLK_ZT_SWR = 0x2, /* Sequential writes required */
-+     */
++    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
-+    if (mutex) {
++} BlockZoneType;
-+        qemu_co_mutex_lock(mutex);
++
-+    }
++/*
- }
++ * Zone descriptor data structure.
++ * Provides information on a zone with all position and size values in bytes.
- /**
++ */
-@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
++typedef struct BlockZoneDescriptor {
-     Coroutine *self = qemu_coroutine_self();
++    uint64_t start;
++    uint64_t length;
-     while (lock->writer) {
++    uint64_t cap;
--        qemu_co_queue_wait(&lock->queue);
++    uint64_t wp;
-+        qemu_co_queue_wait(&lock->queue, NULL);
++    BlockZoneType type;
-     }
++    BlockZoneState state;
-     lock->reader++;
++} BlockZoneDescriptor;
-     self->locks_held++;
++
-@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_wrlock(CoRwlock *lock)
+ typedef struct BlockDriverInfo {
-     Coroutine *self = qemu_coroutine_self();
+     /* in bytes, 0 if irrelevant */
+     int cluster_size;
      while (lock->writer || lock->reader) {
 -        qemu_co_queue_wait(&lock->queue);
 +        qemu_co_queue_wait(&lock->queue, NULL);
      }
      lock->writer = true;
      self->locks_held++;
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 24/24] coroutine-lock: make CoRwlock thread-safe and fair
+[PULL v2 02/16] block/file-posix: introduce helper functions for sysfs attributes
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-This adds a CoMutex around the existing CoQueue.  Because the write-side
+Use get_sysfs_str_val() to get the string value of device
-can just take CoMutex, the old "writer" field is not necessary anymore.
+zoned model. Then get_sysfs_zoned_model() can convert it to
-Instead of removing it altogether, count the number of pending writers
+BlockZoneModel type of QEMU.
-during a read-side critical section and forbid further readers from
-entering.
+Use get_sysfs_long_val() to get the long value of zoned device
+information.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-Message-id: 20170213181244.16297-7-pbonzini@redhat.com
+Reviewed-by: Hannes Reinecke <hare@suse.de>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
 Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-3-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-3-faithilikerun@gmail.com
 [Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 <philmd@linaro.org>.
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  3 ++-
+ include/block/block_int-common.h |   3 +
- util/qemu-coroutine-lock.c | 35 ++++++++++++++++++++++++-----------
+ block/file-posix.c               | 135 ++++++++++++++++++++++---------
-files changed, 26 insertions(+), 12 deletions(-)
+files changed, 100 insertions(+), 38 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/include/block/block_int-common.h
-+++ b/include/qemu/coroutine.h
++++ b/include/block/block_int-common.h
-@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
+@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
+      * an explicit monitor command to load the disk inside the guest).
+      */
- typedef struct CoRwlock {
+     bool has_variable_length;
--    bool writer;
++
-+    int pending_writer;
++    /* device zone model */
-     int reader;
++    BlockZoneModel zoned;
-+    CoMutex mutex;
+ } BlockLimits;
-     CoQueue queue;
- } CoRwlock;
+ typedef struct BdrvOpBlocker BdrvOpBlocker;
+diff --git a/block/file-posix.c b/block/file-posix.c
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine-lock.c
+--- a/block/file-posix.c
-+++ b/util/qemu-coroutine-lock.c
++++ b/block/file-posix.c
-@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_init(CoRwlock *lock)
+@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_hw_transfer(int fd, struct stat *st)
  #endif
  }
 -static int hdev_get_max_segments(int fd, struct stat *st)
 +/*
 + * Get a sysfs attribute value as character string.
 + */
 +#ifdef CONFIG_LINUX
 +static int get_sysfs_str_val(struct stat *st, const char *attribute,
 +                             char **val) {
 +    g_autofree char *sysfspath = NULL;
 +    int ret;
 +    size_t len;
 +
 +    if (!S_ISBLK(st->st_mode)) {
 +        return -ENOTSUP;
 +    }
 +
 +    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
 +                                major(st->st_rdev), minor(st->st_rdev),
 +                                attribute);
 +    ret = g_file_get_contents(sysfspath, val, &len, NULL);
 +    if (ret == -1) {
 +        return -ENOENT;
 +    }
 +
 +    /* The file is ended with '\n' */
 +    char *p;
 +    p = *val;
 +    if (*(p + len - 1) == '\n') {
 +        *(p + len - 1) = '\0';
 +    }
 +    return ret;
 +}
 +#endif
 +
 +static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
  {
-     memset(lock, 0, sizeof(*lock));
++    g_autofree char *val = NULL;
-     qemu_co_queue_init(&lock->queue);
++    int ret;
-+    qemu_co_mutex_init(&lock->mutex);
++
 +    ret = get_sysfs_str_val(st, "zoned", &val);
 +    if (ret < 0) {
 +        return ret;
 +    }
 +
 +    if (strcmp(val, "host-managed") == 0) {
 +        *zoned = BLK_Z_HM;
 +    } else if (strcmp(val, "host-aware") == 0) {
 +        *zoned = BLK_Z_HA;
 +    } else if (strcmp(val, "none") == 0) {
 +        *zoned = BLK_Z_NONE;
 +    } else {
 +        return -ENOTSUP;
 +    }
 +    return 0;
 +}
 +
 +/*
 + * Get a sysfs attribute value as a long integer.
 + */
  #ifdef CONFIG_LINUX
 -    char buf[32];
 +static long get_sysfs_long_val(struct stat *st, const char *attribute)
 +{
 +    g_autofree char *str = NULL;
      const char *end;
 -    char *sysfspath = NULL;
 +    long val;
 +    int ret;
 +
 +    ret = get_sysfs_str_val(st, attribute, &str);
 +    if (ret < 0) {
 +        return ret;
 +    }
 +
 +    /* The file is ended with '\n', pass 'end' to accept that. */
 +    ret = qemu_strtol(str, &end, 10, &val);
 +    if (ret == 0 && end && *end == '\0') {
 +        ret = val;
 +    }
 +    return ret;
 +}
 +#endif
 +
 +static int hdev_get_max_segments(int fd, struct stat *st)
 +{
 +#ifdef CONFIG_LINUX
      int ret;
 -    int sysfd = -1;
 -    long max_segments;
      if (S_ISCHR(st->st_mode)) {
          if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
          }
          return -ENOTSUP;
      }
 -
 -    if (!S_ISBLK(st->st_mode)) {
 -        return -ENOTSUP;
 -    }
 -
 -    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
 -                                major(st->st_rdev), minor(st->st_rdev));
 -    sysfd = open(sysfspath, O_RDONLY);
 -    if (sysfd == -1) {
 -        ret = -errno;
 -        goto out;
 -    }
 -    ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
 -    if (ret < 0) {
 -        ret = -errno;
 -        goto out;
 -    } else if (ret == 0) {
 -        ret = -EIO;
 -        goto out;
 -    }
 -    buf[ret] = 0;
 -    /* The file is ended with '\n', pass 'end' to accept that. */
 -    ret = qemu_strtol(buf, &end, 10, &max_segments);
 -    if (ret == 0 && end && *end == '\n') {
 -        ret = max_segments;
 -    }
 -
 -out:
 -    if (sysfd != -1) {
 -        close(sysfd);
 -    }
 -    g_free(sysfspath);
 -    return ret;
 +    return get_sysfs_long_val(st, "max_segments");
  #else
      return -ENOTSUP;
  #endif
  }
- void qemu_co_rwlock_rdlock(CoRwlock *lock)
++static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
 +                                     Error **errp)
 +{
 +    BlockZoneModel zoned;
 +    int ret;
 +
 +    bs->bl.zoned = BLK_Z_NONE;
 +
 +    ret = get_sysfs_zoned_model(st, &zoned);
 +    if (ret < 0 || zoned == BLK_Z_NONE) {
 +        return;
 +    }
 +    bs->bl.zoned = zoned;
 +}
 +
  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
  {
-     Coroutine *self = qemu_coroutine_self();
+     BDRVRawState *s = bs->opaque;
+@@ -XXX,XX +XXX,XX @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
--    while (lock->writer) {
+             bs->bl.max_hw_iov = ret;
 -        qemu_co_queue_wait(&lock->queue, NULL);
 +    qemu_co_mutex_lock(&lock->mutex);
 +    /* For fairness, wait if a writer is in line.  */
 +    while (lock->pending_writer) {
 +        qemu_co_queue_wait(&lock->queue, &lock->mutex);
      }
      lock->reader++;
 +    qemu_co_mutex_unlock(&lock->mutex);
 +
 +    /* The rest of the read-side critical section is run without the mutex.  */
      self->locks_held++;
  }
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
      Coroutine *self = qemu_coroutine_self();
      assert(qemu_in_coroutine());
 -    if (lock->writer) {
 -        lock->writer = false;
 +    if (!lock->reader) {
 +        /* The critical section started in qemu_co_rwlock_wrlock.  */
          qemu_co_queue_restart_all(&lock->queue);
      } else {
 +        self->locks_held--;
 +
 +        qemu_co_mutex_lock(&lock->mutex);
          lock->reader--;
          assert(lock->reader >= 0);
          /* Wakeup only one waiting writer */
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
              qemu_co_queue_next(&lock->queue);
          }
      }
--    self->locks_held--;
++
-+    qemu_co_mutex_unlock(&lock->mutex);
++    raw_refresh_zoned_limits(bs, &st, errp);
  }
- void qemu_co_rwlock_wrlock(CoRwlock *lock)
+ static int check_for_dasd(int fd)
  {
 -    Coroutine *self = qemu_coroutine_self();
 -
 -    while (lock->writer || lock->reader) {
 -        qemu_co_queue_wait(&lock->queue, NULL);
 +    qemu_co_mutex_lock(&lock->mutex);
 +    lock->pending_writer++;
 +    while (lock->reader) {
 +        qemu_co_queue_wait(&lock->queue, &lock->mutex);
      }
 -    lock->writer = true;
 -    self->locks_held++;
 +    lock->pending_writer--;
 +
 +    /* The rest of the write-side critical section is run with
 +     * the mutex taken, so that lock->reader remains zero.
 +     * There is no need to update self->locks_held.
 +     */
  }
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 07/24] nbd: convert to use qio_channel_yield
+[PULL v2 03/16] block/block-backend: add block layer APIs resembling Linux ZonedBlockDevice ioctls
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-In the client, read the reply headers from a coroutine, switching the
+Add zoned device option to host_device BlockDriver. It will be presented only
-read side between the "read header" coroutine and the I/O coroutine that
+for zoned host block devices. By adding zone management operations to the
-reads the body of the reply.
+host_block_device BlockDriver, users can use the new block layer APIs
 including Report Zone and four zone management operations
 (open, close, finish, reset, reset_all).
-In the server, if the server can read more requests it will create a new
+Qemu-io uses the new APIs to perform zoned storage commands of the device:
-"read request" coroutine as soon as a request has been read.  Otherwise,
+zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
-the new coroutine is created in nbd_request_put.
+zone_finish(zf).
+For example, to test zone_report, use following command:
+$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
+-c "zrp offset nr_zones"
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
+Reviewed-by: Hannes Reinecke <hare@suse.de>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Message-id: 20170213135235.12274-8-pbonzini@redhat.com
+Message-id: 20230508045533.175575-4-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-4-faithilikerun@gmail.com
 [Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 <philmd@linaro.org> and remove spurious ret = -errno in
 raw_co_zone_mgmt().
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nbd-client.h |   2 +-
+ meson.build                       |   5 +
- block/nbd-client.c | 117 ++++++++++++++++++++++++-----------------------------
+ include/block/block-io.h          |   9 +
- nbd/client.c       |   2 +-
+ include/block/block_int-common.h  |  21 ++
- nbd/common.c       |   9 +----
+ include/block/raw-aio.h           |   6 +-
- nbd/server.c       |  94 +++++++++++++-----------------------------
+ include/sysemu/block-backend-io.h |  18 ++
-files changed, 83 insertions(+), 141 deletions(-)
+ block/block-backend.c             | 137 +++++++++++++
  block/file-posix.c                | 313 +++++++++++++++++++++++++++++-
  block/io.c                        |  41 ++++
  qemu-io-cmds.c                    | 149 ++++++++++++++
 files changed, 696 insertions(+), 3 deletions(-)
-diff --git a/block/nbd-client.h b/block/nbd-client.h
+diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd-client.h
+--- a/meson.build
-+++ b/block/nbd-client.h
++++ b/meson.build
-@@ -XXX,XX +XXX,XX @@ typedef struct NBDClientSession {
+@@ -XXX,XX +XXX,XX @@ if rdma.found()
+ endif
-     CoMutex send_mutex;
-     CoQueue free_sema;
+ # has_header_symbol
--    Coroutine *send_coroutine;
++config_host_data.set('CONFIG_BLKZONED',
-+    Coroutine *read_reply_co;
++                     cc.has_header_symbol('linux/blkzoned.h', 'BLKOPENZONE'))
-     int in_flight;
+ config_host_data.set('CONFIG_EPOLL_CREATE1',
+                      cc.has_header_symbol('sys/epoll.h', 'epoll_create1'))
-     Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
+ config_host_data.set('CONFIG_FALLOCATE_PUNCH_HOLE',
-diff --git a/block/nbd-client.c b/block/nbd-client.c
+@@ -XXX,XX +XXX,XX @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
  config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
                       cc.has_member('struct stat', 'st_atim',
                                     prefix: '#include <sys/stat.h>'))
 +config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
 +                     cc.has_member('struct blk_zone', 'capacity',
 +                                   prefix: '#include <linux/blkzoned.h>'))
  # has_type
  config_host_data.set('CONFIG_IOVEC',
 diff --git a/include/block/block-io.h b/include/block/block-io.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd-client.c
+--- a/include/block/block-io.h
-+++ b/block/nbd-client.c
++++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn GRAPH_RDLOCK bdrv_co_flush(BlockDriverState *bs);
  int coroutine_fn GRAPH_RDLOCK bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                                 int64_t bytes);
 +/* Report zone information of zone block device. */
 +int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_report(BlockDriverState *bs,
 +                                                  int64_t offset,
 +                                                  unsigned int *nr_zones,
 +                                                  BlockZoneDescriptor *zones);
 +int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
 +                                                BlockZoneOp op,
 +                                                int64_t offset, int64_t len);
 +
  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
                        int64_t bytes, int64_t *pnum, int64_t *map,
 diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block_int-common.h
 +++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
      int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_load_vmstate)(
          BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
 +            int64_t offset, unsigned int *nr_zones,
 +            BlockZoneDescriptor *zones);
 +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
 +            int64_t offset, int64_t len);
 +
      /* removable device specific */
      bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
          BlockDriverState *bs);
@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
      /* device zone model */
      BlockZoneModel zoned;
 +
 +    /* zone size expressed in bytes */
 +    uint32_t zone_size;
 +
 +    /* total number of zones */
 +    uint32_t nr_zones;
 +
 +    /* maximum sectors of a zone append write operation */
 +    uint32_t max_append_sectors;
 +
 +    /* maximum number of open zones */
 +    uint32_t max_open_zones;
 +
 +    /* maximum number of active zones */
 +    uint32_t max_active_zones;
  } BlockLimits;
  typedef struct BdrvOpBlocker BdrvOpBlocker;
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
 @@ -XXX,XX +XXX,XX @@
- #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
+ #define QEMU_AIO_WRITE_ZEROES 0x0020
- #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
+ #define QEMU_AIO_COPY_RANGE   0x0040
+ #define QEMU_AIO_TRUNCATE     0x0080
--static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
++#define QEMU_AIO_ZONE_REPORT  0x0100
-+static void nbd_recv_coroutines_enter_all(BlockDriverState *bs)
++#define QEMU_AIO_ZONE_MGMT    0x0200
  #define QEMU_AIO_TYPE_MASK \
          (QEMU_AIO_READ | \
           QEMU_AIO_WRITE | \
@@ -XXX,XX +XXX,XX @@
           QEMU_AIO_DISCARD | \
           QEMU_AIO_WRITE_ZEROES | \
           QEMU_AIO_COPY_RANGE | \
 -         QEMU_AIO_TRUNCATE)
 +         QEMU_AIO_TRUNCATE | \
 +         QEMU_AIO_ZONE_REPORT | \
 +         QEMU_AIO_ZONE_MGMT)
  /* AIO flags */
  #define QEMU_AIO_MISALIGNED   0x1000
 diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/sysemu/block-backend-io.h
 +++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
                              BlockCompletionFunc *cb, void *opaque);
  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
                            BlockCompletionFunc *cb, void *opaque);
 +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
 +                                unsigned int *nr_zones,
 +                                BlockZoneDescriptor *zones,
 +                                BlockCompletionFunc *cb, void *opaque);
 +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
 +                              int64_t offset, int64_t len,
 +                              BlockCompletionFunc *cb, void *opaque);
  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                               BlockCompletionFunc *cb, void *opaque);
  void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -XXX,XX +XXX,XX @@ int co_wrapper_mixed blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                        int64_t bytes, BdrvRequestFlags flags);
 +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
 +                                    unsigned int *nr_zones,
 +                                    BlockZoneDescriptor *zones);
 +int co_wrapper_mixed blk_zone_report(BlockBackend *blk, int64_t offset,
 +                                         unsigned int *nr_zones,
 +                                         BlockZoneDescriptor *zones);
 +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
 +                                  int64_t offset, int64_t len);
 +int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
 +                                       int64_t offset, int64_t len);
 +
  int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
                                    int64_t bytes);
  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
 diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
      return ret;
  }
 +static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
 +{
 +    BlkAioEmAIOCB *acb = opaque;
 +    BlkRwCo *rwco = &acb->rwco;
 +
 +    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
 +                                   (unsigned int*)(uintptr_t)acb->bytes,
 +                                   rwco->iobuf);
 +    blk_aio_complete(acb);
 +}
 +
 +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
 +                                unsigned int *nr_zones,
 +                                BlockZoneDescriptor  *zones,
 +                                BlockCompletionFunc *cb, void *opaque)
 +{
 +    BlkAioEmAIOCB *acb;
 +    Coroutine *co;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk);
 +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
 +    acb->rwco = (BlkRwCo) {
 +        .blk    = blk,
 +        .offset = offset,
 +        .iobuf  = zones,
 +        .ret    = NOT_DONE,
 +    };
 +    acb->bytes = (int64_t)(uintptr_t)nr_zones,
 +    acb->has_returned = false;
 +
 +    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
 +    aio_co_enter(blk_get_aio_context(blk), co);
 +
 +    acb->has_returned = true;
 +    if (acb->rwco.ret != NOT_DONE) {
 +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
 +                                         blk_aio_complete_bh, acb);
 +    }
 +
 +    return &acb->common;
 +}
 +
 +static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
 +{
 +    BlkAioEmAIOCB *acb = opaque;
 +    BlkRwCo *rwco = &acb->rwco;
 +
 +    rwco->ret = blk_co_zone_mgmt(rwco->blk,
 +                                 (BlockZoneOp)(uintptr_t)rwco->iobuf,
 +                                 rwco->offset, acb->bytes);
 +    blk_aio_complete(acb);
 +}
 +
 +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
 +                              int64_t offset, int64_t len,
 +                              BlockCompletionFunc *cb, void *opaque) {
 +    BlkAioEmAIOCB *acb;
 +    Coroutine *co;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk);
 +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
 +    acb->rwco = (BlkRwCo) {
 +        .blk    = blk,
 +        .offset = offset,
 +        .iobuf  = (void *)(uintptr_t)op,
 +        .ret    = NOT_DONE,
 +    };
 +    acb->bytes = len;
 +    acb->has_returned = false;
 +
 +    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
 +    aio_co_enter(blk_get_aio_context(blk), co);
 +
 +    acb->has_returned = true;
 +    if (acb->rwco.ret != NOT_DONE) {
 +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
 +                                         blk_aio_complete_bh, acb);
 +    }
 +
 +    return &acb->common;
 +}
 +
 +/*
 + * Send a zone_report command.
 + * offset is a byte offset from the start of the device. No alignment
 + * required for offset.
 + * nr_zones represents IN maximum and OUT actual.
 + */
 +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
 +                                    unsigned int *nr_zones,
 +                                    BlockZoneDescriptor *zones)
 +{
 +    int ret;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk); /* increase before waiting */
 +    blk_wait_while_drained(blk);
 +    GRAPH_RDLOCK_GUARD();
 +    if (!blk_is_available(blk)) {
 +        blk_dec_in_flight(blk);
 +        return -ENOMEDIUM;
 +    }
 +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
 +    blk_dec_in_flight(blk);
 +    return ret;
 +}
 +
 +/*
 + * Send a zone_management command.
 + * op is the zone operation;
 + * offset is the byte offset from the start of the zoned device;
 + * len is the maximum number of bytes the command should operate on. It
 + * should be aligned with the device zone size.
 + */
 +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
 +        int64_t offset, int64_t len)
 +{
 +    int ret;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk);
 +    blk_wait_while_drained(blk);
 +    GRAPH_RDLOCK_GUARD();
 +
 +    ret = blk_check_byte_request(blk, offset, len);
 +    if (ret < 0) {
 +        blk_dec_in_flight(blk);
 +        return ret;
 +    }
 +
 +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
 +    blk_dec_in_flight(blk);
 +    return ret;
 +}
 +
  void blk_drain(BlockBackend *blk)
  {
-+    NBDClientSession *s = nbd_get_client_session(bs);
+     BlockDriverState *bs = blk_bs(blk);
-     int i;
+diff --git a/block/file-posix.c b/block/file-posix.c
+index XXXXXXX..XXXXXXX 100644
-     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+--- a/block/file-posix.c
-@@ -XXX,XX +XXX,XX @@ static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
++++ b/block/file-posix.c
-             qemu_coroutine_enter(s->recv_coroutine[i]);
+@@ -XXX,XX +XXX,XX @@
-         }
+ #include <sys/param.h>
  #include <sys/syscall.h>
  #include <sys/vfs.h>
 +#if defined(CONFIG_BLKZONED)
 +#include <linux/blkzoned.h>
 +#endif
  #include <linux/cdrom.h>
  #include <linux/fd.h>
  #include <linux/fs.h>
@@ -XXX,XX +XXX,XX @@ typedef struct RawPosixAIOData {
              PreallocMode prealloc;
              Error **errp;
          } truncate;
 +        struct {
 +            unsigned int *nr_zones;
 +            BlockZoneDescriptor *zones;
 +        } zone_report;
 +        struct {
 +            unsigned long op;
 +        } zone_mgmt;
      };
  } RawPosixAIOData;
@@ -XXX,XX +XXX,XX @@ static int get_sysfs_str_val(struct stat *st, const char *attribute,
  }
  #endif
 +#if defined(CONFIG_BLKZONED)
  static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
  {
      g_autofree char *val = NULL;
@@ -XXX,XX +XXX,XX @@ static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
      }
-+    BDRV_POLL_WHILE(bs, s->read_reply_co);
+     return 0;
  }
++#endif /* defined(CONFIG_BLKZONED) */
- static void nbd_teardown_connection(BlockDriverState *bs)
-@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
+ /*
-     qio_channel_shutdown(client->ioc,
+  * Get a sysfs attribute value as a long integer.
-                          QIO_CHANNEL_SHUTDOWN_BOTH,
+@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
-                          NULL);
+ #endif
 -    nbd_recv_coroutines_enter_all(client);
 +    nbd_recv_coroutines_enter_all(bs);
      nbd_client_detach_aio_context(bs);
      object_unref(OBJECT(client->sioc));
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
      client->ioc = NULL;
  }
--static void nbd_reply_ready(void *opaque)
++#if defined(CONFIG_BLKZONED)
-+static coroutine_fn void nbd_read_reply_entry(void *opaque)
+ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
                                       Error **errp)
  {
--    BlockDriverState *bs = opaque;
+@@ -XXX,XX +XXX,XX @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
--    NBDClientSession *s = nbd_get_client_session(bs);
+         return;
-+    NBDClientSession *s = opaque;
+     }
-     uint64_t i;
+     bs->bl.zoned = zoned;
 +
 +    ret = get_sysfs_long_val(st, "max_open_zones");
 +    if (ret >= 0) {
 +        bs->bl.max_open_zones = ret;
 +    }
 +
 +    ret = get_sysfs_long_val(st, "max_active_zones");
 +    if (ret >= 0) {
 +        bs->bl.max_active_zones = ret;
 +    }
 +
 +    /*
 +     * The zoned device must at least have zone size and nr_zones fields.
 +     */
 +    ret = get_sysfs_long_val(st, "chunk_sectors");
 +    if (ret < 0) {
 +        error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
 +                                     "sysfs attribute");
 +        return;
 +    } else if (!ret) {
 +        error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
 +        return;
 +    }
 +    bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
 +
 +    ret = get_sysfs_long_val(st, "nr_zones");
 +    if (ret < 0) {
 +        error_setg_errno(errp, -ret, "Unable to read nr_zones "
 +                                     "sysfs attribute");
 +        return;
 +    } else if (!ret) {
 +        error_setg(errp, "Read 0 from nr_zones sysfs attribute");
 +        return;
 +    }
 +    bs->bl.nr_zones = ret;
 +
 +    ret = get_sysfs_long_val(st, "zone_append_max_bytes");
 +    if (ret > 0) {
 +        bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
 +    }
  }
 +#else /* !defined(CONFIG_BLKZONED) */
 +static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
 +                                     Error **errp)
 +{
 +    bs->bl.zoned = BLK_Z_NONE;
 +}
 +#endif /* !defined(CONFIG_BLKZONED) */
  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
  {
@@ -XXX,XX +XXX,XX @@ static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
      BDRVRawState *s = bs->opaque;
      int ret;
--    if (!s->ioc) { /* Already closed */
+-    /* If DASD, get blocksizes */
--        return;
++    /* If DASD or zoned devices, get blocksizes */
--    }
+     if (check_for_dasd(s->fd) < 0) {
--
+-        return -ENOTSUP;
--    if (s->reply.handle == 0) {
++        /* zoned devices are not DASD */
--        /* No reply already in flight.  Fetch a header.  It is possible
++        if (bs->bl.zoned == BLK_Z_NONE) {
--         * that another thread has done the same thing in parallel, so
++            return -ENOTSUP;
--         * the socket is not readable anymore.
++        }
--         */
+     }
-+    for (;;) {
+     ret = probe_logical_blocksize(s->fd, &bsz->log);
-+        assert(s->reply.handle == 0);
+     if (ret < 0) {
-         ret = nbd_receive_reply(s->ioc, &s->reply);
+@@ -XXX,XX +XXX,XX @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
--        if (ret == -EAGAIN) {
+ }
--            return;
+ #endif
--        }
-         if (ret < 0) {
++/*
--            s->reply.handle = 0;
++ * parse_zone - Fill a zone descriptor
--            goto fail;
++ */
-+            break;
++#if defined(CONFIG_BLKZONED)
-         }
++static inline int parse_zone(struct BlockZoneDescriptor *zone,
--    }
++                              const struct blk_zone *blkz) {
++    zone->start = blkz->start << BDRV_SECTOR_BITS;
--    /* There's no need for a mutex on the receive side, because the
++    zone->length = blkz->len << BDRV_SECTOR_BITS;
--     * handler acts as a synchronization point and ensures that only
++    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
--     * one coroutine is called until the reply finishes.  */
++
--    i = HANDLE_TO_INDEX(s, s->reply.handle);
++#ifdef HAVE_BLK_ZONE_REP_CAPACITY
--    if (i >= MAX_NBD_REQUESTS) {
++    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
--        goto fail;
++#else
--    }
++    zone->cap = blkz->len << BDRV_SECTOR_BITS;
-+        /* There's no need for a mutex on the receive side, because the
++#endif
-+         * handler acts as a synchronization point and ensures that only
++
-+         * one coroutine is called until the reply finishes.
++    switch (blkz->type) {
-+         */
++    case BLK_ZONE_TYPE_SEQWRITE_REQ:
-+        i = HANDLE_TO_INDEX(s, s->reply.handle);
++        zone->type = BLK_ZT_SWR;
-+        if (i >= MAX_NBD_REQUESTS || !s->recv_coroutine[i]) {
++        break;
 +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
 +        zone->type = BLK_ZT_SWP;
 +        break;
 +    case BLK_ZONE_TYPE_CONVENTIONAL:
 +        zone->type = BLK_ZT_CONV;
 +        break;
 +    default:
 +        error_report("Unsupported zone type: 0x%x", blkz->type);
 +        return -ENOTSUP;
 +    }
 +
 +    switch (blkz->cond) {
 +    case BLK_ZONE_COND_NOT_WP:
 +        zone->state = BLK_ZS_NOT_WP;
 +        break;
 +    case BLK_ZONE_COND_EMPTY:
 +        zone->state = BLK_ZS_EMPTY;
 +        break;
 +    case BLK_ZONE_COND_IMP_OPEN:
 +        zone->state = BLK_ZS_IOPEN;
 +        break;
 +    case BLK_ZONE_COND_EXP_OPEN:
 +        zone->state = BLK_ZS_EOPEN;
 +        break;
 +    case BLK_ZONE_COND_CLOSED:
 +        zone->state = BLK_ZS_CLOSED;
 +        break;
 +    case BLK_ZONE_COND_READONLY:
 +        zone->state = BLK_ZS_RDONLY;
 +        break;
 +    case BLK_ZONE_COND_FULL:
 +        zone->state = BLK_ZS_FULL;
 +        break;
 +    case BLK_ZONE_COND_OFFLINE:
 +        zone->state = BLK_ZS_OFFLINE;
 +        break;
 +    default:
 +        error_report("Unsupported zone state: 0x%x", blkz->cond);
 +        return -ENOTSUP;
 +    }
 +    return 0;
 +}
 +#endif
 +
 +#if defined(CONFIG_BLKZONED)
 +static int handle_aiocb_zone_report(void *opaque)
 +{
 +    RawPosixAIOData *aiocb = opaque;
 +    int fd = aiocb->aio_fildes;
 +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
 +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
 +    /* zoned block devices use 512-byte sectors */
 +    uint64_t sector = aiocb->aio_offset / 512;
 +
 +    struct blk_zone *blkz;
 +    size_t rep_size;
 +    unsigned int nrz;
 +    int ret;
 +    unsigned int n = 0, i = 0;
 +
 +    nrz = *nr_zones;
 +    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
 +    g_autofree struct blk_zone_report *rep = NULL;
 +    rep = g_malloc(rep_size);
 +
 +    blkz = (struct blk_zone *)(rep + 1);
 +    while (n < nrz) {
 +        memset(rep, 0, rep_size);
 +        rep->sector = sector;
 +        rep->nr_zones = nrz - n;
 +
 +        do {
 +            ret = ioctl(fd, BLKREPORTZONE, rep);
 +        } while (ret != 0 && errno == EINTR);
 +        if (ret != 0) {
 +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
 +                         fd, sector, errno);
 +            return -errno;
 +        }
 +
 +        if (!rep->nr_zones) {
 +            break;
 +        }
++
--    if (s->recv_coroutine[i]) {
++        for (i = 0; i < rep->nr_zones; i++, n++) {
--        qemu_coroutine_enter(s->recv_coroutine[i]);
++            ret = parse_zone(&zones[n], &blkz[i]);
--        return;
++            if (ret != 0) {
-+        /* We're woken up by the recv_coroutine itself.  Note that there
++                return ret;
-+         * is no race between yielding and reentering read_reply_co.  This
++            }
-+         * is because:
++
-+         *
++            /* The next report should start after the last zone reported */
-+         * - if recv_coroutine[i] runs on the same AioContext, it is only
++            sector = blkz[i].start + blkz[i].len;
-+         *   entered after we yield
++        }
-+         *
++    }
-+         * - if recv_coroutine[i] runs on a different AioContext, reentering
++
-+         *   read_reply_co happens through a bottom half, which can only
++    *nr_zones = n;
-+         *   run after we yield.
++    return 0;
-+         */
++}
-+        aio_co_wake(s->recv_coroutine[i]);
++#endif
-+        qemu_coroutine_yield();
++
-     }
++#if defined(CONFIG_BLKZONED)
--
++static int handle_aiocb_zone_mgmt(void *opaque)
--fail:
++{
--    nbd_teardown_connection(bs);
++    RawPosixAIOData *aiocb = opaque;
--}
++    int fd = aiocb->aio_fildes;
--
++    uint64_t sector = aiocb->aio_offset / 512;
--static void nbd_restart_write(void *opaque)
++    int64_t nr_sectors = aiocb->aio_nbytes / 512;
--{
++    struct blk_zone_range range;
--    BlockDriverState *bs = opaque;
++    int ret;
--
++
--    qemu_coroutine_enter(nbd_get_client_session(bs)->send_coroutine);
++    /* Execute the operation */
-+    s->read_reply_co = NULL;
++    range.sector = sector;
- }
++    range.nr_sectors = nr_sectors;
++    do {
- static int nbd_co_send_request(BlockDriverState *bs,
++        ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
-@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
++    } while (ret != 0 && errno == EINTR);
-                                QEMUIOVector *qiov)
++
 +    return ret;
 +}
 +#endif
 +
  static int handle_aiocb_copy_range(void *opaque)
  {
-     NBDClientSession *s = nbd_get_client_session(bs);
+     RawPosixAIOData *aiocb = opaque;
--    AioContext *aio_context;
+@@ -XXX,XX +XXX,XX @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
      int rc, ret, i;
      qemu_co_mutex_lock(&s->send_mutex);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
          return -EPIPE;
      }
 -    s->send_coroutine = qemu_coroutine_self();
 -    aio_context = bdrv_get_aio_context(bs);
 -
 -    aio_set_fd_handler(aio_context, s->sioc->fd, false,
 -                       nbd_reply_ready, nbd_restart_write, NULL, bs);
      if (qiov) {
          qio_channel_set_cork(s->ioc, true);
          rc = nbd_send_request(s->ioc, request);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
      } else {
          rc = nbd_send_request(s->ioc, request);
      }
 -    aio_set_fd_handler(aio_context, s->sioc->fd, false,
 -                       nbd_reply_ready, NULL, NULL, bs);
 -    s->send_coroutine = NULL;
      qemu_co_mutex_unlock(&s->send_mutex);
      return rc;
  }
@@ -XXX,XX +XXX,XX @@ static void nbd_co_receive_reply(NBDClientSession *s,
  {
      int ret;
 -    /* Wait until we're woken up by the read handler.  TODO: perhaps
 -     * peek at the next reply and avoid yielding if it's ours?  */
 +    /* Wait until we're woken up by nbd_read_reply_entry.  */
      qemu_coroutine_yield();
      *reply = s->reply;
      if (reply->handle != request->handle ||
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
      /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
  }
 -static void nbd_coroutine_end(NBDClientSession *s,
 +static void nbd_coroutine_end(BlockDriverState *bs,
                                NBDRequest *request)
  {
 +    NBDClientSession *s = nbd_get_client_session(bs);
      int i = HANDLE_TO_INDEX(s, request->handle);
 +
      s->recv_coroutine[i] = NULL;
 -    if (s->in_flight-- == MAX_NBD_REQUESTS) {
 -        qemu_co_queue_next(&s->free_sema);
 +    s->in_flight--;
 +    qemu_co_queue_next(&s->free_sema);
 +
 +    /* Kick the read_reply_co to get the next reply.  */
 +    if (s->read_reply_co) {
 +        aio_co_wake(s->read_reply_co);
      }
  }
-@@ -XXX,XX +XXX,XX @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
++/*
-     } else {
++ * zone report - Get a zone block device's information in the form
-         nbd_co_receive_reply(client, &request, &reply, qiov);
++ * of an array of zone descriptors.
-     }
++ * zones is an array of zone descriptors to hold zone information on reply;
--    nbd_coroutine_end(client, &request);
++ * offset can be any byte within the entire size of the device;
-+    nbd_coroutine_end(bs, &request);
++ * nr_zones is the maxium number of sectors the command should operate on.
-     return -reply.error;
++ */
 +#if defined(CONFIG_BLKZONED)
 +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
 +                                           unsigned int *nr_zones,
 +                                           BlockZoneDescriptor *zones) {
 +    BDRVRawState *s = bs->opaque;
 +    RawPosixAIOData acb = (RawPosixAIOData) {
 +        .bs         = bs,
 +        .aio_fildes = s->fd,
 +        .aio_type   = QEMU_AIO_ZONE_REPORT,
 +        .aio_offset = offset,
 +        .zone_report    = {
 +            .nr_zones       = nr_zones,
 +            .zones          = zones,
 +        },
 +    };
 +
 +    return raw_thread_pool_submit(handle_aiocb_zone_report, &acb);
 +}
 +#endif
 +
 +/*
 + * zone management operations - Execute an operation on a zone
 + */
 +#if defined(CONFIG_BLKZONED)
 +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
 +        int64_t offset, int64_t len) {
 +    BDRVRawState *s = bs->opaque;
 +    RawPosixAIOData acb;
 +    int64_t zone_size, zone_size_mask;
 +    const char *op_name;
 +    unsigned long zo;
 +    int ret;
 +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
 +
 +    zone_size = bs->bl.zone_size;
 +    zone_size_mask = zone_size - 1;
 +    if (offset & zone_size_mask) {
 +        error_report("sector offset %" PRId64 " is not aligned to zone size "
 +                     "%" PRId64 "", offset / 512, zone_size / 512);
 +        return -EINVAL;
 +    }
 +
 +    if (((offset + len) < capacity && len & zone_size_mask) ||
 +        offset + len > capacity) {
 +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
 +                      " %" PRId64 "", len / 512, zone_size / 512);
 +        return -EINVAL;
 +    }
 +
 +    switch (op) {
 +    case BLK_ZO_OPEN:
 +        op_name = "BLKOPENZONE";
 +        zo = BLKOPENZONE;
 +        break;
 +    case BLK_ZO_CLOSE:
 +        op_name = "BLKCLOSEZONE";
 +        zo = BLKCLOSEZONE;
 +        break;
 +    case BLK_ZO_FINISH:
 +        op_name = "BLKFINISHZONE";
 +        zo = BLKFINISHZONE;
 +        break;
 +    case BLK_ZO_RESET:
 +        op_name = "BLKRESETZONE";
 +        zo = BLKRESETZONE;
 +        break;
 +    default:
 +        error_report("Unsupported zone op: 0x%x", op);
 +        return -ENOTSUP;
 +    }
 +
 +    acb = (RawPosixAIOData) {
 +        .bs             = bs,
 +        .aio_fildes     = s->fd,
 +        .aio_type       = QEMU_AIO_ZONE_MGMT,
 +        .aio_offset     = offset,
 +        .aio_nbytes     = len,
 +        .zone_mgmt  = {
 +            .op = zo,
 +        },
 +    };
 +
 +    ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
 +    if (ret != 0) {
 +        error_report("ioctl %s failed %d", op_name, ret);
 +    }
 +
 +    return ret;
 +}
 +#endif
 +
  static coroutine_fn int
  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
                  bool blkdev)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
  #ifdef __linux__
      .bdrv_co_ioctl          = hdev_co_ioctl,
  #endif
 +
 +    /* zoned device */
 +#if defined(CONFIG_BLKZONED)
 +    /* zone management operations */
 +    .bdrv_co_zone_report = raw_co_zone_report,
 +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
 +#endif
  };
  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ out:
      return co.ret;
  }
-@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
++int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
-     } else {
++                        unsigned int *nr_zones,
-         nbd_co_receive_reply(client, &request, &reply, NULL);
++                        BlockZoneDescriptor *zones)
-     }
++{
--    nbd_coroutine_end(client, &request);
++    BlockDriver *drv = bs->drv;
-+    nbd_coroutine_end(bs, &request);
++    CoroutineIOCompletion co = {
-     return -reply.error;
++            .coroutine = qemu_coroutine_self(),
- }
++    };
++    IO_CODE();
-@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
++
-     } else {
++    bdrv_inc_in_flight(bs);
-         nbd_co_receive_reply(client, &request, &reply, NULL);
++    if (!drv || !drv->bdrv_co_zone_report || bs->bl.zoned == BLK_Z_NONE) {
-     }
++        co.ret = -ENOTSUP;
--    nbd_coroutine_end(client, &request);
++        goto out;
-+    nbd_coroutine_end(bs, &request);
++    }
-     return -reply.error;
++    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
- }
++out:
++    bdrv_dec_in_flight(bs);
-@@ -XXX,XX +XXX,XX @@ int nbd_client_co_flush(BlockDriverState *bs)
++    return co.ret;
-     } else {
++}
-         nbd_co_receive_reply(client, &request, &reply, NULL);
++
-     }
++int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
--    nbd_coroutine_end(client, &request);
++        int64_t offset, int64_t len)
-+    nbd_coroutine_end(bs, &request);
++{
-     return -reply.error;
++    BlockDriver *drv = bs->drv;
- }
++    CoroutineIOCompletion co = {
++            .coroutine = qemu_coroutine_self(),
-@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
++    };
-     } else {
++    IO_CODE();
-         nbd_co_receive_reply(client, &request, &reply, NULL);
++
-     }
++    bdrv_inc_in_flight(bs);
--    nbd_coroutine_end(client, &request);
++    if (!drv || !drv->bdrv_co_zone_mgmt || bs->bl.zoned == BLK_Z_NONE) {
-+    nbd_coroutine_end(bs, &request);
++        co.ret = -ENOTSUP;
-     return -reply.error;
++        goto out;
++    }
- }
++    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
++out:
- void nbd_client_detach_aio_context(BlockDriverState *bs)
++    bdrv_dec_in_flight(bs);
 +    return co.ret;
 +}
 +
  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
--    aio_set_fd_handler(bdrv_get_aio_context(bs),
+     IO_CODE();
--                       nbd_get_client_session(bs)->sioc->fd,
+diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
 -                       false, NULL, NULL, NULL, NULL);
 +    NBDClientSession *client = nbd_get_client_session(bs);
 +    qio_channel_detach_aio_context(QIO_CHANNEL(client->sioc));
  }
  void nbd_client_attach_aio_context(BlockDriverState *bs,
                                     AioContext *new_context)
  {
 -    aio_set_fd_handler(new_context, nbd_get_client_session(bs)->sioc->fd,
 -                       false, nbd_reply_ready, NULL, NULL, bs);
 +    NBDClientSession *client = nbd_get_client_session(bs);
 +    qio_channel_attach_aio_context(QIO_CHANNEL(client->sioc), new_context);
 +    aio_co_schedule(new_context, client->read_reply_co);
  }
  void nbd_client_close(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ int nbd_client_init(BlockDriverState *bs,
      /* Now that we're connected, set the socket to be non-blocking and
       * kick the reply mechanism.  */
      qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
 -
 +    client->read_reply_co = qemu_coroutine_create(nbd_read_reply_entry, client);
      nbd_client_attach_aio_context(bs, bdrv_get_aio_context(bs));
      logout("Established connection with NBD server\n");
 diff --git a/nbd/client.c b/nbd/client.c
 index XXXXXXX..XXXXXXX 100644
---- a/nbd/client.c
+--- a/qemu-io-cmds.c
-+++ b/nbd/client.c
++++ b/qemu-io-cmds.c
-@@ -XXX,XX +XXX,XX @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
+@@ -XXX,XX +XXX,XX @@ static const cmdinfo_t flush_cmd = {
-     ssize_t ret;
+     .oneline    = "flush all in-core file state to disk",
+ };
-     ret = read_sync(ioc, buf, sizeof(buf));
--    if (ret < 0) {
++static inline int64_t tosector(int64_t bytes)
-+    if (ret <= 0) {
++{
-         return ret;
++    return bytes >> BDRV_SECTOR_BITS;
-     }
++}
++
-diff --git a/nbd/common.c b/nbd/common.c
++static int zone_report_f(BlockBackend *blk, int argc, char **argv)
-index XXXXXXX..XXXXXXX 100644
++{
---- a/nbd/common.c
++    int ret;
-+++ b/nbd/common.c
++    int64_t offset;
-@@ -XXX,XX +XXX,XX @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
++    unsigned int nr_zones;
-         }
++
-         if (len == QIO_CHANNEL_ERR_BLOCK) {
++    ++optind;
-             if (qemu_in_coroutine()) {
++    offset = cvtnum(argv[optind]);
--                /* XXX figure out if we can create a variant on
++    ++optind;
--                 * qio_channel_yield() that works with AIO contexts
++    nr_zones = cvtnum(argv[optind]);
--                 * and consider using that in this branch */
++
--                qemu_coroutine_yield();
++    g_autofree BlockZoneDescriptor *zones = NULL;
--            } else if (done) {
++    zones = g_new(BlockZoneDescriptor, nr_zones);
--                /* XXX this is needed by nbd_reply_ready.  */
++    ret = blk_zone_report(blk, offset, &nr_zones, zones);
--                qio_channel_wait(ioc,
++    if (ret < 0) {
--                                 do_read ? G_IO_IN : G_IO_OUT);
++        printf("zone report failed: %s\n", strerror(-ret));
-+                qio_channel_yield(ioc, do_read ? G_IO_IN : G_IO_OUT);
++    } else {
-             } else {
++        for (int i = 0; i < nr_zones; ++i) {
-                 return -EAGAIN;
++            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
-             }
++                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
-diff --git a/nbd/server.c b/nbd/server.c
++                   "zcond:%u, [type: %u]\n",
-index XXXXXXX..XXXXXXX 100644
++                    tosector(zones[i].start), tosector(zones[i].length),
---- a/nbd/server.c
++                    tosector(zones[i].cap), tosector(zones[i].wp),
-+++ b/nbd/server.c
++                    zones[i].state, zones[i].type);
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
      CoMutex send_lock;
      Coroutine *send_coroutine;
 -    bool can_read;
 -
      QTAILQ_ENTRY(NBDClient) next;
      int nb_requests;
      bool closing;
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
  /* That's all folks */
 -static void nbd_set_handlers(NBDClient *client);
 -static void nbd_unset_handlers(NBDClient *client);
 -static void nbd_update_can_read(NBDClient *client);
 +static void nbd_client_receive_next_request(NBDClient *client);
  static gboolean nbd_negotiate_continue(QIOChannel *ioc,
                                         GIOCondition condition,
@@ -XXX,XX +XXX,XX @@ void nbd_client_put(NBDClient *client)
           */
          assert(client->closing);
 -        nbd_unset_handlers(client);
 +        qio_channel_detach_aio_context(client->ioc);
          object_unref(OBJECT(client->sioc));
          object_unref(OBJECT(client->ioc));
          if (client->tlscreds) {
@@ -XXX,XX +XXX,XX @@ static NBDRequestData *nbd_request_get(NBDClient *client)
      assert(client->nb_requests <= MAX_NBD_REQUESTS - 1);
      client->nb_requests++;
 -    nbd_update_can_read(client);
      req = g_new0(NBDRequestData, 1);
      nbd_client_get(client);
@@ -XXX,XX +XXX,XX @@ static void nbd_request_put(NBDRequestData *req)
      g_free(req);
      client->nb_requests--;
 -    nbd_update_can_read(client);
 +    nbd_client_receive_next_request(client);
 +
      nbd_client_put(client);
  }
@@ -XXX,XX +XXX,XX @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
      exp->ctx = ctx;
      QTAILQ_FOREACH(client, &exp->clients, next) {
 -        nbd_set_handlers(client);
 +        qio_channel_attach_aio_context(client->ioc, ctx);
 +        if (client->recv_coroutine) {
 +            aio_co_schedule(ctx, client->recv_coroutine);
 +        }
-+        if (client->send_coroutine) {
++    }
-+            aio_co_schedule(ctx, client->send_coroutine);
++    return ret;
-+        }
++}
-     }
++
- }
++static const cmdinfo_t zone_report_cmd = {
++    .name = "zone_report",
-@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
++    .altname = "zrp",
-     TRACE("Export %s: Detaching clients from AIO context %p\n", exp->name, exp->ctx);
++    .cfunc = zone_report_f,
++    .argmin = 2,
-     QTAILQ_FOREACH(client, &exp->clients, next) {
++    .argmax = 2,
--        nbd_unset_handlers(client);
++    .args = "offset number",
-+        qio_channel_detach_aio_context(client->ioc);
++    .oneline = "report zone information",
-     }
++};
++
-     exp->ctx = NULL;
++static int zone_open_f(BlockBackend *blk, int argc, char **argv)
-@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
++{
-     g_assert(qemu_in_coroutine());
++    int ret;
-     qemu_co_mutex_lock(&client->send_lock);
++    int64_t offset, len;
-     client->send_coroutine = qemu_coroutine_self();
++    ++optind;
--    nbd_set_handlers(client);
++    offset = cvtnum(argv[optind]);
++    ++optind;
-     if (!len) {
++    len = cvtnum(argv[optind]);
-         rc = nbd_send_reply(client->ioc, reply);
++    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
-@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
++    if (ret < 0) {
-     }
++        printf("zone open failed: %s\n", strerror(-ret));
++    }
-     client->send_coroutine = NULL;
++    return ret;
--    nbd_set_handlers(client);
++}
-     qemu_co_mutex_unlock(&client->send_lock);
++
-     return rc;
++static const cmdinfo_t zone_open_cmd = {
- }
++    .name = "zone_open",
-@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
++    .altname = "zo",
-     ssize_t rc;
++    .cfunc = zone_open_f,
++    .argmin = 2,
-     g_assert(qemu_in_coroutine());
++    .argmax = 2,
--    client->recv_coroutine = qemu_coroutine_self();
++    .args = "offset len",
--    nbd_update_can_read(client);
++    .oneline = "explicit open a range of zones in zone block device",
--
++};
-+    assert(client->recv_coroutine == qemu_coroutine_self());
++
-     rc = nbd_receive_request(client->ioc, request);
++static int zone_close_f(BlockBackend *blk, int argc, char **argv)
-     if (rc < 0) {
++{
-         if (rc != -EAGAIN) {
++    int ret;
-@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
++    int64_t offset, len;
++    ++optind;
- out:
++    offset = cvtnum(argv[optind]);
-     client->recv_coroutine = NULL;
++    ++optind;
--    nbd_update_can_read(client);
++    len = cvtnum(argv[optind]);
-+    nbd_client_receive_next_request(client);
++    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
++    if (ret < 0) {
-     return rc;
++        printf("zone close failed: %s\n", strerror(-ret));
- }
++    }
++    return ret;
--static void nbd_trip(void *opaque)
++}
-+/* Owns a reference to the NBDClient passed as opaque.  */
++
-+static coroutine_fn void nbd_trip(void *opaque)
++static const cmdinfo_t zone_close_cmd = {
- {
++    .name = "zone_close",
-     NBDClient *client = opaque;
++    .altname = "zc",
-     NBDExport *exp = client->exp;
++    .cfunc = zone_close_f,
-     NBDRequestData *req;
++    .argmin = 2,
--    NBDRequest request;
++    .argmax = 2,
-+    NBDRequest request = { 0 };    /* GCC thinks it can be used uninitialized */
++    .args = "offset len",
-     NBDReply reply;
++    .oneline = "close a range of zones in zone block device",
-     ssize_t ret;
++};
-     int flags;
++
++static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
-     TRACE("Reading request.");
++{
-     if (client->closing) {
++    int ret;
-+        nbd_client_put(client);
++    int64_t offset, len;
-         return;
++    ++optind;
-     }
++    offset = cvtnum(argv[optind]);
++    ++optind;
-@@ -XXX,XX +XXX,XX @@ static void nbd_trip(void *opaque)
++    len = cvtnum(argv[optind]);
++    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
- done:
++    if (ret < 0) {
-     nbd_request_put(req);
++        printf("zone finish failed: %s\n", strerror(-ret));
-+    nbd_client_put(client);
++    }
-     return;
++    return ret;
++}
- out:
++
-     nbd_request_put(req);
++static const cmdinfo_t zone_finish_cmd = {
-     client_close(client);
++    .name = "zone_finish",
-+    nbd_client_put(client);
++    .altname = "zf",
- }
++    .cfunc = zone_finish_f,
++    .argmin = 2,
--static void nbd_read(void *opaque)
++    .argmax = 2,
-+static void nbd_client_receive_next_request(NBDClient *client)
++    .args = "offset len",
- {
++    .oneline = "finish a range of zones in zone block device",
--    NBDClient *client = opaque;
++};
--
++
--    if (client->recv_coroutine) {
++static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
--        qemu_coroutine_enter(client->recv_coroutine);
++{
--    } else {
++    int ret;
--        qemu_coroutine_enter(qemu_coroutine_create(nbd_trip, client));
++    int64_t offset, len;
--    }
++    ++optind;
--}
++    offset = cvtnum(argv[optind]);
--
++    ++optind;
--static void nbd_restart_write(void *opaque)
++    len = cvtnum(argv[optind]);
--{
++    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
--    NBDClient *client = opaque;
++    if (ret < 0) {
--
++        printf("zone reset failed: %s\n", strerror(-ret));
--    qemu_coroutine_enter(client->send_coroutine);
++    }
--}
++    return ret;
--
++}
--static void nbd_set_handlers(NBDClient *client)
++
--{
++static const cmdinfo_t zone_reset_cmd = {
--    if (client->exp && client->exp->ctx) {
++    .name = "zone_reset",
--        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true,
++    .altname = "zrs",
--                           client->can_read ? nbd_read : NULL,
++    .cfunc = zone_reset_f,
--                           client->send_coroutine ? nbd_restart_write : NULL,
++    .argmin = 2,
--                           NULL, client);
++    .argmax = 2,
--    }
++    .args = "offset len",
--}
++    .oneline = "reset a zone write pointer in zone block device",
--
++};
--static void nbd_unset_handlers(NBDClient *client)
++
--{
+ static int truncate_f(BlockBackend *blk, int argc, char **argv);
--    if (client->exp && client->exp->ctx) {
+ static const cmdinfo_t truncate_cmd = {
--        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true, NULL,
+     .name       = "truncate",
--                           NULL, NULL, NULL);
+@@ -XXX,XX +XXX,XX @@ static void __attribute((constructor)) init_qemuio_commands(void)
--    }
+     qemuio_add_command(&aio_write_cmd);
--}
+     qemuio_add_command(&aio_flush_cmd);
--
+     qemuio_add_command(&flush_cmd);
--static void nbd_update_can_read(NBDClient *client)
++    qemuio_add_command(&zone_report_cmd);
--{
++    qemuio_add_command(&zone_open_cmd);
--    bool can_read = client->recv_coroutine ||
++    qemuio_add_command(&zone_close_cmd);
--                    client->nb_requests < MAX_NBD_REQUESTS;
++    qemuio_add_command(&zone_finish_cmd);
--
++    qemuio_add_command(&zone_reset_cmd);
--    if (can_read != client->can_read) {
+     qemuio_add_command(&truncate_cmd);
--        client->can_read = can_read;
+     qemuio_add_command(&length_cmd);
--        nbd_set_handlers(client);
+     qemuio_add_command(&info_cmd);
 -
 -        /* There is no need to invoke aio_notify(), since aio_set_fd_handler()
 -         * in nbd_set_handlers() will have taken care of that */
 +    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS) {
 +        nbd_client_get(client);
 +        client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
 +        aio_co_schedule(client->exp->ctx, client->recv_coroutine);
      }
  }
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void nbd_co_client_start(void *opaque)
          goto out;
      }
      qemu_co_mutex_init(&client->send_lock);
 -    nbd_set_handlers(client);
      if (exp) {
          QTAILQ_INSERT_TAIL(&exp->clients, client, next);
      }
 +
 +    nbd_client_receive_next_request(client);
 +
  out:
      g_free(data);
  }
@@ -XXX,XX +XXX,XX @@ void nbd_client_new(NBDExport *exp,
      object_ref(OBJECT(client->sioc));
      client->ioc = QIO_CHANNEL(sioc);
      object_ref(OBJECT(client->ioc));
 -    client->can_read = true;
      client->close = close_fn;
      data->client = client;
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 19/24] coroutine-lock: make CoMutex thread-safe
+[PULL v2 04/16] block/raw-format: add zone operations to pass through requests
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-This uses the lock-free mutex described in the paper '"Blocking without
+raw-format driver usually sits on top of file-posix driver. It needs to
-Locking", or LFTHREADS: A lock-free thread library' by Gidenstam and
+pass through requests of zone commands.
 Papatriantafilou.  The same technique is used in OSv, and in fact
 the code is essentially a conversion to C of OSv's code.
-[Added missing coroutine_fn in tests/test-aio-multithread.c.
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
 Reviewed-by: Hannes Reinecke <hare@suse.de>
 Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-5-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-5-faithilikerun@gmail.com
 [Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 <philmd@linaro.org>.
 --Stefan]
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213181244.16297-2-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h     |  17 ++++-
+ block/raw-format.c | 17 +++++++++++++++++
- tests/test-aio-multithread.c |  86 ++++++++++++++++++++++++
+file changed, 17 insertions(+)
  util/qemu-coroutine-lock.c   | 155 ++++++++++++++++++++++++++++++++++++++++---
  util/trace-events            |   1 +
 files changed, 246 insertions(+), 13 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/block/raw-format.c b/block/raw-format.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/block/raw-format.c
-+++ b/include/qemu/coroutine.h
++++ b/block/raw-format.c
-@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
+@@ -XXX,XX +XXX,XX @@ raw_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
- /**
+     return bdrv_co_pdiscard(bs->file, offset, bytes);
   * Provides a mutex that can be used to synchronise coroutines
   */
 +struct CoWaitRecord;
  typedef struct CoMutex {
 -    bool locked;
 +    /* Count of pending lockers; 0 for a free mutex, 1 for an
 +     * uncontended mutex.
 +     */
 +    unsigned locked;
 +
 +    /* A queue of waiters.  Elements are added atomically in front of
 +     * from_push.  to_pop is only populated, and popped from, by whoever
 +     * is in charge of the next wakeup.  This can be an unlocker or,
 +     * through the handoff protocol, a locker that is about to go to sleep.
 +     */
 +    QSLIST_HEAD(, CoWaitRecord) from_push, to_pop;
 +
 +    unsigned handoff, sequence;
 +
      Coroutine *holder;
 -    CoQueue queue;
  } CoMutex;
  /**
 diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/test-aio-multithread.c
 +++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@ static void test_multi_co_schedule_10(void)
      test_multi_co_schedule(10);
  }
-+/* CoMutex thread-safety.  */
++static int coroutine_fn GRAPH_RDLOCK
-+
++raw_co_zone_report(BlockDriverState *bs, int64_t offset,
-+static uint32_t atomic_counter;
++                   unsigned int *nr_zones,
-+static uint32_t running;
++                   BlockZoneDescriptor *zones)
 +static uint32_t counter;
 +static CoMutex comutex;
 +
 +static void coroutine_fn test_multi_co_mutex_entry(void *opaque)
 +{
-+    while (!atomic_mb_read(&now_stopping)) {
++    return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
 +        qemu_co_mutex_lock(&comutex);
 +        counter++;
 +        qemu_co_mutex_unlock(&comutex);
 +
 +        /* Increase atomic_counter *after* releasing the mutex.  Otherwise
 +         * there is a chance (it happens about 1 in 3 runs) that the iothread
 +         * exits before the coroutine is woken up, causing a spurious
 +         * assertion failure.
 +         */
 +        atomic_inc(&atomic_counter);
 +    }
 +    atomic_dec(&running);
 +}
 +
-+static void test_multi_co_mutex(int threads, int seconds)
++static int coroutine_fn GRAPH_RDLOCK
 +raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
 +                 int64_t offset, int64_t len)
 +{
-+    int i;
++    return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
 +
 +    qemu_co_mutex_init(&comutex);
 +    counter = 0;
 +    atomic_counter = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    assert(threads <= NUM_CONTEXTS);
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_co_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
-+/* Testing with NUM_CONTEXTS threads focuses on the queue.  The mutex however
+ static int64_t coroutine_fn GRAPH_RDLOCK
-+ * is too contended (and the threads spend too much time in aio_poll)
+ raw_co_getlength(BlockDriverState *bs)
 + * to actually stress the handoff protocol.
 + */
 +static void test_multi_co_mutex_1(void)
 +{
 +    test_multi_co_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_co_mutex_10(void)
 +{
 +    test_multi_co_mutex(NUM_CONTEXTS, 10);
 +}
 +
 +/* Testing with fewer threads stresses the handoff protocol too.  Still, the
 + * case where the locker _can_ pick up a handoff is very rare, happening
 + * about 10 times in 1 million, so increase the runtime a bit compared to
 + * other "quick" testcases that only run for 1 second.
 + */
 +static void test_multi_co_mutex_2_3(void)
 +{
 +    test_multi_co_mutex(2, 3);
 +}
 +
 +static void test_multi_co_mutex_2_30(void)
 +{
 +    test_multi_co_mutex(2, 30);
 +}
 +
  /* End of tests.  */
  int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
      g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
      if (g_test_quick()) {
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
 +        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
 +        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
      } else {
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
 +        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
 +        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
      }
      return g_test_run();
  }
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-lock.c
 +++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
   * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
   * THE SOFTWARE.
 + *
 + * The lock-free mutex implementation is based on OSv
 + * (core/lfmutex.cc, include/lockfree/mutex.hh).
 + * Copyright (C) 2013 Cloudius Systems, Ltd.
   */
  #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue)
      return QSIMPLEQ_FIRST(&queue->entries) == NULL;
  }
 +/* The wait records are handled with a multiple-producer, single-consumer
 + * lock-free queue.  There cannot be two concurrent pop_waiter() calls
 + * because pop_waiter() can only be called while mutex->handoff is zero.
 + * This can happen in three cases:
 + * - in qemu_co_mutex_unlock, before the hand-off protocol has started.
 + *   In this case, qemu_co_mutex_lock will see mutex->handoff == 0 and
 + *   not take part in the handoff.
 + * - in qemu_co_mutex_lock, if it steals the hand-off responsibility from
 + *   qemu_co_mutex_unlock.  In this case, qemu_co_mutex_unlock will fail
 + *   the cmpxchg (it will see either 0 or the next sequence value) and
 + *   exit.  The next hand-off cannot begin until qemu_co_mutex_lock has
 + *   woken up someone.
 + * - in qemu_co_mutex_unlock, if it takes the hand-off token itself.
 + *   In this case another iteration starts with mutex->handoff == 0;
 + *   a concurrent qemu_co_mutex_lock will fail the cmpxchg, and
 + *   qemu_co_mutex_unlock will go back to case (1).
 + *
 + * The following functions manage this queue.
 + */
 +typedef struct CoWaitRecord {
 +    Coroutine *co;
 +    QSLIST_ENTRY(CoWaitRecord) next;
 +} CoWaitRecord;
 +
 +static void push_waiter(CoMutex *mutex, CoWaitRecord *w)
 +{
 +    w->co = qemu_coroutine_self();
 +    QSLIST_INSERT_HEAD_ATOMIC(&mutex->from_push, w, next);
 +}
 +
 +static void move_waiters(CoMutex *mutex)
 +{
 +    QSLIST_HEAD(, CoWaitRecord) reversed;
 +    QSLIST_MOVE_ATOMIC(&reversed, &mutex->from_push);
 +    while (!QSLIST_EMPTY(&reversed)) {
 +        CoWaitRecord *w = QSLIST_FIRST(&reversed);
 +        QSLIST_REMOVE_HEAD(&reversed, next);
 +        QSLIST_INSERT_HEAD(&mutex->to_pop, w, next);
 +    }
 +}
 +
 +static CoWaitRecord *pop_waiter(CoMutex *mutex)
 +{
 +    CoWaitRecord *w;
 +
 +    if (QSLIST_EMPTY(&mutex->to_pop)) {
 +        move_waiters(mutex);
 +        if (QSLIST_EMPTY(&mutex->to_pop)) {
 +            return NULL;
 +        }
 +    }
 +    w = QSLIST_FIRST(&mutex->to_pop);
 +    QSLIST_REMOVE_HEAD(&mutex->to_pop, next);
 +    return w;
 +}
 +
 +static bool has_waiters(CoMutex *mutex)
 +{
 +    return QSLIST_EMPTY(&mutex->to_pop) || QSLIST_EMPTY(&mutex->from_push);
 +}
 +
  void qemu_co_mutex_init(CoMutex *mutex)
  {
-     memset(mutex, 0, sizeof(*mutex));
+@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_raw = {
--    qemu_co_queue_init(&mutex->queue);
+     .bdrv_co_pwritev      = &raw_co_pwritev,
- }
+     .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
+     .bdrv_co_pdiscard     = &raw_co_pdiscard,
--void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
++    .bdrv_co_zone_report  = &raw_co_zone_report,
-+static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
++    .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
- {
+     .bdrv_co_block_status = &raw_co_block_status,
-     Coroutine *self = qemu_coroutine_self();
+     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
-+    CoWaitRecord w;
+     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
 +    unsigned old_handoff;
      trace_qemu_co_mutex_lock_entry(mutex, self);
 +    w.co = self;
 +    push_waiter(mutex, &w);
 -    while (mutex->locked) {
 -        qemu_co_queue_wait(&mutex->queue);
 +    /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
 +     * a concurrent unlock() the responsibility of waking somebody up.
 +     */
 +    old_handoff = atomic_mb_read(&mutex->handoff);
 +    if (old_handoff &&
 +        has_waiters(mutex) &&
 +        atomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
 +        /* There can be no concurrent pops, because there can be only
 +         * one active handoff at a time.
 +         */
 +        CoWaitRecord *to_wake = pop_waiter(mutex);
 +        Coroutine *co = to_wake->co;
 +        if (co == self) {
 +            /* We got the lock ourselves!  */
 +            assert(to_wake == &w);
 +            return;
 +        }
 +
 +        aio_co_wake(co);
      }
 -    mutex->locked = true;
 -    mutex->holder = self;
 -    self->locks_held++;
 -
 +    qemu_coroutine_yield();
      trace_qemu_co_mutex_lock_return(mutex, self);
  }
 +void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
 +{
 +    Coroutine *self = qemu_coroutine_self();
 +
 +    if (atomic_fetch_inc(&mutex->locked) == 0) {
 +        /* Uncontended.  */
 +        trace_qemu_co_mutex_lock_uncontended(mutex, self);
 +    } else {
 +        qemu_co_mutex_lock_slowpath(mutex);
 +    }
 +    mutex->holder = self;
 +    self->locks_held++;
 +}
 +
  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
  {
      Coroutine *self = qemu_coroutine_self();
      trace_qemu_co_mutex_unlock_entry(mutex, self);
 -    assert(mutex->locked == true);
 +    assert(mutex->locked);
      assert(mutex->holder == self);
      assert(qemu_in_coroutine());
 -    mutex->locked = false;
      mutex->holder = NULL;
      self->locks_held--;
 -    qemu_co_queue_next(&mutex->queue);
 +    if (atomic_fetch_dec(&mutex->locked) == 1) {
 +        /* No waiting qemu_co_mutex_lock().  Pfew, that was easy!  */
 +        return;
 +    }
 +
 +    for (;;) {
 +        CoWaitRecord *to_wake = pop_waiter(mutex);
 +        unsigned our_handoff;
 +
 +        if (to_wake) {
 +            Coroutine *co = to_wake->co;
 +            aio_co_wake(co);
 +            break;
 +        }
 +
 +        /* Some concurrent lock() is in progress (we know this because
 +         * mutex->locked was >1) but it hasn't yet put itself on the wait
 +         * queue.  Pick a sequence number for the handoff protocol (not 0).
 +         */
 +        if (++mutex->sequence == 0) {
 +            mutex->sequence = 1;
 +        }
 +
 +        our_handoff = mutex->sequence;
 +        atomic_mb_set(&mutex->handoff, our_handoff);
 +        if (!has_waiters(mutex)) {
 +            /* The concurrent lock has not added itself yet, so it
 +             * will be able to pick our handoff.
 +             */
 +            break;
 +        }
 +
 +        /* Try to do the handoff protocol ourselves; if somebody else has
 +         * already taken it, however, we're done and they're responsible.
 +         */
 +        if (atomic_cmpxchg(&mutex->handoff, our_handoff, 0) != our_handoff) {
 +            break;
 +        }
 +    }
      trace_qemu_co_mutex_unlock_return(mutex, self);
  }
 diff --git a/util/trace-events b/util/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/util/trace-events
 +++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
  # util/qemu-coroutine-lock.c
  qemu_co_queue_run_restart(void *co) "co %p"
 +qemu_co_mutex_lock_uncontended(void *mutex, void *self) "mutex %p self %p"
  qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
  qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
  qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 17/24] async: remove unnecessary inc/dec pairs
+[PULL v2 05/16] block: add zoned BlockDriver check to block layer
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-Pull the increment/decrement pair out of aio_bh_poll and into the
+Putting zoned/non-zoned BlockDrivers on top of each other is not
-callers.
+allowed.
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Hannes Reinecke <hare@suse.de>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Message-id: 20170213135235.12274-18-pbonzini@redhat.com
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-6-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-6-faithilikerun@gmail.com
 [Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 <philmd@linaro.org> and clarify that the check is about zoned
 BlockDrivers.
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/aio-posix.c |  8 +++-----
+ include/block/block_int-common.h |  5 +++++
- util/aio-win32.c |  8 ++++----
+ block.c                          | 19 +++++++++++++++++++
- util/async.c     | 12 ++++++------
+ block/file-posix.c               | 12 ++++++++++++
-files changed, 13 insertions(+), 15 deletions(-)
+ block/raw-format.c               |  1 +
 files changed, 37 insertions(+)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
+--- a/include/block/block_int-common.h
-+++ b/util/aio-posix.c
++++ b/include/block/block_int-common.h
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
+      */
- void aio_dispatch(AioContext *ctx)
+     bool is_format;
- {
-+    qemu_lockcnt_inc(&ctx->list_lock);
++    /*
-     aio_bh_poll(ctx);
++     * Set to true if the BlockDriver supports zoned children.
--
++     */
--    qemu_lockcnt_inc(&ctx->list_lock);
++    bool supports_zoned_children;
-     aio_dispatch_handlers(ctx);
++
-     qemu_lockcnt_dec(&ctx->list_lock);
+     /*
+      * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
+      * this field set to true, except ones that are defined only by their
 diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block.c
 +++ b/block.c
@@ -XXX,XX +XXX,XX @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
          return;
      }
-     npfd = 0;
++    /*
--    qemu_lockcnt_dec(&ctx->list_lock);
++     * Non-zoned block drivers do not follow zoned storage constraints
++     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
-     progress |= aio_bh_poll(ctx);
++     * drivers in a graph.
++     */
-     if (ret > 0) {
++    if (!parent_bs->drv->supports_zoned_children &&
--        qemu_lockcnt_inc(&ctx->list_lock);
++        child_bs->bl.zoned == BLK_Z_HM) {
-         progress |= aio_dispatch_handlers(ctx);
++        /*
--        qemu_lockcnt_dec(&ctx->list_lock);
++         * The host-aware model allows zoned storage constraints and random
-     }
++         * write. Allow mixing host-aware and non-zoned drivers. Using
++         * host-aware device as a regular device.
-+    qemu_lockcnt_dec(&ctx->list_lock);
++         */
 +        error_setg(errp, "Cannot add a %s child to a %s parent",
 +                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
 +                   parent_bs->drv->supports_zoned_children ?
 +                   "support zoned children" : "not support zoned children");
 +        return;
 +    }
 +
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
+     if (!QLIST_EMPTY(&child_bs->parents)) {
+         error_setg(errp, "The node %s already has a parent",
-     return progress;
+                    child_bs->node_name);
-diff --git a/util/aio-win32.c b/util/aio-win32.c
+diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-win32.c
+--- a/block/file-posix.c
-+++ b/util/aio-win32.c
++++ b/block/file-posix.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
+@@ -XXX,XX +XXX,XX @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
-     bool progress = false;
+             goto fail;
      AioHandler *tmp;
 -    qemu_lockcnt_inc(&ctx->list_lock);
 -
      /*
       * We have to walk very carefully in case aio_set_fd_handler is
       * called while we're walking.
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
          }
      }
++#ifdef CONFIG_BLKZONED
--    qemu_lockcnt_dec(&ctx->list_lock);
++    /*
-     return progress;
++     * The kernel page cache does not reliably work for writes to SWR zones
- }
++     * of zoned block device because it can not guarantee the order of writes.
++     */
- void aio_dispatch(AioContext *ctx)
++    if ((bs->bl.zoned != BLK_Z_NONE) &&
- {
++        (!(s->open_flags & O_DIRECT))) {
-+    qemu_lockcnt_inc(&ctx->list_lock);
++        error_setg(errp, "The driver supports zoned devices, and it requires "
-     aio_bh_poll(ctx);
++                         "cache.direct=on, which was not specified.");
-     aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
++        return -EINVAL; /* No host kernel page cache */
-+    qemu_lockcnt_dec(&ctx->list_lock);
++    }
-     timerlistgroup_run_timers(&ctx->tlg);
++#endif
- }
+     if (S_ISBLK(st.st_mode)) {
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
+ #ifdef __linux__
-         }
+diff --git a/block/raw-format.c b/block/raw-format.c
      }
 -    qemu_lockcnt_dec(&ctx->list_lock);
      first = true;
      /* ctx->notifier is always registered.  */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress |= aio_dispatch_handlers(ctx, event);
      } while (count > 0);
 +    qemu_lockcnt_dec(&ctx->list_lock);
 +
      progress |= timerlistgroup_run_timers(&ctx->tlg);
      return progress;
  }
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/async.c
+--- a/block/raw-format.c
-+++ b/util/async.c
++++ b/block/raw-format.c
-@@ -XXX,XX +XXX,XX @@ void aio_bh_call(QEMUBH *bh)
+@@ -XXX,XX +XXX,XX @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
-     bh->cb(bh->opaque);
+ BlockDriver bdrv_raw = {
- }
+     .format_name          = "raw",
+     .instance_size        = sizeof(BDRVRawState),
--/* Multiple occurrences of aio_bh_poll cannot be called concurrently */
++    .supports_zoned_children = true,
-+/* Multiple occurrences of aio_bh_poll cannot be called concurrently.
+     .bdrv_probe           = &raw_probe,
-+ * The count in ctx->list_lock is incremented before the call, and is
+     .bdrv_reopen_prepare  = &raw_reopen_prepare,
-+ * not affected by the call.
+     .bdrv_reopen_commit   = &raw_reopen_commit,
 + */
  int aio_bh_poll(AioContext *ctx)
  {
      QEMUBH *bh, **bhp, *next;
      int ret;
      bool deleted = false;
 -    qemu_lockcnt_inc(&ctx->list_lock);
 -
      ret = 0;
      for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
          next = atomic_rcu_read(&bh->next);
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
      /* remove deleted bhs */
      if (!deleted) {
 -        qemu_lockcnt_dec(&ctx->list_lock);
          return ret;
      }
 -    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
 +    if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
          bhp = &ctx->first_bh;
          while (*bhp) {
              bh = *bhp;
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  bhp = &bh->next;
              }
          }
 -        qemu_lockcnt_unlock(&ctx->list_lock);
 +        qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
      }
      return ret;
  }
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 02/24] aio: introduce aio_co_schedule and aio_co_wake
+[PULL v2 06/16] iotests: test new zone operations
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-aio_co_wake provides the infrastructure to start a coroutine on a "home"
+The new block layer APIs of zoned block devices can be tested by:
-AioContext.  It will be used by CoMutex and CoQueue, so that coroutines
+$ tests/qemu-iotests/check zoned
-don't jump from one context to another when they go to sleep on a
+Run each zone operation on a newly created null_blk device
-mutex or waitqueue.  However, it can also be used as a more efficient
+and see whether it outputs the same zone information.
 alternative to one-shot bottom halves, and saves the effort of tracking
 which AioContext a coroutine is running on.
-aio_co_schedule is the part of aio_co_wake that starts a coroutine
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-on a remove AioContext, but it is also useful to implement e.g.
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-bdrv_set_aio_context callbacks.
+Acked-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-The implementation of aio_co_schedule is based on a lock-free
+Message-id: 20230508045533.175575-7-faithilikerun@gmail.com
-multiple-producer, single-consumer queue.  The multiple producers use
+Message-id: 20230324090605.28361-7-faithilikerun@gmail.com
-cmpxchg to add to a LIFO stack.  The consumer (a per-AioContext bottom
+[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
-half) grabs all items added so far, inverts the list to make it FIFO,
+<philmd@linaro.org>.
-and goes through it one item at a time until it's empty.  The data
+--Stefan]
 structure was inspired by OSv, which uses it in the very code we'll
 "port" to QEMU for the thread-safe CoMutex.
 Most of the new code is really tests.
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213135235.12274-3-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/Makefile.include       |   8 +-
+ tests/qemu-iotests/tests/zoned     | 89 ++++++++++++++++++++++++++++++
- include/block/aio.h          |  32 +++++++
+ tests/qemu-iotests/tests/zoned.out | 53 ++++++++++++++++++
- include/qemu/coroutine_int.h |  11 ++-
+files changed, 142 insertions(+)
- tests/iothread.h             |  25 +++++
+ create mode 100755 tests/qemu-iotests/tests/zoned
- tests/iothread.c             |  91 ++++++++++++++++++
+ create mode 100644 tests/qemu-iotests/tests/zoned.out
  tests/test-aio-multithread.c | 213 +++++++++++++++++++++++++++++++++++++++++++
  util/async.c                 |  65 +++++++++++++
  util/qemu-coroutine.c        |   8 ++
  util/trace-events            |   4 +
 files changed, 453 insertions(+), 4 deletions(-)
  create mode 100644 tests/iothread.h
  create mode 100644 tests/iothread.c
  create mode 100644 tests/test-aio-multithread.c
-diff --git a/tests/Makefile.include b/tests/Makefile.include
+diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
-index XXXXXXX..XXXXXXX 100644
+new file mode 100755
---- a/tests/Makefile.include
+index XXXXXXX..XXXXXXX
-+++ b/tests/Makefile.include
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-aio$(EXESUF)
++++ b/tests/qemu-iotests/tests/zoned
- gcov-files-test-aio-y = util/async.c util/qemu-timer.o
+@@ -XXX,XX +XXX,XX @@
- gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
++#!/usr/bin/env bash
- gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
++#
-+check-unit-y += tests/test-aio-multithread$(EXESUF)
++# Test zone management operations.
-+gcov-files-test-aio-multithread-y = $(gcov-files-test-aio-y)
++#
 +gcov-files-test-aio-multithread-y += util/qemu-coroutine.c tests/iothread.c
  check-unit-y += tests/test-throttle$(EXESUF)
 -gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 -gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
  check-unit-y += tests/test-thread-pool$(EXESUF)
  gcov-files-test-thread-pool-y = thread-pool.c
  gcov-files-test-hbitmap-y = util/hbitmap.c
@@ -XXX,XX +XXX,XX @@ test-qapi-obj-y = tests/test-qapi-visit.o tests/test-qapi-types.o \
      $(test-qom-obj-y)
  test-crypto-obj-y = $(crypto-obj-y) $(test-qom-obj-y)
  test-io-obj-y = $(io-obj-y) $(test-crypto-obj-y)
 -test-block-obj-y = $(block-obj-y) $(test-io-obj-y)
 +test-block-obj-y = $(block-obj-y) $(test-io-obj-y) tests/iothread.o
  tests/check-qint$(EXESUF): tests/check-qint.o $(test-util-obj-y)
  tests/check-qstring$(EXESUF): tests/check-qstring.o $(test-util-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
  tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
  tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
  tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
 +tests/test-aio-multithread$(EXESUF): tests/test-aio-multithread.o $(test-block-obj-y)
  tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
  tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
  tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
 diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/aio.h
 +++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ typedef void QEMUBHFunc(void *opaque);
  typedef bool AioPollFn(void *opaque);
  typedef void IOHandler(void *opaque);
 +struct Coroutine;
  struct ThreadPool;
  struct LinuxAioState;
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      bool notified;
      EventNotifier notifier;
 +    QSLIST_HEAD(, Coroutine) scheduled_coroutines;
 +    QEMUBH *co_schedule_bh;
 +
-     /* Thread pool for performing work and receiving completion callbacks.
++seq="$(basename $0)"
-      * Has its own locking.
++echo "QA output created by $seq"
-      */
++status=1 # failure is the default!
@@ -XXX,XX +XXX,XX @@ static inline bool aio_node_check(AioContext *ctx, bool is_external)
  }
  /**
 + * aio_co_schedule:
 + * @ctx: the aio context
 + * @co: the coroutine
 + *
 + * Start a coroutine on a remote AioContext.
 + *
 + * The coroutine must not be entered by anyone else while aio_co_schedule()
 + * is active.  In addition the coroutine must have yielded unless ctx
 + * is the context in which the coroutine is running (i.e. the value of
 + * qemu_get_current_aio_context() from the coroutine itself).
 + */
 +void aio_co_schedule(AioContext *ctx, struct Coroutine *co);
 +
-+/**
++_cleanup()
-+ * aio_co_wake:
++{
-+ * @co: the coroutine
++  _cleanup_test_img
-+ *
++  sudo -n rmmod null_blk
-+ * Restart a coroutine on the AioContext where it was running last, thus
++}
-+ * preventing coroutines from jumping from one context to another when they
++trap "_cleanup; exit \$status" 0 1 2 3 15
 + * go to sleep.
 + *
 + * aio_co_wake may be executed either in coroutine or non-coroutine
 + * context.  The coroutine must not be entered by anyone else while
 + * aio_co_wake() is active.
 + */
 +void aio_co_wake(struct Coroutine *co);
 +
-+/**
++# get standard environment, filters and checks
-  * Return the AioContext whose event loop runs in the current thread.
++. ../common.rc
-  *
++. ../common.filter
-  * If called from an IOThread this will be the IOThread's AioContext.  If
++. ../common.qemu
 diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/qemu/coroutine_int.h
 +++ b/include/qemu/coroutine_int.h
@@ -XXX,XX +XXX,XX @@ struct Coroutine {
      CoroutineEntry *entry;
      void *entry_arg;
      Coroutine *caller;
 +
-+    /* Only used when the coroutine has terminated.  */
++# This test only runs on Linux hosts with raw image files.
-     QSLIST_ENTRY(Coroutine) pool_next;
++_supported_fmt raw
 +_supported_proto file
 +_supported_os Linux
 +
-     size_t locks_held;
++sudo -n true || \
++    _notrun 'Password-less sudo required'
 -    /* Coroutines that should be woken up when we yield or terminate */
 +    /* Coroutines that should be woken up when we yield or terminate.
 +     * Only used when the coroutine is running.
 +     */
      QSIMPLEQ_HEAD(, Coroutine) co_queue_wakeup;
 +
-+    /* Only used when the coroutine has yielded.  */
++IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
-+    AioContext *ctx;
++QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
-     QSIMPLEQ_ENTRY(Coroutine) co_queue_next;
++
-+    QSLIST_ENTRY(Coroutine) co_scheduled_next;
++echo "Testing a null_blk device:"
- };
++echo "case 1: if the operations work"
++sudo -n modprobe null_blk nr_devices=1 zoned=1
- Coroutine *qemu_coroutine_new(void);
++sudo -n chmod 0666 /dev/nullb0
-diff --git a/tests/iothread.h b/tests/iothread.h
++
 +echo "(1) report the first zone:"
 +$QEMU_IO $IMG -c "zrp 0 1"
 +echo
 +echo "report the first 10 zones"
 +$QEMU_IO $IMG -c "zrp 0 10"
 +echo
 +echo "report the last zone:"
 +$QEMU_IO $IMG -c "zrp 0x3e70000000 2" # 0x3e70000000 / 512 = 0x1f380000
 +echo
 +echo
 +echo "(2) opening the first zone"
 +$QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
 +echo "report after:"
 +$QEMU_IO $IMG -c "zrp 0 1"
 +echo
 +echo "opening the second zone"
 +$QEMU_IO $IMG -c "zo 268435456 268435456" #
 +echo "report after:"
 +$QEMU_IO $IMG -c "zrp 268435456 1"
 +echo
 +echo "opening the last zone"
 +$QEMU_IO $IMG -c "zo 0x3e70000000 268435456"
 +echo "report after:"
 +$QEMU_IO $IMG -c "zrp 0x3e70000000 2"
 +echo
 +echo
 +echo "(3) closing the first zone"
 +$QEMU_IO $IMG -c "zc 0 268435456"
 +echo "report after:"
 +$QEMU_IO $IMG -c "zrp 0 1"
 +echo
 +echo "closing the last zone"
 +$QEMU_IO $IMG -c "zc 0x3e70000000 268435456"
 +echo "report after:"
 +$QEMU_IO $IMG -c "zrp 0x3e70000000 2"
 +echo
 +echo
 +echo "(4) finishing the second zone"
 +$QEMU_IO $IMG -c "zf 268435456 268435456"
 +echo "After finishing a zone:"
 +$QEMU_IO $IMG -c "zrp 268435456 1"
 +echo
 +echo
 +echo "(5) resetting the second zone"
 +$QEMU_IO $IMG -c "zrs 268435456 268435456"
 +echo "After resetting a zone:"
 +$QEMU_IO $IMG -c "zrp 268435456 1"
 +
 +# success, all done
 +echo "*** done"
 +rm -f $seq.full
 +status=0
 diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/iothread.h
++++ b/tests/qemu-iotests/tests/zoned.out
 @@ -XXX,XX +XXX,XX @@
-+/*
++QA output created by zoned
-+ * Event loop thread implementation for unit tests
++Testing a null_blk device:
-+ *
++case 1: if the operations work
-+ * Copyright Red Hat Inc., 2013, 2016
++(1) report the first zone:
-+ *
++start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
 + * Authors:
 + *  Stefan Hajnoczi   <stefanha@redhat.com>
 + *  Paolo Bonzini     <pbonzini@redhat.com>
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +#ifndef TEST_IOTHREAD_H
 +#define TEST_IOTHREAD_H
 +
-+#include "block/aio.h"
++report the first 10 zones
-+#include "qemu/thread.h"
++start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
 +start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
 +start: 0x100000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:1, [type: 2]
 +start: 0x180000, len 0x80000, cap 0x80000, wptr 0x180000, zcond:1, [type: 2]
 +start: 0x200000, len 0x80000, cap 0x80000, wptr 0x200000, zcond:1, [type: 2]
 +start: 0x280000, len 0x80000, cap 0x80000, wptr 0x280000, zcond:1, [type: 2]
 +start: 0x300000, len 0x80000, cap 0x80000, wptr 0x300000, zcond:1, [type: 2]
 +start: 0x380000, len 0x80000, cap 0x80000, wptr 0x380000, zcond:1, [type: 2]
 +start: 0x400000, len 0x80000, cap 0x80000, wptr 0x400000, zcond:1, [type: 2]
 +start: 0x480000, len 0x80000, cap 0x80000, wptr 0x480000, zcond:1, [type: 2]
 +
-+typedef struct IOThread IOThread;
++report the last zone:
-+
++start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
 +IOThread *iothread_new(void);
 +void iothread_join(IOThread *iothread);
 +AioContext *iothread_get_aio_context(IOThread *iothread);
 +
 +#endif
 diff --git a/tests/iothread.c b/tests/iothread.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/iothread.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * Event loop thread implementation for unit tests
 + *
 + * Copyright Red Hat Inc., 2013, 2016
 + *
 + * Authors:
 + *  Stefan Hajnoczi   <stefanha@redhat.com>
 + *  Paolo Bonzini     <pbonzini@redhat.com>
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include "qemu/osdep.h"
 +#include "qapi/error.h"
 +#include "block/aio.h"
 +#include "qemu/main-loop.h"
 +#include "qemu/rcu.h"
 +#include "iothread.h"
 +
 +struct IOThread {
 +    AioContext *ctx;
 +
 +    QemuThread thread;
 +    QemuMutex init_done_lock;
 +    QemuCond init_done_cond;    /* is thread initialization done? */
 +    bool stopping;
 +};
 +
 +static __thread IOThread *my_iothread;
 +
 +AioContext *qemu_get_current_aio_context(void)
 +{
 +    return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
 +}
 +
 +static void *iothread_run(void *opaque)
 +{
 +    IOThread *iothread = opaque;
 +
 +    rcu_register_thread();
 +
 +    my_iothread = iothread;
 +    qemu_mutex_lock(&iothread->init_done_lock);
 +    iothread->ctx = aio_context_new(&error_abort);
 +    qemu_cond_signal(&iothread->init_done_cond);
 +    qemu_mutex_unlock(&iothread->init_done_lock);
 +
 +    while (!atomic_read(&iothread->stopping)) {
 +        aio_poll(iothread->ctx, true);
 +    }
 +
 +    rcu_unregister_thread();
 +    return NULL;
 +}
 +
 +void iothread_join(IOThread *iothread)
 +{
 +    iothread->stopping = true;
 +    aio_notify(iothread->ctx);
 +    qemu_thread_join(&iothread->thread);
 +    qemu_cond_destroy(&iothread->init_done_cond);
 +    qemu_mutex_destroy(&iothread->init_done_lock);
 +    aio_context_unref(iothread->ctx);
 +    g_free(iothread);
 +}
 +
 +IOThread *iothread_new(void)
 +{
 +    IOThread *iothread = g_new0(IOThread, 1);
 +
 +    qemu_mutex_init(&iothread->init_done_lock);
 +    qemu_cond_init(&iothread->init_done_cond);
 +    qemu_thread_create(&iothread->thread, NULL, iothread_run,
 +                       iothread, QEMU_THREAD_JOINABLE);
 +
 +    /* Wait for initialization to complete */
 +    qemu_mutex_lock(&iothread->init_done_lock);
 +    while (iothread->ctx == NULL) {
 +        qemu_cond_wait(&iothread->init_done_cond,
 +                       &iothread->init_done_lock);
 +    }
 +    qemu_mutex_unlock(&iothread->init_done_lock);
 +    return iothread;
 +}
 +
 +AioContext *iothread_get_aio_context(IOThread *iothread)
 +{
 +    return iothread->ctx;
 +}
 diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * AioContext multithreading tests
 + *
 + * Copyright Red Hat, Inc. 2016
 + *
 + * Authors:
 + *  Paolo Bonzini    <pbonzini@redhat.com>
 + *
 + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
 + * See the COPYING.LIB file in the top-level directory.
 + */
 +
 +#include "qemu/osdep.h"
 +#include <glib.h>
 +#include "block/aio.h"
 +#include "qapi/error.h"
 +#include "qemu/coroutine.h"
 +#include "qemu/thread.h"
 +#include "qemu/error-report.h"
 +#include "iothread.h"
 +
 +/* AioContext management */
 +
 +#define NUM_CONTEXTS 5
 +
 +static IOThread *threads[NUM_CONTEXTS];
 +static AioContext *ctx[NUM_CONTEXTS];
 +static __thread int id = -1;
 +
 +static QemuEvent done_event;
 +
 +/* Run a function synchronously on a remote iothread. */
 +
 +typedef struct CtxRunData {
 +    QEMUBHFunc *cb;
 +    void *arg;
 +} CtxRunData;
 +
 +static void ctx_run_bh_cb(void *opaque)
 +{
 +    CtxRunData *data = opaque;
 +
 +    data->cb(data->arg);
 +    qemu_event_set(&done_event);
 +}
 +
 +static void ctx_run(int i, QEMUBHFunc *cb, void *opaque)
 +{
 +    CtxRunData data = {
 +        .cb = cb,
 +        .arg = opaque
 +    };
 +
 +    qemu_event_reset(&done_event);
 +    aio_bh_schedule_oneshot(ctx[i], ctx_run_bh_cb, &data);
 +    qemu_event_wait(&done_event);
 +}
 +
 +/* Starting the iothreads. */
 +
 +static void set_id_cb(void *opaque)
 +{
 +    int *i = opaque;
 +
 +    id = *i;
 +}
 +
 +static void create_aio_contexts(void)
 +{
 +    int i;
 +
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        threads[i] = iothread_new();
 +        ctx[i] = iothread_get_aio_context(threads[i]);
 +    }
 +
 +    qemu_event_init(&done_event, false);
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        ctx_run(i, set_id_cb, &i);
 +    }
 +}
 +
 +/* Stopping the iothreads. */
 +
 +static void join_aio_contexts(void)
 +{
 +    int i;
 +
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        aio_context_ref(ctx[i]);
 +    }
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        iothread_join(threads[i]);
 +    }
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        aio_context_unref(ctx[i]);
 +    }
 +    qemu_event_destroy(&done_event);
 +}
 +
 +/* Basic test for the stuff above. */
 +
 +static void test_lifecycle(void)
 +{
 +    create_aio_contexts();
 +    join_aio_contexts();
 +}
 +
 +/* aio_co_schedule test.  */
 +
 +static Coroutine *to_schedule[NUM_CONTEXTS];
 +
 +static bool now_stopping;
 +
 +static int count_retry;
 +static int count_here;
 +static int count_other;
 +
 +static bool schedule_next(int n)
 +{
 +    Coroutine *co;
 +
 +    co = atomic_xchg(&to_schedule[n], NULL);
 +    if (!co) {
 +        atomic_inc(&count_retry);
 +        return false;
 +    }
 +
 +    if (n == id) {
 +        atomic_inc(&count_here);
 +    } else {
 +        atomic_inc(&count_other);
 +    }
 +
 +    aio_co_schedule(ctx[n], co);
 +    return true;
 +}
 +
 +static void finish_cb(void *opaque)
 +{
 +    schedule_next(id);
 +}
 +
 +static coroutine_fn void test_multi_co_schedule_entry(void *opaque)
 +{
 +    g_assert(to_schedule[id] == NULL);
 +    atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
 +
 +    while (!atomic_mb_read(&now_stopping)) {
 +        int n;
 +
 +        n = g_test_rand_int_range(0, NUM_CONTEXTS);
 +        schedule_next(n);
 +        qemu_coroutine_yield();
 +
 +        g_assert(to_schedule[id] == NULL);
 +        atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
 +    }
 +}
 +
 +
-+static void test_multi_co_schedule(int seconds)
++(2) opening the first zone
-+{
++report after:
-+    int i;
++start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:3, [type: 2]
 +
-+    count_here = count_other = count_retry = 0;
++opening the second zone
-+    now_stopping = false;
++report after:
 +start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:3, [type: 2]
 +
-+    create_aio_contexts();
++opening the last zone
-+    for (i = 0; i < NUM_CONTEXTS; i++) {
++report after:
-+        Coroutine *co1 = qemu_coroutine_create(test_multi_co_schedule_entry, NULL);
++start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:3, [type: 2]
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
-+    g_usleep(seconds * 1000000);
 +
-+    atomic_mb_set(&now_stopping, true);
++(3) closing the first zone
-+    for (i = 0; i < NUM_CONTEXTS; i++) {
++report after:
-+        ctx_run(i, finish_cb, NULL);
++start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
 +        to_schedule[i] = NULL;
 +    }
 +
-+    join_aio_contexts();
++closing the last zone
-+    g_test_message("scheduled %d, queued %d, retry %d, total %d\n",
++report after:
-+                  count_other, count_here, count_retry,
++start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
 +                  count_here + count_other + count_retry);
 +}
 +
-+static void test_multi_co_schedule_1(void)
-+{
-+    test_multi_co_schedule(1);
-+}
 +
-+static void test_multi_co_schedule_10(void)
++(4) finishing the second zone
-+{
++After finishing a zone:
-+    test_multi_co_schedule(10);
++start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
 +}
 +
-+/* End of tests.  */
 +
-+int main(int argc, char **argv)
++(5) resetting the second zone
-+{
++After resetting a zone:
-+    init_clocks();
++start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
-+
++*** done
 +    g_test_init(&argc, &argv, NULL);
 +    g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
 +    if (g_test_quick()) {
 +        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
 +    } else {
 +        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
 +    }
 +    return g_test_run();
 +}
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/main-loop.h"
  #include "qemu/atomic.h"
  #include "block/raw-aio.h"
 +#include "qemu/coroutine_int.h"
 +#include "trace.h"
  /***********************************************************/
  /* bottom halves (can be seen as timers which expire ASAP) */
@@ -XXX,XX +XXX,XX @@ aio_ctx_finalize(GSource     *source)
      }
  #endif
 +    assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
 +    qemu_bh_delete(ctx->co_schedule_bh);
 +
      qemu_lockcnt_lock(&ctx->list_lock);
      assert(!qemu_lockcnt_count(&ctx->list_lock));
      while (ctx->first_bh) {
@@ -XXX,XX +XXX,XX @@ static bool event_notifier_poll(void *opaque)
      return atomic_read(&ctx->notified);
  }
 +static void co_schedule_bh_cb(void *opaque)
 +{
 +    AioContext *ctx = opaque;
 +    QSLIST_HEAD(, Coroutine) straight, reversed;
 +
 +    QSLIST_MOVE_ATOMIC(&reversed, &ctx->scheduled_coroutines);
 +    QSLIST_INIT(&straight);
 +
 +    while (!QSLIST_EMPTY(&reversed)) {
 +        Coroutine *co = QSLIST_FIRST(&reversed);
 +        QSLIST_REMOVE_HEAD(&reversed, co_scheduled_next);
 +        QSLIST_INSERT_HEAD(&straight, co, co_scheduled_next);
 +    }
 +
 +    while (!QSLIST_EMPTY(&straight)) {
 +        Coroutine *co = QSLIST_FIRST(&straight);
 +        QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
 +        trace_aio_co_schedule_bh_cb(ctx, co);
 +        qemu_coroutine_enter(co);
 +    }
 +}
 +
  AioContext *aio_context_new(Error **errp)
  {
      int ret;
@@ -XXX,XX +XXX,XX @@ AioContext *aio_context_new(Error **errp)
      }
      g_source_set_can_recurse(&ctx->source, true);
      qemu_lockcnt_init(&ctx->list_lock);
 +
 +    ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
 +    QSLIST_INIT(&ctx->scheduled_coroutines);
 +
      aio_set_event_notifier(ctx, &ctx->notifier,
                             false,
                             (EventNotifierHandler *)
@@ -XXX,XX +XXX,XX @@ fail:
      return NULL;
  }
 +void aio_co_schedule(AioContext *ctx, Coroutine *co)
 +{
 +    trace_aio_co_schedule(ctx, co);
 +    QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
 +                              co, co_scheduled_next);
 +    qemu_bh_schedule(ctx->co_schedule_bh);
 +}
 +
 +void aio_co_wake(struct Coroutine *co)
 +{
 +    AioContext *ctx;
 +
 +    /* Read coroutine before co->ctx.  Matches smp_wmb in
 +     * qemu_coroutine_enter.
 +     */
 +    smp_read_barrier_depends();
 +    ctx = atomic_read(&co->ctx);
 +
 +    if (ctx != qemu_get_current_aio_context()) {
 +        aio_co_schedule(ctx, co);
 +        return;
 +    }
 +
 +    if (qemu_in_coroutine()) {
 +        Coroutine *self = qemu_coroutine_self();
 +        assert(self != co);
 +        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
 +    } else {
 +        aio_context_acquire(ctx);
 +        qemu_coroutine_enter(co);
 +        aio_context_release(ctx);
 +    }
 +}
 +
  void aio_context_ref(AioContext *ctx)
  {
      g_source_ref(&ctx->source);
 diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine.c
 +++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/atomic.h"
  #include "qemu/coroutine.h"
  #include "qemu/coroutine_int.h"
 +#include "block/aio.h"
  enum {
      POOL_BATCH_SIZE = 64,
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
      }
      co->caller = self;
 +    co->ctx = qemu_get_current_aio_context();
 +
 +    /* Store co->ctx before anything that stores co.  Matches
 +     * barrier in aio_co_wake.
 +     */
 +    smp_wmb();
 +
      ret = qemu_coroutine_switch(self, co, COROUTINE_ENTER);
      qemu_co_queue_run_restart(co);
 diff --git a/util/trace-events b/util/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/util/trace-events
 +++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
  poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
  poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +# util/async.c
 +aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
 +aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
 +
  # util/thread-pool.c
  thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
  thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 03/24] block-backend: allow blk_prw from coroutine context
+[PULL v2 07/16] block: add some trace events for new block layer APIs
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-qcow2_create2 calls this.  Do not run a nested event loop, as that
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
 breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup
 list of the currently running one.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Acked-by: Kevin Wolf <kwolf@redhat.com>
-Message-id: 20170213135235.12274-4-pbonzini@redhat.com
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-8-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-8-faithilikerun@gmail.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/block-backend.c | 12 ++++++++----
+ block/file-posix.c | 3 +++
-file changed, 8 insertions(+), 4 deletions(-)
+ block/trace-events | 2 ++
 files changed, 5 insertions(+)
-diff --git a/block/block-backend.c b/block/block-backend.c
+diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/block-backend.c
+--- a/block/file-posix.c
-+++ b/block/block-backend.c
++++ b/block/file-posix.c
-@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
- {
+         },
      QEMUIOVector qiov;
      struct iovec iov;
 -    Coroutine *co;
      BlkRwCo rwco;
      iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
          .ret    = NOT_DONE,
      };
--    co = qemu_coroutine_create(co_entry, &rwco);
++    trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
--    qemu_coroutine_enter(co);
+     return raw_thread_pool_submit(handle_aiocb_zone_report, &acb);
 -    BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
 +    if (qemu_in_coroutine()) {
 +        /* Fast-path if already in coroutine context */
 +        co_entry(&rwco);
 +    } else {
 +        Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
 +        qemu_coroutine_enter(co);
 +        BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
 +    }
      return rwco.ret;
  }
+ #endif
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+         },
+     };
++    trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
++                        len >> BDRV_SECTOR_BITS);
+     ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
+     if (ret != 0) {
+         error_report("ioctl %s failed %d", op_name, ret);
+diff --git a/block/trace-events b/block/trace-events
+index XXXXXXX..XXXXXXX 100644
+--- a/block/trace-events
++++ b/block/trace-events
+@@ -XXX,XX +XXX,XX @@ file_FindEjectableOpticalMedia(const char *media) "Matching using %s"
+ file_setup_cdrom(const char *partition) "Using %s as optical disc"
+ file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
+ file_flush_fdatasync_failed(int err) "errno %d"
++zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report %d zones starting at sector offset 0x%" PRIx64 ""
++zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs %p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " sectors"
+ # ssh.c
+ sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 01/24] block: move AioContext, QEMUTimer, main-loop to libqemuutil
+[PULL v2 08/16] docs/zoned-storage: add zoned device documentation
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-AioContext is fairly self contained, the only dependency is QEMUTimer but
+Add the documentation about the zoned device support to virtio-blk
-that in turn doesn't need anything else.  So move them out of block-obj-y
+emulation.
 to avoid introducing a dependency from io/ to block-obj-y.
-main-loop and its dependency iohandler also need to be moved, because
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-later in this series io/ will call iohandler_get_aio_context.
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
-[Changed copyright "the QEMU team" to "other QEMU contributors" as
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
-suggested by Daniel Berrange and agreed by Paolo.
+Acked-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-id: 20230508045533.175575-9-faithilikerun@gmail.com
 Message-id: 20230324090605.28361-9-faithilikerun@gmail.com
 [Add index-api.rst to fix "zoned-storage.rst:document isn't included in
 any toctree" error and fix pre-formatted code syntax.
 --Stefan]
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213135235.12274-2-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- Makefile.objs                       |  4 ---
+ docs/devel/index-api.rst               |  1 +
- stubs/Makefile.objs                 |  1 +
+ docs/devel/zoned-storage.rst           | 43 ++++++++++++++++++++++++++
- tests/Makefile.include              | 11 ++++----
+ docs/system/qemu-block-drivers.rst.inc |  6 ++++
- util/Makefile.objs                  |  6 +++-
+files changed, 50 insertions(+)
- block/io.c                          | 29 -------------------
+ create mode 100644 docs/devel/zoned-storage.rst
  stubs/linux-aio.c                   | 32 +++++++++++++++++++++
  stubs/set-fd-handler.c              | 11 --------
  aio-posix.c => util/aio-posix.c     |  2 +-
  aio-win32.c => util/aio-win32.c     |  0
  util/aiocb.c                        | 55 +++++++++++++++++++++++++++++++++++++
  async.c => util/async.c             |  3 +-
  iohandler.c => util/iohandler.c     |  0
  main-loop.c => util/main-loop.c     |  0
  qemu-timer.c => util/qemu-timer.c   |  0
  thread-pool.c => util/thread-pool.c |  2 +-
  trace-events                        | 11 --------
  util/trace-events                   | 11 ++++++++
 files changed, 114 insertions(+), 64 deletions(-)
  create mode 100644 stubs/linux-aio.c
  rename aio-posix.c => util/aio-posix.c (99%)
  rename aio-win32.c => util/aio-win32.c (100%)
  create mode 100644 util/aiocb.c
  rename async.c => util/async.c (99%)
  rename iohandler.c => util/iohandler.c (100%)
  rename main-loop.c => util/main-loop.c (100%)
  rename qemu-timer.c => util/qemu-timer.c (100%)
  rename thread-pool.c => util/thread-pool.c (99%)
-diff --git a/Makefile.objs b/Makefile.objs
+diff --git a/docs/devel/index-api.rst b/docs/devel/index-api.rst
 index XXXXXXX..XXXXXXX 100644
---- a/Makefile.objs
+--- a/docs/devel/index-api.rst
-+++ b/Makefile.objs
++++ b/docs/devel/index-api.rst
-@@ -XXX,XX +XXX,XX @@ chardev-obj-y = chardev/
+@@ -XXX,XX +XXX,XX @@ generated from in-code annotations to function prototypes.
- #######################################################################
+    memory
- # block-obj-y is code used by both qemu system emulation and qemu-img
+    modules
+    ui
--block-obj-y = async.o thread-pool.o
++   zoned-storage
- block-obj-y += nbd/
+diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
  block-obj-y += block.o blockjob.o
 -block-obj-y += main-loop.o iohandler.o qemu-timer.o
 -block-obj-$(CONFIG_POSIX) += aio-posix.o
 -block-obj-$(CONFIG_WIN32) += aio-win32.o
  block-obj-y += block/
  block-obj-y += qemu-io-cmds.o
  block-obj-$(CONFIG_REPLICATION) += replication.o
 diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/stubs/Makefile.objs
 +++ b/stubs/Makefile.objs
@@ -XXX,XX +XXX,XX @@ stub-obj-y += get-vm-name.o
  stub-obj-y += iothread.o
  stub-obj-y += iothread-lock.o
  stub-obj-y += is-daemonized.o
 +stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  stub-obj-y += machine-init-done.o
  stub-obj-y += migr-blocker.o
  stub-obj-y += monitor.o
 diff --git a/tests/Makefile.include b/tests/Makefile.include
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/Makefile.include
 +++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
  check-unit-y += tests/test-iov$(EXESUF)
  gcov-files-test-iov-y = util/iov.c
  check-unit-y += tests/test-aio$(EXESUF)
 +gcov-files-test-aio-y = util/async.c util/qemu-timer.o
 +gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
 +gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
  check-unit-y += tests/test-throttle$(EXESUF)
  gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
  gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -XXX,XX +XXX,XX @@ tests/check-qjson$(EXESUF): tests/check-qjson.o $(test-util-obj-y)
  tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y)
  tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 -tests/test-char$(EXESUF): tests/test-char.o qemu-timer.o \
 -    $(test-util-obj-y) $(qtest-obj-y) $(test-block-obj-y) $(chardev-obj-y)
 +tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
  tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
  tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
  tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/test-vmstate$(EXESUF): tests/test-vmstate.o \
      migration/vmstate.o migration/qemu-file.o \
          migration/qemu-file-channel.o migration/qjson.o \
      $(test-io-obj-y)
 -tests/test-timed-average$(EXESUF): tests/test-timed-average.o qemu-timer.o \
 -    $(test-util-obj-y)
 +tests/test-timed-average$(EXESUF): tests/test-timed-average.o $(test-util-obj-y)
  tests/test-base64$(EXESUF): tests/test-base64.o \
      libqemuutil.a libqemustub.a
  tests/ptimer-test$(EXESUF): tests/ptimer-test.o tests/ptimer-test-stubs.o hw/core/ptimer.o libqemustub.a
@@ -XXX,XX +XXX,XX @@ tests/usb-hcd-ehci-test$(EXESUF): tests/usb-hcd-ehci-test.o $(libqos-usb-obj-y)
  tests/usb-hcd-xhci-test$(EXESUF): tests/usb-hcd-xhci-test.o $(libqos-usb-obj-y)
  tests/pc-cpu-test$(EXESUF): tests/pc-cpu-test.o
  tests/postcopy-test$(EXESUF): tests/postcopy-test.o
 -tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o qemu-timer.o \
 +tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o $(test-util-obj-y) \
      $(qtest-obj-y) $(test-io-obj-y) $(libqos-virtio-obj-y) $(libqos-pc-obj-y) \
      $(chardev-obj-y)
  tests/qemu-iotests/socket_scm_helper$(EXESUF): tests/qemu-iotests/socket_scm_helper.o
 diff --git a/util/Makefile.objs b/util/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/util/Makefile.objs
 +++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@
  util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
  util-obj-y += bufferiszero.o
  util-obj-y += lockcnt.o
 +util-obj-y += aiocb.o async.o thread-pool.o qemu-timer.o
 +util-obj-y += main-loop.o iohandler.o
 +util-obj-$(CONFIG_POSIX) += aio-posix.o
  util-obj-$(CONFIG_POSIX) += compatfd.o
  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
  util-obj-$(CONFIG_POSIX) += oslib-posix.o
  util-obj-$(CONFIG_POSIX) += qemu-openpty.o
  util-obj-$(CONFIG_POSIX) += qemu-thread-posix.o
 -util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
  util-obj-$(CONFIG_POSIX) += memfd.o
 +util-obj-$(CONFIG_WIN32) += aio-win32.o
 +util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
  util-obj-$(CONFIG_WIN32) += oslib-win32.o
  util-obj-$(CONFIG_WIN32) += qemu-thread-win32.o
  util-obj-y += envlist.o path.o module.o
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
      return &acb->common;
  }
 -void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
 -                   BlockCompletionFunc *cb, void *opaque)
 -{
 -    BlockAIOCB *acb;
 -
 -    acb = g_malloc(aiocb_info->aiocb_size);
 -    acb->aiocb_info = aiocb_info;
 -    acb->bs = bs;
 -    acb->cb = cb;
 -    acb->opaque = opaque;
 -    acb->refcnt = 1;
 -    return acb;
 -}
 -
 -void qemu_aio_ref(void *p)
 -{
 -    BlockAIOCB *acb = p;
 -    acb->refcnt++;
 -}
 -
 -void qemu_aio_unref(void *p)
 -{
 -    BlockAIOCB *acb = p;
 -    assert(acb->refcnt > 0);
 -    if (--acb->refcnt == 0) {
 -        g_free(acb);
 -    }
 -}
 -
  /**************************************************************/
  /* Coroutine block device emulation */
 diff --git a/stubs/linux-aio.c b/stubs/linux-aio.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/stubs/linux-aio.c
++++ b/docs/devel/zoned-storage.rst
 @@ -XXX,XX +XXX,XX @@
-+/*
++=============
-+ * Linux native AIO support.
++zoned-storage
-+ *
++=============
 + * Copyright (C) 2009 IBM, Corp.
 + * Copyright (C) 2009 Red Hat, Inc.
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +#include "qemu/osdep.h"
 +#include "block/aio.h"
 +#include "block/raw-aio.h"
 +
-+void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
++Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
-+{
++that are larger than the LBA size. They can only allow sequential writes, which
-+    abort();
++can reduce write amplification in SSDs, and potentially lead to higher
-+}
++throughput and increased capacity. More details about ZBDs can be found at:
 +
-+void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
++https://zonedstorage.io/docs/introduction/zoned-storage
 +{
 +    abort();
 +}
 +
-+LinuxAioState *laio_init(void)
++1. Block layer APIs for zoned storage
-+{
++-------------------------------------
-+    abort();
++QEMU block layer supports three zoned storage models:
-+}
++- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
 +to zones. It supports ZBD-specific I/O commands that can be used by a host to
 +manage the zones of a device.
 +- BLK_Z_HA: The host-aware zoned model allows random write operations in
 +zones, making it backward compatible with regular block devices.
 +- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
 +regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
 +supported.
 +
-+void laio_cleanup(LinuxAioState *s)
++The block device information resides inside BlockDriverState. QEMU uses
-+{
++BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
-+    abort();
++block layer while processing I/O requests. A BlockBackend has a root pointer to
-+}
++a BlockDriverState graph(for example, raw format on top of file-posix). The
-diff --git a/stubs/set-fd-handler.c b/stubs/set-fd-handler.c
++zoned storage information can be propagated from the leaf BlockDriverState all
 +the way up to the BlockBackend. If the zoned storage model in file-posix is
 +set to BLK_Z_HM, then block drivers will declare support for zoned host device.
 +
 +The block layer APIs support commands needed for zoned storage devices,
 +including report zones, four zone operations, and zone append.
 +
 +2. Emulating zoned storage controllers
 +--------------------------------------
 +When the BlockBackend's BlockLimits model reports a zoned storage device, users
 +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
 +APIs for zoned storage emulation or testing.
 +
 +For example, to test zone_report on a null_blk device using qemu-io is::
 +
 +  $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c "zrp offset nr_zones"
 diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
 index XXXXXXX..XXXXXXX 100644
---- a/stubs/set-fd-handler.c
+--- a/docs/system/qemu-block-drivers.rst.inc
-+++ b/stubs/set-fd-handler.c
++++ b/docs/system/qemu-block-drivers.rst.inc
-@@ -XXX,XX +XXX,XX @@ void qemu_set_fd_handler(int fd,
+@@ -XXX,XX +XXX,XX @@ Hard disks
- {
+   you may corrupt your host data (use the ``-snapshot`` command
-     abort();
+   line option or modify the device permissions accordingly).
- }
--
++Zoned block devices
--void aio_set_fd_handler(AioContext *ctx,
++  Zoned block devices can be passed through to the guest if the emulated storage
--                        int fd,
++  controller supports zoned storage. Use ``--blockdev host_device,
--                        bool is_external,
++  node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
--                        IOHandler *io_read,
++  ``/dev/nullb0`` as ``drive0``.
 -                        IOHandler *io_write,
 -                        AioPollFn *io_poll,
 -                        void *opaque)
 -{
 -    abort();
 -}
 diff --git a/aio-posix.c b/util/aio-posix.c
 similarity index 99%
 rename from aio-posix.c
 rename to util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/rcu_queue.h"
  #include "qemu/sockets.h"
  #include "qemu/cutils.h"
 -#include "trace-root.h"
 +#include "trace.h"
  #ifdef CONFIG_EPOLL_CREATE1
  #include <sys/epoll.h>
  #endif
 diff --git a/aio-win32.c b/util/aio-win32.c
 similarity index 100%
 rename from aio-win32.c
 rename to util/aio-win32.c
 diff --git a/util/aiocb.c b/util/aiocb.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/util/aiocb.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * BlockAIOCB allocation
 + *
 + * Copyright (c) 2003-2017 Fabrice Bellard and other QEMU contributors
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a copy
 + * of this software and associated documentation files (the "Software"), to deal
 + * in the Software without restriction, including without limitation the rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
-+#include "qemu/osdep.h"
+ Windows
-+#include "block/aio.h"
+ ^^^^^^^
-+
 +void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
 +                   BlockCompletionFunc *cb, void *opaque)
 +{
 +    BlockAIOCB *acb;
 +
 +    acb = g_malloc(aiocb_info->aiocb_size);
 +    acb->aiocb_info = aiocb_info;
 +    acb->bs = bs;
 +    acb->cb = cb;
 +    acb->opaque = opaque;
 +    acb->refcnt = 1;
 +    return acb;
 +}
 +
 +void qemu_aio_ref(void *p)
 +{
 +    BlockAIOCB *acb = p;
 +    acb->refcnt++;
 +}
 +
 +void qemu_aio_unref(void *p)
 +{
 +    BlockAIOCB *acb = p;
 +    assert(acb->refcnt > 0);
 +    if (--acb->refcnt == 0) {
 +        g_free(acb);
 +    }
 +}
 diff --git a/async.c b/util/async.c
 similarity index 99%
 rename from async.c
 rename to util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
  /*
 - * QEMU System Emulator
 + * Data plane event loop
   *
   * Copyright (c) 2003-2008 Fabrice Bellard
 + * Copyright (c) 2009-2017 QEMU contributors
   *
   * Permission is hereby granted, free of charge, to any person obtaining a copy
   * of this software and associated documentation files (the "Software"), to deal
 diff --git a/iohandler.c b/util/iohandler.c
 similarity index 100%
 rename from iohandler.c
 rename to util/iohandler.c
 diff --git a/main-loop.c b/util/main-loop.c
 similarity index 100%
 rename from main-loop.c
 rename to util/main-loop.c
 diff --git a/qemu-timer.c b/util/qemu-timer.c
 similarity index 100%
 rename from qemu-timer.c
 rename to util/qemu-timer.c
 diff --git a/thread-pool.c b/util/thread-pool.c
 similarity index 99%
 rename from thread-pool.c
 rename to util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/queue.h"
  #include "qemu/thread.h"
  #include "qemu/coroutine.h"
 -#include "trace-root.h"
 +#include "trace.h"
  #include "block/thread-pool.h"
  #include "qemu/main-loop.h"
 diff --git a/trace-events b/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/trace-events
 +++ b/trace-events
@@ -XXX,XX +XXX,XX @@
  #
  # The <format-string> should be a sprintf()-compatible format string.
 -# aio-posix.c
 -run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
 -run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 -poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 -poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 -
 -# thread-pool.c
 -thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 -thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 -thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 -
  # ioport.c
  cpu_in(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
  cpu_out(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
 diff --git a/util/trace-events b/util/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/util/trace-events
 +++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@
  # See docs/tracing.txt for syntax documentation.
 +# util/aio-posix.c
 +run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
 +run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 +poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +
 +# util/thread-pool.c
 +thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 +thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 +thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 +
  # util/buffer.c
  buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"
  buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 04/24] test-thread-pool: use generic AioContext infrastructure
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-Once the thread pool starts using aio_co_wake, it will also need
-qemu_get_current_aio_context().  Make test-thread-pool create
-an AioContext with qemu_init_main_loop, so that stubs/iothread.c
-and tests/iothread.c can provide the rest.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213135235.12274-5-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- tests/test-thread-pool.c | 12 +++---------
-file changed, 3 insertions(+), 9 deletions(-)
-diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
-index XXXXXXX..XXXXXXX 100644
---- a/tests/test-thread-pool.c
-+++ b/tests/test-thread-pool.c
-@@ -XXX,XX +XXX,XX @@
- #include "qapi/error.h"
- #include "qemu/timer.h"
- #include "qemu/error-report.h"
-+#include "qemu/main-loop.h"
- static AioContext *ctx;
- static ThreadPool *pool;
-@@ -XXX,XX +XXX,XX @@ static void test_cancel_async(void)
- int main(int argc, char **argv)
- {
-     int ret;
--    Error *local_error = NULL;
--    init_clocks();
--
--    ctx = aio_context_new(&local_error);
--    if (!ctx) {
--        error_reportf_err(local_error, "Failed to create AIO Context: ");
--        exit(1);
--    }
-+    qemu_init_main_loop(&error_abort);
-+    ctx = qemu_get_current_aio_context();
-     pool = aio_get_thread_pool(ctx);
-     g_test_init(&argc, &argv, NULL);
-@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
-     ret = g_test_run();
--    aio_context_unref(ctx);
-     return ret;
- }
---
-.9.3

-[Qemu-devel] [PULL v2 05/24] io: add methods to set I/O handlers on AioContext
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-This is in preparation for making qio_channel_yield work on
-AioContexts other than the main one.
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213135235.12274-6-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- include/io/channel.h | 25 +++++++++++++++++++++++++
- io/channel-command.c | 13 +++++++++++++
- io/channel-file.c    | 11 +++++++++++
- io/channel-socket.c  | 16 +++++++++++-----
- io/channel-tls.c     | 12 ++++++++++++
- io/channel-watch.c   |  6 ++++++
- io/channel.c         | 11 +++++++++++
-files changed, 89 insertions(+), 5 deletions(-)
-diff --git a/include/io/channel.h b/include/io/channel.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/io/channel.h
-+++ b/include/io/channel.h
-@@ -XXX,XX +XXX,XX @@
- #include "qemu-common.h"
- #include "qom/object.h"
-+#include "block/aio.h"
- #define TYPE_QIO_CHANNEL "qio-channel"
- #define QIO_CHANNEL(obj)                                    \
-@@ -XXX,XX +XXX,XX @@ struct QIOChannelClass {
-                      off_t offset,
-                      int whence,
-                      Error **errp);
-+    void (*io_set_aio_fd_handler)(QIOChannel *ioc,
-+                                  AioContext *ctx,
-+                                  IOHandler *io_read,
-+                                  IOHandler *io_write,
-+                                  void *opaque);
- };
- /* General I/O handling functions */
-@@ -XXX,XX +XXX,XX @@ void qio_channel_yield(QIOChannel *ioc,
- void qio_channel_wait(QIOChannel *ioc,
-                       GIOCondition condition);
-+/**
-+ * qio_channel_set_aio_fd_handler:
-+ * @ioc: the channel object
-+ * @ctx: the AioContext to set the handlers on
-+ * @io_read: the read handler
-+ * @io_write: the write handler
-+ * @opaque: the opaque value passed to the handler
-+ *
-+ * This is used internally by qio_channel_yield().  It can
-+ * be used by channel implementations to forward the handlers
-+ * to another channel (e.g. from #QIOChannelTLS to the
-+ * underlying socket).
-+ */
-+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
-+                                    AioContext *ctx,
-+                                    IOHandler *io_read,
-+                                    IOHandler *io_write,
-+                                    void *opaque);
-+
- #endif /* QIO_CHANNEL_H */
-diff --git a/io/channel-command.c b/io/channel-command.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel-command.c
-+++ b/io/channel-command.c
-@@ -XXX,XX +XXX,XX @@ static int qio_channel_command_close(QIOChannel *ioc,
- }
-+static void qio_channel_command_set_aio_fd_handler(QIOChannel *ioc,
-+                                                   AioContext *ctx,
-+                                                   IOHandler *io_read,
-+                                                   IOHandler *io_write,
-+                                                   void *opaque)
-+{
-+    QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
-+    aio_set_fd_handler(ctx, cioc->readfd, false, io_read, NULL, NULL, opaque);
-+    aio_set_fd_handler(ctx, cioc->writefd, false, NULL, io_write, NULL, opaque);
-+}
-+
-+
- static GSource *qio_channel_command_create_watch(QIOChannel *ioc,
-                                                  GIOCondition condition)
- {
-@@ -XXX,XX +XXX,XX @@ static void qio_channel_command_class_init(ObjectClass *klass,
-     ioc_klass->io_set_blocking = qio_channel_command_set_blocking;
-     ioc_klass->io_close = qio_channel_command_close;
-     ioc_klass->io_create_watch = qio_channel_command_create_watch;
-+    ioc_klass->io_set_aio_fd_handler = qio_channel_command_set_aio_fd_handler;
- }
- static const TypeInfo qio_channel_command_info = {
-diff --git a/io/channel-file.c b/io/channel-file.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel-file.c
-+++ b/io/channel-file.c
-@@ -XXX,XX +XXX,XX @@ static int qio_channel_file_close(QIOChannel *ioc,
- }
-+static void qio_channel_file_set_aio_fd_handler(QIOChannel *ioc,
-+                                                AioContext *ctx,
-+                                                IOHandler *io_read,
-+                                                IOHandler *io_write,
-+                                                void *opaque)
-+{
-+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
-+    aio_set_fd_handler(ctx, fioc->fd, false, io_read, io_write, NULL, opaque);
-+}
-+
- static GSource *qio_channel_file_create_watch(QIOChannel *ioc,
-                                               GIOCondition condition)
- {
-@@ -XXX,XX +XXX,XX @@ static void qio_channel_file_class_init(ObjectClass *klass,
-     ioc_klass->io_seek = qio_channel_file_seek;
-     ioc_klass->io_close = qio_channel_file_close;
-     ioc_klass->io_create_watch = qio_channel_file_create_watch;
-+    ioc_klass->io_set_aio_fd_handler = qio_channel_file_set_aio_fd_handler;
- }
- static const TypeInfo qio_channel_file_info = {
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -XXX,XX +XXX,XX @@ qio_channel_socket_set_blocking(QIOChannel *ioc,
-         qemu_set_block(sioc->fd);
-     } else {
-         qemu_set_nonblock(sioc->fd);
--#ifdef WIN32
--        WSAEventSelect(sioc->fd, ioc->event,
--                       FD_READ | FD_ACCEPT | FD_CLOSE |
--                       FD_CONNECT | FD_WRITE | FD_OOB);
--#endif
-     }
-     return 0;
- }
-@@ -XXX,XX +XXX,XX @@ qio_channel_socket_shutdown(QIOChannel *ioc,
-     return 0;
- }
-+static void qio_channel_socket_set_aio_fd_handler(QIOChannel *ioc,
-+                                                  AioContext *ctx,
-+                                                  IOHandler *io_read,
-+                                                  IOHandler *io_write,
-+                                                  void *opaque)
-+{
-+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
-+    aio_set_fd_handler(ctx, sioc->fd, false, io_read, io_write, NULL, opaque);
-+}
-+
- static GSource *qio_channel_socket_create_watch(QIOChannel *ioc,
-                                                 GIOCondition condition)
- {
-@@ -XXX,XX +XXX,XX @@ static void qio_channel_socket_class_init(ObjectClass *klass,
-     ioc_klass->io_set_cork = qio_channel_socket_set_cork;
-     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
-     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
-+    ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
- }
- static const TypeInfo qio_channel_socket_info = {
-diff --git a/io/channel-tls.c b/io/channel-tls.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel-tls.c
-+++ b/io/channel-tls.c
-@@ -XXX,XX +XXX,XX @@ static int qio_channel_tls_close(QIOChannel *ioc,
-     return qio_channel_close(tioc->master, errp);
- }
-+static void qio_channel_tls_set_aio_fd_handler(QIOChannel *ioc,
-+                                               AioContext *ctx,
-+                                               IOHandler *io_read,
-+                                               IOHandler *io_write,
-+                                               void *opaque)
-+{
-+    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
-+
-+    qio_channel_set_aio_fd_handler(tioc->master, ctx, io_read, io_write, opaque);
-+}
-+
- static GSource *qio_channel_tls_create_watch(QIOChannel *ioc,
-                                              GIOCondition condition)
- {
-@@ -XXX,XX +XXX,XX @@ static void qio_channel_tls_class_init(ObjectClass *klass,
-     ioc_klass->io_close = qio_channel_tls_close;
-     ioc_klass->io_shutdown = qio_channel_tls_shutdown;
-     ioc_klass->io_create_watch = qio_channel_tls_create_watch;
-+    ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
- }
- static const TypeInfo qio_channel_tls_info = {
-diff --git a/io/channel-watch.c b/io/channel-watch.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel-watch.c
-+++ b/io/channel-watch.c
-@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_socket_watch(QIOChannel *ioc,
-     GSource *source;
-     QIOChannelSocketSource *ssource;
-+#ifdef WIN32
-+    WSAEventSelect(socket, ioc->event,
-+                   FD_READ | FD_ACCEPT | FD_CLOSE |
-+                   FD_CONNECT | FD_WRITE | FD_OOB);
-+#endif
-+
-     source = g_source_new(&qio_channel_socket_source_funcs,
-                           sizeof(QIOChannelSocketSource));
-     ssource = (QIOChannelSocketSource *)source;
-diff --git a/io/channel.c b/io/channel.c
-index XXXXXXX..XXXXXXX 100644
---- a/io/channel.c
-+++ b/io/channel.c
-@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_watch(QIOChannel *ioc,
- }
-+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
-+                                    AioContext *ctx,
-+                                    IOHandler *io_read,
-+                                    IOHandler *io_write,
-+                                    void *opaque)
-+{
-+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
-+
-+    klass->io_set_aio_fd_handler(ioc, ctx, io_read, io_write, opaque);
-+}
-+
- guint qio_channel_add_watch(QIOChannel *ioc,
-                             GIOCondition condition,
-                             QIOChannelFunc func,
---
-.9.3

-[Qemu-devel] [PULL v2 21/24] test-aio-multithread: add performance comparison with thread-based mutexes
+[PULL v2 09/16] file-posix: add tracking of the zone write pointers
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-Add two implementations of the same benchmark as the previous patch,
+Since Linux doesn't have a user API to issue zone append operations to
-but using pthreads.  One uses a normal QemuMutex, the other is Linux
+zoned devices from user space, the file-posix driver is modified to add
-only and implements a fair mutex based on MCS locks and futexes.
+zone append emulation using regular writes. To do this, the file-posix
-This shows that the slower performance of the 5-thread case is due to
+driver tracks the wp location of all zones of the device. It uses an
-the fairness of CoMutex, rather than to coroutines.  If fairness does
+array of uint64_t. The most significant bit of each wp location indicates
-not matter, as is the case with two threads, CoMutex can actually be
+if the zone type is conventional zones.
 faster than pthreads.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+The zones wp can be changed due to the following operations issued:
-Reviewed-by: Fam Zheng <famz@redhat.com>
+- zone reset: change the wp to the start offset of that zone
-Message-id: 20170213181244.16297-4-pbonzini@redhat.com
+- zone finish: change to the end location of that zone
 - write to a zone
 - zone append
 Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Message-id: 20230508051510.177850-2-faithilikerun@gmail.com
 [Fix errno propagation from handle_aiocb_zone_mgmt()
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/test-aio-multithread.c | 164 +++++++++++++++++++++++++++++++++++++++++++
+ include/block/block-common.h     |  14 +++
-file changed, 164 insertions(+)
+ include/block/block_int-common.h |   5 +
  block/file-posix.c               | 178 ++++++++++++++++++++++++++++++-
 files changed, 193 insertions(+), 4 deletions(-)
-diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
+diff --git a/include/block/block-common.h b/include/block/block-common.h
 index XXXXXXX..XXXXXXX 100644
---- a/tests/test-aio-multithread.c
+--- a/include/block/block-common.h
-+++ b/tests/test-aio-multithread.c
++++ b/include/block/block-common.h
-@@ -XXX,XX +XXX,XX @@ static void test_multi_co_mutex_2_30(void)
+@@ -XXX,XX +XXX,XX @@ typedef struct BlockZoneDescriptor {
-     test_multi_co_mutex(2, 30);
+     BlockZoneState state;
  } BlockZoneDescriptor;
 +/*
 + * Track write pointers of a zone in bytes.
 + */
 +typedef struct BlockZoneWps {
 +    CoMutex colock;
 +    uint64_t wp[];
 +} BlockZoneWps;
 +
  typedef struct BlockDriverInfo {
      /* in bytes, 0 if irrelevant */
      int cluster_size;
@@ -XXX,XX +XXX,XX @@ typedef enum {
  #define BDRV_SECTOR_BITS   9
  #define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
 +/*
 + * Get the first most significant bit of wp. If it is zero, then
 + * the zone type is SWR.
 + */
 +#define BDRV_ZT_IS_CONV(wp)    (wp & (1ULL << 63))
 +
  #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
                                             INT_MAX >> BDRV_SECTOR_BITS)
  #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
 diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block_int-common.h
 +++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
      /* maximum number of active zones */
      uint32_t max_active_zones;
 +
 +    uint32_t write_granularity;
  } BlockLimits;
  typedef struct BdrvOpBlocker BdrvOpBlocker;
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
      CoMutex bsc_modify_lock;
      /* Always non-NULL, but must only be dereferenced under an RCU read guard */
      BdrvBlockStatusCache *block_status_cache;
 +
 +    /* array of write pointers' location of each zone in the zoned device. */
 +    BlockZoneWps *wps;
  };
  struct BlockBackendRootState {
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
  }
-+/* Same test with fair mutexes, for performance comparison.  */
+ #if defined(CONFIG_BLKZONED)
-+
++/*
-+#ifdef CONFIG_LINUX
++ * If the reset_all flag is true, then the wps of zone whose state is
-+#include "qemu/futex.h"
++ * not readonly or offline should be all reset to the start sector.
-+
++ * Else, take the real wp of the device.
 +/* The nodes for the mutex reside in this structure (on which we try to avoid
 + * false sharing).  The head of the mutex is in the "mutex_head" variable.
 + */
-+static struct {
++static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
-+    int next, locked;
++                        unsigned int nrz, bool reset_all)
 +    int padding[14];
 +} nodes[NUM_CONTEXTS] __attribute__((__aligned__(64)));
 +
 +static int mutex_head = -1;
 +
 +static void mcs_mutex_lock(void)
 +{
-+    int prev;
++    struct blk_zone *blkz;
-+
++    size_t rep_size;
-+    nodes[id].next = -1;
++    uint64_t sector = offset >> BDRV_SECTOR_BITS;
-+    nodes[id].locked = 1;
++    BlockZoneWps *wps = bs->wps;
-+    prev = atomic_xchg(&mutex_head, id);
++    unsigned int j = offset / bs->bl.zone_size;
-+    if (prev != -1) {
++    unsigned int n = 0, i = 0;
-+        atomic_set(&nodes[prev].next, id);
++    int ret;
-+        qemu_futex_wait(&nodes[id].locked, 1);
++    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
-+    }
++    g_autofree struct blk_zone_report *rep = NULL;
 +
 +    rep = g_malloc(rep_size);
 +    blkz = (struct blk_zone *)(rep + 1);
 +    while (n < nrz) {
 +        memset(rep, 0, rep_size);
 +        rep->sector = sector;
 +        rep->nr_zones = nrz - n;
 +
 +        do {
 +            ret = ioctl(fd, BLKREPORTZONE, rep);
 +        } while (ret != 0 && errno == EINTR);
 +        if (ret != 0) {
 +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
 +                    fd, offset, errno);
 +            return -errno;
 +        }
 +
 +        if (!rep->nr_zones) {
 +            break;
 +        }
 +
 +        for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
 +            /*
 +             * The wp tracking cares only about sequential writes required and
 +             * sequential write preferred zones so that the wp can advance to
 +             * the right location.
 +             * Use the most significant bit of the wp location to indicate the
 +             * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
 +             */
 +            if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
 +                wps->wp[j] |= 1ULL << 63;
 +            } else {
 +                switch(blkz[i].cond) {
 +                case BLK_ZONE_COND_FULL:
 +                case BLK_ZONE_COND_READONLY:
 +                    /* Zone not writable */
 +                    wps->wp[j] = (blkz[i].start + blkz[i].len) << BDRV_SECTOR_BITS;
 +                    break;
 +                case BLK_ZONE_COND_OFFLINE:
 +                    /* Zone not writable nor readable */
 +                    wps->wp[j] = (blkz[i].start) << BDRV_SECTOR_BITS;
 +                    break;
 +                default:
 +                    if (reset_all) {
 +                        wps->wp[j] = blkz[i].start << BDRV_SECTOR_BITS;
 +                    } else {
 +                        wps->wp[j] = blkz[i].wp << BDRV_SECTOR_BITS;
 +                    }
 +                    break;
 +                }
 +            }
 +        }
 +        sector = blkz[i - 1].start + blkz[i - 1].len;
 +    }
 +
 +    return 0;
 +}
 +
-+static void mcs_mutex_unlock(void)
++static void update_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
 +                            unsigned int nrz)
 +{
-+    int next;
++    if (get_zones_wp(bs, fd, offset, nrz, 0) < 0) {
-+    if (nodes[id].next == -1) {
++        error_report("update zone wp failed");
-+        if (atomic_read(&mutex_head) == id &&
++    }
 +            atomic_cmpxchg(&mutex_head, id, -1) == id) {
 +            /* Last item in the list, exit.  */
 +            return;
 +        }
 +        while (atomic_read(&nodes[id].next) == -1) {
 +            /* mcs_mutex_lock did the xchg, but has not updated
 +             * nodes[prev].next yet.
 +             */
 +        }
 +    }
 +
 +    /* Wake up the next in line.  */
 +    next = nodes[id].next;
 +    nodes[next].locked = 0;
 +    qemu_futex_wake(&nodes[next].locked, 1);
 +}
 +
-+static void test_multi_fair_mutex_entry(void *opaque)
+ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
                                       Error **errp)
  {
 +    BDRVRawState *s = bs->opaque;
      BlockZoneModel zoned;
      int ret;
@@ -XXX,XX +XXX,XX @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
      if (ret > 0) {
          bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
      }
 +
 +    ret = get_sysfs_long_val(st, "physical_block_size");
 +    if (ret >= 0) {
 +        bs->bl.write_granularity = ret;
 +    }
 +
 +    /* The refresh_limits() function can be called multiple times. */
 +    g_free(bs->wps);
 +    bs->wps = g_malloc(sizeof(BlockZoneWps) +
 +            sizeof(int64_t) * bs->bl.nr_zones);
 +    ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0);
 +    if (ret < 0) {
 +        error_setg_errno(errp, -ret, "report wps failed");
 +        bs->wps = NULL;
 +        return;
 +    }
 +    qemu_co_mutex_init(&bs->wps->colock);
  }
  #else /* !defined(CONFIG_BLKZONED) */
  static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
@@ -XXX,XX +XXX,XX @@ static int handle_aiocb_zone_mgmt(void *opaque)
          ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
      } while (ret != 0 && errno == EINTR);
 -    return ret;
 +    return ret < 0 ? -errno : ret;
  }
  #endif
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
  {
      BDRVRawState *s = bs->opaque;
      RawPosixAIOData acb;
 +    int ret;
      if (fd_open(bs) < 0)
          return -EIO;
 +#if defined(CONFIG_BLKZONED)
 +    if (type & QEMU_AIO_WRITE && bs->wps) {
 +        qemu_co_mutex_lock(&bs->wps->colock);
 +    }
 +#endif
      /*
       * When using O_DIRECT, the request must be aligned to be able to use
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
  #ifdef CONFIG_LINUX_IO_URING
      } else if (s->use_linux_io_uring) {
          assert(qiov->size == bytes);
 -        return luring_co_submit(bs, s->fd, offset, qiov, type);
 +        ret = luring_co_submit(bs, s->fd, offset, qiov, type);
 +        goto out;
  #endif
  #ifdef CONFIG_LINUX_AIO
      } else if (s->use_linux_aio) {
          assert(qiov->size == bytes);
 -        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
 +        ret = laio_co_submit(s->fd, offset, qiov, type,
 +                              s->aio_max_batch);
 +        goto out;
  #endif
      }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
      };
      assert(qiov->size == bytes);
 -    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
 +    ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
 +    goto out; /* Avoid the compiler err of unused label */
 +
 +out:
 +#if defined(CONFIG_BLKZONED)
 +{
-+    while (!atomic_mb_read(&now_stopping)) {
++    BlockZoneWps *wps = bs->wps;
-+        mcs_mutex_lock();
++    if (ret == 0) {
-+        counter++;
++        if (type & QEMU_AIO_WRITE && wps && bs->bl.zone_size) {
-+        mcs_mutex_unlock();
++            uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
-+        atomic_inc(&atomic_counter);
++            if (!BDRV_ZT_IS_CONV(*wp)) {
-+    }
++                /* Advance the wp if needed */
-+    atomic_dec(&running);
++                if (offset + bytes > *wp) {
-+}
++                    *wp = offset + bytes;
-+
++                }
-+static void test_multi_fair_mutex(int threads, int seconds)
++            }
-+{
++        }
-+    int i;
++    } else {
-+
++        if (type & QEMU_AIO_WRITE) {
-+    assert(mutex_head == -1);
++            update_zones_wp(bs, s->fd, 0, 1);
-+    counter = 0;
++        }
-+    atomic_counter = 0;
++    }
-+    now_stopping = false;
++
-+
++    if (type & QEMU_AIO_WRITE && wps) {
-+    create_aio_contexts();
++        qemu_co_mutex_unlock(&wps->colock);
-+    assert(threads <= NUM_CONTEXTS);
++    }
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_fair_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
 +static void test_multi_fair_mutex_1(void)
 +{
 +    test_multi_fair_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_fair_mutex_10(void)
 +{
 +    test_multi_fair_mutex(NUM_CONTEXTS, 10);
 +}
 +#endif
-+
++    return ret;
-+/* Same test with pthread mutexes, for performance comparison and
+ }
-+ * portability.  */
-+
+ static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
-+static QemuMutex mutex;
+@@ -XXX,XX +XXX,XX @@ static void raw_close(BlockDriverState *bs)
-+
+     BDRVRawState *s = bs->opaque;
-+static void test_multi_mutex_entry(void *opaque)
-+{
+     if (s->fd >= 0) {
-+    while (!atomic_mb_read(&now_stopping)) {
++#if defined(CONFIG_BLKZONED)
-+        qemu_mutex_lock(&mutex);
++        g_free(bs->wps);
 +        counter++;
 +        qemu_mutex_unlock(&mutex);
 +        atomic_inc(&atomic_counter);
 +    }
 +    atomic_dec(&running);
 +}
 +
 +static void test_multi_mutex(int threads, int seconds)
 +{
 +    int i;
 +
 +    qemu_mutex_init(&mutex);
 +    counter = 0;
 +    atomic_counter = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    assert(threads <= NUM_CONTEXTS);
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
 +static void test_multi_mutex_1(void)
 +{
 +    test_multi_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_mutex_10(void)
 +{
 +    test_multi_mutex(NUM_CONTEXTS, 10);
 +}
 +
  /* End of tests.  */
  int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
          g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
          g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
 +#ifdef CONFIG_LINUX
 +        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_1);
 +#endif
-+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_1);
+         qemu_close(s->fd);
-     } else {
+         s->fd = -1;
-         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
+     }
-         g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
-         g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
+     const char *op_name;
-+#ifdef CONFIG_LINUX
+     unsigned long zo;
-+        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_10);
+     int ret;
-+#endif
++    BlockZoneWps *wps = bs->wps;
-+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_10);
+     int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
-     }
-     return g_test_run();
+     zone_size = bs->bl.zone_size;
- }
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
          return -EINVAL;
      }
 +    uint32_t i = offset / bs->bl.zone_size;
 +    uint32_t nrz = len / bs->bl.zone_size;
 +    uint64_t *wp = &wps->wp[i];
 +    if (BDRV_ZT_IS_CONV(*wp) && len != capacity) {
 +        error_report("zone mgmt operations are not allowed for conventional zones");
 +        return -EIO;
 +    }
 +
      switch (op) {
      case BLK_ZO_OPEN:
          op_name = "BLKOPENZONE";
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
                          len >> BDRV_SECTOR_BITS);
      ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
      if (ret != 0) {
 +        update_zones_wp(bs, s->fd, offset, i);
          error_report("ioctl %s failed %d", op_name, ret);
 +        return ret;
 +    }
 +
 +    if (zo == BLKRESETZONE && len == capacity) {
 +        ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 1);
 +        if (ret < 0) {
 +            error_report("reporting single wp failed");
 +            return ret;
 +        }
 +    } else if (zo == BLKRESETZONE) {
 +        for (unsigned int j = 0; j < nrz; ++j) {
 +            wp[j] = offset + j * zone_size;
 +        }
 +    } else if (zo == BLKFINISHZONE) {
 +        for (unsigned int j = 0; j < nrz; ++j) {
 +            /* The zoned device allows the last zone smaller that the
 +             * zone size. */
 +            wp[j] = MIN(offset + (j + 1) * zone_size, offset + len);
 +        }
      }
      return ret;
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 14/24] block: explicitly acquire aiocontext in bottom halves that need it
+[PULL v2 10/16] block: introduce zone append write for zoned devices
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
+A zone append command is a write operation that specifies the first
+logical block of a zone as the write position. When writing to a zoned
+block device using zone append, the byte offset of the call may point at
+any position within the zone to which the data is being appended. Upon
+completion the device will respond with the position where the data has
+been written in the zone.
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20230508051510.177850-3-faithilikerun@gmail.com
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-15-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/archipelago.c   |  3 +++
+ include/block/block-io.h          |  4 ++
- block/blkreplay.c     |  2 +-
+ include/block/block_int-common.h  |  3 ++
- block/block-backend.c |  6 ++++++
+ include/block/raw-aio.h           |  4 +-
- block/curl.c          | 26 ++++++++++++++++++--------
+ include/sysemu/block-backend-io.h |  9 +++++
- block/gluster.c       |  9 +--------
+ block/block-backend.c             | 61 +++++++++++++++++++++++++++++++
- block/io.c            |  6 +++++-
+ block/file-posix.c                | 58 +++++++++++++++++++++++++----
- block/iscsi.c         |  6 +++++-
+ block/io.c                        | 27 ++++++++++++++
- block/linux-aio.c     | 15 +++++++++------
+ block/io_uring.c                  |  4 ++
- block/nfs.c           |  3 ++-
+ block/linux-aio.c                 |  3 ++
- block/null.c          |  4 ++++
+ block/raw-format.c                |  8 ++++
- block/qed.c           |  3 +++
+files changed, 173 insertions(+), 8 deletions(-)
  block/rbd.c           |  4 ++++
  dma-helpers.c         |  2 ++
  hw/block/virtio-blk.c |  2 ++
  hw/scsi/scsi-bus.c    |  2 ++
  util/async.c          |  4 ++--
  util/thread-pool.c    |  2 ++
 files changed, 71 insertions(+), 28 deletions(-)
-diff --git a/block/archipelago.c b/block/archipelago.c
+diff --git a/include/block/block-io.h b/include/block/block-io.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/archipelago.c
+--- a/include/block/block-io.h
-+++ b/block/archipelago.c
++++ b/include/block/block-io.h
-@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_report(BlockDriverState *bs,
- {
+ int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
-     AIORequestData *reqdata = (AIORequestData *) opaque;
+                                                 BlockZoneOp op,
-     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
+                                                 int64_t offset, int64_t len);
-+    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
++int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_append(BlockDriverState *bs,
++                                                  int64_t *offset,
-+    aio_context_acquire(ctx);
++                                                  QEMUIOVector *qiov,
-     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
++                                                  BdrvRequestFlags flags);
-+    aio_context_release(ctx);
-     aio_cb->status = 0;
+ bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
+ int bdrv_block_status(BlockDriverState *bs, int64_t offset,
-     qemu_aio_unref(aio_cb);
+diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
-diff --git a/block/blkreplay.c b/block/blkreplay.c
+index XXXXXXX..XXXXXXX 100644
-index XXXXXXX..XXXXXXX 100755
+--- a/include/block/block_int-common.h
---- a/block/blkreplay.c
++++ b/include/block/block_int-common.h
-+++ b/block/blkreplay.c
+@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
-@@ -XXX,XX +XXX,XX @@ static int64_t blkreplay_getlength(BlockDriverState *bs)
+             BlockZoneDescriptor *zones);
- static void blkreplay_bh_cb(void *opaque)
+     int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
- {
+             int64_t offset, int64_t len);
-     Request *req = opaque;
++    int coroutine_fn (*bdrv_co_zone_append)(BlockDriverState *bs,
--    qemu_coroutine_enter(req->co);
++            int64_t *offset, QEMUIOVector *qiov,
-+    aio_co_wake(req->co);
++            BdrvRequestFlags flags);
-     qemu_bh_delete(req->bh);
-     g_free(req);
+     /* removable device specific */
- }
+     bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
  #define QEMU_AIO_TRUNCATE     0x0080
  #define QEMU_AIO_ZONE_REPORT  0x0100
  #define QEMU_AIO_ZONE_MGMT    0x0200
 +#define QEMU_AIO_ZONE_APPEND  0x0400
  #define QEMU_AIO_TYPE_MASK \
          (QEMU_AIO_READ | \
           QEMU_AIO_WRITE | \
@@ -XXX,XX +XXX,XX @@
           QEMU_AIO_COPY_RANGE | \
           QEMU_AIO_TRUNCATE | \
           QEMU_AIO_ZONE_REPORT | \
 -         QEMU_AIO_ZONE_MGMT)
 +         QEMU_AIO_ZONE_MGMT | \
 +         QEMU_AIO_ZONE_APPEND)
  /* AIO flags */
  #define QEMU_AIO_MISALIGNED   0x1000
 diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/sysemu/block-backend-io.h
 +++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
  BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                                int64_t offset, int64_t len,
                                BlockCompletionFunc *cb, void *opaque);
 +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
 +                                QEMUIOVector *qiov, BdrvRequestFlags flags,
 +                                BlockCompletionFunc *cb, void *opaque);
  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                               BlockCompletionFunc *cb, void *opaque);
  void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                                    int64_t offset, int64_t len);
  int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                                         int64_t offset, int64_t len);
 +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
 +                                    QEMUIOVector *qiov,
 +                                    BdrvRequestFlags flags);
 +int co_wrapper_mixed blk_zone_append(BlockBackend *blk, int64_t *offset,
 +                                         QEMUIOVector *qiov,
 +                                         BdrvRequestFlags flags);
  int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
                                    int64_t bytes);
 diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
+@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
- static void error_callback_bh(void *opaque)
+     return &acb->common;
  }
 +static void coroutine_fn blk_aio_zone_append_entry(void *opaque)
 +{
 +    BlkAioEmAIOCB *acb = opaque;
 +    BlkRwCo *rwco = &acb->rwco;
 +
 +    rwco->ret = blk_co_zone_append(rwco->blk, (int64_t *)(uintptr_t)acb->bytes,
 +                                   rwco->iobuf, rwco->flags);
 +    blk_aio_complete(acb);
 +}
 +
 +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
 +                                QEMUIOVector *qiov, BdrvRequestFlags flags,
 +                                BlockCompletionFunc *cb, void *opaque) {
 +    BlkAioEmAIOCB *acb;
 +    Coroutine *co;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk);
 +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
 +    acb->rwco = (BlkRwCo) {
 +        .blk    = blk,
 +        .ret    = NOT_DONE,
 +        .flags  = flags,
 +        .iobuf  = qiov,
 +    };
 +    acb->bytes = (int64_t)(uintptr_t)offset;
 +    acb->has_returned = false;
 +
 +    co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
 +    aio_co_enter(blk_get_aio_context(blk), co);
 +    acb->has_returned = true;
 +    if (acb->rwco.ret != NOT_DONE) {
 +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
 +                                         blk_aio_complete_bh, acb);
 +    }
 +
 +    return &acb->common;
 +}
 +
  /*
   * Send a zone_report command.
   * offset is a byte offset from the start of the device. No alignment
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
      return ret;
  }
 +/*
 + * Send a zone_append command.
 + */
 +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
 +        QEMUIOVector *qiov, BdrvRequestFlags flags)
 +{
 +    int ret;
 +    IO_CODE();
 +
 +    blk_inc_in_flight(blk);
 +    blk_wait_while_drained(blk);
 +    GRAPH_RDLOCK_GUARD();
 +    if (!blk_is_available(blk)) {
 +        blk_dec_in_flight(blk);
 +        return -ENOMEDIUM;
 +    }
 +
 +    ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
 +    blk_dec_in_flight(blk);
 +    return ret;
 +}
 +
  void blk_drain(BlockBackend *blk)
  {
-     struct BlockBackendAIOCB *acb = opaque;
+     BlockDriverState *bs = blk_bs(blk);
-+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+diff --git a/block/file-posix.c b/block/file-posix.c
+index XXXXXXX..XXXXXXX 100644
-     bdrv_dec_in_flight(acb->common.bs);
+--- a/block/file-posix.c
-+    aio_context_acquire(ctx);
++++ b/block/file-posix.c
-     acb->common.cb(acb->common.opaque, acb->ret);
+@@ -XXX,XX +XXX,XX @@ typedef struct BDRVRawState {
-+    aio_context_release(ctx);
+     bool has_write_zeroes:1;
-     qemu_aio_unref(acb);
+     bool use_linux_aio:1;
- }
+     bool use_linux_io_uring:1;
++    int64_t *offset; /* offset of zone append operation */
-@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
+     int page_cache_inconsistent; /* errno from fdatasync failure */
- static void blk_aio_complete_bh(void *opaque)
+     bool has_fallocate;
      bool needs_alignment;
@@ -XXX,XX +XXX,XX @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb)
      ssize_t len;
      len = RETRY_ON_EINTR(
 -        (aiocb->aio_type & QEMU_AIO_WRITE) ?
 +        (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
              qemu_pwritev(aiocb->aio_fildes,
                             aiocb->io.iov,
                             aiocb->io.niov,
@@ -XXX,XX +XXX,XX @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
      ssize_t len;
      while (offset < aiocb->aio_nbytes) {
 -        if (aiocb->aio_type & QEMU_AIO_WRITE) {
 +        if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
              len = pwrite(aiocb->aio_fildes,
                           (const char *)buf + offset,
                           aiocb->aio_nbytes - offset,
@@ -XXX,XX +XXX,XX @@ static int handle_aiocb_rw(void *opaque)
      }
      nbytes = handle_aiocb_rw_linear(aiocb, buf);
 -    if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
 +    if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
          char *p = buf;
          size_t count = aiocb->aio_nbytes, copy;
          int i;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
      if (fd_open(bs) < 0)
          return -EIO;
  #if defined(CONFIG_BLKZONED)
 -    if (type & QEMU_AIO_WRITE && bs->wps) {
 +    if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && bs->wps) {
          qemu_co_mutex_lock(&bs->wps->colock);
 +        if (type & QEMU_AIO_ZONE_APPEND && bs->bl.zone_size) {
 +            int index = offset / bs->bl.zone_size;
 +            offset = bs->wps->wp[index];
 +        }
      }
  #endif
@@ -XXX,XX +XXX,XX @@ out:
  {
-     BlkAioEmAIOCB *acb = opaque;
+     BlockZoneWps *wps = bs->wps;
-+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+     if (ret == 0) {
+-        if (type & QEMU_AIO_WRITE && wps && bs->bl.zone_size) {
-     assert(acb->has_returned);
++        if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))
-+    aio_context_acquire(ctx);
++            && wps && bs->bl.zone_size) {
-     blk_aio_complete(acb);
+             uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
-+    aio_context_release(ctx);
+             if (!BDRV_ZT_IS_CONV(*wp)) {
- }
++                if (type & QEMU_AIO_ZONE_APPEND) {
++                    *s->offset = *wp;
- static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
++                }
-diff --git a/block/curl.c b/block/curl.c
+                 /* Advance the wp if needed */
-index XXXXXXX..XXXXXXX 100644
+                 if (offset + bytes > *wp) {
---- a/block/curl.c
+                     *wp = offset + bytes;
-+++ b/block/curl.c
+@@ -XXX,XX +XXX,XX @@ out:
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
+             }
- {
+         }
-     CURLState *state;
+     } else {
-     int running;
+-        if (type & QEMU_AIO_WRITE) {
-+    int ret = -EINPROGRESS;
++        if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
+             update_zones_wp(bs, s->fd, 0, 1);
-     CURLAIOCB *acb = p;
+         }
 -    BDRVCURLState *s = acb->common.bs->opaque;
 +    BlockDriverState *bs = acb->common.bs;
 +    BDRVCURLState *s = bs->opaque;
 +    AioContext *ctx = bdrv_get_aio_context(bs);
      size_t start = acb->sector_num * BDRV_SECTOR_SIZE;
      size_t end;
 +    aio_context_acquire(ctx);
 +
      // In case we have the requested data already (e.g. read-ahead),
      // we can just call the callback and be done.
      switch (curl_find_buf(s, start, acb->nb_sectors * BDRV_SECTOR_SIZE, acb)) {
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
              qemu_aio_unref(acb);
              // fall through
          case FIND_RET_WAIT:
 -            return;
 +            goto out;
          default:
              break;
      }
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-     // No cache found, so let's start a new request
+-    if (type & QEMU_AIO_WRITE && wps) {
-     state = curl_init_state(acb->common.bs, s);
++    if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && wps) {
-     if (!state) {
+         qemu_co_mutex_unlock(&wps->colock);
 -        acb->common.cb(acb->common.opaque, -EIO);
 -        qemu_aio_unref(acb);
 -        return;
 +        ret = -EIO;
 +        goto out;
      }
+ }
-     acb->start = 0;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
+ }
-     state->orig_buf = g_try_malloc(state->buf_len);
+ #endif
-     if (state->buf_len && state->orig_buf == NULL) {
-         curl_clean_state(state);
++#if defined(CONFIG_BLKZONED)
--        acb->common.cb(acb->common.opaque, -ENOMEM);
++static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
--        qemu_aio_unref(acb);
++                                           int64_t *offset,
--        return;
++                                           QEMUIOVector *qiov,
-+        ret = -ENOMEM;
++                                           BdrvRequestFlags flags) {
-+        goto out;
++    assert(flags == 0);
-     }
++    int64_t zone_size_mask = bs->bl.zone_size - 1;
-     state->acb[0] = acb;
++    int64_t iov_len = 0;
++    int64_t len = 0;
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
++    BDRVRawState *s = bs->opaque;
++    s->offset = offset;
-     /* Tell curl it needs to kick things off */
++
-     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
++    if (*offset & zone_size_mask) {
-+
++        error_report("sector offset %" PRId64 " is not aligned to zone size "
-+out:
++                     "%" PRId32 "", *offset / 512, bs->bl.zone_size / 512);
-+    if (ret != -EINPROGRESS) {
++        return -EINVAL;
-+        acb->common.cb(acb->common.opaque, ret);
++    }
-+        qemu_aio_unref(acb);
++
-+    }
++    int64_t wg = bs->bl.write_granularity;
-+    aio_context_release(ctx);
++    int64_t wg_mask = wg - 1;
- }
++    for (int i = 0; i < qiov->niov; i++) {
++        iov_len = qiov->iov[i].iov_len;
- static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
++        if (iov_len & wg_mask) {
-diff --git a/block/gluster.c b/block/gluster.c
++            error_report("len of IOVector[%d] %" PRId64 " is not aligned to "
-index XXXXXXX..XXXXXXX 100644
++                         "block size %" PRId64 "", i, iov_len, wg);
---- a/block/gluster.c
++            return -EINVAL;
-+++ b/block/gluster.c
++        }
-@@ -XXX,XX +XXX,XX @@ static struct glfs *qemu_gluster_init(BlockdevOptionsGluster *gconf,
++        len += iov_len;
-     return qemu_gluster_glfs_init(gconf, errp);
++    }
- }
++
++    return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
--static void qemu_gluster_complete_aio(void *opaque)
++}
--{
++#endif
--    GlusterAIOCB *acb = (GlusterAIOCB *)opaque;
++
--
+ static coroutine_fn int
--    qemu_coroutine_enter(acb->coroutine);
+ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
--}
+                 bool blkdev)
--
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
- /*
+     /* zone management operations */
-  * AIO callback routine called from GlusterFS thread.
+     .bdrv_co_zone_report = raw_co_zone_report,
-  */
+     .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
-@@ -XXX,XX +XXX,XX @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg)
++    .bdrv_co_zone_append = raw_co_zone_append,
-         acb->ret = -EIO; /* Partial read/write - fail it */
+ #endif
-     }
+ };
 -    aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb);
 +    aio_co_schedule(acb->aio_context, acb->coroutine);
  }
  static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_drain_bh_cb(void *opaque)
+@@ -XXX,XX +XXX,XX @@ out:
-     bdrv_dec_in_flight(bs);
+     return co.ret;
-     bdrv_drained_begin(bs);
+ }
-     data->done = true;
--    qemu_coroutine_enter(co);
++int coroutine_fn bdrv_co_zone_append(BlockDriverState *bs, int64_t *offset,
-+    aio_co_wake(co);
++                        QEMUIOVector *qiov,
- }
++                        BdrvRequestFlags flags)
++{
- static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
++    int ret;
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
++    BlockDriver *drv = bs->drv;
- static void bdrv_co_em_bh(void *opaque)
++    CoroutineIOCompletion co = {
 +            .coroutine = qemu_coroutine_self(),
 +    };
 +    IO_CODE();
 +
 +    ret = bdrv_check_qiov_request(*offset, qiov->size, qiov, 0, NULL);
 +    if (ret < 0) {
 +        return ret;
 +    }
 +
 +    bdrv_inc_in_flight(bs);
 +    if (!drv || !drv->bdrv_co_zone_append || bs->bl.zoned == BLK_Z_NONE) {
 +        co.ret = -ENOTSUP;
 +        goto out;
 +    }
 +    co.ret = drv->bdrv_co_zone_append(bs, offset, qiov, flags);
 +out:
 +    bdrv_dec_in_flight(bs);
 +    return co.ret;
 +}
 +
  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
-     BlockAIOCBCoroutine *acb = opaque;
+     IO_CODE();
-+    BlockDriverState *bs = acb->common.bs;
+diff --git a/block/io_uring.c b/block/io_uring.c
-+    AioContext *ctx = bdrv_get_aio_context(bs);
+index XXXXXXX..XXXXXXX 100644
+--- a/block/io_uring.c
-     assert(!acb->need_bh);
++++ b/block/io_uring.c
-+    aio_context_acquire(ctx);
+@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-     bdrv_co_complete(acb);
+         io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-+    aio_context_release(ctx);
+                              luringcb->qiov->niov, offset);
- }
+         break;
++    case QEMU_AIO_ZONE_APPEND:
- static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
++        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-diff --git a/block/iscsi.c b/block/iscsi.c
++                             luringcb->qiov->niov, offset);
-index XXXXXXX..XXXXXXX 100644
++        break;
---- a/block/iscsi.c
+     case QEMU_AIO_READ:
-+++ b/block/iscsi.c
+         io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
-@@ -XXX,XX +XXX,XX @@ static void
+                             luringcb->qiov->niov, offset);
  iscsi_bh_cb(void *p)
  {
      IscsiAIOCB *acb = p;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      qemu_bh_delete(acb->bh);
      g_free(acb->buf);
      acb->buf = NULL;
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, acb->status);
 +    aio_context_release(ctx);
      if (acb->task != NULL) {
          scsi_free_scsi_task(acb->task);
@@ -XXX,XX +XXX,XX @@ iscsi_schedule_bh(IscsiAIOCB *acb)
  static void iscsi_co_generic_bh_cb(void *opaque)
  {
      struct IscsiTask *iTask = opaque;
 +
      iTask->complete = 1;
 -    qemu_coroutine_enter(iTask->co);
 +    aio_co_wake(iTask->co);
  }
  static void iscsi_retry_timer_expired(void *opaque)
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
-@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
+@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
-     io_context_t ctx;
+     case QEMU_AIO_WRITE:
-     EventNotifier e;
+         io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
+         break;
--    /* io queue for submit at batch */
++    case QEMU_AIO_ZONE_APPEND:
-+    /* io queue for submit at batch.  Protected by AioContext lock. */
++        io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
-     LaioQueue io_q;
++        break;
+     case QEMU_AIO_READ:
--    /* I/O completion processing */
+         io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
-+    /* I/O completion processing.  Only runs in I/O thread.  */
+         break;
-     QEMUBH *completion_bh;
+diff --git a/block/raw-format.c b/block/raw-format.c
-     int event_idx;
+index XXXXXXX..XXXXXXX 100644
-     int event_max;
+--- a/block/raw-format.c
-@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
++++ b/block/raw-format.c
-  */
+@@ -XXX,XX +XXX,XX @@ raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
- static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
+     return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
  }
 +static int coroutine_fn GRAPH_RDLOCK
 +raw_co_zone_append(BlockDriverState *bs,int64_t *offset, QEMUIOVector *qiov,
 +                   BdrvRequestFlags flags)
 +{
 +    return bdrv_co_zone_append(bs->file->bs, offset, qiov, flags);
 +}
 +
  static int64_t coroutine_fn GRAPH_RDLOCK
  raw_co_getlength(BlockDriverState *bs)
  {
-+    LinuxAioState *s = laiocb->ctx;
+@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_raw = {
-     int ret;
+     .bdrv_co_pdiscard     = &raw_co_pdiscard,
+     .bdrv_co_zone_report  = &raw_co_zone_report,
-     ret = laiocb->ret;
+     .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
-@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
++    .bdrv_co_zone_append = &raw_co_zone_append,
-     }
+     .bdrv_co_block_status = &raw_co_block_status,
+     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
-     laiocb->ret = ret;
+     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
 +    aio_context_acquire(s->aio_context);
      if (laiocb->co) {
          /* If the coroutine is already entered it must be in ioq_submit() and
           * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
          laiocb->common.cb(laiocb->common.opaque, ret);
          qemu_aio_unref(laiocb);
      }
 +    aio_context_release(s->aio_context);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
  {
      qemu_laio_process_completions(s);
 +
 +    aio_context_acquire(s->aio_context);
      if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
 +    aio_context_release(s->aio_context);
  }
  static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
      LinuxAioState *s = container_of(e, LinuxAioState, e);
      if (event_notifier_test_and_clear(&s->e)) {
 -        aio_context_acquire(s->aio_context);
          qemu_laio_process_completions_and_submit(s);
 -        aio_context_release(s->aio_context);
      }
  }
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
          return false;
      }
 -    aio_context_acquire(s->aio_context);
      qemu_laio_process_completions_and_submit(s);
 -    aio_context_release(s->aio_context);
      return true;
  }
@@ -XXX,XX +XXX,XX @@ void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
  {
      aio_set_event_notifier(old_context, &s->e, false, NULL, NULL);
      qemu_bh_delete(s->completion_bh);
 +    s->aio_context = NULL;
  }
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
 diff --git a/block/nfs.c b/block/nfs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
  static void nfs_co_generic_bh_cb(void *opaque)
  {
      NFSRPC *task = opaque;
 +
      task->complete = 1;
 -    qemu_coroutine_enter(task->co);
 +    aio_co_wake(task->co);
  }
  static void
 diff --git a/block/null.c b/block/null.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/null.c
 +++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
  static void null_bh_cb(void *opaque)
  {
      NullAIOCB *acb = opaque;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 +
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, 0);
 +    aio_context_release(ctx);
      qemu_aio_unref(acb);
  }
 diff --git a/block/qed.c b/block/qed.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.c
 +++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
  static void qed_aio_complete_bh(void *opaque)
  {
      QEDAIOCB *acb = opaque;
 +    BDRVQEDState *s = acb_to_s(acb);
      BlockCompletionFunc *cb = acb->common.cb;
      void *user_opaque = acb->common.opaque;
      int ret = acb->bh_ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
      qemu_aio_unref(acb);
      /* Invoke callback */
 +    qed_acquire(s);
      cb(user_opaque, ret);
 +    qed_release(s);
  }
  static void qed_aio_complete(QEDAIOCB *acb, int ret)
 diff --git a/block/rbd.c b/block/rbd.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
  static void qemu_rbd_complete_aio(RADOSCB *rcb)
  {
      RBDAIOCB *acb = rcb->acb;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      int64_t r;
      r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
          qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
      }
      qemu_vfree(acb->bounce);
 +
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 +    aio_context_release(ctx);
      qemu_aio_unref(acb);
  }
 diff --git a/dma-helpers.c b/dma-helpers.c
 index XXXXXXX..XXXXXXX 100644
 --- a/dma-helpers.c
 +++ b/dma-helpers.c
@@ -XXX,XX +XXX,XX @@ static void dma_blk_cb(void *opaque, int ret)
                                  QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
      }
 +    aio_context_acquire(dbs->ctx);
      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                              dma_blk_cb, dbs, dbs->io_func_opaque);
 +    aio_context_release(dbs->ctx);
      assert(dbs->acb);
  }
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
      s->rq = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
      while (req) {
          VirtIOBlockReq *next = req->next;
          if (virtio_blk_handle_request(req, &mrb)) {
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
      if (mrb.num_reqs) {
          virtio_blk_submit_multireq(s->blk, &mrb);
      }
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
  }
  static void virtio_blk_dma_restart_cb(void *opaque, int running,
 diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/scsi/scsi-bus.c
 +++ b/hw/scsi/scsi-bus.c
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
      qemu_bh_delete(s->bh);
      s->bh = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.blk));
      QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
          scsi_req_ref(req);
          if (req->retry) {
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
          }
          scsi_req_unref(req);
      }
 +    aio_context_release(blk_get_aio_context(s->conf.blk));
  }
  void scsi_req_retry(SCSIRequest *req)
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  ret = 1;
              }
              bh->idle = 0;
 -            aio_context_acquire(ctx);
              aio_bh_call(bh);
 -            aio_context_release(ctx);
          }
          if (bh->deleted) {
              deleted = true;
@@ -XXX,XX +XXX,XX @@ static void co_schedule_bh_cb(void *opaque)
          Coroutine *co = QSLIST_FIRST(&straight);
          QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
          trace_aio_co_schedule_bh_cb(ctx, co);
 +        aio_context_acquire(ctx);
          qemu_coroutine_enter(co);
 +        aio_context_release(ctx);
      }
  }
 diff --git a/util/thread-pool.c b/util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
      ThreadPool *pool = opaque;
      ThreadPoolElement *elem, *next;
 +    aio_context_acquire(pool->ctx);
  restart:
      QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
          if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
              qemu_aio_unref(elem);
          }
      }
 +    aio_context_release(pool->ctx);
  }
  static void thread_pool_cancel(BlockAIOCB *acb)
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 06/24] io: make qio_channel_yield aware of AioContexts
+[PULL v2 11/16] qemu-iotests: test zone append operation
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-Support separate coroutines for reading and writing, and place the
+The patch tests zone append writes by reporting the zone wp after
-read/write handlers on the AioContext that the QIOChannel is registered
+the completion of the call. "zap -p" option can print the sector
-with.
+offset value after completion, which should be the start sector
 where the append write begins.
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20230508051510.177850-4-faithilikerun@gmail.com
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213135235.12274-7-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/io/channel.h | 47 ++++++++++++++++++++++++++--
+ qemu-io-cmds.c                     | 75 ++++++++++++++++++++++++++++++
- io/channel.c         | 86 +++++++++++++++++++++++++++++++++++++++-------------
+ tests/qemu-iotests/tests/zoned     | 16 +++++++
-files changed, 109 insertions(+), 24 deletions(-)
+ tests/qemu-iotests/tests/zoned.out | 16 +++++++
 files changed, 107 insertions(+)
-diff --git a/include/io/channel.h b/include/io/channel.h
+diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/io/channel.h
+--- a/qemu-io-cmds.c
-+++ b/include/io/channel.h
++++ b/qemu-io-cmds.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static const cmdinfo_t zone_reset_cmd = {
+     .oneline = "reset a zone write pointer in zone block device",
- #include "qemu-common.h"
+ };
- #include "qom/object.h"
-+#include "qemu/coroutine.h"
++static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
- #include "block/aio.h"
++                              int64_t *offset, int flags, int *total)
++{
- #define TYPE_QIO_CHANNEL "qio-channel"
++    int async_ret = NOT_DONE;
@@ -XXX,XX +XXX,XX @@ struct QIOChannel {
      Object parent;
      unsigned int features; /* bitmask of QIOChannelFeatures */
      char *name;
 +    AioContext *ctx;
 +    Coroutine *read_coroutine;
 +    Coroutine *write_coroutine;
  #ifdef _WIN32
      HANDLE event; /* For use with GSource on Win32 */
  #endif
@@ -XXX,XX +XXX,XX @@ guint qio_channel_add_watch(QIOChannel *ioc,
  /**
 + * qio_channel_attach_aio_context:
 + * @ioc: the channel object
 + * @ctx: the #AioContext to set the handlers on
 + *
 + * Request that qio_channel_yield() sets I/O handlers on
 + * the given #AioContext.  If @ctx is %NULL, qio_channel_yield()
 + * uses QEMU's main thread event loop.
 + *
 + * You can move a #QIOChannel from one #AioContext to another even if
 + * I/O handlers are set for a coroutine.  However, #QIOChannel provides
 + * no synchronization between the calls to qio_channel_yield() and
 + * qio_channel_attach_aio_context().
 + *
 + * Therefore you should first call qio_channel_detach_aio_context()
 + * to ensure that the coroutine is not entered concurrently.  Then,
 + * while the coroutine has yielded, call qio_channel_attach_aio_context(),
 + * and then aio_co_schedule() to place the coroutine on the new
 + * #AioContext.  The calls to qio_channel_detach_aio_context()
 + * and qio_channel_attach_aio_context() should be protected with
 + * aio_context_acquire() and aio_context_release().
 + */
 +void qio_channel_attach_aio_context(QIOChannel *ioc,
 +                                    AioContext *ctx);
 +
-+/**
++    blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret);
-+ * qio_channel_detach_aio_context:
++    while (async_ret == NOT_DONE) {
-+ * @ioc: the channel object
++        main_loop_wait(false);
 + *
 + * Disable any I/O handlers set by qio_channel_yield().  With the
 + * help of aio_co_schedule(), this allows moving a coroutine that was
 + * paused by qio_channel_yield() to another context.
 + */
 +void qio_channel_detach_aio_context(QIOChannel *ioc);
 +
 +/**
   * qio_channel_yield:
   * @ioc: the channel object
   * @condition: the I/O condition to wait for
   *
 - * Yields execution from the current coroutine until
 - * the condition indicated by @condition becomes
 - * available.
 + * Yields execution from the current coroutine until the condition
 + * indicated by @condition becomes available.  @condition must
 + * be either %G_IO_IN or %G_IO_OUT; it cannot contain both.  In
 + * addition, no two coroutine can be waiting on the same condition
 + * and channel at the same time.
   *
   * This must only be called from coroutine context
   */
 diff --git a/io/channel.c b/io/channel.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel.c
 +++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/osdep.h"
  #include "io/channel.h"
  #include "qapi/error.h"
 -#include "qemu/coroutine.h"
 +#include "qemu/main-loop.h"
  bool qio_channel_has_feature(QIOChannel *ioc,
                               QIOChannelFeature feature)
@@ -XXX,XX +XXX,XX @@ off_t qio_channel_io_seek(QIOChannel *ioc,
  }
 -typedef struct QIOChannelYieldData QIOChannelYieldData;
 -struct QIOChannelYieldData {
 -    QIOChannel *ioc;
 -    Coroutine *co;
 -};
 +static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc);
 +static void qio_channel_restart_read(void *opaque)
 +{
 +    QIOChannel *ioc = opaque;
 +    Coroutine *co = ioc->read_coroutine;
 +
 +    ioc->read_coroutine = NULL;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +    aio_co_wake(co);
 +}
 -static gboolean qio_channel_yield_enter(QIOChannel *ioc,
 -                                        GIOCondition condition,
 -                                        gpointer opaque)
 +static void qio_channel_restart_write(void *opaque)
  {
 -    QIOChannelYieldData *data = opaque;
 -    qemu_coroutine_enter(data->co);
 -    return FALSE;
 +    QIOChannel *ioc = opaque;
 +    Coroutine *co = ioc->write_coroutine;
 +
 +    ioc->write_coroutine = NULL;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +    aio_co_wake(co);
  }
 +static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc)
 +{
 +    IOHandler *rd_handler = NULL, *wr_handler = NULL;
 +    AioContext *ctx;
 +
 +    if (ioc->read_coroutine) {
 +        rd_handler = qio_channel_restart_read;
 +    }
 +    if (ioc->write_coroutine) {
 +        wr_handler = qio_channel_restart_write;
 +    }
 +
-+    ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
++    *total = qiov->size;
-+    qio_channel_set_aio_fd_handler(ioc, ctx, rd_handler, wr_handler, ioc);
++    return async_ret < 0 ? async_ret : 1;
 +}
 +
-+void qio_channel_attach_aio_context(QIOChannel *ioc,
++static int zone_append_f(BlockBackend *blk, int argc, char **argv)
 +                                    AioContext *ctx)
 +{
-+    AioContext *old_ctx;
++    int ret;
-+    if (ioc->ctx == ctx) {
++    bool pflag = false;
-+        return;
++    int flags = 0;
 +    int total = 0;
 +    int64_t offset;
 +    char *buf;
 +    int c, nr_iov;
 +    int pattern = 0xcd;
 +    QEMUIOVector qiov;
 +
 +    if (optind > argc - 3) {
 +        return -EINVAL;
 +    }
 +
-+    old_ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
++    if ((c = getopt(argc, argv, "p")) != -1) {
-+    qio_channel_set_aio_fd_handler(ioc, old_ctx, NULL, NULL, NULL);
++        pflag = true;
-+    ioc->ctx = ctx;
++    }
-+    qio_channel_set_aio_fd_handlers(ioc);
++
 +    offset = cvtnum(argv[optind]);
 +    if (offset < 0) {
 +        print_cvtnum_err(offset, argv[optind]);
 +        return offset;
 +    }
 +    optind++;
 +    nr_iov = argc - optind;
 +    buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern,
 +                       flags & BDRV_REQ_REGISTERED_BUF);
 +    if (buf == NULL) {
 +        return -EINVAL;
 +    }
 +    ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total);
 +    if (ret < 0) {
 +        printf("zone append failed: %s\n", strerror(-ret));
 +        goto out;
 +    }
 +
 +    if (pflag) {
 +        printf("After zap done, the append sector is 0x%" PRIx64 "\n",
 +               tosector(offset));
 +    }
 +
 +out:
 +    qemu_io_free(blk, buf, qiov.size,
 +                 flags & BDRV_REQ_REGISTERED_BUF);
 +    qemu_iovec_destroy(&qiov);
 +    return ret;
 +}
 +
-+void qio_channel_detach_aio_context(QIOChannel *ioc)
++static const cmdinfo_t zone_append_cmd = {
-+{
++    .name = "zone_append",
-+    ioc->read_coroutine = NULL;
++    .altname = "zap",
-+    ioc->write_coroutine = NULL;
++    .cfunc = zone_append_f,
-+    qio_channel_set_aio_fd_handlers(ioc);
++    .argmin = 3,
-+    ioc->ctx = NULL;
++    .argmax = 4,
-+}
++    .args = "offset len [len..]",
++    .oneline = "append write a number of bytes at a specified offset",
- void coroutine_fn qio_channel_yield(QIOChannel *ioc,
++};
-                                     GIOCondition condition)
++
- {
+ static int truncate_f(BlockBackend *blk, int argc, char **argv);
--    QIOChannelYieldData data;
+ static const cmdinfo_t truncate_cmd = {
--
+     .name       = "truncate",
-     assert(qemu_in_coroutine());
+@@ -XXX,XX +XXX,XX @@ static void __attribute((constructor)) init_qemuio_commands(void)
--    data.ioc = ioc;
+     qemuio_add_command(&zone_close_cmd);
--    data.co = qemu_coroutine_self();
+     qemuio_add_command(&zone_finish_cmd);
--    qio_channel_add_watch(ioc,
+     qemuio_add_command(&zone_reset_cmd);
--                          condition,
++    qemuio_add_command(&zone_append_cmd);
--                          qio_channel_yield_enter,
+     qemuio_add_command(&truncate_cmd);
--                          &data,
+     qemuio_add_command(&length_cmd);
--                          NULL);
+     qemuio_add_command(&info_cmd);
-+    if (condition == G_IO_IN) {
+diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
-+        assert(!ioc->read_coroutine);
+index XXXXXXX..XXXXXXX 100755
-+        ioc->read_coroutine = qemu_coroutine_self();
+--- a/tests/qemu-iotests/tests/zoned
-+    } else if (condition == G_IO_OUT) {
++++ b/tests/qemu-iotests/tests/zoned
-+        assert(!ioc->write_coroutine);
+@@ -XXX,XX +XXX,XX @@ echo "(5) resetting the second zone"
-+        ioc->write_coroutine = qemu_coroutine_self();
+ $QEMU_IO $IMG -c "zrs 268435456 268435456"
-+    } else {
+ echo "After resetting a zone:"
-+        abort();
+ $QEMU_IO $IMG -c "zrp 268435456 1"
-+    }
++echo
-+    qio_channel_set_aio_fd_handlers(ioc);
++echo
-     qemu_coroutine_yield();
++echo "(6) append write" # the physical block size of the device is 4096
- }
++$QEMU_IO $IMG -c "zrp 0 1"
++$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
 +echo "After appending the first zone firstly:"
 +$QEMU_IO $IMG -c "zrp 0 1"
 +$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
 +echo "After appending the first zone secondly:"
 +$QEMU_IO $IMG -c "zrp 0 1"
 +$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
 +echo "After appending the second zone firstly:"
 +$QEMU_IO $IMG -c "zrp 268435456 1"
 +$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
 +echo "After appending the second zone secondly:"
 +$QEMU_IO $IMG -c "zrp 268435456 1"
  # success, all done
  echo "*** done"
 diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/tests/zoned.out
 +++ b/tests/qemu-iotests/tests/zoned.out
@@ -XXX,XX +XXX,XX @@ start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
  (5) resetting the second zone
  After resetting a zone:
  start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
 +
 +
 +(6) append write
 +start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
 +After zap done, the append sector is 0x0
 +After appending the first zone firstly:
 +start: 0x0, len 0x80000, cap 0x80000, wptr 0x18, zcond:2, [type: 2]
 +After zap done, the append sector is 0x18
 +After appending the first zone secondly:
 +start: 0x0, len 0x80000, cap 0x80000, wptr 0x30, zcond:2, [type: 2]
 +After zap done, the append sector is 0x80000
 +After appending the second zone firstly:
 +start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80018, zcond:2, [type: 2]
 +After zap done, the append sector is 0x80018
 +After appending the second zone secondly:
 +start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80030, zcond:2, [type: 2]
  *** done
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 08/24] coroutine-lock: reschedule coroutine on the AioContext it was running on
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-As a small step towards the introduction of multiqueue, we want
-coroutines to remain on the same AioContext that started them,
-unless they are moved explicitly with e.g. aio_co_schedule.  This patch
-avoids that coroutines switch AioContext when they use a CoMutex.
-For now it does not make much of a difference, because the CoMutex
-is not thread-safe and the AioContext itself is used to protect the
-CoMutex from concurrent access.  However, this is going to change.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-9-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- util/qemu-coroutine-lock.c | 5 ++---
- util/trace-events          | 1 -
-files changed, 2 insertions(+), 4 deletions(-)
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine-lock.c
-+++ b/util/qemu-coroutine-lock.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/coroutine.h"
- #include "qemu/coroutine_int.h"
- #include "qemu/queue.h"
-+#include "block/aio.h"
- #include "trace.h"
- void qemu_co_queue_init(CoQueue *queue)
-@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_run_restart(Coroutine *co)
- static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
- {
--    Coroutine *self = qemu_coroutine_self();
-     Coroutine *next;
-     if (QSIMPLEQ_EMPTY(&queue->entries)) {
-@@ -XXX,XX +XXX,XX @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
-     while ((next = QSIMPLEQ_FIRST(&queue->entries)) != NULL) {
-         QSIMPLEQ_REMOVE_HEAD(&queue->entries, co_queue_next);
--        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
--        trace_qemu_co_queue_next(next);
-+        aio_co_wake(next);
-         if (single) {
-             break;
-         }
-diff --git a/util/trace-events b/util/trace-events
-index XXXXXXX..XXXXXXX 100644
---- a/util/trace-events
-+++ b/util/trace-events
-@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
- # util/qemu-coroutine-lock.c
- qemu_co_queue_run_restart(void *co) "co %p"
--qemu_co_queue_next(void *nxt) "next %p"
- qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
- qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
- qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
---
-.9.3

-[Qemu-devel] [PULL v2 09/24] blkdebug: reschedule coroutine on the AioContext it is running on
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-Keep the coroutine on the same AioContext.  Without this change,
-there would be a race between yielding the coroutine and reentering it.
-While the race cannot happen now, because the code only runs from a single
-AioContext, this will change with multiqueue support in the block layer.
-While doing the change, replace custom bottom half with aio_co_schedule.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-10-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/blkdebug.c | 9 +--------
-file changed, 1 insertion(+), 8 deletions(-)
-diff --git a/block/blkdebug.c b/block/blkdebug.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/blkdebug.c
-+++ b/block/blkdebug.c
-@@ -XXX,XX +XXX,XX @@ out:
-     return ret;
- }
--static void error_callback_bh(void *opaque)
--{
--    Coroutine *co = opaque;
--    qemu_coroutine_enter(co);
--}
--
- static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
- {
-     BDRVBlkdebugState *s = bs->opaque;
-@@ -XXX,XX +XXX,XX @@ static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
-     }
-     if (!immediately) {
--        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), error_callback_bh,
--                                qemu_coroutine_self());
-+        aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
-         qemu_coroutine_yield();
-     }
---
-.9.3

-[Qemu-devel] [PULL v2 10/24] qed: introduce qed_aio_start_io and qed_aio_next_io_cb
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-qed_aio_start_io and qed_aio_next_io will not have to acquire/release
-the AioContext, while qed_aio_next_io_cb will.  Split the functionality
-and gain a little type-safety in the process.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-11-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 39 +++++++++++++++++++++++++--------------
-file changed, 25 insertions(+), 14 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
-     return l2_table;
- }
--static void qed_aio_next_io(void *opaque, int ret);
-+static void qed_aio_next_io(QEDAIOCB *acb, int ret);
-+
-+static void qed_aio_start_io(QEDAIOCB *acb)
-+{
-+    qed_aio_next_io(acb, 0);
-+}
-+
-+static void qed_aio_next_io_cb(void *opaque, int ret)
-+{
-+    QEDAIOCB *acb = opaque;
-+
-+    qed_aio_next_io(acb, ret);
-+}
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
- {
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
-     acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-     if (acb) {
--        qed_aio_next_io(acb, 0);
-+        qed_aio_start_io(acb);
-     }
- }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
-         QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
-         acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-         if (acb) {
--            qed_aio_next_io(acb, 0);
-+            qed_aio_start_io(acb);
-         } else if (s->header.features & QED_F_NEED_CHECK) {
-             qed_start_need_check_timer(s);
-         }
-@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
-     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-     assert(acb->request.l2_table != NULL);
--    qed_aio_next_io(opaque, ret);
-+    qed_aio_next_io(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-     if (need_alloc) {
-         /* Write out the whole new L2 table */
-         qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
--                            qed_aio_write_l1_update, acb);
-+                           qed_aio_write_l1_update, acb);
-     } else {
-         /* Write out only the updated part of the L2 table */
-         qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
--                            qed_aio_next_io, acb);
-+                           qed_aio_next_io_cb, acb);
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-     }
-     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
--        next_fn = qed_aio_next_io;
-+        next_fn = qed_aio_next_io_cb;
-     } else {
-         if (s->bs->backing) {
-             next_fn = qed_aio_write_flush_before_l2_update;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     if (acb->flags & QED_AIOCB_ZERO) {
-         /* Skip ahead if the clusters are already zero */
-         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
--            qed_aio_next_io(acb, 0);
-+            qed_aio_start_io(acb);
-             return;
-         }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_read_data(void *opaque, int ret,
-     /* Handle zero cluster and backing file reads */
-     if (ret == QED_CLUSTER_ZERO) {
-         qemu_iovec_memset(&acb->cur_qiov, 0, 0, acb->cur_qiov.size);
--        qed_aio_next_io(acb, 0);
-+        qed_aio_start_io(acb);
-         return;
-     } else if (ret != QED_CLUSTER_FOUND) {
-         qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
--                              &acb->backing_qiov, qed_aio_next_io, acb);
-+                              &acb->backing_qiov, qed_aio_next_io_cb, acb);
-         return;
-     }
-     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-     bdrv_aio_readv(bs->file, offset / BDRV_SECTOR_SIZE,
-                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
--                   qed_aio_next_io, acb);
-+                   qed_aio_next_io_cb, acb);
-     return;
- err:
-@@ -XXX,XX +XXX,XX @@ err:
- /**
-  * Begin next I/O or complete the request
-  */
--static void qed_aio_next_io(void *opaque, int ret)
-+static void qed_aio_next_io(QEDAIOCB *acb, int ret)
- {
--    QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
-                                 qed_aio_write_data : qed_aio_read_data;
-@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-     qemu_iovec_init(&acb->cur_qiov, qiov->niov);
-     /* Start request */
--    qed_aio_next_io(acb, 0);
-+    qed_aio_start_io(acb);
-     return &acb->common;
- }
---
-.9.3

-[Qemu-devel] [PULL v2 11/24] aio: push aio_context_acquire/release down to dispatching
+[PULL v2 12/16] block: add some trace events for zone append
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-The AioContext data structures are now protected by list_lock and/or
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-they are walked with FOREACH_RCU primitives.  There is no need anymore
+Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
 to acquire the AioContext for the entire duration of aio_dispatch.
 Instead, just acquire it before and after invoking the callbacks.
 The next step is then to push it further down.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20230508051510.177850-5-faithilikerun@gmail.com
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-12-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/aio-posix.c | 25 +++++++++++--------------
+ block/file-posix.c | 3 +++
- util/aio-win32.c | 15 +++++++--------
+ block/trace-events | 2 ++
- util/async.c     |  2 ++
+files changed, 5 insertions(+)
 files changed, 20 insertions(+), 22 deletions(-)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
+--- a/block/file-posix.c
-+++ b/util/aio-posix.c
++++ b/block/file-posix.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+@@ -XXX,XX +XXX,XX @@ out:
-             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
+             if (!BDRV_ZT_IS_CONV(*wp)) {
-             aio_node_check(ctx, node->is_external) &&
+                 if (type & QEMU_AIO_ZONE_APPEND) {
-             node->io_read) {
+                     *s->offset = *wp;
-+            aio_context_acquire(ctx);
++                    trace_zbd_zone_append_complete(bs, *s->offset
-             node->io_read(node->opaque);
++                        >> BDRV_SECTOR_BITS);
-+            aio_context_release(ctx);
+                 }
+                 /* Advance the wp if needed */
-             /* aio_notify() does not count as progress */
+                 if (offset + bytes > *wp) {
-             if (node->opaque != &ctx->notifier) {
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+         len += iov_len;
              (revents & (G_IO_OUT | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_write) {
 +            aio_context_acquire(ctx);
              node->io_write(node->opaque);
 +            aio_context_release(ctx);
              progress = true;
          }
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
      }
-     /* Run our timers */
++    trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
-+    aio_context_acquire(ctx);
+     return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
      progress |= timerlistgroup_run_timers(&ctx->tlg);
 +    aio_context_release(ctx);
      return progress;
  }
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
+ #endif
-     int64_t timeout;
+diff --git a/block/trace-events b/block/trace-events
      int64_t start = 0;
 -    aio_context_acquire(ctx);
 -    progress = false;
 -
      /* aio_notify can avoid the expensive event_notifier_set if
       * everything (file descriptors, bottom halves, timers) will
       * be re-evaluated before the next blocking poll().  This is
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      }
 -    if (try_poll_mode(ctx, blocking)) {
 -        progress = true;
 -    } else {
 +    aio_context_acquire(ctx);
 +    progress = try_poll_mode(ctx, blocking);
 +    aio_context_release(ctx);
 +
 +    if (!progress) {
          assert(npfd == 0);
          /* fill pollfds */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          timeout = blocking ? aio_compute_timeout(ctx) : 0;
          /* wait until next event */
 -        if (timeout) {
 -            aio_context_release(ctx);
 -        }
          if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
              AioHandler epoll_handler;
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          } else  {
              ret = qemu_poll_ns(pollfds, npfd, timeout);
          }
 -        if (timeout) {
 -            aio_context_acquire(ctx);
 -        }
      }
      if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress = true;
      }
 -    aio_context_release(ctx);
 -
      return progress;
  }
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-win32.c
+--- a/block/trace-events
-+++ b/util/aio-win32.c
++++ b/block/trace-events
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
+@@ -XXX,XX +XXX,XX @@ file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
-             (revents || event_notifier_get_handle(node->e) == event) &&
+ file_flush_fdatasync_failed(int err) "errno %d"
-             node->io_notify) {
+ zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report %d zones starting at sector offset 0x%" PRIx64 ""
-             node->pfd.revents = 0;
+ zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs %p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " sectors"
-+            aio_context_acquire(ctx);
++zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" PRIx64 ""
-             node->io_notify(node->e);
++zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append sector 0x%" PRIx64 ""
-+            aio_context_release(ctx);
+ # ssh.c
-             /* aio_notify() does not count as progress */
+ sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
              if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (node->io_read || node->io_write)) {
              node->pfd.revents = 0;
              if ((revents & G_IO_IN) && node->io_read) {
 +                aio_context_acquire(ctx);
                  node->io_read(node->opaque);
 +                aio_context_release(ctx);
                  progress = true;
              }
              if ((revents & G_IO_OUT) && node->io_write) {
 +                aio_context_acquire(ctx);
                  node->io_write(node->opaque);
 +                aio_context_release(ctx);
                  progress = true;
              }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
      int count;
      int timeout;
 -    aio_context_acquire(ctx);
      progress = false;
      /* aio_notify can avoid the expensive event_notifier_set if
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          timeout = blocking && !have_select_revents
              ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
 -        if (timeout) {
 -            aio_context_release(ctx);
 -        }
          ret = WaitForMultipleObjects(count, events, FALSE, timeout);
          if (blocking) {
              assert(first);
              atomic_sub(&ctx->notify_me, 2);
          }
 -        if (timeout) {
 -            aio_context_acquire(ctx);
 -        }
          if (first) {
              aio_notify_accept(ctx);
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress |= aio_dispatch_handlers(ctx, event);
      } while (count > 0);
 +    aio_context_acquire(ctx);
      progress |= timerlistgroup_run_timers(&ctx->tlg);
 -
      aio_context_release(ctx);
      return progress;
  }
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  ret = 1;
              }
              bh->idle = 0;
 +            aio_context_acquire(ctx);
              aio_bh_call(bh);
 +            aio_context_release(ctx);
          }
          if (bh->deleted) {
              deleted = true;
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 12/24] block: explicitly acquire aiocontext in timers that need it
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-13-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.h                 |  3 +++
- block/curl.c                |  2 ++
- block/io.c                  |  5 +++++
- block/iscsi.c               |  8 ++++++--
- block/null.c                |  4 ++++
- block/qed.c                 | 12 ++++++++++++
- block/throttle-groups.c     |  2 ++
- util/aio-posix.c            |  2 --
- util/aio-win32.c            |  2 --
- util/qemu-coroutine-sleep.c |  2 +-
-files changed, 35 insertions(+), 7 deletions(-)
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ enum {
-  */
- typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
-+void qed_acquire(BDRVQEDState *s);
-+void qed_release(BDRVQEDState *s);
-+
- /**
-  * Generic callback for chaining async callbacks
-  */
-diff --git a/block/curl.c b/block/curl.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
-+++ b/block/curl.c
-@@ -XXX,XX +XXX,XX @@ static void curl_multi_timeout_do(void *arg)
-         return;
-     }
-+    aio_context_acquire(s->aio_context);
-     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
-     curl_multi_check_completion(s);
-+    aio_context_release(s->aio_context);
- #else
-     abort();
- #endif
-diff --git a/block/io.c b/block/io.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/io.c
-+++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel(BlockAIOCB *acb)
-         if (acb->aiocb_info->get_aio_context) {
-             aio_poll(acb->aiocb_info->get_aio_context(acb), true);
-         } else if (acb->bs) {
-+            /* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
-+             * assert that we're not using an I/O thread.  Thread-safe
-+             * code should use bdrv_aio_cancel_async exclusively.
-+             */
-+            assert(bdrv_get_aio_context(acb->bs) == qemu_get_aio_context());
-             aio_poll(bdrv_get_aio_context(acb->bs), true);
-         } else {
-             abort();
-diff --git a/block/iscsi.c b/block/iscsi.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/iscsi.c
-+++ b/block/iscsi.c
-@@ -XXX,XX +XXX,XX @@ static void iscsi_retry_timer_expired(void *opaque)
-     struct IscsiTask *iTask = opaque;
-     iTask->complete = 1;
-     if (iTask->co) {
--        qemu_coroutine_enter(iTask->co);
-+        aio_co_wake(iTask->co);
-     }
- }
-@@ -XXX,XX +XXX,XX @@ static void iscsi_nop_timed_event(void *opaque)
- {
-     IscsiLun *iscsilun = opaque;
-+    aio_context_acquire(iscsilun->aio_context);
-     if (iscsi_get_nops_in_flight(iscsilun->iscsi) >= MAX_NOP_FAILURES) {
-         error_report("iSCSI: NOP timeout. Reconnecting...");
-         iscsilun->request_timed_out = true;
-     } else if (iscsi_nop_out_async(iscsilun->iscsi, NULL, NULL, 0, NULL) != 0) {
-         error_report("iSCSI: failed to sent NOP-Out. Disabling NOP messages.");
--        return;
-+        goto out;
-     }
-     timer_mod(iscsilun->nop_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
-     iscsi_set_events(iscsilun);
-+
-+out:
-+    aio_context_release(iscsilun->aio_context);
- }
- static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
-diff --git a/block/null.c b/block/null.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/null.c
-+++ b/block/null.c
-@@ -XXX,XX +XXX,XX @@ static void null_bh_cb(void *opaque)
- static void null_timer_cb(void *opaque)
- {
-     NullAIOCB *acb = opaque;
-+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-+
-+    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, 0);
-+    aio_context_release(ctx);
-     timer_deinit(&acb->timer);
-     qemu_aio_unref(acb);
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
-     trace_qed_need_check_timer_cb(s);
-+    qed_acquire(s);
-     qed_plug_allocating_write_reqs(s);
-     /* Ensure writes are on disk before clearing flag */
-     bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
-+    qed_release(s);
-+}
-+
-+void qed_acquire(BDRVQEDState *s)
-+{
-+    aio_context_acquire(bdrv_get_aio_context(s->bs));
-+}
-+
-+void qed_release(BDRVQEDState *s)
-+{
-+    aio_context_release(bdrv_get_aio_context(s->bs));
- }
- static void qed_start_need_check_timer(BDRVQEDState *s)
-diff --git a/block/throttle-groups.c b/block/throttle-groups.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/throttle-groups.c
-+++ b/block/throttle-groups.c
-@@ -XXX,XX +XXX,XX @@ static void timer_cb(BlockBackend *blk, bool is_write)
-     qemu_mutex_unlock(&tg->lock);
-     /* Run the request that was waiting for this timer */
-+    aio_context_acquire(blk_get_aio_context(blk));
-     empty_queue = !qemu_co_enter_next(&blkp->throttled_reqs[is_write]);
-+    aio_context_release(blk_get_aio_context(blk));
-     /* If the request queue was empty then we have to take care of
-      * scheduling the next one */
-diff --git a/util/aio-posix.c b/util/aio-posix.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
-+++ b/util/aio-posix.c
-@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
-     }
-     /* Run our timers */
--    aio_context_acquire(ctx);
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
--    aio_context_release(ctx);
-     return progress;
- }
-diff --git a/util/aio-win32.c b/util/aio-win32.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-win32.c
-+++ b/util/aio-win32.c
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
-         progress |= aio_dispatch_handlers(ctx, event);
-     } while (count > 0);
--    aio_context_acquire(ctx);
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
--    aio_context_release(ctx);
-     return progress;
- }
-diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine-sleep.c
-+++ b/util/qemu-coroutine-sleep.c
-@@ -XXX,XX +XXX,XX @@ static void co_sleep_cb(void *opaque)
- {
-     CoSleepCB *sleep_cb = opaque;
--    qemu_coroutine_enter(sleep_cb->co);
-+    aio_co_wake(sleep_cb->co);
- }
- void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
---
-.9.3

-[Qemu-devel] [PULL v2 13/24] block: explicitly acquire aiocontext in callbacks that need it
+[PULL v2 13/16] virtio-blk: add zoned storage emulation for zoned devices
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-This covers both file descriptor callbacks and polling callbacks,
+This patch extends virtio-blk emulation to handle zoned device commands
-since they execute related code.
+by calling the new block layer APIs to perform zoned device I/O on
 behalf of the guest. It supports Report Zone, four zone oparations (open,
 close, finish, reset), and Append Zone.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+support zoned block devices. Regular block devices(conventional zones)
-Reviewed-by: Fam Zheng <famz@redhat.com>
+will not be set.
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-14-pbonzini@redhat.com
+The guest os can use blktests, fio to test those commands on zoned devices.
 Furthermore, using zonefs to test zone append write is also supported.
 Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Message-id: 20230508051916.178322-2-faithilikerun@gmail.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/curl.c          | 16 +++++++++++++---
+ hw/block/virtio-blk-common.c |   2 +
- block/iscsi.c         |  4 ++++
+ hw/block/virtio-blk.c        | 389 +++++++++++++++++++++++++++++++++++
- block/linux-aio.c     |  4 ++++
+ hw/virtio/virtio-qmp.c       |   2 +
- block/nfs.c           |  6 ++++++
+files changed, 393 insertions(+)
  block/sheepdog.c      | 29 +++++++++++++++--------------
  block/ssh.c           | 29 +++++++++--------------------
  block/win32-aio.c     | 10 ++++++----
  hw/block/virtio-blk.c |  5 ++++-
  hw/scsi/virtio-scsi.c |  7 +++++++
  util/aio-posix.c      |  7 -------
  util/aio-win32.c      |  6 ------
 files changed, 68 insertions(+), 55 deletions(-)
-diff --git a/block/curl.c b/block/curl.c
+diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
+--- a/hw/block/virtio-blk-common.c
-+++ b/block/curl.c
++++ b/hw/block/virtio-blk-common.c
-@@ -XXX,XX +XXX,XX @@ static void curl_multi_check_completion(BDRVCURLState *s)
+@@ -XXX,XX +XXX,XX @@ static const VirtIOFeature feature_sizes[] = {
-     }
+      .end = endof(struct virtio_blk_config, discard_sector_alignment)},
- }
+     {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
+      .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
--static void curl_multi_do(void *arg)
++    {.flags = 1ULL << VIRTIO_BLK_F_ZONED,
-+static void curl_multi_do_locked(CURLState *s)
++     .end = endof(struct virtio_blk_config, zoned)},
- {
+     {}
 -    CURLState *s = (CURLState *)arg;
      CURLSocket *socket, *next_socket;
      int running;
      int r;
@@ -XXX,XX +XXX,XX @@ static void curl_multi_do(void *arg)
      }
  }
 +static void curl_multi_do(void *arg)
 +{
 +    CURLState *s = (CURLState *)arg;
 +
 +    aio_context_acquire(s->s->aio_context);
 +    curl_multi_do_locked(s);
 +    aio_context_release(s->s->aio_context);
 +}
 +
  static void curl_multi_read(void *arg)
  {
      CURLState *s = (CURLState *)arg;
 -    curl_multi_do(arg);
 +    aio_context_acquire(s->s->aio_context);
 +    curl_multi_do_locked(s);
      curl_multi_check_completion(s->s);
 +    aio_context_release(s->s->aio_context);
  }
  static void curl_multi_timeout_do(void *arg)
 diff --git a/block/iscsi.c b/block/iscsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ iscsi_process_read(void *arg)
      IscsiLun *iscsilun = arg;
      struct iscsi_context *iscsi = iscsilun->iscsi;
 +    aio_context_acquire(iscsilun->aio_context);
      iscsi_service(iscsi, POLLIN);
      iscsi_set_events(iscsilun);
 +    aio_context_release(iscsilun->aio_context);
  }
  static void
@@ -XXX,XX +XXX,XX @@ iscsi_process_write(void *arg)
      IscsiLun *iscsilun = arg;
      struct iscsi_context *iscsi = iscsilun->iscsi;
 +    aio_context_acquire(iscsilun->aio_context);
      iscsi_service(iscsi, POLLOUT);
      iscsi_set_events(iscsilun);
 +    aio_context_release(iscsilun->aio_context);
  }
  static int64_t sector_lun2qemu(int64_t sector, IscsiLun *iscsilun)
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
      LinuxAioState *s = container_of(e, LinuxAioState, e);
      if (event_notifier_test_and_clear(&s->e)) {
 +        aio_context_acquire(s->aio_context);
          qemu_laio_process_completions_and_submit(s);
 +        aio_context_release(s->aio_context);
      }
  }
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
          return false;
      }
 +    aio_context_acquire(s->aio_context);
      qemu_laio_process_completions_and_submit(s);
 +    aio_context_release(s->aio_context);
      return true;
  }
 diff --git a/block/nfs.c b/block/nfs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_set_events(NFSClient *client)
  static void nfs_process_read(void *arg)
  {
      NFSClient *client = arg;
 +
 +    aio_context_acquire(client->aio_context);
      nfs_service(client->context, POLLIN);
      nfs_set_events(client);
 +    aio_context_release(client->aio_context);
  }
  static void nfs_process_write(void *arg)
  {
      NFSClient *client = arg;
 +
 +    aio_context_acquire(client->aio_context);
      nfs_service(client->context, POLLOUT);
      nfs_set_events(client);
 +    aio_context_release(client->aio_context);
  }
  static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
      return ret;
  }
 -static void restart_co_req(void *opaque)
 -{
 -    Coroutine *co = opaque;
 -
 -    qemu_coroutine_enter(co);
 -}
 -
  typedef struct SheepdogReqCo {
      int sockfd;
      BlockDriverState *bs;
@@ -XXX,XX +XXX,XX @@ typedef struct SheepdogReqCo {
      unsigned int *rlen;
      int ret;
      bool finished;
 +    Coroutine *co;
  } SheepdogReqCo;
 +static void restart_co_req(void *opaque)
 +{
 +    SheepdogReqCo *srco = opaque;
 +
 +    aio_co_wake(srco->co);
 +}
 +
  static coroutine_fn void do_co_req(void *opaque)
  {
      int ret;
 -    Coroutine *co;
      SheepdogReqCo *srco = opaque;
      int sockfd = srco->sockfd;
      SheepdogReq *hdr = srco->hdr;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
      unsigned int *wlen = srco->wlen;
      unsigned int *rlen = srco->rlen;
 -    co = qemu_coroutine_self();
 +    srco->co = qemu_coroutine_self();
      aio_set_fd_handler(srco->aio_context, sockfd, false,
 -                       NULL, restart_co_req, NULL, co);
 +                       NULL, restart_co_req, NULL, srco);
      ret = send_co_req(sockfd, hdr, data, wlen);
      if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
      }
      aio_set_fd_handler(srco->aio_context, sockfd, false,
 -                       restart_co_req, NULL, NULL, co);
 +                       restart_co_req, NULL, NULL, srco);
      ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
      if (ret != sizeof(*hdr)) {
@@ -XXX,XX +XXX,XX @@ out:
      aio_set_fd_handler(srco->aio_context, sockfd, false,
                         NULL, NULL, NULL, NULL);
 +    srco->co = NULL;
      srco->ret = ret;
      srco->finished = true;
      if (srco->bs) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn aio_read_response(void *opaque)
           * We've finished all requests which belong to the AIOCB, so
           * we can switch back to sd_co_readv/writev now.
           */
 -        qemu_coroutine_enter(acb->coroutine);
 +        aio_co_wake(acb->coroutine);
      }
      return;
@@ -XXX,XX +XXX,XX @@ static void co_read_response(void *opaque)
          s->co_recv = qemu_coroutine_create(aio_read_response, opaque);
      }
 -    qemu_coroutine_enter(s->co_recv);
 +    aio_co_wake(s->co_recv);
  }
  static void co_write_request(void *opaque)
  {
      BDRVSheepdogState *s = opaque;
 -    qemu_coroutine_enter(s->co_send);
 +    aio_co_wake(s->co_send);
  }
  /*
 diff --git a/block/ssh.c b/block/ssh.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/ssh.c
 +++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static void restart_coroutine(void *opaque)
      DPRINTF("co=%p", co);
 -    qemu_coroutine_enter(co);
 +    aio_co_wake(co);
  }
 -static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
 +/* A non-blocking call returned EAGAIN, so yield, ensuring the
 + * handlers are set up so that we'll be rescheduled when there is an
 + * interesting event on the socket.
 + */
 +static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
  {
      int r;
      IOHandler *rd_handler = NULL, *wr_handler = NULL;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
      aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
                         false, rd_handler, wr_handler, NULL, co);
 -}
 -
 -static coroutine_fn void clear_fd_handler(BDRVSSHState *s,
 -                                          BlockDriverState *bs)
 -{
 -    DPRINTF("s->sock=%d", s->sock);
 -    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
 -                       false, NULL, NULL, NULL, NULL);
 -}
 -
 -/* A non-blocking call returned EAGAIN, so yield, ensuring the
 - * handlers are set up so that we'll be rescheduled when there is an
 - * interesting event on the socket.
 - */
 -static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 -{
 -    set_fd_handler(s, bs);
      qemu_coroutine_yield();
 -    clear_fd_handler(s, bs);
 +    DPRINTF("s->sock=%d - back", s->sock);
 +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
 +                       NULL, NULL, NULL, NULL);
  }
  /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
 diff --git a/block/win32-aio.c b/block/win32-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/win32-aio.c
 +++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ struct QEMUWin32AIOState {
      HANDLE hIOCP;
      EventNotifier e;
      int count;
 -    bool is_aio_context_attached;
 +    AioContext *aio_ctx;
  };
- typedef struct QEMUWin32AIOCB {
-@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
-     }
-+    aio_context_acquire(s->aio_ctx);
-     waiocb->common.cb(waiocb->common.opaque, ret);
-+    aio_context_release(s->aio_ctx);
-     qemu_aio_unref(waiocb);
- }
-@@ -XXX,XX +XXX,XX @@ void win32_aio_detach_aio_context(QEMUWin32AIOState *aio,
-                                   AioContext *old_context)
- {
-     aio_set_event_notifier(old_context, &aio->e, false, NULL, NULL);
--    aio->is_aio_context_attached = false;
-+    aio->aio_ctx = NULL;
- }
- void win32_aio_attach_aio_context(QEMUWin32AIOState *aio,
-                                   AioContext *new_context)
- {
--    aio->is_aio_context_attached = true;
-+    aio->aio_ctx = new_context;
-     aio_set_event_notifier(new_context, &aio->e, false,
-                            win32_aio_completion_cb, NULL);
- }
-@@ -XXX,XX +XXX,XX @@ out_free_state:
- void win32_aio_cleanup(QEMUWin32AIOState *aio)
- {
--    assert(!aio->is_aio_context_attached);
-+    assert(!aio->aio_ctx);
-     CloseHandle(aio->hIOCP);
-     event_notifier_cleanup(&aio->e);
-     g_free(aio);
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
-@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
+@@ -XXX,XX +XXX,XX @@
  #include "qemu/module.h"
  #include "qemu/error-report.h"
  #include "qemu/main-loop.h"
 +#include "block/block_int.h"
  #include "trace.h"
  #include "hw/block/block.h"
  #include "hw/qdev-properties.h"
@@ -XXX,XX +XXX,XX @@ err:
      return err_status;
  }
 +typedef struct ZoneCmdData {
 +    VirtIOBlockReq *req;
 +    struct iovec *in_iov;
 +    unsigned in_num;
 +    union {
 +        struct {
 +            unsigned int nr_zones;
 +            BlockZoneDescriptor *zones;
 +        } zone_report_data;
 +        struct {
 +            int64_t offset;
 +        } zone_append_data;
 +    };
 +} ZoneCmdData;
 +
 +/*
 + * check zoned_request: error checking before issuing requests. If all checks
 + * passed, return true.
 + * append: true if only zone append requests issued.
 + */
 +static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
 +                             bool append, uint8_t *status) {
 +    BlockDriverState *bs = blk_bs(s->blk);
 +    int index;
 +
 +    if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
 +        *status = VIRTIO_BLK_S_UNSUPP;
 +        return false;
 +    }
 +
 +    if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
 +        || offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
 +        *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        return false;
 +    }
 +
 +    if (append) {
 +        if (bs->bl.write_granularity) {
 +            if ((offset % bs->bl.write_granularity) != 0) {
 +                *status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
 +                return false;
 +            }
 +        }
 +
 +        index = offset / bs->bl.zone_size;
 +        if (BDRV_ZT_IS_CONV(bs->wps->wp[index])) {
 +            *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +            return false;
 +        }
 +
 +        if (len / 512 > bs->bl.max_append_sectors) {
 +            if (bs->bl.max_append_sectors == 0) {
 +                *status = VIRTIO_BLK_S_UNSUPP;
 +            } else {
 +                *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +            }
 +            return false;
 +        }
 +    }
 +    return true;
 +}
 +
 +static void virtio_blk_zone_report_complete(void *opaque, int ret)
 +{
 +    ZoneCmdData *data = opaque;
 +    VirtIOBlockReq *req = data->req;
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
 +    struct iovec *in_iov = data->in_iov;
 +    unsigned in_num = data->in_num;
 +    int64_t zrp_size, n, j = 0;
 +    int64_t nz = data->zone_report_data.nr_zones;
 +    int8_t err_status = VIRTIO_BLK_S_OK;
 +
 +    if (ret) {
 +        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        goto out;
 +    }
 +
 +    struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
 +        .nr_zones = cpu_to_le64(nz),
 +    };
 +    zrp_size = sizeof(struct virtio_blk_zone_report)
 +               + sizeof(struct virtio_blk_zone_descriptor) * nz;
 +    n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
 +    if (n != sizeof(zrp_hdr)) {
 +        virtio_error(vdev, "Driver provided input buffer that is too small!");
 +        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        goto out;
 +    }
 +
 +    for (size_t i = sizeof(zrp_hdr); i < zrp_size;
 +        i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
 +        struct virtio_blk_zone_descriptor desc =
 +            (struct virtio_blk_zone_descriptor) {
 +                .z_start = cpu_to_le64(data->zone_report_data.zones[j].start
 +                    >> BDRV_SECTOR_BITS),
 +                .z_cap = cpu_to_le64(data->zone_report_data.zones[j].cap
 +                    >> BDRV_SECTOR_BITS),
 +                .z_wp = cpu_to_le64(data->zone_report_data.zones[j].wp
 +                    >> BDRV_SECTOR_BITS),
 +        };
 +
 +        switch (data->zone_report_data.zones[j].type) {
 +        case BLK_ZT_CONV:
 +            desc.z_type = VIRTIO_BLK_ZT_CONV;
 +            break;
 +        case BLK_ZT_SWR:
 +            desc.z_type = VIRTIO_BLK_ZT_SWR;
 +            break;
 +        case BLK_ZT_SWP:
 +            desc.z_type = VIRTIO_BLK_ZT_SWP;
 +            break;
 +        default:
 +            g_assert_not_reached();
 +        }
 +
 +        switch (data->zone_report_data.zones[j].state) {
 +        case BLK_ZS_RDONLY:
 +            desc.z_state = VIRTIO_BLK_ZS_RDONLY;
 +            break;
 +        case BLK_ZS_OFFLINE:
 +            desc.z_state = VIRTIO_BLK_ZS_OFFLINE;
 +            break;
 +        case BLK_ZS_EMPTY:
 +            desc.z_state = VIRTIO_BLK_ZS_EMPTY;
 +            break;
 +        case BLK_ZS_CLOSED:
 +            desc.z_state = VIRTIO_BLK_ZS_CLOSED;
 +            break;
 +        case BLK_ZS_FULL:
 +            desc.z_state = VIRTIO_BLK_ZS_FULL;
 +            break;
 +        case BLK_ZS_EOPEN:
 +            desc.z_state = VIRTIO_BLK_ZS_EOPEN;
 +            break;
 +        case BLK_ZS_IOPEN:
 +            desc.z_state = VIRTIO_BLK_ZS_IOPEN;
 +            break;
 +        case BLK_ZS_NOT_WP:
 +            desc.z_state = VIRTIO_BLK_ZS_NOT_WP;
 +            break;
 +        default:
 +            g_assert_not_reached();
 +        }
 +
 +        /* TODO: it takes O(n^2) time complexity. Optimizations required. */
 +        n = iov_from_buf(in_iov, in_num, i, &desc, sizeof(desc));
 +        if (n != sizeof(desc)) {
 +            virtio_error(vdev, "Driver provided input buffer "
 +                               "for descriptors that is too small!");
 +            err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        }
 +    }
 +
 +out:
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 +    g_free(data->zone_report_data.zones);
 +    g_free(data);
 +}
 +
 +static void virtio_blk_handle_zone_report(VirtIOBlockReq *req,
 +                                         struct iovec *in_iov,
 +                                         unsigned in_num)
 +{
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(s);
 +    unsigned int nr_zones;
 +    ZoneCmdData *data;
 +    int64_t zone_size, offset;
 +    uint8_t err_status;
 +
 +    if (req->in_len < sizeof(struct virtio_blk_inhdr) +
 +            sizeof(struct virtio_blk_zone_report) +
 +            sizeof(struct virtio_blk_zone_descriptor)) {
 +        virtio_error(vdev, "in buffer too small for zone report");
 +        return;
 +    }
 +
 +    /* start byte offset of the zone report */
 +    offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
 +    if (!check_zoned_request(s, offset, 0, false, &err_status)) {
 +        goto out;
 +    }
 +    nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 +                sizeof(struct virtio_blk_zone_report)) /
 +               sizeof(struct virtio_blk_zone_descriptor);
 +
 +    zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 +    data = g_malloc(sizeof(ZoneCmdData));
 +    data->req = req;
 +    data->in_iov = in_iov;
 +    data->in_num = in_num;
 +    data->zone_report_data.nr_zones = nr_zones;
 +    data->zone_report_data.zones = g_malloc(zone_size),
 +
 +    blk_aio_zone_report(s->blk, offset, &data->zone_report_data.nr_zones,
 +                        data->zone_report_data.zones,
 +                        virtio_blk_zone_report_complete, data);
 +    return;
 +out:
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +}
 +
 +static void virtio_blk_zone_mgmt_complete(void *opaque, int ret)
 +{
 +    VirtIOBlockReq *req = opaque;
 +    VirtIOBlock *s = req->dev;
 +    int8_t err_status = VIRTIO_BLK_S_OK;
 +
 +    if (ret) {
 +        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +    }
 +
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 +}
 +
 +static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
 +{
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(s);
 +    BlockDriverState *bs = blk_bs(s->blk);
 +    int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
 +    uint64_t len;
 +    uint64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
 +    uint8_t err_status = VIRTIO_BLK_S_OK;
 +
 +    uint32_t type = virtio_ldl_p(vdev, &req->out.type);
 +    if (type == VIRTIO_BLK_T_ZONE_RESET_ALL) {
 +        /* Entire drive capacity */
 +        offset = 0;
 +        len = capacity;
 +    } else {
 +        if (bs->bl.zone_size > capacity - offset) {
 +            /* The zoned device allows the last smaller zone. */
 +            len = capacity - bs->bl.zone_size * (bs->bl.nr_zones - 1);
 +        } else {
 +            len = bs->bl.zone_size;
 +        }
 +    }
 +
 +    if (!check_zoned_request(s, offset, len, false, &err_status)) {
 +        goto out;
 +    }
 +
 +    blk_aio_zone_mgmt(s->blk, op, offset, len,
 +                      virtio_blk_zone_mgmt_complete, req);
 +
 +    return 0;
 +out:
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +    return err_status;
 +}
 +
 +static void virtio_blk_zone_append_complete(void *opaque, int ret)
 +{
 +    ZoneCmdData *data = opaque;
 +    VirtIOBlockReq *req = data->req;
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
 +    int64_t append_sector, n;
 +    uint8_t err_status = VIRTIO_BLK_S_OK;
 +
 +    if (ret) {
 +        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        goto out;
 +    }
 +
 +    virtio_stq_p(vdev, &append_sector,
 +                 data->zone_append_data.offset >> BDRV_SECTOR_BITS);
 +    n = iov_from_buf(data->in_iov, data->in_num, 0, &append_sector,
 +                     sizeof(append_sector));
 +    if (n != sizeof(append_sector)) {
 +        virtio_error(vdev, "Driver provided input buffer less than size of "
 +                           "append_sector");
 +        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 +        goto out;
 +    }
 +
 +out:
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 +    g_free(data);
 +}
 +
 +static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
 +                                         struct iovec *out_iov,
 +                                         struct iovec *in_iov,
 +                                         uint64_t out_num,
 +                                         unsigned in_num) {
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(s);
 +    uint8_t err_status = VIRTIO_BLK_S_OK;
 +
 +    int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
 +    int64_t len = iov_size(out_iov, out_num);
 +
 +    if (!check_zoned_request(s, offset, len, true, &err_status)) {
 +        goto out;
 +    }
 +
 +    ZoneCmdData *data = g_malloc(sizeof(ZoneCmdData));
 +    data->req = req;
 +    data->in_iov = in_iov;
 +    data->in_num = in_num;
 +    data->zone_append_data.offset = offset;
 +    qemu_iovec_init_external(&req->qiov, out_iov, out_num);
 +    blk_aio_zone_append(s->blk, &data->zone_append_data.offset, &req->qiov, 0,
 +                        virtio_blk_zone_append_complete, data);
 +    return 0;
 +
 +out:
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 +    virtio_blk_req_complete(req, err_status);
 +    virtio_blk_free_request(req);
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 +    return err_status;
 +}
 +
  static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
  {
-     VirtIOBlockIoctlReq *ioctl_req = opaque;
+     uint32_t type;
-     VirtIOBlockReq *req = ioctl_req->req;
+@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
--    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+     case VIRTIO_BLK_T_FLUSH:
-+    VirtIOBlock *s = req->dev;
+         virtio_blk_handle_flush(req, mrb);
-+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
+         break;
-     struct virtio_scsi_inhdr *scsi;
++    case VIRTIO_BLK_T_ZONE_REPORT:
-     struct sg_io_hdr *hdr;
++        virtio_blk_handle_zone_report(req, in_iov, in_num);
++        break;
-@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
++    case VIRTIO_BLK_T_ZONE_OPEN:
-     MultiReqBuffer mrb = {};
++        virtio_blk_handle_zone_mgmt(req, BLK_ZO_OPEN);
-     bool progress = false;
++        break;
++    case VIRTIO_BLK_T_ZONE_CLOSE:
-+    aio_context_acquire(blk_get_aio_context(s->blk));
++        virtio_blk_handle_zone_mgmt(req, BLK_ZO_CLOSE);
-     blk_io_plug(s->blk);
++        break;
++    case VIRTIO_BLK_T_ZONE_FINISH:
-     do {
++        virtio_blk_handle_zone_mgmt(req, BLK_ZO_FINISH);
-@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
++        break;
 +    case VIRTIO_BLK_T_ZONE_RESET:
 +        virtio_blk_handle_zone_mgmt(req, BLK_ZO_RESET);
 +        break;
 +    case VIRTIO_BLK_T_ZONE_RESET_ALL:
 +        virtio_blk_handle_zone_mgmt(req, BLK_ZO_RESET);
 +        break;
      case VIRTIO_BLK_T_SCSI_CMD:
          virtio_blk_handle_scsi(req);
          break;
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
          virtio_blk_free_request(req);
          break;
      }
++    case VIRTIO_BLK_T_ZONE_APPEND & ~VIRTIO_BLK_T_OUT:
-     blk_io_unplug(s->blk);
++        /*
-+    aio_context_release(blk_get_aio_context(s->blk));
++         * Passing out_iov/out_num and in_iov/in_num is not safe
-     return progress;
++         * to access req->elem.out_sg directly because it may be
 +         * modified by virtio_blk_handle_request().
 +         */
 +        virtio_blk_handle_zone_append(req, out_iov, in_iov, out_num, in_num);
 +        break;
      /*
       * VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES are defined with
       * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch statement,
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
  {
      VirtIOBlock *s = VIRTIO_BLK(vdev);
      BlockConf *conf = &s->conf.conf;
 +    BlockDriverState *bs = blk_bs(s->blk);
      struct virtio_blk_config blkcfg;
      uint64_t capacity;
      int64_t length;
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
          blkcfg.write_zeroes_may_unmap = 1;
          virtio_stl_p(vdev, &blkcfg.max_write_zeroes_seg, 1);
      }
 +    if (bs->bl.zoned != BLK_Z_NONE) {
 +        switch (bs->bl.zoned) {
 +        case BLK_Z_HM:
 +            blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
 +            break;
 +        case BLK_Z_HA:
 +            blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
 +            break;
 +        default:
 +            g_assert_not_reached();
 +        }
 +
 +        virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors,
 +                     bs->bl.zone_size / 512);
 +        virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones,
 +                     bs->bl.max_active_zones);
 +        virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones,
 +                     bs->bl.max_open_zones);
 +        virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_size);
 +        virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors,
 +                     bs->bl.max_append_sectors);
 +    } else {
 +        blkcfg.zoned.model = VIRTIO_BLK_Z_NONE;
 +    }
      memcpy(config, &blkcfg, s->config_size);
  }
-diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
+@@ -XXX,XX +XXX,XX @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
          return;
      }
 +    BlockDriverState *bs = blk_bs(conf->conf.blk);
 +    if (bs->bl.zoned != BLK_Z_NONE) {
 +        virtio_add_feature(&s->host_features, VIRTIO_BLK_F_ZONED);
 +        if (bs->bl.zoned == BLK_Z_HM) {
 +            virtio_clear_feature(&s->host_features, VIRTIO_BLK_F_DISCARD);
 +        }
 +    }
 +
      if (virtio_has_feature(s->host_features, VIRTIO_BLK_F_DISCARD) &&
          (!conf->max_discard_sectors ||
           conf->max_discard_sectors > BDRV_REQUEST_MAX_SECTORS)) {
 diff --git a/hw/virtio/virtio-qmp.c b/hw/virtio/virtio-qmp.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/scsi/virtio-scsi.c
+--- a/hw/virtio/virtio-qmp.c
-+++ b/hw/scsi/virtio-scsi.c
++++ b/hw/virtio/virtio-qmp.c
-@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_ctrl_vq(VirtIOSCSI *s, VirtQueue *vq)
+@@ -XXX,XX +XXX,XX @@ static const qmp_virtio_feature_map_t virtio_blk_feature_map[] = {
-     VirtIOSCSIReq *req;
+             "VIRTIO_BLK_F_DISCARD: Discard command supported"),
-     bool progress = false;
+     FEATURE_ENTRY(VIRTIO_BLK_F_WRITE_ZEROES, \
+             "VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported"),
-+    virtio_scsi_acquire(s);
++    FEATURE_ENTRY(VIRTIO_BLK_F_ZONED, \
-     while ((req = virtio_scsi_pop_req(s, vq))) {
++            "VIRTIO_BLK_F_ZONED: Zoned block devices"),
-         progress = true;
+ #ifndef VIRTIO_BLK_NO_LEGACY
-         virtio_scsi_handle_ctrl_req(s, req);
+     FEATURE_ENTRY(VIRTIO_BLK_F_BARRIER, \
-     }
+             "VIRTIO_BLK_F_BARRIER: Request barriers supported"),
 +    virtio_scsi_release(s);
      return progress;
  }
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
      QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
 +    virtio_scsi_acquire(s);
      do {
          virtio_queue_set_notification(vq, 0);
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
      QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
          virtio_scsi_handle_cmd_req_submit(s, req);
      }
 +    virtio_scsi_release(s);
      return progress;
  }
@@ -XXX,XX +XXX,XX @@ out:
  bool virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
  {
 +    virtio_scsi_acquire(s);
      if (s->events_dropped) {
          virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
 +        virtio_scsi_release(s);
          return true;
      }
 +    virtio_scsi_release(s);
      return false;
  }
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
              (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_read) {
 -            aio_context_acquire(ctx);
              node->io_read(node->opaque);
 -            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
              (revents & (G_IO_OUT | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_write) {
 -            aio_context_acquire(ctx);
              node->io_write(node->opaque);
 -            aio_context_release(ctx);
              progress = true;
          }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      }
 -    aio_context_acquire(ctx);
      progress = try_poll_mode(ctx, blocking);
 -    aio_context_release(ctx);
 -
      if (!progress) {
          assert(npfd == 0);
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-win32.c
 +++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (revents || event_notifier_get_handle(node->e) == event) &&
              node->io_notify) {
              node->pfd.revents = 0;
 -            aio_context_acquire(ctx);
              node->io_notify(node->e);
 -            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (node->io_read || node->io_write)) {
              node->pfd.revents = 0;
              if ((revents & G_IO_IN) && node->io_read) {
 -                aio_context_acquire(ctx);
                  node->io_read(node->opaque);
 -                aio_context_release(ctx);
                  progress = true;
              }
              if ((revents & G_IO_OUT) && node->io_write) {
 -                aio_context_acquire(ctx);
                  node->io_write(node->opaque);
 -                aio_context_release(ctx);
                  progress = true;
              }
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 20/24] coroutine-lock: add limited spinning to CoMutex
+[PULL v2 14/16] block: add accounting for zone append operation
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-Running a very small critical section on pthread_mutex_t and CoMutex
+Taking account of the new zone append write operation for zoned devices,
-shows that pthread_mutex_t is much faster because it doesn't actually
+BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
-go to sleep.  What happens is that the critical section is shorter
+write, flush).
 than the latency of entering the kernel and thus FUTEX_WAIT always
 fails.  With CoMutex there is no such latency but you still want to
 avoid wait and wakeup.  So introduce it artificially.
-This only works with one waiters; because CoMutex is fair, it will
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-always have more waits and wakeups than a pthread_mutex_t.
+Message-id: 20230508051916.178322-3-faithilikerun@gmail.com
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213181244.16297-3-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  5 +++++
+ qapi/block-core.json       | 68 ++++++++++++++++++++++++++++++++------
- util/qemu-coroutine-lock.c | 51 ++++++++++++++++++++++++++++++++++++++++------
+ qapi/block.json            |  4 +++
- util/qemu-coroutine.c      |  2 +-
+ include/block/accounting.h |  1 +
-files changed, 51 insertions(+), 7 deletions(-)
+ block/qapi-sysemu.c        | 11 ++++++
  block/qapi.c               | 18 ++++++++++
  hw/block/virtio-blk.c      |  4 +++
  tests/qemu-iotests/227.out | 18 ++++++++++
 files changed, 113 insertions(+), 11 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/qapi/block-core.json
-+++ b/include/qemu/coroutine.h
++++ b/qapi/block-core.json
-@@ -XXX,XX +XXX,XX @@ typedef struct CoMutex {
+@@ -XXX,XX +XXX,XX @@
-      */
+ # @min_wr_latency_ns: Minimum latency of write operations in the
-     unsigned locked;
+ #     defined interval, in nanoseconds.
+ #
-+    /* Context that is holding the lock.  Useful to avoid spinning
++# @min_zone_append_latency_ns: Minimum latency of zone append operations
-+     * when two coroutines on the same AioContext try to get the lock. :)
++#                              in the defined interval, in nanoseconds
-+     */
++#                              (since 8.1)
-+    AioContext *ctx;
++#
-+
+ # @min_flush_latency_ns: Minimum latency of flush operations in the
-     /* A queue of waiters.  Elements are added atomically in front of
+ #     defined interval, in nanoseconds.
-      * from_push.  to_pop is only populated, and popped from, by whoever
+ #
-      * is in charge of the next wakeup.  This can be an unlocker or,
+@@ -XXX,XX +XXX,XX @@
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
+ # @max_wr_latency_ns: Maximum latency of write operations in the
-index XXXXXXX..XXXXXXX 100644
+ #     defined interval, in nanoseconds.
---- a/util/qemu-coroutine-lock.c
+ #
-+++ b/util/qemu-coroutine-lock.c
++# @max_zone_append_latency_ns: Maximum latency of zone append operations
-@@ -XXX,XX +XXX,XX @@
++#                              in the defined interval, in nanoseconds
- #include "qemu-common.h"
++#                              (since 8.1)
- #include "qemu/coroutine.h"
++#
- #include "qemu/coroutine_int.h"
+ # @max_flush_latency_ns: Maximum latency of flush operations in the
-+#include "qemu/processor.h"
+ #     defined interval, in nanoseconds.
- #include "qemu/queue.h"
+ #
- #include "block/aio.h"
+@@ -XXX,XX +XXX,XX @@
- #include "trace.h"
+ # @avg_wr_latency_ns: Average latency of write operations in the
-@@ -XXX,XX +XXX,XX @@ void qemu_co_mutex_init(CoMutex *mutex)
+ #     defined interval, in nanoseconds.
-     memset(mutex, 0, sizeof(*mutex));
+ #
- }
++# @avg_zone_append_latency_ns: Average latency of zone append operations
++#                              in the defined interval, in nanoseconds
--static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
++#                              (since 8.1)
-+static void coroutine_fn qemu_co_mutex_wake(CoMutex *mutex, Coroutine *co)
++#
-+{
+ # @avg_flush_latency_ns: Average latency of flush operations in the
-+    /* Read co before co->ctx; pairs with smp_wmb() in
+ #     defined interval, in nanoseconds.
-+     * qemu_coroutine_enter().
+ #
-+     */
+@@ -XXX,XX +XXX,XX @@
-+    smp_read_barrier_depends();
+ # @avg_wr_queue_depth: Average number of pending write operations in
-+    mutex->ctx = co->ctx;
+ #     the defined interval.
-+    aio_co_wake(co);
+ #
-+}
++# @avg_zone_append_queue_depth: Average number of pending zone append
-+
++#                               operations in the defined interval
-+static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
++#                               (since 8.1).
-+                                                     CoMutex *mutex)
++#
  # Since: 2.5
  ##
  { 'struct': 'BlockDeviceTimedStats',
    'data': { 'interval_length': 'int', 'min_rd_latency_ns': 'int',
              'max_rd_latency_ns': 'int', 'avg_rd_latency_ns': 'int',
              'min_wr_latency_ns': 'int', 'max_wr_latency_ns': 'int',
 -            'avg_wr_latency_ns': 'int', 'min_flush_latency_ns': 'int',
 -            'max_flush_latency_ns': 'int', 'avg_flush_latency_ns': 'int',
 -            'avg_rd_queue_depth': 'number', 'avg_wr_queue_depth': 'number' } }
 +            'avg_wr_latency_ns': 'int', 'min_zone_append_latency_ns': 'int',
 +            'max_zone_append_latency_ns': 'int',
 +            'avg_zone_append_latency_ns': 'int',
 +            'min_flush_latency_ns': 'int', 'max_flush_latency_ns': 'int',
 +            'avg_flush_latency_ns': 'int', 'avg_rd_queue_depth': 'number',
 +            'avg_wr_queue_depth': 'number',
 +            'avg_zone_append_queue_depth': 'number'  } }
  ##
  # @BlockDeviceStats:
@@ -XXX,XX +XXX,XX @@
  #
  # @wr_bytes: The number of bytes written by the device.
  #
 +# @zone_append_bytes: The number of bytes appended by the zoned devices
 +#                     (since 8.1)
 +#
  # @unmap_bytes: The number of bytes unmapped by the device (Since 4.2)
  #
  # @rd_operations: The number of read operations performed by the
@@ -XXX,XX +XXX,XX @@
  # @wr_operations: The number of write operations performed by the
  #     device.
  #
 +# @zone_append_operations: The number of zone append operations performed
 +#                          by the zoned devices (since 8.1)
 +#
  # @flush_operations: The number of cache flush operations performed by
  #     the device (since 0.15)
  #
@@ -XXX,XX +XXX,XX @@
  # @wr_total_time_ns: Total time spent on writes in nanoseconds (since
  #     0.15).
  #
 +# @zone_append_total_time_ns: Total time spent on zone append writes
 +#                             in nanoseconds (since 8.1)
 +#
  # @flush_total_time_ns: Total time spent on cache flushes in
  #     nanoseconds (since 0.15).
  #
@@ -XXX,XX +XXX,XX @@
  # @wr_merged: Number of write requests that have been merged into
  #     another request (Since 2.3).
  #
 +# @zone_append_merged: Number of zone append requests that have been merged
 +#                      into another request (since 8.1)
 +#
  # @unmap_merged: Number of unmap requests that have been merged into
  #     another request (Since 4.2)
  #
@@ -XXX,XX +XXX,XX @@
  # @failed_wr_operations: The number of failed write operations
  #     performed by the device (Since 2.5)
  #
 +# @failed_zone_append_operations: The number of failed zone append write
 +#                                 operations performed by the zoned devices
 +#                                 (since 8.1)
 +#
  # @failed_flush_operations: The number of failed flush operations
  #     performed by the device (Since 2.5)
  #
@@ -XXX,XX +XXX,XX @@
  # @invalid_wr_operations: The number of invalid write operations
  #     performed by the device (Since 2.5)
  #
 +# @invalid_zone_append_operations: The number of invalid zone append operations
 +#                                  performed by the zoned device (since 8.1)
 +#
  # @invalid_flush_operations: The number of invalid flush operations
  #     performed by the device (Since 2.5)
  #
@@ -XXX,XX +XXX,XX @@
  #
  # @wr_latency_histogram: @BlockLatencyHistogramInfo.  (Since 4.0)
  #
 +# @zone_append_latency_histogram: @BlockLatencyHistogramInfo.  (since 8.1)
 +#
  # @flush_latency_histogram: @BlockLatencyHistogramInfo.  (Since 4.0)
  #
  # Since: 0.14
  ##
  { 'struct': 'BlockDeviceStats',
 -  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
 -           'rd_operations': 'int', 'wr_operations': 'int',
 +  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'zone_append_bytes': 'int',
 +           'unmap_bytes' : 'int', 'rd_operations': 'int',
 +           'wr_operations': 'int', 'zone_append_operations': 'int',
             'flush_operations': 'int', 'unmap_operations': 'int',
             'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
 -           'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
 -           'wr_highest_offset': 'int',
 -           'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
 -           '*idle_time_ns': 'int',
 +           'zone_append_total_time_ns': 'int', 'flush_total_time_ns': 'int',
 +           'unmap_total_time_ns': 'int', 'wr_highest_offset': 'int',
 +           'rd_merged': 'int', 'wr_merged': 'int', 'zone_append_merged': 'int',
 +           'unmap_merged': 'int', '*idle_time_ns': 'int',
             'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
 -           'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
 -           'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
 +           'failed_zone_append_operations': 'int',
 +           'failed_flush_operations': 'int',
 +           'failed_unmap_operations': 'int', 'invalid_rd_operations': 'int',
 +           'invalid_wr_operations': 'int',
 +           'invalid_zone_append_operations': 'int',
             'invalid_flush_operations': 'int', 'invalid_unmap_operations': 'int',
             'account_invalid': 'bool', 'account_failed': 'bool',
             'timed_stats': ['BlockDeviceTimedStats'],
             '*rd_latency_histogram': 'BlockLatencyHistogramInfo',
             '*wr_latency_histogram': 'BlockLatencyHistogramInfo',
 +           '*zone_append_latency_histogram': 'BlockLatencyHistogramInfo',
             '*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
  ##
 diff --git a/qapi/block.json b/qapi/block.json
 index XXXXXXX..XXXXXXX 100644
 --- a/qapi/block.json
 +++ b/qapi/block.json
@@ -XXX,XX +XXX,XX @@
  # @boundaries-write: list of interval boundary values for write
  #     latency histogram.
  #
 +# @boundaries-zap: list of interval boundary values for zone append write
 +#                  latency histogram.
 +#
  # @boundaries-flush: list of interval boundary values for flush
  #     latency histogram.
  #
@@ -XXX,XX +XXX,XX @@
             '*boundaries': ['uint64'],
             '*boundaries-read': ['uint64'],
             '*boundaries-write': ['uint64'],
 +           '*boundaries-zap': ['uint64'],
             '*boundaries-flush': ['uint64'] },
    'allow-preconfig': true }
 diff --git a/include/block/accounting.h b/include/block/accounting.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/accounting.h
 +++ b/include/block/accounting.h
@@ -XXX,XX +XXX,XX @@ enum BlockAcctType {
      BLOCK_ACCT_READ,
      BLOCK_ACCT_WRITE,
      BLOCK_ACCT_FLUSH,
 +    BLOCK_ACCT_ZONE_APPEND,
      BLOCK_ACCT_UNMAP,
      BLOCK_MAX_IOTYPE,
  };
 diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qapi-sysemu.c
 +++ b/block/qapi-sysemu.c
@@ -XXX,XX +XXX,XX @@ void qmp_block_latency_histogram_set(
      bool has_boundaries, uint64List *boundaries,
      bool has_boundaries_read, uint64List *boundaries_read,
      bool has_boundaries_write, uint64List *boundaries_write,
 +    bool has_boundaries_append, uint64List *boundaries_append,
      bool has_boundaries_flush, uint64List *boundaries_flush,
      Error **errp)
  {
-     Coroutine *self = qemu_coroutine_self();
+@@ -XXX,XX +XXX,XX @@ void qmp_block_latency_histogram_set(
      CoWaitRecord w;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
          if (co == self) {
              /* We got the lock ourselves!  */
              assert(to_wake == &w);
 +            mutex->ctx = ctx;
              return;
          }
--        aio_co_wake(co);
-+        qemu_co_mutex_wake(mutex, co);
      }
-     qemu_coroutine_yield();
++    if (has_boundaries || has_boundaries_append) {
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
++        ret = block_latency_histogram_set(
++                stats, BLOCK_ACCT_ZONE_APPEND,
- void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
++                has_boundaries_append ? boundaries_append : boundaries);
- {
++        if (ret) {
-+    AioContext *ctx = qemu_get_current_aio_context();
++            error_setg(errp, "Device '%s' set append write boundaries fail", id);
-     Coroutine *self = qemu_coroutine_self();
++            return;
 +    int waiters, i;
 -    if (atomic_fetch_inc(&mutex->locked) == 0) {
 +    /* Running a very small critical section on pthread_mutex_t and CoMutex
 +     * shows that pthread_mutex_t is much faster because it doesn't actually
 +     * go to sleep.  What happens is that the critical section is shorter
 +     * than the latency of entering the kernel and thus FUTEX_WAIT always
 +     * fails.  With CoMutex there is no such latency but you still want to
 +     * avoid wait and wakeup.  So introduce it artificially.
 +     */
 +    i = 0;
 +retry_fast_path:
 +    waiters = atomic_cmpxchg(&mutex->locked, 0, 1);
 +    if (waiters != 0) {
 +        while (waiters == 1 && ++i < 1000) {
 +            if (atomic_read(&mutex->ctx) == ctx) {
 +                break;
 +            }
 +            if (atomic_read(&mutex->locked) == 0) {
 +                goto retry_fast_path;
 +            }
 +            cpu_relax();
 +        }
-+        waiters = atomic_fetch_inc(&mutex->locked);
 +    }
 +
-+    if (waiters == 0) {
+     if (has_boundaries || has_boundaries_flush) {
-         /* Uncontended.  */
+         ret = block_latency_histogram_set(
-         trace_qemu_co_mutex_lock_uncontended(mutex, self);
+             stats, BLOCK_ACCT_FLUSH,
-+        mutex->ctx = ctx;
+diff --git a/block/qapi.c b/block/qapi.c
-     } else {
+index XXXXXXX..XXXXXXX 100644
--        qemu_co_mutex_lock_slowpath(mutex);
+--- a/block/qapi.c
-+        qemu_co_mutex_lock_slowpath(ctx, mutex);
++++ b/block/qapi.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
      ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
      ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
 +    ds->zone_append_bytes = stats->nr_bytes[BLOCK_ACCT_ZONE_APPEND];
      ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
      ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
      ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
 +    ds->zone_append_operations = stats->nr_ops[BLOCK_ACCT_ZONE_APPEND];
      ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
      ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
      ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
 +    ds->failed_zone_append_operations =
 +        stats->failed_ops[BLOCK_ACCT_ZONE_APPEND];
      ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
      ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
      ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
      ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
 +    ds->invalid_zone_append_operations =
 +        stats->invalid_ops[BLOCK_ACCT_ZONE_APPEND];
      ds->invalid_flush_operations =
          stats->invalid_ops[BLOCK_ACCT_FLUSH];
      ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
      ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
      ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
 +    ds->zone_append_merged = stats->merged[BLOCK_ACCT_ZONE_APPEND];
      ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
      ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
      ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
 +    ds->zone_append_total_time_ns =
 +        stats->total_time_ns[BLOCK_ACCT_ZONE_APPEND];
      ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
      ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
      ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
          TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
          TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
 +        TimedAverage *zap = &ts->latency[BLOCK_ACCT_ZONE_APPEND];
          TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
          dev_stats->interval_length = ts->interval_length;
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
          dev_stats->max_wr_latency_ns = timed_average_max(wr);
          dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 +        dev_stats->min_zone_append_latency_ns = timed_average_min(zap);
 +        dev_stats->max_zone_append_latency_ns = timed_average_max(zap);
 +        dev_stats->avg_zone_append_latency_ns = timed_average_avg(zap);
 +
          dev_stats->min_flush_latency_ns = timed_average_min(fl);
          dev_stats->max_flush_latency_ns = timed_average_max(fl);
          dev_stats->avg_flush_latency_ns = timed_average_avg(fl);
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
              block_acct_queue_depth(ts, BLOCK_ACCT_READ);
          dev_stats->avg_wr_queue_depth =
              block_acct_queue_depth(ts, BLOCK_ACCT_WRITE);
 +        dev_stats->avg_zone_append_queue_depth =
 +            block_acct_queue_depth(ts, BLOCK_ACCT_ZONE_APPEND);
          QAPI_LIST_PREPEND(ds->timed_stats, dev_stats);
      }
-     mutex->holder = self;
+@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
-     self->locks_held++;
+         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_READ]);
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
+     ds->wr_latency_histogram
-     assert(mutex->holder == self);
+         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_WRITE]);
-     assert(qemu_in_coroutine());
++    ds->zone_append_latency_histogram
++        = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_ZONE_APPEND]);
-+    mutex->ctx = NULL;
+     ds->flush_latency_histogram
-     mutex->holder = NULL;
+         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_FLUSH]);
-     self->locks_held--;
+ }
-     if (atomic_fetch_dec(&mutex->locked) == 1) {
+diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
+index XXXXXXX..XXXXXXX 100644
-         unsigned our_handoff;
+--- a/hw/block/virtio-blk.c
++++ b/hw/block/virtio-blk.c
-         if (to_wake) {
+@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
--            Coroutine *co = to_wake->co;
+     data->in_num = in_num;
--            aio_co_wake(co);
+     data->zone_append_data.offset = offset;
-+            qemu_co_mutex_wake(mutex, to_wake->co);
+     qemu_iovec_init_external(&req->qiov, out_iov, out_num);
-             break;
++
-         }
++    block_acct_start(blk_get_stats(s->blk), &req->acct, len,
++                     BLOCK_ACCT_ZONE_APPEND);
-diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
++
-index XXXXXXX..XXXXXXX 100644
+     blk_aio_zone_append(s->blk, &data->zone_append_data.offset, &req->qiov, 0,
---- a/util/qemu-coroutine.c
+                         virtio_blk_zone_append_complete, data);
-+++ b/util/qemu-coroutine.c
+     return 0;
-@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
+diff --git a/tests/qemu-iotests/227.out b/tests/qemu-iotests/227.out
-     co->ctx = qemu_get_current_aio_context();
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/qemu-iotests/227.out
-     /* Store co->ctx before anything that stores co.  Matches
++++ b/tests/qemu-iotests/227.out
--     * barrier in aio_co_wake.
+@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
-+     * barrier in aio_co_wake and qemu_co_mutex_wake.
+             "stats": {
-      */
+                 "unmap_operations": 0,
-     smp_wmb();
+                 "unmap_merged": 0,
++                "failed_zone_append_operations": 0,
                  "flush_total_time_ns": 0,
                  "wr_highest_offset": 0,
                  "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
                  "timed_stats": [
                  ],
                  "failed_unmap_operations": 0,
 +                "zone_append_merged": 0,
                  "failed_flush_operations": 0,
                  "account_invalid": true,
                  "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
                  "unmap_total_time_ns": 0,
                  "invalid_flush_operations": 0,
                  "account_failed": true,
 +                "zone_append_total_time_ns": 0,
 +                "zone_append_operations": 0,
                  "rd_operations": 0,
 +                "zone_append_bytes": 0,
 +                "invalid_zone_append_operations": 0,
                  "invalid_wr_operations": 0,
                  "invalid_rd_operations": 0
              },
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
              "stats": {
                  "unmap_operations": 0,
                  "unmap_merged": 0,
 +                "failed_zone_append_operations": 0,
                  "flush_total_time_ns": 0,
                  "wr_highest_offset": 0,
                  "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
                  "timed_stats": [
                  ],
                  "failed_unmap_operations": 0,
 +                "zone_append_merged": 0,
                  "failed_flush_operations": 0,
                  "account_invalid": true,
                  "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
                  "unmap_total_time_ns": 0,
                  "invalid_flush_operations": 0,
                  "account_failed": true,
 +                "zone_append_total_time_ns": 0,
 +                "zone_append_operations": 0,
                  "rd_operations": 0,
 +                "zone_append_bytes": 0,
 +                "invalid_zone_append_operations": 0,
                  "invalid_wr_operations": 0,
                  "invalid_rd_operations": 0
              },
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
              "stats": {
                  "unmap_operations": 0,
                  "unmap_merged": 0,
 +                "failed_zone_append_operations": 0,
                  "flush_total_time_ns": 0,
                  "wr_highest_offset": 0,
                  "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
                  "timed_stats": [
                  ],
                  "failed_unmap_operations": 0,
 +                "zone_append_merged": 0,
                  "failed_flush_operations": 0,
                  "account_invalid": true,
                  "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
                  "unmap_total_time_ns": 0,
                  "invalid_flush_operations": 0,
                  "account_failed": true,
 +                "zone_append_total_time_ns": 0,
 +                "zone_append_operations": 0,
                  "rd_operations": 0,
 +                "zone_append_bytes": 0,
 +                "invalid_zone_append_operations": 0,
                  "invalid_wr_operations": 0,
                  "invalid_rd_operations": 0
              },
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 15/24] block: explicitly acquire aiocontext in aio callbacks that need it
+[PULL v2 15/16] virtio-blk: add some trace events for zoned emulation
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20230508051916.178322-4-faithilikerun@gmail.com
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-16-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/archipelago.c    |  3 ---
+ hw/block/virtio-blk.c | 12 ++++++++++++
- block/block-backend.c  |  7 -------
+ hw/block/trace-events |  7 +++++++
- block/curl.c           |  2 +-
+files changed, 19 insertions(+)
  block/io.c             |  6 +-----
  block/iscsi.c          |  3 ---
  block/linux-aio.c      |  5 +----
  block/mirror.c         | 12 +++++++++---
  block/null.c           |  8 --------
  block/qed-cluster.c    |  2 ++
  block/qed-table.c      | 12 ++++++++++--
  block/qed.c            |  4 ++--
  block/rbd.c            |  4 ----
  block/win32-aio.c      |  3 ---
  hw/block/virtio-blk.c  | 12 +++++++++++-
  hw/scsi/scsi-disk.c    | 15 +++++++++++++++
  hw/scsi/scsi-generic.c | 20 +++++++++++++++++---
  util/thread-pool.c     |  4 +++-
 files changed, 72 insertions(+), 50 deletions(-)
-diff --git a/block/archipelago.c b/block/archipelago.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/archipelago.c
-+++ b/block/archipelago.c
-@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
- {
-     AIORequestData *reqdata = (AIORequestData *) opaque;
-     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
--    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
--    aio_context_acquire(ctx);
-     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
--    aio_context_release(ctx);
-     aio_cb->status = 0;
-     qemu_aio_unref(aio_cb);
-diff --git a/block/block-backend.c b/block/block-backend.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/block-backend.c
-+++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
- static void error_callback_bh(void *opaque)
- {
-     struct BlockBackendAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     bdrv_dec_in_flight(acb->common.bs);
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, acb->ret);
--    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
-@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
- static void blk_aio_complete_bh(void *opaque)
- {
-     BlkAioEmAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
--
-     assert(acb->has_returned);
--    aio_context_acquire(ctx);
-     blk_aio_complete(acb);
--    aio_context_release(ctx);
- }
- static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
-diff --git a/block/curl.c b/block/curl.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
-+++ b/block/curl.c
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
- out:
-+    aio_context_release(ctx);
-     if (ret != -EINPROGRESS) {
-         acb->common.cb(acb->common.opaque, ret);
-         qemu_aio_unref(acb);
-     }
--    aio_context_release(ctx);
- }
- static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
-diff --git a/block/io.c b/block/io.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/io.c
-+++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_io_em_complete(void *opaque, int ret)
-     CoroutineIOCompletion *co = opaque;
-     co->ret = ret;
--    qemu_coroutine_enter(co->coroutine);
-+    aio_co_wake(co->coroutine);
- }
- static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
- static void bdrv_co_em_bh(void *opaque)
- {
-     BlockAIOCBCoroutine *acb = opaque;
--    BlockDriverState *bs = acb->common.bs;
--    AioContext *ctx = bdrv_get_aio_context(bs);
-     assert(!acb->need_bh);
--    aio_context_acquire(ctx);
-     bdrv_co_complete(acb);
--    aio_context_release(ctx);
- }
- static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
-diff --git a/block/iscsi.c b/block/iscsi.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/iscsi.c
-+++ b/block/iscsi.c
-@@ -XXX,XX +XXX,XX @@ static void
- iscsi_bh_cb(void *p)
- {
-     IscsiAIOCB *acb = p;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     qemu_bh_delete(acb->bh);
-     g_free(acb->buf);
-     acb->buf = NULL;
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, acb->status);
--    aio_context_release(ctx);
-     if (acb->task != NULL) {
-         scsi_free_scsi_task(acb->task);
-diff --git a/block/linux-aio.c b/block/linux-aio.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/linux-aio.c
-+++ b/block/linux-aio.c
-@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
-  */
- static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
- {
--    LinuxAioState *s = laiocb->ctx;
-     int ret;
-     ret = laiocb->ret;
-@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
-     }
-     laiocb->ret = ret;
--    aio_context_acquire(s->aio_context);
-     if (laiocb->co) {
-         /* If the coroutine is already entered it must be in ioq_submit() and
-          * will notice laio->ret has been filled in when it eventually runs
-@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
-          * that!
-          */
-         if (!qemu_coroutine_entered(laiocb->co)) {
--            qemu_coroutine_enter(laiocb->co);
-+            aio_co_wake(laiocb->co);
-         }
-     } else {
-         laiocb->common.cb(laiocb->common.opaque, ret);
-         qemu_aio_unref(laiocb);
-     }
--    aio_context_release(s->aio_context);
- }
- /**
-diff --git a/block/mirror.c b/block/mirror.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/mirror.c
-+++ b/block/mirror.c
-@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
- {
-     MirrorOp *op = opaque;
-     MirrorBlockJob *s = op->s;
-+
-+    aio_context_acquire(blk_get_aio_context(s->common.blk));
-     if (ret < 0) {
-         BlockErrorAction action;
-@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
-         }
-     }
-     mirror_iteration_done(op, ret);
-+    aio_context_release(blk_get_aio_context(s->common.blk));
- }
- static void mirror_read_complete(void *opaque, int ret)
- {
-     MirrorOp *op = opaque;
-     MirrorBlockJob *s = op->s;
-+
-+    aio_context_acquire(blk_get_aio_context(s->common.blk));
-     if (ret < 0) {
-         BlockErrorAction action;
-@@ -XXX,XX +XXX,XX @@ static void mirror_read_complete(void *opaque, int ret)
-         }
-         mirror_iteration_done(op, ret);
--        return;
-+    } else {
-+        blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-+                        0, mirror_write_complete, op);
-     }
--    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
--                    0, mirror_write_complete, op);
-+    aio_context_release(blk_get_aio_context(s->common.blk));
- }
- static inline void mirror_clip_sectors(MirrorBlockJob *s,
-diff --git a/block/null.c b/block/null.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/null.c
-+++ b/block/null.c
-@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
- static void null_bh_cb(void *opaque)
- {
-     NullAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, 0);
--    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
- static void null_timer_cb(void *opaque)
- {
-     NullAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, 0);
--    aio_context_release(ctx);
-     timer_deinit(&acb->timer);
-     qemu_aio_unref(acb);
- }
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
-+++ b/block/qed-cluster.c
-@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
-     unsigned int index;
-     unsigned int n;
-+    qed_acquire(s);
-     if (ret) {
-         goto out;
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
- out:
-     find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
-+    qed_release(s);
-     g_free(find_cluster_cb);
- }
-diff --git a/block/qed-table.c b/block/qed-table.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
-+++ b/block/qed-table.c
-@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
- {
-     QEDReadTableCB *read_table_cb = opaque;
-     QEDTable *table = read_table_cb->table;
-+    BDRVQEDState *s = read_table_cb->s;
-     int noffsets = read_table_cb->qiov.size / sizeof(uint64_t);
-     int i;
-@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
-     }
-     /* Byteswap offsets */
-+    qed_acquire(s);
-     for (i = 0; i < noffsets; i++) {
-         table->offsets[i] = le64_to_cpu(table->offsets[i]);
-     }
-+    qed_release(s);
- out:
-     /* Completion */
--    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
-+    trace_qed_read_table_cb(s, read_table_cb->table, ret);
-     gencb_complete(&read_table_cb->gencb, ret);
- }
-@@ -XXX,XX +XXX,XX @@ typedef struct {
- static void qed_write_table_cb(void *opaque, int ret)
- {
-     QEDWriteTableCB *write_table_cb = opaque;
-+    BDRVQEDState *s = write_table_cb->s;
--    trace_qed_write_table_cb(write_table_cb->s,
-+    trace_qed_write_table_cb(s,
-                              write_table_cb->orig_table,
-                              write_table_cb->flush,
-                              ret);
-@@ -XXX,XX +XXX,XX @@ static void qed_write_table_cb(void *opaque, int ret)
-     if (write_table_cb->flush) {
-         /* We still need to flush first */
-         write_table_cb->flush = false;
-+        qed_acquire(s);
-         bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
-                        write_table_cb);
-+        qed_release(s);
-         return;
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
-     CachedL2Table *l2_table = request->l2_table;
-     uint64_t l2_offset = read_l2_table_cb->l2_offset;
-+    qed_acquire(s);
-     if (ret) {
-         /* can't trust loaded L2 table anymore */
-         qed_unref_l2_cache_entry(l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
-         request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-         assert(request->l2_table != NULL);
-     }
-+    qed_release(s);
-     gencb_complete(&read_l2_table_cb->gencb, ret);
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t l
-     }
-     if (cb->co) {
--        qemu_coroutine_enter(cb->co);
-+        aio_co_wake(cb->co);
-     }
- }
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
-     cb->done = true;
-     cb->ret = ret;
-     if (cb->co) {
--        qemu_coroutine_enter(cb->co);
-+        aio_co_wake(cb->co);
-     }
- }
-diff --git a/block/rbd.c b/block/rbd.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/rbd.c
-+++ b/block/rbd.c
-@@ -XXX,XX +XXX,XX @@ shutdown:
- static void qemu_rbd_complete_aio(RADOSCB *rcb)
- {
-     RBDAIOCB *acb = rcb->acb;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     int64_t r;
-     r = rcb->ret;
-@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
-         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-     }
-     qemu_vfree(acb->bounce);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
--    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
-diff --git a/block/win32-aio.c b/block/win32-aio.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/win32-aio.c
-+++ b/block/win32-aio.c
-@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
-         qemu_vfree(waiocb->buf);
-     }
--
--    aio_context_acquire(s->aio_ctx);
-     waiocb->common.cb(waiocb->common.opaque, ret);
--    aio_context_release(s->aio_ctx);
-     qemu_aio_unref(waiocb);
- }
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
-@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
+@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_report_complete(void *opaque, int ret)
- static void virtio_blk_rw_complete(void *opaque, int ret)
+     int64_t nz = data->zone_report_data.nr_zones;
- {
+     int8_t err_status = VIRTIO_BLK_S_OK;
-     VirtIOBlockReq *next = opaque;
-+    VirtIOBlock *s = next->dev;
++    trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
+     if (ret) {
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
-     while (next) {
+         goto out;
-         VirtIOBlockReq *req = next;
+@@ -XXX,XX +XXX,XX @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq *req,
-         next = req->mr_next;
+     nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
-@@ -XXX,XX +XXX,XX @@ static void virtio_blk_rw_complete(void *opaque, int ret)
+                 sizeof(struct virtio_blk_zone_report)) /
-         block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
+                sizeof(struct virtio_blk_zone_descriptor);
-         virtio_blk_free_request(req);
++    trace_virtio_blk_handle_zone_report(vdev, req,
-     }
++                                        offset >> BDRV_SECTOR_BITS, nr_zones);
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
- }
+     zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
+     data = g_malloc(sizeof(ZoneCmdData));
- static void virtio_blk_flush_complete(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int ret)
  {
      VirtIOBlockReq *req = opaque;
-+    VirtIOBlock *s = req->dev;
+     VirtIOBlock *s = req->dev;
++    VirtIODevice *vdev = VIRTIO_DEVICE(s);
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+     int8_t err_status = VIRTIO_BLK_S_OK;
 +    trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
      if (ret) {
-         if (virtio_blk_handle_rw_error(req, -ret, 0)) {
+         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
--            return;
+@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
-+            goto out;
+         /* Entire drive capacity */
          offset = 0;
          len = capacity;
 +        trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
 +                                               bs->total_sectors);
      } else {
          if (bs->bl.zone_size > capacity - offset) {
              /* The zoned device allows the last smaller zone. */
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
          } else {
              len = bs->bl.zone_size;
          }
++        trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
++                                          offset >> BDRV_SECTOR_BITS,
++                                          len >> BDRV_SECTOR_BITS);
      }
-     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
+     if (!check_zoned_request(s, offset, len, false, &err_status)) {
-     block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
+@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_append_complete(void *opaque, int ret)
-     virtio_blk_free_request(req);
+         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
-+
+         goto out;
-+out:
+     }
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
++    trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
  }
  #ifdef __linux__
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
      virtio_stl_p(vdev, &scsi->data_len, hdr->dxfer_len);
  out:
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+     aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
-     virtio_blk_req_complete(req, status);
+@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
-     virtio_blk_free_request(req);
+     int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
+     int64_t len = iov_size(out_iov, out_num);
-     g_free(ioctl_req);
- }
++    trace_virtio_blk_handle_zone_append(vdev, req, offset >> BDRV_SECTOR_BITS);
+     if (!check_zoned_request(s, offset, len, true, &err_status)) {
-diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
+         goto out;
      }
 diff --git a/hw/block/trace-events b/hw/block/trace-events
 index XXXXXXX..XXXXXXX 100644
---- a/hw/scsi/scsi-disk.c
+--- a/hw/block/trace-events
-+++ b/hw/scsi/scsi-disk.c
++++ b/hw/block/trace-events
-@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: unknown command 0x%02x"
+ # virtio-blk.c
-     assert(r->req.aiocb != NULL);
+ virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p status %d"
-     r->req.aiocb = NULL;
+ virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
++virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, int ret) "vdev %p req %p nr_zones %u ret %d"
-     if (scsi_disk_req_check_error(r, ret, true)) {
++virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
-         goto done;
++virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
-     }
+ virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
-@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
+ virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
-     scsi_req_complete(&r->req, GOOD);
+ virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs %d offset %"PRIu64" size %zu is_write %d"
++virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
- done:
++virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
++virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
-     scsi_req_unref(&r->req);
++virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p req %p, append sector 0x%" PRIx64 ""
- }
+ # hd-geometry.c
-@@ -XXX,XX +XXX,XX @@ static void scsi_dma_complete(void *opaque, int ret)
+ hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS %d %d %d"
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      if (ret < 0) {
          block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
      } else {
          block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
      }
      scsi_dma_complete_noio(r, ret);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  static void scsi_read_complete(void * opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      if (scsi_disk_req_check_error(r, ret, true)) {
          goto done;
      }
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
  done:
      scsi_req_unref(&r->req);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  /* Actually issue a read to the block device.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_do_read_cb(void *opaque, int ret)
      assert (r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      if (ret < 0) {
          block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
      } else {
          block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
      }
      scsi_do_read(opaque, ret);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
      assert (r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      if (ret < 0) {
          block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
      } else {
          block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
      }
      scsi_write_complete_noio(r, ret);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  static void scsi_write_data(SCSIRequest *req)
@@ -XXX,XX +XXX,XX @@ static void scsi_unmap_complete(void *opaque, int ret)
  {
      UnmapCBData *data = opaque;
      SCSIDiskReq *r = data->r;
 +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      scsi_unmap_complete_noio(data, ret);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
@@ -XXX,XX +XXX,XX @@ static void scsi_write_same_complete(void *opaque, int ret)
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
      if (scsi_disk_req_check_error(r, ret, true)) {
          goto done;
      }
@@ -XXX,XX +XXX,XX @@ done:
      scsi_req_unref(&r->req);
      qemu_vfree(data->iov.iov_base);
      g_free(data);
 +    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
  }
  static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
 diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/scsi/scsi-generic.c
 +++ b/hw/scsi/scsi-generic.c
@@ -XXX,XX +XXX,XX @@ done:
  static void scsi_command_complete(void *opaque, int ret)
  {
      SCSIGenericReq *r = (SCSIGenericReq *)opaque;
 +    SCSIDevice *s = r->req.dev;
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +
 +    aio_context_acquire(blk_get_aio_context(s->conf.blk));
      scsi_command_complete_noio(r, ret);
 +    aio_context_release(blk_get_aio_context(s->conf.blk));
  }
  static int execute_command(BlockBackend *blk,
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.blk));
 +
      if (ret || r->req.io_canceled) {
          scsi_command_complete_noio(r, ret);
 -        return;
 +        goto done;
      }
      len = r->io_header.dxfer_len - r->io_header.resid;
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
      r->len = -1;
      if (len == 0) {
          scsi_command_complete_noio(r, 0);
 -        return;
 +        goto done;
      }
      /* Snoop READ CAPACITY output to set the blocksize.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
      }
      scsi_req_data(&r->req, len);
      scsi_req_unref(&r->req);
 +
 +done:
 +    aio_context_release(blk_get_aio_context(s->conf.blk));
  }
  /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
      assert(r->req.aiocb != NULL);
      r->req.aiocb = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.blk));
 +
      if (ret || r->req.io_canceled) {
          scsi_command_complete_noio(r, ret);
 -        return;
 +        goto done;
      }
      if (r->req.cmd.buf[0] == MODE_SELECT && r->req.cmd.buf[4] == 12 &&
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
      }
      scsi_command_complete_noio(r, ret);
 +
 +done:
 +    aio_context_release(blk_get_aio_context(s->conf.blk));
  }
  /* Write data to a scsi device.  Returns nonzero on failure.
 diff --git a/util/thread-pool.c b/util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ restart:
               */
              qemu_bh_schedule(pool->completion_bh);
 +            aio_context_release(pool->ctx);
              elem->common.cb(elem->common.opaque, elem->ret);
 +            aio_context_acquire(pool->ctx);
              qemu_aio_unref(elem);
              goto restart;
          } else {
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
      ThreadPoolCo *co = opaque;
      co->ret = ret;
 -    qemu_coroutine_enter(co->co);
 +    aio_co_wake(co->co);
  }
  int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
 --
-.9.3
+.40.1

-[Qemu-devel] [PULL v2 16/24] aio-posix: partially inline aio_dispatch into aio_poll
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-This patch prepares for the removal of unnecessary lockcnt inc/dec pairs.
-Extract the dispatching loop for file descriptor handlers into a new
-function aio_dispatch_handlers, and then inline aio_dispatch into
-aio_poll.
-aio_dispatch can now become void.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-17-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- include/block/aio.h |  6 +-----
- util/aio-posix.c    | 44 ++++++++++++++------------------------------
- util/aio-win32.c    | 13 ++++---------
- util/async.c        |  2 +-
-files changed, 20 insertions(+), 45 deletions(-)
-diff --git a/include/block/aio.h b/include/block/aio.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
-+++ b/include/block/aio.h
-@@ -XXX,XX +XXX,XX @@ bool aio_pending(AioContext *ctx);
- /* Dispatch any pending callbacks from the GSource attached to the AioContext.
-  *
-  * This is used internally in the implementation of the GSource.
-- *
-- * @dispatch_fds: true to process fds, false to skip them
-- *                (can be used as an optimization by callers that know there
-- *                are no fds ready)
-  */
--bool aio_dispatch(AioContext *ctx, bool dispatch_fds);
-+void aio_dispatch(AioContext *ctx);
- /* Progress in completing AIO work to occur.  This can issue new pending
-  * aio as a result of executing I/O completion or bh callbacks.
-diff --git a/util/aio-posix.c b/util/aio-posix.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
-+++ b/util/aio-posix.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
-     AioHandler *node, *tmp;
-     bool progress = false;
--    /*
--     * We have to walk very carefully in case aio_set_fd_handler is
--     * called while we're walking.
--     */
--    qemu_lockcnt_inc(&ctx->list_lock);
--
-     QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
-         int revents;
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
-         }
-     }
--    qemu_lockcnt_dec(&ctx->list_lock);
-     return progress;
- }
--/*
-- * Note that dispatch_fds == false has the side-effect of post-poning the
-- * freeing of deleted handlers.
-- */
--bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
-+void aio_dispatch(AioContext *ctx)
- {
--    bool progress;
-+    aio_bh_poll(ctx);
--    /*
--     * If there are callbacks left that have been queued, we need to call them.
--     * Do not call select in this case, because it is possible that the caller
--     * does not need a complete flush (as is the case for aio_poll loops).
--     */
--    progress = aio_bh_poll(ctx);
-+    qemu_lockcnt_inc(&ctx->list_lock);
-+    aio_dispatch_handlers(ctx);
-+    qemu_lockcnt_dec(&ctx->list_lock);
--    if (dispatch_fds) {
--        progress |= aio_dispatch_handlers(ctx);
--    }
--
--    /* Run our timers */
--    progress |= timerlistgroup_run_timers(&ctx->tlg);
--
--    return progress;
-+    timerlistgroup_run_timers(&ctx->tlg);
- }
- /* These thread-local variables are used only in a small part of aio_poll
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
-     npfd = 0;
-     qemu_lockcnt_dec(&ctx->list_lock);
--    /* Run dispatch even if there were no readable fds to run timers */
--    if (aio_dispatch(ctx, ret > 0)) {
--        progress = true;
-+    progress |= aio_bh_poll(ctx);
-+
-+    if (ret > 0) {
-+        qemu_lockcnt_inc(&ctx->list_lock);
-+        progress |= aio_dispatch_handlers(ctx);
-+        qemu_lockcnt_dec(&ctx->list_lock);
-     }
-+    progress |= timerlistgroup_run_timers(&ctx->tlg);
-+
-     return progress;
- }
-diff --git a/util/aio-win32.c b/util/aio-win32.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-win32.c
-+++ b/util/aio-win32.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
-     return progress;
- }
--bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
-+void aio_dispatch(AioContext *ctx)
- {
--    bool progress;
--
--    progress = aio_bh_poll(ctx);
--    if (dispatch_fds) {
--        progress |= aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
--    }
--    progress |= timerlistgroup_run_timers(&ctx->tlg);
--    return progress;
-+    aio_bh_poll(ctx);
-+    aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
-+    timerlistgroup_run_timers(&ctx->tlg);
- }
- bool aio_poll(AioContext *ctx, bool blocking)
-diff --git a/util/async.c b/util/async.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/async.c
-+++ b/util/async.c
-@@ -XXX,XX +XXX,XX @@ aio_ctx_dispatch(GSource     *source,
-     AioContext *ctx = (AioContext *) source;
-     assert(callback == NULL);
--    aio_dispatch(ctx, true);
-+    aio_dispatch(ctx);
-     return true;
- }
---
-.9.3

-[Qemu-devel] [PULL v2 18/24] block: document fields protected by AioContext lock
+Deleted patch
-From: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
-Message-id: 20170213135235.12274-19-pbonzini@redhat.com
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- include/block/block_int.h      | 64 +++++++++++++++++++++++++-----------------
- include/sysemu/block-backend.h | 14 ++++++---
-files changed, 49 insertions(+), 29 deletions(-)
-diff --git a/include/block/block_int.h b/include/block/block_int.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/block/block_int.h
-+++ b/include/block/block_int.h
-@@ -XXX,XX +XXX,XX @@ struct BdrvChild {
-  * copied as well.
-  */
- struct BlockDriverState {
--    int64_t total_sectors; /* if we are reading a disk image, give its
--                              size in sectors */
-+    /* Protected by big QEMU lock or read-only after opening.  No special
-+     * locking needed during I/O...
-+     */
-     int open_flags; /* flags used to open the file, re-used for re-open */
-     bool read_only; /* if true, the media is read only */
-     bool encrypted; /* if true, the media is encrypted */
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     bool sg;        /* if true, the device is a /dev/sg* */
-     bool probed;    /* if true, format was probed rather than specified */
--    int copy_on_read; /* if nonzero, copy read backing sectors into image.
--                         note this is a reference count */
--
--    CoQueue flush_queue;            /* Serializing flush queue */
--    bool active_flush_req;          /* Flush request in flight? */
--    unsigned int write_gen;         /* Current data generation */
--    unsigned int flushed_gen;       /* Flushed write generation */
--
-     BlockDriver *drv; /* NULL means no media */
-     void *opaque;
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     BdrvChild *backing;
-     BdrvChild *file;
--    /* Callback before write request is processed */
--    NotifierWithReturnList before_write_notifiers;
--
--    /* number of in-flight requests; overall and serialising */
--    unsigned int in_flight;
--    unsigned int serialising_in_flight;
--
--    bool wakeup;
--
--    /* Offset after the highest byte written to */
--    uint64_t wr_highest_offset;
--
-     /* I/O Limits */
-     BlockLimits bl;
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     QTAILQ_ENTRY(BlockDriverState) bs_list;
-     /* element of the list of monitor-owned BDS */
-     QTAILQ_ENTRY(BlockDriverState) monitor_list;
--    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
-     int refcnt;
--    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
--
-     /* operation blockers */
-     QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     /* The error object in use for blocking operations on backing_hd */
-     Error *backing_blocker;
-+    /* Protected by AioContext lock */
-+
-+    /* If true, copy read backing sectors into image.  Can be >1 if more
-+     * than one client has requested copy-on-read.
-+     */
-+    int copy_on_read;
-+
-+    /* If we are reading a disk image, give its size in sectors.
-+     * Generally read-only; it is written to by load_vmstate and save_vmstate,
-+     * but the block layer is quiescent during those.
-+     */
-+    int64_t total_sectors;
-+
-+    /* Callback before write request is processed */
-+    NotifierWithReturnList before_write_notifiers;
-+
-+    /* number of in-flight requests; overall and serialising */
-+    unsigned int in_flight;
-+    unsigned int serialising_in_flight;
-+
-+    bool wakeup;
-+
-+    /* Offset after the highest byte written to */
-+    uint64_t wr_highest_offset;
-+
-     /* threshold limit for writes, in bytes. "High water mark". */
-     uint64_t write_threshold_offset;
-     NotifierWithReturn write_threshold_notifier;
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     /* counter for nested bdrv_io_plug */
-     unsigned io_plugged;
-+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
-+    CoQueue flush_queue;                  /* Serializing flush queue */
-+    bool active_flush_req;                /* Flush request in flight? */
-+    unsigned int write_gen;               /* Current data generation */
-+    unsigned int flushed_gen;             /* Flushed write generation */
-+
-+    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
-+
-+    /* do we need to tell the quest if we have a volatile write cache? */
-+    int enable_write_cache;
-+
-     int quiesce_counter;
- };
-diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/sysemu/block-backend.h
-+++ b/include/sysemu/block-backend.h
-@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
-  * fields that must be public. This is in particular for QLIST_ENTRY() and
-  * friends so that BlockBackends can be kept in lists outside block-backend.c */
- typedef struct BlockBackendPublic {
--    /* I/O throttling.
--     * throttle_state tells us if this BlockBackend has I/O limits configured.
--     * io_limits_disabled tells us if they are currently being enforced */
-+    /* I/O throttling has its own locking, but also some fields are
-+     * protected by the AioContext lock.
-+     */
-+
-+    /* Protected by AioContext lock.  */
-     CoQueue      throttled_reqs[2];
-+
-+    /* Nonzero if the I/O limits are currently being ignored; generally
-+     * it is zero.  */
-     unsigned int io_limits_disabled;
-     /* The following fields are protected by the ThrottleGroup lock.
--     * See the ThrottleGroup documentation for details. */
-+     * See the ThrottleGroup documentation for details.
-+     * throttle_state tells us if I/O limits are configured. */
-     ThrottleState *throttle_state;
-     ThrottleTimers throttle_timers;
-     unsigned       pending_reqs[2];
---
-.9.3

-[Qemu-devel] [PULL v2 22/24] coroutine-lock: place CoMutex before CoQueue in header
+[PULL v2 16/16] docs/zoned-storage:add zoned emulation use case
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Sam Li <faithilikerun@gmail.com>
-This will avoid forward references in the next patch.  It is also
+Add the documentation about the example of using virtio-blk driver
-more logical because CoQueue is not anymore the basic primitive.
+to pass the zoned block devices through to the guest.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Sam Li <faithilikerun@gmail.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Message-id: 20230508051916.178322-5-faithilikerun@gmail.com
-Message-id: 20170213181244.16297-5-pbonzini@redhat.com
+[Fix pre-formatted code syntax
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h | 89 ++++++++++++++++++++++++------------------------
+ docs/devel/zoned-storage.rst | 19 +++++++++++++++++++
-file changed, 44 insertions(+), 45 deletions(-)
+file changed, 19 insertions(+)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/docs/devel/zoned-storage.rst
-+++ b/include/qemu/coroutine.h
++++ b/docs/devel/zoned-storage.rst
-@@ -XXX,XX +XXX,XX @@ bool qemu_in_coroutine(void);
+@@ -XXX,XX +XXX,XX @@ APIs for zoned storage emulation or testing.
-  */
+ For example, to test zone_report on a null_blk device using qemu-io is::
- bool qemu_coroutine_entered(Coroutine *co);
+   $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c "zrp offset nr_zones"
 -
 -/**
 - * CoQueues are a mechanism to queue coroutines in order to continue executing
 - * them later. They provide the fundamental primitives on which coroutine locks
 - * are built.
 - */
 -typedef struct CoQueue {
 -    QSIMPLEQ_HEAD(, Coroutine) entries;
 -} CoQueue;
 -
 -/**
 - * Initialise a CoQueue. This must be called before any other operation is used
 - * on the CoQueue.
 - */
 -void qemu_co_queue_init(CoQueue *queue);
 -
 -/**
 - * Adds the current coroutine to the CoQueue and transfers control to the
 - * caller of the coroutine.
 - */
 -void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
 -
 -/**
 - * Restarts the next coroutine in the CoQueue and removes it from the queue.
 - *
 - * Returns true if a coroutine was restarted, false if the queue is empty.
 - */
 -bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 -
 -/**
 - * Restarts all coroutines in the CoQueue and leaves the queue empty.
 - */
 -void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
 -
 -/**
 - * Enter the next coroutine in the queue
 - */
 -bool qemu_co_enter_next(CoQueue *queue);
 -
 -/**
 - * Checks if the CoQueue is empty.
 - */
 -bool qemu_co_queue_empty(CoQueue *queue);
 -
 -
  /**
   * Provides a mutex that can be used to synchronise coroutines
   */
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
   */
  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 +
-+/**
++To expose the host's zoned block device through virtio-blk, the command line
-+ * CoQueues are a mechanism to queue coroutines in order to continue executing
++can be (includes the -device parameter)::
 + * them later.
 + */
 +typedef struct CoQueue {
 +    QSIMPLEQ_HEAD(, Coroutine) entries;
 +} CoQueue;
 +
-+/**
++  -blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,cache.direct=on \
-+ * Initialise a CoQueue. This must be called before any other operation is used
++  -device virtio-blk-pci,drive=drive0
 + * on the CoQueue.
 + */
 +void qemu_co_queue_init(CoQueue *queue);
 +
-+/**
++Or only use the -drive parameter::
 + * Adds the current coroutine to the CoQueue and transfers control to the
 + * caller of the coroutine.
 + */
 +void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
 +
-+/**
++  -driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
 + * Restarts the next coroutine in the CoQueue and removes it from the queue.
 + *
 + * Returns true if a coroutine was restarted, false if the queue is empty.
 + */
 +bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 +
-+/**
++Additionally, QEMU has several ways of supporting zoned storage, including:
-+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
++(1) Using virtio-scsi: --device scsi-block allows for the passing through of
-+ */
++SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
-+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
++(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
-+
++purposes, it cannot yet pass through a zoned device from the host. To pass on
-+/**
++the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
-+ * Enter the next coroutine in the queue
++through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
-+ */
++attached to the HBA.
 +bool qemu_co_enter_next(CoQueue *queue);
 +
 +/**
 + * Checks if the CoQueue is empty.
 + */
 +bool qemu_co_queue_empty(CoQueue *queue);
 +
 +
  typedef struct CoRwlock {
      bool writer;
      int reader;
 --
-.9.3
+.40.1

The following changes since commit 56f9e46b841c7be478ca038d8d4085d776ab4b0d:

Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2017-02-20' into staging (2017-02-20 17:42:47 +0000)

are available in the git repository at:

git://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to a7b91d35bab97a2d3e779d0c64c9b837b52a6cf7:

coroutine-lock: make CoRwlock thread-safe and fair (2017-02-21 11:39:40 +0000)

----------------------------------------------------------------
Pull request

v2:
 * Rebased to resolve scsi conflicts

----------------------------------------------------------------

Paolo Bonzini (24):
  block: move AioContext, QEMUTimer, main-loop to libqemuutil
  aio: introduce aio_co_schedule and aio_co_wake
  block-backend: allow blk_prw from coroutine context
  test-thread-pool: use generic AioContext infrastructure
  io: add methods to set I/O handlers on AioContext
  io: make qio_channel_yield aware of AioContexts
  nbd: convert to use qio_channel_yield
  coroutine-lock: reschedule coroutine on the AioContext it was running
    on
  blkdebug: reschedule coroutine on the AioContext it is running on
  qed: introduce qed_aio_start_io and qed_aio_next_io_cb
  aio: push aio_context_acquire/release down to dispatching
  block: explicitly acquire aiocontext in timers that need it
  block: explicitly acquire aiocontext in callbacks that need it
  block: explicitly acquire aiocontext in bottom halves that need it
  block: explicitly acquire aiocontext in aio callbacks that need it
  aio-posix: partially inline aio_dispatch into aio_poll
  async: remove unnecessary inc/dec pairs
  block: document fields protected by AioContext lock
  coroutine-lock: make CoMutex thread-safe
  coroutine-lock: add limited spinning to CoMutex
  test-aio-multithread: add performance comparison with thread-based
    mutexes
  coroutine-lock: place CoMutex before CoQueue in header
  coroutine-lock: add mutex argument to CoQueue APIs
  coroutine-lock: make CoRwlock thread-safe and fair

-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

AioContext is fairly self contained, the only dependency is QEMUTimer but
that in turn doesn't need anything else.  So move them out of block-obj-y
to avoid introducing a dependency from io/ to block-obj-y.

main-loop and its dependency iohandler also need to be moved, because
later in this series io/ will call iohandler_get_aio_context.

[Changed copyright "the QEMU team" to "other QEMU contributors" as
suggested by Daniel Berrange and agreed by Paolo.
--Stefan]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 Makefile.objs                       |  4 ---
 stubs/Makefile.objs                 |  1 +
 tests/Makefile.include              | 11 ++++----
 util/Makefile.objs                  |  6 +++-
 block/io.c                          | 29 -------------------
 stubs/linux-aio.c                   | 32 +++++++++++++++++++++
 stubs/set-fd-handler.c              | 11 --------
 aio-posix.c => util/aio-posix.c     |  2 +-
 aio-win32.c => util/aio-win32.c     |  0
 util/aiocb.c                        | 55 +++++++++++++++++++++++++++++++++++++
 async.c => util/async.c             |  3 +-
 iohandler.c => util/iohandler.c     |  0
 main-loop.c => util/main-loop.c     |  0
 qemu-timer.c => util/qemu-timer.c   |  0
 thread-pool.c => util/thread-pool.c |  2 +-
 trace-events                        | 11 --------
 util/trace-events                   | 11 ++++++++
 17 files changed, 114 insertions(+), 64 deletions(-)
 create mode 100644 stubs/linux-aio.c
 rename aio-posix.c => util/aio-posix.c (99%)
 rename aio-win32.c => util/aio-win32.c (100%)
 create mode 100644 util/aiocb.c
 rename async.c => util/async.c (99%)
 rename iohandler.c => util/iohandler.c (100%)
 rename main-loop.c => util/main-loop.c (100%)
 rename qemu-timer.c => util/qemu-timer.c (100%)
 rename thread-pool.c => util/thread-pool.c (99%)

diff --git a/Makefile.objs b/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -XXX,XX +XXX,XX @@ chardev-obj-y = chardev/
 #######################################################################
 # block-obj-y is code used by both qemu system emulation and qemu-img
 
-block-obj-y = async.o thread-pool.o
 block-obj-y += nbd/
 block-obj-y += block.o blockjob.o
-block-obj-y += main-loop.o iohandler.o qemu-timer.o
-block-obj-$(CONFIG_POSIX) += aio-posix.o
-block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
 block-obj-$(CONFIG_REPLICATION) += replication.o
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -XXX,XX +XXX,XX @@ stub-obj-y += get-vm-name.o
 stub-obj-y += iothread.o
 stub-obj-y += iothread-lock.o
 stub-obj-y += is-daemonized.o
+stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 stub-obj-y += machine-init-done.o
 stub-obj-y += migr-blocker.o
 stub-obj-y += monitor.o
diff --git a/tests/Makefile.include b/tests/Makefile.include
index XXXXXXX..XXXXXXX 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
+gcov-files-test-aio-y = util/async.c util/qemu-timer.o
+gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
+gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
 check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -XXX,XX +XXX,XX @@ tests/check-qjson$(EXESUF): tests/check-qjson.o $(test-util-obj-y)
 tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y)
 tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 
-tests/test-char$(EXESUF): tests/test-char.o qemu-timer.o \
-	$(test-util-obj-y) $(qtest-obj-y) $(test-block-obj-y) $(chardev-obj-y)
+tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/test-vmstate$(EXESUF): tests/test-vmstate.o \
 	migration/vmstate.o migration/qemu-file.o \
         migration/qemu-file-channel.o migration/qjson.o \
 	$(test-io-obj-y)
-tests/test-timed-average$(EXESUF): tests/test-timed-average.o qemu-timer.o \
-	$(test-util-obj-y)
+tests/test-timed-average$(EXESUF): tests/test-timed-average.o $(test-util-obj-y)
 tests/test-base64$(EXESUF): tests/test-base64.o \
 	libqemuutil.a libqemustub.a
 tests/ptimer-test$(EXESUF): tests/ptimer-test.o tests/ptimer-test-stubs.o hw/core/ptimer.o libqemustub.a
@@ -XXX,XX +XXX,XX @@ tests/usb-hcd-ehci-test$(EXESUF): tests/usb-hcd-ehci-test.o $(libqos-usb-obj-y)
 tests/usb-hcd-xhci-test$(EXESUF): tests/usb-hcd-xhci-test.o $(libqos-usb-obj-y)
 tests/pc-cpu-test$(EXESUF): tests/pc-cpu-test.o
 tests/postcopy-test$(EXESUF): tests/postcopy-test.o
-tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o qemu-timer.o \
+tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o $(test-util-obj-y) \
 	$(qtest-obj-y) $(test-io-obj-y) $(libqos-virtio-obj-y) $(libqos-pc-obj-y) \
 	$(chardev-obj-y)
 tests/qemu-iotests/socket_scm_helper$(EXESUF): tests/qemu-iotests/socket_scm_helper.o
diff --git a/util/Makefile.objs b/util/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@
 util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
 util-obj-y += bufferiszero.o
 util-obj-y += lockcnt.o
+util-obj-y += aiocb.o async.o thread-pool.o qemu-timer.o
+util-obj-y += main-loop.o iohandler.o
+util-obj-$(CONFIG_POSIX) += aio-posix.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
 util-obj-$(CONFIG_POSIX) += oslib-posix.o
 util-obj-$(CONFIG_POSIX) += qemu-openpty.o
 util-obj-$(CONFIG_POSIX) += qemu-thread-posix.o
-util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
 util-obj-$(CONFIG_POSIX) += memfd.o
+util-obj-$(CONFIG_WIN32) += aio-win32.o
+util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
 util-obj-$(CONFIG_WIN32) += oslib-win32.o
 util-obj-$(CONFIG_WIN32) += qemu-thread-win32.o
 util-obj-y += envlist.o path.o module.o
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
     return &acb->common;
 }
 
-void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
-                   BlockCompletionFunc *cb, void *opaque)
-{
-    BlockAIOCB *acb;
-
-    acb = g_malloc(aiocb_info->aiocb_size);
-    acb->aiocb_info = aiocb_info;
-    acb->bs = bs;
-    acb->cb = cb;
-    acb->opaque = opaque;
-    acb->refcnt = 1;
-    return acb;
-}
-
-void qemu_aio_ref(void *p)
-{
-    BlockAIOCB *acb = p;
-    acb->refcnt++;
-}
-
-void qemu_aio_unref(void *p)
-{
-    BlockAIOCB *acb = p;
-    assert(acb->refcnt > 0);
-    if (--acb->refcnt == 0) {
-        g_free(acb);
-    }
-}
-
 /**************************************************************/
 /* Coroutine block device emulation */
 
diff --git a/stubs/linux-aio.c b/stubs/linux-aio.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/stubs/linux-aio.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Linux native AIO support.
+ *
+ * Copyright (C) 2009 IBM, Corp.
+ * Copyright (C) 2009 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "block/aio.h"
+#include "block/raw-aio.h"
+
+void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
+{
+    abort();
+}
+
+void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
+{
+    abort();
+}
+
+LinuxAioState *laio_init(void)
+{
+    abort();
+}
+
+void laio_cleanup(LinuxAioState *s)
+{
+    abort();
+}
diff --git a/stubs/set-fd-handler.c b/stubs/set-fd-handler.c
index XXXXXXX..XXXXXXX 100644
--- a/stubs/set-fd-handler.c
+++ b/stubs/set-fd-handler.c
@@ -XXX,XX +XXX,XX @@ void qemu_set_fd_handler(int fd,
 {
     abort();
 }
-
-void aio_set_fd_handler(AioContext *ctx,
-                        int fd,
-                        bool is_external,
-                        IOHandler *io_read,
-                        IOHandler *io_write,
-                        AioPollFn *io_poll,
-                        void *opaque)
-{
-    abort();
-}
diff --git a/aio-posix.c b/util/aio-posix.c
similarity index 99%
rename from aio-posix.c
rename to util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/rcu_queue.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
-#include "trace-root.h"
+#include "trace.h"
 #ifdef CONFIG_EPOLL_CREATE1
 #include <sys/epoll.h>
 #endif
diff --git a/aio-win32.c b/util/aio-win32.c
similarity index 100%
rename from aio-win32.c
rename to util/aio-win32.c
diff --git a/util/aiocb.c b/util/aiocb.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/aiocb.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * BlockAIOCB allocation
+ *
+ * Copyright (c) 2003-2017 Fabrice Bellard and other QEMU contributors
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "block/aio.h"
+
+void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
+                   BlockCompletionFunc *cb, void *opaque)
+{
+    BlockAIOCB *acb;
+
+    acb = g_malloc(aiocb_info->aiocb_size);
+    acb->aiocb_info = aiocb_info;
+    acb->bs = bs;
+    acb->cb = cb;
+    acb->opaque = opaque;
+    acb->refcnt = 1;
+    return acb;
+}
+
+void qemu_aio_ref(void *p)
+{
+    BlockAIOCB *acb = p;
+    acb->refcnt++;
+}
+
+void qemu_aio_unref(void *p)
+{
+    BlockAIOCB *acb = p;
+    assert(acb->refcnt > 0);
+    if (--acb->refcnt == 0) {
+        g_free(acb);
+    }
+}
diff --git a/async.c b/util/async.c
similarity index 99%
rename from async.c
rename to util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
 /*
- * QEMU System Emulator
+ * Data plane event loop
  *
  * Copyright (c) 2003-2008 Fabrice Bellard
+ * Copyright (c) 2009-2017 QEMU contributors
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to deal
diff --git a/iohandler.c b/util/iohandler.c
similarity index 100%
rename from iohandler.c
rename to util/iohandler.c
diff --git a/main-loop.c b/util/main-loop.c
similarity index 100%
rename from main-loop.c
rename to util/main-loop.c
diff --git a/qemu-timer.c b/util/qemu-timer.c
similarity index 100%
rename from qemu-timer.c
rename to util/qemu-timer.c
diff --git a/thread-pool.c b/util/thread-pool.c
similarity index 99%
rename from thread-pool.c
rename to util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/coroutine.h"
-#include "trace-root.h"
+#include "trace.h"
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
 
diff --git a/trace-events b/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/trace-events
+++ b/trace-events
@@ -XXX,XX +XXX,XX @@
 #
 # The <format-string> should be a sprintf()-compatible format string.
 
-# aio-posix.c
-run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
-run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
-poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
-poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
-
-# thread-pool.c
-thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
-thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
-
 # ioport.c
 cpu_in(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
 cpu_out(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@
 # See docs/tracing.txt for syntax documentation.
 
+# util/aio-posix.c
+run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
+run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
+poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+
+# util/thread-pool.c
+thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
+thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
+thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
+
 # util/buffer.c
 buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"
 buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

aio_co_wake provides the infrastructure to start a coroutine on a "home"
AioContext.  It will be used by CoMutex and CoQueue, so that coroutines
don't jump from one context to another when they go to sleep on a
mutex or waitqueue.  However, it can also be used as a more efficient
alternative to one-shot bottom halves, and saves the effort of tracking
which AioContext a coroutine is running on.

aio_co_schedule is the part of aio_co_wake that starts a coroutine
on a remove AioContext, but it is also useful to implement e.g.
bdrv_set_aio_context callbacks.

The implementation of aio_co_schedule is based on a lock-free
multiple-producer, single-consumer queue.  The multiple producers use
cmpxchg to add to a LIFO stack.  The consumer (a per-AioContext bottom
half) grabs all items added so far, inverts the list to make it FIFO,
and goes through it one item at a time until it's empty.  The data
structure was inspired by OSv, which uses it in the very code we'll
"port" to QEMU for the thread-safe CoMutex.

Most of the new code is really tests.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-3-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/Makefile.include       |   8 +-
 include/block/aio.h          |  32 +++++++
 include/qemu/coroutine_int.h |  11 ++-
 tests/iothread.h             |  25 +++++
 tests/iothread.c             |  91 ++++++++++++++++++
 tests/test-aio-multithread.c | 213 +++++++++++++++++++++++++++++++++++++++++++
 util/async.c                 |  65 +++++++++++++
 util/qemu-coroutine.c        |   8 ++
 util/trace-events            |   4 +
 9 files changed, 453 insertions(+), 4 deletions(-)
 create mode 100644 tests/iothread.h
 create mode 100644 tests/iothread.c
 create mode 100644 tests/test-aio-multithread.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index XXXXXXX..XXXXXXX 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-aio$(EXESUF)
 gcov-files-test-aio-y = util/async.c util/qemu-timer.o
 gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
+check-unit-y += tests/test-aio-multithread$(EXESUF)
+gcov-files-test-aio-multithread-y = $(gcov-files-test-aio-y)
+gcov-files-test-aio-multithread-y += util/qemu-coroutine.c tests/iothread.c
 check-unit-y += tests/test-throttle$(EXESUF)
-gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
-gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
 check-unit-y += tests/test-thread-pool$(EXESUF)
 gcov-files-test-thread-pool-y = thread-pool.c
 gcov-files-test-hbitmap-y = util/hbitmap.c
@@ -XXX,XX +XXX,XX @@ test-qapi-obj-y = tests/test-qapi-visit.o tests/test-qapi-types.o \
 	$(test-qom-obj-y)
 test-crypto-obj-y = $(crypto-obj-y) $(test-qom-obj-y)
 test-io-obj-y = $(io-obj-y) $(test-crypto-obj-y)
-test-block-obj-y = $(block-obj-y) $(test-io-obj-y)
+test-block-obj-y = $(block-obj-y) $(test-io-obj-y) tests/iothread.o
 
 tests/check-qint$(EXESUF): tests/check-qint.o $(test-util-obj-y)
 tests/check-qstring$(EXESUF): tests/check-qstring.o $(test-util-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
+tests/test-aio-multithread$(EXESUF): tests/test-aio-multithread.o $(test-block-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
 tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ typedef void QEMUBHFunc(void *opaque);
 typedef bool AioPollFn(void *opaque);
 typedef void IOHandler(void *opaque);
 
+struct Coroutine;
 struct ThreadPool;
 struct LinuxAioState;
 
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     bool notified;
     EventNotifier notifier;
 
+    QSLIST_HEAD(, Coroutine) scheduled_coroutines;
+    QEMUBH *co_schedule_bh;
+
     /* Thread pool for performing work and receiving completion callbacks.
      * Has its own locking.
      */
@@ -XXX,XX +XXX,XX @@ static inline bool aio_node_check(AioContext *ctx, bool is_external)
 }
 
 /**
+ * aio_co_schedule:
+ * @ctx: the aio context
+ * @co: the coroutine
+ *
+ * Start a coroutine on a remote AioContext.
+ *
+ * The coroutine must not be entered by anyone else while aio_co_schedule()
+ * is active.  In addition the coroutine must have yielded unless ctx
+ * is the context in which the coroutine is running (i.e. the value of
+ * qemu_get_current_aio_context() from the coroutine itself).
+ */
+void aio_co_schedule(AioContext *ctx, struct Coroutine *co);
+
+/**
+ * aio_co_wake:
+ * @co: the coroutine
+ *
+ * Restart a coroutine on the AioContext where it was running last, thus
+ * preventing coroutines from jumping from one context to another when they
+ * go to sleep.
+ *
+ * aio_co_wake may be executed either in coroutine or non-coroutine
+ * context.  The coroutine must not be entered by anyone else while
+ * aio_co_wake() is active.
+ */
+void aio_co_wake(struct Coroutine *co);
+
+/**
  * Return the AioContext whose event loop runs in the current thread.
  *
  * If called from an IOThread this will be the IOThread's AioContext.  If
diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine_int.h
+++ b/include/qemu/coroutine_int.h
@@ -XXX,XX +XXX,XX @@ struct Coroutine {
     CoroutineEntry *entry;
     void *entry_arg;
     Coroutine *caller;
+
+    /* Only used when the coroutine has terminated.  */
     QSLIST_ENTRY(Coroutine) pool_next;
+
     size_t locks_held;
 
-    /* Coroutines that should be woken up when we yield or terminate */
+    /* Coroutines that should be woken up when we yield or terminate.
+     * Only used when the coroutine is running.
+     */
     QSIMPLEQ_HEAD(, Coroutine) co_queue_wakeup;
+
+    /* Only used when the coroutine has yielded.  */
+    AioContext *ctx;
     QSIMPLEQ_ENTRY(Coroutine) co_queue_next;
+    QSLIST_ENTRY(Coroutine) co_scheduled_next;
 };
 
 Coroutine *qemu_coroutine_new(void);
diff --git a/tests/iothread.h b/tests/iothread.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/iothread.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Event loop thread implementation for unit tests
+ *
+ * Copyright Red Hat Inc., 2013, 2016
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@redhat.com>
+ *  Paolo Bonzini     <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef TEST_IOTHREAD_H
+#define TEST_IOTHREAD_H
+
+#include "block/aio.h"
+#include "qemu/thread.h"
+
+typedef struct IOThread IOThread;
+
+IOThread *iothread_new(void);
+void iothread_join(IOThread *iothread);
+AioContext *iothread_get_aio_context(IOThread *iothread);
+
+#endif
diff --git a/tests/iothread.c b/tests/iothread.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/iothread.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Event loop thread implementation for unit tests
+ *
+ * Copyright Red Hat Inc., 2013, 2016
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@redhat.com>
+ *  Paolo Bonzini     <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "block/aio.h"
+#include "qemu/main-loop.h"
+#include "qemu/rcu.h"
+#include "iothread.h"
+
+struct IOThread {
+    AioContext *ctx;
+
+    QemuThread thread;
+    QemuMutex init_done_lock;
+    QemuCond init_done_cond;    /* is thread initialization done? */
+    bool stopping;
+};
+
+static __thread IOThread *my_iothread;
+
+AioContext *qemu_get_current_aio_context(void)
+{
+    return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
+}
+
+static void *iothread_run(void *opaque)
+{
+    IOThread *iothread = opaque;
+
+    rcu_register_thread();
+
+    my_iothread = iothread;
+    qemu_mutex_lock(&iothread->init_done_lock);
+    iothread->ctx = aio_context_new(&error_abort);
+    qemu_cond_signal(&iothread->init_done_cond);
+    qemu_mutex_unlock(&iothread->init_done_lock);
+
+    while (!atomic_read(&iothread->stopping)) {
+        aio_poll(iothread->ctx, true);
+    }
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+void iothread_join(IOThread *iothread)
+{
+    iothread->stopping = true;
+    aio_notify(iothread->ctx);
+    qemu_thread_join(&iothread->thread);
+    qemu_cond_destroy(&iothread->init_done_cond);
+    qemu_mutex_destroy(&iothread->init_done_lock);
+    aio_context_unref(iothread->ctx);
+    g_free(iothread);
+}
+
+IOThread *iothread_new(void)
+{
+    IOThread *iothread = g_new0(IOThread, 1);
+
+    qemu_mutex_init(&iothread->init_done_lock);
+    qemu_cond_init(&iothread->init_done_cond);
+    qemu_thread_create(&iothread->thread, NULL, iothread_run,
+                       iothread, QEMU_THREAD_JOINABLE);
+
+    /* Wait for initialization to complete */
+    qemu_mutex_lock(&iothread->init_done_lock);
+    while (iothread->ctx == NULL) {
+        qemu_cond_wait(&iothread->init_done_cond,
+                       &iothread->init_done_lock);
+    }
+    qemu_mutex_unlock(&iothread->init_done_lock);
+    return iothread;
+}
+
+AioContext *iothread_get_aio_context(IOThread *iothread)
+{
+    return iothread->ctx;
+}
diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * AioContext multithreading tests
+ *
+ * Copyright Red Hat, Inc. 2016
+ *
+ * Authors:
+ *  Paolo Bonzini    <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "block/aio.h"
+#include "qapi/error.h"
+#include "qemu/coroutine.h"
+#include "qemu/thread.h"
+#include "qemu/error-report.h"
+#include "iothread.h"
+
+/* AioContext management */
+
+#define NUM_CONTEXTS 5
+
+static IOThread *threads[NUM_CONTEXTS];
+static AioContext *ctx[NUM_CONTEXTS];
+static __thread int id = -1;
+
+static QemuEvent done_event;
+
+/* Run a function synchronously on a remote iothread. */
+
+typedef struct CtxRunData {
+    QEMUBHFunc *cb;
+    void *arg;
+} CtxRunData;
+
+static void ctx_run_bh_cb(void *opaque)
+{
+    CtxRunData *data = opaque;
+
+    data->cb(data->arg);
+    qemu_event_set(&done_event);
+}
+
+static void ctx_run(int i, QEMUBHFunc *cb, void *opaque)
+{
+    CtxRunData data = {
+        .cb = cb,
+        .arg = opaque
+    };
+
+    qemu_event_reset(&done_event);
+    aio_bh_schedule_oneshot(ctx[i], ctx_run_bh_cb, &data);
+    qemu_event_wait(&done_event);
+}
+
+/* Starting the iothreads. */
+
+static void set_id_cb(void *opaque)
+{
+    int *i = opaque;
+
+    id = *i;
+}
+
+static void create_aio_contexts(void)
+{
+    int i;
+
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        threads[i] = iothread_new();
+        ctx[i] = iothread_get_aio_context(threads[i]);
+    }
+
+    qemu_event_init(&done_event, false);
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        ctx_run(i, set_id_cb, &i);
+    }
+}
+
+/* Stopping the iothreads. */
+
+static void join_aio_contexts(void)
+{
+    int i;
+
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        aio_context_ref(ctx[i]);
+    }
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        iothread_join(threads[i]);
+    }
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        aio_context_unref(ctx[i]);
+    }
+    qemu_event_destroy(&done_event);
+}
+
+/* Basic test for the stuff above. */
+
+static void test_lifecycle(void)
+{
+    create_aio_contexts();
+    join_aio_contexts();
+}
+
+/* aio_co_schedule test.  */
+
+static Coroutine *to_schedule[NUM_CONTEXTS];
+
+static bool now_stopping;
+
+static int count_retry;
+static int count_here;
+static int count_other;
+
+static bool schedule_next(int n)
+{
+    Coroutine *co;
+
+    co = atomic_xchg(&to_schedule[n], NULL);
+    if (!co) {
+        atomic_inc(&count_retry);
+        return false;
+    }
+
+    if (n == id) {
+        atomic_inc(&count_here);
+    } else {
+        atomic_inc(&count_other);
+    }
+
+    aio_co_schedule(ctx[n], co);
+    return true;
+}
+
+static void finish_cb(void *opaque)
+{
+    schedule_next(id);
+}
+
+static coroutine_fn void test_multi_co_schedule_entry(void *opaque)
+{
+    g_assert(to_schedule[id] == NULL);
+    atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
+
+    while (!atomic_mb_read(&now_stopping)) {
+        int n;
+
+        n = g_test_rand_int_range(0, NUM_CONTEXTS);
+        schedule_next(n);
+        qemu_coroutine_yield();
+
+        g_assert(to_schedule[id] == NULL);
+        atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
+    }
+}
+
+
+static void test_multi_co_schedule(int seconds)
+{
+    int i;
+
+    count_here = count_other = count_retry = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_co_schedule_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        ctx_run(i, finish_cb, NULL);
+        to_schedule[i] = NULL;
+    }
+
+    join_aio_contexts();
+    g_test_message("scheduled %d, queued %d, retry %d, total %d\n",
+                  count_other, count_here, count_retry,
+                  count_here + count_other + count_retry);
+}
+
+static void test_multi_co_schedule_1(void)
+{
+    test_multi_co_schedule(1);
+}
+
+static void test_multi_co_schedule_10(void)
+{
+    test_multi_co_schedule(10);
+}
+
+/* End of tests.  */
+
+int main(int argc, char **argv)
+{
+    init_clocks();
+
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
+    if (g_test_quick()) {
+        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
+    } else {
+        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
+    }
+    return g_test_run();
+}
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/main-loop.h"
 #include "qemu/atomic.h"
 #include "block/raw-aio.h"
+#include "qemu/coroutine_int.h"
+#include "trace.h"
 
 /***********************************************************/
 /* bottom halves (can be seen as timers which expire ASAP) */
@@ -XXX,XX +XXX,XX @@ aio_ctx_finalize(GSource     *source)
     }
 #endif
 
+    assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
+    qemu_bh_delete(ctx->co_schedule_bh);
+
     qemu_lockcnt_lock(&ctx->list_lock);
     assert(!qemu_lockcnt_count(&ctx->list_lock));
     while (ctx->first_bh) {
@@ -XXX,XX +XXX,XX @@ static bool event_notifier_poll(void *opaque)
     return atomic_read(&ctx->notified);
 }
 
+static void co_schedule_bh_cb(void *opaque)
+{
+    AioContext *ctx = opaque;
+    QSLIST_HEAD(, Coroutine) straight, reversed;
+
+    QSLIST_MOVE_ATOMIC(&reversed, &ctx->scheduled_coroutines);
+    QSLIST_INIT(&straight);
+
+    while (!QSLIST_EMPTY(&reversed)) {
+        Coroutine *co = QSLIST_FIRST(&reversed);
+        QSLIST_REMOVE_HEAD(&reversed, co_scheduled_next);
+        QSLIST_INSERT_HEAD(&straight, co, co_scheduled_next);
+    }
+
+    while (!QSLIST_EMPTY(&straight)) {
+        Coroutine *co = QSLIST_FIRST(&straight);
+        QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
+        trace_aio_co_schedule_bh_cb(ctx, co);
+        qemu_coroutine_enter(co);
+    }
+}
+
 AioContext *aio_context_new(Error **errp)
 {
     int ret;
@@ -XXX,XX +XXX,XX @@ AioContext *aio_context_new(Error **errp)
     }
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
+
+    ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
+    QSLIST_INIT(&ctx->scheduled_coroutines);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)
@@ -XXX,XX +XXX,XX @@ fail:
     return NULL;
 }
 
+void aio_co_schedule(AioContext *ctx, Coroutine *co)
+{
+    trace_aio_co_schedule(ctx, co);
+    QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
+                              co, co_scheduled_next);
+    qemu_bh_schedule(ctx->co_schedule_bh);
+}
+
+void aio_co_wake(struct Coroutine *co)
+{
+    AioContext *ctx;
+
+    /* Read coroutine before co->ctx.  Matches smp_wmb in
+     * qemu_coroutine_enter.
+     */
+    smp_read_barrier_depends();
+    ctx = atomic_read(&co->ctx);
+
+    if (ctx != qemu_get_current_aio_context()) {
+        aio_co_schedule(ctx, co);
+        return;
+    }
+
+    if (qemu_in_coroutine()) {
+        Coroutine *self = qemu_coroutine_self();
+        assert(self != co);
+        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
+    } else {
+        aio_context_acquire(ctx);
+        qemu_coroutine_enter(co);
+        aio_context_release(ctx);
+    }
+}
+
 void aio_context_ref(AioContext *ctx)
 {
     g_source_ref(&ctx->source);
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "block/aio.h"
 
 enum {
     POOL_BATCH_SIZE = 64,
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
     }
 
     co->caller = self;
+    co->ctx = qemu_get_current_aio_context();
+
+    /* Store co->ctx before anything that stores co.  Matches
+     * barrier in aio_co_wake.
+     */
+    smp_wmb();
+
     ret = qemu_coroutine_switch(self, co, COROUTINE_ENTER);
 
     qemu_co_queue_run_restart(co);
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 
+# util/async.c
+aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
+aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
+
 # util/thread-pool.c
 thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

qcow2_create2 calls this.  Do not run a nested event loop, as that
breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup
list of the currently running one.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
 {
     QEMUIOVector qiov;
     struct iovec iov;
-    Coroutine *co;
     BlkRwCo rwco;
 
     iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
         .ret    = NOT_DONE,
     };
 
-    co = qemu_coroutine_create(co_entry, &rwco);
-    qemu_coroutine_enter(co);
-    BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        co_entry(&rwco);
+    } else {
+        Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
+        qemu_coroutine_enter(co);
+        BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
+    }
 
     return rwco.ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Once the thread pool starts using aio_co_wake, it will also need
qemu_get_current_aio_context().  Make test-thread-pool create
an AioContext with qemu_init_main_loop, so that stubs/iothread.c
and tests/iothread.c can provide the rest.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-5-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/test-thread-pool.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "qemu/timer.h"
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
 
 static AioContext *ctx;
 static ThreadPool *pool;
@@ -XXX,XX +XXX,XX @@ static void test_cancel_async(void)
 int main(int argc, char **argv)
 {
     int ret;
-    Error *local_error = NULL;
 
-    init_clocks();
-
-    ctx = aio_context_new(&local_error);
-    if (!ctx) {
-        error_reportf_err(local_error, "Failed to create AIO Context: ");
-        exit(1);
-    }
+    qemu_init_main_loop(&error_abort);
+    ctx = qemu_get_current_aio_context();
     pool = aio_get_thread_pool(ctx);
 
     g_test_init(&argc, &argv, NULL);
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
 
     ret = g_test_run();
 
-    aio_context_unref(ctx);
     return ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This is in preparation for making qio_channel_yield work on
AioContexts other than the main one.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/io/channel.h | 25 +++++++++++++++++++++++++
 io/channel-command.c | 13 +++++++++++++
 io/channel-file.c    | 11 +++++++++++
 io/channel-socket.c  | 16 +++++++++++-----
 io/channel-tls.c     | 12 ++++++++++++
 io/channel-watch.c   |  6 ++++++
 io/channel.c         | 11 +++++++++++
 7 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index XXXXXXX..XXXXXXX 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -XXX,XX +XXX,XX @@
 
 #include "qemu-common.h"
 #include "qom/object.h"
+#include "block/aio.h"
 
 #define TYPE_QIO_CHANNEL "qio-channel"
 #define QIO_CHANNEL(obj)                                    \
@@ -XXX,XX +XXX,XX @@ struct QIOChannelClass {
                      off_t offset,
                      int whence,
                      Error **errp);
+    void (*io_set_aio_fd_handler)(QIOChannel *ioc,
+                                  AioContext *ctx,
+                                  IOHandler *io_read,
+                                  IOHandler *io_write,
+                                  void *opaque);
 };
 
 /* General I/O handling functions */
@@ -XXX,XX +XXX,XX @@ void qio_channel_yield(QIOChannel *ioc,
 void qio_channel_wait(QIOChannel *ioc,
                       GIOCondition condition);
 
+/**
+ * qio_channel_set_aio_fd_handler:
+ * @ioc: the channel object
+ * @ctx: the AioContext to set the handlers on
+ * @io_read: the read handler
+ * @io_write: the write handler
+ * @opaque: the opaque value passed to the handler
+ *
+ * This is used internally by qio_channel_yield().  It can
+ * be used by channel implementations to forward the handlers
+ * to another channel (e.g. from #QIOChannelTLS to the
+ * underlying socket).
+ */
+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
+                                    AioContext *ctx,
+                                    IOHandler *io_read,
+                                    IOHandler *io_write,
+                                    void *opaque);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-command.c b/io/channel-command.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-command.c
+++ b/io/channel-command.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_command_close(QIOChannel *ioc,
 }
 
 
+static void qio_channel_command_set_aio_fd_handler(QIOChannel *ioc,
+                                                   AioContext *ctx,
+                                                   IOHandler *io_read,
+                                                   IOHandler *io_write,
+                                                   void *opaque)
+{
+    QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
+    aio_set_fd_handler(ctx, cioc->readfd, false, io_read, NULL, NULL, opaque);
+    aio_set_fd_handler(ctx, cioc->writefd, false, NULL, io_write, NULL, opaque);
+}
+
+
 static GSource *qio_channel_command_create_watch(QIOChannel *ioc,
                                                  GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_command_class_init(ObjectClass *klass,
     ioc_klass->io_set_blocking = qio_channel_command_set_blocking;
     ioc_klass->io_close = qio_channel_command_close;
     ioc_klass->io_create_watch = qio_channel_command_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_command_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_command_info = {
diff --git a/io/channel-file.c b/io/channel-file.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_file_close(QIOChannel *ioc,
 }
 
 
+static void qio_channel_file_set_aio_fd_handler(QIOChannel *ioc,
+                                                AioContext *ctx,
+                                                IOHandler *io_read,
+                                                IOHandler *io_write,
+                                                void *opaque)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    aio_set_fd_handler(ctx, fioc->fd, false, io_read, io_write, NULL, opaque);
+}
+
 static GSource *qio_channel_file_create_watch(QIOChannel *ioc,
                                               GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_file_class_init(ObjectClass *klass,
     ioc_klass->io_seek = qio_channel_file_seek;
     ioc_klass->io_close = qio_channel_file_close;
     ioc_klass->io_create_watch = qio_channel_file_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_file_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_file_info = {
diff --git a/io/channel-socket.c b/io/channel-socket.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -XXX,XX +XXX,XX @@ qio_channel_socket_set_blocking(QIOChannel *ioc,
         qemu_set_block(sioc->fd);
     } else {
         qemu_set_nonblock(sioc->fd);
-#ifdef WIN32
-        WSAEventSelect(sioc->fd, ioc->event,
-                       FD_READ | FD_ACCEPT | FD_CLOSE |
-                       FD_CONNECT | FD_WRITE | FD_OOB);
-#endif
     }
     return 0;
 }
@@ -XXX,XX +XXX,XX @@ qio_channel_socket_shutdown(QIOChannel *ioc,
     return 0;
 }
 
+static void qio_channel_socket_set_aio_fd_handler(QIOChannel *ioc,
+                                                  AioContext *ctx,
+                                                  IOHandler *io_read,
+                                                  IOHandler *io_write,
+                                                  void *opaque)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    aio_set_fd_handler(ctx, sioc->fd, false, io_read, io_write, NULL, opaque);
+}
+
 static GSource *qio_channel_socket_create_watch(QIOChannel *ioc,
                                                 GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_cork = qio_channel_socket_set_cork;
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel-tls.c b/io/channel-tls.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_tls_close(QIOChannel *ioc,
     return qio_channel_close(tioc->master, errp);
 }
 
+static void qio_channel_tls_set_aio_fd_handler(QIOChannel *ioc,
+                                               AioContext *ctx,
+                                               IOHandler *io_read,
+                                               IOHandler *io_write,
+                                               void *opaque)
+{
+    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
+
+    qio_channel_set_aio_fd_handler(tioc->master, ctx, io_read, io_write, opaque);
+}
+
 static GSource *qio_channel_tls_create_watch(QIOChannel *ioc,
                                              GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_tls_class_init(ObjectClass *klass,
     ioc_klass->io_close = qio_channel_tls_close;
     ioc_klass->io_shutdown = qio_channel_tls_shutdown;
     ioc_klass->io_create_watch = qio_channel_tls_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_tls_info = {
diff --git a/io/channel-watch.c b/io/channel-watch.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-watch.c
+++ b/io/channel-watch.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_socket_watch(QIOChannel *ioc,
     GSource *source;
     QIOChannelSocketSource *ssource;
 
+#ifdef WIN32
+    WSAEventSelect(socket, ioc->event,
+                   FD_READ | FD_ACCEPT | FD_CLOSE |
+                   FD_CONNECT | FD_WRITE | FD_OOB);
+#endif
+
     source = g_source_new(&qio_channel_socket_source_funcs,
                           sizeof(QIOChannelSocketSource));
     ssource = (QIOChannelSocketSource *)source;
diff --git a/io/channel.c b/io/channel.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_watch(QIOChannel *ioc,
 }
 
 
+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
+                                    AioContext *ctx,
+                                    IOHandler *io_read,
+                                    IOHandler *io_write,
+                                    void *opaque)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    klass->io_set_aio_fd_handler(ioc, ctx, io_read, io_write, opaque);
+}
+
 guint qio_channel_add_watch(QIOChannel *ioc,
                             GIOCondition condition,
                             QIOChannelFunc func,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Support separate coroutines for reading and writing, and place the
read/write handlers on the AioContext that the QIOChannel is registered
with.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-7-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/io/channel.h | 47 ++++++++++++++++++++++++++--
 io/channel.c         | 86 +++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 109 insertions(+), 24 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index XXXXXXX..XXXXXXX 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -XXX,XX +XXX,XX @@
 
 #include "qemu-common.h"
 #include "qom/object.h"
+#include "qemu/coroutine.h"
 #include "block/aio.h"
 
 #define TYPE_QIO_CHANNEL "qio-channel"
@@ -XXX,XX +XXX,XX @@ struct QIOChannel {
     Object parent;
     unsigned int features; /* bitmask of QIOChannelFeatures */
     char *name;
+    AioContext *ctx;
+    Coroutine *read_coroutine;
+    Coroutine *write_coroutine;
 #ifdef _WIN32
     HANDLE event; /* For use with GSource on Win32 */
 #endif
@@ -XXX,XX +XXX,XX @@ guint qio_channel_add_watch(QIOChannel *ioc,
 
 
 /**
+ * qio_channel_attach_aio_context:
+ * @ioc: the channel object
+ * @ctx: the #AioContext to set the handlers on
+ *
+ * Request that qio_channel_yield() sets I/O handlers on
+ * the given #AioContext.  If @ctx is %NULL, qio_channel_yield()
+ * uses QEMU's main thread event loop.
+ *
+ * You can move a #QIOChannel from one #AioContext to another even if
+ * I/O handlers are set for a coroutine.  However, #QIOChannel provides
+ * no synchronization between the calls to qio_channel_yield() and
+ * qio_channel_attach_aio_context().
+ *
+ * Therefore you should first call qio_channel_detach_aio_context()
+ * to ensure that the coroutine is not entered concurrently.  Then,
+ * while the coroutine has yielded, call qio_channel_attach_aio_context(),
+ * and then aio_co_schedule() to place the coroutine on the new
+ * #AioContext.  The calls to qio_channel_detach_aio_context()
+ * and qio_channel_attach_aio_context() should be protected with
+ * aio_context_acquire() and aio_context_release().
+ */
+void qio_channel_attach_aio_context(QIOChannel *ioc,
+                                    AioContext *ctx);
+
+/**
+ * qio_channel_detach_aio_context:
+ * @ioc: the channel object
+ *
+ * Disable any I/O handlers set by qio_channel_yield().  With the
+ * help of aio_co_schedule(), this allows moving a coroutine that was
+ * paused by qio_channel_yield() to another context.
+ */
+void qio_channel_detach_aio_context(QIOChannel *ioc);
+
+/**
  * qio_channel_yield:
  * @ioc: the channel object
  * @condition: the I/O condition to wait for
  *
- * Yields execution from the current coroutine until
- * the condition indicated by @condition becomes
- * available.
+ * Yields execution from the current coroutine until the condition
+ * indicated by @condition becomes available.  @condition must
+ * be either %G_IO_IN or %G_IO_OUT; it cannot contain both.  In
+ * addition, no two coroutine can be waiting on the same condition
+ * and channel at the same time.
  *
  * This must only be called from coroutine context
  */
diff --git a/io/channel.c b/io/channel.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/osdep.h"
 #include "io/channel.h"
 #include "qapi/error.h"
-#include "qemu/coroutine.h"
+#include "qemu/main-loop.h"
 
 bool qio_channel_has_feature(QIOChannel *ioc,
                              QIOChannelFeature feature)
@@ -XXX,XX +XXX,XX @@ off_t qio_channel_io_seek(QIOChannel *ioc,
 }
 
 
-typedef struct QIOChannelYieldData QIOChannelYieldData;
-struct QIOChannelYieldData {
-    QIOChannel *ioc;
-    Coroutine *co;
-};
+static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc);
 
+static void qio_channel_restart_read(void *opaque)
+{
+    QIOChannel *ioc = opaque;
+    Coroutine *co = ioc->read_coroutine;
+
+    ioc->read_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    aio_co_wake(co);
+}
 
-static gboolean qio_channel_yield_enter(QIOChannel *ioc,
-                                        GIOCondition condition,
-                                        gpointer opaque)
+static void qio_channel_restart_write(void *opaque)
 {
-    QIOChannelYieldData *data = opaque;
-    qemu_coroutine_enter(data->co);
-    return FALSE;
+    QIOChannel *ioc = opaque;
+    Coroutine *co = ioc->write_coroutine;
+
+    ioc->write_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    aio_co_wake(co);
 }
 
+static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc)
+{
+    IOHandler *rd_handler = NULL, *wr_handler = NULL;
+    AioContext *ctx;
+
+    if (ioc->read_coroutine) {
+        rd_handler = qio_channel_restart_read;
+    }
+    if (ioc->write_coroutine) {
+        wr_handler = qio_channel_restart_write;
+    }
+
+    ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+    qio_channel_set_aio_fd_handler(ioc, ctx, rd_handler, wr_handler, ioc);
+}
+
+void qio_channel_attach_aio_context(QIOChannel *ioc,
+                                    AioContext *ctx)
+{
+    AioContext *old_ctx;
+    if (ioc->ctx == ctx) {
+        return;
+    }
+
+    old_ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+    qio_channel_set_aio_fd_handler(ioc, old_ctx, NULL, NULL, NULL);
+    ioc->ctx = ctx;
+    qio_channel_set_aio_fd_handlers(ioc);
+}
+
+void qio_channel_detach_aio_context(QIOChannel *ioc)
+{
+    ioc->read_coroutine = NULL;
+    ioc->write_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    ioc->ctx = NULL;
+}
 
 void coroutine_fn qio_channel_yield(QIOChannel *ioc,
                                     GIOCondition condition)
 {
-    QIOChannelYieldData data;
-
     assert(qemu_in_coroutine());
-    data.ioc = ioc;
-    data.co = qemu_coroutine_self();
-    qio_channel_add_watch(ioc,
-                          condition,
-                          qio_channel_yield_enter,
-                          &data,
-                          NULL);
+    if (condition == G_IO_IN) {
+        assert(!ioc->read_coroutine);
+        ioc->read_coroutine = qemu_coroutine_self();
+    } else if (condition == G_IO_OUT) {
+        assert(!ioc->write_coroutine);
+        ioc->write_coroutine = qemu_coroutine_self();
+    } else {
+        abort();
+    }
+    qio_channel_set_aio_fd_handlers(ioc);
     qemu_coroutine_yield();
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

In the client, read the reply headers from a coroutine, switching the
read side between the "read header" coroutine and the I/O coroutine that
reads the body of the reply.

In the server, if the server can read more requests it will create a new
"read request" coroutine as soon as a request has been read.  Otherwise,
the new coroutine is created in nbd_request_put.

diff --git a/block/nbd-client.h b/block/nbd-client.h
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -XXX,XX +XXX,XX @@ typedef struct NBDClientSession {
 
     CoMutex send_mutex;
     CoQueue free_sema;
-    Coroutine *send_coroutine;
+    Coroutine *read_reply_co;
     int in_flight;
 
     Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
diff --git a/block/nbd-client.c b/block/nbd-client.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
 
-static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
+static void nbd_recv_coroutines_enter_all(BlockDriverState *bs)
 {
+    NBDClientSession *s = nbd_get_client_session(bs);
     int i;
 
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
@@ -XXX,XX +XXX,XX @@ static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
             qemu_coroutine_enter(s->recv_coroutine[i]);
         }
     }
+    BDRV_POLL_WHILE(bs, s->read_reply_co);
 }
 
 static void nbd_teardown_connection(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
     qio_channel_shutdown(client->ioc,
                          QIO_CHANNEL_SHUTDOWN_BOTH,
                          NULL);
-    nbd_recv_coroutines_enter_all(client);
+    nbd_recv_coroutines_enter_all(bs);
 
     nbd_client_detach_aio_context(bs);
     object_unref(OBJECT(client->sioc));
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
     client->ioc = NULL;
 }
 
-static void nbd_reply_ready(void *opaque)
+static coroutine_fn void nbd_read_reply_entry(void *opaque)
 {
-    BlockDriverState *bs = opaque;
-    NBDClientSession *s = nbd_get_client_session(bs);
+    NBDClientSession *s = opaque;
     uint64_t i;
     int ret;
 
-    if (!s->ioc) { /* Already closed */
-        return;
-    }
-
-    if (s->reply.handle == 0) {
-        /* No reply already in flight.  Fetch a header.  It is possible
-         * that another thread has done the same thing in parallel, so
-         * the socket is not readable anymore.
-         */
+    for (;;) {
+        assert(s->reply.handle == 0);
         ret = nbd_receive_reply(s->ioc, &s->reply);
-        if (ret == -EAGAIN) {
-            return;
-        }
         if (ret < 0) {
-            s->reply.handle = 0;
-            goto fail;
+            break;
         }
-    }
 
-    /* There's no need for a mutex on the receive side, because the
-     * handler acts as a synchronization point and ensures that only
-     * one coroutine is called until the reply finishes.  */
-    i = HANDLE_TO_INDEX(s, s->reply.handle);
-    if (i >= MAX_NBD_REQUESTS) {
-        goto fail;
-    }
+        /* There's no need for a mutex on the receive side, because the
+         * handler acts as a synchronization point and ensures that only
+         * one coroutine is called until the reply finishes.
+         */
+        i = HANDLE_TO_INDEX(s, s->reply.handle);
+        if (i >= MAX_NBD_REQUESTS || !s->recv_coroutine[i]) {
+            break;
+        }
 
-    if (s->recv_coroutine[i]) {
-        qemu_coroutine_enter(s->recv_coroutine[i]);
-        return;
+        /* We're woken up by the recv_coroutine itself.  Note that there
+         * is no race between yielding and reentering read_reply_co.  This
+         * is because:
+         *
+         * - if recv_coroutine[i] runs on the same AioContext, it is only
+         *   entered after we yield
+         *
+         * - if recv_coroutine[i] runs on a different AioContext, reentering
+         *   read_reply_co happens through a bottom half, which can only
+         *   run after we yield.
+         */
+        aio_co_wake(s->recv_coroutine[i]);
+        qemu_coroutine_yield();
     }
-
-fail:
-    nbd_teardown_connection(bs);
-}
-
-static void nbd_restart_write(void *opaque)
-{
-    BlockDriverState *bs = opaque;
-
-    qemu_coroutine_enter(nbd_get_client_session(bs)->send_coroutine);
+    s->read_reply_co = NULL;
 }
 
 static int nbd_co_send_request(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
                                QEMUIOVector *qiov)
 {
     NBDClientSession *s = nbd_get_client_session(bs);
-    AioContext *aio_context;
     int rc, ret, i;
 
     qemu_co_mutex_lock(&s->send_mutex);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
         return -EPIPE;
     }
 
-    s->send_coroutine = qemu_coroutine_self();
-    aio_context = bdrv_get_aio_context(bs);
-
-    aio_set_fd_handler(aio_context, s->sioc->fd, false,
-                       nbd_reply_ready, nbd_restart_write, NULL, bs);
     if (qiov) {
         qio_channel_set_cork(s->ioc, true);
         rc = nbd_send_request(s->ioc, request);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
     } else {
         rc = nbd_send_request(s->ioc, request);
     }
-    aio_set_fd_handler(aio_context, s->sioc->fd, false,
-                       nbd_reply_ready, NULL, NULL, bs);
-    s->send_coroutine = NULL;
     qemu_co_mutex_unlock(&s->send_mutex);
     return rc;
 }
@@ -XXX,XX +XXX,XX @@ static void nbd_co_receive_reply(NBDClientSession *s,
 {
     int ret;
 
-    /* Wait until we're woken up by the read handler.  TODO: perhaps
-     * peek at the next reply and avoid yielding if it's ours?  */
+    /* Wait until we're woken up by nbd_read_reply_entry.  */
     qemu_coroutine_yield();
     *reply = s->reply;
     if (reply->handle != request->handle ||
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
     /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
 }
 
-static void nbd_coroutine_end(NBDClientSession *s,
+static void nbd_coroutine_end(BlockDriverState *bs,
                               NBDRequest *request)
 {
+    NBDClientSession *s = nbd_get_client_session(bs);
     int i = HANDLE_TO_INDEX(s, request->handle);
+
     s->recv_coroutine[i] = NULL;
-    if (s->in_flight-- == MAX_NBD_REQUESTS) {
-        qemu_co_queue_next(&s->free_sema);
+    s->in_flight--;
+    qemu_co_queue_next(&s->free_sema);
+
+    /* Kick the read_reply_co to get the next reply.  */
+    if (s->read_reply_co) {
+        aio_co_wake(s->read_reply_co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, qiov);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_flush(BlockDriverState *bs)
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 
 }
 
 void nbd_client_detach_aio_context(BlockDriverState *bs)
 {
-    aio_set_fd_handler(bdrv_get_aio_context(bs),
-                       nbd_get_client_session(bs)->sioc->fd,
-                       false, NULL, NULL, NULL, NULL);
+    NBDClientSession *client = nbd_get_client_session(bs);
+    qio_channel_detach_aio_context(QIO_CHANNEL(client->sioc));
 }
 
 void nbd_client_attach_aio_context(BlockDriverState *bs,
                                    AioContext *new_context)
 {
-    aio_set_fd_handler(new_context, nbd_get_client_session(bs)->sioc->fd,
-                       false, nbd_reply_ready, NULL, NULL, bs);
+    NBDClientSession *client = nbd_get_client_session(bs);
+    qio_channel_attach_aio_context(QIO_CHANNEL(client->sioc), new_context);
+    aio_co_schedule(new_context, client->read_reply_co);
 }
 
 void nbd_client_close(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ int nbd_client_init(BlockDriverState *bs,
     /* Now that we're connected, set the socket to be non-blocking and
      * kick the reply mechanism.  */
     qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
-
+    client->read_reply_co = qemu_coroutine_create(nbd_read_reply_entry, client);
     nbd_client_attach_aio_context(bs, bdrv_get_aio_context(bs));
 
     logout("Established connection with NBD server\n");
diff --git a/nbd/client.c b/nbd/client.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
     ssize_t ret;
 
     ret = read_sync(ioc, buf, sizeof(buf));
-    if (ret < 0) {
+    if (ret <= 0) {
         return ret;
     }
 
diff --git a/nbd/common.c b/nbd/common.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/common.c
+++ b/nbd/common.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
         }
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
-                /* XXX figure out if we can create a variant on
-                 * qio_channel_yield() that works with AIO contexts
-                 * and consider using that in this branch */
-                qemu_coroutine_yield();
-            } else if (done) {
-                /* XXX this is needed by nbd_reply_ready.  */
-                qio_channel_wait(ioc,
-                                 do_read ? G_IO_IN : G_IO_OUT);
+                qio_channel_yield(ioc, do_read ? G_IO_IN : G_IO_OUT);
             } else {
                 return -EAGAIN;
             }
diff --git a/nbd/server.c b/nbd/server.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
     CoMutex send_lock;
     Coroutine *send_coroutine;
 
-    bool can_read;
-
     QTAILQ_ENTRY(NBDClient) next;
     int nb_requests;
     bool closing;
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
 
 /* That's all folks */
 
-static void nbd_set_handlers(NBDClient *client);
-static void nbd_unset_handlers(NBDClient *client);
-static void nbd_update_can_read(NBDClient *client);
+static void nbd_client_receive_next_request(NBDClient *client);
 
 static gboolean nbd_negotiate_continue(QIOChannel *ioc,
                                        GIOCondition condition,
@@ -XXX,XX +XXX,XX @@ void nbd_client_put(NBDClient *client)
          */
         assert(client->closing);
 
-        nbd_unset_handlers(client);
+        qio_channel_detach_aio_context(client->ioc);
         object_unref(OBJECT(client->sioc));
         object_unref(OBJECT(client->ioc));
         if (client->tlscreds) {
@@ -XXX,XX +XXX,XX @@ static NBDRequestData *nbd_request_get(NBDClient *client)
 
     assert(client->nb_requests <= MAX_NBD_REQUESTS - 1);
     client->nb_requests++;
-    nbd_update_can_read(client);
 
     req = g_new0(NBDRequestData, 1);
     nbd_client_get(client);
@@ -XXX,XX +XXX,XX @@ static void nbd_request_put(NBDRequestData *req)
     g_free(req);
 
     client->nb_requests--;
-    nbd_update_can_read(client);
+    nbd_client_receive_next_request(client);
+
     nbd_client_put(client);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
     exp->ctx = ctx;
 
     QTAILQ_FOREACH(client, &exp->clients, next) {
-        nbd_set_handlers(client);
+        qio_channel_attach_aio_context(client->ioc, ctx);
+        if (client->recv_coroutine) {
+            aio_co_schedule(ctx, client->recv_coroutine);
+        }
+        if (client->send_coroutine) {
+            aio_co_schedule(ctx, client->send_coroutine);
+        }
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
     TRACE("Export %s: Detaching clients from AIO context %p\n", exp->name, exp->ctx);
 
     QTAILQ_FOREACH(client, &exp->clients, next) {
-        nbd_unset_handlers(client);
+        qio_channel_detach_aio_context(client->ioc);
     }
 
     exp->ctx = NULL;
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
     g_assert(qemu_in_coroutine());
     qemu_co_mutex_lock(&client->send_lock);
     client->send_coroutine = qemu_coroutine_self();
-    nbd_set_handlers(client);
 
     if (!len) {
         rc = nbd_send_reply(client->ioc, reply);
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
     }
 
     client->send_coroutine = NULL;
-    nbd_set_handlers(client);
     qemu_co_mutex_unlock(&client->send_lock);
     return rc;
 }
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
     ssize_t rc;
 
     g_assert(qemu_in_coroutine());
-    client->recv_coroutine = qemu_coroutine_self();
-    nbd_update_can_read(client);
-
+    assert(client->recv_coroutine == qemu_coroutine_self());
     rc = nbd_receive_request(client->ioc, request);
     if (rc < 0) {
         if (rc != -EAGAIN) {
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
 
 out:
     client->recv_coroutine = NULL;
-    nbd_update_can_read(client);
+    nbd_client_receive_next_request(client);
 
     return rc;
 }
 
-static void nbd_trip(void *opaque)
+/* Owns a reference to the NBDClient passed as opaque.  */
+static coroutine_fn void nbd_trip(void *opaque)
 {
     NBDClient *client = opaque;
     NBDExport *exp = client->exp;
     NBDRequestData *req;
-    NBDRequest request;
+    NBDRequest request = { 0 };    /* GCC thinks it can be used uninitialized */
     NBDReply reply;
     ssize_t ret;
     int flags;
 
     TRACE("Reading request.");
     if (client->closing) {
+        nbd_client_put(client);
         return;
     }
 
@@ -XXX,XX +XXX,XX @@ static void nbd_trip(void *opaque)
 
 done:
     nbd_request_put(req);
+    nbd_client_put(client);
     return;
 
 out:
     nbd_request_put(req);
     client_close(client);
+    nbd_client_put(client);
 }
 
-static void nbd_read(void *opaque)
+static void nbd_client_receive_next_request(NBDClient *client)
 {
-    NBDClient *client = opaque;
-
-    if (client->recv_coroutine) {
-        qemu_coroutine_enter(client->recv_coroutine);
-    } else {
-        qemu_coroutine_enter(qemu_coroutine_create(nbd_trip, client));
-    }
-}
-
-static void nbd_restart_write(void *opaque)
-{
-    NBDClient *client = opaque;
-
-    qemu_coroutine_enter(client->send_coroutine);
-}
-
-static void nbd_set_handlers(NBDClient *client)
-{
-    if (client->exp && client->exp->ctx) {
-        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true,
-                           client->can_read ? nbd_read : NULL,
-                           client->send_coroutine ? nbd_restart_write : NULL,
-                           NULL, client);
-    }
-}
-
-static void nbd_unset_handlers(NBDClient *client)
-{
-    if (client->exp && client->exp->ctx) {
-        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true, NULL,
-                           NULL, NULL, NULL);
-    }
-}
-
-static void nbd_update_can_read(NBDClient *client)
-{
-    bool can_read = client->recv_coroutine ||
-                    client->nb_requests < MAX_NBD_REQUESTS;
-
-    if (can_read != client->can_read) {
-        client->can_read = can_read;
-        nbd_set_handlers(client);
-
-        /* There is no need to invoke aio_notify(), since aio_set_fd_handler()
-         * in nbd_set_handlers() will have taken care of that */
+    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS) {
+        nbd_client_get(client);
+        client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+        aio_co_schedule(client->exp->ctx, client->recv_coroutine);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void nbd_co_client_start(void *opaque)
         goto out;
     }
     qemu_co_mutex_init(&client->send_lock);
-    nbd_set_handlers(client);
 
     if (exp) {
         QTAILQ_INSERT_TAIL(&exp->clients, client, next);
     }
+
+    nbd_client_receive_next_request(client);
+
 out:
     g_free(data);
 }
@@ -XXX,XX +XXX,XX @@ void nbd_client_new(NBDExport *exp,
     object_ref(OBJECT(client->sioc));
     client->ioc = QIO_CHANNEL(sioc);
     object_ref(OBJECT(client->ioc));
-    client->can_read = true;
     client->close = close_fn;
 
     data->client = client;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

As a small step towards the introduction of multiqueue, we want
coroutines to remain on the same AioContext that started them,
unless they are moved explicitly with e.g. aio_co_schedule.  This patch
avoids that coroutines switch AioContext when they use a CoMutex.
For now it does not make much of a difference, because the CoMutex
is not thread-safe and the AioContext itself is used to protect the
CoMutex from concurrent access.  However, this is going to change.

diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/queue.h"
+#include "block/aio.h"
 #include "trace.h"
 
 void qemu_co_queue_init(CoQueue *queue)
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_run_restart(Coroutine *co)
 
 static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 {
-    Coroutine *self = qemu_coroutine_self();
     Coroutine *next;
 
     if (QSIMPLEQ_EMPTY(&queue->entries)) {
@@ -XXX,XX +XXX,XX @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 
     while ((next = QSIMPLEQ_FIRST(&queue->entries)) != NULL) {
         QSIMPLEQ_REMOVE_HEAD(&queue->entries, co_queue_next);
-        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
-        trace_qemu_co_queue_next(next);
+        aio_co_wake(next);
         if (single) {
             break;
         }
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
 
 # util/qemu-coroutine-lock.c
 qemu_co_queue_run_restart(void *co) "co %p"
-qemu_co_queue_next(void *nxt) "next %p"
 qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Keep the coroutine on the same AioContext.  Without this change,
there would be a race between yielding the coroutine and reentering it.
While the race cannot happen now, because the code only runs from a single
AioContext, this will change with multiqueue support in the block layer.

While doing the change, replace custom bottom half with aio_co_schedule.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-10-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkdebug.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ out:
     return ret;
 }
 
-static void error_callback_bh(void *opaque)
-{
-    Coroutine *co = opaque;
-    qemu_coroutine_enter(co);
-}
-
 static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
 {
     BDRVBlkdebugState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
     }
 
     if (!immediately) {
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), error_callback_bh,
-                                qemu_coroutine_self());
+        aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
         qemu_coroutine_yield();
     }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

qed_aio_start_io and qed_aio_next_io will not have to acquire/release
the AioContext, while qed_aio_next_io_cb will.  Split the functionality
and gain a little type-safety in the process.

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(void *opaque, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+
+static void qed_aio_start_io(QEDAIOCB *acb)
+{
+    qed_aio_next_io(acb, 0);
+}
+
+static void qed_aio_next_io_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+
+    qed_aio_next_io(acb, ret);
+}
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 
     acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
     if (acb) {
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
         acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
         if (acb) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(opaque, ret);
+    qed_aio_next_io(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                            qed_aio_write_l1_update, acb);
+                           qed_aio_write_l1_update, acb);
     } else {
         /* Write out only the updated part of the L2 table */
         qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                            qed_aio_next_io, acb);
+                           qed_aio_next_io_cb, acb);
     }
     return;
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
     }
 
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io;
+        next_fn = qed_aio_next_io_cb;
     } else {
         if (s->bs->backing) {
             next_fn = qed_aio_write_flush_before_l2_update;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
             return;
         }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_read_data(void *opaque, int ret,
     /* Handle zero cluster and backing file reads */
     if (ret == QED_CLUSTER_ZERO) {
         qemu_iovec_memset(&acb->cur_qiov, 0, 0, acb->cur_qiov.size);
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
         return;
     } else if (ret != QED_CLUSTER_FOUND) {
         qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
-                              &acb->backing_qiov, qed_aio_next_io, acb);
+                              &acb->backing_qiov, qed_aio_next_io_cb, acb);
         return;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
     bdrv_aio_readv(bs->file, offset / BDRV_SECTOR_SIZE,
                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                   qed_aio_next_io, acb);
+                   qed_aio_next_io_cb, acb);
     return;
 
 err:
@@ -XXX,XX +XXX,XX @@ err:
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(void *opaque, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb, int ret)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
     qemu_iovec_init(&acb->cur_qiov, qiov->niov);
 
     /* Start request */
-    qed_aio_next_io(acb, 0);
+    qed_aio_start_io(acb);
     return &acb->common;
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

The AioContext data structures are now protected by list_lock and/or
they are walked with FOREACH_RCU primitives.  There is no need anymore
to acquire the AioContext for the entire duration of aio_dispatch.
Instead, just acquire it before and after invoking the callbacks.
The next step is then to push it further down.

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_read) {
+            aio_context_acquire(ctx);
             node->io_read(node->opaque);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_write) {
+            aio_context_acquire(ctx);
             node->io_write(node->opaque);
+            aio_context_release(ctx);
             progress = true;
         }
 
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
     }
 
     /* Run our timers */
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
+    aio_context_release(ctx);
 
     return progress;
 }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     int64_t timeout;
     int64_t start = 0;
 
-    aio_context_acquire(ctx);
-    progress = false;
-
     /* aio_notify can avoid the expensive event_notifier_set if
      * everything (file descriptors, bottom halves, timers) will
      * be re-evaluated before the next blocking poll().  This is
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-    if (try_poll_mode(ctx, blocking)) {
-        progress = true;
-    } else {
+    aio_context_acquire(ctx);
+    progress = try_poll_mode(ctx, blocking);
+    aio_context_release(ctx);
+
+    if (!progress) {
         assert(npfd == 0);
 
         /* fill pollfds */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         timeout = blocking ? aio_compute_timeout(ctx) : 0;
 
         /* wait until next event */
-        if (timeout) {
-            aio_context_release(ctx);
-        }
         if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
             AioHandler epoll_handler;
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         } else  {
             ret = qemu_poll_ns(pollfds, npfd, timeout);
         }
-        if (timeout) {
-            aio_context_acquire(ctx);
-        }
     }
 
     if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress = true;
     }
 
-    aio_context_release(ctx);
-
     return progress;
 }
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
             node->pfd.revents = 0;
+            aio_context_acquire(ctx);
             node->io_notify(node->e);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (node->io_read || node->io_write)) {
             node->pfd.revents = 0;
             if ((revents & G_IO_IN) && node->io_read) {
+                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
+                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     int count;
     int timeout;
 
-    aio_context_acquire(ctx);
     progress = false;
 
     /* aio_notify can avoid the expensive event_notifier_set if
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
 
         timeout = blocking && !have_select_revents
             ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
-        if (timeout) {
-            aio_context_release(ctx);
-        }
         ret = WaitForMultipleObjects(count, events, FALSE, timeout);
         if (blocking) {
             assert(first);
             atomic_sub(&ctx->notify_me, 2);
         }
-        if (timeout) {
-            aio_context_acquire(ctx);
-        }
 
         if (first) {
             aio_notify_accept(ctx);
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-
     aio_context_release(ctx);
     return progress;
 }
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
+            aio_context_acquire(ctx);
             aio_bh_call(bh);
+            aio_context_release(ctx);
         }
         if (bh->deleted) {
             deleted = true;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-13-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.h                 |  3 +++
 block/curl.c                |  2 ++
 block/io.c                  |  5 +++++
 block/iscsi.c               |  8 ++++++--
 block/null.c                |  4 ++++
 block/qed.c                 | 12 ++++++++++++
 block/throttle-groups.c     |  2 ++
 util/aio-posix.c            |  2 --
 util/aio-win32.c            |  2 --
 util/qemu-coroutine-sleep.c |  2 +-
 10 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
  */
 typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
 
+void qed_acquire(BDRVQEDState *s);
+void qed_release(BDRVQEDState *s);
+
 /**
  * Generic callback for chaining async callbacks
  */
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_multi_timeout_do(void *arg)
         return;
     }
 
+    aio_context_acquire(s->aio_context);
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
     curl_multi_check_completion(s);
+    aio_context_release(s->aio_context);
 #else
     abort();
 #endif
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel(BlockAIOCB *acb)
         if (acb->aiocb_info->get_aio_context) {
             aio_poll(acb->aiocb_info->get_aio_context(acb), true);
         } else if (acb->bs) {
+            /* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
+             * assert that we're not using an I/O thread.  Thread-safe
+             * code should use bdrv_aio_cancel_async exclusively.
+             */
+            assert(bdrv_get_aio_context(acb->bs) == qemu_get_aio_context());
             aio_poll(bdrv_get_aio_context(acb->bs), true);
         } else {
             abort();
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void iscsi_retry_timer_expired(void *opaque)
     struct IscsiTask *iTask = opaque;
     iTask->complete = 1;
     if (iTask->co) {
-        qemu_coroutine_enter(iTask->co);
+        aio_co_wake(iTask->co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void iscsi_nop_timed_event(void *opaque)
 {
     IscsiLun *iscsilun = opaque;
 
+    aio_context_acquire(iscsilun->aio_context);
     if (iscsi_get_nops_in_flight(iscsilun->iscsi) >= MAX_NOP_FAILURES) {
         error_report("iSCSI: NOP timeout. Reconnecting...");
         iscsilun->request_timed_out = true;
     } else if (iscsi_nop_out_async(iscsilun->iscsi, NULL, NULL, 0, NULL) != 0) {
         error_report("iSCSI: failed to sent NOP-Out. Disabling NOP messages.");
-        return;
+        goto out;
     }
 
     timer_mod(iscsilun->nop_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
     iscsi_set_events(iscsilun);
+
+out:
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static void null_bh_cb(void *opaque)
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
 
     trace_qed_need_check_timer_cb(s);
 
+    qed_acquire(s);
     qed_plug_allocating_write_reqs(s);
 
     /* Ensure writes are on disk before clearing flag */
     bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
+    qed_release(s);
+}
+
+void qed_acquire(BDRVQEDState *s)
+{
+    aio_context_acquire(bdrv_get_aio_context(s->bs));
+}
+
+void qed_release(BDRVQEDState *s)
+{
+    aio_context_release(bdrv_get_aio_context(s->bs));
 }
 
 static void qed_start_need_check_timer(BDRVQEDState *s)
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ static void timer_cb(BlockBackend *blk, bool is_write)
     qemu_mutex_unlock(&tg->lock);
 
     /* Run the request that was waiting for this timer */
+    aio_context_acquire(blk_get_aio_context(blk));
     empty_queue = !qemu_co_enter_next(&blkp->throttled_reqs[is_write]);
+    aio_context_release(blk_get_aio_context(blk));
 
     /* If the request queue was empty then we have to take care of
      * scheduling the next one */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
     }
 
     /* Run our timers */
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
 
     return progress;
 }
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
     return progress;
 }
 
diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-sleep.c
+++ b/util/qemu-coroutine-sleep.c
@@ -XXX,XX +XXX,XX @@ static void co_sleep_cb(void *opaque)
 {
     CoSleepCB *sleep_cb = opaque;
 
-    qemu_coroutine_enter(sleep_cb->co);
+    aio_co_wake(sleep_cb->co);
 }
 
 void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This covers both file descriptor callbacks and polling callbacks,
since they execute related code.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-14-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/curl.c          | 16 +++++++++++++---
 block/iscsi.c         |  4 ++++
 block/linux-aio.c     |  4 ++++
 block/nfs.c           |  6 ++++++
 block/sheepdog.c      | 29 +++++++++++++++--------------
 block/ssh.c           | 29 +++++++++--------------------
 block/win32-aio.c     | 10 ++++++----
 hw/block/virtio-blk.c |  5 ++++-
 hw/scsi/virtio-scsi.c |  7 +++++++
 util/aio-posix.c      |  7 -------
 util/aio-win32.c      |  6 ------
 11 files changed, 68 insertions(+), 55 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_multi_check_completion(BDRVCURLState *s)
     }
 }
 
-static void curl_multi_do(void *arg)
+static void curl_multi_do_locked(CURLState *s)
 {
-    CURLState *s = (CURLState *)arg;
     CURLSocket *socket, *next_socket;
     int running;
     int r;
@@ -XXX,XX +XXX,XX @@ static void curl_multi_do(void *arg)
     }
 }
 
+static void curl_multi_do(void *arg)
+{
+    CURLState *s = (CURLState *)arg;
+
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
+    aio_context_release(s->s->aio_context);
+}
+
 static void curl_multi_read(void *arg)
 {
     CURLState *s = (CURLState *)arg;
 
-    curl_multi_do(arg);
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
     curl_multi_check_completion(s->s);
+    aio_context_release(s->s->aio_context);
 }
 
 static void curl_multi_timeout_do(void *arg)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ iscsi_process_read(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLIN);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void
@@ -XXX,XX +XXX,XX @@ iscsi_process_write(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLOUT);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static int64_t sector_lun2qemu(int64_t sector, IscsiLun *iscsilun)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
+        aio_context_acquire(s->aio_context);
         qemu_laio_process_completions_and_submit(s);
+        aio_context_release(s->aio_context);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
         return false;
     }
 
+    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions_and_submit(s);
+    aio_context_release(s->aio_context);
     return true;
 }
 
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_set_events(NFSClient *client)
 static void nfs_process_read(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLIN);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_process_write(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLOUT);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
     return ret;
 }
 
-static void restart_co_req(void *opaque)
-{
-    Coroutine *co = opaque;
-
-    qemu_coroutine_enter(co);
-}
-
 typedef struct SheepdogReqCo {
     int sockfd;
     BlockDriverState *bs;
@@ -XXX,XX +XXX,XX @@ typedef struct SheepdogReqCo {
     unsigned int *rlen;
     int ret;
     bool finished;
+    Coroutine *co;
 } SheepdogReqCo;
 
+static void restart_co_req(void *opaque)
+{
+    SheepdogReqCo *srco = opaque;
+
+    aio_co_wake(srco->co);
+}
+
 static coroutine_fn void do_co_req(void *opaque)
 {
     int ret;
-    Coroutine *co;
     SheepdogReqCo *srco = opaque;
     int sockfd = srco->sockfd;
     SheepdogReq *hdr = srco->hdr;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
     unsigned int *wlen = srco->wlen;
     unsigned int *rlen = srco->rlen;
 
-    co = qemu_coroutine_self();
+    srco->co = qemu_coroutine_self();
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       NULL, restart_co_req, NULL, co);
+                       NULL, restart_co_req, NULL, srco);
 
     ret = send_co_req(sockfd, hdr, data, wlen);
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
     }
 
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       restart_co_req, NULL, NULL, co);
+                       restart_co_req, NULL, NULL, srco);
 
     ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
     if (ret != sizeof(*hdr)) {
@@ -XXX,XX +XXX,XX @@ out:
     aio_set_fd_handler(srco->aio_context, sockfd, false,
                        NULL, NULL, NULL, NULL);
 
+    srco->co = NULL;
     srco->ret = ret;
     srco->finished = true;
     if (srco->bs) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn aio_read_response(void *opaque)
          * We've finished all requests which belong to the AIOCB, so
          * we can switch back to sd_co_readv/writev now.
          */
-        qemu_coroutine_enter(acb->coroutine);
+        aio_co_wake(acb->coroutine);
     }
 
     return;
@@ -XXX,XX +XXX,XX @@ static void co_read_response(void *opaque)
         s->co_recv = qemu_coroutine_create(aio_read_response, opaque);
     }
 
-    qemu_coroutine_enter(s->co_recv);
+    aio_co_wake(s->co_recv);
 }
 
 static void co_write_request(void *opaque)
 {
     BDRVSheepdogState *s = opaque;
 
-    qemu_coroutine_enter(s->co_send);
+    aio_co_wake(s->co_send);
 }
 
 /*
diff --git a/block/ssh.c b/block/ssh.c
index XXXXXXX..XXXXXXX 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static void restart_coroutine(void *opaque)
 
     DPRINTF("co=%p", co);
 
-    qemu_coroutine_enter(co);
+    aio_co_wake(co);
 }
 
-static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
+/* A non-blocking call returned EAGAIN, so yield, ensuring the
+ * handlers are set up so that we'll be rescheduled when there is an
+ * interesting event on the socket.
+ */
+static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 {
     int r;
     IOHandler *rd_handler = NULL, *wr_handler = NULL;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
 
     aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
                        false, rd_handler, wr_handler, NULL, co);
-}
-
-static coroutine_fn void clear_fd_handler(BDRVSSHState *s,
-                                          BlockDriverState *bs)
-{
-    DPRINTF("s->sock=%d", s->sock);
-    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
-                       false, NULL, NULL, NULL, NULL);
-}
-
-/* A non-blocking call returned EAGAIN, so yield, ensuring the
- * handlers are set up so that we'll be rescheduled when there is an
- * interesting event on the socket.
- */
-static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
-{
-    set_fd_handler(s, bs);
     qemu_coroutine_yield();
-    clear_fd_handler(s, bs);
+    DPRINTF("s->sock=%d - back", s->sock);
+    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
+                       NULL, NULL, NULL, NULL);
 }
 
 /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
diff --git a/block/win32-aio.c b/block/win32-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ struct QEMUWin32AIOState {
     HANDLE hIOCP;
     EventNotifier e;
     int count;
-    bool is_aio_context_attached;
+    AioContext *aio_ctx;
 };
 
 typedef struct QEMUWin32AIOCB {
@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
     }
 
 
+    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
+    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
@@ -XXX,XX +XXX,XX @@ void win32_aio_detach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *old_context)
 {
     aio_set_event_notifier(old_context, &aio->e, false, NULL, NULL);
-    aio->is_aio_context_attached = false;
+    aio->aio_ctx = NULL;
 }
 
 void win32_aio_attach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *new_context)
 {
-    aio->is_aio_context_attached = true;
+    aio->aio_ctx = new_context;
     aio_set_event_notifier(new_context, &aio->e, false,
                            win32_aio_completion_cb, NULL);
 }
@@ -XXX,XX +XXX,XX @@ out_free_state:
 
 void win32_aio_cleanup(QEMUWin32AIOState *aio)
 {
-    assert(!aio->is_aio_context_attached);
+    assert(!aio->aio_ctx);
     CloseHandle(aio->hIOCP);
     event_notifier_cleanup(&aio->e);
     g_free(aio);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
 {
     VirtIOBlockIoctlReq *ioctl_req = opaque;
     VirtIOBlockReq *req = ioctl_req->req;
-    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
     struct virtio_scsi_inhdr *scsi;
     struct sg_io_hdr *hdr;
 
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     MultiReqBuffer mrb = {};
     bool progress = false;
 
+    aio_context_acquire(blk_get_aio_context(s->blk));
     blk_io_plug(s->blk);
 
     do {
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     }
 
     blk_io_unplug(s->blk);
+    aio_context_release(blk_get_aio_context(s->blk));
     return progress;
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_ctrl_vq(VirtIOSCSI *s, VirtQueue *vq)
     VirtIOSCSIReq *req;
     bool progress = false;
 
+    virtio_scsi_acquire(s);
     while ((req = virtio_scsi_pop_req(s, vq))) {
         progress = true;
         virtio_scsi_handle_ctrl_req(s, req);
     }
+    virtio_scsi_release(s);
     return progress;
 }
 
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
 
     QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
 
+    virtio_scsi_acquire(s);
     do {
         virtio_queue_set_notification(vq, 0);
 
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
     QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
         virtio_scsi_handle_cmd_req_submit(s, req);
     }
+    virtio_scsi_release(s);
     return progress;
 }
 
@@ -XXX,XX +XXX,XX @@ out:
 
 bool virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
 {
+    virtio_scsi_acquire(s);
     if (s->events_dropped) {
         virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+        virtio_scsi_release(s);
         return true;
     }
+    virtio_scsi_release(s);
     return false;
 }
 
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_read) {
-            aio_context_acquire(ctx);
             node->io_read(node->opaque);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_write) {
-            aio_context_acquire(ctx);
             node->io_write(node->opaque);
-            aio_context_release(ctx);
             progress = true;
         }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-    aio_context_acquire(ctx);
     progress = try_poll_mode(ctx, blocking);
-    aio_context_release(ctx);
-
     if (!progress) {
         assert(npfd == 0);
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
             node->pfd.revents = 0;
-            aio_context_acquire(ctx);
             node->io_notify(node->e);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (node->io_read || node->io_write)) {
             node->pfd.revents = 0;
             if ((revents & G_IO_IN) && node->io_read) {
-                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
-                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-15-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/archipelago.c   |  3 +++
 block/blkreplay.c     |  2 +-
 block/block-backend.c |  6 ++++++
 block/curl.c          | 26 ++++++++++++++++++--------
 block/gluster.c       |  9 +--------
 block/io.c            |  6 +++++-
 block/iscsi.c         |  6 +++++-
 block/linux-aio.c     | 15 +++++++++------
 block/nfs.c           |  3 ++-
 block/null.c          |  4 ++++
 block/qed.c           |  3 +++
 block/rbd.c           |  4 ++++
 dma-helpers.c         |  2 ++
 hw/block/virtio-blk.c |  2 ++
 hw/scsi/scsi-bus.c    |  2 ++
 util/async.c          |  4 ++--
 util/thread-pool.c    |  2 ++
 17 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/block/archipelago.c b/block/archipelago.c
index XXXXXXX..XXXXXXX 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
+    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
+    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
+    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/blkreplay.c b/block/blkreplay.c
index XXXXXXX..XXXXXXX 100755
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -XXX,XX +XXX,XX @@ static int64_t blkreplay_getlength(BlockDriverState *bs)
 static void blkreplay_bh_cb(void *opaque)
 {
     Request *req = opaque;
-    qemu_coroutine_enter(req->co);
+    aio_co_wake(req->co);
     qemu_bh_delete(req->bh);
     g_free(req);
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     bdrv_dec_in_flight(acb->common.bs);
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
 static void blk_aio_complete_bh(void *opaque)
 {
     BlkAioEmAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     assert(acb->has_returned);
+    aio_context_acquire(ctx);
     blk_aio_complete(acb);
+    aio_context_release(ctx);
 }
 
 static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
 {
     CURLState *state;
     int running;
+    int ret = -EINPROGRESS;
 
     CURLAIOCB *acb = p;
-    BDRVCURLState *s = acb->common.bs->opaque;
+    BlockDriverState *bs = acb->common.bs;
+    BDRVCURLState *s = bs->opaque;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
     size_t start = acb->sector_num * BDRV_SECTOR_SIZE;
     size_t end;
 
+    aio_context_acquire(ctx);
+
     // In case we have the requested data already (e.g. read-ahead),
     // we can just call the callback and be done.
     switch (curl_find_buf(s, start, acb->nb_sectors * BDRV_SECTOR_SIZE, acb)) {
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
             qemu_aio_unref(acb);
             // fall through
         case FIND_RET_WAIT:
-            return;
+            goto out;
         default:
             break;
     }
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     // No cache found, so let's start a new request
     state = curl_init_state(acb->common.bs, s);
     if (!state) {
-        acb->common.cb(acb->common.opaque, -EIO);
-        qemu_aio_unref(acb);
-        return;
+        ret = -EIO;
+        goto out;
     }
 
     acb->start = 0;
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     state->orig_buf = g_try_malloc(state->buf_len);
     if (state->buf_len && state->orig_buf == NULL) {
         curl_clean_state(state);
-        acb->common.cb(acb->common.opaque, -ENOMEM);
-        qemu_aio_unref(acb);
-        return;
+        ret = -ENOMEM;
+        goto out;
     }
     state->acb[0] = acb;
 
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
 
     /* Tell curl it needs to kick things off */
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
+
+out:
+    if (ret != -EINPROGRESS) {
+        acb->common.cb(acb->common.opaque, ret);
+        qemu_aio_unref(acb);
+    }
+    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/gluster.c b/block/gluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -XXX,XX +XXX,XX @@ static struct glfs *qemu_gluster_init(BlockdevOptionsGluster *gconf,
     return qemu_gluster_glfs_init(gconf, errp);
 }
 
-static void qemu_gluster_complete_aio(void *opaque)
-{
-    GlusterAIOCB *acb = (GlusterAIOCB *)opaque;
-
-    qemu_coroutine_enter(acb->coroutine);
-}
-
 /*
  * AIO callback routine called from GlusterFS thread.
  */
@@ -XXX,XX +XXX,XX @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg)
         acb->ret = -EIO; /* Partial read/write - fail it */
     }
 
-    aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb);
+    aio_co_schedule(acb->aio_context, acb->coroutine);
 }
 
 static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_drain_bh_cb(void *opaque)
     bdrv_dec_in_flight(bs);
     bdrv_drained_begin(bs);
     data->done = true;
-    qemu_coroutine_enter(co);
+    aio_co_wake(co);
 }
 
 static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
+    BlockDriverState *bs = acb->common.bs;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
     assert(!acb->need_bh);
+    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
+    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
 iscsi_bh_cb(void *p)
 {
     IscsiAIOCB *acb = p;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
 
     g_free(acb->buf);
     acb->buf = NULL;
 
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
+    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
@@ -XXX,XX +XXX,XX @@ iscsi_schedule_bh(IscsiAIOCB *acb)
 static void iscsi_co_generic_bh_cb(void *opaque)
 {
     struct IscsiTask *iTask = opaque;
+
     iTask->complete = 1;
-    qemu_coroutine_enter(iTask->co);
+    aio_co_wake(iTask->co);
 }
 
 static void iscsi_retry_timer_expired(void *opaque)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
     io_context_t ctx;
     EventNotifier e;
 
-    /* io queue for submit at batch */
+    /* io queue for submit at batch.  Protected by AioContext lock. */
     LaioQueue io_q;
 
-    /* I/O completion processing */
+    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
     int event_idx;
     int event_max;
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
  */
 static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 {
+    LinuxAioState *s = laiocb->ctx;
     int ret;
 
     ret = laiocb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
     }
 
     laiocb->ret = ret;
+    aio_context_acquire(s->aio_context);
     if (laiocb->co) {
         /* If the coroutine is already entered it must be in ioq_submit() and
          * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
         laiocb->common.cb(laiocb->common.opaque, ret);
         qemu_aio_unref(laiocb);
     }
+    aio_context_release(s->aio_context);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
 static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
     qemu_laio_process_completions(s);
+
+    aio_context_acquire(s->aio_context);
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
+    aio_context_release(s->aio_context);
 }
 
 static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        aio_context_acquire(s->aio_context);
         qemu_laio_process_completions_and_submit(s);
-        aio_context_release(s->aio_context);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
         return false;
     }
 
-    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions_and_submit(s);
-    aio_context_release(s->aio_context);
     return true;
 }
 
@@ -XXX,XX +XXX,XX @@ void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
 {
     aio_set_event_notifier(old_context, &s->e, false, NULL, NULL);
     qemu_bh_delete(s->completion_bh);
+    s->aio_context = NULL;
 }
 
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 static void nfs_co_generic_bh_cb(void *opaque)
 {
     NFSRPC *task = opaque;
+
     task->complete = 1;
-    qemu_coroutine_enter(task->co);
+    aio_co_wake(task->co);
 }
 
 static void
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 static void qed_aio_complete_bh(void *opaque)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb = acb->common.cb;
     void *user_opaque = acb->common.opaque;
     int ret = acb->bh_ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
     qemu_aio_unref(acb);
 
     /* Invoke callback */
+    qed_acquire(s);
     cb(user_opaque, ret);
+    qed_release(s);
 }
 
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
+    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/dma-helpers.c b/dma-helpers.c
index XXXXXXX..XXXXXXX 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -XXX,XX +XXX,XX @@ static void dma_blk_cb(void *opaque, int ret)
                                 QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
     }
 
+    aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                             dma_blk_cb, dbs, dbs->io_func_opaque);
+    aio_context_release(dbs->ctx);
     assert(dbs->acb);
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
 
     s->rq = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (req) {
         VirtIOBlockReq *next = req->next;
         if (virtio_blk_handle_request(req, &mrb)) {
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
     if (mrb.num_reqs) {
         virtio_blk_submit_multireq(s->blk, &mrb);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
     qemu_bh_delete(s->bh);
     s->bh = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
         scsi_req_ref(req);
         if (req->retry) {
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
         }
         scsi_req_unref(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 void scsi_req_retry(SCSIRequest *req)
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
-            aio_context_acquire(ctx);
             aio_bh_call(bh);
-            aio_context_release(ctx);
         }
         if (bh->deleted) {
             deleted = true;
@@ -XXX,XX +XXX,XX @@ static void co_schedule_bh_cb(void *opaque)
         Coroutine *co = QSLIST_FIRST(&straight);
         QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
         trace_aio_co_schedule_bh_cb(ctx, co);
+        aio_context_acquire(ctx);
         qemu_coroutine_enter(co);
+        aio_context_release(ctx);
     }
 }
 
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
     ThreadPool *pool = opaque;
     ThreadPoolElement *elem, *next;
 
+    aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
         if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
             qemu_aio_unref(elem);
         }
     }
+    aio_context_release(pool->ctx);
 }
 
 static void thread_pool_cancel(BlockAIOCB *acb)
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-16-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/archipelago.c    |  3 ---
 block/block-backend.c  |  7 -------
 block/curl.c           |  2 +-
 block/io.c             |  6 +-----
 block/iscsi.c          |  3 ---
 block/linux-aio.c      |  5 +----
 block/mirror.c         | 12 +++++++++---
 block/null.c           |  8 --------
 block/qed-cluster.c    |  2 ++
 block/qed-table.c      | 12 ++++++++++--
 block/qed.c            |  4 ++--
 block/rbd.c            |  4 ----
 block/win32-aio.c      |  3 ---
 hw/block/virtio-blk.c  | 12 +++++++++++-
 hw/scsi/scsi-disk.c    | 15 +++++++++++++++
 hw/scsi/scsi-generic.c | 20 +++++++++++++++++---
 util/thread-pool.c     |  4 +++-
 17 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/block/archipelago.c b/block/archipelago.c
index XXXXXXX..XXXXXXX 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
-    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
-    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
-    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     bdrv_dec_in_flight(acb->common.bs);
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
 static void blk_aio_complete_bh(void *opaque)
 {
     BlkAioEmAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
     assert(acb->has_returned);
-    aio_context_acquire(ctx);
     blk_aio_complete(acb);
-    aio_context_release(ctx);
 }
 
 static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
 out:
+    aio_context_release(ctx);
     if (ret != -EINPROGRESS) {
         acb->common.cb(acb->common.opaque, ret);
         qemu_aio_unref(acb);
     }
-    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_io_em_complete(void *opaque, int ret)
     CoroutineIOCompletion *co = opaque;
 
     co->ret = ret;
-    qemu_coroutine_enter(co->coroutine);
+    aio_co_wake(co->coroutine);
 }
 
 static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
-    BlockDriverState *bs = acb->common.bs;
-    AioContext *ctx = bdrv_get_aio_context(bs);
 
     assert(!acb->need_bh);
-    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
-    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
 iscsi_bh_cb(void *p)
 {
     IscsiAIOCB *acb = p;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
 
     g_free(acb->buf);
     acb->buf = NULL;
 
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
-    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
  */
 static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 {
-    LinuxAioState *s = laiocb->ctx;
     int ret;
 
     ret = laiocb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
     }
 
     laiocb->ret = ret;
-    aio_context_acquire(s->aio_context);
     if (laiocb->co) {
         /* If the coroutine is already entered it must be in ioq_submit() and
          * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
          * that!
          */
         if (!qemu_coroutine_entered(laiocb->co)) {
-            qemu_coroutine_enter(laiocb->co);
+            aio_co_wake(laiocb->co);
         }
     } else {
         laiocb->common.cb(laiocb->common.opaque, ret);
         qemu_aio_unref(laiocb);
     }
-    aio_context_release(s->aio_context);
 }
 
 /**
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(blk_get_aio_context(s->common.blk));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
         }
     }
     mirror_iteration_done(op, ret);
+    aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
 static void mirror_read_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(blk_get_aio_context(s->common.blk));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -XXX,XX +XXX,XX @@ static void mirror_read_complete(void *opaque, int ret)
         }
 
         mirror_iteration_done(op, ret);
-        return;
+    } else {
+        blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
+                        0, mirror_write_complete, op);
     }
-    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-                    0, mirror_write_complete, op);
+    aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
 static inline void mirror_clip_sectors(MirrorBlockJob *s,
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
     unsigned int index;
     unsigned int n;
 
+    qed_acquire(s);
     if (ret) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
 
 out:
     find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+    qed_release(s);
     g_free(find_cluster_cb);
 }
 
diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
 {
     QEDReadTableCB *read_table_cb = opaque;
     QEDTable *table = read_table_cb->table;
+    BDRVQEDState *s = read_table_cb->s;
     int noffsets = read_table_cb->qiov.size / sizeof(uint64_t);
     int i;
 
@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
     }
 
     /* Byteswap offsets */
+    qed_acquire(s);
     for (i = 0; i < noffsets; i++) {
         table->offsets[i] = le64_to_cpu(table->offsets[i]);
     }
+    qed_release(s);
 
 out:
     /* Completion */
-    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
+    trace_qed_read_table_cb(s, read_table_cb->table, ret);
     gencb_complete(&read_table_cb->gencb, ret);
 }
 
@@ -XXX,XX +XXX,XX @@ typedef struct {
 static void qed_write_table_cb(void *opaque, int ret)
 {
     QEDWriteTableCB *write_table_cb = opaque;
+    BDRVQEDState *s = write_table_cb->s;
 
-    trace_qed_write_table_cb(write_table_cb->s,
+    trace_qed_write_table_cb(s,
                              write_table_cb->orig_table,
                              write_table_cb->flush,
                              ret);
@@ -XXX,XX +XXX,XX @@ static void qed_write_table_cb(void *opaque, int ret)
     if (write_table_cb->flush) {
         /* We still need to flush first */
         write_table_cb->flush = false;
+        qed_acquire(s);
         bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
                        write_table_cb);
+        qed_release(s);
         return;
     }
 
@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
     CachedL2Table *l2_table = request->l2_table;
     uint64_t l2_offset = read_l2_table_cb->l2_offset;
 
+    qed_acquire(s);
     if (ret) {
         /* can't trust loaded L2 table anymore */
         qed_unref_l2_cache_entry(l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
         request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
         assert(request->l2_table != NULL);
     }
+    qed_release(s);
 
     gencb_complete(&read_l2_table_cb->gencb, ret);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t l
     }
 
     if (cb->co) {
-        qemu_coroutine_enter(cb->co);
+        aio_co_wake(cb->co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
     cb->done = true;
     cb->ret = ret;
     if (cb->co) {
-        qemu_coroutine_enter(cb->co);
+        aio_co_wake(cb->co);
     }
 }
 
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/block/win32-aio.c b/block/win32-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
         qemu_vfree(waiocb->buf);
     }
 
-
-    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
-    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
 static void virtio_blk_rw_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *next = opaque;
+    VirtIOBlock *s = next->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (next) {
         VirtIOBlockReq *req = next;
         next = req->mr_next;
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_rw_complete(void *opaque, int ret)
         block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
         virtio_blk_free_request(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_flush_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *req = opaque;
+    VirtIOBlock *s = req->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     if (ret) {
         if (virtio_blk_handle_rw_error(req, -ret, 0)) {
-            return;
+            goto out;
         }
     }
 
     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
     block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
     virtio_blk_free_request(req);
+
+out:
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 #ifdef __linux__
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
     virtio_stl_p(vdev, &scsi->data_len, hdr->dxfer_len);
 
 out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     virtio_blk_req_complete(req, status);
     virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
     g_free(ioctl_req);
 }
 
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
     scsi_req_complete(&r->req, GOOD);
 
 done:
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
     scsi_req_unref(&r->req);
 }
 
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_complete(void *opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_dma_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_read_complete(void * opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
 
 done:
     scsi_req_unref(&r->req);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Actually issue a read to the block device.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_do_read_cb(void *opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_do_read(opaque, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_write_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_write_data(SCSIRequest *req)
@@ -XXX,XX +XXX,XX @@ static void scsi_unmap_complete(void *opaque, int ret)
 {
     UnmapCBData *data = opaque;
     SCSIDiskReq *r = data->r;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     scsi_unmap_complete_noio(data, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
@@ -XXX,XX +XXX,XX @@ static void scsi_write_same_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ done:
     scsi_req_unref(&r->req);
     qemu_vfree(data->iov.iov_base);
     g_free(data);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -XXX,XX +XXX,XX @@ done:
 static void scsi_command_complete(void *opaque, int ret)
 {
     SCSIGenericReq *r = (SCSIGenericReq *)opaque;
+    SCSIDevice *s = r->req.dev;
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     scsi_command_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 static int execute_command(BlockBackend *blk,
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     len = r->io_header.dxfer_len - r->io_header.resid;
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     r->len = -1;
     if (len == 0) {
         scsi_command_complete_noio(r, 0);
-        return;
+        goto done;
     }
 
     /* Snoop READ CAPACITY output to set the blocksize.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     }
     scsi_req_data(&r->req, len);
     scsi_req_unref(&r->req);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     if (r->req.cmd.buf[0] == MODE_SELECT && r->req.cmd.buf[4] == 12 &&
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     }
 
     scsi_command_complete_noio(r, ret);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Write data to a scsi device.  Returns nonzero on failure.
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ restart:
              */
             qemu_bh_schedule(pool->completion_bh);
 
+            aio_context_release(pool->ctx);
             elem->common.cb(elem->common.opaque, elem->ret);
+            aio_context_acquire(pool->ctx);
             qemu_aio_unref(elem);
             goto restart;
         } else {
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
     ThreadPoolCo *co = opaque;
 
     co->ret = ret;
-    qemu_coroutine_enter(co->co);
+    aio_co_wake(co->co);
 }
 
 int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This patch prepares for the removal of unnecessary lockcnt inc/dec pairs.
Extract the dispatching loop for file descriptor handlers into a new
function aio_dispatch_handlers, and then inline aio_dispatch into
aio_poll.

aio_dispatch can now become void.

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ bool aio_pending(AioContext *ctx);
 /* Dispatch any pending callbacks from the GSource attached to the AioContext.
  *
  * This is used internally in the implementation of the GSource.
- *
- * @dispatch_fds: true to process fds, false to skip them
- *                (can be used as an optimization by callers that know there
- *                are no fds ready)
  */
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds);
+void aio_dispatch(AioContext *ctx);
 
 /* Progress in completing AIO work to occur.  This can issue new pending
  * aio as a result of executing I/O completion or bh callbacks.
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
     AioHandler *node, *tmp;
     bool progress = false;
 
-    /*
-     * We have to walk very carefully in case aio_set_fd_handler is
-     * called while we're walking.
-     */
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
         int revents;
 
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
-/*
- * Note that dispatch_fds == false has the side-effect of post-poning the
- * freeing of deleted handlers.
- */
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
+void aio_dispatch(AioContext *ctx)
 {
-    bool progress;
+    aio_bh_poll(ctx);
 
-    /*
-     * If there are callbacks left that have been queued, we need to call them.
-     * Do not call select in this case, because it is possible that the caller
-     * does not need a complete flush (as is the case for aio_poll loops).
-     */
-    progress = aio_bh_poll(ctx);
+    qemu_lockcnt_inc(&ctx->list_lock);
+    aio_dispatch_handlers(ctx);
+    qemu_lockcnt_dec(&ctx->list_lock);
 
-    if (dispatch_fds) {
-        progress |= aio_dispatch_handlers(ctx);
-    }
-
-    /* Run our timers */
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
-
-    return progress;
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 /* These thread-local variables are used only in a small part of aio_poll
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     npfd = 0;
     qemu_lockcnt_dec(&ctx->list_lock);
 
-    /* Run dispatch even if there were no readable fds to run timers */
-    if (aio_dispatch(ctx, ret > 0)) {
-        progress = true;
+    progress |= aio_bh_poll(ctx);
+
+    if (ret > 0) {
+        qemu_lockcnt_inc(&ctx->list_lock);
+        progress |= aio_dispatch_handlers(ctx);
+        qemu_lockcnt_dec(&ctx->list_lock);
     }
 
+    progress |= timerlistgroup_run_timers(&ctx->tlg);
+
     return progress;
 }
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     return progress;
 }
 
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
+void aio_dispatch(AioContext *ctx)
 {
-    bool progress;
-
-    progress = aio_bh_poll(ctx);
-    if (dispatch_fds) {
-        progress |= aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
-    }
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
-    return progress;
+    aio_bh_poll(ctx);
+    aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 bool aio_poll(AioContext *ctx, bool blocking)
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ aio_ctx_dispatch(GSource     *source,
     AioContext *ctx = (AioContext *) source;
 
     assert(callback == NULL);
-    aio_dispatch(ctx, true);
+    aio_dispatch(ctx);
     return true;
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Pull the increment/decrement pair out of aio_bh_poll and into the
callers.

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
 
 void aio_dispatch(AioContext *ctx)
 {
+    qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
-
-    qemu_lockcnt_inc(&ctx->list_lock);
     aio_dispatch_handlers(ctx);
     qemu_lockcnt_dec(&ctx->list_lock);
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     }
 
     npfd = 0;
-    qemu_lockcnt_dec(&ctx->list_lock);
 
     progress |= aio_bh_poll(ctx);
 
     if (ret > 0) {
-        qemu_lockcnt_inc(&ctx->list_lock);
         progress |= aio_dispatch_handlers(ctx);
-        qemu_lockcnt_dec(&ctx->list_lock);
     }
 
+    qemu_lockcnt_dec(&ctx->list_lock);
+
     progress |= timerlistgroup_run_timers(&ctx->tlg);
 
     return progress;
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     bool progress = false;
     AioHandler *tmp;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     /*
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
 void aio_dispatch(AioContext *ctx)
 {
+    qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
     aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    qemu_lockcnt_dec(&ctx->list_lock);
     timerlistgroup_run_timers(&ctx->tlg);
 }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     first = true;
 
     /* ctx->notifier is always registered.  */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    qemu_lockcnt_dec(&ctx->list_lock);
+
     progress |= timerlistgroup_run_timers(&ctx->tlg);
     return progress;
 }
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ void aio_bh_call(QEMUBH *bh)
     bh->cb(bh->opaque);
 }
 
-/* Multiple occurrences of aio_bh_poll cannot be called concurrently */
+/* Multiple occurrences of aio_bh_poll cannot be called concurrently.
+ * The count in ctx->list_lock is incremented before the call, and is
+ * not affected by the call.
+ */
 int aio_bh_poll(AioContext *ctx)
 {
     QEMUBH *bh, **bhp, *next;
     int ret;
     bool deleted = false;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     ret = 0;
     for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
         next = atomic_rcu_read(&bh->next);
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!deleted) {
-        qemu_lockcnt_dec(&ctx->list_lock);
         return ret;
     }
 
-    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
+    if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
-        qemu_lockcnt_unlock(&ctx->list_lock);
+        qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
     }
     return ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/include/block/block_int.h b/include/block/block_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -XXX,XX +XXX,XX @@ struct BdrvChild {
  * copied as well.
  */
 struct BlockDriverState {
-    int64_t total_sectors; /* if we are reading a disk image, give its
-                              size in sectors */
+    /* Protected by big QEMU lock or read-only after opening.  No special
+     * locking needed during I/O...
+     */
     int open_flags; /* flags used to open the file, re-used for re-open */
     bool read_only; /* if true, the media is read only */
     bool encrypted; /* if true, the media is encrypted */
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     bool sg;        /* if true, the device is a /dev/sg* */
     bool probed;    /* if true, format was probed rather than specified */
 
-    int copy_on_read; /* if nonzero, copy read backing sectors into image.
-                         note this is a reference count */
-
-    CoQueue flush_queue;            /* Serializing flush queue */
-    bool active_flush_req;          /* Flush request in flight? */
-    unsigned int write_gen;         /* Current data generation */
-    unsigned int flushed_gen;       /* Flushed write generation */
-
     BlockDriver *drv; /* NULL means no media */
     void *opaque;
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     BdrvChild *backing;
     BdrvChild *file;
 
-    /* Callback before write request is processed */
-    NotifierWithReturnList before_write_notifiers;
-
-    /* number of in-flight requests; overall and serialising */
-    unsigned int in_flight;
-    unsigned int serialising_in_flight;
-
-    bool wakeup;
-
-    /* Offset after the highest byte written to */
-    uint64_t wr_highest_offset;
-
     /* I/O Limits */
     BlockLimits bl;
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     QTAILQ_ENTRY(BlockDriverState) bs_list;
     /* element of the list of monitor-owned BDS */
     QTAILQ_ENTRY(BlockDriverState) monitor_list;
-    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
     int refcnt;
 
-    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
-
     /* operation blockers */
     QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
 
+    /* Protected by AioContext lock */
+
+    /* If true, copy read backing sectors into image.  Can be >1 if more
+     * than one client has requested copy-on-read.
+     */
+    int copy_on_read;
+
+    /* If we are reading a disk image, give its size in sectors.
+     * Generally read-only; it is written to by load_vmstate and save_vmstate,
+     * but the block layer is quiescent during those.
+     */
+    int64_t total_sectors;
+
+    /* Callback before write request is processed */
+    NotifierWithReturnList before_write_notifiers;
+
+    /* number of in-flight requests; overall and serialising */
+    unsigned int in_flight;
+    unsigned int serialising_in_flight;
+
+    bool wakeup;
+
+    /* Offset after the highest byte written to */
+    uint64_t wr_highest_offset;
+
     /* threshold limit for writes, in bytes. "High water mark". */
     uint64_t write_threshold_offset;
     NotifierWithReturn write_threshold_notifier;
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     /* counter for nested bdrv_io_plug */
     unsigned io_plugged;
 
+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
+    CoQueue flush_queue;                  /* Serializing flush queue */
+    bool active_flush_req;                /* Flush request in flight? */
+    unsigned int write_gen;               /* Current data generation */
+    unsigned int flushed_gen;             /* Flushed write generation */
+
+    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
+
+    /* do we need to tell the quest if we have a volatile write cache? */
+    int enable_write_cache;
+
     int quiesce_counter;
 };
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
  * fields that must be public. This is in particular for QLIST_ENTRY() and
  * friends so that BlockBackends can be kept in lists outside block-backend.c */
 typedef struct BlockBackendPublic {
-    /* I/O throttling.
-     * throttle_state tells us if this BlockBackend has I/O limits configured.
-     * io_limits_disabled tells us if they are currently being enforced */
+    /* I/O throttling has its own locking, but also some fields are
+     * protected by the AioContext lock.
+     */
+
+    /* Protected by AioContext lock.  */
     CoQueue      throttled_reqs[2];
+
+    /* Nonzero if the I/O limits are currently being ignored; generally
+     * it is zero.  */
     unsigned int io_limits_disabled;
 
     /* The following fields are protected by the ThrottleGroup lock.
-     * See the ThrottleGroup documentation for details. */
+     * See the ThrottleGroup documentation for details.
+     * throttle_state tells us if I/O limits are configured. */
     ThrottleState *throttle_state;
     ThrottleTimers throttle_timers;
     unsigned       pending_reqs[2];
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This uses the lock-free mutex described in the paper '"Blocking without
Locking", or LFTHREADS: A lock-free thread library' by Gidenstam and
Papatriantafilou.  The same technique is used in OSv, and in fact
the code is essentially a conversion to C of OSv's code.

[Added missing coroutine_fn in tests/test-aio-multithread.c.
--Stefan]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h     |  17 ++++-
 tests/test-aio-multithread.c |  86 ++++++++++++++++++++++++
 util/qemu-coroutine-lock.c   | 155 ++++++++++++++++++++++++++++++++++++++++---
 util/trace-events            |   1 +
 4 files changed, 246 insertions(+), 13 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
 /**
  * Provides a mutex that can be used to synchronise coroutines
  */
+struct CoWaitRecord;
 typedef struct CoMutex {
-    bool locked;
+    /* Count of pending lockers; 0 for a free mutex, 1 for an
+     * uncontended mutex.
+     */
+    unsigned locked;
+
+    /* A queue of waiters.  Elements are added atomically in front of
+     * from_push.  to_pop is only populated, and popped from, by whoever
+     * is in charge of the next wakeup.  This can be an unlocker or,
+     * through the handoff protocol, a locker that is about to go to sleep.
+     */
+    QSLIST_HEAD(, CoWaitRecord) from_push, to_pop;
+
+    unsigned handoff, sequence;
+
     Coroutine *holder;
-    CoQueue queue;
 } CoMutex;
 
 /**
diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-aio-multithread.c
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@ static void test_multi_co_schedule_10(void)
     test_multi_co_schedule(10);
 }
 
+/* CoMutex thread-safety.  */
+
+static uint32_t atomic_counter;
+static uint32_t running;
+static uint32_t counter;
+static CoMutex comutex;
+
+static void coroutine_fn test_multi_co_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        qemu_co_mutex_lock(&comutex);
+        counter++;
+        qemu_co_mutex_unlock(&comutex);
+
+        /* Increase atomic_counter *after* releasing the mutex.  Otherwise
+         * there is a chance (it happens about 1 in 3 runs) that the iothread
+         * exits before the coroutine is woken up, causing a spurious
+         * assertion failure.
+         */
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_co_mutex(int threads, int seconds)
+{
+    int i;
+
+    qemu_co_mutex_init(&comutex);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_co_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+/* Testing with NUM_CONTEXTS threads focuses on the queue.  The mutex however
+ * is too contended (and the threads spend too much time in aio_poll)
+ * to actually stress the handoff protocol.
+ */
+static void test_multi_co_mutex_1(void)
+{
+    test_multi_co_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_co_mutex_10(void)
+{
+    test_multi_co_mutex(NUM_CONTEXTS, 10);
+}
+
+/* Testing with fewer threads stresses the handoff protocol too.  Still, the
+ * case where the locker _can_ pick up a handoff is very rare, happening
+ * about 10 times in 1 million, so increase the runtime a bit compared to
+ * other "quick" testcases that only run for 1 second.
+ */
+static void test_multi_co_mutex_2_3(void)
+{
+    test_multi_co_mutex(2, 3);
+}
+
+static void test_multi_co_mutex_2_30(void)
+{
+    test_multi_co_mutex(2, 30);
+}
+
 /* End of tests.  */
 
 int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
     g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
     if (g_test_quick()) {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
+        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
+        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
     } else {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
+        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
+        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
     }
     return g_test_run();
 }
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
  * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
+ *
+ * The lock-free mutex implementation is based on OSv
+ * (core/lfmutex.cc, include/lockfree/mutex.hh).
+ * Copyright (C) 2013 Cloudius Systems, Ltd.
  */
 
 #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue)
     return QSIMPLEQ_FIRST(&queue->entries) == NULL;
 }
 
+/* The wait records are handled with a multiple-producer, single-consumer
+ * lock-free queue.  There cannot be two concurrent pop_waiter() calls
+ * because pop_waiter() can only be called while mutex->handoff is zero.
+ * This can happen in three cases:
+ * - in qemu_co_mutex_unlock, before the hand-off protocol has started.
+ *   In this case, qemu_co_mutex_lock will see mutex->handoff == 0 and
+ *   not take part in the handoff.
+ * - in qemu_co_mutex_lock, if it steals the hand-off responsibility from
+ *   qemu_co_mutex_unlock.  In this case, qemu_co_mutex_unlock will fail
+ *   the cmpxchg (it will see either 0 or the next sequence value) and
+ *   exit.  The next hand-off cannot begin until qemu_co_mutex_lock has
+ *   woken up someone.
+ * - in qemu_co_mutex_unlock, if it takes the hand-off token itself.
+ *   In this case another iteration starts with mutex->handoff == 0;
+ *   a concurrent qemu_co_mutex_lock will fail the cmpxchg, and
+ *   qemu_co_mutex_unlock will go back to case (1).
+ *
+ * The following functions manage this queue.
+ */
+typedef struct CoWaitRecord {
+    Coroutine *co;
+    QSLIST_ENTRY(CoWaitRecord) next;
+} CoWaitRecord;
+
+static void push_waiter(CoMutex *mutex, CoWaitRecord *w)
+{
+    w->co = qemu_coroutine_self();
+    QSLIST_INSERT_HEAD_ATOMIC(&mutex->from_push, w, next);
+}
+
+static void move_waiters(CoMutex *mutex)
+{
+    QSLIST_HEAD(, CoWaitRecord) reversed;
+    QSLIST_MOVE_ATOMIC(&reversed, &mutex->from_push);
+    while (!QSLIST_EMPTY(&reversed)) {
+        CoWaitRecord *w = QSLIST_FIRST(&reversed);
+        QSLIST_REMOVE_HEAD(&reversed, next);
+        QSLIST_INSERT_HEAD(&mutex->to_pop, w, next);
+    }
+}
+
+static CoWaitRecord *pop_waiter(CoMutex *mutex)
+{
+    CoWaitRecord *w;
+
+    if (QSLIST_EMPTY(&mutex->to_pop)) {
+        move_waiters(mutex);
+        if (QSLIST_EMPTY(&mutex->to_pop)) {
+            return NULL;
+        }
+    }
+    w = QSLIST_FIRST(&mutex->to_pop);
+    QSLIST_REMOVE_HEAD(&mutex->to_pop, next);
+    return w;
+}
+
+static bool has_waiters(CoMutex *mutex)
+{
+    return QSLIST_EMPTY(&mutex->to_pop) || QSLIST_EMPTY(&mutex->from_push);
+}
+
 void qemu_co_mutex_init(CoMutex *mutex)
 {
     memset(mutex, 0, sizeof(*mutex));
-    qemu_co_queue_init(&mutex->queue);
 }
 
-void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
+static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
+    CoWaitRecord w;
+    unsigned old_handoff;
 
     trace_qemu_co_mutex_lock_entry(mutex, self);
+    w.co = self;
+    push_waiter(mutex, &w);
 
-    while (mutex->locked) {
-        qemu_co_queue_wait(&mutex->queue);
+    /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
+     * a concurrent unlock() the responsibility of waking somebody up.
+     */
+    old_handoff = atomic_mb_read(&mutex->handoff);
+    if (old_handoff &&
+        has_waiters(mutex) &&
+        atomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
+        /* There can be no concurrent pops, because there can be only
+         * one active handoff at a time.
+         */
+        CoWaitRecord *to_wake = pop_waiter(mutex);
+        Coroutine *co = to_wake->co;
+        if (co == self) {
+            /* We got the lock ourselves!  */
+            assert(to_wake == &w);
+            return;
+        }
+
+        aio_co_wake(co);
     }
 
-    mutex->locked = true;
-    mutex->holder = self;
-    self->locks_held++;
-
+    qemu_coroutine_yield();
     trace_qemu_co_mutex_lock_return(mutex, self);
 }
 
+void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    if (atomic_fetch_inc(&mutex->locked) == 0) {
+        /* Uncontended.  */
+        trace_qemu_co_mutex_lock_uncontended(mutex, self);
+    } else {
+        qemu_co_mutex_lock_slowpath(mutex);
+    }
+    mutex->holder = self;
+    self->locks_held++;
+}
+
 void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
 
     trace_qemu_co_mutex_unlock_entry(mutex, self);
 
-    assert(mutex->locked == true);
+    assert(mutex->locked);
     assert(mutex->holder == self);
     assert(qemu_in_coroutine());
 
-    mutex->locked = false;
     mutex->holder = NULL;
     self->locks_held--;
-    qemu_co_queue_next(&mutex->queue);
+    if (atomic_fetch_dec(&mutex->locked) == 1) {
+        /* No waiting qemu_co_mutex_lock().  Pfew, that was easy!  */
+        return;
+    }
+
+    for (;;) {
+        CoWaitRecord *to_wake = pop_waiter(mutex);
+        unsigned our_handoff;
+
+        if (to_wake) {
+            Coroutine *co = to_wake->co;
+            aio_co_wake(co);
+            break;
+        }
+
+        /* Some concurrent lock() is in progress (we know this because
+         * mutex->locked was >1) but it hasn't yet put itself on the wait
+         * queue.  Pick a sequence number for the handoff protocol (not 0).
+         */
+        if (++mutex->sequence == 0) {
+            mutex->sequence = 1;
+        }
+
+        our_handoff = mutex->sequence;
+        atomic_mb_set(&mutex->handoff, our_handoff);
+        if (!has_waiters(mutex)) {
+            /* The concurrent lock has not added itself yet, so it
+             * will be able to pick our handoff.
+             */
+            break;
+        }
+
+        /* Try to do the handoff protocol ourselves; if somebody else has
+         * already taken it, however, we're done and they're responsible.
+         */
+        if (atomic_cmpxchg(&mutex->handoff, our_handoff, 0) != our_handoff) {
+            break;
+        }
+    }
 
     trace_qemu_co_mutex_unlock_return(mutex, self);
 }
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
 
 # util/qemu-coroutine-lock.c
 qemu_co_queue_run_restart(void *co) "co %p"
+qemu_co_mutex_lock_uncontended(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Running a very small critical section on pthread_mutex_t and CoMutex
shows that pthread_mutex_t is much faster because it doesn't actually
go to sleep.  What happens is that the critical section is shorter
than the latency of entering the kernel and thus FUTEX_WAIT always
fails.  With CoMutex there is no such latency but you still want to
avoid wait and wakeup.  So introduce it artificially.

This only works with one waiters; because CoMutex is fair, it will
always have more waits and wakeups than a pthread_mutex_t.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-3-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  5 +++++
 util/qemu-coroutine-lock.c | 51 ++++++++++++++++++++++++++++++++++++++++------
 util/qemu-coroutine.c      |  2 +-
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ typedef struct CoMutex {
      */
     unsigned locked;
 
+    /* Context that is holding the lock.  Useful to avoid spinning
+     * when two coroutines on the same AioContext try to get the lock. :)
+     */
+    AioContext *ctx;
+
     /* A queue of waiters.  Elements are added atomically in front of
      * from_push.  to_pop is only populated, and popped from, by whoever
      * is in charge of the next wakeup.  This can be an unlocker or,
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu-common.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "qemu/processor.h"
 #include "qemu/queue.h"
 #include "block/aio.h"
 #include "trace.h"
@@ -XXX,XX +XXX,XX @@ void qemu_co_mutex_init(CoMutex *mutex)
     memset(mutex, 0, sizeof(*mutex));
 }
 
-static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
+static void coroutine_fn qemu_co_mutex_wake(CoMutex *mutex, Coroutine *co)
+{
+    /* Read co before co->ctx; pairs with smp_wmb() in
+     * qemu_coroutine_enter().
+     */
+    smp_read_barrier_depends();
+    mutex->ctx = co->ctx;
+    aio_co_wake(co);
+}
+
+static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
+                                                     CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
     CoWaitRecord w;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
         if (co == self) {
             /* We got the lock ourselves!  */
             assert(to_wake == &w);
+            mutex->ctx = ctx;
             return;
         }
 
-        aio_co_wake(co);
+        qemu_co_mutex_wake(mutex, co);
     }
 
     qemu_coroutine_yield();
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
 
 void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
     Coroutine *self = qemu_coroutine_self();
+    int waiters, i;
 
-    if (atomic_fetch_inc(&mutex->locked) == 0) {
+    /* Running a very small critical section on pthread_mutex_t and CoMutex
+     * shows that pthread_mutex_t is much faster because it doesn't actually
+     * go to sleep.  What happens is that the critical section is shorter
+     * than the latency of entering the kernel and thus FUTEX_WAIT always
+     * fails.  With CoMutex there is no such latency but you still want to
+     * avoid wait and wakeup.  So introduce it artificially.
+     */
+    i = 0;
+retry_fast_path:
+    waiters = atomic_cmpxchg(&mutex->locked, 0, 1);
+    if (waiters != 0) {
+        while (waiters == 1 && ++i < 1000) {
+            if (atomic_read(&mutex->ctx) == ctx) {
+                break;
+            }
+            if (atomic_read(&mutex->locked) == 0) {
+                goto retry_fast_path;
+            }
+            cpu_relax();
+        }
+        waiters = atomic_fetch_inc(&mutex->locked);
+    }
+
+    if (waiters == 0) {
         /* Uncontended.  */
         trace_qemu_co_mutex_lock_uncontended(mutex, self);
+        mutex->ctx = ctx;
     } else {
-        qemu_co_mutex_lock_slowpath(mutex);
+        qemu_co_mutex_lock_slowpath(ctx, mutex);
     }
     mutex->holder = self;
     self->locks_held++;
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
     assert(mutex->holder == self);
     assert(qemu_in_coroutine());
 
+    mutex->ctx = NULL;
     mutex->holder = NULL;
     self->locks_held--;
     if (atomic_fetch_dec(&mutex->locked) == 1) {
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
         unsigned our_handoff;
 
         if (to_wake) {
-            Coroutine *co = to_wake->co;
-            aio_co_wake(co);
+            qemu_co_mutex_wake(mutex, to_wake->co);
             break;
         }
 
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
     co->ctx = qemu_get_current_aio_context();
 
     /* Store co->ctx before anything that stores co.  Matches
-     * barrier in aio_co_wake.
+     * barrier in aio_co_wake and qemu_co_mutex_wake.
      */
     smp_wmb();
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Add two implementations of the same benchmark as the previous patch,
but using pthreads.  One uses a normal QemuMutex, the other is Linux
only and implements a fair mutex based on MCS locks and futexes.
This shows that the slower performance of the 5-thread case is due to
the fairness of CoMutex, rather than to coroutines.  If fairness does
not matter, as is the case with two threads, CoMutex can actually be
faster than pthreads.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/test-aio-multithread.c | 164 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-aio-multithread.c
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@ static void test_multi_co_mutex_2_30(void)
     test_multi_co_mutex(2, 30);
 }
 
+/* Same test with fair mutexes, for performance comparison.  */
+
+#ifdef CONFIG_LINUX
+#include "qemu/futex.h"
+
+/* The nodes for the mutex reside in this structure (on which we try to avoid
+ * false sharing).  The head of the mutex is in the "mutex_head" variable.
+ */
+static struct {
+    int next, locked;
+    int padding[14];
+} nodes[NUM_CONTEXTS] __attribute__((__aligned__(64)));
+
+static int mutex_head = -1;
+
+static void mcs_mutex_lock(void)
+{
+    int prev;
+
+    nodes[id].next = -1;
+    nodes[id].locked = 1;
+    prev = atomic_xchg(&mutex_head, id);
+    if (prev != -1) {
+        atomic_set(&nodes[prev].next, id);
+        qemu_futex_wait(&nodes[id].locked, 1);
+    }
+}
+
+static void mcs_mutex_unlock(void)
+{
+    int next;
+    if (nodes[id].next == -1) {
+        if (atomic_read(&mutex_head) == id &&
+            atomic_cmpxchg(&mutex_head, id, -1) == id) {
+            /* Last item in the list, exit.  */
+            return;
+        }
+        while (atomic_read(&nodes[id].next) == -1) {
+            /* mcs_mutex_lock did the xchg, but has not updated
+             * nodes[prev].next yet.
+             */
+        }
+    }
+
+    /* Wake up the next in line.  */
+    next = nodes[id].next;
+    nodes[next].locked = 0;
+    qemu_futex_wake(&nodes[next].locked, 1);
+}
+
+static void test_multi_fair_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        mcs_mutex_lock();
+        counter++;
+        mcs_mutex_unlock();
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_fair_mutex(int threads, int seconds)
+{
+    int i;
+
+    assert(mutex_head == -1);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_fair_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+static void test_multi_fair_mutex_1(void)
+{
+    test_multi_fair_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_fair_mutex_10(void)
+{
+    test_multi_fair_mutex(NUM_CONTEXTS, 10);
+}
+#endif
+
+/* Same test with pthread mutexes, for performance comparison and
+ * portability.  */
+
+static QemuMutex mutex;
+
+static void test_multi_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        qemu_mutex_lock(&mutex);
+        counter++;
+        qemu_mutex_unlock(&mutex);
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_mutex(int threads, int seconds)
+{
+    int i;
+
+    qemu_mutex_init(&mutex);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+static void test_multi_mutex_1(void)
+{
+    test_multi_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_mutex_10(void)
+{
+    test_multi_mutex(NUM_CONTEXTS, 10);
+}
+
 /* End of tests.  */
 
 int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
         g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
         g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
+#ifdef CONFIG_LINUX
+        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_1);
+#endif
+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_1);
     } else {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
         g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
         g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
+#ifdef CONFIG_LINUX
+        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_10);
+#endif
+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_10);
     }
     return g_test_run();
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This will avoid forward references in the next patch.  It is also
more logical because CoQueue is not anymore the basic primitive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-5-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h | 89 ++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 45 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_in_coroutine(void);
  */
 bool qemu_coroutine_entered(Coroutine *co);
 
-
-/**
- * CoQueues are a mechanism to queue coroutines in order to continue executing
- * them later. They provide the fundamental primitives on which coroutine locks
- * are built.
- */
-typedef struct CoQueue {
-    QSIMPLEQ_HEAD(, Coroutine) entries;
-} CoQueue;
-
-/**
- * Initialise a CoQueue. This must be called before any other operation is used
- * on the CoQueue.
- */
-void qemu_co_queue_init(CoQueue *queue);
-
-/**
- * Adds the current coroutine to the CoQueue and transfers control to the
- * caller of the coroutine.
- */
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
-
-/**
- * Restarts the next coroutine in the CoQueue and removes it from the queue.
- *
- * Returns true if a coroutine was restarted, false if the queue is empty.
- */
-bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
-
-/**
- * Restarts all coroutines in the CoQueue and leaves the queue empty.
- */
-void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
-
-/**
- * Enter the next coroutine in the queue
- */
-bool qemu_co_enter_next(CoQueue *queue);
-
-/**
- * Checks if the CoQueue is empty.
- */
-bool qemu_co_queue_empty(CoQueue *queue);
-
-
 /**
  * Provides a mutex that can be used to synchronise coroutines
  */
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
  */
 void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 
+
+/**
+ * CoQueues are a mechanism to queue coroutines in order to continue executing
+ * them later.
+ */
+typedef struct CoQueue {
+    QSIMPLEQ_HEAD(, Coroutine) entries;
+} CoQueue;
+
+/**
+ * Initialise a CoQueue. This must be called before any other operation is used
+ * on the CoQueue.
+ */
+void qemu_co_queue_init(CoQueue *queue);
+
+/**
+ * Adds the current coroutine to the CoQueue and transfers control to the
+ * caller of the coroutine.
+ */
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
+
+/**
+ * Restarts the next coroutine in the CoQueue and removes it from the queue.
+ *
+ * Returns true if a coroutine was restarted, false if the queue is empty.
+ */
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
+
+/**
+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
+ */
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
+
+/**
+ * Checks if the CoQueue is empty.
+ */
+bool qemu_co_queue_empty(CoQueue *queue);
+
+
 typedef struct CoRwlock {
     bool writer;
     int reader;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

All that CoQueue needs in order to become thread-safe is help
from an external mutex.  Add this to the API.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  8 +++++---
 block/backup.c             |  2 +-
 block/io.c                 |  4 ++--
 block/nbd-client.c         |  2 +-
 block/qcow2-cluster.c      |  4 +---
 block/sheepdog.c           |  2 +-
 block/throttle-groups.c    |  2 +-
 hw/9pfs/9p.c               |  2 +-
 util/qemu-coroutine-lock.c | 24 +++++++++++++++++++++---
 9 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 
 /**
  * CoQueues are a mechanism to queue coroutines in order to continue executing
- * them later.
+ * them later.  They are similar to condition variables, but they need help
+ * from an external mutex in order to maintain thread-safety.
  */
 typedef struct CoQueue {
     QSIMPLEQ_HEAD(, Coroutine) entries;
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue);
 
 /**
  * Adds the current coroutine to the CoQueue and transfers control to the
- * caller of the coroutine.
+ * caller of the coroutine.  The mutex is unlocked during the wait and
+ * locked again afterwards.
  */
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex);
 
 /**
  * Restarts the next coroutine in the CoQueue and removes it from the queue.
diff --git a/block/backup.c b/block/backup.c
index XXXXXXX..XXXXXXX 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
         retry = false;
         QLIST_FOREACH(req, &job->inflight_reqs, list) {
             if (end > req->start && start < req->end) {
-                qemu_co_queue_wait(&req->wait_queue);
+                qemu_co_queue_wait(&req->wait_queue, NULL);
                 retry = true;
                 break;
             }
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  * (instead of producing a deadlock in the former case). */
                 if (!req->waiting_for) {
                     self->waiting_for = req;
-                    qemu_co_queue_wait(&req->wait_queue);
+                    qemu_co_queue_wait(&req->wait_queue, NULL);
                     self->waiting_for = NULL;
                     retry = true;
                     waited = true;
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 
     /* Wait until any previous flushes are completed */
     while (bs->active_flush_req) {
-        qemu_co_queue_wait(&bs->flush_queue);
+        qemu_co_queue_wait(&bs->flush_queue, NULL);
     }
 
     bs->active_flush_req = true;
diff --git a/block/nbd-client.c b/block/nbd-client.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
     /* Poor man semaphore.  The free_sema is locked when no other request
      * can be accepted, and unlocked after receiving one reply.  */
     if (s->in_flight == MAX_NBD_REQUESTS) {
-        qemu_co_queue_wait(&s->free_sema);
+        qemu_co_queue_wait(&s->free_sema, NULL);
         assert(s->in_flight < MAX_NBD_REQUESTS);
     }
     s->in_flight++;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
             if (bytes == 0) {
                 /* Wait for the dependency to complete. We need to recheck
                  * the free/allocated clusters when we continue. */
-                qemu_co_mutex_unlock(&s->lock);
-                qemu_co_queue_wait(&old_alloc->dependent_requests);
-                qemu_co_mutex_lock(&s->lock);
+                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
                 return -EAGAIN;
             }
         }
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void wait_for_overlapping_aiocb(BDRVSheepdogState *s, SheepdogAIOCB *acb)
 retry:
     QLIST_FOREACH(cb, &s->inflight_aiocb_head, aiocb_siblings) {
         if (AIOCBOverlapping(acb, cb)) {
-            qemu_co_queue_wait(&s->overlapping_queue);
+            qemu_co_queue_wait(&s->overlapping_queue, NULL);
             goto retry;
         }
     }
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockBackend *blk,
     if (must_wait || blkp->pending_reqs[is_write]) {
         blkp->pending_reqs[is_write]++;
         qemu_mutex_unlock(&tg->lock);
-        qemu_co_queue_wait(&blkp->throttled_reqs[is_write]);
+        qemu_co_queue_wait(&blkp->throttled_reqs[is_write], NULL);
         qemu_mutex_lock(&tg->lock);
         blkp->pending_reqs[is_write]--;
     }
diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn v9fs_flush(void *opaque)
         /*
          * Wait for pdu to complete.
          */
-        qemu_co_queue_wait(&cancel_pdu->complete);
+        qemu_co_queue_wait(&cancel_pdu->complete, NULL);
         cancel_pdu->cancelled = 0;
         pdu_free(cancel_pdu);
     }
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue)
     QSIMPLEQ_INIT(&queue->entries);
 }
 
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
     QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+
+    if (mutex) {
+        qemu_co_mutex_unlock(mutex);
+    }
+
+    /* There is no race condition here.  Other threads will call
+     * aio_co_schedule on our AioContext, which can reenter this
+     * coroutine but only after this yield and after the main loop
+     * has gone through the next iteration.
+     */
     qemu_coroutine_yield();
     assert(qemu_in_coroutine());
+
+    /* TODO: OSv implements wait morphing here, where the wakeup
+     * primitive automatically places the woken coroutine on the
+     * mutex's queue.  This avoids the thundering herd effect.
+     */
+    if (mutex) {
+        qemu_co_mutex_lock(mutex);
+    }
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     while (lock->writer) {
-        qemu_co_queue_wait(&lock->queue);
+        qemu_co_queue_wait(&lock->queue, NULL);
     }
     lock->reader++;
     self->locks_held++;
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_wrlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     while (lock->writer || lock->reader) {
-        qemu_co_queue_wait(&lock->queue);
+        qemu_co_queue_wait(&lock->queue, NULL);
     }
     lock->writer = true;
     self->locks_held++;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This adds a CoMutex around the existing CoQueue.  Because the write-side
can just take CoMutex, the old "writer" field is not necessary anymore.
Instead of removing it altogether, count the number of pending writers
during a read-side critical section and forbid further readers from
entering.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-7-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  3 ++-
 util/qemu-coroutine-lock.c | 35 ++++++++++++++++++++++++-----------
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
 
 
 typedef struct CoRwlock {
-    bool writer;
+    int pending_writer;
     int reader;
+    CoMutex mutex;
     CoQueue queue;
 } CoRwlock;
 
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_init(CoRwlock *lock)
 {
     memset(lock, 0, sizeof(*lock));
     qemu_co_queue_init(&lock->queue);
+    qemu_co_mutex_init(&lock->mutex);
 }
 
 void qemu_co_rwlock_rdlock(CoRwlock *lock)
 {
     Coroutine *self = qemu_coroutine_self();
 
-    while (lock->writer) {
-        qemu_co_queue_wait(&lock->queue, NULL);
+    qemu_co_mutex_lock(&lock->mutex);
+    /* For fairness, wait if a writer is in line.  */
+    while (lock->pending_writer) {
+        qemu_co_queue_wait(&lock->queue, &lock->mutex);
     }
     lock->reader++;
+    qemu_co_mutex_unlock(&lock->mutex);
+
+    /* The rest of the read-side critical section is run without the mutex.  */
     self->locks_held++;
 }
 
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     assert(qemu_in_coroutine());
-    if (lock->writer) {
-        lock->writer = false;
+    if (!lock->reader) {
+        /* The critical section started in qemu_co_rwlock_wrlock.  */
         qemu_co_queue_restart_all(&lock->queue);
     } else {
+        self->locks_held--;
+
+        qemu_co_mutex_lock(&lock->mutex);
         lock->reader--;
         assert(lock->reader >= 0);
         /* Wakeup only one waiting writer */
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
             qemu_co_queue_next(&lock->queue);
         }
     }
-    self->locks_held--;
+    qemu_co_mutex_unlock(&lock->mutex);
 }
 
 void qemu_co_rwlock_wrlock(CoRwlock *lock)
 {
-    Coroutine *self = qemu_coroutine_self();
-
-    while (lock->writer || lock->reader) {
-        qemu_co_queue_wait(&lock->queue, NULL);
+    qemu_co_mutex_lock(&lock->mutex);
+    lock->pending_writer++;
+    while (lock->reader) {
+        qemu_co_queue_wait(&lock->queue, &lock->mutex);
     }
-    lock->writer = true;
-    self->locks_held++;
+    lock->pending_writer--;
+
+    /* The rest of the write-side critical section is run with
+     * the mutex taken, so that lock->reader remains zero.
+     * There is no need to update self->locks_held.
+     */
 }
-- 
2.9.3

The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:

Merge tag 'or1k-pull-request-20230513' of https://github.com/stffrdhrn/qemu into staging (2023-05-13 11:23:14 +0100)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 01562fee5f3ad4506d57dbcf4b1903b565eceec7:

docs/zoned-storage:add zoned emulation use case (2023-05-15 08:19:04 -0400)

----------------------------------------------------------------
Pull request

This pull request contain's Sam Li's zoned storage support in the QEMU block
layer and virtio-blk emulation.

v2:
- Sam fixed the CI failures. CI passes for me now. [Richard]

----------------------------------------------------------------

Sam Li (16):
  block/block-common: add zoned device structs
  block/file-posix: introduce helper functions for sysfs attributes
  block/block-backend: add block layer APIs resembling Linux
    ZonedBlockDevice ioctls
  block/raw-format: add zone operations to pass through requests
  block: add zoned BlockDriver check to block layer
  iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation
  file-posix: add tracking of the zone write pointers
  block: introduce zone append write for zoned devices
  qemu-iotests: test zone append operation
  block: add some trace events for zone append
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation
  docs/zoned-storage:add zoned emulation use case

-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-2-faithilikerun@gmail.com
Message-id: 20230324090605.28361-2-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+    BLK_ZO_OPEN,
+    BLK_ZO_CLOSE,
+    BLK_ZO_FINISH,
+    BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+    BLK_Z_NONE = 0x0, /* Regular block device */
+    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+    BLK_ZS_NOT_WP = 0x0,
+    BLK_ZS_EMPTY = 0x1,
+    BLK_ZS_IOPEN = 0x2,
+    BLK_ZS_EOPEN = 0x3,
+    BLK_ZS_CLOSED = 0x4,
+    BLK_ZS_RDONLY = 0xD,
+    BLK_ZS_FULL = 0xE,
+    BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+    BLK_ZT_SWR = 0x2, /* Sequential writes required */
+    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+    uint64_t start;
+    uint64_t length;
+    uint64_t cap;
+    uint64_t wp;
+    BlockZoneType type;
+    BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
     /* in bytes, 0 if irrelevant */
     int cluster_size;
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-3-faithilikerun@gmail.com
Message-id: 20230324090605.28361-3-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block_int-common.h |   3 +
 block/file-posix.c               | 135 ++++++++++++++++++++++---------
 2 files changed, 100 insertions(+), 38 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
      * an explicit monitor command to load the disk inside the guest).
      */
     bool has_variable_length;
+
+    /* device zone model */
+    BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_hw_transfer(int fd, struct stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
+/*
+ * Get a sysfs attribute value as character string.
+ */
+#ifdef CONFIG_LINUX
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+                             char **val) {
+    g_autofree char *sysfspath = NULL;
+    int ret;
+    size_t len;
+
+    if (!S_ISBLK(st->st_mode)) {
+        return -ENOTSUP;
+    }
+
+    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+                                major(st->st_rdev), minor(st->st_rdev),
+                                attribute);
+    ret = g_file_get_contents(sysfspath, val, &len, NULL);
+    if (ret == -1) {
+        return -ENOENT;
+    }
+
+    /* The file is ended with '\n' */
+    char *p;
+    p = *val;
+    if (*(p + len - 1) == '\n') {
+        *(p + len - 1) = '\0';
+    }
+    return ret;
+}
+#endif
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
 {
+    g_autofree char *val = NULL;
+    int ret;
+
+    ret = get_sysfs_str_val(st, "zoned", &val);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (strcmp(val, "host-managed") == 0) {
+        *zoned = BLK_Z_HM;
+    } else if (strcmp(val, "host-aware") == 0) {
+        *zoned = BLK_Z_HA;
+    } else if (strcmp(val, "none") == 0) {
+        *zoned = BLK_Z_NONE;
+    } else {
+        return -ENOTSUP;
+    }
+    return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
 #ifdef CONFIG_LINUX
-    char buf[32];
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+    g_autofree char *str = NULL;
     const char *end;
-    char *sysfspath = NULL;
+    long val;
+    int ret;
+
+    ret = get_sysfs_str_val(st, attribute, &str);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* The file is ended with '\n', pass 'end' to accept that. */
+    ret = qemu_strtol(str, &end, 10, &val);
+    if (ret == 0 && end && *end == '\0') {
+        ret = val;
+    }
+    return ret;
+}
+#endif
+
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
     int ret;
-    int sysfd = -1;
-    long max_segments;
 
     if (S_ISCHR(st->st_mode)) {
         if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
         }
         return -ENOTSUP;
     }
-
-    if (!S_ISBLK(st->st_mode)) {
-        return -ENOTSUP;
-    }
-
-    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-                                major(st->st_rdev), minor(st->st_rdev));
-    sysfd = open(sysfspath, O_RDONLY);
-    if (sysfd == -1) {
-        ret = -errno;
-        goto out;
-    }
-    ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
-    if (ret < 0) {
-        ret = -errno;
-        goto out;
-    } else if (ret == 0) {
-        ret = -EIO;
-        goto out;
-    }
-    buf[ret] = 0;
-    /* The file is ended with '\n', pass 'end' to accept that. */
-    ret = qemu_strtol(buf, &end, 10, &max_segments);
-    if (ret == 0 && end && *end == '\n') {
-        ret = max_segments;
-    }
-
-out:
-    if (sysfd != -1) {
-        close(sysfd);
-    }
-    g_free(sysfspath);
-    return ret;
+    return get_sysfs_long_val(st, "max_segments");
 #else
     return -ENOTSUP;
 #endif
 }
 
+static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
+                                     Error **errp)
+{
+    BlockZoneModel zoned;
+    int ret;
+
+    bs->bl.zoned = BLK_Z_NONE;
+
+    ret = get_sysfs_zoned_model(st, &zoned);
+    if (ret < 0 || zoned == BLK_Z_NONE) {
+        return;
+    }
+    bs->bl.zoned = zoned;
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
             bs->bl.max_hw_iov = ret;
         }
     }
+
+    raw_refresh_zoned_limits(bs, &st, errp);
 }
 
 static int check_for_dasd(int fd)
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-4-faithilikerun@gmail.com
Message-id: 20230324090605.28361-4-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org> and remove spurious ret = -errno in
raw_co_zone_mgmt().
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 meson.build                       |   5 +
 include/block/block-io.h          |   9 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h           |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 block/block-backend.c             | 137 +++++++++++++
 block/file-posix.c                | 313 +++++++++++++++++++++++++++++-
 block/io.c                        |  41 ++++
 qemu-io-cmds.c                    | 149 ++++++++++++++
 9 files changed, 696 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ if rdma.found()
 endif
 
 # has_header_symbol
+config_host_data.set('CONFIG_BLKZONED',
+                     cc.has_header_symbol('linux/blkzoned.h', 'BLKOPENZONE'))
 config_host_data.set('CONFIG_EPOLL_CREATE1',
                      cc.has_header_symbol('sys/epoll.h', 'epoll_create1'))
 config_host_data.set('CONFIG_FALLOCATE_PUNCH_HOLE',
@@ -XXX,XX +XXX,XX @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
 config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
                      cc.has_member('struct stat', 'st_atim',
                                    prefix: '#include <sys/stat.h>'))
+config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
+                     cc.has_member('struct blk_zone', 'capacity',
+                                   prefix: '#include <linux/blkzoned.h>'))
 
 # has_type
 config_host_data.set('CONFIG_IOVEC',
diff --git a/include/block/block-io.h b/include/block/block-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn GRAPH_RDLOCK bdrv_co_flush(BlockDriverState *bs);
 int coroutine_fn GRAPH_RDLOCK bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                                int64_t bytes);
 
+/* Report zone information of zone block device. */
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_report(BlockDriverState *bs,
+                                                  int64_t offset,
+                                                  unsigned int *nr_zones,
+                                                  BlockZoneDescriptor *zones);
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
+                                                BlockZoneOp op,
+                                                int64_t offset, int64_t len);
+
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
                       int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
     int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_load_vmstate)(
         BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
+    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
+            int64_t offset, unsigned int *nr_zones,
+            BlockZoneDescriptor *zones);
+    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
+            int64_t offset, int64_t len);
+
     /* removable device specific */
     bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
         BlockDriverState *bs);
@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
 
     /* device zone model */
     BlockZoneModel zoned;
+
+    /* zone size expressed in bytes */
+    uint32_t zone_size;
+
+    /* total number of zones */
+    uint32_t nr_zones;
+
+    /* maximum sectors of a zone append write operation */
+    uint32_t max_append_sectors;
+
+    /* maximum number of open zones */
+    uint32_t max_open_zones;
+
+    /* maximum number of active zones */
+    uint32_t max_active_zones;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
 #define QEMU_AIO_WRITE_ZEROES 0x0020
 #define QEMU_AIO_COPY_RANGE   0x0040
 #define QEMU_AIO_TRUNCATE     0x0080
+#define QEMU_AIO_ZONE_REPORT  0x0100
+#define QEMU_AIO_ZONE_MGMT    0x0200
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ | \
          QEMU_AIO_WRITE | \
@@ -XXX,XX +XXX,XX @@
          QEMU_AIO_DISCARD | \
          QEMU_AIO_WRITE_ZEROES | \
          QEMU_AIO_COPY_RANGE | \
-         QEMU_AIO_TRUNCATE)
+         QEMU_AIO_TRUNCATE | \
+         QEMU_AIO_ZONE_REPORT | \
+         QEMU_AIO_ZONE_MGMT)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
                             BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_flush(BlockBackend *blk,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones,
+                                BlockZoneDescriptor *zones,
+                                BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                              BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -XXX,XX +XXX,XX @@ int co_wrapper_mixed blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                       int64_t bytes, BdrvRequestFlags flags);
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones);
+int co_wrapper_mixed blk_zone_report(BlockBackend *blk, int64_t offset,
+                                         unsigned int *nr_zones,
+                                         BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                  int64_t offset, int64_t len);
+int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                       int64_t offset, int64_t len);
+
 int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
                                   int64_t bytes);
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
     return ret;
 }
 
+static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
+{
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+                                   (unsigned int*)(uintptr_t)acb->bytes,
+                                   rwco->iobuf);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones,
+                                BlockZoneDescriptor  *zones,
+                                BlockCompletionFunc *cb, void *opaque)
+{
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+        .blk    = blk,
+        .offset = offset,
+        .iobuf  = zones,
+        .ret    = NOT_DONE,
+    };
+    acb->bytes = (int64_t)(uintptr_t)nr_zones,
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+    aio_co_enter(blk_get_aio_context(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
+{
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_mgmt(rwco->blk,
+                                 (BlockZoneOp)(uintptr_t)rwco->iobuf,
+                                 rwco->offset, acb->bytes);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque) {
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+        .blk    = blk,
+        .offset = offset,
+        .iobuf  = (void *)(uintptr_t)op,
+        .ret    = NOT_DONE,
+    };
+    acb->bytes = len;
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+    aio_co_enter(blk_get_aio_context(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk); /* increase before waiting */
+    blk_wait_while_drained(blk);
+    GRAPH_RDLOCK_GUARD();
+    if (!blk_is_available(blk)) {
+        blk_dec_in_flight(blk);
+        return -ENOMEDIUM;
+    }
+    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * op is the zone operation;
+ * offset is the byte offset from the start of the zoned device;
+ * len is the maximum number of bytes the command should operate on. It
+ * should be aligned with the device zone size.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    blk_wait_while_drained(blk);
+    GRAPH_RDLOCK_GUARD();
+
+    ret = blk_check_byte_request(blk, offset, len);
+    if (ret < 0) {
+        blk_dec_in_flight(blk);
+        return ret;
+    }
+
+    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@
 #include <sys/param.h>
 #include <sys/syscall.h>
 #include <sys/vfs.h>
+#if defined(CONFIG_BLKZONED)
+#include <linux/blkzoned.h>
+#endif
 #include <linux/cdrom.h>
 #include <linux/fd.h>
 #include <linux/fs.h>
@@ -XXX,XX +XXX,XX @@ typedef struct RawPosixAIOData {
             PreallocMode prealloc;
             Error **errp;
         } truncate;
+        struct {
+            unsigned int *nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report;
+        struct {
+            unsigned long op;
+        } zone_mgmt;
     };
 } RawPosixAIOData;
 
@@ -XXX,XX +XXX,XX @@ static int get_sysfs_str_val(struct stat *st, const char *attribute,
 }
 #endif
 
+#if defined(CONFIG_BLKZONED)
 static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
 {
     g_autofree char *val = NULL;
@@ -XXX,XX +XXX,XX @@ static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
     }
     return 0;
 }
+#endif /* defined(CONFIG_BLKZONED) */
 
 /*
  * Get a sysfs attribute value as a long integer.
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
 #endif
 }
 
+#if defined(CONFIG_BLKZONED)
 static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
                                      Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
         return;
     }
     bs->bl.zoned = zoned;
+
+    ret = get_sysfs_long_val(st, "max_open_zones");
+    if (ret >= 0) {
+        bs->bl.max_open_zones = ret;
+    }
+
+    ret = get_sysfs_long_val(st, "max_active_zones");
+    if (ret >= 0) {
+        bs->bl.max_active_zones = ret;
+    }
+
+    /*
+     * The zoned device must at least have zone size and nr_zones fields.
+     */
+    ret = get_sysfs_long_val(st, "chunk_sectors");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
+                                     "sysfs attribute");
+        return;
+    } else if (!ret) {
+        error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
+        return;
+    }
+    bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
+
+    ret = get_sysfs_long_val(st, "nr_zones");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Unable to read nr_zones "
+                                     "sysfs attribute");
+        return;
+    } else if (!ret) {
+        error_setg(errp, "Read 0 from nr_zones sysfs attribute");
+        return;
+    }
+    bs->bl.nr_zones = ret;
+
+    ret = get_sysfs_long_val(st, "zone_append_max_bytes");
+    if (ret > 0) {
+        bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
+    }
 }
+#else /* !defined(CONFIG_BLKZONED) */
+static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
+                                     Error **errp)
+{
+    bs->bl.zoned = BLK_Z_NONE;
+}
+#endif /* !defined(CONFIG_BLKZONED) */
 
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
     BDRVRawState *s = bs->opaque;
     int ret;
 
-    /* If DASD, get blocksizes */
+    /* If DASD or zoned devices, get blocksizes */
     if (check_for_dasd(s->fd) < 0) {
-        return -ENOTSUP;
+        /* zoned devices are not DASD */
+        if (bs->bl.zoned == BLK_Z_NONE) {
+            return -ENOTSUP;
+        }
     }
     ret = probe_logical_blocksize(s->fd, &bsz->log);
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
 }
 #endif
 
+/*
+ * parse_zone - Fill a zone descriptor
+ */
+#if defined(CONFIG_BLKZONED)
+static inline int parse_zone(struct BlockZoneDescriptor *zone,
+                              const struct blk_zone *blkz) {
+    zone->start = blkz->start << BDRV_SECTOR_BITS;
+    zone->length = blkz->len << BDRV_SECTOR_BITS;
+    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
+
+#ifdef HAVE_BLK_ZONE_REP_CAPACITY
+    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
+#else
+    zone->cap = blkz->len << BDRV_SECTOR_BITS;
+#endif
+
+    switch (blkz->type) {
+    case BLK_ZONE_TYPE_SEQWRITE_REQ:
+        zone->type = BLK_ZT_SWR;
+        break;
+    case BLK_ZONE_TYPE_SEQWRITE_PREF:
+        zone->type = BLK_ZT_SWP;
+        break;
+    case BLK_ZONE_TYPE_CONVENTIONAL:
+        zone->type = BLK_ZT_CONV;
+        break;
+    default:
+        error_report("Unsupported zone type: 0x%x", blkz->type);
+        return -ENOTSUP;
+    }
+
+    switch (blkz->cond) {
+    case BLK_ZONE_COND_NOT_WP:
+        zone->state = BLK_ZS_NOT_WP;
+        break;
+    case BLK_ZONE_COND_EMPTY:
+        zone->state = BLK_ZS_EMPTY;
+        break;
+    case BLK_ZONE_COND_IMP_OPEN:
+        zone->state = BLK_ZS_IOPEN;
+        break;
+    case BLK_ZONE_COND_EXP_OPEN:
+        zone->state = BLK_ZS_EOPEN;
+        break;
+    case BLK_ZONE_COND_CLOSED:
+        zone->state = BLK_ZS_CLOSED;
+        break;
+    case BLK_ZONE_COND_READONLY:
+        zone->state = BLK_ZS_RDONLY;
+        break;
+    case BLK_ZONE_COND_FULL:
+        zone->state = BLK_ZS_FULL;
+        break;
+    case BLK_ZONE_COND_OFFLINE:
+        zone->state = BLK_ZS_OFFLINE;
+        break;
+    default:
+        error_report("Unsupported zone state: 0x%x", blkz->cond);
+        return -ENOTSUP;
+    }
+    return 0;
+}
+#endif
+
+#if defined(CONFIG_BLKZONED)
+static int handle_aiocb_zone_report(void *opaque)
+{
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
+    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
+    /* zoned block devices use 512-byte sectors */
+    uint64_t sector = aiocb->aio_offset / 512;
+
+    struct blk_zone *blkz;
+    size_t rep_size;
+    unsigned int nrz;
+    int ret;
+    unsigned int n = 0, i = 0;
+
+    nrz = *nr_zones;
+    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+    g_autofree struct blk_zone_report *rep = NULL;
+    rep = g_malloc(rep_size);
+
+    blkz = (struct blk_zone *)(rep + 1);
+    while (n < nrz) {
+        memset(rep, 0, rep_size);
+        rep->sector = sector;
+        rep->nr_zones = nrz - n;
+
+        do {
+            ret = ioctl(fd, BLKREPORTZONE, rep);
+        } while (ret != 0 && errno == EINTR);
+        if (ret != 0) {
+            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+                         fd, sector, errno);
+            return -errno;
+        }
+
+        if (!rep->nr_zones) {
+            break;
+        }
+
+        for (i = 0; i < rep->nr_zones; i++, n++) {
+            ret = parse_zone(&zones[n], &blkz[i]);
+            if (ret != 0) {
+                return ret;
+            }
+
+            /* The next report should start after the last zone reported */
+            sector = blkz[i].start + blkz[i].len;
+        }
+    }
+
+    *nr_zones = n;
+    return 0;
+}
+#endif
+
+#if defined(CONFIG_BLKZONED)
+static int handle_aiocb_zone_mgmt(void *opaque)
+{
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    uint64_t sector = aiocb->aio_offset / 512;
+    int64_t nr_sectors = aiocb->aio_nbytes / 512;
+    struct blk_zone_range range;
+    int ret;
+
+    /* Execute the operation */
+    range.sector = sector;
+    range.nr_sectors = nr_sectors;
+    do {
+        ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
+    } while (ret != 0 && errno == EINTR);
+
+    return ret;
+}
+#endif
+
 static int handle_aiocb_copy_range(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
@@ -XXX,XX +XXX,XX @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
     }
 }
 
+/*
+ * zone report - Get a zone block device's information in the form
+ * of an array of zone descriptors.
+ * zones is an array of zone descriptors to hold zone information on reply;
+ * offset can be any byte within the entire size of the device;
+ * nr_zones is the maxium number of sectors the command should operate on.
+ */
+#if defined(CONFIG_BLKZONED)
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                           unsigned int *nr_zones,
+                                           BlockZoneDescriptor *zones) {
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb = (RawPosixAIOData) {
+        .bs         = bs,
+        .aio_fildes = s->fd,
+        .aio_type   = QEMU_AIO_ZONE_REPORT,
+        .aio_offset = offset,
+        .zone_report    = {
+            .nr_zones       = nr_zones,
+            .zones          = zones,
+        },
+    };
+
+    return raw_thread_pool_submit(handle_aiocb_zone_report, &acb);
+}
+#endif
+
+/*
+ * zone management operations - Execute an operation on a zone
+ */
+#if defined(CONFIG_BLKZONED)
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len) {
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb;
+    int64_t zone_size, zone_size_mask;
+    const char *op_name;
+    unsigned long zo;
+    int ret;
+    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
+
+    zone_size = bs->bl.zone_size;
+    zone_size_mask = zone_size - 1;
+    if (offset & zone_size_mask) {
+        error_report("sector offset %" PRId64 " is not aligned to zone size "
+                     "%" PRId64 "", offset / 512, zone_size / 512);
+        return -EINVAL;
+    }
+
+    if (((offset + len) < capacity && len & zone_size_mask) ||
+        offset + len > capacity) {
+        error_report("number of sectors %" PRId64 " is not aligned to zone size"
+                      " %" PRId64 "", len / 512, zone_size / 512);
+        return -EINVAL;
+    }
+
+    switch (op) {
+    case BLK_ZO_OPEN:
+        op_name = "BLKOPENZONE";
+        zo = BLKOPENZONE;
+        break;
+    case BLK_ZO_CLOSE:
+        op_name = "BLKCLOSEZONE";
+        zo = BLKCLOSEZONE;
+        break;
+    case BLK_ZO_FINISH:
+        op_name = "BLKFINISHZONE";
+        zo = BLKFINISHZONE;
+        break;
+    case BLK_ZO_RESET:
+        op_name = "BLKRESETZONE";
+        zo = BLKRESETZONE;
+        break;
+    default:
+        error_report("Unsupported zone op: 0x%x", op);
+        return -ENOTSUP;
+    }
+
+    acb = (RawPosixAIOData) {
+        .bs             = bs,
+        .aio_fildes     = s->fd,
+        .aio_type       = QEMU_AIO_ZONE_MGMT,
+        .aio_offset     = offset,
+        .aio_nbytes     = len,
+        .zone_mgmt  = {
+            .op = zo,
+        },
+    };
+
+    ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
+    if (ret != 0) {
+        error_report("ioctl %s failed %d", op_name, ret);
+    }
+
+    return ret;
+}
+#endif
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 bool blkdev)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
 #ifdef __linux__
     .bdrv_co_ioctl          = hdev_co_ioctl,
 #endif
+
+    /* zoned device */
+#if defined(CONFIG_BLKZONED)
+    /* zone management operations */
+    .bdrv_co_zone_report = raw_co_zone_report,
+    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
+#endif
 };
 
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ out:
     return co.ret;
 }
 
+int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+                        unsigned int *nr_zones,
+                        BlockZoneDescriptor *zones)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_report || bs->bl.zoned == BLK_Z_NONE) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
+int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_mgmt || bs->bl.zoned == BLK_Z_NONE) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
     IO_CODE();
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -XXX,XX +XXX,XX @@ static const cmdinfo_t flush_cmd = {
     .oneline    = "flush all in-core file state to disk",
 };
 
+static inline int64_t tosector(int64_t bytes)
+{
+    return bytes >> BDRV_SECTOR_BITS;
+}
+
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset;
+    unsigned int nr_zones;
+
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    nr_zones = cvtnum(argv[optind]);
+
+    g_autofree BlockZoneDescriptor *zones = NULL;
+    zones = g_new(BlockZoneDescriptor, nr_zones);
+    ret = blk_zone_report(blk, offset, &nr_zones, zones);
+    if (ret < 0) {
+        printf("zone report failed: %s\n", strerror(-ret));
+    } else {
+        for (int i = 0; i < nr_zones; ++i) {
+            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
+                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
+                   "zcond:%u, [type: %u]\n",
+                    tosector(zones[i].start), tosector(zones[i].length),
+                    tosector(zones[i].cap), tosector(zones[i].wp),
+                    zones[i].state, zones[i].type);
+        }
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+    .name = "zone_report",
+    .altname = "zrp",
+    .cfunc = zone_report_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset number",
+    .oneline = "report zone information",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
+    if (ret < 0) {
+        printf("zone open failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_open_cmd = {
+    .name = "zone_open",
+    .altname = "zo",
+    .cfunc = zone_open_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
+    if (ret < 0) {
+        printf("zone close failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_close_cmd = {
+    .name = "zone_close",
+    .altname = "zc",
+    .cfunc = zone_close_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
+    if (ret < 0) {
+        printf("zone finish failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_finish_cmd = {
+    .name = "zone_finish",
+    .altname = "zf",
+    .cfunc = zone_finish_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "finish a range of zones in zone block device",
+};
+
+static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
+    if (ret < 0) {
+        printf("zone reset failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_reset_cmd = {
+    .name = "zone_reset",
+    .altname = "zrs",
+    .cfunc = zone_reset_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "reset a zone write pointer in zone block device",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
     .name       = "truncate",
@@ -XXX,XX +XXX,XX @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&aio_write_cmd);
     qemuio_add_command(&aio_flush_cmd);
     qemuio_add_command(&flush_cmd);
+    qemuio_add_command(&zone_report_cmd);
+    qemuio_add_command(&zone_open_cmd);
+    qemuio_add_command(&zone_close_cmd);
+    qemuio_add_command(&zone_finish_cmd);
+    qemuio_add_command(&zone_reset_cmd);
     qemuio_add_command(&truncate_cmd);
     qemuio_add_command(&length_cmd);
     qemuio_add_command(&info_cmd);
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-5-faithilikerun@gmail.com
Message-id: 20230324090605.28361-5-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-format.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index XXXXXXX..XXXXXXX 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -XXX,XX +XXX,XX @@ raw_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
     return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                   unsigned int *nr_zones,
+                   BlockZoneDescriptor *zones)
+{
+    return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+                 int64_t offset, int64_t len)
+{
+    return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 raw_co_getlength(BlockDriverState *bs)
 {
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_raw = {
     .bdrv_co_pwritev      = &raw_co_pwritev,
     .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
     .bdrv_co_pdiscard     = &raw_co_pdiscard,
+    .bdrv_co_zone_report  = &raw_co_zone_report,
+    .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
     .bdrv_co_block_status = &raw_co_block_status,
     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-6-faithilikerun@gmail.com
Message-id: 20230324090605.28361-6-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org> and clarify that the check is about zoned
BlockDrivers.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block_int-common.h |  5 +++++
 block.c                          | 19 +++++++++++++++++++
 block/file-posix.c               | 12 ++++++++++++
 block/raw-format.c               |  1 +
 4 files changed, 37 insertions(+)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
      */
     bool is_format;
 
+    /*
+     * Set to true if the BlockDriver supports zoned children.
+     */
+    bool supports_zoned_children;
+
     /*
      * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
      * this field set to true, except ones that are defined only by their
diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
         return;
     }
 
+    /*
+     * Non-zoned block drivers do not follow zoned storage constraints
+     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+     * drivers in a graph.
+     */
+    if (!parent_bs->drv->supports_zoned_children &&
+        child_bs->bl.zoned == BLK_Z_HM) {
+        /*
+         * The host-aware model allows zoned storage constraints and random
+         * write. Allow mixing host-aware and non-zoned drivers. Using
+         * host-aware device as a regular device.
+         */
+        error_setg(errp, "Cannot add a %s child to a %s parent",
+                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+                   parent_bs->drv->supports_zoned_children ?
+                   "support zoned children" : "not support zoned children");
+        return;
+    }
+
     if (!QLIST_EMPTY(&child_bs->parents)) {
         error_setg(errp, "The node %s already has a parent",
                    child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
             goto fail;
         }
     }
+#ifdef CONFIG_BLKZONED
+    /*
+     * The kernel page cache does not reliably work for writes to SWR zones
+     * of zoned block device because it can not guarantee the order of writes.
+     */
+    if ((bs->bl.zoned != BLK_Z_NONE) &&
+        (!(s->open_flags & O_DIRECT))) {
+        error_setg(errp, "The driver supports zoned devices, and it requires "
+                         "cache.direct=on, which was not specified.");
+        return -EINVAL; /* No host kernel page cache */
+    }
+#endif
 
     if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index XXXXXXX..XXXXXXX 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -XXX,XX +XXX,XX @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
 BlockDriver bdrv_raw = {
     .format_name          = "raw",
     .instance_size        = sizeof(BDRVRawState),
+    .supports_zoned_children = true,
     .bdrv_probe           = &raw_probe,
     .bdrv_reopen_prepare  = &raw_reopen_prepare,
     .bdrv_reopen_commit   = &raw_reopen_commit,
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

The new block layer APIs of zoned block devices can be tested by:
$ tests/qemu-iotests/check zoned
Run each zone operation on a newly created null_blk device
and see whether it outputs the same zone information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-7-faithilikerun@gmail.com
Message-id: 20230324090605.28361-7-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/tests/zoned     | 89 ++++++++++++++++++++++++++++++
 tests/qemu-iotests/tests/zoned.out | 53 ++++++++++++++++++
 2 files changed, 142 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/zoned
 create mode 100644 tests/qemu-iotests/tests/zoned.out

diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned
@@ -XXX,XX +XXX,XX @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo -n rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+sudo -n true || \
+    _notrun 'Password-less sudo required'
+
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo -n modprobe null_blk nr_devices=1 zoned=1
+sudo -n chmod 0666 /dev/nullb0
+
+echo "(1) report the first zone:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+$QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+$QEMU_IO $IMG -c "zrp 0x3e70000000 2" # 0x3e70000000 / 512 = 0x1f380000
+echo
+echo
+echo "(2) opening the first zone"
+$QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+$QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+$QEMU_IO $IMG -c "zo 0x3e70000000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(3) closing the first zone"
+$QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+$QEMU_IO $IMG -c "zc 0x3e70000000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+$QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+$QEMU_IO $IMG -c "zrs 268435456 268435456"
+echo "After resetting a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -XXX,XX +XXX,XX @@
+QA output created by zoned
+Testing a null_blk device:
+case 1: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+start: 0x100000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:1, [type: 2]
+start: 0x180000, len 0x80000, cap 0x80000, wptr 0x180000, zcond:1, [type: 2]
+start: 0x200000, len 0x80000, cap 0x80000, wptr 0x200000, zcond:1, [type: 2]
+start: 0x280000, len 0x80000, cap 0x80000, wptr 0x280000, zcond:1, [type: 2]
+start: 0x300000, len 0x80000, cap 0x80000, wptr 0x300000, zcond:1, [type: 2]
+start: 0x380000, len 0x80000, cap 0x80000, wptr 0x380000, zcond:1, [type: 2]
+start: 0x400000, len 0x80000, cap 0x80000, wptr 0x400000, zcond:1, [type: 2]
+start: 0x480000, len 0x80000, cap 0x80000, wptr 0x480000, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+*** done
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-8-faithilikerun@gmail.com
Message-id: 20230324090605.28361-8-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
         },
     };
 
+    trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
     return raw_thread_pool_submit(handle_aiocb_zone_report, &acb);
 }
 #endif
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
         },
     };
 
+    trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+                        len >> BDRV_SECTOR_BITS);
     ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
     if (ret != 0) {
         error_report("ioctl %s failed %d", op_name, ret);
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ file_FindEjectableOpticalMedia(const char *media) "Matching using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report %d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs %p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-9-faithilikerun@gmail.com
Message-id: 20230324090605.28361-9-faithilikerun@gmail.com
[Add index-api.rst to fix "zoned-storage.rst:document isn't included in
any toctree" error and fix pre-formatted code syntax.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 docs/devel/index-api.rst               |  1 +
 docs/devel/zoned-storage.rst           | 43 ++++++++++++++++++++++++++
 docs/system/qemu-block-drivers.rst.inc |  6 ++++
 3 files changed, 50 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/index-api.rst b/docs/devel/index-api.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/index-api.rst
+++ b/docs/devel/index-api.rst
@@ -XXX,XX +XXX,XX @@ generated from in-code annotations to function prototypes.
    memory
    modules
    ui
+   zoned-storage
diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -XXX,XX +XXX,XX @@
+=============
+zoned-storage
+=============
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-------------------------------------
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--------------------------------------
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is::
+
+  $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index XXXXXXX..XXXXXXX 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -XXX,XX +XXX,XX @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated storage
+  controller supports zoned storage. Use ``--blockdev host_device,
+  node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
+  ``/dev/nullb0`` as ``drive0``.
+
 Windows
 ^^^^^^^
 
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20230508051510.177850-2-faithilikerun@gmail.com
[Fix errno propagation from handle_aiocb_zone_mgmt()
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-common.h     |  14 +++
 include/block/block_int-common.h |   5 +
 block/file-posix.c               | 178 ++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 4 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockZoneDescriptor {
     BlockZoneState state;
 } BlockZoneDescriptor;
 
+/*
+ * Track write pointers of a zone in bytes.
+ */
+typedef struct BlockZoneWps {
+    CoMutex colock;
+    uint64_t wp[];
+} BlockZoneWps;
+
 typedef struct BlockDriverInfo {
     /* in bytes, 0 if irrelevant */
     int cluster_size;
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
 
+/*
+ * Get the first most significant bit of wp. If it is zero, then
+ * the zone type is SWR.
+ */
+#define BDRV_ZT_IS_CONV(wp)    (wp & (1ULL << 63))
+
 #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
                                            INT_MAX >> BDRV_SECTOR_BITS)
 #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockLimits {
 
     /* maximum number of active zones */
     uint32_t max_active_zones;
+
+    uint32_t write_granularity;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     CoMutex bsc_modify_lock;
     /* Always non-NULL, but must only be dereferenced under an RCU read guard */
     BdrvBlockStatusCache *block_status_cache;
+
+    /* array of write pointers' location of each zone in the zoned device. */
+    BlockZoneWps *wps;
 };
 
 struct BlockBackendRootState {
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int hdev_get_max_segments(int fd, struct stat *st)
 }
 
 #if defined(CONFIG_BLKZONED)
+/*
+ * If the reset_all flag is true, then the wps of zone whose state is
+ * not readonly or offline should be all reset to the start sector.
+ * Else, take the real wp of the device.
+ */
+static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
+                        unsigned int nrz, bool reset_all)
+{
+    struct blk_zone *blkz;
+    size_t rep_size;
+    uint64_t sector = offset >> BDRV_SECTOR_BITS;
+    BlockZoneWps *wps = bs->wps;
+    unsigned int j = offset / bs->bl.zone_size;
+    unsigned int n = 0, i = 0;
+    int ret;
+    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+    g_autofree struct blk_zone_report *rep = NULL;
+
+    rep = g_malloc(rep_size);
+    blkz = (struct blk_zone *)(rep + 1);
+    while (n < nrz) {
+        memset(rep, 0, rep_size);
+        rep->sector = sector;
+        rep->nr_zones = nrz - n;
+
+        do {
+            ret = ioctl(fd, BLKREPORTZONE, rep);
+        } while (ret != 0 && errno == EINTR);
+        if (ret != 0) {
+            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+                    fd, offset, errno);
+            return -errno;
+        }
+
+        if (!rep->nr_zones) {
+            break;
+        }
+
+        for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
+            /*
+             * The wp tracking cares only about sequential writes required and
+             * sequential write preferred zones so that the wp can advance to
+             * the right location.
+             * Use the most significant bit of the wp location to indicate the
+             * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
+             */
+            if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+                wps->wp[j] |= 1ULL << 63;
+            } else {
+                switch(blkz[i].cond) {
+                case BLK_ZONE_COND_FULL:
+                case BLK_ZONE_COND_READONLY:
+                    /* Zone not writable */
+                    wps->wp[j] = (blkz[i].start + blkz[i].len) << BDRV_SECTOR_BITS;
+                    break;
+                case BLK_ZONE_COND_OFFLINE:
+                    /* Zone not writable nor readable */
+                    wps->wp[j] = (blkz[i].start) << BDRV_SECTOR_BITS;
+                    break;
+                default:
+                    if (reset_all) {
+                        wps->wp[j] = blkz[i].start << BDRV_SECTOR_BITS;
+                    } else {
+                        wps->wp[j] = blkz[i].wp << BDRV_SECTOR_BITS;
+                    }
+                    break;
+                }
+            }
+        }
+        sector = blkz[i - 1].start + blkz[i - 1].len;
+    }
+
+    return 0;
+}
+
+static void update_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
+                            unsigned int nrz)
+{
+    if (get_zones_wp(bs, fd, offset, nrz, 0) < 0) {
+        error_report("update zone wp failed");
+    }
+}
+
 static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
                                      Error **errp)
 {
+    BDRVRawState *s = bs->opaque;
     BlockZoneModel zoned;
     int ret;
 
@@ -XXX,XX +XXX,XX @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
     if (ret > 0) {
         bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
     }
+
+    ret = get_sysfs_long_val(st, "physical_block_size");
+    if (ret >= 0) {
+        bs->bl.write_granularity = ret;
+    }
+
+    /* The refresh_limits() function can be called multiple times. */
+    g_free(bs->wps);
+    bs->wps = g_malloc(sizeof(BlockZoneWps) +
+            sizeof(int64_t) * bs->bl.nr_zones);
+    ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "report wps failed");
+        bs->wps = NULL;
+        return;
+    }
+    qemu_co_mutex_init(&bs->wps->colock);
 }
 #else /* !defined(CONFIG_BLKZONED) */
 static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
@@ -XXX,XX +XXX,XX @@ static int handle_aiocb_zone_mgmt(void *opaque)
         ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
     } while (ret != 0 && errno == EINTR);
 
-    return ret;
+    return ret < 0 ? -errno : ret;
 }
 #endif
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
 {
     BDRVRawState *s = bs->opaque;
     RawPosixAIOData acb;
+    int ret;
 
     if (fd_open(bs) < 0)
         return -EIO;
+#if defined(CONFIG_BLKZONED)
+    if (type & QEMU_AIO_WRITE && bs->wps) {
+        qemu_co_mutex_lock(&bs->wps->colock);
+    }
+#endif
 
     /*
      * When using O_DIRECT, the request must be aligned to be able to use
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
 #ifdef CONFIG_LINUX_IO_URING
     } else if (s->use_linux_io_uring) {
         assert(qiov->size == bytes);
-        return luring_co_submit(bs, s->fd, offset, qiov, type);
+        ret = luring_co_submit(bs, s->fd, offset, qiov, type);
+        goto out;
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
         assert(qiov->size == bytes);
-        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
+        ret = laio_co_submit(s->fd, offset, qiov, type,
+                              s->aio_max_batch);
+        goto out;
 #endif
     }
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
     };
 
     assert(qiov->size == bytes);
-    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
+    ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
+    goto out; /* Avoid the compiler err of unused label */
+
+out:
+#if defined(CONFIG_BLKZONED)
+{
+    BlockZoneWps *wps = bs->wps;
+    if (ret == 0) {
+        if (type & QEMU_AIO_WRITE && wps && bs->bl.zone_size) {
+            uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
+            if (!BDRV_ZT_IS_CONV(*wp)) {
+                /* Advance the wp if needed */
+                if (offset + bytes > *wp) {
+                    *wp = offset + bytes;
+                }
+            }
+        }
+    } else {
+        if (type & QEMU_AIO_WRITE) {
+            update_zones_wp(bs, s->fd, 0, 1);
+        }
+    }
+
+    if (type & QEMU_AIO_WRITE && wps) {
+        qemu_co_mutex_unlock(&wps->colock);
+    }
+}
+#endif
+    return ret;
 }
 
 static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ static void raw_close(BlockDriverState *bs)
     BDRVRawState *s = bs->opaque;
 
     if (s->fd >= 0) {
+#if defined(CONFIG_BLKZONED)
+        g_free(bs->wps);
+#endif
         qemu_close(s->fd);
         s->fd = -1;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
     const char *op_name;
     unsigned long zo;
     int ret;
+    BlockZoneWps *wps = bs->wps;
     int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
 
     zone_size = bs->bl.zone_size;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
         return -EINVAL;
     }
 
+    uint32_t i = offset / bs->bl.zone_size;
+    uint32_t nrz = len / bs->bl.zone_size;
+    uint64_t *wp = &wps->wp[i];
+    if (BDRV_ZT_IS_CONV(*wp) && len != capacity) {
+        error_report("zone mgmt operations are not allowed for conventional zones");
+        return -EIO;
+    }
+
     switch (op) {
     case BLK_ZO_OPEN:
         op_name = "BLKOPENZONE";
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
                         len >> BDRV_SECTOR_BITS);
     ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb);
     if (ret != 0) {
+        update_zones_wp(bs, s->fd, offset, i);
         error_report("ioctl %s failed %d", op_name, ret);
+        return ret;
+    }
+
+    if (zo == BLKRESETZONE && len == capacity) {
+        ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 1);
+        if (ret < 0) {
+            error_report("reporting single wp failed");
+            return ret;
+        }
+    } else if (zo == BLKRESETZONE) {
+        for (unsigned int j = 0; j < nrz; ++j) {
+            wp[j] = offset + j * zone_size;
+        }
+    } else if (zo == BLKFINISHZONE) {
+        for (unsigned int j = 0; j < nrz; ++j) {
+            /* The zoned device allows the last zone smaller that the
+             * zone size. */
+            wp[j] = MIN(offset + (j + 1) * zone_size, offset + len);
+        }
     }
 
     return ret;
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of the call may point at
any position within the zone to which the data is being appended. Upon
completion the device will respond with the position where the data has
been written in the zone.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508051510.177850-3-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-io.h          |  4 ++
 include/block/block_int-common.h  |  3 ++
 include/block/raw-aio.h           |  4 +-
 include/sysemu/block-backend-io.h |  9 +++++
 block/block-backend.c             | 61 +++++++++++++++++++++++++++++++
 block/file-posix.c                | 58 +++++++++++++++++++++++++----
 block/io.c                        | 27 ++++++++++++++
 block/io_uring.c                  |  4 ++
 block/linux-aio.c                 |  3 ++
 block/raw-format.c                |  8 ++++
 10 files changed, 173 insertions(+), 8 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_report(BlockDriverState *bs,
 int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
                                                 BlockZoneOp op,
                                                 int64_t offset, int64_t len);
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_append(BlockDriverState *bs,
+                                                  int64_t *offset,
+                                                  QEMUIOVector *qiov,
+                                                  BdrvRequestFlags flags);
 
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
             BlockZoneDescriptor *zones);
     int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
             int64_t offset, int64_t len);
+    int coroutine_fn (*bdrv_co_zone_append)(BlockDriverState *bs,
+            int64_t *offset, QEMUIOVector *qiov,
+            BdrvRequestFlags flags);
 
     /* removable device specific */
     bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
 #define QEMU_AIO_TRUNCATE     0x0080
 #define QEMU_AIO_ZONE_REPORT  0x0100
 #define QEMU_AIO_ZONE_MGMT    0x0200
+#define QEMU_AIO_ZONE_APPEND  0x0400
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ | \
          QEMU_AIO_WRITE | \
@@ -XXX,XX +XXX,XX @@
          QEMU_AIO_COPY_RANGE | \
          QEMU_AIO_TRUNCATE | \
          QEMU_AIO_ZONE_REPORT | \
-         QEMU_AIO_ZONE_MGMT)
+         QEMU_AIO_ZONE_MGMT | \
+         QEMU_AIO_ZONE_APPEND)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
 BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                               int64_t offset, int64_t len,
                               BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+                                QEMUIOVector *qiov, BdrvRequestFlags flags,
+                                BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                              BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                                   int64_t offset, int64_t len);
 int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
                                        int64_t offset, int64_t len);
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+                                    QEMUIOVector *qiov,
+                                    BdrvRequestFlags flags);
+int co_wrapper_mixed blk_zone_append(BlockBackend *blk, int64_t *offset,
+                                         QEMUIOVector *qiov,
+                                         BdrvRequestFlags flags);
 
 int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
                                   int64_t bytes);
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
     return &acb->common;
 }
 
+static void coroutine_fn blk_aio_zone_append_entry(void *opaque)
+{
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_append(rwco->blk, (int64_t *)(uintptr_t)acb->bytes,
+                                   rwco->iobuf, rwco->flags);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+                                QEMUIOVector *qiov, BdrvRequestFlags flags,
+                                BlockCompletionFunc *cb, void *opaque) {
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+        .blk    = blk,
+        .ret    = NOT_DONE,
+        .flags  = flags,
+        .iobuf  = qiov,
+    };
+    acb->bytes = (int64_t)(uintptr_t)offset;
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
+    aio_co_enter(blk_get_aio_context(blk), co);
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
@@ -XXX,XX +XXX,XX @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
     return ret;
 }
 
+/*
+ * Send a zone_append command.
+ */
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+        QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    blk_wait_while_drained(blk);
+    GRAPH_RDLOCK_GUARD();
+    if (!blk_is_available(blk)) {
+        blk_dec_in_flight(blk);
+        return -ENOMEDIUM;
+    }
+
+    ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ typedef struct BDRVRawState {
     bool has_write_zeroes:1;
     bool use_linux_aio:1;
     bool use_linux_io_uring:1;
+    int64_t *offset; /* offset of zone append operation */
     int page_cache_inconsistent; /* errno from fdatasync failure */
     bool has_fallocate;
     bool needs_alignment;
@@ -XXX,XX +XXX,XX @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb)
     ssize_t len;
 
     len = RETRY_ON_EINTR(
-        (aiocb->aio_type & QEMU_AIO_WRITE) ?
+        (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
             qemu_pwritev(aiocb->aio_fildes,
                            aiocb->io.iov,
                            aiocb->io.niov,
@@ -XXX,XX +XXX,XX @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
     ssize_t len;
 
     while (offset < aiocb->aio_nbytes) {
-        if (aiocb->aio_type & QEMU_AIO_WRITE) {
+        if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
             len = pwrite(aiocb->aio_fildes,
                          (const char *)buf + offset,
                          aiocb->aio_nbytes - offset,
@@ -XXX,XX +XXX,XX @@ static int handle_aiocb_rw(void *opaque)
     }
 
     nbytes = handle_aiocb_rw_linear(aiocb, buf);
-    if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
+    if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
         char *p = buf;
         size_t count = aiocb->aio_nbytes, copy;
         int i;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
     if (fd_open(bs) < 0)
         return -EIO;
 #if defined(CONFIG_BLKZONED)
-    if (type & QEMU_AIO_WRITE && bs->wps) {
+    if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && bs->wps) {
         qemu_co_mutex_lock(&bs->wps->colock);
+        if (type & QEMU_AIO_ZONE_APPEND && bs->bl.zone_size) {
+            int index = offset / bs->bl.zone_size;
+            offset = bs->wps->wp[index];
+        }
     }
 #endif
 
@@ -XXX,XX +XXX,XX @@ out:
 {
     BlockZoneWps *wps = bs->wps;
     if (ret == 0) {
-        if (type & QEMU_AIO_WRITE && wps && bs->bl.zone_size) {
+        if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))
+            && wps && bs->bl.zone_size) {
             uint64_t *wp = &wps->wp[offset / bs->bl.zone_size];
             if (!BDRV_ZT_IS_CONV(*wp)) {
+                if (type & QEMU_AIO_ZONE_APPEND) {
+                    *s->offset = *wp;
+                }
                 /* Advance the wp if needed */
                 if (offset + bytes > *wp) {
                     *wp = offset + bytes;
@@ -XXX,XX +XXX,XX @@ out:
             }
         }
     } else {
-        if (type & QEMU_AIO_WRITE) {
+        if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
             update_zones_wp(bs, s->fd, 0, 1);
         }
     }
 
-    if (type & QEMU_AIO_WRITE && wps) {
+    if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && wps) {
         qemu_co_mutex_unlock(&wps->colock);
     }
 }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
 }
 #endif
 
+#if defined(CONFIG_BLKZONED)
+static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
+                                           int64_t *offset,
+                                           QEMUIOVector *qiov,
+                                           BdrvRequestFlags flags) {
+    assert(flags == 0);
+    int64_t zone_size_mask = bs->bl.zone_size - 1;
+    int64_t iov_len = 0;
+    int64_t len = 0;
+    BDRVRawState *s = bs->opaque;
+    s->offset = offset;
+
+    if (*offset & zone_size_mask) {
+        error_report("sector offset %" PRId64 " is not aligned to zone size "
+                     "%" PRId32 "", *offset / 512, bs->bl.zone_size / 512);
+        return -EINVAL;
+    }
+
+    int64_t wg = bs->bl.write_granularity;
+    int64_t wg_mask = wg - 1;
+    for (int i = 0; i < qiov->niov; i++) {
+        iov_len = qiov->iov[i].iov_len;
+        if (iov_len & wg_mask) {
+            error_report("len of IOVector[%d] %" PRId64 " is not aligned to "
+                         "block size %" PRId64 "", i, iov_len, wg);
+            return -EINVAL;
+        }
+        len += iov_len;
+    }
+
+    return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
+}
+#endif
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 bool blkdev)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_host_device = {
     /* zone management operations */
     .bdrv_co_zone_report = raw_co_zone_report,
     .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
+    .bdrv_co_zone_append = raw_co_zone_append,
 #endif
 };
 
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ out:
     return co.ret;
 }
 
+int coroutine_fn bdrv_co_zone_append(BlockDriverState *bs, int64_t *offset,
+                        QEMUIOVector *qiov,
+                        BdrvRequestFlags flags)
+{
+    int ret;
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    ret = bdrv_check_qiov_request(*offset, qiov->size, qiov, 0, NULL);
+    if (ret < 0) {
+        return ret;
+    }
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_append || bs->bl.zoned == BLK_Z_NONE) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_append(bs, offset, qiov, flags);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
     IO_CODE();
diff --git a/block/io_uring.c b/block/io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
         io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
                              luringcb->qiov->niov, offset);
         break;
+    case QEMU_AIO_ZONE_APPEND:
+        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
+                             luringcb->qiov->niov, offset);
+        break;
     case QEMU_AIO_READ:
         io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
                             luringcb->qiov->niov, offset);
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
     case QEMU_AIO_WRITE:
         io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
         break;
+    case QEMU_AIO_ZONE_APPEND:
+        io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
+        break;
     case QEMU_AIO_READ:
         io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
         break;
diff --git a/block/raw-format.c b/block/raw-format.c
index XXXXXXX..XXXXXXX 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -XXX,XX +XXX,XX @@ raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
     return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_append(BlockDriverState *bs,int64_t *offset, QEMUIOVector *qiov,
+                   BdrvRequestFlags flags)
+{
+    return bdrv_co_zone_append(bs->file->bs, offset, qiov, flags);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 raw_co_getlength(BlockDriverState *bs)
 {
@@ -XXX,XX +XXX,XX @@ BlockDriver bdrv_raw = {
     .bdrv_co_pdiscard     = &raw_co_pdiscard,
     .bdrv_co_zone_report  = &raw_co_zone_report,
     .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
+    .bdrv_co_zone_append = &raw_co_zone_append,
     .bdrv_co_block_status = &raw_co_block_status,
     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

The patch tests zone append writes by reporting the zone wp after
the completion of the call. "zap -p" option can print the sector
offset value after completion, which should be the start sector
where the append write begins.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508051510.177850-4-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qemu-io-cmds.c                     | 75 ++++++++++++++++++++++++++++++
 tests/qemu-iotests/tests/zoned     | 16 +++++++
 tests/qemu-iotests/tests/zoned.out | 16 +++++++
 3 files changed, 107 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -XXX,XX +XXX,XX @@ static const cmdinfo_t zone_reset_cmd = {
     .oneline = "reset a zone write pointer in zone block device",
 };
 
+static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
+                              int64_t *offset, int flags, int *total)
+{
+    int async_ret = NOT_DONE;
+
+    blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret);
+    while (async_ret == NOT_DONE) {
+        main_loop_wait(false);
+    }
+
+    *total = qiov->size;
+    return async_ret < 0 ? async_ret : 1;
+}
+
+static int zone_append_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    bool pflag = false;
+    int flags = 0;
+    int total = 0;
+    int64_t offset;
+    char *buf;
+    int c, nr_iov;
+    int pattern = 0xcd;
+    QEMUIOVector qiov;
+
+    if (optind > argc - 3) {
+        return -EINVAL;
+    }
+
+    if ((c = getopt(argc, argv, "p")) != -1) {
+        pflag = true;
+    }
+
+    offset = cvtnum(argv[optind]);
+    if (offset < 0) {
+        print_cvtnum_err(offset, argv[optind]);
+        return offset;
+    }
+    optind++;
+    nr_iov = argc - optind;
+    buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern,
+                       flags & BDRV_REQ_REGISTERED_BUF);
+    if (buf == NULL) {
+        return -EINVAL;
+    }
+    ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total);
+    if (ret < 0) {
+        printf("zone append failed: %s\n", strerror(-ret));
+        goto out;
+    }
+
+    if (pflag) {
+        printf("After zap done, the append sector is 0x%" PRIx64 "\n",
+               tosector(offset));
+    }
+
+out:
+    qemu_io_free(blk, buf, qiov.size,
+                 flags & BDRV_REQ_REGISTERED_BUF);
+    qemu_iovec_destroy(&qiov);
+    return ret;
+}
+
+static const cmdinfo_t zone_append_cmd = {
+    .name = "zone_append",
+    .altname = "zap",
+    .cfunc = zone_append_f,
+    .argmin = 3,
+    .argmax = 4,
+    .args = "offset len [len..]",
+    .oneline = "append write a number of bytes at a specified offset",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
     .name       = "truncate",
@@ -XXX,XX +XXX,XX @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&zone_close_cmd);
     qemuio_add_command(&zone_finish_cmd);
     qemuio_add_command(&zone_reset_cmd);
+    qemuio_add_command(&zone_append_cmd);
     qemuio_add_command(&truncate_cmd);
     qemuio_add_command(&length_cmd);
     qemuio_add_command(&info_cmd);
diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/tests/zoned
+++ b/tests/qemu-iotests/tests/zoned
@@ -XXX,XX +XXX,XX @@ echo "(5) resetting the second zone"
 $QEMU_IO $IMG -c "zrs 268435456 268435456"
 echo "After resetting a zone:"
 $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(6) append write" # the physical block size of the device is 4096
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone firstly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone secondly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone firstly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone secondly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/tests/zoned.out
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -XXX,XX +XXX,XX @@ start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
 (5) resetting the second zone
 After resetting a zone:
 start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+
+
+(6) append write
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+After zap done, the append sector is 0x0
+After appending the first zone firstly:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x18, zcond:2, [type: 2]
+After zap done, the append sector is 0x18
+After appending the first zone secondly:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x30, zcond:2, [type: 2]
+After zap done, the append sector is 0x80000
+After appending the second zone firstly:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80018, zcond:2, [type: 2]
+After zap done, the append sector is 0x80018
+After appending the second zone secondly:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80030, zcond:2, [type: 2]
 *** done
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
             if (!BDRV_ZT_IS_CONV(*wp)) {
                 if (type & QEMU_AIO_ZONE_APPEND) {
                     *s->offset = *wp;
+                    trace_zbd_zone_append_complete(bs, *s->offset
+                        >> BDRV_SECTOR_BITS);
                 }
                 /* Advance the wp if needed */
                 if (offset + bytes > *wp) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
         len += iov_len;
     }
 
+    trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
     return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
 zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report %d zones starting at sector offset 0x%" PRIx64 ""
 zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs %p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " sectors"
+zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" PRIx64 ""
+zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append sector 0x%" PRIx64 ""
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20230508051916.178322-2-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c        | 389 +++++++++++++++++++++++++++++++++++
 hw/virtio/virtio-qmp.c       |   2 +
 3 files changed, 393 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -XXX,XX +XXX,XX @@ static const VirtIOFeature feature_sizes[] = {
      .end = endof(struct virtio_blk_config, discard_sector_alignment)},
     {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
      .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+    {.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+     .end = endof(struct virtio_blk_config, zoned)},
     {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -XXX,XX +XXX,XX @@ err:
     return err_status;
 }
 
+typedef struct ZoneCmdData {
+    VirtIOBlockReq *req;
+    struct iovec *in_iov;
+    unsigned in_num;
+    union {
+        struct {
+            unsigned int nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report_data;
+        struct {
+            int64_t offset;
+        } zone_append_data;
+    };
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+                             bool append, uint8_t *status) {
+    BlockDriverState *bs = blk_bs(s->blk);
+    int index;
+
+    if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+        *status = VIRTIO_BLK_S_UNSUPP;
+        return false;
+    }
+
+    if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+        || offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+        *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        return false;
+    }
+
+    if (append) {
+        if (bs->bl.write_granularity) {
+            if ((offset % bs->bl.write_granularity) != 0) {
+                *status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+                return false;
+            }
+        }
+
+        index = offset / bs->bl.zone_size;
+        if (BDRV_ZT_IS_CONV(bs->wps->wp[index])) {
+            *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+            return false;
+        }
+
+        if (len / 512 > bs->bl.max_append_sectors) {
+            if (bs->bl.max_append_sectors == 0) {
+                *status = VIRTIO_BLK_S_UNSUPP;
+            } else {
+                *status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+            }
+            return false;
+        }
+    }
+    return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+    ZoneCmdData *data = opaque;
+    VirtIOBlockReq *req = data->req;
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+    struct iovec *in_iov = data->in_iov;
+    unsigned in_num = data->in_num;
+    int64_t zrp_size, n, j = 0;
+    int64_t nz = data->zone_report_data.nr_zones;
+    int8_t err_status = VIRTIO_BLK_S_OK;
+
+    if (ret) {
+        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        goto out;
+    }
+
+    struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+        .nr_zones = cpu_to_le64(nz),
+    };
+    zrp_size = sizeof(struct virtio_blk_zone_report)
+               + sizeof(struct virtio_blk_zone_descriptor) * nz;
+    n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+    if (n != sizeof(zrp_hdr)) {
+        virtio_error(vdev, "Driver provided input buffer that is too small!");
+        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        goto out;
+    }
+
+    for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+        i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+        struct virtio_blk_zone_descriptor desc =
+            (struct virtio_blk_zone_descriptor) {
+                .z_start = cpu_to_le64(data->zone_report_data.zones[j].start
+                    >> BDRV_SECTOR_BITS),
+                .z_cap = cpu_to_le64(data->zone_report_data.zones[j].cap
+                    >> BDRV_SECTOR_BITS),
+                .z_wp = cpu_to_le64(data->zone_report_data.zones[j].wp
+                    >> BDRV_SECTOR_BITS),
+        };
+
+        switch (data->zone_report_data.zones[j].type) {
+        case BLK_ZT_CONV:
+            desc.z_type = VIRTIO_BLK_ZT_CONV;
+            break;
+        case BLK_ZT_SWR:
+            desc.z_type = VIRTIO_BLK_ZT_SWR;
+            break;
+        case BLK_ZT_SWP:
+            desc.z_type = VIRTIO_BLK_ZT_SWP;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+
+        switch (data->zone_report_data.zones[j].state) {
+        case BLK_ZS_RDONLY:
+            desc.z_state = VIRTIO_BLK_ZS_RDONLY;
+            break;
+        case BLK_ZS_OFFLINE:
+            desc.z_state = VIRTIO_BLK_ZS_OFFLINE;
+            break;
+        case BLK_ZS_EMPTY:
+            desc.z_state = VIRTIO_BLK_ZS_EMPTY;
+            break;
+        case BLK_ZS_CLOSED:
+            desc.z_state = VIRTIO_BLK_ZS_CLOSED;
+            break;
+        case BLK_ZS_FULL:
+            desc.z_state = VIRTIO_BLK_ZS_FULL;
+            break;
+        case BLK_ZS_EOPEN:
+            desc.z_state = VIRTIO_BLK_ZS_EOPEN;
+            break;
+        case BLK_ZS_IOPEN:
+            desc.z_state = VIRTIO_BLK_ZS_IOPEN;
+            break;
+        case BLK_ZS_NOT_WP:
+            desc.z_state = VIRTIO_BLK_ZS_NOT_WP;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+
+        /* TODO: it takes O(n^2) time complexity. Optimizations required. */
+        n = iov_from_buf(in_iov, in_num, i, &desc, sizeof(desc));
+        if (n != sizeof(desc)) {
+            virtio_error(vdev, "Driver provided input buffer "
+                               "for descriptors that is too small!");
+            err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        }
+    }
+
+out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
+    g_free(data->zone_report_data.zones);
+    g_free(data);
+}
+
+static void virtio_blk_handle_zone_report(VirtIOBlockReq *req,
+                                         struct iovec *in_iov,
+                                         unsigned in_num)
+{
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
+    unsigned int nr_zones;
+    ZoneCmdData *data;
+    int64_t zone_size, offset;
+    uint8_t err_status;
+
+    if (req->in_len < sizeof(struct virtio_blk_inhdr) +
+            sizeof(struct virtio_blk_zone_report) +
+            sizeof(struct virtio_blk_zone_descriptor)) {
+        virtio_error(vdev, "in buffer too small for zone report");
+        return;
+    }
+
+    /* start byte offset of the zone report */
+    offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
+    if (!check_zoned_request(s, offset, 0, false, &err_status)) {
+        goto out;
+    }
+    nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
+                sizeof(struct virtio_blk_zone_report)) /
+               sizeof(struct virtio_blk_zone_descriptor);
+
+    zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
+    data = g_malloc(sizeof(ZoneCmdData));
+    data->req = req;
+    data->in_iov = in_iov;
+    data->in_num = in_num;
+    data->zone_report_data.nr_zones = nr_zones;
+    data->zone_report_data.zones = g_malloc(zone_size),
+
+    blk_aio_zone_report(s->blk, offset, &data->zone_report_data.nr_zones,
+                        data->zone_report_data.zones,
+                        virtio_blk_zone_report_complete, data);
+    return;
+out:
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+}
+
+static void virtio_blk_zone_mgmt_complete(void *opaque, int ret)
+{
+    VirtIOBlockReq *req = opaque;
+    VirtIOBlock *s = req->dev;
+    int8_t err_status = VIRTIO_BLK_S_OK;
+
+    if (ret) {
+        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+    }
+
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
+}
+
+static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
+{
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
+    BlockDriverState *bs = blk_bs(s->blk);
+    int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
+    uint64_t len;
+    uint64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
+    uint8_t err_status = VIRTIO_BLK_S_OK;
+
+    uint32_t type = virtio_ldl_p(vdev, &req->out.type);
+    if (type == VIRTIO_BLK_T_ZONE_RESET_ALL) {
+        /* Entire drive capacity */
+        offset = 0;
+        len = capacity;
+    } else {
+        if (bs->bl.zone_size > capacity - offset) {
+            /* The zoned device allows the last smaller zone. */
+            len = capacity - bs->bl.zone_size * (bs->bl.nr_zones - 1);
+        } else {
+            len = bs->bl.zone_size;
+        }
+    }
+
+    if (!check_zoned_request(s, offset, len, false, &err_status)) {
+        goto out;
+    }
+
+    blk_aio_zone_mgmt(s->blk, op, offset, len,
+                      virtio_blk_zone_mgmt_complete, req);
+
+    return 0;
+out:
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+    return err_status;
+}
+
+static void virtio_blk_zone_append_complete(void *opaque, int ret)
+{
+    ZoneCmdData *data = opaque;
+    VirtIOBlockReq *req = data->req;
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+    int64_t append_sector, n;
+    uint8_t err_status = VIRTIO_BLK_S_OK;
+
+    if (ret) {
+        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        goto out;
+    }
+
+    virtio_stq_p(vdev, &append_sector,
+                 data->zone_append_data.offset >> BDRV_SECTOR_BITS);
+    n = iov_from_buf(data->in_iov, data->in_num, 0, &append_sector,
+                     sizeof(append_sector));
+    if (n != sizeof(append_sector)) {
+        virtio_error(vdev, "Driver provided input buffer less than size of "
+                           "append_sector");
+        err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+        goto out;
+    }
+
+out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
+    g_free(data);
+}
+
+static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
+                                         struct iovec *out_iov,
+                                         struct iovec *in_iov,
+                                         uint64_t out_num,
+                                         unsigned in_num) {
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
+    uint8_t err_status = VIRTIO_BLK_S_OK;
+
+    int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
+    int64_t len = iov_size(out_iov, out_num);
+
+    if (!check_zoned_request(s, offset, len, true, &err_status)) {
+        goto out;
+    }
+
+    ZoneCmdData *data = g_malloc(sizeof(ZoneCmdData));
+    data->req = req;
+    data->in_iov = in_iov;
+    data->in_num = in_num;
+    data->zone_append_data.offset = offset;
+    qemu_iovec_init_external(&req->qiov, out_iov, out_num);
+    blk_aio_zone_append(s->blk, &data->zone_append_data.offset, &req->qiov, 0,
+                        virtio_blk_zone_append_complete, data);
+    return 0;
+
+out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    virtio_blk_req_complete(req, err_status);
+    virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
+    return err_status;
+}
+
 static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
 {
     uint32_t type;
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
     case VIRTIO_BLK_T_FLUSH:
         virtio_blk_handle_flush(req, mrb);
         break;
+    case VIRTIO_BLK_T_ZONE_REPORT:
+        virtio_blk_handle_zone_report(req, in_iov, in_num);
+        break;
+    case VIRTIO_BLK_T_ZONE_OPEN:
+        virtio_blk_handle_zone_mgmt(req, BLK_ZO_OPEN);
+        break;
+    case VIRTIO_BLK_T_ZONE_CLOSE:
+        virtio_blk_handle_zone_mgmt(req, BLK_ZO_CLOSE);
+        break;
+    case VIRTIO_BLK_T_ZONE_FINISH:
+        virtio_blk_handle_zone_mgmt(req, BLK_ZO_FINISH);
+        break;
+    case VIRTIO_BLK_T_ZONE_RESET:
+        virtio_blk_handle_zone_mgmt(req, BLK_ZO_RESET);
+        break;
+    case VIRTIO_BLK_T_ZONE_RESET_ALL:
+        virtio_blk_handle_zone_mgmt(req, BLK_ZO_RESET);
+        break;
     case VIRTIO_BLK_T_SCSI_CMD:
         virtio_blk_handle_scsi(req);
         break;
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
         virtio_blk_free_request(req);
         break;
     }
+    case VIRTIO_BLK_T_ZONE_APPEND & ~VIRTIO_BLK_T_OUT:
+        /*
+         * Passing out_iov/out_num and in_iov/in_num is not safe
+         * to access req->elem.out_sg directly because it may be
+         * modified by virtio_blk_handle_request().
+         */
+        virtio_blk_handle_zone_append(req, out_iov, in_iov, out_num, in_num);
+        break;
     /*
      * VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES are defined with
      * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch statement,
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIOBlock *s = VIRTIO_BLK(vdev);
     BlockConf *conf = &s->conf.conf;
+    BlockDriverState *bs = blk_bs(s->blk);
     struct virtio_blk_config blkcfg;
     uint64_t capacity;
     int64_t length;
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
         blkcfg.write_zeroes_may_unmap = 1;
         virtio_stl_p(vdev, &blkcfg.max_write_zeroes_seg, 1);
     }
+    if (bs->bl.zoned != BLK_Z_NONE) {
+        switch (bs->bl.zoned) {
+        case BLK_Z_HM:
+            blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
+            break;
+        case BLK_Z_HA:
+            blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+
+        virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors,
+                     bs->bl.zone_size / 512);
+        virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones,
+                     bs->bl.max_active_zones);
+        virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones,
+                     bs->bl.max_open_zones);
+        virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_size);
+        virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors,
+                     bs->bl.max_append_sectors);
+    } else {
+        blkcfg.zoned.model = VIRTIO_BLK_Z_NONE;
+    }
     memcpy(config, &blkcfg, s->config_size);
 }
 
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    BlockDriverState *bs = blk_bs(conf->conf.blk);
+    if (bs->bl.zoned != BLK_Z_NONE) {
+        virtio_add_feature(&s->host_features, VIRTIO_BLK_F_ZONED);
+        if (bs->bl.zoned == BLK_Z_HM) {
+            virtio_clear_feature(&s->host_features, VIRTIO_BLK_F_DISCARD);
+        }
+    }
+
     if (virtio_has_feature(s->host_features, VIRTIO_BLK_F_DISCARD) &&
         (!conf->max_discard_sectors ||
          conf->max_discard_sectors > BDRV_REQUEST_MAX_SECTORS)) {
diff --git a/hw/virtio/virtio-qmp.c b/hw/virtio/virtio-qmp.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-qmp.c
+++ b/hw/virtio/virtio-qmp.c
@@ -XXX,XX +XXX,XX @@ static const qmp_virtio_feature_map_t virtio_blk_feature_map[] = {
             "VIRTIO_BLK_F_DISCARD: Discard command supported"),
     FEATURE_ENTRY(VIRTIO_BLK_F_WRITE_ZEROES, \
             "VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported"),
+    FEATURE_ENTRY(VIRTIO_BLK_F_ZONED, \
+            "VIRTIO_BLK_F_ZONED: Zoned block devices"),
 #ifndef VIRTIO_BLK_NO_LEGACY
     FEATURE_ENTRY(VIRTIO_BLK_F_BARRIER, \
             "VIRTIO_BLK_F_BARRIER: Request barriers supported"),
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20230508051916.178322-3-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-core.json       | 68 ++++++++++++++++++++++++++++++++------
 qapi/block.json            |  4 +++
 include/block/accounting.h |  1 +
 block/qapi-sysemu.c        | 11 ++++++
 block/qapi.c               | 18 ++++++++++
 hw/block/virtio-blk.c      |  4 +++
 tests/qemu-iotests/227.out | 18 ++++++++++
 7 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
 # @min_wr_latency_ns: Minimum latency of write operations in the
 #     defined interval, in nanoseconds.
 #
+# @min_zone_append_latency_ns: Minimum latency of zone append operations
+#                              in the defined interval, in nanoseconds
+#                              (since 8.1)
+#
 # @min_flush_latency_ns: Minimum latency of flush operations in the
 #     defined interval, in nanoseconds.
 #
@@ -XXX,XX +XXX,XX @@
 # @max_wr_latency_ns: Maximum latency of write operations in the
 #     defined interval, in nanoseconds.
 #
+# @max_zone_append_latency_ns: Maximum latency of zone append operations
+#                              in the defined interval, in nanoseconds
+#                              (since 8.1)
+#
 # @max_flush_latency_ns: Maximum latency of flush operations in the
 #     defined interval, in nanoseconds.
 #
@@ -XXX,XX +XXX,XX @@
 # @avg_wr_latency_ns: Average latency of write operations in the
 #     defined interval, in nanoseconds.
 #
+# @avg_zone_append_latency_ns: Average latency of zone append operations
+#                              in the defined interval, in nanoseconds
+#                              (since 8.1)
+#
 # @avg_flush_latency_ns: Average latency of flush operations in the
 #     defined interval, in nanoseconds.
 #
@@ -XXX,XX +XXX,XX @@
 # @avg_wr_queue_depth: Average number of pending write operations in
 #     the defined interval.
 #
+# @avg_zone_append_queue_depth: Average number of pending zone append
+#                               operations in the defined interval
+#                               (since 8.1).
+#
 # Since: 2.5
 ##
 { 'struct': 'BlockDeviceTimedStats',
   'data': { 'interval_length': 'int', 'min_rd_latency_ns': 'int',
             'max_rd_latency_ns': 'int', 'avg_rd_latency_ns': 'int',
             'min_wr_latency_ns': 'int', 'max_wr_latency_ns': 'int',
-            'avg_wr_latency_ns': 'int', 'min_flush_latency_ns': 'int',
-            'max_flush_latency_ns': 'int', 'avg_flush_latency_ns': 'int',
-            'avg_rd_queue_depth': 'number', 'avg_wr_queue_depth': 'number' } }
+            'avg_wr_latency_ns': 'int', 'min_zone_append_latency_ns': 'int',
+            'max_zone_append_latency_ns': 'int',
+            'avg_zone_append_latency_ns': 'int',
+            'min_flush_latency_ns': 'int', 'max_flush_latency_ns': 'int',
+            'avg_flush_latency_ns': 'int', 'avg_rd_queue_depth': 'number',
+            'avg_wr_queue_depth': 'number',
+            'avg_zone_append_queue_depth': 'number'  } }
 
 ##
 # @BlockDeviceStats:
@@ -XXX,XX +XXX,XX @@
 #
 # @wr_bytes: The number of bytes written by the device.
 #
+# @zone_append_bytes: The number of bytes appended by the zoned devices
+#                     (since 8.1)
+#
 # @unmap_bytes: The number of bytes unmapped by the device (Since 4.2)
 #
 # @rd_operations: The number of read operations performed by the
@@ -XXX,XX +XXX,XX @@
 # @wr_operations: The number of write operations performed by the
 #     device.
 #
+# @zone_append_operations: The number of zone append operations performed
+#                          by the zoned devices (since 8.1)
+#
 # @flush_operations: The number of cache flush operations performed by
 #     the device (since 0.15)
 #
@@ -XXX,XX +XXX,XX @@
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since
 #     0.15).
 #
+# @zone_append_total_time_ns: Total time spent on zone append writes
+#                             in nanoseconds (since 8.1)
+#
 # @flush_total_time_ns: Total time spent on cache flushes in
 #     nanoseconds (since 0.15).
 #
@@ -XXX,XX +XXX,XX @@
 # @wr_merged: Number of write requests that have been merged into
 #     another request (Since 2.3).
 #
+# @zone_append_merged: Number of zone append requests that have been merged
+#                      into another request (since 8.1)
+#
 # @unmap_merged: Number of unmap requests that have been merged into
 #     another request (Since 4.2)
 #
@@ -XXX,XX +XXX,XX @@
 # @failed_wr_operations: The number of failed write operations
 #     performed by the device (Since 2.5)
 #
+# @failed_zone_append_operations: The number of failed zone append write
+#                                 operations performed by the zoned devices
+#                                 (since 8.1)
+#
 # @failed_flush_operations: The number of failed flush operations
 #     performed by the device (Since 2.5)
 #
@@ -XXX,XX +XXX,XX @@
 # @invalid_wr_operations: The number of invalid write operations
 #     performed by the device (Since 2.5)
 #
+# @invalid_zone_append_operations: The number of invalid zone append operations
+#                                  performed by the zoned device (since 8.1)
+#
 # @invalid_flush_operations: The number of invalid flush operations
 #     performed by the device (Since 2.5)
 #
@@ -XXX,XX +XXX,XX @@
 #
 # @wr_latency_histogram: @BlockLatencyHistogramInfo.  (Since 4.0)
 #
+# @zone_append_latency_histogram: @BlockLatencyHistogramInfo.  (since 8.1)
+#
 # @flush_latency_histogram: @BlockLatencyHistogramInfo.  (Since 4.0)
 #
 # Since: 0.14
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
-           'rd_operations': 'int', 'wr_operations': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'zone_append_bytes': 'int',
+           'unmap_bytes' : 'int', 'rd_operations': 'int',
+           'wr_operations': 'int', 'zone_append_operations': 'int',
            'flush_operations': 'int', 'unmap_operations': 'int',
            'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-           'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
-           'wr_highest_offset': 'int',
-           'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
-           '*idle_time_ns': 'int',
+           'zone_append_total_time_ns': 'int', 'flush_total_time_ns': 'int',
+           'unmap_total_time_ns': 'int', 'wr_highest_offset': 'int',
+           'rd_merged': 'int', 'wr_merged': 'int', 'zone_append_merged': 'int',
+           'unmap_merged': 'int', '*idle_time_ns': 'int',
            'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-           'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
-           'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
+           'failed_zone_append_operations': 'int',
+           'failed_flush_operations': 'int',
+           'failed_unmap_operations': 'int', 'invalid_rd_operations': 'int',
+           'invalid_wr_operations': 'int',
+           'invalid_zone_append_operations': 'int',
            'invalid_flush_operations': 'int', 'invalid_unmap_operations': 'int',
            'account_invalid': 'bool', 'account_failed': 'bool',
            'timed_stats': ['BlockDeviceTimedStats'],
            '*rd_latency_histogram': 'BlockLatencyHistogramInfo',
            '*wr_latency_histogram': 'BlockLatencyHistogramInfo',
+           '*zone_append_latency_histogram': 'BlockLatencyHistogramInfo',
            '*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
 
 ##
diff --git a/qapi/block.json b/qapi/block.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -XXX,XX +XXX,XX @@
 # @boundaries-write: list of interval boundary values for write
 #     latency histogram.
 #
+# @boundaries-zap: list of interval boundary values for zone append write
+#                  latency histogram.
+#
 # @boundaries-flush: list of interval boundary values for flush
 #     latency histogram.
 #
@@ -XXX,XX +XXX,XX @@
            '*boundaries': ['uint64'],
            '*boundaries-read': ['uint64'],
            '*boundaries-write': ['uint64'],
+           '*boundaries-zap': ['uint64'],
            '*boundaries-flush': ['uint64'] },
   'allow-preconfig': true }
diff --git a/include/block/accounting.h b/include/block/accounting.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -XXX,XX +XXX,XX @@ enum BlockAcctType {
     BLOCK_ACCT_READ,
     BLOCK_ACCT_WRITE,
     BLOCK_ACCT_FLUSH,
+    BLOCK_ACCT_ZONE_APPEND,
     BLOCK_ACCT_UNMAP,
     BLOCK_MAX_IOTYPE,
 };
diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -XXX,XX +XXX,XX @@ void qmp_block_latency_histogram_set(
     bool has_boundaries, uint64List *boundaries,
     bool has_boundaries_read, uint64List *boundaries_read,
     bool has_boundaries_write, uint64List *boundaries_write,
+    bool has_boundaries_append, uint64List *boundaries_append,
     bool has_boundaries_flush, uint64List *boundaries_flush,
     Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ void qmp_block_latency_histogram_set(
         }
     }
 
+    if (has_boundaries || has_boundaries_append) {
+        ret = block_latency_histogram_set(
+                stats, BLOCK_ACCT_ZONE_APPEND,
+                has_boundaries_append ? boundaries_append : boundaries);
+        if (ret) {
+            error_setg(errp, "Device '%s' set append write boundaries fail", id);
+            return;
+        }
+    }
+
     if (has_boundaries || has_boundaries_flush) {
         ret = block_latency_histogram_set(
             stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
 
     ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
     ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+    ds->zone_append_bytes = stats->nr_bytes[BLOCK_ACCT_ZONE_APPEND];
     ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
     ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
     ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+    ds->zone_append_operations = stats->nr_ops[BLOCK_ACCT_ZONE_APPEND];
     ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
     ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
     ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+    ds->failed_zone_append_operations =
+        stats->failed_ops[BLOCK_ACCT_ZONE_APPEND];
     ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
     ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
     ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
     ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+    ds->invalid_zone_append_operations =
+        stats->invalid_ops[BLOCK_ACCT_ZONE_APPEND];
     ds->invalid_flush_operations =
         stats->invalid_ops[BLOCK_ACCT_FLUSH];
     ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
     ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
     ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+    ds->zone_append_merged = stats->merged[BLOCK_ACCT_ZONE_APPEND];
     ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
     ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
     ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+    ds->zone_append_total_time_ns =
+        stats->total_time_ns[BLOCK_ACCT_ZONE_APPEND];
     ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
     ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
     ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
 
         TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
         TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+        TimedAverage *zap = &ts->latency[BLOCK_ACCT_ZONE_APPEND];
         TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
         dev_stats->interval_length = ts->interval_length;
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
         dev_stats->max_wr_latency_ns = timed_average_max(wr);
         dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+        dev_stats->min_zone_append_latency_ns = timed_average_min(zap);
+        dev_stats->max_zone_append_latency_ns = timed_average_max(zap);
+        dev_stats->avg_zone_append_latency_ns = timed_average_avg(zap);
+
         dev_stats->min_flush_latency_ns = timed_average_min(fl);
         dev_stats->max_flush_latency_ns = timed_average_max(fl);
         dev_stats->avg_flush_latency_ns = timed_average_avg(fl);
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
             block_acct_queue_depth(ts, BLOCK_ACCT_READ);
         dev_stats->avg_wr_queue_depth =
             block_acct_queue_depth(ts, BLOCK_ACCT_WRITE);
+        dev_stats->avg_zone_append_queue_depth =
+            block_acct_queue_depth(ts, BLOCK_ACCT_ZONE_APPEND);
 
         QAPI_LIST_PREPEND(ds->timed_stats, dev_stats);
     }
@@ -XXX,XX +XXX,XX @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_READ]);
     ds->wr_latency_histogram
         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_WRITE]);
+    ds->zone_append_latency_histogram
+        = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_ZONE_APPEND]);
     ds->flush_latency_histogram
         = bdrv_latency_histogram_stats(&hgram[BLOCK_ACCT_FLUSH]);
 }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
     data->in_num = in_num;
     data->zone_append_data.offset = offset;
     qemu_iovec_init_external(&req->qiov, out_iov, out_num);
+
+    block_acct_start(blk_get_stats(s->blk), &req->acct, len,
+                     BLOCK_ACCT_ZONE_APPEND);
+
     blk_aio_zone_append(s->blk, &data->zone_append_data.offset, &req->qiov, 0,
                         virtio_blk_zone_append_complete, data);
     return 0;
diff --git a/tests/qemu-iotests/227.out b/tests/qemu-iotests/227.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/227.out
+++ b/tests/qemu-iotests/227.out
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
             "stats": {
                 "unmap_operations": 0,
                 "unmap_merged": 0,
+                "failed_zone_append_operations": 0,
                 "flush_total_time_ns": 0,
                 "wr_highest_offset": 0,
                 "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
                 "timed_stats": [
                 ],
                 "failed_unmap_operations": 0,
+                "zone_append_merged": 0,
                 "failed_flush_operations": 0,
                 "account_invalid": true,
                 "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,read-zeroes=on,if=virtio
                 "unmap_total_time_ns": 0,
                 "invalid_flush_operations": 0,
                 "account_failed": true,
+                "zone_append_total_time_ns": 0,
+                "zone_append_operations": 0,
                 "rd_operations": 0,
+                "zone_append_bytes": 0,
+                "invalid_zone_append_operations": 0,
                 "invalid_wr_operations": 0,
                 "invalid_rd_operations": 0
             },
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
             "stats": {
                 "unmap_operations": 0,
                 "unmap_merged": 0,
+                "failed_zone_append_operations": 0,
                 "flush_total_time_ns": 0,
                 "wr_highest_offset": 0,
                 "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
                 "timed_stats": [
                 ],
                 "failed_unmap_operations": 0,
+                "zone_append_merged": 0,
                 "failed_flush_operations": 0,
                 "account_invalid": true,
                 "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -drive driver=null-co,if=none
                 "unmap_total_time_ns": 0,
                 "invalid_flush_operations": 0,
                 "account_failed": true,
+                "zone_append_total_time_ns": 0,
+                "zone_append_operations": 0,
                 "rd_operations": 0,
+                "zone_append_bytes": 0,
+                "invalid_zone_append_operations": 0,
                 "invalid_wr_operations": 0,
                 "invalid_rd_operations": 0
             },
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
             "stats": {
                 "unmap_operations": 0,
                 "unmap_merged": 0,
+                "failed_zone_append_operations": 0,
                 "flush_total_time_ns": 0,
                 "wr_highest_offset": 0,
                 "wr_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
                 "timed_stats": [
                 ],
                 "failed_unmap_operations": 0,
+                "zone_append_merged": 0,
                 "failed_flush_operations": 0,
                 "account_invalid": true,
                 "rd_total_time_ns": 0,
@@ -XXX,XX +XXX,XX @@ Testing: -blockdev driver=null-co,read-zeroes=on,node-name=null -device virtio-b
                 "unmap_total_time_ns": 0,
                 "invalid_flush_operations": 0,
                 "account_failed": true,
+                "zone_append_total_time_ns": 0,
+                "zone_append_operations": 0,
                 "rd_operations": 0,
+                "zone_append_bytes": 0,
+                "invalid_zone_append_operations": 0,
                 "invalid_wr_operations": 0,
                 "invalid_rd_operations": 0
             },
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508051916.178322-4-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/block/virtio-blk.c | 12 ++++++++++++
 hw/block/trace-events |  7 +++++++
 2 files changed, 19 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_report_complete(void *opaque, int ret)
     int64_t nz = data->zone_report_data.nr_zones;
     int8_t err_status = VIRTIO_BLK_S_OK;
 
+    trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
     if (ret) {
         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
         goto out;
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq *req,
     nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
                 sizeof(struct virtio_blk_zone_report)) /
                sizeof(struct virtio_blk_zone_descriptor);
+    trace_virtio_blk_handle_zone_report(vdev, req,
+                                        offset >> BDRV_SECTOR_BITS, nr_zones);
 
     zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
     data = g_malloc(sizeof(ZoneCmdData));
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *req = opaque;
     VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
     int8_t err_status = VIRTIO_BLK_S_OK;
+    trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
     if (ret) {
         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
         /* Entire drive capacity */
         offset = 0;
         len = capacity;
+        trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+                                               bs->total_sectors);
     } else {
         if (bs->bl.zone_size > capacity - offset) {
             /* The zoned device allows the last smaller zone. */
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, BlockZoneOp op)
         } else {
             len = bs->bl.zone_size;
         }
+        trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+                                          offset >> BDRV_SECTOR_BITS,
+                                          len >> BDRV_SECTOR_BITS);
     }
 
     if (!check_zoned_request(s, offset, len, false, &err_status)) {
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_zone_append_complete(void *opaque, int ret)
         err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
         goto out;
     }
+    trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
     aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq *req,
     int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
     int64_t len = iov_size(out_iov, out_num);
 
+    trace_virtio_blk_handle_zone_append(vdev, req, offset >> BDRV_SECTOR_BITS);
     if (!check_zoned_request(s, offset, len, true, &err_status)) {
         goto out;
     }
diff --git a/hw/block/trace-events b/hw/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -XXX,XX +XXX,XX @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs %d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p req %p, append sector 0x%" PRIx64 ""
 
 # hd-geometry.c
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS %d %d %d"
-- 
2.40.1

From: Sam Li <faithilikerun@gmail.com>

Add the documentation about the example of using virtio-blk driver
to pass the zoned block devices through to the guest.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20230508051916.178322-5-faithilikerun@gmail.com
[Fix pre-formatted code syntax
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 docs/devel/zoned-storage.rst | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/zoned-storage.rst
+++ b/docs/devel/zoned-storage.rst
@@ -XXX,XX +XXX,XX @@ APIs for zoned storage emulation or testing.
 For example, to test zone_report on a null_blk device using qemu-io is::
 
   $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c "zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter)::
+
+  -blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,cache.direct=on \
+  -device virtio-blk-pci,drive=drive0
+
+Or only use the -drive parameter::
+
+  -driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
-- 
2.40.1