Series comparison

-[Qemu-devel] [PULL v2 00/24] Block patches
+[PULL v2 00/28] Block patches
-The following changes since commit 56f9e46b841c7be478ca038d8d4085d776ab4b0d:
+The following changes since commit ac793156f650ae2d77834932d72224175ee69086:
-  Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2017-02-20' into staging (2017-02-20 17:42:47 +0000)
+  Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20201020-1' into staging (2020-10-20 21:11:35 +0100)
-are available in the git repository at:
+are available in the Git repository at:
-  git://github.com/stefanha/qemu.git tags/block-pull-request
+  https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to a7b91d35bab97a2d3e779d0c64c9b837b52a6cf7:
+for you to fetch changes up to 32a3fd65e7e3551337fd26bfc0e2f899d70c028c:
-  coroutine-lock: make CoRwlock thread-safe and fair (2017-02-21 11:39:40 +0000)
+  iotests: add commit top->base cases to 274 (2020-10-22 09:55:39 +0100)
 ----------------------------------------------------------------
 Pull request
 v2:
- * Rebased to resolve scsi conflicts
+ * Fix format string issues on 32-bit hosts [Peter]
  * Fix qemu-nbd.c CONFIG_POSIX ifdef issue [Eric]
  * Fix missing eventfd.h header on macOS [Peter]
  * Drop unreliable vhost-user-blk test (will send a new patch when ready) [Peter]
 This pull request contains the vhost-user-blk server by Coiby Xu along with my
 additions, block/nvme.c alignment and hardware error statistics by Philippe
 Mathieu-Daudé, and bdrv_co_block_status_above() fixes by Vladimir
 Sementsov-Ogievskiy.
 ----------------------------------------------------------------
-Paolo Bonzini (24):
+Coiby Xu (6):
-  block: move AioContext, QEMUTimer, main-loop to libqemuutil
+  libvhost-user: Allow vu_message_read to be replaced
-  aio: introduce aio_co_schedule and aio_co_wake
+  libvhost-user: remove watch for kick_fd when de-initialize vu-dev
-  block-backend: allow blk_prw from coroutine context
+  util/vhost-user-server: generic vhost user server
-  test-thread-pool: use generic AioContext infrastructure
+  block: move logical block size check function to a common utility
-  io: add methods to set I/O handlers on AioContext
+    function
-  io: make qio_channel_yield aware of AioContexts
+  block/export: vhost-user block device backend server
-  nbd: convert to use qio_channel_yield
+  MAINTAINERS: Add vhost-user block device backend server maintainer
   coroutine-lock: reschedule coroutine on the AioContext it was running
     on
   blkdebug: reschedule coroutine on the AioContext it is running on
   qed: introduce qed_aio_start_io and qed_aio_next_io_cb
   aio: push aio_context_acquire/release down to dispatching
   block: explicitly acquire aiocontext in timers that need it
   block: explicitly acquire aiocontext in callbacks that need it
   block: explicitly acquire aiocontext in bottom halves that need it
   block: explicitly acquire aiocontext in aio callbacks that need it
   aio-posix: partially inline aio_dispatch into aio_poll
   async: remove unnecessary inc/dec pairs
   block: document fields protected by AioContext lock
   coroutine-lock: make CoMutex thread-safe
   coroutine-lock: add limited spinning to CoMutex
   test-aio-multithread: add performance comparison with thread-based
     mutexes
   coroutine-lock: place CoMutex before CoQueue in header
   coroutine-lock: add mutex argument to CoQueue APIs
   coroutine-lock: make CoRwlock thread-safe and fair
- Makefile.objs                       |   4 -
+Philippe Mathieu-Daudé (1):
- stubs/Makefile.objs                 |   1 +
+  block/nvme: Add driver statistics for access alignment and hw errors
- tests/Makefile.include              |  19 +-
- util/Makefile.objs                  |   6 +-
+Stefan Hajnoczi (16):
- block/nbd-client.h                  |   2 +-
+  util/vhost-user-server: s/fileds/fields/ typo fix
- block/qed.h                         |   3 +
+  util/vhost-user-server: drop unnecessary QOM cast
- include/block/aio.h                 |  38 ++-
+  util/vhost-user-server: drop unnecessary watch deletion
- include/block/block_int.h           |  64 +++--
+  block/export: consolidate request structs into VuBlockReq
- include/io/channel.h                |  72 +++++-
+  util/vhost-user-server: drop unused DevicePanicNotifier
- include/qemu/coroutine.h            |  84 ++++---
+  util/vhost-user-server: fix memory leak in vu_message_read()
- include/qemu/coroutine_int.h        |  11 +-
+  util/vhost-user-server: check EOF when reading payload
- include/sysemu/block-backend.h      |  14 +-
+  util/vhost-user-server: rework vu_client_trip() coroutine lifecycle
- tests/iothread.h                    |  25 ++
+  block/export: report flush errors
- block/backup.c                      |   2 +-
+  block/export: convert vhost-user-blk server to block export API
- block/blkdebug.c                    |   9 +-
+  util/vhost-user-server: move header to include/
- block/blkreplay.c                   |   2 +-
+  util/vhost-user-server: use static library in meson.build
- block/block-backend.c               |  13 +-
+  qemu-storage-daemon: avoid compiling blockdev_ss twice
- block/curl.c                        |  44 +++-
+  block: move block exports to libblockdev
- block/gluster.c                     |   9 +-
+  block/export: add iothread and fixed-iothread options
- block/io.c                          |  42 +---
+  block/export: add vhost-user-blk multi-queue support
- block/iscsi.c                       |  15 +-
- block/linux-aio.c                   |  10 +-
+Vladimir Sementsov-Ogievskiy (5):
- block/mirror.c                      |  12 +-
+  block/io: fix bdrv_co_block_status_above
- block/nbd-client.c                  | 119 +++++----
+  block/io: bdrv_common_block_status_above: support include_base
- block/nfs.c                         |   9 +-
+  block/io: bdrv_common_block_status_above: support bs == base
- block/qcow2-cluster.c               |   4 +-
+  block/io: fix bdrv_is_allocated_above
- block/qed-cluster.c                 |   2 +
+  iotests: add commit top->base cases to 274
- block/qed-table.c                   |  12 +-
- block/qed.c                         |  58 +++--
+ MAINTAINERS                                |   9 +
- block/sheepdog.c                    |  31 +--
+ qapi/block-core.json                       |  24 +-
- block/ssh.c                         |  29 +--
+ qapi/block-export.json                     |  36 +-
- block/throttle-groups.c             |   4 +-
+ block/coroutines.h                         |   2 +
- block/win32-aio.c                   |   9 +-
+ block/export/vhost-user-blk-server.h       |  19 +
- dma-helpers.c                       |   2 +
+ contrib/libvhost-user/libvhost-user.h      |  21 +
- hw/9pfs/9p.c                        |   2 +-
+ include/qemu/vhost-user-server.h           |  65 +++
- hw/block/virtio-blk.c               |  19 +-
+ util/block-helpers.h                       |  19 +
- hw/scsi/scsi-bus.c                  |   2 +
+ block/export/export.c                      |  37 +-
- hw/scsi/scsi-disk.c                 |  15 ++
+ block/export/vhost-user-blk-server.c       | 431 ++++++++++++++++++++
- hw/scsi/scsi-generic.c              |  20 +-
+ block/io.c                                 | 132 +++---
- hw/scsi/virtio-scsi.c               |   7 +
+ block/nvme.c                               |  27 ++
- io/channel-command.c                |  13 +
+ block/qcow2.c                              |  16 +-
- io/channel-file.c                   |  11 +
+ contrib/libvhost-user/libvhost-user-glib.c |   2 +-
- io/channel-socket.c                 |  16 +-
+ contrib/libvhost-user/libvhost-user.c      |  15 +-
- io/channel-tls.c                    |  12 +
+ hw/core/qdev-properties-system.c           |  31 +-
- io/channel-watch.c                  |   6 +
+ nbd/server.c                               |   2 -
- io/channel.c                        |  97 ++++++--
+ qemu-nbd.c                                 |  21 +-
- nbd/client.c                        |   2 +-
+ softmmu/vl.c                               |   4 +
- nbd/common.c                        |   9 +-
+ stubs/blk-exp-close-all.c                  |   7 +
- nbd/server.c                        |  94 +++-----
+ tests/vhost-user-bridge.c                  |   2 +
- stubs/linux-aio.c                   |  32 +++
+ tools/virtiofsd/fuse_virtio.c              |   4 +-
- stubs/set-fd-handler.c              |  11 -
+ util/block-helpers.c                       |  46 +++
- tests/iothread.c                    |  91 +++++++
+ util/vhost-user-server.c                   | 446 +++++++++++++++++++++
- tests/test-aio-multithread.c        | 463 ++++++++++++++++++++++++++++++++++++
+ block/export/meson.build                   |   3 +-
- tests/test-thread-pool.c            |  12 +-
+ contrib/libvhost-user/meson.build          |   1 +
- aio-posix.c => util/aio-posix.c     |  62 ++---
+ meson.build                                |  22 +-
- aio-win32.c => util/aio-win32.c     |  30 +--
+ nbd/meson.build                            |   2 +
- util/aiocb.c                        |  55 +++++
+ storage-daemon/meson.build                 |   3 +-
- async.c => util/async.c             |  84 ++++++-
+ stubs/meson.build                          |   1 +
- iohandler.c => util/iohandler.c     |   0
+ tests/qemu-iotests/274                     |  20 +
- main-loop.c => util/main-loop.c     |   0
+ tests/qemu-iotests/274.out                 |  68 ++++
- util/qemu-coroutine-lock.c          | 254 ++++++++++++++++++--
+ util/meson.build                           |   4 +
- util/qemu-coroutine-sleep.c         |   2 +-
+files changed, 1420 insertions(+), 122 deletions(-)
- util/qemu-coroutine.c               |   8 +
+ create mode 100644 block/export/vhost-user-blk-server.h
- qemu-timer.c => util/qemu-timer.c   |   0
+ create mode 100644 include/qemu/vhost-user-server.h
- thread-pool.c => util/thread-pool.c |   8 +-
+ create mode 100644 util/block-helpers.h
- trace-events                        |  11 -
+ create mode 100644 block/export/vhost-user-blk-server.c
- util/trace-events                   |  17 +-
+ create mode 100644 stubs/blk-exp-close-all.c
-files changed, 1712 insertions(+), 533 deletions(-)
+ create mode 100644 util/block-helpers.c
- create mode 100644 tests/iothread.h
+ create mode 100644 util/vhost-user-server.c
  create mode 100644 stubs/linux-aio.c
  create mode 100644 tests/iothread.c
  create mode 100644 tests/test-aio-multithread.c
  rename aio-posix.c => util/aio-posix.c (94%)
  rename aio-win32.c => util/aio-win32.c (95%)
  create mode 100644 util/aiocb.c
  rename async.c => util/async.c (82%)
  rename iohandler.c => util/iohandler.c (100%)
  rename main-loop.c => util/main-loop.c (100%)
  rename qemu-timer.c => util/qemu-timer.c (100%)
  rename thread-pool.c => util/thread-pool.c (97%)
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 21/24] test-aio-multithread: add performance comparison with thread-based mutexes
+[PULL v2 01/28] block/nvme: Add driver statistics for access alignment and hw errors
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Philippe Mathieu-Daudé <philmd@redhat.com>
-Add two implementations of the same benchmark as the previous patch,
+Keep statistics of some hardware errors, and number of
-but using pthreads.  One uses a normal QemuMutex, the other is Linux
+aligned/unaligned I/O accesses.
 only and implements a fair mutex based on MCS locks and futexes.
 This shows that the slower performance of the 5-thread case is due to
 the fairness of CoMutex, rather than to coroutines.  If fairness does
 not matter, as is the case with two threads, CoMutex can actually be
 faster than pthreads.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+QMP example booting a full RHEL 8.3 aarch64 guest:
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213181244.16297-4-pbonzini@redhat.com
+{ "execute": "query-blockstats" }
 {
     "return": [
         {
             "device": "",
             "node-name": "drive0",
             "stats": {
                 "flush_total_time_ns": 6026948,
                 "wr_highest_offset": 3383991230464,
                 "wr_total_time_ns": 807450995,
                 "failed_wr_operations": 0,
                 "failed_rd_operations": 0,
                 "wr_merged": 3,
                 "wr_bytes": 50133504,
                 "failed_unmap_operations": 0,
                 "failed_flush_operations": 0,
                 "account_invalid": false,
                 "rd_total_time_ns": 1846979900,
                 "flush_operations": 130,
                 "wr_operations": 659,
                 "rd_merged": 1192,
                 "rd_bytes": 218244096,
                 "account_failed": false,
                 "idle_time_ns": 2678641497,
                 "rd_operations": 7406,
             },
             "driver-specific": {
                 "driver": "nvme",
                 "completion-errors": 0,
                 "unaligned-accesses": 2959,
                 "aligned-accesses": 4477
             },
             "qdev": "/machine/peripheral-anon/device[0]/virtio-backend"
         }
     ]
 }
 Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
 Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
 Acked-by: Markus Armbruster <armbru@redhat.com>
 Message-id: 20201001162939.1567915-1-philmd@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/test-aio-multithread.c | 164 +++++++++++++++++++++++++++++++++++++++++++
+ qapi/block-core.json | 24 +++++++++++++++++++++++-
-file changed, 164 insertions(+)
+ block/nvme.c         | 27 +++++++++++++++++++++++++++
 files changed, 50 insertions(+), 1 deletion(-)
-diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
+diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
---- a/tests/test-aio-multithread.c
+--- a/qapi/block-core.json
-+++ b/tests/test-aio-multithread.c
++++ b/qapi/block-core.json
-@@ -XXX,XX +XXX,XX @@ static void test_multi_co_mutex_2_30(void)
+@@ -XXX,XX +XXX,XX @@
-     test_multi_co_mutex(2, 30);
+       'discard-nb-failed': 'uint64',
        'discard-bytes-ok': 'uint64' } }
 +##
 +# @BlockStatsSpecificNvme:
 +#
 +# NVMe driver statistics
 +#
 +# @completion-errors: The number of completion errors.
 +#
 +# @aligned-accesses: The number of aligned accesses performed by
 +#                    the driver.
 +#
 +# @unaligned-accesses: The number of unaligned accesses performed by
 +#                      the driver.
 +#
 +# Since: 5.2
 +##
 +{ 'struct': 'BlockStatsSpecificNvme',
 +  'data': {
 +      'completion-errors': 'uint64',
 +      'aligned-accesses': 'uint64',
 +      'unaligned-accesses': 'uint64' } }
 +
  ##
  # @BlockStatsSpecific:
  #
@@ -XXX,XX +XXX,XX @@
    'discriminator': 'driver',
    'data': {
        'file': 'BlockStatsSpecificFile',
 -      'host_device': 'BlockStatsSpecificFile' } }
 +      'host_device': 'BlockStatsSpecificFile',
 +      'nvme': 'BlockStatsSpecificNvme' } }
  ##
  # @BlockStats:
 diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nvme.c
 +++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
      /* PCI address (required for nvme_refresh_filename()) */
      char *device;
 +
 +    struct {
 +        uint64_t completion_errors;
 +        uint64_t aligned_accesses;
 +        uint64_t unaligned_accesses;
 +    } stats;
  };
  #define NVME_BLOCK_OPT_DEVICE "device"
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
              break;
          }
          ret = nvme_translate_error(c);
 +        if (ret) {
 +            s->stats.completion_errors++;
 +        }
          q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
          if (!q->cq.head) {
              q->cq_phase = !q->cq_phase;
@@ -XXX,XX +XXX,XX @@ static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
      assert(QEMU_IS_ALIGNED(bytes, s->page_size));
      assert(bytes <= s->max_transfer);
      if (nvme_qiov_aligned(bs, qiov)) {
 +        s->stats.aligned_accesses++;
          return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write, flags);
      }
 +    s->stats.unaligned_accesses++;
      trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write);
      buf = qemu_try_memalign(s->page_size, bytes);
@@ -XXX,XX +XXX,XX @@ static void nvme_unregister_buf(BlockDriverState *bs, void *host)
      qemu_vfio_dma_unmap(s->vfio, host);
  }
-+/* Same test with fair mutexes, for performance comparison.  */
++static BlockStatsSpecific *nvme_get_specific_stats(BlockDriverState *bs)
 +{
 +    BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
 +    BDRVNVMeState *s = bs->opaque;
 +
-+#ifdef CONFIG_LINUX
++    stats->driver = BLOCKDEV_DRIVER_NVME;
-+#include "qemu/futex.h"
++    stats->u.nvme = (BlockStatsSpecificNvme) {
 +        .completion_errors = s->stats.completion_errors,
 +        .aligned_accesses = s->stats.aligned_accesses,
 +        .unaligned_accesses = s->stats.unaligned_accesses,
 +    };
 +
-+/* The nodes for the mutex reside in this structure (on which we try to avoid
++    return stats;
 + * false sharing).  The head of the mutex is in the "mutex_head" variable.
 + */
 +static struct {
 +    int next, locked;
 +    int padding[14];
 +} nodes[NUM_CONTEXTS] __attribute__((__aligned__(64)));
 +
 +static int mutex_head = -1;
 +
 +static void mcs_mutex_lock(void)
 +{
 +    int prev;
 +
 +    nodes[id].next = -1;
 +    nodes[id].locked = 1;
 +    prev = atomic_xchg(&mutex_head, id);
 +    if (prev != -1) {
 +        atomic_set(&nodes[prev].next, id);
 +        qemu_futex_wait(&nodes[id].locked, 1);
 +    }
 +}
 +
-+static void mcs_mutex_unlock(void)
+ static const char *const nvme_strong_runtime_opts[] = {
-+{
+     NVME_BLOCK_OPT_DEVICE,
-+    int next;
+     NVME_BLOCK_OPT_NAMESPACE,
-+    if (nodes[id].next == -1) {
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
-+        if (atomic_read(&mutex_head) == id &&
+     .bdrv_refresh_filename    = nvme_refresh_filename,
-+            atomic_cmpxchg(&mutex_head, id, -1) == id) {
+     .bdrv_refresh_limits      = nvme_refresh_limits,
-+            /* Last item in the list, exit.  */
+     .strong_runtime_opts      = nvme_strong_runtime_opts,
-+            return;
++    .bdrv_get_specific_stats  = nvme_get_specific_stats,
-+        }
-+        while (atomic_read(&nodes[id].next) == -1) {
+     .bdrv_detach_aio_context  = nvme_detach_aio_context,
-+            /* mcs_mutex_lock did the xchg, but has not updated
+     .bdrv_attach_aio_context  = nvme_attach_aio_context,
 +             * nodes[prev].next yet.
 +             */
 +        }
 +    }
 +
 +    /* Wake up the next in line.  */
 +    next = nodes[id].next;
 +    nodes[next].locked = 0;
 +    qemu_futex_wake(&nodes[next].locked, 1);
 +}
 +
 +static void test_multi_fair_mutex_entry(void *opaque)
 +{
 +    while (!atomic_mb_read(&now_stopping)) {
 +        mcs_mutex_lock();
 +        counter++;
 +        mcs_mutex_unlock();
 +        atomic_inc(&atomic_counter);
 +    }
 +    atomic_dec(&running);
 +}
 +
 +static void test_multi_fair_mutex(int threads, int seconds)
 +{
 +    int i;
 +
 +    assert(mutex_head == -1);
 +    counter = 0;
 +    atomic_counter = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    assert(threads <= NUM_CONTEXTS);
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_fair_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
 +static void test_multi_fair_mutex_1(void)
 +{
 +    test_multi_fair_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_fair_mutex_10(void)
 +{
 +    test_multi_fair_mutex(NUM_CONTEXTS, 10);
 +}
 +#endif
 +
 +/* Same test with pthread mutexes, for performance comparison and
 + * portability.  */
 +
 +static QemuMutex mutex;
 +
 +static void test_multi_mutex_entry(void *opaque)
 +{
 +    while (!atomic_mb_read(&now_stopping)) {
 +        qemu_mutex_lock(&mutex);
 +        counter++;
 +        qemu_mutex_unlock(&mutex);
 +        atomic_inc(&atomic_counter);
 +    }
 +    atomic_dec(&running);
 +}
 +
 +static void test_multi_mutex(int threads, int seconds)
 +{
 +    int i;
 +
 +    qemu_mutex_init(&mutex);
 +    counter = 0;
 +    atomic_counter = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    assert(threads <= NUM_CONTEXTS);
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
 +static void test_multi_mutex_1(void)
 +{
 +    test_multi_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_mutex_10(void)
 +{
 +    test_multi_mutex(NUM_CONTEXTS, 10);
 +}
 +
  /* End of tests.  */
  int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
          g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
          g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
 +#ifdef CONFIG_LINUX
 +        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_1);
 +#endif
 +        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_1);
      } else {
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
          g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
          g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
 +#ifdef CONFIG_LINUX
 +        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_10);
 +#endif
 +        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_10);
      }
      return g_test_run();
  }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 09/24] blkdebug: reschedule coroutine on the AioContext it is running on
+[PULL v2 02/28] libvhost-user: Allow vu_message_read to be replaced
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Coiby Xu <coiby.xu@gmail.com>
-Keep the coroutine on the same AioContext.  Without this change,
+Allow vu_message_read to be replaced by one which will make use of the
-there would be a race between yielding the coroutine and reentering it.
+QIOChannel functions. Thus reading vhost-user message won't stall the
-While the race cannot happen now, because the code only runs from a single
+guest. For slave channel, we still use the default vu_message_read.
 AioContext, this will change with multiqueue support in the block layer.
-While doing the change, replace custom bottom half with aio_co_schedule.
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Message-id: 20200918080912.321299-2-coiby.xu@gmail.com
 Message-id: 20170213135235.12274-10-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/blkdebug.c | 9 +--------
+ contrib/libvhost-user/libvhost-user.h      | 21 +++++++++++++++++++++
-file changed, 1 insertion(+), 8 deletions(-)
+ contrib/libvhost-user/libvhost-user-glib.c |  2 +-
  contrib/libvhost-user/libvhost-user.c      | 14 +++++++-------
  tests/vhost-user-bridge.c                  |  2 ++
  tools/virtiofsd/fuse_virtio.c              |  4 ++--
 files changed, 33 insertions(+), 10 deletions(-)
-diff --git a/block/blkdebug.c b/block/blkdebug.c
+diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkdebug.c
+--- a/contrib/libvhost-user/libvhost-user.h
-+++ b/block/blkdebug.c
++++ b/contrib/libvhost-user/libvhost-user.h
-@@ -XXX,XX +XXX,XX @@ out:
+@@ -XXX,XX +XXX,XX @@
-     return ret;
+  */
  #define VHOST_USER_MAX_RAM_SLOTS 32
 +#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 +
  typedef enum VhostSetConfigType {
      VHOST_SET_CONFIG_TYPE_MASTER = 0,
      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
@@ -XXX,XX +XXX,XX @@ typedef uint64_t (*vu_get_features_cb) (VuDev *dev);
  typedef void (*vu_set_features_cb) (VuDev *dev, uint64_t features);
  typedef int (*vu_process_msg_cb) (VuDev *dev, VhostUserMsg *vmsg,
                                    int *do_reply);
 +typedef bool (*vu_read_msg_cb) (VuDev *dev, int sock, VhostUserMsg *vmsg);
  typedef void (*vu_queue_set_started_cb) (VuDev *dev, int qidx, bool started);
  typedef bool (*vu_queue_is_processed_in_order_cb) (VuDev *dev, int qidx);
  typedef int (*vu_get_config_cb) (VuDev *dev, uint8_t *config, uint32_t len);
@@ -XXX,XX +XXX,XX @@ struct VuDev {
      bool broken;
      uint16_t max_queues;
 +    /* @read_msg: custom method to read vhost-user message
 +     *
 +     * Read data from vhost_user socket fd and fill up
 +     * the passed VhostUserMsg *vmsg struct.
 +     *
 +     * If reading fails, it should close the received set of file
 +     * descriptors as socket message's auxiliary data.
 +     *
 +     * For the details, please refer to vu_message_read in libvhost-user.c
 +     * which will be used by default if not custom method is provided when
 +     * calling vu_init
 +     *
 +     * Returns: true if vhost-user message successfully received,
 +     *          otherwise return false.
 +     *
 +     */
 +    vu_read_msg_cb read_msg;
      /* @set_watch: add or update the given fd to the watch set,
       * call cb when condition is met */
      vu_set_watch_cb set_watch;
@@ -XXX,XX +XXX,XX @@ bool vu_init(VuDev *dev,
               uint16_t max_queues,
               int socket,
               vu_panic_cb panic,
 +             vu_read_msg_cb read_msg,
               vu_set_watch_cb set_watch,
               vu_remove_watch_cb remove_watch,
               const VuDevIface *iface);
 diff --git a/contrib/libvhost-user/libvhost-user-glib.c b/contrib/libvhost-user/libvhost-user-glib.c
 index XXXXXXX..XXXXXXX 100644
 --- a/contrib/libvhost-user/libvhost-user-glib.c
 +++ b/contrib/libvhost-user/libvhost-user-glib.c
@@ -XXX,XX +XXX,XX @@ vug_init(VugDev *dev, uint16_t max_queues, int socket,
      g_assert(dev);
      g_assert(iface);
 -    if (!vu_init(&dev->parent, max_queues, socket, panic, set_watch,
 +    if (!vu_init(&dev->parent, max_queues, socket, panic, NULL, set_watch,
                   remove_watch, iface)) {
          return false;
      }
 diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
 index XXXXXXX..XXXXXXX 100644
 --- a/contrib/libvhost-user/libvhost-user.c
 +++ b/contrib/libvhost-user/libvhost-user.c
@@ -XXX,XX +XXX,XX @@
  /* The version of inflight buffer */
  #define INFLIGHT_VERSION 1
 -#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 -
  /* The version of the protocol we support */
  #define VHOST_USER_VERSION 1
  #define LIBVHOST_USER_DEBUG 0
@@ -XXX,XX +XXX,XX @@ have_userfault(void)
  }
--static void error_callback_bh(void *opaque)
+ static bool
--{
+-vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
--    Coroutine *co = opaque;
++vu_message_read_default(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 -    qemu_coroutine_enter(co);
 -}
 -
  static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
  {
-     BDRVBlkdebugState *s = bs->opaque;
+     char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = {};
-@@ -XXX,XX +XXX,XX @@ static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
+     struct iovec iov = {
@@ -XXX,XX +XXX,XX @@ vu_process_message_reply(VuDev *dev, const VhostUserMsg *vmsg)
          goto out;
      }
-     if (!immediately) {
+-    if (!vu_message_read(dev, dev->slave_fd, &msg_reply)) {
--        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), error_callback_bh,
++    if (!vu_message_read_default(dev, dev->slave_fd, &msg_reply)) {
--                                qemu_coroutine_self());
+         goto out;
 +        aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
          qemu_coroutine_yield();
      }
+@@ -XXX,XX +XXX,XX @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
+     /* Wait for QEMU to confirm that it's registered the handler for the
+      * faults.
+      */
+-    if (!vu_message_read(dev, dev->sock, vmsg) ||
++    if (!dev->read_msg(dev, dev->sock, vmsg) ||
+         vmsg->size != sizeof(vmsg->payload.u64) ||
+         vmsg->payload.u64 != 0) {
+         vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
+@@ -XXX,XX +XXX,XX @@ vu_dispatch(VuDev *dev)
+     int reply_requested;
+     bool need_reply, success = false;
+-    if (!vu_message_read(dev, dev->sock, &vmsg)) {
++    if (!dev->read_msg(dev, dev->sock, &vmsg)) {
+         goto end;
+     }
+@@ -XXX,XX +XXX,XX @@ vu_init(VuDev *dev,
+         uint16_t max_queues,
+         int socket,
+         vu_panic_cb panic,
++        vu_read_msg_cb read_msg,
+         vu_set_watch_cb set_watch,
+         vu_remove_watch_cb remove_watch,
+         const VuDevIface *iface)
+@@ -XXX,XX +XXX,XX @@ vu_init(VuDev *dev,
+     dev->sock = socket;
+     dev->panic = panic;
++    dev->read_msg = read_msg ? read_msg : vu_message_read_default;
+     dev->set_watch = set_watch;
+     dev->remove_watch = remove_watch;
+     dev->iface = iface;
+@@ -XXX,XX +XXX,XX @@ static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync)
+         vu_message_write(dev, dev->slave_fd, &vmsg);
+         if (ack) {
+-            vu_message_read(dev, dev->slave_fd, &vmsg);
++            vu_message_read_default(dev, dev->slave_fd, &vmsg);
+         }
+         return;
+     }
+diff --git a/tests/vhost-user-bridge.c b/tests/vhost-user-bridge.c
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/vhost-user-bridge.c
++++ b/tests/vhost-user-bridge.c
+@@ -XXX,XX +XXX,XX @@ vubr_accept_cb(int sock, void *ctx)
+                  VHOST_USER_BRIDGE_MAX_QUEUES,
+                  conn_fd,
+                  vubr_panic,
++                 NULL,
+                  vubr_set_watch,
+                  vubr_remove_watch,
+                  &vuiface)) {
+@@ -XXX,XX +XXX,XX @@ vubr_new(const char *path, bool client)
+                      VHOST_USER_BRIDGE_MAX_QUEUES,
+                      dev->sock,
+                      vubr_panic,
++                     NULL,
+                      vubr_set_watch,
+                      vubr_remove_watch,
+                      &vuiface)) {
+diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
+index XXXXXXX..XXXXXXX 100644
+--- a/tools/virtiofsd/fuse_virtio.c
++++ b/tools/virtiofsd/fuse_virtio.c
+@@ -XXX,XX +XXX,XX @@ int virtio_session_mount(struct fuse_session *se)
+     se->vu_socketfd = data_sock;
+     se->virtio_dev->se = se;
+     pthread_rwlock_init(&se->virtio_dev->vu_dispatch_rwlock, NULL);
+-    vu_init(&se->virtio_dev->dev, 2, se->vu_socketfd, fv_panic, fv_set_watch,
+-            fv_remove_watch, &fv_iface);
++    vu_init(&se->virtio_dev->dev, 2, se->vu_socketfd, fv_panic, NULL,
++            fv_set_watch, fv_remove_watch, &fv_iface);
+     return 0;
+ }
 --
-.9.3
+.26.2

-New patch
+[PULL v2 03/28] libvhost-user: remove watch for kick_fd when de-initialize vu-dev
+From: Coiby Xu <coiby.xu@gmail.com>
+When the client is running in gdb and quit command is run in gdb,
+QEMU will still dispatch the event which will cause segment fault in
+the callback function.
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
+Message-id: 20200918080912.321299-3-coiby.xu@gmail.com
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ contrib/libvhost-user/libvhost-user.c | 1 +
+file changed, 1 insertion(+)
+diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
+index XXXXXXX..XXXXXXX 100644
+--- a/contrib/libvhost-user/libvhost-user.c
++++ b/contrib/libvhost-user/libvhost-user.c
+@@ -XXX,XX +XXX,XX @@ vu_deinit(VuDev *dev)
+         }
+         if (vq->kick_fd != -1) {
++            dev->remove_watch(dev, vq->kick_fd);
+             close(vq->kick_fd);
+             vq->kick_fd = -1;
+         }
+--
+.26.2

-New patch
+[PULL v2 04/28] util/vhost-user-server: generic vhost user server
+From: Coiby Xu <coiby.xu@gmail.com>
+Sharing QEMU devices via vhost-user protocol.
+Only one vhost-user client can connect to the server one time.
+Suggested-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
+Message-id: 20200918080912.321299-4-coiby.xu@gmail.com
+[Fixed size_t %lu -> %zu format string compiler error.
+--Stefan]
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ util/vhost-user-server.h |  65 ++++++
+ util/vhost-user-server.c | 428 +++++++++++++++++++++++++++++++++++++++
+ util/meson.build         |   1 +
+files changed, 494 insertions(+)
+ create mode 100644 util/vhost-user-server.h
+ create mode 100644 util/vhost-user-server.c
+diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/util/vhost-user-server.h
+@@ -XXX,XX +XXX,XX @@
++/*
++ * Sharing QEMU devices via vhost-user protocol
++ *
++ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
++ * Copyright (c) 2020 Red Hat, Inc.
++ *
++ * This work is licensed under the terms of the GNU GPL, version 2 or
++ * later.  See the COPYING file in the top-level directory.
++ */
++
++#ifndef VHOST_USER_SERVER_H
++#define VHOST_USER_SERVER_H
++
++#include "contrib/libvhost-user/libvhost-user.h"
++#include "io/channel-socket.h"
++#include "io/channel-file.h"
++#include "io/net-listener.h"
++#include "qemu/error-report.h"
++#include "qapi/error.h"
++#include "standard-headers/linux/virtio_blk.h"
++
++typedef struct VuFdWatch {
++    VuDev *vu_dev;
++    int fd; /*kick fd*/
++    void *pvt;
++    vu_watch_cb cb;
++    bool processing;
++    QTAILQ_ENTRY(VuFdWatch) next;
++} VuFdWatch;
++
++typedef struct VuServer VuServer;
++typedef void DevicePanicNotifierFn(VuServer *server);
++
++struct VuServer {
++    QIONetListener *listener;
++    AioContext *ctx;
++    DevicePanicNotifierFn *device_panic_notifier;
++    int max_queues;
++    const VuDevIface *vu_iface;
++    VuDev vu_dev;
++    QIOChannel *ioc; /* The I/O channel with the client */
++    QIOChannelSocket *sioc; /* The underlying data channel with the client */
++    /* IOChannel for fd provided via VHOST_USER_SET_SLAVE_REQ_FD */
++    QIOChannel *ioc_slave;
++    QIOChannelSocket *sioc_slave;
++    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
++    QTAILQ_HEAD(, VuFdWatch) vu_fd_watches;
++    /* restart coroutine co_trip if AIOContext is changed */
++    bool aio_context_changed;
++    bool processing_msg;
++};
++
++bool vhost_user_server_start(VuServer *server,
++                             SocketAddress *unix_socket,
++                             AioContext *ctx,
++                             uint16_t max_queues,
++                             DevicePanicNotifierFn *device_panic_notifier,
++                             const VuDevIface *vu_iface,
++                             Error **errp);
++
++void vhost_user_server_stop(VuServer *server);
++
++void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx);
++
++#endif /* VHOST_USER_SERVER_H */
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/util/vhost-user-server.c
+@@ -XXX,XX +XXX,XX @@
++/*
++ * Sharing QEMU devices via vhost-user protocol
++ *
++ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
++ * Copyright (c) 2020 Red Hat, Inc.
++ *
++ * This work is licensed under the terms of the GNU GPL, version 2 or
++ * later.  See the COPYING file in the top-level directory.
++ */
++#include "qemu/osdep.h"
++#include "qemu/main-loop.h"
++#include "vhost-user-server.h"
++
++static void vmsg_close_fds(VhostUserMsg *vmsg)
++{
++    int i;
++    for (i = 0; i < vmsg->fd_num; i++) {
++        close(vmsg->fds[i]);
++    }
++}
++
++static void vmsg_unblock_fds(VhostUserMsg *vmsg)
++{
++    int i;
++    for (i = 0; i < vmsg->fd_num; i++) {
++        qemu_set_nonblock(vmsg->fds[i]);
++    }
++}
++
++static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
++                      gpointer opaque);
++
++static void close_client(VuServer *server)
++{
++    /*
++     * Before closing the client
++     *
++     * 1. Let vu_client_trip stop processing new vhost-user msg
++     *
++     * 2. remove kick_handler
++     *
++     * 3. wait for the kick handler to be finished
++     *
++     * 4. wait for the current vhost-user msg to be finished processing
++     */
++
++    QIOChannelSocket *sioc = server->sioc;
++    /* When this is set vu_client_trip will stop new processing vhost-user message */
++    server->sioc = NULL;
++
++    VuFdWatch *vu_fd_watch, *next;
++    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
++        aio_set_fd_handler(server->ioc->ctx, vu_fd_watch->fd, true, NULL,
++                           NULL, NULL, NULL);
++    }
++
++    while (!QTAILQ_EMPTY(&server->vu_fd_watches)) {
++        QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
++            if (!vu_fd_watch->processing) {
++                QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
++                g_free(vu_fd_watch);
++            }
++        }
++    }
++
++    while (server->processing_msg) {
++        if (server->ioc->read_coroutine) {
++            server->ioc->read_coroutine = NULL;
++            qio_channel_set_aio_fd_handler(server->ioc, server->ioc->ctx, NULL,
++                                           NULL, server->ioc);
++            server->processing_msg = false;
++        }
++    }
++
++    vu_deinit(&server->vu_dev);
++    object_unref(OBJECT(sioc));
++    object_unref(OBJECT(server->ioc));
++}
++
++static void panic_cb(VuDev *vu_dev, const char *buf)
++{
++    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
++
++    /* avoid while loop in close_client */
++    server->processing_msg = false;
++
++    if (buf) {
++        error_report("vu_panic: %s", buf);
++    }
++
++    if (server->sioc) {
++        close_client(server);
++    }
++
++    if (server->device_panic_notifier) {
++        server->device_panic_notifier(server);
++    }
++
++    /*
++     * Set the callback function for network listener so another
++     * vhost-user client can connect to this server
++     */
++    qio_net_listener_set_client_func(server->listener,
++                                     vu_accept,
++                                     server,
++                                     NULL);
++}
++
++static bool coroutine_fn
++vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
++{
++    struct iovec iov = {
++        .iov_base = (char *)vmsg,
++        .iov_len = VHOST_USER_HDR_SIZE,
++    };
++    int rc, read_bytes = 0;
++    Error *local_err = NULL;
++    /*
++     * Store fds/nfds returned from qio_channel_readv_full into
++     * temporary variables.
++     *
++     * VhostUserMsg is a packed structure, gcc will complain about passing
++     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
++     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
++     * thus two temporary variables nfds and fds are used here.
++     */
++    size_t nfds = 0, nfds_t = 0;
++    const size_t max_fds = G_N_ELEMENTS(vmsg->fds);
++    int *fds_t = NULL;
++    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
++    QIOChannel *ioc = server->ioc;
++
++    if (!ioc) {
++        error_report_err(local_err);
++        goto fail;
++    }
++
++    assert(qemu_in_coroutine());
++    do {
++        /*
++         * qio_channel_readv_full may have short reads, keeping calling it
++         * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
++         */
++        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &local_err);
++        if (rc < 0) {
++            if (rc == QIO_CHANNEL_ERR_BLOCK) {
++                qio_channel_yield(ioc, G_IO_IN);
++                continue;
++            } else {
++                error_report_err(local_err);
++                return false;
++            }
++        }
++        read_bytes += rc;
++        if (nfds_t > 0) {
++            if (nfds + nfds_t > max_fds) {
++                error_report("A maximum of %zu fds are allowed, "
++                             "however got %zu fds now",
++                             max_fds, nfds + nfds_t);
++                goto fail;
++            }
++            memcpy(vmsg->fds + nfds, fds_t,
++                   nfds_t *sizeof(vmsg->fds[0]));
++            nfds += nfds_t;
++            g_free(fds_t);
++        }
++        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
++            break;
++        }
++        iov.iov_base = (char *)vmsg + read_bytes;
++        iov.iov_len = VHOST_USER_HDR_SIZE - read_bytes;
++    } while (true);
++
++    vmsg->fd_num = nfds;
++    /* qio_channel_readv_full will make socket fds blocking, unblock them */
++    vmsg_unblock_fds(vmsg);
++    if (vmsg->size > sizeof(vmsg->payload)) {
++        error_report("Error: too big message request: %d, "
++                     "size: vmsg->size: %u, "
++                     "while sizeof(vmsg->payload) = %zu",
++                     vmsg->request, vmsg->size, sizeof(vmsg->payload));
++        goto fail;
++    }
++
++    struct iovec iov_payload = {
++        .iov_base = (char *)&vmsg->payload,
++        .iov_len = vmsg->size,
++    };
++    if (vmsg->size) {
++        rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &local_err);
++        if (rc == -1) {
++            error_report_err(local_err);
++            goto fail;
++        }
++    }
++
++    return true;
++
++fail:
++    vmsg_close_fds(vmsg);
++
++    return false;
++}
++
++
++static void vu_client_start(VuServer *server);
++static coroutine_fn void vu_client_trip(void *opaque)
++{
++    VuServer *server = opaque;
++
++    while (!server->aio_context_changed && server->sioc) {
++        server->processing_msg = true;
++        vu_dispatch(&server->vu_dev);
++        server->processing_msg = false;
++    }
++
++    if (server->aio_context_changed && server->sioc) {
++        server->aio_context_changed = false;
++        vu_client_start(server);
++    }
++}
++
++static void vu_client_start(VuServer *server)
++{
++    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
++    aio_co_enter(server->ctx, server->co_trip);
++}
++
++/*
++ * a wrapper for vu_kick_cb
++ *
++ * since aio_dispatch can only pass one user data pointer to the
++ * callback function, pack VuDev and pvt into a struct. Then unpack it
++ * and pass them to vu_kick_cb
++ */
++static void kick_handler(void *opaque)
++{
++    VuFdWatch *vu_fd_watch = opaque;
++    vu_fd_watch->processing = true;
++    vu_fd_watch->cb(vu_fd_watch->vu_dev, 0, vu_fd_watch->pvt);
++    vu_fd_watch->processing = false;
++}
++
++
++static VuFdWatch *find_vu_fd_watch(VuServer *server, int fd)
++{
++
++    VuFdWatch *vu_fd_watch, *next;
++    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
++        if (vu_fd_watch->fd == fd) {
++            return vu_fd_watch;
++        }
++    }
++    return NULL;
++}
++
++static void
++set_watch(VuDev *vu_dev, int fd, int vu_evt,
++          vu_watch_cb cb, void *pvt)
++{
++
++    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
++    g_assert(vu_dev);
++    g_assert(fd >= 0);
++    g_assert(cb);
++
++    VuFdWatch *vu_fd_watch = find_vu_fd_watch(server, fd);
++
++    if (!vu_fd_watch) {
++        VuFdWatch *vu_fd_watch = g_new0(VuFdWatch, 1);
++
++        QTAILQ_INSERT_TAIL(&server->vu_fd_watches, vu_fd_watch, next);
++
++        vu_fd_watch->fd = fd;
++        vu_fd_watch->cb = cb;
++        qemu_set_nonblock(fd);
++        aio_set_fd_handler(server->ioc->ctx, fd, true, kick_handler,
++                           NULL, NULL, vu_fd_watch);
++        vu_fd_watch->vu_dev = vu_dev;
++        vu_fd_watch->pvt = pvt;
++    }
++}
++
++
++static void remove_watch(VuDev *vu_dev, int fd)
++{
++    VuServer *server;
++    g_assert(vu_dev);
++    g_assert(fd >= 0);
++
++    server = container_of(vu_dev, VuServer, vu_dev);
++
++    VuFdWatch *vu_fd_watch = find_vu_fd_watch(server, fd);
++
++    if (!vu_fd_watch) {
++        return;
++    }
++    aio_set_fd_handler(server->ioc->ctx, fd, true, NULL, NULL, NULL, NULL);
++
++    QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
++    g_free(vu_fd_watch);
++}
++
++
++static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
++                      gpointer opaque)
++{
++    VuServer *server = opaque;
++
++    if (server->sioc) {
++        warn_report("Only one vhost-user client is allowed to "
++                    "connect the server one time");
++        return;
++    }
++
++    if (!vu_init(&server->vu_dev, server->max_queues, sioc->fd, panic_cb,
++                 vu_message_read, set_watch, remove_watch, server->vu_iface)) {
++        error_report("Failed to initialize libvhost-user");
++        return;
++    }
++
++    /*
++     * Unset the callback function for network listener to make another
++     * vhost-user client keeping waiting until this client disconnects
++     */
++    qio_net_listener_set_client_func(server->listener,
++                                     NULL,
++                                     NULL,
++                                     NULL);
++    server->sioc = sioc;
++    /*
++     * Increase the object reference, so sioc will not freed by
++     * qio_net_listener_channel_func which will call object_unref(OBJECT(sioc))
++     */
++    object_ref(OBJECT(server->sioc));
++    qio_channel_set_name(QIO_CHANNEL(sioc), "vhost-user client");
++    server->ioc = QIO_CHANNEL(sioc);
++    object_ref(OBJECT(server->ioc));
++    qio_channel_attach_aio_context(server->ioc, server->ctx);
++    qio_channel_set_blocking(QIO_CHANNEL(server->sioc), false, NULL);
++    vu_client_start(server);
++}
++
++
++void vhost_user_server_stop(VuServer *server)
++{
++    if (server->sioc) {
++        close_client(server);
++    }
++
++    if (server->listener) {
++        qio_net_listener_disconnect(server->listener);
++        object_unref(OBJECT(server->listener));
++    }
++
++}
++
++void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx)
++{
++    VuFdWatch *vu_fd_watch, *next;
++    void *opaque = NULL;
++    IOHandler *io_read = NULL;
++    bool attach;
++
++    server->ctx = ctx ? ctx : qemu_get_aio_context();
++
++    if (!server->sioc) {
++        /* not yet serving any client*/
++        return;
++    }
++
++    if (ctx) {
++        qio_channel_attach_aio_context(server->ioc, ctx);
++        server->aio_context_changed = true;
++        io_read = kick_handler;
++        attach = true;
++    } else {
++        qio_channel_detach_aio_context(server->ioc);
++        /* server->ioc->ctx keeps the old AioConext */
++        ctx = server->ioc->ctx;
++        attach = false;
++    }
++
++    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
++        if (vu_fd_watch->cb) {
++            opaque = attach ? vu_fd_watch : NULL;
++            aio_set_fd_handler(ctx, vu_fd_watch->fd, true,
++                               io_read, NULL, NULL,
++                               opaque);
++        }
++    }
++}
++
++
++bool vhost_user_server_start(VuServer *server,
++                             SocketAddress *socket_addr,
++                             AioContext *ctx,
++                             uint16_t max_queues,
++                             DevicePanicNotifierFn *device_panic_notifier,
++                             const VuDevIface *vu_iface,
++                             Error **errp)
++{
++    QIONetListener *listener = qio_net_listener_new();
++    if (qio_net_listener_open_sync(listener, socket_addr, 1,
++                                   errp) < 0) {
++        object_unref(OBJECT(listener));
++        return false;
++    }
++
++    /* zero out unspecified fileds */
++    *server = (VuServer) {
++        .listener              = listener,
++        .vu_iface              = vu_iface,
++        .max_queues            = max_queues,
++        .ctx                   = ctx,
++        .device_panic_notifier = device_panic_notifier,
++    };
++
++    qio_net_listener_set_name(server->listener, "vhost-user-backend-listener");
++
++    qio_net_listener_set_client_func(server->listener,
++                                     vu_accept,
++                                     server,
++                                     NULL);
++
++    QTAILQ_INIT(&server->vu_fd_watches);
++    return true;
++}
+diff --git a/util/meson.build b/util/meson.build
+index XXXXXXX..XXXXXXX 100644
+--- a/util/meson.build
++++ b/util/meson.build
+@@ -XXX,XX +XXX,XX @@ if have_block
+   util_ss.add(files('main-loop.c'))
+   util_ss.add(files('nvdimm-utils.c'))
+   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
++  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
+   util_ss.add(files('qemu-coroutine-sleep.c'))
+   util_ss.add(files('qemu-co-shared-resource.c'))
+   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
+--
+.26.2

-[Qemu-devel] [PULL v2 01/24] block: move AioContext, QEMUTimer, main-loop to libqemuutil
+[PULL v2 05/28] block: move logical block size check function to a common utility function
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Coiby Xu <coiby.xu@gmail.com>
-AioContext is fairly self contained, the only dependency is QEMUTimer but
+Move the constants from hw/core/qdev-properties.c to
-that in turn doesn't need anything else.  So move them out of block-obj-y
+util/block-helpers.h so that knowledge of the min/max values is
 to avoid introducing a dependency from io/ to block-obj-y.
-main-loop and its dependency iohandler also need to be moved, because
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-later in this series io/ will call iohandler_get_aio_context.
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-[Changed copyright "the QEMU team" to "other QEMU contributors" as
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
-suggested by Daniel Berrange and agreed by Paolo.
+Acked-by: Eduardo Habkost <ehabkost@redhat.com>
---Stefan]
+Message-id: 20200918080912.321299-5-coiby.xu@gmail.com
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213135235.12274-2-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- Makefile.objs                       |  4 ---
+ util/block-helpers.h             | 19 +++++++++++++
- stubs/Makefile.objs                 |  1 +
+ hw/core/qdev-properties-system.c | 31 ++++-----------------
- tests/Makefile.include              | 11 ++++----
+ util/block-helpers.c             | 46 ++++++++++++++++++++++++++++++++
- util/Makefile.objs                  |  6 +++-
+ util/meson.build                 |  1 +
- block/io.c                          | 29 -------------------
+files changed, 71 insertions(+), 26 deletions(-)
- stubs/linux-aio.c                   | 32 +++++++++++++++++++++
+ create mode 100644 util/block-helpers.h
- stubs/set-fd-handler.c              | 11 --------
+ create mode 100644 util/block-helpers.c
  aio-posix.c => util/aio-posix.c     |  2 +-
  aio-win32.c => util/aio-win32.c     |  0
  util/aiocb.c                        | 55 +++++++++++++++++++++++++++++++++++++
  async.c => util/async.c             |  3 +-
  iohandler.c => util/iohandler.c     |  0
  main-loop.c => util/main-loop.c     |  0
  qemu-timer.c => util/qemu-timer.c   |  0
  thread-pool.c => util/thread-pool.c |  2 +-
  trace-events                        | 11 --------
  util/trace-events                   | 11 ++++++++
 files changed, 114 insertions(+), 64 deletions(-)
  create mode 100644 stubs/linux-aio.c
  rename aio-posix.c => util/aio-posix.c (99%)
  rename aio-win32.c => util/aio-win32.c (100%)
  create mode 100644 util/aiocb.c
  rename async.c => util/async.c (99%)
  rename iohandler.c => util/iohandler.c (100%)
  rename main-loop.c => util/main-loop.c (100%)
  rename qemu-timer.c => util/qemu-timer.c (100%)
  rename thread-pool.c => util/thread-pool.c (99%)
-diff --git a/Makefile.objs b/Makefile.objs
+diff --git a/util/block-helpers.h b/util/block-helpers.h
 index XXXXXXX..XXXXXXX 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
@@ -XXX,XX +XXX,XX @@ chardev-obj-y = chardev/
  #######################################################################
  # block-obj-y is code used by both qemu system emulation and qemu-img
 -block-obj-y = async.o thread-pool.o
  block-obj-y += nbd/
  block-obj-y += block.o blockjob.o
 -block-obj-y += main-loop.o iohandler.o qemu-timer.o
 -block-obj-$(CONFIG_POSIX) += aio-posix.o
 -block-obj-$(CONFIG_WIN32) += aio-win32.o
  block-obj-y += block/
  block-obj-y += qemu-io-cmds.o
  block-obj-$(CONFIG_REPLICATION) += replication.o
 diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/stubs/Makefile.objs
 +++ b/stubs/Makefile.objs
@@ -XXX,XX +XXX,XX @@ stub-obj-y += get-vm-name.o
  stub-obj-y += iothread.o
  stub-obj-y += iothread-lock.o
  stub-obj-y += is-daemonized.o
 +stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  stub-obj-y += machine-init-done.o
  stub-obj-y += migr-blocker.o
  stub-obj-y += monitor.o
 diff --git a/tests/Makefile.include b/tests/Makefile.include
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/Makefile.include
 +++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
  check-unit-y += tests/test-iov$(EXESUF)
  gcov-files-test-iov-y = util/iov.c
  check-unit-y += tests/test-aio$(EXESUF)
 +gcov-files-test-aio-y = util/async.c util/qemu-timer.o
 +gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
 +gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
  check-unit-y += tests/test-throttle$(EXESUF)
  gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
  gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -XXX,XX +XXX,XX @@ tests/check-qjson$(EXESUF): tests/check-qjson.o $(test-util-obj-y)
  tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y)
  tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 -tests/test-char$(EXESUF): tests/test-char.o qemu-timer.o \
 -    $(test-util-obj-y) $(qtest-obj-y) $(test-block-obj-y) $(chardev-obj-y)
 +tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
  tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
  tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
  tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/test-vmstate$(EXESUF): tests/test-vmstate.o \
      migration/vmstate.o migration/qemu-file.o \
          migration/qemu-file-channel.o migration/qjson.o \
      $(test-io-obj-y)
 -tests/test-timed-average$(EXESUF): tests/test-timed-average.o qemu-timer.o \
 -    $(test-util-obj-y)
 +tests/test-timed-average$(EXESUF): tests/test-timed-average.o $(test-util-obj-y)
  tests/test-base64$(EXESUF): tests/test-base64.o \
      libqemuutil.a libqemustub.a
  tests/ptimer-test$(EXESUF): tests/ptimer-test.o tests/ptimer-test-stubs.o hw/core/ptimer.o libqemustub.a
@@ -XXX,XX +XXX,XX @@ tests/usb-hcd-ehci-test$(EXESUF): tests/usb-hcd-ehci-test.o $(libqos-usb-obj-y)
  tests/usb-hcd-xhci-test$(EXESUF): tests/usb-hcd-xhci-test.o $(libqos-usb-obj-y)
  tests/pc-cpu-test$(EXESUF): tests/pc-cpu-test.o
  tests/postcopy-test$(EXESUF): tests/postcopy-test.o
 -tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o qemu-timer.o \
 +tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o $(test-util-obj-y) \
      $(qtest-obj-y) $(test-io-obj-y) $(libqos-virtio-obj-y) $(libqos-pc-obj-y) \
      $(chardev-obj-y)
  tests/qemu-iotests/socket_scm_helper$(EXESUF): tests/qemu-iotests/socket_scm_helper.o
 diff --git a/util/Makefile.objs b/util/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/util/Makefile.objs
 +++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@
  util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
  util-obj-y += bufferiszero.o
  util-obj-y += lockcnt.o
 +util-obj-y += aiocb.o async.o thread-pool.o qemu-timer.o
 +util-obj-y += main-loop.o iohandler.o
 +util-obj-$(CONFIG_POSIX) += aio-posix.o
  util-obj-$(CONFIG_POSIX) += compatfd.o
  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
  util-obj-$(CONFIG_POSIX) += oslib-posix.o
  util-obj-$(CONFIG_POSIX) += qemu-openpty.o
  util-obj-$(CONFIG_POSIX) += qemu-thread-posix.o
 -util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
  util-obj-$(CONFIG_POSIX) += memfd.o
 +util-obj-$(CONFIG_WIN32) += aio-win32.o
 +util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
  util-obj-$(CONFIG_WIN32) += oslib-win32.o
  util-obj-$(CONFIG_WIN32) += qemu-thread-win32.o
  util-obj-y += envlist.o path.o module.o
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
      return &acb->common;
  }
 -void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
 -                   BlockCompletionFunc *cb, void *opaque)
 -{
 -    BlockAIOCB *acb;
 -
 -    acb = g_malloc(aiocb_info->aiocb_size);
 -    acb->aiocb_info = aiocb_info;
 -    acb->bs = bs;
 -    acb->cb = cb;
 -    acb->opaque = opaque;
 -    acb->refcnt = 1;
 -    return acb;
 -}
 -
 -void qemu_aio_ref(void *p)
 -{
 -    BlockAIOCB *acb = p;
 -    acb->refcnt++;
 -}
 -
 -void qemu_aio_unref(void *p)
 -{
 -    BlockAIOCB *acb = p;
 -    assert(acb->refcnt > 0);
 -    if (--acb->refcnt == 0) {
 -        g_free(acb);
 -    }
 -}
 -
  /**************************************************************/
  /* Coroutine block device emulation */
 diff --git a/stubs/linux-aio.c b/stubs/linux-aio.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/stubs/linux-aio.c
++++ b/util/block-helpers.h
@@ -XXX,XX +XXX,XX @@
 +#ifndef BLOCK_HELPERS_H
 +#define BLOCK_HELPERS_H
 +
 +#include "qemu/units.h"
 +
 +/* lower limit is sector size */
 +#define MIN_BLOCK_SIZE          INT64_C(512)
 +#define MIN_BLOCK_SIZE_STR      "512 B"
 +/*
 + * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
 + * matches qcow2 cluster size limit
 + */
 +#define MAX_BLOCK_SIZE          (2 * MiB)
 +#define MAX_BLOCK_SIZE_STR      "2 MiB"
 +
 +void check_block_size(const char *id, const char *name, int64_t value,
 +                      Error **errp);
 +
 +#endif /* BLOCK_HELPERS_H */
 diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/core/qdev-properties-system.c
 +++ b/hw/core/qdev-properties-system.c
@@ -XXX,XX +XXX,XX @@
  #include "sysemu/blockdev.h"
  #include "net/net.h"
  #include "hw/pci/pci.h"
 +#include "util/block-helpers.h"
  static bool check_prop_still_unset(DeviceState *dev, const char *name,
                                     const void *old_val, const char *new_val,
@@ -XXX,XX +XXX,XX @@ const PropertyInfo qdev_prop_losttickpolicy = {
  /* --- blocksize --- */
 -/* lower limit is sector size */
 -#define MIN_BLOCK_SIZE          512
 -#define MIN_BLOCK_SIZE_STR      "512 B"
 -/*
 - * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
 - * matches qcow2 cluster size limit
 - */
 -#define MAX_BLOCK_SIZE          (2 * MiB)
 -#define MAX_BLOCK_SIZE_STR      "2 MiB"
 -
  static void set_blocksize(Object *obj, Visitor *v, const char *name,
                            void *opaque, Error **errp)
  {
@@ -XXX,XX +XXX,XX @@ static void set_blocksize(Object *obj, Visitor *v, const char *name,
      Property *prop = opaque;
      uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
      uint64_t value;
 +    Error *local_err = NULL;
      if (dev->realized) {
          qdev_prop_set_after_realize(dev, name, errp);
@@ -XXX,XX +XXX,XX @@ static void set_blocksize(Object *obj, Visitor *v, const char *name,
      if (!visit_type_size(v, name, &value, errp)) {
          return;
      }
 -    /* value of 0 means "unset" */
 -    if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
 -        error_setg(errp,
 -                   "Property %s.%s doesn't take value %" PRIu64
 -                   " (minimum: " MIN_BLOCK_SIZE_STR
 -                   ", maximum: " MAX_BLOCK_SIZE_STR ")",
 -                   dev->id ? : "", name, value);
 +    check_block_size(dev->id ? : "", name, value, &local_err);
 +    if (local_err) {
 +        error_propagate(errp, local_err);
          return;
      }
 -
 -    /* We rely on power-of-2 blocksizes for bitmasks */
 -    if ((value & (value - 1)) != 0) {
 -        error_setg(errp,
 -                  "Property %s.%s doesn't take value '%" PRId64 "', "
 -                  "it's not a power of 2", dev->id ?: "", name, (int64_t)value);
 -        return;
 -    }
 -
      *ptr = value;
  }
 diff --git a/util/block-helpers.c b/util/block-helpers.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/util/block-helpers.c
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * Linux native AIO support.
++ * Block utility functions
 + *
-+ * Copyright (C) 2009 IBM, Corp.
++ * Copyright IBM, Corp. 2011
-+ * Copyright (C) 2009 Red Hat, Inc.
++ * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
-+#include "qemu/osdep.h"
-+#include "block/aio.h"
-+#include "block/raw-aio.h"
-+
-+void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
-+{
-+    abort();
-+}
-+
-+void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
-+{
-+    abort();
-+}
-+
-+LinuxAioState *laio_init(void)
-+{
-+    abort();
-+}
-+
-+void laio_cleanup(LinuxAioState *s)
-+{
-+    abort();
-+}
-diff --git a/stubs/set-fd-handler.c b/stubs/set-fd-handler.c
-index XXXXXXX..XXXXXXX 100644
---- a/stubs/set-fd-handler.c
-+++ b/stubs/set-fd-handler.c
-@@ -XXX,XX +XXX,XX @@ void qemu_set_fd_handler(int fd,
- {
-     abort();
- }
--
--void aio_set_fd_handler(AioContext *ctx,
--                        int fd,
--                        bool is_external,
--                        IOHandler *io_read,
--                        IOHandler *io_write,
--                        AioPollFn *io_poll,
--                        void *opaque)
--{
--    abort();
--}
-diff --git a/aio-posix.c b/util/aio-posix.c
-similarity index 99%
-rename from aio-posix.c
-rename to util/aio-posix.c
-index XXXXXXX..XXXXXXX 100644
---- a/aio-posix.c
-+++ b/util/aio-posix.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/rcu_queue.h"
- #include "qemu/sockets.h"
- #include "qemu/cutils.h"
--#include "trace-root.h"
-+#include "trace.h"
- #ifdef CONFIG_EPOLL_CREATE1
- #include <sys/epoll.h>
- #endif
-diff --git a/aio-win32.c b/util/aio-win32.c
-similarity index 100%
-rename from aio-win32.c
-rename to util/aio-win32.c
-diff --git a/util/aiocb.c b/util/aiocb.c
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/util/aiocb.c
-@@ -XXX,XX +XXX,XX @@
-+/*
-+ * BlockAIOCB allocation
-+ *
-+ * Copyright (c) 2003-2017 Fabrice Bellard and other QEMU contributors
-+ *
-+ * Permission is hereby granted, free of charge, to any person obtaining a copy
-+ * of this software and associated documentation files (the "Software"), to deal
-+ * in the Software without restriction, including without limitation the rights
-+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-+ * copies of the Software, and to permit persons to whom the Software is
-+ * furnished to do so, subject to the following conditions:
-+ *
-+ * The above copyright notice and this permission notice shall be included in
-+ * all copies or substantial portions of the Software.
-+ *
-+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
-+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-+ * THE SOFTWARE.
-+ */
 +
 +#include "qemu/osdep.h"
-+#include "block/aio.h"
++#include "qapi/error.h"
 +#include "qapi/qmp/qerror.h"
 +#include "block-helpers.h"
 +
-+void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
++/**
-+                   BlockCompletionFunc *cb, void *opaque)
++ * check_block_size:
 + * @id: The unique ID of the object
 + * @name: The name of the property being validated
 + * @value: The block size in bytes
 + * @errp: A pointer to an area to store an error
 + *
 + * This function checks that the block size meets the following conditions:
 + * 1. At least MIN_BLOCK_SIZE
 + * 2. No larger than MAX_BLOCK_SIZE
 + * 3. A power of 2
 + */
 +void check_block_size(const char *id, const char *name, int64_t value,
 +                      Error **errp)
 +{
-+    BlockAIOCB *acb;
++    /* value of 0 means "unset" */
 +    if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
 +        error_setg(errp, QERR_PROPERTY_VALUE_OUT_OF_RANGE,
 +                   id, name, value, MIN_BLOCK_SIZE, MAX_BLOCK_SIZE);
 +        return;
 +    }
 +
-+    acb = g_malloc(aiocb_info->aiocb_size);
++    /* We rely on power-of-2 blocksizes for bitmasks */
-+    acb->aiocb_info = aiocb_info;
++    if ((value & (value - 1)) != 0) {
-+    acb->bs = bs;
++        error_setg(errp,
-+    acb->cb = cb;
++                   "Property %s.%s doesn't take value '%" PRId64
-+    acb->opaque = opaque;
++                   "', it's not a power of 2",
-+    acb->refcnt = 1;
++                   id, name, value);
-+    return acb;
++        return;
 +}
 +
 +void qemu_aio_ref(void *p)
 +{
 +    BlockAIOCB *acb = p;
 +    acb->refcnt++;
 +}
 +
 +void qemu_aio_unref(void *p)
 +{
 +    BlockAIOCB *acb = p;
 +    assert(acb->refcnt > 0);
 +    if (--acb->refcnt == 0) {
 +        g_free(acb);
 +    }
 +}
-diff --git a/async.c b/util/async.c
+diff --git a/util/meson.build b/util/meson.build
 similarity index 99%
 rename from async.c
 rename to util/async.c
 index XXXXXXX..XXXXXXX 100644
---- a/async.c
+--- a/util/meson.build
-+++ b/util/async.c
++++ b/util/meson.build
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ if have_block
- /*
+   util_ss.add(files('nvdimm-utils.c'))
-- * QEMU System Emulator
+   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
-+ * Data plane event loop
+   util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
-  *
++  util_ss.add(files('block-helpers.c'))
-  * Copyright (c) 2003-2008 Fabrice Bellard
+   util_ss.add(files('qemu-coroutine-sleep.c'))
-+ * Copyright (c) 2009-2017 QEMU contributors
+   util_ss.add(files('qemu-co-shared-resource.c'))
-  *
+   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
   * Permission is hereby granted, free of charge, to any person obtaining a copy
   * of this software and associated documentation files (the "Software"), to deal
 diff --git a/iohandler.c b/util/iohandler.c
 similarity index 100%
 rename from iohandler.c
 rename to util/iohandler.c
 diff --git a/main-loop.c b/util/main-loop.c
 similarity index 100%
 rename from main-loop.c
 rename to util/main-loop.c
 diff --git a/qemu-timer.c b/util/qemu-timer.c
 similarity index 100%
 rename from qemu-timer.c
 rename to util/qemu-timer.c
 diff --git a/thread-pool.c b/util/thread-pool.c
 similarity index 99%
 rename from thread-pool.c
 rename to util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/queue.h"
  #include "qemu/thread.h"
  #include "qemu/coroutine.h"
 -#include "trace-root.h"
 +#include "trace.h"
  #include "block/thread-pool.h"
  #include "qemu/main-loop.h"
 diff --git a/trace-events b/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/trace-events
 +++ b/trace-events
@@ -XXX,XX +XXX,XX @@
  #
  # The <format-string> should be a sprintf()-compatible format string.
 -# aio-posix.c
 -run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
 -run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 -poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 -poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 -
 -# thread-pool.c
 -thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 -thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 -thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 -
  # ioport.c
  cpu_in(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
  cpu_out(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
 diff --git a/util/trace-events b/util/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/util/trace-events
 +++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@
  # See docs/tracing.txt for syntax documentation.
 +# util/aio-posix.c
 +run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
 +run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 +poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +
 +# util/thread-pool.c
 +thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 +thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 +thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 +
  # util/buffer.c
  buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"
  buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 02/24] aio: introduce aio_co_schedule and aio_co_wake
+[PULL v2 06/28] block/export: vhost-user block device backend server
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Coiby Xu <coiby.xu@gmail.com>
-aio_co_wake provides the infrastructure to start a coroutine on a "home"
+By making use of libvhost-user, block device drive can be shared to
-AioContext.  It will be used by CoMutex and CoQueue, so that coroutines
+the connected vhost-user client. Only one client can connect to the
-don't jump from one context to another when they go to sleep on a
+server one time.
 mutex or waitqueue.  However, it can also be used as a more efficient
 alternative to one-shot bottom halves, and saves the effort of tracking
 which AioContext a coroutine is running on.
-aio_co_schedule is the part of aio_co_wake that starts a coroutine
+Since vhost-user-server needs a block drive to be created first, delay
-on a remove AioContext, but it is also useful to implement e.g.
+the creation of this object.
 bdrv_set_aio_context callbacks.
-The implementation of aio_co_schedule is based on a lock-free
+Suggested-by: Kevin Wolf <kwolf@redhat.com>
-multiple-producer, single-consumer queue.  The multiple producers use
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-cmpxchg to add to a LIFO stack.  The consumer (a per-AioContext bottom
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
-half) grabs all items added so far, inverts the list to make it FIFO,
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-and goes through it one item at a time until it's empty.  The data
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
-structure was inspired by OSv, which uses it in the very code we'll
+Message-id: 20200918080912.321299-6-coiby.xu@gmail.com
-"port" to QEMU for the thread-safe CoMutex.
+[Shorten "vhost_user_blk_server" string to "vhost_user_blk" to avoid the
+following compiler warning:
-Most of the new code is really tests.
+../block/export/vhost-user-blk-server.c:178:50: error: ‘%s’ directive output truncated writing 21 bytes into a region of size 20 [-Werror=format-truncation=]
+and fix "Invalid size %ld ..." ssize_t format string arguments for
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+-bit hosts.
-Reviewed-by: Fam Zheng <famz@redhat.com>
+--Stefan]
 Message-id: 20170213135235.12274-3-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/Makefile.include       |   8 +-
+ block/export/vhost-user-blk-server.h |  36 ++
- include/block/aio.h          |  32 +++++++
+ block/export/vhost-user-blk-server.c | 661 +++++++++++++++++++++++++++
- include/qemu/coroutine_int.h |  11 ++-
+ softmmu/vl.c                         |   4 +
- tests/iothread.h             |  25 +++++
+ block/meson.build                    |   1 +
- tests/iothread.c             |  91 ++++++++++++++++++
+files changed, 702 insertions(+)
- tests/test-aio-multithread.c | 213 +++++++++++++++++++++++++++++++++++++++++++
+ create mode 100644 block/export/vhost-user-blk-server.h
- util/async.c                 |  65 +++++++++++++
+ create mode 100644 block/export/vhost-user-blk-server.c
  util/qemu-coroutine.c        |   8 ++
  util/trace-events            |   4 +
 files changed, 453 insertions(+), 4 deletions(-)
  create mode 100644 tests/iothread.h
  create mode 100644 tests/iothread.c
  create mode 100644 tests/test-aio-multithread.c
-diff --git a/tests/Makefile.include b/tests/Makefile.include
+diff --git a/block/export/vhost-user-blk-server.h b/block/export/vhost-user-blk-server.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/Makefile.include
 +++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-aio$(EXESUF)
  gcov-files-test-aio-y = util/async.c util/qemu-timer.o
  gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
  gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
 +check-unit-y += tests/test-aio-multithread$(EXESUF)
 +gcov-files-test-aio-multithread-y = $(gcov-files-test-aio-y)
 +gcov-files-test-aio-multithread-y += util/qemu-coroutine.c tests/iothread.c
  check-unit-y += tests/test-throttle$(EXESUF)
 -gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 -gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
  check-unit-y += tests/test-thread-pool$(EXESUF)
  gcov-files-test-thread-pool-y = thread-pool.c
  gcov-files-test-hbitmap-y = util/hbitmap.c
@@ -XXX,XX +XXX,XX @@ test-qapi-obj-y = tests/test-qapi-visit.o tests/test-qapi-types.o \
      $(test-qom-obj-y)
  test-crypto-obj-y = $(crypto-obj-y) $(test-qom-obj-y)
  test-io-obj-y = $(io-obj-y) $(test-crypto-obj-y)
 -test-block-obj-y = $(block-obj-y) $(test-io-obj-y)
 +test-block-obj-y = $(block-obj-y) $(test-io-obj-y) tests/iothread.o
  tests/check-qint$(EXESUF): tests/check-qint.o $(test-util-obj-y)
  tests/check-qstring$(EXESUF): tests/check-qstring.o $(test-util-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
  tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
  tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
  tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
 +tests/test-aio-multithread$(EXESUF): tests/test-aio-multithread.o $(test-block-obj-y)
  tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
  tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
  tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
 diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/aio.h
 +++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ typedef void QEMUBHFunc(void *opaque);
  typedef bool AioPollFn(void *opaque);
  typedef void IOHandler(void *opaque);
 +struct Coroutine;
  struct ThreadPool;
  struct LinuxAioState;
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      bool notified;
      EventNotifier notifier;
 +    QSLIST_HEAD(, Coroutine) scheduled_coroutines;
 +    QEMUBH *co_schedule_bh;
 +
      /* Thread pool for performing work and receiving completion callbacks.
       * Has its own locking.
       */
@@ -XXX,XX +XXX,XX @@ static inline bool aio_node_check(AioContext *ctx, bool is_external)
  }
  /**
 + * aio_co_schedule:
 + * @ctx: the aio context
 + * @co: the coroutine
 + *
 + * Start a coroutine on a remote AioContext.
 + *
 + * The coroutine must not be entered by anyone else while aio_co_schedule()
 + * is active.  In addition the coroutine must have yielded unless ctx
 + * is the context in which the coroutine is running (i.e. the value of
 + * qemu_get_current_aio_context() from the coroutine itself).
 + */
 +void aio_co_schedule(AioContext *ctx, struct Coroutine *co);
 +
 +/**
 + * aio_co_wake:
 + * @co: the coroutine
 + *
 + * Restart a coroutine on the AioContext where it was running last, thus
 + * preventing coroutines from jumping from one context to another when they
 + * go to sleep.
 + *
 + * aio_co_wake may be executed either in coroutine or non-coroutine
 + * context.  The coroutine must not be entered by anyone else while
 + * aio_co_wake() is active.
 + */
 +void aio_co_wake(struct Coroutine *co);
 +
 +/**
   * Return the AioContext whose event loop runs in the current thread.
   *
   * If called from an IOThread this will be the IOThread's AioContext.  If
 diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/qemu/coroutine_int.h
 +++ b/include/qemu/coroutine_int.h
@@ -XXX,XX +XXX,XX @@ struct Coroutine {
      CoroutineEntry *entry;
      void *entry_arg;
      Coroutine *caller;
 +
 +    /* Only used when the coroutine has terminated.  */
      QSLIST_ENTRY(Coroutine) pool_next;
 +
      size_t locks_held;
 -    /* Coroutines that should be woken up when we yield or terminate */
 +    /* Coroutines that should be woken up when we yield or terminate.
 +     * Only used when the coroutine is running.
 +     */
      QSIMPLEQ_HEAD(, Coroutine) co_queue_wakeup;
 +
 +    /* Only used when the coroutine has yielded.  */
 +    AioContext *ctx;
      QSIMPLEQ_ENTRY(Coroutine) co_queue_next;
 +    QSLIST_ENTRY(Coroutine) co_scheduled_next;
  };
  Coroutine *qemu_coroutine_new(void);
 diff --git a/tests/iothread.h b/tests/iothread.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/iothread.h
++++ b/block/export/vhost-user-blk-server.h
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * Event loop thread implementation for unit tests
++ * Sharing QEMU block devices via vhost-user protocal
 + *
-+ * Copyright Red Hat Inc., 2013, 2016
++ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
 + * Copyright (c) 2020 Red Hat, Inc.
 + *
-+ * Authors:
++ * This work is licensed under the terms of the GNU GPL, version 2 or
-+ *  Stefan Hajnoczi   <stefanha@redhat.com>
++ * later.  See the COPYING file in the top-level directory.
 + *  Paolo Bonzini     <pbonzini@redhat.com>
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
-+#ifndef TEST_IOTHREAD_H
++
-+#define TEST_IOTHREAD_H
++#ifndef VHOST_USER_BLK_SERVER_H
-+
++#define VHOST_USER_BLK_SERVER_H
-+#include "block/aio.h"
++#include "util/vhost-user-server.h"
-+#include "qemu/thread.h"
++
-+
++typedef struct VuBlockDev VuBlockDev;
-+typedef struct IOThread IOThread;
++#define TYPE_VHOST_USER_BLK_SERVER "vhost-user-blk-server"
-+
++#define VHOST_USER_BLK_SERVER(obj) \
-+IOThread *iothread_new(void);
++   OBJECT_CHECK(VuBlockDev, obj, TYPE_VHOST_USER_BLK_SERVER)
-+void iothread_join(IOThread *iothread);
++
-+AioContext *iothread_get_aio_context(IOThread *iothread);
++/* vhost user block device */
-+
++struct VuBlockDev {
-+#endif
++    Object parent_obj;
-diff --git a/tests/iothread.c b/tests/iothread.c
++    char *node_name;
 +    SocketAddress *addr;
 +    AioContext *ctx;
 +    VuServer vu_server;
 +    bool running;
 +    uint32_t blk_size;
 +    BlockBackend *backend;
 +    QIOChannelSocket *sioc;
 +    QTAILQ_ENTRY(VuBlockDev) next;
 +    struct virtio_blk_config blkcfg;
 +    bool writable;
 +};
 +
 +#endif /* VHOST_USER_BLK_SERVER_H */
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/iothread.c
++++ b/block/export/vhost-user-blk-server.c
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * Event loop thread implementation for unit tests
++ * Sharing QEMU block devices via vhost-user protocal
 + *
-+ * Copyright Red Hat Inc., 2013, 2016
++ * Parts of the code based on nbd/server.c.
 + *
-+ * Authors:
++ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
-+ *  Stefan Hajnoczi   <stefanha@redhat.com>
++ * Copyright (c) 2020 Red Hat, Inc.
 + *  Paolo Bonzini     <pbonzini@redhat.com>
 + *
-+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
++ * This work is licensed under the terms of the GNU GPL, version 2 or
-+ * See the COPYING file in the top-level directory.
++ * later.  See the COPYING file in the top-level directory.
 + */
 +#include "qemu/osdep.h"
 +#include "block/block.h"
 +#include "vhost-user-blk-server.h"
 +#include "qapi/error.h"
 +#include "qom/object_interfaces.h"
 +#include "sysemu/block-backend.h"
 +#include "util/block-helpers.h"
 +
 +enum {
 +    VHOST_USER_BLK_MAX_QUEUES = 1,
 +};
 +struct virtio_blk_inhdr {
 +    unsigned char status;
 +};
 +
 +typedef struct VuBlockReq {
 +    VuVirtqElement *elem;
 +    int64_t sector_num;
 +    size_t size;
 +    struct virtio_blk_inhdr *in;
 +    struct virtio_blk_outhdr out;
 +    VuServer *server;
 +    struct VuVirtq *vq;
 +} VuBlockReq;
 +
 +static void vu_block_req_complete(VuBlockReq *req)
 +{
 +    VuDev *vu_dev = &req->server->vu_dev;
 +
 +    /* IO size with 1 extra status byte */
 +    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
 +    vu_queue_notify(vu_dev, req->vq);
 +
 +    if (req->elem) {
 +        free(req->elem);
 +    }
 +
 +    g_free(req);
 +}
 +
 +static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
 +{
 +    return container_of(server, VuBlockDev, vu_server);
 +}
 +
 +static int coroutine_fn
 +vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
 +                              uint32_t iovcnt, uint32_t type)
 +{
 +    struct virtio_blk_discard_write_zeroes desc;
 +    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
 +    if (unlikely(size != sizeof(desc))) {
 +        error_report("Invalid size %zd, expect %zu", size, sizeof(desc));
 +        return -EINVAL;
 +    }
 +
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
 +    uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
 +                          le32_to_cpu(desc.num_sectors) << 9 };
 +    if (type == VIRTIO_BLK_T_DISCARD) {
 +        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
 +            return 0;
 +        }
 +    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
 +        if (blk_co_pwrite_zeroes(vdev_blk->backend,
 +                                 range[0], range[1], 0) == 0) {
 +            return 0;
 +        }
 +    }
 +
 +    return -EINVAL;
 +}
 +
 +static void coroutine_fn vu_block_flush(VuBlockReq *req)
 +{
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
 +    BlockBackend *backend = vdev_blk->backend;
 +    blk_co_flush(backend);
 +}
 +
 +struct req_data {
 +    VuServer *server;
 +    VuVirtq *vq;
 +    VuVirtqElement *elem;
 +};
 +
 +static void coroutine_fn vu_block_virtio_process_req(void *opaque)
 +{
 +    struct req_data *data = opaque;
 +    VuServer *server = data->server;
 +    VuVirtq *vq = data->vq;
 +    VuVirtqElement *elem = data->elem;
 +    uint32_t type;
 +    VuBlockReq *req;
 +
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    BlockBackend *backend = vdev_blk->backend;
 +
 +    struct iovec *in_iov = elem->in_sg;
 +    struct iovec *out_iov = elem->out_sg;
 +    unsigned in_num = elem->in_num;
 +    unsigned out_num = elem->out_num;
 +    /* refer to hw/block/virtio_blk.c */
 +    if (elem->out_num < 1 || elem->in_num < 1) {
 +        error_report("virtio-blk request missing headers");
 +        free(elem);
 +        return;
 +    }
 +
 +    req = g_new0(VuBlockReq, 1);
 +    req->server = server;
 +    req->vq = vq;
 +    req->elem = elem;
 +
 +    if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
 +                            sizeof(req->out)) != sizeof(req->out))) {
 +        error_report("virtio-blk request outhdr too short");
 +        goto err;
 +    }
 +
 +    iov_discard_front(&out_iov, &out_num, sizeof(req->out));
 +
 +    if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) {
 +        error_report("virtio-blk request inhdr too short");
 +        goto err;
 +    }
 +
 +    /* We always touch the last byte, so just see how big in_iov is.  */
 +    req->in = (void *)in_iov[in_num - 1].iov_base
 +              + in_iov[in_num - 1].iov_len
 +              - sizeof(struct virtio_blk_inhdr);
 +    iov_discard_back(in_iov, &in_num, sizeof(struct virtio_blk_inhdr));
 +
 +    type = le32_to_cpu(req->out.type);
 +    switch (type & ~VIRTIO_BLK_T_BARRIER) {
 +    case VIRTIO_BLK_T_IN:
 +    case VIRTIO_BLK_T_OUT: {
 +        ssize_t ret = 0;
 +        bool is_write = type & VIRTIO_BLK_T_OUT;
 +        req->sector_num = le64_to_cpu(req->out.sector);
 +
 +        int64_t offset = req->sector_num * vdev_blk->blk_size;
 +        QEMUIOVector qiov;
 +        if (is_write) {
 +            qemu_iovec_init_external(&qiov, out_iov, out_num);
 +            ret = blk_co_pwritev(backend, offset, qiov.size,
 +                                 &qiov, 0);
 +        } else {
 +            qemu_iovec_init_external(&qiov, in_iov, in_num);
 +            ret = blk_co_preadv(backend, offset, qiov.size,
 +                                &qiov, 0);
 +        }
 +        if (ret >= 0) {
 +            req->in->status = VIRTIO_BLK_S_OK;
 +        } else {
 +            req->in->status = VIRTIO_BLK_S_IOERR;
 +        }
 +        break;
 +    }
 +    case VIRTIO_BLK_T_FLUSH:
 +        vu_block_flush(req);
 +        req->in->status = VIRTIO_BLK_S_OK;
 +        break;
 +    case VIRTIO_BLK_T_GET_ID: {
 +        size_t size = MIN(iov_size(&elem->in_sg[0], in_num),
 +                          VIRTIO_BLK_ID_BYTES);
 +        snprintf(elem->in_sg[0].iov_base, size, "%s", "vhost_user_blk");
 +        req->in->status = VIRTIO_BLK_S_OK;
 +        req->size = elem->in_sg[0].iov_len;
 +        break;
 +    }
 +    case VIRTIO_BLK_T_DISCARD:
 +    case VIRTIO_BLK_T_WRITE_ZEROES: {
 +        int rc;
 +        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
 +                                           out_num, type);
 +        if (rc == 0) {
 +            req->in->status = VIRTIO_BLK_S_OK;
 +        } else {
 +            req->in->status = VIRTIO_BLK_S_IOERR;
 +        }
 +        break;
 +    }
 +    default:
 +        req->in->status = VIRTIO_BLK_S_UNSUPP;
 +        break;
 +    }
 +
 +    vu_block_req_complete(req);
 +    return;
 +
 +err:
 +    free(elem);
 +    g_free(req);
 +    return;
 +}
 +
 +static void vu_block_process_vq(VuDev *vu_dev, int idx)
 +{
 +    VuServer *server;
 +    VuVirtq *vq;
 +    struct req_data *req_data;
 +
 +    server = container_of(vu_dev, VuServer, vu_dev);
 +    assert(server);
 +
 +    vq = vu_get_queue(vu_dev, idx);
 +    assert(vq);
 +    VuVirtqElement *elem;
 +    while (1) {
 +        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
 +                                    sizeof(VuBlockReq));
 +        if (elem) {
 +            req_data = g_new0(struct req_data, 1);
 +            req_data->server = server;
 +            req_data->vq = vq;
 +            req_data->elem = elem;
 +            Coroutine *co = qemu_coroutine_create(vu_block_virtio_process_req,
 +                                                  req_data);
 +            aio_co_enter(server->ioc->ctx, co);
 +        } else {
 +            break;
 +        }
 +    }
 +}
 +
 +static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
 +{
 +    VuVirtq *vq;
 +
 +    assert(vu_dev);
 +
 +    vq = vu_get_queue(vu_dev, idx);
 +    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
 +}
 +
 +static uint64_t vu_block_get_features(VuDev *dev)
 +{
 +    uint64_t features;
 +    VuServer *server = container_of(dev, VuServer, vu_dev);
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    features = 1ull << VIRTIO_BLK_F_SIZE_MAX |
 +               1ull << VIRTIO_BLK_F_SEG_MAX |
 +               1ull << VIRTIO_BLK_F_TOPOLOGY |
 +               1ull << VIRTIO_BLK_F_BLK_SIZE |
 +               1ull << VIRTIO_BLK_F_FLUSH |
 +               1ull << VIRTIO_BLK_F_DISCARD |
 +               1ull << VIRTIO_BLK_F_WRITE_ZEROES |
 +               1ull << VIRTIO_BLK_F_CONFIG_WCE |
 +               1ull << VIRTIO_F_VERSION_1 |
 +               1ull << VIRTIO_RING_F_INDIRECT_DESC |
 +               1ull << VIRTIO_RING_F_EVENT_IDX |
 +               1ull << VHOST_USER_F_PROTOCOL_FEATURES;
 +
 +    if (!vdev_blk->writable) {
 +        features |= 1ull << VIRTIO_BLK_F_RO;
 +    }
 +
 +    return features;
 +}
 +
 +static uint64_t vu_block_get_protocol_features(VuDev *dev)
 +{
 +    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
 +           1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
 +}
 +
 +static int
 +vu_block_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
 +{
 +    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    memcpy(config, &vdev_blk->blkcfg, len);
 +
 +    return 0;
 +}
 +
 +static int
 +vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
 +                    uint32_t offset, uint32_t size, uint32_t flags)
 +{
 +    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 +    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    uint8_t wce;
 +
 +    /* don't support live migration */
 +    if (flags != VHOST_SET_CONFIG_TYPE_MASTER) {
 +        return -EINVAL;
 +    }
 +
 +    if (offset != offsetof(struct virtio_blk_config, wce) ||
 +        size != 1) {
 +        return -EINVAL;
 +    }
 +
 +    wce = *data;
 +    vdev_blk->blkcfg.wce = wce;
 +    blk_set_enable_write_cache(vdev_blk->backend, wce);
 +    return 0;
 +}
 +
 +/*
 + * When the client disconnects, it sends a VHOST_USER_NONE request
 + * and vu_process_message will simple call exit which cause the VM
 + * to exit abruptly.
 + * To avoid this issue,  process VHOST_USER_NONE request ahead
 + * of vu_process_message.
 + *
 + */
-+
++static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
-+#include "qemu/osdep.h"
++{
-+#include "qapi/error.h"
++    if (vmsg->request == VHOST_USER_NONE) {
-+#include "block/aio.h"
++        dev->panic(dev, "disconnect");
-+#include "qemu/main-loop.h"
++        return true;
-+#include "qemu/rcu.h"
++    }
-+#include "iothread.h"
++    return false;
-+
++}
-+struct IOThread {
++
 +static const VuDevIface vu_block_iface = {
 +    .get_features          = vu_block_get_features,
 +    .queue_set_started     = vu_block_queue_set_started,
 +    .get_protocol_features = vu_block_get_protocol_features,
 +    .get_config            = vu_block_get_config,
 +    .set_config            = vu_block_set_config,
 +    .process_msg           = vu_block_process_msg,
 +};
 +
 +static void blk_aio_attached(AioContext *ctx, void *opaque)
 +{
 +    VuBlockDev *vub_dev = opaque;
 +    aio_context_acquire(ctx);
 +    vhost_user_server_set_aio_context(&vub_dev->vu_server, ctx);
 +    aio_context_release(ctx);
 +}
 +
 +static void blk_aio_detach(void *opaque)
 +{
 +    VuBlockDev *vub_dev = opaque;
 +    AioContext *ctx = vub_dev->vu_server.ctx;
 +    aio_context_acquire(ctx);
 +    vhost_user_server_set_aio_context(&vub_dev->vu_server, NULL);
 +    aio_context_release(ctx);
 +}
 +
 +static void
 +vu_block_initialize_config(BlockDriverState *bs,
 +                           struct virtio_blk_config *config, uint32_t blk_size)
 +{
 +    config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
 +    config->blk_size = blk_size;
 +    config->size_max = 0;
 +    config->seg_max = 128 - 2;
 +    config->min_io_size = 1;
 +    config->opt_io_size = 1;
 +    config->num_queues = VHOST_USER_BLK_MAX_QUEUES;
 +    config->max_discard_sectors = 32768;
 +    config->max_discard_seg = 1;
 +    config->discard_sector_alignment = config->blk_size >> 9;
 +    config->max_write_zeroes_sectors = 32768;
 +    config->max_write_zeroes_seg = 1;
 +}
 +
 +static VuBlockDev *vu_block_init(VuBlockDev *vu_block_device, Error **errp)
 +{
 +
 +    BlockBackend *blk;
 +    Error *local_error = NULL;
 +    const char *node_name = vu_block_device->node_name;
 +    bool writable = vu_block_device->writable;
 +    uint64_t perm = BLK_PERM_CONSISTENT_READ;
 +    int ret;
 +
 +    AioContext *ctx;
 +
-+    QemuThread thread;
++    BlockDriverState *bs = bdrv_lookup_bs(node_name, node_name, &local_error);
-+    QemuMutex init_done_lock;
++
-+    QemuCond init_done_cond;    /* is thread initialization done? */
++    if (!bs) {
-+    bool stopping;
++        error_propagate(errp, local_error);
 +        return NULL;
 +    }
 +
 +    if (bdrv_is_read_only(bs)) {
 +        writable = false;
 +    }
 +
 +    if (writable) {
 +        perm |= BLK_PERM_WRITE;
 +    }
 +
 +    ctx = bdrv_get_aio_context(bs);
 +    aio_context_acquire(ctx);
 +    bdrv_invalidate_cache(bs, NULL);
 +    aio_context_release(ctx);
 +
 +    /*
 +     * Don't allow resize while the vhost user server is running,
 +     * otherwise we don't care what happens with the node.
 +     */
 +    blk = blk_new(bdrv_get_aio_context(bs), perm,
 +                  BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
 +                  BLK_PERM_WRITE | BLK_PERM_GRAPH_MOD);
 +    ret = blk_insert_bs(blk, bs, errp);
 +
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +
 +    blk_set_enable_write_cache(blk, false);
 +
 +    blk_set_allow_aio_context_change(blk, true);
 +
 +    vu_block_device->blkcfg.wce = 0;
 +    vu_block_device->backend = blk;
 +    if (!vu_block_device->blk_size) {
 +        vu_block_device->blk_size = BDRV_SECTOR_SIZE;
 +    }
 +    vu_block_device->blkcfg.blk_size = vu_block_device->blk_size;
 +    blk_set_guest_block_size(blk, vu_block_device->blk_size);
 +    vu_block_initialize_config(bs, &vu_block_device->blkcfg,
 +                                   vu_block_device->blk_size);
 +    return vu_block_device;
 +
 +fail:
 +    blk_unref(blk);
 +    return NULL;
 +}
 +
 +static void vu_block_deinit(VuBlockDev *vu_block_device)
 +{
 +    if (vu_block_device->backend) {
 +        blk_remove_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
 +                                        blk_aio_detach, vu_block_device);
 +    }
 +
 +    blk_unref(vu_block_device->backend);
 +}
 +
 +static void vhost_user_blk_server_stop(VuBlockDev *vu_block_device)
 +{
 +    vhost_user_server_stop(&vu_block_device->vu_server);
 +    vu_block_deinit(vu_block_device);
 +}
 +
 +static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
 +                                        Error **errp)
 +{
 +    AioContext *ctx;
 +    SocketAddress *addr = vu_block_device->addr;
 +
 +    if (!vu_block_init(vu_block_device, errp)) {
 +        return;
 +    }
 +
 +    ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
 +
 +    if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
 +                                 VHOST_USER_BLK_MAX_QUEUES,
 +                                 NULL, &vu_block_iface,
 +                                 errp)) {
 +        goto error;
 +    }
 +
 +    blk_add_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
 +                                 blk_aio_detach, vu_block_device);
 +    vu_block_device->running = true;
 +    return;
 +
 + error:
 +    vu_block_deinit(vu_block_device);
 +}
 +
 +static bool vu_prop_modifiable(VuBlockDev *vus, Error **errp)
 +{
 +    if (vus->running) {
 +            error_setg(errp, "The property can't be modified "
 +                       "while the server is running");
 +            return false;
 +    }
 +    return true;
 +}
 +
 +static void vu_set_node_name(Object *obj, const char *value, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +
 +    if (!vu_prop_modifiable(vus, errp)) {
 +        return;
 +    }
 +
 +    if (vus->node_name) {
 +        g_free(vus->node_name);
 +    }
 +
 +    vus->node_name = g_strdup(value);
 +}
 +
 +static char *vu_get_node_name(Object *obj, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +    return g_strdup(vus->node_name);
 +}
 +
 +static void free_socket_addr(SocketAddress *addr)
 +{
 +        g_free(addr->u.q_unix.path);
 +        g_free(addr);
 +}
 +
 +static void vu_set_unix_socket(Object *obj, const char *value,
 +                               Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +
 +    if (!vu_prop_modifiable(vus, errp)) {
 +        return;
 +    }
 +
 +    if (vus->addr) {
 +        free_socket_addr(vus->addr);
 +    }
 +
 +    SocketAddress *addr = g_new0(SocketAddress, 1);
 +    addr->type = SOCKET_ADDRESS_TYPE_UNIX;
 +    addr->u.q_unix.path = g_strdup(value);
 +    vus->addr = addr;
 +}
 +
 +static char *vu_get_unix_socket(Object *obj, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +    return g_strdup(vus->addr->u.q_unix.path);
 +}
 +
 +static bool vu_get_block_writable(Object *obj, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +    return vus->writable;
 +}
 +
 +static void vu_set_block_writable(Object *obj, bool value, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +
 +    if (!vu_prop_modifiable(vus, errp)) {
 +            return;
 +    }
 +
 +    vus->writable = value;
 +}
 +
 +static void vu_get_blk_size(Object *obj, Visitor *v, const char *name,
 +                            void *opaque, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +    uint32_t value = vus->blk_size;
 +
 +    visit_type_uint32(v, name, &value, errp);
 +}
 +
 +static void vu_set_blk_size(Object *obj, Visitor *v, const char *name,
 +                            void *opaque, Error **errp)
 +{
 +    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 +
 +    Error *local_err = NULL;
 +    uint32_t value;
 +
 +    if (!vu_prop_modifiable(vus, errp)) {
 +            return;
 +    }
 +
 +    visit_type_uint32(v, name, &value, &local_err);
 +    if (local_err) {
 +        goto out;
 +    }
 +
 +    check_block_size(object_get_typename(obj), name, value, &local_err);
 +    if (local_err) {
 +        goto out;
 +    }
 +
 +    vus->blk_size = value;
 +
 +out:
 +    error_propagate(errp, local_err);
 +}
 +
 +static void vhost_user_blk_server_instance_finalize(Object *obj)
 +{
 +    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
 +
 +    vhost_user_blk_server_stop(vub);
 +
 +    /*
 +     * Unlike object_property_add_str, object_class_property_add_str
 +     * doesn't have a release method. Thus manual memory freeing is
 +     * needed.
 +     */
 +    free_socket_addr(vub->addr);
 +    g_free(vub->node_name);
 +}
 +
 +static void vhost_user_blk_server_complete(UserCreatable *obj, Error **errp)
 +{
 +    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
 +
 +    vhost_user_blk_server_start(vub, errp);
 +}
 +
 +static void vhost_user_blk_server_class_init(ObjectClass *klass,
 +                                             void *class_data)
 +{
 +    UserCreatableClass *ucc = USER_CREATABLE_CLASS(klass);
 +    ucc->complete = vhost_user_blk_server_complete;
 +
 +    object_class_property_add_bool(klass, "writable",
 +                                   vu_get_block_writable,
 +                                   vu_set_block_writable);
 +
 +    object_class_property_add_str(klass, "node-name",
 +                                  vu_get_node_name,
 +                                  vu_set_node_name);
 +
 +    object_class_property_add_str(klass, "unix-socket",
 +                                  vu_get_unix_socket,
 +                                  vu_set_unix_socket);
 +
 +    object_class_property_add(klass, "logical-block-size", "uint32",
 +                              vu_get_blk_size, vu_set_blk_size,
 +                              NULL, NULL);
 +}
 +
 +static const TypeInfo vhost_user_blk_server_info = {
 +    .name = TYPE_VHOST_USER_BLK_SERVER,
 +    .parent = TYPE_OBJECT,
 +    .instance_size = sizeof(VuBlockDev),
 +    .instance_finalize = vhost_user_blk_server_instance_finalize,
 +    .class_init = vhost_user_blk_server_class_init,
 +    .interfaces = (InterfaceInfo[]) {
 +        {TYPE_USER_CREATABLE},
 +        {}
 +    },
 +};
 +
-+static __thread IOThread *my_iothread;
++static void vhost_user_blk_server_register_types(void)
-+
++{
-+AioContext *qemu_get_current_aio_context(void)
++    type_register_static(&vhost_user_blk_server_info);
-+{
++}
-+    return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
++
-+}
++type_init(vhost_user_blk_server_register_types)
-+
+diff --git a/softmmu/vl.c b/softmmu/vl.c
 +static void *iothread_run(void *opaque)
 +{
 +    IOThread *iothread = opaque;
 +
 +    rcu_register_thread();
 +
 +    my_iothread = iothread;
 +    qemu_mutex_lock(&iothread->init_done_lock);
 +    iothread->ctx = aio_context_new(&error_abort);
 +    qemu_cond_signal(&iothread->init_done_cond);
 +    qemu_mutex_unlock(&iothread->init_done_lock);
 +
 +    while (!atomic_read(&iothread->stopping)) {
 +        aio_poll(iothread->ctx, true);
 +    }
 +
 +    rcu_unregister_thread();
 +    return NULL;
 +}
 +
 +void iothread_join(IOThread *iothread)
 +{
 +    iothread->stopping = true;
 +    aio_notify(iothread->ctx);
 +    qemu_thread_join(&iothread->thread);
 +    qemu_cond_destroy(&iothread->init_done_cond);
 +    qemu_mutex_destroy(&iothread->init_done_lock);
 +    aio_context_unref(iothread->ctx);
 +    g_free(iothread);
 +}
 +
 +IOThread *iothread_new(void)
 +{
 +    IOThread *iothread = g_new0(IOThread, 1);
 +
 +    qemu_mutex_init(&iothread->init_done_lock);
 +    qemu_cond_init(&iothread->init_done_cond);
 +    qemu_thread_create(&iothread->thread, NULL, iothread_run,
 +                       iothread, QEMU_THREAD_JOINABLE);
 +
 +    /* Wait for initialization to complete */
 +    qemu_mutex_lock(&iothread->init_done_lock);
 +    while (iothread->ctx == NULL) {
 +        qemu_cond_wait(&iothread->init_done_cond,
 +                       &iothread->init_done_lock);
 +    }
 +    qemu_mutex_unlock(&iothread->init_done_lock);
 +    return iothread;
 +}
 +
 +AioContext *iothread_get_aio_context(IOThread *iothread)
 +{
 +    return iothread->ctx;
 +}
 diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * AioContext multithreading tests
 + *
 + * Copyright Red Hat, Inc. 2016
 + *
 + * Authors:
 + *  Paolo Bonzini    <pbonzini@redhat.com>
 + *
 + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
 + * See the COPYING.LIB file in the top-level directory.
 + */
 +
 +#include "qemu/osdep.h"
 +#include <glib.h>
 +#include "block/aio.h"
 +#include "qapi/error.h"
 +#include "qemu/coroutine.h"
 +#include "qemu/thread.h"
 +#include "qemu/error-report.h"
 +#include "iothread.h"
 +
 +/* AioContext management */
 +
 +#define NUM_CONTEXTS 5
 +
 +static IOThread *threads[NUM_CONTEXTS];
 +static AioContext *ctx[NUM_CONTEXTS];
 +static __thread int id = -1;
 +
 +static QemuEvent done_event;
 +
 +/* Run a function synchronously on a remote iothread. */
 +
 +typedef struct CtxRunData {
 +    QEMUBHFunc *cb;
 +    void *arg;
 +} CtxRunData;
 +
 +static void ctx_run_bh_cb(void *opaque)
 +{
 +    CtxRunData *data = opaque;
 +
 +    data->cb(data->arg);
 +    qemu_event_set(&done_event);
 +}
 +
 +static void ctx_run(int i, QEMUBHFunc *cb, void *opaque)
 +{
 +    CtxRunData data = {
 +        .cb = cb,
 +        .arg = opaque
 +    };
 +
 +    qemu_event_reset(&done_event);
 +    aio_bh_schedule_oneshot(ctx[i], ctx_run_bh_cb, &data);
 +    qemu_event_wait(&done_event);
 +}
 +
 +/* Starting the iothreads. */
 +
 +static void set_id_cb(void *opaque)
 +{
 +    int *i = opaque;
 +
 +    id = *i;
 +}
 +
 +static void create_aio_contexts(void)
 +{
 +    int i;
 +
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        threads[i] = iothread_new();
 +        ctx[i] = iothread_get_aio_context(threads[i]);
 +    }
 +
 +    qemu_event_init(&done_event, false);
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        ctx_run(i, set_id_cb, &i);
 +    }
 +}
 +
 +/* Stopping the iothreads. */
 +
 +static void join_aio_contexts(void)
 +{
 +    int i;
 +
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        aio_context_ref(ctx[i]);
 +    }
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        iothread_join(threads[i]);
 +    }
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        aio_context_unref(ctx[i]);
 +    }
 +    qemu_event_destroy(&done_event);
 +}
 +
 +/* Basic test for the stuff above. */
 +
 +static void test_lifecycle(void)
 +{
 +    create_aio_contexts();
 +    join_aio_contexts();
 +}
 +
 +/* aio_co_schedule test.  */
 +
 +static Coroutine *to_schedule[NUM_CONTEXTS];
 +
 +static bool now_stopping;
 +
 +static int count_retry;
 +static int count_here;
 +static int count_other;
 +
 +static bool schedule_next(int n)
 +{
 +    Coroutine *co;
 +
 +    co = atomic_xchg(&to_schedule[n], NULL);
 +    if (!co) {
 +        atomic_inc(&count_retry);
 +        return false;
 +    }
 +
 +    if (n == id) {
 +        atomic_inc(&count_here);
 +    } else {
 +        atomic_inc(&count_other);
 +    }
 +
 +    aio_co_schedule(ctx[n], co);
 +    return true;
 +}
 +
 +static void finish_cb(void *opaque)
 +{
 +    schedule_next(id);
 +}
 +
 +static coroutine_fn void test_multi_co_schedule_entry(void *opaque)
 +{
 +    g_assert(to_schedule[id] == NULL);
 +    atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
 +
 +    while (!atomic_mb_read(&now_stopping)) {
 +        int n;
 +
 +        n = g_test_rand_int_range(0, NUM_CONTEXTS);
 +        schedule_next(n);
 +        qemu_coroutine_yield();
 +
 +        g_assert(to_schedule[id] == NULL);
 +        atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
 +    }
 +}
 +
 +
 +static void test_multi_co_schedule(int seconds)
 +{
 +    int i;
 +
 +    count_here = count_other = count_retry = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_co_schedule_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    for (i = 0; i < NUM_CONTEXTS; i++) {
 +        ctx_run(i, finish_cb, NULL);
 +        to_schedule[i] = NULL;
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("scheduled %d, queued %d, retry %d, total %d\n",
 +                  count_other, count_here, count_retry,
 +                  count_here + count_other + count_retry);
 +}
 +
 +static void test_multi_co_schedule_1(void)
 +{
 +    test_multi_co_schedule(1);
 +}
 +
 +static void test_multi_co_schedule_10(void)
 +{
 +    test_multi_co_schedule(10);
 +}
 +
 +/* End of tests.  */
 +
 +int main(int argc, char **argv)
 +{
 +    init_clocks();
 +
 +    g_test_init(&argc, &argv, NULL);
 +    g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
 +    if (g_test_quick()) {
 +        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
 +    } else {
 +        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
 +    }
 +    return g_test_run();
 +}
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/async.c
+--- a/softmmu/vl.c
-+++ b/util/async.c
++++ b/softmmu/vl.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static bool object_create_initial(const char *type, QemuOpts *opts)
  #include "qemu/main-loop.h"
  #include "qemu/atomic.h"
  #include "block/raw-aio.h"
 +#include "qemu/coroutine_int.h"
 +#include "trace.h"
  /***********************************************************/
  /* bottom halves (can be seen as timers which expire ASAP) */
@@ -XXX,XX +XXX,XX @@ aio_ctx_finalize(GSource     *source)
      }
  #endif
-+    assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
++    /* Reason: vhost-user-blk-server property "node-name" */
-+    qemu_bh_delete(ctx->co_schedule_bh);
++    if (g_str_equal(type, "vhost-user-blk-server")) {
-+
++        return false;
-     qemu_lockcnt_lock(&ctx->list_lock);
++    }
-     assert(!qemu_lockcnt_count(&ctx->list_lock));
+     /*
-     while (ctx->first_bh) {
+      * Reason: filter-* property "netdev" etc.
-@@ -XXX,XX +XXX,XX @@ static bool event_notifier_poll(void *opaque)
+      */
-     return atomic_read(&ctx->notified);
+diff --git a/block/meson.build b/block/meson.build
  }
 +static void co_schedule_bh_cb(void *opaque)
 +{
 +    AioContext *ctx = opaque;
 +    QSLIST_HEAD(, Coroutine) straight, reversed;
 +
 +    QSLIST_MOVE_ATOMIC(&reversed, &ctx->scheduled_coroutines);
 +    QSLIST_INIT(&straight);
 +
 +    while (!QSLIST_EMPTY(&reversed)) {
 +        Coroutine *co = QSLIST_FIRST(&reversed);
 +        QSLIST_REMOVE_HEAD(&reversed, co_scheduled_next);
 +        QSLIST_INSERT_HEAD(&straight, co, co_scheduled_next);
 +    }
 +
 +    while (!QSLIST_EMPTY(&straight)) {
 +        Coroutine *co = QSLIST_FIRST(&straight);
 +        QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
 +        trace_aio_co_schedule_bh_cb(ctx, co);
 +        qemu_coroutine_enter(co);
 +    }
 +}
 +
  AioContext *aio_context_new(Error **errp)
  {
      int ret;
@@ -XXX,XX +XXX,XX @@ AioContext *aio_context_new(Error **errp)
      }
      g_source_set_can_recurse(&ctx->source, true);
      qemu_lockcnt_init(&ctx->list_lock);
 +
 +    ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
 +    QSLIST_INIT(&ctx->scheduled_coroutines);
 +
      aio_set_event_notifier(ctx, &ctx->notifier,
                             false,
                             (EventNotifierHandler *)
@@ -XXX,XX +XXX,XX @@ fail:
      return NULL;
  }
 +void aio_co_schedule(AioContext *ctx, Coroutine *co)
 +{
 +    trace_aio_co_schedule(ctx, co);
 +    QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
 +                              co, co_scheduled_next);
 +    qemu_bh_schedule(ctx->co_schedule_bh);
 +}
 +
 +void aio_co_wake(struct Coroutine *co)
 +{
 +    AioContext *ctx;
 +
 +    /* Read coroutine before co->ctx.  Matches smp_wmb in
 +     * qemu_coroutine_enter.
 +     */
 +    smp_read_barrier_depends();
 +    ctx = atomic_read(&co->ctx);
 +
 +    if (ctx != qemu_get_current_aio_context()) {
 +        aio_co_schedule(ctx, co);
 +        return;
 +    }
 +
 +    if (qemu_in_coroutine()) {
 +        Coroutine *self = qemu_coroutine_self();
 +        assert(self != co);
 +        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
 +    } else {
 +        aio_context_acquire(ctx);
 +        qemu_coroutine_enter(co);
 +        aio_context_release(ctx);
 +    }
 +}
 +
  void aio_context_ref(AioContext *ctx)
  {
      g_source_ref(&ctx->source);
 diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine.c
+--- a/block/meson.build
-+++ b/util/qemu-coroutine.c
++++ b/block/meson.build
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c')
- #include "qemu/atomic.h"
+ block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
- #include "qemu/coroutine.h"
+ block_ss.add(when: 'CONFIG_LIBISCSI', if_true: files('iscsi-opts.c'))
- #include "qemu/coroutine_int.h"
+ block_ss.add(when: 'CONFIG_LINUX', if_true: files('nvme.c'))
-+#include "block/aio.h"
++block_ss.add(when: 'CONFIG_LINUX', if_true: files('export/vhost-user-blk-server.c', '../contrib/libvhost-user/libvhost-user.c'))
+ block_ss.add(when: 'CONFIG_REPLICATION', if_true: files('replication.c'))
- enum {
+ block_ss.add(when: 'CONFIG_SHEEPDOG', if_true: files('sheepdog.c'))
-     POOL_BATCH_SIZE = 64,
+ block_ss.add(when: ['CONFIG_LINUX_AIO', libaio], if_true: files('linux-aio.c'))
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
      }
      co->caller = self;
 +    co->ctx = qemu_get_current_aio_context();
 +
 +    /* Store co->ctx before anything that stores co.  Matches
 +     * barrier in aio_co_wake.
 +     */
 +    smp_wmb();
 +
      ret = qemu_coroutine_switch(self, co, COROUTINE_ENTER);
      qemu_co_queue_run_restart(co);
 diff --git a/util/trace-events b/util/trace-events
 index XXXXXXX..XXXXXXX 100644
 --- a/util/trace-events
 +++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
  poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
  poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 +# util/async.c
 +aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
 +aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
 +
  # util/thread-pool.c
  thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
  thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 --
-.9.3
+.26.2

-New patch
+[PULL v2 07/28] MAINTAINERS: Add vhost-user block device backend server maintainer
+From: Coiby Xu <coiby.xu@gmail.com>
+Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
+Message-id: 20200918080912.321299-8-coiby.xu@gmail.com
+[Removed reference to vhost-user-blk-test.c, it will be sent in a
+separate pull request.
+--Stefan]
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ MAINTAINERS | 7 +++++++
+file changed, 7 insertions(+)
+diff --git a/MAINTAINERS b/MAINTAINERS
+index XXXXXXX..XXXXXXX 100644
+--- a/MAINTAINERS
++++ b/MAINTAINERS
+@@ -XXX,XX +XXX,XX @@ L: qemu-block@nongnu.org
+ S: Supported
+ F: tests/image-fuzzer/
++Vhost-user block device backend server
++M: Coiby Xu <Coiby.Xu@gmail.com>
++S: Maintained
++F: block/export/vhost-user-blk-server.c
++F: util/vhost-user-server.c
++F: tests/qtest/libqos/vhost-user-blk.c
++
+ Replication
+ M: Wen Congyang <wencongyang2@huawei.com>
+ M: Xie Changlong <xiechanglong.d@gmail.com>
+--
+.26.2

-[Qemu-devel] [PULL v2 11/24] aio: push aio_context_acquire/release down to dispatching
+[PULL v2 08/28] util/vhost-user-server: s/fileds/fields/ typo fix
-From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-id: 20200924151549.913737-3-stefanha@redhat.com
 The AioContext data structures are now protected by list_lock and/or
 they are walked with FOREACH_RCU primitives.  There is no need anymore
 to acquire the AioContext for the entire duration of aio_dispatch.
 Instead, just acquire it before and after invoking the callbacks.
 The next step is then to push it further down.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-12-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/aio-posix.c | 25 +++++++++++--------------
+ util/vhost-user-server.c | 2 +-
- util/aio-win32.c | 15 +++++++--------
+file changed, 1 insertion(+), 1 deletion(-)
  util/async.c     |  2 ++
 files changed, 20 insertions(+), 22 deletions(-)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
+--- a/util/vhost-user-server.c
-+++ b/util/aio-posix.c
++++ b/util/vhost-user-server.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
-             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
+         return false;
              aio_node_check(ctx, node->is_external) &&
              node->io_read) {
 +            aio_context_acquire(ctx);
              node->io_read(node->opaque);
 +            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
              (revents & (G_IO_OUT | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_write) {
 +            aio_context_acquire(ctx);
              node->io_write(node->opaque);
 +            aio_context_release(ctx);
              progress = true;
          }
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
      }
-     /* Run our timers */
+-    /* zero out unspecified fileds */
-+    aio_context_acquire(ctx);
++    /* zero out unspecified fields */
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
+     *server = (VuServer) {
-+    aio_context_release(ctx);
+         .listener              = listener,
+         .vu_iface              = vu_iface,
      return progress;
  }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
      int64_t timeout;
      int64_t start = 0;
 -    aio_context_acquire(ctx);
 -    progress = false;
 -
      /* aio_notify can avoid the expensive event_notifier_set if
       * everything (file descriptors, bottom halves, timers) will
       * be re-evaluated before the next blocking poll().  This is
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      }
 -    if (try_poll_mode(ctx, blocking)) {
 -        progress = true;
 -    } else {
 +    aio_context_acquire(ctx);
 +    progress = try_poll_mode(ctx, blocking);
 +    aio_context_release(ctx);
 +
 +    if (!progress) {
          assert(npfd == 0);
          /* fill pollfds */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          timeout = blocking ? aio_compute_timeout(ctx) : 0;
          /* wait until next event */
 -        if (timeout) {
 -            aio_context_release(ctx);
 -        }
          if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
              AioHandler epoll_handler;
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          } else  {
              ret = qemu_poll_ns(pollfds, npfd, timeout);
          }
 -        if (timeout) {
 -            aio_context_acquire(ctx);
 -        }
      }
      if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress = true;
      }
 -    aio_context_release(ctx);
 -
      return progress;
  }
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-win32.c
 +++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (revents || event_notifier_get_handle(node->e) == event) &&
              node->io_notify) {
              node->pfd.revents = 0;
 +            aio_context_acquire(ctx);
              node->io_notify(node->e);
 +            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (node->io_read || node->io_write)) {
              node->pfd.revents = 0;
              if ((revents & G_IO_IN) && node->io_read) {
 +                aio_context_acquire(ctx);
                  node->io_read(node->opaque);
 +                aio_context_release(ctx);
                  progress = true;
              }
              if ((revents & G_IO_OUT) && node->io_write) {
 +                aio_context_acquire(ctx);
                  node->io_write(node->opaque);
 +                aio_context_release(ctx);
                  progress = true;
              }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
      int count;
      int timeout;
 -    aio_context_acquire(ctx);
      progress = false;
      /* aio_notify can avoid the expensive event_notifier_set if
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          timeout = blocking && !have_select_revents
              ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
 -        if (timeout) {
 -            aio_context_release(ctx);
 -        }
          ret = WaitForMultipleObjects(count, events, FALSE, timeout);
          if (blocking) {
              assert(first);
              atomic_sub(&ctx->notify_me, 2);
          }
 -        if (timeout) {
 -            aio_context_acquire(ctx);
 -        }
          if (first) {
              aio_notify_accept(ctx);
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress |= aio_dispatch_handlers(ctx, event);
      } while (count > 0);
 +    aio_context_acquire(ctx);
      progress |= timerlistgroup_run_timers(&ctx->tlg);
 -
      aio_context_release(ctx);
      return progress;
  }
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  ret = 1;
              }
              bh->idle = 0;
 +            aio_context_acquire(ctx);
              aio_bh_call(bh);
 +            aio_context_release(ctx);
          }
          if (bh->deleted) {
              deleted = true;
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 20/24] coroutine-lock: add limited spinning to CoMutex
+[PULL v2 09/28] util/vhost-user-server: drop unnecessary QOM cast
-From: Paolo Bonzini <pbonzini@redhat.com>
+We already have access to the value with the correct type (ioc and sioc
 are the same QIOChannel).
-Running a very small critical section on pthread_mutex_t and CoMutex
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-shows that pthread_mutex_t is much faster because it doesn't actually
+Message-id: 20200924151549.913737-4-stefanha@redhat.com
 go to sleep.  What happens is that the critical section is shorter
 than the latency of entering the kernel and thus FUTEX_WAIT always
 fails.  With CoMutex there is no such latency but you still want to
 avoid wait and wakeup.  So introduce it artificially.
 This only works with one waiters; because CoMutex is fair, it will
 always have more waits and wakeups than a pthread_mutex_t.
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213181244.16297-3-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  5 +++++
+ util/vhost-user-server.c | 2 +-
- util/qemu-coroutine-lock.c | 51 ++++++++++++++++++++++++++++++++++++++++------
+file changed, 1 insertion(+), 1 deletion(-)
  util/qemu-coroutine.c      |  2 +-
 files changed, 51 insertions(+), 7 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/util/vhost-user-server.c
-+++ b/include/qemu/coroutine.h
++++ b/util/vhost-user-server.c
-@@ -XXX,XX +XXX,XX @@ typedef struct CoMutex {
+@@ -XXX,XX +XXX,XX @@ static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
-      */
+     server->ioc = QIO_CHANNEL(sioc);
-     unsigned locked;
+     object_ref(OBJECT(server->ioc));
+     qio_channel_attach_aio_context(server->ioc, server->ctx);
-+    /* Context that is holding the lock.  Useful to avoid spinning
+-    qio_channel_set_blocking(QIO_CHANNEL(server->sioc), false, NULL);
-+     * when two coroutines on the same AioContext try to get the lock. :)
++    qio_channel_set_blocking(server->ioc, false, NULL);
-+     */
+     vu_client_start(server);
 +    AioContext *ctx;
 +
      /* A queue of waiters.  Elements are added atomically in front of
       * from_push.  to_pop is only populated, and popped from, by whoever
       * is in charge of the next wakeup.  This can be an unlocker or,
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-lock.c
 +++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu-common.h"
  #include "qemu/coroutine.h"
  #include "qemu/coroutine_int.h"
 +#include "qemu/processor.h"
  #include "qemu/queue.h"
  #include "block/aio.h"
  #include "trace.h"
@@ -XXX,XX +XXX,XX @@ void qemu_co_mutex_init(CoMutex *mutex)
      memset(mutex, 0, sizeof(*mutex));
  }
--static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
-+static void coroutine_fn qemu_co_mutex_wake(CoMutex *mutex, Coroutine *co)
-+{
-+    /* Read co before co->ctx; pairs with smp_wmb() in
-+     * qemu_coroutine_enter().
-+     */
-+    smp_read_barrier_depends();
-+    mutex->ctx = co->ctx;
-+    aio_co_wake(co);
-+}
-+
-+static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
-+                                                     CoMutex *mutex)
- {
-     Coroutine *self = qemu_coroutine_self();
-     CoWaitRecord w;
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
-         if (co == self) {
-             /* We got the lock ourselves!  */
-             assert(to_wake == &w);
-+            mutex->ctx = ctx;
-             return;
-         }
--        aio_co_wake(co);
-+        qemu_co_mutex_wake(mutex, co);
-     }
-     qemu_coroutine_yield();
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
- void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
- {
-+    AioContext *ctx = qemu_get_current_aio_context();
-     Coroutine *self = qemu_coroutine_self();
-+    int waiters, i;
--    if (atomic_fetch_inc(&mutex->locked) == 0) {
-+    /* Running a very small critical section on pthread_mutex_t and CoMutex
-+     * shows that pthread_mutex_t is much faster because it doesn't actually
-+     * go to sleep.  What happens is that the critical section is shorter
-+     * than the latency of entering the kernel and thus FUTEX_WAIT always
-+     * fails.  With CoMutex there is no such latency but you still want to
-+     * avoid wait and wakeup.  So introduce it artificially.
-+     */
-+    i = 0;
-+retry_fast_path:
-+    waiters = atomic_cmpxchg(&mutex->locked, 0, 1);
-+    if (waiters != 0) {
-+        while (waiters == 1 && ++i < 1000) {
-+            if (atomic_read(&mutex->ctx) == ctx) {
-+                break;
-+            }
-+            if (atomic_read(&mutex->locked) == 0) {
-+                goto retry_fast_path;
-+            }
-+            cpu_relax();
-+        }
-+        waiters = atomic_fetch_inc(&mutex->locked);
-+    }
-+
-+    if (waiters == 0) {
-         /* Uncontended.  */
-         trace_qemu_co_mutex_lock_uncontended(mutex, self);
-+        mutex->ctx = ctx;
-     } else {
--        qemu_co_mutex_lock_slowpath(mutex);
-+        qemu_co_mutex_lock_slowpath(ctx, mutex);
-     }
-     mutex->holder = self;
-     self->locks_held++;
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
-     assert(mutex->holder == self);
-     assert(qemu_in_coroutine());
-+    mutex->ctx = NULL;
-     mutex->holder = NULL;
-     self->locks_held--;
-     if (atomic_fetch_dec(&mutex->locked) == 1) {
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
-         unsigned our_handoff;
-         if (to_wake) {
--            Coroutine *co = to_wake->co;
--            aio_co_wake(co);
-+            qemu_co_mutex_wake(mutex, to_wake->co);
-             break;
-         }
-diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine.c
-+++ b/util/qemu-coroutine.c
-@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
-     co->ctx = qemu_get_current_aio_context();
-     /* Store co->ctx before anything that stores co.  Matches
--     * barrier in aio_co_wake.
-+     * barrier in aio_co_wake and qemu_co_mutex_wake.
-      */
-     smp_wmb();
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 17/24] async: remove unnecessary inc/dec pairs
+[PULL v2 10/28] util/vhost-user-server: drop unnecessary watch deletion
-From: Paolo Bonzini <pbonzini@redhat.com>
+Explicitly deleting watches is not necessary since libvhost-user calls
 remove_watch() during vu_deinit(). Add an assertion to check this
 though.
-Pull the increment/decrement pair out of aio_bh_poll and into the
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-callers.
+Message-id: 20200924151549.913737-5-stefanha@redhat.com
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-18-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/aio-posix.c |  8 +++-----
+ util/vhost-user-server.c | 19 ++++---------------
- util/aio-win32.c |  8 ++++----
+file changed, 4 insertions(+), 15 deletions(-)
  util/async.c     | 12 ++++++------
 files changed, 13 insertions(+), 15 deletions(-)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
+--- a/util/vhost-user-server.c
-+++ b/util/aio-posix.c
++++ b/util/vhost-user-server.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+@@ -XXX,XX +XXX,XX @@ static void close_client(VuServer *server)
+     /* When this is set vu_client_trip will stop new processing vhost-user message */
- void aio_dispatch(AioContext *ctx)
+     server->sioc = NULL;
- {
-+    qemu_lockcnt_inc(&ctx->list_lock);
+-    VuFdWatch *vu_fd_watch, *next;
-     aio_bh_poll(ctx);
+-    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
 -        aio_set_fd_handler(server->ioc->ctx, vu_fd_watch->fd, true, NULL,
 -                           NULL, NULL, NULL);
 -    }
 -
--    qemu_lockcnt_inc(&ctx->list_lock);
+-    while (!QTAILQ_EMPTY(&server->vu_fd_watches)) {
-     aio_dispatch_handlers(ctx);
+-        QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
-     qemu_lockcnt_dec(&ctx->list_lock);
+-            if (!vu_fd_watch->processing) {
+-                QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
+-                g_free(vu_fd_watch);
 -            }
 -        }
 -    }
 -
      while (server->processing_msg) {
          if (server->ioc->read_coroutine) {
              server->ioc->read_coroutine = NULL;
@@ -XXX,XX +XXX,XX @@ static void close_client(VuServer *server)
      }
-     npfd = 0;
+     vu_deinit(&server->vu_dev);
 -    qemu_lockcnt_dec(&ctx->list_lock);
      progress |= aio_bh_poll(ctx);
      if (ret > 0) {
 -        qemu_lockcnt_inc(&ctx->list_lock);
          progress |= aio_dispatch_handlers(ctx);
 -        qemu_lockcnt_dec(&ctx->list_lock);
      }
 +    qemu_lockcnt_dec(&ctx->list_lock);
 +
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
++    /* vu_deinit() should have called remove_watch() */
++    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
      return progress;
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-win32.c
 +++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
      bool progress = false;
      AioHandler *tmp;
 -    qemu_lockcnt_inc(&ctx->list_lock);
 -
      /*
       * We have to walk very carefully in case aio_set_fd_handler is
       * called while we're walking.
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
          }
      }
 -    qemu_lockcnt_dec(&ctx->list_lock);
      return progress;
  }
  void aio_dispatch(AioContext *ctx)
  {
 +    qemu_lockcnt_inc(&ctx->list_lock);
      aio_bh_poll(ctx);
      aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
 +    qemu_lockcnt_dec(&ctx->list_lock);
      timerlistgroup_run_timers(&ctx->tlg);
  }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          }
      }
 -    qemu_lockcnt_dec(&ctx->list_lock);
      first = true;
      /* ctx->notifier is always registered.  */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress |= aio_dispatch_handlers(ctx, event);
      } while (count > 0);
 +    qemu_lockcnt_dec(&ctx->list_lock);
 +
-     progress |= timerlistgroup_run_timers(&ctx->tlg);
+     object_unref(OBJECT(sioc));
-     return progress;
+     object_unref(OBJECT(server->ioc));
  }
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ void aio_bh_call(QEMUBH *bh)
      bh->cb(bh->opaque);
  }
 -/* Multiple occurrences of aio_bh_poll cannot be called concurrently */
 +/* Multiple occurrences of aio_bh_poll cannot be called concurrently.
 + * The count in ctx->list_lock is incremented before the call, and is
 + * not affected by the call.
 + */
  int aio_bh_poll(AioContext *ctx)
  {
      QEMUBH *bh, **bhp, *next;
      int ret;
      bool deleted = false;
 -    qemu_lockcnt_inc(&ctx->list_lock);
 -
      ret = 0;
      for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
          next = atomic_rcu_read(&bh->next);
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
      /* remove deleted bhs */
      if (!deleted) {
 -        qemu_lockcnt_dec(&ctx->list_lock);
          return ret;
      }
 -    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
 +    if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
          bhp = &ctx->first_bh;
          while (*bhp) {
              bh = *bhp;
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  bhp = &bh->next;
              }
          }
 -        qemu_lockcnt_unlock(&ctx->list_lock);
 +        qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
      }
      return ret;
  }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 16/24] aio-posix: partially inline aio_dispatch into aio_poll
+[PULL v2 11/28] block/export: consolidate request structs into VuBlockReq
-From: Paolo Bonzini <pbonzini@redhat.com>
+Only one struct is needed per request. Drop req_data and the separate
 VuBlockReq instance. Instead let vu_queue_pop() allocate everything at
 once.
-This patch prepares for the removal of unnecessary lockcnt inc/dec pairs.
+This fixes the req_data memory leak in vu_block_virtio_process_req().
 Extract the dispatching loop for file descriptor handlers into a new
 function aio_dispatch_handlers, and then inline aio_dispatch into
 aio_poll.
-aio_dispatch can now become void.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-id: 20200924151549.913737-6-stefanha@redhat.com
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-17-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/block/aio.h |  6 +-----
+ block/export/vhost-user-blk-server.c | 68 +++++++++-------------------
- util/aio-posix.c    | 44 ++++++++++++++------------------------------
+file changed, 21 insertions(+), 47 deletions(-)
  util/aio-win32.c    | 13 ++++---------
  util/async.c        |  2 +-
 files changed, 20 insertions(+), 45 deletions(-)
-diff --git a/include/block/aio.h b/include/block/aio.h
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
+--- a/block/export/vhost-user-blk-server.c
-+++ b/include/block/aio.h
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ bool aio_pending(AioContext *ctx);
+@@ -XXX,XX +XXX,XX @@ struct virtio_blk_inhdr {
- /* Dispatch any pending callbacks from the GSource attached to the AioContext.
+ };
-  *
-  * This is used internally in the implementation of the GSource.
+ typedef struct VuBlockReq {
-- *
+-    VuVirtqElement *elem;
-- * @dispatch_fds: true to process fds, false to skip them
++    VuVirtqElement elem;
-- *                (can be used as an optimization by callers that know there
+     int64_t sector_num;
-- *                are no fds ready)
+     size_t size;
-  */
+     struct virtio_blk_inhdr *in;
--bool aio_dispatch(AioContext *ctx, bool dispatch_fds);
+@@ -XXX,XX +XXX,XX @@ static void vu_block_req_complete(VuBlockReq *req)
-+void aio_dispatch(AioContext *ctx);
+     VuDev *vu_dev = &req->server->vu_dev;
- /* Progress in completing AIO work to occur.  This can issue new pending
+     /* IO size with 1 extra status byte */
-  * aio as a result of executing I/O completion or bh callbacks.
+-    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
-diff --git a/util/aio-posix.c b/util/aio-posix.c
++    vu_queue_push(vu_dev, req->vq, &req->elem, req->size + 1);
-index XXXXXXX..XXXXXXX 100644
+     vu_queue_notify(vu_dev, req->vq);
---- a/util/aio-posix.c
-+++ b/util/aio-posix.c
+-    if (req->elem) {
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
+-        free(req->elem);
      AioHandler *node, *tmp;
      bool progress = false;
 -    /*
 -     * We have to walk very carefully in case aio_set_fd_handler is
 -     * called while we're walking.
 -     */
 -    qemu_lockcnt_inc(&ctx->list_lock);
 -
      QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
          int revents;
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
          }
      }
 -    qemu_lockcnt_dec(&ctx->list_lock);
      return progress;
  }
 -/*
 - * Note that dispatch_fds == false has the side-effect of post-poning the
 - * freeing of deleted handlers.
 - */
 -bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
 +void aio_dispatch(AioContext *ctx)
  {
 -    bool progress;
 +    aio_bh_poll(ctx);
 -    /*
 -     * If there are callbacks left that have been queued, we need to call them.
 -     * Do not call select in this case, because it is possible that the caller
 -     * does not need a complete flush (as is the case for aio_poll loops).
 -     */
 -    progress = aio_bh_poll(ctx);
 +    qemu_lockcnt_inc(&ctx->list_lock);
 +    aio_dispatch_handlers(ctx);
 +    qemu_lockcnt_dec(&ctx->list_lock);
 -    if (dispatch_fds) {
 -        progress |= aio_dispatch_handlers(ctx);
 -    }
 -
--    /* Run our timers */
+-    g_free(req);
--    progress |= timerlistgroup_run_timers(&ctx->tlg);
++    free(req);
  }
  static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_flush(VuBlockReq *req)
      blk_co_flush(backend);
  }
 -struct req_data {
 -    VuServer *server;
 -    VuVirtq *vq;
 -    VuVirtqElement *elem;
 -};
 -
--    return progress;
+ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-+    timerlistgroup_run_timers(&ctx->tlg);
+ {
 -    struct req_data *data = opaque;
 -    VuServer *server = data->server;
 -    VuVirtq *vq = data->vq;
 -    VuVirtqElement *elem = data->elem;
 +    VuBlockReq *req = opaque;
 +    VuServer *server = req->server;
 +    VuVirtqElement *elem = &req->elem;
      uint32_t type;
 -    VuBlockReq *req;
      VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
      BlockBackend *backend = vdev_blk->backend;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
      struct iovec *out_iov = elem->out_sg;
      unsigned in_num = elem->in_num;
      unsigned out_num = elem->out_num;
 +
      /* refer to hw/block/virtio_blk.c */
      if (elem->out_num < 1 || elem->in_num < 1) {
          error_report("virtio-blk request missing headers");
 -        free(elem);
 -        return;
 +        goto err;
      }
 -    req = g_new0(VuBlockReq, 1);
 -    req->server = server;
 -    req->vq = vq;
 -    req->elem = elem;
 -
      if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
                              sizeof(req->out)) != sizeof(req->out))) {
          error_report("virtio-blk request outhdr too short");
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
  err:
      free(elem);
 -    g_free(req);
 -    return;
  }
- /* These thread-local variables are used only in a small part of aio_poll
+ static void vu_block_process_vq(VuDev *vu_dev, int idx)
-@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
+ {
-     npfd = 0;
+-    VuServer *server;
-     qemu_lockcnt_dec(&ctx->list_lock);
+-    VuVirtq *vq;
+-    struct req_data *req_data;
--    /* Run dispatch even if there were no readable fds to run timers */
++    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
--    if (aio_dispatch(ctx, ret > 0)) {
++    VuVirtq *vq = vu_get_queue(vu_dev, idx);
--        progress = true;
-+    progress |= aio_bh_poll(ctx);
+-    server = container_of(vu_dev, VuServer, vu_dev);
 -    assert(server);
 -
 -    vq = vu_get_queue(vu_dev, idx);
 -    assert(vq);
 -    VuVirtqElement *elem;
      while (1) {
 -        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
 -                                    sizeof(VuBlockReq));
 -        if (elem) {
 -            req_data = g_new0(struct req_data, 1);
 -            req_data->server = server;
 -            req_data->vq = vq;
 -            req_data->elem = elem;
 -            Coroutine *co = qemu_coroutine_create(vu_block_virtio_process_req,
 -                                                  req_data);
 -            aio_co_enter(server->ioc->ctx, co);
 -        } else {
 +        VuBlockReq *req;
 +
-+    if (ret > 0) {
++        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlockReq));
-+        qemu_lockcnt_inc(&ctx->list_lock);
++        if (!req) {
-+        progress |= aio_dispatch_handlers(ctx);
+             break;
-+        qemu_lockcnt_dec(&ctx->list_lock);
+         }
 +
 +        req->server = server;
 +        req->vq = vq;
 +
 +        Coroutine *co =
 +            qemu_coroutine_create(vu_block_virtio_process_req, req);
 +        qemu_coroutine_enter(co);
      }
-+    progress |= timerlistgroup_run_timers(&ctx->tlg);
-+
-     return progress;
  }
-diff --git a/util/aio-win32.c b/util/aio-win32.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-win32.c
-+++ b/util/aio-win32.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
-     return progress;
- }
--bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
-+void aio_dispatch(AioContext *ctx)
- {
--    bool progress;
--
--    progress = aio_bh_poll(ctx);
--    if (dispatch_fds) {
--        progress |= aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
--    }
--    progress |= timerlistgroup_run_timers(&ctx->tlg);
--    return progress;
-+    aio_bh_poll(ctx);
-+    aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
-+    timerlistgroup_run_timers(&ctx->tlg);
- }
- bool aio_poll(AioContext *ctx, bool blocking)
-diff --git a/util/async.c b/util/async.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/async.c
-+++ b/util/async.c
-@@ -XXX,XX +XXX,XX @@ aio_ctx_dispatch(GSource     *source,
-     AioContext *ctx = (AioContext *) source;
-     assert(callback == NULL);
--    aio_dispatch(ctx, true);
-+    aio_dispatch(ctx);
-     return true;
- }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 03/24] block-backend: allow blk_prw from coroutine context
+[PULL v2 12/28] util/vhost-user-server: drop unused DevicePanicNotifier
-From: Paolo Bonzini <pbonzini@redhat.com>
+The device panic notifier callback is not used. Drop it.
-qcow2_create2 calls this.  Do not run a nested event loop, as that
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup
+Message-id: 20200924151549.913737-7-stefanha@redhat.com
 list of the currently running one.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213135235.12274-4-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/block-backend.c | 12 ++++++++----
+ util/vhost-user-server.h             | 3 ---
-file changed, 8 insertions(+), 4 deletions(-)
+ block/export/vhost-user-blk-server.c | 3 +--
  util/vhost-user-server.c             | 6 ------
 files changed, 1 insertion(+), 11 deletions(-)
-diff --git a/block/block-backend.c b/block/block-backend.c
+diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/block-backend.c
+--- a/util/vhost-user-server.h
-+++ b/block/block-backend.c
++++ b/util/vhost-user-server.h
-@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
+@@ -XXX,XX +XXX,XX @@ typedef struct VuFdWatch {
  } VuFdWatch;
  typedef struct VuServer VuServer;
 -typedef void DevicePanicNotifierFn(VuServer *server);
  struct VuServer {
      QIONetListener *listener;
      AioContext *ctx;
 -    DevicePanicNotifierFn *device_panic_notifier;
      int max_queues;
      const VuDevIface *vu_iface;
      VuDev vu_dev;
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                               SocketAddress *unix_socket,
                               AioContext *ctx,
                               uint16_t max_queues,
 -                             DevicePanicNotifierFn *device_panic_notifier,
                               const VuDevIface *vu_iface,
                               Error **errp);
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.c
 +++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
      ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
      if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
 -                                 VHOST_USER_BLK_MAX_QUEUES,
 -                                 NULL, &vu_block_iface,
 +                                 VHOST_USER_BLK_MAX_QUEUES, &vu_block_iface,
                                   errp)) {
          goto error;
      }
 diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/vhost-user-server.c
 +++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@ static void panic_cb(VuDev *vu_dev, const char *buf)
          close_client(server);
      }
 -    if (server->device_panic_notifier) {
 -        server->device_panic_notifier(server);
 -    }
 -
      /*
       * Set the callback function for network listener so another
       * vhost-user client can connect to this server
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                               SocketAddress *socket_addr,
                               AioContext *ctx,
                               uint16_t max_queues,
 -                             DevicePanicNotifierFn *device_panic_notifier,
                               const VuDevIface *vu_iface,
                               Error **errp)
  {
-     QEMUIOVector qiov;
+@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
-     struct iovec iov;
+         .vu_iface              = vu_iface,
--    Coroutine *co;
+         .max_queues            = max_queues,
-     BlkRwCo rwco;
+         .ctx                   = ctx,
+-        .device_panic_notifier = device_panic_notifier,
      iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
          .ret    = NOT_DONE,
      };
--    co = qemu_coroutine_create(co_entry, &rwco);
+     qio_net_listener_set_name(server->listener, "vhost-user-backend-listener");
 -    qemu_coroutine_enter(co);
 -    BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
 +    if (qemu_in_coroutine()) {
 +        /* Fast-path if already in coroutine context */
 +        co_entry(&rwco);
 +    } else {
 +        Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
 +        qemu_coroutine_enter(co);
 +        BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
 +    }
      return rwco.ret;
  }
 --
-.9.3
+.26.2

-New patch
+[PULL v2 13/28] util/vhost-user-server: fix memory leak in vu_message_read()
+fds[] is leaked when qio_channel_readv_full() fails.
+Use vmsg->fds[] instead of keeping a local fds[] array. Then we can
+reuse goto fail to clean up fds. vmsg->fd_num must be zeroed before the
+loop to make this safe.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-id: 20200924151549.913737-8-stefanha@redhat.com
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ util/vhost-user-server.c | 50 ++++++++++++++++++----------------------
+file changed, 23 insertions(+), 27 deletions(-)
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
+index XXXXXXX..XXXXXXX 100644
+--- a/util/vhost-user-server.c
++++ b/util/vhost-user-server.c
+@@ -XXX,XX +XXX,XX @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
+     };
+     int rc, read_bytes = 0;
+     Error *local_err = NULL;
+-    /*
+-     * Store fds/nfds returned from qio_channel_readv_full into
+-     * temporary variables.
+-     *
+-     * VhostUserMsg is a packed structure, gcc will complain about passing
+-     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
+-     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
+-     * thus two temporary variables nfds and fds are used here.
+-     */
+-    size_t nfds = 0, nfds_t = 0;
+     const size_t max_fds = G_N_ELEMENTS(vmsg->fds);
+-    int *fds_t = NULL;
+     VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+     QIOChannel *ioc = server->ioc;
++    vmsg->fd_num = 0;
+     if (!ioc) {
+         error_report_err(local_err);
+         goto fail;
+@@ -XXX,XX +XXX,XX @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
+     assert(qemu_in_coroutine());
+     do {
++        size_t nfds = 0;
++        int *fds = NULL;
++
+         /*
+          * qio_channel_readv_full may have short reads, keeping calling it
+          * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
+          */
+-        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &local_err);
++        rc = qio_channel_readv_full(ioc, &iov, 1, &fds, &nfds, &local_err);
+         if (rc < 0) {
+             if (rc == QIO_CHANNEL_ERR_BLOCK) {
++                assert(local_err == NULL);
+                 qio_channel_yield(ioc, G_IO_IN);
+                 continue;
+             } else {
+                 error_report_err(local_err);
+-                return false;
++                goto fail;
+             }
+         }
+-        read_bytes += rc;
+-        if (nfds_t > 0) {
+-            if (nfds + nfds_t > max_fds) {
++
++        if (nfds > 0) {
++            if (vmsg->fd_num + nfds > max_fds) {
+                 error_report("A maximum of %zu fds are allowed, "
+                              "however got %zu fds now",
+-                             max_fds, nfds + nfds_t);
++                             max_fds, vmsg->fd_num + nfds);
++                g_free(fds);
+                 goto fail;
+             }
+-            memcpy(vmsg->fds + nfds, fds_t,
+-                   nfds_t *sizeof(vmsg->fds[0]));
+-            nfds += nfds_t;
+-            g_free(fds_t);
++            memcpy(vmsg->fds + vmsg->fd_num, fds, nfds * sizeof(vmsg->fds[0]));
++            vmsg->fd_num += nfds;
++            g_free(fds);
+         }
+-        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
+-            break;
++
++        if (rc == 0) { /* socket closed */
++            goto fail;
+         }
+-        iov.iov_base = (char *)vmsg + read_bytes;
+-        iov.iov_len = VHOST_USER_HDR_SIZE - read_bytes;
+-    } while (true);
+-    vmsg->fd_num = nfds;
++        iov.iov_base += rc;
++        iov.iov_len -= rc;
++        read_bytes += rc;
++    } while (read_bytes != VHOST_USER_HDR_SIZE);
++
+     /* qio_channel_readv_full will make socket fds blocking, unblock them */
+     vmsg_unblock_fds(vmsg);
+     if (vmsg->size > sizeof(vmsg->payload)) {
+--
+.26.2

-[Qemu-devel] [PULL v2 24/24] coroutine-lock: make CoRwlock thread-safe and fair
+[PULL v2 14/28] util/vhost-user-server: check EOF when reading payload
-From: Paolo Bonzini <pbonzini@redhat.com>
+Unexpected EOF is an error that must be reported.
-This adds a CoMutex around the existing CoQueue.  Because the write-side
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-can just take CoMutex, the old "writer" field is not necessary anymore.
+Message-id: 20200924151549.913737-9-stefanha@redhat.com
 Instead of removing it altogether, count the number of pending writers
 during a read-side critical section and forbid further readers from
 entering.
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213181244.16297-7-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  3 ++-
+ util/vhost-user-server.c | 6 ++++--
- util/qemu-coroutine-lock.c | 35 ++++++++++++++++++++++++-----------
+file changed, 4 insertions(+), 2 deletions(-)
 files changed, 26 insertions(+), 12 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/util/vhost-user-server.c
-+++ b/include/qemu/coroutine.h
++++ b/util/vhost-user-server.c
-@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
+@@ -XXX,XX +XXX,XX @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
+     };
+     if (vmsg->size) {
- typedef struct CoRwlock {
+         rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &local_err);
--    bool writer;
+-        if (rc == -1) {
-+    int pending_writer;
+-            error_report_err(local_err);
-     int reader;
++        if (rc != 1) {
-+    CoMutex mutex;
++            if (local_err) {
-     CoQueue queue;
++                error_report_err(local_err);
- } CoRwlock;
++            }
+             goto fail;
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-lock.c
 +++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_init(CoRwlock *lock)
  {
      memset(lock, 0, sizeof(*lock));
      qemu_co_queue_init(&lock->queue);
 +    qemu_co_mutex_init(&lock->mutex);
  }
  void qemu_co_rwlock_rdlock(CoRwlock *lock)
  {
      Coroutine *self = qemu_coroutine_self();
 -    while (lock->writer) {
 -        qemu_co_queue_wait(&lock->queue, NULL);
 +    qemu_co_mutex_lock(&lock->mutex);
 +    /* For fairness, wait if a writer is in line.  */
 +    while (lock->pending_writer) {
 +        qemu_co_queue_wait(&lock->queue, &lock->mutex);
      }
      lock->reader++;
 +    qemu_co_mutex_unlock(&lock->mutex);
 +
 +    /* The rest of the read-side critical section is run without the mutex.  */
      self->locks_held++;
  }
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
      Coroutine *self = qemu_coroutine_self();
      assert(qemu_in_coroutine());
 -    if (lock->writer) {
 -        lock->writer = false;
 +    if (!lock->reader) {
 +        /* The critical section started in qemu_co_rwlock_wrlock.  */
          qemu_co_queue_restart_all(&lock->queue);
      } else {
 +        self->locks_held--;
 +
 +        qemu_co_mutex_lock(&lock->mutex);
          lock->reader--;
          assert(lock->reader >= 0);
          /* Wakeup only one waiting writer */
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
              qemu_co_queue_next(&lock->queue);
          }
      }
--    self->locks_held--;
-+    qemu_co_mutex_unlock(&lock->mutex);
- }
- void qemu_co_rwlock_wrlock(CoRwlock *lock)
- {
--    Coroutine *self = qemu_coroutine_self();
--
--    while (lock->writer || lock->reader) {
--        qemu_co_queue_wait(&lock->queue, NULL);
-+    qemu_co_mutex_lock(&lock->mutex);
-+    lock->pending_writer++;
-+    while (lock->reader) {
-+        qemu_co_queue_wait(&lock->queue, &lock->mutex);
-     }
--    lock->writer = true;
--    self->locks_held++;
-+    lock->pending_writer--;
-+
-+    /* The rest of the write-side critical section is run with
-+     * the mutex taken, so that lock->reader remains zero.
-+     * There is no need to update self->locks_held.
-+     */
- }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 13/24] block: explicitly acquire aiocontext in callbacks that need it
+[PULL v2 15/28] util/vhost-user-server: rework vu_client_trip() coroutine lifecycle
-From: Paolo Bonzini <pbonzini@redhat.com>
+The vu_client_trip() coroutine is leaked during AioContext switching. It
 is also unsafe to destroy the vu_dev in panic_cb() since its callers
 still access it in some cases.
-This covers both file descriptor callbacks and polling callbacks,
+Rework the lifecycle to solve these safety issues.
 since they execute related code.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-id: 20200924151549.913737-10-stefanha@redhat.com
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-14-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/curl.c          | 16 +++++++++++++---
+ util/vhost-user-server.h             |  29 ++--
- block/iscsi.c         |  4 ++++
+ block/export/vhost-user-blk-server.c |   9 +-
- block/linux-aio.c     |  4 ++++
+ util/vhost-user-server.c             | 245 +++++++++++++++------------
- block/nfs.c           |  6 ++++++
+files changed, 155 insertions(+), 128 deletions(-)
  block/sheepdog.c      | 29 +++++++++++++++--------------
  block/ssh.c           | 29 +++++++++--------------------
  block/win32-aio.c     | 10 ++++++----
  hw/block/virtio-blk.c |  5 ++++-
  hw/scsi/virtio-scsi.c |  7 +++++++
  util/aio-posix.c      |  7 -------
  util/aio-win32.c      |  6 ------
 files changed, 68 insertions(+), 55 deletions(-)
-diff --git a/block/curl.c b/block/curl.c
+diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
+--- a/util/vhost-user-server.h
-+++ b/block/curl.c
++++ b/util/vhost-user-server.h
-@@ -XXX,XX +XXX,XX @@ static void curl_multi_check_completion(BDRVCURLState *s)
+@@ -XXX,XX +XXX,XX @@
-     }
+ #include "qapi/error.h"
- }
+ #include "standard-headers/linux/virtio_blk.h"
--static void curl_multi_do(void *arg)
++/* A kick fd that we monitor on behalf of libvhost-user */
-+static void curl_multi_do_locked(CURLState *s)
+ typedef struct VuFdWatch {
- {
+     VuDev *vu_dev;
--    CURLState *s = (CURLState *)arg;
+     int fd; /*kick fd*/
-     CURLSocket *socket, *next_socket;
+     void *pvt;
-     int running;
+     vu_watch_cb cb;
-     int r;
+-    bool processing;
-@@ -XXX,XX +XXX,XX @@ static void curl_multi_do(void *arg)
+     QTAILQ_ENTRY(VuFdWatch) next;
-     }
+ } VuFdWatch;
- }
+-typedef struct VuServer VuServer;
-+static void curl_multi_do(void *arg)
+-
 -struct VuServer {
 +/**
 + * VuServer:
 + * A vhost-user server instance with user-defined VuDevIface callbacks.
 + * Vhost-user device backends can be implemented using VuServer. VuDevIface
 + * callbacks and virtqueue kicks run in the given AioContext.
 + */
 +typedef struct {
      QIONetListener *listener;
 +    QEMUBH *restart_listener_bh;
      AioContext *ctx;
      int max_queues;
      const VuDevIface *vu_iface;
 +
 +    /* Protected by ctx lock */
      VuDev vu_dev;
      QIOChannel *ioc; /* The I/O channel with the client */
      QIOChannelSocket *sioc; /* The underlying data channel with the client */
 -    /* IOChannel for fd provided via VHOST_USER_SET_SLAVE_REQ_FD */
 -    QIOChannel *ioc_slave;
 -    QIOChannelSocket *sioc_slave;
 -    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
      QTAILQ_HEAD(, VuFdWatch) vu_fd_watches;
 -    /* restart coroutine co_trip if AIOContext is changed */
 -    bool aio_context_changed;
 -    bool processing_msg;
 -};
 +
 +    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
 +} VuServer;
  bool vhost_user_server_start(VuServer *server,
                               SocketAddress *unix_socket,
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
  void vhost_user_server_stop(VuServer *server);
 -void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx);
 +void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 +void vhost_user_server_detach_aio_context(VuServer *server);
  #endif /* VHOST_USER_SERVER_H */
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.c
 +++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static const VuDevIface vu_block_iface = {
  static void blk_aio_attached(AioContext *ctx, void *opaque)
  {
      VuBlockDev *vub_dev = opaque;
 -    aio_context_acquire(ctx);
 -    vhost_user_server_set_aio_context(&vub_dev->vu_server, ctx);
 -    aio_context_release(ctx);
 +    vhost_user_server_attach_aio_context(&vub_dev->vu_server, ctx);
  }
  static void blk_aio_detach(void *opaque)
  {
      VuBlockDev *vub_dev = opaque;
 -    AioContext *ctx = vub_dev->vu_server.ctx;
 -    aio_context_acquire(ctx);
 -    vhost_user_server_set_aio_context(&vub_dev->vu_server, NULL);
 -    aio_context_release(ctx);
 +    vhost_user_server_detach_aio_context(&vub_dev->vu_server);
  }
  static void
 diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/vhost-user-server.c
 +++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@
   */
  #include "qemu/osdep.h"
  #include "qemu/main-loop.h"
 +#include "block/aio-wait.h"
  #include "vhost-user-server.h"
 +/*
 + * Theory of operation:
 + *
 + * VuServer is started and stopped by vhost_user_server_start() and
 + * vhost_user_server_stop() from the main loop thread. Starting the server
 + * opens a vhost-user UNIX domain socket and listens for incoming connections.
 + * Only one connection is allowed at a time.
 + *
 + * The connection is handled by the vu_client_trip() coroutine in the
 + * VuServer->ctx AioContext. The coroutine consists of a vu_dispatch() loop
 + * where libvhost-user calls vu_message_read() to receive the next vhost-user
 + * protocol messages over the UNIX domain socket.
 + *
 + * When virtqueues are set up libvhost-user calls set_watch() to monitor kick
 + * fds. These fds are also handled in the VuServer->ctx AioContext.
 + *
 + * Both vu_client_trip() and kick fd monitoring can be stopped by shutting down
 + * the socket connection. Shutting down the socket connection causes
 + * vu_message_read() to fail since no more data can be received from the socket.
 + * After vu_dispatch() fails, vu_client_trip() calls vu_deinit() to stop
 + * libvhost-user before terminating the coroutine. vu_deinit() calls
 + * remove_watch() to stop monitoring kick fds and this stops virtqueue
 + * processing.
 + *
 + * When vu_client_trip() has finished cleaning up it schedules a BH in the main
 + * loop thread to accept the next client connection.
 + *
 + * When libvhost-user detects an error it calls panic_cb() and sets the
 + * dev->broken flag. Both vu_client_trip() and kick fd processing stop when
 + * the dev->broken flag is set.
 + *
 + * It is possible to switch AioContexts using
 + * vhost_user_server_detach_aio_context() and
 + * vhost_user_server_attach_aio_context(). They stop monitoring fds in the old
 + * AioContext and resume monitoring in the new AioContext. The vu_client_trip()
 + * coroutine remains in a yielded state during the switch. This is made
 + * possible by QIOChannel's support for spurious coroutine re-entry in
 + * qio_channel_yield(). The coroutine will restart I/O when re-entered from the
 + * new AioContext.
 + */
 +
  static void vmsg_close_fds(VhostUserMsg *vmsg)
  {
      int i;
@@ -XXX,XX +XXX,XX @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg)
      }
  }
 -static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
 -                      gpointer opaque);
 -
 -static void close_client(VuServer *server)
 -{
 -    /*
 -     * Before closing the client
 -     *
 -     * 1. Let vu_client_trip stop processing new vhost-user msg
 -     *
 -     * 2. remove kick_handler
 -     *
 -     * 3. wait for the kick handler to be finished
 -     *
 -     * 4. wait for the current vhost-user msg to be finished processing
 -     */
 -
 -    QIOChannelSocket *sioc = server->sioc;
 -    /* When this is set vu_client_trip will stop new processing vhost-user message */
 -    server->sioc = NULL;
 -
 -    while (server->processing_msg) {
 -        if (server->ioc->read_coroutine) {
 -            server->ioc->read_coroutine = NULL;
 -            qio_channel_set_aio_fd_handler(server->ioc, server->ioc->ctx, NULL,
 -                                           NULL, server->ioc);
 -            server->processing_msg = false;
 -        }
 -    }
 -
 -    vu_deinit(&server->vu_dev);
 -
 -    /* vu_deinit() should have called remove_watch() */
 -    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
 -
 -    object_unref(OBJECT(sioc));
 -    object_unref(OBJECT(server->ioc));
 -}
 -
  static void panic_cb(VuDev *vu_dev, const char *buf)
  {
 -    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 -
 -    /* avoid while loop in close_client */
 -    server->processing_msg = false;
 -
 -    if (buf) {
 -        error_report("vu_panic: %s", buf);
 -    }
 -
 -    if (server->sioc) {
 -        close_client(server);
 -    }
 -
 -    /*
 -     * Set the callback function for network listener so another
 -     * vhost-user client can connect to this server
 -     */
 -    qio_net_listener_set_client_func(server->listener,
 -                                     vu_accept,
 -                                     server,
 -                                     NULL);
 +    error_report("vu_panic: %s", buf);
  }
  static bool coroutine_fn
@@ -XXX,XX +XXX,XX @@ fail:
      return false;
  }
 -
 -static void vu_client_start(VuServer *server);
  static coroutine_fn void vu_client_trip(void *opaque)
  {
      VuServer *server = opaque;
 +    VuDev *vu_dev = &server->vu_dev;
 -    while (!server->aio_context_changed && server->sioc) {
 -        server->processing_msg = true;
 -        vu_dispatch(&server->vu_dev);
 -        server->processing_msg = false;
 +    while (!vu_dev->broken && vu_dispatch(vu_dev)) {
 +        /* Keep running */
      }
 -    if (server->aio_context_changed && server->sioc) {
 -        server->aio_context_changed = false;
 -        vu_client_start(server);
 -    }
 -}
 +    vu_deinit(vu_dev);
 +
 +    /* vu_deinit() should have called remove_watch() */
 +    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
 +
 +    object_unref(OBJECT(server->sioc));
 +    server->sioc = NULL;
 -static void vu_client_start(VuServer *server)
 -{
 -    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
 -    aio_co_enter(server->ctx, server->co_trip);
 +    object_unref(OBJECT(server->ioc));
 +    server->ioc = NULL;
 +
 +    server->co_trip = NULL;
 +    if (server->restart_listener_bh) {
 +        qemu_bh_schedule(server->restart_listener_bh);
 +    }
 +    aio_wait_kick();
  }
  /*
@@ -XXX,XX +XXX,XX @@ static void vu_client_start(VuServer *server)
  static void kick_handler(void *opaque)
  {
      VuFdWatch *vu_fd_watch = opaque;
 -    vu_fd_watch->processing = true;
 -    vu_fd_watch->cb(vu_fd_watch->vu_dev, 0, vu_fd_watch->pvt);
 -    vu_fd_watch->processing = false;
 +    VuDev *vu_dev = vu_fd_watch->vu_dev;
 +
 +    vu_fd_watch->cb(vu_dev, 0, vu_fd_watch->pvt);
 +
 +    /* Stop vu_client_trip() if an error occurred in vu_fd_watch->cb() */
 +    if (vu_dev->broken) {
 +        VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 +
 +        qio_channel_shutdown(server->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 +    }
  }
 -
  static VuFdWatch *find_vu_fd_watch(VuServer *server, int fd)
  {
@@ -XXX,XX +XXX,XX @@ static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
      qio_channel_set_name(QIO_CHANNEL(sioc), "vhost-user client");
      server->ioc = QIO_CHANNEL(sioc);
      object_ref(OBJECT(server->ioc));
 -    qio_channel_attach_aio_context(server->ioc, server->ctx);
 +
 +    /* TODO vu_message_write() spins if non-blocking! */
      qio_channel_set_blocking(server->ioc, false, NULL);
 -    vu_client_start(server);
 +
 +    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
 +
 +    aio_context_acquire(server->ctx);
 +    vhost_user_server_attach_aio_context(server, server->ctx);
 +    aio_context_release(server->ctx);
  }
 -
  void vhost_user_server_stop(VuServer *server)
  {
 +    aio_context_acquire(server->ctx);
 +
 +    qemu_bh_delete(server->restart_listener_bh);
 +    server->restart_listener_bh = NULL;
 +
      if (server->sioc) {
 -        close_client(server);
 +        VuFdWatch *vu_fd_watch;
 +
 +        QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
 +            aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
 +                               NULL, NULL, NULL, vu_fd_watch);
 +        }
 +
 +        qio_channel_shutdown(server->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 +
 +        AIO_WAIT_WHILE(server->ctx, server->co_trip);
      }
 +    aio_context_release(server->ctx);
 +
      if (server->listener) {
          qio_net_listener_disconnect(server->listener);
          object_unref(OBJECT(server->listener));
      }
 +}
 +
 +/*
 + * Allow the next client to connect to the server. Called from a BH in the main
 + * loop.
 + */
 +static void restart_listener_bh(void *opaque)
 +{
-+    CURLState *s = (CURLState *)arg;
++    VuServer *server = opaque;
-+
-+    aio_context_acquire(s->s->aio_context);
++    qio_net_listener_set_client_func(server->listener, vu_accept, server,
-+    curl_multi_do_locked(s);
++                                     NULL);
-+    aio_context_release(s->s->aio_context);
+ }
 -void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx)
 +/* Called with ctx acquired */
 +void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx)
  {
 -    VuFdWatch *vu_fd_watch, *next;
 -    void *opaque = NULL;
 -    IOHandler *io_read = NULL;
 -    bool attach;
 +    VuFdWatch *vu_fd_watch;
 -    server->ctx = ctx ? ctx : qemu_get_aio_context();
 +    server->ctx = ctx;
      if (!server->sioc) {
 -        /* not yet serving any client*/
          return;
      }
 -    if (ctx) {
 -        qio_channel_attach_aio_context(server->ioc, ctx);
 -        server->aio_context_changed = true;
 -        io_read = kick_handler;
 -        attach = true;
 -    } else {
 +    qio_channel_attach_aio_context(server->ioc, ctx);
 +
 +    QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
 +        aio_set_fd_handler(ctx, vu_fd_watch->fd, true, kick_handler, NULL,
 +                           NULL, vu_fd_watch);
 +    }
 +
 +    aio_co_schedule(ctx, server->co_trip);
 +}
 +
- static void curl_multi_read(void *arg)
++/* Called with server->ctx acquired */
- {
++void vhost_user_server_detach_aio_context(VuServer *server)
-     CURLState *s = (CURLState *)arg;
++{
++    if (server->sioc) {
--    curl_multi_do(arg);
++        VuFdWatch *vu_fd_watch;
-+    aio_context_acquire(s->s->aio_context);
++
-+    curl_multi_do_locked(s);
++        QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-     curl_multi_check_completion(s->s);
++            aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
-+    aio_context_release(s->s->aio_context);
++                               NULL, NULL, NULL, vu_fd_watch);
- }
++        }
++
- static void curl_multi_timeout_do(void *arg)
+         qio_channel_detach_aio_context(server->ioc);
-diff --git a/block/iscsi.c b/block/iscsi.c
+-        /* server->ioc->ctx keeps the old AioConext */
-index XXXXXXX..XXXXXXX 100644
+-        ctx = server->ioc->ctx;
---- a/block/iscsi.c
+-        attach = false;
-+++ b/block/iscsi.c
+     }
-@@ -XXX,XX +XXX,XX @@ iscsi_process_read(void *arg)
-     IscsiLun *iscsilun = arg;
+-    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
-     struct iscsi_context *iscsi = iscsilun->iscsi;
+-        if (vu_fd_watch->cb) {
+-            opaque = attach ? vu_fd_watch : NULL;
-+    aio_context_acquire(iscsilun->aio_context);
+-            aio_set_fd_handler(ctx, vu_fd_watch->fd, true,
-     iscsi_service(iscsi, POLLIN);
+-                               io_read, NULL, NULL,
-     iscsi_set_events(iscsilun);
+-                               opaque);
-+    aio_context_release(iscsilun->aio_context);
+-        }
- }
+-    }
++    server->ctx = NULL;
- static void
+ }
-@@ -XXX,XX +XXX,XX @@ iscsi_process_write(void *arg)
-     IscsiLun *iscsilun = arg;
+-
-     struct iscsi_context *iscsi = iscsilun->iscsi;
+ bool vhost_user_server_start(VuServer *server,
+                              SocketAddress *socket_addr,
-+    aio_context_acquire(iscsilun->aio_context);
+                              AioContext *ctx,
-     iscsi_service(iscsi, POLLOUT);
+@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
-     iscsi_set_events(iscsilun);
+                              const VuDevIface *vu_iface,
-+    aio_context_release(iscsilun->aio_context);
+                              Error **errp)
- }
+ {
++    QEMUBH *bh;
- static int64_t sector_lun2qemu(int64_t sector, IscsiLun *iscsilun)
+     QIONetListener *listener = qio_net_listener_new();
-diff --git a/block/linux-aio.c b/block/linux-aio.c
+     if (qio_net_listener_open_sync(listener, socket_addr, 1,
-index XXXXXXX..XXXXXXX 100644
+                                    errp) < 0) {
---- a/block/linux-aio.c
+@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
      LinuxAioState *s = container_of(e, LinuxAioState, e);
      if (event_notifier_test_and_clear(&s->e)) {
 +        aio_context_acquire(s->aio_context);
          qemu_laio_process_completions_and_submit(s);
 +        aio_context_release(s->aio_context);
      }
  }
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
          return false;
      }
-+    aio_context_acquire(s->aio_context);
++    bh = qemu_bh_new(restart_listener_bh, server);
-     qemu_laio_process_completions_and_submit(s);
++
-+    aio_context_release(s->aio_context);
+     /* zero out unspecified fields */
-     return true;
+     *server = (VuServer) {
- }
+         .listener              = listener,
++        .restart_listener_bh   = bh,
-diff --git a/block/nfs.c b/block/nfs.c
+         .vu_iface              = vu_iface,
-index XXXXXXX..XXXXXXX 100644
+         .max_queues            = max_queues,
---- a/block/nfs.c
+         .ctx                   = ctx,
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_set_events(NFSClient *client)
  static void nfs_process_read(void *arg)
  {
      NFSClient *client = arg;
 +
 +    aio_context_acquire(client->aio_context);
      nfs_service(client->context, POLLIN);
      nfs_set_events(client);
 +    aio_context_release(client->aio_context);
  }
  static void nfs_process_write(void *arg)
  {
      NFSClient *client = arg;
 +
 +    aio_context_acquire(client->aio_context);
      nfs_service(client->context, POLLOUT);
      nfs_set_events(client);
 +    aio_context_release(client->aio_context);
  }
  static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
      return ret;
  }
 -static void restart_co_req(void *opaque)
 -{
 -    Coroutine *co = opaque;
 -
 -    qemu_coroutine_enter(co);
 -}
 -
  typedef struct SheepdogReqCo {
      int sockfd;
      BlockDriverState *bs;
@@ -XXX,XX +XXX,XX @@ typedef struct SheepdogReqCo {
      unsigned int *rlen;
      int ret;
      bool finished;
 +    Coroutine *co;
  } SheepdogReqCo;
 +static void restart_co_req(void *opaque)
 +{
 +    SheepdogReqCo *srco = opaque;
 +
 +    aio_co_wake(srco->co);
 +}
 +
  static coroutine_fn void do_co_req(void *opaque)
  {
      int ret;
 -    Coroutine *co;
      SheepdogReqCo *srco = opaque;
      int sockfd = srco->sockfd;
      SheepdogReq *hdr = srco->hdr;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
      unsigned int *wlen = srco->wlen;
      unsigned int *rlen = srco->rlen;
 -    co = qemu_coroutine_self();
 +    srco->co = qemu_coroutine_self();
      aio_set_fd_handler(srco->aio_context, sockfd, false,
 -                       NULL, restart_co_req, NULL, co);
 +                       NULL, restart_co_req, NULL, srco);
      ret = send_co_req(sockfd, hdr, data, wlen);
      if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
      }
      aio_set_fd_handler(srco->aio_context, sockfd, false,
 -                       restart_co_req, NULL, NULL, co);
 +                       restart_co_req, NULL, NULL, srco);
      ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
      if (ret != sizeof(*hdr)) {
@@ -XXX,XX +XXX,XX @@ out:
      aio_set_fd_handler(srco->aio_context, sockfd, false,
                         NULL, NULL, NULL, NULL);
 +    srco->co = NULL;
      srco->ret = ret;
      srco->finished = true;
      if (srco->bs) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn aio_read_response(void *opaque)
           * We've finished all requests which belong to the AIOCB, so
           * we can switch back to sd_co_readv/writev now.
           */
 -        qemu_coroutine_enter(acb->coroutine);
 +        aio_co_wake(acb->coroutine);
      }
      return;
@@ -XXX,XX +XXX,XX @@ static void co_read_response(void *opaque)
          s->co_recv = qemu_coroutine_create(aio_read_response, opaque);
      }
 -    qemu_coroutine_enter(s->co_recv);
 +    aio_co_wake(s->co_recv);
  }
  static void co_write_request(void *opaque)
  {
      BDRVSheepdogState *s = opaque;
 -    qemu_coroutine_enter(s->co_send);
 +    aio_co_wake(s->co_send);
  }
  /*
 diff --git a/block/ssh.c b/block/ssh.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/ssh.c
 +++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static void restart_coroutine(void *opaque)
      DPRINTF("co=%p", co);
 -    qemu_coroutine_enter(co);
 +    aio_co_wake(co);
  }
 -static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
 +/* A non-blocking call returned EAGAIN, so yield, ensuring the
 + * handlers are set up so that we'll be rescheduled when there is an
 + * interesting event on the socket.
 + */
 +static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
  {
      int r;
      IOHandler *rd_handler = NULL, *wr_handler = NULL;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
      aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
                         false, rd_handler, wr_handler, NULL, co);
 -}
 -
 -static coroutine_fn void clear_fd_handler(BDRVSSHState *s,
 -                                          BlockDriverState *bs)
 -{
 -    DPRINTF("s->sock=%d", s->sock);
 -    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
 -                       false, NULL, NULL, NULL, NULL);
 -}
 -
 -/* A non-blocking call returned EAGAIN, so yield, ensuring the
 - * handlers are set up so that we'll be rescheduled when there is an
 - * interesting event on the socket.
 - */
 -static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 -{
 -    set_fd_handler(s, bs);
      qemu_coroutine_yield();
 -    clear_fd_handler(s, bs);
 +    DPRINTF("s->sock=%d - back", s->sock);
 +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
 +                       NULL, NULL, NULL, NULL);
  }
  /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
 diff --git a/block/win32-aio.c b/block/win32-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/win32-aio.c
 +++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ struct QEMUWin32AIOState {
      HANDLE hIOCP;
      EventNotifier e;
      int count;
 -    bool is_aio_context_attached;
 +    AioContext *aio_ctx;
  };
  typedef struct QEMUWin32AIOCB {
@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
      }
 +    aio_context_acquire(s->aio_ctx);
      waiocb->common.cb(waiocb->common.opaque, ret);
 +    aio_context_release(s->aio_ctx);
      qemu_aio_unref(waiocb);
  }
@@ -XXX,XX +XXX,XX @@ void win32_aio_detach_aio_context(QEMUWin32AIOState *aio,
                                    AioContext *old_context)
  {
      aio_set_event_notifier(old_context, &aio->e, false, NULL, NULL);
 -    aio->is_aio_context_attached = false;
 +    aio->aio_ctx = NULL;
  }
  void win32_aio_attach_aio_context(QEMUWin32AIOState *aio,
                                    AioContext *new_context)
  {
 -    aio->is_aio_context_attached = true;
 +    aio->aio_ctx = new_context;
      aio_set_event_notifier(new_context, &aio->e, false,
                             win32_aio_completion_cb, NULL);
  }
@@ -XXX,XX +XXX,XX @@ out_free_state:
  void win32_aio_cleanup(QEMUWin32AIOState *aio)
  {
 -    assert(!aio->is_aio_context_attached);
 +    assert(!aio->aio_ctx);
      CloseHandle(aio->hIOCP);
      event_notifier_cleanup(&aio->e);
      g_free(aio);
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
  {
      VirtIOBlockIoctlReq *ioctl_req = opaque;
      VirtIOBlockReq *req = ioctl_req->req;
 -    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
 +    VirtIOBlock *s = req->dev;
 +    VirtIODevice *vdev = VIRTIO_DEVICE(s);
      struct virtio_scsi_inhdr *scsi;
      struct sg_io_hdr *hdr;
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
      MultiReqBuffer mrb = {};
      bool progress = false;
 +    aio_context_acquire(blk_get_aio_context(s->blk));
      blk_io_plug(s->blk);
      do {
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
      }
      blk_io_unplug(s->blk);
 +    aio_context_release(blk_get_aio_context(s->blk));
      return progress;
  }
 diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/scsi/virtio-scsi.c
 +++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_ctrl_vq(VirtIOSCSI *s, VirtQueue *vq)
      VirtIOSCSIReq *req;
      bool progress = false;
 +    virtio_scsi_acquire(s);
      while ((req = virtio_scsi_pop_req(s, vq))) {
          progress = true;
          virtio_scsi_handle_ctrl_req(s, req);
      }
 +    virtio_scsi_release(s);
      return progress;
  }
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
      QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
 +    virtio_scsi_acquire(s);
      do {
          virtio_queue_set_notification(vq, 0);
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
      QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
          virtio_scsi_handle_cmd_req_submit(s, req);
      }
 +    virtio_scsi_release(s);
      return progress;
  }
@@ -XXX,XX +XXX,XX @@ out:
  bool virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
  {
 +    virtio_scsi_acquire(s);
      if (s->events_dropped) {
          virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
 +        virtio_scsi_release(s);
          return true;
      }
 +    virtio_scsi_release(s);
      return false;
  }
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
              (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_read) {
 -            aio_context_acquire(ctx);
              node->io_read(node->opaque);
 -            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
              (revents & (G_IO_OUT | G_IO_ERR)) &&
              aio_node_check(ctx, node->is_external) &&
              node->io_write) {
 -            aio_context_acquire(ctx);
              node->io_write(node->opaque);
 -            aio_context_release(ctx);
              progress = true;
          }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      }
 -    aio_context_acquire(ctx);
      progress = try_poll_mode(ctx, blocking);
 -    aio_context_release(ctx);
 -
      if (!progress) {
          assert(npfd == 0);
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-win32.c
 +++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (revents || event_notifier_get_handle(node->e) == event) &&
              node->io_notify) {
              node->pfd.revents = 0;
 -            aio_context_acquire(ctx);
              node->io_notify(node->e);
 -            aio_context_release(ctx);
              /* aio_notify() does not count as progress */
              if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
              (node->io_read || node->io_write)) {
              node->pfd.revents = 0;
              if ((revents & G_IO_IN) && node->io_read) {
 -                aio_context_acquire(ctx);
                  node->io_read(node->opaque);
 -                aio_context_release(ctx);
                  progress = true;
              }
              if ((revents & G_IO_OUT) && node->io_write) {
 -                aio_context_acquire(ctx);
                  node->io_write(node->opaque);
 -                aio_context_release(ctx);
                  progress = true;
              }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 05/24] io: add methods to set I/O handlers on AioContext
+[PULL v2 16/28] block/export: report flush errors
-From: Paolo Bonzini <pbonzini@redhat.com>
+Propagate the flush return value since errors are possible.
-This is in preparation for making qio_channel_yield work on
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-AioContexts other than the main one.
+Message-id: 20200924151549.913737-11-stefanha@redhat.com
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213135235.12274-6-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/io/channel.h | 25 +++++++++++++++++++++++++
+ block/export/vhost-user-blk-server.c | 11 +++++++----
- io/channel-command.c | 13 +++++++++++++
+file changed, 7 insertions(+), 4 deletions(-)
  io/channel-file.c    | 11 +++++++++++
  io/channel-socket.c  | 16 +++++++++++-----
  io/channel-tls.c     | 12 ++++++++++++
  io/channel-watch.c   |  6 ++++++
  io/channel.c         | 11 +++++++++++
 files changed, 89 insertions(+), 5 deletions(-)
-diff --git a/include/io/channel.h b/include/io/channel.h
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/io/channel.h
+--- a/block/export/vhost-user-blk-server.c
-+++ b/include/io/channel.h
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
+     return -EINVAL;
  #include "qemu-common.h"
  #include "qom/object.h"
 +#include "block/aio.h"
  #define TYPE_QIO_CHANNEL "qio-channel"
  #define QIO_CHANNEL(obj)                                    \
@@ -XXX,XX +XXX,XX @@ struct QIOChannelClass {
                       off_t offset,
                       int whence,
                       Error **errp);
 +    void (*io_set_aio_fd_handler)(QIOChannel *ioc,
 +                                  AioContext *ctx,
 +                                  IOHandler *io_read,
 +                                  IOHandler *io_write,
 +                                  void *opaque);
  };
  /* General I/O handling functions */
@@ -XXX,XX +XXX,XX @@ void qio_channel_yield(QIOChannel *ioc,
  void qio_channel_wait(QIOChannel *ioc,
                        GIOCondition condition);
 +/**
 + * qio_channel_set_aio_fd_handler:
 + * @ioc: the channel object
 + * @ctx: the AioContext to set the handlers on
 + * @io_read: the read handler
 + * @io_write: the write handler
 + * @opaque: the opaque value passed to the handler
 + *
 + * This is used internally by qio_channel_yield().  It can
 + * be used by channel implementations to forward the handlers
 + * to another channel (e.g. from #QIOChannelTLS to the
 + * underlying socket).
 + */
 +void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
 +                                    AioContext *ctx,
 +                                    IOHandler *io_read,
 +                                    IOHandler *io_write,
 +                                    void *opaque);
 +
  #endif /* QIO_CHANNEL_H */
 diff --git a/io/channel-command.c b/io/channel-command.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel-command.c
 +++ b/io/channel-command.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_command_close(QIOChannel *ioc,
  }
+-static void coroutine_fn vu_block_flush(VuBlockReq *req)
-+static void qio_channel_command_set_aio_fd_handler(QIOChannel *ioc,
++static int coroutine_fn vu_block_flush(VuBlockReq *req)
 +                                                   AioContext *ctx,
 +                                                   IOHandler *io_read,
 +                                                   IOHandler *io_write,
 +                                                   void *opaque)
 +{
 +    QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
 +    aio_set_fd_handler(ctx, cioc->readfd, false, io_read, NULL, NULL, opaque);
 +    aio_set_fd_handler(ctx, cioc->writefd, false, NULL, io_write, NULL, opaque);
 +}
 +
 +
  static GSource *qio_channel_command_create_watch(QIOChannel *ioc,
                                                   GIOCondition condition)
  {
-@@ -XXX,XX +XXX,XX @@ static void qio_channel_command_class_init(ObjectClass *klass,
+     VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
-     ioc_klass->io_set_blocking = qio_channel_command_set_blocking;
+     BlockBackend *backend = vdev_blk->backend;
-     ioc_klass->io_close = qio_channel_command_close;
+-    blk_co_flush(backend);
-     ioc_klass->io_create_watch = qio_channel_command_create_watch;
++    return blk_co_flush(backend);
 +    ioc_klass->io_set_aio_fd_handler = qio_channel_command_set_aio_fd_handler;
  }
- static const TypeInfo qio_channel_command_info = {
+ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-diff --git a/io/channel-file.c b/io/channel-file.c
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-index XXXXXXX..XXXXXXX 100644
+         break;
 --- a/io/channel-file.c
 +++ b/io/channel-file.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_file_close(QIOChannel *ioc,
  }
 +static void qio_channel_file_set_aio_fd_handler(QIOChannel *ioc,
 +                                                AioContext *ctx,
 +                                                IOHandler *io_read,
 +                                                IOHandler *io_write,
 +                                                void *opaque)
 +{
 +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
 +    aio_set_fd_handler(ctx, fioc->fd, false, io_read, io_write, NULL, opaque);
 +}
 +
  static GSource *qio_channel_file_create_watch(QIOChannel *ioc,
                                                GIOCondition condition)
  {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_file_class_init(ObjectClass *klass,
      ioc_klass->io_seek = qio_channel_file_seek;
      ioc_klass->io_close = qio_channel_file_close;
      ioc_klass->io_create_watch = qio_channel_file_create_watch;
 +    ioc_klass->io_set_aio_fd_handler = qio_channel_file_set_aio_fd_handler;
  }
  static const TypeInfo qio_channel_file_info = {
 diff --git a/io/channel-socket.c b/io/channel-socket.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel-socket.c
 +++ b/io/channel-socket.c
@@ -XXX,XX +XXX,XX @@ qio_channel_socket_set_blocking(QIOChannel *ioc,
          qemu_set_block(sioc->fd);
      } else {
          qemu_set_nonblock(sioc->fd);
 -#ifdef WIN32
 -        WSAEventSelect(sioc->fd, ioc->event,
 -                       FD_READ | FD_ACCEPT | FD_CLOSE |
 -                       FD_CONNECT | FD_WRITE | FD_OOB);
 -#endif
      }
-     return 0;
+     case VIRTIO_BLK_T_FLUSH:
- }
+-        vu_block_flush(req);
-@@ -XXX,XX +XXX,XX @@ qio_channel_socket_shutdown(QIOChannel *ioc,
+-        req->in->status = VIRTIO_BLK_S_OK;
-     return 0;
++        if (vu_block_flush(req) == 0) {
- }
++            req->in->status = VIRTIO_BLK_S_OK;
++        } else {
-+static void qio_channel_socket_set_aio_fd_handler(QIOChannel *ioc,
++            req->in->status = VIRTIO_BLK_S_IOERR;
-+                                                  AioContext *ctx,
++        }
-+                                                  IOHandler *io_read,
+         break;
-+                                                  IOHandler *io_write,
+     case VIRTIO_BLK_T_GET_ID: {
-+                                                  void *opaque)
+         size_t size = MIN(iov_size(&elem->in_sg[0], in_num),
 +{
 +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
 +    aio_set_fd_handler(ctx, sioc->fd, false, io_read, io_write, NULL, opaque);
 +}
 +
  static GSource *qio_channel_socket_create_watch(QIOChannel *ioc,
                                                  GIOCondition condition)
  {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_socket_class_init(ObjectClass *klass,
      ioc_klass->io_set_cork = qio_channel_socket_set_cork;
      ioc_klass->io_set_delay = qio_channel_socket_set_delay;
      ioc_klass->io_create_watch = qio_channel_socket_create_watch;
 +    ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
  }
  static const TypeInfo qio_channel_socket_info = {
 diff --git a/io/channel-tls.c b/io/channel-tls.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel-tls.c
 +++ b/io/channel-tls.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_tls_close(QIOChannel *ioc,
      return qio_channel_close(tioc->master, errp);
  }
 +static void qio_channel_tls_set_aio_fd_handler(QIOChannel *ioc,
 +                                               AioContext *ctx,
 +                                               IOHandler *io_read,
 +                                               IOHandler *io_write,
 +                                               void *opaque)
 +{
 +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
 +
 +    qio_channel_set_aio_fd_handler(tioc->master, ctx, io_read, io_write, opaque);
 +}
 +
  static GSource *qio_channel_tls_create_watch(QIOChannel *ioc,
                                               GIOCondition condition)
  {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_tls_class_init(ObjectClass *klass,
      ioc_klass->io_close = qio_channel_tls_close;
      ioc_klass->io_shutdown = qio_channel_tls_shutdown;
      ioc_klass->io_create_watch = qio_channel_tls_create_watch;
 +    ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
  }
  static const TypeInfo qio_channel_tls_info = {
 diff --git a/io/channel-watch.c b/io/channel-watch.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel-watch.c
 +++ b/io/channel-watch.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_socket_watch(QIOChannel *ioc,
      GSource *source;
      QIOChannelSocketSource *ssource;
 +#ifdef WIN32
 +    WSAEventSelect(socket, ioc->event,
 +                   FD_READ | FD_ACCEPT | FD_CLOSE |
 +                   FD_CONNECT | FD_WRITE | FD_OOB);
 +#endif
 +
      source = g_source_new(&qio_channel_socket_source_funcs,
                            sizeof(QIOChannelSocketSource));
      ssource = (QIOChannelSocketSource *)source;
 diff --git a/io/channel.c b/io/channel.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel.c
 +++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_watch(QIOChannel *ioc,
  }
 +void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
 +                                    AioContext *ctx,
 +                                    IOHandler *io_read,
 +                                    IOHandler *io_write,
 +                                    void *opaque)
 +{
 +    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
 +
 +    klass->io_set_aio_fd_handler(ioc, ctx, io_read, io_write, opaque);
 +}
 +
  guint qio_channel_add_watch(QIOChannel *ioc,
                              GIOCondition condition,
                              QIOChannelFunc func,
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 19/24] coroutine-lock: make CoMutex thread-safe
+[PULL v2 17/28] block/export: convert vhost-user-blk server to block export API
-From: Paolo Bonzini <pbonzini@redhat.com>
+Use the new QAPI block exports API instead of defining our own QOM
+objects.
-This uses the lock-free mutex described in the paper '"Blocking without
-Locking", or LFTHREADS: A lock-free thread library' by Gidenstam and
+This is a large change because the lifecycle of VuBlockDev needs to
-Papatriantafilou.  The same technique is used in OSv, and in fact
+follow BlockExportDriver. QOM properties are replaced by QAPI options
-the code is essentially a conversion to C of OSv's code.
+objects.
-[Added missing coroutine_fn in tests/test-aio-multithread.c.
+VuBlockDev is renamed VuBlkExport and contains a BlockExport field.
 Several fields can be dropped since BlockExport already has equivalents.
 The file names and meson build integration will be adjusted in a future
 patch. libvhost-user should probably be built as a static library that
 is linked into QEMU instead of as a .c file that results in duplicate
 compilation.
 The new command-line syntax is:
   $ qemu-storage-daemon \
       --blockdev file,node-name=drive0,filename=test.img \
       --export vhost-user-blk,node-name=drive0,id=export0,unix-socket=/tmp/vhost-user-blk.sock
 Note that unix-socket is optional because we may wish to accept chardevs
 too in the future.
 Markus noted that supported address families are not explicit in the
 QAPI schema. It is unlikely that support for more address families will
 be added since file descriptor passing is required and few address
 families support it. If a new address family needs to be added, then the
 QAPI 'features' syntax can be used to advertize them.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Acked-by: Markus Armbruster <armbru@redhat.com>
 Message-id: 20200924151549.913737-12-stefanha@redhat.com
 [Skip test on big-endian host architectures because this device doesn't
 support them yet (as already mentioned in a code comment).
 --Stefan]
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213181244.16297-2-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h     |  17 ++++-
+ qapi/block-export.json               |  21 +-
- tests/test-aio-multithread.c |  86 ++++++++++++++++++++++++
+ block/export/vhost-user-blk-server.h |  23 +-
- util/qemu-coroutine-lock.c   | 155 ++++++++++++++++++++++++++++++++++++++++---
+ block/export/export.c                |   6 +
- util/trace-events            |   1 +
+ block/export/vhost-user-blk-server.c | 452 +++++++--------------------
-files changed, 246 insertions(+), 13 deletions(-)
+ util/vhost-user-server.c             |  10 +-
+ block/export/meson.build             |   1 +
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+ block/meson.build                    |   1 -
 files changed, 156 insertions(+), 358 deletions(-)
 diff --git a/qapi/block-export.json b/qapi/block-export.json
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/qapi/block-export.json
-+++ b/include/qemu/coroutine.h
++++ b/qapi/block-export.json
-@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
+@@ -XXX,XX +XXX,XX @@
- /**
+   'data': { '*name': 'str', '*description': 'str',
-  * Provides a mutex that can be used to synchronise coroutines
+             '*bitmap': 'str' } }
 +##
 +# @BlockExportOptionsVhostUserBlk:
 +#
 +# A vhost-user-blk block export.
 +#
 +# @addr: The vhost-user socket on which to listen. Both 'unix' and 'fd'
 +#        SocketAddress types are supported. Passed fds must be UNIX domain
 +#        sockets.
 +# @logical-block-size: Logical block size in bytes. Defaults to 512 bytes.
 +#
 +# Since: 5.2
 +##
 +{ 'struct': 'BlockExportOptionsVhostUserBlk',
 +  'data': { 'addr': 'SocketAddress', '*logical-block-size': 'size' } }
 +
  ##
  # @NbdServerAddOptions:
  #
@@ -XXX,XX +XXX,XX @@
  # An enumeration of block export types
  #
  # @nbd: NBD export
 +# @vhost-user-blk: vhost-user-blk export (since 5.2)
  #
  # Since: 4.2
  ##
  { 'enum': 'BlockExportType',
 -  'data': [ 'nbd' ] }
 +  'data': [ 'nbd', 'vhost-user-blk' ] }
  ##
  # @BlockExportOptions:
@@ -XXX,XX +XXX,XX @@
              '*writethrough': 'bool' },
    'discriminator': 'type',
    'data': {
 -      'nbd': 'BlockExportOptionsNbd'
 +      'nbd': 'BlockExportOptionsNbd',
 +      'vhost-user-blk': 'BlockExportOptionsVhostUserBlk'
     } }
  ##
 diff --git a/block/export/vhost-user-blk-server.h b/block/export/vhost-user-blk-server.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.h
 +++ b/block/export/vhost-user-blk-server.h
@@ -XXX,XX +XXX,XX @@
  #ifndef VHOST_USER_BLK_SERVER_H
  #define VHOST_USER_BLK_SERVER_H
 -#include "util/vhost-user-server.h"
 -typedef struct VuBlockDev VuBlockDev;
 -#define TYPE_VHOST_USER_BLK_SERVER "vhost-user-blk-server"
 -#define VHOST_USER_BLK_SERVER(obj) \
 -   OBJECT_CHECK(VuBlockDev, obj, TYPE_VHOST_USER_BLK_SERVER)
 +#include "block/export.h"
 -/* vhost user block device */
 -struct VuBlockDev {
 -    Object parent_obj;
 -    char *node_name;
 -    SocketAddress *addr;
 -    AioContext *ctx;
 -    VuServer vu_server;
 -    bool running;
 -    uint32_t blk_size;
 -    BlockBackend *backend;
 -    QIOChannelSocket *sioc;
 -    QTAILQ_ENTRY(VuBlockDev) next;
 -    struct virtio_blk_config blkcfg;
 -    bool writable;
 -};
 +/* For block/export/export.c */
 +extern const BlockExportDriver blk_exp_vhost_user_blk;
  #endif /* VHOST_USER_BLK_SERVER_H */
 diff --git a/block/export/export.c b/block/export/export.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/export.c
 +++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@
  #include "sysemu/block-backend.h"
  #include "block/export.h"
  #include "block/nbd.h"
 +#if CONFIG_LINUX
 +#include "block/export/vhost-user-blk-server.h"
 +#endif
  #include "qapi/error.h"
  #include "qapi/qapi-commands-block-export.h"
  #include "qapi/qapi-events-block-export.h"
@@ -XXX,XX +XXX,XX @@
  static const BlockExportDriver *blk_exp_drivers[] = {
      &blk_exp_nbd,
 +#if CONFIG_LINUX
 +    &blk_exp_vhost_user_blk,
 +#endif
  };
  /* Only accessed from the main thread */
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.c
 +++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
   */
-+struct CoWaitRecord;
+ #include "qemu/osdep.h"
- typedef struct CoMutex {
+ #include "block/block.h"
--    bool locked;
++#include "contrib/libvhost-user/libvhost-user.h"
-+    /* Count of pending lockers; 0 for a free mutex, 1 for an
++#include "standard-headers/linux/virtio_blk.h"
-+     * uncontended mutex.
++#include "util/vhost-user-server.h"
-+     */
+ #include "vhost-user-blk-server.h"
-+    unsigned locked;
+ #include "qapi/error.h"
  #include "qom/object_interfaces.h"
@@ -XXX,XX +XXX,XX @@ struct virtio_blk_inhdr {
      unsigned char status;
  };
 -typedef struct VuBlockReq {
 +typedef struct VuBlkReq {
      VuVirtqElement elem;
      int64_t sector_num;
      size_t size;
@@ -XXX,XX +XXX,XX @@ typedef struct VuBlockReq {
      struct virtio_blk_outhdr out;
      VuServer *server;
      struct VuVirtq *vq;
 -} VuBlockReq;
 +} VuBlkReq;
 -static void vu_block_req_complete(VuBlockReq *req)
 +/* vhost user block device */
 +typedef struct {
 +    BlockExport export;
 +    VuServer vu_server;
 +    uint32_t blk_size;
 +    QIOChannelSocket *sioc;
 +    struct virtio_blk_config blkcfg;
 +    bool writable;
 +} VuBlkExport;
 +
-+    /* A queue of waiters.  Elements are added atomically in front of
++static void vu_blk_req_complete(VuBlkReq *req)
-+     * from_push.  to_pop is only populated, and popped from, by whoever
+ {
-+     * is in charge of the next wakeup.  This can be an unlocker or,
+     VuDev *vu_dev = &req->server->vu_dev;
-+     * through the handoff protocol, a locker that is about to go to sleep.
-+     */
+@@ -XXX,XX +XXX,XX @@ static void vu_block_req_complete(VuBlockReq *req)
-+    QSLIST_HEAD(, CoWaitRecord) from_push, to_pop;
+     free(req);
-+
+ }
-+    unsigned handoff, sequence;
-+
+-static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
-     Coroutine *holder;
+-{
--    CoQueue queue;
+-    return container_of(server, VuBlockDev, vu_server);
- } CoMutex;
+-}
+-
- /**
+ static int coroutine_fn
-diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
+-vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
-index XXXXXXX..XXXXXXX 100644
+-                              uint32_t iovcnt, uint32_t type)
---- a/tests/test-aio-multithread.c
++vu_blk_discard_write_zeroes(BlockBackend *blk, struct iovec *iov,
-+++ b/tests/test-aio-multithread.c
++                            uint32_t iovcnt, uint32_t type)
-@@ -XXX,XX +XXX,XX @@ static void test_multi_co_schedule_10(void)
+ {
-     test_multi_co_schedule(10);
+     struct virtio_blk_discard_write_zeroes desc;
- }
+     ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
+@@ -XXX,XX +XXX,XX @@ vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
-+/* CoMutex thread-safety.  */
+         return -EINVAL;
 +
 +static uint32_t atomic_counter;
 +static uint32_t running;
 +static uint32_t counter;
 +static CoMutex comutex;
 +
 +static void coroutine_fn test_multi_co_mutex_entry(void *opaque)
 +{
 +    while (!atomic_mb_read(&now_stopping)) {
 +        qemu_co_mutex_lock(&comutex);
 +        counter++;
 +        qemu_co_mutex_unlock(&comutex);
 +
 +        /* Increase atomic_counter *after* releasing the mutex.  Otherwise
 +         * there is a chance (it happens about 1 in 3 runs) that the iothread
 +         * exits before the coroutine is woken up, causing a spurious
 +         * assertion failure.
 +         */
 +        atomic_inc(&atomic_counter);
 +    }
 +    atomic_dec(&running);
 +}
 +
 +static void test_multi_co_mutex(int threads, int seconds)
 +{
 +    int i;
 +
 +    qemu_co_mutex_init(&comutex);
 +    counter = 0;
 +    atomic_counter = 0;
 +    now_stopping = false;
 +
 +    create_aio_contexts();
 +    assert(threads <= NUM_CONTEXTS);
 +    running = threads;
 +    for (i = 0; i < threads; i++) {
 +        Coroutine *co1 = qemu_coroutine_create(test_multi_co_mutex_entry, NULL);
 +        aio_co_schedule(ctx[i], co1);
 +    }
 +
 +    g_usleep(seconds * 1000000);
 +
 +    atomic_mb_set(&now_stopping, true);
 +    while (running > 0) {
 +        g_usleep(100000);
 +    }
 +
 +    join_aio_contexts();
 +    g_test_message("%d iterations/second\n", counter / seconds);
 +    g_assert_cmpint(counter, ==, atomic_counter);
 +}
 +
 +/* Testing with NUM_CONTEXTS threads focuses on the queue.  The mutex however
 + * is too contended (and the threads spend too much time in aio_poll)
 + * to actually stress the handoff protocol.
 + */
 +static void test_multi_co_mutex_1(void)
 +{
 +    test_multi_co_mutex(NUM_CONTEXTS, 1);
 +}
 +
 +static void test_multi_co_mutex_10(void)
 +{
 +    test_multi_co_mutex(NUM_CONTEXTS, 10);
 +}
 +
 +/* Testing with fewer threads stresses the handoff protocol too.  Still, the
 + * case where the locker _can_ pick up a handoff is very rare, happening
 + * about 10 times in 1 million, so increase the runtime a bit compared to
 + * other "quick" testcases that only run for 1 second.
 + */
 +static void test_multi_co_mutex_2_3(void)
 +{
 +    test_multi_co_mutex(2, 3);
 +}
 +
 +static void test_multi_co_mutex_2_30(void)
 +{
 +    test_multi_co_mutex(2, 30);
 +}
 +
  /* End of tests.  */
  int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
      g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
      if (g_test_quick()) {
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
 +        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
 +        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
      } else {
          g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
 +        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
 +        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
      }
-     return g_test_run();
- }
+-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
+     uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
-index XXXXXXX..XXXXXXX 100644
+                           le32_to_cpu(desc.num_sectors) << 9 };
---- a/util/qemu-coroutine-lock.c
+     if (type == VIRTIO_BLK_T_DISCARD) {
-+++ b/util/qemu-coroutine-lock.c
+-        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
-@@ -XXX,XX +XXX,XX @@
++        if (blk_co_pdiscard(blk, range[0], range[1]) == 0) {
-  * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+             return 0;
-  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+         }
-  * THE SOFTWARE.
+     } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
-+ *
+-        if (blk_co_pwrite_zeroes(vdev_blk->backend,
-+ * The lock-free mutex implementation is based on OSv
+-                                 range[0], range[1], 0) == 0) {
-+ * (core/lfmutex.cc, include/lockfree/mutex.hh).
++        if (blk_co_pwrite_zeroes(blk, range[0], range[1], 0) == 0) {
-+ * Copyright (C) 2013 Cloudius Systems, Ltd.
+             return 0;
-  */
+         }
  #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue)
      return QSIMPLEQ_FIRST(&queue->entries) == NULL;
  }
 +/* The wait records are handled with a multiple-producer, single-consumer
 + * lock-free queue.  There cannot be two concurrent pop_waiter() calls
 + * because pop_waiter() can only be called while mutex->handoff is zero.
 + * This can happen in three cases:
 + * - in qemu_co_mutex_unlock, before the hand-off protocol has started.
 + *   In this case, qemu_co_mutex_lock will see mutex->handoff == 0 and
 + *   not take part in the handoff.
 + * - in qemu_co_mutex_lock, if it steals the hand-off responsibility from
 + *   qemu_co_mutex_unlock.  In this case, qemu_co_mutex_unlock will fail
 + *   the cmpxchg (it will see either 0 or the next sequence value) and
 + *   exit.  The next hand-off cannot begin until qemu_co_mutex_lock has
 + *   woken up someone.
 + * - in qemu_co_mutex_unlock, if it takes the hand-off token itself.
 + *   In this case another iteration starts with mutex->handoff == 0;
 + *   a concurrent qemu_co_mutex_lock will fail the cmpxchg, and
 + *   qemu_co_mutex_unlock will go back to case (1).
 + *
 + * The following functions manage this queue.
 + */
 +typedef struct CoWaitRecord {
 +    Coroutine *co;
 +    QSLIST_ENTRY(CoWaitRecord) next;
 +} CoWaitRecord;
 +
 +static void push_waiter(CoMutex *mutex, CoWaitRecord *w)
 +{
 +    w->co = qemu_coroutine_self();
 +    QSLIST_INSERT_HEAD_ATOMIC(&mutex->from_push, w, next);
 +}
 +
 +static void move_waiters(CoMutex *mutex)
 +{
 +    QSLIST_HEAD(, CoWaitRecord) reversed;
 +    QSLIST_MOVE_ATOMIC(&reversed, &mutex->from_push);
 +    while (!QSLIST_EMPTY(&reversed)) {
 +        CoWaitRecord *w = QSLIST_FIRST(&reversed);
 +        QSLIST_REMOVE_HEAD(&reversed, next);
 +        QSLIST_INSERT_HEAD(&mutex->to_pop, w, next);
 +    }
 +}
 +
 +static CoWaitRecord *pop_waiter(CoMutex *mutex)
 +{
 +    CoWaitRecord *w;
 +
 +    if (QSLIST_EMPTY(&mutex->to_pop)) {
 +        move_waiters(mutex);
 +        if (QSLIST_EMPTY(&mutex->to_pop)) {
 +            return NULL;
 +        }
 +    }
 +    w = QSLIST_FIRST(&mutex->to_pop);
 +    QSLIST_REMOVE_HEAD(&mutex->to_pop, next);
 +    return w;
 +}
 +
 +static bool has_waiters(CoMutex *mutex)
 +{
 +    return QSLIST_EMPTY(&mutex->to_pop) || QSLIST_EMPTY(&mutex->from_push);
 +}
 +
  void qemu_co_mutex_init(CoMutex *mutex)
  {
      memset(mutex, 0, sizeof(*mutex));
 -    qemu_co_queue_init(&mutex->queue);
  }
 -void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
 +static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
  {
      Coroutine *self = qemu_coroutine_self();
 +    CoWaitRecord w;
 +    unsigned old_handoff;
      trace_qemu_co_mutex_lock_entry(mutex, self);
 +    w.co = self;
 +    push_waiter(mutex, &w);
 -    while (mutex->locked) {
 -        qemu_co_queue_wait(&mutex->queue);
 +    /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
 +     * a concurrent unlock() the responsibility of waking somebody up.
 +     */
 +    old_handoff = atomic_mb_read(&mutex->handoff);
 +    if (old_handoff &&
 +        has_waiters(mutex) &&
 +        atomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
 +        /* There can be no concurrent pops, because there can be only
 +         * one active handoff at a time.
 +         */
 +        CoWaitRecord *to_wake = pop_waiter(mutex);
 +        Coroutine *co = to_wake->co;
 +        if (co == self) {
 +            /* We got the lock ourselves!  */
 +            assert(to_wake == &w);
 +            return;
 +        }
 +
 +        aio_co_wake(co);
      }
+@@ -XXX,XX +XXX,XX @@ vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
--    mutex->locked = true;
+     return -EINVAL;
--    mutex->holder = self;
+ }
--    self->locks_held++;
--
+-static int coroutine_fn vu_block_flush(VuBlockReq *req)
-+    qemu_coroutine_yield();
++static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
-     trace_qemu_co_mutex_lock_return(mutex, self);
+ {
- }
+-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
+-    BlockBackend *backend = vdev_blk->backend;
-+void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
+-    return blk_co_flush(backend);
-+{
+-}
-+    Coroutine *self = qemu_coroutine_self();
+-
-+
+-static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-+    if (atomic_fetch_inc(&mutex->locked) == 0) {
+-{
-+        /* Uncontended.  */
+-    VuBlockReq *req = opaque;
-+        trace_qemu_co_mutex_lock_uncontended(mutex, self);
++    VuBlkReq *req = opaque;
-+    } else {
+     VuServer *server = req->server;
-+        qemu_co_mutex_lock_slowpath(mutex);
+     VuVirtqElement *elem = &req->elem;
-+    }
+     uint32_t type;
-+    mutex->holder = self;
-+    self->locks_held++;
+-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
-+}
+-    BlockBackend *backend = vdev_blk->backend;
-+
++    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
- void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
++    BlockBackend *blk = vexp->export.blk;
- {
-     Coroutine *self = qemu_coroutine_self();
+     struct iovec *in_iov = elem->in_sg;
+     struct iovec *out_iov = elem->out_sg;
-     trace_qemu_co_mutex_unlock_entry(mutex, self);
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
+         bool is_write = type & VIRTIO_BLK_T_OUT;
--    assert(mutex->locked == true);
+         req->sector_num = le64_to_cpu(req->out.sector);
-+    assert(mutex->locked);
-     assert(mutex->holder == self);
+-        int64_t offset = req->sector_num * vdev_blk->blk_size;
-     assert(qemu_in_coroutine());
++        if (is_write && !vexp->writable) {
++            req->in->status = VIRTIO_BLK_S_IOERR;
 -    mutex->locked = false;
      mutex->holder = NULL;
      self->locks_held--;
 -    qemu_co_queue_next(&mutex->queue);
 +    if (atomic_fetch_dec(&mutex->locked) == 1) {
 +        /* No waiting qemu_co_mutex_lock().  Pfew, that was easy!  */
 +        return;
 +    }
 +
 +    for (;;) {
 +        CoWaitRecord *to_wake = pop_waiter(mutex);
 +        unsigned our_handoff;
 +
 +        if (to_wake) {
 +            Coroutine *co = to_wake->co;
 +            aio_co_wake(co);
 +            break;
 +        }
 +
-+        /* Some concurrent lock() is in progress (we know this because
++        int64_t offset = req->sector_num * vexp->blk_size;
-+         * mutex->locked was >1) but it hasn't yet put itself on the wait
+         QEMUIOVector qiov;
-+         * queue.  Pick a sequence number for the handoff protocol (not 0).
+         if (is_write) {
-+         */
+             qemu_iovec_init_external(&qiov, out_iov, out_num);
-+        if (++mutex->sequence == 0) {
+-            ret = blk_co_pwritev(backend, offset, qiov.size,
-+            mutex->sequence = 1;
+-                                 &qiov, 0);
-+        }
++            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
          } else {
              qemu_iovec_init_external(&qiov, in_iov, in_num);
 -            ret = blk_co_preadv(backend, offset, qiov.size,
 -                                &qiov, 0);
 +            ret = blk_co_preadv(blk, offset, qiov.size, &qiov, 0);
          }
          if (ret >= 0) {
              req->in->status = VIRTIO_BLK_S_OK;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
          break;
      }
      case VIRTIO_BLK_T_FLUSH:
 -        if (vu_block_flush(req) == 0) {
 +        if (blk_co_flush(blk) == 0) {
              req->in->status = VIRTIO_BLK_S_OK;
          } else {
              req->in->status = VIRTIO_BLK_S_IOERR;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
      case VIRTIO_BLK_T_DISCARD:
      case VIRTIO_BLK_T_WRITE_ZEROES: {
          int rc;
 -        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
 -                                           out_num, type);
 +
-+        our_handoff = mutex->sequence;
++        if (!vexp->writable) {
-+        atomic_mb_set(&mutex->handoff, our_handoff);
++            req->in->status = VIRTIO_BLK_S_IOERR;
 +        if (!has_waiters(mutex)) {
 +            /* The concurrent lock has not added itself yet, so it
 +             * will be able to pick our handoff.
 +             */
 +            break;
 +        }
 +
-+        /* Try to do the handoff protocol ourselves; if somebody else has
++        rc = vu_blk_discard_write_zeroes(blk, &elem->out_sg[1], out_num, type);
-+         * already taken it, however, we're done and they're responsible.
+         if (rc == 0) {
-+         */
+             req->in->status = VIRTIO_BLK_S_OK;
-+        if (atomic_cmpxchg(&mutex->handoff, our_handoff, 0) != our_handoff) {
+         } else {
-+            break;
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-+        }
+         break;
      }
 -    vu_block_req_complete(req);
 +    vu_blk_req_complete(req);
      return;
  err:
 -    free(elem);
 +    free(req);
  }
 -static void vu_block_process_vq(VuDev *vu_dev, int idx)
 +static void vu_blk_process_vq(VuDev *vu_dev, int idx)
  {
      VuServer *server = container_of(vu_dev, VuServer, vu_dev);
      VuVirtq *vq = vu_get_queue(vu_dev, idx);
      while (1) {
 -        VuBlockReq *req;
 +        VuBlkReq *req;
 -        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlockReq));
 +        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlkReq));
          if (!req) {
              break;
          }
@@ -XXX,XX +XXX,XX @@ static void vu_block_process_vq(VuDev *vu_dev, int idx)
          req->vq = vq;
          Coroutine *co =
 -            qemu_coroutine_create(vu_block_virtio_process_req, req);
 +            qemu_coroutine_create(vu_blk_virtio_process_req, req);
          qemu_coroutine_enter(co);
      }
  }
 -static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
 +static void vu_blk_queue_set_started(VuDev *vu_dev, int idx, bool started)
  {
      VuVirtq *vq;
      assert(vu_dev);
      vq = vu_get_queue(vu_dev, idx);
 -    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
 +    vu_set_queue_handler(vu_dev, vq, started ? vu_blk_process_vq : NULL);
  }
 -static uint64_t vu_block_get_features(VuDev *dev)
 +static uint64_t vu_blk_get_features(VuDev *dev)
  {
      uint64_t features;
      VuServer *server = container_of(dev, VuServer, vu_dev);
 -    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
      features = 1ull << VIRTIO_BLK_F_SIZE_MAX |
 ull << VIRTIO_BLK_F_SEG_MAX |
 ull << VIRTIO_BLK_F_TOPOLOGY |
@@ -XXX,XX +XXX,XX @@ static uint64_t vu_block_get_features(VuDev *dev)
 ull << VIRTIO_RING_F_EVENT_IDX |
 ull << VHOST_USER_F_PROTOCOL_FEATURES;
 -    if (!vdev_blk->writable) {
 +    if (!vexp->writable) {
          features |= 1ull << VIRTIO_BLK_F_RO;
      }
      return features;
  }
 -static uint64_t vu_block_get_protocol_features(VuDev *dev)
 +static uint64_t vu_blk_get_protocol_features(VuDev *dev)
  {
      return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
 ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
  }
  static int
 -vu_block_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
 +vu_blk_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
  {
 +    /* TODO blkcfg must be little-endian for VIRTIO 1.0 */
      VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 -    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 -    memcpy(config, &vdev_blk->blkcfg, len);
 -
 +    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
 +    memcpy(config, &vexp->blkcfg, len);
      return 0;
  }
  static int
 -vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
 +vu_blk_set_config(VuDev *vu_dev, const uint8_t *data,
                      uint32_t offset, uint32_t size, uint32_t flags)
  {
      VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 -    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 +    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
      uint8_t wce;
      /* don't support live migration */
@@ -XXX,XX +XXX,XX @@ vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
      }
      wce = *data;
 -    vdev_blk->blkcfg.wce = wce;
 -    blk_set_enable_write_cache(vdev_blk->backend, wce);
 +    vexp->blkcfg.wce = wce;
 +    blk_set_enable_write_cache(vexp->export.blk, wce);
      return 0;
  }
@@ -XXX,XX +XXX,XX @@ vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
   * of vu_process_message.
   *
   */
 -static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
 +static int vu_blk_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
  {
      if (vmsg->request == VHOST_USER_NONE) {
          dev->panic(dev, "disconnect");
@@ -XXX,XX +XXX,XX @@ static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
      return false;
  }
 -static const VuDevIface vu_block_iface = {
 -    .get_features          = vu_block_get_features,
 -    .queue_set_started     = vu_block_queue_set_started,
 -    .get_protocol_features = vu_block_get_protocol_features,
 -    .get_config            = vu_block_get_config,
 -    .set_config            = vu_block_set_config,
 -    .process_msg           = vu_block_process_msg,
 +static const VuDevIface vu_blk_iface = {
 +    .get_features          = vu_blk_get_features,
 +    .queue_set_started     = vu_blk_queue_set_started,
 +    .get_protocol_features = vu_blk_get_protocol_features,
 +    .get_config            = vu_blk_get_config,
 +    .set_config            = vu_blk_set_config,
 +    .process_msg           = vu_blk_process_msg,
  };
  static void blk_aio_attached(AioContext *ctx, void *opaque)
  {
 -    VuBlockDev *vub_dev = opaque;
 -    vhost_user_server_attach_aio_context(&vub_dev->vu_server, ctx);
 +    VuBlkExport *vexp = opaque;
 +    vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
  }
  static void blk_aio_detach(void *opaque)
  {
 -    VuBlockDev *vub_dev = opaque;
 -    vhost_user_server_detach_aio_context(&vub_dev->vu_server);
 +    VuBlkExport *vexp = opaque;
 +    vhost_user_server_detach_aio_context(&vexp->vu_server);
  }
  static void
 -vu_block_initialize_config(BlockDriverState *bs,
 +vu_blk_initialize_config(BlockDriverState *bs,
                             struct virtio_blk_config *config, uint32_t blk_size)
  {
      config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
@@ -XXX,XX +XXX,XX @@ vu_block_initialize_config(BlockDriverState *bs,
      config->max_write_zeroes_seg = 1;
  }
 -static VuBlockDev *vu_block_init(VuBlockDev *vu_block_device, Error **errp)
 +static void vu_blk_exp_request_shutdown(BlockExport *exp)
  {
 +    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 -    BlockBackend *blk;
 -    Error *local_error = NULL;
 -    const char *node_name = vu_block_device->node_name;
 -    bool writable = vu_block_device->writable;
 -    uint64_t perm = BLK_PERM_CONSISTENT_READ;
 -    int ret;
 -
 -    AioContext *ctx;
 -
 -    BlockDriverState *bs = bdrv_lookup_bs(node_name, node_name, &local_error);
 -
 -    if (!bs) {
 -        error_propagate(errp, local_error);
 -        return NULL;
 -    }
 -
 -    if (bdrv_is_read_only(bs)) {
 -        writable = false;
 -    }
 -
 -    if (writable) {
 -        perm |= BLK_PERM_WRITE;
 -    }
 -
 -    ctx = bdrv_get_aio_context(bs);
 -    aio_context_acquire(ctx);
 -    bdrv_invalidate_cache(bs, NULL);
 -    aio_context_release(ctx);
 -
 -    /*
 -     * Don't allow resize while the vhost user server is running,
 -     * otherwise we don't care what happens with the node.
 -     */
 -    blk = blk_new(bdrv_get_aio_context(bs), perm,
 -                  BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
 -                  BLK_PERM_WRITE | BLK_PERM_GRAPH_MOD);
 -    ret = blk_insert_bs(blk, bs, errp);
 -
 -    if (ret < 0) {
 -        goto fail;
 -    }
 -
 -    blk_set_enable_write_cache(blk, false);
 -
 -    blk_set_allow_aio_context_change(blk, true);
 -
 -    vu_block_device->blkcfg.wce = 0;
 -    vu_block_device->backend = blk;
 -    if (!vu_block_device->blk_size) {
 -        vu_block_device->blk_size = BDRV_SECTOR_SIZE;
 -    }
 -    vu_block_device->blkcfg.blk_size = vu_block_device->blk_size;
 -    blk_set_guest_block_size(blk, vu_block_device->blk_size);
 -    vu_block_initialize_config(bs, &vu_block_device->blkcfg,
 -                                   vu_block_device->blk_size);
 -    return vu_block_device;
 -
 -fail:
 -    blk_unref(blk);
 -    return NULL;
 -}
 -
 -static void vu_block_deinit(VuBlockDev *vu_block_device)
 -{
 -    if (vu_block_device->backend) {
 -        blk_remove_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
 -                                        blk_aio_detach, vu_block_device);
 -    }
 -
 -    blk_unref(vu_block_device->backend);
 -}
 -
 -static void vhost_user_blk_server_stop(VuBlockDev *vu_block_device)
 -{
 -    vhost_user_server_stop(&vu_block_device->vu_server);
 -    vu_block_deinit(vu_block_device);
 -}
 -
 -static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
 -                                        Error **errp)
 -{
 -    AioContext *ctx;
 -    SocketAddress *addr = vu_block_device->addr;
 -
 -    if (!vu_block_init(vu_block_device, errp)) {
 -        return;
 -    }
 -
 -    ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
 -
 -    if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
 -                                 VHOST_USER_BLK_MAX_QUEUES, &vu_block_iface,
 -                                 errp)) {
 -        goto error;
 -    }
 -
 -    blk_add_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
 -                                 blk_aio_detach, vu_block_device);
 -    vu_block_device->running = true;
 -    return;
 -
 - error:
 -    vu_block_deinit(vu_block_device);
 -}
 -
 -static bool vu_prop_modifiable(VuBlockDev *vus, Error **errp)
 -{
 -    if (vus->running) {
 -            error_setg(errp, "The property can't be modified "
 -                       "while the server is running");
 -            return false;
 -    }
 -    return true;
 -}
 -
 -static void vu_set_node_name(Object *obj, const char *value, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -
 -    if (!vu_prop_modifiable(vus, errp)) {
 -        return;
 -    }
 -
 -    if (vus->node_name) {
 -        g_free(vus->node_name);
 -    }
 -
 -    vus->node_name = g_strdup(value);
 -}
 -
 -static char *vu_get_node_name(Object *obj, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -    return g_strdup(vus->node_name);
 -}
 -
 -static void free_socket_addr(SocketAddress *addr)
 -{
 -        g_free(addr->u.q_unix.path);
 -        g_free(addr);
 -}
 -
 -static void vu_set_unix_socket(Object *obj, const char *value,
 -                               Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -
 -    if (!vu_prop_modifiable(vus, errp)) {
 -        return;
 -    }
 -
 -    if (vus->addr) {
 -        free_socket_addr(vus->addr);
 -    }
 -
 -    SocketAddress *addr = g_new0(SocketAddress, 1);
 -    addr->type = SOCKET_ADDRESS_TYPE_UNIX;
 -    addr->u.q_unix.path = g_strdup(value);
 -    vus->addr = addr;
 +    vhost_user_server_stop(&vexp->vu_server);
  }
 -static char *vu_get_unix_socket(Object *obj, Error **errp)
 +static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
 +                             Error **errp)
  {
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -    return g_strdup(vus->addr->u.q_unix.path);
 -}
 -
 -static bool vu_get_block_writable(Object *obj, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -    return vus->writable;
 -}
 -
 -static void vu_set_block_writable(Object *obj, bool value, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -
 -    if (!vu_prop_modifiable(vus, errp)) {
 -            return;
 -    }
 -
 -    vus->writable = value;
 -}
 -
 -static void vu_get_blk_size(Object *obj, Visitor *v, const char *name,
 -                            void *opaque, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -    uint32_t value = vus->blk_size;
 -
 -    visit_type_uint32(v, name, &value, errp);
 -}
 -
 -static void vu_set_blk_size(Object *obj, Visitor *v, const char *name,
 -                            void *opaque, Error **errp)
 -{
 -    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
 -
 +    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 +    BlockExportOptionsVhostUserBlk *vu_opts = &opts->u.vhost_user_blk;
      Error *local_err = NULL;
 -    uint32_t value;
 +    uint64_t logical_block_size;
 -    if (!vu_prop_modifiable(vus, errp)) {
 -            return;
 -    }
 +    vexp->writable = opts->writable;
 +    vexp->blkcfg.wce = 0;
 -    visit_type_uint32(v, name, &value, &local_err);
 -    if (local_err) {
 -        goto out;
 +    if (vu_opts->has_logical_block_size) {
 +        logical_block_size = vu_opts->logical_block_size;
 +    } else {
 +        logical_block_size = BDRV_SECTOR_SIZE;
      }
 -
 -    check_block_size(object_get_typename(obj), name, value, &local_err);
 +    check_block_size(exp->id, "logical-block-size", logical_block_size,
 +                     &local_err);
      if (local_err) {
 -        goto out;
 +        error_propagate(errp, local_err);
 +        return -EINVAL;
 +    }
++    vexp->blk_size = logical_block_size;
-     trace_qemu_co_mutex_unlock_return(mutex, self);
++    blk_set_guest_block_size(exp->blk, logical_block_size);
- }
++    vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
-diff --git a/util/trace-events b/util/trace-events
++                               logical_block_size);
 +
 +    blk_set_allow_aio_context_change(exp->blk, true);
 +    blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
 +                                 vexp);
 +
 +    if (!vhost_user_server_start(&vexp->vu_server, vu_opts->addr, exp->ctx,
 +                                 VHOST_USER_BLK_MAX_QUEUES, &vu_blk_iface,
 +                                 errp)) {
 +        blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
 +                                        blk_aio_detach, vexp);
 +        return -EADDRNOTAVAIL;
      }
 -    vus->blk_size = value;
 -
 -out:
 -    error_propagate(errp, local_err);
 -}
 -
 -static void vhost_user_blk_server_instance_finalize(Object *obj)
 -{
 -    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
 -
 -    vhost_user_blk_server_stop(vub);
 -
 -    /*
 -     * Unlike object_property_add_str, object_class_property_add_str
 -     * doesn't have a release method. Thus manual memory freeing is
 -     * needed.
 -     */
 -    free_socket_addr(vub->addr);
 -    g_free(vub->node_name);
 -}
 -
 -static void vhost_user_blk_server_complete(UserCreatable *obj, Error **errp)
 -{
 -    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
 -
 -    vhost_user_blk_server_start(vub, errp);
 +    return 0;
  }
 -static void vhost_user_blk_server_class_init(ObjectClass *klass,
 -                                             void *class_data)
 +static void vu_blk_exp_delete(BlockExport *exp)
  {
 -    UserCreatableClass *ucc = USER_CREATABLE_CLASS(klass);
 -    ucc->complete = vhost_user_blk_server_complete;
 -
 -    object_class_property_add_bool(klass, "writable",
 -                                   vu_get_block_writable,
 -                                   vu_set_block_writable);
 -
 -    object_class_property_add_str(klass, "node-name",
 -                                  vu_get_node_name,
 -                                  vu_set_node_name);
 -
 -    object_class_property_add_str(klass, "unix-socket",
 -                                  vu_get_unix_socket,
 -                                  vu_set_unix_socket);
 +    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 -    object_class_property_add(klass, "logical-block-size", "uint32",
 -                              vu_get_blk_size, vu_set_blk_size,
 -                              NULL, NULL);
 +    blk_remove_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
 +                                    vexp);
  }
 -static const TypeInfo vhost_user_blk_server_info = {
 -    .name = TYPE_VHOST_USER_BLK_SERVER,
 -    .parent = TYPE_OBJECT,
 -    .instance_size = sizeof(VuBlockDev),
 -    .instance_finalize = vhost_user_blk_server_instance_finalize,
 -    .class_init = vhost_user_blk_server_class_init,
 -    .interfaces = (InterfaceInfo[]) {
 -        {TYPE_USER_CREATABLE},
 -        {}
 -    },
 +const BlockExportDriver blk_exp_vhost_user_blk = {
 +    .type               = BLOCK_EXPORT_TYPE_VHOST_USER_BLK,
 +    .instance_size      = sizeof(VuBlkExport),
 +    .create             = vu_blk_exp_create,
 +    .delete             = vu_blk_exp_delete,
 +    .request_shutdown   = vu_blk_exp_request_shutdown,
  };
 -
 -static void vhost_user_blk_server_register_types(void)
 -{
 -    type_register_static(&vhost_user_blk_server_info);
 -}
 -
 -type_init(vhost_user_blk_server_register_types)
 diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/trace-events
+--- a/util/vhost-user-server.c
-+++ b/util/trace-events
++++ b/util/vhost-user-server.c
-@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
+@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
+                              Error **errp)
- # util/qemu-coroutine-lock.c
+ {
- qemu_co_queue_run_restart(void *co) "co %p"
+     QEMUBH *bh;
-+qemu_co_mutex_lock_uncontended(void *mutex, void *self) "mutex %p self %p"
+-    QIONetListener *listener = qio_net_listener_new();
- qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
++    QIONetListener *listener;
- qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
++
- qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
++    if (socket_addr->type != SOCKET_ADDRESS_TYPE_UNIX &&
 +        socket_addr->type != SOCKET_ADDRESS_TYPE_FD) {
 +        error_setg(errp, "Only socket address types 'unix' and 'fd' are supported");
 +        return false;
 +    }
 +
 +    listener = qio_net_listener_new();
      if (qio_net_listener_open_sync(listener, socket_addr, 1,
                                     errp) < 0) {
          object_unref(OBJECT(listener));
 diff --git a/block/export/meson.build b/block/export/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/meson.build
 +++ b/block/export/meson.build
@@ -1 +1,2 @@
  block_ss.add(files('export.c'))
 +block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', '../../contrib/libvhost-user/libvhost-user.c'))
 diff --git a/block/meson.build b/block/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/meson.build
 +++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c')
  block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
  block_ss.add(when: 'CONFIG_LIBISCSI', if_true: files('iscsi-opts.c'))
  block_ss.add(when: 'CONFIG_LINUX', if_true: files('nvme.c'))
 -block_ss.add(when: 'CONFIG_LINUX', if_true: files('export/vhost-user-blk-server.c', '../contrib/libvhost-user/libvhost-user.c'))
  block_ss.add(when: 'CONFIG_REPLICATION', if_true: files('replication.c'))
  block_ss.add(when: 'CONFIG_SHEEPDOG', if_true: files('sheepdog.c'))
  block_ss.add(when: ['CONFIG_LINUX_AIO', libaio], if_true: files('linux-aio.c'))
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 22/24] coroutine-lock: place CoMutex before CoQueue in header
+[PULL v2 18/28] util/vhost-user-server: move header to include/
-From: Paolo Bonzini <pbonzini@redhat.com>
+Headers used by other subsystems are located in include/. Also add the
 vhost-user-server and vhost-user-blk-server headers to MAINTAINERS.
-This will avoid forward references in the next patch.  It is also
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-more logical because CoQueue is not anymore the basic primitive.
+Message-id: 20200924151549.913737-13-stefanha@redhat.com
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Message-id: 20170213181244.16297-5-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h | 89 ++++++++++++++++++++++++------------------------
+ MAINTAINERS                                | 4 +++-
-file changed, 44 insertions(+), 45 deletions(-)
+ {util => include/qemu}/vhost-user-server.h | 0
  block/export/vhost-user-blk-server.c       | 2 +-
  util/vhost-user-server.c                   | 2 +-
 files changed, 5 insertions(+), 3 deletions(-)
  rename {util => include/qemu}/vhost-user-server.h (100%)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
+diff --git a/MAINTAINERS b/MAINTAINERS
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
+--- a/MAINTAINERS
-+++ b/include/qemu/coroutine.h
++++ b/MAINTAINERS
-@@ -XXX,XX +XXX,XX @@ bool qemu_in_coroutine(void);
+@@ -XXX,XX +XXX,XX @@ Vhost-user block device backend server
  M: Coiby Xu <Coiby.Xu@gmail.com>
  S: Maintained
  F: block/export/vhost-user-blk-server.c
 -F: util/vhost-user-server.c
 +F: block/export/vhost-user-blk-server.h
 +F: include/qemu/vhost-user-server.h
  F: tests/qtest/libqos/vhost-user-blk.c
 +F: util/vhost-user-server.c
  Replication
  M: Wen Congyang <wencongyang2@huawei.com>
 diff --git a/util/vhost-user-server.h b/include/qemu/vhost-user-server.h
 similarity index 100%
 rename from util/vhost-user-server.h
 rename to include/qemu/vhost-user-server.h
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.c
 +++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
  #include "block/block.h"
  #include "contrib/libvhost-user/libvhost-user.h"
  #include "standard-headers/linux/virtio_blk.h"
 -#include "util/vhost-user-server.h"
 +#include "qemu/vhost-user-server.h"
  #include "vhost-user-blk-server.h"
  #include "qapi/error.h"
  #include "qom/object_interfaces.h"
 diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/vhost-user-server.c
 +++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@
   */
- bool qemu_coroutine_entered(Coroutine *co);
+ #include "qemu/osdep.h"
+ #include "qemu/main-loop.h"
--
++#include "qemu/vhost-user-server.h"
--/**
+ #include "block/aio-wait.h"
-- * CoQueues are a mechanism to queue coroutines in order to continue executing
+-#include "vhost-user-server.h"
-- * them later. They provide the fundamental primitives on which coroutine locks
-- * are built.
+ /*
-- */
+  * Theory of operation:
 -typedef struct CoQueue {
 -    QSIMPLEQ_HEAD(, Coroutine) entries;
 -} CoQueue;
 -
 -/**
 - * Initialise a CoQueue. This must be called before any other operation is used
 - * on the CoQueue.
 - */
 -void qemu_co_queue_init(CoQueue *queue);
 -
 -/**
 - * Adds the current coroutine to the CoQueue and transfers control to the
 - * caller of the coroutine.
 - */
 -void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
 -
 -/**
 - * Restarts the next coroutine in the CoQueue and removes it from the queue.
 - *
 - * Returns true if a coroutine was restarted, false if the queue is empty.
 - */
 -bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 -
 -/**
 - * Restarts all coroutines in the CoQueue and leaves the queue empty.
 - */
 -void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
 -
 -/**
 - * Enter the next coroutine in the queue
 - */
 -bool qemu_co_enter_next(CoQueue *queue);
 -
 -/**
 - * Checks if the CoQueue is empty.
 - */
 -bool qemu_co_queue_empty(CoQueue *queue);
 -
 -
  /**
   * Provides a mutex that can be used to synchronise coroutines
   */
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
   */
  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 +
 +/**
 + * CoQueues are a mechanism to queue coroutines in order to continue executing
 + * them later.
 + */
 +typedef struct CoQueue {
 +    QSIMPLEQ_HEAD(, Coroutine) entries;
 +} CoQueue;
 +
 +/**
 + * Initialise a CoQueue. This must be called before any other operation is used
 + * on the CoQueue.
 + */
 +void qemu_co_queue_init(CoQueue *queue);
 +
 +/**
 + * Adds the current coroutine to the CoQueue and transfers control to the
 + * caller of the coroutine.
 + */
 +void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
 +
 +/**
 + * Restarts the next coroutine in the CoQueue and removes it from the queue.
 + *
 + * Returns true if a coroutine was restarted, false if the queue is empty.
 + */
 +bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 +
 +/**
 + * Restarts all coroutines in the CoQueue and leaves the queue empty.
 + */
 +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
 +
 +/**
 + * Enter the next coroutine in the queue
 + */
 +bool qemu_co_enter_next(CoQueue *queue);
 +
 +/**
 + * Checks if the CoQueue is empty.
 + */
 +bool qemu_co_queue_empty(CoQueue *queue);
 +
 +
  typedef struct CoRwlock {
      bool writer;
      int reader;
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 18/24] block: document fields protected by AioContext lock
+[PULL v2 19/28] util/vhost-user-server: use static library in meson.build
-From: Paolo Bonzini <pbonzini@redhat.com>
+Don't compile contrib/libvhost-user/libvhost-user.c again. Instead build
 the static library once and then reuse it throughout QEMU.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Also switch from CONFIG_LINUX to CONFIG_VHOST_USER, which is what the
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+vhost-user tools (vhost-user-gpu, etc) do.
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Message-id: 20170213135235.12274-19-pbonzini@redhat.com
+Message-id: 20200924151549.913737-14-stefanha@redhat.com
 [Added CONFIG_LINUX again because libvhost-user doesn't build on macOS.
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/block/block_int.h      | 64 +++++++++++++++++++++++++-----------------
+ block/export/export.c             | 8 ++++----
- include/sysemu/block-backend.h | 14 ++++++---
+ block/export/meson.build          | 2 +-
-files changed, 49 insertions(+), 29 deletions(-)
+ contrib/libvhost-user/meson.build | 1 +
  meson.build                       | 6 +++++-
  util/meson.build                  | 4 +++-
 files changed, 14 insertions(+), 7 deletions(-)
-diff --git a/include/block/block_int.h b/include/block/block_int.h
+diff --git a/block/export/export.c b/block/export/export.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/block_int.h
+--- a/block/export/export.c
-+++ b/include/block/block_int.h
++++ b/block/export/export.c
-@@ -XXX,XX +XXX,XX @@ struct BdrvChild {
+@@ -XXX,XX +XXX,XX @@
-  * copied as well.
+ #include "sysemu/block-backend.h"
-  */
+ #include "block/export.h"
- struct BlockDriverState {
+ #include "block/nbd.h"
--    int64_t total_sectors; /* if we are reading a disk image, give its
+-#if CONFIG_LINUX
--                              size in sectors */
+-#include "block/export/vhost-user-blk-server.h"
-+    /* Protected by big QEMU lock or read-only after opening.  No special
+-#endif
-+     * locking needed during I/O...
+ #include "qapi/error.h"
-+     */
+ #include "qapi/qapi-commands-block-export.h"
-     int open_flags; /* flags used to open the file, re-used for re-open */
+ #include "qapi/qapi-events-block-export.h"
-     bool read_only; /* if true, the media is read only */
+ #include "qemu/id.h"
-     bool encrypted; /* if true, the media is encrypted */
++#ifdef CONFIG_VHOST_USER
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
++#include "vhost-user-blk-server.h"
-     bool sg;        /* if true, the device is a /dev/sg* */
++#endif
-     bool probed;    /* if true, format was probed rather than specified */
+ static const BlockExportDriver *blk_exp_drivers[] = {
--    int copy_on_read; /* if nonzero, copy read backing sectors into image.
+     &blk_exp_nbd,
--                         note this is a reference count */
+-#if CONFIG_LINUX
--
++#ifdef CONFIG_VHOST_USER
--    CoQueue flush_queue;            /* Serializing flush queue */
+     &blk_exp_vhost_user_blk,
--    bool active_flush_req;          /* Flush request in flight? */
+ #endif
--    unsigned int write_gen;         /* Current data generation */
+ };
--    unsigned int flushed_gen;       /* Flushed write generation */
+diff --git a/block/export/meson.build b/block/export/meson.build
--
+index XXXXXXX..XXXXXXX 100644
-     BlockDriver *drv; /* NULL means no media */
+--- a/block/export/meson.build
-     void *opaque;
++++ b/block/export/meson.build
+@@ -XXX,XX +XXX,XX @@
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
+ block_ss.add(files('export.c'))
-     BdrvChild *backing;
+-block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', '../../contrib/libvhost-user/libvhost-user.c'))
-     BdrvChild *file;
++block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
+diff --git a/contrib/libvhost-user/meson.build b/contrib/libvhost-user/meson.build
--    /* Callback before write request is processed */
+index XXXXXXX..XXXXXXX 100644
--    NotifierWithReturnList before_write_notifiers;
+--- a/contrib/libvhost-user/meson.build
--
++++ b/contrib/libvhost-user/meson.build
--    /* number of in-flight requests; overall and serialising */
+@@ -XXX,XX +XXX,XX @@
--    unsigned int in_flight;
+ libvhost_user = static_library('vhost-user',
--    unsigned int serialising_in_flight;
+                                files('libvhost-user.c', 'libvhost-user-glib.c'),
--
+                                build_by_default: false)
--    bool wakeup;
++vhost_user = declare_dependency(link_with: libvhost_user)
--
+diff --git a/meson.build b/meson.build
--    /* Offset after the highest byte written to */
+index XXXXXXX..XXXXXXX 100644
--    uint64_t wr_highest_offset;
+--- a/meson.build
--
++++ b/meson.build
-     /* I/O Limits */
+@@ -XXX,XX +XXX,XX @@ trace_events_subdirs += [
-     BlockLimits bl;
+   'util',
+ ]
-@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
-     QTAILQ_ENTRY(BlockDriverState) bs_list;
++vhost_user = not_found
-     /* element of the list of monitor-owned BDS */
++if 'CONFIG_VHOST_USER' in config_host
-     QTAILQ_ENTRY(BlockDriverState) monitor_list;
++  subdir('contrib/libvhost-user')
--    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
++endif
      int refcnt;
 -    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
 -
      /* operation blockers */
      QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
      /* The error object in use for blocking operations on backing_hd */
      Error *backing_blocker;
 +    /* Protected by AioContext lock */
 +
-+    /* If true, copy read backing sectors into image.  Can be >1 if more
+ subdir('qapi')
-+     * than one client has requested copy-on-read.
+ subdir('qobject')
-+     */
+ subdir('stubs')
-+    int copy_on_read;
+@@ -XXX,XX +XXX,XX @@ if have_tools
-+
+              install: true)
-+    /* If we are reading a disk image, give its size in sectors.
-+     * Generally read-only; it is written to by load_vmstate and save_vmstate,
+   if 'CONFIG_VHOST_USER' in config_host
-+     * but the block layer is quiescent during those.
+-    subdir('contrib/libvhost-user')
-+     */
+     subdir('contrib/vhost-user-blk')
-+    int64_t total_sectors;
+     subdir('contrib/vhost-user-gpu')
-+
+     subdir('contrib/vhost-user-input')
-+    /* Callback before write request is processed */
+diff --git a/util/meson.build b/util/meson.build
 +    NotifierWithReturnList before_write_notifiers;
 +
 +    /* number of in-flight requests; overall and serialising */
 +    unsigned int in_flight;
 +    unsigned int serialising_in_flight;
 +
 +    bool wakeup;
 +
 +    /* Offset after the highest byte written to */
 +    uint64_t wr_highest_offset;
 +
      /* threshold limit for writes, in bytes. "High water mark". */
      uint64_t write_threshold_offset;
      NotifierWithReturn write_threshold_notifier;
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
      /* counter for nested bdrv_io_plug */
      unsigned io_plugged;
 +    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
 +    CoQueue flush_queue;                  /* Serializing flush queue */
 +    bool active_flush_req;                /* Flush request in flight? */
 +    unsigned int write_gen;               /* Current data generation */
 +    unsigned int flushed_gen;             /* Flushed write generation */
 +
 +    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
 +
 +    /* do we need to tell the quest if we have a volatile write cache? */
 +    int enable_write_cache;
 +
      int quiesce_counter;
  };
 diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/sysemu/block-backend.h
+--- a/util/meson.build
-+++ b/include/sysemu/block-backend.h
++++ b/util/meson.build
-@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
+@@ -XXX,XX +XXX,XX @@ if have_block
-  * fields that must be public. This is in particular for QLIST_ENTRY() and
+   util_ss.add(files('main-loop.c'))
-  * friends so that BlockBackends can be kept in lists outside block-backend.c */
+   util_ss.add(files('nvdimm-utils.c'))
- typedef struct BlockBackendPublic {
+   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
--    /* I/O throttling.
+-  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
--     * throttle_state tells us if this BlockBackend has I/O limits configured.
++  util_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: [
--     * io_limits_disabled tells us if they are currently being enforced */
++    files('vhost-user-server.c'), vhost_user
-+    /* I/O throttling has its own locking, but also some fields are
++  ])
-+     * protected by the AioContext lock.
+   util_ss.add(files('block-helpers.c'))
-+     */
+   util_ss.add(files('qemu-coroutine-sleep.c'))
-+
+   util_ss.add(files('qemu-co-shared-resource.c'))
 +    /* Protected by AioContext lock.  */
      CoQueue      throttled_reqs[2];
 +
 +    /* Nonzero if the I/O limits are currently being ignored; generally
 +     * it is zero.  */
      unsigned int io_limits_disabled;
      /* The following fields are protected by the ThrottleGroup lock.
 -     * See the ThrottleGroup documentation for details. */
 +     * See the ThrottleGroup documentation for details.
 +     * throttle_state tells us if I/O limits are configured. */
      ThrottleState *throttle_state;
      ThrottleTimers throttle_timers;
      unsigned       pending_reqs[2];
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 10/24] qed: introduce qed_aio_start_io and qed_aio_next_io_cb
+[PULL v2 20/28] qemu-storage-daemon: avoid compiling blockdev_ss twice
-From: Paolo Bonzini <pbonzini@redhat.com>
+Introduce libblkdev.fa to avoid recompiling blockdev_ss twice.
-qed_aio_start_io and qed_aio_next_io will not have to acquire/release
+Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
-the AioContext, while qed_aio_next_io_cb will.  Split the functionality
+Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
-and gain a little type-safety in the process.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-id: 20200929125516.186715-3-stefanha@redhat.com
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-11-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 39 +++++++++++++++++++++++++--------------
+ meson.build                | 12 ++++++++++--
-file changed, 25 insertions(+), 14 deletions(-)
+ storage-daemon/meson.build |  3 +--
 files changed, 11 insertions(+), 4 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/meson.build
-+++ b/block/qed.c
++++ b/meson.build
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ blockdev_ss.add(files(
-     return l2_table;
+ # os-win32.c does not
- }
+ blockdev_ss.add(when: 'CONFIG_POSIX', if_true: files('os-posix.c'))
+ softmmu_ss.add(when: 'CONFIG_WIN32', if_true: [files('os-win32.c')])
--static void qed_aio_next_io(void *opaque, int ret);
+-softmmu_ss.add_all(blockdev_ss)
-+static void qed_aio_next_io(QEDAIOCB *acb, int ret);
  common_ss.add(files('cpus-common.c'))
@@ -XXX,XX +XXX,XX @@ block = declare_dependency(link_whole: [libblock],
                             link_args: '@block.syms',
                             dependencies: [crypto, io])
 +blockdev_ss = blockdev_ss.apply(config_host, strict: false)
 +libblockdev = static_library('blockdev', blockdev_ss.sources() + genh,
 +                             dependencies: blockdev_ss.dependencies(),
 +                             name_suffix: 'fa',
 +                             build_by_default: false)
 +
-+static void qed_aio_start_io(QEDAIOCB *acb)
++blockdev = declare_dependency(link_whole: [libblockdev],
-+{
++                              dependencies: [block])
 +    qed_aio_next_io(acb, 0);
 +}
 +
-+static void qed_aio_next_io_cb(void *opaque, int ret)
+ qmp_ss = qmp_ss.apply(config_host, strict: false)
-+{
+ libqmp = static_library('qmp', qmp_ss.sources() + genh,
-+    QEDAIOCB *acb = opaque;
+                         dependencies: qmp_ss.dependencies(),
-+
+@@ -XXX,XX +XXX,XX @@ foreach m : block_mods + softmmu_mods
-+    qed_aio_next_io(acb, ret);
+                 install_dir: config_host['qemu_moddir'])
-+}
+ endforeach
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+-softmmu_ss.add(authz, block, chardev, crypto, io, qmp)
- {
++softmmu_ss.add(authz, blockdev, chardev, crypto, io, qmp)
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
+ common_ss.add(qom, qemuutil)
-     acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
+ common_ss.add_all(when: 'CONFIG_SOFTMMU', if_true: [softmmu_ss])
-     if (acb) {
+diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
--        qed_aio_next_io(acb, 0);
+index XXXXXXX..XXXXXXX 100644
-+        qed_aio_start_io(acb);
+--- a/storage-daemon/meson.build
-     }
++++ b/storage-daemon/meson.build
- }
+@@ -XXX,XX +XXX,XX @@
+ qsd_ss = ss.source_set()
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
+ qsd_ss.add(files('qemu-storage-daemon.c'))
-         QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
+-qsd_ss.add(block, chardev, qmp, qom, qemuutil)
-         acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
+-qsd_ss.add_all(blockdev_ss)
-         if (acb) {
++qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil)
--            qed_aio_next_io(acb, 0);
-+            qed_aio_start_io(acb);
+ subdir('qapi')
          } else if (s->header.features & QED_F_NEED_CHECK) {
              qed_start_need_check_timer(s);
          }
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
      acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
      assert(acb->request.l2_table != NULL);
 -    qed_aio_next_io(opaque, ret);
 +    qed_aio_next_io(acb, ret);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
      if (need_alloc) {
          /* Write out the whole new L2 table */
          qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
 -                            qed_aio_write_l1_update, acb);
 +                           qed_aio_write_l1_update, acb);
      } else {
          /* Write out only the updated part of the L2 table */
          qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
 -                            qed_aio_next_io, acb);
 +                           qed_aio_next_io_cb, acb);
      }
      return;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
      }
      if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
 -        next_fn = qed_aio_next_io;
 +        next_fn = qed_aio_next_io_cb;
      } else {
          if (s->bs->backing) {
              next_fn = qed_aio_write_flush_before_l2_update;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
      if (acb->flags & QED_AIOCB_ZERO) {
          /* Skip ahead if the clusters are already zero */
          if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
 -            qed_aio_next_io(acb, 0);
 +            qed_aio_start_io(acb);
              return;
          }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_read_data(void *opaque, int ret,
      /* Handle zero cluster and backing file reads */
      if (ret == QED_CLUSTER_ZERO) {
          qemu_iovec_memset(&acb->cur_qiov, 0, 0, acb->cur_qiov.size);
 -        qed_aio_next_io(acb, 0);
 +        qed_aio_start_io(acb);
          return;
      } else if (ret != QED_CLUSTER_FOUND) {
          qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
 -                              &acb->backing_qiov, qed_aio_next_io, acb);
 +                              &acb->backing_qiov, qed_aio_next_io_cb, acb);
          return;
      }
      BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
      bdrv_aio_readv(bs->file, offset / BDRV_SECTOR_SIZE,
                     &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
 -                   qed_aio_next_io, acb);
 +                   qed_aio_next_io_cb, acb);
      return;
  err:
@@ -XXX,XX +XXX,XX @@ err:
  /**
   * Begin next I/O or complete the request
   */
 -static void qed_aio_next_io(void *opaque, int ret)
 +static void qed_aio_next_io(QEDAIOCB *acb, int ret)
  {
 -    QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
      QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                  qed_aio_write_data : qed_aio_read_data;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
      qemu_iovec_init(&acb->cur_qiov, qiov->niov);
      /* Start request */
 -    qed_aio_next_io(acb, 0);
 +    qed_aio_start_io(acb);
      return &acb->common;
  }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 04/24] test-thread-pool: use generic AioContext infrastructure
+[PULL v2 21/28] block: move block exports to libblockdev
-From: Paolo Bonzini <pbonzini@redhat.com>
+Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
 They are not used by other programs and are not otherwise needed in
 libblock.
-Once the thread pool starts using aio_co_wake, it will also need
+Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
-qemu_get_current_aio_context().  Make test-thread-pool create
+Since bdrv_close_all() (libblock) calls blk_exp_close_all()
-an AioContext with qemu_init_main_loop, so that stubs/iothread.c
+(libblockdev) a stub function is required..
 and tests/iothread.c can provide the rest.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Make qemu-nbd.c use signal handling utility functions instead of
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+duplicating the code. This helps because os-posix.c is in libblockdev
-Reviewed-by: Fam Zheng <famz@redhat.com>
+and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
-Message-id: 20170213135235.12274-5-pbonzini@redhat.com
+Once we use the signal handling utility functions we also end up
 providing the necessary symbol.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Message-id: 20200929125516.186715-4-stefanha@redhat.com
 [Fixed s/ndb/nbd/ typo in commit description as suggested by Eric Blake
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- tests/test-thread-pool.c | 12 +++---------
+ qemu-nbd.c                | 21 ++++++++-------------
-file changed, 3 insertions(+), 9 deletions(-)
+ stubs/blk-exp-close-all.c |  7 +++++++
  block/export/meson.build  |  4 ++--
  meson.build               |  4 ++--
  nbd/meson.build           |  2 ++
  stubs/meson.build         |  1 +
 files changed, 22 insertions(+), 17 deletions(-)
  create mode 100644 stubs/blk-exp-close-all.c
-diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
+diff --git a/qemu-nbd.c b/qemu-nbd.c
 index XXXXXXX..XXXXXXX 100644
---- a/tests/test-thread-pool.c
+--- a/qemu-nbd.c
-+++ b/tests/test-thread-pool.c
++++ b/qemu-nbd.c
 @@ -XXX,XX +XXX,XX @@
  #include "qapi/error.h"
- #include "qemu/timer.h"
+ #include "qemu/cutils.h"
- #include "qemu/error-report.h"
+ #include "sysemu/block-backend.h"
-+#include "qemu/main-loop.h"
++#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
+ #include "block/block_int.h"
- static AioContext *ctx;
+ #include "block/nbd.h"
- static ThreadPool *pool;
+ #include "qemu/main-loop.h"
-@@ -XXX,XX +XXX,XX @@ static void test_cancel_async(void)
+@@ -XXX,XX +XXX,XX @@ QEMU_COPYRIGHT "\n"
- int main(int argc, char **argv)
+ }
  #ifdef CONFIG_POSIX
 -static void termsig_handler(int signum)
 +/*
 + * The client thread uses SIGTERM to interrupt the server.  A signal
 + * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
 + */
 +void qemu_system_killed(int signum, pid_t pid)
  {
-     int ret;
+     qatomic_cmpxchg(&state, RUNNING, TERMINATE);
--    Error *local_error = NULL;
+     qemu_notify_event();
+@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
--    init_clocks();
+     BlockExportOptions *export_opts;
  #ifdef CONFIG_POSIX
 -    /*
 -     * Exit gracefully on various signals, which includes SIGTERM used
 -     * by 'qemu-nbd -v -c'.
 -     */
 -    struct sigaction sa_sigterm;
 -    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
 -    sa_sigterm.sa_handler = termsig_handler;
 -    sigaction(SIGTERM, &sa_sigterm, NULL);
 -    sigaction(SIGINT, &sa_sigterm, NULL);
 -    sigaction(SIGHUP, &sa_sigterm, NULL);
 -
--    ctx = aio_context_new(&local_error);
+-    signal(SIGPIPE, SIG_IGN);
--    if (!ctx) {
++    os_setup_early_signal_handling();
--        error_reportf_err(local_error, "Failed to create AIO Context: ");
++    os_setup_signal_handling();
--        exit(1);
+ #endif
--    }
-+    qemu_init_main_loop(&error_abort);
+     socket_init();
-+    ctx = qemu_get_current_aio_context();
+diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
-     pool = aio_get_thread_pool(ctx);
+new file mode 100644
+index XXXXXXX..XXXXXXX
-     g_test_init(&argc, &argv, NULL);
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
++++ b/stubs/blk-exp-close-all.c
+@@ -XXX,XX +XXX,XX @@
-     ret = g_test_run();
++#include "qemu/osdep.h"
++#include "block/export.h"
--    aio_context_unref(ctx);
++
-     return ret;
++/* Only used in programs that support block exports (libblockdev.fa) */
- }
++void blk_exp_close_all(void)
 +{
 +}
 diff --git a/block/export/meson.build b/block/export/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/meson.build
 +++ b/block/export/meson.build
@@ -XXX,XX +XXX,XX @@
 -block_ss.add(files('export.c'))
 -block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
 +blockdev_ss.add(files('export.c'))
 +blockdev_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
 diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/meson.build
 +++ b/meson.build
@@ -XXX,XX +XXX,XX @@ subdir('dump')
  block_ss.add(files(
    'block.c',
 -  'blockdev-nbd.c',
    'blockjob.c',
    'job.c',
    'qemu-io-cmds.c',
@@ -XXX,XX +XXX,XX @@ subdir('block')
  blockdev_ss.add(files(
    'blockdev.c',
 +  'blockdev-nbd.c',
    'iothread.c',
    'job-qmp.c',
  ))
@@ -XXX,XX +XXX,XX @@ if have_tools
    qemu_io = executable('qemu-io', files('qemu-io.c'),
               dependencies: [block, qemuutil], install: true)
    qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
 -               dependencies: [block, qemuutil], install: true)
 +               dependencies: [blockdev, qemuutil], install: true)
    subdir('storage-daemon')
    subdir('contrib/rdmacm-mux')
 diff --git a/nbd/meson.build b/nbd/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/nbd/meson.build
 +++ b/nbd/meson.build
@@ -XXX,XX +XXX,XX @@
  block_ss.add(files(
    'client.c',
    'common.c',
 +))
 +blockdev_ss.add(files(
    'server.c',
  ))
 diff --git a/stubs/meson.build b/stubs/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/stubs/meson.build
 +++ b/stubs/meson.build
@@ -XXX,XX +XXX,XX @@
  stub_ss.add(files('arch_type.c'))
  stub_ss.add(files('bdrv-next-monitor-owned.c'))
  stub_ss.add(files('blk-commit-all.c'))
 +stub_ss.add(files('blk-exp-close-all.c'))
  stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
  stub_ss.add(files('change-state-handler.c'))
  stub_ss.add(files('cmos.c'))
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 07/24] nbd: convert to use qio_channel_yield
+[PULL v2 22/28] block/export: add iothread and fixed-iothread options
-From: Paolo Bonzini <pbonzini@redhat.com>
+Make it possible to specify the iothread where the export will run. By
 default the block node can be moved to other AioContexts later and the
 export will follow. The fixed-iothread option forces strict behavior
 that prevents changing AioContext while the export is active. See the
 QAPI docs for details.
-In the client, read the reply headers from a coroutine, switching the
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-read side between the "read header" coroutine and the I/O coroutine that
+Message-id: 20200929125516.186715-5-stefanha@redhat.com
-reads the body of the reply.
+[Fix stray '#' character in block-export.json and add missing "(since:
+.2)" as suggested by Eric Blake.
-In the server, if the server can read more requests it will create a new
+--Stefan]
 "read request" coroutine as soon as a request has been read.  Otherwise,
 the new coroutine is created in nbd_request_put.
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Fam Zheng <famz@redhat.com>
 Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
 Message-id: 20170213135235.12274-8-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/nbd-client.h |   2 +-
+ qapi/block-export.json               | 11 ++++++++++
- block/nbd-client.c | 117 ++++++++++++++++++++++++-----------------------------
+ block/export/export.c                | 31 +++++++++++++++++++++++++++-
- nbd/client.c       |   2 +-
+ block/export/vhost-user-blk-server.c |  5 ++++-
- nbd/common.c       |   9 +----
+ nbd/server.c                         |  2 --
- nbd/server.c       |  94 +++++++++++++-----------------------------
+files changed, 45 insertions(+), 4 deletions(-)
 files changed, 83 insertions(+), 141 deletions(-)
-diff --git a/block/nbd-client.h b/block/nbd-client.h
+diff --git a/qapi/block-export.json b/qapi/block-export.json
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd-client.h
+--- a/qapi/block-export.json
-+++ b/block/nbd-client.h
++++ b/qapi/block-export.json
-@@ -XXX,XX +XXX,XX @@ typedef struct NBDClientSession {
+@@ -XXX,XX +XXX,XX @@
+ #                export before completion is signalled. (since: 5.2;
-     CoMutex send_mutex;
+ #                default: false)
-     CoQueue free_sema;
+ #
--    Coroutine *send_coroutine;
++# @iothread: The name of the iothread object where the export will run. The
-+    Coroutine *read_reply_co;
++#            default is to use the thread currently associated with the
-     int in_flight;
++#            block node. (since: 5.2)
++#
-     Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
++# @fixed-iothread: True prevents the block node from being moved to another
-diff --git a/block/nbd-client.c b/block/nbd-client.c
++#                  thread while the export is active. If true and @iothread is
 +#                  given, export creation fails if the block node cannot be
 +#                  moved to the iothread. The default is false. (since: 5.2)
 +#
  # Since: 4.2
  ##
  { 'union': 'BlockExportOptions',
    'base': { 'type': 'BlockExportType',
              'id': 'str',
 +        '*fixed-iothread': 'bool',
 +        '*iothread': 'str',
              'node-name': 'str',
              '*writable': 'bool',
              '*writethrough': 'bool' },
 diff --git a/block/export/export.c b/block/export/export.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd-client.c
+--- a/block/export/export.c
-+++ b/block/nbd-client.c
++++ b/block/export/export.c
 @@ -XXX,XX +XXX,XX @@
- #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
- #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
+ #include "block/block.h"
+ #include "sysemu/block-backend.h"
--static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
++#include "sysemu/iothread.h"
-+static void nbd_recv_coroutines_enter_all(BlockDriverState *bs)
+ #include "block/export.h"
  #include "block/nbd.h"
  #include "qapi/error.h"
@@ -XXX,XX +XXX,XX @@ static const BlockExportDriver *blk_exp_find_driver(BlockExportType type)
  BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
  {
-+    NBDClientSession *s = nbd_get_client_session(bs);
++    bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
-     int i;
+     const BlockExportDriver *drv;
+     BlockExport *exp = NULL;
-     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+     BlockDriverState *bs;
-@@ -XXX,XX +XXX,XX @@ static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
+-    BlockBackend *blk;
-             qemu_coroutine_enter(s->recv_coroutine[i]);
++    BlockBackend *blk = NULL;
-         }
+     AioContext *ctx;
      uint64_t perm;
      int ret;
@@ -XXX,XX +XXX,XX @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
      ctx = bdrv_get_aio_context(bs);
      aio_context_acquire(ctx);
 +    if (export->has_iothread) {
 +        IOThread *iothread;
 +        AioContext *new_ctx;
 +
 +        iothread = iothread_by_id(export->iothread);
 +        if (!iothread) {
 +            error_setg(errp, "iothread \"%s\" not found", export->iothread);
 +            goto fail;
 +        }
 +
 +        new_ctx = iothread_get_aio_context(iothread);
 +
 +        ret = bdrv_try_set_aio_context(bs, new_ctx, errp);
 +        if (ret == 0) {
 +            aio_context_release(ctx);
 +            aio_context_acquire(new_ctx);
 +            ctx = new_ctx;
 +        } else if (fixed_iothread) {
 +            goto fail;
 +        }
 +    }
 +
      /*
       * Block exports are used for non-shared storage migration. Make sure
       * that BDRV_O_INACTIVE is cleared and the image is ready for write
@@ -XXX,XX +XXX,XX @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
      }
-+    BDRV_POLL_WHILE(bs, s->read_reply_co);
      blk = blk_new(ctx, perm, BLK_PERM_ALL);
 +
 +    if (!fixed_iothread) {
 +        blk_set_allow_aio_context_change(blk, true);
 +    }
 +
      ret = blk_insert_bs(blk, bs, errp);
      if (ret < 0) {
          goto fail;
 diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/export/vhost-user-blk-server.c
 +++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static const VuDevIface vu_blk_iface = {
  static void blk_aio_attached(AioContext *ctx, void *opaque)
  {
      VuBlkExport *vexp = opaque;
 +
 +    vexp->export.ctx = ctx;
      vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
  }
- static void nbd_teardown_connection(BlockDriverState *bs)
+ static void blk_aio_detach(void *opaque)
-@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
+ {
-     qio_channel_shutdown(client->ioc,
+     VuBlkExport *vexp = opaque;
-                          QIO_CHANNEL_SHUTDOWN_BOTH,
++
-                          NULL);
+     vhost_user_server_detach_aio_context(&vexp->vu_server);
--    nbd_recv_coroutines_enter_all(client);
++    vexp->export.ctx = NULL;
 +    nbd_recv_coroutines_enter_all(bs);
      nbd_client_detach_aio_context(bs);
      object_unref(OBJECT(client->sioc));
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
      client->ioc = NULL;
  }
--static void nbd_reply_ready(void *opaque)
+ static void
-+static coroutine_fn void nbd_read_reply_entry(void *opaque)
+@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
- {
+     vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
--    BlockDriverState *bs = opaque;
+                                logical_block_size);
--    NBDClientSession *s = nbd_get_client_session(bs);
-+    NBDClientSession *s = opaque;
+-    blk_set_allow_aio_context_change(exp->blk, true);
-     uint64_t i;
+     blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
-     int ret;
+                                  vexp);
 -    if (!s->ioc) { /* Already closed */
 -        return;
 -    }
 -
 -    if (s->reply.handle == 0) {
 -        /* No reply already in flight.  Fetch a header.  It is possible
 -         * that another thread has done the same thing in parallel, so
 -         * the socket is not readable anymore.
 -         */
 +    for (;;) {
 +        assert(s->reply.handle == 0);
          ret = nbd_receive_reply(s->ioc, &s->reply);
 -        if (ret == -EAGAIN) {
 -            return;
 -        }
          if (ret < 0) {
 -            s->reply.handle = 0;
 -            goto fail;
 +            break;
          }
 -    }
 -    /* There's no need for a mutex on the receive side, because the
 -     * handler acts as a synchronization point and ensures that only
 -     * one coroutine is called until the reply finishes.  */
 -    i = HANDLE_TO_INDEX(s, s->reply.handle);
 -    if (i >= MAX_NBD_REQUESTS) {
 -        goto fail;
 -    }
 +        /* There's no need for a mutex on the receive side, because the
 +         * handler acts as a synchronization point and ensures that only
 +         * one coroutine is called until the reply finishes.
 +         */
 +        i = HANDLE_TO_INDEX(s, s->reply.handle);
 +        if (i >= MAX_NBD_REQUESTS || !s->recv_coroutine[i]) {
 +            break;
 +        }
 -    if (s->recv_coroutine[i]) {
 -        qemu_coroutine_enter(s->recv_coroutine[i]);
 -        return;
 +        /* We're woken up by the recv_coroutine itself.  Note that there
 +         * is no race between yielding and reentering read_reply_co.  This
 +         * is because:
 +         *
 +         * - if recv_coroutine[i] runs on the same AioContext, it is only
 +         *   entered after we yield
 +         *
 +         * - if recv_coroutine[i] runs on a different AioContext, reentering
 +         *   read_reply_co happens through a bottom half, which can only
 +         *   run after we yield.
 +         */
 +        aio_co_wake(s->recv_coroutine[i]);
 +        qemu_coroutine_yield();
      }
 -
 -fail:
 -    nbd_teardown_connection(bs);
 -}
 -
 -static void nbd_restart_write(void *opaque)
 -{
 -    BlockDriverState *bs = opaque;
 -
 -    qemu_coroutine_enter(nbd_get_client_session(bs)->send_coroutine);
 +    s->read_reply_co = NULL;
  }
  static int nbd_co_send_request(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
                                 QEMUIOVector *qiov)
  {
      NBDClientSession *s = nbd_get_client_session(bs);
 -    AioContext *aio_context;
      int rc, ret, i;
      qemu_co_mutex_lock(&s->send_mutex);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
          return -EPIPE;
      }
 -    s->send_coroutine = qemu_coroutine_self();
 -    aio_context = bdrv_get_aio_context(bs);
 -
 -    aio_set_fd_handler(aio_context, s->sioc->fd, false,
 -                       nbd_reply_ready, nbd_restart_write, NULL, bs);
      if (qiov) {
          qio_channel_set_cork(s->ioc, true);
          rc = nbd_send_request(s->ioc, request);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
      } else {
          rc = nbd_send_request(s->ioc, request);
      }
 -    aio_set_fd_handler(aio_context, s->sioc->fd, false,
 -                       nbd_reply_ready, NULL, NULL, bs);
 -    s->send_coroutine = NULL;
      qemu_co_mutex_unlock(&s->send_mutex);
      return rc;
  }
@@ -XXX,XX +XXX,XX @@ static void nbd_co_receive_reply(NBDClientSession *s,
  {
      int ret;
 -    /* Wait until we're woken up by the read handler.  TODO: perhaps
 -     * peek at the next reply and avoid yielding if it's ours?  */
 +    /* Wait until we're woken up by nbd_read_reply_entry.  */
      qemu_coroutine_yield();
      *reply = s->reply;
      if (reply->handle != request->handle ||
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
      /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
  }
 -static void nbd_coroutine_end(NBDClientSession *s,
 +static void nbd_coroutine_end(BlockDriverState *bs,
                                NBDRequest *request)
  {
 +    NBDClientSession *s = nbd_get_client_session(bs);
      int i = HANDLE_TO_INDEX(s, request->handle);
 +
      s->recv_coroutine[i] = NULL;
 -    if (s->in_flight-- == MAX_NBD_REQUESTS) {
 -        qemu_co_queue_next(&s->free_sema);
 +    s->in_flight--;
 +    qemu_co_queue_next(&s->free_sema);
 +
 +    /* Kick the read_reply_co to get the next reply.  */
 +    if (s->read_reply_co) {
 +        aio_co_wake(s->read_reply_co);
      }
  }
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
      } else {
          nbd_co_receive_reply(client, &request, &reply, qiov);
      }
 -    nbd_coroutine_end(client, &request);
 +    nbd_coroutine_end(bs, &request);
      return -reply.error;
  }
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
      } else {
          nbd_co_receive_reply(client, &request, &reply, NULL);
      }
 -    nbd_coroutine_end(client, &request);
 +    nbd_coroutine_end(bs, &request);
      return -reply.error;
  }
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
      } else {
          nbd_co_receive_reply(client, &request, &reply, NULL);
      }
 -    nbd_coroutine_end(client, &request);
 +    nbd_coroutine_end(bs, &request);
      return -reply.error;
  }
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_flush(BlockDriverState *bs)
      } else {
          nbd_co_receive_reply(client, &request, &reply, NULL);
      }
 -    nbd_coroutine_end(client, &request);
 +    nbd_coroutine_end(bs, &request);
      return -reply.error;
  }
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
      } else {
          nbd_co_receive_reply(client, &request, &reply, NULL);
      }
 -    nbd_coroutine_end(client, &request);
 +    nbd_coroutine_end(bs, &request);
      return -reply.error;
  }
  void nbd_client_detach_aio_context(BlockDriverState *bs)
  {
 -    aio_set_fd_handler(bdrv_get_aio_context(bs),
 -                       nbd_get_client_session(bs)->sioc->fd,
 -                       false, NULL, NULL, NULL, NULL);
 +    NBDClientSession *client = nbd_get_client_session(bs);
 +    qio_channel_detach_aio_context(QIO_CHANNEL(client->sioc));
  }
  void nbd_client_attach_aio_context(BlockDriverState *bs,
                                     AioContext *new_context)
  {
 -    aio_set_fd_handler(new_context, nbd_get_client_session(bs)->sioc->fd,
 -                       false, nbd_reply_ready, NULL, NULL, bs);
 +    NBDClientSession *client = nbd_get_client_session(bs);
 +    qio_channel_attach_aio_context(QIO_CHANNEL(client->sioc), new_context);
 +    aio_co_schedule(new_context, client->read_reply_co);
  }
  void nbd_client_close(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ int nbd_client_init(BlockDriverState *bs,
      /* Now that we're connected, set the socket to be non-blocking and
       * kick the reply mechanism.  */
      qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
 -
 +    client->read_reply_co = qemu_coroutine_create(nbd_read_reply_entry, client);
      nbd_client_attach_aio_context(bs, bdrv_get_aio_context(bs));
      logout("Established connection with NBD server\n");
 diff --git a/nbd/client.c b/nbd/client.c
 index XXXXXXX..XXXXXXX 100644
 --- a/nbd/client.c
 +++ b/nbd/client.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
      ssize_t ret;
      ret = read_sync(ioc, buf, sizeof(buf));
 -    if (ret < 0) {
 +    if (ret <= 0) {
          return ret;
      }
 diff --git a/nbd/common.c b/nbd/common.c
 index XXXXXXX..XXXXXXX 100644
 --- a/nbd/common.c
 +++ b/nbd/common.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
          }
          if (len == QIO_CHANNEL_ERR_BLOCK) {
              if (qemu_in_coroutine()) {
 -                /* XXX figure out if we can create a variant on
 -                 * qio_channel_yield() that works with AIO contexts
 -                 * and consider using that in this branch */
 -                qemu_coroutine_yield();
 -            } else if (done) {
 -                /* XXX this is needed by nbd_reply_ready.  */
 -                qio_channel_wait(ioc,
 -                                 do_read ? G_IO_IN : G_IO_OUT);
 +                qio_channel_yield(ioc, do_read ? G_IO_IN : G_IO_OUT);
              } else {
                  return -EAGAIN;
              }
 diff --git a/nbd/server.c b/nbd/server.c
 index XXXXXXX..XXXXXXX 100644
 --- a/nbd/server.c
 +++ b/nbd/server.c
-@@ -XXX,XX +XXX,XX @@ struct NBDClient {
+@@ -XXX,XX +XXX,XX @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
-     CoMutex send_lock;
+         return ret;
-     Coroutine *send_coroutine;
+     }
--    bool can_read;
+-    blk_set_allow_aio_context_change(blk, true);
 -
-     QTAILQ_ENTRY(NBDClient) next;
+     QTAILQ_INIT(&exp->clients);
-     int nb_requests;
+     exp->name = g_strdup(arg->name);
-     bool closing;
+     exp->description = g_strdup(arg->description);
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
  /* That's all folks */
 -static void nbd_set_handlers(NBDClient *client);
 -static void nbd_unset_handlers(NBDClient *client);
 -static void nbd_update_can_read(NBDClient *client);
 +static void nbd_client_receive_next_request(NBDClient *client);
  static gboolean nbd_negotiate_continue(QIOChannel *ioc,
                                         GIOCondition condition,
@@ -XXX,XX +XXX,XX @@ void nbd_client_put(NBDClient *client)
           */
          assert(client->closing);
 -        nbd_unset_handlers(client);
 +        qio_channel_detach_aio_context(client->ioc);
          object_unref(OBJECT(client->sioc));
          object_unref(OBJECT(client->ioc));
          if (client->tlscreds) {
@@ -XXX,XX +XXX,XX @@ static NBDRequestData *nbd_request_get(NBDClient *client)
      assert(client->nb_requests <= MAX_NBD_REQUESTS - 1);
      client->nb_requests++;
 -    nbd_update_can_read(client);
      req = g_new0(NBDRequestData, 1);
      nbd_client_get(client);
@@ -XXX,XX +XXX,XX @@ static void nbd_request_put(NBDRequestData *req)
      g_free(req);
      client->nb_requests--;
 -    nbd_update_can_read(client);
 +    nbd_client_receive_next_request(client);
 +
      nbd_client_put(client);
  }
@@ -XXX,XX +XXX,XX @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
      exp->ctx = ctx;
      QTAILQ_FOREACH(client, &exp->clients, next) {
 -        nbd_set_handlers(client);
 +        qio_channel_attach_aio_context(client->ioc, ctx);
 +        if (client->recv_coroutine) {
 +            aio_co_schedule(ctx, client->recv_coroutine);
 +        }
 +        if (client->send_coroutine) {
 +            aio_co_schedule(ctx, client->send_coroutine);
 +        }
      }
  }
@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
      TRACE("Export %s: Detaching clients from AIO context %p\n", exp->name, exp->ctx);
      QTAILQ_FOREACH(client, &exp->clients, next) {
 -        nbd_unset_handlers(client);
 +        qio_channel_detach_aio_context(client->ioc);
      }
      exp->ctx = NULL;
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
      g_assert(qemu_in_coroutine());
      qemu_co_mutex_lock(&client->send_lock);
      client->send_coroutine = qemu_coroutine_self();
 -    nbd_set_handlers(client);
      if (!len) {
          rc = nbd_send_reply(client->ioc, reply);
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
      }
      client->send_coroutine = NULL;
 -    nbd_set_handlers(client);
      qemu_co_mutex_unlock(&client->send_lock);
      return rc;
  }
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
      ssize_t rc;
      g_assert(qemu_in_coroutine());
 -    client->recv_coroutine = qemu_coroutine_self();
 -    nbd_update_can_read(client);
 -
 +    assert(client->recv_coroutine == qemu_coroutine_self());
      rc = nbd_receive_request(client->ioc, request);
      if (rc < 0) {
          if (rc != -EAGAIN) {
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
  out:
      client->recv_coroutine = NULL;
 -    nbd_update_can_read(client);
 +    nbd_client_receive_next_request(client);
      return rc;
  }
 -static void nbd_trip(void *opaque)
 +/* Owns a reference to the NBDClient passed as opaque.  */
 +static coroutine_fn void nbd_trip(void *opaque)
  {
      NBDClient *client = opaque;
      NBDExport *exp = client->exp;
      NBDRequestData *req;
 -    NBDRequest request;
 +    NBDRequest request = { 0 };    /* GCC thinks it can be used uninitialized */
      NBDReply reply;
      ssize_t ret;
      int flags;
      TRACE("Reading request.");
      if (client->closing) {
 +        nbd_client_put(client);
          return;
      }
@@ -XXX,XX +XXX,XX @@ static void nbd_trip(void *opaque)
  done:
      nbd_request_put(req);
 +    nbd_client_put(client);
      return;
  out:
      nbd_request_put(req);
      client_close(client);
 +    nbd_client_put(client);
  }
 -static void nbd_read(void *opaque)
 +static void nbd_client_receive_next_request(NBDClient *client)
  {
 -    NBDClient *client = opaque;
 -
 -    if (client->recv_coroutine) {
 -        qemu_coroutine_enter(client->recv_coroutine);
 -    } else {
 -        qemu_coroutine_enter(qemu_coroutine_create(nbd_trip, client));
 -    }
 -}
 -
 -static void nbd_restart_write(void *opaque)
 -{
 -    NBDClient *client = opaque;
 -
 -    qemu_coroutine_enter(client->send_coroutine);
 -}
 -
 -static void nbd_set_handlers(NBDClient *client)
 -{
 -    if (client->exp && client->exp->ctx) {
 -        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true,
 -                           client->can_read ? nbd_read : NULL,
 -                           client->send_coroutine ? nbd_restart_write : NULL,
 -                           NULL, client);
 -    }
 -}
 -
 -static void nbd_unset_handlers(NBDClient *client)
 -{
 -    if (client->exp && client->exp->ctx) {
 -        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true, NULL,
 -                           NULL, NULL, NULL);
 -    }
 -}
 -
 -static void nbd_update_can_read(NBDClient *client)
 -{
 -    bool can_read = client->recv_coroutine ||
 -                    client->nb_requests < MAX_NBD_REQUESTS;
 -
 -    if (can_read != client->can_read) {
 -        client->can_read = can_read;
 -        nbd_set_handlers(client);
 -
 -        /* There is no need to invoke aio_notify(), since aio_set_fd_handler()
 -         * in nbd_set_handlers() will have taken care of that */
 +    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS) {
 +        nbd_client_get(client);
 +        client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
 +        aio_co_schedule(client->exp->ctx, client->recv_coroutine);
      }
  }
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void nbd_co_client_start(void *opaque)
          goto out;
      }
      qemu_co_mutex_init(&client->send_lock);
 -    nbd_set_handlers(client);
      if (exp) {
          QTAILQ_INSERT_TAIL(&exp->clients, client, next);
      }
 +
 +    nbd_client_receive_next_request(client);
 +
  out:
      g_free(data);
  }
@@ -XXX,XX +XXX,XX @@ void nbd_client_new(NBDExport *exp,
      object_ref(OBJECT(client->sioc));
      client->ioc = QIO_CHANNEL(sioc);
      object_ref(OBJECT(client->ioc));
 -    client->can_read = true;
      client->close = close_fn;
      data->client = client;
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 06/24] io: make qio_channel_yield aware of AioContexts
+[PULL v2 23/28] block/export: add vhost-user-blk multi-queue support
-From: Paolo Bonzini <pbonzini@redhat.com>
+Allow the number of queues to be configured using --export
 vhost-user-blk,num-queues=N. This setting should match the QEMU --device
 vhost-user-blk-pci,num-queues=N setting but QEMU vhost-user-blk.c lowers
 its own value if the vhost-user-blk backend offers fewer queues than
 QEMU.
-Support separate coroutines for reading and writing, and place the
+The vhost-user-blk-server.c code is already capable of multi-queue. All
-read/write handlers on the AioContext that the QIOChannel is registered
+virtqueue processing runs in the same AioContext. No new locking is
-with.
+needed.
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Add the num-queues=N option and set the VIRTIO_BLK_F_MQ feature bit.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Note that the feature bit only announces the presence of the num_queues
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+configuration space field. It does not promise that there is more than 1
-Reviewed-by: Fam Zheng <famz@redhat.com>
+virtqueue, so we can set it unconditionally.
-Message-id: 20170213135235.12274-7-pbonzini@redhat.com
 I tested multi-queue by running a random read fio test with numjobs=4 on
 an -smp 4 guest. After the benchmark finished the guest /proc/interrupts
 file showed activity on all 4 virtio-blk MSI-X. The /sys/block/vda/mq/
 directory shows that Linux blk-mq has 4 queues configured.
 An automated test is included in the next commit.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Acked-by: Markus Armbruster <armbru@redhat.com>
 Message-id: 20201001144604.559733-2-stefanha@redhat.com
 [Fixed accidental tab characters as suggested by Markus Armbruster
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/io/channel.h | 47 ++++++++++++++++++++++++++--
+ qapi/block-export.json               | 10 +++++++---
- io/channel.c         | 86 +++++++++++++++++++++++++++++++++++++++-------------
+ block/export/vhost-user-blk-server.c | 24 ++++++++++++++++++------
-files changed, 109 insertions(+), 24 deletions(-)
+files changed, 25 insertions(+), 9 deletions(-)
-diff --git a/include/io/channel.h b/include/io/channel.h
+diff --git a/qapi/block-export.json b/qapi/block-export.json
 index XXXXXXX..XXXXXXX 100644
---- a/include/io/channel.h
+--- a/qapi/block-export.json
-+++ b/include/io/channel.h
++++ b/qapi/block-export.json
 @@ -XXX,XX +XXX,XX @@
+ #        SocketAddress types are supported. Passed fds must be UNIX domain
- #include "qemu-common.h"
+ #        sockets.
- #include "qom/object.h"
+ # @logical-block-size: Logical block size in bytes. Defaults to 512 bytes.
-+#include "qemu/coroutine.h"
++# @num-queues: Number of request virtqueues. Must be greater than 0. Defaults
- #include "block/aio.h"
++#              to 1.
+ #
- #define TYPE_QIO_CHANNEL "qio-channel"
+ # Since: 5.2
-@@ -XXX,XX +XXX,XX @@ struct QIOChannel {
+ ##
-     Object parent;
+ { 'struct': 'BlockExportOptionsVhostUserBlk',
-     unsigned int features; /* bitmask of QIOChannelFeatures */
+-  'data': { 'addr': 'SocketAddress', '*logical-block-size': 'size' } }
-     char *name;
++  'data': { 'addr': 'SocketAddress',
-+    AioContext *ctx;
++        '*logical-block-size': 'size',
-+    Coroutine *read_coroutine;
++            '*num-queues': 'uint16'} }
-+    Coroutine *write_coroutine;
- #ifdef _WIN32
+ ##
-     HANDLE event; /* For use with GSource on Win32 */
+ # @NbdServerAddOptions:
- #endif
+@@ -XXX,XX +XXX,XX @@
-@@ -XXX,XX +XXX,XX @@ guint qio_channel_add_watch(QIOChannel *ioc,
+ { 'union': 'BlockExportOptions',
+   'base': { 'type': 'BlockExportType',
+             'id': 'str',
- /**
+-        '*fixed-iothread': 'bool',
-+ * qio_channel_attach_aio_context:
+-        '*iothread': 'str',
-+ * @ioc: the channel object
++            '*fixed-iothread': 'bool',
-+ * @ctx: the #AioContext to set the handlers on
++            '*iothread': 'str',
-+ *
+             'node-name': 'str',
-+ * Request that qio_channel_yield() sets I/O handlers on
+             '*writable': 'bool',
-+ * the given #AioContext.  If @ctx is %NULL, qio_channel_yield()
+             '*writethrough': 'bool' },
-+ * uses QEMU's main thread event loop.
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
-+ *
+index XXXXXXX..XXXXXXX 100644
-+ * You can move a #QIOChannel from one #AioContext to another even if
+--- a/block/export/vhost-user-blk-server.c
-+ * I/O handlers are set for a coroutine.  However, #QIOChannel provides
++++ b/block/export/vhost-user-blk-server.c
-+ * no synchronization between the calls to qio_channel_yield() and
+@@ -XXX,XX +XXX,XX @@
-+ * qio_channel_attach_aio_context().
+ #include "util/block-helpers.h"
-+ *
-+ * Therefore you should first call qio_channel_detach_aio_context()
+ enum {
-+ * to ensure that the coroutine is not entered concurrently.  Then,
+-    VHOST_USER_BLK_MAX_QUEUES = 1,
-+ * while the coroutine has yielded, call qio_channel_attach_aio_context(),
++    VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
-+ * and then aio_co_schedule() to place the coroutine on the new
+ };
-+ * #AioContext.  The calls to qio_channel_detach_aio_context()
+ struct virtio_blk_inhdr {
-+ * and qio_channel_attach_aio_context() should be protected with
+     unsigned char status;
-+ * aio_context_acquire() and aio_context_release().
+@@ -XXX,XX +XXX,XX @@ static uint64_t vu_blk_get_features(VuDev *dev)
-+ */
+ull << VIRTIO_BLK_F_DISCARD |
-+void qio_channel_attach_aio_context(QIOChannel *ioc,
+ull << VIRTIO_BLK_F_WRITE_ZEROES |
-+                                    AioContext *ctx);
+ull << VIRTIO_BLK_F_CONFIG_WCE |
 +               1ull << VIRTIO_BLK_F_MQ |
 ull << VIRTIO_F_VERSION_1 |
 ull << VIRTIO_RING_F_INDIRECT_DESC |
 ull << VIRTIO_RING_F_EVENT_IDX |
@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
  static void
  vu_blk_initialize_config(BlockDriverState *bs,
 -                           struct virtio_blk_config *config, uint32_t blk_size)
 +                         struct virtio_blk_config *config,
 +                         uint32_t blk_size,
 +                         uint16_t num_queues)
  {
      config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
      config->blk_size = blk_size;
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
      config->seg_max = 128 - 2;
      config->min_io_size = 1;
      config->opt_io_size = 1;
 -    config->num_queues = VHOST_USER_BLK_MAX_QUEUES;
 +    config->num_queues = num_queues;
      config->max_discard_sectors = 32768;
      config->max_discard_seg = 1;
      config->discard_sector_alignment = config->blk_size >> 9;
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
      BlockExportOptionsVhostUserBlk *vu_opts = &opts->u.vhost_user_blk;
      Error *local_err = NULL;
      uint64_t logical_block_size;
 +    uint16_t num_queues = VHOST_USER_BLK_NUM_QUEUES_DEFAULT;
      vexp->writable = opts->writable;
      vexp->blkcfg.wce = 0;
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
      }
      vexp->blk_size = logical_block_size;
      blk_set_guest_block_size(exp->blk, logical_block_size);
 +
-+/**
++    if (vu_opts->has_num_queues) {
-+ * qio_channel_detach_aio_context:
++        num_queues = vu_opts->num_queues;
 + * @ioc: the channel object
 + *
 + * Disable any I/O handlers set by qio_channel_yield().  With the
 + * help of aio_co_schedule(), this allows moving a coroutine that was
 + * paused by qio_channel_yield() to another context.
 + */
 +void qio_channel_detach_aio_context(QIOChannel *ioc);
 +
 +/**
   * qio_channel_yield:
   * @ioc: the channel object
   * @condition: the I/O condition to wait for
   *
 - * Yields execution from the current coroutine until
 - * the condition indicated by @condition becomes
 - * available.
 + * Yields execution from the current coroutine until the condition
 + * indicated by @condition becomes available.  @condition must
 + * be either %G_IO_IN or %G_IO_OUT; it cannot contain both.  In
 + * addition, no two coroutine can be waiting on the same condition
 + * and channel at the same time.
   *
   * This must only be called from coroutine context
   */
 diff --git a/io/channel.c b/io/channel.c
 index XXXXXXX..XXXXXXX 100644
 --- a/io/channel.c
 +++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/osdep.h"
  #include "io/channel.h"
  #include "qapi/error.h"
 -#include "qemu/coroutine.h"
 +#include "qemu/main-loop.h"
  bool qio_channel_has_feature(QIOChannel *ioc,
                               QIOChannelFeature feature)
@@ -XXX,XX +XXX,XX @@ off_t qio_channel_io_seek(QIOChannel *ioc,
  }
 -typedef struct QIOChannelYieldData QIOChannelYieldData;
 -struct QIOChannelYieldData {
 -    QIOChannel *ioc;
 -    Coroutine *co;
 -};
 +static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc);
 +static void qio_channel_restart_read(void *opaque)
 +{
 +    QIOChannel *ioc = opaque;
 +    Coroutine *co = ioc->read_coroutine;
 +
 +    ioc->read_coroutine = NULL;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +    aio_co_wake(co);
 +}
 -static gboolean qio_channel_yield_enter(QIOChannel *ioc,
 -                                        GIOCondition condition,
 -                                        gpointer opaque)
 +static void qio_channel_restart_write(void *opaque)
  {
 -    QIOChannelYieldData *data = opaque;
 -    qemu_coroutine_enter(data->co);
 -    return FALSE;
 +    QIOChannel *ioc = opaque;
 +    Coroutine *co = ioc->write_coroutine;
 +
 +    ioc->write_coroutine = NULL;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +    aio_co_wake(co);
  }
 +static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc)
 +{
 +    IOHandler *rd_handler = NULL, *wr_handler = NULL;
 +    AioContext *ctx;
 +
 +    if (ioc->read_coroutine) {
 +        rd_handler = qio_channel_restart_read;
 +    }
-+    if (ioc->write_coroutine) {
++    if (num_queues == 0) {
-+        wr_handler = qio_channel_restart_write;
++        error_setg(errp, "num-queues must be greater than 0");
 +        return -EINVAL;
 +    }
 +
-+    ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+     vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
-+    qio_channel_set_aio_fd_handler(ioc, ctx, rd_handler, wr_handler, ioc);
+-                               logical_block_size);
-+}
++                             logical_block_size, num_queues);
-+
-+void qio_channel_attach_aio_context(QIOChannel *ioc,
+     blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
-+                                    AioContext *ctx)
+                                  vexp);
-+{
-+    AioContext *old_ctx;
+     if (!vhost_user_server_start(&vexp->vu_server, vu_opts->addr, exp->ctx,
-+    if (ioc->ctx == ctx) {
+-                                 VHOST_USER_BLK_MAX_QUEUES, &vu_blk_iface,
-+        return;
+-                                 errp)) {
-+    }
++                                 num_queues, &vu_blk_iface, errp)) {
-+
+         blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
-+    old_ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+                                         blk_aio_detach, vexp);
-+    qio_channel_set_aio_fd_handler(ioc, old_ctx, NULL, NULL, NULL);
+         return -EADDRNOTAVAIL;
 +    ioc->ctx = ctx;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +}
 +
 +void qio_channel_detach_aio_context(QIOChannel *ioc)
 +{
 +    ioc->read_coroutine = NULL;
 +    ioc->write_coroutine = NULL;
 +    qio_channel_set_aio_fd_handlers(ioc);
 +    ioc->ctx = NULL;
 +}
  void coroutine_fn qio_channel_yield(QIOChannel *ioc,
                                      GIOCondition condition)
  {
 -    QIOChannelYieldData data;
 -
      assert(qemu_in_coroutine());
 -    data.ioc = ioc;
 -    data.co = qemu_coroutine_self();
 -    qio_channel_add_watch(ioc,
 -                          condition,
 -                          qio_channel_yield_enter,
 -                          &data,
 -                          NULL);
 +    if (condition == G_IO_IN) {
 +        assert(!ioc->read_coroutine);
 +        ioc->read_coroutine = qemu_coroutine_self();
 +    } else if (condition == G_IO_OUT) {
 +        assert(!ioc->write_coroutine);
 +        ioc->write_coroutine = qemu_coroutine_self();
 +    } else {
 +        abort();
 +    }
 +    qio_channel_set_aio_fd_handlers(ioc);
      qemu_coroutine_yield();
  }
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 14/24] block: explicitly acquire aiocontext in bottom halves that need it
+[PULL v2 24/28] block/io: fix bdrv_co_block_status_above
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+bdrv_co_block_status_above has several design problems with handling
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+short backing files:
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
-Message-id: 20170213135235.12274-15-pbonzini@redhat.com
+without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
 which produces these after-EOF zeros is inside requested backing
 sequence.
 . With want_zero=false, it may return pnum=0 prior to actual EOF,
 because of EOF of short backing file.
 Fix these things, making logic about short backing files clearer.
 With fixed bdrv_block_status_above we also have to improve is_zero in
 qcow2 code, otherwise iotest 154 will fail, because with this patch we
 stop to merge zeros of different types (produced by fully unallocated
 in the whole backing chain regions vs produced by short backing files).
 Note also, that this patch leaves for another day the general problem
 around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
 vs go-to-backing.
 Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Reviewed-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Message-id: 20200924194003.22080-2-vsementsov@virtuozzo.com
 [Fix s/comes/come/ as suggested by Eric Blake
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/archipelago.c   |  3 +++
+ block/io.c    | 68 ++++++++++++++++++++++++++++++++++++++++-----------
- block/blkreplay.c     |  2 +-
+ block/qcow2.c | 16 ++++++++++--
- block/block-backend.c |  6 ++++++
+files changed, 68 insertions(+), 16 deletions(-)
  block/curl.c          | 26 ++++++++++++++++++--------
  block/gluster.c       |  9 +--------
  block/io.c            |  6 +++++-
  block/iscsi.c         |  6 +++++-
  block/linux-aio.c     | 15 +++++++++------
  block/nfs.c           |  3 ++-
  block/null.c          |  4 ++++
  block/qed.c           |  3 +++
  block/rbd.c           |  4 ++++
  dma-helpers.c         |  2 ++
  hw/block/virtio-blk.c |  2 ++
  hw/scsi/scsi-bus.c    |  2 ++
  util/async.c          |  4 ++--
  util/thread-pool.c    |  2 ++
 files changed, 71 insertions(+), 28 deletions(-)
-diff --git a/block/archipelago.c b/block/archipelago.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/archipelago.c
-+++ b/block/archipelago.c
-@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
- {
-     AIORequestData *reqdata = (AIORequestData *) opaque;
-     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
-+    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
-+    aio_context_acquire(ctx);
-     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
-+    aio_context_release(ctx);
-     aio_cb->status = 0;
-     qemu_aio_unref(aio_cb);
-diff --git a/block/blkreplay.c b/block/blkreplay.c
-index XXXXXXX..XXXXXXX 100755
---- a/block/blkreplay.c
-+++ b/block/blkreplay.c
-@@ -XXX,XX +XXX,XX @@ static int64_t blkreplay_getlength(BlockDriverState *bs)
- static void blkreplay_bh_cb(void *opaque)
- {
-     Request *req = opaque;
--    qemu_coroutine_enter(req->co);
-+    aio_co_wake(req->co);
-     qemu_bh_delete(req->bh);
-     g_free(req);
- }
-diff --git a/block/block-backend.c b/block/block-backend.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/block-backend.c
-+++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
- static void error_callback_bh(void *opaque)
- {
-     struct BlockBackendAIOCB *acb = opaque;
-+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     bdrv_dec_in_flight(acb->common.bs);
-+    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, acb->ret);
-+    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
-@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
- static void blk_aio_complete_bh(void *opaque)
- {
-     BlkAioEmAIOCB *acb = opaque;
-+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     assert(acb->has_returned);
-+    aio_context_acquire(ctx);
-     blk_aio_complete(acb);
-+    aio_context_release(ctx);
- }
- static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
-diff --git a/block/curl.c b/block/curl.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
-+++ b/block/curl.c
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
- {
-     CURLState *state;
-     int running;
-+    int ret = -EINPROGRESS;
-     CURLAIOCB *acb = p;
--    BDRVCURLState *s = acb->common.bs->opaque;
-+    BlockDriverState *bs = acb->common.bs;
-+    BDRVCURLState *s = bs->opaque;
-+    AioContext *ctx = bdrv_get_aio_context(bs);
-     size_t start = acb->sector_num * BDRV_SECTOR_SIZE;
-     size_t end;
-+    aio_context_acquire(ctx);
-+
-     // In case we have the requested data already (e.g. read-ahead),
-     // we can just call the callback and be done.
-     switch (curl_find_buf(s, start, acb->nb_sectors * BDRV_SECTOR_SIZE, acb)) {
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-             qemu_aio_unref(acb);
-             // fall through
-         case FIND_RET_WAIT:
--            return;
-+            goto out;
-         default:
-             break;
-     }
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-     // No cache found, so let's start a new request
-     state = curl_init_state(acb->common.bs, s);
-     if (!state) {
--        acb->common.cb(acb->common.opaque, -EIO);
--        qemu_aio_unref(acb);
--        return;
-+        ret = -EIO;
-+        goto out;
-     }
-     acb->start = 0;
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-     state->orig_buf = g_try_malloc(state->buf_len);
-     if (state->buf_len && state->orig_buf == NULL) {
-         curl_clean_state(state);
--        acb->common.cb(acb->common.opaque, -ENOMEM);
--        qemu_aio_unref(acb);
--        return;
-+        ret = -ENOMEM;
-+        goto out;
-     }
-     state->acb[0] = acb;
-@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
-     /* Tell curl it needs to kick things off */
-     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
-+
-+out:
-+    if (ret != -EINPROGRESS) {
-+        acb->common.cb(acb->common.opaque, ret);
-+        qemu_aio_unref(acb);
-+    }
-+    aio_context_release(ctx);
- }
- static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
-diff --git a/block/gluster.c b/block/gluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/gluster.c
-+++ b/block/gluster.c
-@@ -XXX,XX +XXX,XX @@ static struct glfs *qemu_gluster_init(BlockdevOptionsGluster *gconf,
-     return qemu_gluster_glfs_init(gconf, errp);
- }
--static void qemu_gluster_complete_aio(void *opaque)
--{
--    GlusterAIOCB *acb = (GlusterAIOCB *)opaque;
--
--    qemu_coroutine_enter(acb->coroutine);
--}
--
- /*
-  * AIO callback routine called from GlusterFS thread.
-  */
-@@ -XXX,XX +XXX,XX @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg)
-         acb->ret = -EIO; /* Partial read/write - fail it */
-     }
--    aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb);
-+    aio_co_schedule(acb->aio_context, acb->coroutine);
- }
- static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_drain_bh_cb(void *opaque)
+@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
-     bdrv_dec_in_flight(bs);
+                                   int64_t *map,
-     bdrv_drained_begin(bs);
+                                   BlockDriverState **file)
-     data->done = true;
+ {
--    qemu_coroutine_enter(co);
++    int ret;
-+    aio_co_wake(co);
+     BlockDriverState *p;
 -    int ret = 0;
 -    bool first = true;
 +    int64_t eof = 0;
      assert(bs != base);
 -    for (p = bs; p != base; p = bdrv_filter_or_cow_bs(p)) {
 +
 +    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
 +    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
 +        return ret;
 +    }
 +
 +    if (ret & BDRV_BLOCK_EOF) {
 +        eof = offset + *pnum;
 +    }
 +
 +    assert(*pnum <= bytes);
 +    bytes = *pnum;
 +
 +    for (p = bdrv_filter_or_cow_bs(bs); p != base;
 +         p = bdrv_filter_or_cow_bs(p))
 +    {
          ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                     file);
          if (ret < 0) {
 -            break;
 +            return ret;
          }
 -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
 +        if (*pnum == 0) {
              /*
 -             * Reading beyond the end of the file continues to read
 -             * zeroes, but we can only widen the result to the
 -             * unallocated length we learned from an earlier
 -             * iteration.
 +             * The top layer deferred to this layer, and because this layer is
 +             * short, any zeroes that we synthesize beyond EOF behave as if they
 +             * were allocated at this layer.
 +             *
 +             * We don't include BDRV_BLOCK_EOF into ret, as upper layer may be
 +             * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
 +             * below.
               */
 +            assert(ret & BDRV_BLOCK_EOF);
              *pnum = bytes;
 +            if (file) {
 +                *file = p;
 +            }
 +            ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
 +            break;
          }
 -        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
 +        if (ret & BDRV_BLOCK_ALLOCATED) {
 +            /*
 +             * We've found the node and the status, we must break.
 +             *
 +             * Drop BDRV_BLOCK_EOF, as it's not for upper layer, which may be
 +             * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
 +             * below.
 +             */
 +            ret &= ~BDRV_BLOCK_EOF;
              break;
          }
 -        /* [offset, pnum] unallocated on this layer, which could be only
 -         * the first part of [offset, bytes].  */
 -        bytes = MIN(bytes, *pnum);
 -        first = false;
 +
 +        /*
 +         * OK, [offset, offset + *pnum) region is unallocated on this layer,
 +         * let's continue the diving.
 +         */
 +        assert(*pnum <= bytes);
 +        bytes = *pnum;
 +    }
 +
 +    if (offset + *pnum == eof) {
 +        ret |= BDRV_BLOCK_EOF;
      }
 +
      return ret;
  }
- static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
+diff --git a/block/qcow2.c b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
+index XXXXXXX..XXXXXXX 100644
- static void bdrv_co_em_bh(void *opaque)
+--- a/block/qcow2.c
- {
++++ b/block/qcow2.c
-     BlockAIOCBCoroutine *acb = opaque;
+@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
-+    BlockDriverState *bs = acb->common.bs;
+     if (!bytes) {
-+    AioContext *ctx = bdrv_get_aio_context(bs);
+         return true;
+     }
-     assert(!acb->need_bh);
+-    res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
-+    aio_context_acquire(ctx);
+-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
-     bdrv_co_complete(acb);
++
-+    aio_context_release(ctx);
++    /*
 +     * bdrv_block_status_above doesn't merge different types of zeros, for
 +     * example, zeros which come from the region which is unallocated in
 +     * the whole backing chain, and zeros which come because of a short
 +     * backing file. So, we need a loop.
 +     */
 +    do {
 +        res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
 +        offset += nr;
 +        bytes -= nr;
 +    } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
 +
 +    return res >= 0 && (res & BDRV_BLOCK_ZERO) && bytes == 0;
  }
- static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
+ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 diff --git a/block/iscsi.c b/block/iscsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
  iscsi_bh_cb(void *p)
  {
      IscsiAIOCB *acb = p;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      qemu_bh_delete(acb->bh);
      g_free(acb->buf);
      acb->buf = NULL;
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, acb->status);
 +    aio_context_release(ctx);
      if (acb->task != NULL) {
          scsi_free_scsi_task(acb->task);
@@ -XXX,XX +XXX,XX @@ iscsi_schedule_bh(IscsiAIOCB *acb)
  static void iscsi_co_generic_bh_cb(void *opaque)
  {
      struct IscsiTask *iTask = opaque;
 +
      iTask->complete = 1;
 -    qemu_coroutine_enter(iTask->co);
 +    aio_co_wake(iTask->co);
  }
  static void iscsi_retry_timer_expired(void *opaque)
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
      io_context_t ctx;
      EventNotifier e;
 -    /* io queue for submit at batch */
 +    /* io queue for submit at batch.  Protected by AioContext lock. */
      LaioQueue io_q;
 -    /* I/O completion processing */
 +    /* I/O completion processing.  Only runs in I/O thread.  */
      QEMUBH *completion_bh;
      int event_idx;
      int event_max;
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
   */
  static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
  {
 +    LinuxAioState *s = laiocb->ctx;
      int ret;
      ret = laiocb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
      }
      laiocb->ret = ret;
 +    aio_context_acquire(s->aio_context);
      if (laiocb->co) {
          /* If the coroutine is already entered it must be in ioq_submit() and
           * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
          laiocb->common.cb(laiocb->common.opaque, ret);
          qemu_aio_unref(laiocb);
      }
 +    aio_context_release(s->aio_context);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
  {
      qemu_laio_process_completions(s);
 +
 +    aio_context_acquire(s->aio_context);
      if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
 +    aio_context_release(s->aio_context);
  }
  static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
      LinuxAioState *s = container_of(e, LinuxAioState, e);
      if (event_notifier_test_and_clear(&s->e)) {
 -        aio_context_acquire(s->aio_context);
          qemu_laio_process_completions_and_submit(s);
 -        aio_context_release(s->aio_context);
      }
  }
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
          return false;
      }
 -    aio_context_acquire(s->aio_context);
      qemu_laio_process_completions_and_submit(s);
 -    aio_context_release(s->aio_context);
      return true;
  }
@@ -XXX,XX +XXX,XX @@ void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
  {
      aio_set_event_notifier(old_context, &s->e, false, NULL, NULL);
      qemu_bh_delete(s->completion_bh);
 +    s->aio_context = NULL;
  }
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
 diff --git a/block/nfs.c b/block/nfs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
  static void nfs_co_generic_bh_cb(void *opaque)
  {
      NFSRPC *task = opaque;
 +
      task->complete = 1;
 -    qemu_coroutine_enter(task->co);
 +    aio_co_wake(task->co);
  }
  static void
 diff --git a/block/null.c b/block/null.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/null.c
 +++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
  static void null_bh_cb(void *opaque)
  {
      NullAIOCB *acb = opaque;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 +
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, 0);
 +    aio_context_release(ctx);
      qemu_aio_unref(acb);
  }
 diff --git a/block/qed.c b/block/qed.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.c
 +++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
  static void qed_aio_complete_bh(void *opaque)
  {
      QEDAIOCB *acb = opaque;
 +    BDRVQEDState *s = acb_to_s(acb);
      BlockCompletionFunc *cb = acb->common.cb;
      void *user_opaque = acb->common.opaque;
      int ret = acb->bh_ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
      qemu_aio_unref(acb);
      /* Invoke callback */
 +    qed_acquire(s);
      cb(user_opaque, ret);
 +    qed_release(s);
  }
  static void qed_aio_complete(QEDAIOCB *acb, int ret)
 diff --git a/block/rbd.c b/block/rbd.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
  static void qemu_rbd_complete_aio(RADOSCB *rcb)
  {
      RBDAIOCB *acb = rcb->acb;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      int64_t r;
      r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
          qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
      }
      qemu_vfree(acb->bounce);
 +
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 +    aio_context_release(ctx);
      qemu_aio_unref(acb);
  }
 diff --git a/dma-helpers.c b/dma-helpers.c
 index XXXXXXX..XXXXXXX 100644
 --- a/dma-helpers.c
 +++ b/dma-helpers.c
@@ -XXX,XX +XXX,XX @@ static void dma_blk_cb(void *opaque, int ret)
                                  QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
      }
 +    aio_context_acquire(dbs->ctx);
      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                              dma_blk_cb, dbs, dbs->io_func_opaque);
 +    aio_context_release(dbs->ctx);
      assert(dbs->acb);
  }
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
      s->rq = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
      while (req) {
          VirtIOBlockReq *next = req->next;
          if (virtio_blk_handle_request(req, &mrb)) {
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
      if (mrb.num_reqs) {
          virtio_blk_submit_multireq(s->blk, &mrb);
      }
 +    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
  }
  static void virtio_blk_dma_restart_cb(void *opaque, int running,
 diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/scsi/scsi-bus.c
 +++ b/hw/scsi/scsi-bus.c
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
      qemu_bh_delete(s->bh);
      s->bh = NULL;
 +    aio_context_acquire(blk_get_aio_context(s->conf.blk));
      QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
          scsi_req_ref(req);
          if (req->retry) {
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
          }
          scsi_req_unref(req);
      }
 +    aio_context_release(blk_get_aio_context(s->conf.blk));
  }
  void scsi_req_retry(SCSIRequest *req)
 diff --git a/util/async.c b/util/async.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/async.c
 +++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                  ret = 1;
              }
              bh->idle = 0;
 -            aio_context_acquire(ctx);
              aio_bh_call(bh);
 -            aio_context_release(ctx);
          }
          if (bh->deleted) {
              deleted = true;
@@ -XXX,XX +XXX,XX @@ static void co_schedule_bh_cb(void *opaque)
          Coroutine *co = QSLIST_FIRST(&straight);
          QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
          trace_aio_co_schedule_bh_cb(ctx, co);
 +        aio_context_acquire(ctx);
          qemu_coroutine_enter(co);
 +        aio_context_release(ctx);
      }
  }
 diff --git a/util/thread-pool.c b/util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
      ThreadPool *pool = opaque;
      ThreadPoolElement *elem, *next;
 +    aio_context_acquire(pool->ctx);
  restart:
      QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
          if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
              qemu_aio_unref(elem);
          }
      }
 +    aio_context_release(pool->ctx);
  }
  static void thread_pool_cancel(BlockAIOCB *acb)
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 15/24] block: explicitly acquire aiocontext in aio callbacks that need it
+[PULL v2 25/28] block/io: bdrv_common_block_status_above: support include_base
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+In order to reuse bdrv_common_block_status_above in
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+bdrv_is_allocated_above, let's support include_base parameter.
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Message-id: 20170213135235.12274-16-pbonzini@redhat.com
+Reviewed-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Message-id: 20200924194003.22080-3-vsementsov@virtuozzo.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/archipelago.c    |  3 ---
+ block/coroutines.h |  2 ++
- block/block-backend.c  |  7 -------
+ block/io.c         | 21 ++++++++++++++-------
- block/curl.c           |  2 +-
+files changed, 16 insertions(+), 7 deletions(-)
  block/io.c             |  6 +-----
  block/iscsi.c          |  3 ---
  block/linux-aio.c      |  5 +----
  block/mirror.c         | 12 +++++++++---
  block/null.c           |  8 --------
  block/qed-cluster.c    |  2 ++
  block/qed-table.c      | 12 ++++++++++--
  block/qed.c            |  4 ++--
  block/rbd.c            |  4 ----
  block/win32-aio.c      |  3 ---
  hw/block/virtio-blk.c  | 12 +++++++++++-
  hw/scsi/scsi-disk.c    | 15 +++++++++++++++
  hw/scsi/scsi-generic.c | 20 +++++++++++++++++---
  util/thread-pool.c     |  4 +++-
 files changed, 72 insertions(+), 50 deletions(-)
-diff --git a/block/archipelago.c b/block/archipelago.c
+diff --git a/block/coroutines.h b/block/coroutines.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/archipelago.c
+--- a/block/coroutines.h
-+++ b/block/archipelago.c
++++ b/block/coroutines.h
-@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
+@@ -XXX,XX +XXX,XX @@ bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes,
- {
+ int coroutine_fn
-     AIORequestData *reqdata = (AIORequestData *) opaque;
+ bdrv_co_common_block_status_above(BlockDriverState *bs,
-     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
+                                   BlockDriverState *base,
--    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
++                                  bool include_base,
+                                   bool want_zero,
--    aio_context_acquire(ctx);
+                                   int64_t offset,
-     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
+                                   int64_t bytes,
--    aio_context_release(ctx);
+@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
-     aio_cb->status = 0;
+ int generated_co_wrapper
+ bdrv_common_block_status_above(BlockDriverState *bs,
-     qemu_aio_unref(aio_cb);
+                                BlockDriverState *base,
-diff --git a/block/block-backend.c b/block/block-backend.c
++                               bool include_base,
-index XXXXXXX..XXXXXXX 100644
+                                bool want_zero,
---- a/block/block-backend.c
+                                int64_t offset,
-+++ b/block/block-backend.c
+                                int64_t bytes,
@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
  static void error_callback_bh(void *opaque)
  {
      struct BlockBackendAIOCB *acb = opaque;
 -    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      bdrv_dec_in_flight(acb->common.bs);
 -    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, acb->ret);
 -    aio_context_release(ctx);
      qemu_aio_unref(acb);
  }
@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
  static void blk_aio_complete_bh(void *opaque)
  {
      BlkAioEmAIOCB *acb = opaque;
 -    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 -
      assert(acb->has_returned);
 -    aio_context_acquire(ctx);
      blk_aio_complete(acb);
 -    aio_context_release(ctx);
  }
  static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
 diff --git a/block/curl.c b/block/curl.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/curl.c
 +++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
      curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
  out:
 +    aio_context_release(ctx);
      if (ret != -EINPROGRESS) {
          acb->common.cb(acb->common.opaque, ret);
          qemu_aio_unref(acb);
      }
 -    aio_context_release(ctx);
  }
  static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static void bdrv_co_io_em_complete(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ early_out:
-     CoroutineIOCompletion *co = opaque;
+ int coroutine_fn
+ bdrv_co_common_block_status_above(BlockDriverState *bs,
-     co->ret = ret;
+                                   BlockDriverState *base,
--    qemu_coroutine_enter(co->coroutine);
++                                  bool include_base,
-+    aio_co_wake(co->coroutine);
+                                   bool want_zero,
                                    int64_t offset,
                                    int64_t bytes,
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
      BlockDriverState *p;
      int64_t eof = 0;
 -    assert(bs != base);
 +    assert(include_base || bs != base);
 +    assert(!include_base || base); /* Can't include NULL base */
      ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
 -    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
 +    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED || bs == base) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
      assert(*pnum <= bytes);
      bytes = *pnum;
 -    for (p = bdrv_filter_or_cow_bs(bs); p != base;
 +    for (p = bdrv_filter_or_cow_bs(bs); include_base || p != base;
           p = bdrv_filter_or_cow_bs(p))
      {
          ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
              break;
          }
 +        if (p == base) {
 +            assert(include_base);
 +            break;
 +        }
 +
          /*
           * OK, [offset, offset + *pnum) region is unallocated on this layer,
           * let's continue the diving.
@@ -XXX,XX +XXX,XX @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                              int64_t offset, int64_t bytes, int64_t *pnum,
                              int64_t *map, BlockDriverState **file)
  {
 -    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
 +    return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
                                            pnum, map, file);
  }
- static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
  static void bdrv_co_em_bh(void *opaque)
  {
      BlockAIOCBCoroutine *acb = opaque;
 -    BlockDriverState *bs = acb->common.bs;
 -    AioContext *ctx = bdrv_get_aio_context(bs);
      assert(!acb->need_bh);
 -    aio_context_acquire(ctx);
      bdrv_co_complete(acb);
 -    aio_context_release(ctx);
  }
  static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
 diff --git a/block/iscsi.c b/block/iscsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
  iscsi_bh_cb(void *p)
  {
      IscsiAIOCB *acb = p;
 -    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
      qemu_bh_delete(acb->bh);
      g_free(acb->buf);
      acb->buf = NULL;
 -    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, acb->status);
 -    aio_context_release(ctx);
      if (acb->task != NULL) {
          scsi_free_scsi_task(acb->task);
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
   */
  static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
  {
 -    LinuxAioState *s = laiocb->ctx;
      int ret;
+     int64_t dummy;
-     ret = laiocb->ret;
-@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
+-    ret = bdrv_common_block_status_above(bs, bdrv_filter_or_cow_bs(bs), false,
 -                                         offset, bytes, pnum ? pnum : &dummy,
 -                                         NULL, NULL);
 +    ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
 +                                         bytes, pnum ? pnum : &dummy, NULL,
 +                                         NULL);
      if (ret < 0) {
          return ret;
      }
-     laiocb->ret = ret;
--    aio_context_acquire(s->aio_context);
-     if (laiocb->co) {
-         /* If the coroutine is already entered it must be in ioq_submit() and
-          * will notice laio->ret has been filled in when it eventually runs
-@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
-          * that!
-          */
-         if (!qemu_coroutine_entered(laiocb->co)) {
--            qemu_coroutine_enter(laiocb->co);
-+            aio_co_wake(laiocb->co);
-         }
-     } else {
-         laiocb->common.cb(laiocb->common.opaque, ret);
-         qemu_aio_unref(laiocb);
-     }
--    aio_context_release(s->aio_context);
- }
- /**
-diff --git a/block/mirror.c b/block/mirror.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/mirror.c
-+++ b/block/mirror.c
-@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
- {
-     MirrorOp *op = opaque;
-     MirrorBlockJob *s = op->s;
-+
-+    aio_context_acquire(blk_get_aio_context(s->common.blk));
-     if (ret < 0) {
-         BlockErrorAction action;
-@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
-         }
-     }
-     mirror_iteration_done(op, ret);
-+    aio_context_release(blk_get_aio_context(s->common.blk));
- }
- static void mirror_read_complete(void *opaque, int ret)
- {
-     MirrorOp *op = opaque;
-     MirrorBlockJob *s = op->s;
-+
-+    aio_context_acquire(blk_get_aio_context(s->common.blk));
-     if (ret < 0) {
-         BlockErrorAction action;
-@@ -XXX,XX +XXX,XX @@ static void mirror_read_complete(void *opaque, int ret)
-         }
-         mirror_iteration_done(op, ret);
--        return;
-+    } else {
-+        blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-+                        0, mirror_write_complete, op);
-     }
--    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
--                    0, mirror_write_complete, op);
-+    aio_context_release(blk_get_aio_context(s->common.blk));
- }
- static inline void mirror_clip_sectors(MirrorBlockJob *s,
-diff --git a/block/null.c b/block/null.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/null.c
-+++ b/block/null.c
-@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
- static void null_bh_cb(void *opaque)
- {
-     NullAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, 0);
--    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
- static void null_timer_cb(void *opaque)
- {
-     NullAIOCB *acb = opaque;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, 0);
--    aio_context_release(ctx);
-     timer_deinit(&acb->timer);
-     qemu_aio_unref(acb);
- }
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
-+++ b/block/qed-cluster.c
-@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
-     unsigned int index;
-     unsigned int n;
-+    qed_acquire(s);
-     if (ret) {
-         goto out;
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
- out:
-     find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
-+    qed_release(s);
-     g_free(find_cluster_cb);
- }
-diff --git a/block/qed-table.c b/block/qed-table.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
-+++ b/block/qed-table.c
-@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
- {
-     QEDReadTableCB *read_table_cb = opaque;
-     QEDTable *table = read_table_cb->table;
-+    BDRVQEDState *s = read_table_cb->s;
-     int noffsets = read_table_cb->qiov.size / sizeof(uint64_t);
-     int i;
-@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
-     }
-     /* Byteswap offsets */
-+    qed_acquire(s);
-     for (i = 0; i < noffsets; i++) {
-         table->offsets[i] = le64_to_cpu(table->offsets[i]);
-     }
-+    qed_release(s);
- out:
-     /* Completion */
--    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
-+    trace_qed_read_table_cb(s, read_table_cb->table, ret);
-     gencb_complete(&read_table_cb->gencb, ret);
- }
-@@ -XXX,XX +XXX,XX @@ typedef struct {
- static void qed_write_table_cb(void *opaque, int ret)
- {
-     QEDWriteTableCB *write_table_cb = opaque;
-+    BDRVQEDState *s = write_table_cb->s;
--    trace_qed_write_table_cb(write_table_cb->s,
-+    trace_qed_write_table_cb(s,
-                              write_table_cb->orig_table,
-                              write_table_cb->flush,
-                              ret);
-@@ -XXX,XX +XXX,XX @@ static void qed_write_table_cb(void *opaque, int ret)
-     if (write_table_cb->flush) {
-         /* We still need to flush first */
-         write_table_cb->flush = false;
-+        qed_acquire(s);
-         bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
-                        write_table_cb);
-+        qed_release(s);
-         return;
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
-     CachedL2Table *l2_table = request->l2_table;
-     uint64_t l2_offset = read_l2_table_cb->l2_offset;
-+    qed_acquire(s);
-     if (ret) {
-         /* can't trust loaded L2 table anymore */
-         qed_unref_l2_cache_entry(l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
-         request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-         assert(request->l2_table != NULL);
-     }
-+    qed_release(s);
-     gencb_complete(&read_l2_table_cb->gencb, ret);
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t l
-     }
-     if (cb->co) {
--        qemu_coroutine_enter(cb->co);
-+        aio_co_wake(cb->co);
-     }
- }
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
-     cb->done = true;
-     cb->ret = ret;
-     if (cb->co) {
--        qemu_coroutine_enter(cb->co);
-+        aio_co_wake(cb->co);
-     }
- }
-diff --git a/block/rbd.c b/block/rbd.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/rbd.c
-+++ b/block/rbd.c
-@@ -XXX,XX +XXX,XX @@ shutdown:
- static void qemu_rbd_complete_aio(RADOSCB *rcb)
- {
-     RBDAIOCB *acb = rcb->acb;
--    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-     int64_t r;
-     r = rcb->ret;
-@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
-         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-     }
-     qemu_vfree(acb->bounce);
--
--    aio_context_acquire(ctx);
-     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
--    aio_context_release(ctx);
-     qemu_aio_unref(acb);
- }
-diff --git a/block/win32-aio.c b/block/win32-aio.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/win32-aio.c
-+++ b/block/win32-aio.c
-@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
-         qemu_vfree(waiocb->buf);
-     }
--
--    aio_context_acquire(s->aio_ctx);
-     waiocb->common.cb(waiocb->common.opaque, ret);
--    aio_context_release(s->aio_ctx);
-     qemu_aio_unref(waiocb);
- }
-diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
-index XXXXXXX..XXXXXXX 100644
---- a/hw/block/virtio-blk.c
-+++ b/hw/block/virtio-blk.c
-@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
- static void virtio_blk_rw_complete(void *opaque, int ret)
- {
-     VirtIOBlockReq *next = opaque;
-+    VirtIOBlock *s = next->dev;
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
-     while (next) {
-         VirtIOBlockReq *req = next;
-         next = req->mr_next;
-@@ -XXX,XX +XXX,XX @@ static void virtio_blk_rw_complete(void *opaque, int ret)
-         block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
-         virtio_blk_free_request(req);
-     }
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
- }
- static void virtio_blk_flush_complete(void *opaque, int ret)
- {
-     VirtIOBlockReq *req = opaque;
-+    VirtIOBlock *s = req->dev;
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
-     if (ret) {
-         if (virtio_blk_handle_rw_error(req, -ret, 0)) {
--            return;
-+            goto out;
-         }
-     }
-     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-     block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
-     virtio_blk_free_request(req);
-+
-+out:
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
- }
- #ifdef __linux__
-@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
-     virtio_stl_p(vdev, &scsi->data_len, hdr->dxfer_len);
- out:
-+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
-     virtio_blk_req_complete(req, status);
-     virtio_blk_free_request(req);
-+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
-     g_free(ioctl_req);
- }
-diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
-index XXXXXXX..XXXXXXX 100644
---- a/hw/scsi/scsi-disk.c
-+++ b/hw/scsi/scsi-disk.c
-@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (scsi_disk_req_check_error(r, ret, true)) {
-         goto done;
-     }
-@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
-     scsi_req_complete(&r->req, GOOD);
- done:
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
-     scsi_req_unref(&r->req);
- }
-@@ -XXX,XX +XXX,XX @@ static void scsi_dma_complete(void *opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (ret < 0) {
-         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     } else {
-         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     }
-     scsi_dma_complete_noio(r, ret);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- static void scsi_read_complete(void * opaque, int ret)
-@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (scsi_disk_req_check_error(r, ret, true)) {
-         goto done;
-     }
-@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
- done:
-     scsi_req_unref(&r->req);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- /* Actually issue a read to the block device.  */
-@@ -XXX,XX +XXX,XX @@ static void scsi_do_read_cb(void *opaque, int ret)
-     assert (r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (ret < 0) {
-         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     } else {
-         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     }
-     scsi_do_read(opaque, ret);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- /* Read more data from scsi device into buffer.  */
-@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
-     assert (r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (ret < 0) {
-         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     } else {
-         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
-     }
-     scsi_write_complete_noio(r, ret);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- static void scsi_write_data(SCSIRequest *req)
-@@ -XXX,XX +XXX,XX @@ static void scsi_unmap_complete(void *opaque, int ret)
- {
-     UnmapCBData *data = opaque;
-     SCSIDiskReq *r = data->r;
-+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     scsi_unmap_complete_noio(data, ret);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
-@@ -XXX,XX +XXX,XX @@ static void scsi_write_same_complete(void *opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-     if (scsi_disk_req_check_error(r, ret, true)) {
-         goto done;
-     }
-@@ -XXX,XX +XXX,XX @@ done:
-     scsi_req_unref(&r->req);
-     qemu_vfree(data->iov.iov_base);
-     g_free(data);
-+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
- }
- static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
-diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
-index XXXXXXX..XXXXXXX 100644
---- a/hw/scsi/scsi-generic.c
-+++ b/hw/scsi/scsi-generic.c
-@@ -XXX,XX +XXX,XX @@ done:
- static void scsi_command_complete(void *opaque, int ret)
- {
-     SCSIGenericReq *r = (SCSIGenericReq *)opaque;
-+    SCSIDevice *s = r->req.dev;
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+
-+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-     scsi_command_complete_noio(r, ret);
-+    aio_context_release(blk_get_aio_context(s->conf.blk));
- }
- static int execute_command(BlockBackend *blk,
-@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-+
-     if (ret || r->req.io_canceled) {
-         scsi_command_complete_noio(r, ret);
--        return;
-+        goto done;
-     }
-     len = r->io_header.dxfer_len - r->io_header.resid;
-@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
-     r->len = -1;
-     if (len == 0) {
-         scsi_command_complete_noio(r, 0);
--        return;
-+        goto done;
-     }
-     /* Snoop READ CAPACITY output to set the blocksize.  */
-@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
-     }
-     scsi_req_data(&r->req, len);
-     scsi_req_unref(&r->req);
-+
-+done:
-+    aio_context_release(blk_get_aio_context(s->conf.blk));
- }
- /* Read more data from scsi device into buffer.  */
-@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
-     assert(r->req.aiocb != NULL);
-     r->req.aiocb = NULL;
-+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-+
-     if (ret || r->req.io_canceled) {
-         scsi_command_complete_noio(r, ret);
--        return;
-+        goto done;
-     }
-     if (r->req.cmd.buf[0] == MODE_SELECT && r->req.cmd.buf[4] == 12 &&
-@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
-     }
-     scsi_command_complete_noio(r, ret);
-+
-+done:
-+    aio_context_release(blk_get_aio_context(s->conf.blk));
- }
- /* Write data to a scsi device.  Returns nonzero on failure.
-diff --git a/util/thread-pool.c b/util/thread-pool.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/thread-pool.c
-+++ b/util/thread-pool.c
-@@ -XXX,XX +XXX,XX @@ restart:
-              */
-             qemu_bh_schedule(pool->completion_bh);
-+            aio_context_release(pool->ctx);
-             elem->common.cb(elem->common.opaque, elem->ret);
-+            aio_context_acquire(pool->ctx);
-             qemu_aio_unref(elem);
-             goto restart;
-         } else {
-@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
-     ThreadPoolCo *co = opaque;
-     co->ret = ret;
--    qemu_coroutine_enter(co->co);
-+    aio_co_wake(co->co);
- }
- int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 23/24] coroutine-lock: add mutex argument to CoQueue APIs
+[PULL v2 26/28] block/io: bdrv_common_block_status_above: support bs == base
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-All that CoQueue needs in order to become thread-safe is help
+We are going to reuse bdrv_common_block_status_above in
-from an external mutex.  Add this to the API.
+bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
 include_base == false and still bs == base (for ex. from img_rebase()).
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+So, support this corner case.
-Reviewed-by: Fam Zheng <famz@redhat.com>
-Message-id: 20170213181244.16297-6-pbonzini@redhat.com
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Alberto Garcia <berto@igalia.com>
 Message-id: 20200924194003.22080-4-vsementsov@virtuozzo.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- include/qemu/coroutine.h   |  8 +++++---
+ block/io.c | 6 +++++-
- block/backup.c             |  2 +-
+file changed, 5 insertions(+), 1 deletion(-)
  block/io.c                 |  4 ++--
  block/nbd-client.c         |  2 +-
  block/qcow2-cluster.c      |  4 +---
  block/sheepdog.c           |  2 +-
  block/throttle-groups.c    |  2 +-
  hw/9pfs/9p.c               |  2 +-
  util/qemu-coroutine-lock.c | 24 +++++++++++++++++++++---
 files changed, 34 insertions(+), 16 deletions(-)
-diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/coroutine.h
-+++ b/include/qemu/coroutine.h
-@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
- /**
-  * CoQueues are a mechanism to queue coroutines in order to continue executing
-- * them later.
-+ * them later.  They are similar to condition variables, but they need help
-+ * from an external mutex in order to maintain thread-safety.
-  */
- typedef struct CoQueue {
-     QSIMPLEQ_HEAD(, Coroutine) entries;
-@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue);
- /**
-  * Adds the current coroutine to the CoQueue and transfers control to the
-- * caller of the coroutine.
-+ * caller of the coroutine.  The mutex is unlocked during the wait and
-+ * locked again afterwards.
-  */
--void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
-+void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex);
- /**
-  * Restarts the next coroutine in the CoQueue and removes it from the queue.
-diff --git a/block/backup.c b/block/backup.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/backup.c
-+++ b/block/backup.c
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
-         retry = false;
-         QLIST_FOREACH(req, &job->inflight_reqs, list) {
-             if (end > req->start && start < req->end) {
--                qemu_co_queue_wait(&req->wait_queue);
-+                qemu_co_queue_wait(&req->wait_queue, NULL);
-                 retry = true;
-                 break;
-             }
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
-                  * (instead of producing a deadlock in the former case). */
+     BlockDriverState *p;
-                 if (!req->waiting_for) {
+     int64_t eof = 0;
-                     self->waiting_for = req;
--                    qemu_co_queue_wait(&req->wait_queue);
+-    assert(include_base || bs != base);
-+                    qemu_co_queue_wait(&req->wait_queue, NULL);
+     assert(!include_base || base); /* Can't include NULL base */
-                     self->waiting_for = NULL;
-                     retry = true;
++    if (!include_base && bs == base) {
-                     waited = true;
++        *pnum = bytes;
-@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
++        return 0;
      /* Wait until any previous flushes are completed */
      while (bs->active_flush_req) {
 -        qemu_co_queue_wait(&bs->flush_queue);
 +        qemu_co_queue_wait(&bs->flush_queue, NULL);
      }
      bs->active_flush_req = true;
 diff --git a/block/nbd-client.c b/block/nbd-client.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nbd-client.c
 +++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
      /* Poor man semaphore.  The free_sema is locked when no other request
       * can be accepted, and unlocked after receiving one reply.  */
      if (s->in_flight == MAX_NBD_REQUESTS) {
 -        qemu_co_queue_wait(&s->free_sema);
 +        qemu_co_queue_wait(&s->free_sema, NULL);
          assert(s->in_flight < MAX_NBD_REQUESTS);
      }
      s->in_flight++;
 diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-cluster.c
 +++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
              if (bytes == 0) {
                  /* Wait for the dependency to complete. We need to recheck
                   * the free/allocated clusters when we continue. */
 -                qemu_co_mutex_unlock(&s->lock);
 -                qemu_co_queue_wait(&old_alloc->dependent_requests);
 -                qemu_co_mutex_lock(&s->lock);
 +                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
                  return -EAGAIN;
              }
          }
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void wait_for_overlapping_aiocb(BDRVSheepdogState *s, SheepdogAIOCB *acb)
  retry:
      QLIST_FOREACH(cb, &s->inflight_aiocb_head, aiocb_siblings) {
          if (AIOCBOverlapping(acb, cb)) {
 -            qemu_co_queue_wait(&s->overlapping_queue);
 +            qemu_co_queue_wait(&s->overlapping_queue, NULL);
              goto retry;
          }
      }
 diff --git a/block/throttle-groups.c b/block/throttle-groups.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/throttle-groups.c
 +++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockBackend *blk,
      if (must_wait || blkp->pending_reqs[is_write]) {
          blkp->pending_reqs[is_write]++;
          qemu_mutex_unlock(&tg->lock);
 -        qemu_co_queue_wait(&blkp->throttled_reqs[is_write]);
 +        qemu_co_queue_wait(&blkp->throttled_reqs[is_write], NULL);
          qemu_mutex_lock(&tg->lock);
          blkp->pending_reqs[is_write]--;
      }
 diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/9pfs/9p.c
 +++ b/hw/9pfs/9p.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn v9fs_flush(void *opaque)
          /*
           * Wait for pdu to complete.
           */
 -        qemu_co_queue_wait(&cancel_pdu->complete);
 +        qemu_co_queue_wait(&cancel_pdu->complete, NULL);
          cancel_pdu->cancelled = 0;
          pdu_free(cancel_pdu);
      }
 diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-lock.c
 +++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue)
      QSIMPLEQ_INIT(&queue->entries);
  }
 -void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
 +void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex)
  {
      Coroutine *self = qemu_coroutine_self();
      QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
 +
 +    if (mutex) {
 +        qemu_co_mutex_unlock(mutex);
 +    }
 +
-+    /* There is no race condition here.  Other threads will call
+     ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
-+     * aio_co_schedule on our AioContext, which can reenter this
+     if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED || bs == base) {
-+     * coroutine but only after this yield and after the main loop
+         return ret;
 +     * has gone through the next iteration.
 +     */
      qemu_coroutine_yield();
      assert(qemu_in_coroutine());
 +
 +    /* TODO: OSv implements wait morphing here, where the wakeup
 +     * primitive automatically places the woken coroutine on the
 +     * mutex's queue.  This avoids the thundering herd effect.
 +     */
 +    if (mutex) {
 +        qemu_co_mutex_lock(mutex);
 +    }
  }
  /**
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
      Coroutine *self = qemu_coroutine_self();
      while (lock->writer) {
 -        qemu_co_queue_wait(&lock->queue);
 +        qemu_co_queue_wait(&lock->queue, NULL);
      }
      lock->reader++;
      self->locks_held++;
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_wrlock(CoRwlock *lock)
      Coroutine *self = qemu_coroutine_self();
      while (lock->writer || lock->reader) {
 -        qemu_co_queue_wait(&lock->queue);
 +        qemu_co_queue_wait(&lock->queue, NULL);
      }
      lock->writer = true;
      self->locks_held++;
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 12/24] block: explicitly acquire aiocontext in timers that need it
+[PULL v2 27/28] block/io: fix bdrv_is_allocated_above
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+bdrv_is_allocated_above wrongly handles short backing files: it reports
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+after-EOF space as UNALLOCATED which is wrong, as on read the data is
-Reviewed-by: Fam Zheng <famz@redhat.com>
+generated on the level of short backing file (if all overlays have
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+unallocated areas at that place).
-Message-id: 20170213135235.12274-13-pbonzini@redhat.com
 Reusing bdrv_common_block_status_above fixes the issue and unifies code
 path.
 Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Alberto Garcia <berto@igalia.com>
 Message-id: 20200924194003.22080-5-vsementsov@virtuozzo.com
 [Fix s/has/have/ as suggested by Eric Blake. Fix s/area/areas/.
 --Stefan]
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.h                 |  3 +++
+ block/io.c | 43 +++++--------------------------------------
- block/curl.c                |  2 ++
+file changed, 5 insertions(+), 38 deletions(-)
  block/io.c                  |  5 +++++
  block/iscsi.c               |  8 ++++++--
  block/null.c                |  4 ++++
  block/qed.c                 | 12 ++++++++++++
  block/throttle-groups.c     |  2 ++
  util/aio-posix.c            |  2 --
  util/aio-win32.c            |  2 --
  util/qemu-coroutine-sleep.c |  2 +-
 files changed, 35 insertions(+), 7 deletions(-)
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ enum {
-  */
- typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
-+void qed_acquire(BDRVQEDState *s);
-+void qed_release(BDRVQEDState *s);
-+
- /**
-  * Generic callback for chaining async callbacks
-  */
-diff --git a/block/curl.c b/block/curl.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/curl.c
-+++ b/block/curl.c
-@@ -XXX,XX +XXX,XX @@ static void curl_multi_timeout_do(void *arg)
-         return;
-     }
-+    aio_context_acquire(s->aio_context);
-     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
-     curl_multi_check_completion(s);
-+    aio_context_release(s->aio_context);
- #else
-     abort();
- #endif
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel(BlockAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
-         if (acb->aiocb_info->get_aio_context) {
+  * at 'offset + *pnum' may return the same allocation status (in other
-             aio_poll(acb->aiocb_info->get_aio_context(acb), true);
+  * words, the result is not necessarily the maximum possible range);
-         } else if (acb->bs) {
+  * but 'pnum' will only be 0 when end of file is reached.
-+            /* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
+- *
-+             * assert that we're not using an I/O thread.  Thread-safe
+  */
-+             * code should use bdrv_aio_cancel_async exclusively.
+ int bdrv_is_allocated_above(BlockDriverState *top,
-+             */
+                             BlockDriverState *base,
-+            assert(bdrv_get_aio_context(acb->bs) == qemu_get_aio_context());
+                             bool include_base, int64_t offset,
-             aio_poll(bdrv_get_aio_context(acb->bs), true);
+                             int64_t bytes, int64_t *pnum)
-         } else {
+ {
-             abort();
+-    BlockDriverState *intermediate;
-diff --git a/block/iscsi.c b/block/iscsi.c
+-    int ret;
-index XXXXXXX..XXXXXXX 100644
+-    int64_t n = bytes;
---- a/block/iscsi.c
+-
-+++ b/block/iscsi.c
+-    assert(base || !include_base);
-@@ -XXX,XX +XXX,XX @@ static void iscsi_retry_timer_expired(void *opaque)
+-
-     struct IscsiTask *iTask = opaque;
+-    intermediate = top;
-     iTask->complete = 1;
+-    while (include_base || intermediate != base) {
-     if (iTask->co) {
+-        int64_t pnum_inter;
--        qemu_coroutine_enter(iTask->co);
+-        int64_t size_inter;
-+        aio_co_wake(iTask->co);
+-
 -        assert(intermediate);
 -        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
 -        if (ret < 0) {
 -            return ret;
 -        }
 -        if (ret) {
 -            *pnum = pnum_inter;
 -            return 1;
 -        }
 -
 -        size_inter = bdrv_getlength(intermediate);
 -        if (size_inter < 0) {
 -            return size_inter;
 -        }
 -        if (n > pnum_inter &&
 -            (intermediate == top || offset + pnum_inter < size_inter)) {
 -            n = pnum_inter;
 -        }
 -
 -        if (intermediate == base) {
 -            break;
 -        }
 -
 -        intermediate = bdrv_filter_or_cow_bs(intermediate);
 +    int ret = bdrv_common_block_status_above(top, base, include_base, false,
 +                                             offset, bytes, pnum, NULL, NULL);
 +    if (ret < 0) {
 +        return ret;
      }
+-    *pnum = n;
+-    return 0;
++    return !!(ret & BDRV_BLOCK_ALLOCATED);
  }
-@@ -XXX,XX +XXX,XX @@ static void iscsi_nop_timed_event(void *opaque)
+ int coroutine_fn
  {
      IscsiLun *iscsilun = opaque;
 +    aio_context_acquire(iscsilun->aio_context);
      if (iscsi_get_nops_in_flight(iscsilun->iscsi) >= MAX_NOP_FAILURES) {
          error_report("iSCSI: NOP timeout. Reconnecting...");
          iscsilun->request_timed_out = true;
      } else if (iscsi_nop_out_async(iscsilun->iscsi, NULL, NULL, 0, NULL) != 0) {
          error_report("iSCSI: failed to sent NOP-Out. Disabling NOP messages.");
 -        return;
 +        goto out;
      }
      timer_mod(iscsilun->nop_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
      iscsi_set_events(iscsilun);
 +
 +out:
 +    aio_context_release(iscsilun->aio_context);
  }
  static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
 diff --git a/block/null.c b/block/null.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/null.c
 +++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static void null_bh_cb(void *opaque)
  static void null_timer_cb(void *opaque)
  {
      NullAIOCB *acb = opaque;
 +    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 +
 +    aio_context_acquire(ctx);
      acb->common.cb(acb->common.opaque, 0);
 +    aio_context_release(ctx);
      timer_deinit(&acb->timer);
      qemu_aio_unref(acb);
  }
 diff --git a/block/qed.c b/block/qed.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.c
 +++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
      trace_qed_need_check_timer_cb(s);
 +    qed_acquire(s);
      qed_plug_allocating_write_reqs(s);
      /* Ensure writes are on disk before clearing flag */
      bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
 +    qed_release(s);
 +}
 +
 +void qed_acquire(BDRVQEDState *s)
 +{
 +    aio_context_acquire(bdrv_get_aio_context(s->bs));
 +}
 +
 +void qed_release(BDRVQEDState *s)
 +{
 +    aio_context_release(bdrv_get_aio_context(s->bs));
  }
  static void qed_start_need_check_timer(BDRVQEDState *s)
 diff --git a/block/throttle-groups.c b/block/throttle-groups.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/throttle-groups.c
 +++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ static void timer_cb(BlockBackend *blk, bool is_write)
      qemu_mutex_unlock(&tg->lock);
      /* Run the request that was waiting for this timer */
 +    aio_context_acquire(blk_get_aio_context(blk));
      empty_queue = !qemu_co_enter_next(&blkp->throttled_reqs[is_write]);
 +    aio_context_release(blk_get_aio_context(blk));
      /* If the request queue was empty then we have to take care of
       * scheduling the next one */
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
      }
      /* Run our timers */
 -    aio_context_acquire(ctx);
      progress |= timerlistgroup_run_timers(&ctx->tlg);
 -    aio_context_release(ctx);
      return progress;
  }
 diff --git a/util/aio-win32.c b/util/aio-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-win32.c
 +++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          progress |= aio_dispatch_handlers(ctx, event);
      } while (count > 0);
 -    aio_context_acquire(ctx);
      progress |= timerlistgroup_run_timers(&ctx->tlg);
 -    aio_context_release(ctx);
      return progress;
  }
 diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/qemu-coroutine-sleep.c
 +++ b/util/qemu-coroutine-sleep.c
@@ -XXX,XX +XXX,XX @@ static void co_sleep_cb(void *opaque)
  {
      CoSleepCB *sleep_cb = opaque;
 -    qemu_coroutine_enter(sleep_cb->co);
 +    aio_co_wake(sleep_cb->co);
  }
  void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
 --
-.9.3
+.26.2

-[Qemu-devel] [PULL v2 08/24] coroutine-lock: reschedule coroutine on the AioContext it was running on
+[PULL v2 28/28] iotests: add commit top->base cases to 274
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-As a small step towards the introduction of multiqueue, we want
+These cases are fixed by previous patches around block_status and
-coroutines to remain on the same AioContext that started them,
+is_allocated.
 unless they are moved explicitly with e.g. aio_co_schedule.  This patch
 avoids that coroutines switch AioContext when they use a CoMutex.
 For now it does not make much of a difference, because the CoMutex
 is not thread-safe and the AioContext itself is used to protect the
 CoMutex from concurrent access.  However, this is going to change.
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Fam Zheng <famz@redhat.com>
+Reviewed-by: Alberto Garcia <berto@igalia.com>
-Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
+Message-id: 20200924194003.22080-6-vsementsov@virtuozzo.com
 Message-id: 20170213135235.12274-9-pbonzini@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- util/qemu-coroutine-lock.c | 5 ++---
+ tests/qemu-iotests/274     | 20 +++++++++++
- util/trace-events          | 1 -
+ tests/qemu-iotests/274.out | 68 ++++++++++++++++++++++++++++++++++++++
-files changed, 2 insertions(+), 4 deletions(-)
+files changed, 88 insertions(+)
-diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
+diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/274
 +++ b/tests/qemu-iotests/274
@@ -XXX,XX +XXX,XX @@ with iotests.FilePath('base') as base, \
      iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
      iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 +    iotests.log('=== Testing qemu-img commit (top -> base) ===')
 +
 +    create_chain()
 +    iotests.qemu_img_log('commit', '-b', base, top)
 +    iotests.img_info_log(base)
 +    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
 +    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
 +
 +    iotests.log('=== Testing QMP active commit (top -> base) ===')
 +
 +    create_chain()
 +    with create_vm() as vm:
 +        vm.launch()
 +        vm.qmp_log('block-commit', device='top', base_node='base',
 +                   job_id='job0', auto_dismiss=False)
 +        vm.run_job('job0', wait=5)
 +
 +    iotests.img_info_log(mid)
 +    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
 +    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
      iotests.log('== Resize tests ==')
 diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
 index XXXXXXX..XXXXXXX 100644
---- a/util/qemu-coroutine-lock.c
+--- a/tests/qemu-iotests/274.out
-+++ b/util/qemu-coroutine-lock.c
++++ b/tests/qemu-iotests/274.out
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ read 1048576/1048576 bytes at offset 0
- #include "qemu/coroutine.h"
+ read 1048576/1048576 bytes at offset 1048576
- #include "qemu/coroutine_int.h"
+MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
- #include "qemu/queue.h"
-+#include "block/aio.h"
++=== Testing qemu-img commit (top -> base) ===
- #include "trace.h"
++Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
++
- void qemu_co_queue_init(CoQueue *queue)
++Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
-@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_run_restart(Coroutine *co)
++
++Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
- static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
++
- {
++wrote 2097152/2097152 bytes at offset 0
--    Coroutine *self = qemu_coroutine_self();
++2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-     Coroutine *next;
++
++Image committed.
-     if (QSIMPLEQ_EMPTY(&queue->entries)) {
++
-@@ -XXX,XX +XXX,XX @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
++image: TEST_IMG
++file format: IMGFMT
-     while ((next = QSIMPLEQ_FIRST(&queue->entries)) != NULL) {
++virtual size: 2 MiB (2097152 bytes)
-         QSIMPLEQ_REMOVE_HEAD(&queue->entries, co_queue_next);
++cluster_size: 65536
--        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
++Format specific information:
--        trace_qemu_co_queue_next(next);
++    compat: 1.1
-+        aio_co_wake(next);
++    compression type: zlib
-         if (single) {
++    lazy refcounts: false
-             break;
++    refcount bits: 16
-         }
++    corrupt: false
-diff --git a/util/trace-events b/util/trace-events
++    extended l2: false
-index XXXXXXX..XXXXXXX 100644
++
---- a/util/trace-events
++read 1048576/1048576 bytes at offset 0
-+++ b/util/trace-events
++1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
++
++read 1048576/1048576 bytes at offset 1048576
- # util/qemu-coroutine-lock.c
++1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
- qemu_co_queue_run_restart(void *co) "co %p"
++
--qemu_co_queue_next(void *nxt) "next %p"
++=== Testing QMP active commit (top -> base) ===
- qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
++Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
- qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
++
- qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
++Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
 +
 +Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
 +
 +wrote 2097152/2097152 bytes at offset 0
 +2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": "base", "device": "top", "job-id": "job0"}}
 +{"return": {}}
 +{"execute": "job-complete", "arguments": {"id": "job0"}}
 +{"return": {}}
 +{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
 +{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
 +{"execute": "job-dismiss", "arguments": {"id": "job0"}}
 +{"return": {}}
 +image: TEST_IMG
 +file format: IMGFMT
 +virtual size: 1 MiB (1048576 bytes)
 +cluster_size: 65536
 +backing file: TEST_DIR/PID-base
 +backing file format: IMGFMT
 +Format specific information:
 +    compat: 1.1
 +    compression type: zlib
 +    lazy refcounts: false
 +    refcount bits: 16
 +    corrupt: false
 +    extended l2: false
 +
 +read 1048576/1048576 bytes at offset 0
 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +read 1048576/1048576 bytes at offset 1048576
 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
  == Resize tests ==
  === preallocation=off ===
  Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=6442450944 lazy_refcounts=off refcount_bits=16
 --
-.9.3
+.26.2

The following changes since commit 56f9e46b841c7be478ca038d8d4085d776ab4b0d:

Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2017-02-20' into staging (2017-02-20 17:42:47 +0000)

are available in the git repository at:

git://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to a7b91d35bab97a2d3e779d0c64c9b837b52a6cf7:

coroutine-lock: make CoRwlock thread-safe and fair (2017-02-21 11:39:40 +0000)

----------------------------------------------------------------
Pull request

v2:
 * Rebased to resolve scsi conflicts

----------------------------------------------------------------

Paolo Bonzini (24):
  block: move AioContext, QEMUTimer, main-loop to libqemuutil
  aio: introduce aio_co_schedule and aio_co_wake
  block-backend: allow blk_prw from coroutine context
  test-thread-pool: use generic AioContext infrastructure
  io: add methods to set I/O handlers on AioContext
  io: make qio_channel_yield aware of AioContexts
  nbd: convert to use qio_channel_yield
  coroutine-lock: reschedule coroutine on the AioContext it was running
    on
  blkdebug: reschedule coroutine on the AioContext it is running on
  qed: introduce qed_aio_start_io and qed_aio_next_io_cb
  aio: push aio_context_acquire/release down to dispatching
  block: explicitly acquire aiocontext in timers that need it
  block: explicitly acquire aiocontext in callbacks that need it
  block: explicitly acquire aiocontext in bottom halves that need it
  block: explicitly acquire aiocontext in aio callbacks that need it
  aio-posix: partially inline aio_dispatch into aio_poll
  async: remove unnecessary inc/dec pairs
  block: document fields protected by AioContext lock
  coroutine-lock: make CoMutex thread-safe
  coroutine-lock: add limited spinning to CoMutex
  test-aio-multithread: add performance comparison with thread-based
    mutexes
  coroutine-lock: place CoMutex before CoQueue in header
  coroutine-lock: add mutex argument to CoQueue APIs
  coroutine-lock: make CoRwlock thread-safe and fair

-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

AioContext is fairly self contained, the only dependency is QEMUTimer but
that in turn doesn't need anything else.  So move them out of block-obj-y
to avoid introducing a dependency from io/ to block-obj-y.

main-loop and its dependency iohandler also need to be moved, because
later in this series io/ will call iohandler_get_aio_context.

[Changed copyright "the QEMU team" to "other QEMU contributors" as
suggested by Daniel Berrange and agreed by Paolo.
--Stefan]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 Makefile.objs                       |  4 ---
 stubs/Makefile.objs                 |  1 +
 tests/Makefile.include              | 11 ++++----
 util/Makefile.objs                  |  6 +++-
 block/io.c                          | 29 -------------------
 stubs/linux-aio.c                   | 32 +++++++++++++++++++++
 stubs/set-fd-handler.c              | 11 --------
 aio-posix.c => util/aio-posix.c     |  2 +-
 aio-win32.c => util/aio-win32.c     |  0
 util/aiocb.c                        | 55 +++++++++++++++++++++++++++++++++++++
 async.c => util/async.c             |  3 +-
 iohandler.c => util/iohandler.c     |  0
 main-loop.c => util/main-loop.c     |  0
 qemu-timer.c => util/qemu-timer.c   |  0
 thread-pool.c => util/thread-pool.c |  2 +-
 trace-events                        | 11 --------
 util/trace-events                   | 11 ++++++++
 17 files changed, 114 insertions(+), 64 deletions(-)
 create mode 100644 stubs/linux-aio.c
 rename aio-posix.c => util/aio-posix.c (99%)
 rename aio-win32.c => util/aio-win32.c (100%)
 create mode 100644 util/aiocb.c
 rename async.c => util/async.c (99%)
 rename iohandler.c => util/iohandler.c (100%)
 rename main-loop.c => util/main-loop.c (100%)
 rename qemu-timer.c => util/qemu-timer.c (100%)
 rename thread-pool.c => util/thread-pool.c (99%)

diff --git a/Makefile.objs b/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -XXX,XX +XXX,XX @@ chardev-obj-y = chardev/
 #######################################################################
 # block-obj-y is code used by both qemu system emulation and qemu-img
 
-block-obj-y = async.o thread-pool.o
 block-obj-y += nbd/
 block-obj-y += block.o blockjob.o
-block-obj-y += main-loop.o iohandler.o qemu-timer.o
-block-obj-$(CONFIG_POSIX) += aio-posix.o
-block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
 block-obj-$(CONFIG_REPLICATION) += replication.o
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -XXX,XX +XXX,XX @@ stub-obj-y += get-vm-name.o
 stub-obj-y += iothread.o
 stub-obj-y += iothread-lock.o
 stub-obj-y += is-daemonized.o
+stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 stub-obj-y += machine-init-done.o
 stub-obj-y += migr-blocker.o
 stub-obj-y += monitor.o
diff --git a/tests/Makefile.include b/tests/Makefile.include
index XXXXXXX..XXXXXXX 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
+gcov-files-test-aio-y = util/async.c util/qemu-timer.o
+gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
+gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
 check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -XXX,XX +XXX,XX @@ tests/check-qjson$(EXESUF): tests/check-qjson.o $(test-util-obj-y)
 tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y)
 tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 
-tests/test-char$(EXESUF): tests/test-char.o qemu-timer.o \
-	$(test-util-obj-y) $(qtest-obj-y) $(test-block-obj-y) $(chardev-obj-y)
+tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/test-vmstate$(EXESUF): tests/test-vmstate.o \
 	migration/vmstate.o migration/qemu-file.o \
         migration/qemu-file-channel.o migration/qjson.o \
 	$(test-io-obj-y)
-tests/test-timed-average$(EXESUF): tests/test-timed-average.o qemu-timer.o \
-	$(test-util-obj-y)
+tests/test-timed-average$(EXESUF): tests/test-timed-average.o $(test-util-obj-y)
 tests/test-base64$(EXESUF): tests/test-base64.o \
 	libqemuutil.a libqemustub.a
 tests/ptimer-test$(EXESUF): tests/ptimer-test.o tests/ptimer-test-stubs.o hw/core/ptimer.o libqemustub.a
@@ -XXX,XX +XXX,XX @@ tests/usb-hcd-ehci-test$(EXESUF): tests/usb-hcd-ehci-test.o $(libqos-usb-obj-y)
 tests/usb-hcd-xhci-test$(EXESUF): tests/usb-hcd-xhci-test.o $(libqos-usb-obj-y)
 tests/pc-cpu-test$(EXESUF): tests/pc-cpu-test.o
 tests/postcopy-test$(EXESUF): tests/postcopy-test.o
-tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o qemu-timer.o \
+tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o $(test-util-obj-y) \
 	$(qtest-obj-y) $(test-io-obj-y) $(libqos-virtio-obj-y) $(libqos-pc-obj-y) \
 	$(chardev-obj-y)
 tests/qemu-iotests/socket_scm_helper$(EXESUF): tests/qemu-iotests/socket_scm_helper.o
diff --git a/util/Makefile.objs b/util/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@
 util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
 util-obj-y += bufferiszero.o
 util-obj-y += lockcnt.o
+util-obj-y += aiocb.o async.o thread-pool.o qemu-timer.o
+util-obj-y += main-loop.o iohandler.o
+util-obj-$(CONFIG_POSIX) += aio-posix.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
 util-obj-$(CONFIG_POSIX) += oslib-posix.o
 util-obj-$(CONFIG_POSIX) += qemu-openpty.o
 util-obj-$(CONFIG_POSIX) += qemu-thread-posix.o
-util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
 util-obj-$(CONFIG_POSIX) += memfd.o
+util-obj-$(CONFIG_WIN32) += aio-win32.o
+util-obj-$(CONFIG_WIN32) += event_notifier-win32.o
 util-obj-$(CONFIG_WIN32) += oslib-win32.o
 util-obj-$(CONFIG_WIN32) += qemu-thread-win32.o
 util-obj-y += envlist.o path.o module.o
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
     return &acb->common;
 }
 
-void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
-                   BlockCompletionFunc *cb, void *opaque)
-{
-    BlockAIOCB *acb;
-
-    acb = g_malloc(aiocb_info->aiocb_size);
-    acb->aiocb_info = aiocb_info;
-    acb->bs = bs;
-    acb->cb = cb;
-    acb->opaque = opaque;
-    acb->refcnt = 1;
-    return acb;
-}
-
-void qemu_aio_ref(void *p)
-{
-    BlockAIOCB *acb = p;
-    acb->refcnt++;
-}
-
-void qemu_aio_unref(void *p)
-{
-    BlockAIOCB *acb = p;
-    assert(acb->refcnt > 0);
-    if (--acb->refcnt == 0) {
-        g_free(acb);
-    }
-}
-
 /**************************************************************/
 /* Coroutine block device emulation */
 
diff --git a/stubs/linux-aio.c b/stubs/linux-aio.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/stubs/linux-aio.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Linux native AIO support.
+ *
+ * Copyright (C) 2009 IBM, Corp.
+ * Copyright (C) 2009 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "block/aio.h"
+#include "block/raw-aio.h"
+
+void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
+{
+    abort();
+}
+
+void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
+{
+    abort();
+}
+
+LinuxAioState *laio_init(void)
+{
+    abort();
+}
+
+void laio_cleanup(LinuxAioState *s)
+{
+    abort();
+}
diff --git a/stubs/set-fd-handler.c b/stubs/set-fd-handler.c
index XXXXXXX..XXXXXXX 100644
--- a/stubs/set-fd-handler.c
+++ b/stubs/set-fd-handler.c
@@ -XXX,XX +XXX,XX @@ void qemu_set_fd_handler(int fd,
 {
     abort();
 }
-
-void aio_set_fd_handler(AioContext *ctx,
-                        int fd,
-                        bool is_external,
-                        IOHandler *io_read,
-                        IOHandler *io_write,
-                        AioPollFn *io_poll,
-                        void *opaque)
-{
-    abort();
-}
diff --git a/aio-posix.c b/util/aio-posix.c
similarity index 99%
rename from aio-posix.c
rename to util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/rcu_queue.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
-#include "trace-root.h"
+#include "trace.h"
 #ifdef CONFIG_EPOLL_CREATE1
 #include <sys/epoll.h>
 #endif
diff --git a/aio-win32.c b/util/aio-win32.c
similarity index 100%
rename from aio-win32.c
rename to util/aio-win32.c
diff --git a/util/aiocb.c b/util/aiocb.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/aiocb.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * BlockAIOCB allocation
+ *
+ * Copyright (c) 2003-2017 Fabrice Bellard and other QEMU contributors
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "block/aio.h"
+
+void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
+                   BlockCompletionFunc *cb, void *opaque)
+{
+    BlockAIOCB *acb;
+
+    acb = g_malloc(aiocb_info->aiocb_size);
+    acb->aiocb_info = aiocb_info;
+    acb->bs = bs;
+    acb->cb = cb;
+    acb->opaque = opaque;
+    acb->refcnt = 1;
+    return acb;
+}
+
+void qemu_aio_ref(void *p)
+{
+    BlockAIOCB *acb = p;
+    acb->refcnt++;
+}
+
+void qemu_aio_unref(void *p)
+{
+    BlockAIOCB *acb = p;
+    assert(acb->refcnt > 0);
+    if (--acb->refcnt == 0) {
+        g_free(acb);
+    }
+}
diff --git a/async.c b/util/async.c
similarity index 99%
rename from async.c
rename to util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
 /*
- * QEMU System Emulator
+ * Data plane event loop
  *
  * Copyright (c) 2003-2008 Fabrice Bellard
+ * Copyright (c) 2009-2017 QEMU contributors
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to deal
diff --git a/iohandler.c b/util/iohandler.c
similarity index 100%
rename from iohandler.c
rename to util/iohandler.c
diff --git a/main-loop.c b/util/main-loop.c
similarity index 100%
rename from main-loop.c
rename to util/main-loop.c
diff --git a/qemu-timer.c b/util/qemu-timer.c
similarity index 100%
rename from qemu-timer.c
rename to util/qemu-timer.c
diff --git a/thread-pool.c b/util/thread-pool.c
similarity index 99%
rename from thread-pool.c
rename to util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/coroutine.h"
-#include "trace-root.h"
+#include "trace.h"
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
 
diff --git a/trace-events b/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/trace-events
+++ b/trace-events
@@ -XXX,XX +XXX,XX @@
 #
 # The <format-string> should be a sprintf()-compatible format string.
 
-# aio-posix.c
-run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
-run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
-poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
-poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
-
-# thread-pool.c
-thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
-thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
-
 # ioport.c
 cpu_in(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
 cpu_out(unsigned int addr, char size, unsigned int val) "addr %#x(%c) value %u"
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@
 # See docs/tracing.txt for syntax documentation.
 
+# util/aio-posix.c
+run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
+run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
+poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+
+# util/thread-pool.c
+thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
+thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
+thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
+
 # util/buffer.c
 buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"
 buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

aio_co_wake provides the infrastructure to start a coroutine on a "home"
AioContext.  It will be used by CoMutex and CoQueue, so that coroutines
don't jump from one context to another when they go to sleep on a
mutex or waitqueue.  However, it can also be used as a more efficient
alternative to one-shot bottom halves, and saves the effort of tracking
which AioContext a coroutine is running on.

aio_co_schedule is the part of aio_co_wake that starts a coroutine
on a remove AioContext, but it is also useful to implement e.g.
bdrv_set_aio_context callbacks.

The implementation of aio_co_schedule is based on a lock-free
multiple-producer, single-consumer queue.  The multiple producers use
cmpxchg to add to a LIFO stack.  The consumer (a per-AioContext bottom
half) grabs all items added so far, inverts the list to make it FIFO,
and goes through it one item at a time until it's empty.  The data
structure was inspired by OSv, which uses it in the very code we'll
"port" to QEMU for the thread-safe CoMutex.

Most of the new code is really tests.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-3-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/Makefile.include       |   8 +-
 include/block/aio.h          |  32 +++++++
 include/qemu/coroutine_int.h |  11 ++-
 tests/iothread.h             |  25 +++++
 tests/iothread.c             |  91 ++++++++++++++++++
 tests/test-aio-multithread.c | 213 +++++++++++++++++++++++++++++++++++++++++++
 util/async.c                 |  65 +++++++++++++
 util/qemu-coroutine.c        |   8 ++
 util/trace-events            |   4 +
 9 files changed, 453 insertions(+), 4 deletions(-)
 create mode 100644 tests/iothread.h
 create mode 100644 tests/iothread.c
 create mode 100644 tests/test-aio-multithread.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index XXXXXXX..XXXXXXX 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -XXX,XX +XXX,XX @@ check-unit-y += tests/test-aio$(EXESUF)
 gcov-files-test-aio-y = util/async.c util/qemu-timer.o
 gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c
+check-unit-y += tests/test-aio-multithread$(EXESUF)
+gcov-files-test-aio-multithread-y = $(gcov-files-test-aio-y)
+gcov-files-test-aio-multithread-y += util/qemu-coroutine.c tests/iothread.c
 check-unit-y += tests/test-throttle$(EXESUF)
-gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
-gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
 check-unit-y += tests/test-thread-pool$(EXESUF)
 gcov-files-test-thread-pool-y = thread-pool.c
 gcov-files-test-hbitmap-y = util/hbitmap.c
@@ -XXX,XX +XXX,XX @@ test-qapi-obj-y = tests/test-qapi-visit.o tests/test-qapi-types.o \
 	$(test-qom-obj-y)
 test-crypto-obj-y = $(crypto-obj-y) $(test-qom-obj-y)
 test-io-obj-y = $(io-obj-y) $(test-crypto-obj-y)
-test-block-obj-y = $(block-obj-y) $(test-io-obj-y)
+test-block-obj-y = $(block-obj-y) $(test-io-obj-y) tests/iothread.o
 
 tests/check-qint$(EXESUF): tests/check-qint.o $(test-util-obj-y)
 tests/check-qstring$(EXESUF): tests/check-qstring.o $(test-util-obj-y)
@@ -XXX,XX +XXX,XX @@ tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 tests/test-char$(EXESUF): tests/test-char.o $(test-util-obj-y) $(qtest-obj-y) $(test-io-obj-y) $(chardev-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
+tests/test-aio-multithread$(EXESUF): tests/test-aio-multithread.o $(test-block-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
 tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ typedef void QEMUBHFunc(void *opaque);
 typedef bool AioPollFn(void *opaque);
 typedef void IOHandler(void *opaque);
 
+struct Coroutine;
 struct ThreadPool;
 struct LinuxAioState;
 
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     bool notified;
     EventNotifier notifier;
 
+    QSLIST_HEAD(, Coroutine) scheduled_coroutines;
+    QEMUBH *co_schedule_bh;
+
     /* Thread pool for performing work and receiving completion callbacks.
      * Has its own locking.
      */
@@ -XXX,XX +XXX,XX @@ static inline bool aio_node_check(AioContext *ctx, bool is_external)
 }
 
 /**
+ * aio_co_schedule:
+ * @ctx: the aio context
+ * @co: the coroutine
+ *
+ * Start a coroutine on a remote AioContext.
+ *
+ * The coroutine must not be entered by anyone else while aio_co_schedule()
+ * is active.  In addition the coroutine must have yielded unless ctx
+ * is the context in which the coroutine is running (i.e. the value of
+ * qemu_get_current_aio_context() from the coroutine itself).
+ */
+void aio_co_schedule(AioContext *ctx, struct Coroutine *co);
+
+/**
+ * aio_co_wake:
+ * @co: the coroutine
+ *
+ * Restart a coroutine on the AioContext where it was running last, thus
+ * preventing coroutines from jumping from one context to another when they
+ * go to sleep.
+ *
+ * aio_co_wake may be executed either in coroutine or non-coroutine
+ * context.  The coroutine must not be entered by anyone else while
+ * aio_co_wake() is active.
+ */
+void aio_co_wake(struct Coroutine *co);
+
+/**
  * Return the AioContext whose event loop runs in the current thread.
  *
  * If called from an IOThread this will be the IOThread's AioContext.  If
diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine_int.h
+++ b/include/qemu/coroutine_int.h
@@ -XXX,XX +XXX,XX @@ struct Coroutine {
     CoroutineEntry *entry;
     void *entry_arg;
     Coroutine *caller;
+
+    /* Only used when the coroutine has terminated.  */
     QSLIST_ENTRY(Coroutine) pool_next;
+
     size_t locks_held;
 
-    /* Coroutines that should be woken up when we yield or terminate */
+    /* Coroutines that should be woken up when we yield or terminate.
+     * Only used when the coroutine is running.
+     */
     QSIMPLEQ_HEAD(, Coroutine) co_queue_wakeup;
+
+    /* Only used when the coroutine has yielded.  */
+    AioContext *ctx;
     QSIMPLEQ_ENTRY(Coroutine) co_queue_next;
+    QSLIST_ENTRY(Coroutine) co_scheduled_next;
 };
 
 Coroutine *qemu_coroutine_new(void);
diff --git a/tests/iothread.h b/tests/iothread.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/iothread.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Event loop thread implementation for unit tests
+ *
+ * Copyright Red Hat Inc., 2013, 2016
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@redhat.com>
+ *  Paolo Bonzini     <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef TEST_IOTHREAD_H
+#define TEST_IOTHREAD_H
+
+#include "block/aio.h"
+#include "qemu/thread.h"
+
+typedef struct IOThread IOThread;
+
+IOThread *iothread_new(void);
+void iothread_join(IOThread *iothread);
+AioContext *iothread_get_aio_context(IOThread *iothread);
+
+#endif
diff --git a/tests/iothread.c b/tests/iothread.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/iothread.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Event loop thread implementation for unit tests
+ *
+ * Copyright Red Hat Inc., 2013, 2016
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@redhat.com>
+ *  Paolo Bonzini     <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "block/aio.h"
+#include "qemu/main-loop.h"
+#include "qemu/rcu.h"
+#include "iothread.h"
+
+struct IOThread {
+    AioContext *ctx;
+
+    QemuThread thread;
+    QemuMutex init_done_lock;
+    QemuCond init_done_cond;    /* is thread initialization done? */
+    bool stopping;
+};
+
+static __thread IOThread *my_iothread;
+
+AioContext *qemu_get_current_aio_context(void)
+{
+    return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
+}
+
+static void *iothread_run(void *opaque)
+{
+    IOThread *iothread = opaque;
+
+    rcu_register_thread();
+
+    my_iothread = iothread;
+    qemu_mutex_lock(&iothread->init_done_lock);
+    iothread->ctx = aio_context_new(&error_abort);
+    qemu_cond_signal(&iothread->init_done_cond);
+    qemu_mutex_unlock(&iothread->init_done_lock);
+
+    while (!atomic_read(&iothread->stopping)) {
+        aio_poll(iothread->ctx, true);
+    }
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+void iothread_join(IOThread *iothread)
+{
+    iothread->stopping = true;
+    aio_notify(iothread->ctx);
+    qemu_thread_join(&iothread->thread);
+    qemu_cond_destroy(&iothread->init_done_cond);
+    qemu_mutex_destroy(&iothread->init_done_lock);
+    aio_context_unref(iothread->ctx);
+    g_free(iothread);
+}
+
+IOThread *iothread_new(void)
+{
+    IOThread *iothread = g_new0(IOThread, 1);
+
+    qemu_mutex_init(&iothread->init_done_lock);
+    qemu_cond_init(&iothread->init_done_cond);
+    qemu_thread_create(&iothread->thread, NULL, iothread_run,
+                       iothread, QEMU_THREAD_JOINABLE);
+
+    /* Wait for initialization to complete */
+    qemu_mutex_lock(&iothread->init_done_lock);
+    while (iothread->ctx == NULL) {
+        qemu_cond_wait(&iothread->init_done_cond,
+                       &iothread->init_done_lock);
+    }
+    qemu_mutex_unlock(&iothread->init_done_lock);
+    return iothread;
+}
+
+AioContext *iothread_get_aio_context(IOThread *iothread)
+{
+    return iothread->ctx;
+}
diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * AioContext multithreading tests
+ *
+ * Copyright Red Hat, Inc. 2016
+ *
+ * Authors:
+ *  Paolo Bonzini    <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "block/aio.h"
+#include "qapi/error.h"
+#include "qemu/coroutine.h"
+#include "qemu/thread.h"
+#include "qemu/error-report.h"
+#include "iothread.h"
+
+/* AioContext management */
+
+#define NUM_CONTEXTS 5
+
+static IOThread *threads[NUM_CONTEXTS];
+static AioContext *ctx[NUM_CONTEXTS];
+static __thread int id = -1;
+
+static QemuEvent done_event;
+
+/* Run a function synchronously on a remote iothread. */
+
+typedef struct CtxRunData {
+    QEMUBHFunc *cb;
+    void *arg;
+} CtxRunData;
+
+static void ctx_run_bh_cb(void *opaque)
+{
+    CtxRunData *data = opaque;
+
+    data->cb(data->arg);
+    qemu_event_set(&done_event);
+}
+
+static void ctx_run(int i, QEMUBHFunc *cb, void *opaque)
+{
+    CtxRunData data = {
+        .cb = cb,
+        .arg = opaque
+    };
+
+    qemu_event_reset(&done_event);
+    aio_bh_schedule_oneshot(ctx[i], ctx_run_bh_cb, &data);
+    qemu_event_wait(&done_event);
+}
+
+/* Starting the iothreads. */
+
+static void set_id_cb(void *opaque)
+{
+    int *i = opaque;
+
+    id = *i;
+}
+
+static void create_aio_contexts(void)
+{
+    int i;
+
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        threads[i] = iothread_new();
+        ctx[i] = iothread_get_aio_context(threads[i]);
+    }
+
+    qemu_event_init(&done_event, false);
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        ctx_run(i, set_id_cb, &i);
+    }
+}
+
+/* Stopping the iothreads. */
+
+static void join_aio_contexts(void)
+{
+    int i;
+
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        aio_context_ref(ctx[i]);
+    }
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        iothread_join(threads[i]);
+    }
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        aio_context_unref(ctx[i]);
+    }
+    qemu_event_destroy(&done_event);
+}
+
+/* Basic test for the stuff above. */
+
+static void test_lifecycle(void)
+{
+    create_aio_contexts();
+    join_aio_contexts();
+}
+
+/* aio_co_schedule test.  */
+
+static Coroutine *to_schedule[NUM_CONTEXTS];
+
+static bool now_stopping;
+
+static int count_retry;
+static int count_here;
+static int count_other;
+
+static bool schedule_next(int n)
+{
+    Coroutine *co;
+
+    co = atomic_xchg(&to_schedule[n], NULL);
+    if (!co) {
+        atomic_inc(&count_retry);
+        return false;
+    }
+
+    if (n == id) {
+        atomic_inc(&count_here);
+    } else {
+        atomic_inc(&count_other);
+    }
+
+    aio_co_schedule(ctx[n], co);
+    return true;
+}
+
+static void finish_cb(void *opaque)
+{
+    schedule_next(id);
+}
+
+static coroutine_fn void test_multi_co_schedule_entry(void *opaque)
+{
+    g_assert(to_schedule[id] == NULL);
+    atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
+
+    while (!atomic_mb_read(&now_stopping)) {
+        int n;
+
+        n = g_test_rand_int_range(0, NUM_CONTEXTS);
+        schedule_next(n);
+        qemu_coroutine_yield();
+
+        g_assert(to_schedule[id] == NULL);
+        atomic_mb_set(&to_schedule[id], qemu_coroutine_self());
+    }
+}
+
+
+static void test_multi_co_schedule(int seconds)
+{
+    int i;
+
+    count_here = count_other = count_retry = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_co_schedule_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    for (i = 0; i < NUM_CONTEXTS; i++) {
+        ctx_run(i, finish_cb, NULL);
+        to_schedule[i] = NULL;
+    }
+
+    join_aio_contexts();
+    g_test_message("scheduled %d, queued %d, retry %d, total %d\n",
+                  count_other, count_here, count_retry,
+                  count_here + count_other + count_retry);
+}
+
+static void test_multi_co_schedule_1(void)
+{
+    test_multi_co_schedule(1);
+}
+
+static void test_multi_co_schedule_10(void)
+{
+    test_multi_co_schedule(10);
+}
+
+/* End of tests.  */
+
+int main(int argc, char **argv)
+{
+    init_clocks();
+
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
+    if (g_test_quick()) {
+        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
+    } else {
+        g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
+    }
+    return g_test_run();
+}
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/main-loop.h"
 #include "qemu/atomic.h"
 #include "block/raw-aio.h"
+#include "qemu/coroutine_int.h"
+#include "trace.h"
 
 /***********************************************************/
 /* bottom halves (can be seen as timers which expire ASAP) */
@@ -XXX,XX +XXX,XX @@ aio_ctx_finalize(GSource     *source)
     }
 #endif
 
+    assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
+    qemu_bh_delete(ctx->co_schedule_bh);
+
     qemu_lockcnt_lock(&ctx->list_lock);
     assert(!qemu_lockcnt_count(&ctx->list_lock));
     while (ctx->first_bh) {
@@ -XXX,XX +XXX,XX @@ static bool event_notifier_poll(void *opaque)
     return atomic_read(&ctx->notified);
 }
 
+static void co_schedule_bh_cb(void *opaque)
+{
+    AioContext *ctx = opaque;
+    QSLIST_HEAD(, Coroutine) straight, reversed;
+
+    QSLIST_MOVE_ATOMIC(&reversed, &ctx->scheduled_coroutines);
+    QSLIST_INIT(&straight);
+
+    while (!QSLIST_EMPTY(&reversed)) {
+        Coroutine *co = QSLIST_FIRST(&reversed);
+        QSLIST_REMOVE_HEAD(&reversed, co_scheduled_next);
+        QSLIST_INSERT_HEAD(&straight, co, co_scheduled_next);
+    }
+
+    while (!QSLIST_EMPTY(&straight)) {
+        Coroutine *co = QSLIST_FIRST(&straight);
+        QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
+        trace_aio_co_schedule_bh_cb(ctx, co);
+        qemu_coroutine_enter(co);
+    }
+}
+
 AioContext *aio_context_new(Error **errp)
 {
     int ret;
@@ -XXX,XX +XXX,XX @@ AioContext *aio_context_new(Error **errp)
     }
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
+
+    ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
+    QSLIST_INIT(&ctx->scheduled_coroutines);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)
@@ -XXX,XX +XXX,XX @@ fail:
     return NULL;
 }
 
+void aio_co_schedule(AioContext *ctx, Coroutine *co)
+{
+    trace_aio_co_schedule(ctx, co);
+    QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
+                              co, co_scheduled_next);
+    qemu_bh_schedule(ctx->co_schedule_bh);
+}
+
+void aio_co_wake(struct Coroutine *co)
+{
+    AioContext *ctx;
+
+    /* Read coroutine before co->ctx.  Matches smp_wmb in
+     * qemu_coroutine_enter.
+     */
+    smp_read_barrier_depends();
+    ctx = atomic_read(&co->ctx);
+
+    if (ctx != qemu_get_current_aio_context()) {
+        aio_co_schedule(ctx, co);
+        return;
+    }
+
+    if (qemu_in_coroutine()) {
+        Coroutine *self = qemu_coroutine_self();
+        assert(self != co);
+        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
+    } else {
+        aio_context_acquire(ctx);
+        qemu_coroutine_enter(co);
+        aio_context_release(ctx);
+    }
+}
+
 void aio_context_ref(AioContext *ctx)
 {
     g_source_ref(&ctx->source);
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "block/aio.h"
 
 enum {
     POOL_BATCH_SIZE = 64,
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
     }
 
     co->caller = self;
+    co->ctx = qemu_get_current_aio_context();
+
+    /* Store co->ctx before anything that stores co.  Matches
+     * barrier in aio_co_wake.
+     */
+    smp_wmb();
+
     ret = qemu_coroutine_switch(self, co, COROUTINE_ENTER);
 
     qemu_co_queue_run_restart(co);
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
 poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 
+# util/async.c
+aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
+aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
+
 # util/thread-pool.c
 thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

qcow2_create2 calls this.  Do not run a nested event loop, as that
breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup
list of the currently running one.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
 {
     QEMUIOVector qiov;
     struct iovec iov;
-    Coroutine *co;
     BlkRwCo rwco;
 
     iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
         .ret    = NOT_DONE,
     };
 
-    co = qemu_coroutine_create(co_entry, &rwco);
-    qemu_coroutine_enter(co);
-    BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        co_entry(&rwco);
+    } else {
+        Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
+        qemu_coroutine_enter(co);
+        BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
+    }
 
     return rwco.ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Once the thread pool starts using aio_co_wake, it will also need
qemu_get_current_aio_context().  Make test-thread-pool create
an AioContext with qemu_init_main_loop, so that stubs/iothread.c
and tests/iothread.c can provide the rest.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-5-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/test-thread-pool.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "qemu/timer.h"
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
 
 static AioContext *ctx;
 static ThreadPool *pool;
@@ -XXX,XX +XXX,XX @@ static void test_cancel_async(void)
 int main(int argc, char **argv)
 {
     int ret;
-    Error *local_error = NULL;
 
-    init_clocks();
-
-    ctx = aio_context_new(&local_error);
-    if (!ctx) {
-        error_reportf_err(local_error, "Failed to create AIO Context: ");
-        exit(1);
-    }
+    qemu_init_main_loop(&error_abort);
+    ctx = qemu_get_current_aio_context();
     pool = aio_get_thread_pool(ctx);
 
     g_test_init(&argc, &argv, NULL);
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
 
     ret = g_test_run();
 
-    aio_context_unref(ctx);
     return ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This is in preparation for making qio_channel_yield work on
AioContexts other than the main one.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/io/channel.h | 25 +++++++++++++++++++++++++
 io/channel-command.c | 13 +++++++++++++
 io/channel-file.c    | 11 +++++++++++
 io/channel-socket.c  | 16 +++++++++++-----
 io/channel-tls.c     | 12 ++++++++++++
 io/channel-watch.c   |  6 ++++++
 io/channel.c         | 11 +++++++++++
 7 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index XXXXXXX..XXXXXXX 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -XXX,XX +XXX,XX @@
 
 #include "qemu-common.h"
 #include "qom/object.h"
+#include "block/aio.h"
 
 #define TYPE_QIO_CHANNEL "qio-channel"
 #define QIO_CHANNEL(obj)                                    \
@@ -XXX,XX +XXX,XX @@ struct QIOChannelClass {
                      off_t offset,
                      int whence,
                      Error **errp);
+    void (*io_set_aio_fd_handler)(QIOChannel *ioc,
+                                  AioContext *ctx,
+                                  IOHandler *io_read,
+                                  IOHandler *io_write,
+                                  void *opaque);
 };
 
 /* General I/O handling functions */
@@ -XXX,XX +XXX,XX @@ void qio_channel_yield(QIOChannel *ioc,
 void qio_channel_wait(QIOChannel *ioc,
                       GIOCondition condition);
 
+/**
+ * qio_channel_set_aio_fd_handler:
+ * @ioc: the channel object
+ * @ctx: the AioContext to set the handlers on
+ * @io_read: the read handler
+ * @io_write: the write handler
+ * @opaque: the opaque value passed to the handler
+ *
+ * This is used internally by qio_channel_yield().  It can
+ * be used by channel implementations to forward the handlers
+ * to another channel (e.g. from #QIOChannelTLS to the
+ * underlying socket).
+ */
+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
+                                    AioContext *ctx,
+                                    IOHandler *io_read,
+                                    IOHandler *io_write,
+                                    void *opaque);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-command.c b/io/channel-command.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-command.c
+++ b/io/channel-command.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_command_close(QIOChannel *ioc,
 }
 
 
+static void qio_channel_command_set_aio_fd_handler(QIOChannel *ioc,
+                                                   AioContext *ctx,
+                                                   IOHandler *io_read,
+                                                   IOHandler *io_write,
+                                                   void *opaque)
+{
+    QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
+    aio_set_fd_handler(ctx, cioc->readfd, false, io_read, NULL, NULL, opaque);
+    aio_set_fd_handler(ctx, cioc->writefd, false, NULL, io_write, NULL, opaque);
+}
+
+
 static GSource *qio_channel_command_create_watch(QIOChannel *ioc,
                                                  GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_command_class_init(ObjectClass *klass,
     ioc_klass->io_set_blocking = qio_channel_command_set_blocking;
     ioc_klass->io_close = qio_channel_command_close;
     ioc_klass->io_create_watch = qio_channel_command_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_command_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_command_info = {
diff --git a/io/channel-file.c b/io/channel-file.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_file_close(QIOChannel *ioc,
 }
 
 
+static void qio_channel_file_set_aio_fd_handler(QIOChannel *ioc,
+                                                AioContext *ctx,
+                                                IOHandler *io_read,
+                                                IOHandler *io_write,
+                                                void *opaque)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    aio_set_fd_handler(ctx, fioc->fd, false, io_read, io_write, NULL, opaque);
+}
+
 static GSource *qio_channel_file_create_watch(QIOChannel *ioc,
                                               GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_file_class_init(ObjectClass *klass,
     ioc_klass->io_seek = qio_channel_file_seek;
     ioc_klass->io_close = qio_channel_file_close;
     ioc_klass->io_create_watch = qio_channel_file_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_file_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_file_info = {
diff --git a/io/channel-socket.c b/io/channel-socket.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -XXX,XX +XXX,XX @@ qio_channel_socket_set_blocking(QIOChannel *ioc,
         qemu_set_block(sioc->fd);
     } else {
         qemu_set_nonblock(sioc->fd);
-#ifdef WIN32
-        WSAEventSelect(sioc->fd, ioc->event,
-                       FD_READ | FD_ACCEPT | FD_CLOSE |
-                       FD_CONNECT | FD_WRITE | FD_OOB);
-#endif
     }
     return 0;
 }
@@ -XXX,XX +XXX,XX @@ qio_channel_socket_shutdown(QIOChannel *ioc,
     return 0;
 }
 
+static void qio_channel_socket_set_aio_fd_handler(QIOChannel *ioc,
+                                                  AioContext *ctx,
+                                                  IOHandler *io_read,
+                                                  IOHandler *io_write,
+                                                  void *opaque)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    aio_set_fd_handler(ctx, sioc->fd, false, io_read, io_write, NULL, opaque);
+}
+
 static GSource *qio_channel_socket_create_watch(QIOChannel *ioc,
                                                 GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_cork = qio_channel_socket_set_cork;
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel-tls.c b/io/channel-tls.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -XXX,XX +XXX,XX @@ static int qio_channel_tls_close(QIOChannel *ioc,
     return qio_channel_close(tioc->master, errp);
 }
 
+static void qio_channel_tls_set_aio_fd_handler(QIOChannel *ioc,
+                                               AioContext *ctx,
+                                               IOHandler *io_read,
+                                               IOHandler *io_write,
+                                               void *opaque)
+{
+    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
+
+    qio_channel_set_aio_fd_handler(tioc->master, ctx, io_read, io_write, opaque);
+}
+
 static GSource *qio_channel_tls_create_watch(QIOChannel *ioc,
                                              GIOCondition condition)
 {
@@ -XXX,XX +XXX,XX @@ static void qio_channel_tls_class_init(ObjectClass *klass,
     ioc_klass->io_close = qio_channel_tls_close;
     ioc_klass->io_shutdown = qio_channel_tls_shutdown;
     ioc_klass->io_create_watch = qio_channel_tls_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_tls_info = {
diff --git a/io/channel-watch.c b/io/channel-watch.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel-watch.c
+++ b/io/channel-watch.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_socket_watch(QIOChannel *ioc,
     GSource *source;
     QIOChannelSocketSource *ssource;
 
+#ifdef WIN32
+    WSAEventSelect(socket, ioc->event,
+                   FD_READ | FD_ACCEPT | FD_CLOSE |
+                   FD_CONNECT | FD_WRITE | FD_OOB);
+#endif
+
     source = g_source_new(&qio_channel_socket_source_funcs,
                           sizeof(QIOChannelSocketSource));
     ssource = (QIOChannelSocketSource *)source;
diff --git a/io/channel.c b/io/channel.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@ GSource *qio_channel_create_watch(QIOChannel *ioc,
 }
 
 
+void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
+                                    AioContext *ctx,
+                                    IOHandler *io_read,
+                                    IOHandler *io_write,
+                                    void *opaque)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    klass->io_set_aio_fd_handler(ioc, ctx, io_read, io_write, opaque);
+}
+
 guint qio_channel_add_watch(QIOChannel *ioc,
                             GIOCondition condition,
                             QIOChannelFunc func,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Support separate coroutines for reading and writing, and place the
read/write handlers on the AioContext that the QIOChannel is registered
with.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-7-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/io/channel.h | 47 ++++++++++++++++++++++++++--
 io/channel.c         | 86 +++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 109 insertions(+), 24 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index XXXXXXX..XXXXXXX 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -XXX,XX +XXX,XX @@
 
 #include "qemu-common.h"
 #include "qom/object.h"
+#include "qemu/coroutine.h"
 #include "block/aio.h"
 
 #define TYPE_QIO_CHANNEL "qio-channel"
@@ -XXX,XX +XXX,XX @@ struct QIOChannel {
     Object parent;
     unsigned int features; /* bitmask of QIOChannelFeatures */
     char *name;
+    AioContext *ctx;
+    Coroutine *read_coroutine;
+    Coroutine *write_coroutine;
 #ifdef _WIN32
     HANDLE event; /* For use with GSource on Win32 */
 #endif
@@ -XXX,XX +XXX,XX @@ guint qio_channel_add_watch(QIOChannel *ioc,
 
 
 /**
+ * qio_channel_attach_aio_context:
+ * @ioc: the channel object
+ * @ctx: the #AioContext to set the handlers on
+ *
+ * Request that qio_channel_yield() sets I/O handlers on
+ * the given #AioContext.  If @ctx is %NULL, qio_channel_yield()
+ * uses QEMU's main thread event loop.
+ *
+ * You can move a #QIOChannel from one #AioContext to another even if
+ * I/O handlers are set for a coroutine.  However, #QIOChannel provides
+ * no synchronization between the calls to qio_channel_yield() and
+ * qio_channel_attach_aio_context().
+ *
+ * Therefore you should first call qio_channel_detach_aio_context()
+ * to ensure that the coroutine is not entered concurrently.  Then,
+ * while the coroutine has yielded, call qio_channel_attach_aio_context(),
+ * and then aio_co_schedule() to place the coroutine on the new
+ * #AioContext.  The calls to qio_channel_detach_aio_context()
+ * and qio_channel_attach_aio_context() should be protected with
+ * aio_context_acquire() and aio_context_release().
+ */
+void qio_channel_attach_aio_context(QIOChannel *ioc,
+                                    AioContext *ctx);
+
+/**
+ * qio_channel_detach_aio_context:
+ * @ioc: the channel object
+ *
+ * Disable any I/O handlers set by qio_channel_yield().  With the
+ * help of aio_co_schedule(), this allows moving a coroutine that was
+ * paused by qio_channel_yield() to another context.
+ */
+void qio_channel_detach_aio_context(QIOChannel *ioc);
+
+/**
  * qio_channel_yield:
  * @ioc: the channel object
  * @condition: the I/O condition to wait for
  *
- * Yields execution from the current coroutine until
- * the condition indicated by @condition becomes
- * available.
+ * Yields execution from the current coroutine until the condition
+ * indicated by @condition becomes available.  @condition must
+ * be either %G_IO_IN or %G_IO_OUT; it cannot contain both.  In
+ * addition, no two coroutine can be waiting on the same condition
+ * and channel at the same time.
  *
  * This must only be called from coroutine context
  */
diff --git a/io/channel.c b/io/channel.c
index XXXXXXX..XXXXXXX 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/osdep.h"
 #include "io/channel.h"
 #include "qapi/error.h"
-#include "qemu/coroutine.h"
+#include "qemu/main-loop.h"
 
 bool qio_channel_has_feature(QIOChannel *ioc,
                              QIOChannelFeature feature)
@@ -XXX,XX +XXX,XX @@ off_t qio_channel_io_seek(QIOChannel *ioc,
 }
 
 
-typedef struct QIOChannelYieldData QIOChannelYieldData;
-struct QIOChannelYieldData {
-    QIOChannel *ioc;
-    Coroutine *co;
-};
+static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc);
 
+static void qio_channel_restart_read(void *opaque)
+{
+    QIOChannel *ioc = opaque;
+    Coroutine *co = ioc->read_coroutine;
+
+    ioc->read_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    aio_co_wake(co);
+}
 
-static gboolean qio_channel_yield_enter(QIOChannel *ioc,
-                                        GIOCondition condition,
-                                        gpointer opaque)
+static void qio_channel_restart_write(void *opaque)
 {
-    QIOChannelYieldData *data = opaque;
-    qemu_coroutine_enter(data->co);
-    return FALSE;
+    QIOChannel *ioc = opaque;
+    Coroutine *co = ioc->write_coroutine;
+
+    ioc->write_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    aio_co_wake(co);
 }
 
+static void qio_channel_set_aio_fd_handlers(QIOChannel *ioc)
+{
+    IOHandler *rd_handler = NULL, *wr_handler = NULL;
+    AioContext *ctx;
+
+    if (ioc->read_coroutine) {
+        rd_handler = qio_channel_restart_read;
+    }
+    if (ioc->write_coroutine) {
+        wr_handler = qio_channel_restart_write;
+    }
+
+    ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+    qio_channel_set_aio_fd_handler(ioc, ctx, rd_handler, wr_handler, ioc);
+}
+
+void qio_channel_attach_aio_context(QIOChannel *ioc,
+                                    AioContext *ctx)
+{
+    AioContext *old_ctx;
+    if (ioc->ctx == ctx) {
+        return;
+    }
+
+    old_ctx = ioc->ctx ? ioc->ctx : iohandler_get_aio_context();
+    qio_channel_set_aio_fd_handler(ioc, old_ctx, NULL, NULL, NULL);
+    ioc->ctx = ctx;
+    qio_channel_set_aio_fd_handlers(ioc);
+}
+
+void qio_channel_detach_aio_context(QIOChannel *ioc)
+{
+    ioc->read_coroutine = NULL;
+    ioc->write_coroutine = NULL;
+    qio_channel_set_aio_fd_handlers(ioc);
+    ioc->ctx = NULL;
+}
 
 void coroutine_fn qio_channel_yield(QIOChannel *ioc,
                                     GIOCondition condition)
 {
-    QIOChannelYieldData data;
-
     assert(qemu_in_coroutine());
-    data.ioc = ioc;
-    data.co = qemu_coroutine_self();
-    qio_channel_add_watch(ioc,
-                          condition,
-                          qio_channel_yield_enter,
-                          &data,
-                          NULL);
+    if (condition == G_IO_IN) {
+        assert(!ioc->read_coroutine);
+        ioc->read_coroutine = qemu_coroutine_self();
+    } else if (condition == G_IO_OUT) {
+        assert(!ioc->write_coroutine);
+        ioc->write_coroutine = qemu_coroutine_self();
+    } else {
+        abort();
+    }
+    qio_channel_set_aio_fd_handlers(ioc);
     qemu_coroutine_yield();
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

In the client, read the reply headers from a coroutine, switching the
read side between the "read header" coroutine and the I/O coroutine that
reads the body of the reply.

In the server, if the server can read more requests it will create a new
"read request" coroutine as soon as a request has been read.  Otherwise,
the new coroutine is created in nbd_request_put.

diff --git a/block/nbd-client.h b/block/nbd-client.h
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -XXX,XX +XXX,XX @@ typedef struct NBDClientSession {
 
     CoMutex send_mutex;
     CoQueue free_sema;
-    Coroutine *send_coroutine;
+    Coroutine *read_reply_co;
     int in_flight;
 
     Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
diff --git a/block/nbd-client.c b/block/nbd-client.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
 
-static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
+static void nbd_recv_coroutines_enter_all(BlockDriverState *bs)
 {
+    NBDClientSession *s = nbd_get_client_session(bs);
     int i;
 
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
@@ -XXX,XX +XXX,XX @@ static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
             qemu_coroutine_enter(s->recv_coroutine[i]);
         }
     }
+    BDRV_POLL_WHILE(bs, s->read_reply_co);
 }
 
 static void nbd_teardown_connection(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
     qio_channel_shutdown(client->ioc,
                          QIO_CHANNEL_SHUTDOWN_BOTH,
                          NULL);
-    nbd_recv_coroutines_enter_all(client);
+    nbd_recv_coroutines_enter_all(bs);
 
     nbd_client_detach_aio_context(bs);
     object_unref(OBJECT(client->sioc));
@@ -XXX,XX +XXX,XX @@ static void nbd_teardown_connection(BlockDriverState *bs)
     client->ioc = NULL;
 }
 
-static void nbd_reply_ready(void *opaque)
+static coroutine_fn void nbd_read_reply_entry(void *opaque)
 {
-    BlockDriverState *bs = opaque;
-    NBDClientSession *s = nbd_get_client_session(bs);
+    NBDClientSession *s = opaque;
     uint64_t i;
     int ret;
 
-    if (!s->ioc) { /* Already closed */
-        return;
-    }
-
-    if (s->reply.handle == 0) {
-        /* No reply already in flight.  Fetch a header.  It is possible
-         * that another thread has done the same thing in parallel, so
-         * the socket is not readable anymore.
-         */
+    for (;;) {
+        assert(s->reply.handle == 0);
         ret = nbd_receive_reply(s->ioc, &s->reply);
-        if (ret == -EAGAIN) {
-            return;
-        }
         if (ret < 0) {
-            s->reply.handle = 0;
-            goto fail;
+            break;
         }
-    }
 
-    /* There's no need for a mutex on the receive side, because the
-     * handler acts as a synchronization point and ensures that only
-     * one coroutine is called until the reply finishes.  */
-    i = HANDLE_TO_INDEX(s, s->reply.handle);
-    if (i >= MAX_NBD_REQUESTS) {
-        goto fail;
-    }
+        /* There's no need for a mutex on the receive side, because the
+         * handler acts as a synchronization point and ensures that only
+         * one coroutine is called until the reply finishes.
+         */
+        i = HANDLE_TO_INDEX(s, s->reply.handle);
+        if (i >= MAX_NBD_REQUESTS || !s->recv_coroutine[i]) {
+            break;
+        }
 
-    if (s->recv_coroutine[i]) {
-        qemu_coroutine_enter(s->recv_coroutine[i]);
-        return;
+        /* We're woken up by the recv_coroutine itself.  Note that there
+         * is no race between yielding and reentering read_reply_co.  This
+         * is because:
+         *
+         * - if recv_coroutine[i] runs on the same AioContext, it is only
+         *   entered after we yield
+         *
+         * - if recv_coroutine[i] runs on a different AioContext, reentering
+         *   read_reply_co happens through a bottom half, which can only
+         *   run after we yield.
+         */
+        aio_co_wake(s->recv_coroutine[i]);
+        qemu_coroutine_yield();
     }
-
-fail:
-    nbd_teardown_connection(bs);
-}
-
-static void nbd_restart_write(void *opaque)
-{
-    BlockDriverState *bs = opaque;
-
-    qemu_coroutine_enter(nbd_get_client_session(bs)->send_coroutine);
+    s->read_reply_co = NULL;
 }
 
 static int nbd_co_send_request(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
                                QEMUIOVector *qiov)
 {
     NBDClientSession *s = nbd_get_client_session(bs);
-    AioContext *aio_context;
     int rc, ret, i;
 
     qemu_co_mutex_lock(&s->send_mutex);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
         return -EPIPE;
     }
 
-    s->send_coroutine = qemu_coroutine_self();
-    aio_context = bdrv_get_aio_context(bs);
-
-    aio_set_fd_handler(aio_context, s->sioc->fd, false,
-                       nbd_reply_ready, nbd_restart_write, NULL, bs);
     if (qiov) {
         qio_channel_set_cork(s->ioc, true);
         rc = nbd_send_request(s->ioc, request);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_request(BlockDriverState *bs,
     } else {
         rc = nbd_send_request(s->ioc, request);
     }
-    aio_set_fd_handler(aio_context, s->sioc->fd, false,
-                       nbd_reply_ready, NULL, NULL, bs);
-    s->send_coroutine = NULL;
     qemu_co_mutex_unlock(&s->send_mutex);
     return rc;
 }
@@ -XXX,XX +XXX,XX @@ static void nbd_co_receive_reply(NBDClientSession *s,
 {
     int ret;
 
-    /* Wait until we're woken up by the read handler.  TODO: perhaps
-     * peek at the next reply and avoid yielding if it's ours?  */
+    /* Wait until we're woken up by nbd_read_reply_entry.  */
     qemu_coroutine_yield();
     *reply = s->reply;
     if (reply->handle != request->handle ||
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
     /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
 }
 
-static void nbd_coroutine_end(NBDClientSession *s,
+static void nbd_coroutine_end(BlockDriverState *bs,
                               NBDRequest *request)
 {
+    NBDClientSession *s = nbd_get_client_session(bs);
     int i = HANDLE_TO_INDEX(s, request->handle);
+
     s->recv_coroutine[i] = NULL;
-    if (s->in_flight-- == MAX_NBD_REQUESTS) {
-        qemu_co_queue_next(&s->free_sema);
+    s->in_flight--;
+    qemu_co_queue_next(&s->free_sema);
+
+    /* Kick the read_reply_co to get the next reply.  */
+    if (s->read_reply_co) {
+        aio_co_wake(s->read_reply_co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, qiov);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_flush(BlockDriverState *bs)
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 }
 
@@ -XXX,XX +XXX,XX @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
     } else {
         nbd_co_receive_reply(client, &request, &reply, NULL);
     }
-    nbd_coroutine_end(client, &request);
+    nbd_coroutine_end(bs, &request);
     return -reply.error;
 
 }
 
 void nbd_client_detach_aio_context(BlockDriverState *bs)
 {
-    aio_set_fd_handler(bdrv_get_aio_context(bs),
-                       nbd_get_client_session(bs)->sioc->fd,
-                       false, NULL, NULL, NULL, NULL);
+    NBDClientSession *client = nbd_get_client_session(bs);
+    qio_channel_detach_aio_context(QIO_CHANNEL(client->sioc));
 }
 
 void nbd_client_attach_aio_context(BlockDriverState *bs,
                                    AioContext *new_context)
 {
-    aio_set_fd_handler(new_context, nbd_get_client_session(bs)->sioc->fd,
-                       false, nbd_reply_ready, NULL, NULL, bs);
+    NBDClientSession *client = nbd_get_client_session(bs);
+    qio_channel_attach_aio_context(QIO_CHANNEL(client->sioc), new_context);
+    aio_co_schedule(new_context, client->read_reply_co);
 }
 
 void nbd_client_close(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ int nbd_client_init(BlockDriverState *bs,
     /* Now that we're connected, set the socket to be non-blocking and
      * kick the reply mechanism.  */
     qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
-
+    client->read_reply_co = qemu_coroutine_create(nbd_read_reply_entry, client);
     nbd_client_attach_aio_context(bs, bdrv_get_aio_context(bs));
 
     logout("Established connection with NBD server\n");
diff --git a/nbd/client.c b/nbd/client.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
     ssize_t ret;
 
     ret = read_sync(ioc, buf, sizeof(buf));
-    if (ret < 0) {
+    if (ret <= 0) {
         return ret;
     }
 
diff --git a/nbd/common.c b/nbd/common.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/common.c
+++ b/nbd/common.c
@@ -XXX,XX +XXX,XX @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
         }
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
-                /* XXX figure out if we can create a variant on
-                 * qio_channel_yield() that works with AIO contexts
-                 * and consider using that in this branch */
-                qemu_coroutine_yield();
-            } else if (done) {
-                /* XXX this is needed by nbd_reply_ready.  */
-                qio_channel_wait(ioc,
-                                 do_read ? G_IO_IN : G_IO_OUT);
+                qio_channel_yield(ioc, do_read ? G_IO_IN : G_IO_OUT);
             } else {
                 return -EAGAIN;
             }
diff --git a/nbd/server.c b/nbd/server.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
     CoMutex send_lock;
     Coroutine *send_coroutine;
 
-    bool can_read;
-
     QTAILQ_ENTRY(NBDClient) next;
     int nb_requests;
     bool closing;
@@ -XXX,XX +XXX,XX @@ struct NBDClient {
 
 /* That's all folks */
 
-static void nbd_set_handlers(NBDClient *client);
-static void nbd_unset_handlers(NBDClient *client);
-static void nbd_update_can_read(NBDClient *client);
+static void nbd_client_receive_next_request(NBDClient *client);
 
 static gboolean nbd_negotiate_continue(QIOChannel *ioc,
                                        GIOCondition condition,
@@ -XXX,XX +XXX,XX @@ void nbd_client_put(NBDClient *client)
          */
         assert(client->closing);
 
-        nbd_unset_handlers(client);
+        qio_channel_detach_aio_context(client->ioc);
         object_unref(OBJECT(client->sioc));
         object_unref(OBJECT(client->ioc));
         if (client->tlscreds) {
@@ -XXX,XX +XXX,XX @@ static NBDRequestData *nbd_request_get(NBDClient *client)
 
     assert(client->nb_requests <= MAX_NBD_REQUESTS - 1);
     client->nb_requests++;
-    nbd_update_can_read(client);
 
     req = g_new0(NBDRequestData, 1);
     nbd_client_get(client);
@@ -XXX,XX +XXX,XX @@ static void nbd_request_put(NBDRequestData *req)
     g_free(req);
 
     client->nb_requests--;
-    nbd_update_can_read(client);
+    nbd_client_receive_next_request(client);
+
     nbd_client_put(client);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
     exp->ctx = ctx;
 
     QTAILQ_FOREACH(client, &exp->clients, next) {
-        nbd_set_handlers(client);
+        qio_channel_attach_aio_context(client->ioc, ctx);
+        if (client->recv_coroutine) {
+            aio_co_schedule(ctx, client->recv_coroutine);
+        }
+        if (client->send_coroutine) {
+            aio_co_schedule(ctx, client->send_coroutine);
+        }
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
     TRACE("Export %s: Detaching clients from AIO context %p\n", exp->name, exp->ctx);
 
     QTAILQ_FOREACH(client, &exp->clients, next) {
-        nbd_unset_handlers(client);
+        qio_channel_detach_aio_context(client->ioc);
     }
 
     exp->ctx = NULL;
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
     g_assert(qemu_in_coroutine());
     qemu_co_mutex_lock(&client->send_lock);
     client->send_coroutine = qemu_coroutine_self();
-    nbd_set_handlers(client);
 
     if (!len) {
         rc = nbd_send_reply(client->ioc, reply);
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_send_reply(NBDRequestData *req, NBDReply *reply,
     }
 
     client->send_coroutine = NULL;
-    nbd_set_handlers(client);
     qemu_co_mutex_unlock(&client->send_lock);
     return rc;
 }
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
     ssize_t rc;
 
     g_assert(qemu_in_coroutine());
-    client->recv_coroutine = qemu_coroutine_self();
-    nbd_update_can_read(client);
-
+    assert(client->recv_coroutine == qemu_coroutine_self());
     rc = nbd_receive_request(client->ioc, request);
     if (rc < 0) {
         if (rc != -EAGAIN) {
@@ -XXX,XX +XXX,XX @@ static ssize_t nbd_co_receive_request(NBDRequestData *req,
 
 out:
     client->recv_coroutine = NULL;
-    nbd_update_can_read(client);
+    nbd_client_receive_next_request(client);
 
     return rc;
 }
 
-static void nbd_trip(void *opaque)
+/* Owns a reference to the NBDClient passed as opaque.  */
+static coroutine_fn void nbd_trip(void *opaque)
 {
     NBDClient *client = opaque;
     NBDExport *exp = client->exp;
     NBDRequestData *req;
-    NBDRequest request;
+    NBDRequest request = { 0 };    /* GCC thinks it can be used uninitialized */
     NBDReply reply;
     ssize_t ret;
     int flags;
 
     TRACE("Reading request.");
     if (client->closing) {
+        nbd_client_put(client);
         return;
     }
 
@@ -XXX,XX +XXX,XX @@ static void nbd_trip(void *opaque)
 
 done:
     nbd_request_put(req);
+    nbd_client_put(client);
     return;
 
 out:
     nbd_request_put(req);
     client_close(client);
+    nbd_client_put(client);
 }
 
-static void nbd_read(void *opaque)
+static void nbd_client_receive_next_request(NBDClient *client)
 {
-    NBDClient *client = opaque;
-
-    if (client->recv_coroutine) {
-        qemu_coroutine_enter(client->recv_coroutine);
-    } else {
-        qemu_coroutine_enter(qemu_coroutine_create(nbd_trip, client));
-    }
-}
-
-static void nbd_restart_write(void *opaque)
-{
-    NBDClient *client = opaque;
-
-    qemu_coroutine_enter(client->send_coroutine);
-}
-
-static void nbd_set_handlers(NBDClient *client)
-{
-    if (client->exp && client->exp->ctx) {
-        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true,
-                           client->can_read ? nbd_read : NULL,
-                           client->send_coroutine ? nbd_restart_write : NULL,
-                           NULL, client);
-    }
-}
-
-static void nbd_unset_handlers(NBDClient *client)
-{
-    if (client->exp && client->exp->ctx) {
-        aio_set_fd_handler(client->exp->ctx, client->sioc->fd, true, NULL,
-                           NULL, NULL, NULL);
-    }
-}
-
-static void nbd_update_can_read(NBDClient *client)
-{
-    bool can_read = client->recv_coroutine ||
-                    client->nb_requests < MAX_NBD_REQUESTS;
-
-    if (can_read != client->can_read) {
-        client->can_read = can_read;
-        nbd_set_handlers(client);
-
-        /* There is no need to invoke aio_notify(), since aio_set_fd_handler()
-         * in nbd_set_handlers() will have taken care of that */
+    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS) {
+        nbd_client_get(client);
+        client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+        aio_co_schedule(client->exp->ctx, client->recv_coroutine);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void nbd_co_client_start(void *opaque)
         goto out;
     }
     qemu_co_mutex_init(&client->send_lock);
-    nbd_set_handlers(client);
 
     if (exp) {
         QTAILQ_INSERT_TAIL(&exp->clients, client, next);
     }
+
+    nbd_client_receive_next_request(client);
+
 out:
     g_free(data);
 }
@@ -XXX,XX +XXX,XX @@ void nbd_client_new(NBDExport *exp,
     object_ref(OBJECT(client->sioc));
     client->ioc = QIO_CHANNEL(sioc);
     object_ref(OBJECT(client->ioc));
-    client->can_read = true;
     client->close = close_fn;
 
     data->client = client;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

As a small step towards the introduction of multiqueue, we want
coroutines to remain on the same AioContext that started them,
unless they are moved explicitly with e.g. aio_co_schedule.  This patch
avoids that coroutines switch AioContext when they use a CoMutex.
For now it does not make much of a difference, because the CoMutex
is not thread-safe and the AioContext itself is used to protect the
CoMutex from concurrent access.  However, this is going to change.

diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/queue.h"
+#include "block/aio.h"
 #include "trace.h"
 
 void qemu_co_queue_init(CoQueue *queue)
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_run_restart(Coroutine *co)
 
 static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 {
-    Coroutine *self = qemu_coroutine_self();
     Coroutine *next;
 
     if (QSIMPLEQ_EMPTY(&queue->entries)) {
@@ -XXX,XX +XXX,XX @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 
     while ((next = QSIMPLEQ_FIRST(&queue->entries)) != NULL) {
         QSIMPLEQ_REMOVE_HEAD(&queue->entries, co_queue_next);
-        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
-        trace_qemu_co_queue_next(next);
+        aio_co_wake(next);
         if (single) {
             break;
         }
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
 
 # util/qemu-coroutine-lock.c
 qemu_co_queue_run_restart(void *co) "co %p"
-qemu_co_queue_next(void *nxt) "next %p"
 qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Keep the coroutine on the same AioContext.  Without this change,
there would be a race between yielding the coroutine and reentering it.
While the race cannot happen now, because the code only runs from a single
AioContext, this will change with multiqueue support in the block layer.

While doing the change, replace custom bottom half with aio_co_schedule.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-10-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkdebug.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ out:
     return ret;
 }
 
-static void error_callback_bh(void *opaque)
-{
-    Coroutine *co = opaque;
-    qemu_coroutine_enter(co);
-}
-
 static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
 {
     BDRVBlkdebugState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static int inject_error(BlockDriverState *bs, BlkdebugRule *rule)
     }
 
     if (!immediately) {
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), error_callback_bh,
-                                qemu_coroutine_self());
+        aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
         qemu_coroutine_yield();
     }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

qed_aio_start_io and qed_aio_next_io will not have to acquire/release
the AioContext, while qed_aio_next_io_cb will.  Split the functionality
and gain a little type-safety in the process.

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(void *opaque, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+
+static void qed_aio_start_io(QEDAIOCB *acb)
+{
+    qed_aio_next_io(acb, 0);
+}
+
+static void qed_aio_next_io_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+
+    qed_aio_next_io(acb, ret);
+}
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 
     acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
     if (acb) {
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
         acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
         if (acb) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(opaque, ret);
+    qed_aio_next_io(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                            qed_aio_write_l1_update, acb);
+                           qed_aio_write_l1_update, acb);
     } else {
         /* Write out only the updated part of the L2 table */
         qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                            qed_aio_next_io, acb);
+                           qed_aio_next_io_cb, acb);
     }
     return;
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
     }
 
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io;
+        next_fn = qed_aio_next_io_cb;
     } else {
         if (s->bs->backing) {
             next_fn = qed_aio_write_flush_before_l2_update;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
             return;
         }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_read_data(void *opaque, int ret,
     /* Handle zero cluster and backing file reads */
     if (ret == QED_CLUSTER_ZERO) {
         qemu_iovec_memset(&acb->cur_qiov, 0, 0, acb->cur_qiov.size);
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
         return;
     } else if (ret != QED_CLUSTER_FOUND) {
         qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
-                              &acb->backing_qiov, qed_aio_next_io, acb);
+                              &acb->backing_qiov, qed_aio_next_io_cb, acb);
         return;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
     bdrv_aio_readv(bs->file, offset / BDRV_SECTOR_SIZE,
                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                   qed_aio_next_io, acb);
+                   qed_aio_next_io_cb, acb);
     return;
 
 err:
@@ -XXX,XX +XXX,XX @@ err:
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(void *opaque, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb, int ret)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
     qemu_iovec_init(&acb->cur_qiov, qiov->niov);
 
     /* Start request */
-    qed_aio_next_io(acb, 0);
+    qed_aio_start_io(acb);
     return &acb->common;
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

The AioContext data structures are now protected by list_lock and/or
they are walked with FOREACH_RCU primitives.  There is no need anymore
to acquire the AioContext for the entire duration of aio_dispatch.
Instead, just acquire it before and after invoking the callbacks.
The next step is then to push it further down.

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_read) {
+            aio_context_acquire(ctx);
             node->io_read(node->opaque);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_write) {
+            aio_context_acquire(ctx);
             node->io_write(node->opaque);
+            aio_context_release(ctx);
             progress = true;
         }
 
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
     }
 
     /* Run our timers */
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
+    aio_context_release(ctx);
 
     return progress;
 }
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     int64_t timeout;
     int64_t start = 0;
 
-    aio_context_acquire(ctx);
-    progress = false;
-
     /* aio_notify can avoid the expensive event_notifier_set if
      * everything (file descriptors, bottom halves, timers) will
      * be re-evaluated before the next blocking poll().  This is
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-    if (try_poll_mode(ctx, blocking)) {
-        progress = true;
-    } else {
+    aio_context_acquire(ctx);
+    progress = try_poll_mode(ctx, blocking);
+    aio_context_release(ctx);
+
+    if (!progress) {
         assert(npfd == 0);
 
         /* fill pollfds */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         timeout = blocking ? aio_compute_timeout(ctx) : 0;
 
         /* wait until next event */
-        if (timeout) {
-            aio_context_release(ctx);
-        }
         if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
             AioHandler epoll_handler;
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         } else  {
             ret = qemu_poll_ns(pollfds, npfd, timeout);
         }
-        if (timeout) {
-            aio_context_acquire(ctx);
-        }
     }
 
     if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress = true;
     }
 
-    aio_context_release(ctx);
-
     return progress;
 }
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
             node->pfd.revents = 0;
+            aio_context_acquire(ctx);
             node->io_notify(node->e);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (node->io_read || node->io_write)) {
             node->pfd.revents = 0;
             if ((revents & G_IO_IN) && node->io_read) {
+                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
+                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     int count;
     int timeout;
 
-    aio_context_acquire(ctx);
     progress = false;
 
     /* aio_notify can avoid the expensive event_notifier_set if
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
 
         timeout = blocking && !have_select_revents
             ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
-        if (timeout) {
-            aio_context_release(ctx);
-        }
         ret = WaitForMultipleObjects(count, events, FALSE, timeout);
         if (blocking) {
             assert(first);
             atomic_sub(&ctx->notify_me, 2);
         }
-        if (timeout) {
-            aio_context_acquire(ctx);
-        }
 
         if (first) {
             aio_notify_accept(ctx);
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-
     aio_context_release(ctx);
     return progress;
 }
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
+            aio_context_acquire(ctx);
             aio_bh_call(bh);
+            aio_context_release(ctx);
         }
         if (bh->deleted) {
             deleted = true;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-13-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.h                 |  3 +++
 block/curl.c                |  2 ++
 block/io.c                  |  5 +++++
 block/iscsi.c               |  8 ++++++--
 block/null.c                |  4 ++++
 block/qed.c                 | 12 ++++++++++++
 block/throttle-groups.c     |  2 ++
 util/aio-posix.c            |  2 --
 util/aio-win32.c            |  2 --
 util/qemu-coroutine-sleep.c |  2 +-
 10 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
  */
 typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
 
+void qed_acquire(BDRVQEDState *s);
+void qed_release(BDRVQEDState *s);
+
 /**
  * Generic callback for chaining async callbacks
  */
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_multi_timeout_do(void *arg)
         return;
     }
 
+    aio_context_acquire(s->aio_context);
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
     curl_multi_check_completion(s);
+    aio_context_release(s->aio_context);
 #else
     abort();
 #endif
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel(BlockAIOCB *acb)
         if (acb->aiocb_info->get_aio_context) {
             aio_poll(acb->aiocb_info->get_aio_context(acb), true);
         } else if (acb->bs) {
+            /* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
+             * assert that we're not using an I/O thread.  Thread-safe
+             * code should use bdrv_aio_cancel_async exclusively.
+             */
+            assert(bdrv_get_aio_context(acb->bs) == qemu_get_aio_context());
             aio_poll(bdrv_get_aio_context(acb->bs), true);
         } else {
             abort();
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void iscsi_retry_timer_expired(void *opaque)
     struct IscsiTask *iTask = opaque;
     iTask->complete = 1;
     if (iTask->co) {
-        qemu_coroutine_enter(iTask->co);
+        aio_co_wake(iTask->co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void iscsi_nop_timed_event(void *opaque)
 {
     IscsiLun *iscsilun = opaque;
 
+    aio_context_acquire(iscsilun->aio_context);
     if (iscsi_get_nops_in_flight(iscsilun->iscsi) >= MAX_NOP_FAILURES) {
         error_report("iSCSI: NOP timeout. Reconnecting...");
         iscsilun->request_timed_out = true;
     } else if (iscsi_nop_out_async(iscsilun->iscsi, NULL, NULL, 0, NULL) != 0) {
         error_report("iSCSI: failed to sent NOP-Out. Disabling NOP messages.");
-        return;
+        goto out;
     }
 
     timer_mod(iscsilun->nop_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
     iscsi_set_events(iscsilun);
+
+out:
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static void null_bh_cb(void *opaque)
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
 
     trace_qed_need_check_timer_cb(s);
 
+    qed_acquire(s);
     qed_plug_allocating_write_reqs(s);
 
     /* Ensure writes are on disk before clearing flag */
     bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
+    qed_release(s);
+}
+
+void qed_acquire(BDRVQEDState *s)
+{
+    aio_context_acquire(bdrv_get_aio_context(s->bs));
+}
+
+void qed_release(BDRVQEDState *s)
+{
+    aio_context_release(bdrv_get_aio_context(s->bs));
 }
 
 static void qed_start_need_check_timer(BDRVQEDState *s)
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ static void timer_cb(BlockBackend *blk, bool is_write)
     qemu_mutex_unlock(&tg->lock);
 
     /* Run the request that was waiting for this timer */
+    aio_context_acquire(blk_get_aio_context(blk));
     empty_queue = !qemu_co_enter_next(&blkp->throttled_reqs[is_write]);
+    aio_context_release(blk_get_aio_context(blk));
 
     /* If the request queue was empty then we have to take care of
      * scheduling the next one */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
     }
 
     /* Run our timers */
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
 
     return progress;
 }
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
     return progress;
 }
 
diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-sleep.c
+++ b/util/qemu-coroutine-sleep.c
@@ -XXX,XX +XXX,XX @@ static void co_sleep_cb(void *opaque)
 {
     CoSleepCB *sleep_cb = opaque;
 
-    qemu_coroutine_enter(sleep_cb->co);
+    aio_co_wake(sleep_cb->co);
 }
 
 void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This covers both file descriptor callbacks and polling callbacks,
since they execute related code.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-14-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/curl.c          | 16 +++++++++++++---
 block/iscsi.c         |  4 ++++
 block/linux-aio.c     |  4 ++++
 block/nfs.c           |  6 ++++++
 block/sheepdog.c      | 29 +++++++++++++++--------------
 block/ssh.c           | 29 +++++++++--------------------
 block/win32-aio.c     | 10 ++++++----
 hw/block/virtio-blk.c |  5 ++++-
 hw/scsi/virtio-scsi.c |  7 +++++++
 util/aio-posix.c      |  7 -------
 util/aio-win32.c      |  6 ------
 11 files changed, 68 insertions(+), 55 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_multi_check_completion(BDRVCURLState *s)
     }
 }
 
-static void curl_multi_do(void *arg)
+static void curl_multi_do_locked(CURLState *s)
 {
-    CURLState *s = (CURLState *)arg;
     CURLSocket *socket, *next_socket;
     int running;
     int r;
@@ -XXX,XX +XXX,XX @@ static void curl_multi_do(void *arg)
     }
 }
 
+static void curl_multi_do(void *arg)
+{
+    CURLState *s = (CURLState *)arg;
+
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
+    aio_context_release(s->s->aio_context);
+}
+
 static void curl_multi_read(void *arg)
 {
     CURLState *s = (CURLState *)arg;
 
-    curl_multi_do(arg);
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
     curl_multi_check_completion(s->s);
+    aio_context_release(s->s->aio_context);
 }
 
 static void curl_multi_timeout_do(void *arg)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ iscsi_process_read(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLIN);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void
@@ -XXX,XX +XXX,XX @@ iscsi_process_write(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLOUT);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static int64_t sector_lun2qemu(int64_t sector, IscsiLun *iscsilun)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
+        aio_context_acquire(s->aio_context);
         qemu_laio_process_completions_and_submit(s);
+        aio_context_release(s->aio_context);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
         return false;
     }
 
+    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions_and_submit(s);
+    aio_context_release(s->aio_context);
     return true;
 }
 
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_set_events(NFSClient *client)
 static void nfs_process_read(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLIN);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_process_write(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLOUT);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
     return ret;
 }
 
-static void restart_co_req(void *opaque)
-{
-    Coroutine *co = opaque;
-
-    qemu_coroutine_enter(co);
-}
-
 typedef struct SheepdogReqCo {
     int sockfd;
     BlockDriverState *bs;
@@ -XXX,XX +XXX,XX @@ typedef struct SheepdogReqCo {
     unsigned int *rlen;
     int ret;
     bool finished;
+    Coroutine *co;
 } SheepdogReqCo;
 
+static void restart_co_req(void *opaque)
+{
+    SheepdogReqCo *srco = opaque;
+
+    aio_co_wake(srco->co);
+}
+
 static coroutine_fn void do_co_req(void *opaque)
 {
     int ret;
-    Coroutine *co;
     SheepdogReqCo *srco = opaque;
     int sockfd = srco->sockfd;
     SheepdogReq *hdr = srco->hdr;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
     unsigned int *wlen = srco->wlen;
     unsigned int *rlen = srco->rlen;
 
-    co = qemu_coroutine_self();
+    srco->co = qemu_coroutine_self();
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       NULL, restart_co_req, NULL, co);
+                       NULL, restart_co_req, NULL, srco);
 
     ret = send_co_req(sockfd, hdr, data, wlen);
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void do_co_req(void *opaque)
     }
 
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       restart_co_req, NULL, NULL, co);
+                       restart_co_req, NULL, NULL, srco);
 
     ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
     if (ret != sizeof(*hdr)) {
@@ -XXX,XX +XXX,XX @@ out:
     aio_set_fd_handler(srco->aio_context, sockfd, false,
                        NULL, NULL, NULL, NULL);
 
+    srco->co = NULL;
     srco->ret = ret;
     srco->finished = true;
     if (srco->bs) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn aio_read_response(void *opaque)
          * We've finished all requests which belong to the AIOCB, so
          * we can switch back to sd_co_readv/writev now.
          */
-        qemu_coroutine_enter(acb->coroutine);
+        aio_co_wake(acb->coroutine);
     }
 
     return;
@@ -XXX,XX +XXX,XX @@ static void co_read_response(void *opaque)
         s->co_recv = qemu_coroutine_create(aio_read_response, opaque);
     }
 
-    qemu_coroutine_enter(s->co_recv);
+    aio_co_wake(s->co_recv);
 }
 
 static void co_write_request(void *opaque)
 {
     BDRVSheepdogState *s = opaque;
 
-    qemu_coroutine_enter(s->co_send);
+    aio_co_wake(s->co_send);
 }
 
 /*
diff --git a/block/ssh.c b/block/ssh.c
index XXXXXXX..XXXXXXX 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static void restart_coroutine(void *opaque)
 
     DPRINTF("co=%p", co);
 
-    qemu_coroutine_enter(co);
+    aio_co_wake(co);
 }
 
-static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
+/* A non-blocking call returned EAGAIN, so yield, ensuring the
+ * handlers are set up so that we'll be rescheduled when there is an
+ * interesting event on the socket.
+ */
+static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 {
     int r;
     IOHandler *rd_handler = NULL, *wr_handler = NULL;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
 
     aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
                        false, rd_handler, wr_handler, NULL, co);
-}
-
-static coroutine_fn void clear_fd_handler(BDRVSSHState *s,
-                                          BlockDriverState *bs)
-{
-    DPRINTF("s->sock=%d", s->sock);
-    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
-                       false, NULL, NULL, NULL, NULL);
-}
-
-/* A non-blocking call returned EAGAIN, so yield, ensuring the
- * handlers are set up so that we'll be rescheduled when there is an
- * interesting event on the socket.
- */
-static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
-{
-    set_fd_handler(s, bs);
     qemu_coroutine_yield();
-    clear_fd_handler(s, bs);
+    DPRINTF("s->sock=%d - back", s->sock);
+    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
+                       NULL, NULL, NULL, NULL);
 }
 
 /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
diff --git a/block/win32-aio.c b/block/win32-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ struct QEMUWin32AIOState {
     HANDLE hIOCP;
     EventNotifier e;
     int count;
-    bool is_aio_context_attached;
+    AioContext *aio_ctx;
 };
 
 typedef struct QEMUWin32AIOCB {
@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
     }
 
 
+    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
+    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
@@ -XXX,XX +XXX,XX @@ void win32_aio_detach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *old_context)
 {
     aio_set_event_notifier(old_context, &aio->e, false, NULL, NULL);
-    aio->is_aio_context_attached = false;
+    aio->aio_ctx = NULL;
 }
 
 void win32_aio_attach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *new_context)
 {
-    aio->is_aio_context_attached = true;
+    aio->aio_ctx = new_context;
     aio_set_event_notifier(new_context, &aio->e, false,
                            win32_aio_completion_cb, NULL);
 }
@@ -XXX,XX +XXX,XX @@ out_free_state:
 
 void win32_aio_cleanup(QEMUWin32AIOState *aio)
 {
-    assert(!aio->is_aio_context_attached);
+    assert(!aio->aio_ctx);
     CloseHandle(aio->hIOCP);
     event_notifier_cleanup(&aio->e);
     g_free(aio);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
 {
     VirtIOBlockIoctlReq *ioctl_req = opaque;
     VirtIOBlockReq *req = ioctl_req->req;
-    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
     struct virtio_scsi_inhdr *scsi;
     struct sg_io_hdr *hdr;
 
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     MultiReqBuffer mrb = {};
     bool progress = false;
 
+    aio_context_acquire(blk_get_aio_context(s->blk));
     blk_io_plug(s->blk);
 
     do {
@@ -XXX,XX +XXX,XX @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     }
 
     blk_io_unplug(s->blk);
+    aio_context_release(blk_get_aio_context(s->blk));
     return progress;
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_ctrl_vq(VirtIOSCSI *s, VirtQueue *vq)
     VirtIOSCSIReq *req;
     bool progress = false;
 
+    virtio_scsi_acquire(s);
     while ((req = virtio_scsi_pop_req(s, vq))) {
         progress = true;
         virtio_scsi_handle_ctrl_req(s, req);
     }
+    virtio_scsi_release(s);
     return progress;
 }
 
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
 
     QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
 
+    virtio_scsi_acquire(s);
     do {
         virtio_queue_set_notification(vq, 0);
 
@@ -XXX,XX +XXX,XX @@ bool virtio_scsi_handle_cmd_vq(VirtIOSCSI *s, VirtQueue *vq)
     QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
         virtio_scsi_handle_cmd_req_submit(s, req);
     }
+    virtio_scsi_release(s);
     return progress;
 }
 
@@ -XXX,XX +XXX,XX @@ out:
 
 bool virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
 {
+    virtio_scsi_acquire(s);
     if (s->events_dropped) {
         virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+        virtio_scsi_release(s);
         return true;
     }
+    virtio_scsi_release(s);
     return false;
 }
 
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_read) {
-            aio_context_acquire(ctx);
             node->io_read(node->opaque);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             aio_node_check(ctx, node->is_external) &&
             node->io_write) {
-            aio_context_acquire(ctx);
             node->io_write(node->opaque);
-            aio_context_release(ctx);
             progress = true;
         }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-    aio_context_acquire(ctx);
     progress = try_poll_mode(ctx, blocking);
-    aio_context_release(ctx);
-
     if (!progress) {
         assert(npfd == 0);
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
             node->pfd.revents = 0;
-            aio_context_acquire(ctx);
             node->io_notify(node->e);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             (node->io_read || node->io_write)) {
             node->pfd.revents = 0;
             if ((revents & G_IO_IN) && node->io_read) {
-                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
-                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-15-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/archipelago.c   |  3 +++
 block/blkreplay.c     |  2 +-
 block/block-backend.c |  6 ++++++
 block/curl.c          | 26 ++++++++++++++++++--------
 block/gluster.c       |  9 +--------
 block/io.c            |  6 +++++-
 block/iscsi.c         |  6 +++++-
 block/linux-aio.c     | 15 +++++++++------
 block/nfs.c           |  3 ++-
 block/null.c          |  4 ++++
 block/qed.c           |  3 +++
 block/rbd.c           |  4 ++++
 dma-helpers.c         |  2 ++
 hw/block/virtio-blk.c |  2 ++
 hw/scsi/scsi-bus.c    |  2 ++
 util/async.c          |  4 ++--
 util/thread-pool.c    |  2 ++
 17 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/block/archipelago.c b/block/archipelago.c
index XXXXXXX..XXXXXXX 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
+    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
+    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
+    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/blkreplay.c b/block/blkreplay.c
index XXXXXXX..XXXXXXX 100755
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -XXX,XX +XXX,XX @@ static int64_t blkreplay_getlength(BlockDriverState *bs)
 static void blkreplay_bh_cb(void *opaque)
 {
     Request *req = opaque;
-    qemu_coroutine_enter(req->co);
+    aio_co_wake(req->co);
     qemu_bh_delete(req->bh);
     g_free(req);
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     bdrv_dec_in_flight(acb->common.bs);
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
 static void blk_aio_complete_bh(void *opaque)
 {
     BlkAioEmAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     assert(acb->has_returned);
+    aio_context_acquire(ctx);
     blk_aio_complete(acb);
+    aio_context_release(ctx);
 }
 
 static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
 {
     CURLState *state;
     int running;
+    int ret = -EINPROGRESS;
 
     CURLAIOCB *acb = p;
-    BDRVCURLState *s = acb->common.bs->opaque;
+    BlockDriverState *bs = acb->common.bs;
+    BDRVCURLState *s = bs->opaque;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
     size_t start = acb->sector_num * BDRV_SECTOR_SIZE;
     size_t end;
 
+    aio_context_acquire(ctx);
+
     // In case we have the requested data already (e.g. read-ahead),
     // we can just call the callback and be done.
     switch (curl_find_buf(s, start, acb->nb_sectors * BDRV_SECTOR_SIZE, acb)) {
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
             qemu_aio_unref(acb);
             // fall through
         case FIND_RET_WAIT:
-            return;
+            goto out;
         default:
             break;
     }
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     // No cache found, so let's start a new request
     state = curl_init_state(acb->common.bs, s);
     if (!state) {
-        acb->common.cb(acb->common.opaque, -EIO);
-        qemu_aio_unref(acb);
-        return;
+        ret = -EIO;
+        goto out;
     }
 
     acb->start = 0;
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     state->orig_buf = g_try_malloc(state->buf_len);
     if (state->buf_len && state->orig_buf == NULL) {
         curl_clean_state(state);
-        acb->common.cb(acb->common.opaque, -ENOMEM);
-        qemu_aio_unref(acb);
-        return;
+        ret = -ENOMEM;
+        goto out;
     }
     state->acb[0] = acb;
 
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
 
     /* Tell curl it needs to kick things off */
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
+
+out:
+    if (ret != -EINPROGRESS) {
+        acb->common.cb(acb->common.opaque, ret);
+        qemu_aio_unref(acb);
+    }
+    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/gluster.c b/block/gluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -XXX,XX +XXX,XX @@ static struct glfs *qemu_gluster_init(BlockdevOptionsGluster *gconf,
     return qemu_gluster_glfs_init(gconf, errp);
 }
 
-static void qemu_gluster_complete_aio(void *opaque)
-{
-    GlusterAIOCB *acb = (GlusterAIOCB *)opaque;
-
-    qemu_coroutine_enter(acb->coroutine);
-}
-
 /*
  * AIO callback routine called from GlusterFS thread.
  */
@@ -XXX,XX +XXX,XX @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg)
         acb->ret = -EIO; /* Partial read/write - fail it */
     }
 
-    aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb);
+    aio_co_schedule(acb->aio_context, acb->coroutine);
 }
 
 static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_drain_bh_cb(void *opaque)
     bdrv_dec_in_flight(bs);
     bdrv_drained_begin(bs);
     data->done = true;
-    qemu_coroutine_enter(co);
+    aio_co_wake(co);
 }
 
 static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
+    BlockDriverState *bs = acb->common.bs;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
     assert(!acb->need_bh);
+    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
+    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
 iscsi_bh_cb(void *p)
 {
     IscsiAIOCB *acb = p;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
 
     g_free(acb->buf);
     acb->buf = NULL;
 
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
+    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
@@ -XXX,XX +XXX,XX @@ iscsi_schedule_bh(IscsiAIOCB *acb)
 static void iscsi_co_generic_bh_cb(void *opaque)
 {
     struct IscsiTask *iTask = opaque;
+
     iTask->complete = 1;
-    qemu_coroutine_enter(iTask->co);
+    aio_co_wake(iTask->co);
 }
 
 static void iscsi_retry_timer_expired(void *opaque)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
     io_context_t ctx;
     EventNotifier e;
 
-    /* io queue for submit at batch */
+    /* io queue for submit at batch.  Protected by AioContext lock. */
     LaioQueue io_q;
 
-    /* I/O completion processing */
+    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
     int event_idx;
     int event_max;
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
  */
 static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 {
+    LinuxAioState *s = laiocb->ctx;
     int ret;
 
     ret = laiocb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
     }
 
     laiocb->ret = ret;
+    aio_context_acquire(s->aio_context);
     if (laiocb->co) {
         /* If the coroutine is already entered it must be in ioq_submit() and
          * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
         laiocb->common.cb(laiocb->common.opaque, ret);
         qemu_aio_unref(laiocb);
     }
+    aio_context_release(s->aio_context);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
 static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
     qemu_laio_process_completions(s);
+
+    aio_context_acquire(s->aio_context);
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
+    aio_context_release(s->aio_context);
 }
 
 static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        aio_context_acquire(s->aio_context);
         qemu_laio_process_completions_and_submit(s);
-        aio_context_release(s->aio_context);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static bool qemu_laio_poll_cb(void *opaque)
         return false;
     }
 
-    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions_and_submit(s);
-    aio_context_release(s->aio_context);
     return true;
 }
 
@@ -XXX,XX +XXX,XX @@ void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
 {
     aio_set_event_notifier(old_context, &s->e, false, NULL, NULL);
     qemu_bh_delete(s->completion_bh);
+    s->aio_context = NULL;
 }
 
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context)
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 static void nfs_co_generic_bh_cb(void *opaque)
 {
     NFSRPC *task = opaque;
+
     task->complete = 1;
-    qemu_coroutine_enter(task->co);
+    aio_co_wake(task->co);
 }
 
 static void
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 static void qed_aio_complete_bh(void *opaque)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb = acb->common.cb;
     void *user_opaque = acb->common.opaque;
     int ret = acb->bh_ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
     qemu_aio_unref(acb);
 
     /* Invoke callback */
+    qed_acquire(s);
     cb(user_opaque, ret);
+    qed_release(s);
 }
 
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
+    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/dma-helpers.c b/dma-helpers.c
index XXXXXXX..XXXXXXX 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -XXX,XX +XXX,XX @@ static void dma_blk_cb(void *opaque, int ret)
                                 QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
     }
 
+    aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                             dma_blk_cb, dbs, dbs->io_func_opaque);
+    aio_context_release(dbs->ctx);
     assert(dbs->acb);
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
 
     s->rq = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (req) {
         VirtIOBlockReq *next = req->next;
         if (virtio_blk_handle_request(req, &mrb)) {
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_dma_restart_bh(void *opaque)
     if (mrb.num_reqs) {
         virtio_blk_submit_multireq(s->blk, &mrb);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
     qemu_bh_delete(s->bh);
     s->bh = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
         scsi_req_ref(req);
         if (req->retry) {
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_restart_bh(void *opaque)
         }
         scsi_req_unref(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 void scsi_req_retry(SCSIRequest *req)
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
-            aio_context_acquire(ctx);
             aio_bh_call(bh);
-            aio_context_release(ctx);
         }
         if (bh->deleted) {
             deleted = true;
@@ -XXX,XX +XXX,XX @@ static void co_schedule_bh_cb(void *opaque)
         Coroutine *co = QSLIST_FIRST(&straight);
         QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
         trace_aio_co_schedule_bh_cb(ctx, co);
+        aio_context_acquire(ctx);
         qemu_coroutine_enter(co);
+        aio_context_release(ctx);
     }
 }
 
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
     ThreadPool *pool = opaque;
     ThreadPoolElement *elem, *next;
 
+    aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
         if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
             qemu_aio_unref(elem);
         }
     }
+    aio_context_release(pool->ctx);
 }
 
 static void thread_pool_cancel(BlockAIOCB *acb)
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-16-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/archipelago.c    |  3 ---
 block/block-backend.c  |  7 -------
 block/curl.c           |  2 +-
 block/io.c             |  6 +-----
 block/iscsi.c          |  3 ---
 block/linux-aio.c      |  5 +----
 block/mirror.c         | 12 +++++++++---
 block/null.c           |  8 --------
 block/qed-cluster.c    |  2 ++
 block/qed-table.c      | 12 ++++++++++--
 block/qed.c            |  4 ++--
 block/rbd.c            |  4 ----
 block/win32-aio.c      |  3 ---
 hw/block/virtio-blk.c  | 12 +++++++++++-
 hw/scsi/scsi-disk.c    | 15 +++++++++++++++
 hw/scsi/scsi-generic.c | 20 +++++++++++++++++---
 util/thread-pool.c     |  4 +++-
 17 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/block/archipelago.c b/block/archipelago.c
index XXXXXXX..XXXXXXX 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -XXX,XX +XXX,XX @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
-    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
-    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
-    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     bdrv_dec_in_flight(acb->common.bs);
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_aio_complete(BlkAioEmAIOCB *acb)
 static void blk_aio_complete_bh(void *opaque)
 {
     BlkAioEmAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
     assert(acb->has_returned);
-    aio_context_acquire(ctx);
     blk_aio_complete(acb);
-    aio_context_release(ctx);
 }
 
 static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
diff --git a/block/curl.c b/block/curl.c
index XXXXXXX..XXXXXXX 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -XXX,XX +XXX,XX @@ static void curl_readv_bh_cb(void *p)
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
 out:
+    aio_context_release(ctx);
     if (ret != -EINPROGRESS) {
         acb->common.cb(acb->common.opaque, ret);
         qemu_aio_unref(acb);
     }
-    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_io_em_complete(void *opaque, int ret)
     CoroutineIOCompletion *co = opaque;
 
     co->ret = ret;
-    qemu_coroutine_enter(co->coroutine);
+    aio_co_wake(co->coroutine);
 }
 
 static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
-    BlockDriverState *bs = acb->common.bs;
-    AioContext *ctx = bdrv_get_aio_context(bs);
 
     assert(!acb->need_bh);
-    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
-    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@ static void
 iscsi_bh_cb(void *p)
 {
     IscsiAIOCB *acb = p;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
 
     g_free(acb->buf);
     acb->buf = NULL;
 
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
-    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@ static inline ssize_t io_event_ret(struct io_event *ev)
  */
 static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 {
-    LinuxAioState *s = laiocb->ctx;
     int ret;
 
     ret = laiocb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
     }
 
     laiocb->ret = ret;
-    aio_context_acquire(s->aio_context);
     if (laiocb->co) {
         /* If the coroutine is already entered it must be in ioq_submit() and
          * will notice laio->ret has been filled in when it eventually runs
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
          * that!
          */
         if (!qemu_coroutine_entered(laiocb->co)) {
-            qemu_coroutine_enter(laiocb->co);
+            aio_co_wake(laiocb->co);
         }
     } else {
         laiocb->common.cb(laiocb->common.opaque, ret);
         qemu_aio_unref(laiocb);
     }
-    aio_context_release(s->aio_context);
 }
 
 /**
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(blk_get_aio_context(s->common.blk));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -XXX,XX +XXX,XX @@ static void mirror_write_complete(void *opaque, int ret)
         }
     }
     mirror_iteration_done(op, ret);
+    aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
 static void mirror_read_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(blk_get_aio_context(s->common.blk));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -XXX,XX +XXX,XX @@ static void mirror_read_complete(void *opaque, int ret)
         }
 
         mirror_iteration_done(op, ret);
-        return;
+    } else {
+        blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
+                        0, mirror_write_complete, op);
     }
-    blk_aio_pwritev(s->target, op->sector_num * BDRV_SECTOR_SIZE, &op->qiov,
-                    0, mirror_write_complete, op);
+    aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
 static inline void mirror_clip_sectors(MirrorBlockJob *s,
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
     unsigned int index;
     unsigned int n;
 
+    qed_acquire(s);
     if (ret) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static void qed_find_cluster_cb(void *opaque, int ret)
 
 out:
     find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+    qed_release(s);
     g_free(find_cluster_cb);
 }
 
diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
 {
     QEDReadTableCB *read_table_cb = opaque;
     QEDTable *table = read_table_cb->table;
+    BDRVQEDState *s = read_table_cb->s;
     int noffsets = read_table_cb->qiov.size / sizeof(uint64_t);
     int i;
 
@@ -XXX,XX +XXX,XX @@ static void qed_read_table_cb(void *opaque, int ret)
     }
 
     /* Byteswap offsets */
+    qed_acquire(s);
     for (i = 0; i < noffsets; i++) {
         table->offsets[i] = le64_to_cpu(table->offsets[i]);
     }
+    qed_release(s);
 
 out:
     /* Completion */
-    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
+    trace_qed_read_table_cb(s, read_table_cb->table, ret);
     gencb_complete(&read_table_cb->gencb, ret);
 }
 
@@ -XXX,XX +XXX,XX @@ typedef struct {
 static void qed_write_table_cb(void *opaque, int ret)
 {
     QEDWriteTableCB *write_table_cb = opaque;
+    BDRVQEDState *s = write_table_cb->s;
 
-    trace_qed_write_table_cb(write_table_cb->s,
+    trace_qed_write_table_cb(s,
                              write_table_cb->orig_table,
                              write_table_cb->flush,
                              ret);
@@ -XXX,XX +XXX,XX @@ static void qed_write_table_cb(void *opaque, int ret)
     if (write_table_cb->flush) {
         /* We still need to flush first */
         write_table_cb->flush = false;
+        qed_acquire(s);
         bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
                        write_table_cb);
+        qed_release(s);
         return;
     }
 
@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
     CachedL2Table *l2_table = request->l2_table;
     uint64_t l2_offset = read_l2_table_cb->l2_offset;
 
+    qed_acquire(s);
     if (ret) {
         /* can't trust loaded L2 table anymore */
         qed_unref_l2_cache_entry(l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_read_l2_table_cb(void *opaque, int ret)
         request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
         assert(request->l2_table != NULL);
     }
+    qed_release(s);
 
     gencb_complete(&read_l2_table_cb->gencb, ret);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t l
     }
 
     if (cb->co) {
-        qemu_coroutine_enter(cb->co);
+        aio_co_wake(cb->co);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
     cb->done = true;
     cb->ret = ret;
     if (cb->co) {
-        qemu_coroutine_enter(cb->co);
+        aio_co_wake(cb->co);
     }
 }
 
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ shutdown:
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/block/win32-aio.c b/block/win32-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -XXX,XX +XXX,XX @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
         qemu_vfree(waiocb->buf);
     }
 
-
-    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
-    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
 static void virtio_blk_rw_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *next = opaque;
+    VirtIOBlock *s = next->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (next) {
         VirtIOBlockReq *req = next;
         next = req->mr_next;
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_rw_complete(void *opaque, int ret)
         block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
         virtio_blk_free_request(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_flush_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *req = opaque;
+    VirtIOBlock *s = req->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     if (ret) {
         if (virtio_blk_handle_rw_error(req, -ret, 0)) {
-            return;
+            goto out;
         }
     }
 
     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
     block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
     virtio_blk_free_request(req);
+
+out:
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 #ifdef __linux__
@@ -XXX,XX +XXX,XX @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
     virtio_stl_p(vdev, &scsi->data_len, hdr->dxfer_len);
 
 out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     virtio_blk_req_complete(req, status);
     virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
     g_free(ioctl_req);
 }
 
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ static void scsi_aio_complete(void *opaque, int ret)
     scsi_req_complete(&r->req, GOOD);
 
 done:
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
     scsi_req_unref(&r->req);
 }
 
@@ -XXX,XX +XXX,XX @@ static void scsi_dma_complete(void *opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_dma_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_read_complete(void * opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
 
 done:
     scsi_req_unref(&r->req);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Actually issue a read to the block device.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_do_read_cb(void *opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_do_read(opaque, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_write_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_write_data(SCSIRequest *req)
@@ -XXX,XX +XXX,XX @@ static void scsi_unmap_complete(void *opaque, int ret)
 {
     UnmapCBData *data = opaque;
     SCSIDiskReq *r = data->r;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     scsi_unmap_complete_noio(data, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
@@ -XXX,XX +XXX,XX @@ static void scsi_write_same_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -XXX,XX +XXX,XX @@ done:
     scsi_req_unref(&r->req);
     qemu_vfree(data->iov.iov_base);
     g_free(data);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -XXX,XX +XXX,XX @@ done:
 static void scsi_command_complete(void *opaque, int ret)
 {
     SCSIGenericReq *r = (SCSIGenericReq *)opaque;
+    SCSIDevice *s = r->req.dev;
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     scsi_command_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 static int execute_command(BlockBackend *blk,
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     len = r->io_header.dxfer_len - r->io_header.resid;
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     r->len = -1;
     if (len == 0) {
         scsi_command_complete_noio(r, 0);
-        return;
+        goto done;
     }
 
     /* Snoop READ CAPACITY output to set the blocksize.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_read_complete(void * opaque, int ret)
     }
     scsi_req_data(&r->req, len);
     scsi_req_unref(&r->req);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     if (r->req.cmd.buf[0] == MODE_SELECT && r->req.cmd.buf[4] == 12 &&
@@ -XXX,XX +XXX,XX @@ static void scsi_write_complete(void * opaque, int ret)
     }
 
     scsi_command_complete_noio(r, ret);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Write data to a scsi device.  Returns nonzero on failure.
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ restart:
              */
             qemu_bh_schedule(pool->completion_bh);
 
+            aio_context_release(pool->ctx);
             elem->common.cb(elem->common.opaque, elem->ret);
+            aio_context_acquire(pool->ctx);
             qemu_aio_unref(elem);
             goto restart;
         } else {
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
     ThreadPoolCo *co = opaque;
 
     co->ret = ret;
-    qemu_coroutine_enter(co->co);
+    aio_co_wake(co->co);
 }
 
 int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This patch prepares for the removal of unnecessary lockcnt inc/dec pairs.
Extract the dispatching loop for file descriptor handlers into a new
function aio_dispatch_handlers, and then inline aio_dispatch into
aio_poll.

aio_dispatch can now become void.

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ bool aio_pending(AioContext *ctx);
 /* Dispatch any pending callbacks from the GSource attached to the AioContext.
  *
  * This is used internally in the implementation of the GSource.
- *
- * @dispatch_fds: true to process fds, false to skip them
- *                (can be used as an optimization by callers that know there
- *                are no fds ready)
  */
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds);
+void aio_dispatch(AioContext *ctx);
 
 /* Progress in completing AIO work to occur.  This can issue new pending
  * aio as a result of executing I/O completion or bh callbacks.
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
     AioHandler *node, *tmp;
     bool progress = false;
 
-    /*
-     * We have to walk very carefully in case aio_set_fd_handler is
-     * called while we're walking.
-     */
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
         int revents;
 
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
-/*
- * Note that dispatch_fds == false has the side-effect of post-poning the
- * freeing of deleted handlers.
- */
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
+void aio_dispatch(AioContext *ctx)
 {
-    bool progress;
+    aio_bh_poll(ctx);
 
-    /*
-     * If there are callbacks left that have been queued, we need to call them.
-     * Do not call select in this case, because it is possible that the caller
-     * does not need a complete flush (as is the case for aio_poll loops).
-     */
-    progress = aio_bh_poll(ctx);
+    qemu_lockcnt_inc(&ctx->list_lock);
+    aio_dispatch_handlers(ctx);
+    qemu_lockcnt_dec(&ctx->list_lock);
 
-    if (dispatch_fds) {
-        progress |= aio_dispatch_handlers(ctx);
-    }
-
-    /* Run our timers */
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
-
-    return progress;
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 /* These thread-local variables are used only in a small part of aio_poll
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     npfd = 0;
     qemu_lockcnt_dec(&ctx->list_lock);
 
-    /* Run dispatch even if there were no readable fds to run timers */
-    if (aio_dispatch(ctx, ret > 0)) {
-        progress = true;
+    progress |= aio_bh_poll(ctx);
+
+    if (ret > 0) {
+        qemu_lockcnt_inc(&ctx->list_lock);
+        progress |= aio_dispatch_handlers(ctx);
+        qemu_lockcnt_dec(&ctx->list_lock);
     }
 
+    progress |= timerlistgroup_run_timers(&ctx->tlg);
+
     return progress;
 }
 
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     return progress;
 }
 
-bool aio_dispatch(AioContext *ctx, bool dispatch_fds)
+void aio_dispatch(AioContext *ctx)
 {
-    bool progress;
-
-    progress = aio_bh_poll(ctx);
-    if (dispatch_fds) {
-        progress |= aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
-    }
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
-    return progress;
+    aio_bh_poll(ctx);
+    aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 bool aio_poll(AioContext *ctx, bool blocking)
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ aio_ctx_dispatch(GSource     *source,
     AioContext *ctx = (AioContext *) source;
 
     assert(callback == NULL);
-    aio_dispatch(ctx, true);
+    aio_dispatch(ctx);
     return true;
 }
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Pull the increment/decrement pair out of aio_bh_poll and into the
callers.

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx)
 
 void aio_dispatch(AioContext *ctx)
 {
+    qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
-
-    qemu_lockcnt_inc(&ctx->list_lock);
     aio_dispatch_handlers(ctx);
     qemu_lockcnt_dec(&ctx->list_lock);
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     }
 
     npfd = 0;
-    qemu_lockcnt_dec(&ctx->list_lock);
 
     progress |= aio_bh_poll(ctx);
 
     if (ret > 0) {
-        qemu_lockcnt_inc(&ctx->list_lock);
         progress |= aio_dispatch_handlers(ctx);
-        qemu_lockcnt_dec(&ctx->list_lock);
     }
 
+    qemu_lockcnt_dec(&ctx->list_lock);
+
     progress |= timerlistgroup_run_timers(&ctx->tlg);
 
     return progress;
diff --git a/util/aio-win32.c b/util/aio-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     bool progress = false;
     AioHandler *tmp;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     /*
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
 void aio_dispatch(AioContext *ctx)
 {
+    qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
     aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    qemu_lockcnt_dec(&ctx->list_lock);
     timerlistgroup_run_timers(&ctx->tlg);
 }
 
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     first = true;
 
     /* ctx->notifier is always registered.  */
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    qemu_lockcnt_dec(&ctx->list_lock);
+
     progress |= timerlistgroup_run_timers(&ctx->tlg);
     return progress;
 }
diff --git a/util/async.c b/util/async.c
index XXXXXXX..XXXXXXX 100644
--- a/util/async.c
+++ b/util/async.c
@@ -XXX,XX +XXX,XX @@ void aio_bh_call(QEMUBH *bh)
     bh->cb(bh->opaque);
 }
 
-/* Multiple occurrences of aio_bh_poll cannot be called concurrently */
+/* Multiple occurrences of aio_bh_poll cannot be called concurrently.
+ * The count in ctx->list_lock is incremented before the call, and is
+ * not affected by the call.
+ */
 int aio_bh_poll(AioContext *ctx)
 {
     QEMUBH *bh, **bhp, *next;
     int ret;
     bool deleted = false;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     ret = 0;
     for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
         next = atomic_rcu_read(&bh->next);
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!deleted) {
-        qemu_lockcnt_dec(&ctx->list_lock);
         return ret;
     }
 
-    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
+    if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -XXX,XX +XXX,XX @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
-        qemu_lockcnt_unlock(&ctx->list_lock);
+        qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
     }
     return ret;
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/include/block/block_int.h b/include/block/block_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -XXX,XX +XXX,XX @@ struct BdrvChild {
  * copied as well.
  */
 struct BlockDriverState {
-    int64_t total_sectors; /* if we are reading a disk image, give its
-                              size in sectors */
+    /* Protected by big QEMU lock or read-only after opening.  No special
+     * locking needed during I/O...
+     */
     int open_flags; /* flags used to open the file, re-used for re-open */
     bool read_only; /* if true, the media is read only */
     bool encrypted; /* if true, the media is encrypted */
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     bool sg;        /* if true, the device is a /dev/sg* */
     bool probed;    /* if true, format was probed rather than specified */
 
-    int copy_on_read; /* if nonzero, copy read backing sectors into image.
-                         note this is a reference count */
-
-    CoQueue flush_queue;            /* Serializing flush queue */
-    bool active_flush_req;          /* Flush request in flight? */
-    unsigned int write_gen;         /* Current data generation */
-    unsigned int flushed_gen;       /* Flushed write generation */
-
     BlockDriver *drv; /* NULL means no media */
     void *opaque;
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     BdrvChild *backing;
     BdrvChild *file;
 
-    /* Callback before write request is processed */
-    NotifierWithReturnList before_write_notifiers;
-
-    /* number of in-flight requests; overall and serialising */
-    unsigned int in_flight;
-    unsigned int serialising_in_flight;
-
-    bool wakeup;
-
-    /* Offset after the highest byte written to */
-    uint64_t wr_highest_offset;
-
     /* I/O Limits */
     BlockLimits bl;
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     QTAILQ_ENTRY(BlockDriverState) bs_list;
     /* element of the list of monitor-owned BDS */
     QTAILQ_ENTRY(BlockDriverState) monitor_list;
-    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
     int refcnt;
 
-    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
-
     /* operation blockers */
     QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
 
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
 
+    /* Protected by AioContext lock */
+
+    /* If true, copy read backing sectors into image.  Can be >1 if more
+     * than one client has requested copy-on-read.
+     */
+    int copy_on_read;
+
+    /* If we are reading a disk image, give its size in sectors.
+     * Generally read-only; it is written to by load_vmstate and save_vmstate,
+     * but the block layer is quiescent during those.
+     */
+    int64_t total_sectors;
+
+    /* Callback before write request is processed */
+    NotifierWithReturnList before_write_notifiers;
+
+    /* number of in-flight requests; overall and serialising */
+    unsigned int in_flight;
+    unsigned int serialising_in_flight;
+
+    bool wakeup;
+
+    /* Offset after the highest byte written to */
+    uint64_t wr_highest_offset;
+
     /* threshold limit for writes, in bytes. "High water mark". */
     uint64_t write_threshold_offset;
     NotifierWithReturn write_threshold_notifier;
@@ -XXX,XX +XXX,XX @@ struct BlockDriverState {
     /* counter for nested bdrv_io_plug */
     unsigned io_plugged;
 
+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
+    CoQueue flush_queue;                  /* Serializing flush queue */
+    bool active_flush_req;                /* Flush request in flight? */
+    unsigned int write_gen;               /* Current data generation */
+    unsigned int flushed_gen;             /* Flushed write generation */
+
+    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
+
+    /* do we need to tell the quest if we have a volatile write cache? */
+    int enable_write_cache;
+
     int quiesce_counter;
 };
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
  * fields that must be public. This is in particular for QLIST_ENTRY() and
  * friends so that BlockBackends can be kept in lists outside block-backend.c */
 typedef struct BlockBackendPublic {
-    /* I/O throttling.
-     * throttle_state tells us if this BlockBackend has I/O limits configured.
-     * io_limits_disabled tells us if they are currently being enforced */
+    /* I/O throttling has its own locking, but also some fields are
+     * protected by the AioContext lock.
+     */
+
+    /* Protected by AioContext lock.  */
     CoQueue      throttled_reqs[2];
+
+    /* Nonzero if the I/O limits are currently being ignored; generally
+     * it is zero.  */
     unsigned int io_limits_disabled;
 
     /* The following fields are protected by the ThrottleGroup lock.
-     * See the ThrottleGroup documentation for details. */
+     * See the ThrottleGroup documentation for details.
+     * throttle_state tells us if I/O limits are configured. */
     ThrottleState *throttle_state;
     ThrottleTimers throttle_timers;
     unsigned       pending_reqs[2];
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This uses the lock-free mutex described in the paper '"Blocking without
Locking", or LFTHREADS: A lock-free thread library' by Gidenstam and
Papatriantafilou.  The same technique is used in OSv, and in fact
the code is essentially a conversion to C of OSv's code.

[Added missing coroutine_fn in tests/test-aio-multithread.c.
--Stefan]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h     |  17 ++++-
 tests/test-aio-multithread.c |  86 ++++++++++++++++++++++++
 util/qemu-coroutine-lock.c   | 155 ++++++++++++++++++++++++++++++++++++++++---
 util/trace-events            |   1 +
 4 files changed, 246 insertions(+), 13 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
 /**
  * Provides a mutex that can be used to synchronise coroutines
  */
+struct CoWaitRecord;
 typedef struct CoMutex {
-    bool locked;
+    /* Count of pending lockers; 0 for a free mutex, 1 for an
+     * uncontended mutex.
+     */
+    unsigned locked;
+
+    /* A queue of waiters.  Elements are added atomically in front of
+     * from_push.  to_pop is only populated, and popped from, by whoever
+     * is in charge of the next wakeup.  This can be an unlocker or,
+     * through the handoff protocol, a locker that is about to go to sleep.
+     */
+    QSLIST_HEAD(, CoWaitRecord) from_push, to_pop;
+
+    unsigned handoff, sequence;
+
     Coroutine *holder;
-    CoQueue queue;
 } CoMutex;
 
 /**
diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-aio-multithread.c
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@ static void test_multi_co_schedule_10(void)
     test_multi_co_schedule(10);
 }
 
+/* CoMutex thread-safety.  */
+
+static uint32_t atomic_counter;
+static uint32_t running;
+static uint32_t counter;
+static CoMutex comutex;
+
+static void coroutine_fn test_multi_co_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        qemu_co_mutex_lock(&comutex);
+        counter++;
+        qemu_co_mutex_unlock(&comutex);
+
+        /* Increase atomic_counter *after* releasing the mutex.  Otherwise
+         * there is a chance (it happens about 1 in 3 runs) that the iothread
+         * exits before the coroutine is woken up, causing a spurious
+         * assertion failure.
+         */
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_co_mutex(int threads, int seconds)
+{
+    int i;
+
+    qemu_co_mutex_init(&comutex);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_co_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+/* Testing with NUM_CONTEXTS threads focuses on the queue.  The mutex however
+ * is too contended (and the threads spend too much time in aio_poll)
+ * to actually stress the handoff protocol.
+ */
+static void test_multi_co_mutex_1(void)
+{
+    test_multi_co_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_co_mutex_10(void)
+{
+    test_multi_co_mutex(NUM_CONTEXTS, 10);
+}
+
+/* Testing with fewer threads stresses the handoff protocol too.  Still, the
+ * case where the locker _can_ pick up a handoff is very rare, happening
+ * about 10 times in 1 million, so increase the runtime a bit compared to
+ * other "quick" testcases that only run for 1 second.
+ */
+static void test_multi_co_mutex_2_3(void)
+{
+    test_multi_co_mutex(2, 3);
+}
+
+static void test_multi_co_mutex_2_30(void)
+{
+    test_multi_co_mutex(2, 30);
+}
+
 /* End of tests.  */
 
 int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
     g_test_add_func("/aio/multi/lifecycle", test_lifecycle);
     if (g_test_quick()) {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
+        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
+        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
     } else {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
+        g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
+        g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
     }
     return g_test_run();
 }
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
  * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
+ *
+ * The lock-free mutex implementation is based on OSv
+ * (core/lfmutex.cc, include/lockfree/mutex.hh).
+ * Copyright (C) 2013 Cloudius Systems, Ltd.
  */
 
 #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue)
     return QSIMPLEQ_FIRST(&queue->entries) == NULL;
 }
 
+/* The wait records are handled with a multiple-producer, single-consumer
+ * lock-free queue.  There cannot be two concurrent pop_waiter() calls
+ * because pop_waiter() can only be called while mutex->handoff is zero.
+ * This can happen in three cases:
+ * - in qemu_co_mutex_unlock, before the hand-off protocol has started.
+ *   In this case, qemu_co_mutex_lock will see mutex->handoff == 0 and
+ *   not take part in the handoff.
+ * - in qemu_co_mutex_lock, if it steals the hand-off responsibility from
+ *   qemu_co_mutex_unlock.  In this case, qemu_co_mutex_unlock will fail
+ *   the cmpxchg (it will see either 0 or the next sequence value) and
+ *   exit.  The next hand-off cannot begin until qemu_co_mutex_lock has
+ *   woken up someone.
+ * - in qemu_co_mutex_unlock, if it takes the hand-off token itself.
+ *   In this case another iteration starts with mutex->handoff == 0;
+ *   a concurrent qemu_co_mutex_lock will fail the cmpxchg, and
+ *   qemu_co_mutex_unlock will go back to case (1).
+ *
+ * The following functions manage this queue.
+ */
+typedef struct CoWaitRecord {
+    Coroutine *co;
+    QSLIST_ENTRY(CoWaitRecord) next;
+} CoWaitRecord;
+
+static void push_waiter(CoMutex *mutex, CoWaitRecord *w)
+{
+    w->co = qemu_coroutine_self();
+    QSLIST_INSERT_HEAD_ATOMIC(&mutex->from_push, w, next);
+}
+
+static void move_waiters(CoMutex *mutex)
+{
+    QSLIST_HEAD(, CoWaitRecord) reversed;
+    QSLIST_MOVE_ATOMIC(&reversed, &mutex->from_push);
+    while (!QSLIST_EMPTY(&reversed)) {
+        CoWaitRecord *w = QSLIST_FIRST(&reversed);
+        QSLIST_REMOVE_HEAD(&reversed, next);
+        QSLIST_INSERT_HEAD(&mutex->to_pop, w, next);
+    }
+}
+
+static CoWaitRecord *pop_waiter(CoMutex *mutex)
+{
+    CoWaitRecord *w;
+
+    if (QSLIST_EMPTY(&mutex->to_pop)) {
+        move_waiters(mutex);
+        if (QSLIST_EMPTY(&mutex->to_pop)) {
+            return NULL;
+        }
+    }
+    w = QSLIST_FIRST(&mutex->to_pop);
+    QSLIST_REMOVE_HEAD(&mutex->to_pop, next);
+    return w;
+}
+
+static bool has_waiters(CoMutex *mutex)
+{
+    return QSLIST_EMPTY(&mutex->to_pop) || QSLIST_EMPTY(&mutex->from_push);
+}
+
 void qemu_co_mutex_init(CoMutex *mutex)
 {
     memset(mutex, 0, sizeof(*mutex));
-    qemu_co_queue_init(&mutex->queue);
 }
 
-void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
+static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
+    CoWaitRecord w;
+    unsigned old_handoff;
 
     trace_qemu_co_mutex_lock_entry(mutex, self);
+    w.co = self;
+    push_waiter(mutex, &w);
 
-    while (mutex->locked) {
-        qemu_co_queue_wait(&mutex->queue);
+    /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
+     * a concurrent unlock() the responsibility of waking somebody up.
+     */
+    old_handoff = atomic_mb_read(&mutex->handoff);
+    if (old_handoff &&
+        has_waiters(mutex) &&
+        atomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
+        /* There can be no concurrent pops, because there can be only
+         * one active handoff at a time.
+         */
+        CoWaitRecord *to_wake = pop_waiter(mutex);
+        Coroutine *co = to_wake->co;
+        if (co == self) {
+            /* We got the lock ourselves!  */
+            assert(to_wake == &w);
+            return;
+        }
+
+        aio_co_wake(co);
     }
 
-    mutex->locked = true;
-    mutex->holder = self;
-    self->locks_held++;
-
+    qemu_coroutine_yield();
     trace_qemu_co_mutex_lock_return(mutex, self);
 }
 
+void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    if (atomic_fetch_inc(&mutex->locked) == 0) {
+        /* Uncontended.  */
+        trace_qemu_co_mutex_lock_uncontended(mutex, self);
+    } else {
+        qemu_co_mutex_lock_slowpath(mutex);
+    }
+    mutex->holder = self;
+    self->locks_held++;
+}
+
 void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
 
     trace_qemu_co_mutex_unlock_entry(mutex, self);
 
-    assert(mutex->locked == true);
+    assert(mutex->locked);
     assert(mutex->holder == self);
     assert(qemu_in_coroutine());
 
-    mutex->locked = false;
     mutex->holder = NULL;
     self->locks_held--;
-    qemu_co_queue_next(&mutex->queue);
+    if (atomic_fetch_dec(&mutex->locked) == 1) {
+        /* No waiting qemu_co_mutex_lock().  Pfew, that was easy!  */
+        return;
+    }
+
+    for (;;) {
+        CoWaitRecord *to_wake = pop_waiter(mutex);
+        unsigned our_handoff;
+
+        if (to_wake) {
+            Coroutine *co = to_wake->co;
+            aio_co_wake(co);
+            break;
+        }
+
+        /* Some concurrent lock() is in progress (we know this because
+         * mutex->locked was >1) but it hasn't yet put itself on the wait
+         * queue.  Pick a sequence number for the handoff protocol (not 0).
+         */
+        if (++mutex->sequence == 0) {
+            mutex->sequence = 1;
+        }
+
+        our_handoff = mutex->sequence;
+        atomic_mb_set(&mutex->handoff, our_handoff);
+        if (!has_waiters(mutex)) {
+            /* The concurrent lock has not added itself yet, so it
+             * will be able to pick our handoff.
+             */
+            break;
+        }
+
+        /* Try to do the handoff protocol ourselves; if somebody else has
+         * already taken it, however, we're done and they're responsible.
+         */
+        if (atomic_cmpxchg(&mutex->handoff, our_handoff, 0) != our_handoff) {
+            break;
+        }
+    }
 
     trace_qemu_co_mutex_unlock_return(mutex, self);
 }
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ qemu_coroutine_terminate(void *co) "self %p"
 
 # util/qemu-coroutine-lock.c
 qemu_co_queue_run_restart(void *co) "co %p"
+qemu_co_mutex_lock_uncontended(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_entry(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_lock_return(void *mutex, void *self) "mutex %p self %p"
 qemu_co_mutex_unlock_entry(void *mutex, void *self) "mutex %p self %p"
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Running a very small critical section on pthread_mutex_t and CoMutex
shows that pthread_mutex_t is much faster because it doesn't actually
go to sleep.  What happens is that the critical section is shorter
than the latency of entering the kernel and thus FUTEX_WAIT always
fails.  With CoMutex there is no such latency but you still want to
avoid wait and wakeup.  So introduce it artificially.

This only works with one waiters; because CoMutex is fair, it will
always have more waits and wakeups than a pthread_mutex_t.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-3-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  5 +++++
 util/qemu-coroutine-lock.c | 51 ++++++++++++++++++++++++++++++++++++++++------
 util/qemu-coroutine.c      |  2 +-
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ typedef struct CoMutex {
      */
     unsigned locked;
 
+    /* Context that is holding the lock.  Useful to avoid spinning
+     * when two coroutines on the same AioContext try to get the lock. :)
+     */
+    AioContext *ctx;
+
     /* A queue of waiters.  Elements are added atomically in front of
      * from_push.  to_pop is only populated, and popped from, by whoever
      * is in charge of the next wakeup.  This can be an unlocker or,
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu-common.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "qemu/processor.h"
 #include "qemu/queue.h"
 #include "block/aio.h"
 #include "trace.h"
@@ -XXX,XX +XXX,XX @@ void qemu_co_mutex_init(CoMutex *mutex)
     memset(mutex, 0, sizeof(*mutex));
 }
 
-static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
+static void coroutine_fn qemu_co_mutex_wake(CoMutex *mutex, Coroutine *co)
+{
+    /* Read co before co->ctx; pairs with smp_wmb() in
+     * qemu_coroutine_enter().
+     */
+    smp_read_barrier_depends();
+    mutex->ctx = co->ctx;
+    aio_co_wake(co);
+}
+
+static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
+                                                     CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
     CoWaitRecord w;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
         if (co == self) {
             /* We got the lock ourselves!  */
             assert(to_wake == &w);
+            mutex->ctx = ctx;
             return;
         }
 
-        aio_co_wake(co);
+        qemu_co_mutex_wake(mutex, co);
     }
 
     qemu_coroutine_yield();
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(CoMutex *mutex)
 
 void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
     Coroutine *self = qemu_coroutine_self();
+    int waiters, i;
 
-    if (atomic_fetch_inc(&mutex->locked) == 0) {
+    /* Running a very small critical section on pthread_mutex_t and CoMutex
+     * shows that pthread_mutex_t is much faster because it doesn't actually
+     * go to sleep.  What happens is that the critical section is shorter
+     * than the latency of entering the kernel and thus FUTEX_WAIT always
+     * fails.  With CoMutex there is no such latency but you still want to
+     * avoid wait and wakeup.  So introduce it artificially.
+     */
+    i = 0;
+retry_fast_path:
+    waiters = atomic_cmpxchg(&mutex->locked, 0, 1);
+    if (waiters != 0) {
+        while (waiters == 1 && ++i < 1000) {
+            if (atomic_read(&mutex->ctx) == ctx) {
+                break;
+            }
+            if (atomic_read(&mutex->locked) == 0) {
+                goto retry_fast_path;
+            }
+            cpu_relax();
+        }
+        waiters = atomic_fetch_inc(&mutex->locked);
+    }
+
+    if (waiters == 0) {
         /* Uncontended.  */
         trace_qemu_co_mutex_lock_uncontended(mutex, self);
+        mutex->ctx = ctx;
     } else {
-        qemu_co_mutex_lock_slowpath(mutex);
+        qemu_co_mutex_lock_slowpath(ctx, mutex);
     }
     mutex->holder = self;
     self->locks_held++;
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
     assert(mutex->holder == self);
     assert(qemu_in_coroutine());
 
+    mutex->ctx = NULL;
     mutex->holder = NULL;
     self->locks_held--;
     if (atomic_fetch_dec(&mutex->locked) == 1) {
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
         unsigned our_handoff;
 
         if (to_wake) {
-            Coroutine *co = to_wake->co;
-            aio_co_wake(co);
+            qemu_co_mutex_wake(mutex, to_wake->co);
             break;
         }
 
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@ void qemu_coroutine_enter(Coroutine *co)
     co->ctx = qemu_get_current_aio_context();
 
     /* Store co->ctx before anything that stores co.  Matches
-     * barrier in aio_co_wake.
+     * barrier in aio_co_wake and qemu_co_mutex_wake.
      */
     smp_wmb();
 
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Add two implementations of the same benchmark as the previous patch,
but using pthreads.  One uses a normal QemuMutex, the other is Linux
only and implements a fair mutex based on MCS locks and futexes.
This shows that the slower performance of the 5-thread case is due to
the fairness of CoMutex, rather than to coroutines.  If fairness does
not matter, as is the case with two threads, CoMutex can actually be
faster than pthreads.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/test-aio-multithread.c | 164 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/tests/test-aio-multithread.c b/tests/test-aio-multithread.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-aio-multithread.c
+++ b/tests/test-aio-multithread.c
@@ -XXX,XX +XXX,XX @@ static void test_multi_co_mutex_2_30(void)
     test_multi_co_mutex(2, 30);
 }
 
+/* Same test with fair mutexes, for performance comparison.  */
+
+#ifdef CONFIG_LINUX
+#include "qemu/futex.h"
+
+/* The nodes for the mutex reside in this structure (on which we try to avoid
+ * false sharing).  The head of the mutex is in the "mutex_head" variable.
+ */
+static struct {
+    int next, locked;
+    int padding[14];
+} nodes[NUM_CONTEXTS] __attribute__((__aligned__(64)));
+
+static int mutex_head = -1;
+
+static void mcs_mutex_lock(void)
+{
+    int prev;
+
+    nodes[id].next = -1;
+    nodes[id].locked = 1;
+    prev = atomic_xchg(&mutex_head, id);
+    if (prev != -1) {
+        atomic_set(&nodes[prev].next, id);
+        qemu_futex_wait(&nodes[id].locked, 1);
+    }
+}
+
+static void mcs_mutex_unlock(void)
+{
+    int next;
+    if (nodes[id].next == -1) {
+        if (atomic_read(&mutex_head) == id &&
+            atomic_cmpxchg(&mutex_head, id, -1) == id) {
+            /* Last item in the list, exit.  */
+            return;
+        }
+        while (atomic_read(&nodes[id].next) == -1) {
+            /* mcs_mutex_lock did the xchg, but has not updated
+             * nodes[prev].next yet.
+             */
+        }
+    }
+
+    /* Wake up the next in line.  */
+    next = nodes[id].next;
+    nodes[next].locked = 0;
+    qemu_futex_wake(&nodes[next].locked, 1);
+}
+
+static void test_multi_fair_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        mcs_mutex_lock();
+        counter++;
+        mcs_mutex_unlock();
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_fair_mutex(int threads, int seconds)
+{
+    int i;
+
+    assert(mutex_head == -1);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_fair_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+static void test_multi_fair_mutex_1(void)
+{
+    test_multi_fair_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_fair_mutex_10(void)
+{
+    test_multi_fair_mutex(NUM_CONTEXTS, 10);
+}
+#endif
+
+/* Same test with pthread mutexes, for performance comparison and
+ * portability.  */
+
+static QemuMutex mutex;
+
+static void test_multi_mutex_entry(void *opaque)
+{
+    while (!atomic_mb_read(&now_stopping)) {
+        qemu_mutex_lock(&mutex);
+        counter++;
+        qemu_mutex_unlock(&mutex);
+        atomic_inc(&atomic_counter);
+    }
+    atomic_dec(&running);
+}
+
+static void test_multi_mutex(int threads, int seconds)
+{
+    int i;
+
+    qemu_mutex_init(&mutex);
+    counter = 0;
+    atomic_counter = 0;
+    now_stopping = false;
+
+    create_aio_contexts();
+    assert(threads <= NUM_CONTEXTS);
+    running = threads;
+    for (i = 0; i < threads; i++) {
+        Coroutine *co1 = qemu_coroutine_create(test_multi_mutex_entry, NULL);
+        aio_co_schedule(ctx[i], co1);
+    }
+
+    g_usleep(seconds * 1000000);
+
+    atomic_mb_set(&now_stopping, true);
+    while (running > 0) {
+        g_usleep(100000);
+    }
+
+    join_aio_contexts();
+    g_test_message("%d iterations/second\n", counter / seconds);
+    g_assert_cmpint(counter, ==, atomic_counter);
+}
+
+static void test_multi_mutex_1(void)
+{
+    test_multi_mutex(NUM_CONTEXTS, 1);
+}
+
+static void test_multi_mutex_10(void)
+{
+    test_multi_mutex(NUM_CONTEXTS, 10);
+}
+
 /* End of tests.  */
 
 int main(int argc, char **argv)
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_1);
         g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_1);
         g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_3);
+#ifdef CONFIG_LINUX
+        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_1);
+#endif
+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_1);
     } else {
         g_test_add_func("/aio/multi/schedule", test_multi_co_schedule_10);
         g_test_add_func("/aio/multi/mutex/contended", test_multi_co_mutex_10);
         g_test_add_func("/aio/multi/mutex/handoff", test_multi_co_mutex_2_30);
+#ifdef CONFIG_LINUX
+        g_test_add_func("/aio/multi/mutex/mcs", test_multi_fair_mutex_10);
+#endif
+        g_test_add_func("/aio/multi/mutex/pthread", test_multi_mutex_10);
     }
     return g_test_run();
 }
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This will avoid forward references in the next patch.  It is also
more logical because CoQueue is not anymore the basic primitive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-5-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h | 89 ++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 45 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_in_coroutine(void);
  */
 bool qemu_coroutine_entered(Coroutine *co);
 
-
-/**
- * CoQueues are a mechanism to queue coroutines in order to continue executing
- * them later. They provide the fundamental primitives on which coroutine locks
- * are built.
- */
-typedef struct CoQueue {
-    QSIMPLEQ_HEAD(, Coroutine) entries;
-} CoQueue;
-
-/**
- * Initialise a CoQueue. This must be called before any other operation is used
- * on the CoQueue.
- */
-void qemu_co_queue_init(CoQueue *queue);
-
-/**
- * Adds the current coroutine to the CoQueue and transfers control to the
- * caller of the coroutine.
- */
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
-
-/**
- * Restarts the next coroutine in the CoQueue and removes it from the queue.
- *
- * Returns true if a coroutine was restarted, false if the queue is empty.
- */
-bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
-
-/**
- * Restarts all coroutines in the CoQueue and leaves the queue empty.
- */
-void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
-
-/**
- * Enter the next coroutine in the queue
- */
-bool qemu_co_enter_next(CoQueue *queue);
-
-/**
- * Checks if the CoQueue is empty.
- */
-bool qemu_co_queue_empty(CoQueue *queue);
-
-
 /**
  * Provides a mutex that can be used to synchronise coroutines
  */
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
  */
 void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 
+
+/**
+ * CoQueues are a mechanism to queue coroutines in order to continue executing
+ * them later.
+ */
+typedef struct CoQueue {
+    QSIMPLEQ_HEAD(, Coroutine) entries;
+} CoQueue;
+
+/**
+ * Initialise a CoQueue. This must be called before any other operation is used
+ * on the CoQueue.
+ */
+void qemu_co_queue_init(CoQueue *queue);
+
+/**
+ * Adds the current coroutine to the CoQueue and transfers control to the
+ * caller of the coroutine.
+ */
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
+
+/**
+ * Restarts the next coroutine in the CoQueue and removes it from the queue.
+ *
+ * Returns true if a coroutine was restarted, false if the queue is empty.
+ */
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
+
+/**
+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
+ */
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
+
+/**
+ * Checks if the CoQueue is empty.
+ */
+bool qemu_co_queue_empty(CoQueue *queue);
+
+
 typedef struct CoRwlock {
     bool writer;
     int reader;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

All that CoQueue needs in order to become thread-safe is help
from an external mutex.  Add this to the API.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  8 +++++---
 block/backup.c             |  2 +-
 block/io.c                 |  4 ++--
 block/nbd-client.c         |  2 +-
 block/qcow2-cluster.c      |  4 +---
 block/sheepdog.c           |  2 +-
 block/throttle-groups.c    |  2 +-
 hw/9pfs/9p.c               |  2 +-
 util/qemu-coroutine-lock.c | 24 +++++++++++++++++++++---
 9 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
 
 /**
  * CoQueues are a mechanism to queue coroutines in order to continue executing
- * them later.
+ * them later.  They are similar to condition variables, but they need help
+ * from an external mutex in order to maintain thread-safety.
  */
 typedef struct CoQueue {
     QSIMPLEQ_HEAD(, Coroutine) entries;
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue);
 
 /**
  * Adds the current coroutine to the CoQueue and transfers control to the
- * caller of the coroutine.
+ * caller of the coroutine.  The mutex is unlocked during the wait and
+ * locked again afterwards.
  */
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue);
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex);
 
 /**
  * Restarts the next coroutine in the CoQueue and removes it from the queue.
diff --git a/block/backup.c b/block/backup.c
index XXXXXXX..XXXXXXX 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
         retry = false;
         QLIST_FOREACH(req, &job->inflight_reqs, list) {
             if (end > req->start && start < req->end) {
-                qemu_co_queue_wait(&req->wait_queue);
+                qemu_co_queue_wait(&req->wait_queue, NULL);
                 retry = true;
                 break;
             }
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  * (instead of producing a deadlock in the former case). */
                 if (!req->waiting_for) {
                     self->waiting_for = req;
-                    qemu_co_queue_wait(&req->wait_queue);
+                    qemu_co_queue_wait(&req->wait_queue, NULL);
                     self->waiting_for = NULL;
                     retry = true;
                     waited = true;
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 
     /* Wait until any previous flushes are completed */
     while (bs->active_flush_req) {
-        qemu_co_queue_wait(&bs->flush_queue);
+        qemu_co_queue_wait(&bs->flush_queue, NULL);
     }
 
     bs->active_flush_req = true;
diff --git a/block/nbd-client.c b/block/nbd-client.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -XXX,XX +XXX,XX @@ static void nbd_coroutine_start(NBDClientSession *s,
     /* Poor man semaphore.  The free_sema is locked when no other request
      * can be accepted, and unlocked after receiving one reply.  */
     if (s->in_flight == MAX_NBD_REQUESTS) {
-        qemu_co_queue_wait(&s->free_sema);
+        qemu_co_queue_wait(&s->free_sema, NULL);
         assert(s->in_flight < MAX_NBD_REQUESTS);
     }
     s->in_flight++;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
             if (bytes == 0) {
                 /* Wait for the dependency to complete. We need to recheck
                  * the free/allocated clusters when we continue. */
-                qemu_co_mutex_unlock(&s->lock);
-                qemu_co_queue_wait(&old_alloc->dependent_requests);
-                qemu_co_mutex_lock(&s->lock);
+                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
                 return -EAGAIN;
             }
         }
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void wait_for_overlapping_aiocb(BDRVSheepdogState *s, SheepdogAIOCB *acb)
 retry:
     QLIST_FOREACH(cb, &s->inflight_aiocb_head, aiocb_siblings) {
         if (AIOCBOverlapping(acb, cb)) {
-            qemu_co_queue_wait(&s->overlapping_queue);
+            qemu_co_queue_wait(&s->overlapping_queue, NULL);
             goto retry;
         }
     }
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockBackend *blk,
     if (must_wait || blkp->pending_reqs[is_write]) {
         blkp->pending_reqs[is_write]++;
         qemu_mutex_unlock(&tg->lock);
-        qemu_co_queue_wait(&blkp->throttled_reqs[is_write]);
+        qemu_co_queue_wait(&blkp->throttled_reqs[is_write], NULL);
         qemu_mutex_lock(&tg->lock);
         blkp->pending_reqs[is_write]--;
     }
diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn v9fs_flush(void *opaque)
         /*
          * Wait for pdu to complete.
          */
-        qemu_co_queue_wait(&cancel_pdu->complete);
+        qemu_co_queue_wait(&cancel_pdu->complete, NULL);
         cancel_pdu->cancelled = 0;
         pdu_free(cancel_pdu);
     }
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_queue_init(CoQueue *queue)
     QSIMPLEQ_INIT(&queue->entries);
 }
 
-void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
+void coroutine_fn qemu_co_queue_wait(CoQueue *queue, CoMutex *mutex)
 {
     Coroutine *self = qemu_coroutine_self();
     QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+
+    if (mutex) {
+        qemu_co_mutex_unlock(mutex);
+    }
+
+    /* There is no race condition here.  Other threads will call
+     * aio_co_schedule on our AioContext, which can reenter this
+     * coroutine but only after this yield and after the main loop
+     * has gone through the next iteration.
+     */
     qemu_coroutine_yield();
     assert(qemu_in_coroutine());
+
+    /* TODO: OSv implements wait morphing here, where the wakeup
+     * primitive automatically places the woken coroutine on the
+     * mutex's queue.  This avoids the thundering herd effect.
+     */
+    if (mutex) {
+        qemu_co_mutex_lock(mutex);
+    }
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_rdlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     while (lock->writer) {
-        qemu_co_queue_wait(&lock->queue);
+        qemu_co_queue_wait(&lock->queue, NULL);
     }
     lock->reader++;
     self->locks_held++;
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_wrlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     while (lock->writer || lock->reader) {
-        qemu_co_queue_wait(&lock->queue);
+        qemu_co_queue_wait(&lock->queue, NULL);
     }
     lock->writer = true;
     self->locks_held++;
-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

This adds a CoMutex around the existing CoQueue.  Because the write-side
can just take CoMutex, the old "writer" field is not necessary anymore.
Instead of removing it altogether, count the number of pending writers
during a read-side critical section and forbid further readers from
entering.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-7-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   |  3 ++-
 util/qemu-coroutine-lock.c | 35 ++++++++++++++++++++++++-----------
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -XXX,XX +XXX,XX @@ bool qemu_co_queue_empty(CoQueue *queue);
 
 
 typedef struct CoRwlock {
-    bool writer;
+    int pending_writer;
     int reader;
+    CoMutex mutex;
     CoQueue queue;
 } CoRwlock;
 
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_init(CoRwlock *lock)
 {
     memset(lock, 0, sizeof(*lock));
     qemu_co_queue_init(&lock->queue);
+    qemu_co_mutex_init(&lock->mutex);
 }
 
 void qemu_co_rwlock_rdlock(CoRwlock *lock)
 {
     Coroutine *self = qemu_coroutine_self();
 
-    while (lock->writer) {
-        qemu_co_queue_wait(&lock->queue, NULL);
+    qemu_co_mutex_lock(&lock->mutex);
+    /* For fairness, wait if a writer is in line.  */
+    while (lock->pending_writer) {
+        qemu_co_queue_wait(&lock->queue, &lock->mutex);
     }
     lock->reader++;
+    qemu_co_mutex_unlock(&lock->mutex);
+
+    /* The rest of the read-side critical section is run without the mutex.  */
     self->locks_held++;
 }
 
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
     Coroutine *self = qemu_coroutine_self();
 
     assert(qemu_in_coroutine());
-    if (lock->writer) {
-        lock->writer = false;
+    if (!lock->reader) {
+        /* The critical section started in qemu_co_rwlock_wrlock.  */
         qemu_co_queue_restart_all(&lock->queue);
     } else {
+        self->locks_held--;
+
+        qemu_co_mutex_lock(&lock->mutex);
         lock->reader--;
         assert(lock->reader >= 0);
         /* Wakeup only one waiting writer */
@@ -XXX,XX +XXX,XX @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
             qemu_co_queue_next(&lock->queue);
         }
     }
-    self->locks_held--;
+    qemu_co_mutex_unlock(&lock->mutex);
 }
 
 void qemu_co_rwlock_wrlock(CoRwlock *lock)
 {
-    Coroutine *self = qemu_coroutine_self();
-
-    while (lock->writer || lock->reader) {
-        qemu_co_queue_wait(&lock->queue, NULL);
+    qemu_co_mutex_lock(&lock->mutex);
+    lock->pending_writer++;
+    while (lock->reader) {
+        qemu_co_queue_wait(&lock->queue, &lock->mutex);
     }
-    lock->writer = true;
-    self->locks_held++;
+    lock->pending_writer--;
+
+    /* The rest of the write-side critical section is run with
+     * the mutex taken, so that lock->reader remains zero.
+     * There is no need to update self->locks_held.
+     */
 }
-- 
2.9.3

The following changes since commit ac793156f650ae2d77834932d72224175ee69086:

Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20201020-1' into staging (2020-10-20 21:11:35 +0100)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 32a3fd65e7e3551337fd26bfc0e2f899d70c028c:

iotests: add commit top->base cases to 274 (2020-10-22 09:55:39 +0100)

----------------------------------------------------------------
Pull request

v2:
 * Fix format string issues on 32-bit hosts [Peter]
 * Fix qemu-nbd.c CONFIG_POSIX ifdef issue [Eric]
 * Fix missing eventfd.h header on macOS [Peter]
 * Drop unreliable vhost-user-blk test (will send a new patch when ready) [Peter]

This pull request contains the vhost-user-blk server by Coiby Xu along with my
additions, block/nvme.c alignment and hardware error statistics by Philippe
Mathieu-Daudé, and bdrv_co_block_status_above() fixes by Vladimir
Sementsov-Ogievskiy.

----------------------------------------------------------------

Coiby Xu (6):
  libvhost-user: Allow vu_message_read to be replaced
  libvhost-user: remove watch for kick_fd when de-initialize vu-dev
  util/vhost-user-server: generic vhost user server
  block: move logical block size check function to a common utility
    function
  block/export: vhost-user block device backend server
  MAINTAINERS: Add vhost-user block device backend server maintainer

Philippe Mathieu-Daudé (1):
  block/nvme: Add driver statistics for access alignment and hw errors

Stefan Hajnoczi (16):
  util/vhost-user-server: s/fileds/fields/ typo fix
  util/vhost-user-server: drop unnecessary QOM cast
  util/vhost-user-server: drop unnecessary watch deletion
  block/export: consolidate request structs into VuBlockReq
  util/vhost-user-server: drop unused DevicePanicNotifier
  util/vhost-user-server: fix memory leak in vu_message_read()
  util/vhost-user-server: check EOF when reading payload
  util/vhost-user-server: rework vu_client_trip() coroutine lifecycle
  block/export: report flush errors
  block/export: convert vhost-user-blk server to block export API
  util/vhost-user-server: move header to include/
  util/vhost-user-server: use static library in meson.build
  qemu-storage-daemon: avoid compiling blockdev_ss twice
  block: move block exports to libblockdev
  block/export: add iothread and fixed-iothread options
  block/export: add vhost-user-blk multi-queue support

Vladimir Sementsov-Ogievskiy (5):
  block/io: fix bdrv_co_block_status_above
  block/io: bdrv_common_block_status_above: support include_base
  block/io: bdrv_common_block_status_above: support bs == base
  block/io: fix bdrv_is_allocated_above
  iotests: add commit top->base cases to 274

-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

Keep statistics of some hardware errors, and number of
aligned/unaligned I/O accesses.

QMP example booting a full RHEL 8.3 aarch64 guest:

{ "execute": "query-blockstats" }
{
    "return": [
        {
            "device": "",
            "node-name": "drive0",
            "stats": {
                "flush_total_time_ns": 6026948,
                "wr_highest_offset": 3383991230464,
                "wr_total_time_ns": 807450995,
                "failed_wr_operations": 0,
                "failed_rd_operations": 0,
                "wr_merged": 3,
                "wr_bytes": 50133504,
                "failed_unmap_operations": 0,
                "failed_flush_operations": 0,
                "account_invalid": false,
                "rd_total_time_ns": 1846979900,
                "flush_operations": 130,
                "wr_operations": 659,
                "rd_merged": 1192,
                "rd_bytes": 218244096,
                "account_failed": false,
                "idle_time_ns": 2678641497,
                "rd_operations": 7406,
            },
            "driver-specific": {
                "driver": "nvme",
                "completion-errors": 0,
                "unaligned-accesses": 2959,
                "aligned-accesses": 4477
            },
            "qdev": "/machine/peripheral-anon/device[0]/virtio-backend"
        }
    ]
}

Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20201001162939.1567915-1-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-core.json | 24 +++++++++++++++++++++++-
 block/nvme.c         | 27 +++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
       'discard-nb-failed': 'uint64',
       'discard-bytes-ok': 'uint64' } }
 
+##
+# @BlockStatsSpecificNvme:
+#
+# NVMe driver statistics
+#
+# @completion-errors: The number of completion errors.
+#
+# @aligned-accesses: The number of aligned accesses performed by
+#                    the driver.
+#
+# @unaligned-accesses: The number of unaligned accesses performed by
+#                      the driver.
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockStatsSpecificNvme',
+  'data': {
+      'completion-errors': 'uint64',
+      'aligned-accesses': 'uint64',
+      'unaligned-accesses': 'uint64' } }
+
 ##
 # @BlockStatsSpecific:
 #
@@ -XXX,XX +XXX,XX @@
   'discriminator': 'driver',
   'data': {
       'file': 'BlockStatsSpecificFile',
-      'host_device': 'BlockStatsSpecificFile' } }
+      'host_device': 'BlockStatsSpecificFile',
+      'nvme': 'BlockStatsSpecificNvme' } }
 
 ##
 # @BlockStats:
diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
 
     /* PCI address (required for nvme_refresh_filename()) */
     char *device;
+
+    struct {
+        uint64_t completion_errors;
+        uint64_t aligned_accesses;
+        uint64_t unaligned_accesses;
+    } stats;
 };
 
 #define NVME_BLOCK_OPT_DEVICE "device"
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(NVMeQueuePair *q)
             break;
         }
         ret = nvme_translate_error(c);
+        if (ret) {
+            s->stats.completion_errors++;
+        }
         q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
         if (!q->cq.head) {
             q->cq_phase = !q->cq_phase;
@@ -XXX,XX +XXX,XX @@ static int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
     assert(QEMU_IS_ALIGNED(bytes, s->page_size));
     assert(bytes <= s->max_transfer);
     if (nvme_qiov_aligned(bs, qiov)) {
+        s->stats.aligned_accesses++;
         return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write, flags);
     }
+    s->stats.unaligned_accesses++;
     trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write);
     buf = qemu_try_memalign(s->page_size, bytes);
 
@@ -XXX,XX +XXX,XX @@ static void nvme_unregister_buf(BlockDriverState *bs, void *host)
     qemu_vfio_dma_unmap(s->vfio, host);
 }
 
+static BlockStatsSpecific *nvme_get_specific_stats(BlockDriverState *bs)
+{
+    BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
+    BDRVNVMeState *s = bs->opaque;
+
+    stats->driver = BLOCKDEV_DRIVER_NVME;
+    stats->u.nvme = (BlockStatsSpecificNvme) {
+        .completion_errors = s->stats.completion_errors,
+        .aligned_accesses = s->stats.aligned_accesses,
+        .unaligned_accesses = s->stats.unaligned_accesses,
+    };
+
+    return stats;
+}
+
 static const char *const nvme_strong_runtime_opts[] = {
     NVME_BLOCK_OPT_DEVICE,
     NVME_BLOCK_OPT_NAMESPACE,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_nvme = {
     .bdrv_refresh_filename    = nvme_refresh_filename,
     .bdrv_refresh_limits      = nvme_refresh_limits,
     .strong_runtime_opts      = nvme_strong_runtime_opts,
+    .bdrv_get_specific_stats  = nvme_get_specific_stats,
 
     .bdrv_detach_aio_context  = nvme_detach_aio_context,
     .bdrv_attach_aio_context  = nvme_attach_aio_context,
-- 
2.26.2

From: Coiby Xu <coiby.xu@gmail.com>

Allow vu_message_read to be replaced by one which will make use of the
QIOChannel functions. Thus reading vhost-user message won't stall the
guest. For slave channel, we still use the default vu_message_read.

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200918080912.321299-2-coiby.xu@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/libvhost-user/libvhost-user.h      | 21 +++++++++++++++++++++
 contrib/libvhost-user/libvhost-user-glib.c |  2 +-
 contrib/libvhost-user/libvhost-user.c      | 14 +++++++-------
 tests/vhost-user-bridge.c                  |  2 ++
 tools/virtiofsd/fuse_virtio.c              |  4 ++--
 5 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index XXXXXXX..XXXXXXX 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -XXX,XX +XXX,XX @@
  */
 #define VHOST_USER_MAX_RAM_SLOTS 32
 
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+
 typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MASTER = 0,
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
@@ -XXX,XX +XXX,XX @@ typedef uint64_t (*vu_get_features_cb) (VuDev *dev);
 typedef void (*vu_set_features_cb) (VuDev *dev, uint64_t features);
 typedef int (*vu_process_msg_cb) (VuDev *dev, VhostUserMsg *vmsg,
                                   int *do_reply);
+typedef bool (*vu_read_msg_cb) (VuDev *dev, int sock, VhostUserMsg *vmsg);
 typedef void (*vu_queue_set_started_cb) (VuDev *dev, int qidx, bool started);
 typedef bool (*vu_queue_is_processed_in_order_cb) (VuDev *dev, int qidx);
 typedef int (*vu_get_config_cb) (VuDev *dev, uint8_t *config, uint32_t len);
@@ -XXX,XX +XXX,XX @@ struct VuDev {
     bool broken;
     uint16_t max_queues;
 
+    /* @read_msg: custom method to read vhost-user message
+     *
+     * Read data from vhost_user socket fd and fill up
+     * the passed VhostUserMsg *vmsg struct.
+     *
+     * If reading fails, it should close the received set of file
+     * descriptors as socket message's auxiliary data.
+     *
+     * For the details, please refer to vu_message_read in libvhost-user.c
+     * which will be used by default if not custom method is provided when
+     * calling vu_init
+     *
+     * Returns: true if vhost-user message successfully received,
+     *          otherwise return false.
+     *
+     */
+    vu_read_msg_cb read_msg;
     /* @set_watch: add or update the given fd to the watch set,
      * call cb when condition is met */
     vu_set_watch_cb set_watch;
@@ -XXX,XX +XXX,XX @@ bool vu_init(VuDev *dev,
              uint16_t max_queues,
              int socket,
              vu_panic_cb panic,
+             vu_read_msg_cb read_msg,
              vu_set_watch_cb set_watch,
              vu_remove_watch_cb remove_watch,
              const VuDevIface *iface);
diff --git a/contrib/libvhost-user/libvhost-user-glib.c b/contrib/libvhost-user/libvhost-user-glib.c
index XXXXXXX..XXXXXXX 100644
--- a/contrib/libvhost-user/libvhost-user-glib.c
+++ b/contrib/libvhost-user/libvhost-user-glib.c
@@ -XXX,XX +XXX,XX @@ vug_init(VugDev *dev, uint16_t max_queues, int socket,
     g_assert(dev);
     g_assert(iface);
 
-    if (!vu_init(&dev->parent, max_queues, socket, panic, set_watch,
+    if (!vu_init(&dev->parent, max_queues, socket, panic, NULL, set_watch,
                  remove_watch, iface)) {
         return false;
     }
diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index XXXXXXX..XXXXXXX 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -XXX,XX +XXX,XX @@
 /* The version of inflight buffer */
 #define INFLIGHT_VERSION 1
 
-#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
-
 /* The version of the protocol we support */
 #define VHOST_USER_VERSION 1
 #define LIBVHOST_USER_DEBUG 0
@@ -XXX,XX +XXX,XX @@ have_userfault(void)
 }
 
 static bool
-vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
+vu_message_read_default(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
     char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = {};
     struct iovec iov = {
@@ -XXX,XX +XXX,XX @@ vu_process_message_reply(VuDev *dev, const VhostUserMsg *vmsg)
         goto out;
     }
 
-    if (!vu_message_read(dev, dev->slave_fd, &msg_reply)) {
+    if (!vu_message_read_default(dev, dev->slave_fd, &msg_reply)) {
         goto out;
     }
 
@@ -XXX,XX +XXX,XX @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
     /* Wait for QEMU to confirm that it's registered the handler for the
      * faults.
      */
-    if (!vu_message_read(dev, dev->sock, vmsg) ||
+    if (!dev->read_msg(dev, dev->sock, vmsg) ||
         vmsg->size != sizeof(vmsg->payload.u64) ||
         vmsg->payload.u64 != 0) {
         vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
@@ -XXX,XX +XXX,XX @@ vu_dispatch(VuDev *dev)
     int reply_requested;
     bool need_reply, success = false;
 
-    if (!vu_message_read(dev, dev->sock, &vmsg)) {
+    if (!dev->read_msg(dev, dev->sock, &vmsg)) {
         goto end;
     }
 
@@ -XXX,XX +XXX,XX @@ vu_init(VuDev *dev,
         uint16_t max_queues,
         int socket,
         vu_panic_cb panic,
+        vu_read_msg_cb read_msg,
         vu_set_watch_cb set_watch,
         vu_remove_watch_cb remove_watch,
         const VuDevIface *iface)
@@ -XXX,XX +XXX,XX @@ vu_init(VuDev *dev,
 
     dev->sock = socket;
     dev->panic = panic;
+    dev->read_msg = read_msg ? read_msg : vu_message_read_default;
     dev->set_watch = set_watch;
     dev->remove_watch = remove_watch;
     dev->iface = iface;
@@ -XXX,XX +XXX,XX @@ static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync)
 
         vu_message_write(dev, dev->slave_fd, &vmsg);
         if (ack) {
-            vu_message_read(dev, dev->slave_fd, &vmsg);
+            vu_message_read_default(dev, dev->slave_fd, &vmsg);
         }
         return;
     }
diff --git a/tests/vhost-user-bridge.c b/tests/vhost-user-bridge.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/vhost-user-bridge.c
+++ b/tests/vhost-user-bridge.c
@@ -XXX,XX +XXX,XX @@ vubr_accept_cb(int sock, void *ctx)
                  VHOST_USER_BRIDGE_MAX_QUEUES,
                  conn_fd,
                  vubr_panic,
+                 NULL,
                  vubr_set_watch,
                  vubr_remove_watch,
                  &vuiface)) {
@@ -XXX,XX +XXX,XX @@ vubr_new(const char *path, bool client)
                      VHOST_USER_BRIDGE_MAX_QUEUES,
                      dev->sock,
                      vubr_panic,
+                     NULL,
                      vubr_set_watch,
                      vubr_remove_watch,
                      &vuiface)) {
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index XXXXXXX..XXXXXXX 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -XXX,XX +XXX,XX @@ int virtio_session_mount(struct fuse_session *se)
     se->vu_socketfd = data_sock;
     se->virtio_dev->se = se;
     pthread_rwlock_init(&se->virtio_dev->vu_dispatch_rwlock, NULL);
-    vu_init(&se->virtio_dev->dev, 2, se->vu_socketfd, fv_panic, fv_set_watch,
-            fv_remove_watch, &fv_iface);
+    vu_init(&se->virtio_dev->dev, 2, se->vu_socketfd, fv_panic, NULL,
+            fv_set_watch, fv_remove_watch, &fv_iface);
 
     return 0;
 }
-- 
2.26.2

From: Coiby Xu <coiby.xu@gmail.com>

Sharing QEMU devices via vhost-user protocol.

Only one vhost-user client can connect to the server one time.

diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/vhost-user-server.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Sharing QEMU devices via vhost-user protocol
+ *
+ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef VHOST_USER_SERVER_H
+#define VHOST_USER_SERVER_H
+
+#include "contrib/libvhost-user/libvhost-user.h"
+#include "io/channel-socket.h"
+#include "io/channel-file.h"
+#include "io/net-listener.h"
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "standard-headers/linux/virtio_blk.h"
+
+typedef struct VuFdWatch {
+    VuDev *vu_dev;
+    int fd; /*kick fd*/
+    void *pvt;
+    vu_watch_cb cb;
+    bool processing;
+    QTAILQ_ENTRY(VuFdWatch) next;
+} VuFdWatch;
+
+typedef struct VuServer VuServer;
+typedef void DevicePanicNotifierFn(VuServer *server);
+
+struct VuServer {
+    QIONetListener *listener;
+    AioContext *ctx;
+    DevicePanicNotifierFn *device_panic_notifier;
+    int max_queues;
+    const VuDevIface *vu_iface;
+    VuDev vu_dev;
+    QIOChannel *ioc; /* The I/O channel with the client */
+    QIOChannelSocket *sioc; /* The underlying data channel with the client */
+    /* IOChannel for fd provided via VHOST_USER_SET_SLAVE_REQ_FD */
+    QIOChannel *ioc_slave;
+    QIOChannelSocket *sioc_slave;
+    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
+    QTAILQ_HEAD(, VuFdWatch) vu_fd_watches;
+    /* restart coroutine co_trip if AIOContext is changed */
+    bool aio_context_changed;
+    bool processing_msg;
+};
+
+bool vhost_user_server_start(VuServer *server,
+                             SocketAddress *unix_socket,
+                             AioContext *ctx,
+                             uint16_t max_queues,
+                             DevicePanicNotifierFn *device_panic_notifier,
+                             const VuDevIface *vu_iface,
+                             Error **errp);
+
+void vhost_user_server_stop(VuServer *server);
+
+void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx);
+
+#endif /* VHOST_USER_SERVER_H */
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Sharing QEMU devices via vhost-user protocol
+ *
+ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "vhost-user-server.h"
+
+static void vmsg_close_fds(VhostUserMsg *vmsg)
+{
+    int i;
+    for (i = 0; i < vmsg->fd_num; i++) {
+        close(vmsg->fds[i]);
+    }
+}
+
+static void vmsg_unblock_fds(VhostUserMsg *vmsg)
+{
+    int i;
+    for (i = 0; i < vmsg->fd_num; i++) {
+        qemu_set_nonblock(vmsg->fds[i]);
+    }
+}
+
+static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
+                      gpointer opaque);
+
+static void close_client(VuServer *server)
+{
+    /*
+     * Before closing the client
+     *
+     * 1. Let vu_client_trip stop processing new vhost-user msg
+     *
+     * 2. remove kick_handler
+     *
+     * 3. wait for the kick handler to be finished
+     *
+     * 4. wait for the current vhost-user msg to be finished processing
+     */
+
+    QIOChannelSocket *sioc = server->sioc;
+    /* When this is set vu_client_trip will stop new processing vhost-user message */
+    server->sioc = NULL;
+
+    VuFdWatch *vu_fd_watch, *next;
+    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
+        aio_set_fd_handler(server->ioc->ctx, vu_fd_watch->fd, true, NULL,
+                           NULL, NULL, NULL);
+    }
+
+    while (!QTAILQ_EMPTY(&server->vu_fd_watches)) {
+        QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
+            if (!vu_fd_watch->processing) {
+                QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
+                g_free(vu_fd_watch);
+            }
+        }
+    }
+
+    while (server->processing_msg) {
+        if (server->ioc->read_coroutine) {
+            server->ioc->read_coroutine = NULL;
+            qio_channel_set_aio_fd_handler(server->ioc, server->ioc->ctx, NULL,
+                                           NULL, server->ioc);
+            server->processing_msg = false;
+        }
+    }
+
+    vu_deinit(&server->vu_dev);
+    object_unref(OBJECT(sioc));
+    object_unref(OBJECT(server->ioc));
+}
+
+static void panic_cb(VuDev *vu_dev, const char *buf)
+{
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+
+    /* avoid while loop in close_client */
+    server->processing_msg = false;
+
+    if (buf) {
+        error_report("vu_panic: %s", buf);
+    }
+
+    if (server->sioc) {
+        close_client(server);
+    }
+
+    if (server->device_panic_notifier) {
+        server->device_panic_notifier(server);
+    }
+
+    /*
+     * Set the callback function for network listener so another
+     * vhost-user client can connect to this server
+     */
+    qio_net_listener_set_client_func(server->listener,
+                                     vu_accept,
+                                     server,
+                                     NULL);
+}
+
+static bool coroutine_fn
+vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
+{
+    struct iovec iov = {
+        .iov_base = (char *)vmsg,
+        .iov_len = VHOST_USER_HDR_SIZE,
+    };
+    int rc, read_bytes = 0;
+    Error *local_err = NULL;
+    /*
+     * Store fds/nfds returned from qio_channel_readv_full into
+     * temporary variables.
+     *
+     * VhostUserMsg is a packed structure, gcc will complain about passing
+     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
+     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
+     * thus two temporary variables nfds and fds are used here.
+     */
+    size_t nfds = 0, nfds_t = 0;
+    const size_t max_fds = G_N_ELEMENTS(vmsg->fds);
+    int *fds_t = NULL;
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+    QIOChannel *ioc = server->ioc;
+
+    if (!ioc) {
+        error_report_err(local_err);
+        goto fail;
+    }
+
+    assert(qemu_in_coroutine());
+    do {
+        /*
+         * qio_channel_readv_full may have short reads, keeping calling it
+         * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
+         */
+        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &local_err);
+        if (rc < 0) {
+            if (rc == QIO_CHANNEL_ERR_BLOCK) {
+                qio_channel_yield(ioc, G_IO_IN);
+                continue;
+            } else {
+                error_report_err(local_err);
+                return false;
+            }
+        }
+        read_bytes += rc;
+        if (nfds_t > 0) {
+            if (nfds + nfds_t > max_fds) {
+                error_report("A maximum of %zu fds are allowed, "
+                             "however got %zu fds now",
+                             max_fds, nfds + nfds_t);
+                goto fail;
+            }
+            memcpy(vmsg->fds + nfds, fds_t,
+                   nfds_t *sizeof(vmsg->fds[0]));
+            nfds += nfds_t;
+            g_free(fds_t);
+        }
+        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
+            break;
+        }
+        iov.iov_base = (char *)vmsg + read_bytes;
+        iov.iov_len = VHOST_USER_HDR_SIZE - read_bytes;
+    } while (true);
+
+    vmsg->fd_num = nfds;
+    /* qio_channel_readv_full will make socket fds blocking, unblock them */
+    vmsg_unblock_fds(vmsg);
+    if (vmsg->size > sizeof(vmsg->payload)) {
+        error_report("Error: too big message request: %d, "
+                     "size: vmsg->size: %u, "
+                     "while sizeof(vmsg->payload) = %zu",
+                     vmsg->request, vmsg->size, sizeof(vmsg->payload));
+        goto fail;
+    }
+
+    struct iovec iov_payload = {
+        .iov_base = (char *)&vmsg->payload,
+        .iov_len = vmsg->size,
+    };
+    if (vmsg->size) {
+        rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &local_err);
+        if (rc == -1) {
+            error_report_err(local_err);
+            goto fail;
+        }
+    }
+
+    return true;
+
+fail:
+    vmsg_close_fds(vmsg);
+
+    return false;
+}
+
+
+static void vu_client_start(VuServer *server);
+static coroutine_fn void vu_client_trip(void *opaque)
+{
+    VuServer *server = opaque;
+
+    while (!server->aio_context_changed && server->sioc) {
+        server->processing_msg = true;
+        vu_dispatch(&server->vu_dev);
+        server->processing_msg = false;
+    }
+
+    if (server->aio_context_changed && server->sioc) {
+        server->aio_context_changed = false;
+        vu_client_start(server);
+    }
+}
+
+static void vu_client_start(VuServer *server)
+{
+    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
+    aio_co_enter(server->ctx, server->co_trip);
+}
+
+/*
+ * a wrapper for vu_kick_cb
+ *
+ * since aio_dispatch can only pass one user data pointer to the
+ * callback function, pack VuDev and pvt into a struct. Then unpack it
+ * and pass them to vu_kick_cb
+ */
+static void kick_handler(void *opaque)
+{
+    VuFdWatch *vu_fd_watch = opaque;
+    vu_fd_watch->processing = true;
+    vu_fd_watch->cb(vu_fd_watch->vu_dev, 0, vu_fd_watch->pvt);
+    vu_fd_watch->processing = false;
+}
+
+
+static VuFdWatch *find_vu_fd_watch(VuServer *server, int fd)
+{
+
+    VuFdWatch *vu_fd_watch, *next;
+    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
+        if (vu_fd_watch->fd == fd) {
+            return vu_fd_watch;
+        }
+    }
+    return NULL;
+}
+
+static void
+set_watch(VuDev *vu_dev, int fd, int vu_evt,
+          vu_watch_cb cb, void *pvt)
+{
+
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+    g_assert(cb);
+
+    VuFdWatch *vu_fd_watch = find_vu_fd_watch(server, fd);
+
+    if (!vu_fd_watch) {
+        VuFdWatch *vu_fd_watch = g_new0(VuFdWatch, 1);
+
+        QTAILQ_INSERT_TAIL(&server->vu_fd_watches, vu_fd_watch, next);
+
+        vu_fd_watch->fd = fd;
+        vu_fd_watch->cb = cb;
+        qemu_set_nonblock(fd);
+        aio_set_fd_handler(server->ioc->ctx, fd, true, kick_handler,
+                           NULL, NULL, vu_fd_watch);
+        vu_fd_watch->vu_dev = vu_dev;
+        vu_fd_watch->pvt = pvt;
+    }
+}
+
+
+static void remove_watch(VuDev *vu_dev, int fd)
+{
+    VuServer *server;
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+
+    server = container_of(vu_dev, VuServer, vu_dev);
+
+    VuFdWatch *vu_fd_watch = find_vu_fd_watch(server, fd);
+
+    if (!vu_fd_watch) {
+        return;
+    }
+    aio_set_fd_handler(server->ioc->ctx, fd, true, NULL, NULL, NULL, NULL);
+
+    QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
+    g_free(vu_fd_watch);
+}
+
+
+static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
+                      gpointer opaque)
+{
+    VuServer *server = opaque;
+
+    if (server->sioc) {
+        warn_report("Only one vhost-user client is allowed to "
+                    "connect the server one time");
+        return;
+    }
+
+    if (!vu_init(&server->vu_dev, server->max_queues, sioc->fd, panic_cb,
+                 vu_message_read, set_watch, remove_watch, server->vu_iface)) {
+        error_report("Failed to initialize libvhost-user");
+        return;
+    }
+
+    /*
+     * Unset the callback function for network listener to make another
+     * vhost-user client keeping waiting until this client disconnects
+     */
+    qio_net_listener_set_client_func(server->listener,
+                                     NULL,
+                                     NULL,
+                                     NULL);
+    server->sioc = sioc;
+    /*
+     * Increase the object reference, so sioc will not freed by
+     * qio_net_listener_channel_func which will call object_unref(OBJECT(sioc))
+     */
+    object_ref(OBJECT(server->sioc));
+    qio_channel_set_name(QIO_CHANNEL(sioc), "vhost-user client");
+    server->ioc = QIO_CHANNEL(sioc);
+    object_ref(OBJECT(server->ioc));
+    qio_channel_attach_aio_context(server->ioc, server->ctx);
+    qio_channel_set_blocking(QIO_CHANNEL(server->sioc), false, NULL);
+    vu_client_start(server);
+}
+
+
+void vhost_user_server_stop(VuServer *server)
+{
+    if (server->sioc) {
+        close_client(server);
+    }
+
+    if (server->listener) {
+        qio_net_listener_disconnect(server->listener);
+        object_unref(OBJECT(server->listener));
+    }
+
+}
+
+void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx)
+{
+    VuFdWatch *vu_fd_watch, *next;
+    void *opaque = NULL;
+    IOHandler *io_read = NULL;
+    bool attach;
+
+    server->ctx = ctx ? ctx : qemu_get_aio_context();
+
+    if (!server->sioc) {
+        /* not yet serving any client*/
+        return;
+    }
+
+    if (ctx) {
+        qio_channel_attach_aio_context(server->ioc, ctx);
+        server->aio_context_changed = true;
+        io_read = kick_handler;
+        attach = true;
+    } else {
+        qio_channel_detach_aio_context(server->ioc);
+        /* server->ioc->ctx keeps the old AioConext */
+        ctx = server->ioc->ctx;
+        attach = false;
+    }
+
+    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
+        if (vu_fd_watch->cb) {
+            opaque = attach ? vu_fd_watch : NULL;
+            aio_set_fd_handler(ctx, vu_fd_watch->fd, true,
+                               io_read, NULL, NULL,
+                               opaque);
+        }
+    }
+}
+
+
+bool vhost_user_server_start(VuServer *server,
+                             SocketAddress *socket_addr,
+                             AioContext *ctx,
+                             uint16_t max_queues,
+                             DevicePanicNotifierFn *device_panic_notifier,
+                             const VuDevIface *vu_iface,
+                             Error **errp)
+{
+    QIONetListener *listener = qio_net_listener_new();
+    if (qio_net_listener_open_sync(listener, socket_addr, 1,
+                                   errp) < 0) {
+        object_unref(OBJECT(listener));
+        return false;
+    }
+
+    /* zero out unspecified fileds */
+    *server = (VuServer) {
+        .listener              = listener,
+        .vu_iface              = vu_iface,
+        .max_queues            = max_queues,
+        .ctx                   = ctx,
+        .device_panic_notifier = device_panic_notifier,
+    };
+
+    qio_net_listener_set_name(server->listener, "vhost-user-backend-listener");
+
+    qio_net_listener_set_client_func(server->listener,
+                                     vu_accept,
+                                     server,
+                                     NULL);
+
+    QTAILQ_INIT(&server->vu_fd_watches);
+    return true;
+}
diff --git a/util/meson.build b/util/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -XXX,XX +XXX,XX @@ if have_block
   util_ss.add(files('main-loop.c'))
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
+  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
-- 
2.26.2

From: Coiby Xu <coiby.xu@gmail.com>

Move the constants from hw/core/qdev-properties.c to
util/block-helpers.h so that knowledge of the min/max values is

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com>
Message-id: 20200918080912.321299-5-coiby.xu@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/block-helpers.h             | 19 +++++++++++++
 hw/core/qdev-properties-system.c | 31 ++++-----------------
 util/block-helpers.c             | 46 ++++++++++++++++++++++++++++++++
 util/meson.build                 |  1 +
 4 files changed, 71 insertions(+), 26 deletions(-)
 create mode 100644 util/block-helpers.h
 create mode 100644 util/block-helpers.c

diff --git a/util/block-helpers.h b/util/block-helpers.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/block-helpers.h
@@ -XXX,XX +XXX,XX @@
+#ifndef BLOCK_HELPERS_H
+#define BLOCK_HELPERS_H
+
+#include "qemu/units.h"
+
+/* lower limit is sector size */
+#define MIN_BLOCK_SIZE          INT64_C(512)
+#define MIN_BLOCK_SIZE_STR      "512 B"
+/*
+ * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
+ * matches qcow2 cluster size limit
+ */
+#define MAX_BLOCK_SIZE          (2 * MiB)
+#define MAX_BLOCK_SIZE_STR      "2 MiB"
+
+void check_block_size(const char *id, const char *name, int64_t value,
+                      Error **errp);
+
+#endif /* BLOCK_HELPERS_H */
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/blockdev.h"
 #include "net/net.h"
 #include "hw/pci/pci.h"
+#include "util/block-helpers.h"
 
 static bool check_prop_still_unset(DeviceState *dev, const char *name,
                                    const void *old_val, const char *new_val,
@@ -XXX,XX +XXX,XX @@ const PropertyInfo qdev_prop_losttickpolicy = {
 
 /* --- blocksize --- */
 
-/* lower limit is sector size */
-#define MIN_BLOCK_SIZE          512
-#define MIN_BLOCK_SIZE_STR      "512 B"
-/*
- * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
- * matches qcow2 cluster size limit
- */
-#define MAX_BLOCK_SIZE          (2 * MiB)
-#define MAX_BLOCK_SIZE_STR      "2 MiB"
-
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
                           void *opaque, Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static void set_blocksize(Object *obj, Visitor *v, const char *name,
     Property *prop = opaque;
     uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
     uint64_t value;
+    Error *local_err = NULL;
 
     if (dev->realized) {
         qdev_prop_set_after_realize(dev, name, errp);
@@ -XXX,XX +XXX,XX @@ static void set_blocksize(Object *obj, Visitor *v, const char *name,
     if (!visit_type_size(v, name, &value, errp)) {
         return;
     }
-    /* value of 0 means "unset" */
-    if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
-        error_setg(errp,
-                   "Property %s.%s doesn't take value %" PRIu64
-                   " (minimum: " MIN_BLOCK_SIZE_STR
-                   ", maximum: " MAX_BLOCK_SIZE_STR ")",
-                   dev->id ? : "", name, value);
+    check_block_size(dev->id ? : "", name, value, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
         return;
     }
-
-    /* We rely on power-of-2 blocksizes for bitmasks */
-    if ((value & (value - 1)) != 0) {
-        error_setg(errp,
-                  "Property %s.%s doesn't take value '%" PRId64 "', "
-                  "it's not a power of 2", dev->id ?: "", name, (int64_t)value);
-        return;
-    }
-
     *ptr = value;
 }
 
diff --git a/util/block-helpers.c b/util/block-helpers.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/block-helpers.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Block utility functions
+ *
+ * Copyright IBM, Corp. 2011
+ * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "block-helpers.h"
+
+/**
+ * check_block_size:
+ * @id: The unique ID of the object
+ * @name: The name of the property being validated
+ * @value: The block size in bytes
+ * @errp: A pointer to an area to store an error
+ *
+ * This function checks that the block size meets the following conditions:
+ * 1. At least MIN_BLOCK_SIZE
+ * 2. No larger than MAX_BLOCK_SIZE
+ * 3. A power of 2
+ */
+void check_block_size(const char *id, const char *name, int64_t value,
+                      Error **errp)
+{
+    /* value of 0 means "unset" */
+    if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
+        error_setg(errp, QERR_PROPERTY_VALUE_OUT_OF_RANGE,
+                   id, name, value, MIN_BLOCK_SIZE, MAX_BLOCK_SIZE);
+        return;
+    }
+
+    /* We rely on power-of-2 blocksizes for bitmasks */
+    if ((value & (value - 1)) != 0) {
+        error_setg(errp,
+                   "Property %s.%s doesn't take value '%" PRId64
+                   "', it's not a power of 2",
+                   id, name, value);
+        return;
+    }
+}
diff --git a/util/meson.build b/util/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -XXX,XX +XXX,XX @@ if have_block
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
   util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
+  util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
-- 
2.26.2

From: Coiby Xu <coiby.xu@gmail.com>

By making use of libvhost-user, block device drive can be shared to
the connected vhost-user client. Only one client can connect to the
server one time.

Since vhost-user-server needs a block drive to be created first, delay
the creation of this object.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20200918080912.321299-6-coiby.xu@gmail.com
[Shorten "vhost_user_blk_server" string to "vhost_user_blk" to avoid the
following compiler warning:
../block/export/vhost-user-blk-server.c:178:50: error: ‘%s’ directive output truncated writing 21 bytes into a region of size 20 [-Werror=format-truncation=]
and fix "Invalid size %ld ..." ssize_t format string arguments for
32-bit hosts.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/export/vhost-user-blk-server.h |  36 ++
 block/export/vhost-user-blk-server.c | 661 +++++++++++++++++++++++++++
 softmmu/vl.c                         |   4 +
 block/meson.build                    |   1 +
 4 files changed, 702 insertions(+)
 create mode 100644 block/export/vhost-user-blk-server.h
 create mode 100644 block/export/vhost-user-blk-server.c

diff --git a/block/export/vhost-user-blk-server.h b/block/export/vhost-user-blk-server.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/export/vhost-user-blk-server.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Sharing QEMU block devices via vhost-user protocal
+ *
+ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef VHOST_USER_BLK_SERVER_H
+#define VHOST_USER_BLK_SERVER_H
+#include "util/vhost-user-server.h"
+
+typedef struct VuBlockDev VuBlockDev;
+#define TYPE_VHOST_USER_BLK_SERVER "vhost-user-blk-server"
+#define VHOST_USER_BLK_SERVER(obj) \
+   OBJECT_CHECK(VuBlockDev, obj, TYPE_VHOST_USER_BLK_SERVER)
+
+/* vhost user block device */
+struct VuBlockDev {
+    Object parent_obj;
+    char *node_name;
+    SocketAddress *addr;
+    AioContext *ctx;
+    VuServer vu_server;
+    bool running;
+    uint32_t blk_size;
+    BlockBackend *backend;
+    QIOChannelSocket *sioc;
+    QTAILQ_ENTRY(VuBlockDev) next;
+    struct virtio_blk_config blkcfg;
+    bool writable;
+};
+
+#endif /* VHOST_USER_BLK_SERVER_H */
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Sharing QEMU block devices via vhost-user protocal
+ *
+ * Parts of the code based on nbd/server.c.
+ *
+ * Copyright (c) Coiby Xu <coiby.xu@gmail.com>.
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "block/block.h"
+#include "vhost-user-blk-server.h"
+#include "qapi/error.h"
+#include "qom/object_interfaces.h"
+#include "sysemu/block-backend.h"
+#include "util/block-helpers.h"
+
+enum {
+    VHOST_USER_BLK_MAX_QUEUES = 1,
+};
+struct virtio_blk_inhdr {
+    unsigned char status;
+};
+
+typedef struct VuBlockReq {
+    VuVirtqElement *elem;
+    int64_t sector_num;
+    size_t size;
+    struct virtio_blk_inhdr *in;
+    struct virtio_blk_outhdr out;
+    VuServer *server;
+    struct VuVirtq *vq;
+} VuBlockReq;
+
+static void vu_block_req_complete(VuBlockReq *req)
+{
+    VuDev *vu_dev = &req->server->vu_dev;
+
+    /* IO size with 1 extra status byte */
+    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
+    vu_queue_notify(vu_dev, req->vq);
+
+    if (req->elem) {
+        free(req->elem);
+    }
+
+    g_free(req);
+}
+
+static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
+{
+    return container_of(server, VuBlockDev, vu_server);
+}
+
+static int coroutine_fn
+vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
+                              uint32_t iovcnt, uint32_t type)
+{
+    struct virtio_blk_discard_write_zeroes desc;
+    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
+    if (unlikely(size != sizeof(desc))) {
+        error_report("Invalid size %zd, expect %zu", size, sizeof(desc));
+        return -EINVAL;
+    }
+
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
+    uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
+                          le32_to_cpu(desc.num_sectors) << 9 };
+    if (type == VIRTIO_BLK_T_DISCARD) {
+        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
+            return 0;
+        }
+    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
+        if (blk_co_pwrite_zeroes(vdev_blk->backend,
+                                 range[0], range[1], 0) == 0) {
+            return 0;
+        }
+    }
+
+    return -EINVAL;
+}
+
+static void coroutine_fn vu_block_flush(VuBlockReq *req)
+{
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
+    BlockBackend *backend = vdev_blk->backend;
+    blk_co_flush(backend);
+}
+
+struct req_data {
+    VuServer *server;
+    VuVirtq *vq;
+    VuVirtqElement *elem;
+};
+
+static void coroutine_fn vu_block_virtio_process_req(void *opaque)
+{
+    struct req_data *data = opaque;
+    VuServer *server = data->server;
+    VuVirtq *vq = data->vq;
+    VuVirtqElement *elem = data->elem;
+    uint32_t type;
+    VuBlockReq *req;
+
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    BlockBackend *backend = vdev_blk->backend;
+
+    struct iovec *in_iov = elem->in_sg;
+    struct iovec *out_iov = elem->out_sg;
+    unsigned in_num = elem->in_num;
+    unsigned out_num = elem->out_num;
+    /* refer to hw/block/virtio_blk.c */
+    if (elem->out_num < 1 || elem->in_num < 1) {
+        error_report("virtio-blk request missing headers");
+        free(elem);
+        return;
+    }
+
+    req = g_new0(VuBlockReq, 1);
+    req->server = server;
+    req->vq = vq;
+    req->elem = elem;
+
+    if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
+                            sizeof(req->out)) != sizeof(req->out))) {
+        error_report("virtio-blk request outhdr too short");
+        goto err;
+    }
+
+    iov_discard_front(&out_iov, &out_num, sizeof(req->out));
+
+    if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) {
+        error_report("virtio-blk request inhdr too short");
+        goto err;
+    }
+
+    /* We always touch the last byte, so just see how big in_iov is.  */
+    req->in = (void *)in_iov[in_num - 1].iov_base
+              + in_iov[in_num - 1].iov_len
+              - sizeof(struct virtio_blk_inhdr);
+    iov_discard_back(in_iov, &in_num, sizeof(struct virtio_blk_inhdr));
+
+    type = le32_to_cpu(req->out.type);
+    switch (type & ~VIRTIO_BLK_T_BARRIER) {
+    case VIRTIO_BLK_T_IN:
+    case VIRTIO_BLK_T_OUT: {
+        ssize_t ret = 0;
+        bool is_write = type & VIRTIO_BLK_T_OUT;
+        req->sector_num = le64_to_cpu(req->out.sector);
+
+        int64_t offset = req->sector_num * vdev_blk->blk_size;
+        QEMUIOVector qiov;
+        if (is_write) {
+            qemu_iovec_init_external(&qiov, out_iov, out_num);
+            ret = blk_co_pwritev(backend, offset, qiov.size,
+                                 &qiov, 0);
+        } else {
+            qemu_iovec_init_external(&qiov, in_iov, in_num);
+            ret = blk_co_preadv(backend, offset, qiov.size,
+                                &qiov, 0);
+        }
+        if (ret >= 0) {
+            req->in->status = VIRTIO_BLK_S_OK;
+        } else {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+        }
+        break;
+    }
+    case VIRTIO_BLK_T_FLUSH:
+        vu_block_flush(req);
+        req->in->status = VIRTIO_BLK_S_OK;
+        break;
+    case VIRTIO_BLK_T_GET_ID: {
+        size_t size = MIN(iov_size(&elem->in_sg[0], in_num),
+                          VIRTIO_BLK_ID_BYTES);
+        snprintf(elem->in_sg[0].iov_base, size, "%s", "vhost_user_blk");
+        req->in->status = VIRTIO_BLK_S_OK;
+        req->size = elem->in_sg[0].iov_len;
+        break;
+    }
+    case VIRTIO_BLK_T_DISCARD:
+    case VIRTIO_BLK_T_WRITE_ZEROES: {
+        int rc;
+        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
+                                           out_num, type);
+        if (rc == 0) {
+            req->in->status = VIRTIO_BLK_S_OK;
+        } else {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+        }
+        break;
+    }
+    default:
+        req->in->status = VIRTIO_BLK_S_UNSUPP;
+        break;
+    }
+
+    vu_block_req_complete(req);
+    return;
+
+err:
+    free(elem);
+    g_free(req);
+    return;
+}
+
+static void vu_block_process_vq(VuDev *vu_dev, int idx)
+{
+    VuServer *server;
+    VuVirtq *vq;
+    struct req_data *req_data;
+
+    server = container_of(vu_dev, VuServer, vu_dev);
+    assert(server);
+
+    vq = vu_get_queue(vu_dev, idx);
+    assert(vq);
+    VuVirtqElement *elem;
+    while (1) {
+        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
+                                    sizeof(VuBlockReq));
+        if (elem) {
+            req_data = g_new0(struct req_data, 1);
+            req_data->server = server;
+            req_data->vq = vq;
+            req_data->elem = elem;
+            Coroutine *co = qemu_coroutine_create(vu_block_virtio_process_req,
+                                                  req_data);
+            aio_co_enter(server->ioc->ctx, co);
+        } else {
+            break;
+        }
+    }
+}
+
+static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
+{
+    VuVirtq *vq;
+
+    assert(vu_dev);
+
+    vq = vu_get_queue(vu_dev, idx);
+    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
+}
+
+static uint64_t vu_block_get_features(VuDev *dev)
+{
+    uint64_t features;
+    VuServer *server = container_of(dev, VuServer, vu_dev);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    features = 1ull << VIRTIO_BLK_F_SIZE_MAX |
+               1ull << VIRTIO_BLK_F_SEG_MAX |
+               1ull << VIRTIO_BLK_F_TOPOLOGY |
+               1ull << VIRTIO_BLK_F_BLK_SIZE |
+               1ull << VIRTIO_BLK_F_FLUSH |
+               1ull << VIRTIO_BLK_F_DISCARD |
+               1ull << VIRTIO_BLK_F_WRITE_ZEROES |
+               1ull << VIRTIO_BLK_F_CONFIG_WCE |
+               1ull << VIRTIO_F_VERSION_1 |
+               1ull << VIRTIO_RING_F_INDIRECT_DESC |
+               1ull << VIRTIO_RING_F_EVENT_IDX |
+               1ull << VHOST_USER_F_PROTOCOL_FEATURES;
+
+    if (!vdev_blk->writable) {
+        features |= 1ull << VIRTIO_BLK_F_RO;
+    }
+
+    return features;
+}
+
+static uint64_t vu_block_get_protocol_features(VuDev *dev)
+{
+    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
+           1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
+}
+
+static int
+vu_block_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
+{
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    memcpy(config, &vdev_blk->blkcfg, len);
+
+    return 0;
+}
+
+static int
+vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
+                    uint32_t offset, uint32_t size, uint32_t flags)
+{
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    uint8_t wce;
+
+    /* don't support live migration */
+    if (flags != VHOST_SET_CONFIG_TYPE_MASTER) {
+        return -EINVAL;
+    }
+
+    if (offset != offsetof(struct virtio_blk_config, wce) ||
+        size != 1) {
+        return -EINVAL;
+    }
+
+    wce = *data;
+    vdev_blk->blkcfg.wce = wce;
+    blk_set_enable_write_cache(vdev_blk->backend, wce);
+    return 0;
+}
+
+/*
+ * When the client disconnects, it sends a VHOST_USER_NONE request
+ * and vu_process_message will simple call exit which cause the VM
+ * to exit abruptly.
+ * To avoid this issue,  process VHOST_USER_NONE request ahead
+ * of vu_process_message.
+ *
+ */
+static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
+{
+    if (vmsg->request == VHOST_USER_NONE) {
+        dev->panic(dev, "disconnect");
+        return true;
+    }
+    return false;
+}
+
+static const VuDevIface vu_block_iface = {
+    .get_features          = vu_block_get_features,
+    .queue_set_started     = vu_block_queue_set_started,
+    .get_protocol_features = vu_block_get_protocol_features,
+    .get_config            = vu_block_get_config,
+    .set_config            = vu_block_set_config,
+    .process_msg           = vu_block_process_msg,
+};
+
+static void blk_aio_attached(AioContext *ctx, void *opaque)
+{
+    VuBlockDev *vub_dev = opaque;
+    aio_context_acquire(ctx);
+    vhost_user_server_set_aio_context(&vub_dev->vu_server, ctx);
+    aio_context_release(ctx);
+}
+
+static void blk_aio_detach(void *opaque)
+{
+    VuBlockDev *vub_dev = opaque;
+    AioContext *ctx = vub_dev->vu_server.ctx;
+    aio_context_acquire(ctx);
+    vhost_user_server_set_aio_context(&vub_dev->vu_server, NULL);
+    aio_context_release(ctx);
+}
+
+static void
+vu_block_initialize_config(BlockDriverState *bs,
+                           struct virtio_blk_config *config, uint32_t blk_size)
+{
+    config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
+    config->blk_size = blk_size;
+    config->size_max = 0;
+    config->seg_max = 128 - 2;
+    config->min_io_size = 1;
+    config->opt_io_size = 1;
+    config->num_queues = VHOST_USER_BLK_MAX_QUEUES;
+    config->max_discard_sectors = 32768;
+    config->max_discard_seg = 1;
+    config->discard_sector_alignment = config->blk_size >> 9;
+    config->max_write_zeroes_sectors = 32768;
+    config->max_write_zeroes_seg = 1;
+}
+
+static VuBlockDev *vu_block_init(VuBlockDev *vu_block_device, Error **errp)
+{
+
+    BlockBackend *blk;
+    Error *local_error = NULL;
+    const char *node_name = vu_block_device->node_name;
+    bool writable = vu_block_device->writable;
+    uint64_t perm = BLK_PERM_CONSISTENT_READ;
+    int ret;
+
+    AioContext *ctx;
+
+    BlockDriverState *bs = bdrv_lookup_bs(node_name, node_name, &local_error);
+
+    if (!bs) {
+        error_propagate(errp, local_error);
+        return NULL;
+    }
+
+    if (bdrv_is_read_only(bs)) {
+        writable = false;
+    }
+
+    if (writable) {
+        perm |= BLK_PERM_WRITE;
+    }
+
+    ctx = bdrv_get_aio_context(bs);
+    aio_context_acquire(ctx);
+    bdrv_invalidate_cache(bs, NULL);
+    aio_context_release(ctx);
+
+    /*
+     * Don't allow resize while the vhost user server is running,
+     * otherwise we don't care what happens with the node.
+     */
+    blk = blk_new(bdrv_get_aio_context(bs), perm,
+                  BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
+                  BLK_PERM_WRITE | BLK_PERM_GRAPH_MOD);
+    ret = blk_insert_bs(blk, bs, errp);
+
+    if (ret < 0) {
+        goto fail;
+    }
+
+    blk_set_enable_write_cache(blk, false);
+
+    blk_set_allow_aio_context_change(blk, true);
+
+    vu_block_device->blkcfg.wce = 0;
+    vu_block_device->backend = blk;
+    if (!vu_block_device->blk_size) {
+        vu_block_device->blk_size = BDRV_SECTOR_SIZE;
+    }
+    vu_block_device->blkcfg.blk_size = vu_block_device->blk_size;
+    blk_set_guest_block_size(blk, vu_block_device->blk_size);
+    vu_block_initialize_config(bs, &vu_block_device->blkcfg,
+                                   vu_block_device->blk_size);
+    return vu_block_device;
+
+fail:
+    blk_unref(blk);
+    return NULL;
+}
+
+static void vu_block_deinit(VuBlockDev *vu_block_device)
+{
+    if (vu_block_device->backend) {
+        blk_remove_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
+                                        blk_aio_detach, vu_block_device);
+    }
+
+    blk_unref(vu_block_device->backend);
+}
+
+static void vhost_user_blk_server_stop(VuBlockDev *vu_block_device)
+{
+    vhost_user_server_stop(&vu_block_device->vu_server);
+    vu_block_deinit(vu_block_device);
+}
+
+static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
+                                        Error **errp)
+{
+    AioContext *ctx;
+    SocketAddress *addr = vu_block_device->addr;
+
+    if (!vu_block_init(vu_block_device, errp)) {
+        return;
+    }
+
+    ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
+
+    if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
+                                 VHOST_USER_BLK_MAX_QUEUES,
+                                 NULL, &vu_block_iface,
+                                 errp)) {
+        goto error;
+    }
+
+    blk_add_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
+                                 blk_aio_detach, vu_block_device);
+    vu_block_device->running = true;
+    return;
+
+ error:
+    vu_block_deinit(vu_block_device);
+}
+
+static bool vu_prop_modifiable(VuBlockDev *vus, Error **errp)
+{
+    if (vus->running) {
+            error_setg(errp, "The property can't be modified "
+                       "while the server is running");
+            return false;
+    }
+    return true;
+}
+
+static void vu_set_node_name(Object *obj, const char *value, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    if (!vu_prop_modifiable(vus, errp)) {
+        return;
+    }
+
+    if (vus->node_name) {
+        g_free(vus->node_name);
+    }
+
+    vus->node_name = g_strdup(value);
+}
+
+static char *vu_get_node_name(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return g_strdup(vus->node_name);
+}
+
+static void free_socket_addr(SocketAddress *addr)
+{
+        g_free(addr->u.q_unix.path);
+        g_free(addr);
+}
+
+static void vu_set_unix_socket(Object *obj, const char *value,
+                               Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    if (!vu_prop_modifiable(vus, errp)) {
+        return;
+    }
+
+    if (vus->addr) {
+        free_socket_addr(vus->addr);
+    }
+
+    SocketAddress *addr = g_new0(SocketAddress, 1);
+    addr->type = SOCKET_ADDRESS_TYPE_UNIX;
+    addr->u.q_unix.path = g_strdup(value);
+    vus->addr = addr;
+}
+
+static char *vu_get_unix_socket(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return g_strdup(vus->addr->u.q_unix.path);
+}
+
+static bool vu_get_block_writable(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return vus->writable;
+}
+
+static void vu_set_block_writable(Object *obj, bool value, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    if (!vu_prop_modifiable(vus, errp)) {
+            return;
+    }
+
+    vus->writable = value;
+}
+
+static void vu_get_blk_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    uint32_t value = vus->blk_size;
+
+    visit_type_uint32(v, name, &value, errp);
+}
+
+static void vu_set_blk_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    Error *local_err = NULL;
+    uint32_t value;
+
+    if (!vu_prop_modifiable(vus, errp)) {
+            return;
+    }
+
+    visit_type_uint32(v, name, &value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+    check_block_size(object_get_typename(obj), name, value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+    vus->blk_size = value;
+
+out:
+    error_propagate(errp, local_err);
+}
+
+static void vhost_user_blk_server_instance_finalize(Object *obj)
+{
+    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
+
+    vhost_user_blk_server_stop(vub);
+
+    /*
+     * Unlike object_property_add_str, object_class_property_add_str
+     * doesn't have a release method. Thus manual memory freeing is
+     * needed.
+     */
+    free_socket_addr(vub->addr);
+    g_free(vub->node_name);
+}
+
+static void vhost_user_blk_server_complete(UserCreatable *obj, Error **errp)
+{
+    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
+
+    vhost_user_blk_server_start(vub, errp);
+}
+
+static void vhost_user_blk_server_class_init(ObjectClass *klass,
+                                             void *class_data)
+{
+    UserCreatableClass *ucc = USER_CREATABLE_CLASS(klass);
+    ucc->complete = vhost_user_blk_server_complete;
+
+    object_class_property_add_bool(klass, "writable",
+                                   vu_get_block_writable,
+                                   vu_set_block_writable);
+
+    object_class_property_add_str(klass, "node-name",
+                                  vu_get_node_name,
+                                  vu_set_node_name);
+
+    object_class_property_add_str(klass, "unix-socket",
+                                  vu_get_unix_socket,
+                                  vu_set_unix_socket);
+
+    object_class_property_add(klass, "logical-block-size", "uint32",
+                              vu_get_blk_size, vu_set_blk_size,
+                              NULL, NULL);
+}
+
+static const TypeInfo vhost_user_blk_server_info = {
+    .name = TYPE_VHOST_USER_BLK_SERVER,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VuBlockDev),
+    .instance_finalize = vhost_user_blk_server_instance_finalize,
+    .class_init = vhost_user_blk_server_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        {TYPE_USER_CREATABLE},
+        {}
+    },
+};
+
+static void vhost_user_blk_server_register_types(void)
+{
+    type_register_static(&vhost_user_blk_server_info);
+}
+
+type_init(vhost_user_blk_server_register_types)
diff --git a/softmmu/vl.c b/softmmu/vl.c
index XXXXXXX..XXXXXXX 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -XXX,XX +XXX,XX @@ static bool object_create_initial(const char *type, QemuOpts *opts)
     }
 #endif
 
+    /* Reason: vhost-user-blk-server property "node-name" */
+    if (g_str_equal(type, "vhost-user-blk-server")) {
+        return false;
+    }
     /*
      * Reason: filter-* property "netdev" etc.
      */
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c')
 block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
 block_ss.add(when: 'CONFIG_LIBISCSI', if_true: files('iscsi-opts.c'))
 block_ss.add(when: 'CONFIG_LINUX', if_true: files('nvme.c'))
+block_ss.add(when: 'CONFIG_LINUX', if_true: files('export/vhost-user-blk-server.c', '../contrib/libvhost-user/libvhost-user.c'))
 block_ss.add(when: 'CONFIG_REPLICATION', if_true: files('replication.c'))
 block_ss.add(when: 'CONFIG_SHEEPDOG', if_true: files('sheepdog.c'))
 block_ss.add(when: ['CONFIG_LINUX_AIO', libaio], if_true: files('linux-aio.c'))
-- 
2.26.2

From: Coiby Xu <coiby.xu@gmail.com>

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20200918080912.321299-8-coiby.xu@gmail.com
[Removed reference to vhost-user-blk-test.c, it will be sent in a
separate pull request.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ L: qemu-block@nongnu.org
 S: Supported
 F: tests/image-fuzzer/
 
+Vhost-user block device backend server
+M: Coiby Xu <Coiby.Xu@gmail.com>
+S: Maintained
+F: block/export/vhost-user-blk-server.c
+F: util/vhost-user-server.c
+F: tests/qtest/libqos/vhost-user-blk.c
+
 Replication
 M: Wen Congyang <wencongyang2@huawei.com>
 M: Xie Changlong <xiechanglong.d@gmail.com>
-- 
2.26.2

Explicitly deleting watches is not necessary since libvhost-user calls
remove_watch() during vu_deinit(). Add an assertion to check this
though.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vhost-user-server.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@ static void close_client(VuServer *server)
     /* When this is set vu_client_trip will stop new processing vhost-user message */
     server->sioc = NULL;
 
-    VuFdWatch *vu_fd_watch, *next;
-    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
-        aio_set_fd_handler(server->ioc->ctx, vu_fd_watch->fd, true, NULL,
-                           NULL, NULL, NULL);
-    }
-
-    while (!QTAILQ_EMPTY(&server->vu_fd_watches)) {
-        QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
-            if (!vu_fd_watch->processing) {
-                QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
-                g_free(vu_fd_watch);
-            }
-        }
-    }
-
     while (server->processing_msg) {
         if (server->ioc->read_coroutine) {
             server->ioc->read_coroutine = NULL;
@@ -XXX,XX +XXX,XX @@ static void close_client(VuServer *server)
     }
 
     vu_deinit(&server->vu_dev);
+
+    /* vu_deinit() should have called remove_watch() */
+    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
+
     object_unref(OBJECT(sioc));
     object_unref(OBJECT(server->ioc));
 }
-- 
2.26.2

Only one struct is needed per request. Drop req_data and the separate
VuBlockReq instance. Instead let vu_queue_pop() allocate everything at
once.

This fixes the req_data memory leak in vu_block_virtio_process_req().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/export/vhost-user-blk-server.c | 68 +++++++++-------------------
 1 file changed, 21 insertions(+), 47 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ struct virtio_blk_inhdr {
 };
 
 typedef struct VuBlockReq {
-    VuVirtqElement *elem;
+    VuVirtqElement elem;
     int64_t sector_num;
     size_t size;
     struct virtio_blk_inhdr *in;
@@ -XXX,XX +XXX,XX @@ static void vu_block_req_complete(VuBlockReq *req)
     VuDev *vu_dev = &req->server->vu_dev;
 
     /* IO size with 1 extra status byte */
-    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
+    vu_queue_push(vu_dev, req->vq, &req->elem, req->size + 1);
     vu_queue_notify(vu_dev, req->vq);
 
-    if (req->elem) {
-        free(req->elem);
-    }
-
-    g_free(req);
+    free(req);
 }
 
 static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_flush(VuBlockReq *req)
     blk_co_flush(backend);
 }
 
-struct req_data {
-    VuServer *server;
-    VuVirtq *vq;
-    VuVirtqElement *elem;
-};
-
 static void coroutine_fn vu_block_virtio_process_req(void *opaque)
 {
-    struct req_data *data = opaque;
-    VuServer *server = data->server;
-    VuVirtq *vq = data->vq;
-    VuVirtqElement *elem = data->elem;
+    VuBlockReq *req = opaque;
+    VuServer *server = req->server;
+    VuVirtqElement *elem = &req->elem;
     uint32_t type;
-    VuBlockReq *req;
 
     VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
     BlockBackend *backend = vdev_blk->backend;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
     struct iovec *out_iov = elem->out_sg;
     unsigned in_num = elem->in_num;
     unsigned out_num = elem->out_num;
+
     /* refer to hw/block/virtio_blk.c */
     if (elem->out_num < 1 || elem->in_num < 1) {
         error_report("virtio-blk request missing headers");
-        free(elem);
-        return;
+        goto err;
     }
 
-    req = g_new0(VuBlockReq, 1);
-    req->server = server;
-    req->vq = vq;
-    req->elem = elem;
-
     if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
                             sizeof(req->out)) != sizeof(req->out))) {
         error_report("virtio-blk request outhdr too short");
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
 
 err:
     free(elem);
-    g_free(req);
-    return;
 }
 
 static void vu_block_process_vq(VuDev *vu_dev, int idx)
 {
-    VuServer *server;
-    VuVirtq *vq;
-    struct req_data *req_data;
+    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+    VuVirtq *vq = vu_get_queue(vu_dev, idx);
 
-    server = container_of(vu_dev, VuServer, vu_dev);
-    assert(server);
-
-    vq = vu_get_queue(vu_dev, idx);
-    assert(vq);
-    VuVirtqElement *elem;
     while (1) {
-        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
-                                    sizeof(VuBlockReq));
-        if (elem) {
-            req_data = g_new0(struct req_data, 1);
-            req_data->server = server;
-            req_data->vq = vq;
-            req_data->elem = elem;
-            Coroutine *co = qemu_coroutine_create(vu_block_virtio_process_req,
-                                                  req_data);
-            aio_co_enter(server->ioc->ctx, co);
-        } else {
+        VuBlockReq *req;
+
+        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlockReq));
+        if (!req) {
             break;
         }
+
+        req->server = server;
+        req->vq = vq;
+
+        Coroutine *co =
+            qemu_coroutine_create(vu_block_virtio_process_req, req);
+        qemu_coroutine_enter(co);
     }
 }
 
-- 
2.26.2

The device panic notifier callback is not used. Drop it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-7-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vhost-user-server.h             | 3 ---
 block/export/vhost-user-blk-server.c | 3 +--
 util/vhost-user-server.c             | 6 ------
 3 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.h
+++ b/util/vhost-user-server.h
@@ -XXX,XX +XXX,XX @@ typedef struct VuFdWatch {
 } VuFdWatch;
 
 typedef struct VuServer VuServer;
-typedef void DevicePanicNotifierFn(VuServer *server);
 
 struct VuServer {
     QIONetListener *listener;
     AioContext *ctx;
-    DevicePanicNotifierFn *device_panic_notifier;
     int max_queues;
     const VuDevIface *vu_iface;
     VuDev vu_dev;
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                              SocketAddress *unix_socket,
                              AioContext *ctx,
                              uint16_t max_queues,
-                             DevicePanicNotifierFn *device_panic_notifier,
                              const VuDevIface *vu_iface,
                              Error **errp);
 
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
     ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
 
     if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
-                                 VHOST_USER_BLK_MAX_QUEUES,
-                                 NULL, &vu_block_iface,
+                                 VHOST_USER_BLK_MAX_QUEUES, &vu_block_iface,
                                  errp)) {
         goto error;
     }
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@ static void panic_cb(VuDev *vu_dev, const char *buf)
         close_client(server);
     }
 
-    if (server->device_panic_notifier) {
-        server->device_panic_notifier(server);
-    }
-
     /*
      * Set the callback function for network listener so another
      * vhost-user client can connect to this server
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                              SocketAddress *socket_addr,
                              AioContext *ctx,
                              uint16_t max_queues,
-                             DevicePanicNotifierFn *device_panic_notifier,
                              const VuDevIface *vu_iface,
                              Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
         .vu_iface              = vu_iface,
         .max_queues            = max_queues,
         .ctx                   = ctx,
-        .device_panic_notifier = device_panic_notifier,
     };
 
     qio_net_listener_set_name(server->listener, "vhost-user-backend-listener");
-- 
2.26.2

fds[] is leaked when qio_channel_readv_full() fails.

Use vmsg->fds[] instead of keeping a local fds[] array. Then we can
reuse goto fail to clean up fds. vmsg->fd_num must be zeroed before the
loop to make this safe.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-8-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vhost-user-server.c | 50 ++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
     };
     int rc, read_bytes = 0;
     Error *local_err = NULL;
-    /*
-     * Store fds/nfds returned from qio_channel_readv_full into
-     * temporary variables.
-     *
-     * VhostUserMsg is a packed structure, gcc will complain about passing
-     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
-     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
-     * thus two temporary variables nfds and fds are used here.
-     */
-    size_t nfds = 0, nfds_t = 0;
     const size_t max_fds = G_N_ELEMENTS(vmsg->fds);
-    int *fds_t = NULL;
     VuServer *server = container_of(vu_dev, VuServer, vu_dev);
     QIOChannel *ioc = server->ioc;
 
+    vmsg->fd_num = 0;
     if (!ioc) {
         error_report_err(local_err);
         goto fail;
@@ -XXX,XX +XXX,XX @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
 
     assert(qemu_in_coroutine());
     do {
+        size_t nfds = 0;
+        int *fds = NULL;
+
         /*
          * qio_channel_readv_full may have short reads, keeping calling it
          * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
          */
-        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &local_err);
+        rc = qio_channel_readv_full(ioc, &iov, 1, &fds, &nfds, &local_err);
         if (rc < 0) {
             if (rc == QIO_CHANNEL_ERR_BLOCK) {
+                assert(local_err == NULL);
                 qio_channel_yield(ioc, G_IO_IN);
                 continue;
             } else {
                 error_report_err(local_err);
-                return false;
+                goto fail;
             }
         }
-        read_bytes += rc;
-        if (nfds_t > 0) {
-            if (nfds + nfds_t > max_fds) {
+
+        if (nfds > 0) {
+            if (vmsg->fd_num + nfds > max_fds) {
                 error_report("A maximum of %zu fds are allowed, "
                              "however got %zu fds now",
-                             max_fds, nfds + nfds_t);
+                             max_fds, vmsg->fd_num + nfds);
+                g_free(fds);
                 goto fail;
             }
-            memcpy(vmsg->fds + nfds, fds_t,
-                   nfds_t *sizeof(vmsg->fds[0]));
-            nfds += nfds_t;
-            g_free(fds_t);
+            memcpy(vmsg->fds + vmsg->fd_num, fds, nfds * sizeof(vmsg->fds[0]));
+            vmsg->fd_num += nfds;
+            g_free(fds);
         }
-        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
-            break;
+
+        if (rc == 0) { /* socket closed */
+            goto fail;
         }
-        iov.iov_base = (char *)vmsg + read_bytes;
-        iov.iov_len = VHOST_USER_HDR_SIZE - read_bytes;
-    } while (true);
 
-    vmsg->fd_num = nfds;
+        iov.iov_base += rc;
+        iov.iov_len -= rc;
+        read_bytes += rc;
+    } while (read_bytes != VHOST_USER_HDR_SIZE);
+
     /* qio_channel_readv_full will make socket fds blocking, unblock them */
     vmsg_unblock_fds(vmsg);
     if (vmsg->size > sizeof(vmsg->payload)) {
-- 
2.26.2

The vu_client_trip() coroutine is leaked during AioContext switching. It
is also unsafe to destroy the vu_dev in panic_cb() since its callers
still access it in some cases.

Rework the lifecycle to solve these safety issues.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-10-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vhost-user-server.h             |  29 ++--
 block/export/vhost-user-blk-server.c |   9 +-
 util/vhost-user-server.c             | 245 +++++++++++++++------------
 3 files changed, 155 insertions(+), 128 deletions(-)

diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.h
+++ b/util/vhost-user-server.h
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "standard-headers/linux/virtio_blk.h"
 
+/* A kick fd that we monitor on behalf of libvhost-user */
 typedef struct VuFdWatch {
     VuDev *vu_dev;
     int fd; /*kick fd*/
     void *pvt;
     vu_watch_cb cb;
-    bool processing;
     QTAILQ_ENTRY(VuFdWatch) next;
 } VuFdWatch;
 
-typedef struct VuServer VuServer;
-
-struct VuServer {
+/**
+ * VuServer:
+ * A vhost-user server instance with user-defined VuDevIface callbacks.
+ * Vhost-user device backends can be implemented using VuServer. VuDevIface
+ * callbacks and virtqueue kicks run in the given AioContext.
+ */
+typedef struct {
     QIONetListener *listener;
+    QEMUBH *restart_listener_bh;
     AioContext *ctx;
     int max_queues;
     const VuDevIface *vu_iface;
+
+    /* Protected by ctx lock */
     VuDev vu_dev;
     QIOChannel *ioc; /* The I/O channel with the client */
     QIOChannelSocket *sioc; /* The underlying data channel with the client */
-    /* IOChannel for fd provided via VHOST_USER_SET_SLAVE_REQ_FD */
-    QIOChannel *ioc_slave;
-    QIOChannelSocket *sioc_slave;
-    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
     QTAILQ_HEAD(, VuFdWatch) vu_fd_watches;
-    /* restart coroutine co_trip if AIOContext is changed */
-    bool aio_context_changed;
-    bool processing_msg;
-};
+
+    Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
+} VuServer;
 
 bool vhost_user_server_start(VuServer *server,
                              SocketAddress *unix_socket,
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
 
 void vhost_user_server_stop(VuServer *server);
 
-void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx);
+void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
+void vhost_user_server_detach_aio_context(VuServer *server);
 
 #endif /* VHOST_USER_SERVER_H */
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static const VuDevIface vu_block_iface = {
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
     VuBlockDev *vub_dev = opaque;
-    aio_context_acquire(ctx);
-    vhost_user_server_set_aio_context(&vub_dev->vu_server, ctx);
-    aio_context_release(ctx);
+    vhost_user_server_attach_aio_context(&vub_dev->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
     VuBlockDev *vub_dev = opaque;
-    AioContext *ctx = vub_dev->vu_server.ctx;
-    aio_context_acquire(ctx);
-    vhost_user_server_set_aio_context(&vub_dev->vu_server, NULL);
-    aio_context_release(ctx);
+    vhost_user_server_detach_aio_context(&vub_dev->vu_server);
 }
 
 static void
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@
  */
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
+#include "block/aio-wait.h"
 #include "vhost-user-server.h"
 
+/*
+ * Theory of operation:
+ *
+ * VuServer is started and stopped by vhost_user_server_start() and
+ * vhost_user_server_stop() from the main loop thread. Starting the server
+ * opens a vhost-user UNIX domain socket and listens for incoming connections.
+ * Only one connection is allowed at a time.
+ *
+ * The connection is handled by the vu_client_trip() coroutine in the
+ * VuServer->ctx AioContext. The coroutine consists of a vu_dispatch() loop
+ * where libvhost-user calls vu_message_read() to receive the next vhost-user
+ * protocol messages over the UNIX domain socket.
+ *
+ * When virtqueues are set up libvhost-user calls set_watch() to monitor kick
+ * fds. These fds are also handled in the VuServer->ctx AioContext.
+ *
+ * Both vu_client_trip() and kick fd monitoring can be stopped by shutting down
+ * the socket connection. Shutting down the socket connection causes
+ * vu_message_read() to fail since no more data can be received from the socket.
+ * After vu_dispatch() fails, vu_client_trip() calls vu_deinit() to stop
+ * libvhost-user before terminating the coroutine. vu_deinit() calls
+ * remove_watch() to stop monitoring kick fds and this stops virtqueue
+ * processing.
+ *
+ * When vu_client_trip() has finished cleaning up it schedules a BH in the main
+ * loop thread to accept the next client connection.
+ *
+ * When libvhost-user detects an error it calls panic_cb() and sets the
+ * dev->broken flag. Both vu_client_trip() and kick fd processing stop when
+ * the dev->broken flag is set.
+ *
+ * It is possible to switch AioContexts using
+ * vhost_user_server_detach_aio_context() and
+ * vhost_user_server_attach_aio_context(). They stop monitoring fds in the old
+ * AioContext and resume monitoring in the new AioContext. The vu_client_trip()
+ * coroutine remains in a yielded state during the switch. This is made
+ * possible by QIOChannel's support for spurious coroutine re-entry in
+ * qio_channel_yield(). The coroutine will restart I/O when re-entered from the
+ * new AioContext.
+ */
+
 static void vmsg_close_fds(VhostUserMsg *vmsg)
 {
     int i;
@@ -XXX,XX +XXX,XX @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg)
     }
 }
 
-static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
-                      gpointer opaque);
-
-static void close_client(VuServer *server)
-{
-    /*
-     * Before closing the client
-     *
-     * 1. Let vu_client_trip stop processing new vhost-user msg
-     *
-     * 2. remove kick_handler
-     *
-     * 3. wait for the kick handler to be finished
-     *
-     * 4. wait for the current vhost-user msg to be finished processing
-     */
-
-    QIOChannelSocket *sioc = server->sioc;
-    /* When this is set vu_client_trip will stop new processing vhost-user message */
-    server->sioc = NULL;
-
-    while (server->processing_msg) {
-        if (server->ioc->read_coroutine) {
-            server->ioc->read_coroutine = NULL;
-            qio_channel_set_aio_fd_handler(server->ioc, server->ioc->ctx, NULL,
-                                           NULL, server->ioc);
-            server->processing_msg = false;
-        }
-    }
-
-    vu_deinit(&server->vu_dev);
-
-    /* vu_deinit() should have called remove_watch() */
-    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
-
-    object_unref(OBJECT(sioc));
-    object_unref(OBJECT(server->ioc));
-}
-
 static void panic_cb(VuDev *vu_dev, const char *buf)
 {
-    VuServer *server = container_of(vu_dev, VuServer, vu_dev);
-
-    /* avoid while loop in close_client */
-    server->processing_msg = false;
-
-    if (buf) {
-        error_report("vu_panic: %s", buf);
-    }
-
-    if (server->sioc) {
-        close_client(server);
-    }
-
-    /*
-     * Set the callback function for network listener so another
-     * vhost-user client can connect to this server
-     */
-    qio_net_listener_set_client_func(server->listener,
-                                     vu_accept,
-                                     server,
-                                     NULL);
+    error_report("vu_panic: %s", buf);
 }
 
 static bool coroutine_fn
@@ -XXX,XX +XXX,XX @@ fail:
     return false;
 }
 
-
-static void vu_client_start(VuServer *server);
 static coroutine_fn void vu_client_trip(void *opaque)
 {
     VuServer *server = opaque;
+    VuDev *vu_dev = &server->vu_dev;
 
-    while (!server->aio_context_changed && server->sioc) {
-        server->processing_msg = true;
-        vu_dispatch(&server->vu_dev);
-        server->processing_msg = false;
+    while (!vu_dev->broken && vu_dispatch(vu_dev)) {
+        /* Keep running */
     }
 
-    if (server->aio_context_changed && server->sioc) {
-        server->aio_context_changed = false;
-        vu_client_start(server);
-    }
-}
+    vu_deinit(vu_dev);
+
+    /* vu_deinit() should have called remove_watch() */
+    assert(QTAILQ_EMPTY(&server->vu_fd_watches));
+
+    object_unref(OBJECT(server->sioc));
+    server->sioc = NULL;
 
-static void vu_client_start(VuServer *server)
-{
-    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
-    aio_co_enter(server->ctx, server->co_trip);
+    object_unref(OBJECT(server->ioc));
+    server->ioc = NULL;
+
+    server->co_trip = NULL;
+    if (server->restart_listener_bh) {
+        qemu_bh_schedule(server->restart_listener_bh);
+    }
+    aio_wait_kick();
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static void vu_client_start(VuServer *server)
 static void kick_handler(void *opaque)
 {
     VuFdWatch *vu_fd_watch = opaque;
-    vu_fd_watch->processing = true;
-    vu_fd_watch->cb(vu_fd_watch->vu_dev, 0, vu_fd_watch->pvt);
-    vu_fd_watch->processing = false;
+    VuDev *vu_dev = vu_fd_watch->vu_dev;
+
+    vu_fd_watch->cb(vu_dev, 0, vu_fd_watch->pvt);
+
+    /* Stop vu_client_trip() if an error occurred in vu_fd_watch->cb() */
+    if (vu_dev->broken) {
+        VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+
+        qio_channel_shutdown(server->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    }
 }
 
-
 static VuFdWatch *find_vu_fd_watch(VuServer *server, int fd)
 {
 
@@ -XXX,XX +XXX,XX @@ static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
     qio_channel_set_name(QIO_CHANNEL(sioc), "vhost-user client");
     server->ioc = QIO_CHANNEL(sioc);
     object_ref(OBJECT(server->ioc));
-    qio_channel_attach_aio_context(server->ioc, server->ctx);
+
+    /* TODO vu_message_write() spins if non-blocking! */
     qio_channel_set_blocking(server->ioc, false, NULL);
-    vu_client_start(server);
+
+    server->co_trip = qemu_coroutine_create(vu_client_trip, server);
+
+    aio_context_acquire(server->ctx);
+    vhost_user_server_attach_aio_context(server, server->ctx);
+    aio_context_release(server->ctx);
 }
 
-
 void vhost_user_server_stop(VuServer *server)
 {
+    aio_context_acquire(server->ctx);
+
+    qemu_bh_delete(server->restart_listener_bh);
+    server->restart_listener_bh = NULL;
+
     if (server->sioc) {
-        close_client(server);
+        VuFdWatch *vu_fd_watch;
+
+        QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
+            aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+                               NULL, NULL, NULL, vu_fd_watch);
+        }
+
+        qio_channel_shutdown(server->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+
+        AIO_WAIT_WHILE(server->ctx, server->co_trip);
     }
 
+    aio_context_release(server->ctx);
+
     if (server->listener) {
         qio_net_listener_disconnect(server->listener);
         object_unref(OBJECT(server->listener));
     }
+}
+
+/*
+ * Allow the next client to connect to the server. Called from a BH in the main
+ * loop.
+ */
+static void restart_listener_bh(void *opaque)
+{
+    VuServer *server = opaque;
 
+    qio_net_listener_set_client_func(server->listener, vu_accept, server,
+                                     NULL);
 }
 
-void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx)
+/* Called with ctx acquired */
+void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx)
 {
-    VuFdWatch *vu_fd_watch, *next;
-    void *opaque = NULL;
-    IOHandler *io_read = NULL;
-    bool attach;
+    VuFdWatch *vu_fd_watch;
 
-    server->ctx = ctx ? ctx : qemu_get_aio_context();
+    server->ctx = ctx;
 
     if (!server->sioc) {
-        /* not yet serving any client*/
         return;
     }
 
-    if (ctx) {
-        qio_channel_attach_aio_context(server->ioc, ctx);
-        server->aio_context_changed = true;
-        io_read = kick_handler;
-        attach = true;
-    } else {
+    qio_channel_attach_aio_context(server->ioc, ctx);
+
+    QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
+        aio_set_fd_handler(ctx, vu_fd_watch->fd, true, kick_handler, NULL,
+                           NULL, vu_fd_watch);
+    }
+
+    aio_co_schedule(ctx, server->co_trip);
+}
+
+/* Called with server->ctx acquired */
+void vhost_user_server_detach_aio_context(VuServer *server)
+{
+    if (server->sioc) {
+        VuFdWatch *vu_fd_watch;
+
+        QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
+            aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+                               NULL, NULL, NULL, vu_fd_watch);
+        }
+
         qio_channel_detach_aio_context(server->ioc);
-        /* server->ioc->ctx keeps the old AioConext */
-        ctx = server->ioc->ctx;
-        attach = false;
     }
 
-    QTAILQ_FOREACH_SAFE(vu_fd_watch, &server->vu_fd_watches, next, next) {
-        if (vu_fd_watch->cb) {
-            opaque = attach ? vu_fd_watch : NULL;
-            aio_set_fd_handler(ctx, vu_fd_watch->fd, true,
-                               io_read, NULL, NULL,
-                               opaque);
-        }
-    }
+    server->ctx = NULL;
 }
 
-
 bool vhost_user_server_start(VuServer *server,
                              SocketAddress *socket_addr,
                              AioContext *ctx,
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                              const VuDevIface *vu_iface,
                              Error **errp)
 {
+    QEMUBH *bh;
     QIONetListener *listener = qio_net_listener_new();
     if (qio_net_listener_open_sync(listener, socket_addr, 1,
                                    errp) < 0) {
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
         return false;
     }
 
+    bh = qemu_bh_new(restart_listener_bh, server);
+
     /* zero out unspecified fields */
     *server = (VuServer) {
         .listener              = listener,
+        .restart_listener_bh   = bh,
         .vu_iface              = vu_iface,
         .max_queues            = max_queues,
         .ctx                   = ctx,
-- 
2.26.2

Propagate the flush return value since errors are possible.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-11-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/export/vhost-user-blk-server.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Use the new QAPI block exports API instead of defining our own QOM
objects.

This is a large change because the lifecycle of VuBlockDev needs to
follow BlockExportDriver. QOM properties are replaced by QAPI options
objects.

VuBlockDev is renamed VuBlkExport and contains a BlockExport field.
Several fields can be dropped since BlockExport already has equivalents.

The file names and meson build integration will be adjusted in a future
patch. libvhost-user should probably be built as a static library that
is linked into QEMU instead of as a .c file that results in duplicate
compilation.

The new command-line syntax is:

$ qemu-storage-daemon \
      --blockdev file,node-name=drive0,filename=test.img \
      --export vhost-user-blk,node-name=drive0,id=export0,unix-socket=/tmp/vhost-user-blk.sock

Note that unix-socket is optional because we may wish to accept chardevs
too in the future.

Markus noted that supported address families are not explicit in the
QAPI schema. It is unlikely that support for more address families will
be added since file descriptor passing is required and few address
families support it. If a new address family needs to be added, then the
QAPI 'features' syntax can be used to advertize them.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20200924151549.913737-12-stefanha@redhat.com
[Skip test on big-endian host architectures because this device doesn't
support them yet (as already mentioned in a code comment).
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-export.json               |  21 +-
 block/export/vhost-user-blk-server.h |  23 +-
 block/export/export.c                |   6 +
 block/export/vhost-user-blk-server.c | 452 +++++++--------------------
 util/vhost-user-server.c             |  10 +-
 block/export/meson.build             |   1 +
 block/meson.build                    |   1 -
 7 files changed, 156 insertions(+), 358 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -XXX,XX +XXX,XX @@
   'data': { '*name': 'str', '*description': 'str',
             '*bitmap': 'str' } }
 
+##
+# @BlockExportOptionsVhostUserBlk:
+#
+# A vhost-user-blk block export.
+#
+# @addr: The vhost-user socket on which to listen. Both 'unix' and 'fd'
+#        SocketAddress types are supported. Passed fds must be UNIX domain
+#        sockets.
+# @logical-block-size: Logical block size in bytes. Defaults to 512 bytes.
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockExportOptionsVhostUserBlk',
+  'data': { 'addr': 'SocketAddress', '*logical-block-size': 'size' } }
+
 ##
 # @NbdServerAddOptions:
 #
@@ -XXX,XX +XXX,XX @@
 # An enumeration of block export types
 #
 # @nbd: NBD export
+# @vhost-user-blk: vhost-user-blk export (since 5.2)
 #
 # Since: 4.2
 ##
 { 'enum': 'BlockExportType',
-  'data': [ 'nbd' ] }
+  'data': [ 'nbd', 'vhost-user-blk' ] }
 
 ##
 # @BlockExportOptions:
@@ -XXX,XX +XXX,XX @@
             '*writethrough': 'bool' },
   'discriminator': 'type',
   'data': {
-      'nbd': 'BlockExportOptionsNbd'
+      'nbd': 'BlockExportOptionsNbd',
+      'vhost-user-blk': 'BlockExportOptionsVhostUserBlk'
    } }
 
 ##
diff --git a/block/export/vhost-user-blk-server.h b/block/export/vhost-user-blk-server.h
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.h
+++ b/block/export/vhost-user-blk-server.h
@@ -XXX,XX +XXX,XX @@
 
 #ifndef VHOST_USER_BLK_SERVER_H
 #define VHOST_USER_BLK_SERVER_H
-#include "util/vhost-user-server.h"
 
-typedef struct VuBlockDev VuBlockDev;
-#define TYPE_VHOST_USER_BLK_SERVER "vhost-user-blk-server"
-#define VHOST_USER_BLK_SERVER(obj) \
-   OBJECT_CHECK(VuBlockDev, obj, TYPE_VHOST_USER_BLK_SERVER)
+#include "block/export.h"
 
-/* vhost user block device */
-struct VuBlockDev {
-    Object parent_obj;
-    char *node_name;
-    SocketAddress *addr;
-    AioContext *ctx;
-    VuServer vu_server;
-    bool running;
-    uint32_t blk_size;
-    BlockBackend *backend;
-    QIOChannelSocket *sioc;
-    QTAILQ_ENTRY(VuBlockDev) next;
-    struct virtio_blk_config blkcfg;
-    bool writable;
-};
+/* For block/export/export.c */
+extern const BlockExportDriver blk_exp_vhost_user_blk;
 
 #endif /* VHOST_USER_BLK_SERVER_H */
diff --git a/block/export/export.c b/block/export/export.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/block-backend.h"
 #include "block/export.h"
 #include "block/nbd.h"
+#if CONFIG_LINUX
+#include "block/export/vhost-user-blk-server.h"
+#endif
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block-export.h"
 #include "qapi/qapi-events-block-export.h"
@@ -XXX,XX +XXX,XX @@
 
 static const BlockExportDriver *blk_exp_drivers[] = {
     &blk_exp_nbd,
+#if CONFIG_LINUX
+    &blk_exp_vhost_user_blk,
+#endif
 };
 
 /* Only accessed from the main thread */
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
  */
 #include "qemu/osdep.h"
 #include "block/block.h"
+#include "contrib/libvhost-user/libvhost-user.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "util/vhost-user-server.h"
 #include "vhost-user-blk-server.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
@@ -XXX,XX +XXX,XX @@ struct virtio_blk_inhdr {
     unsigned char status;
 };
 
-typedef struct VuBlockReq {
+typedef struct VuBlkReq {
     VuVirtqElement elem;
     int64_t sector_num;
     size_t size;
@@ -XXX,XX +XXX,XX @@ typedef struct VuBlockReq {
     struct virtio_blk_outhdr out;
     VuServer *server;
     struct VuVirtq *vq;
-} VuBlockReq;
+} VuBlkReq;
 
-static void vu_block_req_complete(VuBlockReq *req)
+/* vhost user block device */
+typedef struct {
+    BlockExport export;
+    VuServer vu_server;
+    uint32_t blk_size;
+    QIOChannelSocket *sioc;
+    struct virtio_blk_config blkcfg;
+    bool writable;
+} VuBlkExport;
+
+static void vu_blk_req_complete(VuBlkReq *req)
 {
     VuDev *vu_dev = &req->server->vu_dev;
 
@@ -XXX,XX +XXX,XX @@ static void vu_block_req_complete(VuBlockReq *req)
     free(req);
 }
 
-static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
-{
-    return container_of(server, VuBlockDev, vu_server);
-}
-
 static int coroutine_fn
-vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
-                              uint32_t iovcnt, uint32_t type)
+vu_blk_discard_write_zeroes(BlockBackend *blk, struct iovec *iov,
+                            uint32_t iovcnt, uint32_t type)
 {
     struct virtio_blk_discard_write_zeroes desc;
     ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
@@ -XXX,XX +XXX,XX @@ vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
         return -EINVAL;
     }
 
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
     uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
                           le32_to_cpu(desc.num_sectors) << 9 };
     if (type == VIRTIO_BLK_T_DISCARD) {
-        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
+        if (blk_co_pdiscard(blk, range[0], range[1]) == 0) {
             return 0;
         }
     } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
-        if (blk_co_pwrite_zeroes(vdev_blk->backend,
-                                 range[0], range[1], 0) == 0) {
+        if (blk_co_pwrite_zeroes(blk, range[0], range[1], 0) == 0) {
             return 0;
         }
     }
@@ -XXX,XX +XXX,XX @@ vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
     return -EINVAL;
 }
 
-static int coroutine_fn vu_block_flush(VuBlockReq *req)
+static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
 {
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(req->server);
-    BlockBackend *backend = vdev_blk->backend;
-    return blk_co_flush(backend);
-}
-
-static void coroutine_fn vu_block_virtio_process_req(void *opaque)
-{
-    VuBlockReq *req = opaque;
+    VuBlkReq *req = opaque;
     VuServer *server = req->server;
     VuVirtqElement *elem = &req->elem;
     uint32_t type;
 
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
-    BlockBackend *backend = vdev_blk->backend;
+    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
+    BlockBackend *blk = vexp->export.blk;
 
     struct iovec *in_iov = elem->in_sg;
     struct iovec *out_iov = elem->out_sg;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
         bool is_write = type & VIRTIO_BLK_T_OUT;
         req->sector_num = le64_to_cpu(req->out.sector);
 
-        int64_t offset = req->sector_num * vdev_blk->blk_size;
+        if (is_write && !vexp->writable) {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+            break;
+        }
+
+        int64_t offset = req->sector_num * vexp->blk_size;
         QEMUIOVector qiov;
         if (is_write) {
             qemu_iovec_init_external(&qiov, out_iov, out_num);
-            ret = blk_co_pwritev(backend, offset, qiov.size,
-                                 &qiov, 0);
+            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
         } else {
             qemu_iovec_init_external(&qiov, in_iov, in_num);
-            ret = blk_co_preadv(backend, offset, qiov.size,
-                                &qiov, 0);
+            ret = blk_co_preadv(blk, offset, qiov.size, &qiov, 0);
         }
         if (ret >= 0) {
             req->in->status = VIRTIO_BLK_S_OK;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
         break;
     }
     case VIRTIO_BLK_T_FLUSH:
-        if (vu_block_flush(req) == 0) {
+        if (blk_co_flush(blk) == 0) {
             req->in->status = VIRTIO_BLK_S_OK;
         } else {
             req->in->status = VIRTIO_BLK_S_IOERR;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
     case VIRTIO_BLK_T_DISCARD:
     case VIRTIO_BLK_T_WRITE_ZEROES: {
         int rc;
-        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
-                                           out_num, type);
+
+        if (!vexp->writable) {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+            break;
+        }
+
+        rc = vu_blk_discard_write_zeroes(blk, &elem->out_sg[1], out_num, type);
         if (rc == 0) {
             req->in->status = VIRTIO_BLK_S_OK;
         } else {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_block_virtio_process_req(void *opaque)
         break;
     }
 
-    vu_block_req_complete(req);
+    vu_blk_req_complete(req);
     return;
 
 err:
-    free(elem);
+    free(req);
 }
 
-static void vu_block_process_vq(VuDev *vu_dev, int idx)
+static void vu_blk_process_vq(VuDev *vu_dev, int idx)
 {
     VuServer *server = container_of(vu_dev, VuServer, vu_dev);
     VuVirtq *vq = vu_get_queue(vu_dev, idx);
 
     while (1) {
-        VuBlockReq *req;
+        VuBlkReq *req;
 
-        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlockReq));
+        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlkReq));
         if (!req) {
             break;
         }
@@ -XXX,XX +XXX,XX @@ static void vu_block_process_vq(VuDev *vu_dev, int idx)
         req->vq = vq;
 
         Coroutine *co =
-            qemu_coroutine_create(vu_block_virtio_process_req, req);
+            qemu_coroutine_create(vu_blk_virtio_process_req, req);
         qemu_coroutine_enter(co);
     }
 }
 
-static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
+static void vu_blk_queue_set_started(VuDev *vu_dev, int idx, bool started)
 {
     VuVirtq *vq;
 
     assert(vu_dev);
 
     vq = vu_get_queue(vu_dev, idx);
-    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
+    vu_set_queue_handler(vu_dev, vq, started ? vu_blk_process_vq : NULL);
 }
 
-static uint64_t vu_block_get_features(VuDev *dev)
+static uint64_t vu_blk_get_features(VuDev *dev)
 {
     uint64_t features;
     VuServer *server = container_of(dev, VuServer, vu_dev);
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
     features = 1ull << VIRTIO_BLK_F_SIZE_MAX |
                1ull << VIRTIO_BLK_F_SEG_MAX |
                1ull << VIRTIO_BLK_F_TOPOLOGY |
@@ -XXX,XX +XXX,XX @@ static uint64_t vu_block_get_features(VuDev *dev)
                1ull << VIRTIO_RING_F_EVENT_IDX |
                1ull << VHOST_USER_F_PROTOCOL_FEATURES;
 
-    if (!vdev_blk->writable) {
+    if (!vexp->writable) {
         features |= 1ull << VIRTIO_BLK_F_RO;
     }
 
     return features;
 }
 
-static uint64_t vu_block_get_protocol_features(VuDev *dev)
+static uint64_t vu_blk_get_protocol_features(VuDev *dev)
 {
     return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
            1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
 }
 
 static int
-vu_block_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
+vu_blk_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
 {
+    /* TODO blkcfg must be little-endian for VIRTIO 1.0 */
     VuServer *server = container_of(vu_dev, VuServer, vu_dev);
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
-    memcpy(config, &vdev_blk->blkcfg, len);
-
+    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
+    memcpy(config, &vexp->blkcfg, len);
     return 0;
 }
 
 static int
-vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
+vu_blk_set_config(VuDev *vu_dev, const uint8_t *data,
                     uint32_t offset, uint32_t size, uint32_t flags)
 {
     VuServer *server = container_of(vu_dev, VuServer, vu_dev);
-    VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
+    VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
     uint8_t wce;
 
     /* don't support live migration */
@@ -XXX,XX +XXX,XX @@ vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
     }
 
     wce = *data;
-    vdev_blk->blkcfg.wce = wce;
-    blk_set_enable_write_cache(vdev_blk->backend, wce);
+    vexp->blkcfg.wce = wce;
+    blk_set_enable_write_cache(vexp->export.blk, wce);
     return 0;
 }
 
@@ -XXX,XX +XXX,XX @@ vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
  * of vu_process_message.
  *
  */
-static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
+static int vu_blk_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
 {
     if (vmsg->request == VHOST_USER_NONE) {
         dev->panic(dev, "disconnect");
@@ -XXX,XX +XXX,XX @@ static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
     return false;
 }
 
-static const VuDevIface vu_block_iface = {
-    .get_features          = vu_block_get_features,
-    .queue_set_started     = vu_block_queue_set_started,
-    .get_protocol_features = vu_block_get_protocol_features,
-    .get_config            = vu_block_get_config,
-    .set_config            = vu_block_set_config,
-    .process_msg           = vu_block_process_msg,
+static const VuDevIface vu_blk_iface = {
+    .get_features          = vu_blk_get_features,
+    .queue_set_started     = vu_blk_queue_set_started,
+    .get_protocol_features = vu_blk_get_protocol_features,
+    .get_config            = vu_blk_get_config,
+    .set_config            = vu_blk_set_config,
+    .process_msg           = vu_blk_process_msg,
 };
 
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
-    VuBlockDev *vub_dev = opaque;
-    vhost_user_server_attach_aio_context(&vub_dev->vu_server, ctx);
+    VuBlkExport *vexp = opaque;
+    vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
-    VuBlockDev *vub_dev = opaque;
-    vhost_user_server_detach_aio_context(&vub_dev->vu_server);
+    VuBlkExport *vexp = opaque;
+    vhost_user_server_detach_aio_context(&vexp->vu_server);
 }
 
 static void
-vu_block_initialize_config(BlockDriverState *bs,
+vu_blk_initialize_config(BlockDriverState *bs,
                            struct virtio_blk_config *config, uint32_t blk_size)
 {
     config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
@@ -XXX,XX +XXX,XX @@ vu_block_initialize_config(BlockDriverState *bs,
     config->max_write_zeroes_seg = 1;
 }
 
-static VuBlockDev *vu_block_init(VuBlockDev *vu_block_device, Error **errp)
+static void vu_blk_exp_request_shutdown(BlockExport *exp)
 {
+    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 
-    BlockBackend *blk;
-    Error *local_error = NULL;
-    const char *node_name = vu_block_device->node_name;
-    bool writable = vu_block_device->writable;
-    uint64_t perm = BLK_PERM_CONSISTENT_READ;
-    int ret;
-
-    AioContext *ctx;
-
-    BlockDriverState *bs = bdrv_lookup_bs(node_name, node_name, &local_error);
-
-    if (!bs) {
-        error_propagate(errp, local_error);
-        return NULL;
-    }
-
-    if (bdrv_is_read_only(bs)) {
-        writable = false;
-    }
-
-    if (writable) {
-        perm |= BLK_PERM_WRITE;
-    }
-
-    ctx = bdrv_get_aio_context(bs);
-    aio_context_acquire(ctx);
-    bdrv_invalidate_cache(bs, NULL);
-    aio_context_release(ctx);
-
-    /*
-     * Don't allow resize while the vhost user server is running,
-     * otherwise we don't care what happens with the node.
-     */
-    blk = blk_new(bdrv_get_aio_context(bs), perm,
-                  BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
-                  BLK_PERM_WRITE | BLK_PERM_GRAPH_MOD);
-    ret = blk_insert_bs(blk, bs, errp);
-
-    if (ret < 0) {
-        goto fail;
-    }
-
-    blk_set_enable_write_cache(blk, false);
-
-    blk_set_allow_aio_context_change(blk, true);
-
-    vu_block_device->blkcfg.wce = 0;
-    vu_block_device->backend = blk;
-    if (!vu_block_device->blk_size) {
-        vu_block_device->blk_size = BDRV_SECTOR_SIZE;
-    }
-    vu_block_device->blkcfg.blk_size = vu_block_device->blk_size;
-    blk_set_guest_block_size(blk, vu_block_device->blk_size);
-    vu_block_initialize_config(bs, &vu_block_device->blkcfg,
-                                   vu_block_device->blk_size);
-    return vu_block_device;
-
-fail:
-    blk_unref(blk);
-    return NULL;
-}
-
-static void vu_block_deinit(VuBlockDev *vu_block_device)
-{
-    if (vu_block_device->backend) {
-        blk_remove_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
-                                        blk_aio_detach, vu_block_device);
-    }
-
-    blk_unref(vu_block_device->backend);
-}
-
-static void vhost_user_blk_server_stop(VuBlockDev *vu_block_device)
-{
-    vhost_user_server_stop(&vu_block_device->vu_server);
-    vu_block_deinit(vu_block_device);
-}
-
-static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
-                                        Error **errp)
-{
-    AioContext *ctx;
-    SocketAddress *addr = vu_block_device->addr;
-
-    if (!vu_block_init(vu_block_device, errp)) {
-        return;
-    }
-
-    ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
-
-    if (!vhost_user_server_start(&vu_block_device->vu_server, addr, ctx,
-                                 VHOST_USER_BLK_MAX_QUEUES, &vu_block_iface,
-                                 errp)) {
-        goto error;
-    }
-
-    blk_add_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
-                                 blk_aio_detach, vu_block_device);
-    vu_block_device->running = true;
-    return;
-
- error:
-    vu_block_deinit(vu_block_device);
-}
-
-static bool vu_prop_modifiable(VuBlockDev *vus, Error **errp)
-{
-    if (vus->running) {
-            error_setg(errp, "The property can't be modified "
-                       "while the server is running");
-            return false;
-    }
-    return true;
-}
-
-static void vu_set_node_name(Object *obj, const char *value, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-
-    if (!vu_prop_modifiable(vus, errp)) {
-        return;
-    }
-
-    if (vus->node_name) {
-        g_free(vus->node_name);
-    }
-
-    vus->node_name = g_strdup(value);
-}
-
-static char *vu_get_node_name(Object *obj, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-    return g_strdup(vus->node_name);
-}
-
-static void free_socket_addr(SocketAddress *addr)
-{
-        g_free(addr->u.q_unix.path);
-        g_free(addr);
-}
-
-static void vu_set_unix_socket(Object *obj, const char *value,
-                               Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-
-    if (!vu_prop_modifiable(vus, errp)) {
-        return;
-    }
-
-    if (vus->addr) {
-        free_socket_addr(vus->addr);
-    }
-
-    SocketAddress *addr = g_new0(SocketAddress, 1);
-    addr->type = SOCKET_ADDRESS_TYPE_UNIX;
-    addr->u.q_unix.path = g_strdup(value);
-    vus->addr = addr;
+    vhost_user_server_stop(&vexp->vu_server);
 }
 
-static char *vu_get_unix_socket(Object *obj, Error **errp)
+static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                             Error **errp)
 {
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-    return g_strdup(vus->addr->u.q_unix.path);
-}
-
-static bool vu_get_block_writable(Object *obj, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-    return vus->writable;
-}
-
-static void vu_set_block_writable(Object *obj, bool value, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-
-    if (!vu_prop_modifiable(vus, errp)) {
-            return;
-    }
-
-    vus->writable = value;
-}
-
-static void vu_get_blk_size(Object *obj, Visitor *v, const char *name,
-                            void *opaque, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-    uint32_t value = vus->blk_size;
-
-    visit_type_uint32(v, name, &value, errp);
-}
-
-static void vu_set_blk_size(Object *obj, Visitor *v, const char *name,
-                            void *opaque, Error **errp)
-{
-    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
-
+    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
+    BlockExportOptionsVhostUserBlk *vu_opts = &opts->u.vhost_user_blk;
     Error *local_err = NULL;
-    uint32_t value;
+    uint64_t logical_block_size;
 
-    if (!vu_prop_modifiable(vus, errp)) {
-            return;
-    }
+    vexp->writable = opts->writable;
+    vexp->blkcfg.wce = 0;
 
-    visit_type_uint32(v, name, &value, &local_err);
-    if (local_err) {
-        goto out;
+    if (vu_opts->has_logical_block_size) {
+        logical_block_size = vu_opts->logical_block_size;
+    } else {
+        logical_block_size = BDRV_SECTOR_SIZE;
     }
-
-    check_block_size(object_get_typename(obj), name, value, &local_err);
+    check_block_size(exp->id, "logical-block-size", logical_block_size,
+                     &local_err);
     if (local_err) {
-        goto out;
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+    vexp->blk_size = logical_block_size;
+    blk_set_guest_block_size(exp->blk, logical_block_size);
+    vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
+                               logical_block_size);
+
+    blk_set_allow_aio_context_change(exp->blk, true);
+    blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
+                                 vexp);
+
+    if (!vhost_user_server_start(&vexp->vu_server, vu_opts->addr, exp->ctx,
+                                 VHOST_USER_BLK_MAX_QUEUES, &vu_blk_iface,
+                                 errp)) {
+        blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
+                                        blk_aio_detach, vexp);
+        return -EADDRNOTAVAIL;
     }
 
-    vus->blk_size = value;
-
-out:
-    error_propagate(errp, local_err);
-}
-
-static void vhost_user_blk_server_instance_finalize(Object *obj)
-{
-    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
-
-    vhost_user_blk_server_stop(vub);
-
-    /*
-     * Unlike object_property_add_str, object_class_property_add_str
-     * doesn't have a release method. Thus manual memory freeing is
-     * needed.
-     */
-    free_socket_addr(vub->addr);
-    g_free(vub->node_name);
-}
-
-static void vhost_user_blk_server_complete(UserCreatable *obj, Error **errp)
-{
-    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
-
-    vhost_user_blk_server_start(vub, errp);
+    return 0;
 }
 
-static void vhost_user_blk_server_class_init(ObjectClass *klass,
-                                             void *class_data)
+static void vu_blk_exp_delete(BlockExport *exp)
 {
-    UserCreatableClass *ucc = USER_CREATABLE_CLASS(klass);
-    ucc->complete = vhost_user_blk_server_complete;
-
-    object_class_property_add_bool(klass, "writable",
-                                   vu_get_block_writable,
-                                   vu_set_block_writable);
-
-    object_class_property_add_str(klass, "node-name",
-                                  vu_get_node_name,
-                                  vu_set_node_name);
-
-    object_class_property_add_str(klass, "unix-socket",
-                                  vu_get_unix_socket,
-                                  vu_set_unix_socket);
+    VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 
-    object_class_property_add(klass, "logical-block-size", "uint32",
-                              vu_get_blk_size, vu_set_blk_size,
-                              NULL, NULL);
+    blk_remove_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
+                                    vexp);
 }
 
-static const TypeInfo vhost_user_blk_server_info = {
-    .name = TYPE_VHOST_USER_BLK_SERVER,
-    .parent = TYPE_OBJECT,
-    .instance_size = sizeof(VuBlockDev),
-    .instance_finalize = vhost_user_blk_server_instance_finalize,
-    .class_init = vhost_user_blk_server_class_init,
-    .interfaces = (InterfaceInfo[]) {
-        {TYPE_USER_CREATABLE},
-        {}
-    },
+const BlockExportDriver blk_exp_vhost_user_blk = {
+    .type               = BLOCK_EXPORT_TYPE_VHOST_USER_BLK,
+    .instance_size      = sizeof(VuBlkExport),
+    .create             = vu_blk_exp_create,
+    .delete             = vu_blk_exp_delete,
+    .request_shutdown   = vu_blk_exp_request_shutdown,
 };
-
-static void vhost_user_blk_server_register_types(void)
-{
-    type_register_static(&vhost_user_blk_server_info);
-}
-
-type_init(vhost_user_blk_server_register_types)
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@ bool vhost_user_server_start(VuServer *server,
                              Error **errp)
 {
     QEMUBH *bh;
-    QIONetListener *listener = qio_net_listener_new();
+    QIONetListener *listener;
+
+    if (socket_addr->type != SOCKET_ADDRESS_TYPE_UNIX &&
+        socket_addr->type != SOCKET_ADDRESS_TYPE_FD) {
+        error_setg(errp, "Only socket address types 'unix' and 'fd' are supported");
+        return false;
+    }
+
+    listener = qio_net_listener_new();
     if (qio_net_listener_open_sync(listener, socket_addr, 1,
                                    errp) < 0) {
         object_unref(OBJECT(listener));
diff --git a/block/export/meson.build b/block/export/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1 +1,2 @@
 block_ss.add(files('export.c'))
+block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', '../../contrib/libvhost-user/libvhost-user.c'))
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c')
 block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
 block_ss.add(when: 'CONFIG_LIBISCSI', if_true: files('iscsi-opts.c'))
 block_ss.add(when: 'CONFIG_LINUX', if_true: files('nvme.c'))
-block_ss.add(when: 'CONFIG_LINUX', if_true: files('export/vhost-user-blk-server.c', '../contrib/libvhost-user/libvhost-user.c'))
 block_ss.add(when: 'CONFIG_REPLICATION', if_true: files('replication.c'))
 block_ss.add(when: 'CONFIG_SHEEPDOG', if_true: files('sheepdog.c'))
 block_ss.add(when: ['CONFIG_LINUX_AIO', libaio], if_true: files('linux-aio.c'))
-- 
2.26.2

Headers used by other subsystems are located in include/. Also add the
vhost-user-server and vhost-user-blk-server headers to MAINTAINERS.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-13-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                                | 4 +++-
 {util => include/qemu}/vhost-user-server.h | 0
 block/export/vhost-user-blk-server.c       | 2 +-
 util/vhost-user-server.c                   | 2 +-
 4 files changed, 5 insertions(+), 3 deletions(-)
 rename {util => include/qemu}/vhost-user-server.h (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ Vhost-user block device backend server
 M: Coiby Xu <Coiby.Xu@gmail.com>
 S: Maintained
 F: block/export/vhost-user-blk-server.c
-F: util/vhost-user-server.c
+F: block/export/vhost-user-blk-server.h
+F: include/qemu/vhost-user-server.h
 F: tests/qtest/libqos/vhost-user-blk.c
+F: util/vhost-user-server.c
 
 Replication
 M: Wen Congyang <wencongyang2@huawei.com>
diff --git a/util/vhost-user-server.h b/include/qemu/vhost-user-server.h
similarity index 100%
rename from util/vhost-user-server.h
rename to include/qemu/vhost-user-server.h
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
 #include "block/block.h"
 #include "contrib/libvhost-user/libvhost-user.h"
 #include "standard-headers/linux/virtio_blk.h"
-#include "util/vhost-user-server.h"
+#include "qemu/vhost-user-server.h"
 #include "vhost-user-blk-server.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -XXX,XX +XXX,XX @@
  */
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
+#include "qemu/vhost-user-server.h"
 #include "block/aio-wait.h"
-#include "vhost-user-server.h"
 
 /*
  * Theory of operation:
-- 
2.26.2

Don't compile contrib/libvhost-user/libvhost-user.c again. Instead build
the static library once and then reuse it throughout QEMU.

Also switch from CONFIG_LINUX to CONFIG_VHOST_USER, which is what the
vhost-user tools (vhost-user-gpu, etc) do.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-14-stefanha@redhat.com
[Added CONFIG_LINUX again because libvhost-user doesn't build on macOS.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/export/export.c             | 8 ++++----
 block/export/meson.build          | 2 +-
 contrib/libvhost-user/meson.build | 1 +
 meson.build                       | 6 +++++-
 util/meson.build                  | 4 +++-
 5 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/export/export.c b/block/export/export.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/block-backend.h"
 #include "block/export.h"
 #include "block/nbd.h"
-#if CONFIG_LINUX
-#include "block/export/vhost-user-blk-server.h"
-#endif
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block-export.h"
 #include "qapi/qapi-events-block-export.h"
 #include "qemu/id.h"
+#ifdef CONFIG_VHOST_USER
+#include "vhost-user-blk-server.h"
+#endif
 
 static const BlockExportDriver *blk_exp_drivers[] = {
     &blk_exp_nbd,
-#if CONFIG_LINUX
+#ifdef CONFIG_VHOST_USER
     &blk_exp_vhost_user_blk,
 #endif
 };
diff --git a/block/export/meson.build b/block/export/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -XXX,XX +XXX,XX @@
 block_ss.add(files('export.c'))
-block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', '../../contrib/libvhost-user/libvhost-user.c'))
+block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
diff --git a/contrib/libvhost-user/meson.build b/contrib/libvhost-user/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/contrib/libvhost-user/meson.build
+++ b/contrib/libvhost-user/meson.build
@@ -XXX,XX +XXX,XX @@
 libvhost_user = static_library('vhost-user',
                                files('libvhost-user.c', 'libvhost-user-glib.c'),
                                build_by_default: false)
+vhost_user = declare_dependency(link_with: libvhost_user)
diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ trace_events_subdirs += [
   'util',
 ]
 
+vhost_user = not_found
+if 'CONFIG_VHOST_USER' in config_host
+  subdir('contrib/libvhost-user')
+endif
+
 subdir('qapi')
 subdir('qobject')
 subdir('stubs')
@@ -XXX,XX +XXX,XX @@ if have_tools
              install: true)
 
   if 'CONFIG_VHOST_USER' in config_host
-    subdir('contrib/libvhost-user')
     subdir('contrib/vhost-user-blk')
     subdir('contrib/vhost-user-gpu')
     subdir('contrib/vhost-user-input')
diff --git a/util/meson.build b/util/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -XXX,XX +XXX,XX @@ if have_block
   util_ss.add(files('main-loop.c'))
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 'qemu-coroutine-io.c'))
-  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
+  util_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: [
+    files('vhost-user-server.c'), vhost_user
+  ])
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
-- 
2.26.2

Introduce libblkdev.fa to avoid recompiling blockdev_ss twice.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200929125516.186715-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 meson.build                | 12 ++++++++++--
 storage-daemon/meson.build |  3 +--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ blockdev_ss.add(files(
 # os-win32.c does not
 blockdev_ss.add(when: 'CONFIG_POSIX', if_true: files('os-posix.c'))
 softmmu_ss.add(when: 'CONFIG_WIN32', if_true: [files('os-win32.c')])
-softmmu_ss.add_all(blockdev_ss)
 
 common_ss.add(files('cpus-common.c'))
 
@@ -XXX,XX +XXX,XX @@ block = declare_dependency(link_whole: [libblock],
                            link_args: '@block.syms',
                            dependencies: [crypto, io])
 
+blockdev_ss = blockdev_ss.apply(config_host, strict: false)
+libblockdev = static_library('blockdev', blockdev_ss.sources() + genh,
+                             dependencies: blockdev_ss.dependencies(),
+                             name_suffix: 'fa',
+                             build_by_default: false)
+
+blockdev = declare_dependency(link_whole: [libblockdev],
+                              dependencies: [block])
+
 qmp_ss = qmp_ss.apply(config_host, strict: false)
 libqmp = static_library('qmp', qmp_ss.sources() + genh,
                         dependencies: qmp_ss.dependencies(),
@@ -XXX,XX +XXX,XX @@ foreach m : block_mods + softmmu_mods
                 install_dir: config_host['qemu_moddir'])
 endforeach
 
-softmmu_ss.add(authz, block, chardev, crypto, io, qmp)
+softmmu_ss.add(authz, blockdev, chardev, crypto, io, qmp)
 common_ss.add(qom, qemuutil)
 
 common_ss.add_all(when: 'CONFIG_SOFTMMU', if_true: [softmmu_ss])
diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/storage-daemon/meson.build
+++ b/storage-daemon/meson.build
@@ -XXX,XX +XXX,XX @@
 qsd_ss = ss.source_set()
 qsd_ss.add(files('qemu-storage-daemon.c'))
-qsd_ss.add(block, chardev, qmp, qom, qemuutil)
-qsd_ss.add_all(blockdev_ss)
+qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil)
 
 subdir('qapi')
 
-- 
2.26.2

Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
They are not used by other programs and are not otherwise needed in
libblock.

Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
Since bdrv_close_all() (libblock) calls blk_exp_close_all()
(libblockdev) a stub function is required..

Make qemu-nbd.c use signal handling utility functions instead of
duplicating the code. This helps because os-posix.c is in libblockdev
and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
Once we use the signal handling utility functions we also end up
providing the necessary symbol.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200929125516.186715-4-stefanha@redhat.com
[Fixed s/ndb/nbd/ typo in commit description as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qemu-nbd.c                | 21 ++++++++-------------
 stubs/blk-exp-close-all.c |  7 +++++++
 block/export/meson.build  |  4 ++--
 meson.build               |  4 ++--
 nbd/meson.build           |  2 ++
 stubs/meson.build         |  1 +
 6 files changed, 22 insertions(+), 17 deletions(-)
 create mode 100644 stubs/blk-exp-close-all.c

diff --git a/qemu-nbd.c b/qemu-nbd.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "qemu/cutils.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
 #include "block/block_int.h"
 #include "block/nbd.h"
 #include "qemu/main-loop.h"
@@ -XXX,XX +XXX,XX @@ QEMU_COPYRIGHT "\n"
 }
 
 #ifdef CONFIG_POSIX
-static void termsig_handler(int signum)
+/*
+ * The client thread uses SIGTERM to interrupt the server.  A signal
+ * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
+ */
+void qemu_system_killed(int signum, pid_t pid)
 {
     qatomic_cmpxchg(&state, RUNNING, TERMINATE);
     qemu_notify_event();
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
     BlockExportOptions *export_opts;
 
 #ifdef CONFIG_POSIX
-    /*
-     * Exit gracefully on various signals, which includes SIGTERM used
-     * by 'qemu-nbd -v -c'.
-     */
-    struct sigaction sa_sigterm;
-    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
-    sa_sigterm.sa_handler = termsig_handler;
-    sigaction(SIGTERM, &sa_sigterm, NULL);
-    sigaction(SIGINT, &sa_sigterm, NULL);
-    sigaction(SIGHUP, &sa_sigterm, NULL);
-
-    signal(SIGPIPE, SIG_IGN);
+    os_setup_early_signal_handling();
+    os_setup_signal_handling();
 #endif
 
     socket_init();
diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/stubs/blk-exp-close-all.c
@@ -XXX,XX +XXX,XX @@
+#include "qemu/osdep.h"
+#include "block/export.h"
+
+/* Only used in programs that support block exports (libblockdev.fa) */
+void blk_exp_close_all(void)
+{
+}
diff --git a/block/export/meson.build b/block/export/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -XXX,XX +XXX,XX @@
-block_ss.add(files('export.c'))
-block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
+blockdev_ss.add(files('export.c'))
+blockdev_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: files('vhost-user-blk-server.c'))
diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ subdir('dump')
 
 block_ss.add(files(
   'block.c',
-  'blockdev-nbd.c',
   'blockjob.c',
   'job.c',
   'qemu-io-cmds.c',
@@ -XXX,XX +XXX,XX @@ subdir('block')
 
 blockdev_ss.add(files(
   'blockdev.c',
+  'blockdev-nbd.c',
   'iothread.c',
   'job-qmp.c',
 ))
@@ -XXX,XX +XXX,XX @@ if have_tools
   qemu_io = executable('qemu-io', files('qemu-io.c'),
              dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
-               dependencies: [block, qemuutil], install: true)
+               dependencies: [blockdev, qemuutil], install: true)
 
   subdir('storage-daemon')
   subdir('contrib/rdmacm-mux')
diff --git a/nbd/meson.build b/nbd/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/nbd/meson.build
+++ b/nbd/meson.build
@@ -XXX,XX +XXX,XX @@
 block_ss.add(files(
   'client.c',
   'common.c',
+))
+blockdev_ss.add(files(
   'server.c',
 ))
diff --git a/stubs/meson.build b/stubs/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -XXX,XX +XXX,XX @@
 stub_ss.add(files('arch_type.c'))
 stub_ss.add(files('bdrv-next-monitor-owned.c'))
 stub_ss.add(files('blk-commit-all.c'))
+stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
-- 
2.26.2

Make it possible to specify the iothread where the export will run. By
default the block node can be moved to other AioContexts later and the
export will follow. The fixed-iothread option forces strict behavior
that prevents changing AioContext while the export is active. See the
QAPI docs for details.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200929125516.186715-5-stefanha@redhat.com
[Fix stray '#' character in block-export.json and add missing "(since:
5.2)" as suggested by Eric Blake.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-export.json               | 11 ++++++++++
 block/export/export.c                | 31 +++++++++++++++++++++++++++-
 block/export/vhost-user-blk-server.c |  5 ++++-
 nbd/server.c                         |  2 --
 4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -XXX,XX +XXX,XX @@
 #                export before completion is signalled. (since: 5.2;
 #                default: false)
 #
+# @iothread: The name of the iothread object where the export will run. The
+#            default is to use the thread currently associated with the
+#            block node. (since: 5.2)
+#
+# @fixed-iothread: True prevents the block node from being moved to another
+#                  thread while the export is active. If true and @iothread is
+#                  given, export creation fails if the block node cannot be
+#                  moved to the iothread. The default is false. (since: 5.2)
+#
 # Since: 4.2
 ##
 { 'union': 'BlockExportOptions',
   'base': { 'type': 'BlockExportType',
             'id': 'str',
+	    '*fixed-iothread': 'bool',
+	    '*iothread': 'str',
             'node-name': 'str',
             '*writable': 'bool',
             '*writethrough': 'bool' },
diff --git a/block/export/export.c b/block/export/export.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@
 
 #include "block/block.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/iothread.h"
 #include "block/export.h"
 #include "block/nbd.h"
 #include "qapi/error.h"
@@ -XXX,XX +XXX,XX @@ static const BlockExportDriver *blk_exp_find_driver(BlockExportType type)
 
 BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
+    bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
     const BlockExportDriver *drv;
     BlockExport *exp = NULL;
     BlockDriverState *bs;
-    BlockBackend *blk;
+    BlockBackend *blk = NULL;
     AioContext *ctx;
     uint64_t perm;
     int ret;
@@ -XXX,XX +XXX,XX @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
     ctx = bdrv_get_aio_context(bs);
     aio_context_acquire(ctx);
 
+    if (export->has_iothread) {
+        IOThread *iothread;
+        AioContext *new_ctx;
+
+        iothread = iothread_by_id(export->iothread);
+        if (!iothread) {
+            error_setg(errp, "iothread \"%s\" not found", export->iothread);
+            goto fail;
+        }
+
+        new_ctx = iothread_get_aio_context(iothread);
+
+        ret = bdrv_try_set_aio_context(bs, new_ctx, errp);
+        if (ret == 0) {
+            aio_context_release(ctx);
+            aio_context_acquire(new_ctx);
+            ctx = new_ctx;
+        } else if (fixed_iothread) {
+            goto fail;
+        }
+    }
+
     /*
      * Block exports are used for non-shared storage migration. Make sure
      * that BDRV_O_INACTIVE is cleared and the image is ready for write
@@ -XXX,XX +XXX,XX @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
     }
 
     blk = blk_new(ctx, perm, BLK_PERM_ALL);
+
+    if (!fixed_iothread) {
+        blk_set_allow_aio_context_change(blk, true);
+    }
+
     ret = blk_insert_bs(blk, bs, errp);
     if (ret < 0) {
         goto fail;
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static const VuDevIface vu_blk_iface = {
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
     VuBlkExport *vexp = opaque;
+
+    vexp->export.ctx = ctx;
     vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
     VuBlkExport *vexp = opaque;
+
     vhost_user_server_detach_aio_context(&vexp->vu_server);
+    vexp->export.ctx = NULL;
 }
 
 static void
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
     vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
                                logical_block_size);
 
-    blk_set_allow_aio_context_change(exp->blk, true);
     blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
                                  vexp);
 
diff --git a/nbd/server.c b/nbd/server.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -XXX,XX +XXX,XX @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
         return ret;
     }
 
-    blk_set_allow_aio_context_change(blk, true);
-
     QTAILQ_INIT(&exp->clients);
     exp->name = g_strdup(arg->name);
     exp->description = g_strdup(arg->description);
-- 
2.26.2

Allow the number of queues to be configured using --export
vhost-user-blk,num-queues=N. This setting should match the QEMU --device
vhost-user-blk-pci,num-queues=N setting but QEMU vhost-user-blk.c lowers
its own value if the vhost-user-blk backend offers fewer queues than
QEMU.

The vhost-user-blk-server.c code is already capable of multi-queue. All
virtqueue processing runs in the same AioContext. No new locking is
needed.

Add the num-queues=N option and set the VIRTIO_BLK_F_MQ feature bit.
Note that the feature bit only announces the presence of the num_queues
configuration space field. It does not promise that there is more than 1
virtqueue, so we can set it unconditionally.

I tested multi-queue by running a random read fio test with numjobs=4 on
an -smp 4 guest. After the benchmark finished the guest /proc/interrupts
file showed activity on all 4 virtio-blk MSI-X. The /sys/block/vda/mq/
directory shows that Linux blk-mq has 4 queues configured.

An automated test is included in the next commit.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20201001144604.559733-2-stefanha@redhat.com
[Fixed accidental tab characters as suggested by Markus Armbruster
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-export.json               | 10 +++++++---
 block/export/vhost-user-blk-server.c | 24 ++++++++++++++++++------
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -XXX,XX +XXX,XX @@
 #        SocketAddress types are supported. Passed fds must be UNIX domain
 #        sockets.
 # @logical-block-size: Logical block size in bytes. Defaults to 512 bytes.
+# @num-queues: Number of request virtqueues. Must be greater than 0. Defaults
+#              to 1.
 #
 # Since: 5.2
 ##
 { 'struct': 'BlockExportOptionsVhostUserBlk',
-  'data': { 'addr': 'SocketAddress', '*logical-block-size': 'size' } }
+  'data': { 'addr': 'SocketAddress',
+	    '*logical-block-size': 'size',
+            '*num-queues': 'uint16'} }
 
 ##
 # @NbdServerAddOptions:
@@ -XXX,XX +XXX,XX @@
 { 'union': 'BlockExportOptions',
   'base': { 'type': 'BlockExportType',
             'id': 'str',
-	    '*fixed-iothread': 'bool',
-	    '*iothread': 'str',
+            '*fixed-iothread': 'bool',
+            '*iothread': 'str',
             'node-name': 'str',
             '*writable': 'bool',
             '*writethrough': 'bool' },
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
 #include "util/block-helpers.h"
 
 enum {
-    VHOST_USER_BLK_MAX_QUEUES = 1,
+    VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
 };
 struct virtio_blk_inhdr {
     unsigned char status;
@@ -XXX,XX +XXX,XX @@ static uint64_t vu_blk_get_features(VuDev *dev)
                1ull << VIRTIO_BLK_F_DISCARD |
                1ull << VIRTIO_BLK_F_WRITE_ZEROES |
                1ull << VIRTIO_BLK_F_CONFIG_WCE |
+               1ull << VIRTIO_BLK_F_MQ |
                1ull << VIRTIO_F_VERSION_1 |
                1ull << VIRTIO_RING_F_INDIRECT_DESC |
                1ull << VIRTIO_RING_F_EVENT_IDX |
@@ -XXX,XX +XXX,XX @@ static void blk_aio_detach(void *opaque)
 
 static void
 vu_blk_initialize_config(BlockDriverState *bs,
-                           struct virtio_blk_config *config, uint32_t blk_size)
+                         struct virtio_blk_config *config,
+                         uint32_t blk_size,
+                         uint16_t num_queues)
 {
     config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
     config->blk_size = blk_size;
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
     config->seg_max = 128 - 2;
     config->min_io_size = 1;
     config->opt_io_size = 1;
-    config->num_queues = VHOST_USER_BLK_MAX_QUEUES;
+    config->num_queues = num_queues;
     config->max_discard_sectors = 32768;
     config->max_discard_seg = 1;
     config->discard_sector_alignment = config->blk_size >> 9;
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
     BlockExportOptionsVhostUserBlk *vu_opts = &opts->u.vhost_user_blk;
     Error *local_err = NULL;
     uint64_t logical_block_size;
+    uint16_t num_queues = VHOST_USER_BLK_NUM_QUEUES_DEFAULT;
 
     vexp->writable = opts->writable;
     vexp->blkcfg.wce = 0;
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
     }
     vexp->blk_size = logical_block_size;
     blk_set_guest_block_size(exp->blk, logical_block_size);
+
+    if (vu_opts->has_num_queues) {
+        num_queues = vu_opts->num_queues;
+    }
+    if (num_queues == 0) {
+        error_setg(errp, "num-queues must be greater than 0");
+        return -EINVAL;
+    }
+
     vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
-                               logical_block_size);
+                             logical_block_size, num_queues);
 
     blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
                                  vexp);
 
     if (!vhost_user_server_start(&vexp->vu_server, vu_opts->addr, exp->ctx,
-                                 VHOST_USER_BLK_MAX_QUEUES, &vu_blk_iface,
-                                 errp)) {
+                                 num_queues, &vu_blk_iface, errp)) {
         blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
                                         blk_aio_detach, vexp);
         return -EADDRNOTAVAIL;
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

bdrv_co_block_status_above has several design problems with handling
short backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.

2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

Fix these things, making logic about short backing files clearer.

With fixed bdrv_block_status_above we also have to improve is_zero in
qcow2 code, otherwise iotest 154 will fail, because with this patch we
stop to merge zeros of different types (produced by fully unallocated
in the whole backing chain regions vs produced by short backing files).

Note also, that this patch leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200924194003.22080-2-vsementsov@virtuozzo.com
[Fix s/comes/come/ as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c    | 68 ++++++++++++++++++++++++++++++++++++++++-----------
 block/qcow2.c | 16 ++++++++++--
 2 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   int64_t *map,
                                   BlockDriverState **file)
 {
+    int ret;
     BlockDriverState *p;
-    int ret = 0;
-    bool first = true;
+    int64_t eof = 0;
 
     assert(bs != base);
-    for (p = bs; p != base; p = bdrv_filter_or_cow_bs(p)) {
+
+    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
+    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
+        return ret;
+    }
+
+    if (ret & BDRV_BLOCK_EOF) {
+        eof = offset + *pnum;
+    }
+
+    assert(*pnum <= bytes);
+    bytes = *pnum;
+
+    for (p = bdrv_filter_or_cow_bs(bs); p != base;
+         p = bdrv_filter_or_cow_bs(p))
+    {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
-            break;
+            return ret;
         }
-        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+        if (*pnum == 0) {
             /*
-             * Reading beyond the end of the file continues to read
-             * zeroes, but we can only widen the result to the
-             * unallocated length we learned from an earlier
-             * iteration.
+             * The top layer deferred to this layer, and because this layer is
+             * short, any zeroes that we synthesize beyond EOF behave as if they
+             * were allocated at this layer.
+             *
+             * We don't include BDRV_BLOCK_EOF into ret, as upper layer may be
+             * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
+             * below.
              */
+            assert(ret & BDRV_BLOCK_EOF);
             *pnum = bytes;
+            if (file) {
+                *file = p;
+            }
+            ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
+            break;
         }
-        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
+        if (ret & BDRV_BLOCK_ALLOCATED) {
+            /*
+             * We've found the node and the status, we must break.
+             *
+             * Drop BDRV_BLOCK_EOF, as it's not for upper layer, which may be
+             * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
+             * below.
+             */
+            ret &= ~BDRV_BLOCK_EOF;
             break;
         }
-        /* [offset, pnum] unallocated on this layer, which could be only
-         * the first part of [offset, bytes].  */
-        bytes = MIN(bytes, *pnum);
-        first = false;
+
+        /*
+         * OK, [offset, offset + *pnum) region is unallocated on this layer,
+         * let's continue the diving.
+         */
+        assert(*pnum <= bytes);
+        bytes = *pnum;
+    }
+
+    if (offset + *pnum == eof) {
+        ret |= BDRV_BLOCK_EOF;
     }
+
     return ret;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
     if (!bytes) {
         return true;
     }
-    res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
+
+    /*
+     * bdrv_block_status_above doesn't merge different types of zeros, for
+     * example, zeros which come from the region which is unallocated in
+     * the whole backing chain, and zeros which come because of a short
+     * backing file. So, we need a loop.
+     */
+    do {
+        res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
+        offset += nr;
+        bytes -= nr;
+    } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
+
+    return res >= 0 && (res & BDRV_BLOCK_ZERO) && bytes == 0;
 }
 
 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200924194003.22080-3-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/coroutines.h |  2 ++
 block/io.c         | 21 ++++++++++++++-------
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index XXXXXXX..XXXXXXX 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@ bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes,
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   BlockDriverState *base,
+                                  bool include_base,
                                   bool want_zero,
                                   int64_t offset,
                                   int64_t bytes,
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
                                BlockDriverState *base,
+                               bool include_base,
                                bool want_zero,
                                int64_t offset,
                                int64_t bytes,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ early_out:
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   BlockDriverState *base,
+                                  bool include_base,
                                   bool want_zero,
                                   int64_t offset,
                                   int64_t bytes,
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
     BlockDriverState *p;
     int64_t eof = 0;
 
-    assert(bs != base);
+    assert(include_base || bs != base);
+    assert(!include_base || base); /* Can't include NULL base */
 
     ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
-    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
+    if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED || bs == base) {
         return ret;
     }
 
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
     assert(*pnum <= bytes);
     bytes = *pnum;
 
-    for (p = bdrv_filter_or_cow_bs(bs); p != base;
+    for (p = bdrv_filter_or_cow_bs(bs); include_base || p != base;
          p = bdrv_filter_or_cow_bs(p))
     {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
             break;
         }
 
+        if (p == base) {
+            assert(include_base);
+            break;
+        }
+
         /*
          * OK, [offset, offset + *pnum) region is unallocated on this layer,
          * let's continue the diving.
@@ -XXX,XX +XXX,XX @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file)
 {
-    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+    return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
                                           pnum, map, file);
 }
 
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
     int ret;
     int64_t dummy;
 
-    ret = bdrv_common_block_status_above(bs, bdrv_filter_or_cow_bs(bs), false,
-                                         offset, bytes, pnum ? pnum : &dummy,
-                                         NULL, NULL);
+    ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
+                                         bytes, pnum ? pnum : &dummy, NULL,
+                                         NULL);
     if (ret < 0) {
         return ret;
     }
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

We are going to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
include_base == false and still bs == base (for ex. from img_rebase()).

So, support this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20200924194003.22080-4-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
     BlockDriverState *p;
     int64_t eof = 0;
 
-    assert(include_base || bs != base);
     assert(!include_base || base); /* Can't include NULL base */
 
+    if (!include_base && bs == base) {
+        *pnum = bytes;
+        return 0;
+    }
+
     ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
     if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED || bs == base) {
         return ret;
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

bdrv_is_allocated_above wrongly handles short backing files: it reports
after-EOF space as UNALLOCATED which is wrong, as on read the data is
generated on the level of short backing file (if all overlays have
unallocated areas at that place).

Reusing bdrv_common_block_status_above fixes the issue and unifies code
path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20200924194003.22080-5-vsementsov@virtuozzo.com
[Fix s/has/have/ as suggested by Eric Blake. Fix s/area/areas/.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 43 +++++--------------------------------------
 1 file changed, 5 insertions(+), 38 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * at 'offset + *pnum' may return the same allocation status (in other
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
- *
  */
 int bdrv_is_allocated_above(BlockDriverState *top,
                             BlockDriverState *base,
                             bool include_base, int64_t offset,
                             int64_t bytes, int64_t *pnum)
 {
-    BlockDriverState *intermediate;
-    int ret;
-    int64_t n = bytes;
-
-    assert(base || !include_base);
-
-    intermediate = top;
-    while (include_base || intermediate != base) {
-        int64_t pnum_inter;
-        int64_t size_inter;
-
-        assert(intermediate);
-        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
-        if (ret < 0) {
-            return ret;
-        }
-        if (ret) {
-            *pnum = pnum_inter;
-            return 1;
-        }
-
-        size_inter = bdrv_getlength(intermediate);
-        if (size_inter < 0) {
-            return size_inter;
-        }
-        if (n > pnum_inter &&
-            (intermediate == top || offset + pnum_inter < size_inter)) {
-            n = pnum_inter;
-        }
-
-        if (intermediate == base) {
-            break;
-        }
-
-        intermediate = bdrv_filter_or_cow_bs(intermediate);
+    int ret = bdrv_common_block_status_above(top, base, include_base, false,
+                                             offset, bytes, pnum, NULL, NULL);
+    if (ret < 0) {
+        return ret;
     }
 
-    *pnum = n;
-    return 0;
+    return !!(ret & BDRV_BLOCK_ALLOCATED);
 }
 
 int coroutine_fn
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

These cases are fixed by previous patches around block_status and
is_allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20200924194003.22080-6-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/274     | 20 +++++++++++
 tests/qemu-iotests/274.out | 68 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/274
+++ b/tests/qemu-iotests/274
@@ -XXX,XX +XXX,XX @@ with iotests.FilePath('base') as base, \
     iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
     iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 
+    iotests.log('=== Testing qemu-img commit (top -> base) ===')
+
+    create_chain()
+    iotests.qemu_img_log('commit', '-b', base, top)
+    iotests.img_info_log(base)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
+
+    iotests.log('=== Testing QMP active commit (top -> base) ===')
+
+    create_chain()
+    with create_vm() as vm:
+        vm.launch()
+        vm.qmp_log('block-commit', device='top', base_node='base',
+                   job_id='job0', auto_dismiss=False)
+        vm.run_job('job0', wait=5)
+
+    iotests.img_info_log(mid)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
 
     iotests.log('== Resize tests ==')
 
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -XXX,XX +XXX,XX @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+=== Testing qemu-img commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Image committed.
+
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 2 MiB (2097152 bytes)
+cluster_size: 65536
+Format specific information:
+    compat: 1.1
+    compression type: zlib
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+    extended l2: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing QMP active commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": "base", "device": "top", "job-id": "job0"}}
+{"return": {}}
+{"execute": "job-complete", "arguments": {"id": "job0"}}
+{"return": {}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"execute": "job-dismiss", "arguments": {"id": "job0"}}
+{"return": {}}
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 1 MiB (1048576 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/PID-base
+backing file format: IMGFMT
+Format specific information:
+    compat: 1.1
+    compression type: zlib
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+    extended l2: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == Resize tests ==
 === preallocation=off ===
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=6442450944 lazy_refcounts=off refcount_bits=16
-- 
2.26.2