Series comparison

-[Qemu-devel] [PULL 00/61] Block layer patches
+[PULL 00/31] Block layer patches
-The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:
+The following changes since commit 9a7beaad3dbba982f7a461d676b55a5c3851d312:
-  Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)
+  Merge remote-tracking branch 'remotes/alistair/tags/pull-riscv-to-apply-20210304' into staging (2021-03-05 10:47:46 +0000)
-are available in the git repository at:
+are available in the Git repository at:
   git://repo.or.cz/qemu/kevin.git tags/for-upstream
-for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:
+for you to fetch changes up to 67bedc3aed5c455b629c2cb5f523b536c46adff9:
-  Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)
+  docs: qsd: Explain --export nbd,name=... default (2021-03-05 17:09:46 +0100)
 ----------------------------------------------------------------
+Block layer patches:
-Block layer patches
+- qemu-storage-daemon: add --pidfile option
 - qemu-storage-daemon: CLI error messages include the option name now
 - vhost-user-blk export: Misc fixes, added test cases
 - docs: Improvements for qemu-storage-daemon documentation
 - parallels: load bitmap extension
 - backup-top: Don't crash on post-finalize accesses
 - iotests improvements
 ----------------------------------------------------------------
-Alberto Garcia (9):
+Alberto Garcia (1):
-      throttle: Update throttle-groups.c documentation
+      iotests: Drop deprecated 'props' from object-add
       qcow2: Remove unused Error variable in do_perform_cow()
       qcow2: Use unsigned int for both members of Qcow2COWRegion
       qcow2: Make perform_cow() call do_perform_cow() twice
       qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
       qcow2: Allow reading both COW regions with only one request
       qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
       qcow2: Merge the writing of the COW regions with the guest data
       qcow2: Use offset_into_cluster() and offset_to_l2_index()
-Kevin Wolf (37):
+Coiby Xu (1):
-      commit: Fix completion with extra reference
+      test: new qTest case to test the vhost-user-blk-server
       qemu-iotests: Allow starting new qemu after cleanup
       qemu-iotests: Test exiting qemu with running job
       doc: Document generic -blockdev options
       doc: Document driver-specific -blockdev options
       qed: Use bottom half to resume waiting requests
       qed: Make qed_read_table() synchronous
       qed: Remove callback from qed_read_table()
       qed: Remove callback from qed_read_l2_table()
       qed: Remove callback from qed_find_cluster()
       qed: Make qed_read_backing_file() synchronous
       qed: Make qed_copy_from_backing_file() synchronous
       qed: Remove callback from qed_copy_from_backing_file()
       qed: Make qed_write_header() synchronous
       qed: Remove callback from qed_write_header()
       qed: Make qed_write_table() synchronous
       qed: Remove GenericCB
       qed: Remove callback from qed_write_table()
       qed: Make qed_aio_read_data() synchronous
       qed: Make qed_aio_write_main() synchronous
       qed: Inline qed_commit_l2_update()
       qed: Add return value to qed_aio_write_l1_update()
       qed: Add return value to qed_aio_write_l2_update()
       qed: Add return value to qed_aio_write_main()
       qed: Add return value to qed_aio_write_cow()
       qed: Add return value to qed_aio_write_inplace/alloc()
       qed: Add return value to qed_aio_read/write_data()
       qed: Remove ret argument from qed_aio_next_io()
       qed: Remove recursion in qed_aio_next_io()
       qed: Implement .bdrv_co_readv/writev
       qed: Use CoQueue for serialising allocations
       qed: Simplify request handling
       qed: Use a coroutine for need_check_timer
       qed: Add coroutine_fn to I/O path functions
       qed: Use bdrv_co_* for coroutine_fns
       block: Remove bdrv_aio_readv/writev/flush()
       Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block
-Manos Pitsidianakis (1):
+Eric Blake (1):
-      block: change variable names in BlockDriverState
+      iotests: Fix up python style in 300
 Kevin Wolf (1):
       docs: qsd: Explain --export nbd,name=... default
 Max Reitz (3):
-      blkdebug: Catch bs->exact_filename overflow
+      backup: Remove nodes from job in .clean()
-      blkverify: Catch bs->exact_filename overflow
+      backup-top: Refuse I/O in inactive state
-      block: Do not strcmp() with NULL uri->scheme
+      iotests/283: Check that finalize drops backup-top
-Stefan Hajnoczi (10):
+Paolo Bonzini (2):
-      block: count bdrv_co_rw_vmstate() requests
+      storage-daemon: report unexpected arguments on the fly
-      block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
+      storage-daemon: include current command line option in the errors
       migration: avoid recursive AioContext locking in save_vmstate()
       migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
       virtio-pci: use ioeventfd even when KVM is disabled
       migration: hold AioContext lock for loadvm qemu_fclose()
       qemu-iotests: 068: extract _qemu() function
       qemu-iotests: 068: use -drive/-device instead of -hda
       qemu-iotests: 068: test iothread mode
       qemu-img: don't shadow opts variable in img_dd()
-Stephen Bates (1):
+Stefan Hajnoczi (14):
-      nvme: Add support for Read Data and Write Data in CMBs.
+      qemu-storage-daemon: add --pidfile option
       docs: show how to spawn qemu-storage-daemon with fd passing
       docs: replace insecure /tmp examples in qsd docs
       vhost-user-blk: fix blkcfg->num_queues endianness
       libqtest: add qtest_socket_server()
       libqtest: add qtest_kill_qemu()
       libqtest: add qtest_remove_abrt_handler()
       tests/qtest: add multi-queue test case to vhost-user-blk-test
       block/export: fix blk_size double byteswap
       block/export: use VIRTIO_BLK_SECTOR_BITS
       block/export: fix vhost-user-blk export sector number calculation
       block/export: port virtio-blk discard/write zeroes input validation
       vhost-user-blk-test: test discard/write zeroes invalid inputs
       block/export: port virtio-blk read/write range check
-sochin.jiang (1):
+Stefano Garzarella (1):
-      fix: avoid an infinite loop or a dangling pointer problem in img_commit
+      blockjob: report a better error message
- block/Makefile.objs            |   2 +-
+Vladimir Sementsov-Ogievskiy (7):
- block/blkdebug.c               |  46 +--
+      qcow2-bitmap: make bytes_covered_by_bitmap_cluster() public
- block/blkreplay.c              |   8 +-
+      parallels.txt: fix bitmap L1 table description
- block/blkverify.c              |  12 +-
+      block/parallels: BDRVParallelsState: add cluster_size field
- block/block-backend.c          |  22 +-
+      parallels: support bitmap extension for read-only mode
- block/commit.c                 |   7 +
+      iotests.py: add unarchive_sample_image() helper
- block/file-posix.c             |  34 +-
+      iotests: add parallels-read-bitmap test
- block/io.c                     | 240 ++-----------
+      MAINTAINERS: update parallels block driver
  block/iscsi.c                  |  20 +-
  block/mirror.c                 |   8 +-
  block/nbd-client.c             |   8 +-
  block/nbd-client.h             |   4 +-
  block/nbd.c                    |   6 +-
  block/nfs.c                    |   2 +-
  block/qcow2-cluster.c          | 201 ++++++++---
  block/qcow2.c                  |  94 +++--
  block/qcow2.h                  |  11 +-
  block/qed-cluster.c            | 124 +++----
  block/qed-gencb.c              |  33 --
  block/qed-table.c              | 261 +++++---------
  block/qed.c                    | 779 ++++++++++++++++-------------------------
  block/qed.h                    |  54 +--
  block/raw-format.c             |   8 +-
  block/rbd.c                    |   4 +-
  block/sheepdog.c               |  12 +-
  block/ssh.c                    |   2 +-
  block/throttle-groups.c        |   2 +-
  block/trace-events             |   3 -
  blockjob.c                     |   4 +-
  hw/block/nvme.c                |  83 +++--
  hw/block/nvme.h                |   1 +
  hw/virtio/virtio-pci.c         |   2 +-
  include/block/block.h          |  16 +-
  include/block/block_int.h      |   6 +-
  include/block/blockjob.h       |  18 +
  include/sysemu/block-backend.h |  20 +-
  migration/savevm.c             |  32 +-
  qemu-img.c                     |  29 +-
  qemu-io-cmds.c                 |  46 +--
  qemu-options.hx                | 221 ++++++++++--
  tests/qemu-iotests/068         |  37 +-
  tests/qemu-iotests/068.out     |  11 +-
  tests/qemu-iotests/185         | 206 +++++++++++
  tests/qemu-iotests/185.out     |  59 ++++
  tests/qemu-iotests/common.qemu |   3 +
  tests/qemu-iotests/group       |   1 +
 files changed, 1477 insertions(+), 1325 deletions(-)
  delete mode 100644 block/qed-gencb.c
  create mode 100755 tests/qemu-iotests/185
  create mode 100644 tests/qemu-iotests/185.out
+ docs/interop/parallels.txt                         |  28 +-
+ docs/tools/qemu-storage-daemon.rst                 |  68 +-
+ block/parallels.h                                  |   7 +-
+ include/block/dirty-bitmap.h                       |   2 +
+ tests/qtest/libqos/libqtest.h                      |  37 +
+ tests/qtest/libqos/vhost-user-blk.h                |  48 +
+ block/backup-top.c                                 |  10 +
+ block/backup.c                                     |   1 +
+ block/dirty-bitmap.c                               |  13 +
+ block/export/vhost-user-blk-server.c               | 150 +++-
+ block/parallels-ext.c                              | 300 +++++++
+ block/parallels.c                                  |  26 +-
+ block/qcow2-bitmap.c                               |  16 +-
+ blockjob.c                                         |  10 +-
+ hw/block/vhost-user-blk.c                          |   7 +-
+ storage-daemon/qemu-storage-daemon.c               |  56 +-
+ tests/qtest/libqos/vhost-user-blk.c                | 130 +++
+ tests/qtest/libqtest.c                             |  82 +-
+ tests/qtest/vhost-user-blk-test.c                  | 983 +++++++++++++++++++++
+ tests/qemu-iotests/iotests.py                      |  10 +
+ MAINTAINERS                                        |   5 +
+ block/meson.build                                  |   3 +-
+ tests/qemu-iotests/087                             |   8 +-
+ tests/qemu-iotests/184                             |  18 +-
+ tests/qemu-iotests/218                             |   2 +-
+ tests/qemu-iotests/235                             |   2 +-
+ tests/qemu-iotests/245                             |   4 +-
+ tests/qemu-iotests/258                             |   6 +-
+ tests/qemu-iotests/258.out                         |   4 +-
+ tests/qemu-iotests/283                             |  53 ++
+ tests/qemu-iotests/283.out                         |  15 +
+ tests/qemu-iotests/295                             |   2 +-
+ tests/qemu-iotests/296                             |   2 +-
+ tests/qemu-iotests/300                             |  10 +-
+ .../sample_images/parallels-with-bitmap.bz2        | Bin 0 -> 203 bytes
+ .../sample_images/parallels-with-bitmap.sh         |  51 ++
+ tests/qemu-iotests/tests/parallels-read-bitmap     |  55 ++
+ tests/qemu-iotests/tests/parallels-read-bitmap.out |   6 +
+ tests/qtest/libqos/meson.build                     |   1 +
+ tests/qtest/meson.build                            |   4 +
+files changed, 2098 insertions(+), 137 deletions(-)
+ create mode 100644 tests/qtest/libqos/vhost-user-blk.h
+ create mode 100644 block/parallels-ext.c
+ create mode 100644 tests/qtest/libqos/vhost-user-blk.c
+ create mode 100644 tests/qtest/vhost-user-blk-test.c
+ create mode 100644 tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
+ create mode 100755 tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
+ create mode 100755 tests/qemu-iotests/tests/parallels-read-bitmap
+ create mode 100644 tests/qemu-iotests/tests/parallels-read-bitmap.out

-[Qemu-devel] [PULL 01/61] commit: Fix completion with extra reference
+Deleted patch
-commit_complete() can't assume that after its block_job_completed() the
-job is actually immediately freed; someone else may still be holding
-references. In this case, the op blockers on the intermediate nodes make
-the graph reconfiguration in the completion code fail.
-Call block_job_remove_all_bdrv() manually so that we know for sure that
-any blockers on intermediate nodes are given up.
-Cc: qemu-stable@nongnu.org
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- block/commit.c | 7 +++++++
-file changed, 7 insertions(+)
-diff --git a/block/commit.c b/block/commit.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/commit.c
-+++ b/block/commit.c
-@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
-     }
-     g_free(s->backing_file_str);
-     blk_unref(s->top);
-+
-+    /* If there is more than one reference to the job (e.g. if called from
-+     * block_job_finish_sync()), block_job_completed() won't free it and
-+     * therefore the blockers on the intermediate nodes remain. This would
-+     * cause bdrv_set_backing_hd() to fail. */
-+    block_job_remove_all_bdrv(job);
-+
-     block_job_completed(&s->common, ret);
-     g_free(data);
---
-.8.3.1

-[Qemu-devel] [PULL 24/61] qcow2: Use offset_into_cluster() and offset_to_l2_index()
+[PULL 01/31] iotests: Drop deprecated 'props' from object-add
 From: Alberto Garcia <berto@igalia.com>
-We already have functions for doing these calculations, so let's use
-them instead of doing everything by hand. This makes the code a bit
-more readable.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
+Message-Id: <20210222115737.2993-1-berto@igalia.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ tests/qemu-iotests/087     |  8 ++------
- block/qcow2.c         | 2 +-
+ tests/qemu-iotests/184     | 18 ++++++------------
-files changed, 3 insertions(+), 3 deletions(-)
+ tests/qemu-iotests/218     |  2 +-
+ tests/qemu-iotests/235     |  2 +-
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+ tests/qemu-iotests/245     |  4 ++--
  tests/qemu-iotests/258     |  6 +++---
  tests/qemu-iotests/258.out |  4 ++--
  tests/qemu-iotests/295     |  2 +-
  tests/qemu-iotests/296     |  2 +-
 files changed, 19 insertions(+), 29 deletions(-)
 diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/087
 +++ b/tests/qemu-iotests/087
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
    "arguments": {
        "qom-type": "secret",
        "id": "sec0",
 -      "props": {
 -          "data": "123456"
 -      }
 +      "data": "123456"
    }
  }
  { "execute": "blockdev-add",
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
    "arguments": {
        "qom-type": "secret",
        "id": "sec0",
 -      "props": {
 -          "data": "123456"
 -      }
 +      "data": "123456"
    }
  }
  { "execute": "blockdev-add",
 diff --git a/tests/qemu-iotests/184 b/tests/qemu-iotests/184
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/184
 +++ b/tests/qemu-iotests/184
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
    "arguments": {
      "qom-type": "throttle-group",
      "id": "group0",
 -    "props": {
 -      "limits" : {
 -        "iops-total": 1000
 -      }
 +    "limits" : {
 +      "iops-total": 1000
      }
    }
  }
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
    "arguments": {
      "qom-type": "throttle-group",
      "id": "group0",
 -    "props" : {
 -      "limits": {
 -          "iops-total": 1000
 -      }
 +    "limits": {
 +        "iops-total": 1000
      }
    }
  }
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
    "arguments": {
      "qom-type": "throttle-group",
      "id": "group0",
 -    "props" : {
 -      "limits": {
 -          "iops-total": 1000
 -      }
 +    "limits": {
 +        "iops-total": 1000
      }
    }
  }
 diff --git a/tests/qemu-iotests/218 b/tests/qemu-iotests/218
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/218
 +++ b/tests/qemu-iotests/218
@@ -XXX,XX +XXX,XX @@ with iotests.VM() as vm, \
      vm.launch()
      ret = vm.qmp('object-add', qom_type='throttle-group', id='tg',
 -                 props={'x-bps-read': 4096})
 +                 limits={'bps-read': 4096})
      assert ret['return'] == {}
      ret = vm.qmp('blockdev-add',
 diff --git a/tests/qemu-iotests/235 b/tests/qemu-iotests/235
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/235
 +++ b/tests/qemu-iotests/235
@@ -XXX,XX +XXX,XX @@ vm.add_args('-drive', 'id=src,file=' + disk)
  vm.launch()
  log(vm.qmp('object-add', qom_type='throttle-group', id='tg0',
 -           props={ 'x-bps-total': size }))
 +           limits={'bps-total': size}))
  log(vm.qmp('blockdev-add',
             **{ 'node-name': 'target',
 diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/245
 +++ b/tests/qemu-iotests/245
@@ -XXX,XX +XXX,XX @@ class TestBlockdevReopen(iotests.QMPTestCase):
          ###### throttle ######
          ######################
          opts = { 'qom-type': 'throttle-group', 'id': 'group0',
 -                 'props': { 'limits': { 'iops-total': 1000 } } }
 +                 'limits': { 'iops-total': 1000 } }
          result = self.vm.qmp('object-add', conv_keys = False, **opts)
          self.assert_qmp(result, 'return', {})
          opts = { 'qom-type': 'throttle-group', 'id': 'group1',
 -                 'props': { 'limits': { 'iops-total': 2000 } } }
 +                 'limits': { 'iops-total': 2000 } }
          result = self.vm.qmp('object-add', conv_keys = False, **opts)
          self.assert_qmp(result, 'return', {})
 diff --git a/tests/qemu-iotests/258 b/tests/qemu-iotests/258
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/258
 +++ b/tests/qemu-iotests/258
@@ -XXX,XX +XXX,XX @@ def test_concurrent_finish(write_to_stream_node):
          vm.qmp_log('object-add',
                     qom_type='throttle-group',
                     id='tg',
 -                   props={
 -                       'x-iops-write': 1,
 -                       'x-iops-write-max': 1
 +                   limits={
 +                       'iops-write': 1,
 +                       'iops-write-max': 1
                     })
          vm.qmp_log('blockdev-add',
 diff --git a/tests/qemu-iotests/258.out b/tests/qemu-iotests/258.out
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/tests/qemu-iotests/258.out
-+++ b/block/qcow2-cluster.c
++++ b/tests/qemu-iotests/258.out
-@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ Running tests:
-     /* find the cluster offset for the given disk offset */
+ === Commit and stream finish concurrently (letting stream write) ===
--    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+-{"execute": "object-add", "arguments": {"id": "tg", "props": {"x-iops-write": 1, "x-iops-write-max": 1}, "qom-type": "throttle-group"}}
-+    l2_index = offset_to_l2_index(s, offset);
++{"execute": "object-add", "arguments": {"id": "tg", "limits": {"iops-write": 1, "iops-write-max": 1}, "qom-type": "throttle-group"}}
-     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
+ {"return": {}}
+ {"execute": "blockdev-add", "arguments": {"backing": {"backing": {"backing": {"backing": {"driver": "raw", "file": {"driver": "file", "filename": "TEST_DIR/PID-node0.img"}, "node-name": "node0"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node1.img"}, "node-name": "node1"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node2.img"}, "node-name": "node2"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node3.img"}, "node-name": "node3"}, "driver": "IMGFMT", "file": {"driver": "throttle", "file": {"driver": "file", "filename": "TEST_DIR/PID-node4.img"}, "throttle-group": "tg"}, "node-name": "node4"}}
-     nb_clusters = size_to_clusters(s, bytes_needed);
+ {"return": {}}
-@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ Running tests:
-     /* find the cluster offset for the given disk offset */
+ === Commit and stream finish concurrently (letting commit write) ===
--    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+-{"execute": "object-add", "arguments": {"id": "tg", "props": {"x-iops-write": 1, "x-iops-write-max": 1}, "qom-type": "throttle-group"}}
-+    l2_index = offset_to_l2_index(s, offset);
++{"execute": "object-add", "arguments": {"id": "tg", "limits": {"iops-write": 1, "iops-write-max": 1}, "qom-type": "throttle-group"}}
+ {"return": {}}
-     *new_l2_table = l2_table;
+ {"execute": "blockdev-add", "arguments": {"backing": {"backing": {"backing": {"backing": {"driver": "raw", "file": {"driver": "throttle", "file": {"driver": "file", "filename": "TEST_DIR/PID-node0.img"}, "throttle-group": "tg"}, "node-name": "node0"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node1.img"}, "node-name": "node1"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node2.img"}, "node-name": "node2"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node3.img"}, "node-name": "node3"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node4.img"}, "node-name": "node4"}}
-     *new_l2_index = l2_index;
+ {"return": {}}
-diff --git a/block/qcow2.c b/block/qcow2.c
+diff --git a/tests/qemu-iotests/295 b/tests/qemu-iotests/295
-index XXXXXXX..XXXXXXX 100644
+index XXXXXXX..XXXXXXX 100755
---- a/block/qcow2.c
+--- a/tests/qemu-iotests/295
-+++ b/block/qcow2.c
++++ b/tests/qemu-iotests/295
-@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ class Secret:
-     }
+     def to_qmp_object(self):
-     /* Tables must be cluster aligned */
+         return { "qom_type" : "secret", "id": self.id(),
--    if (offset & (s->cluster_size - 1)) {
+-                 "props": { "data": self.secret() } }
-+    if (offset_into_cluster(s, offset) != 0) {
++                 "data": self.secret() }
-         return -EINVAL;
-     }
+ ################################################################################
  class EncryptionSetupTestCase(iotests.QMPTestCase):
 diff --git a/tests/qemu-iotests/296 b/tests/qemu-iotests/296
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/296
 +++ b/tests/qemu-iotests/296
@@ -XXX,XX +XXX,XX @@ class Secret:
      def to_qmp_object(self):
          return { "qom_type" : "secret", "id": self.id(),
 -                 "props": { "data": self.secret() } }
 +                 "data": self.secret() }
  ################################################################################
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 59/61] blkverify: Catch bs->exact_filename overflow
+[PULL 02/31] backup: Remove nodes from job in .clean()
 From: Max Reitz <mreitz@redhat.com>
-The bs->exact_filename field may not be sufficient to store the full
+The block job holds a reference to the backup-top node (because it is
-blkverify node filename. In this case, we should not generate a filename
+passed as the main job BDS to block_job_create()).  Therefore,
-at all instead of an unusable one.
+bdrv_backup_top_drop() cannot delete the backup-top node (replacing it
 by its child does not affect the job parent, because that has
 .stay_at_node set).  That is a problem, because all of its I/O functions
 assume the BlockCopyState (s->bcs) to be valid and that it has a
 filtered child; but after bdrv_backup_top_drop(), neither of those
 things are true.
-Cc: qemu-stable@nongnu.org
+It does not make sense to add new parents to backup-top after
-Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
+backup_clean(), so we should detach it from the job before
 bdrv_backup_top_drop().  Because there is no function to do that for a
 single node, just detach all of the job's nodes -- the job does not do
 anything past backup_clean() anyway.
 Signed-off-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20170613172006.19685-3-mreitz@redhat.com
+Message-Id: <20210219153348.41861-2-mreitz@redhat.com>
-Reviewed-by: Alberto Garcia <berto@igalia.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkverify.c | 12 ++++++++----
+ block/backup.c | 1 +
-file changed, 8 insertions(+), 4 deletions(-)
+file changed, 1 insertion(+)
-diff --git a/block/blkverify.c b/block/blkverify.c
+diff --git a/block/backup.c b/block/backup.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkverify.c
+--- a/block/backup.c
-+++ b/block/blkverify.c
++++ b/block/backup.c
-@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@ static void backup_abort(Job *job)
-     if (bs->file->bs->exact_filename[0]
+ static void backup_clean(Job *job)
-         && s->test_file->bs->exact_filename[0])
+ {
-     {
+     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++    block_job_remove_all_bdrv(&s->common);
--                 "blkverify:%s:%s",
+     bdrv_backup_top_drop(s->backup_top);
 -                 bs->file->bs->exact_filename,
 -                 s->test_file->bs->exact_filename);
 +        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
 +                           "blkverify:%s:%s",
 +                           bs->file->bs->exact_filename,
 +                           s->test_file->bs->exact_filename);
 +        if (ret >= sizeof(bs->exact_filename)) {
 +            /* An overflow makes the filename unusable, so do not report any */
 +            bs->exact_filename[0] = 0;
 +        }
      }
  }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 60/61] block: Do not strcmp() with NULL uri->scheme
+[PULL 03/31] backup-top: Refuse I/O in inactive state
 From: Max Reitz <mreitz@redhat.com>
-uri_parse(...)->scheme may be NULL. In fact, probably every field may be
+When the backup-top node transitions from active to inactive in
-NULL, and the callers do test this for all of the other fields but not
+bdrv_backup_top_drop(), the BlockCopyState is freed and the filtered
-for scheme (except for block/gluster.c; block/vxhs.c does not access
+child is removed, so the node effectively becomes unusable.
 that field at all).
-We can easily fix this by using g_strcmp0() instead of strcmp().
+However, noone told its I/O functions this, so they will happily
 continue accessing bs->backing and s->bcs.  Prevent that by aborting
 early when s->active is false.
-Cc: qemu-stable@nongnu.org
+(After the preceding patch, the node should be gone after
 bdrv_backup_top_drop(), so this should largely be a theoretical problem.
 But still, better to be safe than sorry, and also I think it just makes
 sense to check s->active in the I/O functions.)
 Signed-off-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20170613205726.13544-1-mreitz@redhat.com
+Message-Id: <20210219153348.41861-3-mreitz@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/nbd.c      | 6 +++---
+ block/backup-top.c | 10 ++++++++++
- block/nfs.c      | 2 +-
+file changed, 10 insertions(+)
  block/sheepdog.c | 6 +++---
  block/ssh.c      | 2 +-
 files changed, 8 insertions(+), 8 deletions(-)
-diff --git a/block/nbd.c b/block/nbd.c
+diff --git a/block/backup-top.c b/block/backup-top.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd.c
+--- a/block/backup-top.c
-+++ b/block/nbd.c
++++ b/block/backup-top.c
-@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int backup_top_co_preadv(
-     }
+         BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+         QEMUIOVector *qiov, int flags)
-     /* transport */
+ {
--    if (!strcmp(uri->scheme, "nbd")) {
++    BDRVBackupTopState *s = bs->opaque;
-+    if (!g_strcmp0(uri->scheme, "nbd")) {
++
-         is_unix = false;
++    if (!s->active) {
--    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
++        return -EIO;
-+    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
++    }
-         is_unix = false;
++
--    } else if (!strcmp(uri->scheme, "nbd+unix")) {
+     return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
-+    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
+ }
-         is_unix = true;
-     } else {
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
-         ret = -EINVAL;
+     BDRVBackupTopState *s = bs->opaque;
-diff --git a/block/nfs.c b/block/nfs.c
+     uint64_t off, end;
-index XXXXXXX..XXXXXXX 100644
---- a/block/nfs.c
++    if (!s->active) {
-+++ b/block/nfs.c
++        return -EIO;
-@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
++    }
-         error_setg(errp, "Invalid URI specified");
++
-         goto out;
+     if (flags & BDRV_REQ_WRITE_UNCHANGED) {
-     }
+         return 0;
 -    if (strcmp(uri->scheme, "nfs") != 0) {
 +    if (g_strcmp0(uri->scheme, "nfs") != 0) {
          error_setg(errp, "URI scheme must be 'nfs'");
          goto out;
      }
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
      }
      /* transport */
 -    if (!strcmp(uri->scheme, "sheepdog")) {
 +    if (!g_strcmp0(uri->scheme, "sheepdog")) {
          is_unix = false;
 -    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
 +    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
          is_unix = false;
 -    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
 +    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
          is_unix = true;
      } else {
          error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
 diff --git a/block/ssh.c b/block/ssh.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/ssh.c
 +++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
          return -EINVAL;
      }
 -    if (strcmp(uri->scheme, "ssh") != 0) {
 +    if (g_strcmp0(uri->scheme, "ssh") != 0) {
          error_setg(errp, "URI scheme must be 'ssh'");
          goto err;
      }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 58/61] blkdebug: Catch bs->exact_filename overflow
+[PULL 04/31] iotests/283: Check that finalize drops backup-top
 From: Max Reitz <mreitz@redhat.com>
-The bs->exact_filename field may not be sufficient to store the full
+Without any of HEAD^ or HEAD^^ applied, qemu will most likely crash on
-blkdebug node filename. In this case, we should not generate a filename
+the qemu-io invocation, for a variety of immediate reasons.  The
-at all instead of an unusable one.
+underlying problem is generally a use-after-free access into
 backup-top's BlockCopyState.
-Cc: qemu-stable@nongnu.org
+With only HEAD^ applied, qemu-io will run into an EIO (which is not
-Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
+capture by the output, but you can see that the qemu-io invocation will
 be accepted (i.e., qemu-io will run) in contrast to the reference
 output, where the node name cannot be found), and qemu will then crash
 in query-named-block-nodes: bdrv_get_allocated_file_size() detects
 backup-top to be a filter and passes the request through to its child.
 However, after bdrv_backup_top_drop(), that child is NULL, so the
 recursive call crashes.
 With HEAD^^ applied, this test should pass.
 Signed-off-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20170613172006.19685-2-mreitz@redhat.com
+Message-Id: <20210219153348.41861-4-mreitz@redhat.com>
-Reviewed-by: Alberto Garcia <berto@igalia.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkdebug.c | 10 +++++++---
+ tests/qemu-iotests/283     | 53 ++++++++++++++++++++++++++++++++++++++
-file changed, 7 insertions(+), 3 deletions(-)
+ tests/qemu-iotests/283.out | 15 +++++++++++
 files changed, 68 insertions(+)
-diff --git a/block/blkdebug.c b/block/blkdebug.c
+diff --git a/tests/qemu-iotests/283 b/tests/qemu-iotests/283
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/283
 +++ b/tests/qemu-iotests/283
@@ -XXX,XX +XXX,XX @@ vm.qmp_log('blockdev-add', **{
  vm.qmp_log('blockdev-backup', sync='full', device='source', target='target')
  vm.shutdown()
 +
 +
 +print('\n=== backup-top should be gone after job-finalize ===\n')
 +
 +# Check that the backup-top node is gone after job-finalize.
 +#
 +# During finalization, the node becomes inactive and can no longer
 +# function.  If it is still present, new parents might be attached, and
 +# there would be no meaningful way to handle their I/O requests.
 +
 +vm = iotests.VM()
 +vm.launch()
 +
 +vm.qmp_log('blockdev-add', **{
 +    'node-name': 'source',
 +    'driver': 'null-co',
 +})
 +
 +vm.qmp_log('blockdev-add', **{
 +    'node-name': 'target',
 +    'driver': 'null-co',
 +})
 +
 +vm.qmp_log('blockdev-backup',
 +           job_id='backup',
 +           device='source',
 +           target='target',
 +           sync='full',
 +           filter_node_name='backup-filter',
 +           auto_finalize=False,
 +           auto_dismiss=False)
 +
 +vm.event_wait('BLOCK_JOB_PENDING', 5.0)
 +
 +# The backup-top filter should still be present prior to finalization
 +assert vm.node_info('backup-filter') is not None
 +
 +vm.qmp_log('job-finalize', id='backup')
 +vm.event_wait('BLOCK_JOB_COMPLETED', 5.0)
 +
 +# The filter should be gone now.  Check that by trying to access it
 +# with qemu-io (which will most likely crash qemu if it is still
 +# there.).
 +vm.qmp_log('human-monitor-command',
 +           command_line='qemu-io backup-filter "write 0 1M"')
 +
 +# (Also, do an explicit check.)
 +assert vm.node_info('backup-filter') is None
 +
 +vm.qmp_log('job-dismiss', id='backup')
 +vm.event_wait('JOB_STATUS_CHANGE', 5.0, {'data': {'status': 'null'}})
 +
 +vm.shutdown()
 diff --git a/tests/qemu-iotests/283.out b/tests/qemu-iotests/283.out
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkdebug.c
+--- a/tests/qemu-iotests/283.out
-+++ b/block/blkdebug.c
++++ b/tests/qemu-iotests/283.out
-@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@
-     }
+ {"return": {}}
+ {"execute": "blockdev-backup", "arguments": {"device": "source", "sync": "full", "target": "target"}}
-     if (!force_json && bs->file->bs->exact_filename[0]) {
+ {"error": {"class": "GenericError", "desc": "Cannot set permissions for backup-top filter: Conflicts with use by other as 'image', which uses 'write' on base"}}
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++
--                 "blkdebug:%s:%s", s->config_file ?: "",
++=== backup-top should be gone after job-finalize ===
--                 bs->file->bs->exact_filename);
++
-+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++{"execute": "blockdev-add", "arguments": {"driver": "null-co", "node-name": "source"}}
-+                           "blkdebug:%s:%s", s->config_file ?: "",
++{"return": {}}
-+                           bs->file->bs->exact_filename);
++{"execute": "blockdev-add", "arguments": {"driver": "null-co", "node-name": "target"}}
-+        if (ret >= sizeof(bs->exact_filename)) {
++{"return": {}}
-+            /* An overflow makes the filename unusable, so do not report any */
++{"execute": "blockdev-backup", "arguments": {"auto-dismiss": false, "auto-finalize": false, "device": "source", "filter-node-name": "backup-filter", "job-id": "backup", "sync": "full", "target": "target"}}
-+            bs->exact_filename[0] = 0;
++{"return": {}}
-+        }
++{"execute": "job-finalize", "arguments": {"id": "backup"}}
-     }
++{"return": {}}
++{"execute": "human-monitor-command", "arguments": {"command-line": "qemu-io backup-filter \"write 0 1M\""}}
-     opts = qdict_new();
++{"return": "Error: Cannot find device= nor node_name=backup-filter\r\n"}
 +{"execute": "job-dismiss", "arguments": {"id": "backup"}}
 +{"return": {}}
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 54/61] qed: Use bdrv_co_* for coroutine_fns
+[PULL 05/31] iotests: Fix up python style in 300
-All functions that are marked coroutine_fn can directly call the
+From: Eric Blake <eblake@redhat.com>
 bdrv_co_* version of functions instead of going through the wrapper.
+Break some long lines, and relax our type hints to be more generic to
+any JSON, in order to more easily permit the additional JSON depth now
+possible in migration parameters.  Detected by iotest 297.
+Fixes: ca4bfec41d56
+ (qemu-iotests: 300: Add test case for modifying persistence of bitmap)
+Reported-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20210215220518.1745469-1-eblake@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 16 +++++++++-------
+ tests/qemu-iotests/300 | 10 ++++++----
-file changed, 9 insertions(+), 7 deletions(-)
+file changed, 6 insertions(+), 4 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
-index XXXXXXX..XXXXXXX 100644
+index XXXXXXX..XXXXXXX 100755
---- a/block/qed.c
+--- a/tests/qemu-iotests/300
-+++ b/block/qed.c
++++ b/tests/qemu-iotests/300
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@
-     };
+ import os
-     qemu_iovec_init_external(&qiov, &iov, 1);
+ import random
+ import re
--    ret = bdrv_preadv(s->bs->file, 0, &qiov);
+-from typing import Dict, List, Optional, Union
-+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
++from typing import Dict, List, Optional
-     if (ret < 0) {
-         goto out;
+ import iotests
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ import iotests
-     /* Update header */
+ # pylint: disable=wrong-import-order
-     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
+ import qemu
--    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
+-BlockBitmapMapping = List[Dict[str, Union[str, List[Dict[str, str]]]]]
-+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
++BlockBitmapMapping = List[Dict[str, object]]
-     if (ret < 0) {
-         goto out;
+ mig_sock = os.path.join(iotests.sock_dir, 'mig_sock')
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+@@ -XXX,XX +XXX,XX @@ class TestCrossAliasMigration(TestDirtyBitmapMigration):
-     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
+ class TestAliasTransformMigration(TestDirtyBitmapMigration):
-     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
+     """
--    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
+-    Tests the 'transform' option which modifies bitmap persistence on migration.
-+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
++    Tests the 'transform' option which modifies bitmap persistence on
-     if (ret < 0) {
++    migration.
-         return ret;
+     """
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
+     src_node_name = 'node-a'
-     }
+@@ -XXX,XX +XXX,XX @@ class TestAliasTransformMigration(TestDirtyBitmapMigration):
+         bitmaps = self.vm_b.query_bitmaps()
-     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
--    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
+         for node in bitmaps:
-+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
+-            bitmaps[node] = sorted(((bmap['name'], bmap['persistent']) for bmap in bitmaps[node]))
-     if (ret < 0) {
++            bitmaps[node] = sorted(((bmap['name'], bmap['persistent'])
-         goto out;
++                                    for bmap in bitmaps[node]))
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
+         self.assertEqual(bitmaps,
-     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
+                          {'node-a': [('bmap-a', True), ('bmap-b', False)],
      BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
 -    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
 +    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
 +                          &acb->cur_qiov, 0);
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
               * region.  The solution is to flush after writing a new data
               * cluster and before updating the L2 table.
               */
 -            ret = bdrv_flush(s->bs->file->bs);
 +            ret = bdrv_co_flush(s->bs->file->bs);
              if (ret < 0) {
                  return ret;
              }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
      }
      BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
 -    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
 +    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
 +                         &acb->cur_qiov, 0);
      if (ret < 0) {
          return ret;
      }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 57/61] fix: avoid an infinite loop or a dangling pointer problem in img_commit
+[PULL 06/31] blockjob: report a better error message
-From: "sochin.jiang" <sochin.jiang@huawei.com>
+From: Stefano Garzarella <sgarzare@redhat.com>
-img_commit could fall into an infinite loop calling run_block_job() if
+When a block job fails, we report strerror(-job->job.ret) error
-its blockjob fails on any I/O error, fix this already known problem.
+message, also if the job set an error object.
 Let's report a better error message using error_get_pretty(job->job.err).
-Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
+If an error object was not set, strerror(-job->ret) is used as fallback,
-Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
+as explained in include/qemu/job.h:
-Signed-off-by: Max Reitz <mreitz@redhat.com>
 typedef struct Job {
     ...
     /**
      * Error object for a failed job.
      * If job->ret is nonzero and an error object was not set, it will be set
      * to strerror(-job->ret) during job_completed.
      */
     Error *err;
 }
 In block_job_query() there can be a transient where 'job.err' is not set
 by a scheduled bottom half. In that case we use strerror(-job->ret) as it
 was before.
 Suggested-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
 Message-Id: <20210225103633.76746-1-sgarzare@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- blockjob.c               |  4 ++--
+ blockjob.c | 10 +++++++---
- include/block/blockjob.h | 18 ++++++++++++++++++
+file changed, 7 insertions(+), 3 deletions(-)
  qemu-img.c               | 20 +++++++++++++-------
 files changed, 33 insertions(+), 9 deletions(-)
 diff --git a/blockjob.c b/blockjob.c
 index XXXXXXX..XXXXXXX 100644
 --- a/blockjob.c
 +++ b/blockjob.c
-@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
+@@ -XXX,XX +XXX,XX @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
-     block_job_enter(job);
+     info->status    = job->job.status;
      info->auto_finalize = job->job.auto_finalize;
      info->auto_dismiss  = job->job.auto_dismiss;
 -    info->has_error = job->job.ret != 0;
 -    info->error     = job->job.ret ? g_strdup(strerror(-job->job.ret)) : NULL;
 +    if (job->job.ret) {
 +        info->has_error = true;
 +        info->error = job->job.err ?
 +                        g_strdup(error_get_pretty(job->job.err)) :
 +                        g_strdup(strerror(-job->job.ret));
 +    }
      return info;
  }
--static void block_job_ref(BlockJob *job)
+@@ -XXX,XX +XXX,XX @@ static void block_job_event_completed(Notifier *n, void *opaque)
-+void block_job_ref(BlockJob *job)
+     }
- {
-     ++job->refcnt;
+     if (job->job.ret < 0) {
- }
+-        msg = strerror(-job->job.ret);
-@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
++        msg = error_get_pretty(job->job.err);
-                                            void *opaque);
+     }
- static void block_job_detach_aio_context(void *opaque);
+     qapi_event_send_block_job_completed(job_type(&job->job),
 -static void block_job_unref(BlockJob *job)
 +void block_job_unref(BlockJob *job)
  {
      if (--job->refcnt == 0) {
          BlockDriverState *bs = blk_bs(job->blk);
 diff --git a/include/block/blockjob.h b/include/block/blockjob.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/blockjob.h
 +++ b/include/block/blockjob.h
@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
  BlockJobTxn *block_job_txn_new(void);
  /**
 + * block_job_ref:
 + *
 + * Add a reference to BlockJob refcnt, it will be decreased with
 + * block_job_unref, and then be freed if it comes to be the last
 + * reference.
 + */
 +void block_job_ref(BlockJob *job);
 +
 +/**
 + * block_job_unref:
 + *
 + * Release a reference that was previously acquired with block_job_ref
 + * or block_job_create. If it's the last reference to the object, it will be
 + * freed.
 + */
 +void block_job_unref(BlockJob *job);
 +
 +/**
   * block_job_txn_unref:
   *
   * Release a reference that was previously acquired with block_job_txn_add_job
 diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
  static void run_block_job(BlockJob *job, Error **errp)
  {
      AioContext *aio_context = blk_get_aio_context(job->blk);
 +    int ret = 0;
 -    /* FIXME In error cases, the job simply goes away and we access a dangling
 -     * pointer below. */
      aio_context_acquire(aio_context);
 +    block_job_ref(job);
      do {
          aio_poll(aio_context, true);
          qemu_progress_print(job->len ?
                              ((float)job->offset / job->len * 100.f) : 0.0f, 0);
 -    } while (!job->ready);
 +    } while (!job->ready && !job->completed);
 -    block_job_complete_sync(job, errp);
 +    if (!job->completed) {
 +        ret = block_job_complete_sync(job, errp);
 +    } else {
 +        ret = job->ret;
 +    }
 +    block_job_unref(job);
      aio_context_release(aio_context);
 -    /* A block job may finish instantaneously without publishing any progress,
 -     * so just signal completion here */
 -    qemu_progress_print(100.f, 0);
 +    /* publish completion progress only when success */
 +    if (!ret) {
 +        qemu_progress_print(100.f, 0);
 +    }
  }
  static int img_commit(int argc, char **argv)
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 51/61] qed: Simplify request handling
+[PULL 07/31] storage-daemon: report unexpected arguments on the fly
-Now that we process a request in the same coroutine from beginning to
+From: Paolo Bonzini <pbonzini@redhat.com>
 end and don't drop out of it any more, we can look like a proper
 coroutine-based driver and simply call qed_aio_next_io() and get a
 return value from it instead of spawning an additional coroutine that
 reenters the parent when it's done.
+If the first character of optstring is '-', then each nonoption argv
+element is handled as if it were the argument of an option with character
+code 1.  This removes the reordering of the argv array, and enables usage
+of loc_set_cmdline to provide better error messages.
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20210301152844.291799-2-pbonzini@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 101 +++++++++++++-----------------------------------------------
+ storage-daemon/qemu-storage-daemon.c | 9 ++++-----
- block/qed.h |   3 +-
+file changed, 4 insertions(+), 5 deletions(-)
 files changed, 22 insertions(+), 82 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/storage-daemon/qemu-storage-daemon.c
-+++ b/block/qed.c
++++ b/storage-daemon/qemu-storage-daemon.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
- #include "qapi/qmp/qerror.h"
+      * they are given on the command lines. This means that things must be
- #include "sysemu/block-backend.h"
+      * defined first before they can be referenced in another option.
+      */
--static const AIOCBInfo qed_aiocb_info = {
+-    while ((c = getopt_long(argc, argv, "hT:V", long_options, NULL)) != -1) {
--    .aiocb_size         = sizeof(QEDAIOCB),
++    while ((c = getopt_long(argc, argv, "-hT:V", long_options, NULL)) != -1) {
--};
+         switch (c) {
--
+         case '?':
- static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
+             exit(EXIT_FAILURE);
-                           const char *filename)
+@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
- {
+                 qobject_unref(args);
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+                 break;
      return l2_table;
  }
 -static void qed_aio_next_io(QEDAIOCB *acb);
 -
 -static void qed_aio_start_io(QEDAIOCB *acb)
 -{
 -    qed_aio_next_io(acb);
 -}
 -
  static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
  {
      assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
  static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
  {
 -    return acb->common.bs->opaque;
 +    return acb->bs->opaque;
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
      }
  }
 -static void qed_aio_complete_bh(void *opaque)
 -{
 -    QEDAIOCB *acb = opaque;
 -    BDRVQEDState *s = acb_to_s(acb);
 -    BlockCompletionFunc *cb = acb->common.cb;
 -    void *user_opaque = acb->common.opaque;
 -    int ret = acb->bh_ret;
 -
 -    qemu_aio_unref(acb);
 -
 -    /* Invoke callback */
 -    qed_acquire(s);
 -    cb(user_opaque, ret);
 -    qed_release(s);
 -}
 -
 -static void qed_aio_complete(QEDAIOCB *acb, int ret)
 +static void qed_aio_complete(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
 -    trace_qed_aio_complete(s, acb, ret);
 -
      /* Free resources */
      qemu_iovec_destroy(&acb->cur_qiov);
      qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
          acb->qiov->iov[0].iov_base = NULL;
      }
 -    /* Arrange for a bh to invoke the completion function */
 -    acb->bh_ret = ret;
 -    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
 -                            qed_aio_complete_bh, acb);
 -
      /* Start next allocating write request waiting behind this one.  Note that
       * requests enqueue themselves when they first hit an unallocated cluster
       * but they wait until the entire request is finished before waking up the
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
          struct iovec *iov = acb->qiov->iov;
          if (!iov->iov_base) {
 -            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
 +            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
              if (iov->iov_base == NULL) {
                  return -ENOMEM;
              }
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
++        case 1:
- {
++            error_report("Unexpected argument: %s", optarg);
-     QEDAIOCB *acb = opaque;
++            exit(EXIT_FAILURE);
-     BDRVQEDState *s = acb_to_s(acb);
+         default:
--    BlockDriverState *bs = acb->common.bs;
+             g_assert_not_reached();
 +    BlockDriverState *bs = acb->bs;
      /* Adjust offset into cluster */
      offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  /**
   * Begin next I/O or complete the request
   */
 -static void qed_aio_next_io(QEDAIOCB *acb)
 +static int qed_aio_next_io(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
          /* Complete request */
          if (acb->cur_pos >= acb->end_pos) {
 -            qed_aio_complete(acb, 0);
 -            return;
 +            ret = 0;
 +            break;
          }
          /* Find next cluster and start I/O */
          len = acb->end_pos - acb->cur_pos;
          ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
          if (ret < 0) {
 -            qed_aio_complete(acb, ret);
 -            return;
 +            break;
          }
          if (acb->flags & QED_AIOCB_WRITE) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
          }
          if (ret < 0 && ret != -EAGAIN) {
 -            qed_aio_complete(acb, ret);
 -            return;
 +            break;
          }
      }
--}
+-    if (optind != argc) {
+-        error_report("Unexpected argument: %s", argv[optind]);
--typedef struct QEDRequestCo {
+-        exit(EXIT_FAILURE);
--    Coroutine *co;
+-    }
 -    bool done;
 -    int ret;
 -} QEDRequestCo;
 -
 -static void qed_co_request_cb(void *opaque, int ret)
 -{
 -    QEDRequestCo *co = opaque;
 -
 -    co->done = true;
 -    co->ret = ret;
 -    qemu_coroutine_enter_if_inactive(co->co);
 +    trace_qed_aio_complete(s, acb, ret);
 +    qed_aio_complete(acb);
 +    return ret;
  }
- static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+ int main(int argc, char *argv[])
                                         QEMUIOVector *qiov, int nb_sectors,
                                         int flags)
  {
 -    QEDRequestCo co = {
 -        .co     = qemu_coroutine_self(),
 -        .done   = false,
 +    QEDAIOCB acb = {
 +        .bs         = bs,
 +        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
 +        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
 +        .qiov       = qiov,
 +        .flags      = flags,
      };
 -    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
 -
 -    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
 +    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
 -    acb->flags = flags;
 -    acb->qiov = qiov;
 -    acb->qiov_offset = 0;
 -    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
 -    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
 -    acb->backing_qiov = NULL;
 -    acb->request.l2_table = NULL;
 -    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
 +    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
      /* Start request */
 -    qed_aio_start_io(acb);
 -
 -    if (!co.done) {
 -        qemu_coroutine_yield();
 -    }
 -
 -    return co.ret;
 +    return qed_aio_next_io(&acb);
  }
  static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
  };
  typedef struct QEDAIOCB {
 -    BlockAIOCB common;
 -    int bh_ret;                     /* final return status for completion bh */
 +    BlockDriverState *bs;
      QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
      int flags;                      /* QED_AIOCB_* bits ORed together */
      uint64_t end_pos;               /* request end on block device, in bytes */
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 16/61] nvme: Add support for Read Data and Write Data in CMBs.
+[PULL 08/31] storage-daemon: include current command line option in the errors
-From: Stephen Bates <sbates@raithlin.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-Add the ability for the NVMe model to support both the RDS and WDS
+Use the location management facilities that the emulator uses, so that
-modes in the Controller Memory Buffer.
+the current command line option appears in the error message.
-Although not currently supported in the upstreamed Linux kernel a fork
+Before:
 with support exists [1] and user-space test programs that build on
 this also exist [2].
-Useful for testing CMB functionality in preperation for real CMB
+  $ storage-daemon/qemu-storage-daemon --nbd key..=
-enabled NVMe devices (coming soon).
+  qemu-storage-daemon: Invalid parameter 'key..'
-[1] https://github.com/sbates130272/linux-p2pmem
+After:
 [2] https://github.com/sbates130272/p2pmem-test
-Signed-off-by: Stephen Bates <sbates@raithlin.com>
+  $ storage-daemon/qemu-storage-daemon --nbd key..=
-Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
+  qemu-storage-daemon: --nbd key..=: Invalid parameter 'key..'
-Reviewed-by: Keith Busch <keith.busch@intel.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-Id: <20210301152844.291799-3-pbonzini@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
+ storage-daemon/qemu-storage-daemon.c | 19 +++++++++++++++++--
- hw/block/nvme.h |  1 +
+file changed, 17 insertions(+), 2 deletions(-)
 files changed, 58 insertions(+), 26 deletions(-)
-diff --git a/hw/block/nvme.c b/hw/block/nvme.c
+diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/block/nvme.c
+--- a/storage-daemon/qemu-storage-daemon.c
-+++ b/hw/block/nvme.c
++++ b/storage-daemon/qemu-storage-daemon.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static void init_qmp_commands(void)
-  *              cmb_size_mb=<cmb_size_mb[optional]>
+                          qmp_marshal_qmp_capabilities, QCO_ALLOW_PRECONFIG);
   *
   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
 - * offset 0 in BAR2 and supports SQS only for now.
 + * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
   */
  #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
      }
  }
--static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
++static int getopt_set_loc(int argc, char **argv, const char *optstring,
--    uint32_t len, NvmeCtrl *n)
++                          const struct option *longopts)
-+static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
++{
-+                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
++    int c, save_index;
 +
 +    optarg = NULL;
 +    save_index = optind;
 +    c = getopt_long(argc, argv, optstring, longopts, NULL);
 +    if (optarg) {
 +        loc_set_cmdline(argv, save_index, MAX(1, optind - save_index));
 +    }
 +    return c;
 +}
 +
  static void process_options(int argc, char *argv[])
  {
-     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
+     int c;
-     trans_len = MIN(len, trans_len);
+@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
-@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+      * they are given on the command lines. This means that things must be
+      * defined first before they can be referenced in another option.
-     if (!prp1) {
+      */
-         return NVME_INVALID_FIELD | NVME_DNR;
+-    while ((c = getopt_long(argc, argv, "-hT:V", long_options, NULL)) != -1) {
-+    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
++    while ((c = getopt_set_loc(argc, argv, "-hT:V", long_options)) != -1) {
-+               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+         switch (c) {
-+        qsg->nsg = 0;
+         case '?':
-+        qemu_iovec_init(iov, num_prps);
+             exit(EXIT_FAILURE);
-+        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
-+    } else {
+                 break;
 +        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
 +        qemu_sglist_add(qsg, prp1, trans_len);
      }
 -
 -    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
 -    qemu_sglist_add(qsg, prp1, trans_len);
      len -= trans_len;
      if (len) {
          if (!prp2) {
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
              nents = (len + n->page_size - 1) >> n->page_bits;
              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 -            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
 +            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
              while (len != 0) {
                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                      i = 0;
                      nents = (len + n->page_size - 1) >> n->page_bits;
                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 -                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
 +                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                          prp_trans);
                      prp_ent = le64_to_cpu(prp_list[i]);
                  }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                  }
                  trans_len = MIN(len, n->page_size);
 -                qemu_sglist_add(qsg, prp_ent, trans_len);
 +                if (qsg->nsg){
 +                    qemu_sglist_add(qsg, prp_ent, trans_len);
 +                } else {
 +                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
 +                }
                  len -= trans_len;
                  i++;
              }
-@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+         case 1:
-             if (prp2 & (n->page_size - 1)) {
+-            error_report("Unexpected argument: %s", optarg);
-                 goto unmap;
++            error_report("Unexpected argument");
-             }
+             exit(EXIT_FAILURE);
--            qemu_sglist_add(qsg, prp2, len);
+         default:
-+            if (qsg->nsg) {
+             g_assert_not_reached();
 +                qemu_sglist_add(qsg, prp2, len);
 +            } else {
 +                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
 +            }
          }
      }
-     return NVME_SUCCESS;
++    loc_set_none();
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
      uint64_t prp1, uint64_t prp2)
  {
      QEMUSGList qsg;
 +    QEMUIOVector iov;
 +    uint16_t status = NVME_SUCCESS;
 -    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
 +    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    if (dma_buf_read(ptr, len, &qsg)) {
 +    if (qsg.nsg > 0) {
 +        if (dma_buf_read(ptr, len, &qsg)) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
          qemu_sglist_destroy(&qsg);
 -        return NVME_INVALID_FIELD | NVME_DNR;
 +    } else {
 +        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
 +        qemu_iovec_destroy(&iov);
      }
 -    qemu_sglist_destroy(&qsg);
 -    return NVME_SUCCESS;
 +    return status;
  }
- static void nvme_post_cqes(void *opaque)
+ int main(int argc, char *argv[])
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
          return NVME_LBA_RANGE | NVME_DNR;
      }
 -    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
 +    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    assert((nlb << data_shift) == req->qsg.size);
 -
 -    req->has_sg = true;
      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
 -    req->aiocb = is_write ?
 -        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                      nvme_rw_cb, req) :
 -        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                     nvme_rw_cb, req);
 +    if (req->qsg.nsg > 0) {
 +        req->has_sg = true;
 +        req->aiocb = is_write ?
 +            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                          nvme_rw_cb, req) :
 +            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                         nvme_rw_cb, req);
 +    } else {
 +        req->has_sg = false;
 +        req->aiocb = is_write ?
 +            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                            req) :
 +            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                           req);
 +    }
      return NVME_NO_COMPLETE;
  }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
          NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
          NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
 +        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 +        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
          NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 +        n->cmbloc = n->bar.cmbloc;
 +        n->cmbsz = n->bar.cmbsz;
 +
          n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
          memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                                "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
 diff --git a/hw/block/nvme.h b/hw/block/nvme.h
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/nvme.h
 +++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
      NvmeCqe                 cqe;
      BlockAcctCookie         acct;
      QEMUSGList              qsg;
 +    QEMUIOVector            iov;
      QTAILQ_ENTRY(NvmeRequest)entry;
  } NvmeRequest;
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 44/61] qed: Add return value to qed_aio_write_cow()
+[PULL 09/31] qemu-storage-daemon: add --pidfile option
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
+From: Stefan Hajnoczi <stefanha@redhat.com>
 just return an error code and let the caller handle it.
-While refactoring qed_aio_write_alloc() to accomodate the change,
+Daemons often have a --pidfile option where the pid is written to a file
-qed_aio_write_zero_cluster() ended up with a single line, so I chose to
+so that scripts can stop the daemon by sending a signal.
 inline that line and remove the function completely.
+The pid file also acts as a lock to prevent multiple instances of the
+daemon from launching for a given pid file.
+QEMU, qemu-nbd, qemu-ga, virtiofsd, and qemu-pr-helper all support the
+--pidfile option. Add it to qemu-storage-daemon too.
+Reported-by: Richard W.M. Jones <rjones@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210302142746.170535-1-stefanha@redhat.com>
+Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 58 +++++++++++++++++++++-------------------------------------
+ docs/tools/qemu-storage-daemon.rst   | 14 +++++++++++
-file changed, 21 insertions(+), 37 deletions(-)
+ storage-daemon/qemu-storage-daemon.c | 36 ++++++++++++++++++++++++++++
 files changed, 50 insertions(+)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/docs/tools/qemu-storage-daemon.rst
-+++ b/block/qed.c
++++ b/docs/tools/qemu-storage-daemon.rst
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ Standard options:
- /**
+   List object properties with ``<type>,help``. See the :manpage:`qemu(1)`
-  * Populate untouched regions of new data cluster
+   manual page for a description of the object properties.
-  */
--static void qed_aio_write_cow(void *opaque, int ret)
++.. option:: --pidfile PATH
 +static int qed_aio_write_cow(QEDAIOCB *acb)
  {
 -    QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t start, len, offset;
 +    int ret;
      /* Populate front untouched region of new data cluster */
      start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
      trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
      ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +    if (ret < 0) {
 +        return ret;
      }
      /* Populate back untouched region of new data cluster */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
      trace_qed_aio_write_postfill(s, acb, start, len, offset);
      ret = qed_copy_from_backing_file(s, start, len, offset);
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -
 -    ret = qed_aio_write_main(acb);
      if (ret < 0) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +        return ret;
      }
 -    qed_aio_next_io(acb, 0);
 +
-+    return qed_aio_write_main(acb);
++  is the path to a file where the daemon writes its pid. This allows scripts to
 +  stop the daemon by sending a signal::
 +
 +    $ kill -SIGTERM $(<path/to/qsd.pid)
 +
 +  A file lock is applied to the file so only one instance of the daemon can run
 +  with a given pid file path. The daemon unlinks its pid file when terminating.
 +
 +  The pid file is written after chardevs, exports, and NBD servers have been
 +  created but before accepting connections. The daemon has started successfully
 +  when the pid file is written and clients may begin connecting.
 +
  Examples
  --------
  Launch the daemon with QMP monitor socket ``qmp.sock`` so clients can execute
 diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
 index XXXXXXX..XXXXXXX 100644
 --- a/storage-daemon/qemu-storage-daemon.c
 +++ b/storage-daemon/qemu-storage-daemon.c
@@ -XXX,XX +XXX,XX @@
  #include "sysemu/runstate.h"
  #include "trace/control.h"
 +static const char *pid_file;
  static volatile bool exit_requested = false;
  void qemu_system_killed(int signal, pid_t pid)
@@ -XXX,XX +XXX,XX @@ static void help(void)
  "                         See the qemu(1) man page for documentation of the\n"
  "                         objects that can be added.\n"
  "\n"
 +"  --pidfile <path>       write process ID to a file after startup\n"
 +"\n"
  QEMU_HELP_BOTTOM "\n",
      error_get_progname());
  }
+@@ -XXX,XX +XXX,XX @@ enum {
- /**
+     OPTION_MONITOR,
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
+     OPTION_NBD_SERVER,
-     return !(s->header.features & QED_F_NEED_CHECK);
+     OPTION_OBJECT,
 +    OPTION_PIDFILE,
  };
  extern QemuOptsList qemu_chardev_opts;
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
          {"monitor", required_argument, NULL, OPTION_MONITOR},
          {"nbd-server", required_argument, NULL, OPTION_NBD_SERVER},
          {"object", required_argument, NULL, OPTION_OBJECT},
 +        {"pidfile", required_argument, NULL, OPTION_PIDFILE},
          {"trace", required_argument, NULL, 'T'},
          {"version", no_argument, NULL, 'V'},
          {0, 0, 0, 0}
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
                  qobject_unref(args);
                  break;
              }
 +        case OPTION_PIDFILE:
 +            pid_file = optarg;
 +            break;
          case 1:
              error_report("Unexpected argument");
              exit(EXIT_FAILURE);
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
      loc_set_none();
  }
--static void qed_aio_write_zero_cluster(void *opaque, int ret)
++static void pid_file_cleanup(void)
--{
++{
--    QEDAIOCB *acb = opaque;
++    unlink(pid_file);
--
++}
--    if (ret) {
++
--        qed_aio_complete(acb, ret);
++static void pid_file_init(void)
--        return;
++{
--    }
++    Error *err = NULL;
--
++
--    ret = qed_aio_write_l2_update(acb, 1);
++    if (!pid_file) {
--    if (ret < 0) {
++        return;
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -    qed_aio_next_io(acb, 0);
 -}
 -
  /**
   * Write new data cluster
   *
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
  static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
 -    BlockCompletionFunc *cb;
      int ret;
      /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
              qed_aio_start_io(acb);
              return;
          }
 -
 -        cb = qed_aio_write_zero_cluster;
      } else {
 -        cb = qed_aio_write_cow;
          acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
      }
      if (qed_should_set_need_check(s)) {
          s->header.features |= QED_F_NEED_CHECK;
          ret = qed_write_header(s);
 -        cb(acb, ret);
 +        if (ret < 0) {
 +            qed_aio_complete(acb, ret);
 +            return;
 +        }
 +    }
 +
-+    if (acb->flags & QED_AIOCB_ZERO) {
++    if (!qemu_write_pidfile(pid_file, &err)) {
-+        ret = qed_aio_write_l2_update(acb, 1);
++        error_reportf_err(err, "cannot create PID file: ");
-     } else {
++        exit(EXIT_FAILURE);
--        cb(acb, 0);
++    }
-+        ret = qed_aio_write_cow(acb);
++
 +    atexit(pid_file_cleanup);
 +}
 +
  int main(int argc, char *argv[])
  {
  #ifdef CONFIG_POSIX
@@ -XXX,XX +XXX,XX @@ int main(int argc, char *argv[])
      qemu_init_main_loop(&error_fatal);
      process_options(argc, argv);
 +    /*
 +     * Write the pid file after creating chardevs, exports, and NBD servers but
 +     * before accepting connections. This ordering is documented. Do not change
 +     * it.
 +     */
 +    pid_file_init();
 +
      while (!exit_requested) {
          main_loop_wait(false);
      }
-+    if (ret < 0) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 11/61] virtio-pci: use ioeventfd even when KVM is disabled
+[PULL 10/31] docs: show how to spawn qemu-storage-daemon with fd passing
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Old kvm.ko versions only supported a tiny number of ioeventfds so
+The QMP monitor, NBD server, and vhost-user-blk export all support file
-virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.
+descriptor passing. This is a useful technique because it allows the
 parent process to spawn and wait for qemu-storage-daemon without busy
 waiting, which may delay startup due to arbitrary sleep() calls.
-Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
+This Python example is inspired by the test case written for libnbd by
-always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
+Richard W.M. Jones <rjones@redhat.com>:
-("memory: emulate ioeventfd") it has been possible to use ioeventfds in
+https://gitlab.com/nbdkit/libnbd/-/commit/89113f484effb0e6c322314ba75c1cbe07a04543
 qtest or TCG mode.
-This patch makes -device virtio-blk-pci,iothread=iothread0 work even
+Thanks to Daniel P. Berrangé <berrange@redhat.com> for suggestions on
-when KVM is disabled.
+how to get this working. Now let's document it!
-I have tested that virtio-blk-pci works under TCG both with and without
+Reported-by: Richard W.M. Jones <rjones@redhat.com>
-iothread.
+Cc: Kevin Wolf <kwolf@redhat.com>
+Cc: Daniel P. Berrangé <berrange@redhat.com>
 Cc: Michael S. Tsirkin <mst@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
+Message-Id: <20210301172728.135331-2-stefanha@redhat.com>
 Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
 Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/virtio/virtio-pci.c | 2 +-
+ docs/tools/qemu-storage-daemon.rst | 42 ++++++++++++++++++++++++++++--
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 40 insertions(+), 2 deletions(-)
-diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
+diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
 index XXXXXXX..XXXXXXX 100644
---- a/hw/virtio/virtio-pci.c
+--- a/docs/tools/qemu-storage-daemon.rst
-+++ b/hw/virtio/virtio-pci.c
++++ b/docs/tools/qemu-storage-daemon.rst
-@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
+@@ -XXX,XX +XXX,XX @@ Standard options:
-     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
-                      !pci_bus_is_root(pci_dev->bus);
+ .. option:: --nbd-server addr.type=inet,addr.host=<host>,addr.port=<port>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>]
+   --nbd-server addr.type=unix,addr.path=<path>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>]
--    if (!kvm_has_many_ioeventfds()) {
++  --nbd-server addr.type=fd,addr.str=<fd>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>]
-+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
-         proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+   is a server for NBD exports. Both TCP and UNIX domain sockets are supported.
-     }
+-  TLS encryption can be configured using ``--object`` tls-creds-* and authz-*
+-  secrets (see below).
 +  A listen socket can be provided via file descriptor passing (see Examples
 +  below). TLS encryption can be configured using ``--object`` tls-creds-* and
 +  authz-* secrets (see below).
    To configure an NBD server on UNIX domain socket path ``/tmp/nbd.sock``::
@@ -XXX,XX +XXX,XX @@ QMP commands::
        --chardev socket,path=qmp.sock,server=on,wait=off,id=char1 \
        --monitor chardev=char1
 +Launch the daemon from Python with a QMP monitor socket using file descriptor
 +passing so there is no need to busy wait for the QMP monitor to become
 +available::
 +
 +  #!/usr/bin/env python3
 +  import subprocess
 +  import socket
 +
 +  sock_path = '/var/run/qmp.sock'
 +
 +  with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as listen_sock:
 +      listen_sock.bind(sock_path)
 +      listen_sock.listen()
 +
 +      fd = listen_sock.fileno()
 +
 +      subprocess.Popen(
 +          ['qemu-storage-daemon',
 +           '--chardev', f'socket,fd={fd},server=on,id=char1',
 +           '--monitor', 'chardev=char1'],
 +          pass_fds=[fd],
 +      )
 +
 +  # listen_sock was automatically closed when leaving the 'with' statement
 +  # body. If the daemon process terminated early then the following connect()
 +  # will fail with "Connection refused" because no process has the listen
 +  # socket open anymore. Launch errors can be detected this way.
 +
 +  qmp_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
 +  qmp_sock.connect(sock_path)
 +  ...QMP interaction...
 +
 +The same socket spawning approach also works with the ``--nbd-server
 +addr.type=fd,addr.str=<fd>`` and ``--export
 +type=vhost-user-blk,addr.type=fd,addr.str=<fd>`` options.
 +
  Export raw image file ``disk.img`` over NBD UNIX domain socket ``nbd.sock``::
    $ qemu-storage-daemon \
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 06/61] migration: avoid recursive AioContext locking in save_vmstate()
+[PULL 11/31] docs: replace insecure /tmp examples in qsd docs
 From: Stefan Hajnoczi <stefanha@redhat.com>
-AioContext was designed to allow nested acquire/release calls.  It uses
+World-writeable directories have security issues. Avoid showing them in
-a recursive mutex so callers don't need to worry about nesting...or so
+the documentation since someone might accidentally use them in
-we thought.
+situations where they are insecure.
-BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
+There tend to be 3 security problems:
-the AioContext temporarily around aio_poll().  This gives IOThreads a
+. Denial of service. An adversary may be able to create the file
-chance to acquire the AioContext to process I/O completions.
+   beforehand, consume all space/inodes, etc to sabotage us.
 . Impersonation. An adversary may be able to create a listen socket and
    accept incoming connections that were meant for us.
 . Unauthenticated client access. An adversary may be able to connect to
    us if we did not set the uid/gid and permissions correctly.
-It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
+These can be prevented or mitigated with private /tmp, carefully setting
-BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
+the umask, etc but that requires special action and does not apply to
-will not be able to acquire the AioContext if it was acquired
+all situations. Just avoid using /tmp in examples.
 multiple times.
-Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
+Reported-by: Richard W.M. Jones <rjones@redhat.com>
-this patch simply avoids nested locking in save_vmstate().  It's the
+Reported-by: Daniel P. Berrangé <berrange@redhat.com>
 simplest fix and we should step back to consider the big picture with
 all the recent changes to block layer threading.
 This patch is the final fix to solve 'savevm' hanging with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20210301172728.135331-3-stefanha@redhat.com>
-Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
 Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 12 +++++++++++-
+ docs/tools/qemu-storage-daemon.rst | 7 ++++---
-file changed, 11 insertions(+), 1 deletion(-)
+file changed, 4 insertions(+), 3 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/docs/tools/qemu-storage-daemon.rst
-+++ b/migration/savevm.c
++++ b/docs/tools/qemu-storage-daemon.rst
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ Standard options:
-         goto the_end;
+   a description of character device properties. A common character device
-     }
+   definition configures a UNIX domain socket::
-+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+-  --chardev socket,id=char1,path=/tmp/qmp.sock,server=on,wait=off
-+     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
++  --chardev socket,id=char1,path=/var/run/qsd-qmp.sock,server=on,wait=off
-+     * it only releases the lock once.  Therefore synchronous I/O will deadlock
-+     * unless we release the AioContext before bdrv_all_create_snapshot().
+ .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
-+     */
+   --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
-+    aio_context_release(aio_context);
+@@ -XXX,XX +XXX,XX @@ Standard options:
-+    aio_context = NULL;
+   below). TLS encryption can be configured using ``--object`` tls-creds-* and
-+
+   authz-* secrets (see below).
-     ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
-     if (ret < 0) {
+-  To configure an NBD server on UNIX domain socket path ``/tmp/nbd.sock``::
-         error_setg(errp, "Error while creating snapshot on '%s'",
++  To configure an NBD server on UNIX domain socket path
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
++  ``/var/run/qsd-nbd.sock``::
-     ret = 0;
+-  --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock
-  the_end:
++  --nbd-server addr.type=unix,addr.path=/var/run/qsd-nbd.sock
--    aio_context_release(aio_context);
-+    if (aio_context) {
+ .. option:: --object help
-+        aio_context_release(aio_context);
+   --object <type>,help
 +    }
      if (saved_vm_running) {
          vm_start();
      }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 47/61] qed: Remove ret argument from qed_aio_next_io()
+[PULL 12/31] vhost-user-blk: fix blkcfg->num_queues endianness
-All callers pass ret = 0, so we can just remove it.
+From: Stefan Hajnoczi <stefanha@redhat.com>
+Treat the num_queues field as virtio-endian. On big-endian hosts the
+vhost-user-blk num_queues field was in the wrong endianness.
+Move the blkcfg.num_queues store operation from realize to
+vhost_user_blk_update_config() so feature negotiation has finished and
+we know the endianness of the device. VIRTIO 1.0 devices are
+little-endian, but in case someone wants to use legacy VIRTIO we support
+all endianness cases.
+Cc: qemu-stable@nongnu.org
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
+Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
+Message-Id: <20210223144653.811468-2-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 17 ++++++-----------
+ hw/block/vhost-user-blk.c | 7 +++----
-file changed, 6 insertions(+), 11 deletions(-)
+file changed, 3 insertions(+), 4 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/hw/block/vhost-user-blk.c
-+++ b/block/qed.c
++++ b/hw/block/vhost-user-blk.c
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t *config)
-     return l2_table;
+ {
      VHostUserBlk *s = VHOST_USER_BLK(vdev);
 +    /* Our num_queues overrides the device backend */
 +    virtio_stw_p(vdev, &s->blkcfg.num_queues, s->num_queues);
 +
      memcpy(config, &s->blkcfg, sizeof(struct virtio_blk_config));
  }
--static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+@@ -XXX,XX +XXX,XX @@ reconnect:
-+static void qed_aio_next_io(QEDAIOCB *acb);
+         goto reconnect;
  static void qed_aio_start_io(QEDAIOCB *acb)
  {
 -    qed_aio_next_io(acb, 0);
 +    qed_aio_next_io(acb);
  }
  static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  /**
   * Begin next I/O or complete the request
   */
 -static void qed_aio_next_io(QEDAIOCB *acb, int ret)
 +static void qed_aio_next_io(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset;
      size_t len;
 +    int ret;
 -    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 +    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
      if (acb->backing_qiov) {
          qemu_iovec_destroy(acb->backing_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
          acb->backing_qiov = NULL;
      }
--    /* Handle I/O error */
+-    if (s->blkcfg.num_queues != s->num_queues) {
--    if (ret) {
+-        s->blkcfg.num_queues = s->num_queues;
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -
-     acb->qiov_offset += acb->cur_qiov.size;
+     return;
-     acb->cur_pos += acb->cur_qiov.size;
-     qemu_iovec_reset(&acb->cur_qiov);
+ virtio_err:
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
          }
          return;
      }
 -    qed_aio_next_io(acb, 0);
 +    qed_aio_next_io(acb);
  }
  static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 15/61] qemu-iotests: 068: test iothread mode
+[PULL 13/31] libqtest: add qtest_socket_server()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Perform the savevm/loadvm test with both iothread on and off.  This
+Add an API that returns a new UNIX domain socket in the listen state.
-covers the recently found savevm/loadvm hang when iothread is enabled.
+The code for this was already there but only used internally in
 init_socket().
 This new API will be used by vhost-user-blk-test.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Thomas Huth <thuth@redhat.com>
+Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
+Message-Id: <20210223144653.811468-3-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/068     | 23 ++++++++++++++---------
+ tests/qtest/libqos/libqtest.h |  8 +++++++
- tests/qemu-iotests/068.out | 11 ++++++++++-
+ tests/qtest/libqtest.c        | 40 ++++++++++++++++++++---------------
-files changed, 24 insertions(+), 10 deletions(-)
+files changed, 31 insertions(+), 17 deletions(-)
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
-index XXXXXXX..XXXXXXX 100755
+index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068
+--- a/tests/qtest/libqos/libqtest.h
-+++ b/tests/qemu-iotests/068
++++ b/tests/qtest/libqos/libqtest.h
-@@ -XXX,XX +XXX,XX @@ _supported_os Linux
+@@ -XXX,XX +XXX,XX @@ void qtest_qmp_send(QTestState *s, const char *fmt, ...)
- IMGOPTS="compat=1.1"
+ void qtest_qmp_send_raw(QTestState *s, const char *fmt, ...)
- IMG_SIZE=128K
+     GCC_FMT_ATTR(2, 3);
--echo
++/**
--echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
++ * qtest_socket_server:
--echo
++ * @socket_path: the UNIX domain socket path
--_make_test_img $IMG_SIZE
++ *
 + * Create and return a listen socket file descriptor, or abort on failure.
 + */
 +int qtest_socket_server(const char *socket_path);
 +
  /**
   * qtest_vqmp_fds:
   * @s: #QTestState instance to operate on.
 diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qtest/libqtest.c
 +++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ static void qtest_client_set_rx_handler(QTestState *s, QTestRecvFn recv);
  static int init_socket(const char *socket_path)
  {
 -    struct sockaddr_un addr;
 -    int sock;
 -    int ret;
 -
- case "$QEMU_DEFAULT_MACHINE" in
+-    sock = socket(PF_UNIX, SOCK_STREAM, 0);
-   s390-ccw-virtio)
+-    g_assert_cmpint(sock, !=, -1);
-       platform_parm="-no-shutdown"
+-
-@@ -XXX,XX +XXX,XX @@ _qemu()
+-    addr.sun_family = AF_UNIX;
-     _filter_qemu | _filter_hmp
+-    snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", socket_path);
 +    int sock = qtest_socket_server(socket_path);
      qemu_set_cloexec(sock);
 -
 -    do {
 -        ret = bind(sock, (struct sockaddr *)&addr, sizeof(addr));
 -    } while (ret == -1 && errno == EINTR);
 -    g_assert_cmpint(ret, !=, -1);
 -    ret = listen(sock, 1);
 -    g_assert_cmpint(ret, !=, -1);
 -
      return sock;
  }
--# Give qemu some time to boot before saving the VM state
+@@ -XXX,XX +XXX,XX @@ QDict *qtest_qmp_receive_dict(QTestState *s)
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+     return qmp_fd_receive(s->qmp_fd);
--# Now try to continue from that VM state (this should just work)
+ }
--echo quit | _qemu -loadvm 0
-+for extra_args in \
++int qtest_socket_server(const char *socket_path)
-+    "" \
++{
-+    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
++    struct sockaddr_un addr;
-+    echo
++    int sock;
-+    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
++    int ret;
 +    echo
 +
-+    _make_test_img $IMG_SIZE
++    sock = socket(PF_UNIX, SOCK_STREAM, 0);
 +    g_assert_cmpint(sock, !=, -1);
 +
-+    # Give qemu some time to boot before saving the VM state
++    addr.sun_family = AF_UNIX;
-+    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
++    snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", socket_path);
 +    # Now try to continue from that VM state (this should just work)
 +    echo quit | _qemu $extra_args -loadvm 0
 +done
  # success, all done
  echo "*** done"
 diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/068.out
 +++ b/tests/qemu-iotests/068.out
@@ -XXX,XX +XXX,XX @@
  QA output created by 068
 -=== Saving and reloading a VM state to/from a qcow2 image ===
 +=== Saving and reloading a VM state to/from a qcow2 image () ===
 +
-+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
++    do {
-+QEMU X.Y.Z monitor - type 'help' for more information
++        ret = bind(sock, (struct sockaddr *)&addr, sizeof(addr));
-+(qemu) savevm 0
++    } while (ret == -1 && errno == EINTR);
-+(qemu) quit
++    g_assert_cmpint(ret, !=, -1);
-+QEMU X.Y.Z monitor - type 'help' for more information
++    ret = listen(sock, 1);
-+(qemu) quit
++    g_assert_cmpint(ret, !=, -1);
 +
-+=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
++    return sock;
++}
- Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
++
- QEMU X.Y.Z monitor - type 'help' for more information
+ /**
   * Allow users to send a message without waiting for the reply,
   * in the case that they choose to discard all replies up until
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 19/61] qcow2: Make perform_cow() call do_perform_cow() twice
+[PULL 14/31] libqtest: add qtest_kill_qemu()
-From: Alberto Garcia <berto@igalia.com>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-Instead of calling perform_cow() twice with a different COW region
+Tests that manage multiple processes may wish to kill QEMU before
-each time, call it just once and make perform_cow() handle both
+destroying the QTestState. Expose a function to do that.
 regions.
-This patch simply moves code around. The next one will do the actual
+The vhost-user-blk-test testcase will need this.
 reordering of the COW operations.
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
-Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Message-Id: <20210223144653.811468-4-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
+ tests/qtest/libqos/libqtest.h | 11 +++++++++++
-file changed, 22 insertions(+), 14 deletions(-)
+ tests/qtest/libqtest.c        |  7 ++++---
 files changed, 15 insertions(+), 3 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/tests/qtest/libqos/libqtest.h
-+++ b/block/qcow2-cluster.c
++++ b/tests/qtest/libqos/libqtest.h
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ QTestState *qtest_init_without_qmp_handshake(const char *extra_args);
-     struct iovec iov;
+  */
-     int ret;
+ QTestState *qtest_init_with_serial(const char *extra_args, int *sock_fd);
-+    if (bytes == 0) {
++/**
-+        return 0;
++ * qtest_kill_qemu:
-+    }
++ * @s: #QTestState instance to operate on.
 + *
 + * Kill the QEMU process and wait for it to terminate. It is safe to call this
 + * function multiple times. Normally qtest_quit() is used instead because it
 + * also frees QTestState. Use qtest_kill_qemu() when you just want to kill QEMU
 + * and qtest_quit() will be called later.
 + */
 +void qtest_kill_qemu(QTestState *s);
 +
-     iov.iov_len = bytes;
+ /**
-     iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
+  * qtest_quit:
-     if (iov.iov_base == NULL) {
+  * @s: #QTestState instance to operate on.
-@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
+diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
-     return cluster_offset;
+index XXXXXXX..XXXXXXX 100644
 --- a/tests/qtest/libqtest.c
 +++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ void qtest_set_expected_status(QTestState *s, int status)
      s->expected_status = status;
  }
--static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+-static void kill_qemu(QTestState *s)
-+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
++void qtest_kill_qemu(QTestState *s)
  {
-     BDRVQcow2State *s = bs->opaque;
+     pid_t pid = s->qemu_pid;
-+    Qcow2COWRegion *start = &m->cow_start;
+     int wstatus;
-+    Qcow2COWRegion *end = &m->cow_end;
+@@ -XXX,XX +XXX,XX @@ static void kill_qemu(QTestState *s)
-     int ret;
+         kill(pid, SIGTERM);
+         TFR(pid = waitpid(s->qemu_pid, &s->wstatus, 0));
--    if (r->nb_bytes == 0) {
+         assert(pid == s->qemu_pid);
-+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
++        s->qemu_pid = -1;
          return 0;
      }
-     qemu_co_mutex_unlock(&s->lock);
--    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
--    qemu_co_mutex_lock(&s->lock);
--
-+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-+                         start->offset, start->nb_bytes);
-     if (ret < 0) {
--        return ret;
-+        goto fail;
-     }
-+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-+                         end->offset, end->nb_bytes);
-+
-+fail:
-+    qemu_co_mutex_lock(&s->lock);
-+
      /*
-      * Before we update the L2 table to actually point to the new cluster, we
+@@ -XXX,XX +XXX,XX @@ static void kill_qemu(QTestState *s)
-      * need to be sure that the refcounts have been increased and COW was
-      * handled.
+ static void kill_qemu_hook_func(void *s)
-      */
+ {
--    qcow2_cache_depends_on_flush(s->l2_table_cache);
+-    kill_qemu(s);
-+    if (ret == 0) {
++    qtest_kill_qemu(s);
 +        qcow2_cache_depends_on_flush(s->l2_table_cache);
 +    }
 -    return 0;
 +    return ret;
  }
- int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
+ static void sigabrt_handler(int signo)
-@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
+@@ -XXX,XX +XXX,XX @@ void qtest_quit(QTestState *s)
-     }
+     /* Uninstall SIGABRT handler on last instance */
+     cleanup_sigabrt_handler();
-     /* copy content of unmodified sectors */
--    ret = perform_cow(bs, m, &m->cow_start);
+-    kill_qemu(s);
--    if (ret < 0) {
++    qtest_kill_qemu(s);
--        goto err;
+     close(s->fd);
--    }
+     close(s->qmp_fd);
--
+     g_string_free(s->rx, true);
 -    ret = perform_cow(bs, m, &m->cow_end);
 +    ret = perform_cow(bs, m);
      if (ret < 0) {
          goto err;
      }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 13/61] qemu-iotests: 068: extract _qemu() function
+[PULL 15/31] libqtest: add qtest_remove_abrt_handler()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Avoid duplicating the QEMU command-line.
+Add a function to remove previously-added abrt handler functions.
 Now that a symmetric pair of add/remove functions exists we can also
 balance the SIGABRT handler installation. The signal handler was
 installed each time qtest_add_abrt_handler() was called. Now it is
 installed when the abrt handler list becomes non-empty and removed again
 when the list becomes empty.
 The qtest_remove_abrt_handler() function will be used by
 vhost-user-blk-test.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
+Message-Id: <20210223144653.811468-5-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/068 | 15 +++++++++------
+ tests/qtest/libqos/libqtest.h | 18 ++++++++++++++++++
-file changed, 9 insertions(+), 6 deletions(-)
+ tests/qtest/libqtest.c        | 35 +++++++++++++++++++++++++++++------
 files changed, 47 insertions(+), 6 deletions(-)
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
-index XXXXXXX..XXXXXXX 100755
+index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068
+--- a/tests/qtest/libqos/libqtest.h
-+++ b/tests/qemu-iotests/068
++++ b/tests/qtest/libqos/libqtest.h
-@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
+@@ -XXX,XX +XXX,XX @@ void qtest_add_data_func_full(const char *str, void *data,
-       ;;
+         g_free(path); \
- esac
+     } while (0)
--# Give qemu some time to boot before saving the VM state
++/**
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
++ * qtest_add_abrt_handler:
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
++ * @fn: Handler function
-+_qemu()
++ * @data: Argument that is passed to the handler
 + *
 + * Add a handler function that is invoked on SIGABRT. This can be used to
 + * terminate processes and perform other cleanup. The handler can be removed
 + * with qtest_remove_abrt_handler().
 + */
  void qtest_add_abrt_handler(GHookFunc fn, const void *data);
 +/**
 + * qtest_remove_abrt_handler:
 + * @data: Argument previously passed to qtest_add_abrt_handler()
 + *
 + * Remove an abrt handler that was previously added with
 + * qtest_add_abrt_handler().
 + */
 +void qtest_remove_abrt_handler(void *data);
 +
  /**
   * qtest_qmp_assert_success:
   * @qts: QTestState instance to operate on
 diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qtest/libqtest.c
 +++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ static void cleanup_sigabrt_handler(void)
      sigaction(SIGABRT, &sigact_old, NULL);
  }
 +static bool hook_list_is_empty(GHookList *hook_list)
 +{
-+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
++    GHook *hook = g_hook_first_valid(hook_list, TRUE);
-+          "$@" |\
++
-     _filter_qemu | _filter_hmp
++    if (!hook) {
 +        return false;
 +    }
 +
 +    g_hook_unref(hook_list, hook);
 +    return true;
 +}
 +
-+# Give qemu some time to boot before saving the VM state
+ void qtest_add_abrt_handler(GHookFunc fn, const void *data)
-+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+ {
- # Now try to continue from that VM state (this should just work)
+     GHook *hook;
--echo quit |\
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
+-    /* Only install SIGABRT handler once */
--    _filter_qemu | _filter_hmp
+     if (!abrt_hooks.is_setup) {
-+echo quit | _qemu -loadvm 0
+         g_hook_list_init(&abrt_hooks, sizeof(GHook));
+     }
- # success, all done
+-    setup_sigabrt_handler();
- echo "*** done"
++
 +    /* Only install SIGABRT handler once */
 +    if (hook_list_is_empty(&abrt_hooks)) {
 +        setup_sigabrt_handler();
 +    }
      hook = g_hook_alloc(&abrt_hooks);
      hook->func = fn;
@@ -XXX,XX +XXX,XX @@ void qtest_add_abrt_handler(GHookFunc fn, const void *data)
      g_hook_prepend(&abrt_hooks, hook);
  }
 +void qtest_remove_abrt_handler(void *data)
 +{
 +    GHook *hook = g_hook_find_data(&abrt_hooks, TRUE, data);
 +    g_hook_destroy_link(&abrt_hooks, hook);
 +
 +    /* Uninstall SIGABRT handler on last instance */
 +    if (hook_list_is_empty(&abrt_hooks)) {
 +        cleanup_sigabrt_handler();
 +    }
 +}
 +
  static const char *qtest_qemu_binary(void)
  {
      const char *qemu_bin;
@@ -XXX,XX +XXX,XX @@ QTestState *qtest_init_with_serial(const char *extra_args, int *sock_fd)
  void qtest_quit(QTestState *s)
  {
 -    g_hook_destroy_link(&abrt_hooks, g_hook_find_data(&abrt_hooks, TRUE, s));
 -
 -    /* Uninstall SIGABRT handler on last instance */
 -    cleanup_sigabrt_handler();
 +    qtest_remove_abrt_handler(s);
      qtest_kill_qemu(s);
      close(s->fd);
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 53/61] qed: Add coroutine_fn to I/O path functions
+[PULL 16/31] test: new qTest case to test the vhost-user-blk-server
-Now that we stay in coroutine context for the whole request when doing
+From: Coiby Xu <coiby.xu@gmail.com>
 reads or writes, we can add coroutine_fn annotations to many functions
 that can do I/O or yield directly.
+This test case has the same tests as tests/virtio-blk-test.c except for
+tests have block_resize. Since the vhost-user-blk export only serves one
+client one time, two exports are started by qemu-storage-daemon for the
+hotplug test.
+Suggested-by: Thomas Huth <thuth@redhat.com>
+Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210223144653.811468-6-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c |  5 +++--
+ tests/qtest/libqos/vhost-user-blk.h |  48 ++
- block/qed.c         | 44 ++++++++++++++++++++++++--------------------
+ tests/qtest/libqos/vhost-user-blk.c | 130 +++++
- block/qed.h         |  5 +++--
+ tests/qtest/vhost-user-blk-test.c   | 788 ++++++++++++++++++++++++++++
-files changed, 30 insertions(+), 24 deletions(-)
+ MAINTAINERS                         |   2 +
  tests/qtest/libqos/meson.build      |   1 +
  tests/qtest/meson.build             |   4 +
 files changed, 973 insertions(+)
  create mode 100644 tests/qtest/libqos/vhost-user-blk.h
  create mode 100644 tests/qtest/libqos/vhost-user-blk.c
  create mode 100644 tests/qtest/vhost-user-blk-test.c
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
+diff --git a/tests/qtest/libqos/vhost-user-blk.h b/tests/qtest/libqos/vhost-user-blk.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qtest/libqos/vhost-user-blk.h
@@ -XXX,XX +XXX,XX @@
 +/*
 + * libqos driver framework
 + *
 + * Based on tests/qtest/libqos/virtio-blk.c
 + *
 + * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
 + *
 + * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
 + *
 + * This library is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License version 2 as published by the Free Software Foundation.
 + *
 + * This library is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with this library; if not, see <http://www.gnu.org/licenses/>
 + */
 +
 +#ifndef TESTS_LIBQOS_VHOST_USER_BLK_H
 +#define TESTS_LIBQOS_VHOST_USER_BLK_H
 +
 +#include "qgraph.h"
 +#include "virtio.h"
 +#include "virtio-pci.h"
 +
 +typedef struct QVhostUserBlk QVhostUserBlk;
 +typedef struct QVhostUserBlkPCI QVhostUserBlkPCI;
 +typedef struct QVhostUserBlkDevice QVhostUserBlkDevice;
 +
 +struct QVhostUserBlk {
 +    QVirtioDevice *vdev;
 +};
 +
 +struct QVhostUserBlkPCI {
 +    QVirtioPCIDevice pci_vdev;
 +    QVhostUserBlk blk;
 +};
 +
 +struct QVhostUserBlkDevice {
 +    QOSGraphObject obj;
 +    QVhostUserBlk blk;
 +};
 +
 +#endif
 diff --git a/tests/qtest/libqos/vhost-user-blk.c b/tests/qtest/libqos/vhost-user-blk.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qtest/libqos/vhost-user-blk.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * libqos driver framework
 + *
 + * Based on tests/qtest/libqos/virtio-blk.c
 + *
 + * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
 + *
 + * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
 + *
 + * This library is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License version 2.1 as published by the Free Software Foundation.
 + *
 + * This library is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with this library; if not, see <http://www.gnu.org/licenses/>
 + */
 +
 +#include "qemu/osdep.h"
 +#include "libqtest.h"
 +#include "qemu/module.h"
 +#include "standard-headers/linux/virtio_blk.h"
 +#include "vhost-user-blk.h"
 +
 +#define PCI_SLOT                0x04
 +#define PCI_FN                  0x00
 +
 +/* virtio-blk-device */
 +static void *qvhost_user_blk_get_driver(QVhostUserBlk *v_blk,
 +                                    const char *interface)
 +{
 +    if (!g_strcmp0(interface, "vhost-user-blk")) {
 +        return v_blk;
 +    }
 +    if (!g_strcmp0(interface, "virtio")) {
 +        return v_blk->vdev;
 +    }
 +
 +    fprintf(stderr, "%s not present in vhost-user-blk-device\n", interface);
 +    g_assert_not_reached();
 +}
 +
 +static void *qvhost_user_blk_device_get_driver(void *object,
 +                                           const char *interface)
 +{
 +    QVhostUserBlkDevice *v_blk = object;
 +    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
 +}
 +
 +static void *vhost_user_blk_device_create(void *virtio_dev,
 +                                      QGuestAllocator *t_alloc,
 +                                      void *addr)
 +{
 +    QVhostUserBlkDevice *vhost_user_blk = g_new0(QVhostUserBlkDevice, 1);
 +    QVhostUserBlk *interface = &vhost_user_blk->blk;
 +
 +    interface->vdev = virtio_dev;
 +
 +    vhost_user_blk->obj.get_driver = qvhost_user_blk_device_get_driver;
 +
 +    return &vhost_user_blk->obj;
 +}
 +
 +/* virtio-blk-pci */
 +static void *qvhost_user_blk_pci_get_driver(void *object, const char *interface)
 +{
 +    QVhostUserBlkPCI *v_blk = object;
 +    if (!g_strcmp0(interface, "pci-device")) {
 +        return v_blk->pci_vdev.pdev;
 +    }
 +    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
 +}
 +
 +static void *vhost_user_blk_pci_create(void *pci_bus, QGuestAllocator *t_alloc,
 +                                      void *addr)
 +{
 +    QVhostUserBlkPCI *vhost_user_blk = g_new0(QVhostUserBlkPCI, 1);
 +    QVhostUserBlk *interface = &vhost_user_blk->blk;
 +    QOSGraphObject *obj = &vhost_user_blk->pci_vdev.obj;
 +
 +    virtio_pci_init(&vhost_user_blk->pci_vdev, pci_bus, addr);
 +    interface->vdev = &vhost_user_blk->pci_vdev.vdev;
 +
 +    g_assert_cmphex(interface->vdev->device_type, ==, VIRTIO_ID_BLOCK);
 +
 +    obj->get_driver = qvhost_user_blk_pci_get_driver;
 +
 +    return obj;
 +}
 +
 +static void vhost_user_blk_register_nodes(void)
 +{
 +    /*
 +     * FIXME: every test using these two nodes needs to setup a
 +     * -drive,id=drive0 otherwise QEMU is not going to start.
 +     * Therefore, we do not include "produces" edge for virtio
 +     * and pci-device yet.
 +     */
 +
 +    char *arg = g_strdup_printf("id=drv0,chardev=char1,addr=%x.%x",
 +                                PCI_SLOT, PCI_FN);
 +
 +    QPCIAddress addr = {
 +        .devfn = QPCI_DEVFN(PCI_SLOT, PCI_FN),
 +    };
 +
 +    QOSGraphEdgeOptions opts = { };
 +
 +    /* virtio-blk-device */
 +    /** opts.extra_device_opts = "drive=drive0"; */
 +    qos_node_create_driver("vhost-user-blk-device",
 +                           vhost_user_blk_device_create);
 +    qos_node_consumes("vhost-user-blk-device", "virtio-bus", &opts);
 +    qos_node_produces("vhost-user-blk-device", "vhost-user-blk");
 +
 +    /* virtio-blk-pci */
 +    opts.extra_device_opts = arg;
 +    add_qpci_address(&opts, &addr);
 +    qos_node_create_driver("vhost-user-blk-pci", vhost_user_blk_pci_create);
 +    qos_node_consumes("vhost-user-blk-pci", "pci-bus", &opts);
 +    qos_node_produces("vhost-user-blk-pci", "vhost-user-blk");
 +
 +    g_free(arg);
 +}
 +
 +libqos_init(vhost_user_blk_register_nodes);
 diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qtest/vhost-user-blk-test.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * QTest testcase for Vhost-user Block Device
 + *
 + * Based on tests/qtest//virtio-blk-test.c
 +
 + * Copyright (c) 2014 SUSE LINUX Products GmbH
 + * Copyright (c) 2014 Marc Marí
 + * Copyright (c) 2020 Coiby Xu
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#include "qemu/osdep.h"
 +#include "libqtest-single.h"
 +#include "qemu/bswap.h"
 +#include "qemu/module.h"
 +#include "standard-headers/linux/virtio_blk.h"
 +#include "standard-headers/linux/virtio_pci.h"
 +#include "libqos/qgraph.h"
 +#include "libqos/vhost-user-blk.h"
 +#include "libqos/libqos-pc.h"
 +
 +#define TEST_IMAGE_SIZE         (64 * 1024 * 1024)
 +#define QVIRTIO_BLK_TIMEOUT_US  (30 * 1000 * 1000)
 +#define PCI_SLOT_HP             0x06
 +
 +typedef struct {
 +    pid_t pid;
 +} QemuStorageDaemonState;
 +
 +typedef struct QVirtioBlkReq {
 +    uint32_t type;
 +    uint32_t ioprio;
 +    uint64_t sector;
 +    char *data;
 +    uint8_t status;
 +} QVirtioBlkReq;
 +
 +#ifdef HOST_WORDS_BIGENDIAN
 +static const bool host_is_big_endian = true;
 +#else
 +static const bool host_is_big_endian; /* false */
 +#endif
 +
 +static inline void virtio_blk_fix_request(QVirtioDevice *d, QVirtioBlkReq *req)
 +{
 +    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
 +        req->type = bswap32(req->type);
 +        req->ioprio = bswap32(req->ioprio);
 +        req->sector = bswap64(req->sector);
 +    }
 +}
 +
 +static inline void virtio_blk_fix_dwz_hdr(QVirtioDevice *d,
 +    struct virtio_blk_discard_write_zeroes *dwz_hdr)
 +{
 +    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
 +        dwz_hdr->sector = bswap64(dwz_hdr->sector);
 +        dwz_hdr->num_sectors = bswap32(dwz_hdr->num_sectors);
 +        dwz_hdr->flags = bswap32(dwz_hdr->flags);
 +    }
 +}
 +
 +static uint64_t virtio_blk_request(QGuestAllocator *alloc, QVirtioDevice *d,
 +                                   QVirtioBlkReq *req, uint64_t data_size)
 +{
 +    uint64_t addr;
 +    uint8_t status = 0xFF;
 +    QTestState *qts = global_qtest;
 +
 +    switch (req->type) {
 +    case VIRTIO_BLK_T_IN:
 +    case VIRTIO_BLK_T_OUT:
 +        g_assert_cmpuint(data_size % 512, ==, 0);
 +        break;
 +    case VIRTIO_BLK_T_DISCARD:
 +    case VIRTIO_BLK_T_WRITE_ZEROES:
 +        g_assert_cmpuint(data_size %
 +                         sizeof(struct virtio_blk_discard_write_zeroes), ==, 0);
 +        break;
 +    default:
 +        g_assert_cmpuint(data_size, ==, 0);
 +    }
 +
 +    addr = guest_alloc(alloc, sizeof(*req) + data_size);
 +
 +    virtio_blk_fix_request(d, req);
 +
 +    qtest_memwrite(qts, addr, req, 16);
 +    qtest_memwrite(qts, addr + 16, req->data, data_size);
 +    qtest_memwrite(qts, addr + 16 + data_size, &status, sizeof(status));
 +
 +    return addr;
 +}
 +
 +/* Returns the request virtqueue so the caller can perform further tests */
 +static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
 +{
 +    QVirtioBlkReq req;
 +    uint64_t req_addr;
 +    uint64_t capacity;
 +    uint64_t features;
 +    uint32_t free_head;
 +    uint8_t status;
 +    char *data;
 +    QTestState *qts = global_qtest;
 +    QVirtQueue *vq;
 +
 +    features = qvirtio_get_features(dev);
 +    features = features & ~(QVIRTIO_F_BAD_FEATURE |
 +                    (1u << VIRTIO_RING_F_INDIRECT_DESC) |
 +                    (1u << VIRTIO_RING_F_EVENT_IDX) |
 +                    (1u << VIRTIO_BLK_F_SCSI));
 +    qvirtio_set_features(dev, features);
 +
 +    capacity = qvirtio_config_readq(dev, 0);
 +    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
 +
 +    vq = qvirtqueue_setup(dev, alloc, 0);
 +
 +    qvirtio_set_driver_ok(dev);
 +
 +    /* Write and read with 3 descriptor layout */
 +    /* Write request */
 +    req.type = VIRTIO_BLK_T_OUT;
 +    req.ioprio = 1;
 +    req.sector = 0;
 +    req.data = g_malloc0(512);
 +    strcpy(req.data, "TEST");
 +
 +    req_addr = virtio_blk_request(alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 528);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    guest_free(alloc, req_addr);
 +
 +    /* Read request */
 +    req.type = VIRTIO_BLK_T_IN;
 +    req.ioprio = 1;
 +    req.sector = 0;
 +    req.data = g_malloc0(512);
 +
 +    req_addr = virtio_blk_request(alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
 +    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 528);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    data = g_malloc0(512);
 +    qtest_memread(qts, req_addr + 16, data, 512);
 +    g_assert_cmpstr(data, ==, "TEST");
 +    g_free(data);
 +
 +    guest_free(alloc, req_addr);
 +
 +    if (features & (1u << VIRTIO_BLK_F_WRITE_ZEROES)) {
 +        struct virtio_blk_discard_write_zeroes dwz_hdr;
 +        void *expected;
 +
 +        /*
 +         * WRITE_ZEROES request on the same sector of previous test where
 +         * we wrote "TEST".
 +         */
 +        req.type = VIRTIO_BLK_T_WRITE_ZEROES;
 +        req.data = (char *) &dwz_hdr;
 +        dwz_hdr.sector = 0;
 +        dwz_hdr.num_sectors = 1;
 +        dwz_hdr.flags = 0;
 +
 +        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
 +
 +        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
 +
 +        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
 +                       false);
 +
 +        qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                               QVIRTIO_BLK_TIMEOUT_US);
 +        status = readb(req_addr + 16 + sizeof(dwz_hdr));
 +        g_assert_cmpint(status, ==, 0);
 +
 +        guest_free(alloc, req_addr);
 +
 +        /* Read request to check if the sector contains all zeroes */
 +        req.type = VIRTIO_BLK_T_IN;
 +        req.ioprio = 1;
 +        req.sector = 0;
 +        req.data = g_malloc0(512);
 +
 +        req_addr = virtio_blk_request(alloc, dev, &req, 512);
 +
 +        g_free(req.data);
 +
 +        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
 +        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +
 +        qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                               QVIRTIO_BLK_TIMEOUT_US);
 +        status = readb(req_addr + 528);
 +        g_assert_cmpint(status, ==, 0);
 +
 +        data = g_malloc(512);
 +        expected = g_malloc0(512);
 +        qtest_memread(qts, req_addr + 16, data, 512);
 +        g_assert_cmpmem(data, 512, expected, 512);
 +        g_free(expected);
 +        g_free(data);
 +
 +        guest_free(alloc, req_addr);
 +    }
 +
 +    if (features & (1u << VIRTIO_BLK_F_DISCARD)) {
 +        struct virtio_blk_discard_write_zeroes dwz_hdr;
 +
 +        req.type = VIRTIO_BLK_T_DISCARD;
 +        req.data = (char *) &dwz_hdr;
 +        dwz_hdr.sector = 0;
 +        dwz_hdr.num_sectors = 1;
 +        dwz_hdr.flags = 0;
 +
 +        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
 +
 +        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
 +
 +        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr),
 +                       1, true, false);
 +
 +        qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                               QVIRTIO_BLK_TIMEOUT_US);
 +        status = readb(req_addr + 16 + sizeof(dwz_hdr));
 +        g_assert_cmpint(status, ==, 0);
 +
 +        guest_free(alloc, req_addr);
 +    }
 +
 +    if (features & (1u << VIRTIO_F_ANY_LAYOUT)) {
 +        /* Write and read with 2 descriptor layout */
 +        /* Write request */
 +        req.type = VIRTIO_BLK_T_OUT;
 +        req.ioprio = 1;
 +        req.sector = 1;
 +        req.data = g_malloc0(512);
 +        strcpy(req.data, "TEST");
 +
 +        req_addr = virtio_blk_request(alloc, dev, &req, 512);
 +
 +        g_free(req.data);
 +
 +        free_head = qvirtqueue_add(qts, vq, req_addr, 528, false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +        qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                               QVIRTIO_BLK_TIMEOUT_US);
 +        status = readb(req_addr + 528);
 +        g_assert_cmpint(status, ==, 0);
 +
 +        guest_free(alloc, req_addr);
 +
 +        /* Read request */
 +        req.type = VIRTIO_BLK_T_IN;
 +        req.ioprio = 1;
 +        req.sector = 1;
 +        req.data = g_malloc0(512);
 +
 +        req_addr = virtio_blk_request(alloc, dev, &req, 512);
 +
 +        g_free(req.data);
 +
 +        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +        qvirtqueue_add(qts, vq, req_addr + 16, 513, true, false);
 +
 +        qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                               QVIRTIO_BLK_TIMEOUT_US);
 +        status = readb(req_addr + 528);
 +        g_assert_cmpint(status, ==, 0);
 +
 +        data = g_malloc0(512);
 +        qtest_memread(qts, req_addr + 16, data, 512);
 +        g_assert_cmpstr(data, ==, "TEST");
 +        g_free(data);
 +
 +        guest_free(alloc, req_addr);
 +    }
 +
 +    return vq;
 +}
 +
 +static void basic(void *obj, void *data, QGuestAllocator *t_alloc)
 +{
 +    QVhostUserBlk *blk_if = obj;
 +    QVirtQueue *vq;
 +
 +    vq = test_basic(blk_if->vdev, t_alloc);
 +    qvirtqueue_cleanup(blk_if->vdev->bus, vq, t_alloc);
 +
 +}
 +
 +static void indirect(void *obj, void *u_data, QGuestAllocator *t_alloc)
 +{
 +    QVirtQueue *vq;
 +    QVhostUserBlk *blk_if = obj;
 +    QVirtioDevice *dev = blk_if->vdev;
 +    QVirtioBlkReq req;
 +    QVRingIndirectDesc *indirect;
 +    uint64_t req_addr;
 +    uint64_t capacity;
 +    uint64_t features;
 +    uint32_t free_head;
 +    uint8_t status;
 +    char *data;
 +    QTestState *qts = global_qtest;
 +
 +    features = qvirtio_get_features(dev);
 +    g_assert_cmphex(features & (1u << VIRTIO_RING_F_INDIRECT_DESC), !=, 0);
 +    features = features & ~(QVIRTIO_F_BAD_FEATURE |
 +                            (1u << VIRTIO_RING_F_EVENT_IDX) |
 +                            (1u << VIRTIO_BLK_F_SCSI));
 +    qvirtio_set_features(dev, features);
 +
 +    capacity = qvirtio_config_readq(dev, 0);
 +    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
 +
 +    vq = qvirtqueue_setup(dev, t_alloc, 0);
 +    qvirtio_set_driver_ok(dev);
 +
 +    /* Write request */
 +    req.type = VIRTIO_BLK_T_OUT;
 +    req.ioprio = 1;
 +    req.sector = 0;
 +    req.data = g_malloc0(512);
 +    strcpy(req.data, "TEST");
 +
 +    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
 +    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 528, false);
 +    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 528, 1, true);
 +    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 528);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    g_free(indirect);
 +    guest_free(t_alloc, req_addr);
 +
 +    /* Read request */
 +    req.type = VIRTIO_BLK_T_IN;
 +    req.ioprio = 1;
 +    req.sector = 0;
 +    req.data = g_malloc0(512);
 +    strcpy(req.data, "TEST");
 +
 +    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
 +    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 16, false);
 +    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 16, 513, true);
 +    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 528);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    data = g_malloc0(512);
 +    qtest_memread(qts, req_addr + 16, data, 512);
 +    g_assert_cmpstr(data, ==, "TEST");
 +    g_free(data);
 +
 +    g_free(indirect);
 +    guest_free(t_alloc, req_addr);
 +    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
 +}
 +
 +static void idx(void *obj, void *u_data, QGuestAllocator *t_alloc)
 +{
 +    QVirtQueue *vq;
 +    QVhostUserBlkPCI *blk = obj;
 +    QVirtioPCIDevice *pdev = &blk->pci_vdev;
 +    QVirtioDevice *dev = &pdev->vdev;
 +    QVirtioBlkReq req;
 +    uint64_t req_addr;
 +    uint64_t capacity;
 +    uint64_t features;
 +    uint32_t free_head;
 +    uint32_t write_head;
 +    uint32_t desc_idx;
 +    uint8_t status;
 +    char *data;
 +    QOSGraphObject *blk_object = obj;
 +    QPCIDevice *pci_dev = blk_object->get_driver(blk_object, "pci-device");
 +    QTestState *qts = global_qtest;
 +
 +    if (qpci_check_buggy_msi(pci_dev)) {
 +        return;
 +    }
 +
 +    qpci_msix_enable(pdev->pdev);
 +    qvirtio_pci_set_msix_configuration_vector(pdev, t_alloc, 0);
 +
 +    features = qvirtio_get_features(dev);
 +    features = features & ~(QVIRTIO_F_BAD_FEATURE |
 +                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
 +                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
 +                            (1u << VIRTIO_BLK_F_SCSI));
 +    qvirtio_set_features(dev, features);
 +
 +    capacity = qvirtio_config_readq(dev, 0);
 +    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
 +
 +    vq = qvirtqueue_setup(dev, t_alloc, 0);
 +    qvirtqueue_pci_msix_setup(pdev, (QVirtQueuePCI *)vq, t_alloc, 1);
 +
 +    qvirtio_set_driver_ok(dev);
 +
 +    /* Write request */
 +    req.type = VIRTIO_BLK_T_OUT;
 +    req.ioprio = 1;
 +    req.sector = 0;
 +    req.data = g_malloc0(512);
 +    strcpy(req.data, "TEST");
 +
 +    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +
 +    /* Write request */
 +    req.type = VIRTIO_BLK_T_OUT;
 +    req.ioprio = 1;
 +    req.sector = 1;
 +    req.data = g_malloc0(512);
 +    strcpy(req.data, "TEST");
 +
 +    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    /* Notify after processing the third request */
 +    qvirtqueue_set_used_event(qts, vq, 2);
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +    write_head = free_head;
 +
 +    /* No notification expected */
 +    status = qvirtio_wait_status_byte_no_isr(qts, dev,
 +                                             vq, req_addr + 528,
 +                                             QVIRTIO_BLK_TIMEOUT_US);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    guest_free(t_alloc, req_addr);
 +
 +    /* Read request */
 +    req.type = VIRTIO_BLK_T_IN;
 +    req.ioprio = 1;
 +    req.sector = 1;
 +    req.data = g_malloc0(512);
 +
 +    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
 +
 +    g_free(req.data);
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
 +    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
 +
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    /* We get just one notification for both requests */
 +    qvirtio_wait_used_elem(qts, dev, vq, write_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    g_assert(qvirtqueue_get_buf(qts, vq, &desc_idx, NULL));
 +    g_assert_cmpint(desc_idx, ==, free_head);
 +
 +    status = readb(req_addr + 528);
 +    g_assert_cmpint(status, ==, 0);
 +
 +    data = g_malloc0(512);
 +    qtest_memread(qts, req_addr + 16, data, 512);
 +    g_assert_cmpstr(data, ==, "TEST");
 +    g_free(data);
 +
 +    guest_free(t_alloc, req_addr);
 +
 +    /* End test */
 +    qpci_msix_disable(pdev->pdev);
 +
 +    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
 +}
 +
 +static void pci_hotplug(void *obj, void *data, QGuestAllocator *t_alloc)
 +{
 +    QVirtioPCIDevice *dev1 = obj;
 +    QVirtioPCIDevice *dev;
 +    QTestState *qts = dev1->pdev->bus->qts;
 +
 +    /* plug secondary disk */
 +    qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
 +                         "{'addr': %s, 'chardev': 'char2'}",
 +                         stringify(PCI_SLOT_HP) ".0");
 +
 +    dev = virtio_pci_new(dev1->pdev->bus,
 +                         &(QPCIAddress) { .devfn = QPCI_DEVFN(PCI_SLOT_HP, 0)
 +                                        });
 +    g_assert_nonnull(dev);
 +    g_assert_cmpint(dev->vdev.device_type, ==, VIRTIO_ID_BLOCK);
 +    qvirtio_pci_device_disable(dev);
 +    qos_object_destroy((QOSGraphObject *)dev);
 +
 +    /* unplug secondary disk */
 +    qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
 +}
 +
 +/*
 + * Check that setting the vring addr on a non-existent virtqueue does
 + * not crash.
 + */
 +static void test_nonexistent_virtqueue(void *obj, void *data,
 +                                       QGuestAllocator *t_alloc)
 +{
 +    QVhostUserBlkPCI *blk = obj;
 +    QVirtioPCIDevice *pdev = &blk->pci_vdev;
 +    QPCIBar bar0;
 +    QPCIDevice *dev;
 +
 +    dev = qpci_device_find(pdev->pdev->bus, QPCI_DEVFN(4, 0));
 +    g_assert(dev != NULL);
 +    qpci_device_enable(dev);
 +
 +    bar0 = qpci_iomap(dev, 0, NULL);
 +
 +    qpci_io_writeb(dev, bar0, VIRTIO_PCI_QUEUE_SEL, 2);
 +    qpci_io_writel(dev, bar0, VIRTIO_PCI_QUEUE_PFN, 1);
 +
 +    g_free(dev);
 +}
 +
 +static const char *qtest_qemu_storage_daemon_binary(void)
 +{
 +    const char *qemu_storage_daemon_bin;
 +
 +    qemu_storage_daemon_bin = getenv("QTEST_QEMU_STORAGE_DAEMON_BINARY");
 +    if (!qemu_storage_daemon_bin) {
 +        fprintf(stderr, "Environment variable "
 +                        "QTEST_QEMU_STORAGE_DAEMON_BINARY required\n");
 +        exit(0);
 +    }
 +
 +    return qemu_storage_daemon_bin;
 +}
 +
 +/* g_test_queue_destroy() cleanup function for files */
 +static void destroy_file(void *path)
 +{
 +    unlink(path);
 +    g_free(path);
 +    qos_invalidate_command_line();
 +}
 +
 +static char *drive_create(void)
 +{
 +    int fd, ret;
 +    /** vhost-user-blk won't recognize drive located in /tmp */
 +    char *t_path = g_strdup("qtest.XXXXXX");
 +
 +    /** Create a temporary raw image */
 +    fd = mkstemp(t_path);
 +    g_assert_cmpint(fd, >=, 0);
 +    ret = ftruncate(fd, TEST_IMAGE_SIZE);
 +    g_assert_cmpint(ret, ==, 0);
 +    close(fd);
 +
 +    g_test_queue_destroy(destroy_file, t_path);
 +    return t_path;
 +}
 +
 +static char *create_listen_socket(int *fd)
 +{
 +    int tmp_fd;
 +    char *path;
 +
 +    /* No race because our pid makes the path unique */
 +    path = g_strdup_printf("/tmp/qtest-%d-sock.XXXXXX", getpid());
 +    tmp_fd = mkstemp(path);
 +    g_assert_cmpint(tmp_fd, >=, 0);
 +    close(tmp_fd);
 +    unlink(path);
 +
 +    *fd = qtest_socket_server(path);
 +    g_test_queue_destroy(destroy_file, path);
 +    return path;
 +}
 +
 +/*
 + * g_test_queue_destroy() and qtest_add_abrt_handler() cleanup function for
 + * qemu-storage-daemon.
 + */
 +static void quit_storage_daemon(void *data)
 +{
 +    QemuStorageDaemonState *qsd = data;
 +    int wstatus;
 +    pid_t pid;
 +
 +    /*
 +     * If we were invoked as a g_test_queue_destroy() cleanup function we need
 +     * to remove the abrt handler to avoid being called again if the code below
 +     * aborts. Also, we must not leave the abrt handler installed after
 +     * cleanup.
 +     */
 +    qtest_remove_abrt_handler(data);
 +
 +    /* Before quitting storage-daemon, quit qemu to avoid dubious messages */
 +    qtest_kill_qemu(global_qtest);
 +
 +    kill(qsd->pid, SIGTERM);
 +    pid = waitpid(qsd->pid, &wstatus, 0);
 +    g_assert_cmpint(pid, ==, qsd->pid);
 +    if (!WIFEXITED(wstatus)) {
 +        fprintf(stderr, "%s: expected qemu-storage-daemon to exit\n",
 +                __func__);
 +        abort();
 +    }
 +    if (WEXITSTATUS(wstatus) != 0) {
 +        fprintf(stderr, "%s: expected qemu-storage-daemon to exit "
 +                "successfully, got %d\n",
 +                __func__, WEXITSTATUS(wstatus));
 +        abort();
 +    }
 +
 +    g_free(data);
 +}
 +
 +static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
 +{
 +    const char *vhost_user_blk_bin = qtest_qemu_storage_daemon_binary();
 +    int i;
 +    gchar *img_path;
 +    GString *storage_daemon_command = g_string_new(NULL);
 +    QemuStorageDaemonState *qsd;
 +
 +    g_string_append_printf(storage_daemon_command,
 +                           "exec %s ",
 +                           vhost_user_blk_bin);
 +
 +    g_string_append_printf(cmd_line,
 +            " -object memory-backend-memfd,id=mem,size=256M,share=on "
 +            " -M memory-backend=mem -m 256M ");
 +
 +    for (i = 0; i < vus_instances; i++) {
 +        int fd;
 +        char *sock_path = create_listen_socket(&fd);
 +
 +        /* create image file */
 +        img_path = drive_create();
 +        g_string_append_printf(storage_daemon_command,
 +            "--blockdev driver=file,node-name=disk%d,filename=%s "
 +            "--export type=vhost-user-blk,id=disk%d,addr.type=unix,addr.path=%s,"
 +            "node-name=disk%i,writable=on ",
 +            i, img_path, i, sock_path, i);
 +
 +        g_string_append_printf(cmd_line, "-chardev socket,id=char%d,path=%s ",
 +                               i + 1, sock_path);
 +    }
 +
 +    g_test_message("starting vhost-user backend: %s",
 +                   storage_daemon_command->str);
 +    pid_t pid = fork();
 +    if (pid == 0) {
 +        /*
 +         * Close standard file descriptors so tap-driver.pl pipe detects when
 +         * our parent terminates.
 +         */
 +        close(0);
 +        close(1);
 +        open("/dev/null", O_RDONLY);
 +        open("/dev/null", O_WRONLY);
 +
 +        execlp("/bin/sh", "sh", "-c", storage_daemon_command->str, NULL);
 +        exit(1);
 +    }
 +    g_string_free(storage_daemon_command, true);
 +
 +    qsd = g_new(QemuStorageDaemonState, 1);
 +    qsd->pid = pid;
 +
 +    /* Make sure qemu-storage-daemon is stopped */
 +    qtest_add_abrt_handler(quit_storage_daemon, qsd);
 +    g_test_queue_destroy(quit_storage_daemon, qsd);
 +}
 +
 +static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
 +{
 +    start_vhost_user_blk(cmd_line, 1);
 +    return arg;
 +}
 +
 +/*
 + * Setup for hotplug.
 + *
 + * Since vhost-user server only serves one vhost-user client one time,
 + * another exprot
 + *
 + */
 +static void *vhost_user_blk_hotplug_test_setup(GString *cmd_line, void *arg)
 +{
 +    /* "-chardev socket,id=char2" is used for pci_hotplug*/
 +    start_vhost_user_blk(cmd_line, 2);
 +    return arg;
 +}
 +
 +static void register_vhost_user_blk_test(void)
 +{
 +    QOSGraphTestOptions opts = {
 +        .before = vhost_user_blk_test_setup,
 +    };
 +
 +    /*
 +     * tests for vhost-user-blk and vhost-user-blk-pci
 +     * The tests are borrowed from tests/virtio-blk-test.c. But some tests
 +     * regarding block_resize don't work for vhost-user-blk.
 +     * vhost-user-blk device doesn't have -drive, so tests containing
 +     * block_resize are also abandoned,
 +     *  - config
 +     *  - resize
 +     */
 +    qos_add_test("basic", "vhost-user-blk", basic, &opts);
 +    qos_add_test("indirect", "vhost-user-blk", indirect, &opts);
 +    qos_add_test("idx", "vhost-user-blk-pci", idx, &opts);
 +    qos_add_test("nxvirtq", "vhost-user-blk-pci",
 +                 test_nonexistent_virtqueue, &opts);
 +
 +    opts.before = vhost_user_blk_hotplug_test_setup;
 +    qos_add_test("hotplug", "vhost-user-blk-pci", pci_hotplug, &opts);
 +}
 +
 +libqos_init(register_vhost_user_blk_test);
 diff --git a/MAINTAINERS b/MAINTAINERS
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+--- a/MAINTAINERS
-+++ b/block/qed-cluster.c
++++ b/MAINTAINERS
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+@@ -XXX,XX +XXX,XX @@ F: block/export/vhost-user-blk-server.c
-  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ F: block/export/vhost-user-blk-server.h
-  * table offset, respectively. len is number of contiguous unallocated bytes.
+ F: include/qemu/vhost-user-server.h
-  */
+ F: tests/qtest/libqos/vhost-user-blk.c
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
++F: tests/qtest/libqos/vhost-user-blk.h
--                     size_t *len, uint64_t *img_offset)
++F: tests/qtest/vhost-user-blk-test.c
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+ F: util/vhost-user-server.c
-+                                  uint64_t pos, size_t *len,
-+                                  uint64_t *img_offset)
+ FUSE block device exports
- {
+diff --git a/tests/qtest/libqos/meson.build b/tests/qtest/libqos/meson.build
      uint64_t l2_offset;
      uint64_t offset = 0;
 diff --git a/block/qed.c b/block/qed.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/tests/qtest/libqos/meson.build
-+++ b/block/qed.c
++++ b/tests/qtest/libqos/meson.build
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ libqos_srcs = files('../libqtest.c',
-  * This function only updates known header fields in-place and does not affect
+         'virtio-9p.c',
-  * extra data after the QED header.
+         'virtio-balloon.c',
-  */
+         'virtio-blk.c',
--static int qed_write_header(BDRVQEDState *s)
++        'vhost-user-blk.c',
-+static int coroutine_fn qed_write_header(BDRVQEDState *s)
+         'virtio-mmio.c',
- {
+         'virtio-net.c',
-     /* We must write full sectors for O_DIRECT but cannot necessarily generate
+         'virtio-pci.c',
-      * the data following the header if an unrecognized compat feature is
+diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
      qemu_co_enter_next(&s->allocating_write_reqs);
  }
 -static void qed_need_check_timer_entry(void *opaque)
 +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
  {
      BDRVQEDState *s = opaque;
      int ret;
@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
   * This function reads qiov->size bytes starting at pos from the backing file.
   * If there is no backing file then zeroes are read.
   */
 -static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                 QEMUIOVector *qiov,
 -                                 QEMUIOVector **backing_qiov)
 +static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
 +                                              QEMUIOVector *qiov,
 +                                              QEMUIOVector **backing_qiov)
  {
      uint64_t backing_length = 0;
      size_t size;
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
   * @len:        Number of bytes
   * @offset:     Byte offset in image file
   */
 -static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                      uint64_t len, uint64_t offset)
 +static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
 +                                                   uint64_t pos, uint64_t len,
 +                                                   uint64_t offset)
  {
      QEMUIOVector qiov;
      QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
   * The cluster offset may be an allocated byte offset in the image file, the
   * zero cluster marker, or the unallocated cluster marker.
   */
 -static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 -                                unsigned int n, uint64_t cluster)
 +static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
 +                                             int index, unsigned int n,
 +                                             uint64_t cluster)
  {
      int i;
      for (i = index; i < index + n; i++) {
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
      }
  }
 -static void qed_aio_complete(QEDAIOCB *acb)
 +static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
  /**
   * Update L1 table with new L2 table offset and write it out
   */
 -static int qed_aio_write_l1_update(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      CachedL2Table *l2_table = acb->request.l2_table;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
  /**
   * Update L2 table with new cluster offsets and write them out
   */
 -static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 +static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
  {
      BDRVQEDState *s = acb_to_s(acb);
      bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
  /**
   * Write data to the image file
   */
 -static int qed_aio_write_main(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset = acb->cur_cluster +
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
  /**
   * Populate untouched regions of new data cluster
   */
 -static int qed_aio_write_cow(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t start, len, offset;
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
   *
   * This path is taken when writing to previously unallocated clusters.
   */
 -static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 +static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
      int ret;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
   *
   * This path is taken when writing to already allocated clusters.
   */
 -static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
 +                                              size_t len)
  {
      /* Allocate buffer for zero writes */
      if (acb->flags & QED_AIOCB_ZERO) {
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
   * @offset:     Cluster offset in bytes
   * @len:        Length in bytes
   */
 -static int qed_aio_write_data(void *opaque, int ret,
 -                              uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
 +                                           uint64_t offset, size_t len)
  {
      QEDAIOCB *acb = opaque;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
   * @offset:     Cluster offset in bytes
   * @len:        Length in bytes
   */
 -static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
 +                                          uint64_t offset, size_t len)
  {
      QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  /**
   * Begin next I/O or complete the request
   */
 -static int qed_aio_next_io(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset;
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
+--- a/tests/qtest/meson.build
-+++ b/block/qed.h
++++ b/tests/qtest/meson.build
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+@@ -XXX,XX +XXX,XX @@ if have_virtfs
- /**
+   qos_test_ss.add(files('virtio-9p-test.c'))
-  * Cluster functions
+ endif
-  */
+ qos_test_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user-test.c'))
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
++if have_vhost_user_blk_server
--                     size_t *len, uint64_t *img_offset);
++  qos_test_ss.add(files('vhost-user-blk-test.c'))
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
++endif
-+                                  uint64_t pos, size_t *len,
-+                                  uint64_t *img_offset);
+ tpmemu_files = ['tpm-emu.c', 'tpm-util.c', 'tpm-tests.c']
- /**
+@@ -XXX,XX +XXX,XX @@ foreach dir : target_dirs
-  * Consistency check
+   endif
    qtest_env.set('G_TEST_DBUS_DAEMON', meson.source_root() / 'tests/dbus-vmstate-daemon.sh')
    qtest_env.set('QTEST_QEMU_BINARY', './qemu-system-' + target_base)
 +  qtest_env.set('QTEST_QEMU_STORAGE_DAEMON_BINARY', './storage-daemon/qemu-storage-daemon')
    foreach test : target_qtests
      # Executables are shared across targets, declare them only the first time we
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 45/61] qed: Add return value to qed_aio_write_inplace/alloc()
+[PULL 17/31] tests/qtest: add multi-queue test case to vhost-user-blk-test
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
+From: Stefan Hajnoczi <stefanha@redhat.com>
 just return an error code and let the caller handle it.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210223144653.811468-7-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 43 ++++++++++++++++++++-----------------------
+ tests/qtest/vhost-user-blk-test.c | 81 +++++++++++++++++++++++++++++--
-file changed, 20 insertions(+), 23 deletions(-)
+file changed, 76 insertions(+), 5 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/tests/qtest/vhost-user-blk-test.c
-+++ b/block/qed.c
++++ b/tests/qtest/vhost-user-blk-test.c
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static void pci_hotplug(void *obj, void *data, QGuestAllocator *t_alloc)
-  *
+     qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
-  * This path is taken when writing to previously unallocated clusters.
+ }
-  */
--static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
++static void multiqueue(void *obj, void *data, QGuestAllocator *t_alloc)
-+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
++{
 +    QVirtioPCIDevice *pdev1 = obj;
 +    QVirtioDevice *dev1 = &pdev1->vdev;
 +    QVirtioPCIDevice *pdev8;
 +    QVirtioDevice *dev8;
 +    QTestState *qts = pdev1->pdev->bus->qts;
 +    uint64_t features;
 +    uint16_t num_queues;
 +
 +    /*
 +     * The primary device has 1 queue and VIRTIO_BLK_F_MQ is not enabled. The
 +     * VIRTIO specification allows VIRTIO_BLK_F_MQ to be enabled when there is
 +     * only 1 virtqueue, but --device vhost-user-blk-pci doesn't do this (which
 +     * is also spec-compliant).
 +     */
 +    features = qvirtio_get_features(dev1);
 +    g_assert_cmpint(features & (1u << VIRTIO_BLK_F_MQ), ==, 0);
 +    features = features & ~(QVIRTIO_F_BAD_FEATURE |
 +                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
 +                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
 +                            (1u << VIRTIO_BLK_F_SCSI));
 +    qvirtio_set_features(dev1, features);
 +
 +    /* Hotplug a secondary device with 8 queues */
 +    qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
 +                         "{'addr': %s, 'chardev': 'char2', 'num-queues': 8}",
 +                         stringify(PCI_SLOT_HP) ".0");
 +
 +    pdev8 = virtio_pci_new(pdev1->pdev->bus,
 +                           &(QPCIAddress) {
 +                               .devfn = QPCI_DEVFN(PCI_SLOT_HP, 0)
 +                           });
 +    g_assert_nonnull(pdev8);
 +    g_assert_cmpint(pdev8->vdev.device_type, ==, VIRTIO_ID_BLOCK);
 +
 +    qos_object_start_hw(&pdev8->obj);
 +
 +    dev8 = &pdev8->vdev;
 +    features = qvirtio_get_features(dev8);
 +    g_assert_cmpint(features & (1u << VIRTIO_BLK_F_MQ),
 +                    ==,
 +                    (1u << VIRTIO_BLK_F_MQ));
 +    features = features & ~(QVIRTIO_F_BAD_FEATURE |
 +                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
 +                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
 +                            (1u << VIRTIO_BLK_F_SCSI) |
 +                            (1u << VIRTIO_BLK_F_MQ));
 +    qvirtio_set_features(dev8, features);
 +
 +    num_queues = qvirtio_config_readw(dev8,
 +            offsetof(struct virtio_blk_config, num_queues));
 +    g_assert_cmpint(num_queues, ==, 8);
 +
 +    qvirtio_pci_device_disable(pdev8);
 +    qos_object_destroy(&pdev8->obj);
 +
 +    /* unplug secondary disk */
 +    qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
 +}
 +
  /*
   * Check that setting the vring addr on a non-existent virtqueue does
   * not crash.
@@ -XXX,XX +XXX,XX @@ static void quit_storage_daemon(void *data)
      g_free(data);
  }
 -static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
 +static void start_vhost_user_blk(GString *cmd_line, int vus_instances,
 +                                 int num_queues)
  {
-     BDRVQEDState *s = acb_to_s(acb);
+     const char *vhost_user_blk_bin = qtest_qemu_storage_daemon_binary();
-     int ret;
+     int i;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+@@ -XXX,XX +XXX,XX @@ static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
-     }
+         g_string_append_printf(storage_daemon_command,
-     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
+             "--blockdev driver=file,node-name=disk%d,filename=%s "
-         s->allocating_write_reqs_plugged) {
+             "--export type=vhost-user-blk,id=disk%d,addr.type=unix,addr.path=%s,"
--        return; /* wait for existing request to finish */
+-            "node-name=disk%i,writable=on ",
-+        return -EINPROGRESS; /* wait for existing request to finish */
+-            i, img_path, i, sock_path, i);
-     }
++            "node-name=disk%i,writable=on,num-queues=%d ",
++            i, img_path, i, sock_path, i, num_queues);
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+         g_string_append_printf(cmd_line, "-chardev socket,id=char%d,path=%s ",
-     if (acb->flags & QED_AIOCB_ZERO) {
+                                i + 1, sock_path);
-         /* Skip ahead if the clusters are already zero */
+@@ -XXX,XX +XXX,XX @@ static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
-         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
--            qed_aio_start_io(acb);
+ static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
--            return;
+ {
-+            return 0;
+-    start_vhost_user_blk(cmd_line, 1);
-         }
++    start_vhost_user_blk(cmd_line, 1, 1);
-     } else {
+     return arg;
          acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
          s->header.features |= QED_F_NEED_CHECK;
          ret = qed_write_header(s);
          if (ret < 0) {
 -            qed_aio_complete(acb, ret);
 -            return;
 +            return ret;
          }
      }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
          ret = qed_aio_write_cow(acb);
      }
      if (ret < 0) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +        return ret;
      }
 -    qed_aio_next_io(acb, 0);
 +    return 0;
  }
- /**
+@@ -XXX,XX +XXX,XX @@ static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+ static void *vhost_user_blk_hotplug_test_setup(GString *cmd_line, void *arg)
   *
   * This path is taken when writing to already allocated clusters.
   */
 -static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 +static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  {
--    int ret;
+     /* "-chardev socket,id=char2" is used for pci_hotplug*/
--
+-    start_vhost_user_blk(cmd_line, 2);
-     /* Allocate buffer for zero writes */
++    start_vhost_user_blk(cmd_line, 2, 1);
-     if (acb->flags & QED_AIOCB_ZERO) {
++    return arg;
-         struct iovec *iov = acb->qiov->iov;
++}
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
++
-         if (!iov->iov_base) {
++static void *vhost_user_blk_multiqueue_test_setup(GString *cmd_line, void *arg)
-             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
++{
-             if (iov->iov_base == NULL) {
++    start_vhost_user_blk(cmd_line, 2, 8);
--                qed_aio_complete(acb, -ENOMEM);
+     return arg;
 -                return;
 +                return -ENOMEM;
              }
              memset(iov->iov_base, 0, iov->iov_len);
          }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
      qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
      /* Do the actual write */
 -    ret = qed_aio_write_main(acb);
 -    if (ret < 0) {
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -    qed_aio_next_io(acb, 0);
 +    return qed_aio_write_main(acb);
  }
- /**
+@@ -XXX,XX +XXX,XX @@ static void register_vhost_user_blk_test(void)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
+     opts.before = vhost_user_blk_hotplug_test_setup;
-     switch (ret) {
+     qos_add_test("hotplug", "vhost-user-blk-pci", pci_hotplug, &opts);
      case QED_CLUSTER_FOUND:
 -        qed_aio_write_inplace(acb, offset, len);
 +        ret = qed_aio_write_inplace(acb, offset, len);
          break;
      case QED_CLUSTER_L2:
      case QED_CLUSTER_L1:
      case QED_CLUSTER_ZERO:
 -        qed_aio_write_alloc(acb, len);
 +        ret = qed_aio_write_alloc(acb, len);
          break;
      default:
 -        qed_aio_complete(acb, ret);
 +        assert(ret < 0);
          break;
      }
 +
-+    if (ret < 0) {
++    opts.before = vhost_user_blk_multiqueue_test_setup;
-+        if (ret != -EINPROGRESS) {
++    qos_add_test("multiqueue", "vhost-user-blk-pci", multiqueue, &opts);
 +            qed_aio_complete(acb, ret);
 +        }
 +        return;
 +    }
 +    qed_aio_next_io(acb, 0);
  }
- /**
+ libqos_init(register_vhost_user_blk_test);
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 05/61] block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
+[PULL 18/31] block/export: fix blk_size double byteswap
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Calling aio_poll() directly may have been fine previously, but this is
+The config->blk_size field is little-endian. Use the native-endian
-the future, man!  The difference between an aio_poll() loop and
+blk_size variable to avoid double byteswapping.
 BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
 around aio_poll().
-This allows the IOThread to run fd handlers or BHs to complete the
+Fixes: 11f60f7eaee2630dd6fa0c3a8c49f792e46c4cf1 ("block/export: make vhost-user-blk config space little-endian")
 request.  Failure to release the AioContext causes deadlocks.
 Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20210223144653.811468-8-stefanha@redhat.com>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/io.c | 4 +---
+ block/export/vhost-user-blk-server.c | 2 +-
-file changed, 1 insertion(+), 3 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/block/io.c b/block/io.c
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/io.c
+--- a/block/export/vhost-user-blk-server.c
-+++ b/block/io.c
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
-         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
+     config->num_queues = cpu_to_le16(num_queues);
+     config->max_discard_sectors = cpu_to_le32(32768);
-         bdrv_coroutine_enter(bs, co);
+     config->max_discard_seg = cpu_to_le32(1);
--        while (data.ret == -EINPROGRESS) {
+-    config->discard_sector_alignment = cpu_to_le32(config->blk_size >> 9);
--            aio_poll(bdrv_get_aio_context(bs), true);
++    config->discard_sector_alignment = cpu_to_le32(blk_size >> 9);
--        }
+     config->max_write_zeroes_sectors = cpu_to_le32(32768);
-+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
+     config->max_write_zeroes_seg = cpu_to_le32(1);
          return data.ret;
      }
  }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 07/61] migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
+[PULL 19/31] block/export: use VIRTIO_BLK_SECTOR_BITS
 From: Stefan Hajnoczi <stefanha@redhat.com>
-blk/bdrv_drain_all() only takes effect for a single instant and then
+Use VIRTIO_BLK_SECTOR_BITS and VIRTIO_BLK_SECTOR_SIZE when dealing with
-resumes block jobs, guest devices, and other external clients like the
+virtio-blk sector numbers. Although the values happen to be the same as
-NBD server.  This can be handy when performing a synchronous drain
+BDRV_SECTOR_BITS and BDRV_SECTOR_SIZE, they are conceptually different.
-before terminating the program, for example.
+This makes it clearer when we are dealing with virtio-blk sector units.
-Monitor commands usually need to quiesce I/O across an entire code
+Use VIRTIO_BLK_SECTOR_BITS in vu_blk_initialize_config(). Later patches
-region so blk/bdrv_drain_all() is not suitable.  They must use
+will use it the new constants the virtqueue request processing code
-bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
+path.
 requests from slipping in or worse - block jobs completing and modifying
 the graph.
-I audited other blk/bdrv_drain_all() callers but did not find anything
+Suggested-by: Max Reitz <mreitz@redhat.com>
 that needs a similar fix.  This patch fixes the savevm/loadvm commands.
 Although I haven't encountered a read world issue this makes the code
 safer.
 Suggested-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20210223144653.811468-9-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 18 +++++++++++++++---
+ block/export/vhost-user-blk-server.c | 15 ++++++++++++---
-file changed, 15 insertions(+), 3 deletions(-)
+file changed, 12 insertions(+), 3 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block/export/vhost-user-blk-server.c
-+++ b/migration/savevm.c
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@
  #include "sysemu/block-backend.h"
  #include "util/block-helpers.h"
 +/*
 + * Sector units are 512 bytes regardless of the
 + * virtio_blk_config->blk_size value.
 + */
 +#define VIRTIO_BLK_SECTOR_BITS 9
 +#define VIRTIO_BLK_SECTOR_SIZE (1ull << VIRTIO_BLK_SECTOR_BITS)
 +
  enum {
      VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
  };
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
                           uint32_t blk_size,
                           uint16_t num_queues)
  {
 -    config->capacity = cpu_to_le64(bdrv_getlength(bs) >> BDRV_SECTOR_BITS);
 +    config->capacity =
 +        cpu_to_le64(bdrv_getlength(bs) >> VIRTIO_BLK_SECTOR_BITS);
      config->blk_size = cpu_to_le32(blk_size);
      config->size_max = cpu_to_le32(0);
      config->seg_max = cpu_to_le32(128 - 2);
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
      config->num_queues = cpu_to_le16(num_queues);
      config->max_discard_sectors = cpu_to_le32(32768);
      config->max_discard_seg = cpu_to_le32(1);
 -    config->discard_sector_alignment = cpu_to_le32(blk_size >> 9);
 +    config->discard_sector_alignment =
 +        cpu_to_le32(blk_size >> VIRTIO_BLK_SECTOR_BITS);
      config->max_write_zeroes_sectors = cpu_to_le32(32768);
      config->max_write_zeroes_seg = cpu_to_le32(1);
  }
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
      if (vu_opts->has_logical_block_size) {
          logical_block_size = vu_opts->logical_block_size;
      } else {
 -        logical_block_size = BDRV_SECTOR_SIZE;
 +        logical_block_size = VIRTIO_BLK_SECTOR_SIZE;
      }
-     vm_stop(RUN_STATE_SAVE_VM);
+     check_block_size(exp->id, "logical-block-size", logical_block_size,
+                      &local_err);
 +    bdrv_drain_all_begin();
 +
      aio_context_acquire(aio_context);
      memset(sn, 0, sizeof(*sn));
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
      if (aio_context) {
          aio_context_release(aio_context);
      }
 +
 +    bdrv_drain_all_end();
 +
      if (saved_vm_running) {
          vm_start();
      }
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      }
      /* Flush all IO requests so they don't interfere with the new state.  */
 -    bdrv_drain_all();
 +    bdrv_drain_all_begin();
      ret = bdrv_all_goto_snapshot(name, &bs);
      if (ret < 0) {
          error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
                       ret, name, bdrv_get_device_name(bs));
 -        return ret;
 +        goto err_drain;
      }
      /* restore the VM state */
      f = qemu_fopen_bdrv(bs_vm_state, 0);
      if (!f) {
          error_setg(errp, "Could not open VM state file");
 -        return -EINVAL;
 +        ret = -EINVAL;
 +        goto err_drain;
      }
      qemu_system_reset(SHUTDOWN_CAUSE_NONE);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      ret = qemu_loadvm_state(f);
      aio_context_release(aio_context);
 +    bdrv_drain_all_end();
 +
      migration_incoming_state_destroy();
      if (ret < 0) {
          error_setg(errp, "Error %d while loading VM state", ret);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      }
      return 0;
 +
 +err_drain:
 +    bdrv_drain_all_end();
 +    return ret;
  }
  void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 12/61] migration: hold AioContext lock for loadvm qemu_fclose()
+[PULL 20/31] block/export: fix vhost-user-blk export sector number calculation
 From: Stefan Hajnoczi <stefanha@redhat.com>
-migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
+The driver is supposed to honor the blk_size field but the protocol
-file.  Make sure to call it inside an AioContext acquire/release region.
+still uses 512-byte sector numbers. It is incorrect to multiply
 req->sector_num by blk_size.
-This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
+VIRTIO 1.1 5.2.5 Device Initialization says:
 in loadvm.
-This patch closes the vmstate file before ending the drained region.
+  blk_size can be read to determine the optimal sector size for the
-Previously we closed the vmstate file after ending the drained region.
+  driver to use. This does not affect the units used in the protocol
-The order does not matter.
+  (always 512 bytes), but awareness of the correct value can affect
   performance.
+Fixes: 3578389bcf76c824a5d82e6586a6f0c71e56f2aa ("block/export: vhost-user block device backend server")
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210223144653.811468-10-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 2 +-
+ block/export/vhost-user-blk-server.c | 2 +-
 file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block/export/vhost-user-blk-server.c
-+++ b/migration/savevm.c
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
+             break;
-     aio_context_acquire(aio_context);
+         }
-     ret = qemu_loadvm_state(f);
-+    migration_incoming_state_destroy();
+-        int64_t offset = req->sector_num * vexp->blk_size;
-     aio_context_release(aio_context);
++        int64_t offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
+         QEMUIOVector qiov;
-     bdrv_drain_all_end();
+         if (is_write) {
+             qemu_iovec_init_external(&qiov, out_iov, out_num);
 -    migration_incoming_state_destroy();
      if (ret < 0) {
          error_setg(errp, "Error %d while loading VM state", ret);
          return ret;
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 49/61] qed: Implement .bdrv_co_readv/writev
+[PULL 21/31] block/export: port virtio-blk discard/write zeroes input validation
-Most of the qed code is now synchronous and matches the coroutine model.
+From: Stefan Hajnoczi <stefanha@redhat.com>
 One notable exception is the serialisation between requests which can
 still schedule a callback. Before we can replace this with coroutine
 locks, let's convert the driver's external interfaces to the coroutine
 versions.
-We need to be careful to handle both requests that call the completion
+Validate discard/write zeroes the same way we do for virtio-blk. Some of
-callback directly from the calling coroutine (i.e. fully synchronous
+these checks are mandated by the VIRTIO specification, others are
-code) and requests that involve some callback, so that we need to yield
+internal to QEMU.
 and wait for the completion callback coming from outside the coroutine.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210223144653.811468-11-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
+ block/export/vhost-user-blk-server.c | 116 +++++++++++++++++++++------
-file changed, 42 insertions(+), 55 deletions(-)
+file changed, 93 insertions(+), 23 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/export/vhost-user-blk-server.c
-+++ b/block/qed.c
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@
-     }
  enum {
      VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
 +    VHOST_USER_BLK_MAX_DISCARD_SECTORS = 32768,
 +    VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS = 32768,
  };
  struct virtio_blk_inhdr {
      unsigned char status;
@@ -XXX,XX +XXX,XX @@ static void vu_blk_req_complete(VuBlkReq *req)
      free(req);
  }
--static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
++static bool vu_blk_sect_range_ok(VuBlkExport *vexp, uint64_t sector,
--                                 int64_t sector_num,
++                                 size_t size)
--                                 QEMUIOVector *qiov, int nb_sectors,
++{
--                                 BlockCompletionFunc *cb,
++    uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
--                                 void *opaque, int flags)
++    uint64_t total_sectors;
 +typedef struct QEDRequestCo {
 +    Coroutine *co;
 +    bool done;
 +    int ret;
 +} QEDRequestCo;
 +
-+static void qed_co_request_cb(void *opaque, int ret)
++    if (nb_sectors > BDRV_REQUEST_MAX_SECTORS) {
- {
++        return false;
--    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
++    }
-+    QEDRequestCo *co = opaque;
++    if ((sector << VIRTIO_BLK_SECTOR_BITS) % vexp->blk_size) {
++        return false;
--    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
++    }
--                        opaque, flags);
++    blk_get_geometry(vexp->export.blk, &total_sectors);
-+    co->done = true;
++    if (sector > total_sectors || nb_sectors > total_sectors - sector) {
-+    co->ret = ret;
++        return false;
-+    qemu_coroutine_enter_if_inactive(co->co);
++    }
 +    return true;
 +}
 +
-+static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+ static int coroutine_fn
-+                                       QEMUIOVector *qiov, int nb_sectors,
+-vu_blk_discard_write_zeroes(BlockBackend *blk, struct iovec *iov,
-+                                       int flags)
++vu_blk_discard_write_zeroes(VuBlkExport *vexp, struct iovec *iov,
-+{
+                             uint32_t iovcnt, uint32_t type)
-+    QEDRequestCo co = {
+ {
-+        .co     = qemu_coroutine_self(),
++    BlockBackend *blk = vexp->export.blk;
-+        .done   = false,
+     struct virtio_blk_discard_write_zeroes desc;
-+    };
+-    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
-+    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
++    ssize_t size;
 +    uint64_t sector;
 +    uint32_t num_sectors;
 +    uint32_t max_sectors;
 +    uint32_t flags;
 +    int bytes;
 +
-+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
++    /* Only one desc is currently supported */
++    if (unlikely(iov_size(iov, iovcnt) > sizeof(desc))) {
-     acb->flags = flags;
++        return VIRTIO_BLK_S_UNSUPP;
      acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
      /* Start request */
      qed_aio_start_io(acb);
 -    return &acb->common;
 -}
 -static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
 -                                      int64_t sector_num,
 -                                      QEMUIOVector *qiov, int nb_sectors,
 -                                      BlockCompletionFunc *cb,
 -                                      void *opaque)
 -{
 -    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 +    if (!co.done) {
 +        qemu_coroutine_yield();
 +    }
 +
-+    return co.ret;
++    size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
      if (unlikely(size != sizeof(desc))) {
 -        error_report("Invalid size %zd, expect %zu", size, sizeof(desc));
 -        return -EINVAL;
 +        error_report("Invalid size %zd, expected %zu", size, sizeof(desc));
 +        return VIRTIO_BLK_S_IOERR;
      }
 -    uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
 -                          le32_to_cpu(desc.num_sectors) << 9 };
 -    if (type == VIRTIO_BLK_T_DISCARD) {
 -        if (blk_co_pdiscard(blk, range[0], range[1]) == 0) {
 -            return 0;
 +    sector = le64_to_cpu(desc.sector);
 +    num_sectors = le32_to_cpu(desc.num_sectors);
 +    flags = le32_to_cpu(desc.flags);
 +    max_sectors = (type == VIRTIO_BLK_T_WRITE_ZEROES) ?
 +                  VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS :
 +                  VHOST_USER_BLK_MAX_DISCARD_SECTORS;
 +
 +    /* This check ensures that 'bytes' fits in an int */
 +    if (unlikely(num_sectors > max_sectors)) {
 +        return VIRTIO_BLK_S_IOERR;
 +    }
 +
 +    bytes = num_sectors << VIRTIO_BLK_SECTOR_BITS;
 +
 +    if (unlikely(!vu_blk_sect_range_ok(vexp, sector, bytes))) {
 +        return VIRTIO_BLK_S_IOERR;
 +    }
 +
 +    /*
 +     * The device MUST set the status byte to VIRTIO_BLK_S_UNSUPP for discard
 +     * and write zeroes commands if any unknown flag is set.
 +     */
 +    if (unlikely(flags & ~VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP)) {
 +        return VIRTIO_BLK_S_UNSUPP;
 +    }
 +
 +    if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
 +        int blk_flags = 0;
 +
 +        if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) {
 +            blk_flags |= BDRV_REQ_MAY_UNMAP;
 +        }
 +
 +        if (blk_co_pwrite_zeroes(blk, sector << VIRTIO_BLK_SECTOR_BITS,
 +                                 bytes, blk_flags) == 0) {
 +            return VIRTIO_BLK_S_OK;
          }
 -    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
 -        if (blk_co_pwrite_zeroes(blk, range[0], range[1], 0) == 0) {
 -            return 0;
 +    } else if (type == VIRTIO_BLK_T_DISCARD) {
 +        /*
 +         * The device MUST set the status byte to VIRTIO_BLK_S_UNSUPP for
 +         * discard commands if the unmap flag is set.
 +         */
 +        if (unlikely(flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP)) {
 +            return VIRTIO_BLK_S_UNSUPP;
 +        }
 +
 +        if (blk_co_pdiscard(blk, sector << VIRTIO_BLK_SECTOR_BITS,
 +                            bytes) == 0) {
 +            return VIRTIO_BLK_S_OK;
          }
      }
 -    return -EINVAL;
 +    return VIRTIO_BLK_S_IOERR;
  }
--static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
+ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
--                                       int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
--                                       QEMUIOVector *qiov, int nb_sectors,
+     }
--                                       BlockCompletionFunc *cb,
+     case VIRTIO_BLK_T_DISCARD:
--                                       void *opaque)
+     case VIRTIO_BLK_T_WRITE_ZEROES: {
-+static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
+-        int rc;
-+                                          int64_t sector_num, int nb_sectors,
+-
-+                                          QEMUIOVector *qiov)
+         if (!vexp->writable) {
- {
+             req->in->status = VIRTIO_BLK_S_IOERR;
--    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
+             break;
--                         opaque, QED_AIOCB_WRITE);
+         }
-+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
 -        rc = vu_blk_discard_write_zeroes(blk, &elem->out_sg[1], out_num, type);
 -        if (rc == 0) {
 -            req->in->status = VIRTIO_BLK_S_OK;
 -        } else {
 -            req->in->status = VIRTIO_BLK_S_IOERR;
 -        }
 +        req->in->status = vu_blk_discard_write_zeroes(vexp, out_iov, out_num,
 +                                                      type);
          break;
      }
      default:
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
      config->min_io_size = cpu_to_le16(1);
      config->opt_io_size = cpu_to_le32(1);
      config->num_queues = cpu_to_le16(num_queues);
 -    config->max_discard_sectors = cpu_to_le32(32768);
 +    config->max_discard_sectors =
 +        cpu_to_le32(VHOST_USER_BLK_MAX_DISCARD_SECTORS);
      config->max_discard_seg = cpu_to_le32(1);
      config->discard_sector_alignment =
          cpu_to_le32(blk_size >> VIRTIO_BLK_SECTOR_BITS);
 -    config->max_write_zeroes_sectors = cpu_to_le32(32768);
 +    config->max_write_zeroes_sectors
 +        = cpu_to_le32(VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS);
      config->max_write_zeroes_seg = cpu_to_le32(1);
  }
--typedef struct {
--    Coroutine *co;
--    int ret;
--    bool done;
--} QEDWriteZeroesCB;
--
--static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
-+static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
-+                                           int64_t sector_num, int nb_sectors,
-+                                           QEMUIOVector *qiov)
- {
--    QEDWriteZeroesCB *cb = opaque;
--
--    cb->done = true;
--    cb->ret = ret;
--    if (cb->co) {
--        aio_co_wake(cb->co);
--    }
-+    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
- }
- static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-                                                   int count,
-                                                   BdrvRequestFlags flags)
- {
--    BlockAIOCB *blockacb;
-     BDRVQEDState *s = bs->opaque;
--    QEDWriteZeroesCB cb = { .done = false };
-     QEMUIOVector qiov;
-     struct iovec iov;
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-     iov.iov_len = count;
-     qemu_iovec_init_external(&qiov, &iov, 1);
--    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
--                             count >> BDRV_SECTOR_BITS,
--                             qed_co_pwrite_zeroes_cb, &cb,
--                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
--    if (!blockacb) {
--        return -EIO;
--    }
--    if (!cb.done) {
--        cb.co = qemu_coroutine_self();
--        qemu_coroutine_yield();
--    }
--    assert(cb.done);
--    return cb.ret;
-+    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
-+                          count >> BDRV_SECTOR_BITS,
-+                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
- }
- static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
-@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
-     .bdrv_create              = bdrv_qed_create,
-     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
-     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
--    .bdrv_aio_readv           = bdrv_qed_aio_readv,
--    .bdrv_aio_writev          = bdrv_qed_aio_writev,
-+    .bdrv_co_readv            = bdrv_qed_co_readv,
-+    .bdrv_co_writev           = bdrv_qed_co_writev,
-     .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
-     .bdrv_truncate            = bdrv_qed_truncate,
-     .bdrv_getlength           = bdrv_qed_getlength,
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 28/61] qed: Remove callback from qed_read_l2_table()
+[PULL 22/31] vhost-user-blk-test: test discard/write zeroes invalid inputs
+From: Stefan Hajnoczi <stefanha@redhat.com>
+Exercise input validation code paths in
+block/export/vhost-user-blk-server.c.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210223144653.811468-12-stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
+ tests/qtest/vhost-user-blk-test.c | 124 ++++++++++++++++++++++++++++++
- block/qed-table.c   | 15 +++------
+file changed, 124 insertions(+)
  block/qed.h         |  3 +-
 files changed, 36 insertions(+), 76 deletions(-)
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
+diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+--- a/tests/qtest/vhost-user-blk-test.c
-+++ b/block/qed-cluster.c
++++ b/tests/qtest/vhost-user-blk-test.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+@@ -XXX,XX +XXX,XX @@ static uint64_t virtio_blk_request(QGuestAllocator *alloc, QVirtioDevice *d,
-     return i - index;
+     return addr;
  }
--typedef struct {
++static void test_invalid_discard_write_zeroes(QVirtioDevice *dev,
--    BDRVQEDState *s;
++                                              QGuestAllocator *alloc,
--    uint64_t pos;
++                                              QTestState *qts,
--    size_t len;
++                                              QVirtQueue *vq,
--
++                                              uint32_t type)
--    QEDRequest *request;
++{
--
++    QVirtioBlkReq req;
--    /* User callback */
++    struct virtio_blk_discard_write_zeroes dwz_hdr;
--    QEDFindClusterFunc *cb;
++    struct virtio_blk_discard_write_zeroes dwz_hdr2[2];
--    void *opaque;
++    uint64_t req_addr;
--} QEDFindClusterCB;
++    uint32_t free_head;
--
++    uint8_t status;
--static void qed_find_cluster_cb(void *opaque, int ret)
++
--{
++    /* More than one dwz is not supported */
--    QEDFindClusterCB *find_cluster_cb = opaque;
++    req.type = type;
--    BDRVQEDState *s = find_cluster_cb->s;
++    req.data = (char *) dwz_hdr2;
--    QEDRequest *request = find_cluster_cb->request;
++    dwz_hdr2[0].sector = 0;
--    uint64_t offset = 0;
++    dwz_hdr2[0].num_sectors = 1;
--    size_t len = 0;
++    dwz_hdr2[0].flags = 0;
--    unsigned int index;
++    dwz_hdr2[1].sector = 1;
--    unsigned int n;
++    dwz_hdr2[1].num_sectors = 1;
--
++    dwz_hdr2[1].flags = 0;
--    qed_acquire(s);
++
--    if (ret) {
++    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr2[0]);
--        goto out;
++    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr2[1]);
--    }
++
--
++    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr2));
--    index = qed_l2_index(s, find_cluster_cb->pos);
++
--    n = qed_bytes_to_clusters(s,
++    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
--                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
++    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr2), false, true);
--                              find_cluster_cb->len);
++    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr2), 1, true,
--    n = qed_count_contiguous_clusters(s, request->l2_table->table,
++                   false);
--                                      index, n, &offset);
++
--
++    qvirtqueue_kick(qts, dev, vq, free_head);
--    if (qed_offset_is_unalloc_cluster(offset)) {
++
--        ret = QED_CLUSTER_L2;
++    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
--    } else if (qed_offset_is_zero_cluster(offset)) {
++                           QVIRTIO_BLK_TIMEOUT_US);
--        ret = QED_CLUSTER_ZERO;
++    status = readb(req_addr + 16 + sizeof(dwz_hdr2));
--    } else if (qed_check_cluster_offset(s, offset)) {
++    g_assert_cmpint(status, ==, VIRTIO_BLK_S_UNSUPP);
--        ret = QED_CLUSTER_FOUND;
++
--    } else {
++    guest_free(alloc, req_addr);
--        ret = -EINVAL;
++
--    }
++    /* num_sectors must be less than config->max_write_zeroes_sectors */
--
++    req.type = type;
--    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
++    req.data = (char *) &dwz_hdr;
--              qed_offset_into_cluster(s, find_cluster_cb->pos));
++    dwz_hdr.sector = 0;
--
++    dwz_hdr.num_sectors = 0xffffffff;
--out:
++    dwz_hdr.flags = 0;
--    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
++
--    qed_release(s);
++    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
--    g_free(find_cluster_cb);
++
--}
++    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
--
++
- /**
++    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
-  * Find the offset of a data cluster
++    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
-  *
++    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
-@@ -XXX,XX +XXX,XX @@ out:
++                   false);
- void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
++
-                       size_t len, QEDFindClusterFunc *cb, void *opaque)
++    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 16 + sizeof(dwz_hdr));
 +    g_assert_cmpint(status, ==, VIRTIO_BLK_S_IOERR);
 +
 +    guest_free(alloc, req_addr);
 +
 +    /* sector must be less than the device capacity */
 +    req.type = type;
 +    req.data = (char *) &dwz_hdr;
 +    dwz_hdr.sector = TEST_IMAGE_SIZE / 512 + 1;
 +    dwz_hdr.num_sectors = 1;
 +    dwz_hdr.flags = 0;
 +
 +    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
 +
 +    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
 +                   false);
 +
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 16 + sizeof(dwz_hdr));
 +    g_assert_cmpint(status, ==, VIRTIO_BLK_S_IOERR);
 +
 +    guest_free(alloc, req_addr);
 +
 +    /* reserved flag bits must be zero */
 +    req.type = type;
 +    req.data = (char *) &dwz_hdr;
 +    dwz_hdr.sector = 0;
 +    dwz_hdr.num_sectors = 1;
 +    dwz_hdr.flags = ~VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
 +
 +    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
 +
 +    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
 +
 +    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
 +    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
 +                   false);
 +
 +    qvirtqueue_kick(qts, dev, vq, free_head);
 +
 +    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
 +                           QVIRTIO_BLK_TIMEOUT_US);
 +    status = readb(req_addr + 16 + sizeof(dwz_hdr));
 +    g_assert_cmpint(status, ==, VIRTIO_BLK_S_UNSUPP);
 +
 +    guest_free(alloc, req_addr);
 +}
 +
  /* Returns the request virtqueue so the caller can perform further tests */
  static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
  {
--    QEDFindClusterCB *find_cluster_cb;
+@@ -XXX,XX +XXX,XX @@ static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
-     uint64_t l2_offset;
+         g_free(data);
-+    uint64_t offset = 0;
-+    unsigned int index;
+         guest_free(alloc, req_addr);
-+    unsigned int n;
++
-+    int ret;
++        test_invalid_discard_write_zeroes(dev, alloc, qts, vq,
++                                          VIRTIO_BLK_T_WRITE_ZEROES);
      /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
       * so that a request acts on one L2 table at a time.
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
          return;
      }
--    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
+     if (features & (1u << VIRTIO_BLK_F_DISCARD)) {
--    find_cluster_cb->s = s;
+@@ -XXX,XX +XXX,XX @@ static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
--    find_cluster_cb->pos = pos;
+         g_assert_cmpint(status, ==, 0);
--    find_cluster_cb->len = len;
--    find_cluster_cb->cb = cb;
+         guest_free(alloc, req_addr);
 -    find_cluster_cb->opaque = opaque;
 -    find_cluster_cb->request = request;
 +    ret = qed_read_l2_table(s, request, l2_offset);
 +    qed_acquire(s);
 +    if (ret) {
 +        goto out;
 +    }
 +
-+    index = qed_l2_index(s, pos);
++        test_invalid_discard_write_zeroes(dev, alloc, qts, vq,
-+    n = qed_bytes_to_clusters(s,
++                                          VIRTIO_BLK_T_DISCARD);
 +                              qed_offset_into_cluster(s, pos) + len);
 +    n = qed_count_contiguous_clusters(s, request->l2_table->table,
 +                                      index, n, &offset);
 +
 +    if (qed_offset_is_unalloc_cluster(offset)) {
 +        ret = QED_CLUSTER_L2;
 +    } else if (qed_offset_is_zero_cluster(offset)) {
 +        ret = QED_CLUSTER_ZERO;
 +    } else if (qed_check_cluster_offset(s, offset)) {
 +        ret = QED_CLUSTER_FOUND;
 +    } else {
 +        ret = -EINVAL;
 +    }
 +
 +    len = MIN(len,
 +              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 -    qed_read_l2_table(s, request, l2_offset,
 -                      qed_find_cluster_cb, find_cluster_cb);
 +out:
 +    cb(opaque, ret, offset, len);
 +    qed_release(s);
  }
 diff --git a/block/qed-table.c b/block/qed-table.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed-table.c
 +++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
      return ret;
  }
 -void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
 -                       BlockCompletionFunc *cb, void *opaque)
 +int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
  {
      int ret;
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
      /* Check for cached L2 entry */
      request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
      if (request->l2_table) {
 -        cb(opaque, 0);
 -        return;
 +        return 0;
      }
-     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+     if (features & (1u << VIRTIO_F_ANY_LAYOUT)) {
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
      }
      qed_release(s);
 -    cb(opaque, ret);
 +    return ret;
  }
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
  {
 -    int ret = -EINPROGRESS;
 -
 -    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
 -    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
 -
 -    return ret;
 +    return qed_read_l2_table(s, request, offset);
  }
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                              unsigned int n);
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             uint64_t offset);
 -void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
 -                       BlockCompletionFunc *cb, void *opaque);
 +int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                          unsigned int index, unsigned int n, bool flush,
                          BlockCompletionFunc *cb, void *opaque);
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 23/61] qcow2: Merge the writing of the COW regions with the guest data
+[PULL 23/31] block/export: port virtio-blk read/write range check
-From: Alberto Garcia <berto@igalia.com>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-If the guest tries to write data that results on the allocation of a
+Check that the sector number and byte count are valid.
 new cluster, instead of writing the guest data first and then the data
 from the COW regions, write everything together using one single I/O
 operation.
-This can improve the write performance by 25% or more, depending on
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-several factors such as the media type, the cluster size and the I/O
+Message-Id: <20210223144653.811468-13-stefanha@redhat.com>
 request size.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
+ block/export/vhost-user-blk-server.c | 19 ++++++++++++++++---
- block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
+file changed, 16 insertions(+), 3 deletions(-)
  block/qcow2.h         |  7 ++++++
 files changed, 91 insertions(+), 20 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/export/vhost-user-blk-server.c
-+++ b/block/qcow2-cluster.c
++++ b/block/export/vhost-user-blk-server.c
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
-     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
+     switch (type & ~VIRTIO_BLK_T_BARRIER) {
-     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
+     case VIRTIO_BLK_T_IN:
-     assert(start->offset + start->nb_bytes <= end->offset);
+     case VIRTIO_BLK_T_OUT: {
-+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
++        QEMUIOVector qiov;
++        int64_t offset;
-     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+         ssize_t ret = 0;
-         return 0;
+         bool is_write = type & VIRTIO_BLK_T_OUT;
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+         req->sector_num = le64_to_cpu(req->out.sector);
-     /* The part of the buffer where the end region is located */
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
-     end_buffer = start_buffer + buffer_size - end->nb_bytes;
+             break;
 -    qemu_iovec_init(&qiov, 1);
 +    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
      qemu_co_mutex_unlock(&s->lock);
      /* First we read the existing data from both COW regions. We
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
          }
-     }
+-        int64_t offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
--    /* And now we can write everything */
+-        QEMUIOVector qiov;
--    qemu_iovec_reset(&qiov);
+         if (is_write) {
--    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+             qemu_iovec_init_external(&qiov, out_iov, out_num);
--    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+-            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
--    if (ret < 0) {
+         } else {
--        goto fail;
+             qemu_iovec_init_external(&qiov, in_iov, in_num);
 +    /* And now we can write everything. If we have the guest data we
 +     * can write everything in one single operation */
 +    if (m->data_qiov) {
 +        qemu_iovec_reset(&qiov);
 +        if (start->nb_bytes) {
 +            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +        }
 +        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
 +        if (end->nb_bytes) {
 +            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +        }
 +        /* NOTE: we have a write_aio blkdebug event here followed by
 +         * a cow_write one in do_perform_cow_write(), but there's only
 +         * one single I/O operation */
 +        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 +        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
 +    } else {
 +        /* If there's no guest data then write both COW regions separately */
 +        qemu_iovec_reset(&qiov);
 +        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
 +        if (ret < 0) {
 +            goto fail;
 +        }
 +
-+        qemu_iovec_reset(&qiov);
++        if (unlikely(!vu_blk_sect_range_ok(vexp,
-+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
++                                           req->sector_num,
-+        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
++                                           qiov.size))) {
-     }
++            req->in->status = VIRTIO_BLK_S_IOERR;
++            break;
 -    qemu_iovec_reset(&qiov);
 -    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 -    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
  fail:
      qemu_co_mutex_lock(&s->lock);
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ fail:
      return ret;
  }
 +/* Check if it's possible to merge a write request with the writing of
 + * the data from the COW regions */
 +static bool merge_cow(uint64_t offset, unsigned bytes,
 +                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
 +{
 +    QCowL2Meta *m;
 +
 +    for (m = l2meta; m != NULL; m = m->next) {
 +        /* If both COW regions are empty then there's nothing to merge */
 +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
 +            continue;
 +        }
 +
-+        /* The data (middle) region must be immediately after the
++        offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
 +         * start region */
 +        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
 +            continue;
 +        }
 +
-+        /* The end region must be immediately after the data (middle)
++        if (is_write) {
-+         * region */
++            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
-+        if (m->offset + m->cow_end.offset != offset + bytes) {
++        } else {
-+            continue;
+             ret = blk_co_preadv(blk, offset, qiov.size, &qiov, 0);
 +        }
 +
 +        /* Make sure that adding both COW regions to the QEMUIOVector
 +         * does not exceed IOV_MAX */
 +        if (hd_qiov->niov > IOV_MAX - 2) {
 +            continue;
 +        }
 +
 +        m->data_qiov = hd_qiov;
 +        return true;
 +    }
 +
 +    return false;
 +}
 +
  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                           uint64_t bytes, QEMUIOVector *qiov,
                                           int flags)
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
              goto fail;
          }
+         if (ret >= 0) {
 -        qemu_co_mutex_unlock(&s->lock);
 -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 -        trace_qcow2_writev_data(qemu_coroutine_self(),
 -                                cluster_offset + offset_in_cluster);
 -        ret = bdrv_co_pwritev(bs->file,
 -                              cluster_offset + offset_in_cluster,
 -                              cur_bytes, &hd_qiov, 0);
 -        qemu_co_mutex_lock(&s->lock);
 -        if (ret < 0) {
 -            goto fail;
 +        /* If we need to do COW, check if it's possible to merge the
 +         * writing of the guest data together with that of the COW regions.
 +         * If it's not possible (or not necessary) then write the
 +         * guest data now. */
 +        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
 +            qemu_co_mutex_unlock(&s->lock);
 +            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 +            trace_qcow2_writev_data(qemu_coroutine_self(),
 +                                    cluster_offset + offset_in_cluster);
 +            ret = bdrv_co_pwritev(bs->file,
 +                                  cluster_offset + offset_in_cluster,
 +                                  cur_bytes, &hd_qiov, 0);
 +            qemu_co_mutex_lock(&s->lock);
 +            if (ret < 0) {
 +                goto fail;
 +            }
          }
          while (l2meta != NULL) {
 diff --git a/block/qcow2.h b/block/qcow2.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
       */
      Qcow2COWRegion cow_end;
 +    /**
 +     * The I/O vector with the data from the actual guest write request.
 +     * If non-NULL, this is meant to be merged together with the data
 +     * from @cow_start and @cow_end into one single write operation.
 +     */
 +    QEMUIOVector *data_qiov;
 +
      /** Pointer to next L2Meta of the same write request */
      struct QCowL2Meta *next;
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 34/61] qed: Remove callback from qed_write_header()
+[PULL 24/31] qcow2-bitmap: make bytes_covered_by_bitmap_cluster() public
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Rename bytes_covered_by_bitmap_cluster() to
+bdrv_dirty_bitmap_serialization_coverage() and make it public.
+It is needed as we are going to share it with bitmap loading in
+parallels format.
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
+Message-Id: <20210224104707.88430-2-vsementsov@virtuozzo.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 32 ++++++++++++--------------------
+ include/block/dirty-bitmap.h |  2 ++
-file changed, 12 insertions(+), 20 deletions(-)
+ block/dirty-bitmap.c         | 13 +++++++++++++
  block/qcow2-bitmap.c         | 16 ++--------------
 files changed, 17 insertions(+), 14 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/include/block/dirty-bitmap.h
-+++ b/block/qed.c
++++ b/include/block/dirty-bitmap.h
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
-  * This function only updates known header fields in-place and does not affect
+ uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-  * extra data after the QED header.
+                                               uint64_t offset, uint64_t bytes);
-  */
+ uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap);
--static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
++uint64_t bdrv_dirty_bitmap_serialization_coverage(int serialized_chunk_size,
--                             void *opaque)
++        const BdrvDirtyBitmap *bitmap);
-+static int qed_write_header(BDRVQEDState *s)
+ void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
- {
+                                       uint8_t *buf, uint64_t offset,
-     /* We must write full sectors for O_DIRECT but cannot necessarily generate
+                                       uint64_t bytes);
-      * the data following the header if an unrecognized compat feature is
+diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
-@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
+index XXXXXXX..XXXXXXX 100644
-     ret = 0;
+--- a/block/dirty-bitmap.c
- out:
++++ b/block/dirty-bitmap.c
-     qemu_vfree(buf);
+@@ -XXX,XX +XXX,XX @@ uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
--    cb(opaque, ret);
+     return hbitmap_serialization_align(bitmap->bitmap);
 +    return ret;
  }
- static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
++/* Return the disk size covered by a chunk of serialized bitmap data. */
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
++uint64_t bdrv_dirty_bitmap_serialization_coverage(int serialized_chunk_size,
-     }
++                                                  const BdrvDirtyBitmap *bitmap)
 +{
 +    uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
 +    uint64_t limit = granularity * (serialized_chunk_size << 3);
 +
 +    assert(QEMU_IS_ALIGNED(limit,
 +                           bdrv_dirty_bitmap_serialization_align(bitmap)));
 +    return limit;
 +}
 +
 +
  void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
                                        uint8_t *buf, uint64_t offset,
                                        uint64_t bytes)
 diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-bitmap.c
 +++ b/block/qcow2-bitmap.c
@@ -XXX,XX +XXX,XX @@ static int free_bitmap_clusters(BlockDriverState *bs, Qcow2BitmapTable *tb)
      return 0;
  }
--static void qed_finish_clear_need_check(void *opaque, int ret)
+-/* Return the disk size covered by a single qcow2 cluster of bitmap data. */
 -static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s,
 -                                                const BdrvDirtyBitmap *bitmap)
 -{
--    /* Do nothing */
+-    uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
 -    uint64_t limit = granularity * (s->cluster_size << 3);
 -
 -    assert(QEMU_IS_ALIGNED(limit,
 -                           bdrv_dirty_bitmap_serialization_align(bitmap)));
 -    return limit;
 -}
 -
--static void qed_flush_after_clear_need_check(void *opaque, int ret)
+ /* load_bitmap_data
--{
+  * @bitmap_table entries must satisfy specification constraints.
--    BDRVQEDState *s = opaque;
+  * @bitmap must be cleared */
--
+@@ -XXX,XX +XXX,XX @@ static int load_bitmap_data(BlockDriverState *bs,
 -    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
 -
 -    /* No need to wait until flush completes */
 -    qed_unplug_allocating_write_reqs(s);
 -}
 -
  static void qed_clear_need_check(void *opaque, int ret)
  {
      BDRVQEDState *s = opaque;
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
      }
-     s->header.features &= ~QED_F_NEED_CHECK;
+     buf = g_malloc(s->cluster_size);
--    qed_write_header(s, qed_flush_after_clear_need_check, s);
+-    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
-+    ret = qed_write_header(s);
++    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
-+    (void) ret;
+     for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
-+
+         uint64_t count = MIN(bm_size - offset, limit);
-+    qed_unplug_allocating_write_reqs(s);
+         uint64_t entry = bitmap_table[i];
-+
+@@ -XXX,XX +XXX,XX @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 +    ret = bdrv_flush(s->bs);
 +    (void) ret;
  }
  static void qed_need_check_timer_cb(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
      BlockCompletionFunc *cb;
 +    int ret;
      /* Cancel timer when the first allocating request comes in */
      if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
      if (qed_should_set_need_check(s)) {
          s->header.features |= QED_F_NEED_CHECK;
 -        qed_write_header(s, cb, acb);
 +        ret = qed_write_header(s);
 +        cb(acb, ret);
      } else {
          cb(acb, 0);
      }
+     buf = g_malloc(s->cluster_size);
+-    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
++    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
+     assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
+     offset = 0;
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 18/61] qcow2: Use unsigned int for both members of Qcow2COWRegion
+[PULL 25/31] parallels.txt: fix bitmap L1 table description
-From: Alberto Garcia <berto@igalia.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Qcow2COWRegion has two attributes:
+Actually L1 table entry offset is in 512 bytes sectors. Fix the spec.
-- The offset of the COW region from the start of the first cluster
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-  touched by the I/O request. Since it's always going to be positive
+Message-Id: <20210224104707.88430-3-vsementsov@virtuozzo.com>
-  and the maximum request size is at most INT_MAX, we can use a
+Reviewed-by: Denis V. Lunev <den@openvz.org>
   regular unsigned int to store this offset.
 - The size of the COW region in bytes. This is guaranteed to be >= 0,
   so we should use an unsigned type instead.
 In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
 It will also help keep some assertions simpler now that we know that
 there are no negative numbers.
 The prototype of do_perform_cow() is also updated to reflect these
 changes.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ docs/interop/parallels.txt | 28 ++++++++++++++++------------
- block/qcow2.h         | 4 ++--
+file changed, 16 insertions(+), 12 deletions(-)
 files changed, 4 insertions(+), 4 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/docs/interop/parallels.txt b/docs/interop/parallels.txt
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/docs/interop/parallels.txt
-+++ b/block/qcow2-cluster.c
++++ b/docs/interop/parallels.txt
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ of its data area are:
- static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+- 31:    l1_size
-                                        uint64_t src_cluster_offset,
+               The number of entries in the L1 table of the bitmap.
-                                        uint64_t cluster_offset,
--                                       int offset_in_cluster,
+-  variable:   l1_table (8 * l1_size bytes)
--                                       int bytes)
+-              L1 offset table (in bytes)
-+                                       unsigned offset_in_cluster,
++  variable:   L1 offset table (l1_table), size: 8 * l1_size bytes
-+                                       unsigned bytes)
- {
+-A dirty bitmap is stored using a one-level structure for the mapping to host
-     BDRVQcow2State *s = bs->opaque;
+-clusters - an L1 table.
-     QEMUIOVector qiov;
++The dirty bitmap described by this feature extension is stored in a set of
-diff --git a/block/qcow2.h b/block/qcow2.h
++clusters inside the Parallels image file. The offsets of these clusters are
-index XXXXXXX..XXXXXXX 100644
++saved in the L1 offset table specified by the feature extension. Each L1 table
---- a/block/qcow2.h
++entry is a 64 bit integer as described below:
-+++ b/block/qcow2.h
-@@ -XXX,XX +XXX,XX @@ typedef struct Qcow2COWRegion {
+-Given an offset in bytes into the bitmap data, the offset in bytes into the
-      * Offset of the COW region in bytes from the start of the first cluster
+-image file can be obtained as follows:
-      * touched by the request.
++Given an offset in bytes into the bitmap data, corresponding L1 entry is
-      */
--    uint64_t    offset;
+-    offset = l1_table[offset / cluster_size] + (offset % cluster_size)
-+    unsigned    offset;
++    l1_table[offset / cluster_size]
-     /** Number of bytes to copy */
+-If an L1 table entry is 0, the corresponding cluster of the bitmap is assumed
--    int         nb_bytes;
+-to be zero.
-+    unsigned    nb_bytes;
++If an L1 table entry is 0, all bits in the corresponding cluster of the bitmap
- } Qcow2COWRegion;
++are assumed to be 0.
- /**
+-If an L1 table entry is 1, the corresponding cluster of the bitmap is assumed
 -to have all bits set.
 +If an L1 table entry is 1, all bits in the corresponding cluster of the bitmap
 +are assumed to be 1.
 -If an L1 table entry is not 0 or 1, it allocates a cluster from the data area.
 +If an L1 table entry is not 0 or 1, it contains the corresponding cluster
 +offset (in 512b sectors). Given an offset in bytes into the bitmap data the
 +offset in bytes into the image file can be obtained as follows:
 +
 +    offset = l1_table[offset / cluster_size] * 512 + (offset % cluster_size)
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 22/61] qcow2: Pass a QEMUIOVector to do_perform_cow_{read, write}()
+[PULL 26/31] block/parallels: BDRVParallelsState: add cluster_size field
-From: Alberto Garcia <berto@igalia.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-Instead of passing a single buffer pointer to do_perform_cow_write(),
+We are going to use it in more places, calculating
-pass a QEMUIOVector. This will allow us to merge the write requests
+"s->tracks << BDRV_SECTOR_BITS" doesn't look good.
 for the COW regions and the actual data into a single one.
-Although do_perform_cow_read() does not strictly need to change its
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-API, we're doing it here as well for consistency.
+Message-Id: <20210224104707.88430-4-vsementsov@virtuozzo.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
+ block/parallels.h | 1 +
-file changed, 24 insertions(+), 27 deletions(-)
+ block/parallels.c | 8 ++++----
 files changed, 5 insertions(+), 4 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/parallels.h b/block/parallels.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/parallels.h
-+++ b/block/qcow2-cluster.c
++++ b/block/parallels.h
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ typedef struct BDRVParallelsState {
- static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+     ParallelsPreallocMode prealloc_mode;
-                                             uint64_t src_cluster_offset,
-                                             unsigned offset_in_cluster,
+     unsigned int tracks;
--                                            uint8_t *buffer,
++    unsigned int cluster_size;
--                                            unsigned bytes)
-+                                            QEMUIOVector *qiov)
+     unsigned int off_multiplier;
- {
+     Error *migration_blocker;
--    QEMUIOVector qiov;
+diff --git a/block/parallels.c b/block/parallels.c
--    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+index XXXXXXX..XXXXXXX 100644
 --- a/block/parallels.c
 +++ b/block/parallels.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
      int ret;
+     uint32_t i;
--    if (bytes == 0) {
+     bool flush_bat = false;
-+    if (qiov->size == 0) {
+-    int cluster_size = s->tracks << BDRV_SECTOR_BITS;
-         return 0;
      size = bdrv_getlength(bs->file->bs);
      if (size < 0) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
              high_off = off;
          }
 -        if (prev_off != 0 && (prev_off + cluster_size) != off) {
 +        if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
              res->bfi.fragmented_clusters++;
          }
          prev_off = off;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
          }
      }
--    qemu_iovec_init_external(&qiov, &iov, 1);
+-    res->image_end_offset = high_off + cluster_size;
--
++    res->image_end_offset = high_off + s->cluster_size;
-     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
+     if (size > res->image_end_offset) {
+         int64_t count;
-     if (!bs->drv) {
+-        count = DIV_ROUND_UP(size - res->image_end_offset, cluster_size);
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
++        count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
-      * which can lead to deadlock when block layer copy-on-read is enabled.
+         fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
-      */
+                 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
-     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
+                 size - res->image_end_offset);
--                                  bytes, &qiov, 0);
+@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
-+                                  qiov->size, qiov, 0);
+         ret = -EFBIG;
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
  static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
                                               uint64_t cluster_offset,
                                               unsigned offset_in_cluster,
 -                                             uint8_t *buffer,
 -                                             unsigned bytes)
 +                                             QEMUIOVector *qiov)
  {
 -    QEMUIOVector qiov;
 -    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
 -    if (bytes == 0) {
 +    if (qiov->size == 0) {
          return 0;
      }
 -    qemu_iovec_init_external(&qiov, &iov, 1);
 -
      ret = qcow2_pre_write_overlap_check(bs, 0,
 -            cluster_offset + offset_in_cluster, bytes);
 +            cluster_offset + offset_in_cluster, qiov->size);
      if (ret < 0) {
          return ret;
      }
      BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
      ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
 -                          bytes, &qiov, 0);
 +                          qiov->size, qiov, 0);
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
      bool merge_reads;
      uint8_t *start_buffer, *end_buffer;
 +    QEMUIOVector qiov;
      int ret;
      assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      /* The part of the buffer where the end region is located */
      end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +    qemu_iovec_init(&qiov, 1);
 +
      qemu_co_mutex_unlock(&s->lock);
      /* First we read the existing data from both COW regions. We
       * either read the whole region in one go, or the start and end
       * regions separately. */
      if (merge_reads) {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, buffer_size);
 +        qemu_iovec_add(&qiov, start_buffer, buffer_size);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
      } else {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, start->nb_bytes);
 +        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
          if (ret < 0) {
              goto fail;
          }
 -        ret = do_perform_cow_read(bs, m->offset, end->offset,
 -                                  end_buffer, end->nb_bytes);
 +        qemu_iovec_reset(&qiov);
 +        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
      }
      if (ret < 0) {
          goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      }
      /* And now we can write everything */
 -    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
 -                               start_buffer, start->nb_bytes);
 +    qemu_iovec_reset(&qiov);
 +    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
      if (ret < 0) {
          goto fail;
      }
++    s->cluster_size = s->tracks << BDRV_SECTOR_BITS;
--    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
--                               end_buffer, end->nb_bytes);
+     s->bat_size = le32_to_cpu(ph.bat_entries);
-+    qemu_iovec_reset(&qiov);
+     if (s->bat_size > INT_MAX / sizeof(uint32_t)) {
 +    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
  fail:
      qemu_co_mutex_lock(&s->lock);
@@ -XXX,XX +XXX,XX @@ fail:
      }
      qemu_vfree(start_buffer);
 +    qemu_iovec_destroy(&qiov);
      return ret;
  }
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 20/61] qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
+[PULL 27/31] parallels: support bitmap extension for read-only mode
-From: Alberto Garcia <berto@igalia.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-This patch splits do_perform_cow() into three separate functions to
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-read, encrypt and write the COW regions.
+Message-Id: <20210224104707.88430-5-vsementsov@virtuozzo.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
 perform_cow() can now read both regions first, then encrypt them and
 finally write them to disk. The memory allocation is also done in
 this function now, using one single buffer large enough to hold both
 regions.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
+ block/parallels.h     |   6 +-
-file changed, 87 insertions(+), 30 deletions(-)
+ block/parallels-ext.c | 300 ++++++++++++++++++++++++++++++++++++++++++
  block/parallels.c     |  18 +++
  block/meson.build     |   3 +-
 files changed, 325 insertions(+), 2 deletions(-)
  create mode 100644 block/parallels-ext.c
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/parallels.h b/block/parallels.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/parallels.h
-+++ b/block/qcow2-cluster.c
++++ b/block/parallels.h
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ typedef struct ParallelsHeader {
-     return 0;
+     uint64_t nb_sectors;
- }
+     uint32_t inuse;
+     uint32_t data_off;
--static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+-    char padding[12];
--                                       uint64_t src_cluster_offset,
++    uint32_t flags;
--                                       uint64_t cluster_offset,
++    uint64_t ext_off;
--                                       unsigned offset_in_cluster,
+ } QEMU_PACKED ParallelsHeader;
--                                       unsigned bytes)
-+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+ typedef enum ParallelsPreallocMode {
-+                                            uint64_t src_cluster_offset,
+@@ -XXX,XX +XXX,XX @@ typedef struct BDRVParallelsState {
-+                                            unsigned offset_in_cluster,
+     Error *migration_blocker;
-+                                            uint8_t *buffer,
+ } BDRVParallelsState;
-+                                            unsigned bytes)
- {
++int parallels_read_format_extension(BlockDriverState *bs,
--    BDRVQcow2State *s = bs->opaque;
++                                    int64_t ext_off, Error **errp);
-     QEMUIOVector qiov;
++
--    struct iovec iov;
+ #endif
-+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+diff --git a/block/parallels-ext.c b/block/parallels-ext.c
-     int ret;
+new file mode 100644
+index XXXXXXX..XXXXXXX
-     if (bytes == 0) {
+--- /dev/null
-         return 0;
++++ b/block/parallels-ext.c
-     }
+@@ -XXX,XX +XXX,XX @@
++/*
--    iov.iov_len = bytes;
++ * Support of Parallels Format Extension. It's a part of Parallels format
--    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
++ * driver.
--    if (iov.iov_base == NULL) {
++ *
--        return -ENOMEM;
++ * Copyright (c) 2021 Virtuozzo International GmbH
--    }
++ *
--
++ * Permission is hereby granted, free of charge, to any person obtaining a copy
-     qemu_iovec_init_external(&qiov, &iov, 1);
++ * of this software and associated documentation files (the "Software"), to deal
++ * in the Software without restriction, including without limitation the rights
-     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
++ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
++ * copies of the Software, and to permit persons to whom the Software is
-     if (!bs->drv) {
++ * furnished to do so, subject to the following conditions:
--        ret = -ENOMEDIUM;
++ *
--        goto out;
++ * The above copyright notice and this permission notice shall be included in
-+        return -ENOMEDIUM;
++ * all copies or substantial portions of the Software.
-     }
++ *
++ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-     /* Call .bdrv_co_readv() directly instead of using the public block-layer
++ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
++ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
-     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
++ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-                                   bytes, &qiov, 0);
++ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-     if (ret < 0) {
++ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
--        goto out;
++ * THE SOFTWARE.
-+        return ret;
++ */
-     }
++
++#include "qemu/osdep.h"
--    if (bs->encrypted) {
++#include "qapi/error.h"
-+    return 0;
++#include "block/block_int.h"
 +#include "parallels.h"
 +#include "crypto/hash.h"
 +#include "qemu/uuid.h"
 +
 +#define PARALLELS_FORMAT_EXTENSION_MAGIC 0xAB234CEF23DCEA87ULL
 +
 +#define PARALLELS_END_OF_FEATURES_MAGIC 0x0ULL
 +#define PARALLELS_DIRTY_BITMAP_FEATURE_MAGIC 0x20385FAE252CB34AULL
 +
 +typedef struct ParallelsFormatExtensionHeader {
 +    uint64_t magic; /* PARALLELS_FORMAT_EXTENSION_MAGIC */
 +    uint8_t check_sum[16];
 +} QEMU_PACKED ParallelsFormatExtensionHeader;
 +
 +typedef struct ParallelsFeatureHeader {
 +    uint64_t magic;
 +    uint64_t flags;
 +    uint32_t data_size;
 +    uint32_t _unused;
 +} QEMU_PACKED ParallelsFeatureHeader;
 +
 +typedef struct ParallelsDirtyBitmapFeature {
 +    uint64_t size;
 +    uint8_t id[16];
 +    uint32_t granularity;
 +    uint32_t l1_size;
 +    /* L1 table follows */
 +} QEMU_PACKED ParallelsDirtyBitmapFeature;
 +
 +/* Given L1 table read bitmap data from the image and populate @bitmap */
 +static int parallels_load_bitmap_data(BlockDriverState *bs,
 +                                      const uint64_t *l1_table,
 +                                      uint32_t l1_size,
 +                                      BdrvDirtyBitmap *bitmap,
 +                                      Error **errp)
 +{
 +    BDRVParallelsState *s = bs->opaque;
 +    int ret = 0;
 +    uint64_t offset, limit;
 +    uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 +    uint8_t *buf = NULL;
 +    uint64_t i, tab_size =
 +        DIV_ROUND_UP(bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size),
 +                     s->cluster_size);
 +
 +    if (tab_size != l1_size) {
 +        error_setg(errp, "Bitmap table size %" PRIu32 " does not correspond "
 +                   "to bitmap size and cluster size. Expected %" PRIu64,
 +                   l1_size, tab_size);
 +        return -EINVAL;
 +    }
 +
 +    buf = qemu_blockalign(bs, s->cluster_size);
 +    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
 +    for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
 +        uint64_t count = MIN(bm_size - offset, limit);
 +        uint64_t entry = l1_table[i];
 +
 +        if (entry == 0) {
 +            /* No need to deserialize zeros because @bitmap is cleared. */
 +            continue;
 +        }
 +
 +        if (entry == 1) {
 +            bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count, false);
 +        } else {
 +            ret = bdrv_pread(bs->file, entry << BDRV_SECTOR_BITS, buf,
 +                             s->cluster_size);
 +            if (ret < 0) {
 +                error_setg_errno(errp, -ret,
 +                                 "Failed to read bitmap data cluster");
 +                goto finish;
 +            }
 +            bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count,
 +                                               false);
 +        }
 +    }
 +    ret = 0;
 +
 +    bdrv_dirty_bitmap_deserialize_finish(bitmap);
 +
 +finish:
 +    qemu_vfree(buf);
 +
 +    return ret;
 +}
 +
-+static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
++/*
-+                                                uint64_t src_cluster_offset,
++ * @data buffer (of @data_size size) is the Dirty bitmaps feature which
-+                                                unsigned offset_in_cluster,
++ * consists of ParallelsDirtyBitmapFeature followed by L1 table.
-+                                                uint8_t *buffer,
++ */
-+                                                unsigned bytes)
++static BdrvDirtyBitmap *parallels_load_bitmap(BlockDriverState *bs,
 +                                              uint8_t *data,
 +                                              size_t data_size,
 +                                              Error **errp)
 +{
-+    if (bytes && bs->encrypted) {
++    int ret;
-+        BDRVQcow2State *s = bs->opaque;
++    ParallelsDirtyBitmapFeature bf;
-         int64_t sector = (src_cluster_offset + offset_in_cluster)
++    g_autofree uint64_t *l1_table = NULL;
-                          >> BDRV_SECTOR_BITS;
++    BdrvDirtyBitmap *bitmap;
-         assert(s->cipher);
++    QemuUUID uuid;
-         assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
++    char uuidstr[UUID_FMT_LEN + 1];
-         assert((bytes & ~BDRV_SECTOR_MASK) == 0);
++    int i;
--        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
++
-+        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
++    if (data_size < sizeof(bf)) {
-                                   bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
++        error_setg(errp, "Too small Bitmap Feature area in Parallels Format "
--            ret = -EIO;
++                   "Extension: %zu bytes, expected at least %zu bytes",
--            goto out;
++                   data_size, sizeof(bf));
-+            return false;
++        return NULL;
-         }
++    }
-     }
++    memcpy(&bf, data, sizeof(bf));
-+    return true;
++    bf.size = le64_to_cpu(bf.size);
 +    bf.granularity = le32_to_cpu(bf.granularity) << BDRV_SECTOR_BITS;
 +    bf.l1_size = le32_to_cpu(bf.l1_size);
 +    data += sizeof(bf);
 +    data_size -= sizeof(bf);
 +
 +    if (bf.size != bs->total_sectors) {
 +        error_setg(errp, "Bitmap size (in sectors) %" PRId64 " differs from "
 +                   "disk size in sectors %" PRId64, bf.size, bs->total_sectors);
 +        return NULL;
 +    }
 +
 +    if (bf.l1_size * sizeof(uint64_t) > data_size) {
 +        error_setg(errp, "Bitmaps feature corrupted: l1 table exceeds "
 +                   "extension data_size");
 +        return NULL;
 +    }
 +
 +    memcpy(&uuid, bf.id, sizeof(uuid));
 +    qemu_uuid_unparse(&uuid, uuidstr);
 +    bitmap = bdrv_create_dirty_bitmap(bs, bf.granularity, uuidstr, errp);
 +    if (!bitmap) {
 +        return NULL;
 +    }
 +
 +    l1_table = g_new(uint64_t, bf.l1_size);
 +    for (i = 0; i < bf.l1_size; i++, data += sizeof(uint64_t)) {
 +        l1_table[i] = ldq_le_p(data);
 +    }
 +
 +    ret = parallels_load_bitmap_data(bs, l1_table, bf.l1_size, bitmap, errp);
 +    if (ret < 0) {
 +        bdrv_release_dirty_bitmap(bitmap);
 +        return NULL;
 +    }
 +
 +    /* We support format extension only for RO parallels images. */
 +    assert(!(bs->open_flags & BDRV_O_RDWR));
 +    bdrv_dirty_bitmap_set_readonly(bitmap, true);
 +
 +    return bitmap;
 +}
 +
-+static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
++static int parallels_parse_format_extension(BlockDriverState *bs,
-+                                             uint64_t cluster_offset,
++                                            uint8_t *ext_cluster, Error **errp)
 +                                             unsigned offset_in_cluster,
 +                                             uint8_t *buffer,
 +                                             unsigned bytes)
 +{
-+    QEMUIOVector qiov;
++    BDRVParallelsState *s = bs->opaque;
 +    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
 +    int ret;
-+
++    int remaining = s->cluster_size;
-+    if (bytes == 0) {
++    uint8_t *pos = ext_cluster;
-+        return 0;
++    ParallelsFormatExtensionHeader eh;
-+    }
++    g_autofree uint8_t *hash = NULL;
-+
++    size_t hash_len = 0;
-+    qemu_iovec_init_external(&qiov, &iov, 1);
++    GSList *bitmaps = NULL, *el;
++
-     ret = qcow2_pre_write_overlap_check(bs, 0,
++    memcpy(&eh, pos, sizeof(eh));
-             cluster_offset + offset_in_cluster, bytes);
++    eh.magic = le64_to_cpu(eh.magic);
-     if (ret < 0) {
++    pos += sizeof(eh);
--        goto out;
++    remaining -= sizeof(eh);
-+        return ret;
++
-     }
++    if (eh.magic != PARALLELS_FORMAT_EXTENSION_MAGIC) {
++        error_setg(errp, "Wrong parallels Format Extension magic: 0x%" PRIx64
-     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
++                   ", expected: 0x%llx", eh.magic,
-     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
++                   PARALLELS_FORMAT_EXTENSION_MAGIC);
-                           bytes, &qiov, 0);
++        goto fail;
-     if (ret < 0) {
++    }
--        goto out;
++
-+        return ret;
++    ret = qcrypto_hash_bytes(QCRYPTO_HASH_ALG_MD5, (char *)pos, remaining,
-     }
++                             &hash, &hash_len, errp);
 -    ret = 0;
 -out:
 -    qemu_vfree(iov.iov_base);
 -    return ret;
 +    return 0;
  }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      BDRVQcow2State *s = bs->opaque;
      Qcow2COWRegion *start = &m->cow_start;
      Qcow2COWRegion *end = &m->cow_end;
 +    unsigned buffer_size;
 +    uint8_t *start_buffer, *end_buffer;
      int ret;
 +    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
 +
      if (start->nb_bytes == 0 && end->nb_bytes == 0) {
          return 0;
      }
 +    /* Reserve a buffer large enough to store the data from both the
 +     * start and end COW regions. Add some padding in the middle if
 +     * necessary to make sure that the end region is optimally aligned */
 +    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
 +        end->nb_bytes;
 +    start_buffer = qemu_try_blockalign(bs, buffer_size);
 +    if (start_buffer == NULL) {
 +        return -ENOMEM;
 +    }
 +    /* The part of the buffer where the end region is located */
 +    end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +
      qemu_co_mutex_unlock(&s->lock);
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 -                         start->offset, start->nb_bytes);
 +    /* First we read the existing data from both COW regions */
 +    ret = do_perform_cow_read(bs, m->offset, start->offset,
 +                              start_buffer, start->nb_bytes);
      if (ret < 0) {
          goto fail;
      }
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 -                         end->offset, end->nb_bytes);
 +    ret = do_perform_cow_read(bs, m->offset, end->offset,
 +                              end_buffer, end->nb_bytes);
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +
-+    /* Encrypt the data if necessary before writing it */
++    if (hash_len != sizeof(eh.check_sum) ||
-+    if (bs->encrypted) {
++        memcmp(hash, eh.check_sum, sizeof(eh.check_sum)) != 0) {
-+        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
++        error_setg(errp, "Wrong checksum in Format Extension header. Format "
-+                                    start_buffer, start->nb_bytes) ||
++                   "extension is corrupted.");
-+            !do_perform_cow_encrypt(bs, m->offset, end->offset,
++        goto fail;
-+                                    end_buffer, end->nb_bytes)) {
++    }
-+            ret = -EIO;
++
 +    while (true) {
 +        ParallelsFeatureHeader fh;
 +        BdrvDirtyBitmap *bitmap;
 +
 +        if (remaining < sizeof(fh)) {
 +            error_setg(errp, "Can not read feature header, as remaining bytes "
 +                       "(%d) in Format Extension is less than Feature header "
 +                       "size (%zu)", remaining, sizeof(fh));
 +            goto fail;
 +        }
-+    }
++
-+
++        memcpy(&fh, pos, sizeof(fh));
-+    /* And now we can write everything */
++        pos += sizeof(fh);
-+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
++        remaining -= sizeof(fh);
-+                               start_buffer, start->nb_bytes);
++
 +        fh.magic = le64_to_cpu(fh.magic);
 +        fh.flags = le64_to_cpu(fh.flags);
 +        fh.data_size = le32_to_cpu(fh.data_size);
 +
 +        if (fh.flags) {
 +            error_setg(errp, "Flags for extension feature are unsupported");
 +            goto fail;
 +        }
 +
 +        if (fh.data_size > remaining) {
 +            error_setg(errp, "Feature data_size exceedes Format Extension "
 +                       "cluster");
 +            goto fail;
 +        }
 +
 +        switch (fh.magic) {
 +        case PARALLELS_END_OF_FEATURES_MAGIC:
 +            return 0;
 +
 +        case PARALLELS_DIRTY_BITMAP_FEATURE_MAGIC:
 +            bitmap = parallels_load_bitmap(bs, pos, fh.data_size, errp);
 +            if (!bitmap) {
 +                goto fail;
 +            }
 +            bitmaps = g_slist_append(bitmaps, bitmap);
 +            break;
 +
 +        default:
 +            error_setg(errp, "Unknown feature: 0x%" PRIu64, fh.magic);
 +            goto fail;
 +        }
 +
 +        pos = ext_cluster + QEMU_ALIGN_UP(pos + fh.data_size - ext_cluster, 8);
 +    }
 +
 +fail:
 +    for (el = bitmaps; el; el = el->next) {
 +        bdrv_release_dirty_bitmap(el->data);
 +    }
 +    g_slist_free(bitmaps);
 +
 +    return -EINVAL;
 +}
 +
 +int parallels_read_format_extension(BlockDriverState *bs,
 +                                    int64_t ext_off, Error **errp)
 +{
 +    BDRVParallelsState *s = bs->opaque;
 +    int ret;
 +    uint8_t *ext_cluster = qemu_blockalign(bs, s->cluster_size);
 +
 +    assert(ext_off > 0);
 +
 +    ret = bdrv_pread(bs->file, ext_off, ext_cluster, s->cluster_size);
 +    if (ret < 0) {
-+        goto fail;
++        error_setg_errno(errp, -ret, "Failed to read Format Extension cluster");
-+    }
++        goto out;
++    }
-+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
++
-+                               end_buffer, end->nb_bytes);
++    ret = parallels_parse_format_extension(bs, ext_cluster, errp);
- fail:
++
-     qemu_co_mutex_lock(&s->lock);
++out:
++    qemu_vfree(ext_cluster);
-@@ -XXX,XX +XXX,XX @@ fail:
++
-         qcow2_cache_depends_on_flush(s->l2_table_cache);
++    return ret;
 +}
 diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/parallels.c
 +++ b/block/parallels.c
@@ -XXX,XX +XXX,XX @@
   */
  #include "qemu/osdep.h"
 +#include "qemu/error-report.h"
  #include "qapi/error.h"
  #include "block/block_int.h"
  #include "block/qdict.h"
@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
          goto fail_options;
      }
-+    qemu_vfree(start_buffer);
++    if (ph.ext_off) {
-     return ret;
++        if (flags & BDRV_O_RDWR) {
- }
++            /*
++             * It's unsafe to open image RW if there is an extension (as we
 +             * don't support it). But parallels driver in QEMU historically
 +             * ignores the extension, so print warning and don't care.
 +             */
 +            warn_report("Format Extension ignored in RW mode");
 +        } else {
 +            ret = parallels_read_format_extension(
 +                    bs, le64_to_cpu(ph.ext_off) << BDRV_SECTOR_BITS, errp);
 +            if (ret < 0) {
 +                goto fail;
 +            }
 +        }
 +    }
 +
      if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_INACTIVE)) {
          s->header->inuse = cpu_to_le32(HEADER_INUSE_MAGIC);
          ret = parallels_update_header(bs);
 diff --git a/block/meson.build b/block/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/meson.build
 +++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_QED', if_true: files(
    'qed-table.c',
    'qed.c',
  ))
 -block_ss.add(when: [libxml2, 'CONFIG_PARALLELS'], if_true: files('parallels.c'))
 +block_ss.add(when: [libxml2, 'CONFIG_PARALLELS'],
 +             if_true: files('parallels.c', 'parallels-ext.c'))
  block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c'))
  block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
  block_ss.add(when: libiscsi, if_true: files('iscsi-opts.c'))
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 10/61] throttle: Update throttle-groups.c documentation
+[PULL 28/31] iotests.py: add unarchive_sample_image() helper
-From: Alberto Garcia <berto@igalia.com>
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-There used to be throttle_timers_{detach,attach}_aio_context() calls
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
+Message-Id: <20210224104707.88430-6-vsementsov@virtuozzo.com>
-they are now in blk_set_aio_context().
+Reviewed-by: Denis V. Lunev <den@openvz.org>
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/throttle-groups.c | 2 +-
+ tests/qemu-iotests/iotests.py | 10 ++++++++++
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 10 insertions(+)
-diff --git a/block/throttle-groups.c b/block/throttle-groups.c
+diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
 index XXXXXXX..XXXXXXX 100644
---- a/block/throttle-groups.c
+--- a/tests/qemu-iotests/iotests.py
-+++ b/block/throttle-groups.c
++++ b/tests/qemu-iotests/iotests.py
 @@ -XXX,XX +XXX,XX @@
-  * Again, all this is handled internally and is mostly transparent to
+ #
-  * the outside. The 'throttle_timers' field however has an additional
-  * constraint because it may be temporarily invalid (see for example
+ import atexit
-- * bdrv_set_aio_context()). Therefore in this file a thread will
++import bz2
-+ * blk_set_aio_context()). Therefore in this file a thread will
+ from collections import OrderedDict
-  * access some other BlockBackend's timers only after verifying that
+ import faulthandler
-  * that BlockBackend has throttled requests in the queue.
+ import io
-  */
+@@ -XXX,XX +XXX,XX @@
  import logging
  import os
  import re
 +import shutil
  import signal
  import struct
  import subprocess
@@ -XXX,XX +XXX,XX @@
                               os.environ.get('IMGKEYSECRET', '')
  luks_default_key_secret_opt = 'key-secret=keysec0'
 +sample_img_dir = os.environ['SAMPLE_IMG_DIR']
 +
 +
 +def unarchive_sample_image(sample, fname):
 +    sample_fname = os.path.join(sample_img_dir, sample + '.bz2')
 +    with bz2.open(sample_fname) as f_in, open(fname, 'wb') as f_out:
 +        shutil.copyfileobj(f_in, f_out)
 +
  def qemu_tool_pipe_and_status(tool: str, args: Sequence[str],
                                connect_stderr: bool = True) -> Tuple[str, int]:
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 03/61] qemu-iotests: Test exiting qemu with running job
+[PULL 29/31] iotests: add parallels-read-bitmap test
-When qemu is exited, all running jobs should be cancelled successfully.
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 This adds a test for this for all types of block jobs that currently
 exist in qemu.
+Test support for reading bitmap from parallels image format.
+parallels-with-bitmap.bz2 is generated on Virtuozzo by
+parallels-with-bitmap.sh
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Message-Id: <20210224104707.88430-7-vsementsov@virtuozzo.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
 ---
- tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
+ .../sample_images/parallels-with-bitmap.bz2   | Bin 0 -> 203 bytes
- tests/qemu-iotests/185.out |  59 +++++++++++++
+ .../sample_images/parallels-with-bitmap.sh    |  51 ++++++++++++++++
- tests/qemu-iotests/group   |   1 +
+ .../qemu-iotests/tests/parallels-read-bitmap  |  55 ++++++++++++++++++
-files changed, 266 insertions(+)
+ .../tests/parallels-read-bitmap.out           |   6 ++
- create mode 100755 tests/qemu-iotests/185
+files changed, 112 insertions(+)
- create mode 100644 tests/qemu-iotests/185.out
+ create mode 100644 tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
  create mode 100755 tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
  create mode 100755 tests/qemu-iotests/tests/parallels-read-bitmap
  create mode 100644 tests/qemu-iotests/tests/parallels-read-bitmap.out
-diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
+diff --git a/tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2 b/tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
 new file mode 100644
 index XXXXXXX..XXXXXXX
 GIT binary patch
 literal 203
 zcmV;+05tzXT4*^jL0KkKS@=;0bpT+Hf7|^?Km<xfFyKQJ7=Y^F-%vt;00~Ysa6|-=
 zk&7Szk`SoS002EkfMftPG<ipnsiCK}K_sNmm}me3FiZr%Oaf_u5F8kD;mB_~cxD-r
 z5P$(X{&Tq5C`<xK02D?NNdN+t$~z$m00O|zFh^ynq*yaCtkn+NZzWom<#OEoF`?zb
 zv(i3x^K~wt!aLPcRBP+PckUsIh6*LgjYSh0`}#7hMC9NR5D)+W0d&8Mxgwk>NPH-R
 Fx`3oHQ9u9y
 literal 0
 HcmV?d00001
 diff --git a/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh b/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
 new file mode 100755
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/qemu-iotests/185
++++ b/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
 @@ -XXX,XX +XXX,XX @@
 +#!/bin/bash
 +#
-+# Test exiting qemu while jobs are still running
++# Test parallels load bitmap
 +#
-+# Copyright (C) 2017 Red Hat, Inc.
++# Copyright (c) 2021 Virtuozzo International GmbH.
 +#
 +# This program is free software; you can redistribute it and/or modify
 +# it under the terms of the GNU General Public License as published by
 +# the Free Software Foundation; either version 2 of the License, or
 +# (at your option) any later version.
 ...
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
 +#
 +
-+# creator
++CT=parallels-with-bitmap-ct
-+owner=kwolf@redhat.com
++DIR=$PWD/parallels-with-bitmap-dir
 +IMG=$DIR/root.hds
 +XML=$DIR/DiskDescriptor.xml
 +TARGET=parallels-with-bitmap.bz2
 +
-+seq=`basename $0`
++rm -rf $DIR
 +echo "QA output created by $seq"
 +
-+here=`pwd`
++prlctl create $CT --vmtype ct
-+status=1 # failure is the default!
++prlctl set $CT --device-add hdd --image $DIR --recreate --size 2G
 +
-+MIG_SOCKET="${TEST_DIR}/migrate"
++# cleanup the image
 +qemu-img create -f parallels $IMG 64G
 +
-+_cleanup()
++# create bitmap
-+{
++prlctl backup $CT
 +    rm -f "${TEST_IMG}.mid"
 +    rm -f "${TEST_IMG}.copy"
 +    _cleanup_test_img
 +    _cleanup_qemu
 +}
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
-+# get standard environment, filters and checks
++prlctl set $CT --device-del hdd1
-+. ./common.rc
++prlctl destroy $CT
 +. ./common.filter
 +. ./common.qemu
 +
-+_supported_fmt qcow2
++dev=$(ploop mount $XML | sed -n 's/^Adding delta dev=\(\/dev\/ploop[0-9]\+\).*/\1/p')
-+_supported_proto file
++dd if=/dev/zero of=$dev bs=64K seek=5 count=2 oflag=direct
-+_supported_os Linux
++dd if=/dev/zero of=$dev bs=64K seek=30 count=1 oflag=direct
 +dd if=/dev/zero of=$dev bs=64K seek=10 count=3 oflag=direct
 +ploop umount $XML  # bitmap name will be in the output
 +
-+size=64M
++bzip2 -z $IMG
 +TEST_IMG="${TEST_IMG}.base" _make_test_img $size
 +
-+echo
++mv $IMG.bz2 $TARGET
 +echo === Starting VM ===
 +echo
 +
-+qemu_comm_method="qmp"
++rm -rf $DIR
 diff --git a/tests/qemu-iotests/tests/parallels-read-bitmap b/tests/qemu-iotests/tests/parallels-read-bitmap
 new file mode 100755
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qemu-iotests/tests/parallels-read-bitmap
@@ -XXX,XX +XXX,XX @@
 +#!/usr/bin/env python3
 +#
 +# Test parallels load bitmap
 +#
 +# Copyright (c) 2021 Virtuozzo International GmbH.
 +#
 +# This program is free software; you can redistribute it and/or modify
 +# it under the terms of the GNU General Public License as published by
 +# the Free Software Foundation; either version 2 of the License, or
 +# (at your option) any later version.
 +#
 +# This program is distributed in the hope that it will be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
 +#
 +
-+_launch_qemu \
++import json
-+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
++import iotests
-+h=$QEMU_HANDLE
++from iotests import qemu_nbd_popen, qemu_img_pipe, log, file_path
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
-+echo
++iotests.script_initialize(supported_fmts=['parallels'])
 +echo === Creating backing chain ===
 +echo
 +
-+_send_qemu_cmd $h \
++nbd_sock = file_path('nbd-sock', base_dir=iotests.sock_dir)
-+    "{ 'execute': 'blockdev-snapshot-sync',
++disk = iotests.file_path('disk')
-+       'arguments': { 'device': 'disk',
++bitmap = 'e4f2eed0-37fe-4539-b50b-85d2e7fd235f'
-+                      'snapshot-file': '$TEST_IMG.mid',
++nbd_opts = f'driver=nbd,server.type=unix,server.path={nbd_sock}' \
-+                      'format': '$IMGFMT',
++        f',x-dirty-bitmap=qemu:dirty-bitmap:{bitmap}'
 +                      'mode': 'absolute-paths' } }" \
 +    "return"
 +
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'human-monitor-command',
-+       'arguments': { 'command-line':
-+                      'qemu-io disk \"write 0 4M\"' } }" \
-+    "return"
 +
-+_send_qemu_cmd $h \
++iotests.unarchive_sample_image('parallels-with-bitmap', disk)
 +    "{ 'execute': 'blockdev-snapshot-sync',
 +       'arguments': { 'device': 'disk',
 +                      'snapshot-file': '$TEST_IMG',
 +                      'format': '$IMGFMT',
 +                      'mode': 'absolute-paths' } }" \
 +    "return"
 +
-+echo
-+echo === Start commit job and exit qemu ===
-+echo
 +
-+# Note that the reference output intentionally includes the 'offset' field in
++with qemu_nbd_popen('--read-only', f'--socket={nbd_sock}',
-+# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
++                    f'--bitmap={bitmap}', '-f', iotests.imgfmt, disk):
-+# predictable and any change in the offsets would hint at a bug in the job
++    out = qemu_img_pipe('map', '--output=json', '--image-opts', nbd_opts)
-+# throttling code.
++    chunks = json.loads(out)
-+#
++    cluster = 64 * 1024
 +# In order to achieve these predictable offsets, all of the following tests
 +# use speed=65536. Each job will perform exactly one iteration before it has
 +# to sleep at least for a second, which is plenty of time for the 'quit' QMP
 +# command to be received (after receiving the command, the rest runs
 +# synchronously, so jobs can arbitrarily continue or complete).
 +#
 +# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
 +# the first request), for active commit and mirror it's large enough to cover
 +# the full 4M, and for backup it's the qcow2 cluster size, which we know is
 +# 64k. As all of these are at least as large as the speed, we are sure that the
 +# offset doesn't advance after the first iteration before qemu exits.
 +
-+_send_qemu_cmd $h \
++    log('dirty clusters (cluster size is 64K):')
-+    "{ 'execute': 'block-commit',
++    for c in chunks:
-+       'arguments': { 'device': 'disk',
++        assert c['start'] % cluster == 0
-+                      'base':'$TEST_IMG.base',
++        assert c['length'] % cluster == 0
-+                      'top': '$TEST_IMG.mid',
++        if c['data']:
-+                      'speed': 65536 } }" \
++            continue
 +    "return"
 +
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
++        a = c['start'] // cluster
-+wait=1 _cleanup_qemu
++        b = (c['start'] + c['length']) // cluster
-+
++        if b - a > 1:
-+echo
++            log(f'{a}-{b-1}')
-+echo === Start active commit job and exit qemu ===
++        else:
-+echo
++            log(a)
-+
+diff --git a/tests/qemu-iotests/tests/parallels-read-bitmap.out b/tests/qemu-iotests/tests/parallels-read-bitmap.out
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'block-commit',
 +       'arguments': { 'device': 'disk',
 +                      'base':'$TEST_IMG.base',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start mirror job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'drive-mirror',
 +       'arguments': { 'device': 'disk',
 +                      'target': '$TEST_IMG.copy',
 +                      'format': '$IMGFMT',
 +                      'sync': 'full',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start backup job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'drive-backup',
 +       'arguments': { 'device': 'disk',
 +                      'target': '$TEST_IMG.copy',
 +                      'format': '$IMGFMT',
 +                      'sync': 'full',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start streaming job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'block-stream',
 +       'arguments': { 'device': 'disk',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +_check_test_img
 +
 +# success, all done
 +echo "*** done"
 +rm -f $seq.full
 +status=0
 diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/qemu-iotests/185.out
++++ b/tests/qemu-iotests/tests/parallels-read-bitmap.out
 @@ -XXX,XX +XXX,XX @@
-+QA output created by 185
++Start NBD server
-+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
++dirty clusters (cluster size is 64K):
-+
++5-6
-+=== Starting VM ===
++10-12
-+
++30
-+{"return": {}}
++Kill NBD server
 +
 +=== Creating backing chain ===
 +
 +Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +wrote 4194304/4194304 bytes at offset 0
 +4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +{"return": ""}
 +Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +
 +=== Start commit job and exit qemu ===
 +
 +{"return": {}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
 +
 +=== Start active commit job and exit qemu ===
 +
 +{"return": {}}
 +{"return": {}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
 +
 +=== Start mirror job and exit qemu ===
 +
 +{"return": {}}
 +Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
 +
 +=== Start backup job and exit qemu ===
 +
 +{"return": {}}
 +Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
 +
 +=== Start streaming job and exit qemu ===
 +
 +{"return": {}}
 +{"return": {}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
 +No errors were found on the image.
 +*** done
 diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/group
 +++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 rw auto migration
 rw auto quick
 rw auto migration
 +185 rw auto
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 02/61] qemu-iotests: Allow starting new qemu after cleanup
+[PULL 30/31] MAINTAINERS: update parallels block driver
-After _cleanup_qemu(), test cases should be able to start the next qemu
+From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 process and call _cleanup_qemu() for that one as well. For this to work
 cleanly, we need to improve the cleanup so that the second invocation
 doesn't try to kill the qemu instances from the first invocation a
 second time (which would result in error messages).
+Add new parallels-ext.c and myself as co-maintainer.
+Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Message-Id: <20210304095151.19358-1-vsementsov@virtuozzo.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
 ---
- tests/qemu-iotests/common.qemu | 3 +++
+ MAINTAINERS | 3 +++
 file changed, 3 insertions(+)
-diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
+diff --git a/MAINTAINERS b/MAINTAINERS
 index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/common.qemu
+--- a/MAINTAINERS
-+++ b/tests/qemu-iotests/common.qemu
++++ b/MAINTAINERS
-@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
+@@ -XXX,XX +XXX,XX @@ F: block/dmg.c
-         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
+ parallels
-         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
+ M: Stefan Hajnoczi <stefanha@redhat.com>
-         eval "exec ${QEMU_OUT[$i]}<&-"
+ M: Denis V. Lunev <den@openvz.org>
-+
++M: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
-+        unset QEMU_IN[$i]
+ L: qemu-block@nongnu.org
-+        unset QEMU_OUT[$i]
+ S: Supported
-     done
+ F: block/parallels.c
- }
++F: block/parallels-ext.c
  F: docs/interop/parallels.txt
 +T: git https://src.openvz.org/scm/~vsementsov/qemu.git parallels
  qed
  M: Stefan Hajnoczi <stefanha@redhat.com>
 --
-.8.3.1
+.29.2

-[Qemu-devel] [PULL 08/61] doc: Document generic -blockdev options
+Deleted patch
-This adds documentation for the -blockdev options that apply to all
-nodes independent of the block driver used.
-All options that are shared by -blockdev and -drive are now explained in
-the section for -blockdev. The documentation of -drive mentions that all
--blockdev options are accepted as well.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
-file changed, 79 insertions(+), 29 deletions(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
-+++ b/qemu-options.hx
-@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
-     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
-     "          [,driver specific parameters...]\n"
-     "                configure a block backend\n", QEMU_ARCH_ALL)
-+STEXI
-+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
-+@findex -blockdev
-+
-+Define a new block driver node.
-+
-+@table @option
-+@item Valid options for any block driver node:
-+
-+@table @code
-+@item driver
-+Specifies the block driver to use for the given node.
-+@item node-name
-+This defines the name of the block driver node by which it will be referenced
-+later. The name must be unique, i.e. it must not match the name of a different
-+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
-+
-+If no node name is specified, it is automatically generated. The generated node
-+name is not intended to be predictable and changes between QEMU invocations.
-+For the top level, an explicit node name must be specified.
-+@item read-only
-+Open the node read-only. Guest write attempts will fail.
-+@item cache.direct
-+The host page cache can be avoided with @option{cache.direct=on}. This will
-+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
-+internal copy of the data.
-+@item cache.no-flush
-+In case you don't care about data integrity over host failures, you can use
-+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
-+any data to the disk but can instead keep things in cache. If anything goes
-+wrong, like your host losing power, the disk storage getting disconnected
-+accidentally, etc. your image will most probably be rendered unusable.
-+@item discard=@var{discard}
-+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
-+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
-+ignored or passed to the filesystem. Some machine types may not support
-+discard requests.
-+@item detect-zeroes=@var{detect-zeroes}
-+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-+conversion of plain zero writes by the OS to driver specific optimized
-+zero write commands. You may even choose "unmap" if @var{discard} is set
-+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
-+@end table
-+
-+@end table
-+
-+ETEXI
- DEF("drive", HAS_ARG, QEMU_OPTION_drive,
-     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
-@@ -XXX,XX +XXX,XX @@ STEXI
- @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
- @findex -drive
--Define a new drive. Valid options are:
-+Define a new drive. This includes creating a block driver node (the backend) as
-+well as a guest device, and is mostly a shortcut for defining the corresponding
-+@option{-blockdev} and @option{-device} options.
-+
-+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
-+addition, it knows the following options:
- @table @option
- @item file=@var{file}
-@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
- @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
- (see @option{-snapshot}).
- @item cache=@var{cache}
--@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
-+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
-+and controls how the host cache is used to access block data. This is a
-+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
-+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
-+which provides a default for the @option{write-cache} option of block guest
-+devices (as in @option{-device}). The modes correspond to the following
-+settings:
-+
-+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
-+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
-+@c and the HTML output.
-+@example
-+@             │ cache.writeback   cache.direct   cache.no-flush
-+─────────────┼─────────────────────────────────────────────────
-+writeback    │ on                off            off
-+none         │ on                on             off
-+writethrough │ off               off            off
-+directsync   │ off               on             off
-+unsafe       │ on                off            on
-+@end example
-+
-+The default mode is @option{cache=writeback}.
-+
- @item aio=@var{aio}
- @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
--@item discard=@var{discard}
--@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
- @item format=@var{format}
- Specify which disk @var{format} will be used rather than detecting
- the format.  Can be used to specify format=raw to avoid interpreting
-@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
- "report" (report the error to the guest), "enospc" (pause QEMU only if the
- host disk is full; report the error to the guest otherwise).
- The default setting is @option{werror=enospc} and @option{rerror=report}.
--@item readonly
--Open drive @option{file} as read-only. Guest write attempts will fail.
- @item copy-on-read=@var{copy-on-read}
- @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
- file sectors into the image file.
--@item detect-zeroes=@var{detect-zeroes}
--@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
--conversion of plain zero writes by the OS to driver specific optimized
--zero write commands. You may even choose "unmap" if @var{discard} is set
--to "unmap" to allow a zero write to be converted to an UNMAP operation.
- @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
- Specify bandwidth throttling limits in bytes per second, either for all request
- types or for reads or writes only.  Small values can lead to timeouts or hangs
-@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
- instead of a single larger disk.
- @end table
--By default, the @option{cache=writeback} mode is used. It will report data
-+By default, the @option{cache.writeback=on} mode is used. It will report data
- writes as completed as soon as the data is present in the host page cache.
- This is safe as long as your guest OS makes sure to correctly flush disk caches
- where needed. If your guest OS does not handle volatile disk write caches
- correctly and your host crashes or loses power, then the guest may experience
- data corruption.
--For such guests, you should consider using @option{cache=writethrough}. This
-+For such guests, you should consider using @option{cache.writeback=off}. This
- means that the host page cache will be used to read and write data, but write
- notification will be sent to the guest only after QEMU has made sure to flush
- each write to the disk. Be aware that this has a major impact on performance.
--The host page cache can be avoided entirely with @option{cache=none}.  This will
--attempt to do disk IO directly to the guest's memory.  QEMU may still perform
--an internal copy of the data. Note that this is considered a writeback mode and
--the guest OS must handle the disk write cache correctly in order to avoid data
--corruption on host crashes.
--
--The host page cache can be avoided while only sending write notifications to
--the guest when the data has been flushed to the disk using
--@option{cache=directsync}.
--
--In case you don't care about data integrity over host failures, use
--@option{cache=unsafe}. This option tells QEMU that it never needs to write any
--data to the disk but can instead keep things in cache. If anything goes wrong,
--like your host losing power, the disk storage getting disconnected accidentally,
--etc. your image will most probably be rendered unusable.   When using
--the @option{-snapshot} option, unsafe caching is always used.
-+When using the @option{-snapshot} option, unsafe caching is always used.
- Copy-on-read avoids accessing the same backing file sectors repeatedly and is
- useful when the backing file is over a slow network.  By default copy-on-read
---
-.8.3.1

-[Qemu-devel] [PULL 09/61] doc: Document driver-specific -blockdev options
+Deleted patch
-This documents the driver-specific options for the raw, qcow2 and file
-block drivers for the man page. For everything else, we refer to the
-QAPI documentation.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
-file changed, 114 insertions(+), 1 deletion(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
-+++ b/qemu-options.hx
-@@ -XXX,XX +XXX,XX @@ STEXI
- @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
- @findex -blockdev
--Define a new block driver node.
-+Define a new block driver node. Some of the options apply to all block drivers,
-+other options are only accepted for a specific block driver. See below for a
-+list of generic options and options for the most common block drivers.
-+
-+Options that expect a reference to another node (e.g. @code{file}) can be
-+given in two ways. Either you specify the node name of an already existing node
-+(file=@var{node-name}), or you define a new node inline, adding options
-+for the referenced node after a dot (file.filename=@var{path},file.aio=native).
-+
-+A block driver node created with @option{-blockdev} can be used for a guest
-+device by specifying its node name for the @code{drive} property in a
-+@option{-device} argument that defines a block device.
- @table @option
- @item Valid options for any block driver node:
-@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
- to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
- @end table
-+@item Driver-specific options for @code{file}
-+
-+This is the protocol-level block driver for accessing regular files.
-+
-+@table @code
-+@item filename
-+The path to the image file in the local filesystem
-+@item aio
-+Specifies the AIO backend (threads/native, default: threads)
-+@end table
-+Example:
-+@example
-+-blockdev driver=file,node-name=disk,filename=disk.img
-+@end example
-+
-+@item Driver-specific options for @code{raw}
-+
-+This is the image format block driver for raw images. It is usually
-+stacked on top of a protocol level block driver such as @code{file}.
-+
-+@table @code
-+@item file
-+Reference to or definition of the data source block driver node
-+(e.g. a @code{file} driver node)
-+@end table
-+Example 1:
-+@example
-+-blockdev driver=file,node-name=disk_file,filename=disk.img
-+-blockdev driver=raw,node-name=disk,file=disk_file
-+@end example
-+Example 2:
-+@example
-+-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
-+@end example
-+
-+@item Driver-specific options for @code{qcow2}
-+
-+This is the image format block driver for qcow2 images. It is usually
-+stacked on top of a protocol level block driver such as @code{file}.
-+
-+@table @code
-+@item file
-+Reference to or definition of the data source block driver node
-+(e.g. a @code{file} driver node)
-+
-+@item backing
-+Reference to or definition of the backing file block device (default is taken
-+from the image file). It is allowed to pass an empty string here in order to
-+disable the default backing file.
-+
-+@item lazy-refcounts
-+Whether to enable the lazy refcounts feature (on/off; default is taken from the
-+image file)
-+
-+@item cache-size
-+The maximum total size of the L2 table and refcount block caches in bytes
-+(default: 1048576 bytes or 8 clusters, whichever is larger)
-+
-+@item l2-cache-size
-+The maximum size of the L2 table cache in bytes
-+(default: 4/5 of the total cache size)
-+
-+@item refcount-cache-size
-+The maximum size of the refcount block cache in bytes
-+(default: 1/5 of the total cache size)
-+
-+@item cache-clean-interval
-+Clean unused entries in the L2 and refcount caches. The interval is in seconds.
-+The default value is 0 and it disables this feature.
-+
-+@item pass-discard-request
-+Whether discard requests to the qcow2 device should be forwarded to the data
-+source (on/off; default: on if discard=unmap is specified, off otherwise)
-+
-+@item pass-discard-snapshot
-+Whether discard requests for the data source should be issued when a snapshot
-+operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
-+default: on)
-+
-+@item pass-discard-other
-+Whether discard requests for the data source should be issued on other
-+occasions where a cluster gets freed (on/off; default: off)
-+
-+@item overlap-check
-+Which overlap checks to perform for writes to the image
-+(none/constant/cached/all; default: cached). For details or finer
-+granularity control refer to the QAPI documentation of @code{blockdev-add}.
-+@end table
-+
-+Example 1:
-+@example
-+-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
-+-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
-+@end example
-+Example 2:
-+@example
-+-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
-+@end example
-+
-+@item Driver-specific options for other drivers
-+Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
-+
- @end table
- ETEXI
---
-.8.3.1

-[Qemu-devel] [PULL 29/61] qed: Remove callback from qed_find_cluster()
+Deleted patch
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
- block/qed.c         | 24 +++++++++++-------------
- block/qed.h         |  4 ++--
-files changed, 35 insertions(+), 32 deletions(-)
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
-+++ b/block/qed-cluster.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
-  * @s:          QED state
-  * @request:    L2 cache entry
-  * @pos:        Byte position in device
-- * @len:        Number of bytes
-- * @cb:         Completion function
-- * @opaque:     User data for completion function
-+ * @len:        Number of bytes (may be shortened on return)
-+ * @img_offset: Contains offset in the image file on success
-  *
-  * This function translates a position in the block device to an offset in the
-- * image file.  It invokes the cb completion callback to report back the
-- * translated offset or unallocated range in the image file.
-+ * image file. The translated offset or unallocated range in the image file is
-+ * reported back in *img_offset and *len.
-  *
-  * If the L2 table exists, request->l2_table points to the L2 table cache entry
-  * and the caller must free the reference when they are finished.  The cache
-  * entry is exposed in this way to avoid callers having to read the L2 table
-  * again later during request processing.  If request->l2_table is non-NULL it
-  * will be unreferenced before taking on the new cache entry.
-+ *
-+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
-+ * range in the image file.
-+ *
-+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
-+ * table offset, respectively. len is number of contiguous unallocated bytes.
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque)
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset)
- {
-     uint64_t l2_offset;
-     uint64_t offset = 0;
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
-      * so that a request acts on one L2 table at a time.
-      */
--    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
-     if (qed_offset_is_unalloc_cluster(l2_offset)) {
--        cb(opaque, QED_CLUSTER_L1, 0, len);
--        return;
-+        *img_offset = 0;
-+        return QED_CLUSTER_L1;
-     }
-     if (!qed_check_table_offset(s, l2_offset)) {
--        cb(opaque, -EINVAL, 0, 0);
--        return;
-+        *img_offset = *len = 0;
-+        return -EINVAL;
-     }
-     ret = qed_read_l2_table(s, request, l2_offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     }
-     index = qed_l2_index(s, pos);
--    n = qed_bytes_to_clusters(s,
--                              qed_offset_into_cluster(s, pos) + len);
-+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
-     n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                       index, n, &offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-         ret = -EINVAL;
-     }
--    len = MIN(len,
--              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
-+    *len = MIN(*len,
-+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
- out:
--    cb(opaque, ret, offset, len);
-+    *img_offset = offset;
-     qed_release(s);
-+    return ret;
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
-         .file = file,
-     };
-     QEDRequest request = { .l2_table = NULL };
-+    uint64_t offset;
-+    int ret;
--    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
-+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
-+    qed_is_allocated_cb(&cb, ret, offset, len);
--    /* Now sleep if the callback wasn't invoked immediately */
--    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
--        cb.co = qemu_coroutine_self();
--        qemu_coroutine_yield();
--    }
-+    /* The callback was invoked immediately */
-+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
-     qed_unref_l2_cache_entry(request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_write_data(void *opaque, int ret,
-                                uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_read_data(void *opaque, int ret,
-                               uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     BDRVQEDState *s = acb_to_s(acb);
-     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
-                                 qed_aio_write_data : qed_aio_read_data;
-+    uint64_t offset;
-+    size_t len;
-     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     }
-     /* Find next cluster and start I/O */
--    qed_find_cluster(s, &acb->request,
--                      acb->cur_pos, acb->end_pos - acb->cur_pos,
--                      io_fn, acb);
-+    len = acb->end_pos - acb->cur_pos;
-+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
-+    io_fn(acb, ret, offset, len);
- }
- static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
- /**
-  * Cluster functions
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque);
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset);
- /**
-  * Consistency check
---
-.8.3.1

-[Qemu-devel] [PULL 32/61] qed: Remove callback from qed_copy_from_backing_file()
+Deleted patch
-With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
-collapse into a single function. This is reflected by a rename of the
-combined function to qed_aio_write_cow().
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 57 +++++++++++++++++++++++----------------------------------
-file changed, 23 insertions(+), 34 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-  * @pos:        Byte position in device
-  * @len:        Number of bytes
-  * @offset:     Byte offset in image file
-- * @cb:         Completion function
-- * @opaque:     User data for completion function
-  */
--static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
--                                       uint64_t len, uint64_t offset,
--                                       BlockCompletionFunc *cb,
--                                       void *opaque)
-+static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-+                                      uint64_t len, uint64_t offset)
- {
-     QEMUIOVector qiov;
-     QEMUIOVector *backing_qiov = NULL;
-@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-     /* Skip copy entirely if there is no work to do */
-     if (len == 0) {
--        cb(opaque, 0);
--        return;
-+        return 0;
-     }
-     iov = (struct iovec) {
-@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-     ret = 0;
- out:
-     qemu_vfree(iov.iov_base);
--    cb(opaque, ret);
-+    return ret;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
- }
- /**
-- * Populate back untouched region of new data cluster
-+ * Populate untouched regions of new data cluster
-  */
--static void qed_aio_write_postfill(void *opaque, int ret)
-+static void qed_aio_write_cow(void *opaque, int ret)
- {
-     QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
--    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
--    uint64_t len =
--        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
--    uint64_t offset = acb->cur_cluster +
--                      qed_offset_into_cluster(s, acb->cur_pos) +
--                      acb->cur_qiov.size;
-+    uint64_t start, len, offset;
-+
-+    /* Populate front untouched region of new data cluster */
-+    start = qed_start_of_cluster(s, acb->cur_pos);
-+    len = qed_offset_into_cluster(s, acb->cur_pos);
-+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
-+    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
-     if (ret) {
-         qed_aio_complete(acb, ret);
-         return;
-     }
--    trace_qed_aio_write_postfill(s, acb, start, len, offset);
--    qed_copy_from_backing_file(s, start, len, offset,
--                                qed_aio_write_main, acb);
--}
-+    /* Populate back untouched region of new data cluster */
-+    start = acb->cur_pos + acb->cur_qiov.size;
-+    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
-+    offset = acb->cur_cluster +
-+             qed_offset_into_cluster(s, acb->cur_pos) +
-+             acb->cur_qiov.size;
--/**
-- * Populate front untouched region of new data cluster
-- */
--static void qed_aio_write_prefill(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
--    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
-+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
-+    ret = qed_copy_from_backing_file(s, start, len, offset);
--    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
--    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
--                                qed_aio_write_postfill, acb);
-+    qed_aio_write_main(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-         cb = qed_aio_write_zero_cluster;
-     } else {
--        cb = qed_aio_write_prefill;
-+        cb = qed_aio_write_cow;
-         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
-     }
---
-.8.3.1

-[Qemu-devel] [PULL 37/61] qed: Remove callback from qed_write_table()
+Deleted patch
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed-table.c | 47 ++++++++++++-----------------------------------
- block/qed.c       | 12 +++++++-----
- block/qed.h       |  8 +++-----
-files changed, 22 insertions(+), 45 deletions(-)
-diff --git a/block/qed-table.c b/block/qed-table.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
-+++ b/block/qed-table.c
-@@ -XXX,XX +XXX,XX @@ out:
-  * @index:      Index of first element
-  * @n:          Number of elements
-  * @flush:      Whether or not to sync to disk
-- * @cb:         Completion function
-- * @opaque:     Argument for completion function
-  */
--static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
--                            unsigned int index, unsigned int n, bool flush,
--                            BlockCompletionFunc *cb, void *opaque)
-+static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-+                           unsigned int index, unsigned int n, bool flush)
- {
-     unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
-     unsigned int start, end, i;
-@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-     ret = 0;
- out:
-     qemu_vfree(new_table);
--    cb(opaque, ret);
--}
--
--/**
-- * Propagate return value from async callback
-- */
--static void qed_sync_cb(void *opaque, int ret)
--{
--    *(int *)opaque = ret;
-+    return ret;
- }
- int qed_read_l1_table_sync(BDRVQEDState *s)
-@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
-     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
- }
--void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
--                        BlockCompletionFunc *cb, void *opaque)
-+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
- {
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
--    qed_write_table(s, s->header.l1_table_offset,
--                    s->l1_table, index, n, false, cb, opaque);
-+    return qed_write_table(s, s->header.l1_table_offset,
-+                           s->l1_table, index, n, false);
- }
- int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
-                             unsigned int n)
- {
--    int ret = -EINPROGRESS;
--
--    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
--
--    return ret;
-+    return qed_write_l1_table(s, index, n);
- }
- int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
-@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
-     return qed_read_l2_table(s, request, offset);
- }
--void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
--                        unsigned int index, unsigned int n, bool flush,
--                        BlockCompletionFunc *cb, void *opaque)
-+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-+                       unsigned int index, unsigned int n, bool flush)
- {
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
--    qed_write_table(s, request->l2_table->offset,
--                    request->l2_table->table, index, n, flush, cb, opaque);
-+    return qed_write_table(s, request->l2_table->offset,
-+                           request->l2_table->table, index, n, flush);
- }
- int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
-                             unsigned int index, unsigned int n, bool flush)
- {
--    int ret = -EINPROGRESS;
--
--    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
--
--    return ret;
-+    return qed_write_l2_table(s, request, index, n, flush);
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
-     index = qed_l1_index(s, acb->cur_pos);
-     s->l1_table->offsets[index] = acb->request.l2_table->offset;
--    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
-+    ret = qed_write_l1_table(s, index, 1);
-+    qed_commit_l2_update(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-     if (need_alloc) {
-         /* Write out the whole new L2 table */
--        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
--                           qed_aio_write_l1_update, acb);
-+        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-+        qed_aio_write_l1_update(acb, ret);
-     } else {
-         /* Write out only the updated part of the L2 table */
--        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
--                           qed_aio_next_io_cb, acb);
-+        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-+                                 false);
-+        qed_aio_next_io(acb, ret);
-     }
-     return;
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
-  * Table I/O functions
-  */
- int qed_read_l1_table_sync(BDRVQEDState *s);
--void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
--                        BlockCompletionFunc *cb, void *opaque);
-+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
- int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
-                             unsigned int n);
- int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
-                            uint64_t offset);
- int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
--void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
--                        unsigned int index, unsigned int n, bool flush,
--                        BlockCompletionFunc *cb, void *opaque);
-+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-+                       unsigned int index, unsigned int n, bool flush);
- int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
-                             unsigned int index, unsigned int n, bool flush);
---
-.8.3.1

-[Qemu-devel] [PULL 39/61] qed: Make qed_aio_write_main() synchronous
+Deleted patch
-Note that this code is generally not running in coroutine context, so
-this is an actual blocking synchronous operation. We'll fix this in a
-moment.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 61 +++++++++++++++++++------------------------------------------
-file changed, 19 insertions(+), 42 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
-     qed_aio_next_io(acb, 0);
- }
--static void qed_aio_next_io_cb(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--
--    qed_aio_next_io(acb, ret);
--}
--
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
- {
-     assert(!s->allocating_write_reqs_plugged);
-@@ -XXX,XX +XXX,XX @@ err:
-     qed_aio_complete(acb, ret);
- }
--static void qed_aio_write_l2_update_cb(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
--}
--
--/**
-- * Flush new data clusters before updating the L2 table
-- *
-- * This flush is necessary when a backing file is in use.  A crash during an
-- * allocating write could result in empty clusters in the image.  If the write
-- * only touched a subregion of the cluster, then backing image sectors have
-- * been lost in the untouched region.  The solution is to flush after writing a
-- * new data cluster and before updating the L2 table.
-- */
--static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--
--    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
--        qed_aio_complete(acb, -EIO);
--    }
--}
--
- /**
-  * Write data to the image file
-  */
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-     BDRVQEDState *s = acb_to_s(acb);
-     uint64_t offset = acb->cur_cluster +
-                       qed_offset_into_cluster(s, acb->cur_pos);
--    BlockCompletionFunc *next_fn;
-     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-         return;
-     }
-+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
-+    if (ret >= 0) {
-+        ret = 0;
-+    }
-+
-     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
--        next_fn = qed_aio_next_io_cb;
-+        qed_aio_next_io(acb, ret);
-     } else {
-         if (s->bs->backing) {
--            next_fn = qed_aio_write_flush_before_l2_update;
--        } else {
--            next_fn = qed_aio_write_l2_update_cb;
-+            /*
-+             * Flush new data clusters before updating the L2 table
-+             *
-+             * This flush is necessary when a backing file is in use.  A crash
-+             * during an allocating write could result in empty clusters in the
-+             * image.  If the write only touched a subregion of the cluster,
-+             * then backing image sectors have been lost in the untouched
-+             * region.  The solution is to flush after writing a new data
-+             * cluster and before updating the L2 table.
-+             */
-+            ret = bdrv_flush(s->bs->file->bs);
-         }
-+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-     }
--
--    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
--    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
--                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
--                    next_fn, acb);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 40/61] qed: Inline qed_commit_l2_update()
+Deleted patch
-qed_commit_l2_update() is unconditionally called at the end of
-qed_aio_write_l1_update(). Inline it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 36 ++++++++++++++----------------------
-file changed, 14 insertions(+), 22 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- }
- /**
-- * Commit the current L2 table to the cache
-+ * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_commit_l2_update(void *opaque, int ret)
-+static void qed_aio_write_l1_update(void *opaque, int ret)
- {
-     QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
-+    int index;
-+
-+    if (ret) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    index = qed_l1_index(s, acb->cur_pos);
-+    s->l1_table->offsets[index] = l2_table->offset;
-+
-+    ret = qed_write_l1_table(s, index, 1);
-+
-+    /* Commit the current L2 table to the cache */
-     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
-     /* This is guaranteed to succeed because we just committed the entry to the
-@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
-     qed_aio_next_io(acb, ret);
- }
--/**
-- * Update L1 table with new L2 table offset and write it out
-- */
--static void qed_aio_write_l1_update(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
--
--    index = qed_l1_index(s, acb->cur_pos);
--    s->l1_table->offsets[index] = acb->request.l2_table->offset;
--
--    ret = qed_write_l1_table(s, index, 1);
--    qed_commit_l2_update(acb, ret);
--}
- /**
-  * Update L2 table with new cluster offsets and write them out
---
-.8.3.1

-[Qemu-devel] [PULL 41/61] qed: Add return value to qed_aio_write_l1_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 19 +++++++++----------
-file changed, 9 insertions(+), 10 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- /**
-  * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_aio_write_l1_update(void *opaque, int ret)
-+static int qed_aio_write_l1_update(QEDAIOCB *acb)
- {
--    QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
-+    int index, ret;
-     index = qed_l1_index(s, acb->cur_pos);
-     s->l1_table->offsets[index] = l2_table->offset;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
-     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-     assert(acb->request.l2_table != NULL);
--    qed_aio_next_io(acb, ret);
-+    return ret;
- }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-     if (need_alloc) {
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
--        qed_aio_write_l1_update(acb, ret);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l1_update(acb);
-+        qed_aio_next_io(acb, ret);
-+
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
---
-.8.3.1

-[Qemu-devel] [PULL 42/61] qed: Add return value to qed_aio_write_l2_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 43 ++++++++++++++++++++++++++-----------------
-file changed, 26 insertions(+), 17 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
- /**
-  * Update L2 table with new cluster offsets and write them out
-  */
--static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
--    int index;
--
--    if (ret) {
--        goto err;
--    }
-+    int index, ret;
-     if (need_alloc) {
-         qed_unref_l2_cache_entry(acb->request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-         if (ret) {
--            goto err;
-+            return ret;
-         }
--        ret = qed_aio_write_l1_update(acb);
--        qed_aio_next_io(acb, ret);
--
-+        return qed_aio_write_l1_update(acb);
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-                                  false);
--        qed_aio_next_io(acb, ret);
-+        if (ret) {
-+            return ret;
-+        }
-     }
--    return;
--
--err:
--    qed_aio_complete(acb, ret);
-+    return 0;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-              */
-             ret = bdrv_flush(s->bs->file->bs);
-         }
--        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        qed_aio_next_io(acb, 0);
-     }
-+    return;
-+
-+err:
-+    qed_aio_complete(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
-         return;
-     }
--    qed_aio_write_l2_update(acb, 0, 1);
-+    ret = qed_aio_write_l2_update(acb, 1);
-+    if (ret < 0) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 50/61] qed: Use CoQueue for serialising allocations
+Deleted patch
-Now that we're running in coroutine context, the ad-hoc serialisation
-code (which drops a request that has to wait out of coroutine context)
-can be replaced by a CoQueue.
-This means that when we resume a serialised request, it is running in
-coroutine context again and its I/O isn't blocking any more.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 49 +++++++++++++++++--------------------------------
- block/qed.h |  3 ++-
-files changed, 19 insertions(+), 33 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
- static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
- {
--    QEDAIOCB *acb;
--
-     assert(s->allocating_write_reqs_plugged);
-     s->allocating_write_reqs_plugged = false;
--
--    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
--    if (acb) {
--        qed_aio_start_io(acb);
--    }
-+    qemu_co_enter_next(&s->allocating_write_reqs);
- }
- static void qed_clear_need_check(void *opaque, int ret)
-@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
-     BDRVQEDState *s = opaque;
-     /* The timer should only fire when allocating writes have drained */
--    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
-+    assert(!s->allocating_acb);
-     trace_qed_need_check_timer_cb(s);
-@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
-     int ret;
-     s->bs = bs;
--    QSIMPLEQ_INIT(&s->allocating_write_reqs);
-+    qemu_co_queue_init(&s->allocating_write_reqs);
-     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
-     if (ret < 0) {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
-     qed_release(s);
- }
--static void qed_resume_alloc_bh(void *opaque)
--{
--    qed_aio_start_io(opaque);
--}
--
- static void qed_aio_complete(QEDAIOCB *acb, int ret)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
-      * next request in the queue.  This ensures that we don't cycle through
-      * requests multiple times but rather finish one at a time completely.
-      */
--    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
--        QEDAIOCB *next_acb;
--        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
--        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
--        if (next_acb) {
--            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
--                                    qed_resume_alloc_bh, next_acb);
-+    if (acb == s->allocating_acb) {
-+        s->allocating_acb = NULL;
-+        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
-+            qemu_co_enter_next(&s->allocating_write_reqs);
-         } else if (s->header.features & QED_F_NEED_CHECK) {
-             qed_start_need_check_timer(s);
-         }
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     int ret;
-     /* Cancel timer when the first allocating request comes in */
--    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
-+    if (s->allocating_acb == NULL) {
-         qed_cancel_need_check_timer(s);
-     }
-     /* Freeze this request if another allocating write is in progress */
--    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
--        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
--    }
--    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
--        s->allocating_write_reqs_plugged) {
--        return -EINPROGRESS; /* wait for existing request to finish */
-+    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
-+        if (s->allocating_acb != NULL) {
-+            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
-+            assert(s->allocating_acb == NULL);
-+        }
-+        s->allocating_acb = acb;
-+        return -EAGAIN; /* start over with looking up table entries */
-     }
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
-             ret = qed_aio_read_data(acb, ret, offset, len);
-         }
--        if (ret < 0) {
--            if (ret != -EINPROGRESS) {
--                qed_aio_complete(acb, ret);
--            }
-+        if (ret < 0 && ret != -EAGAIN) {
-+            qed_aio_complete(acb, ret);
-             return;
-         }
-     }
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ typedef struct {
-     uint32_t l2_mask;
-     /* Allocating write request queue */
--    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
-+    QEDAIOCB *allocating_acb;
-+    CoQueue allocating_write_reqs;
-     bool allocating_write_reqs_plugged;
-     /* Periodic flush and clear need check flag */
---
-.8.3.1

-[Qemu-devel] [PULL 52/61] qed: Use a coroutine for need_check_timer
+[PULL 31/31] docs: qsd: Explain --export nbd,name=... default
-This fixes the last place where we degraded from AIO to actual blocking
+The 'name' option for NBD exports is optional. Add a note that the
-synchronous I/O requests. Putting it into a coroutine means that instead
+default for the option is the node name (people could otherwise expect
-of blocking, the coroutine simply yields while doing I/O.
+that it's the empty string like for qemu-nbd).
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20210305094856.18964-1-kwolf@redhat.com>
 Reviewed-by: Max Reitz <mreitz@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qed.c | 33 +++++++++++++++++----------------
+ docs/tools/qemu-storage-daemon.rst | 5 +++--
-file changed, 17 insertions(+), 16 deletions(-)
+file changed, 3 insertions(+), 2 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/docs/tools/qemu-storage-daemon.rst
-+++ b/block/qed.c
++++ b/docs/tools/qemu-storage-daemon.rst
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ Standard options:
-     qemu_co_enter_next(&s->allocating_write_reqs);
+   requests for modifying data (the default is off).
- }
+   The ``nbd`` export type requires ``--nbd-server`` (see below). ``name`` is
--static void qed_clear_need_check(void *opaque, int ret)
+-  the NBD export name. ``bitmap`` is the name of a dirty bitmap reachable from
-+static void qed_need_check_timer_entry(void *opaque)
+-  the block node, so the NBD client can use NBD_OPT_SET_META_CONTEXT with the
- {
++  the NBD export name (if not specified, it defaults to the given
-     BDRVQEDState *s = opaque;
++  ``node-name``). ``bitmap`` is the name of a dirty bitmap reachable from the
-+    int ret;
++  block node, so the NBD client can use NBD_OPT_SET_META_CONTEXT with the
+   metadata context name "qemu:dirty-bitmap:BITMAP" to inspect the bitmap.
--    if (ret) {
-+    /* The timer should only fire when allocating writes have drained */
+   The ``vhost-user-blk`` export type takes a vhost-user socket address on which
 +    assert(!s->allocating_acb);
 +
 +    trace_qed_need_check_timer_cb(s);
 +
 +    qed_acquire(s);
 +    qed_plug_allocating_write_reqs(s);
 +
 +    /* Ensure writes are on disk before clearing flag */
 +    ret = bdrv_co_flush(s->bs->file->bs);
 +    qed_release(s);
 +    if (ret < 0) {
          qed_unplug_allocating_write_reqs(s);
          return;
      }
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
      qed_unplug_allocating_write_reqs(s);
 -    ret = bdrv_flush(s->bs);
 +    ret = bdrv_co_flush(s->bs);
      (void) ret;
  }
  static void qed_need_check_timer_cb(void *opaque)
  {
 -    BDRVQEDState *s = opaque;
 -
 -    /* The timer should only fire when allocating writes have drained */
 -    assert(!s->allocating_acb);
 -
 -    trace_qed_need_check_timer_cb(s);
 -
 -    qed_acquire(s);
 -    qed_plug_allocating_write_reqs(s);
 -
 -    /* Ensure writes are on disk before clearing flag */
 -    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
 -    qed_release(s);
 +    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
 +    qemu_coroutine_enter(co);
  }
  void qed_acquire(BDRVQEDState *s)
 --
-.8.3.1
+.29.2

The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:

Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)

are available in the git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:

Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)

----------------------------------------------------------------

Block layer patches

----------------------------------------------------------------
Alberto Garcia (9):
      throttle: Update throttle-groups.c documentation
      qcow2: Remove unused Error variable in do_perform_cow()
      qcow2: Use unsigned int for both members of Qcow2COWRegion
      qcow2: Make perform_cow() call do_perform_cow() twice
      qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
      qcow2: Allow reading both COW regions with only one request
      qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
      qcow2: Merge the writing of the COW regions with the guest data
      qcow2: Use offset_into_cluster() and offset_to_l2_index()

Kevin Wolf (37):
      commit: Fix completion with extra reference
      qemu-iotests: Allow starting new qemu after cleanup
      qemu-iotests: Test exiting qemu with running job
      doc: Document generic -blockdev options
      doc: Document driver-specific -blockdev options
      qed: Use bottom half to resume waiting requests
      qed: Make qed_read_table() synchronous
      qed: Remove callback from qed_read_table()
      qed: Remove callback from qed_read_l2_table()
      qed: Remove callback from qed_find_cluster()
      qed: Make qed_read_backing_file() synchronous
      qed: Make qed_copy_from_backing_file() synchronous
      qed: Remove callback from qed_copy_from_backing_file()
      qed: Make qed_write_header() synchronous
      qed: Remove callback from qed_write_header()
      qed: Make qed_write_table() synchronous
      qed: Remove GenericCB
      qed: Remove callback from qed_write_table()
      qed: Make qed_aio_read_data() synchronous
      qed: Make qed_aio_write_main() synchronous
      qed: Inline qed_commit_l2_update()
      qed: Add return value to qed_aio_write_l1_update()
      qed: Add return value to qed_aio_write_l2_update()
      qed: Add return value to qed_aio_write_main()
      qed: Add return value to qed_aio_write_cow()
      qed: Add return value to qed_aio_write_inplace/alloc()
      qed: Add return value to qed_aio_read/write_data()
      qed: Remove ret argument from qed_aio_next_io()
      qed: Remove recursion in qed_aio_next_io()
      qed: Implement .bdrv_co_readv/writev
      qed: Use CoQueue for serialising allocations
      qed: Simplify request handling
      qed: Use a coroutine for need_check_timer
      qed: Add coroutine_fn to I/O path functions
      qed: Use bdrv_co_* for coroutine_fns
      block: Remove bdrv_aio_readv/writev/flush()
      Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block

Manos Pitsidianakis (1):
      block: change variable names in BlockDriverState

Max Reitz (3):
      blkdebug: Catch bs->exact_filename overflow
      blkverify: Catch bs->exact_filename overflow
      block: Do not strcmp() with NULL uri->scheme

Stefan Hajnoczi (10):
      block: count bdrv_co_rw_vmstate() requests
      block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
      migration: avoid recursive AioContext locking in save_vmstate()
      migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
      virtio-pci: use ioeventfd even when KVM is disabled
      migration: hold AioContext lock for loadvm qemu_fclose()
      qemu-iotests: 068: extract _qemu() function
      qemu-iotests: 068: use -drive/-device instead of -hda
      qemu-iotests: 068: test iothread mode
      qemu-img: don't shadow opts variable in img_dd()

Stephen Bates (1):
      nvme: Add support for Read Data and Write Data in CMBs.

sochin.jiang (1):
      fix: avoid an infinite loop or a dangling pointer problem in img_commit

commit_complete() can't assume that after its block_job_completed() the
job is actually immediately freed; someone else may still be holding
references. In this case, the op blockers on the intermediate nodes make
the graph reconfiguration in the completion code fail.

Call block_job_remove_all_bdrv() manually so that we know for sure that
any blockers on intermediate nodes are given up.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/commit.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/commit.c b/block/commit.c
index XXXXXXX..XXXXXXX 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
     }
     g_free(s->backing_file_str);
     blk_unref(s->top);
+
+    /* If there is more than one reference to the job (e.g. if called from
+     * block_job_finish_sync()), block_job_completed() won't free it and
+     * therefore the blockers on the intermediate nodes remain. This would
+     * cause bdrv_set_backing_hd() to fail. */
+    block_job_remove_all_bdrv(job);
+
     block_job_completed(&s->common, ret);
     g_free(data);
 
-- 
1.8.3.1

After _cleanup_qemu(), test cases should be able to start the next qemu
process and call _cleanup_qemu() for that one as well. For this to work
cleanly, we need to improve the cleanup so that the second invocation
doesn't try to kill the qemu instances from the first invocation a
second time (which would result in error messages).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/common.qemu | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.qemu
+++ b/tests/qemu-iotests/common.qemu
@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
         eval "exec ${QEMU_OUT[$i]}<&-"
+
+        unset QEMU_IN[$i]
+        unset QEMU_OUT[$i]
     done
 }
-- 
1.8.3.1

When qemu is exited, all running jobs should be cancelled successfully.
This adds a test for this for all types of block jobs that currently
exist in qemu.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/185.out |  59 +++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 266 insertions(+)
 create mode 100755 tests/qemu-iotests/185
 create mode 100644 tests/qemu-iotests/185.out

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185
@@ -XXX,XX +XXX,XX @@
+#!/bin/bash
+#
+# Test exiting qemu while jobs are still running
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=kwolf@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1 # failure is the default!
+
+MIG_SOCKET="${TEST_DIR}/migrate"
+
+_cleanup()
+{
+    rm -f "${TEST_IMG}.mid"
+    rm -f "${TEST_IMG}.copy"
+    _cleanup_test_img
+    _cleanup_qemu
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+size=64M
+TEST_IMG="${TEST_IMG}.base" _make_test_img $size
+
+echo
+echo === Starting VM ===
+echo
+
+qemu_comm_method="qmp"
+
+_launch_qemu \
+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+echo
+echo === Creating backing chain ===
+echo
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG.mid',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'human-monitor-command',
+       'arguments': { 'command-line':
+                      'qemu-io disk \"write 0 4M\"' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+echo
+echo === Start commit job and exit qemu ===
+echo
+
+# Note that the reference output intentionally includes the 'offset' field in
+# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
+# predictable and any change in the offsets would hint at a bug in the job
+# throttling code.
+#
+# In order to achieve these predictable offsets, all of the following tests
+# use speed=65536. Each job will perform exactly one iteration before it has
+# to sleep at least for a second, which is plenty of time for the 'quit' QMP
+# command to be received (after receiving the command, the rest runs
+# synchronously, so jobs can arbitrarily continue or complete).
+#
+# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
+# the first request), for active commit and mirror it's large enough to cover
+# the full 4M, and for backup it's the qcow2 cluster size, which we know is
+# 64k. As all of these are at least as large as the speed, we are sure that the
+# offset doesn't advance after the first iteration before qemu exits.
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'top': '$TEST_IMG.mid',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start active commit job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start mirror job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-mirror',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start backup job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-backup',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start streaming job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-stream',
+       'arguments': { 'device': 'disk',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+_check_test_img
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185.out
@@ -XXX,XX +XXX,XX @@
+QA output created by 185
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
+
+=== Starting VM ===
+
+{"return": {}}
+
+=== Creating backing chain ===
+
+Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+wrote 4194304/4194304 bytes at offset 0
+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{"return": ""}
+Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+
+=== Start commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
+
+=== Start active commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
+
+=== Start mirror job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
+
+=== Start backup job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
+
+=== Start streaming job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 181 rw auto migration
 182 rw auto quick
 183 rw auto migration
+185 rw auto
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Calling aio_poll() directly may have been fine previously, but this is
the future, man!  The difference between an aio_poll() loop and
BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
around aio_poll().

This allows the IOThread to run fd handlers or BHs to complete the
request.  Failure to release the AioContext causes deadlocks.

Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
 
         bdrv_coroutine_enter(bs, co);
-        while (data.ret == -EINPROGRESS) {
-            aio_poll(bdrv_get_aio_context(bs), true);
-        }
+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
         return data.ret;
     }
 }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

AioContext was designed to allow nested acquire/release calls.  It uses
a recursive mutex so callers don't need to worry about nesting...or so
we thought.

BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
the AioContext temporarily around aio_poll().  This gives IOThreads a
chance to acquire the AioContext to process I/O completions.

It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
will not be able to acquire the AioContext if it was acquired
multiple times.

Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
this patch simply avoids nested locking in save_vmstate().  It's the
simplest fix and we should step back to consider the big picture with
all the recent changes to block layer threading.

This patch is the final fix to solve 'savevm' hanging with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
         goto the_end;
     }
 
+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
+     * it only releases the lock once.  Therefore synchronous I/O will deadlock
+     * unless we release the AioContext before bdrv_all_create_snapshot().
+     */
+    aio_context_release(aio_context);
+    aio_context = NULL;
+
     ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
     if (ret < 0) {
         error_setg(errp, "Error while creating snapshot on '%s'",
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     ret = 0;
 
  the_end:
-    aio_context_release(aio_context);
+    if (aio_context) {
+        aio_context_release(aio_context);
+    }
     if (saved_vm_running) {
         vm_start();
     }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

blk/bdrv_drain_all() only takes effect for a single instant and then
resumes block jobs, guest devices, and other external clients like the
NBD server.  This can be handy when performing a synchronous drain
before terminating the program, for example.

Monitor commands usually need to quiesce I/O across an entire code
region so blk/bdrv_drain_all() is not suitable.  They must use
bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
requests from slipping in or worse - block jobs completing and modifying
the graph.

I audited other blk/bdrv_drain_all() callers but did not find anything
that needs a similar fix.  This patch fixes the savevm/loadvm commands.
Although I haven't encountered a read world issue this makes the code
safer.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     }
     vm_stop(RUN_STATE_SAVE_VM);
 
+    bdrv_drain_all_begin();
+
     aio_context_acquire(aio_context);
 
     memset(sn, 0, sizeof(*sn));
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     if (aio_context) {
         aio_context_release(aio_context);
     }
+
+    bdrv_drain_all_end();
+
     if (saved_vm_running) {
         vm_start();
     }
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     /* Flush all IO requests so they don't interfere with the new state.  */
-    bdrv_drain_all();
+    bdrv_drain_all_begin();
 
     ret = bdrv_all_goto_snapshot(name, &bs);
     if (ret < 0) {
         error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
                      ret, name, bdrv_get_device_name(bs));
-        return ret;
+        goto err_drain;
     }
 
     /* restore the VM state */
     f = qemu_fopen_bdrv(bs_vm_state, 0);
     if (!f) {
         error_setg(errp, "Could not open VM state file");
-        return -EINVAL;
+        ret = -EINVAL;
+        goto err_drain;
     }
 
     qemu_system_reset(SHUTDOWN_CAUSE_NONE);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     ret = qemu_loadvm_state(f);
     aio_context_release(aio_context);
 
+    bdrv_drain_all_end();
+
     migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     return 0;
+
+err_drain:
+    bdrv_drain_all_end();
+    return ret;
 }
 
 void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
-- 
1.8.3.1

This adds documentation for the -blockdev options that apply to all
nodes independent of the block driver used.

All options that are shared by -blockdev and -drive are now explained in
the section for -blockdev. The documentation of -drive mentions that all
-blockdev options are accepted as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
     "          [,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
+STEXI
+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
+@findex -blockdev
+
+Define a new block driver node.
+
+@table @option
+@item Valid options for any block driver node:
+
+@table @code
+@item driver
+Specifies the block driver to use for the given node.
+@item node-name
+This defines the name of the block driver node by which it will be referenced
+later. The name must be unique, i.e. it must not match the name of a different
+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
+
+If no node name is specified, it is automatically generated. The generated node
+name is not intended to be predictable and changes between QEMU invocations.
+For the top level, an explicit node name must be specified.
+@item read-only
+Open the node read-only. Guest write attempts will fail.
+@item cache.direct
+The host page cache can be avoided with @option{cache.direct=on}. This will
+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
+internal copy of the data.
+@item cache.no-flush
+In case you don't care about data integrity over host failures, you can use
+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
+any data to the disk but can instead keep things in cache. If anything goes
+wrong, like your host losing power, the disk storage getting disconnected
+accidentally, etc. your image will most probably be rendered unusable.
+@item discard=@var{discard}
+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
+ignored or passed to the filesystem. Some machine types may not support
+discard requests.
+@item detect-zeroes=@var{detect-zeroes}
+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
+conversion of plain zero writes by the OS to driver specific optimized
+zero write commands. You may even choose "unmap" if @var{discard} is set
+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@end table
+
+@end table
+
+ETEXI
 
 DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -drive
 
-Define a new drive. Valid options are:
+Define a new drive. This includes creating a block driver node (the backend) as
+well as a guest device, and is mostly a shortcut for defining the corresponding
+@option{-blockdev} and @option{-device} options.
+
+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
+addition, it knows the following options:
 
 @table @option
 @item file=@var{file}
@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
 @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
 (see @option{-snapshot}).
 @item cache=@var{cache}
-@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
+and controls how the host cache is used to access block data. This is a
+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
+which provides a default for the @option{write-cache} option of block guest
+devices (as in @option{-device}). The modes correspond to the following
+settings:
+
+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
+@c and the HTML output.
+@example
+@             │ cache.writeback   cache.direct   cache.no-flush
+─────────────┼─────────────────────────────────────────────────
+writeback    │ on                off            off
+none         │ on                on             off
+writethrough │ off               off            off
+directsync   │ off               on             off
+unsafe       │ on                off            on
+@end example
+
+The default mode is @option{cache=writeback}.
+
 @item aio=@var{aio}
 @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
-@item discard=@var{discard}
-@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
 @item format=@var{format}
 Specify which disk @var{format} will be used rather than detecting
 the format.  Can be used to specify format=raw to avoid interpreting
@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
 "report" (report the error to the guest), "enospc" (pause QEMU only if the
 host disk is full; report the error to the guest otherwise).
 The default setting is @option{werror=enospc} and @option{rerror=report}.
-@item readonly
-Open drive @option{file} as read-only. Guest write attempts will fail.
 @item copy-on-read=@var{copy-on-read}
 @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
 file sectors into the image file.
-@item detect-zeroes=@var{detect-zeroes}
-@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-conversion of plain zero writes by the OS to driver specific optimized
-zero write commands. You may even choose "unmap" if @var{discard} is set
-to "unmap" to allow a zero write to be converted to an UNMAP operation.
 @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
 Specify bandwidth throttling limits in bytes per second, either for all request
 types or for reads or writes only.  Small values can lead to timeouts or hangs
@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
 instead of a single larger disk.
 @end table
 
-By default, the @option{cache=writeback} mode is used. It will report data
+By default, the @option{cache.writeback=on} mode is used. It will report data
 writes as completed as soon as the data is present in the host page cache.
 This is safe as long as your guest OS makes sure to correctly flush disk caches
 where needed. If your guest OS does not handle volatile disk write caches
 correctly and your host crashes or loses power, then the guest may experience
 data corruption.
 
-For such guests, you should consider using @option{cache=writethrough}. This
+For such guests, you should consider using @option{cache.writeback=off}. This
 means that the host page cache will be used to read and write data, but write
 notification will be sent to the guest only after QEMU has made sure to flush
 each write to the disk. Be aware that this has a major impact on performance.
 
-The host page cache can be avoided entirely with @option{cache=none}.  This will
-attempt to do disk IO directly to the guest's memory.  QEMU may still perform
-an internal copy of the data. Note that this is considered a writeback mode and
-the guest OS must handle the disk write cache correctly in order to avoid data
-corruption on host crashes.
-
-The host page cache can be avoided while only sending write notifications to
-the guest when the data has been flushed to the disk using
-@option{cache=directsync}.
-
-In case you don't care about data integrity over host failures, use
-@option{cache=unsafe}. This option tells QEMU that it never needs to write any
-data to the disk but can instead keep things in cache. If anything goes wrong,
-like your host losing power, the disk storage getting disconnected accidentally,
-etc. your image will most probably be rendered unusable.   When using
-the @option{-snapshot} option, unsafe caching is always used.
+When using the @option{-snapshot} option, unsafe caching is always used.
 
 Copy-on-read avoids accessing the same backing file sectors repeatedly and is
 useful when the backing file is over a slow network.  By default copy-on-read
-- 
1.8.3.1

This documents the driver-specific options for the raw, qcow2 and file
block drivers for the man page. For everything else, we refer to the
QAPI documentation.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 1 deletion(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -blockdev
 
-Define a new block driver node.
+Define a new block driver node. Some of the options apply to all block drivers,
+other options are only accepted for a specific block driver. See below for a
+list of generic options and options for the most common block drivers.
+
+Options that expect a reference to another node (e.g. @code{file}) can be
+given in two ways. Either you specify the node name of an already existing node
+(file=@var{node-name}), or you define a new node inline, adding options
+for the referenced node after a dot (file.filename=@var{path},file.aio=native).
+
+A block driver node created with @option{-blockdev} can be used for a guest
+device by specifying its node name for the @code{drive} property in a
+@option{-device} argument that defines a block device.
 
 @table @option
 @item Valid options for any block driver node:
@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
 @end table
 
+@item Driver-specific options for @code{file}
+
+This is the protocol-level block driver for accessing regular files.
+
+@table @code
+@item filename
+The path to the image file in the local filesystem
+@item aio
+Specifies the AIO backend (threads/native, default: threads)
+@end table
+Example:
+@example
+-blockdev driver=file,node-name=disk,filename=disk.img
+@end example
+
+@item Driver-specific options for @code{raw}
+
+This is the image format block driver for raw images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+@end table
+Example 1:
+@example
+-blockdev driver=file,node-name=disk_file,filename=disk.img
+-blockdev driver=raw,node-name=disk,file=disk_file
+@end example
+Example 2:
+@example
+-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
+@end example
+
+@item Driver-specific options for @code{qcow2}
+
+This is the image format block driver for qcow2 images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+
+@item backing
+Reference to or definition of the backing file block device (default is taken
+from the image file). It is allowed to pass an empty string here in order to
+disable the default backing file.
+
+@item lazy-refcounts
+Whether to enable the lazy refcounts feature (on/off; default is taken from the
+image file)
+
+@item cache-size
+The maximum total size of the L2 table and refcount block caches in bytes
+(default: 1048576 bytes or 8 clusters, whichever is larger)
+
+@item l2-cache-size
+The maximum size of the L2 table cache in bytes
+(default: 4/5 of the total cache size)
+
+@item refcount-cache-size
+The maximum size of the refcount block cache in bytes
+(default: 1/5 of the total cache size)
+
+@item cache-clean-interval
+Clean unused entries in the L2 and refcount caches. The interval is in seconds.
+The default value is 0 and it disables this feature.
+
+@item pass-discard-request
+Whether discard requests to the qcow2 device should be forwarded to the data
+source (on/off; default: on if discard=unmap is specified, off otherwise)
+
+@item pass-discard-snapshot
+Whether discard requests for the data source should be issued when a snapshot
+operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
+default: on)
+
+@item pass-discard-other
+Whether discard requests for the data source should be issued on other
+occasions where a cluster gets freed (on/off; default: off)
+
+@item overlap-check
+Which overlap checks to perform for writes to the image
+(none/constant/cached/all; default: cached). For details or finer
+granularity control refer to the QAPI documentation of @code{blockdev-add}.
+@end table
+
+Example 1:
+@example
+-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
+-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
+@end example
+Example 2:
+@example
+-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
+@end example
+
+@item Driver-specific options for other drivers
+Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
+
 @end table
 
 ETEXI
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

There used to be throttle_timers_{detach,attach}_aio_context() calls
in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
they are now in blk_set_aio_context().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/throttle-groups.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@
  * Again, all this is handled internally and is mostly transparent to
  * the outside. The 'throttle_timers' field however has an additional
  * constraint because it may be temporarily invalid (see for example
- * bdrv_set_aio_context()). Therefore in this file a thread will
+ * blk_set_aio_context()). Therefore in this file a thread will
  * access some other BlockBackend's timers only after verifying that
  * that BlockBackend has throttled requests in the queue.
  */
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Old kvm.ko versions only supported a tiny number of ioeventfds so
virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.

Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
("memory: emulate ioeventfd") it has been possible to use ioeventfds in
qtest or TCG mode.

This patch makes -device virtio-blk-pci,iothread=iothread0 work even
when KVM is disabled.

I have tested that virtio-blk-pci works under TCG both with and without
iothread.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/virtio/virtio-pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
                      !pci_bus_is_root(pci_dev->bus);
 
-    if (!kvm_has_many_ioeventfds()) {
+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
         proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
     }
 
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
file.  Make sure to call it inside an AioContext acquire/release region.

This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
in loadvm.

This patch closes the vmstate file before ending the drained region.
Previously we closed the vmstate file after ending the drained region.
The order does not matter.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
 
     aio_context_acquire(aio_context);
     ret = qemu_loadvm_state(f);
+    migration_incoming_state_destroy();
     aio_context_release(aio_context);
 
     bdrv_drain_all_end();
 
-    migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
         return ret;
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Avoid duplicating the QEMU command-line.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068 | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
       ;;
 esac
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
+_qemu()
+{
+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
+          "$@" |\
     _filter_qemu | _filter_hmp
+}
+
+# Give qemu some time to boot before saving the VM state
+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
 # Now try to continue from that VM state (this should just work)
-echo quit |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
-    _filter_qemu | _filter_hmp
+echo quit | _qemu -loadvm 0
 
 # success, all done
 echo "*** done"
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Perform the savevm/loadvm test with both iothread on and off.  This
covers the recently found savevm/loadvm hang when iothread is enabled.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068     | 23 ++++++++++++++---------
 tests/qemu-iotests/068.out | 11 ++++++++++-
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ _supported_os Linux
 IMGOPTS="compat=1.1"
 IMG_SIZE=128K
 
-echo
-echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
-echo
-_make_test_img $IMG_SIZE
-
 case "$QEMU_DEFAULT_MACHINE" in
   s390-ccw-virtio)
       platform_parm="-no-shutdown"
@@ -XXX,XX +XXX,XX @@ _qemu()
     _filter_qemu | _filter_hmp
 }
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
-# Now try to continue from that VM state (this should just work)
-echo quit | _qemu -loadvm 0
+for extra_args in \
+    "" \
+    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
+    echo
+    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
+    echo
+
+    _make_test_img $IMG_SIZE
+
+    # Give qemu some time to boot before saving the VM state
+    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
+    # Now try to continue from that VM state (this should just work)
+    echo quit | _qemu $extra_args -loadvm 0
+done
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/068.out
+++ b/tests/qemu-iotests/068.out
@@ -XXX,XX +XXX,XX @@
 QA output created by 068
 
-=== Saving and reloading a VM state to/from a qcow2 image ===
+=== Saving and reloading a VM state to/from a qcow2 image () ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) savevm 0
+(qemu) quit
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) quit
+
+=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
 QEMU X.Y.Z monitor - type 'help' for more information
-- 
1.8.3.1

From: Stephen Bates <sbates@raithlin.com>

Add the ability for the NVMe model to support both the RDS and WDS
modes in the Controller Memory Buffer.

Although not currently supported in the upstreamed Linux kernel a fork
with support exists [1] and user-space test programs that build on
this also exist [2].

Useful for testing CMB functionality in preperation for real CMB
enabled NVMe devices (coming soon).

[1] https://github.com/sbates130272/linux-p2pmem
[2] https://github.com/sbates130272/p2pmem-test

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
 hw/block/nvme.h |  1 +
 2 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -XXX,XX +XXX,XX @@
  *              cmb_size_mb=<cmb_size_mb[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports SQS only for now.
+ * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
  */
 
 #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
-    uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
+                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
     trans_len = MIN(len, trans_len);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
     if (!prp1) {
         return NVME_INVALID_FIELD | NVME_DNR;
+    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
+               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+        qsg->nsg = 0;
+        qemu_iovec_init(iov, num_prps);
+        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+    } else {
+        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+        qemu_sglist_add(qsg, prp1, trans_len);
     }
-
-    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-    qemu_sglist_add(qsg, prp1, trans_len);
     len -= trans_len;
     if (len) {
         if (!prp2) {
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
             nents = (len + n->page_size - 1) >> n->page_bits;
             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
+            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
             while (len != 0) {
                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                     i = 0;
                     nents = (len + n->page_size - 1) >> n->page_bits;
                     prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
+                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                         prp_trans);
                     prp_ent = le64_to_cpu(prp_list[i]);
                 }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                 }
 
                 trans_len = MIN(len, n->page_size);
-                qemu_sglist_add(qsg, prp_ent, trans_len);
+                if (qsg->nsg){
+                    qemu_sglist_add(qsg, prp_ent, trans_len);
+                } else {
+                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
+                }
                 len -= trans_len;
                 i++;
             }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
             if (prp2 & (n->page_size - 1)) {
                 goto unmap;
             }
-            qemu_sglist_add(qsg, prp2, len);
+            if (qsg->nsg) {
+                qemu_sglist_add(qsg, prp2, len);
+            } else {
+                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
+            }
         }
     }
     return NVME_SUCCESS;
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     uint64_t prp1, uint64_t prp2)
 {
     QEMUSGList qsg;
+    QEMUIOVector iov;
+    uint16_t status = NVME_SUCCESS;
 
-    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
+    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
-    if (dma_buf_read(ptr, len, &qsg)) {
+    if (qsg.nsg > 0) {
+        if (dma_buf_read(ptr, len, &qsg)) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
         qemu_sglist_destroy(&qsg);
-        return NVME_INVALID_FIELD | NVME_DNR;
+    } else {
+        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
+        qemu_iovec_destroy(&iov);
     }
-    qemu_sglist_destroy(&qsg);
-    return NVME_SUCCESS;
+    return status;
 }
 
 static void nvme_post_cqes(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
+    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    assert((nlb << data_shift) == req->qsg.size);
-
-    req->has_sg = true;
     dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
-    req->aiocb = is_write ?
-        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                      nvme_rw_cb, req) :
-        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                     nvme_rw_cb, req);
+    if (req->qsg.nsg > 0) {
+        req->has_sg = true;
+        req->aiocb = is_write ?
+            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                          nvme_rw_cb, req) :
+            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                         nvme_rw_cb, req);
+    } else {
+        req->has_sg = false;
+        req->aiocb = is_write ?
+            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                            req) :
+            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                           req);
+    }
 
     return NVME_NO_COMPLETE;
 }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
         NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
         NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
+        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
         NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 
+        n->cmbloc = n->bar.cmbloc;
+        n->cmbsz = n->bar.cmbsz;
+
         n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
         memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                               "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
+    QEMUIOVector            iov;
     QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Qcow2COWRegion has two attributes:

- The offset of the COW region from the start of the first cluster
  touched by the I/O request. Since it's always going to be positive
  and the maximum request size is at most INT_MAX, we can use a
  regular unsigned int to store this offset.

- The size of the COW region in bytes. This is guaranteed to be >= 0,
  so we should use an unsigned type instead.

In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
It will also help keep some assertions simpler now that we know that
there are no negative numbers.

The prototype of do_perform_cow() is also updated to reflect these
changes.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.h         | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

From: Alberto Garcia <berto@igalia.com>

Instead of calling perform_cow() twice with a different COW region
each time, call it just once and make perform_cow() handle both
regions.

This patch simply moves code around. The next one will do the actual
reordering of the COW operations.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     struct iovec iov;
     int ret;
 
+    if (bytes == 0) {
+        return 0;
+    }
+
     iov.iov_len = bytes;
     iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
     if (iov.iov_base == NULL) {
@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
     return cluster_offset;
 }
 
-static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2COWRegion *start = &m->cow_start;
+    Qcow2COWRegion *end = &m->cow_end;
     int ret;
 
-    if (r->nb_bytes == 0) {
+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
-    qemu_co_mutex_lock(&s->lock);
-
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         start->offset, start->nb_bytes);
     if (ret < 0) {
-        return ret;
+        goto fail;
     }
 
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         end->offset, end->nb_bytes);
+
+fail:
+    qemu_co_mutex_lock(&s->lock);
+
     /*
      * Before we update the L2 table to actually point to the new cluster, we
      * need to be sure that the refcounts have been increased and COW was
      * handled.
      */
-    qcow2_cache_depends_on_flush(s->l2_table_cache);
+    if (ret == 0) {
+        qcow2_cache_depends_on_flush(s->l2_table_cache);
+    }
 
-    return 0;
+    return ret;
 }
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* copy content of unmodified sectors */
-    ret = perform_cow(bs, m, &m->cow_start);
-    if (ret < 0) {
-        goto err;
-    }
-
-    ret = perform_cow(bs, m, &m->cow_end);
+    ret = perform_cow(bs, m);
     if (ret < 0) {
         goto err;
     }
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

This patch splits do_perform_cow() into three separate functions to
read, encrypt and write the COW regions.

perform_cow() can now read both regions first, then encrypt them and
finally write them to disk. The memory allocation is also done in
this function now, using one single buffer large enough to hold both
regions.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 87 insertions(+), 30 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
     return 0;
 }
 
-static int coroutine_fn do_perform_cow(BlockDriverState *bs,
-                                       uint64_t src_cluster_offset,
-                                       uint64_t cluster_offset,
-                                       unsigned offset_in_cluster,
-                                       unsigned bytes)
+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+                                            uint64_t src_cluster_offset,
+                                            unsigned offset_in_cluster,
+                                            uint8_t *buffer,
+                                            unsigned bytes)
 {
-    BDRVQcow2State *s = bs->opaque;
     QEMUIOVector qiov;
-    struct iovec iov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
     if (bytes == 0) {
         return 0;
     }
 
-    iov.iov_len = bytes;
-    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
-    if (iov.iov_base == NULL) {
-        return -ENOMEM;
-    }
-
     qemu_iovec_init_external(&qiov, &iov, 1);
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
-        ret = -ENOMEDIUM;
-        goto out;
+        return -ENOMEDIUM;
     }
 
     /* Call .bdrv_co_readv() directly instead of using the public block-layer
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
                                   bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    if (bs->encrypted) {
+    return 0;
+}
+
+static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
+                                                uint64_t src_cluster_offset,
+                                                unsigned offset_in_cluster,
+                                                uint8_t *buffer,
+                                                unsigned bytes)
+{
+    if (bytes && bs->encrypted) {
+        BDRVQcow2State *s = bs->opaque;
         int64_t sector = (src_cluster_offset + offset_in_cluster)
                          >> BDRV_SECTOR_BITS;
         assert(s->cipher);
         assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
         assert((bytes & ~BDRV_SECTOR_MASK) == 0);
-        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
+        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
                                   bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
-            ret = -EIO;
-            goto out;
+            return false;
         }
     }
+    return true;
+}
+
+static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
+                                             uint64_t cluster_offset,
+                                             unsigned offset_in_cluster,
+                                             uint8_t *buffer,
+                                             unsigned bytes)
+{
+    QEMUIOVector qiov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+    int ret;
+
+    if (bytes == 0) {
+        return 0;
+    }
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
 
     ret = qcow2_pre_write_overlap_check(bs, 0,
             cluster_offset + offset_in_cluster, bytes);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
                           bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    ret = 0;
-out:
-    qemu_vfree(iov.iov_base);
-    return ret;
+    return 0;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     BDRVQcow2State *s = bs->opaque;
     Qcow2COWRegion *start = &m->cow_start;
     Qcow2COWRegion *end = &m->cow_end;
+    unsigned buffer_size;
+    uint8_t *start_buffer, *end_buffer;
     int ret;
 
+    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
+
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
+    /* Reserve a buffer large enough to store the data from both the
+     * start and end COW regions. Add some padding in the middle if
+     * necessary to make sure that the end region is optimally aligned */
+    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
+        end->nb_bytes;
+    start_buffer = qemu_try_blockalign(bs, buffer_size);
+    if (start_buffer == NULL) {
+        return -ENOMEM;
+    }
+    /* The part of the buffer where the end region is located */
+    end_buffer = start_buffer + buffer_size - end->nb_bytes;
+
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         start->offset, start->nb_bytes);
+    /* First we read the existing data from both COW regions */
+    ret = do_perform_cow_read(bs, m->offset, start->offset,
+                              start_buffer, start->nb_bytes);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         end->offset, end->nb_bytes);
+    ret = do_perform_cow_read(bs, m->offset, end->offset,
+                              end_buffer, end->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    /* Encrypt the data if necessary before writing it */
+    if (bs->encrypted) {
+        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
+                                    start_buffer, start->nb_bytes) ||
+            !do_perform_cow_encrypt(bs, m->offset, end->offset,
+                                    end_buffer, end->nb_bytes)) {
+            ret = -EIO;
+            goto fail;
+        }
+    }
+
+    /* And now we can write everything */
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
+                               start_buffer, start->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
 
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
+                               end_buffer, end->nb_bytes);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
         qcow2_cache_depends_on_flush(s->l2_table_cache);
     }
 
+    qemu_vfree(start_buffer);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Instead of passing a single buffer pointer to do_perform_cow_write(),
pass a QEMUIOVector. This will allow us to merge the write requests
for the COW regions and the actual data into a single one.

Although do_perform_cow_read() does not strictly need to change its
API, we're doing it here as well for consistency.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
 static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
                                             uint64_t src_cluster_offset,
                                             unsigned offset_in_cluster,
-                                            uint8_t *buffer,
-                                            unsigned bytes)
+                                            QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
      * which can lead to deadlock when block layer copy-on-read is enabled.
      */
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
-                                  bytes, &qiov, 0);
+                                  qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
 static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
                                              uint64_t cluster_offset,
                                              unsigned offset_in_cluster,
-                                             uint8_t *buffer,
-                                             unsigned bytes)
+                                             QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     ret = qcow2_pre_write_overlap_check(bs, 0,
-            cluster_offset + offset_in_cluster, bytes);
+            cluster_offset + offset_in_cluster, qiov->size);
     if (ret < 0) {
         return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
-                          bytes, &qiov, 0);
+                          qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
     bool merge_reads;
     uint8_t *start_buffer, *end_buffer;
+    QEMUIOVector qiov;
     int ret;
 
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
+    qemu_iovec_init(&qiov, 1);
+
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
      * either read the whole region in one go, or the start and end
      * regions separately. */
     if (merge_reads) {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, buffer_size);
+        qemu_iovec_add(&qiov, start_buffer, buffer_size);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
     } else {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, start->nb_bytes);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
         if (ret < 0) {
             goto fail;
         }
 
-        ret = do_perform_cow_read(bs, m->offset, end->offset,
-                                  end_buffer, end->nb_bytes);
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
     }
     if (ret < 0) {
         goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* And now we can write everything */
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
-                               start_buffer, start->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
-                               end_buffer, end->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
     }
 
     qemu_vfree(start_buffer);
+    qemu_iovec_destroy(&qiov);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

If the guest tries to write data that results on the allocation of a
new cluster, instead of writing the guest data first and then the data
from the COW regions, write everything together using one single I/O
operation.

This can improve the write performance by 25% or more, depending on
several factors such as the media type, the cluster size and the I/O
request size.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
 block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
 block/qcow2.h         |  7 ++++++
 3 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
     assert(start->offset + start->nb_bytes <= end->offset);
+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
-    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
 
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
         }
     }
 
-    /* And now we can write everything */
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
-    if (ret < 0) {
-        goto fail;
+    /* And now we can write everything. If we have the guest data we
+     * can write everything in one single operation */
+    if (m->data_qiov) {
+        qemu_iovec_reset(&qiov);
+        if (start->nb_bytes) {
+            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        }
+        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
+        if (end->nb_bytes) {
+            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        }
+        /* NOTE: we have a write_aio blkdebug event here followed by
+         * a cow_write one in do_perform_cow_write(), but there's only
+         * one single I/O operation */
+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+    } else {
+        /* If there's no guest data then write both COW regions separately */
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
     }
 
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ fail:
     return ret;
 }
 
+/* Check if it's possible to merge a write request with the writing of
+ * the data from the COW regions */
+static bool merge_cow(uint64_t offset, unsigned bytes,
+                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
+{
+    QCowL2Meta *m;
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        /* If both COW regions are empty then there's nothing to merge */
+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+            continue;
+        }
+
+        /* The data (middle) region must be immediately after the
+         * start region */
+        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+            continue;
+        }
+
+        /* The end region must be immediately after the data (middle)
+         * region */
+        if (m->offset + m->cow_end.offset != offset + bytes) {
+            continue;
+        }
+
+        /* Make sure that adding both COW regions to the QEMUIOVector
+         * does not exceed IOV_MAX */
+        if (hd_qiov->niov > IOV_MAX - 2) {
+            continue;
+        }
+
+        m->data_qiov = hd_qiov;
+        return true;
+    }
+
+    return false;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             goto fail;
         }
 
-        qemu_co_mutex_unlock(&s->lock);
-        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
-        trace_qcow2_writev_data(qemu_coroutine_self(),
-                                cluster_offset + offset_in_cluster);
-        ret = bdrv_co_pwritev(bs->file,
-                              cluster_offset + offset_in_cluster,
-                              cur_bytes, &hd_qiov, 0);
-        qemu_co_mutex_lock(&s->lock);
-        if (ret < 0) {
-            goto fail;
+        /* If we need to do COW, check if it's possible to merge the
+         * writing of the guest data together with that of the COW regions.
+         * If it's not possible (or not necessary) then write the
+         * guest data now. */
+        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
+            qemu_co_mutex_unlock(&s->lock);
+            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+            trace_qcow2_writev_data(qemu_coroutine_self(),
+                                    cluster_offset + offset_in_cluster);
+            ret = bdrv_co_pwritev(bs->file,
+                                  cluster_offset + offset_in_cluster,
+                                  cur_bytes, &hd_qiov, 0);
+            qemu_co_mutex_lock(&s->lock);
+            if (ret < 0) {
+                goto fail;
+            }
         }
 
         while (l2meta != NULL) {
diff --git a/block/qcow2.h b/block/qcow2.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /**
+     * The I/O vector with the data from the actual guest write request.
+     * If non-NULL, this is meant to be merged together with the data
+     * from @cow_start and @cow_end into one single write operation.
+     */
+    QEMUIOVector *data_qiov;
+
     /** Pointer to next L2Meta of the same write request */
     struct QCowL2Meta *next;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

We already have functions for doing these calculations, so let's use
them instead of doing everything by hand. This makes the code a bit
more readable.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.c         | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
 
     nb_clusters = size_to_clusters(s, bytes_needed);
@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
 
     *new_l2_table = l2_table;
     *new_l2_index = l2_index;
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
     }
 
     /* Tables must be cluster aligned */
-    if (offset & (s->cluster_size - 1)) {
+    if (offset_into_cluster(s, offset) != 0) {
         return -EINVAL;
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
 block/qed-table.c   | 15 +++------
 block/qed.h         |  3 +-
 3 files changed, 36 insertions(+), 76 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
     return i - index;
 }
 
-typedef struct {
-    BDRVQEDState *s;
-    uint64_t pos;
-    size_t len;
-
-    QEDRequest *request;
-
-    /* User callback */
-    QEDFindClusterFunc *cb;
-    void *opaque;
-} QEDFindClusterCB;
-
-static void qed_find_cluster_cb(void *opaque, int ret)
-{
-    QEDFindClusterCB *find_cluster_cb = opaque;
-    BDRVQEDState *s = find_cluster_cb->s;
-    QEDRequest *request = find_cluster_cb->request;
-    uint64_t offset = 0;
-    size_t len = 0;
-    unsigned int index;
-    unsigned int n;
-
-    qed_acquire(s);
-    if (ret) {
-        goto out;
-    }
-
-    index = qed_l2_index(s, find_cluster_cb->pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
-                              find_cluster_cb->len);
-    n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                      index, n, &offset);
-
-    if (qed_offset_is_unalloc_cluster(offset)) {
-        ret = QED_CLUSTER_L2;
-    } else if (qed_offset_is_zero_cluster(offset)) {
-        ret = QED_CLUSTER_ZERO;
-    } else if (qed_check_cluster_offset(s, offset)) {
-        ret = QED_CLUSTER_FOUND;
-    } else {
-        ret = -EINVAL;
-    }
-
-    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
-              qed_offset_into_cluster(s, find_cluster_cb->pos));
-
-out:
-    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
-    qed_release(s);
-    g_free(find_cluster_cb);
-}
-
 /**
  * Find the offset of a data cluster
  *
@@ -XXX,XX +XXX,XX @@ out:
 void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
                       size_t len, QEDFindClusterFunc *cb, void *opaque)
 {
-    QEDFindClusterCB *find_cluster_cb;
     uint64_t l2_offset;
+    uint64_t offset = 0;
+    unsigned int index;
+    unsigned int n;
+    int ret;
 
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         return;
     }
 
-    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
-    find_cluster_cb->s = s;
-    find_cluster_cb->pos = pos;
-    find_cluster_cb->len = len;
-    find_cluster_cb->cb = cb;
-    find_cluster_cb->opaque = opaque;
-    find_cluster_cb->request = request;
+    ret = qed_read_l2_table(s, request, l2_offset);
+    qed_acquire(s);
+    if (ret) {
+        goto out;
+    }
+
+    index = qed_l2_index(s, pos);
+    n = qed_bytes_to_clusters(s,
+                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+                                      index, n, &offset);
+
+    if (qed_offset_is_unalloc_cluster(offset)) {
+        ret = QED_CLUSTER_L2;
+    } else if (qed_offset_is_zero_cluster(offset)) {
+        ret = QED_CLUSTER_ZERO;
+    } else if (qed_check_cluster_offset(s, offset)) {
+        ret = QED_CLUSTER_FOUND;
+    } else {
+        ret = -EINVAL;
+    }
+
+    len = MIN(len,
+              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
-    qed_read_l2_table(s, request, l2_offset,
-                      qed_find_cluster_cb, find_cluster_cb);
+out:
+    cb(opaque, ret, offset, len);
+    qed_release(s);
 }
diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
     return ret;
 }
 
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque)
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
     int ret;
 
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     /* Check for cached L2 entry */
     request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
     if (request->l2_table) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     }
     qed_release(s);
 
-    cb(opaque, ret);
+    return ret;
 }
 
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
-    int ret = -EINPROGRESS;
-
-    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_read_l2_table(s, request, offset);
 }
 
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque);
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                         unsigned int index, unsigned int n, bool flush,
                         BlockCompletionFunc *cb, void *opaque);
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
 block/qed.c         | 24 +++++++++++-------------
 block/qed.h         |  4 ++--
 3 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * @s:          QED state
  * @request:    L2 cache entry
  * @pos:        Byte position in device
- * @len:        Number of bytes
- * @cb:         Completion function
- * @opaque:     User data for completion function
+ * @len:        Number of bytes (may be shortened on return)
+ * @img_offset: Contains offset in the image file on success
  *
  * This function translates a position in the block device to an offset in the
- * image file.  It invokes the cb completion callback to report back the
- * translated offset or unallocated range in the image file.
+ * image file. The translated offset or unallocated range in the image file is
+ * reported back in *img_offset and *len.
  *
  * If the L2 table exists, request->l2_table points to the L2 table cache entry
  * and the caller must free the reference when they are finished.  The cache
  * entry is exposed in this way to avoid callers having to read the L2 table
  * again later during request processing.  If request->l2_table is non-NULL it
  * will be unreferenced before taking on the new cache entry.
+ *
+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
+ * range in the image file.
+ *
+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque)
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
      */
-    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
 
     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
     if (qed_offset_is_unalloc_cluster(l2_offset)) {
-        cb(opaque, QED_CLUSTER_L1, 0, len);
-        return;
+        *img_offset = 0;
+        return QED_CLUSTER_L1;
     }
     if (!qed_check_table_offset(s, l2_offset)) {
-        cb(opaque, -EINVAL, 0, 0);
-        return;
+        *img_offset = *len = 0;
+        return -EINVAL;
     }
 
     ret = qed_read_l2_table(s, request, l2_offset);
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     }
 
     index = qed_l2_index(s, pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
     n = qed_count_contiguous_clusters(s, request->l2_table->table,
                                       index, n, &offset);
 
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         ret = -EINVAL;
     }
 
-    len = MIN(len,
-              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
+    *len = MIN(*len,
+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
 out:
-    cb(opaque, ret, offset, len);
+    *img_offset = offset;
     qed_release(s);
+    return ret;
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
         .file = file,
     };
     QEDRequest request = { .l2_table = NULL };
+    uint64_t offset;
+    int ret;
 
-    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
+    qed_is_allocated_cb(&cb, ret, offset, len);
 
-    /* Now sleep if the callback wasn't invoked immediately */
-    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
+    /* The callback was invoked immediately */
+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
 
     qed_unref_l2_cache_entry(request.l2_table);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_write_data(void *opaque, int ret,
                                uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_read_data(void *opaque, int ret,
                               uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
+    uint64_t offset;
+    size_t len;
 
     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     }
 
     /* Find next cluster and start I/O */
-    qed_find_cluster(s, &acb->request,
-                      acb->cur_pos, acb->end_pos - acb->cur_pos,
-                      io_fn, acb);
+    len = acb->end_pos - acb->cur_pos;
+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
+    io_fn(acb, ret, offset, len);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque);
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
collapse into a single function. This is reflected by a rename of the
combined function to qed_aio_write_cow().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 57 +++++++++++++++++++++++----------------------------------
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @pos:        Byte position in device
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
- * @cb:         Completion function
- * @opaque:     User data for completion function
  */
-static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                       uint64_t len, uint64_t offset,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
+                                      uint64_t len, uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 
     /* Skip copy entirely if there is no work to do */
     if (len == 0) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
     ret = 0;
 out:
     qemu_vfree(iov.iov_base);
-    cb(opaque, ret);
+    return ret;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
 }
 
 /**
- * Populate back untouched region of new data cluster
+ * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_postfill(void *opaque, int ret)
+static void qed_aio_write_cow(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
-    uint64_t len =
-        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
-    uint64_t offset = acb->cur_cluster +
-                      qed_offset_into_cluster(s, acb->cur_pos) +
-                      acb->cur_qiov.size;
+    uint64_t start, len, offset;
+
+    /* Populate front untouched region of new data cluster */
+    start = qed_start_of_cluster(s, acb->cur_pos);
+    len = qed_offset_into_cluster(s, acb->cur_pos);
 
+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
     if (ret) {
         qed_aio_complete(acb, ret);
         return;
     }
 
-    trace_qed_aio_write_postfill(s, acb, start, len, offset);
-    qed_copy_from_backing_file(s, start, len, offset,
-                                qed_aio_write_main, acb);
-}
+    /* Populate back untouched region of new data cluster */
+    start = acb->cur_pos + acb->cur_qiov.size;
+    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
+    offset = acb->cur_cluster +
+             qed_offset_into_cluster(s, acb->cur_pos) +
+             acb->cur_qiov.size;
 
-/**
- * Populate front untouched region of new data cluster
- */
-static void qed_aio_write_prefill(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
-    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+    ret = qed_copy_from_backing_file(s, start, len, offset);
 
-    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
-    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
-                                qed_aio_write_postfill, acb);
+    qed_aio_write_main(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
         cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_prefill;
+        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
-                             void *opaque)
+static int qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
     ret = 0;
 out:
     qemu_vfree(buf);
-    cb(opaque, ret);
+    return ret;
 }
 
 static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     }
 }
 
-static void qed_finish_clear_need_check(void *opaque, int ret)
-{
-    /* Do nothing */
-}
-
-static void qed_flush_after_clear_need_check(void *opaque, int ret)
-{
-    BDRVQEDState *s = opaque;
-
-    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
-
-    /* No need to wait until flush completes */
-    qed_unplug_allocating_write_reqs(s);
-}
-
 static void qed_clear_need_check(void *opaque, int ret)
 {
     BDRVQEDState *s = opaque;
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
     }
 
     s->header.features &= ~QED_F_NEED_CHECK;
-    qed_write_header(s, qed_flush_after_clear_need_check, s);
+    ret = qed_write_header(s);
+    (void) ret;
+
+    qed_unplug_allocating_write_reqs(s);
+
+    ret = bdrv_flush(s->bs);
+    (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb;
+    int ret;
 
     /* Cancel timer when the first allocating request comes in */
     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
-        qed_write_header(s, cb, acb);
+        ret = qed_write_header(s);
+        cb(acb, ret);
     } else {
         cb(acb, 0);
     }
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-table.c | 47 ++++++++++++-----------------------------------
 block/qed.c       | 12 +++++++-----
 block/qed.h       |  8 +++-----
 3 files changed, 22 insertions(+), 45 deletions(-)

diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ out:
  * @index:      Index of first element
  * @n:          Number of elements
  * @flush:      Whether or not to sync to disk
- * @cb:         Completion function
- * @opaque:     Argument for completion function
  */
-static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-                            unsigned int index, unsigned int n, bool flush,
-                            BlockCompletionFunc *cb, void *opaque)
+static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+                           unsigned int index, unsigned int n, bool flush)
 {
     unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
     unsigned int start, end, i;
@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
     ret = 0;
 out:
     qemu_vfree(new_table);
-    cb(opaque, ret);
-}
-
-/**
- * Propagate return value from async callback
- */
-static void qed_sync_cb(void *opaque, int ret)
-{
-    *(int *)opaque = ret;
+    return ret;
 }
 
 int qed_read_l1_table_sync(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
 }
 
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
-    qed_write_table(s, s->header.l1_table_offset,
-                    s->l1_table, index, n, false, cb, opaque);
+    return qed_write_table(s, s->header.l1_table_offset,
+                           s->l1_table, index, n, false);
 }
 
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l1_table(s, index, n);
 }
 
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
     return qed_read_l2_table(s, request, offset);
 }
 
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
-    qed_write_table(s, request->l2_table->offset,
-                    request->l2_table->table, index, n, flush, cb, opaque);
+    return qed_write_table(s, request->l2_table->offset,
+                           request->l2_table->table, index, n, flush);
 }
 
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l2_table(s, request, index, n, flush);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = acb->request.l2_table->offset;
 
-    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+    ret = qed_write_l1_table(s, index, 1);
+    qed_commit_l2_update(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
 
     if (need_alloc) {
         /* Write out the whole new L2 table */
-        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                           qed_aio_write_l1_update, acb);
+        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
+        qed_aio_write_l1_update(acb, ret);
     } else {
         /* Write out only the updated part of the L2 table */
-        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                           qed_aio_next_io_cb, acb);
+        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
+                                 false);
+        qed_aio_next_io(acb, ret);
     }
     return;
 
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
  * Table I/O functions
  */
 int qed_read_l1_table_sync(BDRVQEDState *s);
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush);
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush);
 
-- 
1.8.3.1

Note that this code is generally not running in coroutine context, so
this is an actual blocking synchronous operation. We'll fix this in a
moment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 61 +++++++++++++++++++------------------------------------------
 1 file changed, 19 insertions(+), 42 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
     qed_aio_next_io(acb, 0);
 }
 
-static void qed_aio_next_io_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    qed_aio_next_io(acb, ret);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ err:
     qed_aio_complete(acb, ret);
 }
 
-static void qed_aio_write_l2_update_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-}
-
-/**
- * Flush new data clusters before updating the L2 table
- *
- * This flush is necessary when a backing file is in use.  A crash during an
- * allocating write could result in empty clusters in the image.  If the write
- * only touched a subregion of the cluster, then backing image sectors have
- * been lost in the untouched region.  The solution is to flush after writing a
- * new data cluster and before updating the L2 table.
- */
-static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-
-    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
-        qed_aio_complete(acb, -EIO);
-    }
-}
-
 /**
  * Write data to the image file
  */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
                       qed_offset_into_cluster(s, acb->cur_pos);
-    BlockCompletionFunc *next_fn;
 
     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
         return;
     }
 
+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    if (ret >= 0) {
+        ret = 0;
+    }
+
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io_cb;
+        qed_aio_next_io(acb, ret);
     } else {
         if (s->bs->backing) {
-            next_fn = qed_aio_write_flush_before_l2_update;
-        } else {
-            next_fn = qed_aio_write_l2_update_cb;
+            /*
+             * Flush new data clusters before updating the L2 table
+             *
+             * This flush is necessary when a backing file is in use.  A crash
+             * during an allocating write could result in empty clusters in the
+             * image.  If the write only touched a subregion of the cluster,
+             * then backing image sectors have been lost in the untouched
+             * region.  The solution is to flush after writing a new data
+             * cluster and before updating the L2 table.
+             */
+            ret = bdrv_flush(s->bs->file->bs);
         }
+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
     }
-
-    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
-                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                    next_fn, acb);
 }
 
 /**
-- 
1.8.3.1

qed_commit_l2_update() is unconditionally called at the end of
qed_aio_write_l1_update(). Inline it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 }
 
 /**
- * Commit the current L2 table to the cache
+ * Update L1 table with new L2 table offset and write it out
  */
-static void qed_commit_l2_update(void *opaque, int ret)
+static void qed_aio_write_l1_update(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
+    int index;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
 
+    index = qed_l1_index(s, acb->cur_pos);
+    s->l1_table->offsets[index] = l2_table->offset;
+
+    ret = qed_write_l1_table(s, index, 1);
+
+    /* Commit the current L2 table to the cache */
     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
 
     /* This is guaranteed to succeed because we just committed the entry to the
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
     qed_aio_next_io(acb, ret);
 }
 
-/**
- * Update L1 table with new L2 table offset and write it out
- */
-static void qed_aio_write_l1_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    index = qed_l1_index(s, acb->cur_pos);
-    s->l1_table->offsets[index] = acb->request.l2_table->offset;
-
-    ret = qed_write_l1_table(s, index, 1);
-    qed_commit_l2_update(acb, ret);
-}
 
 /**
  * Update L2 table with new cluster offsets and write them out
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static void qed_aio_write_l1_update(void *opaque, int ret)
+static int qed_aio_write_l1_update(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
+    int index, ret;
 
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = l2_table->offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(acb, ret);
+    return ret;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-        qed_aio_write_l1_update(acb, ret);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l1_update(acb);
+        qed_aio_next_io(acb, ret);
+
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
-    int index;
-
-    if (ret) {
-        goto err;
-    }
+    int index, ret;
 
     if (need_alloc) {
         qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
         if (ret) {
-            goto err;
+            return ret;
         }
-        ret = qed_aio_write_l1_update(acb);
-        qed_aio_next_io(acb, ret);
-
+        return qed_aio_write_l1_update(acb);
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
                                  false);
-        qed_aio_next_io(acb, ret);
+        if (ret) {
+            return ret;
+        }
     }
-    return;
-
-err:
-    qed_aio_complete(acb, ret);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
              */
             ret = bdrv_flush(s->bs->file->bs);
         }
-        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        qed_aio_next_io(acb, 0);
     }
+    return;
+
+err:
+    qed_aio_complete(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
         return;
     }
 
-    qed_aio_write_l2_update(acb, 0, 1);
+    ret = qed_aio_write_l2_update(acb, 1);
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

While refactoring qed_aio_write_alloc() to accomodate the change,
qed_aio_write_zero_cluster() ended up with a single line, so I chose to
inline that line and remove the function completely.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 58 +++++++++++++++++++++-------------------------------------
 1 file changed, 21 insertions(+), 37 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_cow(void *opaque, int ret)
+static int qed_aio_write_cow(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
+    int ret;
 
     /* Populate front untouched region of new data cluster */
     start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
     ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
+    if (ret < 0) {
+        return ret;
     }
 
     /* Populate back untouched region of new data cluster */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_postfill(s, acb, start, len, offset);
     ret = qed_copy_from_backing_file(s, start, len, offset);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_main(acb);
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
     return !(s->header.features & QED_F_NEED_CHECK);
 }
 
-static void qed_aio_write_zero_cluster(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_l2_update(acb, 1);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
-}
-
 /**
  * Write new data cluster
  *
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
 static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb;
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
             qed_aio_start_io(acb);
             return;
         }
-
-        cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
-        cb(acb, ret);
+        if (ret < 0) {
+            qed_aio_complete(acb, ret);
+            return;
+        }
+    }
+
+    if (acb->flags & QED_AIOCB_ZERO) {
+        ret = qed_aio_write_l2_update(acb, 1);
     } else {
-        cb(acb, 0);
+        ret = qed_aio_write_cow(acb);
     }
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     }
     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
         s->allocating_write_reqs_plugged) {
-        return; /* wait for existing request to finish */
+        return -EINPROGRESS; /* wait for existing request to finish */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_start_io(acb);
-            return;
+            return 0;
         }
     } else {
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            return ret;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         ret = qed_aio_write_cow(acb);
     }
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
-    int ret;
-
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
         struct iovec *iov = acb->qiov->iov;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         if (!iov->iov_base) {
             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
             if (iov->iov_base == NULL) {
-                qed_aio_complete(acb, -ENOMEM);
-                return;
+                return -ENOMEM;
             }
             memset(iov->iov_base, 0, iov->iov_len);
         }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
     qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
     /* Do the actual write */
-    ret = qed_aio_write_main(acb);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
 
     switch (ret) {
     case QED_CLUSTER_FOUND:
-        qed_aio_write_inplace(acb, offset, len);
+        ret = qed_aio_write_inplace(acb, offset, len);
         break;
 
     case QED_CLUSTER_L2:
     case QED_CLUSTER_L1:
     case QED_CLUSTER_ZERO:
-        qed_aio_write_alloc(acb, len);
+        ret = qed_aio_write_alloc(acb, len);
         break;
 
     default:
-        qed_aio_complete(acb, ret);
+        assert(ret < 0);
         break;
     }
+
+    if (ret < 0) {
+        if (ret != -EINPROGRESS) {
+            qed_aio_complete(acb, ret);
+        }
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

All callers pass ret = 0, so we can just remove it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb);
 
 static void qed_aio_start_io(QEDAIOCB *acb)
 {
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
     size_t len;
+    int ret;
 
-    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
 
     if (acb->backing_qiov) {
         qemu_iovec_destroy(acb->backing_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         acb->backing_qiov = NULL;
     }
 
-    /* Handle I/O error */
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
     acb->qiov_offset += acb->cur_qiov.size;
     acb->cur_pos += acb->cur_qiov.size;
     qemu_iovec_reset(&acb->cur_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         }
         return;
     }
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-- 
1.8.3.1

Most of the qed code is now synchronous and matches the coroutine model.
One notable exception is the serialisation between requests which can
still schedule a callback. Before we can replace this with coroutine
locks, let's convert the driver's external interfaces to the coroutine
versions.

We need to be careful to handle both requests that call the completion
callback directly from the calling coroutine (i.e. fully synchronous
code) and requests that involve some callback, so that we need to yield
and wait for the completion callback coming from outside the coroutine.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
 1 file changed, 42 insertions(+), 55 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
     }
 }
 
-static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-                                 int64_t sector_num,
-                                 QEMUIOVector *qiov, int nb_sectors,
-                                 BlockCompletionFunc *cb,
-                                 void *opaque, int flags)
+typedef struct QEDRequestCo {
+    Coroutine *co;
+    bool done;
+    int ret;
+} QEDRequestCo;
+
+static void qed_co_request_cb(void *opaque, int ret)
 {
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
+    QEDRequestCo *co = opaque;
 
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
-                        opaque, flags);
+    co->done = true;
+    co->ret = ret;
+    qemu_coroutine_enter_if_inactive(co->co);
+}
+
+static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+                                       QEMUIOVector *qiov, int nb_sectors,
+                                       int flags)
+{
+    QEDRequestCo co = {
+        .co     = qemu_coroutine_self(),
+        .done   = false,
+    };
+    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
+
+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
 
     acb->flags = flags;
     acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
 
     /* Start request */
     qed_aio_start_io(acb);
-    return &acb->common;
-}
 
-static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
-                                      int64_t sector_num,
-                                      QEMUIOVector *qiov, int nb_sectors,
-                                      BlockCompletionFunc *cb,
-                                      void *opaque)
-{
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
+    if (!co.done) {
+        qemu_coroutine_yield();
+    }
+
+    return co.ret;
 }
 
-static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
-                                       int64_t sector_num,
-                                       QEMUIOVector *qiov, int nb_sectors,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
+                                          int64_t sector_num, int nb_sectors,
+                                          QEMUIOVector *qiov)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
-                         opaque, QED_AIOCB_WRITE);
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
 }
 
-typedef struct {
-    Coroutine *co;
-    int ret;
-    bool done;
-} QEDWriteZeroesCB;
-
-static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
+static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors,
+                                           QEMUIOVector *qiov)
 {
-    QEDWriteZeroesCB *cb = opaque;
-
-    cb->done = true;
-    cb->ret = ret;
-    if (cb->co) {
-        aio_co_wake(cb->co);
-    }
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
 }
 
 static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
                                                   int count,
                                                   BdrvRequestFlags flags)
 {
-    BlockAIOCB *blockacb;
     BDRVQEDState *s = bs->opaque;
-    QEDWriteZeroesCB cb = { .done = false };
     QEMUIOVector qiov;
     struct iovec iov;
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
     iov.iov_len = count;
 
     qemu_iovec_init_external(&qiov, &iov, 1);
-    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
-                             count >> BDRV_SECTOR_BITS,
-                             qed_co_pwrite_zeroes_cb, &cb,
-                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
-    if (!blockacb) {
-        return -EIO;
-    }
-    if (!cb.done) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
-    assert(cb.done);
-    return cb.ret;
+    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
+                          count >> BDRV_SECTOR_BITS,
+                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
 }
 
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
     .bdrv_create              = bdrv_qed_create,
     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
-    .bdrv_aio_readv           = bdrv_qed_aio_readv,
-    .bdrv_aio_writev          = bdrv_qed_aio_writev,
+    .bdrv_co_readv            = bdrv_qed_co_readv,
+    .bdrv_co_writev           = bdrv_qed_co_writev,
     .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
-- 
1.8.3.1

Now that we're running in coroutine context, the ad-hoc serialisation
code (which drops a request that has to wait out of coroutine context)
can be replaced by a CoQueue.

This means that when we resume a serialised request, it is running in
coroutine context again and its I/O isn't blocking any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 49 +++++++++++++++++--------------------------------
 block/qed.h |  3 ++-
 2 files changed, 19 insertions(+), 33 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 
 static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 {
-    QEDAIOCB *acb;
-
     assert(s->allocating_write_reqs_plugged);
 
     s->allocating_write_reqs_plugged = false;
-
-    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-    if (acb) {
-        qed_aio_start_io(acb);
-    }
+    qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
 static void qed_clear_need_check(void *opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
     BDRVQEDState *s = opaque;
 
     /* The timer should only fire when allocating writes have drained */
-    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
+    assert(!s->allocating_acb);
 
     trace_qed_need_check_timer_cb(s);
 
@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
     int ret;
 
     s->bs = bs;
-    QSIMPLEQ_INIT(&s->allocating_write_reqs);
+    qemu_co_queue_init(&s->allocating_write_reqs);
 
     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
     qed_release(s);
 }
 
-static void qed_resume_alloc_bh(void *opaque)
-{
-    qed_aio_start_io(opaque);
-}
-
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
 {
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
      * next request in the queue.  This ensures that we don't cycle through
      * requests multiple times but rather finish one at a time completely.
      */
-    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QEDAIOCB *next_acb;
-        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
-        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-        if (next_acb) {
-            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                                    qed_resume_alloc_bh, next_acb);
+    if (acb == s->allocating_acb) {
+        s->allocating_acb = NULL;
+        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
+            qemu_co_enter_next(&s->allocating_write_reqs);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
-    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
+    if (s->allocating_acb == NULL) {
         qed_cancel_need_check_timer(s);
     }
 
     /* Freeze this request if another allocating write is in progress */
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
-    }
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
-        s->allocating_write_reqs_plugged) {
-        return -EINPROGRESS; /* wait for existing request to finish */
+    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
+        if (s->allocating_acb != NULL) {
+            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
+            assert(s->allocating_acb == NULL);
+        }
+        s->allocating_acb = acb;
+        return -EAGAIN; /* start over with looking up table entries */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
             ret = qed_aio_read_data(acb, ret, offset, len);
         }
 
-        if (ret < 0) {
-            if (ret != -EINPROGRESS) {
-                qed_aio_complete(acb, ret);
-            }
+        if (ret < 0 && ret != -EAGAIN) {
+            qed_aio_complete(acb, ret);
             return;
         }
     }
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ typedef struct {
     uint32_t l2_mask;
 
     /* Allocating write request queue */
-    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
+    QEDAIOCB *allocating_acb;
+    CoQueue allocating_write_reqs;
     bool allocating_write_reqs_plugged;
 
     /* Periodic flush and clear need check flag */
-- 
1.8.3.1

Now that we process a request in the same coroutine from beginning to
end and don't drop out of it any more, we can look like a proper
coroutine-based driver and simply call qed_aio_next_io() and get a
return value from it instead of spawning an additional coroutine that
reenters the parent when it's done.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 101 +++++++++++++-----------------------------------------------
 block/qed.h |   3 +-
 2 files changed, 22 insertions(+), 82 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/block-backend.h"
 
-static const AIOCBInfo qed_aiocb_info = {
-    .aiocb_size         = sizeof(QEDAIOCB),
-};
-
 static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                           const char *filename)
 {
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb);
-
-static void qed_aio_start_io(QEDAIOCB *acb)
-{
-    qed_aio_next_io(acb);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
 
 static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
 {
-    return acb->common.bs->opaque;
+    return acb->bs->opaque;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete_bh(void *opaque)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb = acb->common.cb;
-    void *user_opaque = acb->common.opaque;
-    int ret = acb->bh_ret;
-
-    qemu_aio_unref(acb);
-
-    /* Invoke callback */
-    qed_acquire(s);
-    cb(user_opaque, ret);
-    qed_release(s);
-}
-
-static void qed_aio_complete(QEDAIOCB *acb, int ret)
+static void qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
-    trace_qed_aio_complete(s, acb, ret);
-
     /* Free resources */
     qemu_iovec_destroy(&acb->cur_qiov);
     qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         acb->qiov->iov[0].iov_base = NULL;
     }
 
-    /* Arrange for a bh to invoke the completion function */
-    acb->bh_ret = ret;
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                            qed_aio_complete_bh, acb);
-
     /* Start next allocating write request waiting behind this one.  Note that
      * requests enqueue themselves when they first hit an unallocated cluster
      * but they wait until the entire request is finished before waking up the
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         struct iovec *iov = acb->qiov->iov;
 
         if (!iov->iov_base) {
-            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
+            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
             if (iov->iov_base == NULL) {
                 return -ENOMEM;
             }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    BlockDriverState *bs = acb->common.bs;
+    BlockDriverState *bs = acb->bs;
 
     /* Adjust offset into cluster */
     offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb)
+static int qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
 
         /* Complete request */
         if (acb->cur_pos >= acb->end_pos) {
-            qed_aio_complete(acb, 0);
-            return;
+            ret = 0;
+            break;
         }
 
         /* Find next cluster and start I/O */
         len = acb->end_pos - acb->cur_pos;
         ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
 
         if (acb->flags & QED_AIOCB_WRITE) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
         }
 
         if (ret < 0 && ret != -EAGAIN) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
     }
-}
 
-typedef struct QEDRequestCo {
-    Coroutine *co;
-    bool done;
-    int ret;
-} QEDRequestCo;
-
-static void qed_co_request_cb(void *opaque, int ret)
-{
-    QEDRequestCo *co = opaque;
-
-    co->done = true;
-    co->ret = ret;
-    qemu_coroutine_enter_if_inactive(co->co);
+    trace_qed_aio_complete(s, acb, ret);
+    qed_aio_complete(acb);
+    return ret;
 }
 
 static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
                                        QEMUIOVector *qiov, int nb_sectors,
                                        int flags)
 {
-    QEDRequestCo co = {
-        .co     = qemu_coroutine_self(),
-        .done   = false,
+    QEDAIOCB acb = {
+        .bs         = bs,
+        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
+        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
+        .qiov       = qiov,
+        .flags      = flags,
     };
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
-
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
+    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
 
-    acb->flags = flags;
-    acb->qiov = qiov;
-    acb->qiov_offset = 0;
-    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
-    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
-    acb->backing_qiov = NULL;
-    acb->request.l2_table = NULL;
-    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
 
     /* Start request */
-    qed_aio_start_io(acb);
-
-    if (!co.done) {
-        qemu_coroutine_yield();
-    }
-
-    return co.ret;
+    return qed_aio_next_io(&acb);
 }
 
 static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
 };
 
 typedef struct QEDAIOCB {
-    BlockAIOCB common;
-    int bh_ret;                     /* final return status for completion bh */
+    BlockDriverState *bs;
     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
     int flags;                      /* QED_AIOCB_* bits ORed together */
     uint64_t end_pos;               /* request end on block device, in bytes */
-- 
1.8.3.1

This fixes the last place where we degraded from AIO to actual blocking
synchronous I/O requests. Putting it into a coroutine means that instead
of blocking, the coroutine simply yields while doing I/O.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_clear_need_check(void *opaque, int ret)
+static void qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
+    int ret;
 
-    if (ret) {
+    /* The timer should only fire when allocating writes have drained */
+    assert(!s->allocating_acb);
+
+    trace_qed_need_check_timer_cb(s);
+
+    qed_acquire(s);
+    qed_plug_allocating_write_reqs(s);
+
+    /* Ensure writes are on disk before clearing flag */
+    ret = bdrv_co_flush(s->bs->file->bs);
+    qed_release(s);
+    if (ret < 0) {
         qed_unplug_allocating_write_reqs(s);
         return;
     }
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
 
     qed_unplug_allocating_write_reqs(s);
 
-    ret = bdrv_flush(s->bs);
+    ret = bdrv_co_flush(s->bs);
     (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
 {
-    BDRVQEDState *s = opaque;
-
-    /* The timer should only fire when allocating writes have drained */
-    assert(!s->allocating_acb);
-
-    trace_qed_need_check_timer_cb(s);
-
-    qed_acquire(s);
-    qed_plug_allocating_write_reqs(s);
-
-    /* Ensure writes are on disk before clearing flag */
-    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
-    qed_release(s);
+    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+    qemu_coroutine_enter(co);
 }
 
 void qed_acquire(BDRVQEDState *s)
-- 
1.8.3.1

Now that we stay in coroutine context for the whole request when doing
reads or writes, we can add coroutine_fn annotations to many functions
that can do I/O or yield directly.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c |  5 +++--
 block/qed.c         | 44 ++++++++++++++++++++++++--------------------
 block/qed.h         |  5 +++--
 3 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
  * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset)
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static int qed_write_header(BDRVQEDState *s)
+static int coroutine_fn qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_need_check_timer_entry(void *opaque)
+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
     int ret;
@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
  * This function reads qiov->size bytes starting at pos from the backing file.
  * If there is no backing file then zeroes are read.
  */
-static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-                                 QEMUIOVector *qiov,
-                                 QEMUIOVector **backing_qiov)
+static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+                                              QEMUIOVector *qiov,
+                                              QEMUIOVector **backing_qiov)
 {
     uint64_t backing_length = 0;
     size_t size;
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
  */
-static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                      uint64_t len, uint64_t offset)
+static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
+                                                   uint64_t pos, uint64_t len,
+                                                   uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
  * The cluster offset may be an allocated byte offset in the image file, the
  * zero cluster marker, or the unallocated cluster marker.
  */
-static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
-                                unsigned int n, uint64_t cluster)
+static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
+                                             int index, unsigned int n,
+                                             uint64_t cluster)
 {
     int i;
     for (i = index; i < index + n; i++) {
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete(QEDAIOCB *acb)
+static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static int qed_aio_write_l1_update(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 /**
  * Write data to the image file
  */
-static int qed_aio_write_main(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static int qed_aio_write_cow(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
+                                              size_t len)
 {
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_write_data(void *opaque, int ret,
-                              uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
+                                           uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
 
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
+                                          uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static int qed_aio_next_io(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset);
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

All functions that are marked coroutine_fn can directly call the
bdrv_co_* version of functions instead of going through the wrapper.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     };
     qemu_iovec_init_external(&qiov, &iov, 1);
 
-    ret = bdrv_preadv(s->bs->file, 0, &qiov);
+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     /* Update header */
     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
 
-    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
-    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
     }
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
-    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
+                          &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
              * region.  The solution is to flush after writing a new data
              * cluster and before updating the L2 table.
              */
-            ret = bdrv_flush(s->bs->file->bs);
+            ret = bdrv_co_flush(s->bs->file->bs);
             if (ret < 0) {
                 return ret;
             }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
+                         &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
-- 
1.8.3.1

From: "sochin.jiang" <sochin.jiang@huawei.com>

img_commit could fall into an infinite loop calling run_block_job() if
its blockjob fails on any I/O error, fix this already known problem.

Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 blockjob.c               |  4 ++--
 include/block/blockjob.h | 18 ++++++++++++++++++
 qemu-img.c               | 20 +++++++++++++-------
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
     block_job_enter(job);
 }
 
-static void block_job_ref(BlockJob *job)
+void block_job_ref(BlockJob *job)
 {
     ++job->refcnt;
 }
@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
                                            void *opaque);
 static void block_job_detach_aio_context(void *opaque);
 
-static void block_job_unref(BlockJob *job)
+void block_job_unref(BlockJob *job)
 {
     if (--job->refcnt == 0) {
         BlockDriverState *bs = blk_bs(job->blk);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
 BlockJobTxn *block_job_txn_new(void);
 
 /**
+ * block_job_ref:
+ *
+ * Add a reference to BlockJob refcnt, it will be decreased with
+ * block_job_unref, and then be freed if it comes to be the last
+ * reference.
+ */
+void block_job_ref(BlockJob *job);
+
+/**
+ * block_job_unref:
+ *
+ * Release a reference that was previously acquired with block_job_ref
+ * or block_job_create. If it's the last reference to the object, it will be
+ * freed.
+ */
+void block_job_unref(BlockJob *job);
+
+/**
  * block_job_txn_unref:
  *
  * Release a reference that was previously acquired with block_job_txn_add_job
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
 static void run_block_job(BlockJob *job, Error **errp)
 {
     AioContext *aio_context = blk_get_aio_context(job->blk);
+    int ret = 0;
 
-    /* FIXME In error cases, the job simply goes away and we access a dangling
-     * pointer below. */
     aio_context_acquire(aio_context);
+    block_job_ref(job);
     do {
         aio_poll(aio_context, true);
         qemu_progress_print(job->len ?
                             ((float)job->offset / job->len * 100.f) : 0.0f, 0);
-    } while (!job->ready);
+    } while (!job->ready && !job->completed);
 
-    block_job_complete_sync(job, errp);
+    if (!job->completed) {
+        ret = block_job_complete_sync(job, errp);
+    } else {
+        ret = job->ret;
+    }
+    block_job_unref(job);
     aio_context_release(aio_context);
 
-    /* A block job may finish instantaneously without publishing any progress,
-     * so just signal completion here */
-    qemu_progress_print(100.f, 0);
+    /* publish completion progress only when success */
+    if (!ret) {
+        qemu_progress_print(100.f, 0);
+    }
 }
 
 static int img_commit(int argc, char **argv)
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkdebug node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-2-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkdebug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
     }
 
     if (!force_json && bs->file->bs->exact_filename[0]) {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkdebug:%s:%s", s->config_file ?: "",
-                 bs->file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkdebug:%s:%s", s->config_file ?: "",
+                           bs->file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 
     opts = qdict_new();
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkverify node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-3-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkverify.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
     if (bs->file->bs->exact_filename[0]
         && s->test_file->bs->exact_filename[0])
     {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkverify:%s:%s",
-                 bs->file->bs->exact_filename,
-                 s->test_file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkverify:%s:%s",
+                           bs->file->bs->exact_filename,
+                           s->test_file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 }
 
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

uri_parse(...)->scheme may be NULL. In fact, probably every field may be
NULL, and the callers do test this for all of the other fields but not
for scheme (except for block/gluster.c; block/vxhs.c does not access
that field at all).

We can easily fix this by using g_strcmp0() instead of strcmp().

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613205726.13544-1-mreitz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/nbd.c      | 6 +++---
 block/nfs.c      | 2 +-
 block/sheepdog.c | 6 +++---
 block/ssh.c      | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "nbd")) {
+    if (!g_strcmp0(uri->scheme, "nbd")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
         is_unix = true;
     } else {
         ret = -EINVAL;
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
         error_setg(errp, "Invalid URI specified");
         goto out;
     }
-    if (strcmp(uri->scheme, "nfs") != 0) {
+    if (g_strcmp0(uri->scheme, "nfs") != 0) {
         error_setg(errp, "URI scheme must be 'nfs'");
         goto out;
     }
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "sheepdog")) {
+    if (!g_strcmp0(uri->scheme, "sheepdog")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
         is_unix = true;
     } else {
         error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
diff --git a/block/ssh.c b/block/ssh.c
index XXXXXXX..XXXXXXX 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
         return -EINVAL;
     }
 
-    if (strcmp(uri->scheme, "ssh") != 0) {
+    if (g_strcmp0(uri->scheme, "ssh") != 0) {
         error_setg(errp, "URI scheme must be 'ssh'");
         goto err;
     }
-- 
1.8.3.1

The following changes since commit 9a7beaad3dbba982f7a461d676b55a5c3851d312:

Merge remote-tracking branch 'remotes/alistair/tags/pull-riscv-to-apply-20210304' into staging (2021-03-05 10:47:46 +0000)

are available in the Git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 67bedc3aed5c455b629c2cb5f523b536c46adff9:

docs: qsd: Explain --export nbd,name=... default (2021-03-05 17:09:46 +0100)

----------------------------------------------------------------
Block layer patches:

- qemu-storage-daemon: add --pidfile option
- qemu-storage-daemon: CLI error messages include the option name now
- vhost-user-blk export: Misc fixes, added test cases
- docs: Improvements for qemu-storage-daemon documentation
- parallels: load bitmap extension
- backup-top: Don't crash on post-finalize accesses
- iotests improvements

----------------------------------------------------------------
Alberto Garcia (1):
      iotests: Drop deprecated 'props' from object-add

Coiby Xu (1):
      test: new qTest case to test the vhost-user-blk-server

Eric Blake (1):
      iotests: Fix up python style in 300

Kevin Wolf (1):
      docs: qsd: Explain --export nbd,name=... default

Max Reitz (3):
      backup: Remove nodes from job in .clean()
      backup-top: Refuse I/O in inactive state
      iotests/283: Check that finalize drops backup-top

Paolo Bonzini (2):
      storage-daemon: report unexpected arguments on the fly
      storage-daemon: include current command line option in the errors

Stefan Hajnoczi (14):
      qemu-storage-daemon: add --pidfile option
      docs: show how to spawn qemu-storage-daemon with fd passing
      docs: replace insecure /tmp examples in qsd docs
      vhost-user-blk: fix blkcfg->num_queues endianness
      libqtest: add qtest_socket_server()
      libqtest: add qtest_kill_qemu()
      libqtest: add qtest_remove_abrt_handler()
      tests/qtest: add multi-queue test case to vhost-user-blk-test
      block/export: fix blk_size double byteswap
      block/export: use VIRTIO_BLK_SECTOR_BITS
      block/export: fix vhost-user-blk export sector number calculation
      block/export: port virtio-blk discard/write zeroes input validation
      vhost-user-blk-test: test discard/write zeroes invalid inputs
      block/export: port virtio-blk read/write range check

Stefano Garzarella (1):
      blockjob: report a better error message

Vladimir Sementsov-Ogievskiy (7):
      qcow2-bitmap: make bytes_covered_by_bitmap_cluster() public
      parallels.txt: fix bitmap L1 table description
      block/parallels: BDRVParallelsState: add cluster_size field
      parallels: support bitmap extension for read-only mode
      iotests.py: add unarchive_sample_image() helper
      iotests: add parallels-read-bitmap test
      MAINTAINERS: update parallels block driver

docs/interop/parallels.txt                         |  28 +-
 docs/tools/qemu-storage-daemon.rst                 |  68 +-
 block/parallels.h                                  |   7 +-
 include/block/dirty-bitmap.h                       |   2 +
 tests/qtest/libqos/libqtest.h                      |  37 +
 tests/qtest/libqos/vhost-user-blk.h                |  48 +
 block/backup-top.c                                 |  10 +
 block/backup.c                                     |   1 +
 block/dirty-bitmap.c                               |  13 +
 block/export/vhost-user-blk-server.c               | 150 +++-
 block/parallels-ext.c                              | 300 +++++++
 block/parallels.c                                  |  26 +-
 block/qcow2-bitmap.c                               |  16 +-
 blockjob.c                                         |  10 +-
 hw/block/vhost-user-blk.c                          |   7 +-
 storage-daemon/qemu-storage-daemon.c               |  56 +-
 tests/qtest/libqos/vhost-user-blk.c                | 130 +++
 tests/qtest/libqtest.c                             |  82 +-
 tests/qtest/vhost-user-blk-test.c                  | 983 +++++++++++++++++++++
 tests/qemu-iotests/iotests.py                      |  10 +
 MAINTAINERS                                        |   5 +
 block/meson.build                                  |   3 +-
 tests/qemu-iotests/087                             |   8 +-
 tests/qemu-iotests/184                             |  18 +-
 tests/qemu-iotests/218                             |   2 +-
 tests/qemu-iotests/235                             |   2 +-
 tests/qemu-iotests/245                             |   4 +-
 tests/qemu-iotests/258                             |   6 +-
 tests/qemu-iotests/258.out                         |   4 +-
 tests/qemu-iotests/283                             |  53 ++
 tests/qemu-iotests/283.out                         |  15 +
 tests/qemu-iotests/295                             |   2 +-
 tests/qemu-iotests/296                             |   2 +-
 tests/qemu-iotests/300                             |  10 +-
 .../sample_images/parallels-with-bitmap.bz2        | Bin 0 -> 203 bytes
 .../sample_images/parallels-with-bitmap.sh         |  51 ++
 tests/qemu-iotests/tests/parallels-read-bitmap     |  55 ++
 tests/qemu-iotests/tests/parallels-read-bitmap.out |   6 +
 tests/qtest/libqos/meson.build                     |   1 +
 tests/qtest/meson.build                            |   4 +
 40 files changed, 2098 insertions(+), 137 deletions(-)
 create mode 100644 tests/qtest/libqos/vhost-user-blk.h
 create mode 100644 block/parallels-ext.c
 create mode 100644 tests/qtest/libqos/vhost-user-blk.c
 create mode 100644 tests/qtest/vhost-user-blk-test.c
 create mode 100644 tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
 create mode 100755 tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
 create mode 100755 tests/qemu-iotests/tests/parallels-read-bitmap
 create mode 100644 tests/qemu-iotests/tests/parallels-read-bitmap.out

From: Alberto Garcia <berto@igalia.com>

Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20210222115737.2993-1-berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/087     |  8 ++------
 tests/qemu-iotests/184     | 18 ++++++------------
 tests/qemu-iotests/218     |  2 +-
 tests/qemu-iotests/235     |  2 +-
 tests/qemu-iotests/245     |  4 ++--
 tests/qemu-iotests/258     |  6 +++---
 tests/qemu-iotests/258.out |  4 ++--
 tests/qemu-iotests/295     |  2 +-
 tests/qemu-iotests/296     |  2 +-
 9 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
   "arguments": {
       "qom-type": "secret",
       "id": "sec0",
-      "props": {
-          "data": "123456"
-      }
+      "data": "123456"
   }
 }
 { "execute": "blockdev-add",
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
   "arguments": {
       "qom-type": "secret",
       "id": "sec0",
-      "props": {
-          "data": "123456"
-      }
+      "data": "123456"
   }
 }
 { "execute": "blockdev-add",
diff --git a/tests/qemu-iotests/184 b/tests/qemu-iotests/184
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/184
+++ b/tests/qemu-iotests/184
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
   "arguments": {
     "qom-type": "throttle-group",
     "id": "group0",
-    "props": {
-      "limits" : {
-        "iops-total": 1000
-      }
+    "limits" : {
+      "iops-total": 1000
     }
   }
 }
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
   "arguments": {
     "qom-type": "throttle-group",
     "id": "group0",
-    "props" : {
-      "limits": {
-          "iops-total": 1000
-      }
+    "limits": {
+        "iops-total": 1000
     }
   }
 }
@@ -XXX,XX +XXX,XX @@ run_qemu <<EOF
   "arguments": {
     "qom-type": "throttle-group",
     "id": "group0",
-    "props" : {
-      "limits": {
-          "iops-total": 1000
-      }
+    "limits": {
+        "iops-total": 1000
     }
   }
 }
diff --git a/tests/qemu-iotests/218 b/tests/qemu-iotests/218
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/218
+++ b/tests/qemu-iotests/218
@@ -XXX,XX +XXX,XX @@ with iotests.VM() as vm, \
     vm.launch()
 
     ret = vm.qmp('object-add', qom_type='throttle-group', id='tg',
-                 props={'x-bps-read': 4096})
+                 limits={'bps-read': 4096})
     assert ret['return'] == {}
 
     ret = vm.qmp('blockdev-add',
diff --git a/tests/qemu-iotests/235 b/tests/qemu-iotests/235
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/235
+++ b/tests/qemu-iotests/235
@@ -XXX,XX +XXX,XX @@ vm.add_args('-drive', 'id=src,file=' + disk)
 vm.launch()
 
 log(vm.qmp('object-add', qom_type='throttle-group', id='tg0',
-           props={ 'x-bps-total': size }))
+           limits={'bps-total': size}))
 
 log(vm.qmp('blockdev-add',
            **{ 'node-name': 'target',
diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -XXX,XX +XXX,XX @@ class TestBlockdevReopen(iotests.QMPTestCase):
         ###### throttle ######
         ######################
         opts = { 'qom-type': 'throttle-group', 'id': 'group0',
-                 'props': { 'limits': { 'iops-total': 1000 } } }
+                 'limits': { 'iops-total': 1000 } }
         result = self.vm.qmp('object-add', conv_keys = False, **opts)
         self.assert_qmp(result, 'return', {})
 
         opts = { 'qom-type': 'throttle-group', 'id': 'group1',
-                 'props': { 'limits': { 'iops-total': 2000 } } }
+                 'limits': { 'iops-total': 2000 } }
         result = self.vm.qmp('object-add', conv_keys = False, **opts)
         self.assert_qmp(result, 'return', {})
 
diff --git a/tests/qemu-iotests/258 b/tests/qemu-iotests/258
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/258
+++ b/tests/qemu-iotests/258
@@ -XXX,XX +XXX,XX @@ def test_concurrent_finish(write_to_stream_node):
         vm.qmp_log('object-add',
                    qom_type='throttle-group',
                    id='tg',
-                   props={
-                       'x-iops-write': 1,
-                       'x-iops-write-max': 1
+                   limits={
+                       'iops-write': 1,
+                       'iops-write-max': 1
                    })
 
         vm.qmp_log('blockdev-add',
diff --git a/tests/qemu-iotests/258.out b/tests/qemu-iotests/258.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/258.out
+++ b/tests/qemu-iotests/258.out
@@ -XXX,XX +XXX,XX @@ Running tests:
 
 === Commit and stream finish concurrently (letting stream write) ===
 
-{"execute": "object-add", "arguments": {"id": "tg", "props": {"x-iops-write": 1, "x-iops-write-max": 1}, "qom-type": "throttle-group"}}
+{"execute": "object-add", "arguments": {"id": "tg", "limits": {"iops-write": 1, "iops-write-max": 1}, "qom-type": "throttle-group"}}
 {"return": {}}
 {"execute": "blockdev-add", "arguments": {"backing": {"backing": {"backing": {"backing": {"driver": "raw", "file": {"driver": "file", "filename": "TEST_DIR/PID-node0.img"}, "node-name": "node0"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node1.img"}, "node-name": "node1"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node2.img"}, "node-name": "node2"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node3.img"}, "node-name": "node3"}, "driver": "IMGFMT", "file": {"driver": "throttle", "file": {"driver": "file", "filename": "TEST_DIR/PID-node4.img"}, "throttle-group": "tg"}, "node-name": "node4"}}
 {"return": {}}
@@ -XXX,XX +XXX,XX @@ Running tests:
 
 === Commit and stream finish concurrently (letting commit write) ===
 
-{"execute": "object-add", "arguments": {"id": "tg", "props": {"x-iops-write": 1, "x-iops-write-max": 1}, "qom-type": "throttle-group"}}
+{"execute": "object-add", "arguments": {"id": "tg", "limits": {"iops-write": 1, "iops-write-max": 1}, "qom-type": "throttle-group"}}
 {"return": {}}
 {"execute": "blockdev-add", "arguments": {"backing": {"backing": {"backing": {"backing": {"driver": "raw", "file": {"driver": "throttle", "file": {"driver": "file", "filename": "TEST_DIR/PID-node0.img"}, "throttle-group": "tg"}, "node-name": "node0"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node1.img"}, "node-name": "node1"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node2.img"}, "node-name": "node2"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node3.img"}, "node-name": "node3"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-node4.img"}, "node-name": "node4"}}
 {"return": {}}
diff --git a/tests/qemu-iotests/295 b/tests/qemu-iotests/295
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/295
+++ b/tests/qemu-iotests/295
@@ -XXX,XX +XXX,XX @@ class Secret:
 
     def to_qmp_object(self):
         return { "qom_type" : "secret", "id": self.id(),
-                 "props": { "data": self.secret() } }
+                 "data": self.secret() }
 
 ################################################################################
 class EncryptionSetupTestCase(iotests.QMPTestCase):
diff --git a/tests/qemu-iotests/296 b/tests/qemu-iotests/296
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/296
+++ b/tests/qemu-iotests/296
@@ -XXX,XX +XXX,XX @@ class Secret:
 
     def to_qmp_object(self):
         return { "qom_type" : "secret", "id": self.id(),
-                 "props": { "data": self.secret() } }
+                 "data": self.secret() }
 
 ################################################################################
 
-- 
2.29.2

From: Max Reitz <mreitz@redhat.com>

The block job holds a reference to the backup-top node (because it is
passed as the main job BDS to block_job_create()).  Therefore,
bdrv_backup_top_drop() cannot delete the backup-top node (replacing it
by its child does not affect the job parent, because that has
.stay_at_node set).  That is a problem, because all of its I/O functions
assume the BlockCopyState (s->bcs) to be valid and that it has a
filtered child; but after bdrv_backup_top_drop(), neither of those
things are true.

It does not make sense to add new parents to backup-top after
backup_clean(), so we should detach it from the job before
bdrv_backup_top_drop().  Because there is no function to do that for a
single node, just detach all of the job's nodes -- the job does not do
anything past backup_clean() anyway.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210219153348.41861-2-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/backup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/backup.c b/block/backup.c
index XXXXXXX..XXXXXXX 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -XXX,XX +XXX,XX @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
+    block_job_remove_all_bdrv(&s->common);
     bdrv_backup_top_drop(s->backup_top);
 }
 
-- 
2.29.2

From: Max Reitz <mreitz@redhat.com>

When the backup-top node transitions from active to inactive in
bdrv_backup_top_drop(), the BlockCopyState is freed and the filtered
child is removed, so the node effectively becomes unusable.

However, noone told its I/O functions this, so they will happily
continue accessing bs->backing and s->bcs.  Prevent that by aborting
early when s->active is false.

(After the preceding patch, the node should be gone after
bdrv_backup_top_drop(), so this should largely be a theoretical problem.
But still, better to be safe than sorry, and also I think it just makes
sense to check s->active in the I/O functions.)

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210219153348.41861-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/backup-top.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/backup-top.c b/block/backup-top.c
index XXXXXXX..XXXXXXX 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int backup_top_co_preadv(
         BlockDriverState *bs, uint64_t offset, uint64_t bytes,
         QEMUIOVector *qiov, int flags)
 {
+    BDRVBackupTopState *s = bs->opaque;
+
+    if (!s->active) {
+        return -EIO;
+    }
+
     return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
 }
 
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
     BDRVBackupTopState *s = bs->opaque;
     uint64_t off, end;
 
+    if (!s->active) {
+        return -EIO;
+    }
+
     if (flags & BDRV_REQ_WRITE_UNCHANGED) {
         return 0;
     }
-- 
2.29.2

From: Max Reitz <mreitz@redhat.com>

Without any of HEAD^ or HEAD^^ applied, qemu will most likely crash on
the qemu-io invocation, for a variety of immediate reasons.  The
underlying problem is generally a use-after-free access into
backup-top's BlockCopyState.

With only HEAD^ applied, qemu-io will run into an EIO (which is not
capture by the output, but you can see that the qemu-io invocation will
be accepted (i.e., qemu-io will run) in contrast to the reference
output, where the node name cannot be found), and qemu will then crash
in query-named-block-nodes: bdrv_get_allocated_file_size() detects
backup-top to be a filter and passes the request through to its child.
However, after bdrv_backup_top_drop(), that child is NULL, so the
recursive call crashes.

With HEAD^^ applied, this test should pass.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210219153348.41861-4-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/283     | 53 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/283.out | 15 +++++++++++
 2 files changed, 68 insertions(+)

diff --git a/tests/qemu-iotests/283 b/tests/qemu-iotests/283
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/283
+++ b/tests/qemu-iotests/283
@@ -XXX,XX +XXX,XX @@ vm.qmp_log('blockdev-add', **{
 vm.qmp_log('blockdev-backup', sync='full', device='source', target='target')
 
 vm.shutdown()
+
+
+print('\n=== backup-top should be gone after job-finalize ===\n')
+
+# Check that the backup-top node is gone after job-finalize.
+#
+# During finalization, the node becomes inactive and can no longer
+# function.  If it is still present, new parents might be attached, and
+# there would be no meaningful way to handle their I/O requests.
+
+vm = iotests.VM()
+vm.launch()
+
+vm.qmp_log('blockdev-add', **{
+    'node-name': 'source',
+    'driver': 'null-co',
+})
+
+vm.qmp_log('blockdev-add', **{
+    'node-name': 'target',
+    'driver': 'null-co',
+})
+
+vm.qmp_log('blockdev-backup',
+           job_id='backup',
+           device='source',
+           target='target',
+           sync='full',
+           filter_node_name='backup-filter',
+           auto_finalize=False,
+           auto_dismiss=False)
+
+vm.event_wait('BLOCK_JOB_PENDING', 5.0)
+
+# The backup-top filter should still be present prior to finalization
+assert vm.node_info('backup-filter') is not None
+
+vm.qmp_log('job-finalize', id='backup')
+vm.event_wait('BLOCK_JOB_COMPLETED', 5.0)
+
+# The filter should be gone now.  Check that by trying to access it
+# with qemu-io (which will most likely crash qemu if it is still
+# there.).
+vm.qmp_log('human-monitor-command',
+           command_line='qemu-io backup-filter "write 0 1M"')
+
+# (Also, do an explicit check.)
+assert vm.node_info('backup-filter') is None
+
+vm.qmp_log('job-dismiss', id='backup')
+vm.event_wait('JOB_STATUS_CHANGE', 5.0, {'data': {'status': 'null'}})
+
+vm.shutdown()
diff --git a/tests/qemu-iotests/283.out b/tests/qemu-iotests/283.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/283.out
+++ b/tests/qemu-iotests/283.out
@@ -XXX,XX +XXX,XX @@
 {"return": {}}
 {"execute": "blockdev-backup", "arguments": {"device": "source", "sync": "full", "target": "target"}}
 {"error": {"class": "GenericError", "desc": "Cannot set permissions for backup-top filter: Conflicts with use by other as 'image', which uses 'write' on base"}}
+
+=== backup-top should be gone after job-finalize ===
+
+{"execute": "blockdev-add", "arguments": {"driver": "null-co", "node-name": "source"}}
+{"return": {}}
+{"execute": "blockdev-add", "arguments": {"driver": "null-co", "node-name": "target"}}
+{"return": {}}
+{"execute": "blockdev-backup", "arguments": {"auto-dismiss": false, "auto-finalize": false, "device": "source", "filter-node-name": "backup-filter", "job-id": "backup", "sync": "full", "target": "target"}}
+{"return": {}}
+{"execute": "job-finalize", "arguments": {"id": "backup"}}
+{"return": {}}
+{"execute": "human-monitor-command", "arguments": {"command-line": "qemu-io backup-filter \"write 0 1M\""}}
+{"return": "Error: Cannot find device= nor node_name=backup-filter\r\n"}
+{"execute": "job-dismiss", "arguments": {"id": "backup"}}
+{"return": {}}
-- 
2.29.2

From: Eric Blake <eblake@redhat.com>

Break some long lines, and relax our type hints to be more generic to
any JSON, in order to more easily permit the additional JSON depth now
possible in migration parameters.  Detected by iotest 297.

Fixes: ca4bfec41d56
 (qemu-iotests: 300: Add test case for modifying persistence of bitmap)
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210215220518.1745469-1-eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/300 | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -XXX,XX +XXX,XX @@
 import os
 import random
 import re
-from typing import Dict, List, Optional, Union
+from typing import Dict, List, Optional
 
 import iotests
 
@@ -XXX,XX +XXX,XX @@ import iotests
 # pylint: disable=wrong-import-order
 import qemu
 
-BlockBitmapMapping = List[Dict[str, Union[str, List[Dict[str, str]]]]]
+BlockBitmapMapping = List[Dict[str, object]]
 
 mig_sock = os.path.join(iotests.sock_dir, 'mig_sock')
 
@@ -XXX,XX +XXX,XX @@ class TestCrossAliasMigration(TestDirtyBitmapMigration):
 
 class TestAliasTransformMigration(TestDirtyBitmapMigration):
     """
-    Tests the 'transform' option which modifies bitmap persistence on migration.
+    Tests the 'transform' option which modifies bitmap persistence on
+    migration.
     """
 
     src_node_name = 'node-a'
@@ -XXX,XX +XXX,XX @@ class TestAliasTransformMigration(TestDirtyBitmapMigration):
         bitmaps = self.vm_b.query_bitmaps()
 
         for node in bitmaps:
-            bitmaps[node] = sorted(((bmap['name'], bmap['persistent']) for bmap in bitmaps[node]))
+            bitmaps[node] = sorted(((bmap['name'], bmap['persistent'])
+                                    for bmap in bitmaps[node]))
 
         self.assertEqual(bitmaps,
                          {'node-a': [('bmap-a', True), ('bmap-b', False)],
-- 
2.29.2

From: Stefano Garzarella <sgarzare@redhat.com>

When a block job fails, we report strerror(-job->job.ret) error
message, also if the job set an error object.
Let's report a better error message using error_get_pretty(job->job.err).

If an error object was not set, strerror(-job->ret) is used as fallback,
as explained in include/qemu/job.h:

typedef struct Job {
    ...
    /**
     * Error object for a failed job.
     * If job->ret is nonzero and an error object was not set, it will be set
     * to strerror(-job->ret) during job_completed.
     */
    Error *err;
}

In block_job_query() there can be a transient where 'job.err' is not set
by a scheduled bottom half. In that case we use strerror(-job->ret) as it
was before.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20210225103633.76746-1-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 blockjob.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
     info->status    = job->job.status;
     info->auto_finalize = job->job.auto_finalize;
     info->auto_dismiss  = job->job.auto_dismiss;
-    info->has_error = job->job.ret != 0;
-    info->error     = job->job.ret ? g_strdup(strerror(-job->job.ret)) : NULL;
+    if (job->job.ret) {
+        info->has_error = true;
+        info->error = job->job.err ?
+                        g_strdup(error_get_pretty(job->job.err)) :
+                        g_strdup(strerror(-job->job.ret));
+    }
     return info;
 }
 
@@ -XXX,XX +XXX,XX @@ static void block_job_event_completed(Notifier *n, void *opaque)
     }
 
     if (job->job.ret < 0) {
-        msg = strerror(-job->job.ret);
+        msg = error_get_pretty(job->job.err);
     }
 
     qapi_event_send_block_job_completed(job_type(&job->job),
-- 
2.29.2

From: Paolo Bonzini <pbonzini@redhat.com>

If the first character of optstring is '-', then each nonoption argv
element is handled as if it were the argument of an option with character
code 1.  This removes the reordering of the argv array, and enables usage
of loc_set_cmdline to provide better error messages.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210301152844.291799-2-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 storage-daemon/qemu-storage-daemon.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index XXXXXXX..XXXXXXX 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
      * they are given on the command lines. This means that things must be
      * defined first before they can be referenced in another option.
      */
-    while ((c = getopt_long(argc, argv, "hT:V", long_options, NULL)) != -1) {
+    while ((c = getopt_long(argc, argv, "-hT:V", long_options, NULL)) != -1) {
         switch (c) {
         case '?':
             exit(EXIT_FAILURE);
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
                 qobject_unref(args);
                 break;
             }
+        case 1:
+            error_report("Unexpected argument: %s", optarg);
+            exit(EXIT_FAILURE);
         default:
             g_assert_not_reached();
         }
     }
-    if (optind != argc) {
-        error_report("Unexpected argument: %s", argv[optind]);
-        exit(EXIT_FAILURE);
-    }
 }
 
 int main(int argc, char *argv[])
-- 
2.29.2

From: Paolo Bonzini <pbonzini@redhat.com>

Use the location management facilities that the emulator uses, so that
the current command line option appears in the error message.

Before:

$ storage-daemon/qemu-storage-daemon --nbd key..=
  qemu-storage-daemon: Invalid parameter 'key..'

After:

$ storage-daemon/qemu-storage-daemon --nbd key..=
  qemu-storage-daemon: --nbd key..=: Invalid parameter 'key..'

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210301152844.291799-3-pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 storage-daemon/qemu-storage-daemon.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index XXXXXXX..XXXXXXX 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -XXX,XX +XXX,XX @@ static void init_qmp_commands(void)
                          qmp_marshal_qmp_capabilities, QCO_ALLOW_PRECONFIG);
 }
 
+static int getopt_set_loc(int argc, char **argv, const char *optstring,
+                          const struct option *longopts)
+{
+    int c, save_index;
+
+    optarg = NULL;
+    save_index = optind;
+    c = getopt_long(argc, argv, optstring, longopts, NULL);
+    if (optarg) {
+        loc_set_cmdline(argv, save_index, MAX(1, optind - save_index));
+    }
+    return c;
+}
+
 static void process_options(int argc, char *argv[])
 {
     int c;
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
      * they are given on the command lines. This means that things must be
      * defined first before they can be referenced in another option.
      */
-    while ((c = getopt_long(argc, argv, "-hT:V", long_options, NULL)) != -1) {
+    while ((c = getopt_set_loc(argc, argv, "-hT:V", long_options)) != -1) {
         switch (c) {
         case '?':
             exit(EXIT_FAILURE);
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
                 break;
             }
         case 1:
-            error_report("Unexpected argument: %s", optarg);
+            error_report("Unexpected argument");
             exit(EXIT_FAILURE);
         default:
             g_assert_not_reached();
         }
     }
+    loc_set_none();
 }
 
 int main(int argc, char *argv[])
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Daemons often have a --pidfile option where the pid is written to a file
so that scripts can stop the daemon by sending a signal.

The pid file also acts as a lock to prevent multiple instances of the
daemon from launching for a given pid file.

QEMU, qemu-nbd, qemu-ga, virtiofsd, and qemu-pr-helper all support the
--pidfile option. Add it to qemu-storage-daemon too.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210302142746.170535-1-stefanha@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/tools/qemu-storage-daemon.rst   | 14 +++++++++++
 storage-daemon/qemu-storage-daemon.c | 36 ++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/tools/qemu-storage-daemon.rst
+++ b/docs/tools/qemu-storage-daemon.rst
@@ -XXX,XX +XXX,XX @@ Standard options:
   List object properties with ``<type>,help``. See the :manpage:`qemu(1)`
   manual page for a description of the object properties.
 
+.. option:: --pidfile PATH
+
+  is the path to a file where the daemon writes its pid. This allows scripts to
+  stop the daemon by sending a signal::
+
+    $ kill -SIGTERM $(<path/to/qsd.pid)
+
+  A file lock is applied to the file so only one instance of the daemon can run
+  with a given pid file path. The daemon unlinks its pid file when terminating.
+
+  The pid file is written after chardevs, exports, and NBD servers have been
+  created but before accepting connections. The daemon has started successfully
+  when the pid file is written and clients may begin connecting.
+
 Examples
 --------
 Launch the daemon with QMP monitor socket ``qmp.sock`` so clients can execute
diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index XXXXXXX..XXXXXXX 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/runstate.h"
 #include "trace/control.h"
 
+static const char *pid_file;
 static volatile bool exit_requested = false;
 
 void qemu_system_killed(int signal, pid_t pid)
@@ -XXX,XX +XXX,XX @@ static void help(void)
 "                         See the qemu(1) man page for documentation of the\n"
 "                         objects that can be added.\n"
 "\n"
+"  --pidfile <path>       write process ID to a file after startup\n"
+"\n"
 QEMU_HELP_BOTTOM "\n",
     error_get_progname());
 }
@@ -XXX,XX +XXX,XX @@ enum {
     OPTION_MONITOR,
     OPTION_NBD_SERVER,
     OPTION_OBJECT,
+    OPTION_PIDFILE,
 };
 
 extern QemuOptsList qemu_chardev_opts;
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
         {"monitor", required_argument, NULL, OPTION_MONITOR},
         {"nbd-server", required_argument, NULL, OPTION_NBD_SERVER},
         {"object", required_argument, NULL, OPTION_OBJECT},
+        {"pidfile", required_argument, NULL, OPTION_PIDFILE},
         {"trace", required_argument, NULL, 'T'},
         {"version", no_argument, NULL, 'V'},
         {0, 0, 0, 0}
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
                 qobject_unref(args);
                 break;
             }
+        case OPTION_PIDFILE:
+            pid_file = optarg;
+            break;
         case 1:
             error_report("Unexpected argument");
             exit(EXIT_FAILURE);
@@ -XXX,XX +XXX,XX @@ static void process_options(int argc, char *argv[])
     loc_set_none();
 }
 
+static void pid_file_cleanup(void)
+{
+    unlink(pid_file);
+}
+
+static void pid_file_init(void)
+{
+    Error *err = NULL;
+
+    if (!pid_file) {
+        return;
+    }
+
+    if (!qemu_write_pidfile(pid_file, &err)) {
+        error_reportf_err(err, "cannot create PID file: ");
+        exit(EXIT_FAILURE);
+    }
+
+    atexit(pid_file_cleanup);
+}
+
 int main(int argc, char *argv[])
 {
 #ifdef CONFIG_POSIX
@@ -XXX,XX +XXX,XX @@ int main(int argc, char *argv[])
     qemu_init_main_loop(&error_fatal);
     process_options(argc, argv);
 
+    /*
+     * Write the pid file after creating chardevs, exports, and NBD servers but
+     * before accepting connections. This ordering is documented. Do not change
+     * it.
+     */
+    pid_file_init();
+
     while (!exit_requested) {
         main_loop_wait(false);
     }
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

The QMP monitor, NBD server, and vhost-user-blk export all support file
descriptor passing. This is a useful technique because it allows the
parent process to spawn and wait for qemu-storage-daemon without busy
waiting, which may delay startup due to arbitrary sleep() calls.

This Python example is inspired by the test case written for libnbd by
Richard W.M. Jones <rjones@redhat.com>:
https://gitlab.com/nbdkit/libnbd/-/commit/89113f484effb0e6c322314ba75c1cbe07a04543

Thanks to Daniel P. Berrangé <berrange@redhat.com> for suggestions on
how to get this working. Now let's document it!

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210301172728.135331-2-stefanha@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/tools/qemu-storage-daemon.rst | 42 ++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

From: Stefan Hajnoczi <stefanha@redhat.com>

World-writeable directories have security issues. Avoid showing them in
the documentation since someone might accidentally use them in
situations where they are insecure.

There tend to be 3 security problems:
1. Denial of service. An adversary may be able to create the file
   beforehand, consume all space/inodes, etc to sabotage us.
2. Impersonation. An adversary may be able to create a listen socket and
   accept incoming connections that were meant for us.
3. Unauthenticated client access. An adversary may be able to connect to
   us if we did not set the uid/gid and permissions correctly.

These can be prevented or mitigated with private /tmp, carefully setting
the umask, etc but that requires special action and does not apply to
all situations. Just avoid using /tmp in examples.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210301172728.135331-3-stefanha@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/tools/qemu-storage-daemon.rst | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

From: Stefan Hajnoczi <stefanha@redhat.com>

Treat the num_queues field as virtio-endian. On big-endian hosts the
vhost-user-blk num_queues field was in the wrong endianness.

Move the blkcfg.num_queues store operation from realize to
vhost_user_blk_update_config() so feature negotiation has finished and
we know the endianness of the device. VIRTIO 1.0 devices are
little-endian, but in case someone wants to use legacy VIRTIO we support
all endianness cases.

Cc: qemu-stable@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20210223144653.811468-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/block/vhost-user-blk.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -XXX,XX +XXX,XX @@ static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
+    /* Our num_queues overrides the device backend */
+    virtio_stw_p(vdev, &s->blkcfg.num_queues, s->num_queues);
+
     memcpy(config, &s->blkcfg, sizeof(struct virtio_blk_config));
 }
 
@@ -XXX,XX +XXX,XX @@ reconnect:
         goto reconnect;
     }
 
-    if (s->blkcfg.num_queues != s->num_queues) {
-        s->blkcfg.num_queues = s->num_queues;
-    }
-
     return;
 
 virtio_err:
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Add an API that returns a new UNIX domain socket in the listen state.
The code for this was already there but only used internally in
init_socket().

This new API will be used by vhost-user-blk-test.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Message-Id: <20210223144653.811468-3-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/libqos/libqtest.h |  8 +++++++
 tests/qtest/libqtest.c        | 40 ++++++++++++++++++++---------------
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqos/libqtest.h
+++ b/tests/qtest/libqos/libqtest.h
@@ -XXX,XX +XXX,XX @@ void qtest_qmp_send(QTestState *s, const char *fmt, ...)
 void qtest_qmp_send_raw(QTestState *s, const char *fmt, ...)
     GCC_FMT_ATTR(2, 3);
 
+/**
+ * qtest_socket_server:
+ * @socket_path: the UNIX domain socket path
+ *
+ * Create and return a listen socket file descriptor, or abort on failure.
+ */
+int qtest_socket_server(const char *socket_path);
+
 /**
  * qtest_vqmp_fds:
  * @s: #QTestState instance to operate on.
diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ static void qtest_client_set_rx_handler(QTestState *s, QTestRecvFn recv);
 
 static int init_socket(const char *socket_path)
 {
-    struct sockaddr_un addr;
-    int sock;
-    int ret;
-
-    sock = socket(PF_UNIX, SOCK_STREAM, 0);
-    g_assert_cmpint(sock, !=, -1);
-
-    addr.sun_family = AF_UNIX;
-    snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", socket_path);
+    int sock = qtest_socket_server(socket_path);
     qemu_set_cloexec(sock);
-
-    do {
-        ret = bind(sock, (struct sockaddr *)&addr, sizeof(addr));
-    } while (ret == -1 && errno == EINTR);
-    g_assert_cmpint(ret, !=, -1);
-    ret = listen(sock, 1);
-    g_assert_cmpint(ret, !=, -1);
-
     return sock;
 }
 
@@ -XXX,XX +XXX,XX @@ QDict *qtest_qmp_receive_dict(QTestState *s)
     return qmp_fd_receive(s->qmp_fd);
 }
 
+int qtest_socket_server(const char *socket_path)
+{
+    struct sockaddr_un addr;
+    int sock;
+    int ret;
+
+    sock = socket(PF_UNIX, SOCK_STREAM, 0);
+    g_assert_cmpint(sock, !=, -1);
+
+    addr.sun_family = AF_UNIX;
+    snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", socket_path);
+
+    do {
+        ret = bind(sock, (struct sockaddr *)&addr, sizeof(addr));
+    } while (ret == -1 && errno == EINTR);
+    g_assert_cmpint(ret, !=, -1);
+    ret = listen(sock, 1);
+    g_assert_cmpint(ret, !=, -1);
+
+    return sock;
+}
+
 /**
  * Allow users to send a message without waiting for the reply,
  * in the case that they choose to discard all replies up until
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Tests that manage multiple processes may wish to kill QEMU before
destroying the QTestState. Expose a function to do that.

The vhost-user-blk-test testcase will need this.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Message-Id: <20210223144653.811468-4-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/libqos/libqtest.h | 11 +++++++++++
 tests/qtest/libqtest.c        |  7 ++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqos/libqtest.h
+++ b/tests/qtest/libqos/libqtest.h
@@ -XXX,XX +XXX,XX @@ QTestState *qtest_init_without_qmp_handshake(const char *extra_args);
  */
 QTestState *qtest_init_with_serial(const char *extra_args, int *sock_fd);
 
+/**
+ * qtest_kill_qemu:
+ * @s: #QTestState instance to operate on.
+ *
+ * Kill the QEMU process and wait for it to terminate. It is safe to call this
+ * function multiple times. Normally qtest_quit() is used instead because it
+ * also frees QTestState. Use qtest_kill_qemu() when you just want to kill QEMU
+ * and qtest_quit() will be called later.
+ */
+void qtest_kill_qemu(QTestState *s);
+
 /**
  * qtest_quit:
  * @s: #QTestState instance to operate on.
diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ void qtest_set_expected_status(QTestState *s, int status)
     s->expected_status = status;
 }
 
-static void kill_qemu(QTestState *s)
+void qtest_kill_qemu(QTestState *s)
 {
     pid_t pid = s->qemu_pid;
     int wstatus;
@@ -XXX,XX +XXX,XX @@ static void kill_qemu(QTestState *s)
         kill(pid, SIGTERM);
         TFR(pid = waitpid(s->qemu_pid, &s->wstatus, 0));
         assert(pid == s->qemu_pid);
+        s->qemu_pid = -1;
     }
 
     /*
@@ -XXX,XX +XXX,XX @@ static void kill_qemu(QTestState *s)
 
 static void kill_qemu_hook_func(void *s)
 {
-    kill_qemu(s);
+    qtest_kill_qemu(s);
 }
 
 static void sigabrt_handler(int signo)
@@ -XXX,XX +XXX,XX @@ void qtest_quit(QTestState *s)
     /* Uninstall SIGABRT handler on last instance */
     cleanup_sigabrt_handler();
 
-    kill_qemu(s);
+    qtest_kill_qemu(s);
     close(s->fd);
     close(s->qmp_fd);
     g_string_free(s->rx, true);
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Add a function to remove previously-added abrt handler functions.

Now that a symmetric pair of add/remove functions exists we can also
balance the SIGABRT handler installation. The signal handler was
installed each time qtest_add_abrt_handler() was called. Now it is
installed when the abrt handler list becomes non-empty and removed again
when the list becomes empty.

The qtest_remove_abrt_handler() function will be used by
vhost-user-blk-test.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Message-Id: <20210223144653.811468-5-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/libqos/libqtest.h | 18 ++++++++++++++++++
 tests/qtest/libqtest.c        | 35 +++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/libqos/libqtest.h b/tests/qtest/libqos/libqtest.h
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqos/libqtest.h
+++ b/tests/qtest/libqos/libqtest.h
@@ -XXX,XX +XXX,XX @@ void qtest_add_data_func_full(const char *str, void *data,
         g_free(path); \
     } while (0)
 
+/**
+ * qtest_add_abrt_handler:
+ * @fn: Handler function
+ * @data: Argument that is passed to the handler
+ *
+ * Add a handler function that is invoked on SIGABRT. This can be used to
+ * terminate processes and perform other cleanup. The handler can be removed
+ * with qtest_remove_abrt_handler().
+ */
 void qtest_add_abrt_handler(GHookFunc fn, const void *data);
 
+/**
+ * qtest_remove_abrt_handler:
+ * @data: Argument previously passed to qtest_add_abrt_handler()
+ *
+ * Remove an abrt handler that was previously added with
+ * qtest_add_abrt_handler().
+ */
+void qtest_remove_abrt_handler(void *data);
+
 /**
  * qtest_qmp_assert_success:
  * @qts: QTestState instance to operate on
diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -XXX,XX +XXX,XX @@ static void cleanup_sigabrt_handler(void)
     sigaction(SIGABRT, &sigact_old, NULL);
 }
 
+static bool hook_list_is_empty(GHookList *hook_list)
+{
+    GHook *hook = g_hook_first_valid(hook_list, TRUE);
+
+    if (!hook) {
+        return false;
+    }
+
+    g_hook_unref(hook_list, hook);
+    return true;
+}
+
 void qtest_add_abrt_handler(GHookFunc fn, const void *data)
 {
     GHook *hook;
 
-    /* Only install SIGABRT handler once */
     if (!abrt_hooks.is_setup) {
         g_hook_list_init(&abrt_hooks, sizeof(GHook));
     }
-    setup_sigabrt_handler();
+
+    /* Only install SIGABRT handler once */
+    if (hook_list_is_empty(&abrt_hooks)) {
+        setup_sigabrt_handler();
+    }
 
     hook = g_hook_alloc(&abrt_hooks);
     hook->func = fn;
@@ -XXX,XX +XXX,XX @@ void qtest_add_abrt_handler(GHookFunc fn, const void *data)
     g_hook_prepend(&abrt_hooks, hook);
 }
 
+void qtest_remove_abrt_handler(void *data)
+{
+    GHook *hook = g_hook_find_data(&abrt_hooks, TRUE, data);
+    g_hook_destroy_link(&abrt_hooks, hook);
+
+    /* Uninstall SIGABRT handler on last instance */
+    if (hook_list_is_empty(&abrt_hooks)) {
+        cleanup_sigabrt_handler();
+    }
+}
+
 static const char *qtest_qemu_binary(void)
 {
     const char *qemu_bin;
@@ -XXX,XX +XXX,XX @@ QTestState *qtest_init_with_serial(const char *extra_args, int *sock_fd)
 
 void qtest_quit(QTestState *s)
 {
-    g_hook_destroy_link(&abrt_hooks, g_hook_find_data(&abrt_hooks, TRUE, s));
-
-    /* Uninstall SIGABRT handler on last instance */
-    cleanup_sigabrt_handler();
+    qtest_remove_abrt_handler(s);
 
     qtest_kill_qemu(s);
     close(s->fd);
-- 
2.29.2

From: Coiby Xu <coiby.xu@gmail.com>

This test case has the same tests as tests/virtio-blk-test.c except for
tests have block_resize. Since the vhost-user-blk export only serves one
client one time, two exports are started by qemu-storage-daemon for the
hotplug test.

Suggested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-6-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/libqos/vhost-user-blk.h |  48 ++
 tests/qtest/libqos/vhost-user-blk.c | 130 +++++
 tests/qtest/vhost-user-blk-test.c   | 788 ++++++++++++++++++++++++++++
 MAINTAINERS                         |   2 +
 tests/qtest/libqos/meson.build      |   1 +
 tests/qtest/meson.build             |   4 +
 6 files changed, 973 insertions(+)
 create mode 100644 tests/qtest/libqos/vhost-user-blk.h
 create mode 100644 tests/qtest/libqos/vhost-user-blk.c
 create mode 100644 tests/qtest/vhost-user-blk-test.c

diff --git a/tests/qtest/libqos/vhost-user-blk.h b/tests/qtest/libqos/vhost-user-blk.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qtest/libqos/vhost-user-blk.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * libqos driver framework
+ *
+ * Based on tests/qtest/libqos/virtio-blk.c
+ *
+ * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
+ *
+ * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#ifndef TESTS_LIBQOS_VHOST_USER_BLK_H
+#define TESTS_LIBQOS_VHOST_USER_BLK_H
+
+#include "qgraph.h"
+#include "virtio.h"
+#include "virtio-pci.h"
+
+typedef struct QVhostUserBlk QVhostUserBlk;
+typedef struct QVhostUserBlkPCI QVhostUserBlkPCI;
+typedef struct QVhostUserBlkDevice QVhostUserBlkDevice;
+
+struct QVhostUserBlk {
+    QVirtioDevice *vdev;
+};
+
+struct QVhostUserBlkPCI {
+    QVirtioPCIDevice pci_vdev;
+    QVhostUserBlk blk;
+};
+
+struct QVhostUserBlkDevice {
+    QOSGraphObject obj;
+    QVhostUserBlk blk;
+};
+
+#endif
diff --git a/tests/qtest/libqos/vhost-user-blk.c b/tests/qtest/libqos/vhost-user-blk.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qtest/libqos/vhost-user-blk.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * libqos driver framework
+ *
+ * Based on tests/qtest/libqos/virtio-blk.c
+ *
+ * Copyright (c) 2020 Coiby Xu <coiby.xu@gmail.com>
+ *
+ * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2.1 as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu/module.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "vhost-user-blk.h"
+
+#define PCI_SLOT                0x04
+#define PCI_FN                  0x00
+
+/* virtio-blk-device */
+static void *qvhost_user_blk_get_driver(QVhostUserBlk *v_blk,
+                                    const char *interface)
+{
+    if (!g_strcmp0(interface, "vhost-user-blk")) {
+        return v_blk;
+    }
+    if (!g_strcmp0(interface, "virtio")) {
+        return v_blk->vdev;
+    }
+
+    fprintf(stderr, "%s not present in vhost-user-blk-device\n", interface);
+    g_assert_not_reached();
+}
+
+static void *qvhost_user_blk_device_get_driver(void *object,
+                                           const char *interface)
+{
+    QVhostUserBlkDevice *v_blk = object;
+    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
+}
+
+static void *vhost_user_blk_device_create(void *virtio_dev,
+                                      QGuestAllocator *t_alloc,
+                                      void *addr)
+{
+    QVhostUserBlkDevice *vhost_user_blk = g_new0(QVhostUserBlkDevice, 1);
+    QVhostUserBlk *interface = &vhost_user_blk->blk;
+
+    interface->vdev = virtio_dev;
+
+    vhost_user_blk->obj.get_driver = qvhost_user_blk_device_get_driver;
+
+    return &vhost_user_blk->obj;
+}
+
+/* virtio-blk-pci */
+static void *qvhost_user_blk_pci_get_driver(void *object, const char *interface)
+{
+    QVhostUserBlkPCI *v_blk = object;
+    if (!g_strcmp0(interface, "pci-device")) {
+        return v_blk->pci_vdev.pdev;
+    }
+    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
+}
+
+static void *vhost_user_blk_pci_create(void *pci_bus, QGuestAllocator *t_alloc,
+                                      void *addr)
+{
+    QVhostUserBlkPCI *vhost_user_blk = g_new0(QVhostUserBlkPCI, 1);
+    QVhostUserBlk *interface = &vhost_user_blk->blk;
+    QOSGraphObject *obj = &vhost_user_blk->pci_vdev.obj;
+
+    virtio_pci_init(&vhost_user_blk->pci_vdev, pci_bus, addr);
+    interface->vdev = &vhost_user_blk->pci_vdev.vdev;
+
+    g_assert_cmphex(interface->vdev->device_type, ==, VIRTIO_ID_BLOCK);
+
+    obj->get_driver = qvhost_user_blk_pci_get_driver;
+
+    return obj;
+}
+
+static void vhost_user_blk_register_nodes(void)
+{
+    /*
+     * FIXME: every test using these two nodes needs to setup a
+     * -drive,id=drive0 otherwise QEMU is not going to start.
+     * Therefore, we do not include "produces" edge for virtio
+     * and pci-device yet.
+     */
+
+    char *arg = g_strdup_printf("id=drv0,chardev=char1,addr=%x.%x",
+                                PCI_SLOT, PCI_FN);
+
+    QPCIAddress addr = {
+        .devfn = QPCI_DEVFN(PCI_SLOT, PCI_FN),
+    };
+
+    QOSGraphEdgeOptions opts = { };
+
+    /* virtio-blk-device */
+    /** opts.extra_device_opts = "drive=drive0"; */
+    qos_node_create_driver("vhost-user-blk-device",
+                           vhost_user_blk_device_create);
+    qos_node_consumes("vhost-user-blk-device", "virtio-bus", &opts);
+    qos_node_produces("vhost-user-blk-device", "vhost-user-blk");
+
+    /* virtio-blk-pci */
+    opts.extra_device_opts = arg;
+    add_qpci_address(&opts, &addr);
+    qos_node_create_driver("vhost-user-blk-pci", vhost_user_blk_pci_create);
+    qos_node_consumes("vhost-user-blk-pci", "pci-bus", &opts);
+    qos_node_produces("vhost-user-blk-pci", "vhost-user-blk");
+
+    g_free(arg);
+}
+
+libqos_init(vhost_user_blk_register_nodes);
diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * QTest testcase for Vhost-user Block Device
+ *
+ * Based on tests/qtest//virtio-blk-test.c
+
+ * Copyright (c) 2014 SUSE LINUX Products GmbH
+ * Copyright (c) 2014 Marc Marí
+ * Copyright (c) 2020 Coiby Xu
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest-single.h"
+#include "qemu/bswap.h"
+#include "qemu/module.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "standard-headers/linux/virtio_pci.h"
+#include "libqos/qgraph.h"
+#include "libqos/vhost-user-blk.h"
+#include "libqos/libqos-pc.h"
+
+#define TEST_IMAGE_SIZE         (64 * 1024 * 1024)
+#define QVIRTIO_BLK_TIMEOUT_US  (30 * 1000 * 1000)
+#define PCI_SLOT_HP             0x06
+
+typedef struct {
+    pid_t pid;
+} QemuStorageDaemonState;
+
+typedef struct QVirtioBlkReq {
+    uint32_t type;
+    uint32_t ioprio;
+    uint64_t sector;
+    char *data;
+    uint8_t status;
+} QVirtioBlkReq;
+
+#ifdef HOST_WORDS_BIGENDIAN
+static const bool host_is_big_endian = true;
+#else
+static const bool host_is_big_endian; /* false */
+#endif
+
+static inline void virtio_blk_fix_request(QVirtioDevice *d, QVirtioBlkReq *req)
+{
+    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
+        req->type = bswap32(req->type);
+        req->ioprio = bswap32(req->ioprio);
+        req->sector = bswap64(req->sector);
+    }
+}
+
+static inline void virtio_blk_fix_dwz_hdr(QVirtioDevice *d,
+    struct virtio_blk_discard_write_zeroes *dwz_hdr)
+{
+    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
+        dwz_hdr->sector = bswap64(dwz_hdr->sector);
+        dwz_hdr->num_sectors = bswap32(dwz_hdr->num_sectors);
+        dwz_hdr->flags = bswap32(dwz_hdr->flags);
+    }
+}
+
+static uint64_t virtio_blk_request(QGuestAllocator *alloc, QVirtioDevice *d,
+                                   QVirtioBlkReq *req, uint64_t data_size)
+{
+    uint64_t addr;
+    uint8_t status = 0xFF;
+    QTestState *qts = global_qtest;
+
+    switch (req->type) {
+    case VIRTIO_BLK_T_IN:
+    case VIRTIO_BLK_T_OUT:
+        g_assert_cmpuint(data_size % 512, ==, 0);
+        break;
+    case VIRTIO_BLK_T_DISCARD:
+    case VIRTIO_BLK_T_WRITE_ZEROES:
+        g_assert_cmpuint(data_size %
+                         sizeof(struct virtio_blk_discard_write_zeroes), ==, 0);
+        break;
+    default:
+        g_assert_cmpuint(data_size, ==, 0);
+    }
+
+    addr = guest_alloc(alloc, sizeof(*req) + data_size);
+
+    virtio_blk_fix_request(d, req);
+
+    qtest_memwrite(qts, addr, req, 16);
+    qtest_memwrite(qts, addr + 16, req->data, data_size);
+    qtest_memwrite(qts, addr + 16 + data_size, &status, sizeof(status));
+
+    return addr;
+}
+
+/* Returns the request virtqueue so the caller can perform further tests */
+static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
+{
+    QVirtioBlkReq req;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint8_t status;
+    char *data;
+    QTestState *qts = global_qtest;
+    QVirtQueue *vq;
+
+    features = qvirtio_get_features(dev);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                    (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                    (1u << VIRTIO_RING_F_EVENT_IDX) |
+                    (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, alloc, 0);
+
+    qvirtio_set_driver_ok(dev);
+
+    /* Write and read with 3 descriptor layout */
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    guest_free(alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    qtest_memread(qts, req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    guest_free(alloc, req_addr);
+
+    if (features & (1u << VIRTIO_BLK_F_WRITE_ZEROES)) {
+        struct virtio_blk_discard_write_zeroes dwz_hdr;
+        void *expected;
+
+        /*
+         * WRITE_ZEROES request on the same sector of previous test where
+         * we wrote "TEST".
+         */
+        req.type = VIRTIO_BLK_T_WRITE_ZEROES;
+        req.data = (char *) &dwz_hdr;
+        dwz_hdr.sector = 0;
+        dwz_hdr.num_sectors = 1;
+        dwz_hdr.flags = 0;
+
+        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
+                       false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 16 + sizeof(dwz_hdr));
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+
+        /* Read request to check if the sector contains all zeroes */
+        req.type = VIRTIO_BLK_T_IN;
+        req.ioprio = 1;
+        req.sector = 0;
+        req.data = g_malloc0(512);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        data = g_malloc(512);
+        expected = g_malloc0(512);
+        qtest_memread(qts, req_addr + 16, data, 512);
+        g_assert_cmpmem(data, 512, expected, 512);
+        g_free(expected);
+        g_free(data);
+
+        guest_free(alloc, req_addr);
+    }
+
+    if (features & (1u << VIRTIO_BLK_F_DISCARD)) {
+        struct virtio_blk_discard_write_zeroes dwz_hdr;
+
+        req.type = VIRTIO_BLK_T_DISCARD;
+        req.data = (char *) &dwz_hdr;
+        dwz_hdr.sector = 0;
+        dwz_hdr.num_sectors = 1;
+        dwz_hdr.flags = 0;
+
+        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr),
+                       1, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 16 + sizeof(dwz_hdr));
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+    }
+
+    if (features & (1u << VIRTIO_F_ANY_LAYOUT)) {
+        /* Write and read with 2 descriptor layout */
+        /* Write request */
+        req.type = VIRTIO_BLK_T_OUT;
+        req.ioprio = 1;
+        req.sector = 1;
+        req.data = g_malloc0(512);
+        strcpy(req.data, "TEST");
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 528, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+
+        /* Read request */
+        req.type = VIRTIO_BLK_T_IN;
+        req.ioprio = 1;
+        req.sector = 1;
+        req.data = g_malloc0(512);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, 513, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        data = g_malloc0(512);
+        qtest_memread(qts, req_addr + 16, data, 512);
+        g_assert_cmpstr(data, ==, "TEST");
+        g_free(data);
+
+        guest_free(alloc, req_addr);
+    }
+
+    return vq;
+}
+
+static void basic(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVhostUserBlk *blk_if = obj;
+    QVirtQueue *vq;
+
+    vq = test_basic(blk_if->vdev, t_alloc);
+    qvirtqueue_cleanup(blk_if->vdev->bus, vq, t_alloc);
+
+}
+
+static void indirect(void *obj, void *u_data, QGuestAllocator *t_alloc)
+{
+    QVirtQueue *vq;
+    QVhostUserBlk *blk_if = obj;
+    QVirtioDevice *dev = blk_if->vdev;
+    QVirtioBlkReq req;
+    QVRingIndirectDesc *indirect;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint8_t status;
+    char *data;
+    QTestState *qts = global_qtest;
+
+    features = qvirtio_get_features(dev);
+    g_assert_cmphex(features & (1u << VIRTIO_RING_F_INDIRECT_DESC), !=, 0);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_EVENT_IDX) |
+                            (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, t_alloc, 0);
+    qvirtio_set_driver_ok(dev);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 528, false);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 528, 1, true);
+    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    g_free(indirect);
+    guest_free(t_alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 16, false);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 16, 513, true);
+    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    qtest_memread(qts, req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    g_free(indirect);
+    guest_free(t_alloc, req_addr);
+    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
+}
+
+static void idx(void *obj, void *u_data, QGuestAllocator *t_alloc)
+{
+    QVirtQueue *vq;
+    QVhostUserBlkPCI *blk = obj;
+    QVirtioPCIDevice *pdev = &blk->pci_vdev;
+    QVirtioDevice *dev = &pdev->vdev;
+    QVirtioBlkReq req;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint32_t write_head;
+    uint32_t desc_idx;
+    uint8_t status;
+    char *data;
+    QOSGraphObject *blk_object = obj;
+    QPCIDevice *pci_dev = blk_object->get_driver(blk_object, "pci-device");
+    QTestState *qts = global_qtest;
+
+    if (qpci_check_buggy_msi(pci_dev)) {
+        return;
+    }
+
+    qpci_msix_enable(pdev->pdev);
+    qvirtio_pci_set_msix_configuration_vector(pdev, t_alloc, 0);
+
+    features = qvirtio_get_features(dev);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
+                            (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, t_alloc, 0);
+    qvirtqueue_pci_msix_setup(pdev, (QVirtQueuePCI *)vq, t_alloc, 1);
+
+    qvirtio_set_driver_ok(dev);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 1;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    /* Notify after processing the third request */
+    qvirtqueue_set_used_event(qts, vq, 2);
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+    write_head = free_head;
+
+    /* No notification expected */
+    status = qvirtio_wait_status_byte_no_isr(qts, dev,
+                                             vq, req_addr + 528,
+                                             QVIRTIO_BLK_TIMEOUT_US);
+    g_assert_cmpint(status, ==, 0);
+
+    guest_free(t_alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 1;
+    req.data = g_malloc0(512);
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    /* We get just one notification for both requests */
+    qvirtio_wait_used_elem(qts, dev, vq, write_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    g_assert(qvirtqueue_get_buf(qts, vq, &desc_idx, NULL));
+    g_assert_cmpint(desc_idx, ==, free_head);
+
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    qtest_memread(qts, req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    guest_free(t_alloc, req_addr);
+
+    /* End test */
+    qpci_msix_disable(pdev->pdev);
+
+    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
+}
+
+static void pci_hotplug(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVirtioPCIDevice *dev1 = obj;
+    QVirtioPCIDevice *dev;
+    QTestState *qts = dev1->pdev->bus->qts;
+
+    /* plug secondary disk */
+    qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
+                         "{'addr': %s, 'chardev': 'char2'}",
+                         stringify(PCI_SLOT_HP) ".0");
+
+    dev = virtio_pci_new(dev1->pdev->bus,
+                         &(QPCIAddress) { .devfn = QPCI_DEVFN(PCI_SLOT_HP, 0)
+                                        });
+    g_assert_nonnull(dev);
+    g_assert_cmpint(dev->vdev.device_type, ==, VIRTIO_ID_BLOCK);
+    qvirtio_pci_device_disable(dev);
+    qos_object_destroy((QOSGraphObject *)dev);
+
+    /* unplug secondary disk */
+    qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
+}
+
+/*
+ * Check that setting the vring addr on a non-existent virtqueue does
+ * not crash.
+ */
+static void test_nonexistent_virtqueue(void *obj, void *data,
+                                       QGuestAllocator *t_alloc)
+{
+    QVhostUserBlkPCI *blk = obj;
+    QVirtioPCIDevice *pdev = &blk->pci_vdev;
+    QPCIBar bar0;
+    QPCIDevice *dev;
+
+    dev = qpci_device_find(pdev->pdev->bus, QPCI_DEVFN(4, 0));
+    g_assert(dev != NULL);
+    qpci_device_enable(dev);
+
+    bar0 = qpci_iomap(dev, 0, NULL);
+
+    qpci_io_writeb(dev, bar0, VIRTIO_PCI_QUEUE_SEL, 2);
+    qpci_io_writel(dev, bar0, VIRTIO_PCI_QUEUE_PFN, 1);
+
+    g_free(dev);
+}
+
+static const char *qtest_qemu_storage_daemon_binary(void)
+{
+    const char *qemu_storage_daemon_bin;
+
+    qemu_storage_daemon_bin = getenv("QTEST_QEMU_STORAGE_DAEMON_BINARY");
+    if (!qemu_storage_daemon_bin) {
+        fprintf(stderr, "Environment variable "
+                        "QTEST_QEMU_STORAGE_DAEMON_BINARY required\n");
+        exit(0);
+    }
+
+    return qemu_storage_daemon_bin;
+}
+
+/* g_test_queue_destroy() cleanup function for files */
+static void destroy_file(void *path)
+{
+    unlink(path);
+    g_free(path);
+    qos_invalidate_command_line();
+}
+
+static char *drive_create(void)
+{
+    int fd, ret;
+    /** vhost-user-blk won't recognize drive located in /tmp */
+    char *t_path = g_strdup("qtest.XXXXXX");
+
+    /** Create a temporary raw image */
+    fd = mkstemp(t_path);
+    g_assert_cmpint(fd, >=, 0);
+    ret = ftruncate(fd, TEST_IMAGE_SIZE);
+    g_assert_cmpint(ret, ==, 0);
+    close(fd);
+
+    g_test_queue_destroy(destroy_file, t_path);
+    return t_path;
+}
+
+static char *create_listen_socket(int *fd)
+{
+    int tmp_fd;
+    char *path;
+
+    /* No race because our pid makes the path unique */
+    path = g_strdup_printf("/tmp/qtest-%d-sock.XXXXXX", getpid());
+    tmp_fd = mkstemp(path);
+    g_assert_cmpint(tmp_fd, >=, 0);
+    close(tmp_fd);
+    unlink(path);
+
+    *fd = qtest_socket_server(path);
+    g_test_queue_destroy(destroy_file, path);
+    return path;
+}
+
+/*
+ * g_test_queue_destroy() and qtest_add_abrt_handler() cleanup function for
+ * qemu-storage-daemon.
+ */
+static void quit_storage_daemon(void *data)
+{
+    QemuStorageDaemonState *qsd = data;
+    int wstatus;
+    pid_t pid;
+
+    /*
+     * If we were invoked as a g_test_queue_destroy() cleanup function we need
+     * to remove the abrt handler to avoid being called again if the code below
+     * aborts. Also, we must not leave the abrt handler installed after
+     * cleanup.
+     */
+    qtest_remove_abrt_handler(data);
+
+    /* Before quitting storage-daemon, quit qemu to avoid dubious messages */
+    qtest_kill_qemu(global_qtest);
+
+    kill(qsd->pid, SIGTERM);
+    pid = waitpid(qsd->pid, &wstatus, 0);
+    g_assert_cmpint(pid, ==, qsd->pid);
+    if (!WIFEXITED(wstatus)) {
+        fprintf(stderr, "%s: expected qemu-storage-daemon to exit\n",
+                __func__);
+        abort();
+    }
+    if (WEXITSTATUS(wstatus) != 0) {
+        fprintf(stderr, "%s: expected qemu-storage-daemon to exit "
+                "successfully, got %d\n",
+                __func__, WEXITSTATUS(wstatus));
+        abort();
+    }
+
+    g_free(data);
+}
+
+static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
+{
+    const char *vhost_user_blk_bin = qtest_qemu_storage_daemon_binary();
+    int i;
+    gchar *img_path;
+    GString *storage_daemon_command = g_string_new(NULL);
+    QemuStorageDaemonState *qsd;
+
+    g_string_append_printf(storage_daemon_command,
+                           "exec %s ",
+                           vhost_user_blk_bin);
+
+    g_string_append_printf(cmd_line,
+            " -object memory-backend-memfd,id=mem,size=256M,share=on "
+            " -M memory-backend=mem -m 256M ");
+
+    for (i = 0; i < vus_instances; i++) {
+        int fd;
+        char *sock_path = create_listen_socket(&fd);
+
+        /* create image file */
+        img_path = drive_create();
+        g_string_append_printf(storage_daemon_command,
+            "--blockdev driver=file,node-name=disk%d,filename=%s "
+            "--export type=vhost-user-blk,id=disk%d,addr.type=unix,addr.path=%s,"
+            "node-name=disk%i,writable=on ",
+            i, img_path, i, sock_path, i);
+
+        g_string_append_printf(cmd_line, "-chardev socket,id=char%d,path=%s ",
+                               i + 1, sock_path);
+    }
+
+    g_test_message("starting vhost-user backend: %s",
+                   storage_daemon_command->str);
+    pid_t pid = fork();
+    if (pid == 0) {
+        /*
+         * Close standard file descriptors so tap-driver.pl pipe detects when
+         * our parent terminates.
+         */
+        close(0);
+        close(1);
+        open("/dev/null", O_RDONLY);
+        open("/dev/null", O_WRONLY);
+
+        execlp("/bin/sh", "sh", "-c", storage_daemon_command->str, NULL);
+        exit(1);
+    }
+    g_string_free(storage_daemon_command, true);
+
+    qsd = g_new(QemuStorageDaemonState, 1);
+    qsd->pid = pid;
+
+    /* Make sure qemu-storage-daemon is stopped */
+    qtest_add_abrt_handler(quit_storage_daemon, qsd);
+    g_test_queue_destroy(quit_storage_daemon, qsd);
+}
+
+static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
+{
+    start_vhost_user_blk(cmd_line, 1);
+    return arg;
+}
+
+/*
+ * Setup for hotplug.
+ *
+ * Since vhost-user server only serves one vhost-user client one time,
+ * another exprot
+ *
+ */
+static void *vhost_user_blk_hotplug_test_setup(GString *cmd_line, void *arg)
+{
+    /* "-chardev socket,id=char2" is used for pci_hotplug*/
+    start_vhost_user_blk(cmd_line, 2);
+    return arg;
+}
+
+static void register_vhost_user_blk_test(void)
+{
+    QOSGraphTestOptions opts = {
+        .before = vhost_user_blk_test_setup,
+    };
+
+    /*
+     * tests for vhost-user-blk and vhost-user-blk-pci
+     * The tests are borrowed from tests/virtio-blk-test.c. But some tests
+     * regarding block_resize don't work for vhost-user-blk.
+     * vhost-user-blk device doesn't have -drive, so tests containing
+     * block_resize are also abandoned,
+     *  - config
+     *  - resize
+     */
+    qos_add_test("basic", "vhost-user-blk", basic, &opts);
+    qos_add_test("indirect", "vhost-user-blk", indirect, &opts);
+    qos_add_test("idx", "vhost-user-blk-pci", idx, &opts);
+    qos_add_test("nxvirtq", "vhost-user-blk-pci",
+                 test_nonexistent_virtqueue, &opts);
+
+    opts.before = vhost_user_blk_hotplug_test_setup;
+    qos_add_test("hotplug", "vhost-user-blk-pci", pci_hotplug, &opts);
+}
+
+libqos_init(register_vhost_user_blk_test);
diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ F: block/export/vhost-user-blk-server.c
 F: block/export/vhost-user-blk-server.h
 F: include/qemu/vhost-user-server.h
 F: tests/qtest/libqos/vhost-user-blk.c
+F: tests/qtest/libqos/vhost-user-blk.h
+F: tests/qtest/vhost-user-blk-test.c
 F: util/vhost-user-server.c
 
 FUSE block device exports
diff --git a/tests/qtest/libqos/meson.build b/tests/qtest/libqos/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/libqos/meson.build
+++ b/tests/qtest/libqos/meson.build
@@ -XXX,XX +XXX,XX @@ libqos_srcs = files('../libqtest.c',
         'virtio-9p.c',
         'virtio-balloon.c',
         'virtio-blk.c',
+        'vhost-user-blk.c',
         'virtio-mmio.c',
         'virtio-net.c',
         'virtio-pci.c',
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -XXX,XX +XXX,XX @@ if have_virtfs
   qos_test_ss.add(files('virtio-9p-test.c'))
 endif
 qos_test_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user-test.c'))
+if have_vhost_user_blk_server
+  qos_test_ss.add(files('vhost-user-blk-test.c'))
+endif
 
 tpmemu_files = ['tpm-emu.c', 'tpm-util.c', 'tpm-tests.c']
 
@@ -XXX,XX +XXX,XX @@ foreach dir : target_dirs
   endif
   qtest_env.set('G_TEST_DBUS_DAEMON', meson.source_root() / 'tests/dbus-vmstate-daemon.sh')
   qtest_env.set('QTEST_QEMU_BINARY', './qemu-system-' + target_base)
+  qtest_env.set('QTEST_QEMU_STORAGE_DAEMON_BINARY', './storage-daemon/qemu-storage-daemon')
   
   foreach test : target_qtests
     # Executables are shared across targets, declare them only the first time we
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-7-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/vhost-user-blk-test.c | 81 +++++++++++++++++++++++++++++--
 1 file changed, 76 insertions(+), 5 deletions(-)

diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -XXX,XX +XXX,XX @@ static void pci_hotplug(void *obj, void *data, QGuestAllocator *t_alloc)
     qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
 }
 
+static void multiqueue(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVirtioPCIDevice *pdev1 = obj;
+    QVirtioDevice *dev1 = &pdev1->vdev;
+    QVirtioPCIDevice *pdev8;
+    QVirtioDevice *dev8;
+    QTestState *qts = pdev1->pdev->bus->qts;
+    uint64_t features;
+    uint16_t num_queues;
+
+    /*
+     * The primary device has 1 queue and VIRTIO_BLK_F_MQ is not enabled. The
+     * VIRTIO specification allows VIRTIO_BLK_F_MQ to be enabled when there is
+     * only 1 virtqueue, but --device vhost-user-blk-pci doesn't do this (which
+     * is also spec-compliant).
+     */
+    features = qvirtio_get_features(dev1);
+    g_assert_cmpint(features & (1u << VIRTIO_BLK_F_MQ), ==, 0);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
+                            (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev1, features);
+
+    /* Hotplug a secondary device with 8 queues */
+    qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
+                         "{'addr': %s, 'chardev': 'char2', 'num-queues': 8}",
+                         stringify(PCI_SLOT_HP) ".0");
+
+    pdev8 = virtio_pci_new(pdev1->pdev->bus,
+                           &(QPCIAddress) {
+                               .devfn = QPCI_DEVFN(PCI_SLOT_HP, 0)
+                           });
+    g_assert_nonnull(pdev8);
+    g_assert_cmpint(pdev8->vdev.device_type, ==, VIRTIO_ID_BLOCK);
+
+    qos_object_start_hw(&pdev8->obj);
+
+    dev8 = &pdev8->vdev;
+    features = qvirtio_get_features(dev8);
+    g_assert_cmpint(features & (1u << VIRTIO_BLK_F_MQ),
+                    ==,
+                    (1u << VIRTIO_BLK_F_MQ));
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
+                            (1u << VIRTIO_BLK_F_SCSI) |
+                            (1u << VIRTIO_BLK_F_MQ));
+    qvirtio_set_features(dev8, features);
+
+    num_queues = qvirtio_config_readw(dev8,
+            offsetof(struct virtio_blk_config, num_queues));
+    g_assert_cmpint(num_queues, ==, 8);
+
+    qvirtio_pci_device_disable(pdev8);
+    qos_object_destroy(&pdev8->obj);
+
+    /* unplug secondary disk */
+    qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
+}
+
 /*
  * Check that setting the vring addr on a non-existent virtqueue does
  * not crash.
@@ -XXX,XX +XXX,XX @@ static void quit_storage_daemon(void *data)
     g_free(data);
 }
 
-static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
+static void start_vhost_user_blk(GString *cmd_line, int vus_instances,
+                                 int num_queues)
 {
     const char *vhost_user_blk_bin = qtest_qemu_storage_daemon_binary();
     int i;
@@ -XXX,XX +XXX,XX @@ static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
         g_string_append_printf(storage_daemon_command,
             "--blockdev driver=file,node-name=disk%d,filename=%s "
             "--export type=vhost-user-blk,id=disk%d,addr.type=unix,addr.path=%s,"
-            "node-name=disk%i,writable=on ",
-            i, img_path, i, sock_path, i);
+            "node-name=disk%i,writable=on,num-queues=%d ",
+            i, img_path, i, sock_path, i, num_queues);
 
         g_string_append_printf(cmd_line, "-chardev socket,id=char%d,path=%s ",
                                i + 1, sock_path);
@@ -XXX,XX +XXX,XX @@ static void start_vhost_user_blk(GString *cmd_line, int vus_instances)
 
 static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
 {
-    start_vhost_user_blk(cmd_line, 1);
+    start_vhost_user_blk(cmd_line, 1, 1);
     return arg;
 }
 
@@ -XXX,XX +XXX,XX @@ static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
 static void *vhost_user_blk_hotplug_test_setup(GString *cmd_line, void *arg)
 {
     /* "-chardev socket,id=char2" is used for pci_hotplug*/
-    start_vhost_user_blk(cmd_line, 2);
+    start_vhost_user_blk(cmd_line, 2, 1);
+    return arg;
+}
+
+static void *vhost_user_blk_multiqueue_test_setup(GString *cmd_line, void *arg)
+{
+    start_vhost_user_blk(cmd_line, 2, 8);
     return arg;
 }
 
@@ -XXX,XX +XXX,XX @@ static void register_vhost_user_blk_test(void)
 
     opts.before = vhost_user_blk_hotplug_test_setup;
     qos_add_test("hotplug", "vhost-user-blk-pci", pci_hotplug, &opts);
+
+    opts.before = vhost_user_blk_multiqueue_test_setup;
+    qos_add_test("multiqueue", "vhost-user-blk-pci", multiqueue, &opts);
 }
 
 libqos_init(register_vhost_user_blk_test);
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

The config->blk_size field is little-endian. Use the native-endian
blk_size variable to avoid double byteswapping.

Fixes: 11f60f7eaee2630dd6fa0c3a8c49f792e46c4cf1 ("block/export: make vhost-user-blk config space little-endian")
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-8-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/vhost-user-blk-server.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
     config->num_queues = cpu_to_le16(num_queues);
     config->max_discard_sectors = cpu_to_le32(32768);
     config->max_discard_seg = cpu_to_le32(1);
-    config->discard_sector_alignment = cpu_to_le32(config->blk_size >> 9);
+    config->discard_sector_alignment = cpu_to_le32(blk_size >> 9);
     config->max_write_zeroes_sectors = cpu_to_le32(32768);
     config->max_write_zeroes_seg = cpu_to_le32(1);
 }
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Use VIRTIO_BLK_SECTOR_BITS and VIRTIO_BLK_SECTOR_SIZE when dealing with
virtio-blk sector numbers. Although the values happen to be the same as
BDRV_SECTOR_BITS and BDRV_SECTOR_SIZE, they are conceptually different.
This makes it clearer when we are dealing with virtio-blk sector units.

Use VIRTIO_BLK_SECTOR_BITS in vu_blk_initialize_config(). Later patches
will use it the new constants the virtqueue request processing code
path.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-9-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/vhost-user-blk-server.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/block-backend.h"
 #include "util/block-helpers.h"
 
+/*
+ * Sector units are 512 bytes regardless of the
+ * virtio_blk_config->blk_size value.
+ */
+#define VIRTIO_BLK_SECTOR_BITS 9
+#define VIRTIO_BLK_SECTOR_SIZE (1ull << VIRTIO_BLK_SECTOR_BITS)
+
 enum {
     VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
 };
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
                          uint32_t blk_size,
                          uint16_t num_queues)
 {
-    config->capacity = cpu_to_le64(bdrv_getlength(bs) >> BDRV_SECTOR_BITS);
+    config->capacity =
+        cpu_to_le64(bdrv_getlength(bs) >> VIRTIO_BLK_SECTOR_BITS);
     config->blk_size = cpu_to_le32(blk_size);
     config->size_max = cpu_to_le32(0);
     config->seg_max = cpu_to_le32(128 - 2);
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
     config->num_queues = cpu_to_le16(num_queues);
     config->max_discard_sectors = cpu_to_le32(32768);
     config->max_discard_seg = cpu_to_le32(1);
-    config->discard_sector_alignment = cpu_to_le32(blk_size >> 9);
+    config->discard_sector_alignment =
+        cpu_to_le32(blk_size >> VIRTIO_BLK_SECTOR_BITS);
     config->max_write_zeroes_sectors = cpu_to_le32(32768);
     config->max_write_zeroes_seg = cpu_to_le32(1);
 }
@@ -XXX,XX +XXX,XX @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
     if (vu_opts->has_logical_block_size) {
         logical_block_size = vu_opts->logical_block_size;
     } else {
-        logical_block_size = BDRV_SECTOR_SIZE;
+        logical_block_size = VIRTIO_BLK_SECTOR_SIZE;
     }
     check_block_size(exp->id, "logical-block-size", logical_block_size,
                      &local_err);
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

The driver is supposed to honor the blk_size field but the protocol
still uses 512-byte sector numbers. It is incorrect to multiply
req->sector_num by blk_size.

VIRTIO 1.1 5.2.5 Device Initialization says:

blk_size can be read to determine the optimal sector size for the
  driver to use. This does not affect the units used in the protocol
  (always 512 bytes), but awareness of the correct value can affect
  performance.

Fixes: 3578389bcf76c824a5d82e6586a6f0c71e56f2aa ("block/export: vhost-user block device backend server")
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-10-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/vhost-user-blk-server.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
             break;
         }
 
-        int64_t offset = req->sector_num * vexp->blk_size;
+        int64_t offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
         QEMUIOVector qiov;
         if (is_write) {
             qemu_iovec_init_external(&qiov, out_iov, out_num);
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Validate discard/write zeroes the same way we do for virtio-blk. Some of
these checks are mandated by the VIRTIO specification, others are
internal to QEMU.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-11-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/vhost-user-blk-server.c | 116 +++++++++++++++++++++------
 1 file changed, 93 insertions(+), 23 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@
 
 enum {
     VHOST_USER_BLK_NUM_QUEUES_DEFAULT = 1,
+    VHOST_USER_BLK_MAX_DISCARD_SECTORS = 32768,
+    VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS = 32768,
 };
 struct virtio_blk_inhdr {
     unsigned char status;
@@ -XXX,XX +XXX,XX @@ static void vu_blk_req_complete(VuBlkReq *req)
     free(req);
 }
 
+static bool vu_blk_sect_range_ok(VuBlkExport *vexp, uint64_t sector,
+                                 size_t size)
+{
+    uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
+    uint64_t total_sectors;
+
+    if (nb_sectors > BDRV_REQUEST_MAX_SECTORS) {
+        return false;
+    }
+    if ((sector << VIRTIO_BLK_SECTOR_BITS) % vexp->blk_size) {
+        return false;
+    }
+    blk_get_geometry(vexp->export.blk, &total_sectors);
+    if (sector > total_sectors || nb_sectors > total_sectors - sector) {
+        return false;
+    }
+    return true;
+}
+
 static int coroutine_fn
-vu_blk_discard_write_zeroes(BlockBackend *blk, struct iovec *iov,
+vu_blk_discard_write_zeroes(VuBlkExport *vexp, struct iovec *iov,
                             uint32_t iovcnt, uint32_t type)
 {
+    BlockBackend *blk = vexp->export.blk;
     struct virtio_blk_discard_write_zeroes desc;
-    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
+    ssize_t size;
+    uint64_t sector;
+    uint32_t num_sectors;
+    uint32_t max_sectors;
+    uint32_t flags;
+    int bytes;
+
+    /* Only one desc is currently supported */
+    if (unlikely(iov_size(iov, iovcnt) > sizeof(desc))) {
+        return VIRTIO_BLK_S_UNSUPP;
+    }
+
+    size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
     if (unlikely(size != sizeof(desc))) {
-        error_report("Invalid size %zd, expect %zu", size, sizeof(desc));
-        return -EINVAL;
+        error_report("Invalid size %zd, expected %zu", size, sizeof(desc));
+        return VIRTIO_BLK_S_IOERR;
     }
 
-    uint64_t range[2] = { le64_to_cpu(desc.sector) << 9,
-                          le32_to_cpu(desc.num_sectors) << 9 };
-    if (type == VIRTIO_BLK_T_DISCARD) {
-        if (blk_co_pdiscard(blk, range[0], range[1]) == 0) {
-            return 0;
+    sector = le64_to_cpu(desc.sector);
+    num_sectors = le32_to_cpu(desc.num_sectors);
+    flags = le32_to_cpu(desc.flags);
+    max_sectors = (type == VIRTIO_BLK_T_WRITE_ZEROES) ?
+                  VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS :
+                  VHOST_USER_BLK_MAX_DISCARD_SECTORS;
+
+    /* This check ensures that 'bytes' fits in an int */
+    if (unlikely(num_sectors > max_sectors)) {
+        return VIRTIO_BLK_S_IOERR;
+    }
+
+    bytes = num_sectors << VIRTIO_BLK_SECTOR_BITS;
+
+    if (unlikely(!vu_blk_sect_range_ok(vexp, sector, bytes))) {
+        return VIRTIO_BLK_S_IOERR;
+    }
+
+    /*
+     * The device MUST set the status byte to VIRTIO_BLK_S_UNSUPP for discard
+     * and write zeroes commands if any unknown flag is set.
+     */
+    if (unlikely(flags & ~VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP)) {
+        return VIRTIO_BLK_S_UNSUPP;
+    }
+
+    if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
+        int blk_flags = 0;
+
+        if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) {
+            blk_flags |= BDRV_REQ_MAY_UNMAP;
+        }
+
+        if (blk_co_pwrite_zeroes(blk, sector << VIRTIO_BLK_SECTOR_BITS,
+                                 bytes, blk_flags) == 0) {
+            return VIRTIO_BLK_S_OK;
         }
-    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
-        if (blk_co_pwrite_zeroes(blk, range[0], range[1], 0) == 0) {
-            return 0;
+    } else if (type == VIRTIO_BLK_T_DISCARD) {
+        /*
+         * The device MUST set the status byte to VIRTIO_BLK_S_UNSUPP for
+         * discard commands if the unmap flag is set.
+         */
+        if (unlikely(flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP)) {
+            return VIRTIO_BLK_S_UNSUPP;
+        }
+
+        if (blk_co_pdiscard(blk, sector << VIRTIO_BLK_SECTOR_BITS,
+                            bytes) == 0) {
+            return VIRTIO_BLK_S_OK;
         }
     }
 
-    return -EINVAL;
+    return VIRTIO_BLK_S_IOERR;
 }
 
 static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
     }
     case VIRTIO_BLK_T_DISCARD:
     case VIRTIO_BLK_T_WRITE_ZEROES: {
-        int rc;
-
         if (!vexp->writable) {
             req->in->status = VIRTIO_BLK_S_IOERR;
             break;
         }
 
-        rc = vu_blk_discard_write_zeroes(blk, &elem->out_sg[1], out_num, type);
-        if (rc == 0) {
-            req->in->status = VIRTIO_BLK_S_OK;
-        } else {
-            req->in->status = VIRTIO_BLK_S_IOERR;
-        }
+        req->in->status = vu_blk_discard_write_zeroes(vexp, out_iov, out_num,
+                                                      type);
         break;
     }
     default:
@@ -XXX,XX +XXX,XX @@ vu_blk_initialize_config(BlockDriverState *bs,
     config->min_io_size = cpu_to_le16(1);
     config->opt_io_size = cpu_to_le32(1);
     config->num_queues = cpu_to_le16(num_queues);
-    config->max_discard_sectors = cpu_to_le32(32768);
+    config->max_discard_sectors =
+        cpu_to_le32(VHOST_USER_BLK_MAX_DISCARD_SECTORS);
     config->max_discard_seg = cpu_to_le32(1);
     config->discard_sector_alignment =
         cpu_to_le32(blk_size >> VIRTIO_BLK_SECTOR_BITS);
-    config->max_write_zeroes_sectors = cpu_to_le32(32768);
+    config->max_write_zeroes_sectors
+        = cpu_to_le32(VHOST_USER_BLK_MAX_WRITE_ZEROES_SECTORS);
     config->max_write_zeroes_seg = cpu_to_le32(1);
 }
 
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Exercise input validation code paths in
block/export/vhost-user-blk-server.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-12-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qtest/vhost-user-blk-test.c | 124 ++++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)

diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -XXX,XX +XXX,XX @@ static uint64_t virtio_blk_request(QGuestAllocator *alloc, QVirtioDevice *d,
     return addr;
 }
 
+static void test_invalid_discard_write_zeroes(QVirtioDevice *dev,
+                                              QGuestAllocator *alloc,
+                                              QTestState *qts,
+                                              QVirtQueue *vq,
+                                              uint32_t type)
+{
+    QVirtioBlkReq req;
+    struct virtio_blk_discard_write_zeroes dwz_hdr;
+    struct virtio_blk_discard_write_zeroes dwz_hdr2[2];
+    uint64_t req_addr;
+    uint32_t free_head;
+    uint8_t status;
+
+    /* More than one dwz is not supported */
+    req.type = type;
+    req.data = (char *) dwz_hdr2;
+    dwz_hdr2[0].sector = 0;
+    dwz_hdr2[0].num_sectors = 1;
+    dwz_hdr2[0].flags = 0;
+    dwz_hdr2[1].sector = 1;
+    dwz_hdr2[1].num_sectors = 1;
+    dwz_hdr2[1].flags = 0;
+
+    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr2[0]);
+    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr2[1]);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr2));
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr2), false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr2), 1, true,
+                   false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 16 + sizeof(dwz_hdr2));
+    g_assert_cmpint(status, ==, VIRTIO_BLK_S_UNSUPP);
+
+    guest_free(alloc, req_addr);
+
+    /* num_sectors must be less than config->max_write_zeroes_sectors */
+    req.type = type;
+    req.data = (char *) &dwz_hdr;
+    dwz_hdr.sector = 0;
+    dwz_hdr.num_sectors = 0xffffffff;
+    dwz_hdr.flags = 0;
+
+    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
+                   false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 16 + sizeof(dwz_hdr));
+    g_assert_cmpint(status, ==, VIRTIO_BLK_S_IOERR);
+
+    guest_free(alloc, req_addr);
+
+    /* sector must be less than the device capacity */
+    req.type = type;
+    req.data = (char *) &dwz_hdr;
+    dwz_hdr.sector = TEST_IMAGE_SIZE / 512 + 1;
+    dwz_hdr.num_sectors = 1;
+    dwz_hdr.flags = 0;
+
+    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
+                   false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 16 + sizeof(dwz_hdr));
+    g_assert_cmpint(status, ==, VIRTIO_BLK_S_IOERR);
+
+    guest_free(alloc, req_addr);
+
+    /* reserved flag bits must be zero */
+    req.type = type;
+    req.data = (char *) &dwz_hdr;
+    dwz_hdr.sector = 0;
+    dwz_hdr.num_sectors = 1;
+    dwz_hdr.flags = ~VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
+
+    virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
+                   false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 16 + sizeof(dwz_hdr));
+    g_assert_cmpint(status, ==, VIRTIO_BLK_S_UNSUPP);
+
+    guest_free(alloc, req_addr);
+}
+
 /* Returns the request virtqueue so the caller can perform further tests */
 static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
 {
@@ -XXX,XX +XXX,XX @@ static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
         g_free(data);
 
         guest_free(alloc, req_addr);
+
+        test_invalid_discard_write_zeroes(dev, alloc, qts, vq,
+                                          VIRTIO_BLK_T_WRITE_ZEROES);
     }
 
     if (features & (1u << VIRTIO_BLK_F_DISCARD)) {
@@ -XXX,XX +XXX,XX @@ static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
         g_assert_cmpint(status, ==, 0);
 
         guest_free(alloc, req_addr);
+
+        test_invalid_discard_write_zeroes(dev, alloc, qts, vq,
+                                          VIRTIO_BLK_T_DISCARD);
     }
 
     if (features & (1u << VIRTIO_F_ANY_LAYOUT)) {
-- 
2.29.2

From: Stefan Hajnoczi <stefanha@redhat.com>

Check that the sector number and byte count are valid.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210223144653.811468-13-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/vhost-user-blk-server.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
     switch (type & ~VIRTIO_BLK_T_BARRIER) {
     case VIRTIO_BLK_T_IN:
     case VIRTIO_BLK_T_OUT: {
+        QEMUIOVector qiov;
+        int64_t offset;
         ssize_t ret = 0;
         bool is_write = type & VIRTIO_BLK_T_OUT;
         req->sector_num = le64_to_cpu(req->out.sector);
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
             break;
         }
 
-        int64_t offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
-        QEMUIOVector qiov;
         if (is_write) {
             qemu_iovec_init_external(&qiov, out_iov, out_num);
-            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
         } else {
             qemu_iovec_init_external(&qiov, in_iov, in_num);
+        }
+
+        if (unlikely(!vu_blk_sect_range_ok(vexp,
+                                           req->sector_num,
+                                           qiov.size))) {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+            break;
+        }
+
+        offset = req->sector_num << VIRTIO_BLK_SECTOR_BITS;
+
+        if (is_write) {
+            ret = blk_co_pwritev(blk, offset, qiov.size, &qiov, 0);
+        } else {
             ret = blk_co_preadv(blk, offset, qiov.size, &qiov, 0);
         }
         if (ret >= 0) {
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Rename bytes_covered_by_bitmap_cluster() to
bdrv_dirty_bitmap_serialization_coverage() and make it public.
It is needed as we are going to share it with bitmap loading in
parallels format.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20210224104707.88430-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/dirty-bitmap.h |  2 ++
 block/dirty-bitmap.c         | 13 +++++++++++++
 block/qcow2-bitmap.c         | 16 ++--------------
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -XXX,XX +XXX,XX @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
                                               uint64_t offset, uint64_t bytes);
 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap);
+uint64_t bdrv_dirty_bitmap_serialization_coverage(int serialized_chunk_size,
+        const BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
                                       uint8_t *buf, uint64_t offset,
                                       uint64_t bytes);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index XXXXXXX..XXXXXXX 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -XXX,XX +XXX,XX @@ uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
     return hbitmap_serialization_align(bitmap->bitmap);
 }
 
+/* Return the disk size covered by a chunk of serialized bitmap data. */
+uint64_t bdrv_dirty_bitmap_serialization_coverage(int serialized_chunk_size,
+                                                  const BdrvDirtyBitmap *bitmap)
+{
+    uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
+    uint64_t limit = granularity * (serialized_chunk_size << 3);
+
+    assert(QEMU_IS_ALIGNED(limit,
+                           bdrv_dirty_bitmap_serialization_align(bitmap)));
+    return limit;
+}
+
+
 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
                                       uint8_t *buf, uint64_t offset,
                                       uint64_t bytes)
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -XXX,XX +XXX,XX @@ static int free_bitmap_clusters(BlockDriverState *bs, Qcow2BitmapTable *tb)
     return 0;
 }
 
-/* Return the disk size covered by a single qcow2 cluster of bitmap data. */
-static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s,
-                                                const BdrvDirtyBitmap *bitmap)
-{
-    uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
-    uint64_t limit = granularity * (s->cluster_size << 3);
-
-    assert(QEMU_IS_ALIGNED(limit,
-                           bdrv_dirty_bitmap_serialization_align(bitmap)));
-    return limit;
-}
-
 /* load_bitmap_data
  * @bitmap_table entries must satisfy specification constraints.
  * @bitmap must be cleared */
@@ -XXX,XX +XXX,XX @@ static int load_bitmap_data(BlockDriverState *bs,
     }
 
     buf = g_malloc(s->cluster_size);
-    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
     for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
         uint64_t count = MIN(bm_size - offset, limit);
         uint64_t entry = bitmap_table[i];
@@ -XXX,XX +XXX,XX @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
     }
 
     buf = g_malloc(s->cluster_size);
-    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
     assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
 
     offset = 0;
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Actually L1 table entry offset is in 512 bytes sectors. Fix the spec.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210224104707.88430-3-vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/interop/parallels.txt | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/docs/interop/parallels.txt b/docs/interop/parallels.txt
index XXXXXXX..XXXXXXX 100644
--- a/docs/interop/parallels.txt
+++ b/docs/interop/parallels.txt
@@ -XXX,XX +XXX,XX @@ of its data area are:
   28 - 31:    l1_size
               The number of entries in the L1 table of the bitmap.
 
-  variable:   l1_table (8 * l1_size bytes)
-              L1 offset table (in bytes)
+  variable:   L1 offset table (l1_table), size: 8 * l1_size bytes
 
-A dirty bitmap is stored using a one-level structure for the mapping to host
-clusters - an L1 table.
+The dirty bitmap described by this feature extension is stored in a set of
+clusters inside the Parallels image file. The offsets of these clusters are
+saved in the L1 offset table specified by the feature extension. Each L1 table
+entry is a 64 bit integer as described below:
 
-Given an offset in bytes into the bitmap data, the offset in bytes into the
-image file can be obtained as follows:
+Given an offset in bytes into the bitmap data, corresponding L1 entry is
 
-    offset = l1_table[offset / cluster_size] + (offset % cluster_size)
+    l1_table[offset / cluster_size]
 
-If an L1 table entry is 0, the corresponding cluster of the bitmap is assumed
-to be zero.
+If an L1 table entry is 0, all bits in the corresponding cluster of the bitmap
+are assumed to be 0.
 
-If an L1 table entry is 1, the corresponding cluster of the bitmap is assumed
-to have all bits set.
+If an L1 table entry is 1, all bits in the corresponding cluster of the bitmap
+are assumed to be 1.
 
-If an L1 table entry is not 0 or 1, it allocates a cluster from the data area.
+If an L1 table entry is not 0 or 1, it contains the corresponding cluster
+offset (in 512b sectors). Given an offset in bytes into the bitmap data the
+offset in bytes into the image file can be obtained as follows:
+
+    offset = l1_table[offset / cluster_size] * 512 + (offset % cluster_size)
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

We are going to use it in more places, calculating
"s->tracks << BDRV_SECTOR_BITS" doesn't look good.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210224104707.88430-4-vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/parallels.h | 1 +
 block/parallels.c | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/parallels.h b/block/parallels.h
index XXXXXXX..XXXXXXX 100644
--- a/block/parallels.h
+++ b/block/parallels.h
@@ -XXX,XX +XXX,XX @@ typedef struct BDRVParallelsState {
     ParallelsPreallocMode prealloc_mode;
 
     unsigned int tracks;
+    unsigned int cluster_size;
 
     unsigned int off_multiplier;
     Error *migration_blocker;
diff --git a/block/parallels.c b/block/parallels.c
index XXXXXXX..XXXXXXX 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
     int ret;
     uint32_t i;
     bool flush_bat = false;
-    int cluster_size = s->tracks << BDRV_SECTOR_BITS;
 
     size = bdrv_getlength(bs->file->bs);
     if (size < 0) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
             high_off = off;
         }
 
-        if (prev_off != 0 && (prev_off + cluster_size) != off) {
+        if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
             res->bfi.fragmented_clusters++;
         }
         prev_off = off;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn parallels_co_check(BlockDriverState *bs,
         }
     }
 
-    res->image_end_offset = high_off + cluster_size;
+    res->image_end_offset = high_off + s->cluster_size;
     if (size > res->image_end_offset) {
         int64_t count;
-        count = DIV_ROUND_UP(size - res->image_end_offset, cluster_size);
+        count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
         fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
                 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
                 size - res->image_end_offset);
@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
         ret = -EFBIG;
         goto fail;
     }
+    s->cluster_size = s->tracks << BDRV_SECTOR_BITS;
 
     s->bat_size = le32_to_cpu(ph.bat_entries);
     if (s->bat_size > INT_MAX / sizeof(uint32_t)) {
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210224104707.88430-5-vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/parallels.h     |   6 +-
 block/parallels-ext.c | 300 ++++++++++++++++++++++++++++++++++++++++++
 block/parallels.c     |  18 +++
 block/meson.build     |   3 +-
 4 files changed, 325 insertions(+), 2 deletions(-)
 create mode 100644 block/parallels-ext.c

diff --git a/block/parallels.h b/block/parallels.h
index XXXXXXX..XXXXXXX 100644
--- a/block/parallels.h
+++ b/block/parallels.h
@@ -XXX,XX +XXX,XX @@ typedef struct ParallelsHeader {
     uint64_t nb_sectors;
     uint32_t inuse;
     uint32_t data_off;
-    char padding[12];
+    uint32_t flags;
+    uint64_t ext_off;
 } QEMU_PACKED ParallelsHeader;
 
 typedef enum ParallelsPreallocMode {
@@ -XXX,XX +XXX,XX @@ typedef struct BDRVParallelsState {
     Error *migration_blocker;
 } BDRVParallelsState;
 
+int parallels_read_format_extension(BlockDriverState *bs,
+                                    int64_t ext_off, Error **errp);
+
 #endif
diff --git a/block/parallels-ext.c b/block/parallels-ext.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/parallels-ext.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Support of Parallels Format Extension. It's a part of Parallels format
+ * driver.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "block/block_int.h"
+#include "parallels.h"
+#include "crypto/hash.h"
+#include "qemu/uuid.h"
+
+#define PARALLELS_FORMAT_EXTENSION_MAGIC 0xAB234CEF23DCEA87ULL
+
+#define PARALLELS_END_OF_FEATURES_MAGIC 0x0ULL
+#define PARALLELS_DIRTY_BITMAP_FEATURE_MAGIC 0x20385FAE252CB34AULL
+
+typedef struct ParallelsFormatExtensionHeader {
+    uint64_t magic; /* PARALLELS_FORMAT_EXTENSION_MAGIC */
+    uint8_t check_sum[16];
+} QEMU_PACKED ParallelsFormatExtensionHeader;
+
+typedef struct ParallelsFeatureHeader {
+    uint64_t magic;
+    uint64_t flags;
+    uint32_t data_size;
+    uint32_t _unused;
+} QEMU_PACKED ParallelsFeatureHeader;
+
+typedef struct ParallelsDirtyBitmapFeature {
+    uint64_t size;
+    uint8_t id[16];
+    uint32_t granularity;
+    uint32_t l1_size;
+    /* L1 table follows */
+} QEMU_PACKED ParallelsDirtyBitmapFeature;
+
+/* Given L1 table read bitmap data from the image and populate @bitmap */
+static int parallels_load_bitmap_data(BlockDriverState *bs,
+                                      const uint64_t *l1_table,
+                                      uint32_t l1_size,
+                                      BdrvDirtyBitmap *bitmap,
+                                      Error **errp)
+{
+    BDRVParallelsState *s = bs->opaque;
+    int ret = 0;
+    uint64_t offset, limit;
+    uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
+    uint8_t *buf = NULL;
+    uint64_t i, tab_size =
+        DIV_ROUND_UP(bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size),
+                     s->cluster_size);
+
+    if (tab_size != l1_size) {
+        error_setg(errp, "Bitmap table size %" PRIu32 " does not correspond "
+                   "to bitmap size and cluster size. Expected %" PRIu64,
+                   l1_size, tab_size);
+        return -EINVAL;
+    }
+
+    buf = qemu_blockalign(bs, s->cluster_size);
+    limit = bdrv_dirty_bitmap_serialization_coverage(s->cluster_size, bitmap);
+    for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
+        uint64_t count = MIN(bm_size - offset, limit);
+        uint64_t entry = l1_table[i];
+
+        if (entry == 0) {
+            /* No need to deserialize zeros because @bitmap is cleared. */
+            continue;
+        }
+
+        if (entry == 1) {
+            bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count, false);
+        } else {
+            ret = bdrv_pread(bs->file, entry << BDRV_SECTOR_BITS, buf,
+                             s->cluster_size);
+            if (ret < 0) {
+                error_setg_errno(errp, -ret,
+                                 "Failed to read bitmap data cluster");
+                goto finish;
+            }
+            bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count,
+                                               false);
+        }
+    }
+    ret = 0;
+
+    bdrv_dirty_bitmap_deserialize_finish(bitmap);
+
+finish:
+    qemu_vfree(buf);
+
+    return ret;
+}
+
+/*
+ * @data buffer (of @data_size size) is the Dirty bitmaps feature which
+ * consists of ParallelsDirtyBitmapFeature followed by L1 table.
+ */
+static BdrvDirtyBitmap *parallels_load_bitmap(BlockDriverState *bs,
+                                              uint8_t *data,
+                                              size_t data_size,
+                                              Error **errp)
+{
+    int ret;
+    ParallelsDirtyBitmapFeature bf;
+    g_autofree uint64_t *l1_table = NULL;
+    BdrvDirtyBitmap *bitmap;
+    QemuUUID uuid;
+    char uuidstr[UUID_FMT_LEN + 1];
+    int i;
+
+    if (data_size < sizeof(bf)) {
+        error_setg(errp, "Too small Bitmap Feature area in Parallels Format "
+                   "Extension: %zu bytes, expected at least %zu bytes",
+                   data_size, sizeof(bf));
+        return NULL;
+    }
+    memcpy(&bf, data, sizeof(bf));
+    bf.size = le64_to_cpu(bf.size);
+    bf.granularity = le32_to_cpu(bf.granularity) << BDRV_SECTOR_BITS;
+    bf.l1_size = le32_to_cpu(bf.l1_size);
+    data += sizeof(bf);
+    data_size -= sizeof(bf);
+
+    if (bf.size != bs->total_sectors) {
+        error_setg(errp, "Bitmap size (in sectors) %" PRId64 " differs from "
+                   "disk size in sectors %" PRId64, bf.size, bs->total_sectors);
+        return NULL;
+    }
+
+    if (bf.l1_size * sizeof(uint64_t) > data_size) {
+        error_setg(errp, "Bitmaps feature corrupted: l1 table exceeds "
+                   "extension data_size");
+        return NULL;
+    }
+
+    memcpy(&uuid, bf.id, sizeof(uuid));
+    qemu_uuid_unparse(&uuid, uuidstr);
+    bitmap = bdrv_create_dirty_bitmap(bs, bf.granularity, uuidstr, errp);
+    if (!bitmap) {
+        return NULL;
+    }
+
+    l1_table = g_new(uint64_t, bf.l1_size);
+    for (i = 0; i < bf.l1_size; i++, data += sizeof(uint64_t)) {
+        l1_table[i] = ldq_le_p(data);
+    }
+
+    ret = parallels_load_bitmap_data(bs, l1_table, bf.l1_size, bitmap, errp);
+    if (ret < 0) {
+        bdrv_release_dirty_bitmap(bitmap);
+        return NULL;
+    }
+
+    /* We support format extension only for RO parallels images. */
+    assert(!(bs->open_flags & BDRV_O_RDWR));
+    bdrv_dirty_bitmap_set_readonly(bitmap, true);
+
+    return bitmap;
+}
+
+static int parallels_parse_format_extension(BlockDriverState *bs,
+                                            uint8_t *ext_cluster, Error **errp)
+{
+    BDRVParallelsState *s = bs->opaque;
+    int ret;
+    int remaining = s->cluster_size;
+    uint8_t *pos = ext_cluster;
+    ParallelsFormatExtensionHeader eh;
+    g_autofree uint8_t *hash = NULL;
+    size_t hash_len = 0;
+    GSList *bitmaps = NULL, *el;
+
+    memcpy(&eh, pos, sizeof(eh));
+    eh.magic = le64_to_cpu(eh.magic);
+    pos += sizeof(eh);
+    remaining -= sizeof(eh);
+
+    if (eh.magic != PARALLELS_FORMAT_EXTENSION_MAGIC) {
+        error_setg(errp, "Wrong parallels Format Extension magic: 0x%" PRIx64
+                   ", expected: 0x%llx", eh.magic,
+                   PARALLELS_FORMAT_EXTENSION_MAGIC);
+        goto fail;
+    }
+
+    ret = qcrypto_hash_bytes(QCRYPTO_HASH_ALG_MD5, (char *)pos, remaining,
+                             &hash, &hash_len, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    if (hash_len != sizeof(eh.check_sum) ||
+        memcmp(hash, eh.check_sum, sizeof(eh.check_sum)) != 0) {
+        error_setg(errp, "Wrong checksum in Format Extension header. Format "
+                   "extension is corrupted.");
+        goto fail;
+    }
+
+    while (true) {
+        ParallelsFeatureHeader fh;
+        BdrvDirtyBitmap *bitmap;
+
+        if (remaining < sizeof(fh)) {
+            error_setg(errp, "Can not read feature header, as remaining bytes "
+                       "(%d) in Format Extension is less than Feature header "
+                       "size (%zu)", remaining, sizeof(fh));
+            goto fail;
+        }
+
+        memcpy(&fh, pos, sizeof(fh));
+        pos += sizeof(fh);
+        remaining -= sizeof(fh);
+
+        fh.magic = le64_to_cpu(fh.magic);
+        fh.flags = le64_to_cpu(fh.flags);
+        fh.data_size = le32_to_cpu(fh.data_size);
+
+        if (fh.flags) {
+            error_setg(errp, "Flags for extension feature are unsupported");
+            goto fail;
+        }
+
+        if (fh.data_size > remaining) {
+            error_setg(errp, "Feature data_size exceedes Format Extension "
+                       "cluster");
+            goto fail;
+        }
+
+        switch (fh.magic) {
+        case PARALLELS_END_OF_FEATURES_MAGIC:
+            return 0;
+
+        case PARALLELS_DIRTY_BITMAP_FEATURE_MAGIC:
+            bitmap = parallels_load_bitmap(bs, pos, fh.data_size, errp);
+            if (!bitmap) {
+                goto fail;
+            }
+            bitmaps = g_slist_append(bitmaps, bitmap);
+            break;
+
+        default:
+            error_setg(errp, "Unknown feature: 0x%" PRIu64, fh.magic);
+            goto fail;
+        }
+
+        pos = ext_cluster + QEMU_ALIGN_UP(pos + fh.data_size - ext_cluster, 8);
+    }
+
+fail:
+    for (el = bitmaps; el; el = el->next) {
+        bdrv_release_dirty_bitmap(el->data);
+    }
+    g_slist_free(bitmaps);
+
+    return -EINVAL;
+}
+
+int parallels_read_format_extension(BlockDriverState *bs,
+                                    int64_t ext_off, Error **errp)
+{
+    BDRVParallelsState *s = bs->opaque;
+    int ret;
+    uint8_t *ext_cluster = qemu_blockalign(bs, s->cluster_size);
+
+    assert(ext_off > 0);
+
+    ret = bdrv_pread(bs->file, ext_off, ext_cluster, s->cluster_size);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Failed to read Format Extension cluster");
+        goto out;
+    }
+
+    ret = parallels_parse_format_extension(bs, ext_cluster, errp);
+
+out:
+    qemu_vfree(ext_cluster);
+
+    return ret;
+}
diff --git a/block/parallels.c b/block/parallels.c
index XXXXXXX..XXXXXXX 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "block/block_int.h"
 #include "block/qdict.h"
@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail_options;
     }
 
+    if (ph.ext_off) {
+        if (flags & BDRV_O_RDWR) {
+            /*
+             * It's unsafe to open image RW if there is an extension (as we
+             * don't support it). But parallels driver in QEMU historically
+             * ignores the extension, so print warning and don't care.
+             */
+            warn_report("Format Extension ignored in RW mode");
+        } else {
+            ret = parallels_read_format_extension(
+                    bs, le64_to_cpu(ph.ext_off) << BDRV_SECTOR_BITS, errp);
+            if (ret < 0) {
+                goto fail;
+            }
+        }
+    }
+
     if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_INACTIVE)) {
         s->header->inuse = cpu_to_le32(HEADER_INUSE_MAGIC);
         ret = parallels_update_header(bs);
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ block_ss.add(when: 'CONFIG_QED', if_true: files(
   'qed-table.c',
   'qed.c',
 ))
-block_ss.add(when: [libxml2, 'CONFIG_PARALLELS'], if_true: files('parallels.c'))
+block_ss.add(when: [libxml2, 'CONFIG_PARALLELS'],
+             if_true: files('parallels.c', 'parallels-ext.c'))
 block_ss.add(when: 'CONFIG_WIN32', if_true: files('file-win32.c', 'win32-aio.c'))
 block_ss.add(when: 'CONFIG_POSIX', if_true: [files('file-posix.c'), coref, iokit])
 block_ss.add(when: libiscsi, if_true: files('iscsi-opts.c'))
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210224104707.88430-6-vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/iotests.py | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -XXX,XX +XXX,XX @@
 #
 
 import atexit
+import bz2
 from collections import OrderedDict
 import faulthandler
 import io
@@ -XXX,XX +XXX,XX @@
 import logging
 import os
 import re
+import shutil
 import signal
 import struct
 import subprocess
@@ -XXX,XX +XXX,XX @@
                              os.environ.get('IMGKEYSECRET', '')
 luks_default_key_secret_opt = 'key-secret=keysec0'
 
+sample_img_dir = os.environ['SAMPLE_IMG_DIR']
+
+
+def unarchive_sample_image(sample, fname):
+    sample_fname = os.path.join(sample_img_dir, sample + '.bz2')
+    with bz2.open(sample_fname) as f_in, open(fname, 'wb') as f_out:
+        shutil.copyfileobj(f_in, f_out)
+
 
 def qemu_tool_pipe_and_status(tool: str, args: Sequence[str],
                               connect_stderr: bool = True) -> Tuple[str, int]:
-- 
2.29.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Test support for reading bitmap from parallels image format.
parallels-with-bitmap.bz2 is generated on Virtuozzo by
parallels-with-bitmap.sh

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210224104707.88430-7-vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 .../sample_images/parallels-with-bitmap.bz2   | Bin 0 -> 203 bytes
 .../sample_images/parallels-with-bitmap.sh    |  51 ++++++++++++++++
 .../qemu-iotests/tests/parallels-read-bitmap  |  55 ++++++++++++++++++
 .../tests/parallels-read-bitmap.out           |   6 ++
 4 files changed, 112 insertions(+)
 create mode 100644 tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
 create mode 100755 tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
 create mode 100755 tests/qemu-iotests/tests/parallels-read-bitmap
 create mode 100644 tests/qemu-iotests/tests/parallels-read-bitmap.out

diff --git a/tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2 b/tests/qemu-iotests/sample_images/parallels-with-bitmap.bz2
new file mode 100644
index XXXXXXX..XXXXXXX
GIT binary patch
literal 203
zcmV;+05tzXT4*^jL0KkKS@=;0bpT+Hf7|^?Km<xfFyKQJ7=Y^F-%vt;00~Ysa6|-=
zk&7Szk`SoS002EkfMftPG<ipnsiCK}K_sNmm}me3FiZr%Oaf_u5F8kD;mB_~cxD-r
z5P$(X{&Tq5C`<xK02D?NNdN+t$~z$m00O|zFh^ynq*yaCtkn+NZzWom<#OEoF`?zb
zv(i3x^K~wt!aLPcRBP+PckUsIh6*LgjYSh0`}#7hMC9NR5D)+W0d&8Mxgwk>NPH-R
Fx`3oHQ9u9y

literal 0
HcmV?d00001

diff --git a/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh b/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/sample_images/parallels-with-bitmap.sh
@@ -XXX,XX +XXX,XX @@
+#!/bin/bash
+#
+# Test parallels load bitmap
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+CT=parallels-with-bitmap-ct
+DIR=$PWD/parallels-with-bitmap-dir
+IMG=$DIR/root.hds
+XML=$DIR/DiskDescriptor.xml
+TARGET=parallels-with-bitmap.bz2
+
+rm -rf $DIR
+
+prlctl create $CT --vmtype ct
+prlctl set $CT --device-add hdd --image $DIR --recreate --size 2G
+
+# cleanup the image
+qemu-img create -f parallels $IMG 64G
+
+# create bitmap
+prlctl backup $CT
+
+prlctl set $CT --device-del hdd1
+prlctl destroy $CT
+
+dev=$(ploop mount $XML | sed -n 's/^Adding delta dev=$\/dev\/ploop[0-9]\+$.*/\1/p')
+dd if=/dev/zero of=$dev bs=64K seek=5 count=2 oflag=direct
+dd if=/dev/zero of=$dev bs=64K seek=30 count=1 oflag=direct
+dd if=/dev/zero of=$dev bs=64K seek=10 count=3 oflag=direct
+ploop umount $XML  # bitmap name will be in the output
+
+bzip2 -z $IMG
+
+mv $IMG.bz2 $TARGET
+
+rm -rf $DIR
diff --git a/tests/qemu-iotests/tests/parallels-read-bitmap b/tests/qemu-iotests/tests/parallels-read-bitmap
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/parallels-read-bitmap
@@ -XXX,XX +XXX,XX @@
+#!/usr/bin/env python3
+#
+# Test parallels load bitmap
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import json
+import iotests
+from iotests import qemu_nbd_popen, qemu_img_pipe, log, file_path
+
+iotests.script_initialize(supported_fmts=['parallels'])
+
+nbd_sock = file_path('nbd-sock', base_dir=iotests.sock_dir)
+disk = iotests.file_path('disk')
+bitmap = 'e4f2eed0-37fe-4539-b50b-85d2e7fd235f'
+nbd_opts = f'driver=nbd,server.type=unix,server.path={nbd_sock}' \
+        f',x-dirty-bitmap=qemu:dirty-bitmap:{bitmap}'
+
+
+iotests.unarchive_sample_image('parallels-with-bitmap', disk)
+
+
+with qemu_nbd_popen('--read-only', f'--socket={nbd_sock}',
+                    f'--bitmap={bitmap}', '-f', iotests.imgfmt, disk):
+    out = qemu_img_pipe('map', '--output=json', '--image-opts', nbd_opts)
+    chunks = json.loads(out)
+    cluster = 64 * 1024
+
+    log('dirty clusters (cluster size is 64K):')
+    for c in chunks:
+        assert c['start'] % cluster == 0
+        assert c['length'] % cluster == 0
+        if c['data']:
+            continue
+
+        a = c['start'] // cluster
+        b = (c['start'] + c['length']) // cluster
+        if b - a > 1:
+            log(f'{a}-{b-1}')
+        else:
+            log(a)
diff --git a/tests/qemu-iotests/tests/parallels-read-bitmap.out b/tests/qemu-iotests/tests/parallels-read-bitmap.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/parallels-read-bitmap.out
@@ -XXX,XX +XXX,XX @@
+Start NBD server
+dirty clusters (cluster size is 64K):
+5-6
+10-12
+30
+Kill NBD server
-- 
2.29.2

The 'name' option for NBD exports is optional. Add a note that the
default for the option is the node name (people could otherwise expect
that it's the empty string like for qemu-nbd).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210305094856.18964-1-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/tools/qemu-storage-daemon.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)