Series comparison

-[Qemu-devel] [PULL 00/61] Block layer patches
+[Qemu-devel] [PULL 00/35] Block layer patches
-The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:
+The following changes since commit ae49fbbcd8e4e9d8bf7131add34773f579e1aff7:
-  Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)
+  Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20171025' into staging (2017-10-25 16:38:57 +0100)
 are available in the git repository at:
   git://repo.or.cz/qemu/kevin.git tags/for-upstream
-for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:
+for you to fetch changes up to 4254d01ce4eec9a3ccf320d14e2da132b8ad4a51:
-  Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)
+  Merge remote-tracking branch 'mreitz/tags/pull-block-2017-10-26' into queue-block (2017-10-26 15:02:40 +0200)
 ----------------------------------------------------------------
 Block layer patches
 ----------------------------------------------------------------
-Alberto Garcia (9):
+Alberto Garcia (1):
-      throttle: Update throttle-groups.c documentation
+      qcow2: Use BDRV_SECTOR_BITS instead of its literal value
       qcow2: Remove unused Error variable in do_perform_cow()
       qcow2: Use unsigned int for both members of Qcow2COWRegion
       qcow2: Make perform_cow() call do_perform_cow() twice
       qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
       qcow2: Allow reading both COW regions with only one request
       qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
       qcow2: Merge the writing of the COW regions with the guest data
       qcow2: Use offset_into_cluster() and offset_to_l2_index()
-Kevin Wolf (37):
+Eric Blake (24):
-      commit: Fix completion with extra reference
+      block: Allow NULL file for bdrv_get_block_status()
-      qemu-iotests: Allow starting new qemu after cleanup
+      block: Add flag to avoid wasted work in bdrv_is_allocated()
-      qemu-iotests: Test exiting qemu with running job
+      block: Make bdrv_round_to_clusters() signature more useful
-      doc: Document generic -blockdev options
+      qcow2: Switch is_zero_sectors() to byte-based
-      doc: Document driver-specific -blockdev options
+      block: Switch bdrv_make_zero() to byte-based
-      qed: Use bottom half to resume waiting requests
+      qemu-img: Switch get_block_status() to byte-based
-      qed: Make qed_read_table() synchronous
+      block: Convert bdrv_get_block_status() to bytes
-      qed: Remove callback from qed_read_table()
+      block: Switch bdrv_co_get_block_status() to byte-based
-      qed: Remove callback from qed_read_l2_table()
+      block: Switch BdrvCoGetBlockStatusData to byte-based
-      qed: Remove callback from qed_find_cluster()
+      block: Switch bdrv_common_block_status_above() to byte-based
-      qed: Make qed_read_backing_file() synchronous
+      block: Switch bdrv_co_get_block_status_above() to byte-based
-      qed: Make qed_copy_from_backing_file() synchronous
+      block: Convert bdrv_get_block_status_above() to bytes
-      qed: Remove callback from qed_copy_from_backing_file()
+      qemu-img: Simplify logic in img_compare()
-      qed: Make qed_write_header() synchronous
+      qemu-img: Speed up compare on pre-allocated larger file
-      qed: Remove callback from qed_write_header()
+      qemu-img: Add find_nonzero()
-      qed: Make qed_write_table() synchronous
+      qemu-img: Drop redundant error message in compare
-      qed: Remove GenericCB
+      qemu-img: Change check_empty_sectors() to byte-based
-      qed: Remove callback from qed_write_table()
+      qemu-img: Change compare_sectors() to be byte-based
-      qed: Make qed_aio_read_data() synchronous
+      qemu-img: Change img_rebase() to be byte-based
-      qed: Make qed_aio_write_main() synchronous
+      qemu-img: Change img_compare() to be byte-based
-      qed: Inline qed_commit_l2_update()
+      block: Align block status requests
-      qed: Add return value to qed_aio_write_l1_update()
+      block: Reduce bdrv_aligned_preadv() rounding
-      qed: Add return value to qed_aio_write_l2_update()
+      qcow2: Reduce is_zero() rounding
-      qed: Add return value to qed_aio_write_main()
+      qemu-io: Relax 'alloc' now that block-status doesn't assert
       qed: Add return value to qed_aio_write_cow()
       qed: Add return value to qed_aio_write_inplace/alloc()
       qed: Add return value to qed_aio_read/write_data()
       qed: Remove ret argument from qed_aio_next_io()
       qed: Remove recursion in qed_aio_next_io()
       qed: Implement .bdrv_co_readv/writev
       qed: Use CoQueue for serialising allocations
       qed: Simplify request handling
       qed: Use a coroutine for need_check_timer
       qed: Add coroutine_fn to I/O path functions
       qed: Use bdrv_co_* for coroutine_fns
       block: Remove bdrv_aio_readv/writev/flush()
       Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block
-Manos Pitsidianakis (1):
+Kevin Wolf (2):
-      block: change variable names in BlockDriverState
+      qemu-iotests: Test backing_fmt with backing node reference
       Merge remote-tracking branch 'mreitz/tags/pull-block-2017-10-26' into queue-block
-Max Reitz (3):
+Max Reitz (8):
-      blkdebug: Catch bs->exact_filename overflow
+      qemu-img.1: Image invalidation on qemu-img commit
-      blkverify: Catch bs->exact_filename overflow
+      iotests: Add test for dataplane mirroring
-      block: Do not strcmp() with NULL uri->scheme
+      iotests: Pull _filter_actual_image_size from 67/87
       iotests: Filter actual image size in 184 and 191
       qcow2: Emit errp when truncating the image tail
       qcow2: Fix unaligned preallocated truncation
       qcow2: Always execute preallocate() in a coroutine
       iotests: Add cluster_size=64k to 125
-Stefan Hajnoczi (10):
+Peter Krempa (1):
-      block: count bdrv_co_rw_vmstate() requests
+      block: don't add 'driver' to options when referring to backing via node name
       block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
       migration: avoid recursive AioContext locking in save_vmstate()
       migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
       virtio-pci: use ioeventfd even when KVM is disabled
       migration: hold AioContext lock for loadvm qemu_fclose()
       qemu-iotests: 068: extract _qemu() function
       qemu-iotests: 068: use -drive/-device instead of -hda
       qemu-iotests: 068: test iothread mode
       qemu-img: don't shadow opts variable in img_dd()
-Stephen Bates (1):
+ include/block/block.h            |  29 ++-
-      nvme: Add support for Read Data and Write Data in CMBs.
+ include/block/block_int.h        |  11 +-
  block.c                          |   3 +-
  block/blkdebug.c                 |  13 +-
  block/io.c                       | 334 ++++++++++++++++-----------
  block/mirror.c                   |  26 +--
  block/qcow2-cluster.c            |   2 +-
  block/qcow2.c                    | 116 ++++++----
  qemu-img.c                       | 381 ++++++++++++++-----------------
  qemu-io-cmds.c                   |  13 --
  block/trace-events               |   2 +-
  qemu-img.texi                    |   9 +-
  tests/qemu-iotests/067           |   2 +-
  tests/qemu-iotests/074.out       |   2 -
  tests/qemu-iotests/087           |   2 +-
  tests/qemu-iotests/125           |   7 +-
  tests/qemu-iotests/125.out       | 480 +++++++++++++++++++++++++++++++++++----
  tests/qemu-iotests/127           |  97 ++++++++
  tests/qemu-iotests/127.out       |  14 ++
  tests/qemu-iotests/177           |  12 +-
  tests/qemu-iotests/177.out       |  19 +-
  tests/qemu-iotests/184           |   3 +-
  tests/qemu-iotests/184.out       |   6 +-
  tests/qemu-iotests/191           |   7 +-
  tests/qemu-iotests/191.out       |  48 ++--
  tests/qemu-iotests/common.filter |   6 +
  tests/qemu-iotests/group         |   1 +
 files changed, 1102 insertions(+), 543 deletions(-)
  create mode 100755 tests/qemu-iotests/127
  create mode 100644 tests/qemu-iotests/127.out
-sochin.jiang (1):
-      fix: avoid an infinite loop or a dangling pointer problem in img_commit
- block/Makefile.objs            |   2 +-
- block/blkdebug.c               |  46 +--
- block/blkreplay.c              |   8 +-
- block/blkverify.c              |  12 +-
- block/block-backend.c          |  22 +-
- block/commit.c                 |   7 +
- block/file-posix.c             |  34 +-
- block/io.c                     | 240 ++-----------
- block/iscsi.c                  |  20 +-
- block/mirror.c                 |   8 +-
- block/nbd-client.c             |   8 +-
- block/nbd-client.h             |   4 +-
- block/nbd.c                    |   6 +-
- block/nfs.c                    |   2 +-
- block/qcow2-cluster.c          | 201 ++++++++---
- block/qcow2.c                  |  94 +++--
- block/qcow2.h                  |  11 +-
- block/qed-cluster.c            | 124 +++----
- block/qed-gencb.c              |  33 --
- block/qed-table.c              | 261 +++++---------
- block/qed.c                    | 779 ++++++++++++++++-------------------------
- block/qed.h                    |  54 +--
- block/raw-format.c             |   8 +-
- block/rbd.c                    |   4 +-
- block/sheepdog.c               |  12 +-
- block/ssh.c                    |   2 +-
- block/throttle-groups.c        |   2 +-
- block/trace-events             |   3 -
- blockjob.c                     |   4 +-
- hw/block/nvme.c                |  83 +++--
- hw/block/nvme.h                |   1 +
- hw/virtio/virtio-pci.c         |   2 +-
- include/block/block.h          |  16 +-
- include/block/block_int.h      |   6 +-
- include/block/blockjob.h       |  18 +
- include/sysemu/block-backend.h |  20 +-
- migration/savevm.c             |  32 +-
- qemu-img.c                     |  29 +-
- qemu-io-cmds.c                 |  46 +--
- qemu-options.hx                | 221 ++++++++++--
- tests/qemu-iotests/068         |  37 +-
- tests/qemu-iotests/068.out     |  11 +-
- tests/qemu-iotests/185         | 206 +++++++++++
- tests/qemu-iotests/185.out     |  59 ++++
- tests/qemu-iotests/common.qemu |   3 +
- tests/qemu-iotests/group       |   1 +
-files changed, 1477 insertions(+), 1325 deletions(-)
- delete mode 100644 block/qed-gencb.c
- create mode 100755 tests/qemu-iotests/185
- create mode 100644 tests/qemu-iotests/185.out

-[Qemu-devel] [PULL 01/61] commit: Fix completion with extra reference
+Deleted patch
-commit_complete() can't assume that after its block_job_completed() the
-job is actually immediately freed; someone else may still be holding
-references. In this case, the op blockers on the intermediate nodes make
-the graph reconfiguration in the completion code fail.
-Call block_job_remove_all_bdrv() manually so that we know for sure that
-any blockers on intermediate nodes are given up.
-Cc: qemu-stable@nongnu.org
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- block/commit.c | 7 +++++++
-file changed, 7 insertions(+)
-diff --git a/block/commit.c b/block/commit.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/commit.c
-+++ b/block/commit.c
-@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
-     }
-     g_free(s->backing_file_str);
-     blk_unref(s->top);
-+
-+    /* If there is more than one reference to the job (e.g. if called from
-+     * block_job_finish_sync()), block_job_completed() won't free it and
-+     * therefore the blockers on the intermediate nodes remain. This would
-+     * cause bdrv_set_backing_hd() to fail. */
-+    block_job_remove_all_bdrv(job);
-+
-     block_job_completed(&s->common, ret);
-     g_free(data);
---
-.8.3.1

-[Qemu-devel] [PULL 02/61] qemu-iotests: Allow starting new qemu after cleanup
+Deleted patch
-After _cleanup_qemu(), test cases should be able to start the next qemu
-process and call _cleanup_qemu() for that one as well. For this to work
-cleanly, we need to improve the cleanup so that the second invocation
-doesn't try to kill the qemu instances from the first invocation a
-second time (which would result in error messages).
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- tests/qemu-iotests/common.qemu | 3 +++
-file changed, 3 insertions(+)
-diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
-index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/common.qemu
-+++ b/tests/qemu-iotests/common.qemu
-@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
-         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
-         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
-         eval "exec ${QEMU_OUT[$i]}<&-"
-+
-+        unset QEMU_IN[$i]
-+        unset QEMU_OUT[$i]
-     done
- }
---
-.8.3.1

-[Qemu-devel] [PULL 50/61] qed: Use CoQueue for serialising allocations
+[Qemu-devel] [PULL 01/35] block: don't add 'driver' to options when referring to backing via node name
-Now that we're running in coroutine context, the ad-hoc serialisation
+From: Peter Krempa <pkrempa@redhat.com>
 code (which drops a request that has to wait out of coroutine context)
 can be replaced by a CoQueue.
-This means that when we resume a serialised request, it is running in
+When referring to a backing file of an image via node name
-coroutine context again and its I/O isn't blocking any more.
+bdrv_open_backing_file would add the 'driver' option to the option list
 filling it with the backing format driver. This breaks construction of
 the backing chain via -blockdev, as bdrv_open_inherit reports an error
 if both 'reference' and 'options' are provided.
+$ qemu-img create -f raw /tmp/backing.raw 64M
+$ qemu-img create -f qcow2 -F raw -b /tmp/backing.raw /tmp/test.qcow2
+$ qemu-system-x86_64 \
+  -blockdev driver=file,filename=/tmp/backing.raw,node-name=backing \
+  -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing
+qemu-system-x86_64: -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing: Could not open backing file: Cannot reference an existing block device with additional options or a new filename
+Signed-off-by: Peter Krempa <pkrempa@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 49 +++++++++++++++++--------------------------------
+ block.c | 3 ++-
- block/qed.h |  3 ++-
+file changed, 2 insertions(+), 1 deletion(-)
 files changed, 19 insertions(+), 33 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block.c
-+++ b/block/qed.c
++++ b/block.c
-@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
+         goto free_exit;
  static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
  {
 -    QEDAIOCB *acb;
 -
      assert(s->allocating_write_reqs_plugged);
      s->allocating_write_reqs_plugged = false;
 -
 -    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
 -    if (acb) {
 -        qed_aio_start_io(acb);
 -    }
 +    qemu_co_enter_next(&s->allocating_write_reqs);
  }
  static void qed_clear_need_check(void *opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
      BDRVQEDState *s = opaque;
      /* The timer should only fire when allocating writes have drained */
 -    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
 +    assert(!s->allocating_acb);
      trace_qed_need_check_timer_cb(s);
@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
      int ret;
      s->bs = bs;
 -    QSIMPLEQ_INIT(&s->allocating_write_reqs);
 +    qemu_co_queue_init(&s->allocating_write_reqs);
      ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
      if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
      qed_release(s);
  }
 -static void qed_resume_alloc_bh(void *opaque)
 -{
 -    qed_aio_start_io(opaque);
 -}
 -
  static void qed_aio_complete(QEDAIOCB *acb, int ret)
  {
      BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
       * next request in the queue.  This ensures that we don't cycle through
       * requests multiple times but rather finish one at a time completely.
       */
 -    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
 -        QEDAIOCB *next_acb;
 -        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
 -        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
 -        if (next_acb) {
 -            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
 -                                    qed_resume_alloc_bh, next_acb);
 +    if (acb == s->allocating_acb) {
 +        s->allocating_acb = NULL;
 +        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
 +            qemu_co_enter_next(&s->allocating_write_reqs);
          } else if (s->header.features & QED_F_NEED_CHECK) {
              qed_start_need_check_timer(s);
          }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
      int ret;
      /* Cancel timer when the first allocating request comes in */
 -    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
 +    if (s->allocating_acb == NULL) {
          qed_cancel_need_check_timer(s);
      }
-     /* Freeze this request if another allocating write is in progress */
+-    if (bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
--    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
++    if (!reference &&
--        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
++        bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
--    }
+         qdict_put_str(options, "driver", bs->backing_format);
 -    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
 -        s->allocating_write_reqs_plugged) {
 -        return -EINPROGRESS; /* wait for existing request to finish */
 +    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
 +        if (s->allocating_acb != NULL) {
 +            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
 +            assert(s->allocating_acb == NULL);
 +        }
 +        s->allocating_acb = acb;
 +        return -EAGAIN; /* start over with looking up table entries */
      }
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
-             ret = qed_aio_read_data(acb, ret, offset, len);
-         }
--        if (ret < 0) {
--            if (ret != -EINPROGRESS) {
--                qed_aio_complete(acb, ret);
--            }
-+        if (ret < 0 && ret != -EAGAIN) {
-+            qed_aio_complete(acb, ret);
-             return;
-         }
-     }
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ typedef struct {
-     uint32_t l2_mask;
-     /* Allocating write request queue */
--    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
-+    QEDAIOCB *allocating_acb;
-+    CoQueue allocating_write_reqs;
-     bool allocating_write_reqs_plugged;
-     /* Periodic flush and clear need check flag */
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 09/61] doc: Document driver-specific -blockdev options
+[Qemu-devel] [PULL 02/35] qemu-iotests: Test backing_fmt with backing node reference
-This documents the driver-specific options for the raw, qcow2 and file
+This changes test case 191 to include a backing image that has
-block drivers for the man page. For everything else, we refer to the
+backing_fmt set in the image file, but is referenced by node name in the
-QAPI documentation.
+qemu command line.
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
 ---
- qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
+ tests/qemu-iotests/191     | 3 ++-
-file changed, 114 insertions(+), 1 deletion(-)
+ tests/qemu-iotests/191.out | 2 +-
 files changed, 3 insertions(+), 2 deletions(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
+diff --git a/tests/qemu-iotests/191 b/tests/qemu-iotests/191
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/191
 +++ b/tests/qemu-iotests/191
@@ -XXX,XX +XXX,XX @@ echo === Preparing and starting VM ===
  echo
  TEST_IMG="${TEST_IMG}.base" _make_test_img $size
 -TEST_IMG="${TEST_IMG}.mid" _make_test_img -b "${TEST_IMG}.base"
 +IMGOPTS=$(_optstr_add "$IMGOPTS" "backing_fmt=$IMGFMT") \
 +    TEST_IMG="${TEST_IMG}.mid" _make_test_img -b "${TEST_IMG}.base"
  _make_test_img -b "${TEST_IMG}.mid"
  TEST_IMG="${TEST_IMG}.ovl2" _make_test_img -b "${TEST_IMG}.mid"
 diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
 index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
+--- a/tests/qemu-iotests/191.out
-+++ b/qemu-options.hx
++++ b/tests/qemu-iotests/191.out
-@@ -XXX,XX +XXX,XX @@ STEXI
+@@ -XXX,XX +XXX,XX @@ QA output created by 191
- @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
+ === Preparing and starting VM ===
- @findex -blockdev
+ Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
--Define a new block driver node.
+-Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.base
-+Define a new block driver node. Some of the options apply to all block drivers,
++Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
-+other options are only accepted for a specific block driver. See below for a
+ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.mid
-+list of generic options and options for the most common block drivers.
+ Formatting 'TEST_DIR/t.IMGFMT.ovl2', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.mid
-+
+ wrote 65536/65536 bytes at offset 1048576
 +Options that expect a reference to another node (e.g. @code{file}) can be
 +given in two ways. Either you specify the node name of an already existing node
 +(file=@var{node-name}), or you define a new node inline, adding options
 +for the referenced node after a dot (file.filename=@var{path},file.aio=native).
 +
 +A block driver node created with @option{-blockdev} can be used for a guest
 +device by specifying its node name for the @code{drive} property in a
 +@option{-device} argument that defines a block device.
  @table @option
  @item Valid options for any block driver node:
@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
  to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
  @end table
 +@item Driver-specific options for @code{file}
 +
 +This is the protocol-level block driver for accessing regular files.
 +
 +@table @code
 +@item filename
 +The path to the image file in the local filesystem
 +@item aio
 +Specifies the AIO backend (threads/native, default: threads)
 +@end table
 +Example:
 +@example
 +-blockdev driver=file,node-name=disk,filename=disk.img
 +@end example
 +
 +@item Driver-specific options for @code{raw}
 +
 +This is the image format block driver for raw images. It is usually
 +stacked on top of a protocol level block driver such as @code{file}.
 +
 +@table @code
 +@item file
 +Reference to or definition of the data source block driver node
 +(e.g. a @code{file} driver node)
 +@end table
 +Example 1:
 +@example
 +-blockdev driver=file,node-name=disk_file,filename=disk.img
 +-blockdev driver=raw,node-name=disk,file=disk_file
 +@end example
 +Example 2:
 +@example
 +-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
 +@end example
 +
 +@item Driver-specific options for @code{qcow2}
 +
 +This is the image format block driver for qcow2 images. It is usually
 +stacked on top of a protocol level block driver such as @code{file}.
 +
 +@table @code
 +@item file
 +Reference to or definition of the data source block driver node
 +(e.g. a @code{file} driver node)
 +
 +@item backing
 +Reference to or definition of the backing file block device (default is taken
 +from the image file). It is allowed to pass an empty string here in order to
 +disable the default backing file.
 +
 +@item lazy-refcounts
 +Whether to enable the lazy refcounts feature (on/off; default is taken from the
 +image file)
 +
 +@item cache-size
 +The maximum total size of the L2 table and refcount block caches in bytes
 +(default: 1048576 bytes or 8 clusters, whichever is larger)
 +
 +@item l2-cache-size
 +The maximum size of the L2 table cache in bytes
 +(default: 4/5 of the total cache size)
 +
 +@item refcount-cache-size
 +The maximum size of the refcount block cache in bytes
 +(default: 1/5 of the total cache size)
 +
 +@item cache-clean-interval
 +Clean unused entries in the L2 and refcount caches. The interval is in seconds.
 +The default value is 0 and it disables this feature.
 +
 +@item pass-discard-request
 +Whether discard requests to the qcow2 device should be forwarded to the data
 +source (on/off; default: on if discard=unmap is specified, off otherwise)
 +
 +@item pass-discard-snapshot
 +Whether discard requests for the data source should be issued when a snapshot
 +operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
 +default: on)
 +
 +@item pass-discard-other
 +Whether discard requests for the data source should be issued on other
 +occasions where a cluster gets freed (on/off; default: off)
 +
 +@item overlap-check
 +Which overlap checks to perform for writes to the image
 +(none/constant/cached/all; default: cached). For details or finer
 +granularity control refer to the QAPI documentation of @code{blockdev-add}.
 +@end table
 +
 +Example 1:
 +@example
 +-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
 +-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
 +@end example
 +Example 2:
 +@example
 +-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
 +@end example
 +
 +@item Driver-specific options for other drivers
 +Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
 +
  @end table
  ETEXI
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 52/61] qed: Use a coroutine for need_check_timer
+[Qemu-devel] [PULL 03/35] block: Allow NULL file for bdrv_get_block_status()
-This fixes the last place where we degraded from AIO to actual blocking
+From: Eric Blake <eblake@redhat.com>
-synchronous I/O requests. Putting it into a coroutine means that instead
-of blocking, the coroutine simply yields while doing I/O.
+Not all callers care about which BDS owns the mapping for a given
+range of the file.  This patch merely simplifies the callers by
 consolidating the logic in the common call point, while guaranteeing
 a non-NULL file to all the driver callbacks, for no semantic change.
 The only caller that does not care about pnum is bdrv_is_allocated,
 as invoked by vvfat; we can likewise add assertions that the rest
 of the stack does not have to worry about a NULL pnum.
 Furthermore, this will also set the stage for a future cleanup: when
 a caller does not care about which BDS owns an offset, it would be
 nice to allow the driver to optimize things to not have to return
 BDRV_BLOCK_OFFSET_VALID in the first place.  In the case of fragmented
 allocation (for example, it's fairly easy to create a qcow2 image
 where consecutive guest addresses are not at consecutive host
 addresses), the current contract requires bdrv_get_block_status()
 to clamp *pnum to the limit where host addresses are no longer
 consecutive, but allowing a NULL file means that *pnum could be
 set to the full length of known-allocated data.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 33 +++++++++++++++++----------------
+ include/block/block_int.h | 10 ++++++----
-file changed, 17 insertions(+), 16 deletions(-)
+ block/io.c                | 49 ++++++++++++++++++++++++++---------------------
+ block/mirror.c            |  3 +--
-diff --git a/block/qed.c b/block/qed.c
+ block/qcow2.c             |  8 ++------
-index XXXXXXX..XXXXXXX 100644
+ qemu-img.c                | 10 ++++------
---- a/block/qed.c
+files changed, 40 insertions(+), 40 deletions(-)
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
+diff --git a/include/block/block_int.h b/include/block/block_int.h
-     qemu_co_enter_next(&s->allocating_write_reqs);
+index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block_int.h
 +++ b/include/block/block_int.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
          int64_t offset, int bytes);
      /*
 -     * Building block for bdrv_block_status[_above]. The driver should
 -     * answer only according to the current layer, and should not
 -     * set BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
 -     * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.
 +     * Building block for bdrv_block_status[_above] and
 +     * bdrv_is_allocated[_above].  The driver should answer only
 +     * according to the current layer, and should not set
 +     * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
 +     * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
 +     * layer guarantees non-NULL pnum and file.
       */
      int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
          int64_t sector_num, int nb_sectors, int *pnum,
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
  {
      int64_t target_sectors, ret, nb_sectors, sector_num = 0;
      BlockDriverState *bs = child->bs;
 -    BlockDriverState *file;
      int n;
      target_sectors = bdrv_nb_sectors(bs);
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
          if (nb_sectors <= 0) {
              return 0;
          }
 -        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, &file);
 +        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, NULL);
          if (ret < 0) {
              error_report("error getting block status at sector %" PRId64 ": %s",
                           sector_num, strerror(-ret));
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
   * beyond the end of the disk image it will be clamped; if 'pnum' is set to
   * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
   *
 - * If returned value is positive and BDRV_BLOCK_OFFSET_VALID bit is set, 'file'
 - * points to the BDS which the sector range is allocated in.
 + * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
 + * 'file' is non-NULL, then '*file' points to the BDS which the sector range
 + * is allocated in.
   */
  static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
                                                       int64_t sector_num,
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
      int64_t total_sectors;
      int64_t n;
      int64_t ret, ret2;
 +    BlockDriverState *local_file = NULL;
 -    *file = NULL;
 +    assert(pnum);
 +    *pnum = 0;
      total_sectors = bdrv_nb_sectors(bs);
      if (total_sectors < 0) {
 -        return total_sectors;
 +        ret = total_sectors;
 +        goto early_out;
      }
      if (sector_num >= total_sectors) {
 -        *pnum = 0;
 -        return BDRV_BLOCK_EOF;
 +        ret = BDRV_BLOCK_EOF;
 +        goto early_out;
      }
      if (!nb_sectors) {
 -        *pnum = 0;
 -        return 0;
 +        ret = 0;
 +        goto early_out;
      }
      n = total_sectors - sector_num;
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
          }
          if (bs->drv->protocol_name) {
              ret |= BDRV_BLOCK_OFFSET_VALID | (sector_num * BDRV_SECTOR_SIZE);
 -            *file = bs;
 +            local_file = bs;
          }
 -        return ret;
 +        goto early_out;
      }
      bdrv_inc_in_flight(bs);
      ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
 -                                            file);
 +                                            &local_file);
      if (ret < 0) {
          *pnum = 0;
          goto out;
      }
      if (ret & BDRV_BLOCK_RAW) {
 -        assert(ret & BDRV_BLOCK_OFFSET_VALID && *file);
 -        ret = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
 -                                       *pnum, pnum, file);
 +        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
 +        ret = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
 +                                       *pnum, pnum, &local_file);
          goto out;
      }
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
          }
      }
 -    if (*file && *file != bs &&
 +    if (local_file && local_file != bs &&
          (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
          (ret & BDRV_BLOCK_OFFSET_VALID)) {
 -        BlockDriverState *file2;
          int file_pnum;
 -        ret2 = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
 -                                        *pnum, &file_pnum, &file2);
 +        ret2 = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
 +                                        *pnum, &file_pnum, NULL);
          if (ret2 >= 0) {
              /* Ignore errors.  This is just providing extra information, it
               * is useful but not necessary.
@@ -XXX,XX +XXX,XX @@ out:
      if (ret >= 0 && sector_num + *pnum == total_sectors) {
          ret |= BDRV_BLOCK_EOF;
      }
 +early_out:
 +    if (file) {
 +        *file = local_file;
 +    }
      return ret;
  }
--static void qed_clear_need_check(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status(BlockDriverState *bs,
-+static void qed_need_check_timer_entry(void *opaque)
+ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                     int64_t bytes, int64_t *pnum)
  {
-     BDRVQEDState *s = opaque;
+-    BlockDriverState *file;
-+    int ret;
+     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+     int nb_sectors = bytes >> BDRV_SECTOR_BITS;
--    if (ret) {
+     int64_t ret;
-+    /* The timer should only fire when allocating writes have drained */
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
-+    assert(!s->allocating_acb);
+     assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
-+
+     assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
-+    trace_qed_need_check_timer_cb(s);
+     ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &psectors,
-+
+-                                &file);
-+    qed_acquire(s);
++                                NULL);
-+    qed_plug_allocating_write_reqs(s);
+     if (ret < 0) {
-+
+         return ret;
-+    /* Ensure writes are on disk before clearing flag */
+     }
-+    ret = bdrv_co_flush(s->bs->file->bs);
+diff --git a/block/mirror.c b/block/mirror.c
-+    qed_release(s);
+index XXXXXXX..XXXXXXX 100644
-+    if (ret < 0) {
+--- a/block/mirror.c
-         qed_unplug_allocating_write_reqs(s);
++++ b/block/mirror.c
-         return;
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
-     }
+         int io_sectors;
-@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
+         unsigned int io_bytes;
+         int64_t io_bytes_acct;
-     qed_unplug_allocating_write_reqs(s);
+-        BlockDriverState *file;
+         enum MirrorMethod {
--    ret = bdrv_flush(s->bs);
+             MIRROR_METHOD_COPY,
-+    ret = bdrv_co_flush(s->bs);
+             MIRROR_METHOD_ZERO,
-     (void) ret;
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
          ret = bdrv_get_block_status_above(source, NULL,
                                            offset >> BDRV_SECTOR_BITS,
                                            nb_chunks * sectors_per_chunk,
 -                                          &io_sectors, &file);
 +                                          &io_sectors, NULL);
          io_bytes = io_sectors * BDRV_SECTOR_SIZE;
          if (ret < 0) {
              io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
                              uint32_t count)
  {
      int nr;
 -    BlockDriverState *file;
      int64_t res;
      if (start + count > bs->total_sectors) {
@@ -XXX,XX +XXX,XX @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
      if (!count) {
          return true;
      }
 -    res = bdrv_get_block_status_above(bs, NULL, start, count,
 -                                      &nr, &file);
 +    res = bdrv_get_block_status_above(bs, NULL, start, count, &nr, NULL);
      return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
  }
- static void qed_need_check_timer_cb(void *opaque)
+@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
- {
+                  offset += pnum * BDRV_SECTOR_SIZE) {
--    BDRVQEDState *s = opaque;
+                 int nb_sectors = MIN(ssize - offset,
--
+                                      BDRV_REQUEST_MAX_BYTES) / BDRV_SECTOR_SIZE;
--    /* The timer should only fire when allocating writes have drained */
+-                BlockDriverState *file;
--    assert(!s->allocating_acb);
+                 int64_t ret;
--
--    trace_qed_need_check_timer_cb(s);
+                 ret = bdrv_get_block_status_above(in_bs, NULL,
--
+                                                   offset >> BDRV_SECTOR_BITS,
--    qed_acquire(s);
+-                                                  nb_sectors,
--    qed_plug_allocating_write_reqs(s);
+-                                                  &pnum, &file);
--
++                                                  nb_sectors, &pnum, NULL);
--    /* Ensure writes are on disk before clearing flag */
+                 if (ret < 0) {
--    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
+                     error_setg_errno(&local_err, -ret,
--    qed_release(s);
+                                      "Unable to get block status");
-+    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+diff --git a/qemu-img.c b/qemu-img.c
-+    qemu_coroutine_enter(co);
+index XXXXXXX..XXXXXXX 100644
- }
+--- a/qemu-img.c
++++ b/qemu-img.c
- void qed_acquire(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
      for (;;) {
          int64_t status1, status2;
 -        BlockDriverState *file;
          nb_sectors = sectors_to_process(total_sectors, sector_num);
          if (nb_sectors <= 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
          }
          status1 = bdrv_get_block_status_above(bs1, NULL, sector_num,
                                                total_sectors1 - sector_num,
 -                                              &pnum1, &file);
 +                                              &pnum1, NULL);
          if (status1 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
          status2 = bdrv_get_block_status_above(bs2, NULL, sector_num,
                                                total_sectors2 - sector_num,
 -                                              &pnum2, &file);
 +                                              &pnum2, NULL);
          if (status2 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename2);
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
      n = MIN(s->total_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
      if (s->sector_next_status <= sector_num) {
 -        BlockDriverState *file;
          if (s->target_has_backing) {
              ret = bdrv_get_block_status(blk_bs(s->src[src_cur]),
                                          sector_num - src_cur_offset,
 -                                        n, &n, &file);
 +                                        n, &n, NULL);
          } else {
              ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
                                                sector_num - src_cur_offset,
 -                                              n, &n, &file);
 +                                              n, &n, NULL);
          }
          if (ret < 0) {
              return ret;
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 54/61] qed: Use bdrv_co_* for coroutine_fns
+[Qemu-devel] [PULL 04/35] block: Add flag to avoid wasted work in bdrv_is_allocated()
-All functions that are marked coroutine_fn can directly call the
+From: Eric Blake <eblake@redhat.com>
-bdrv_co_* version of functions instead of going through the wrapper.
+Not all callers care about which BDS owns the mapping for a given
 range of the file, or where the zeroes lie within that mapping.  In
 particular, bdrv_is_allocated() cares more about finding the
 largest run of allocated data from the guest perspective, whether
 or not that data is consecutive from the host perspective, and
 whether or not the data reads as zero.  Therefore, doing subsequent
 refinements such as checking how much of the format-layer
 allocation also satisfies BDRV_BLOCK_ZERO at the protocol layer is
 wasted work - in the best case, it just costs extra CPU cycles
 during a single bdrv_is_allocated(), but in the worst case, it
 results in a smaller *pnum, and forces callers to iterate through
 more status probes when visiting the entire file for even more
 extra CPU cycles.
 This patch only optimizes the block layer (no behavior change when
 want_zero is true, but skip unnecessary effort when it is false).
 Then when subsequent patches tweak the driver callback to be
 byte-based, we can also pass this hint through to the driver.
 Tweak BdrvCoGetBlockStatusData to declare arguments in parameter
 order, rather than mixing things up (minimizing padding is not
 necessary here).
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 16 +++++++++-------
+ block/io.c | 57 +++++++++++++++++++++++++++++++++++++++++----------------
-file changed, 9 insertions(+), 7 deletions(-)
+file changed, 41 insertions(+), 16 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/io.c
-+++ b/block/qed.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
-     };
+ typedef struct BdrvCoGetBlockStatusData {
-     qemu_iovec_init_external(&qiov, &iov, 1);
+     BlockDriverState *bs;
+     BlockDriverState *base;
--    ret = bdrv_preadv(s->bs->file, 0, &qiov);
+-    BlockDriverState **file;
-+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
++    bool want_zero;
-     if (ret < 0) {
+     int64_t sector_num;
      int nb_sectors;
      int *pnum;
 +    BlockDriverState **file;
      int64_t ret;
      bool done;
  } BdrvCoGetBlockStatusData;
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
   * Drivers not implementing the functionality are assumed to not support
   * backing files, hence all their sectors are reported as allocated.
   *
 + * If 'want_zero' is true, the caller is querying for mapping purposes,
 + * and the result should include BDRV_BLOCK_OFFSET_VALID and
 + * BDRV_BLOCK_ZERO where possible; otherwise, the result may omit those
 + * bits particularly if it allows for a larger value in 'pnum'.
 + *
   * If 'sector_num' is beyond the end of the disk image the return value is
   * BDRV_BLOCK_EOF and 'pnum' is set to 0.
   *
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
   * is allocated in.
   */
  static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
 +                                                     bool want_zero,
                                                       int64_t sector_num,
                                                       int nb_sectors, int *pnum,
                                                       BlockDriverState **file)
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
      if (ret & BDRV_BLOCK_RAW) {
          assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
 -        ret = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
 +        ret = bdrv_co_get_block_status(local_file, want_zero,
 +                                       ret >> BDRV_SECTOR_BITS,
                                         *pnum, pnum, &local_file);
          goto out;
      }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
-     /* Update header */
+     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
-     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
+         ret |= BDRV_BLOCK_ALLOCATED;
+-    } else {
--    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
++    } else if (want_zero) {
-+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
+         if (bdrv_unallocated_blocks_are_zero(bs)) {
-     if (ret < 0) {
+             ret |= BDRV_BLOCK_ZERO;
-         goto out;
+         } else if (bs->backing) {
              BlockDriverState *bs2 = bs->backing->bs;
              int64_t nb_sectors2 = bdrv_nb_sectors(bs2);
 +
              if (nb_sectors2 >= 0 && sector_num >= nb_sectors2) {
                  ret |= BDRV_BLOCK_ZERO;
              }
          }
      }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
+-    if (local_file && local_file != bs &&
++    if (want_zero && local_file && local_file != bs &&
-     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
+         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
--    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
+         (ret & BDRV_BLOCK_OFFSET_VALID)) {
-+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
+         int file_pnum;
-     if (ret < 0) {
-         return ret;
+-        ret2 = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
-     }
++        ret2 = bdrv_co_get_block_status(local_file, want_zero,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
++                                        ret >> BDRV_SECTOR_BITS,
-     }
+                                         *pnum, &file_pnum, NULL);
+         if (ret2 >= 0) {
-     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
+             /* Ignore errors.  This is just providing extra information, it
--    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
+@@ -XXX,XX +XXX,XX @@ early_out:
-+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
-     if (ret < 0) {
+ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
-         goto out;
+         BlockDriverState *base,
-     }
++        bool want_zero,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
+         int64_t sector_num,
-     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
+         int nb_sectors,
+         int *pnum,
-     BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
--    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
-+    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
+     assert(bs != base);
-+                          &acb->cur_qiov, 0);
+     for (p = bs; p != base; p = backing_bs(p)) {
-     if (ret < 0) {
+-        ret = bdrv_co_get_block_status(p, sector_num, nb_sectors, pnum, file);
-         return ret;
++        ret = bdrv_co_get_block_status(p, want_zero, sector_num, nb_sectors,
-     }
++                                       pnum, file);
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
+         if (ret < 0) {
-              * region.  The solution is to flush after writing a new data
+             break;
-              * cluster and before updating the L2 table.
+         }
-              */
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
--            ret = bdrv_flush(s->bs->file->bs);
+     BdrvCoGetBlockStatusData *data = opaque;
-+            ret = bdrv_co_flush(s->bs->file->bs);
-             if (ret < 0) {
+     data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
-                 return ret;
++                                               data->want_zero,
-             }
+                                                data->sector_num,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
+                                                data->nb_sectors,
-     }
+                                                data->pnum,
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
-     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+  *
--    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
+  * See bdrv_co_get_block_status_above() for details.
-+    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
+  */
-+                         &acb->cur_qiov, 0);
+-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 -                                    BlockDriverState *base,
 -                                    int64_t sector_num,
 -                                    int nb_sectors, int *pnum,
 -                                    BlockDriverState **file)
 +static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
 +                                              BlockDriverState *base,
 +                                              bool want_zero,
 +                                              int64_t sector_num,
 +                                              int nb_sectors, int *pnum,
 +                                              BlockDriverState **file)
  {
      Coroutine *co;
      BdrvCoGetBlockStatusData data = {
          .bs = bs,
          .base = base,
 -        .file = file,
 +        .want_zero = want_zero,
          .sector_num = sector_num,
          .nb_sectors = nb_sectors,
          .pnum = pnum,
 +        .file = file,
          .done = false,
      };
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
      return data.ret;
  }
 +int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 +                                    BlockDriverState *base,
 +                                    int64_t sector_num,
 +                                    int nb_sectors, int *pnum,
 +                                    BlockDriverState **file)
 +{
 +    return bdrv_common_block_status_above(bs, base, true, sector_num,
 +                                          nb_sectors, pnum, file);
 +}
 +
  int64_t bdrv_get_block_status(BlockDriverState *bs,
                                int64_t sector_num,
                                int nb_sectors, int *pnum,
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status(BlockDriverState *bs,
  int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                     int64_t bytes, int64_t *pnum)
  {
 -    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
 -    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
      int64_t ret;
      int psectors;
      assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
      assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
 -    ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &psectors,
 -                                NULL);
 +    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false,
 +                                         offset >> BDRV_SECTOR_BITS,
 +                                         bytes >> BDRV_SECTOR_BITS, &psectors,
 +                                         NULL);
      if (ret < 0) {
          return ret;
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 53/61] qed: Add coroutine_fn to I/O path functions
+[Qemu-devel] [PULL 05/35] block: Make bdrv_round_to_clusters() signature more useful
-Now that we stay in coroutine context for the whole request when doing
+From: Eric Blake <eblake@redhat.com>
 reads or writes, we can add coroutine_fn annotations to many functions
 that can do I/O or yield directly.
+In the process of converting sector-based interfaces to bytes,
+I'm finding it easier to represent a byte count as a 64-bit
+integer at the block layer (even if we are internally capped
+by SIZE_MAX or even INT_MAX for individual transactions, it's
+still nicer to not have to worry about truncation/overflow
+issues on as many variables).  Update the signature of
+bdrv_round_to_clusters() to uniformly use int64_t, matching
+the signature already chosen for bdrv_is_allocated and the
+fact that off_t is also a signed type, then adjust clients
+according to the required fallout (even where the result could
+now exceed 32 bits, no client is directly assigning the result
+into a 32-bit value without breaking things into a loop first).
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c |  5 +++--
+ include/block/block.h | 4 ++--
- block/qed.c         | 44 ++++++++++++++++++++++++--------------------
+ block/io.c            | 6 +++---
- block/qed.h         |  5 +++--
+ block/mirror.c        | 7 +++----
-files changed, 30 insertions(+), 24 deletions(-)
+ block/trace-events    | 2 +-
 files changed, 9 insertions(+), 10 deletions(-)
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
+diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+--- a/include/block/block.h
-+++ b/block/qed-cluster.c
++++ b/include/block/block.h
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+@@ -XXX,XX +XXX,XX @@ int bdrv_get_flags(BlockDriverState *bs);
-  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
-  * table offset, respectively. len is number of contiguous unallocated bytes.
+ ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs);
  void bdrv_round_to_clusters(BlockDriverState *bs,
 -                            int64_t offset, unsigned int bytes,
 +                            int64_t offset, int64_t bytes,
                              int64_t *cluster_offset,
 -                            unsigned int *cluster_bytes);
 +                            int64_t *cluster_bytes);
  const char *bdrv_get_encrypted_filename(BlockDriverState *bs);
  void bdrv_get_backing_filename(BlockDriverState *bs,
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
   * Round a region to cluster boundaries
   */
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+ void bdrv_round_to_clusters(BlockDriverState *bs,
--                     size_t *len, uint64_t *img_offset)
+-                            int64_t offset, unsigned int bytes,
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
++                            int64_t offset, int64_t bytes,
-+                                  uint64_t pos, size_t *len,
+                             int64_t *cluster_offset,
-+                                  uint64_t *img_offset)
+-                            unsigned int *cluster_bytes)
 +                            int64_t *cluster_bytes)
  {
-     uint64_t l2_offset;
+     BlockDriverInfo bdi;
-     uint64_t offset = 0;
-diff --git a/block/qed.c b/block/qed.c
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,
      struct iovec iov;
      QEMUIOVector local_qiov;
      int64_t cluster_offset;
 -    unsigned int cluster_bytes;
 +    int64_t cluster_bytes;
      size_t skip_bytes;
      int ret;
      int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
 diff --git a/block/mirror.c b/block/mirror.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/mirror.c
-+++ b/block/qed.c
++++ b/block/mirror.c
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
-  * This function only updates known header fields in-place and does not affect
+     bool need_cow;
-  * extra data after the QED header.
+     int ret = 0;
-  */
+     int64_t align_offset = *offset;
--static int qed_write_header(BDRVQEDState *s)
+-    unsigned int align_bytes = *bytes;
-+static int coroutine_fn qed_write_header(BDRVQEDState *s)
++    int64_t align_bytes = *bytes;
- {
+     int max_bytes = s->granularity * s->max_iov;
-     /* We must write full sectors for O_DIRECT but cannot necessarily generate
-      * the data following the header if an unrecognized compat feature is
+-    assert(*bytes < INT_MAX);
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
+     need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
-     qemu_co_enter_next(&s->allocating_write_reqs);
+     need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
- }
+                           s->cow_bitmap);
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
--static void qed_need_check_timer_entry(void *opaque)
+     while (nb_chunks > 0 && offset < s->bdev_length) {
-+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
+         int64_t ret;
- {
+         int io_sectors;
-     BDRVQEDState *s = opaque;
+-        unsigned int io_bytes;
-     int ret;
++        int64_t io_bytes;
-@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
+         int64_t io_bytes_acct;
-  * This function reads qiov->size bytes starting at pos from the backing file.
+         enum MirrorMethod {
-  * If there is no backing file then zeroes are read.
+             MIRROR_METHOD_COPY,
-  */
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
--static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+             io_bytes = s->granularity;
--                                 QEMUIOVector *qiov,
+         } else if (ret >= 0 && !(ret & BDRV_BLOCK_DATA)) {
--                                 QEMUIOVector **backing_qiov)
+             int64_t target_offset;
-+static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+-            unsigned int target_bytes;
-+                                              QEMUIOVector *qiov,
++            int64_t target_bytes;
-+                                              QEMUIOVector **backing_qiov)
+             bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
- {
+                                    &target_offset, &target_bytes);
-     uint64_t backing_length = 0;
+             if (target_offset == offset &&
-     size_t size;
+diff --git a/block/trace-events b/block/trace-events
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
   * @len:        Number of bytes
   * @offset:     Byte offset in image file
   */
 -static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                      uint64_t len, uint64_t offset)
 +static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
 +                                                   uint64_t pos, uint64_t len,
 +                                                   uint64_t offset)
  {
      QEMUIOVector qiov;
      QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
   * The cluster offset may be an allocated byte offset in the image file, the
   * zero cluster marker, or the unallocated cluster marker.
   */
 -static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 -                                unsigned int n, uint64_t cluster)
 +static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
 +                                             int index, unsigned int n,
 +                                             uint64_t cluster)
  {
      int i;
      for (i = index; i < index + n; i++) {
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
      }
  }
 -static void qed_aio_complete(QEDAIOCB *acb)
 +static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
  /**
   * Update L1 table with new L2 table offset and write it out
   */
 -static int qed_aio_write_l1_update(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      CachedL2Table *l2_table = acb->request.l2_table;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
  /**
   * Update L2 table with new cluster offsets and write them out
   */
 -static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 +static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
  {
      BDRVQEDState *s = acb_to_s(acb);
      bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
  /**
   * Write data to the image file
   */
 -static int qed_aio_write_main(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset = acb->cur_cluster +
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
  /**
   * Populate untouched regions of new data cluster
   */
 -static int qed_aio_write_cow(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t start, len, offset;
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
   *
   * This path is taken when writing to previously unallocated clusters.
   */
 -static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 +static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
      int ret;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
   *
   * This path is taken when writing to already allocated clusters.
   */
 -static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
 +                                              size_t len)
  {
      /* Allocate buffer for zero writes */
      if (acb->flags & QED_AIOCB_ZERO) {
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
   * @offset:     Cluster offset in bytes
   * @len:        Length in bytes
   */
 -static int qed_aio_write_data(void *opaque, int ret,
 -                              uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
 +                                           uint64_t offset, size_t len)
  {
      QEDAIOCB *acb = opaque;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
   * @offset:     Cluster offset in bytes
   * @len:        Length in bytes
   */
 -static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 +static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
 +                                          uint64_t offset, size_t len)
  {
      QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  /**
   * Begin next I/O or complete the request
   */
 -static int qed_aio_next_io(QEDAIOCB *acb)
 +static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset;
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
+--- a/block/trace-events
-+++ b/block/qed.h
++++ b/block/trace-events
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+@@ -XXX,XX +XXX,XX @@ blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int flag
- /**
+ bdrv_co_preadv(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
-  * Cluster functions
+ bdrv_co_pwritev(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
-  */
+ bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags 0x%x"
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+-bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
--                     size_t *len, uint64_t *img_offset);
++bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, int64_t cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %"PRId64
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
-+                                  uint64_t pos, size_t *len,
+ # block/stream.c
-+                                  uint64_t *img_offset);
+ stream_one_iteration(void *s, int64_t offset, uint64_t bytes, int is_allocated) "s %p offset %" PRId64 " bytes %" PRIu64 " is_allocated %d"
  /**
   * Consistency check
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 32/61] qed: Remove callback from qed_copy_from_backing_file()
+[Qemu-devel] [PULL 06/35] qcow2: Switch is_zero_sectors() to byte-based
-With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
+From: Eric Blake <eblake@redhat.com>
 collapse into a single function. This is reflected by a rename of the
 combined function to qed_aio_write_cow().
+We are gradually converting to byte-based interfaces, as they are
+easier to reason about than sector-based.  Convert another internal
+function (no semantic change), and rename it to is_zero() in the
+process.
+Signed-off-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Fam Zheng <famz@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 57 +++++++++++++++++++++++----------------------------------
+ block/qcow2.c | 33 +++++++++++++++++++--------------
-file changed, 23 insertions(+), 34 deletions(-)
+file changed, 19 insertions(+), 14 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/qcow2.c
-+++ b/block/qed.c
++++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+@@ -XXX,XX +XXX,XX @@ finish:
-  * @pos:        Byte position in device
+ }
-  * @len:        Number of bytes
-  * @offset:     Byte offset in image file
-- * @cb:         Completion function
+-static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
-- * @opaque:     User data for completion function
+-                            uint32_t count)
-  */
++static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 -static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                       uint64_t len, uint64_t offset,
 -                                       BlockCompletionFunc *cb,
 -                                       void *opaque)
 +static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 +                                      uint64_t len, uint64_t offset)
  {
-     QEMUIOVector qiov;
+     int nr;
-     QEMUIOVector *backing_qiov = NULL;
+     int64_t res;
-@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
++    int64_t start;
-     /* Skip copy entirely if there is no work to do */
+-    if (start + count > bs->total_sectors) {
-     if (len == 0) {
+-        count = bs->total_sectors - start;
--        cb(opaque, 0);
++    /* TODO: Widening to sector boundaries should only be needed as
--        return;
++     * long as we can't query finer granularity. */
-+        return 0;
++    start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
 +    bytes = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE) - start;
 +
 +    /* Clamp to image length, before checking status of underlying sectors */
 +    if (start + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
 +        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - start;
      }
-     iov = (struct iovec) {
+-    if (!count) {
-@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
++    if (!bytes) {
-     ret = 0;
+         return true;
- out:
+     }
-     qemu_vfree(iov.iov_base);
+-    res = bdrv_get_block_status_above(bs, NULL, start, count, &nr, NULL);
--    cb(opaque, ret);
+-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
-+    return ret;
++    res = bdrv_get_block_status_above(bs, NULL, start >> BDRV_SECTOR_BITS,
 +                                      bytes >> BDRV_SECTOR_BITS, &nr, NULL);
 +    return res >= 0 && (res & BDRV_BLOCK_ZERO) &&
 +        nr * BDRV_SECTOR_SIZE == bytes;
  }
- /**
+ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
  }
  /**
 - * Populate back untouched region of new data cluster
 + * Populate untouched regions of new data cluster
   */
 -static void qed_aio_write_postfill(void *opaque, int ret)
 +static void qed_aio_write_cow(void *opaque, int ret)
  {
      QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
 -    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
 -    uint64_t len =
 -        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
 -    uint64_t offset = acb->cur_cluster +
 -                      qed_offset_into_cluster(s, acb->cur_pos) +
 -                      acb->cur_qiov.size;
 +    uint64_t start, len, offset;
 +
 +    /* Populate front untouched region of new data cluster */
 +    start = qed_start_of_cluster(s, acb->cur_pos);
 +    len = qed_offset_into_cluster(s, acb->cur_pos);
 +    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
 +    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
      if (ret) {
          qed_aio_complete(acb, ret);
          return;
      }
--    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+     if (head || tail) {
--    qed_copy_from_backing_file(s, start, len, offset,
+-        int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
--                                qed_aio_write_main, acb);
+         uint64_t off;
--}
+         unsigned int nr;
-+    /* Populate back untouched region of new data cluster */
-+    start = acb->cur_pos + acb->cur_qiov.size;
+         assert(head + bytes <= s->cluster_size);
-+    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
-+    offset = acb->cur_cluster +
+         /* check whether remainder of cluster already reads as zero */
-+             qed_offset_into_cluster(s, acb->cur_pos) +
+-        if (!(is_zero_sectors(bs, cl_start,
-+             acb->cur_qiov.size;
+-                              DIV_ROUND_UP(head, BDRV_SECTOR_SIZE)) &&
+-              is_zero_sectors(bs, (offset + bytes) >> BDRV_SECTOR_BITS,
--/**
+-                              DIV_ROUND_UP(-tail & (s->cluster_size - 1),
-- * Populate front untouched region of new data cluster
+-                                           BDRV_SECTOR_SIZE)))) {
-- */
++        if (!(is_zero(bs, offset - head, head) &&
--static void qed_aio_write_prefill(void *opaque, int ret)
++              is_zero(bs, offset + bytes,
--{
++                      tail ? s->cluster_size - tail : 0))) {
--    QEDAIOCB *acb = opaque;
+             return -ENOTSUP;
--    BDRVQEDState *s = acb_to_s(acb);
+         }
--    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
--    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+         qemu_co_mutex_lock(&s->lock);
-+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+         /* We can have new write after previous check */
-+    ret = qed_copy_from_backing_file(s, start, len, offset);
+-        offset = cl_start << BDRV_SECTOR_BITS;
++        offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
--    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+         bytes = s->cluster_size;
--    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
+         nr = s->cluster_size;
--                                qed_aio_write_postfill, acb);
+         ret = qcow2_get_cluster_offset(bs, offset, &nr, &off);
 +    qed_aio_write_main(acb, ret);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
          cb = qed_aio_write_zero_cluster;
      } else {
 -        cb = qed_aio_write_prefill;
 +        cb = qed_aio_write_cow;
          acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 05/61] block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
+[Qemu-devel] [PULL 07/35] block: Switch bdrv_make_zero() to byte-based
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Eric Blake <eblake@redhat.com>
-Calling aio_poll() directly may have been fine previously, but this is
+We are gradually converting to byte-based interfaces, as they are
-the future, man!  The difference between an aio_poll() loop and
+easier to reason about than sector-based.  Change the internal
-BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
+loop iteration of zeroing a device to track by bytes instead of
-around aio_poll().
+sectors (although we are still guaranteed that we iterate by steps
 that are sector-aligned).
-This allows the IOThread to run fd handlers or BHs to complete the
+Signed-off-by: Eric Blake <eblake@redhat.com>
-request.  Failure to release the AioContext causes deadlocks.
+Reviewed-by: Fam Zheng <famz@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
 Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/io.c | 4 +---
+ block/io.c | 32 ++++++++++++++++----------------
-file changed, 1 insertion(+), 3 deletions(-)
+file changed, 16 insertions(+), 16 deletions(-)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
+  */
+ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
-         bdrv_coroutine_enter(bs, co);
+ {
--        while (data.ret == -EINPROGRESS) {
+-    int64_t target_sectors, ret, nb_sectors, sector_num = 0;
--            aio_poll(bdrv_get_aio_context(bs), true);
++    int64_t target_size, ret, bytes, offset = 0;
--        }
+     BlockDriverState *bs = child->bs;
-+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
+-    int n;
-         return data.ret;
++    int n; /* sectors */
 -    target_sectors = bdrv_nb_sectors(bs);
 -    if (target_sectors < 0) {
 -        return target_sectors;
 +    target_size = bdrv_getlength(bs);
 +    if (target_size < 0) {
 +        return target_size;
      }
      for (;;) {
 -        nb_sectors = MIN(target_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
 -        if (nb_sectors <= 0) {
 +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
 +        if (bytes <= 0) {
              return 0;
          }
 -        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, NULL);
 +        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
 +                                    bytes >> BDRV_SECTOR_BITS, &n, NULL);
          if (ret < 0) {
 -            error_report("error getting block status at sector %" PRId64 ": %s",
 -                         sector_num, strerror(-ret));
 +            error_report("error getting block status at offset %" PRId64 ": %s",
 +                         offset, strerror(-ret));
              return ret;
          }
          if (ret & BDRV_BLOCK_ZERO) {
 -            sector_num += n;
 +            offset += n * BDRV_SECTOR_BITS;
              continue;
          }
 -        ret = bdrv_pwrite_zeroes(child, sector_num << BDRV_SECTOR_BITS,
 -                                 n << BDRV_SECTOR_BITS, flags);
 +        ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
          if (ret < 0) {
 -            error_report("error writing zeroes at sector %" PRId64 ": %s",
 -                         sector_num, strerror(-ret));
 +            error_report("error writing zeroes at offset %" PRId64 ": %s",
 +                         offset, strerror(-ret));
              return ret;
          }
 -        sector_num += n;
 +        offset += n * BDRV_SECTOR_SIZE;
      }
  }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 16/61] nvme: Add support for Read Data and Write Data in CMBs.
+[Qemu-devel] [PULL 08/35] qemu-img: Switch get_block_status() to byte-based
-From: Stephen Bates <sbates@raithlin.com>
+From: Eric Blake <eblake@redhat.com>
-Add the ability for the NVMe model to support both the RDS and WDS
+We are gradually converting to byte-based interfaces, as they are
-modes in the Controller Memory Buffer.
+easier to reason about than sector-based.  Continue by converting
 an internal function (no semantic change), and simplifying its
 caller accordingly.
-Although not currently supported in the upstreamed Linux kernel a fork
+Signed-off-by: Eric Blake <eblake@redhat.com>
-with support exists [1] and user-space test programs that build on
+Reviewed-by: Fam Zheng <famz@redhat.com>
-this also exist [2].
+Reviewed-by: John Snow <jsnow@redhat.com>
 Useful for testing CMB functionality in preperation for real CMB
 enabled NVMe devices (coming soon).
 [1] https://github.com/sbates130272/linux-p2pmem
 [2] https://github.com/sbates130272/p2pmem-test
 Signed-off-by: Stephen Bates <sbates@raithlin.com>
 Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
 Reviewed-by: Keith Busch <keith.busch@intel.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
+ qemu-img.c | 24 +++++++++++-------------
- hw/block/nvme.h |  1 +
+file changed, 11 insertions(+), 13 deletions(-)
 files changed, 58 insertions(+), 26 deletions(-)
-diff --git a/hw/block/nvme.c b/hw/block/nvme.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/block/nvme.c
+--- a/qemu-img.c
-+++ b/hw/block/nvme.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static void dump_map_entry(OutputFormat output_format, MapEntry *e,
   *              cmb_size_mb=<cmb_size_mb[optional]>
   *
   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
 - * offset 0 in BAR2 and supports SQS only for now.
 + * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
   */
  #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
      }
  }
--static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+-static int get_block_status(BlockDriverState *bs, int64_t sector_num,
--    uint32_t len, NvmeCtrl *n)
+-                            int nb_sectors, MapEntry *e)
-+static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
++static int get_block_status(BlockDriverState *bs, int64_t offset,
-+                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
++                            int64_t bytes, MapEntry *e)
  {
-     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
+     int64_t ret;
-     trans_len = MIN(len, trans_len);
+     int depth;
-@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+     BlockDriverState *file;
+     bool has_offset;
-     if (!prp1) {
++    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
-         return NVME_INVALID_FIELD | NVME_DNR;
-+    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
++    assert(bytes < INT_MAX);
-+               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+     /* As an optimization, we could cache the current range of unallocated
-+        qsg->nsg = 0;
+      * clusters in each file of the chain, and avoid querying the same
-+        qemu_iovec_init(iov, num_prps);
+      * range repeatedly.
-+        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t sector_num,
-+    } else {
-+        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+     depth = 0;
-+        qemu_sglist_add(qsg, prp1, trans_len);
+     for (;;) {
-     }
+-        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &nb_sectors,
 -                                    &file);
 +        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS, nb_sectors,
 +                                    &nb_sectors, &file);
          if (ret < 0) {
              return ret;
          }
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t sector_num,
      has_offset = !!(ret & BDRV_BLOCK_OFFSET_VALID);
      *e = (MapEntry) {
 -        .start = sector_num * BDRV_SECTOR_SIZE,
 +        .start = offset,
          .length = nb_sectors * BDRV_SECTOR_SIZE,
          .data = !!(ret & BDRV_BLOCK_DATA),
          .zero = !!(ret & BDRV_BLOCK_ZERO),
@@ -XXX,XX +XXX,XX @@ static int img_map(int argc, char **argv)
      length = blk_getlength(blk);
      while (curr.start + curr.length < length) {
 -        int64_t nsectors_left;
 -        int64_t sector_num;
 -        int n;
 -
--    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+-        sector_num = (curr.start + curr.length) >> BDRV_SECTOR_BITS;
--    qemu_sglist_add(qsg, prp1, trans_len);
++        int64_t offset = curr.start + curr.length;
-     len -= trans_len;
++        int64_t n;
-     if (len) {
-         if (!prp2) {
+         /* Probe up to 1 GiB at a time.  */
-@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+-        nsectors_left = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE) - sector_num;
+-        n = MIN(1 << (30 - BDRV_SECTOR_BITS), nsectors_left);
-             nents = (len + n->page_size - 1) >> n->page_bits;
+-        ret = get_block_status(bs, sector_num, n, &next);
-             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
++        n = QEMU_ALIGN_DOWN(MIN(1 << 30, length - offset), BDRV_SECTOR_SIZE);
--            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
++        ret = get_block_status(bs, offset, n, &next);
-+            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
-             while (len != 0) {
+         if (ret < 0) {
-                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
+             error_report("Could not read file metadata: %s", strerror(-ret));
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                      i = 0;
                      nents = (len + n->page_size - 1) >> n->page_bits;
                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 -                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
 +                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                          prp_trans);
                      prp_ent = le64_to_cpu(prp_list[i]);
                  }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                  }
                  trans_len = MIN(len, n->page_size);
 -                qemu_sglist_add(qsg, prp_ent, trans_len);
 +                if (qsg->nsg){
 +                    qemu_sglist_add(qsg, prp_ent, trans_len);
 +                } else {
 +                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
 +                }
                  len -= trans_len;
                  i++;
              }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
              if (prp2 & (n->page_size - 1)) {
                  goto unmap;
              }
 -            qemu_sglist_add(qsg, prp2, len);
 +            if (qsg->nsg) {
 +                qemu_sglist_add(qsg, prp2, len);
 +            } else {
 +                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
 +            }
          }
      }
      return NVME_SUCCESS;
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
      uint64_t prp1, uint64_t prp2)
  {
      QEMUSGList qsg;
 +    QEMUIOVector iov;
 +    uint16_t status = NVME_SUCCESS;
 -    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
 +    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    if (dma_buf_read(ptr, len, &qsg)) {
 +    if (qsg.nsg > 0) {
 +        if (dma_buf_read(ptr, len, &qsg)) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
          qemu_sglist_destroy(&qsg);
 -        return NVME_INVALID_FIELD | NVME_DNR;
 +    } else {
 +        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
 +        qemu_iovec_destroy(&iov);
      }
 -    qemu_sglist_destroy(&qsg);
 -    return NVME_SUCCESS;
 +    return status;
  }
  static void nvme_post_cqes(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
          return NVME_LBA_RANGE | NVME_DNR;
      }
 -    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
 +    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    assert((nlb << data_shift) == req->qsg.size);
 -
 -    req->has_sg = true;
      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
 -    req->aiocb = is_write ?
 -        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                      nvme_rw_cb, req) :
 -        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                     nvme_rw_cb, req);
 +    if (req->qsg.nsg > 0) {
 +        req->has_sg = true;
 +        req->aiocb = is_write ?
 +            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                          nvme_rw_cb, req) :
 +            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                         nvme_rw_cb, req);
 +    } else {
 +        req->has_sg = false;
 +        req->aiocb = is_write ?
 +            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                            req) :
 +            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                           req);
 +    }
      return NVME_NO_COMPLETE;
  }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
          NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
          NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
 +        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 +        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
          NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 +        n->cmbloc = n->bar.cmbloc;
 +        n->cmbsz = n->bar.cmbsz;
 +
          n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
          memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                                "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
 diff --git a/hw/block/nvme.h b/hw/block/nvme.h
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/nvme.h
 +++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
      NvmeCqe                 cqe;
      BlockAcctCookie         acct;
      QEMUSGList              qsg;
 +    QEMUIOVector            iov;
      QTAILQ_ENTRY(NvmeRequest)entry;
  } NvmeRequest;
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 18/61] qcow2: Use unsigned int for both members of Qcow2COWRegion
+[Qemu-devel] [PULL 09/35] block: Convert bdrv_get_block_status() to bytes
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-Qcow2COWRegion has two attributes:
+We are gradually moving away from sector-based interfaces, towards
+byte-based.  In the common case, allocation is unlikely to ever use
-- The offset of the COW region from the start of the first cluster
+values that are not naturally sector-aligned, but it is possible
-  touched by the I/O request. Since it's always going to be positive
+that byte-based values will let us be more precise about allocation
-  and the maximum request size is at most INT_MAX, we can use a
+at the end of an unaligned file that can do byte-based access.
-  regular unsigned int to store this offset.
+Changing the name of the function from bdrv_get_block_status() to
-- The size of the COW region in bytes. This is guaranteed to be >= 0,
+bdrv_block_status() ensures that the compiler enforces that all
-  so we should use an unsigned type instead.
+callers are updated.  For now, the io.c layer still assert()s that
+all callers are sector-aligned, but that can be relaxed when a later
-In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
+patch implements byte-based block status in the drivers.
-It will also help keep some assertions simpler now that we know that
-there are no negative numbers.
+There was an inherent limitation in returning the offset via the
+return value: we only have room for BDRV_BLOCK_OFFSET_MASK bits, which
-The prototype of do_perform_cow() is also updated to reflect these
+means an offset can only be mapped for sector-aligned queries (or,
-changes.
+if we declare that non-aligned input is at the same relative position
+modulo 512 of the answer), so the new interface also changes things to
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+return the offset via output through a parameter by reference rather
-Reviewed-by: Eric Blake <eblake@redhat.com>
+than mashed into the return value.  We'll have some glue code that
-Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+munges between the two styles until we finish converting all uses.
 For the most part this patch is just the addition of scaling at the
 callers followed by inverse scaling at bdrv_block_status(), coupled
 with the tweak in calling convention.  But some code, particularly
 bdrv_is_allocated(), gets a lot simpler because it no longer has to
 mess with sectors.
 For ease of review, bdrv_get_block_status_above() will be tackled
 separately.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ include/block/block.h | 17 +++++++++--------
- block/qcow2.h         | 4 ++--
+ block/io.c            | 47 ++++++++++++++++++++++++++++++++++-------------
-files changed, 4 insertions(+), 4 deletions(-)
+ block/qcow2-cluster.c |  2 +-
+ qemu-img.c            | 25 ++++++++++++++-----------
 files changed, 58 insertions(+), 33 deletions(-)
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ typedef struct HDGeometry {
  #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
  /*
 - * Allocation status flags for bdrv_get_block_status() and friends.
 + * Allocation status flags for bdrv_block_status() and friends.
   *
   * Public flags:
   * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
@@ -XXX,XX +XXX,XX @@ typedef struct HDGeometry {
   *                 that the block layer recompute the answer from the returned
   *                 BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
   *
 - * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK)
 - * represent the offset in the returned BDS that is allocated for the
 - * corresponding raw data; however, whether that offset actually contains
 - * data also depends on BDRV_BLOCK_DATA and BDRV_BLOCK_ZERO, as follows:
 + * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of
 + * the return value (old interface) or the entire map parameter (new
 + * interface) represent the offset in the returned BDS that is allocated for
 + * the corresponding raw data.  However, whether that offset actually
 + * contains data also depends on BDRV_BLOCK_DATA, as follows:
   *
   * DATA ZERO OFFSET_VALID
   *  t    t        t       sectors read as zero, returned file is zero at offset
@@ -XXX,XX +XXX,XX @@ int bdrv_has_zero_init_1(BlockDriverState *bs);
  int bdrv_has_zero_init(BlockDriverState *bs);
  bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 -int64_t bdrv_get_block_status(BlockDriverState *bs, int64_t sector_num,
 -                              int nb_sectors, int *pnum,
 -                              BlockDriverState **file);
 +int bdrv_block_status(BlockDriverState *bs, int64_t offset,
 +                      int64_t bytes, int64_t *pnum, int64_t *map,
 +                      BlockDriverState **file);
  int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                      BlockDriverState *base,
                                      int64_t sector_num,
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
   */
  int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
  {
 -    int64_t target_size, ret, bytes, offset = 0;
 +    int ret;
 +    int64_t target_size, bytes, offset = 0;
      BlockDriverState *bs = child->bs;
 -    int n; /* sectors */
      target_size = bdrv_getlength(bs);
      if (target_size < 0) {
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
          if (bytes <= 0) {
              return 0;
          }
 -        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
 -                                    bytes >> BDRV_SECTOR_BITS, &n, NULL);
 +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
          if (ret < 0) {
              error_report("error getting block status at offset %" PRId64 ": %s",
                           offset, strerror(-ret));
              return ret;
          }
          if (ret & BDRV_BLOCK_ZERO) {
 -            offset += n * BDRV_SECTOR_BITS;
 +            offset += bytes;
              continue;
          }
 -        ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
 +        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
          if (ret < 0) {
              error_report("error writing zeroes at offset %" PRId64 ": %s",
                           offset, strerror(-ret));
              return ret;
          }
 -        offset += n * BDRV_SECTOR_SIZE;
 +        offset += bytes;
      }
  }
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                            nb_sectors, pnum, file);
  }
 -int64_t bdrv_get_block_status(BlockDriverState *bs,
 -                              int64_t sector_num,
 -                              int nb_sectors, int *pnum,
 -                              BlockDriverState **file)
 +int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
 +                      int64_t *pnum, int64_t *map, BlockDriverState **file)
  {
 -    return bdrv_get_block_status_above(bs, backing_bs(bs),
 -                                       sector_num, nb_sectors, pnum, file);
 +    int64_t ret;
 +    int n;
 +
 +    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 +    assert(pnum);
 +    /*
 +     * The contract allows us to return pnum smaller than bytes, even
 +     * if the next query would see the same status; we truncate the
 +     * request to avoid overflowing the driver's 32-bit interface.
 +     */
 +    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
 +    ret = bdrv_get_block_status_above(bs, backing_bs(bs),
 +                                      offset >> BDRV_SECTOR_BITS,
 +                                      bytes >> BDRV_SECTOR_BITS, &n, file);
 +    if (ret < 0) {
 +        assert(INT_MIN <= ret);
 +        *pnum = 0;
 +        return ret;
 +    }
 +    *pnum = n * BDRV_SECTOR_SIZE;
 +    if (map) {
 +        *map = ret & BDRV_BLOCK_OFFSET_MASK;
 +    } else {
 +        ret &= ~BDRV_BLOCK_OFFSET_VALID;
 +    }
 +    return ret & ~BDRV_BLOCK_OFFSET_MASK;
  }
  int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
 diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-cluster.c
 +++ b/block/qcow2-cluster.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
- static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+          * cluster is already marked as zero, or if it's unallocated and we
-                                        uint64_t src_cluster_offset,
+          * don't have a backing file.
-                                        uint64_t cluster_offset,
+          *
--                                       int offset_in_cluster,
+-         * TODO We might want to use bdrv_get_block_status(bs) here, but we're
--                                       int bytes)
++         * TODO We might want to use bdrv_block_status(bs) here, but we're
-+                                       unsigned offset_in_cluster,
+          * holding s->lock, so that doesn't work today.
-+                                       unsigned bytes)
+          *
           * If full_discard is true, the sector should not read back as zeroes,
 diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
      if (s->sector_next_status <= sector_num) {
          if (s->target_has_backing) {
 -            ret = bdrv_get_block_status(blk_bs(s->src[src_cur]),
 -                                        sector_num - src_cur_offset,
 -                                        n, &n, NULL);
 +            int64_t count = n * BDRV_SECTOR_SIZE;
 +
 +            ret = bdrv_block_status(blk_bs(s->src[src_cur]),
 +                                    (sector_num - src_cur_offset) *
 +                                    BDRV_SECTOR_SIZE,
 +                                    count, &count, NULL, NULL);
 +            assert(ret < 0 || QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
 +            n = count >> BDRV_SECTOR_BITS;
          } else {
              ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
                                                sector_num - src_cur_offset,
@@ -XXX,XX +XXX,XX @@ static void dump_map_entry(OutputFormat output_format, MapEntry *e,
  static int get_block_status(BlockDriverState *bs, int64_t offset,
                              int64_t bytes, MapEntry *e)
  {
-     BDRVQcow2State *s = bs->opaque;
+-    int64_t ret;
-     QEMUIOVector qiov;
++    int ret;
-diff --git a/block/qcow2.h b/block/qcow2.h
+     int depth;
-index XXXXXXX..XXXXXXX 100644
+     BlockDriverState *file;
---- a/block/qcow2.h
+     bool has_offset;
-+++ b/block/qcow2.h
+-    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
-@@ -XXX,XX +XXX,XX @@ typedef struct Qcow2COWRegion {
++    int64_t map;
-      * Offset of the COW region in bytes from the start of the first cluster
-      * touched by the request.
+-    assert(bytes < INT_MAX);
-      */
+     /* As an optimization, we could cache the current range of unallocated
--    uint64_t    offset;
+      * clusters in each file of the chain, and avoid querying the same
-+    unsigned    offset;
+      * range repeatedly.
+@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
-     /** Number of bytes to copy */
--    int         nb_bytes;
+     depth = 0;
-+    unsigned    nb_bytes;
+     for (;;) {
- } Qcow2COWRegion;
+-        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS, nb_sectors,
+-                                    &nb_sectors, &file);
- /**
++        ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
          if (ret < 0) {
              return ret;
          }
 -        assert(nb_sectors);
 +        assert(bytes);
          if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
              break;
          }
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
      *e = (MapEntry) {
          .start = offset,
 -        .length = nb_sectors * BDRV_SECTOR_SIZE,
 +        .length = bytes,
          .data = !!(ret & BDRV_BLOCK_DATA),
          .zero = !!(ret & BDRV_BLOCK_ZERO),
 -        .offset = ret & BDRV_BLOCK_OFFSET_MASK,
 +        .offset = map,
          .has_offset = has_offset,
          .depth = depth,
          .has_filename = file && has_offset,
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 28/61] qed: Remove callback from qed_read_l2_table()
+[Qemu-devel] [PULL 10/35] block: Switch bdrv_co_get_block_status() to byte-based
+From: Eric Blake <eblake@redhat.com>
+We are gradually converting to byte-based interfaces, as they are
+easier to reason about than sector-based.  Convert another internal
+function (no semantic change); and as with its public counterpart,
+rename to bdrv_co_block_status() and split the offset return, to
+make the compiler enforce that we catch all uses.  For now, we
+assert that callers and the return value still use aligned data,
+but ultimately, this will be the function where we hand off to a
+byte-based driver callback, and will eventually need to add logic
+to ensure we round calls according to the driver's
+request_alignment then touch up the result handed back to the
+caller, to start permitting a caller to pass unaligned offsets.
+Note that we are now prepared to accepts 'bytes' larger than INT_MAX;
+this is okay as long as we clamp things internally before violating
+any 32-bit limits, and makes no difference to how a client will
+use the information (clients looping over the entire file must
+already be prepared for consecutive calls to return the same status,
+as drivers are already free to return shorter-than-maximal status
+due to any other convenient split points, such as when the L2 table
+crosses cluster boundaries in qcow2).
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
+ block/io.c | 124 ++++++++++++++++++++++++++++++++++++++++---------------------
- block/qed-table.c   | 15 +++------
+file changed, 81 insertions(+), 43 deletions(-)
- block/qed.h         |  3 +-
-files changed, 36 insertions(+), 76 deletions(-)
+diff --git a/block/io.c b/block/io.c
 diff --git a/block/qed-cluster.c b/block/qed-cluster.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+--- a/block/io.c
-+++ b/block/qed-cluster.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
-     return i - index;
+  * BDRV_BLOCK_ZERO where possible; otherwise, the result may omit those
- }
+  * bits particularly if it allows for a larger value in 'pnum'.
+  *
--typedef struct {
+- * If 'sector_num' is beyond the end of the disk image the return value is
--    BDRVQEDState *s;
++ * If 'offset' is beyond the end of the disk image the return value is
--    uint64_t pos;
+  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
--    size_t len;
+  *
--
+- * 'pnum' is set to the number of sectors (including and immediately following
--    QEDRequest *request;
+- * the specified sector) that are known to be in the same
--
+- * allocated/unallocated state.
--    /* User callback */
+- *
--    QEDFindClusterFunc *cb;
+- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
--    void *opaque;
++ * 'bytes' is the max value 'pnum' should be set to.  If bytes goes
--} QEDFindClusterCB;
+  * beyond the end of the disk image it will be clamped; if 'pnum' is set to
--
+  * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
--static void qed_find_cluster_cb(void *opaque, int ret)
+  *
 - * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
 - * 'file' is non-NULL, then '*file' points to the BDS which the sector range
 - * is allocated in.
 + * 'pnum' is set to the number of bytes (including and immediately
 + * following the specified offset) that are easily known to be in the
 + * same allocated/unallocated state.  Note that a second call starting
 + * at the original offset plus returned pnum may have the same status.
 + * The returned value is non-zero on success except at end-of-file.
 + *
 + * Returns negative errno on failure.  Otherwise, if the
 + * BDRV_BLOCK_OFFSET_VALID bit is set, 'map' and 'file' (if non-NULL) are
 + * set to the host mapping and BDS corresponding to the guest offset.
   */
 -static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
 -                                                     bool want_zero,
 -                                                     int64_t sector_num,
 -                                                     int nb_sectors, int *pnum,
 -                                                     BlockDriverState **file)
 -{
--    QEDFindClusterCB *find_cluster_cb = opaque;
+-    int64_t total_sectors;
--    BDRVQEDState *s = find_cluster_cb->s;
+-    int64_t n;
--    QEDRequest *request = find_cluster_cb->request;
+-    int64_t ret, ret2;
--    uint64_t offset = 0;
++static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
--    size_t len = 0;
++                                             bool want_zero,
--    unsigned int index;
++                                             int64_t offset, int64_t bytes,
--    unsigned int n;
++                                             int64_t *pnum, int64_t *map,
--
++                                             BlockDriverState **file)
--    qed_acquire(s);
++{
--    if (ret) {
++    int64_t total_size;
--        goto out;
++    int64_t n; /* bytes */
--    }
++    int64_t ret;
--
++    int64_t local_map = 0;
--    index = qed_l2_index(s, find_cluster_cb->pos);
+     BlockDriverState *local_file = NULL;
--    n = qed_bytes_to_clusters(s,
++    int count; /* sectors */
--                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
--                              find_cluster_cb->len);
+     assert(pnum);
--    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+     *pnum = 0;
--                                      index, n, &offset);
+-    total_sectors = bdrv_nb_sectors(bs);
--
+-    if (total_sectors < 0) {
--    if (qed_offset_is_unalloc_cluster(offset)) {
+-        ret = total_sectors;
--        ret = QED_CLUSTER_L2;
++    total_size = bdrv_getlength(bs);
--    } else if (qed_offset_is_zero_cluster(offset)) {
++    if (total_size < 0) {
--        ret = QED_CLUSTER_ZERO;
++        ret = total_size;
--    } else if (qed_check_cluster_offset(s, offset)) {
+         goto early_out;
--        ret = QED_CLUSTER_FOUND;
+     }
--    } else {
--        ret = -EINVAL;
+-    if (sector_num >= total_sectors) {
--    }
++    if (offset >= total_size) {
--
+         ret = BDRV_BLOCK_EOF;
--    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
+         goto early_out;
--              qed_offset_into_cluster(s, find_cluster_cb->pos));
+     }
--
+-    if (!nb_sectors) {
--out:
++    if (!bytes) {
--    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+         ret = 0;
--    qed_release(s);
+         goto early_out;
--    g_free(find_cluster_cb);
+     }
--}
--
+-    n = total_sectors - sector_num;
- /**
+-    if (n < nb_sectors) {
-  * Find the offset of a data cluster
+-        nb_sectors = n;
-  *
++    n = total_size - offset;
-@@ -XXX,XX +XXX,XX @@ out:
++    if (n < bytes) {
- void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
++        bytes = n;
-                       size_t len, QEDFindClusterFunc *cb, void *opaque)
+     }
- {
--    QEDFindClusterCB *find_cluster_cb;
+     if (!bs->drv->bdrv_co_get_block_status) {
-     uint64_t l2_offset;
+-        *pnum = nb_sectors;
-+    uint64_t offset = 0;
++        *pnum = bytes;
-+    unsigned int index;
+         ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
-+    unsigned int n;
+-        if (sector_num + nb_sectors == total_sectors) {
-+    int ret;
++        if (offset + bytes == total_size) {
+             ret |= BDRV_BLOCK_EOF;
-     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
+         }
-      * so that a request acts on one L2 table at a time.
+         if (bs->drv->protocol_name) {
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+-            ret |= BDRV_BLOCK_OFFSET_VALID | (sector_num * BDRV_SECTOR_SIZE);
-         return;
++            ret |= BDRV_BLOCK_OFFSET_VALID;
-     }
++            local_map = offset;
+             local_file = bs;
--    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
+         }
--    find_cluster_cb->s = s;
+         goto early_out;
--    find_cluster_cb->pos = pos;
+     }
--    find_cluster_cb->len = len;
--    find_cluster_cb->cb = cb;
+     bdrv_inc_in_flight(bs);
--    find_cluster_cb->opaque = opaque;
+-    ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
--    find_cluster_cb->request = request;
++    /*
-+    ret = qed_read_l2_table(s, request, l2_offset);
++     * TODO: Rather than require aligned offsets, we could instead
-+    qed_acquire(s);
++     * round to the driver's request_alignment here, then touch up
-+    if (ret) {
++     * count afterwards back to the caller's expectations.
-+        goto out;
++     */
 +    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 +    /*
 +     * The contract allows us to return pnum smaller than bytes, even
 +     * if the next query would see the same status; we truncate the
 +     * request to avoid overflowing the driver's 32-bit interface.
 +     */
 +    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
 +    ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
 +                                            bytes >> BDRV_SECTOR_BITS, &count,
                                              &local_file);
      if (ret < 0) {
 -        *pnum = 0;
          goto out;
      }
 +    if (ret & BDRV_BLOCK_OFFSET_VALID) {
 +        local_map = ret & BDRV_BLOCK_OFFSET_MASK;
 +    }
-+
++    *pnum = count * BDRV_SECTOR_SIZE;
-+    index = qed_l2_index(s, pos);
-+    n = qed_bytes_to_clusters(s,
+     if (ret & BDRV_BLOCK_RAW) {
-+                              qed_offset_into_cluster(s, pos) + len);
+         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+-        ret = bdrv_co_get_block_status(local_file, want_zero,
-+                                      index, n, &offset);
+-                                       ret >> BDRV_SECTOR_BITS,
-+
+-                                       *pnum, pnum, &local_file);
-+    if (qed_offset_is_unalloc_cluster(offset)) {
++        ret = bdrv_co_block_status(local_file, want_zero, local_map,
-+        ret = QED_CLUSTER_L2;
++                                   *pnum, pnum, &local_map, &local_file);
-+    } else if (qed_offset_is_zero_cluster(offset)) {
++        assert(ret < 0 ||
-+        ret = QED_CLUSTER_ZERO;
++               QEMU_IS_ALIGNED(*pnum | local_map, BDRV_SECTOR_SIZE));
-+    } else if (qed_check_cluster_offset(s, offset)) {
+         goto out;
-+        ret = QED_CLUSTER_FOUND;
+     }
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
              ret |= BDRV_BLOCK_ZERO;
          } else if (bs->backing) {
              BlockDriverState *bs2 = bs->backing->bs;
 -            int64_t nb_sectors2 = bdrv_nb_sectors(bs2);
 +            int64_t size2 = bdrv_getlength(bs2);
 -            if (nb_sectors2 >= 0 && sector_num >= nb_sectors2) {
 +            if (size2 >= 0 && offset >= size2) {
                  ret |= BDRV_BLOCK_ZERO;
              }
          }
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
      if (want_zero && local_file && local_file != bs &&
          (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
          (ret & BDRV_BLOCK_OFFSET_VALID)) {
 -        int file_pnum;
 +        int64_t file_pnum;
 +        int ret2;
 -        ret2 = bdrv_co_get_block_status(local_file, want_zero,
 -                                        ret >> BDRV_SECTOR_BITS,
 -                                        *pnum, &file_pnum, NULL);
 +        ret2 = bdrv_co_block_status(local_file, want_zero, local_map,
 +                                    *pnum, &file_pnum, NULL, NULL);
          if (ret2 >= 0) {
              /* Ignore errors.  This is just providing extra information, it
               * is useful but not necessary.
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
  out:
      bdrv_dec_in_flight(bs);
 -    if (ret >= 0 && sector_num + *pnum == total_sectors) {
 +    if (ret >= 0 && offset + *pnum == total_size) {
          ret |= BDRV_BLOCK_EOF;
      }
  early_out:
      if (file) {
          *file = local_file;
      }
 +    if (map) {
 +        *map = local_map;
 +    }
 +    if (ret >= 0) {
 +        ret &= ~BDRV_BLOCK_OFFSET_MASK;
 +    } else {
-+        ret = -EINVAL;
++        assert(INT_MIN <= ret);
 +    }
-+
-+    len = MIN(len,
-+              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
--    qed_read_l2_table(s, request, l2_offset,
--                      qed_find_cluster_cb, find_cluster_cb);
-+out:
-+    cb(opaque, ret, offset, len);
-+    qed_release(s);
- }
-diff --git a/block/qed-table.c b/block/qed-table.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
-+++ b/block/qed-table.c
-@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
      return ret;
  }
--void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
--                       BlockCompletionFunc *cb, void *opaque)
+     BlockDriverState *p;
-+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
+     int64_t ret = 0;
- {
+     bool first = true;
-     int ret;
++    int64_t map = 0;
-@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+     assert(bs != base);
-     /* Check for cached L2 entry */
+     for (p = bs; p != base; p = backing_bs(p)) {
-     request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
+-        ret = bdrv_co_get_block_status(p, want_zero, sector_num, nb_sectors,
-     if (request->l2_table) {
+-                                       pnum, file);
--        cb(opaque, 0);
++        int64_t count;
--        return;
++
-+        return 0;
++        ret = bdrv_co_block_status(p, want_zero,
-     }
++                                   sector_num * BDRV_SECTOR_SIZE,
++                                   nb_sectors * BDRV_SECTOR_SIZE, &count,
-     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
++                                   &map, file);
-@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+         if (ret < 0) {
-     }
+             break;
-     qed_release(s);
+         }
++        assert(QEMU_IS_ALIGNED(count | map, BDRV_SECTOR_SIZE));
--    cb(opaque, ret);
++        ret |= map;
-+    return ret;
++        *pnum = count >> BDRV_SECTOR_BITS;
- }
+         if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+             /*
- int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
+              * Reading beyond the end of the file continues to read
  {
 -    int ret = -EINPROGRESS;
 -
 -    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
 -    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
 -
 -    return ret;
 +    return qed_read_l2_table(s, request, offset);
  }
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                              unsigned int n);
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             uint64_t offset);
 -void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
 -                       BlockCompletionFunc *cb, void *opaque);
 +int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                          unsigned int index, unsigned int n, bool flush,
                          BlockCompletionFunc *cb, void *opaque);
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 47/61] qed: Remove ret argument from qed_aio_next_io()
+[Qemu-devel] [PULL 11/35] block: Switch BdrvCoGetBlockStatusData to byte-based
-All callers pass ret = 0, so we can just remove it.
+From: Eric Blake <eblake@redhat.com>
+We are gradually converting to byte-based interfaces, as they are
+easier to reason about than sector-based.  Convert another internal
+type (no semantic change), and rename it to match the corresponding
+public function rename.
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 17 ++++++-----------
+ block/io.c | 56 ++++++++++++++++++++++++++++++++++++++------------------
-file changed, 6 insertions(+), 11 deletions(-)
+file changed, 38 insertions(+), 18 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/io.c
-+++ b/block/qed.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
      return l2_table;
  }
--static void qed_aio_next_io(QEDAIOCB *acb, int ret);
-+static void qed_aio_next_io(QEDAIOCB *acb);
+-typedef struct BdrvCoGetBlockStatusData {
++typedef struct BdrvCoBlockStatusData {
- static void qed_aio_start_io(QEDAIOCB *acb)
+     BlockDriverState *bs;
      BlockDriverState *base;
      bool want_zero;
 -    int64_t sector_num;
 -    int nb_sectors;
 -    int *pnum;
 +    int64_t offset;
 +    int64_t bytes;
 +    int64_t *pnum;
 +    int64_t *map;
      BlockDriverState **file;
 -    int64_t ret;
 +    int ret;
      bool done;
 -} BdrvCoGetBlockStatusData;
 +} BdrvCoBlockStatusData;
  int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
                                                          int64_t sector_num,
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
  /* Coroutine wrapper for bdrv_get_block_status_above() */
  static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
  {
--    qed_aio_next_io(acb, 0);
+-    BdrvCoGetBlockStatusData *data = opaque;
-+    qed_aio_next_io(acb);
++    BdrvCoBlockStatusData *data = opaque;
 +    int n = 0;
 +    int64_t ret;
 -    data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
 -                                               data->want_zero,
 -                                               data->sector_num,
 -                                               data->nb_sectors,
 -                                               data->pnum,
 -                                               data->file);
 +    ret = bdrv_co_get_block_status_above(data->bs, data->base,
 +                                         data->want_zero,
 +                                         data->offset >> BDRV_SECTOR_BITS,
 +                                         data->bytes >> BDRV_SECTOR_BITS,
 +                                         &n,
 +                                         data->file);
 +    if (ret < 0) {
 +        assert(INT_MIN <= ret);
 +        data->ret = ret;
 +    } else {
 +        *data->pnum = n * BDRV_SECTOR_SIZE;
 +        *data->map = ret & BDRV_BLOCK_OFFSET_MASK;
 +        data->ret = ret & ~BDRV_BLOCK_OFFSET_MASK;
 +    }
      data->done = true;
  }
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+                                               BlockDriverState **file)
  /**
   * Begin next I/O or complete the request
   */
 -static void qed_aio_next_io(QEDAIOCB *acb, int ret)
 +static void qed_aio_next_io(QEDAIOCB *acb)
  {
-     BDRVQEDState *s = acb_to_s(acb);
+     Coroutine *co;
-     uint64_t offset;
+-    BdrvCoGetBlockStatusData data = {
-     size_t len;
++    int64_t n;
-+    int ret;
++    int64_t map;
++    BdrvCoBlockStatusData data = {
--    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+         .bs = bs,
-+    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
+         .base = base,
+         .want_zero = want_zero,
-     if (acb->backing_qiov) {
+-        .sector_num = sector_num,
-         qemu_iovec_destroy(acb->backing_qiov);
+-        .nb_sectors = nb_sectors,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
+-        .pnum = pnum,
-         acb->backing_qiov = NULL;
++        .offset = sector_num * BDRV_SECTOR_SIZE,
 +        .bytes = nb_sectors * BDRV_SECTOR_SIZE,
 +        .pnum = &n,
 +        .map = &map,
          .file = file,
          .done = false,
      };
@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
          bdrv_coroutine_enter(bs, co);
          BDRV_POLL_WHILE(bs, !data.done);
      }
+-    return data.ret;
--    /* Handle I/O error */
++    if (data.ret < 0) {
--    if (ret) {
++        *pnum = 0;
--        qed_aio_complete(acb, ret);
++        return data.ret;
--        return;
++    }
--    }
++    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
--
++    *pnum = n >> BDRV_SECTOR_BITS;
-     acb->qiov_offset += acb->cur_qiov.size;
++    return data.ret | map;
      acb->cur_pos += acb->cur_qiov.size;
      qemu_iovec_reset(&acb->cur_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
          }
          return;
      }
 -    qed_aio_next_io(acb, 0);
 +    qed_aio_next_io(acb);
  }
- static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
+ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 22/61] qcow2: Pass a QEMUIOVector to do_perform_cow_{read, write}()
+[Qemu-devel] [PULL 12/35] block: Switch bdrv_common_block_status_above() to byte-based
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-Instead of passing a single buffer pointer to do_perform_cow_write(),
+We are gradually converting to byte-based interfaces, as they are
-pass a QEMUIOVector. This will allow us to merge the write requests
+easier to reason about than sector-based.  Convert another internal
-for the COW regions and the actual data into a single one.
+function (no semantic change).
-Although do_perform_cow_read() does not strictly need to change its
+Signed-off-by: Eric Blake <eblake@redhat.com>
 API, we're doing it here as well for consistency.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
+ block/io.c | 61 ++++++++++++++++++++++++++++++-------------------------------
-file changed, 24 insertions(+), 27 deletions(-)
+file changed, 30 insertions(+), 31 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/io.c
-+++ b/block/qcow2-cluster.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
- static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+  *
-                                             uint64_t src_cluster_offset,
+  * See bdrv_co_get_block_status_above() for details.
-                                             unsigned offset_in_cluster,
+  */
--                                            uint8_t *buffer,
+-static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
--                                            unsigned bytes)
+-                                              BlockDriverState *base,
-+                                            QEMUIOVector *qiov)
+-                                              bool want_zero,
 -                                              int64_t sector_num,
 -                                              int nb_sectors, int *pnum,
 -                                              BlockDriverState **file)
 +static int bdrv_common_block_status_above(BlockDriverState *bs,
 +                                          BlockDriverState *base,
 +                                          bool want_zero, int64_t offset,
 +                                          int64_t bytes, int64_t *pnum,
 +                                          int64_t *map,
 +                                          BlockDriverState **file)
  {
--    QEMUIOVector qiov;
+     Coroutine *co;
--    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+-    int64_t n;
-     int ret;
+-    int64_t map;
+     BdrvCoBlockStatusData data = {
--    if (bytes == 0) {
+         .bs = bs,
-+    if (qiov->size == 0) {
+         .base = base,
-         return 0;
+         .want_zero = want_zero,
 -        .offset = sector_num * BDRV_SECTOR_SIZE,
 -        .bytes = nb_sectors * BDRV_SECTOR_SIZE,
 -        .pnum = &n,
 -        .map = &map,
 +        .offset = offset,
 +        .bytes = bytes,
 +        .pnum = pnum,
 +        .map = map,
          .file = file,
          .done = false,
      };
@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
          bdrv_coroutine_enter(bs, co);
          BDRV_POLL_WHILE(bs, !data.done);
      }
+-    if (data.ret < 0) {
--    qemu_iovec_init_external(&qiov, &iov, 1);
+-        *pnum = 0;
--
+-        return data.ret;
-     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
+-    }
+-    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
-     if (!bs->drv) {
+-    *pnum = n >> BDRV_SECTOR_BITS;
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+-    return data.ret | map;
-      * which can lead to deadlock when block layer copy-on-read is enabled.
++    return data.ret;
-      */
+ }
-     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
--                                  bytes, &qiov, 0);
+ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-+                                  qiov->size, qiov, 0);
+@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                      int nb_sectors, int *pnum,
                                      BlockDriverState **file)
  {
 -    return bdrv_common_block_status_above(bs, base, true, sector_num,
 -                                          nb_sectors, pnum, file);
 +    int64_t ret;
 +    int64_t n;
 +    int64_t map;
 +
 +    ret = bdrv_common_block_status_above(bs, base, true,
 +                                         sector_num * BDRV_SECTOR_SIZE,
 +                                         nb_sectors * BDRV_SECTOR_SIZE,
 +                                         &n, &map, file);
 +    if (ret < 0) {
 +        *pnum = 0;
 +        return ret;
 +    }
 +    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
 +    *pnum = n >> BDRV_SECTOR_BITS;
 +    return ret | map;
  }
  int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
@@ -XXX,XX +XXX,XX @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
  int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                     int64_t bytes, int64_t *pnum)
  {
 -    int64_t ret;
 -    int psectors;
 +    int ret;
 +    int64_t dummy;
 -    assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
 -    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
 -    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false,
 -                                         offset >> BDRV_SECTOR_BITS,
 -                                         bytes >> BDRV_SECTOR_BITS, &psectors,
 +    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
 +                                         bytes, pnum ? pnum : &dummy, NULL,
                                           NULL);
      if (ret < 0) {
          return ret;
      }
-@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
+-    if (pnum) {
- static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
+-        *pnum = psectors * BDRV_SECTOR_SIZE;
-                                              uint64_t cluster_offset,
+-    }
-                                              unsigned offset_in_cluster,
+     return !!(ret & BDRV_BLOCK_ALLOCATED);
 -                                             uint8_t *buffer,
 -                                             unsigned bytes)
 +                                             QEMUIOVector *qiov)
  {
 -    QEMUIOVector qiov;
 -    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
 -    if (bytes == 0) {
 +    if (qiov->size == 0) {
          return 0;
      }
 -    qemu_iovec_init_external(&qiov, &iov, 1);
 -
      ret = qcow2_pre_write_overlap_check(bs, 0,
 -            cluster_offset + offset_in_cluster, bytes);
 +            cluster_offset + offset_in_cluster, qiov->size);
      if (ret < 0) {
          return ret;
      }
      BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
      ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
 -                          bytes, &qiov, 0);
 +                          qiov->size, qiov, 0);
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
      bool merge_reads;
      uint8_t *start_buffer, *end_buffer;
 +    QEMUIOVector qiov;
      int ret;
      assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      /* The part of the buffer where the end region is located */
      end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +    qemu_iovec_init(&qiov, 1);
 +
      qemu_co_mutex_unlock(&s->lock);
      /* First we read the existing data from both COW regions. We
       * either read the whole region in one go, or the start and end
       * regions separately. */
      if (merge_reads) {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, buffer_size);
 +        qemu_iovec_add(&qiov, start_buffer, buffer_size);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
      } else {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, start->nb_bytes);
 +        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
          if (ret < 0) {
              goto fail;
          }
 -        ret = do_perform_cow_read(bs, m->offset, end->offset,
 -                                  end_buffer, end->nb_bytes);
 +        qemu_iovec_reset(&qiov);
 +        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
      }
      if (ret < 0) {
          goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      }
      /* And now we can write everything */
 -    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
 -                               start_buffer, start->nb_bytes);
 +    qemu_iovec_reset(&qiov);
 +    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
      if (ret < 0) {
          goto fail;
      }
 -    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
 -                               end_buffer, end->nb_bytes);
 +    qemu_iovec_reset(&qiov);
 +    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
  fail:
      qemu_co_mutex_lock(&s->lock);
@@ -XXX,XX +XXX,XX @@ fail:
      }
      qemu_vfree(start_buffer);
 +    qemu_iovec_destroy(&qiov);
      return ret;
  }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 19/61] qcow2: Make perform_cow() call do_perform_cow() twice
+[Qemu-devel] [PULL 13/35] block: Switch bdrv_co_get_block_status_above() to byte-based
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-Instead of calling perform_cow() twice with a different COW region
+We are gradually converting to byte-based interfaces, as they are
-each time, call it just once and make perform_cow() handle both
+easier to reason about than sector-based.  Convert another internal
-regions.
+type (no semantic change), and rename it to match the corresponding
 public function rename.
-This patch simply moves code around. The next one will do the actual
+Signed-off-by: Eric Blake <eblake@redhat.com>
 reordering of the COW operations.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
+ block/io.c | 68 ++++++++++++++++++++++----------------------------------------
-file changed, 22 insertions(+), 14 deletions(-)
+file changed, 24 insertions(+), 44 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/io.c
-+++ b/block/qcow2-cluster.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ early_out:
-     struct iovec iov;
+     return ret;
      int ret;
 +    if (bytes == 0) {
 +        return 0;
 +    }
 +
      iov.iov_len = bytes;
      iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
      if (iov.iov_base == NULL) {
@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
      return cluster_offset;
  }
--static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+-static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
-+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+-        BlockDriverState *base,
 -        bool want_zero,
 -        int64_t sector_num,
 -        int nb_sectors,
 -        int *pnum,
 -        BlockDriverState **file)
 +static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
 +                                                   BlockDriverState *base,
 +                                                   bool want_zero,
 +                                                   int64_t offset,
 +                                                   int64_t bytes,
 +                                                   int64_t *pnum,
 +                                                   int64_t *map,
 +                                                   BlockDriverState **file)
  {
-     BDRVQcow2State *s = bs->opaque;
+     BlockDriverState *p;
-+    Qcow2COWRegion *start = &m->cow_start;
+-    int64_t ret = 0;
-+    Qcow2COWRegion *end = &m->cow_end;
++    int ret = 0;
-     int ret;
+     bool first = true;
+-    int64_t map = 0;
--    if (r->nb_bytes == 0) {
-+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+     assert(bs != base);
-         return 0;
+     for (p = bs; p != base; p = backing_bs(p)) {
 -        int64_t count;
 -
 -        ret = bdrv_co_block_status(p, want_zero,
 -                                   sector_num * BDRV_SECTOR_SIZE,
 -                                   nb_sectors * BDRV_SECTOR_SIZE, &count,
 -                                   &map, file);
 +        ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
 +                                   file);
          if (ret < 0) {
              break;
          }
 -        assert(QEMU_IS_ALIGNED(count | map, BDRV_SECTOR_SIZE));
 -        ret |= map;
 -        *pnum = count >> BDRV_SECTOR_BITS;
          if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
              /*
               * Reading beyond the end of the file continues to read
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
               * unallocated length we learned from an earlier
               * iteration.
               */
 -            *pnum = nb_sectors;
 +            *pnum = bytes;
          }
          if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
              break;
          }
 -        /* [sector_num, pnum] unallocated on this layer, which could be only
 -         * the first part of [sector_num, nb_sectors].  */
 -        nb_sectors = MIN(nb_sectors, *pnum);
 +        /* [offset, pnum] unallocated on this layer, which could be only
 +         * the first part of [offset, bytes].  */
 +        bytes = MIN(bytes, *pnum);
          first = false;
      }
+     return ret;
      qemu_co_mutex_unlock(&s->lock);
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
 -    qemu_co_mutex_lock(&s->lock);
 -
 +    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 +                         start->offset, start->nb_bytes);
      if (ret < 0) {
 -        return ret;
 +        goto fail;
      }
 +    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 +                         end->offset, end->nb_bytes);
 +
 +fail:
 +    qemu_co_mutex_lock(&s->lock);
 +
      /*
       * Before we update the L2 table to actually point to the new cluster, we
       * need to be sure that the refcounts have been increased and COW was
       * handled.
       */
 -    qcow2_cache_depends_on_flush(s->l2_table_cache);
 +    if (ret == 0) {
 +        qcow2_cache_depends_on_flush(s->l2_table_cache);
 +    }
 -    return 0;
 +    return ret;
  }
- int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
+ /* Coroutine wrapper for bdrv_get_block_status_above() */
-@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
+-static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
-     }
++static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
+ {
-     /* copy content of unmodified sectors */
+     BdrvCoBlockStatusData *data = opaque;
--    ret = perform_cow(bs, m, &m->cow_start);
+-    int n = 0;
 -    int64_t ret;
 -    ret = bdrv_co_get_block_status_above(data->bs, data->base,
 -                                         data->want_zero,
 -                                         data->offset >> BDRV_SECTOR_BITS,
 -                                         data->bytes >> BDRV_SECTOR_BITS,
 -                                         &n,
 -                                         data->file);
 -    if (ret < 0) {
--        goto err;
+-        assert(INT_MIN <= ret);
 -        data->ret = ret;
 -    } else {
 -        *data->pnum = n * BDRV_SECTOR_SIZE;
 -        *data->map = ret & BDRV_BLOCK_OFFSET_MASK;
 -        data->ret = ret & ~BDRV_BLOCK_OFFSET_MASK;
 -    }
--
++    data->ret = bdrv_co_block_status_above(data->bs, data->base,
--    ret = perform_cow(bs, m, &m->cow_end);
++                                           data->want_zero,
-+    ret = perform_cow(bs, m);
++                                           data->offset, data->bytes,
-     if (ret < 0) {
++                                           data->pnum, data->map, data->file);
-         goto err;
+     data->done = true;
  }
  /*
 - * Synchronous wrapper around bdrv_co_get_block_status_above().
 + * Synchronous wrapper around bdrv_co_block_status_above().
   *
 - * See bdrv_co_get_block_status_above() for details.
 + * See bdrv_co_block_status_above() for details.
   */
  static int bdrv_common_block_status_above(BlockDriverState *bs,
                                            BlockDriverState *base,
@@ -XXX,XX +XXX,XX @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
      if (qemu_in_coroutine()) {
          /* Fast-path if already in coroutine context */
 -        bdrv_get_block_status_above_co_entry(&data);
 +        bdrv_block_status_above_co_entry(&data);
      } else {
 -        co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
 -                                   &data);
 +        co = qemu_coroutine_create(bdrv_block_status_above_co_entry, &data);
          bdrv_coroutine_enter(bs, co);
          BDRV_POLL_WHILE(bs, !data.done);
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 23/61] qcow2: Merge the writing of the COW regions with the guest data
+[Qemu-devel] [PULL 14/35] block: Convert bdrv_get_block_status_above() to bytes
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-If the guest tries to write data that results on the allocation of a
+We are gradually moving away from sector-based interfaces, towards
-new cluster, instead of writing the guest data first and then the data
+byte-based.  In the common case, allocation is unlikely to ever use
-from the COW regions, write everything together using one single I/O
+values that are not naturally sector-aligned, but it is possible
-operation.
+that byte-based values will let us be more precise about allocation
+at the end of an unaligned file that can do byte-based access.
-This can improve the write performance by 25% or more, depending on
-several factors such as the media type, the cluster size and the I/O
+Changing the name of the function from bdrv_get_block_status_above()
-request size.
+to bdrv_block_status_above() ensures that the compiler enforces that
+all callers are updated.  Likewise, since it a byte interface allows
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+an offset mapping that might not be sector aligned, split the mapping
-Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+out of the return value and into a pass-by-reference parameter.  For
 now, the io.c layer still assert()s that all uses are sector-aligned,
 but that can be relaxed when a later patch implements byte-based
 block status in the drivers.
 For the most part this patch is just the addition of scaling at the
 callers followed by inverse scaling at bdrv_block_status(), plus
 updates for the new split return interface.  But some code,
 particularly bdrv_block_status(), gets a lot simpler because it no
 longer has to mess with sectors.  Likewise, mirror code no longer
 computes s->granularity >> BDRV_SECTOR_BITS, and can therefore drop
 an assertion about alignment because the loop no longer depends on
 alignment (never mind that we don't really have a driver that
 reports sub-sector alignments, so it's not really possible to test
 the effect of sub-sector mirroring).  Fix a neighboring assertion to
 use is_power_of_2 while there.
 For ease of review, bdrv_get_block_status() was tackled separately.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
+ include/block/block.h |  8 +++-----
- block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
+ block/io.c            | 55 ++++++++-------------------------------------------
- block/qcow2.h         |  7 ++++++
+ block/mirror.c        | 18 ++++++-----------
-files changed, 91 insertions(+), 20 deletions(-)
+ block/qcow2.c         | 30 +++++++++++-----------------
+ qemu-img.c            | 49 +++++++++++++++++++++++++--------------------
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+files changed, 57 insertions(+), 103 deletions(-)
-index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+diff --git a/include/block/block.h b/include/block/block.h
-+++ b/block/qcow2-cluster.c
+index XXXXXXX..XXXXXXX 100644
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+--- a/include/block/block.h
-     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
++++ b/include/block/block.h
-     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
+@@ -XXX,XX +XXX,XX @@ bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
-     assert(start->offset + start->nb_bytes <= end->offset);
+ int bdrv_block_status(BlockDriverState *bs, int64_t offset,
-+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
+                       int64_t bytes, int64_t *pnum, int64_t *map,
+                       BlockDriverState **file);
-     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-         return 0;
+-                                    BlockDriverState *base,
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+-                                    int64_t sector_num,
-     /* The part of the buffer where the end region is located */
+-                                    int nb_sectors, int *pnum,
-     end_buffer = start_buffer + buffer_size - end->nb_bytes;
+-                                    BlockDriverState **file);
++int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
--    qemu_iovec_init(&qiov, 1);
++                            int64_t offset, int64_t bytes, int64_t *pnum,
-+    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
++                            int64_t *map, BlockDriverState **file);
+ int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
-     qemu_co_mutex_unlock(&s->lock);
+                       int64_t *pnum);
-     /* First we read the existing data from both COW regions. We
+ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+diff --git a/block/io.c b/block/io.c
-         }
+index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
      return ret;
  }
 -/* Coroutine wrapper for bdrv_get_block_status_above() */
 +/* Coroutine wrapper for bdrv_block_status_above() */
  static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
  {
      BdrvCoBlockStatusData *data = opaque;
@@ -XXX,XX +XXX,XX @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
      return data.ret;
  }
 -int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 -                                    BlockDriverState *base,
 -                                    int64_t sector_num,
 -                                    int nb_sectors, int *pnum,
 -                                    BlockDriverState **file)
 +int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
 +                            int64_t offset, int64_t bytes, int64_t *pnum,
 +                            int64_t *map, BlockDriverState **file)
  {
 -    int64_t ret;
 -    int64_t n;
 -    int64_t map;
 -
 -    ret = bdrv_common_block_status_above(bs, base, true,
 -                                         sector_num * BDRV_SECTOR_SIZE,
 -                                         nb_sectors * BDRV_SECTOR_SIZE,
 -                                         &n, &map, file);
 -    if (ret < 0) {
 -        *pnum = 0;
 -        return ret;
 -    }
 -    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
 -    *pnum = n >> BDRV_SECTOR_BITS;
 -    return ret | map;
 +    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
 +                                          pnum, map, file);
  }
  int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
                        int64_t *pnum, int64_t *map, BlockDriverState **file)
  {
 -    int64_t ret;
 -    int n;
 -
 -    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 -    assert(pnum);
 -    /*
 -     * The contract allows us to return pnum smaller than bytes, even
 -     * if the next query would see the same status; we truncate the
 -     * request to avoid overflowing the driver's 32-bit interface.
 -     */
 -    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
 -    ret = bdrv_get_block_status_above(bs, backing_bs(bs),
 -                                      offset >> BDRV_SECTOR_BITS,
 -                                      bytes >> BDRV_SECTOR_BITS, &n, file);
 -    if (ret < 0) {
 -        assert(INT_MIN <= ret);
 -        *pnum = 0;
 -        return ret;
 -    }
 -    *pnum = n * BDRV_SECTOR_SIZE;
 -    if (map) {
 -        *map = ret & BDRV_BLOCK_OFFSET_MASK;
 -    } else {
 -        ret &= ~BDRV_BLOCK_OFFSET_VALID;
 -    }
 -    return ret & ~BDRV_BLOCK_OFFSET_MASK;
 +    return bdrv_block_status_above(bs, backing_bs(bs),
 +                                   offset, bytes, pnum, map, file);
  }
  int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
 diff --git a/block/mirror.c b/block/mirror.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/mirror.c
 +++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
      uint64_t delay_ns = 0;
      /* At least the first dirty chunk is mirrored in one iteration. */
      int nb_chunks = 1;
 -    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
      bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
      int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
      }
--    /* And now we can write everything */
+     /* Clear dirty bits before querying the block status, because
--    qemu_iovec_reset(&qiov);
+-     * calling bdrv_get_block_status_above could yield - if some blocks are
--    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
++     * calling bdrv_block_status_above could yield - if some blocks are
--    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+      * marked dirty in this window, we need to know.
--    if (ret < 0) {
+      */
--        goto fail;
+     bdrv_reset_dirty_bitmap_locked(s->dirty_bitmap, offset,
-+    /* And now we can write everything. If we have the guest data we
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
-+     * can write everything in one single operation */
-+    if (m->data_qiov) {
+     bitmap_set(s->in_flight_bitmap, offset / s->granularity, nb_chunks);
-+        qemu_iovec_reset(&qiov);
+     while (nb_chunks > 0 && offset < s->bdev_length) {
-+        if (start->nb_bytes) {
+-        int64_t ret;
-+            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+-        int io_sectors;
-+        }
++        int ret;
-+        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
+         int64_t io_bytes;
-+        if (end->nb_bytes) {
+         int64_t io_bytes_acct;
-+            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+         enum MirrorMethod {
-+        }
+@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
-+        /* NOTE: we have a write_aio blkdebug event here followed by
+         } mirror_method = MIRROR_METHOD_COPY;
-+         * a cow_write one in do_perform_cow_write(), but there's only
-+         * one single I/O operation */
+         assert(!(offset % s->granularity));
-+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+-        ret = bdrv_get_block_status_above(source, NULL,
-+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+-                                          offset >> BDRV_SECTOR_BITS,
-+    } else {
+-                                          nb_chunks * sectors_per_chunk,
-+        /* If there's no guest data then write both COW regions separately */
+-                                          &io_sectors, NULL);
-+        qemu_iovec_reset(&qiov);
+-        io_bytes = io_sectors * BDRV_SECTOR_SIZE;
-+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
++        ret = bdrv_block_status_above(source, NULL, offset,
-+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
++                                      nb_chunks * s->granularity,
-+        if (ret < 0) {
++                                      &io_bytes, NULL, NULL);
-+            goto fail;
+         if (ret < 0) {
-+        }
+             io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
-+
+         } else if (ret & BDRV_BLOCK_DATA) {
-+        qemu_iovec_reset(&qiov);
+@@ -XXX,XX +XXX,XX @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
-+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+         granularity = bdrv_get_default_bitmap_granularity(target);
 +        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
      }
--    qemu_iovec_reset(&qiov);
+-    assert ((granularity & (granularity - 1)) == 0);
--    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+-    /* Granularity must be large enough for sector-based dirty bitmap */
--    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
+-    assert(granularity >= BDRV_SECTOR_SIZE);
- fail:
++    assert(is_power_of_2(granularity));
-     qemu_co_mutex_lock(&s->lock);
+     if (buf_size < 0) {
          error_setg(errp, "Invalid parameter 'buf-size'");
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ fail:
+@@ -XXX,XX +XXX,XX @@ finish:
-     return ret;
- }
+ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
+ {
-+/* Check if it's possible to merge a write request with the writing of
+-    int nr;
-+ * the data from the COW regions */
+-    int64_t res;
-+static bool merge_cow(uint64_t offset, unsigned bytes,
++    int64_t nr;
-+                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
++    int res;
-+{
+     int64_t start;
-+    QCowL2Meta *m;
      /* TODO: Widening to sector boundaries should only be needed as
@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
      if (!bytes) {
          return true;
      }
 -    res = bdrv_get_block_status_above(bs, NULL, start >> BDRV_SECTOR_BITS,
 -                                      bytes >> BDRV_SECTOR_BITS, &nr, NULL);
 -    return res >= 0 && (res & BDRV_BLOCK_ZERO) &&
 -        nr * BDRV_SECTOR_SIZE == bytes;
 +    res = bdrv_block_status_above(bs, NULL, start, bytes, &nr, NULL, NULL);
 +    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
  }
  static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
              required = virtual_size;
          } else {
              int64_t offset;
 -            int pnum = 0;
 +            int64_t pnum = 0;
 -            for (offset = 0; offset < ssize;
 -                 offset += pnum * BDRV_SECTOR_SIZE) {
 -                int nb_sectors = MIN(ssize - offset,
 -                                     BDRV_REQUEST_MAX_BYTES) / BDRV_SECTOR_SIZE;
 -                int64_t ret;
 +            for (offset = 0; offset < ssize; offset += pnum) {
 +                int ret;
 -                ret = bdrv_get_block_status_above(in_bs, NULL,
 -                                                  offset >> BDRV_SECTOR_BITS,
 -                                                  nb_sectors, &pnum, NULL);
 +                ret = bdrv_block_status_above(in_bs, NULL, offset,
 +                                              ssize - offset, &pnum, NULL,
 +                                              NULL);
                  if (ret < 0) {
                      error_setg_errno(&local_err, -ret,
                                       "Unable to get block status");
@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
                  } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
                             (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
                      /* Extend pnum to end of cluster for next iteration */
 -                    pnum = (ROUND_UP(offset + pnum * BDRV_SECTOR_SIZE,
 -                                 cluster_size) - offset) >> BDRV_SECTOR_BITS;
 +                    pnum = ROUND_UP(offset + pnum, cluster_size) - offset;
                      /* Count clusters we've seen */
 -                    required += offset % cluster_size + pnum * BDRV_SECTOR_SIZE;
 +                    required += offset % cluster_size + pnum;
                  }
              }
          }
 diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
      BlockDriverState *bs1, *bs2;
      int64_t total_sectors1, total_sectors2;
      uint8_t *buf1 = NULL, *buf2 = NULL;
 -    int pnum1, pnum2;
 +    int64_t pnum1, pnum2;
      int allocated1, allocated2;
      int ret = 0; /* return value - 0 Ident, 1 Different, >1 Error */
      bool progress = false, quiet = false, strict = false;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
      }
      for (;;) {
 -        int64_t status1, status2;
 +        int status1, status2;
          nb_sectors = sectors_to_process(total_sectors, sector_num);
          if (nb_sectors <= 0) {
              break;
          }
 -        status1 = bdrv_get_block_status_above(bs1, NULL, sector_num,
 -                                              total_sectors1 - sector_num,
 -                                              &pnum1, NULL);
 +        status1 = bdrv_block_status_above(bs1, NULL,
 +                                          sector_num * BDRV_SECTOR_SIZE,
 +                                          (total_sectors1 - sector_num) *
 +                                          BDRV_SECTOR_SIZE,
 +                                          &pnum1, NULL, NULL);
          if (status1 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
          }
          allocated1 = status1 & BDRV_BLOCK_ALLOCATED;
 -        status2 = bdrv_get_block_status_above(bs2, NULL, sector_num,
 -                                              total_sectors2 - sector_num,
 -                                              &pnum2, NULL);
 +        status2 = bdrv_block_status_above(bs2, NULL,
 +                                          sector_num * BDRV_SECTOR_SIZE,
 +                                          (total_sectors2 - sector_num) *
 +                                          BDRV_SECTOR_SIZE,
 +                                          &pnum2, NULL, NULL);
          if (status2 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename2);
              goto out;
          }
          allocated2 = status2 & BDRV_BLOCK_ALLOCATED;
 +        /* TODO: Relax this once comparison is byte-based, and we no longer
 +         * have to worry about sector alignment */
 +        assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
          if (pnum1) {
 -            nb_sectors = MIN(nb_sectors, pnum1);
 +            nb_sectors = MIN(nb_sectors, pnum1 >> BDRV_SECTOR_BITS);
          }
          if (pnum2) {
 -            nb_sectors = MIN(nb_sectors, pnum2);
 +            nb_sectors = MIN(nb_sectors, pnum2 >> BDRV_SECTOR_BITS);
          }
          if (strict) {
 -            if ((status1 & ~BDRV_BLOCK_OFFSET_MASK) !=
 -                (status2 & ~BDRV_BLOCK_OFFSET_MASK)) {
 +            if (status1 != status2) {
                  ret = 1;
                  qprintf(quiet, "Strict mode: Offset %" PRId64
                          " block status mismatch!\n",
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
              }
          }
          if ((status1 & BDRV_BLOCK_ZERO) && (status2 & BDRV_BLOCK_ZERO)) {
 -            nb_sectors = MIN(pnum1, pnum2);
 +            nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
          } else if (allocated1 == allocated2) {
              if (allocated1) {
                  ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
@@ -XXX,XX +XXX,XX @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num,
  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
  {
 -    int64_t ret, src_cur_offset;
 -    int n, src_cur;
 +    int64_t src_cur_offset;
 +    int ret, n, src_cur;
      convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
      n = MIN(s->total_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
      if (s->sector_next_status <= sector_num) {
 +        int64_t count = n * BDRV_SECTOR_SIZE;
 +
-+    for (m = l2meta; m != NULL; m = m->next) {
+         if (s->target_has_backing) {
-+        /* If both COW regions are empty then there's nothing to merge */
+-            int64_t count = n * BDRV_SECTOR_SIZE;
-+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
-+            continue;
+             ret = bdrv_block_status(blk_bs(s->src[src_cur]),
-+        }
+                                     (sector_num - src_cur_offset) *
-+
+                                     BDRV_SECTOR_SIZE,
-+        /* The data (middle) region must be immediately after the
+                                     count, &count, NULL, NULL);
-+         * start region */
+-            assert(ret < 0 || QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-+        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+-            n = count >> BDRV_SECTOR_BITS;
-+            continue;
+         } else {
-+        }
+-            ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
-+
+-                                              sector_num - src_cur_offset,
-+        /* The end region must be immediately after the data (middle)
+-                                              n, &n, NULL);
-+         * region */
++            ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
-+        if (m->offset + m->cow_end.offset != offset + bytes) {
++                                          (sector_num - src_cur_offset) *
-+            continue;
++                                          BDRV_SECTOR_SIZE,
-+        }
++                                          count, &count, NULL, NULL);
-+
+         }
-+        /* Make sure that adding both COW regions to the QEMUIOVector
+         if (ret < 0) {
-+         * does not exceed IOV_MAX */
+             return ret;
-+        if (hd_qiov->niov > IOV_MAX - 2) {
+         }
-+            continue;
++        n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
-+        }
-+
+         if (ret & BDRV_BLOCK_ZERO) {
-+        m->data_qiov = hd_qiov;
+             s->status = BLK_ZERO;
 +        return true;
 +    }
 +
 +    return false;
 +}
 +
  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                           uint64_t bytes, QEMUIOVector *qiov,
                                           int flags)
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
              goto fail;
          }
 -        qemu_co_mutex_unlock(&s->lock);
 -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 -        trace_qcow2_writev_data(qemu_coroutine_self(),
 -                                cluster_offset + offset_in_cluster);
 -        ret = bdrv_co_pwritev(bs->file,
 -                              cluster_offset + offset_in_cluster,
 -                              cur_bytes, &hd_qiov, 0);
 -        qemu_co_mutex_lock(&s->lock);
 -        if (ret < 0) {
 -            goto fail;
 +        /* If we need to do COW, check if it's possible to merge the
 +         * writing of the guest data together with that of the COW regions.
 +         * If it's not possible (or not necessary) then write the
 +         * guest data now. */
 +        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
 +            qemu_co_mutex_unlock(&s->lock);
 +            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 +            trace_qcow2_writev_data(qemu_coroutine_self(),
 +                                    cluster_offset + offset_in_cluster);
 +            ret = bdrv_co_pwritev(bs->file,
 +                                  cluster_offset + offset_in_cluster,
 +                                  cur_bytes, &hd_qiov, 0);
 +            qemu_co_mutex_lock(&s->lock);
 +            if (ret < 0) {
 +                goto fail;
 +            }
          }
          while (l2meta != NULL) {
 diff --git a/block/qcow2.h b/block/qcow2.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
       */
      Qcow2COWRegion cow_end;
 +    /**
 +     * The I/O vector with the data from the actual guest write request.
 +     * If non-NULL, this is meant to be merged together with the data
 +     * from @cow_start and @cow_end into one single write operation.
 +     */
 +    QEMUIOVector *data_qiov;
 +
      /** Pointer to next L2Meta of the same write request */
      struct QCowL2Meta *next;
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 51/61] qed: Simplify request handling
+[Qemu-devel] [PULL 15/35] qemu-img: Simplify logic in img_compare()
-Now that we process a request in the same coroutine from beginning to
+From: Eric Blake <eblake@redhat.com>
 end and don't drop out of it any more, we can look like a proper
 coroutine-based driver and simply call qed_aio_next_io() and get a
 return value from it instead of spawning an additional coroutine that
 reenters the parent when it's done.
+As long as we are querying the status for a chunk smaller than
+the known image size, we are guaranteed that a successful return
+will have set pnum to a non-zero size (pnum is zero only for
+queries beyond the end of the file).  Use that to slightly
+simplify the calculation of the current chunk size being compared.
+Likewise, we don't have to shrink the amount of data operated on
+until we know we have to read the file, and therefore have to fit
+in the bounds of our buffer.  Also, note that 'total_sectors_over'
+is equivalent to 'progress_base'.
+With these changes in place, sectors_to_process() is now dead code,
+and can be removed.
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 101 +++++++++++++-----------------------------------------------
+ qemu-img.c | 38 +++++++++++---------------------------
- block/qed.h |   3 +-
+file changed, 11 insertions(+), 27 deletions(-)
 files changed, 22 insertions(+), 82 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/qemu-img.c
-+++ b/block/qed.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
- #include "qapi/qmp/qerror.h"
+     return sectors << BDRV_SECTOR_BITS;
  #include "sysemu/block-backend.h"
 -static const AIOCBInfo qed_aiocb_info = {
 -    .aiocb_size         = sizeof(QEDAIOCB),
 -};
 -
  static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                            const char *filename)
  {
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
      return l2_table;
  }
--static void qed_aio_next_io(QEDAIOCB *acb);
+-static int64_t sectors_to_process(int64_t total, int64_t from)
 -
 -static void qed_aio_start_io(QEDAIOCB *acb)
 -{
--    qed_aio_next_io(acb);
+-    return MIN(total - from, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 -}
 -
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+ /*
- {
+  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
-     assert(!s->allocating_write_reqs_plugged);
+  *
-@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
+         goto out;
  static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
  {
 -    return acb->common.bs->opaque;
 +    return acb->bs->opaque;
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
      }
- }
+-    for (;;) {
--static void qed_aio_complete_bh(void *opaque)
++    while (sector_num < total_sectors) {
--{
+         int status1, status2;
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
+-        nb_sectors = sectors_to_process(total_sectors, sector_num);
--    BlockCompletionFunc *cb = acb->common.cb;
+-        if (nb_sectors <= 0) {
--    void *user_opaque = acb->common.opaque;
+-            break;
--    int ret = acb->bh_ret;
+-        }
          status1 = bdrv_block_status_above(bs1, NULL,
                                            sector_num * BDRV_SECTOR_SIZE,
                                            (total_sectors1 - sector_num) *
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
          /* TODO: Relax this once comparison is byte-based, and we no longer
           * have to worry about sector alignment */
          assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
 -        if (pnum1) {
 -            nb_sectors = MIN(nb_sectors, pnum1 >> BDRV_SECTOR_BITS);
 -        }
 -        if (pnum2) {
 -            nb_sectors = MIN(nb_sectors, pnum2 >> BDRV_SECTOR_BITS);
 -        }
 +
 +        assert(pnum1 && pnum2);
 +        nb_sectors = MIN(pnum1, pnum2) >> BDRV_SECTOR_BITS;
          if (strict) {
              if (status1 != status2) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
              }
          }
          if ((status1 & BDRV_BLOCK_ZERO) && (status2 & BDRV_BLOCK_ZERO)) {
 -            nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
 +            /* nothing to do */
          } else if (allocated1 == allocated2) {
              if (allocated1) {
 +                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
                  ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
                                  nb_sectors << BDRV_SECTOR_BITS);
                  if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                  }
              }
          } else {
 -
--    qemu_aio_unref(acb);
++            nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
--
+             if (allocated1) {
--    /* Invoke callback */
+                 ret = check_empty_sectors(blk1, sector_num, nb_sectors,
--    qed_acquire(s);
+                                           filename1, buf1, quiet);
--    cb(user_opaque, ret);
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
--    qed_release(s);
--}
+     if (total_sectors1 != total_sectors2) {
--
+         BlockBackend *blk_over;
--static void qed_aio_complete(QEDAIOCB *acb, int ret)
+-        int64_t total_sectors_over;
-+static void qed_aio_complete(QEDAIOCB *acb)
+         const char *filename_over;
- {
-     BDRVQEDState *s = acb_to_s(acb);
+         qprintf(quiet, "Warning: Image size mismatch!\n");
+         if (total_sectors1 > total_sectors2) {
--    trace_qed_aio_complete(s, acb, ret);
+-            total_sectors_over = total_sectors1;
--
+             blk_over = blk1;
-     /* Free resources */
+             filename_over = filename1;
-     qemu_iovec_destroy(&acb->cur_qiov);
+         } else {
-     qed_unref_l2_cache_entry(acb->request.l2_table);
+-            total_sectors_over = total_sectors2;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
+             blk_over = blk2;
-         acb->qiov->iov[0].iov_base = NULL;
+             filename_over = filename2;
      }
 -    /* Arrange for a bh to invoke the completion function */
 -    acb->bh_ret = ret;
 -    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
 -                            qed_aio_complete_bh, acb);
 -
      /* Start next allocating write request waiting behind this one.  Note that
       * requests enqueue themselves when they first hit an unallocated cluster
       * but they wait until the entire request is finished before waking up the
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
          struct iovec *iov = acb->qiov->iov;
          if (!iov->iov_base) {
 -            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
 +            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
              if (iov->iov_base == NULL) {
                  return -ENOMEM;
              }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  {
      QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
 -    BlockDriverState *bs = acb->common.bs;
 +    BlockDriverState *bs = acb->bs;
      /* Adjust offset into cluster */
      offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
  /**
   * Begin next I/O or complete the request
   */
 -static void qed_aio_next_io(QEDAIOCB *acb)
 +static int qed_aio_next_io(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
          /* Complete request */
          if (acb->cur_pos >= acb->end_pos) {
 -            qed_aio_complete(acb, 0);
 -            return;
 +            ret = 0;
 +            break;
          }
-         /* Find next cluster and start I/O */
+-        for (;;) {
-         len = acb->end_pos - acb->cur_pos;
++        while (sector_num < progress_base) {
-         ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
+             int64_t count;
-         if (ret < 0) {
--            qed_aio_complete(acb, ret);
+-            nb_sectors = sectors_to_process(total_sectors_over, sector_num);
--            return;
+-            if (nb_sectors <= 0) {
-+            break;
+-                break;
-         }
+-            }
+             ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
-         if (acb->flags & QED_AIOCB_WRITE) {
+                                           sector_num * BDRV_SECTOR_SIZE,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
+-                                          nb_sectors * BDRV_SECTOR_SIZE,
-         }
++                                          (progress_base - sector_num) *
++                                          BDRV_SECTOR_SIZE,
-         if (ret < 0 && ret != -EAGAIN) {
+                                           &count);
--            qed_aio_complete(acb, ret);
+             if (ret < 0) {
--            return;
+                 ret = 3;
-+            break;
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-         }
+             assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-     }
+             nb_sectors = count >> BDRV_SECTOR_BITS;
--}
+             if (ret) {
++                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
--typedef struct QEDRequestCo {
+                 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
--    Coroutine *co;
+                                           filename_over, buf1, quiet);
--    bool done;
+                 if (ret) {
 -    int ret;
 -} QEDRequestCo;
 -
 -static void qed_co_request_cb(void *opaque, int ret)
 -{
 -    QEDRequestCo *co = opaque;
 -
 -    co->done = true;
 -    co->ret = ret;
 -    qemu_coroutine_enter_if_inactive(co->co);
 +    trace_qed_aio_complete(s, acb, ret);
 +    qed_aio_complete(acb);
 +    return ret;
  }
  static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
                                         QEMUIOVector *qiov, int nb_sectors,
                                         int flags)
  {
 -    QEDRequestCo co = {
 -        .co     = qemu_coroutine_self(),
 -        .done   = false,
 +    QEDAIOCB acb = {
 +        .bs         = bs,
 +        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
 +        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
 +        .qiov       = qiov,
 +        .flags      = flags,
      };
 -    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
 -
 -    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
 +    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
 -    acb->flags = flags;
 -    acb->qiov = qiov;
 -    acb->qiov_offset = 0;
 -    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
 -    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
 -    acb->backing_qiov = NULL;
 -    acb->request.l2_table = NULL;
 -    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
 +    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
      /* Start request */
 -    qed_aio_start_io(acb);
 -
 -    if (!co.done) {
 -        qemu_coroutine_yield();
 -    }
 -
 -    return co.ret;
 +    return qed_aio_next_io(&acb);
  }
  static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
  };
  typedef struct QEDAIOCB {
 -    BlockAIOCB common;
 -    int bh_ret;                     /* final return status for completion bh */
 +    BlockDriverState *bs;
      QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
      int flags;                      /* QED_AIOCB_* bits ORed together */
      uint64_t end_pos;               /* request end on block device, in bytes */
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 57/61] fix: avoid an infinite loop or a dangling pointer problem in img_commit
+[Qemu-devel] [PULL 16/35] qemu-img: Speed up compare on pre-allocated larger file
-From: "sochin.jiang" <sochin.jiang@huawei.com>
+From: Eric Blake <eblake@redhat.com>
-img_commit could fall into an infinite loop calling run_block_job() if
+Compare the following images with all-zero contents:
-its blockjob fails on any I/O error, fix this already known problem.
+$ truncate --size 1M A
 $ qemu-img create -f qcow2 -o preallocation=off B 1G
 $ qemu-img create -f qcow2 -o preallocation=metadata C 1G
-Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
+On my machine, the difference is noticeable for pre-patch speeds,
-Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
+with more than an order of magnitude in difference caused by the
-Signed-off-by: Max Reitz <mreitz@redhat.com>
+choice of preallocation in the qcow2 file:
 $ time ./qemu-img compare -f raw -F qcow2 A B
 Warning: Image size mismatch!
 Images are identical.
 real    0m0.014s
 user    0m0.007s
 sys    0m0.007s
 $ time ./qemu-img compare -f raw -F qcow2 A C
 Warning: Image size mismatch!
 Images are identical.
 real    0m0.341s
 user    0m0.144s
 sys    0m0.188s
 Why? Because bdrv_is_allocated() returns false for image B but
 true for image C, throwing away the fact that both images know
 via lseek(SEEK_HOLE) that the entire image still reads as zero.
 From there, qemu-img ends up calling bdrv_pread() for every byte
 of the tail, instead of quickly looking for the next allocation.
 The solution: use block_status instead of is_allocated, giving:
 $ time ./qemu-img compare -f raw -F qcow2 A C
 Warning: Image size mismatch!
 Images are identical.
 real    0m0.014s
 user    0m0.011s
 sys    0m0.003s
 which is on par with the speeds for no pre-allocation.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- blockjob.c               |  4 ++--
+ qemu-img.c | 8 ++++----
- include/block/blockjob.h | 18 ++++++++++++++++++
+file changed, 4 insertions(+), 4 deletions(-)
  qemu-img.c               | 20 +++++++++++++-------
 files changed, 33 insertions(+), 9 deletions(-)
-diff --git a/blockjob.c b/blockjob.c
-index XXXXXXX..XXXXXXX 100644
---- a/blockjob.c
-+++ b/blockjob.c
-@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
-     block_job_enter(job);
- }
--static void block_job_ref(BlockJob *job)
-+void block_job_ref(BlockJob *job)
- {
-     ++job->refcnt;
- }
-@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
-                                            void *opaque);
- static void block_job_detach_aio_context(void *opaque);
--static void block_job_unref(BlockJob *job)
-+void block_job_unref(BlockJob *job)
- {
-     if (--job->refcnt == 0) {
-         BlockDriverState *bs = blk_bs(job->blk);
-diff --git a/include/block/blockjob.h b/include/block/blockjob.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/block/blockjob.h
-+++ b/include/block/blockjob.h
-@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
- BlockJobTxn *block_job_txn_new(void);
- /**
-+ * block_job_ref:
-+ *
-+ * Add a reference to BlockJob refcnt, it will be decreased with
-+ * block_job_unref, and then be freed if it comes to be the last
-+ * reference.
-+ */
-+void block_job_ref(BlockJob *job);
-+
-+/**
-+ * block_job_unref:
-+ *
-+ * Release a reference that was previously acquired with block_job_ref
-+ * or block_job_create. If it's the last reference to the object, it will be
-+ * freed.
-+ */
-+void block_job_unref(BlockJob *job);
-+
-+/**
-  * block_job_txn_unref:
-  *
-  * Release a reference that was previously acquired with block_job_txn_add_job
 diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
- static void run_block_job(BlockJob *job, Error **errp)
+         while (sector_num < progress_base) {
- {
+             int64_t count;
-     AioContext *aio_context = blk_get_aio_context(job->blk);
-+    int ret = 0;
+-            ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
++            ret = bdrv_block_status_above(blk_bs(blk_over), NULL,
--    /* FIXME In error cases, the job simply goes away and we access a dangling
+                                           sector_num * BDRV_SECTOR_SIZE,
--     * pointer below. */
+                                           (progress_base - sector_num) *
-     aio_context_acquire(aio_context);
+                                           BDRV_SECTOR_SIZE,
-+    block_job_ref(job);
+-                                          &count);
-     do {
++                                          &count, NULL, NULL);
-         aio_poll(aio_context, true);
+             if (ret < 0) {
-         qemu_progress_print(job->len ?
+                 ret = 3;
-                             ((float)job->offset / job->len * 100.f) : 0.0f, 0);
+                 error_report("Sector allocation test failed for %s",
--    } while (!job->ready);
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-+    } while (!job->ready && !job->completed);
+                 goto out;
--    block_job_complete_sync(job, errp);
+             }
-+    if (!job->completed) {
+-            /* TODO relax this once bdrv_is_allocated_above does not enforce
-+        ret = block_job_complete_sync(job, errp);
++            /* TODO relax this once bdrv_block_status_above does not enforce
-+    } else {
+              * sector alignment */
-+        ret = job->ret;
+             assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-+    }
+             nb_sectors = count >> BDRV_SECTOR_BITS;
-+    block_job_unref(job);
+-            if (ret) {
-     aio_context_release(aio_context);
++            if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
+                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
--    /* A block job may finish instantaneously without publishing any progress,
+                 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
--     * so just signal completion here */
+                                           filename_over, buf1, quiet);
 -    qemu_progress_print(100.f, 0);
 +    /* publish completion progress only when success */
 +    if (!ret) {
 +        qemu_progress_print(100.f, 0);
 +    }
  }
  static int img_commit(int argc, char **argv)
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 13/61] qemu-iotests: 068: extract _qemu() function
+[Qemu-devel] [PULL 17/35] qemu-img: Add find_nonzero()
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Eric Blake <eblake@redhat.com>
-Avoid duplicating the QEMU command-line.
+During 'qemu-img compare', when we are checking that an allocated
 portion of one file is all zeros, we don't need to waste time
 computing how many additional sectors after the first non-zero
 byte are also non-zero.  Create a new helper find_nonzero() to do
 the check for a first non-zero sector, and rebase
 check_empty_sectors() to use it.
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+The new interface intentionally uses bytes in its interface, even
 though it still crawls the buffer a sector at a time; it is robust
 to a partial sector at the end of the buffer.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/068 | 15 +++++++++------
+ qemu-img.c | 32 ++++++++++++++++++++++++++++----
-file changed, 9 insertions(+), 6 deletions(-)
+file changed, 28 insertions(+), 4 deletions(-)
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+diff --git a/qemu-img.c b/qemu-img.c
-index XXXXXXX..XXXXXXX 100755
+index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068
+--- a/qemu-img.c
-+++ b/tests/qemu-iotests/068
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
+@@ -XXX,XX +XXX,XX @@ done:
-       ;;
+ }
- esac
+ /*
--# Give qemu some time to boot before saving the VM state
++ * Returns -1 if 'buf' contains only zeroes, otherwise the byte index
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
++ * of the first sector boundary within buf where the sector contains a
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
++ * non-zero byte.  This function is robust to a buffer that is not
-+_qemu()
++ * sector-aligned.
 + */
 +static int64_t find_nonzero(const uint8_t *buf, int64_t n)
 +{
-+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
++    int64_t i;
-+          "$@" |\
++    int64_t end = QEMU_ALIGN_DOWN(n, BDRV_SECTOR_SIZE);
-     _filter_qemu | _filter_hmp
++
 +    for (i = 0; i < end; i += BDRV_SECTOR_SIZE) {
 +        if (!buffer_is_zero(buf + i, BDRV_SECTOR_SIZE)) {
 +            return i;
 +        }
 +    }
 +    if (i < n && !buffer_is_zero(buf + i, n - end)) {
 +        return i;
 +    }
 +    return -1;
 +}
 +
-+# Give qemu some time to boot before saving the VM state
++/*
-+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+  * Returns true iff the first sector pointed to by 'buf' contains at least
- # Now try to continue from that VM state (this should just work)
+  * a non-NUL byte.
--echo quit |\
+  *
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
+@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
--    _filter_qemu | _filter_hmp
+                                int sect_count, const char *filename,
-+echo quit | _qemu -loadvm 0
+                                uint8_t *buffer, bool quiet)
+ {
- # success, all done
+-    int pnum, ret = 0;
- echo "*** done"
++    int ret = 0;
 +    int64_t idx;
 +
      ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
                      sect_count << BDRV_SECTOR_BITS);
      if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
                       sectors_to_bytes(sect_num), filename, strerror(-ret));
          return ret;
      }
 -    ret = is_allocated_sectors(buffer, sect_count, &pnum);
 -    if (ret || pnum != sect_count) {
 +    idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
 +    if (idx >= 0) {
          qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
 -                sectors_to_bytes(ret ? sect_num : sect_num + pnum));
 +                sectors_to_bytes(sect_num) + idx);
          return 1;
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 12/61] migration: hold AioContext lock for loadvm qemu_fclose()
+[Qemu-devel] [PULL 18/35] qemu-img: Drop redundant error message in compare
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Eric Blake <eblake@redhat.com>
-migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
+If a read error is encountered during 'qemu-img compare', we
-file.  Make sure to call it inside an AioContext acquire/release region.
+were printing the "Error while reading offset ..." message twice;
 this was because our helper function was awkward, printing output
 on some but not all paths.  Fix it to consistently report errors
 on all paths, so that the callers do not risk a redundant message,
 and update the testsuite for the improved output.
-This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
+Further simplify the code by hoisting the conversion from an error
-in loadvm.
+message to an exit code into the helper function, rather than
 repeating that logic at all callers (yes, the helper function is
 now less generic, but it's a net win in lines of code).
-This patch closes the vmstate file before ending the drained region.
+Signed-off-by: Eric Blake <eblake@redhat.com>
-Previously we closed the vmstate file after ending the drained region.
+Reviewed-by: John Snow <jsnow@redhat.com>
 The order does not matter.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 2 +-
+ qemu-img.c                 | 19 +++++--------------
-file changed, 1 insertion(+), 1 deletion(-)
+ tests/qemu-iotests/074.out |  2 --
 files changed, 5 insertions(+), 16 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/qemu-img.c
-+++ b/migration/savevm.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
+ /*
-     aio_context_acquire(aio_context);
+  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
-     ret = qemu_loadvm_state(f);
+  *
-+    migration_incoming_state_destroy();
+- * Returns 0 in case sectors are filled with 0, 1 if sectors contain non-zero
-     aio_context_release(aio_context);
+- * data and negative value on error.
++ * Intended for use by 'qemu-img compare': Returns 0 in case sectors are
-     bdrv_drain_all_end();
++ * filled with 0, 1 if sectors contain non-zero data (this is a comparison
++ * failure), and 4 on error (the exit status for read errors), after emitting
--    migration_incoming_state_destroy();
++ * an error message.
   *
   * @param blk:  BlockBackend for the image
   * @param sect_num: Number of first sector to check
@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
      if (ret < 0) {
-         error_setg(errp, "Error %d while loading VM state", ret);
+         error_report("Error while reading offset %" PRId64 " of %s: %s",
-         return ret;
+                      sectors_to_bytes(sect_num), filename, strerror(-ret));
 -        return ret;
 +        return 4;
      }
      idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
      if (idx >= 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                                            filename2, buf1, quiet);
              }
              if (ret) {
 -                if (ret < 0) {
 -                    error_report("Error while reading offset %" PRId64 ": %s",
 -                                 sectors_to_bytes(sector_num), strerror(-ret));
 -                    ret = 4;
 -                }
                  goto out;
              }
          }
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                  ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
                                            filename_over, buf1, quiet);
                  if (ret) {
 -                    if (ret < 0) {
 -                        error_report("Error while reading offset %" PRId64
 -                                     " of %s: %s", sectors_to_bytes(sector_num),
 -                                     filename_over, strerror(-ret));
 -                        ret = 4;
 -                    }
                      goto out;
                  }
              }
 diff --git a/tests/qemu-iotests/074.out b/tests/qemu-iotests/074.out
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/074.out
 +++ b/tests/qemu-iotests/074.out
@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
  wrote 512/512 bytes at offset 512
 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
 -qemu-img: Error while reading offset 0: Input/output error
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
  Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
  wrote 512/512 bytes at offset 512
 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
 -qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
  Warning: Image size mismatch!
  Cleanup
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 34/61] qed: Remove callback from qed_write_header()
+[Qemu-devel] [PULL 19/35] qemu-img: Change check_empty_sectors() to byte-based
+From: Eric Blake <eblake@redhat.com>
+Continue on the quest to make more things byte-based instead of
+sector-based.
+Signed-off-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 32 ++++++++++++--------------------
+ qemu-img.c | 27 +++++++++++++++------------
-file changed, 12 insertions(+), 20 deletions(-)
+file changed, 15 insertions(+), 12 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/qemu-img.c
-+++ b/block/qed.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
-  * This function only updates known header fields in-place and does not affect
+  * an error message.
-  * extra data after the QED header.
+  *
   * @param blk:  BlockBackend for the image
 - * @param sect_num: Number of first sector to check
 - * @param sect_count: Number of sectors to check
 + * @param offset: Starting offset to check
 + * @param bytes: Number of bytes to check
   * @param filename: Name of disk file we are checking (logging purpose)
   * @param buffer: Allocated buffer for storing read data
   * @param quiet: Flag for quiet mode
   */
--static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
+-static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
--                             void *opaque)
+-                               int sect_count, const char *filename,
-+static int qed_write_header(BDRVQEDState *s)
++static int check_empty_sectors(BlockBackend *blk, int64_t offset,
 +                               int64_t bytes, const char *filename,
                                 uint8_t *buffer, bool quiet)
  {
-     /* We must write full sectors for O_DIRECT but cannot necessarily generate
+     int ret = 0;
-      * the data following the header if an unrecognized compat feature is
+     int64_t idx;
-@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
-     ret = 0;
+-    ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
- out:
+-                    sect_count << BDRV_SECTOR_BITS);
-     qemu_vfree(buf);
++    ret = blk_pread(blk, offset, buffer, bytes);
--    cb(opaque, ret);
+     if (ret < 0) {
-+    return ret;
+         error_report("Error while reading offset %" PRId64 " of %s: %s",
- }
+-                     sectors_to_bytes(sect_num), filename, strerror(-ret));
++                     offset, filename, strerror(-ret));
- static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
+         return 4;
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
      }
- }
+-    idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
++    idx = find_nonzero(buffer, bytes);
--static void qed_finish_clear_need_check(void *opaque, int ret)
+     if (idx >= 0) {
--{
+         qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
--    /* Do nothing */
+-                sectors_to_bytes(sect_num) + idx);
--}
++                offset + idx);
--
+         return 1;
 -static void qed_flush_after_clear_need_check(void *opaque, int ret)
 -{
 -    BDRVQEDState *s = opaque;
 -
 -    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
 -
 -    /* No need to wait until flush completes */
 -    qed_unplug_allocating_write_reqs(s);
 -}
 -
  static void qed_clear_need_check(void *opaque, int ret)
  {
      BDRVQEDState *s = opaque;
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
      }
-     s->header.features &= ~QED_F_NEED_CHECK;
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
--    qed_write_header(s, qed_flush_after_clear_need_check, s);
+         } else {
-+    ret = qed_write_header(s);
+             nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-+    (void) ret;
+             if (allocated1) {
-+
+-                ret = check_empty_sectors(blk1, sector_num, nb_sectors,
-+    qed_unplug_allocating_write_reqs(s);
++                ret = check_empty_sectors(blk1, sector_num * BDRV_SECTOR_SIZE,
-+
++                                          nb_sectors * BDRV_SECTOR_SIZE,
-+    ret = bdrv_flush(s->bs);
+                                           filename1, buf1, quiet);
-+    (void) ret;
+             } else {
- }
+-                ret = check_empty_sectors(blk2, sector_num, nb_sectors,
++                ret = check_empty_sectors(blk2, sector_num * BDRV_SECTOR_SIZE,
- static void qed_need_check_timer_cb(void *opaque)
++                                          nb_sectors * BDRV_SECTOR_SIZE,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+                                           filename2, buf1, quiet);
- {
+             }
-     BDRVQEDState *s = acb_to_s(acb);
+             if (ret) {
-     BlockCompletionFunc *cb;
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-+    int ret;
+             nb_sectors = count >> BDRV_SECTOR_BITS;
+             if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
-     /* Cancel timer when the first allocating request comes in */
+                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
+-                ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
++                ret = check_empty_sectors(blk_over,
++                                          sector_num * BDRV_SECTOR_SIZE,
-     if (qed_should_set_need_check(s)) {
++                                          nb_sectors * BDRV_SECTOR_SIZE,
-         s->header.features |= QED_F_NEED_CHECK;
+                                           filename_over, buf1, quiet);
--        qed_write_header(s, cb, acb);
+                 if (ret) {
-+        ret = qed_write_header(s);
+                     goto out;
 +        cb(acb, ret);
      } else {
          cb(acb, 0);
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 39/61] qed: Make qed_aio_write_main() synchronous
+[Qemu-devel] [PULL 20/35] qemu-img: Change compare_sectors() to be byte-based
-Note that this code is generally not running in coroutine context, so
+From: Eric Blake <eblake@redhat.com>
 this is an actual blocking synchronous operation. We'll fix this in a
 moment.
+In the continuing quest to make more things byte-based, change
+compare_sectors(), renaming it to compare_buffers() in the
+process.  Note that one caller (qemu-img compare) only cares
+about the first difference, while the other (qemu-img rebase)
+cares about how many consecutive sectors have the same
+equal/different status; however, this patch does not bother to
+micro-optimize the compare case to avoid the comparisons of
+sectors beyond the first mismatch.  Both callers are always
+passing valid buffers in, so the initial check for buffer size
+can be turned into an assertion.
+Signed-off-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 61 +++++++++++++++++++------------------------------------------
+ qemu-img.c | 55 +++++++++++++++++++++++++++----------------------------
-file changed, 19 insertions(+), 42 deletions(-)
+file changed, 27 insertions(+), 28 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/qemu-img.c
-+++ b/block/qed.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
      qed_aio_next_io(acb, 0);
  }
--static void qed_aio_next_io_cb(void *opaque, int ret)
+ /*
--{
+- * Compares two buffers sector by sector. Returns 0 if the first sector of both
--    QEDAIOCB *acb = opaque;
+- * buffers matches, non-zero otherwise.
--
++ * Compares two buffers sector by sector. Returns 0 if the first
--    qed_aio_next_io(acb, ret);
++ * sector of each buffer matches, non-zero otherwise.
--}
+  *
--
+- * pnum is set to the number of sectors (including and immediately following
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+- * the first one) that are known to have the same comparison result
 + * pnum is set to the sector-aligned size of the buffer prefix that
 + * has the same matching status as the first sector.
   */
 -static int compare_sectors(const uint8_t *buf1, const uint8_t *buf2, int n,
 -    int *pnum)
 +static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
 +                           int64_t bytes, int64_t *pnum)
  {
-     assert(!s->allocating_write_reqs_plugged);
+     bool res;
-@@ -XXX,XX +XXX,XX @@ err:
+-    int i;
-     qed_aio_complete(acb, ret);
++    int64_t i = MIN(bytes, BDRV_SECTOR_SIZE);
- }
+-    if (n <= 0) {
--static void qed_aio_write_l2_update_cb(void *opaque, int ret)
+-        *pnum = 0;
--{
+-        return 0;
 -    QEDAIOCB *acb = opaque;
 -    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
 -}
 -
 -/**
 - * Flush new data clusters before updating the L2 table
 - *
 - * This flush is necessary when a backing file is in use.  A crash during an
 - * allocating write could result in empty clusters in the image.  If the write
 - * only touched a subregion of the cluster, then backing image sectors have
 - * been lost in the untouched region.  The solution is to flush after writing a
 - * new data cluster and before updating the L2 table.
 - */
 -static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
 -{
 -    QEDAIOCB *acb = opaque;
 -    BDRVQEDState *s = acb_to_s(acb);
 -
 -    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
 -        qed_aio_complete(acb, -EIO);
 -    }
--}
++    assert(bytes > 0);
--
- /**
+-    res = !!memcmp(buf1, buf2, 512);
-  * Write data to the image file
+-    for(i = 1; i < n; i++) {
-  */
+-        buf1 += 512;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
+-        buf2 += 512;
-     BDRVQEDState *s = acb_to_s(acb);
++    res = !!memcmp(buf1, buf2, i);
-     uint64_t offset = acb->cur_cluster +
++    while (i < bytes) {
-                       qed_offset_into_cluster(s, acb->cur_pos);
++        int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE);
--    BlockCompletionFunc *next_fn;
+-        if (!!memcmp(buf1, buf2, 512) != res) {
-     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
++        if (!!memcmp(buf1 + i, buf2 + i, len) != res) {
+             break;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
+         }
-         return;
++        i += len;
      }
-+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+     *pnum = i;
-+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-+    if (ret >= 0) {
+     int64_t total_sectors;
-+        ret = 0;
+     int64_t sector_num = 0;
-+    }
+     int64_t nb_sectors;
 -    int c, pnum;
 +    int c;
      uint64_t progress_base;
      bool image_opts = false;
      bool force_share = false;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
              /* nothing to do */
          } else if (allocated1 == allocated2) {
              if (allocated1) {
 +                int64_t pnum;
 +
-     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
+                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
--        next_fn = qed_aio_next_io_cb;
+                 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
-+        qed_aio_next_io(acb, ret);
+                                 nb_sectors << BDRV_SECTOR_BITS);
-     } else {
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-         if (s->bs->backing) {
+                     ret = 4;
--            next_fn = qed_aio_write_flush_before_l2_update;
+                     goto out;
--        } else {
+                 }
--            next_fn = qed_aio_write_l2_update_cb;
+-                ret = compare_sectors(buf1, buf2, nb_sectors, &pnum);
-+            /*
+-                if (ret || pnum != nb_sectors) {
-+             * Flush new data clusters before updating the L2 table
++                ret = compare_buffers(buf1, buf2,
-+             *
++                                      nb_sectors * BDRV_SECTOR_SIZE, &pnum);
-+             * This flush is necessary when a backing file is in use.  A crash
++                if (ret || pnum != nb_sectors * BDRV_SECTOR_SIZE) {
-+             * during an allocating write could result in empty clusters in the
+                     qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-+             * image.  If the write only touched a subregion of the cluster,
+-                            sectors_to_bytes(
-+             * then backing image sectors have been lost in the untouched
+-                                ret ? sector_num : sector_num + pnum));
-+             * region.  The solution is to flush after writing a new data
++                            sectors_to_bytes(sector_num) + (ret ? 0 : pnum));
-+             * cluster and before updating the L2 table.
+                     ret = 1;
-+             */
+                     goto out;
-+            ret = bdrv_flush(s->bs->file->bs);
+                 }
-         }
+@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
-+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+             /* If they differ, we need to write to the COW file */
-     }
+             uint64_t written = 0;
--
--    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+-            while (written < n) {
--    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
+-                int pnum;
--                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
++            while (written < n * BDRV_SECTOR_SIZE) {
--                    next_fn, acb);
++                int64_t pnum;
- }
+-                if (compare_sectors(buf_old + written * 512,
- /**
+-                    buf_new + written * 512, n - written, &pnum))
 +                if (compare_buffers(buf_old + written,
 +                                    buf_new + written,
 +                                    n * BDRV_SECTOR_SIZE - written, &pnum))
                  {
                      ret = blk_pwrite(blk,
 -                                     (sector + written) << BDRV_SECTOR_BITS,
 -                                     buf_old + written * 512,
 -                                     pnum << BDRV_SECTOR_BITS, 0);
 +                                     (sector << BDRV_SECTOR_BITS) + written,
 +                                     buf_old + written, pnum, 0);
                      if (ret < 0) {
                          error_report("Error while writing to COW image: %s",
                              strerror(-ret));
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 07/61] migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
+[Qemu-devel] [PULL 21/35] qemu-img: Change img_rebase() to be byte-based
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Eric Blake <eblake@redhat.com>
-blk/bdrv_drain_all() only takes effect for a single instant and then
+In the continuing quest to make more things byte-based, change
-resumes block jobs, guest devices, and other external clients like the
+the internal iteration of img_rebase().  We can finally drop the
-NBD server.  This can be handy when performing a synchronous drain
+TODO assertion added earlier, now that the entire algorithm is
-before terminating the program, for example.
+byte-based and no longer has to shift from bytes to sectors.
-Monitor commands usually need to quiesce I/O across an entire code
+Most of the change is mechanical ('num_sectors' becomes 'size',
-region so blk/bdrv_drain_all() is not suitable.  They must use
+'sector' becomes 'offset', 'n' goes from sectors to bytes); some
-bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
+of it is also a cleanup (use of MIN() instead of open-coding,
-requests from slipping in or worse - block jobs completing and modifying
+loss of variable 'count' added earlier in commit d6a644bb).
 the graph.
-I audited other blk/bdrv_drain_all() callers but did not find anything
+Signed-off-by: Eric Blake <eblake@redhat.com>
-that needs a similar fix.  This patch fixes the savevm/loadvm commands.
+Reviewed-by: John Snow <jsnow@redhat.com>
 Although I haven't encountered a read world issue this makes the code
 safer.
 Suggested-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 18 +++++++++++++++---
+ qemu-img.c | 84 +++++++++++++++++++++++++-------------------------------------
-file changed, 15 insertions(+), 3 deletions(-)
+file changed, 34 insertions(+), 50 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/qemu-img.c b/qemu-img.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/qemu-img.c
-+++ b/migration/savevm.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
-     }
+      * the image is the same as the original one at any time.
-     vm_stop(RUN_STATE_SAVE_VM);
+      */
+     if (!unsafe) {
-+    bdrv_drain_all_begin();
+-        int64_t num_sectors;
-+
+-        int64_t old_backing_num_sectors;
-     aio_context_acquire(aio_context);
+-        int64_t new_backing_num_sectors = 0;
+-        uint64_t sector;
-     memset(sn, 0, sizeof(*sn));
+-        int n;
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+-        int64_t count;
-     if (aio_context) {
++        int64_t size;
-         aio_context_release(aio_context);
++        int64_t old_backing_size;
-     }
++        int64_t new_backing_size = 0;
-+
++        uint64_t offset;
-+    bdrv_drain_all_end();
++        int64_t n;
-+
+         float local_progress = 0;
-     if (saved_vm_running) {
-         vm_start();
+         buf_old = blk_blockalign(blk, IO_BUF_SIZE);
-     }
+         buf_new = blk_blockalign(blk, IO_BUF_SIZE);
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
-     }
+-        num_sectors = blk_nb_sectors(blk);
+-        if (num_sectors < 0) {
-     /* Flush all IO requests so they don't interfere with the new state.  */
++        size = blk_getlength(blk);
--    bdrv_drain_all();
++        if (size < 0) {
-+    bdrv_drain_all_begin();
+             error_report("Could not get size of '%s': %s",
+-                         filename, strerror(-num_sectors));
-     ret = bdrv_all_goto_snapshot(name, &bs);
++                         filename, strerror(-size));
-     if (ret < 0) {
+             ret = -1;
-         error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
+             goto out;
-                      ret, name, bdrv_get_device_name(bs));
+         }
--        return ret;
+-        old_backing_num_sectors = blk_nb_sectors(blk_old_backing);
-+        goto err_drain;
+-        if (old_backing_num_sectors < 0) {
-     }
++        old_backing_size = blk_getlength(blk_old_backing);
++        if (old_backing_size < 0) {
-     /* restore the VM state */
+             char backing_name[PATH_MAX];
-     f = qemu_fopen_bdrv(bs_vm_state, 0);
-     if (!f) {
+             bdrv_get_backing_filename(bs, backing_name, sizeof(backing_name));
-         error_setg(errp, "Could not open VM state file");
+             error_report("Could not get size of '%s': %s",
--        return -EINVAL;
+-                         backing_name, strerror(-old_backing_num_sectors));
-+        ret = -EINVAL;
++                         backing_name, strerror(-old_backing_size));
-+        goto err_drain;
+             ret = -1;
-     }
+             goto out;
+         }
-     qemu_system_reset(SHUTDOWN_CAUSE_NONE);
+         if (blk_new_backing) {
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
+-            new_backing_num_sectors = blk_nb_sectors(blk_new_backing);
-     ret = qemu_loadvm_state(f);
+-            if (new_backing_num_sectors < 0) {
-     aio_context_release(aio_context);
++            new_backing_size = blk_getlength(blk_new_backing);
++            if (new_backing_size < 0) {
-+    bdrv_drain_all_end();
+                 error_report("Could not get size of '%s': %s",
-+
+-                             out_baseimg, strerror(-new_backing_num_sectors));
-     migration_incoming_state_destroy();
++                             out_baseimg, strerror(-new_backing_size));
-     if (ret < 0) {
+                 ret = -1;
-         error_setg(errp, "Error %d while loading VM state", ret);
+                 goto out;
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
+             }
-     }
+         }
-     return 0;
+-        if (num_sectors != 0) {
-+
+-            local_progress = (float)100 /
-+err_drain:
+-                (num_sectors / MIN(num_sectors, IO_BUF_SIZE / 512));
-+    bdrv_drain_all_end();
++        if (size != 0) {
-+    return ret;
++            local_progress = (float)100 / (size / MIN(size, IO_BUF_SIZE));
- }
+         }
- void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
+-        for (sector = 0; sector < num_sectors; sector += n) {
 -
 -            /* How many sectors can we handle with the next read? */
 -            if (sector + (IO_BUF_SIZE / 512) <= num_sectors) {
 -                n = (IO_BUF_SIZE / 512);
 -            } else {
 -                n = num_sectors - sector;
 -            }
 +        for (offset = 0; offset < size; offset += n) {
 +            /* How many bytes can we handle with the next read? */
 +            n = MIN(IO_BUF_SIZE, size - offset);
              /* If the cluster is allocated, we don't need to take action */
 -            ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
 -                                    n << BDRV_SECTOR_BITS, &count);
 +            ret = bdrv_is_allocated(bs, offset, n, &n);
              if (ret < 0) {
                  error_report("error while reading image metadata: %s",
                               strerror(-ret));
                  goto out;
              }
 -            /* TODO relax this once bdrv_is_allocated does not enforce
 -             * sector alignment */
 -            assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
 -            n = count >> BDRV_SECTOR_BITS;
              if (ret) {
                  continue;
              }
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
               * Read old and new backing file and take into consideration that
               * backing files may be smaller than the COW image.
               */
 -            if (sector >= old_backing_num_sectors) {
 -                memset(buf_old, 0, n * BDRV_SECTOR_SIZE);
 +            if (offset >= old_backing_size) {
 +                memset(buf_old, 0, n);
              } else {
 -                if (sector + n > old_backing_num_sectors) {
 -                    n = old_backing_num_sectors - sector;
 +                if (offset + n > old_backing_size) {
 +                    n = old_backing_size - offset;
                  }
 -                ret = blk_pread(blk_old_backing, sector << BDRV_SECTOR_BITS,
 -                                buf_old, n << BDRV_SECTOR_BITS);
 +                ret = blk_pread(blk_old_backing, offset, buf_old, n);
                  if (ret < 0) {
                      error_report("error while reading from old backing file");
                      goto out;
                  }
              }
 -            if (sector >= new_backing_num_sectors || !blk_new_backing) {
 -                memset(buf_new, 0, n * BDRV_SECTOR_SIZE);
 +            if (offset >= new_backing_size || !blk_new_backing) {
 +                memset(buf_new, 0, n);
              } else {
 -                if (sector + n > new_backing_num_sectors) {
 -                    n = new_backing_num_sectors - sector;
 +                if (offset + n > new_backing_size) {
 +                    n = new_backing_size - offset;
                  }
 -                ret = blk_pread(blk_new_backing, sector << BDRV_SECTOR_BITS,
 -                                buf_new, n << BDRV_SECTOR_BITS);
 +                ret = blk_pread(blk_new_backing, offset, buf_new, n);
                  if (ret < 0) {
                      error_report("error while reading from new backing file");
                      goto out;
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
              /* If they differ, we need to write to the COW file */
              uint64_t written = 0;
 -            while (written < n * BDRV_SECTOR_SIZE) {
 +            while (written < n) {
                  int64_t pnum;
 -                if (compare_buffers(buf_old + written,
 -                                    buf_new + written,
 -                                    n * BDRV_SECTOR_SIZE - written, &pnum))
 +                if (compare_buffers(buf_old + written, buf_new + written,
 +                                    n - written, &pnum))
                  {
 -                    ret = blk_pwrite(blk,
 -                                     (sector << BDRV_SECTOR_BITS) + written,
 +                    ret = blk_pwrite(blk, offset + written,
                                       buf_old + written, pnum, 0);
                      if (ret < 0) {
                          error_report("Error while writing to COW image: %s",
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 37/61] qed: Remove callback from qed_write_table()
+[Qemu-devel] [PULL 22/35] qemu-img: Change img_compare() to be byte-based
+From: Eric Blake <eblake@redhat.com>
+In the continuing quest to make more things byte-based, change
+the internal iteration of img_compare().  We can finally drop the
+TODO assertions added earlier, now that the entire algorithm is
+byte-based and no longer has to shift from bytes to sectors.
+Most of the change is mechanical ('total_sectors' becomes
+'total_size', 'sector_num' becomes 'offset', 'nb_sectors' becomes
+'chunk', 'progress_base' goes from sectors to bytes); some of it
+is also a cleanup (sectors_to_bytes() is now unused, loss of
+variable 'count' added earlier in commit 51b0a488).
+Signed-off-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-table.c | 47 ++++++++++++-----------------------------------
+ qemu-img.c | 124 ++++++++++++++++++++++++-------------------------------------
- block/qed.c       | 12 +++++++-----
+file changed, 48 insertions(+), 76 deletions(-)
- block/qed.h       |  8 +++-----
-files changed, 22 insertions(+), 45 deletions(-)
+diff --git a/qemu-img.c b/qemu-img.c
 diff --git a/block/qed-table.c b/block/qed-table.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
+--- a/qemu-img.c
-+++ b/block/qed-table.c
++++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ out:
+@@ -XXX,XX +XXX,XX @@ static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
-  * @index:      Index of first element
-  * @n:          Number of elements
+ #define IO_BUF_SIZE (2 * 1024 * 1024)
-  * @flush:      Whether or not to sync to disk
-- * @cb:         Completion function
+-static int64_t sectors_to_bytes(int64_t sectors)
-- * @opaque:     Argument for completion function
+-{
-  */
+-    return sectors << BDRV_SECTOR_BITS;
 -static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
 -                            unsigned int index, unsigned int n, bool flush,
 -                            BlockCompletionFunc *cb, void *opaque)
 +static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
 +                           unsigned int index, unsigned int n, bool flush)
  {
      unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
      unsigned int start, end, i;
@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
      ret = 0;
  out:
      qemu_vfree(new_table);
 -    cb(opaque, ret);
 -}
 -
--/**
+ /*
-- * Propagate return value from async callback
+  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
-- */
+  *
--static void qed_sync_cb(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
--{
+     const char *fmt1 = NULL, *fmt2 = NULL, *cache, *filename1, *filename2;
--    *(int *)opaque = ret;
+     BlockBackend *blk1, *blk2;
-+    return ret;
+     BlockDriverState *bs1, *bs2;
- }
+-    int64_t total_sectors1, total_sectors2;
++    int64_t total_size1, total_size2;
- int qed_read_l1_table_sync(BDRVQEDState *s)
+     uint8_t *buf1 = NULL, *buf2 = NULL;
-@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
+     int64_t pnum1, pnum2;
-     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
+     int allocated1, allocated2;
- }
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
+     bool progress = false, quiet = false, strict = false;
--void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
+     int flags;
--                        BlockCompletionFunc *cb, void *opaque)
+     bool writethrough;
-+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
+-    int64_t total_sectors;
- {
+-    int64_t sector_num = 0;
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
+-    int64_t nb_sectors;
--    qed_write_table(s, s->header.l1_table_offset,
++    int64_t total_size;
--                    s->l1_table, index, n, false, cb, opaque);
++    int64_t offset = 0;
-+    return qed_write_table(s, s->header.l1_table_offset,
++    int64_t chunk;
-+                           s->l1_table, index, n, false);
+     int c;
- }
+     uint64_t progress_base;
+     bool image_opts = false;
- int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
-                             unsigned int n)
- {
+     buf1 = blk_blockalign(blk1, IO_BUF_SIZE);
--    int ret = -EINPROGRESS;
+     buf2 = blk_blockalign(blk2, IO_BUF_SIZE);
 -    total_sectors1 = blk_nb_sectors(blk1);
 -    if (total_sectors1 < 0) {
 +    total_size1 = blk_getlength(blk1);
 +    if (total_size1 < 0) {
          error_report("Can't get size of %s: %s",
 -                     filename1, strerror(-total_sectors1));
 +                     filename1, strerror(-total_size1));
          ret = 4;
          goto out;
      }
 -    total_sectors2 = blk_nb_sectors(blk2);
 -    if (total_sectors2 < 0) {
 +    total_size2 = blk_getlength(blk2);
 +    if (total_size2 < 0) {
          error_report("Can't get size of %s: %s",
 -                     filename2, strerror(-total_sectors2));
 +                     filename2, strerror(-total_size2));
          ret = 4;
          goto out;
      }
 -    total_sectors = MIN(total_sectors1, total_sectors2);
 -    progress_base = MAX(total_sectors1, total_sectors2);
 +    total_size = MIN(total_size1, total_size2);
 +    progress_base = MAX(total_size1, total_size2);
      qemu_progress_print(0, 100);
 -    if (strict && total_sectors1 != total_sectors2) {
 +    if (strict && total_size1 != total_size2) {
          ret = 1;
          qprintf(quiet, "Strict mode: Image size mismatch!\n");
          goto out;
      }
 -    while (sector_num < total_sectors) {
 +    while (offset < total_size) {
          int status1, status2;
 -        status1 = bdrv_block_status_above(bs1, NULL,
 -                                          sector_num * BDRV_SECTOR_SIZE,
 -                                          (total_sectors1 - sector_num) *
 -                                          BDRV_SECTOR_SIZE,
 -                                          &pnum1, NULL, NULL);
 +        status1 = bdrv_block_status_above(bs1, NULL, offset,
 +                                          total_size1 - offset, &pnum1, NULL,
 +                                          NULL);
          if (status1 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
          }
          allocated1 = status1 & BDRV_BLOCK_ALLOCATED;
 -        status2 = bdrv_block_status_above(bs2, NULL,
 -                                          sector_num * BDRV_SECTOR_SIZE,
 -                                          (total_sectors2 - sector_num) *
 -                                          BDRV_SECTOR_SIZE,
 -                                          &pnum2, NULL, NULL);
 +        status2 = bdrv_block_status_above(bs2, NULL, offset,
 +                                          total_size2 - offset, &pnum2, NULL,
 +                                          NULL);
          if (status2 < 0) {
              ret = 3;
              error_report("Sector allocation test failed for %s", filename2);
              goto out;
          }
          allocated2 = status2 & BDRV_BLOCK_ALLOCATED;
 -        /* TODO: Relax this once comparison is byte-based, and we no longer
 -         * have to worry about sector alignment */
 -        assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
          assert(pnum1 && pnum2);
 -        nb_sectors = MIN(pnum1, pnum2) >> BDRV_SECTOR_BITS;
 +        chunk = MIN(pnum1, pnum2);
          if (strict) {
              if (status1 != status2) {
                  ret = 1;
                  qprintf(quiet, "Strict mode: Offset %" PRId64
 -                        " block status mismatch!\n",
 -                        sectors_to_bytes(sector_num));
 +                        " block status mismatch!\n", offset);
                  goto out;
              }
          }
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
              if (allocated1) {
                  int64_t pnum;
 -                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 -                ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
 -                                nb_sectors << BDRV_SECTOR_BITS);
 +                chunk = MIN(chunk, IO_BUF_SIZE);
 +                ret = blk_pread(blk1, offset, buf1, chunk);
                  if (ret < 0) {
 -                    error_report("Error while reading offset %" PRId64 " of %s:"
 -                                 " %s", sectors_to_bytes(sector_num), filename1,
 -                                 strerror(-ret));
 +                    error_report("Error while reading offset %" PRId64
 +                                 " of %s: %s",
 +                                 offset, filename1, strerror(-ret));
                      ret = 4;
                      goto out;
                  }
 -                ret = blk_pread(blk2, sector_num << BDRV_SECTOR_BITS, buf2,
 -                                nb_sectors << BDRV_SECTOR_BITS);
 +                ret = blk_pread(blk2, offset, buf2, chunk);
                  if (ret < 0) {
                      error_report("Error while reading offset %" PRId64
 -                                 " of %s: %s", sectors_to_bytes(sector_num),
 -                                 filename2, strerror(-ret));
 +                                 " of %s: %s",
 +                                 offset, filename2, strerror(-ret));
                      ret = 4;
                      goto out;
                  }
 -                ret = compare_buffers(buf1, buf2,
 -                                      nb_sectors * BDRV_SECTOR_SIZE, &pnum);
 -                if (ret || pnum != nb_sectors * BDRV_SECTOR_SIZE) {
 +                ret = compare_buffers(buf1, buf2, chunk, &pnum);
 +                if (ret || pnum != chunk) {
                      qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
 -                            sectors_to_bytes(sector_num) + (ret ? 0 : pnum));
 +                            offset + (ret ? 0 : pnum));
                      ret = 1;
                      goto out;
                  }
              }
          } else {
 -            nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 +            chunk = MIN(chunk, IO_BUF_SIZE);
              if (allocated1) {
 -                ret = check_empty_sectors(blk1, sector_num * BDRV_SECTOR_SIZE,
 -                                          nb_sectors * BDRV_SECTOR_SIZE,
 +                ret = check_empty_sectors(blk1, offset, chunk,
                                            filename1, buf1, quiet);
              } else {
 -                ret = check_empty_sectors(blk2, sector_num * BDRV_SECTOR_SIZE,
 -                                          nb_sectors * BDRV_SECTOR_SIZE,
 +                ret = check_empty_sectors(blk2, offset, chunk,
                                            filename2, buf1, quiet);
              }
              if (ret) {
                  goto out;
              }
          }
 -        sector_num += nb_sectors;
 -        qemu_progress_print(((float) nb_sectors / progress_base)*100, 100);
 +        offset += chunk;
 +        qemu_progress_print(((float) chunk / progress_base) * 100, 100);
      }
 -    if (total_sectors1 != total_sectors2) {
 +    if (total_size1 != total_size2) {
          BlockBackend *blk_over;
          const char *filename_over;
          qprintf(quiet, "Warning: Image size mismatch!\n");
 -        if (total_sectors1 > total_sectors2) {
 +        if (total_size1 > total_size2) {
              blk_over = blk1;
              filename_over = filename1;
          } else {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
              filename_over = filename2;
          }
 -        while (sector_num < progress_base) {
 -            int64_t count;
 -
--    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
+-            ret = bdrv_block_status_above(blk_bs(blk_over), NULL,
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
+-                                          sector_num * BDRV_SECTOR_SIZE,
--
+-                                          (progress_base - sector_num) *
--    return ret;
+-                                          BDRV_SECTOR_SIZE,
-+    return qed_write_l1_table(s, index, n);
+-                                          &count, NULL, NULL);
- }
++        while (offset < progress_base) {
++            ret = bdrv_block_status_above(blk_bs(blk_over), NULL, offset,
- int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
++                                          progress_base - offset, &chunk,
-@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
++                                          NULL, NULL);
-     return qed_read_l2_table(s, request, offset);
+             if (ret < 0) {
- }
+                 ret = 3;
+                 error_report("Sector allocation test failed for %s",
--void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
--                        unsigned int index, unsigned int n, bool flush,
+                 goto out;
--                        BlockCompletionFunc *cb, void *opaque)
-+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+             }
-+                       unsigned int index, unsigned int n, bool flush)
+-            /* TODO relax this once bdrv_block_status_above does not enforce
- {
+-             * sector alignment */
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
+-            assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
--    qed_write_table(s, request->l2_table->offset,
+-            nb_sectors = count >> BDRV_SECTOR_BITS;
--                    request->l2_table->table, index, n, flush, cb, opaque);
+             if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
-+    return qed_write_table(s, request->l2_table->offset,
+-                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-+                           request->l2_table->table, index, n, flush);
+-                ret = check_empty_sectors(blk_over,
- }
+-                                          sector_num * BDRV_SECTOR_SIZE,
+-                                          nb_sectors * BDRV_SECTOR_SIZE,
- int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
++                chunk = MIN(chunk, IO_BUF_SIZE);
-                             unsigned int index, unsigned int n, bool flush)
++                ret = check_empty_sectors(blk_over, offset, chunk,
- {
+                                           filename_over, buf1, quiet);
--    int ret = -EINPROGRESS;
+                 if (ret) {
--
+                     goto out;
--    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
+                 }
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
+             }
--
+-            sector_num += nb_sectors;
--    return ret;
+-            qemu_progress_print(((float) nb_sectors / progress_base)*100, 100);
-+    return qed_write_l2_table(s, request, index, n, flush);
++            offset += chunk;
- }
++            qemu_progress_print(((float) chunk / progress_base) * 100, 100);
-diff --git a/block/qed.c b/block/qed.c
+         }
-index XXXXXXX..XXXXXXX 100644
+     }
 --- a/block/qed.c
 +++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
      index = qed_l1_index(s, acb->cur_pos);
      s->l1_table->offsets[index] = acb->request.l2_table->offset;
 -    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
 +    ret = qed_write_l1_table(s, index, 1);
 +    qed_commit_l2_update(acb, ret);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
      if (need_alloc) {
          /* Write out the whole new L2 table */
 -        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
 -                           qed_aio_write_l1_update, acb);
 +        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
 +        qed_aio_write_l1_update(acb, ret);
      } else {
          /* Write out only the updated part of the L2 table */
 -        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
 -                           qed_aio_next_io_cb, acb);
 +        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
 +                                 false);
 +        qed_aio_next_io(acb, ret);
      }
      return;
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
   * Table I/O functions
   */
  int qed_read_l1_table_sync(BDRVQEDState *s);
 -void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
 -                        BlockCompletionFunc *cb, void *opaque);
 +int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
  int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                              unsigned int n);
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             uint64_t offset);
  int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
 -void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 -                        unsigned int index, unsigned int n, bool flush,
 -                        BlockCompletionFunc *cb, void *opaque);
 +int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 +                       unsigned int index, unsigned int n, bool flush);
  int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                              unsigned int index, unsigned int n, bool flush);
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 20/61] qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
+[Qemu-devel] [PULL 23/35] block: Align block status requests
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-This patch splits do_perform_cow() into three separate functions to
+Any device that has request_alignment greater than 512 should be
-read, encrypt and write the COW regions.
+unable to report status at a finer granularity; it may also be
 simpler for such devices to be guaranteed that the block layer
 has rounded things out to the granularity boundary (the way the
 block layer already rounds all other I/O out).  Besides, getting
 the code correct for super-sector alignment also benefits us
 for the fact that our public interface now has byte granularity,
 even though none of our drivers have byte-level callbacks.
-perform_cow() can now read both regions first, then encrypt them and
+Add an assertion in blkdebug that proves that the block layer
-finally write them to disk. The memory allocation is also done in
+never requests status of unaligned sections, similar to what it
-this function now, using one single buffer large enough to hold both
+does on other requests (while still keeping the generic helper
-regions.
+in place for when future patches add a throttle driver).  Note
 that iotest 177 already covers this (it would fail if you use
 just the blkdebug.c hunk without the io.c changes).  Meanwhile,
 we can drop assertions in callers that no longer have to pass
 in sector-aligned addresses.
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+There is a mid-function scope added for 'count' and 'longret',
-Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+for a couple of reasons: first, an upcoming patch will add an
 'if' statement that checks whether a driver has an old- or
 new-style callback, and can conveniently use the same scope for
 less indentation churn at that time.  Second, since we are
 trying to get rid of sector-based computations, wrapping things
 in a scope makes it easier to group and see what will be
 deleted in a final cleanup patch once all drivers have been
 converted to the new-style callback.
 Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
+ include/block/block_int.h |  3 +-
-file changed, 87 insertions(+), 30 deletions(-)
+ block/blkdebug.c          | 13 ++++++++-
  block/io.c                | 71 ++++++++++++++++++++++++++++++-----------------
 files changed, 59 insertions(+), 28 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/include/block/block_int.h b/include/block/block_int.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/include/block/block_int.h
-+++ b/block/qcow2-cluster.c
++++ b/include/block/block_int.h
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
-     return 0;
+      * according to the current layer, and should not set
       * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
       * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
 -     * layer guarantees non-NULL pnum and file.
 +     * layer guarantees input aligned to request_alignment, as well as
 +     * non-NULL pnum and file.
       */
      int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
          int64_t sector_num, int nb_sectors, int *pnum,
 diff --git a/block/blkdebug.c b/block/blkdebug.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/blkdebug.c
 +++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkdebug_co_pdiscard(BlockDriverState *bs,
      return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
  }
--static int coroutine_fn do_perform_cow(BlockDriverState *bs,
++static int64_t coroutine_fn blkdebug_co_get_block_status(
--                                       uint64_t src_cluster_offset,
++    BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
--                                       uint64_t cluster_offset,
++    BlockDriverState **file)
--                                       unsigned offset_in_cluster,
++{
--                                       unsigned bytes)
++    assert(QEMU_IS_ALIGNED(sector_num | nb_sectors,
-+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
++                           DIV_ROUND_UP(bs->bl.request_alignment,
-+                                            uint64_t src_cluster_offset,
++                                        BDRV_SECTOR_SIZE)));
-+                                            unsigned offset_in_cluster,
++    return bdrv_co_get_block_status_from_file(bs, sector_num, nb_sectors,
-+                                            uint8_t *buffer,
++                                              pnum, file);
 +                                            unsigned bytes)
  {
 -    BDRVQcow2State *s = bs->opaque;
      QEMUIOVector qiov;
 -    struct iovec iov;
 +    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
      if (bytes == 0) {
          return 0;
      }
 -    iov.iov_len = bytes;
 -    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
 -    if (iov.iov_base == NULL) {
 -        return -ENOMEM;
 -    }
 -
      qemu_iovec_init_external(&qiov, &iov, 1);
      BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
      if (!bs->drv) {
 -        ret = -ENOMEDIUM;
 -        goto out;
 +        return -ENOMEDIUM;
      }
      /* Call .bdrv_co_readv() directly instead of using the public block-layer
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
      ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
                                    bytes, &qiov, 0);
      if (ret < 0) {
 -        goto out;
 +        return ret;
      }
 -    if (bs->encrypted) {
 +    return 0;
 +}
 +
-+static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
+ static void blkdebug_close(BlockDriverState *bs)
-+                                                uint64_t src_cluster_offset,
+ {
-+                                                unsigned offset_in_cluster,
+     BDRVBlkdebugState *s = bs->opaque;
-+                                                uint8_t *buffer,
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_blkdebug = {
-+                                                unsigned bytes)
+     .bdrv_co_flush_to_disk  = blkdebug_co_flush,
-+{
+     .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
-+    if (bytes && bs->encrypted) {
+     .bdrv_co_pdiscard       = blkdebug_co_pdiscard,
-+        BDRVQcow2State *s = bs->opaque;
+-    .bdrv_co_get_block_status = bdrv_co_get_block_status_from_file,
-         int64_t sector = (src_cluster_offset + offset_in_cluster)
++    .bdrv_co_get_block_status = blkdebug_co_get_block_status,
-                          >> BDRV_SECTOR_BITS;
-         assert(s->cipher);
+     .bdrv_debug_event           = blkdebug_debug_event,
-         assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
+     .bdrv_debug_breakpoint      = blkdebug_debug_breakpoint,
-         assert((bytes & ~BDRV_SECTOR_MASK) == 0);
+diff --git a/block/io.c b/block/io.c
--        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
+index XXXXXXX..XXXXXXX 100644
-+        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
+--- a/block/io.c
-                                   bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
++++ b/block/io.c
--            ret = -EIO;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
--            goto out;
+ {
-+            return false;
+     int64_t total_size;
-         }
+     int64_t n; /* bytes */
 -    int64_t ret;
 +    int ret;
      int64_t local_map = 0;
      BlockDriverState *local_file = NULL;
 -    int count; /* sectors */
 +    int64_t aligned_offset, aligned_bytes;
 +    uint32_t align;
      assert(pnum);
      *pnum = 0;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
      }
-+    return true;
-+}
+     bdrv_inc_in_flight(bs);
 +
-+static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
++    /* Round out to request_alignment boundaries */
-+                                             uint64_t cluster_offset,
++    /* TODO: until we have a byte-based driver callback, we also have to
-+                                             unsigned offset_in_cluster,
++     * round out to sectors, even if that is bigger than request_alignment */
-+                                             uint8_t *buffer,
++    align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
-+                                             unsigned bytes)
++    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
-+{
++    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 +    QEMUIOVector qiov;
 +    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
 +    int ret;
 +
-+    if (bytes == 0) {
++    {
-+        return 0;
++        int count; /* sectors */
 +        int64_t longret;
 +
 +        assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
 +                               BDRV_SECTOR_SIZE));
 +        /*
 +         * The contract allows us to return pnum smaller than bytes, even
 +         * if the next query would see the same status; we truncate the
 +         * request to avoid overflowing the driver's 32-bit interface.
 +         */
 +        longret = bs->drv->bdrv_co_get_block_status(
 +            bs, aligned_offset >> BDRV_SECTOR_BITS,
 +            MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, &count,
 +            &local_file);
 +        if (longret < 0) {
 +            assert(INT_MIN <= longret);
 +            ret = longret;
 +            goto out;
 +        }
 +        if (longret & BDRV_BLOCK_OFFSET_VALID) {
 +            local_map = longret & BDRV_BLOCK_OFFSET_MASK;
 +        }
 +        ret = longret & ~BDRV_BLOCK_OFFSET_MASK;
 +        *pnum = count * BDRV_SECTOR_SIZE;
 +    }
 +
-+    qemu_iovec_init_external(&qiov, &iov, 1);
+     /*
+-     * TODO: Rather than require aligned offsets, we could instead
-     ret = qcow2_pre_write_overlap_check(bs, 0,
+-     * round to the driver's request_alignment here, then touch up
-             cluster_offset + offset_in_cluster, bytes);
+-     * count afterwards back to the caller's expectations.
-     if (ret < 0) {
+-     */
 -    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 -    /*
 -     * The contract allows us to return pnum smaller than bytes, even
 -     * if the next query would see the same status; we truncate the
 -     * request to avoid overflowing the driver's 32-bit interface.
 +     * The driver's result must be a multiple of request_alignment.
 +     * Clamp pnum and adjust map to original request.
       */
 -    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
 -    ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
 -                                            bytes >> BDRV_SECTOR_BITS, &count,
 -                                            &local_file);
 -    if (ret < 0) {
 -        goto out;
-+        return ret;
++    assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset);
 +    *pnum -= offset - aligned_offset;
 +    if (*pnum > bytes) {
 +        *pnum = bytes;
      }
+     if (ret & BDRV_BLOCK_OFFSET_VALID) {
-     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
+-        local_map = ret & BDRV_BLOCK_OFFSET_MASK;
-     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
++        local_map += offset - aligned_offset;
                            bytes, &qiov, 0);
      if (ret < 0) {
 -        goto out;
 +        return ret;
      }
+-    *pnum = count * BDRV_SECTOR_SIZE;
--    ret = 0;
--out:
+     if (ret & BDRV_BLOCK_RAW) {
--    qemu_vfree(iov.iov_base);
+         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
--    return ret;
+         ret = bdrv_co_block_status(local_file, want_zero, local_map,
-+    return 0;
+                                    *pnum, pnum, &local_map, &local_file);
- }
+-        assert(ret < 0 ||
+-               QEMU_IS_ALIGNED(*pnum | local_map, BDRV_SECTOR_SIZE));
+         goto out;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      BDRVQcow2State *s = bs->opaque;
      Qcow2COWRegion *start = &m->cow_start;
      Qcow2COWRegion *end = &m->cow_end;
 +    unsigned buffer_size;
 +    uint8_t *start_buffer, *end_buffer;
      int ret;
 +    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
 +
      if (start->nb_bytes == 0 && end->nb_bytes == 0) {
          return 0;
      }
-+    /* Reserve a buffer large enough to store the data from both the
+@@ -XXX,XX +XXX,XX @@ early_out:
-+     * start and end COW regions. Add some padding in the middle if
+     if (map) {
-+     * necessary to make sure that the end region is optimally aligned */
+         *map = local_map;
 +    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
 +        end->nb_bytes;
 +    start_buffer = qemu_try_blockalign(bs, buffer_size);
 +    if (start_buffer == NULL) {
 +        return -ENOMEM;
 +    }
 +    /* The part of the buffer where the end region is located */
 +    end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +
      qemu_co_mutex_unlock(&s->lock);
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 -                         start->offset, start->nb_bytes);
 +    /* First we read the existing data from both COW regions */
 +    ret = do_perform_cow_read(bs, m->offset, start->offset,
 +                              start_buffer, start->nb_bytes);
      if (ret < 0) {
          goto fail;
      }
+-    if (ret >= 0) {
--    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+-        ret &= ~BDRV_BLOCK_OFFSET_MASK;
--                         end->offset, end->nb_bytes);
+-    } else {
-+    ret = do_perform_cow_read(bs, m->offset, end->offset,
+-        assert(INT_MIN <= ret);
-+                              end_buffer, end->nb_bytes);
+-    }
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +
 +    /* Encrypt the data if necessary before writing it */
 +    if (bs->encrypted) {
 +        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
 +                                    start_buffer, start->nb_bytes) ||
 +            !do_perform_cow_encrypt(bs, m->offset, end->offset,
 +                                    end_buffer, end->nb_bytes)) {
 +            ret = -EIO;
 +            goto fail;
 +        }
 +    }
 +
 +    /* And now we can write everything */
 +    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
 +                               start_buffer, start->nb_bytes);
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
 +                               end_buffer, end->nb_bytes);
  fail:
      qemu_co_mutex_lock(&s->lock);
@@ -XXX,XX +XXX,XX @@ fail:
          qcow2_cache_depends_on_flush(s->l2_table_cache);
      }
 +    qemu_vfree(start_buffer);
      return ret;
  }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 06/61] migration: avoid recursive AioContext locking in save_vmstate()
+[Qemu-devel] [PULL 24/35] block: Reduce bdrv_aligned_preadv() rounding
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Eric Blake <eblake@redhat.com>
-AioContext was designed to allow nested acquire/release calls.  It uses
+Now that bdrv_is_allocated accepts non-aligned inputs, we can
-a recursive mutex so callers don't need to worry about nesting...or so
+remove the TODO added in commit d6a644bb.
 we thought.
-BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
+Signed-off-by: Eric Blake <eblake@redhat.com>
-the AioContext temporarily around aio_poll().  This gives IOThreads a
+Reviewed-by: John Snow <jsnow@redhat.com>
 chance to acquire the AioContext to process I/O completions.
 It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
 BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
 will not be able to acquire the AioContext if it was acquired
 multiple times.
 Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
 this patch simply avoids nested locking in save_vmstate().  It's the
 simplest fix and we should step back to consider the big picture with
 all the recent changes to block layer threading.
 This patch is the final fix to solve 'savevm' hanging with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 12 +++++++++++-
+ block/io.c | 8 ++------
-file changed, 11 insertions(+), 1 deletion(-)
+file changed, 2 insertions(+), 6 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block/io.c
-+++ b/migration/savevm.c
++++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
          goto the_end;
      }
-+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+     if (flags & BDRV_REQ_COPY_ON_READ) {
-+     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
+-        /* TODO: Simplify further once bdrv_is_allocated no longer
-+     * it only releases the lock once.  Therefore synchronous I/O will deadlock
+-         * requires sector alignment */
-+     * unless we release the AioContext before bdrv_all_create_snapshot().
+-        int64_t start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
-+     */
+-        int64_t end = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE);
-+    aio_context_release(aio_context);
+         int64_t pnum;
-+    aio_context = NULL;
-+
+-        ret = bdrv_is_allocated(bs, start, end - start, &pnum);
-     ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
++        ret = bdrv_is_allocated(bs, offset, bytes, &pnum);
-     if (ret < 0) {
+         if (ret < 0) {
-         error_setg(errp, "Error while creating snapshot on '%s'",
+             goto out;
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+         }
-     ret = 0;
+-        if (!ret || pnum != end - start) {
-  the_end:
++        if (!ret || pnum != bytes) {
--    aio_context_release(aio_context);
+             ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
-+    if (aio_context) {
+             goto out;
-+        aio_context_release(aio_context);
+         }
 +    }
      if (saved_vm_running) {
          vm_start();
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 24/61] qcow2: Use offset_into_cluster() and offset_to_l2_index()
+[Qemu-devel] [PULL 25/35] qcow2: Reduce is_zero() rounding
-From: Alberto Garcia <berto@igalia.com>
+From: Eric Blake <eblake@redhat.com>
-We already have functions for doing these calculations, so let's use
+Now that bdrv_is_allocated accepts non-aligned inputs, we can
-them instead of doing everything by hand. This makes the code a bit
+remove the TODO added in earlier refactoring.
 more readable.
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ block/qcow2.c | 12 +++---------
- block/qcow2.c         | 2 +-
+file changed, 3 insertions(+), 9 deletions(-)
 files changed, 3 insertions(+), 3 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
-+++ b/block/qcow2-cluster.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
-     /* find the cluster offset for the given disk offset */
--    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
-+    l2_index = offset_to_l2_index(s, offset);
-     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
-     nb_clusters = size_to_clusters(s, bytes_needed);
-@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
-     /* find the cluster offset for the given disk offset */
--    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
-+    l2_index = offset_to_l2_index(s, offset);
-     *new_l2_table = l2_table;
-     *new_l2_index = l2_index;
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
  {
      int64_t nr;
      int res;
 -    int64_t start;
 -
 -    /* TODO: Widening to sector boundaries should only be needed as
 -     * long as we can't query finer granularity. */
 -    start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
 -    bytes = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE) - start;
      /* Clamp to image length, before checking status of underlying sectors */
 -    if (start + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
 -        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - start;
 +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
 +        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - offset;
      }
-     /* Tables must be cluster aligned */
+     if (!bytes) {
--    if (offset & (s->cluster_size - 1)) {
+         return true;
 +    if (offset_into_cluster(s, offset) != 0) {
          return -EINVAL;
      }
+-    res = bdrv_block_status_above(bs, NULL, start, bytes, &nr, NULL, NULL);
++    res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
+     return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
+ }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 44/61] qed: Add return value to qed_aio_write_cow()
+[Qemu-devel] [PULL 26/35] qemu-io: Relax 'alloc' now that block-status doesn't assert
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
+From: Eric Blake <eblake@redhat.com>
 just return an error code and let the caller handle it.
-While refactoring qed_aio_write_alloc() to accomodate the change,
+Previously, the alloc command required that input parameters be
-qed_aio_write_zero_cluster() ended up with a single line, so I chose to
+sector-aligned and clamped to 32 bits, because the underlying
-inline that line and remove the function completely.
+bdrv_is_allocated used a 32-bit parameter and asserted aligned
 inputs.  But now that we have fixed block status to report a
 -bit bytes value, and to properly round requests on behalf of
 guests, we can pass any values, and can use qemu-io to add
 coverage that our rounding is correct regardless of the guest
 alignment constraints.
+Update iotest 177 to intentionally probe block status at
+unaligned boundaries as well as with a bytes value that does not
+map to 32-bit sectors, which also required tweaking the image
+prep to leave an unallocated portion to the image under test.
+Signed-off-by: Eric Blake <eblake@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 58 +++++++++++++++++++++-------------------------------------
+ qemu-io-cmds.c             | 13 -------------
-file changed, 21 insertions(+), 37 deletions(-)
+ tests/qemu-iotests/177     | 12 ++++++++++--
  tests/qemu-iotests/177.out | 19 ++++++++++++++-----
 files changed, 24 insertions(+), 20 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/qemu-io-cmds.c
-+++ b/block/qed.c
++++ b/qemu-io-cmds.c
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ static int alloc_f(BlockBackend *blk, int argc, char **argv)
- /**
+     if (offset < 0) {
-  * Populate untouched regions of new data cluster
+         print_cvtnum_err(offset, argv[1]);
-  */
+         return 0;
--static void qed_aio_write_cow(void *opaque, int ret)
+-    } else if (!QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE)) {
-+static int qed_aio_write_cow(QEDAIOCB *acb)
+-        printf("%" PRId64 " is not a sector-aligned value for 'offset'\n",
- {
+-               offset);
--    QEDAIOCB *acb = opaque;
+-        return 0;
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t start, len, offset;
 +    int ret;
      /* Populate front untouched region of new data cluster */
      start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
      trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
      ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +    if (ret < 0) {
 +        return ret;
      }
-     /* Populate back untouched region of new data cluster */
+     if (argc == 3) {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ static int alloc_f(BlockBackend *blk, int argc, char **argv)
+         if (count < 0) {
-     trace_qed_aio_write_postfill(s, acb, start, len, offset);
+             print_cvtnum_err(count, argv[2]);
-     ret = qed_copy_from_backing_file(s, start, len, offset);
+             return 0;
--    if (ret) {
+-        } else if (count > INT_MAX * BDRV_SECTOR_SIZE) {
--        qed_aio_complete(acb, ret);
+-            printf("length argument cannot exceed %llu, given %s\n",
--        return;
+-                   INT_MAX * BDRV_SECTOR_SIZE, argv[2]);
 -            return 0;
          }
      } else {
          count = BDRV_SECTOR_SIZE;
      }
 -    if (!QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE)) {
 -        printf("%" PRId64 " is not a sector-aligned value for 'count'\n",
 -               count);
 -        return 0;
 -    }
--
--    ret = qed_aio_write_main(acb);
+     remaining = count;
-     if (ret < 0) {
+     sum_alloc = 0;
--        qed_aio_complete(acb, ret);
+diff --git a/tests/qemu-iotests/177 b/tests/qemu-iotests/177
--        return;
+index XXXXXXX..XXXXXXX 100755
-+        return ret;
+--- a/tests/qemu-iotests/177
-     }
++++ b/tests/qemu-iotests/177
--    qed_aio_next_io(acb, 0);
+@@ -XXX,XX +XXX,XX @@ echo "== setting up files =="
  TEST_IMG="$TEST_IMG.base" _make_test_img $size
  $QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG.base" | _filter_qemu_io
  _make_test_img -b "$TEST_IMG.base"
 -$QEMU_IO -c "write -P 22 0 $size" "$TEST_IMG" | _filter_qemu_io
 +$QEMU_IO -c "write -P 22 0 110M" "$TEST_IMG" | _filter_qemu_io
  # Limited to 64k max-transfer
  echo
@@ -XXX,XX +XXX,XX @@ $QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
           -c "discard 80000001 30M" | _filter_qemu_io
  echo
 +echo "== block status smaller than alignment =="
 +limits=align=4k
 +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
 +     -c "alloc 1 1" -c "alloc 0x6dffff0 1000" -c "alloc 127m 5P" \
 +     -c map | _filter_qemu_io
 +
-+    return qed_aio_write_main(acb);
++echo
  echo "== verify image content =="
  function verify_io()
@@ -XXX,XX +XXX,XX @@ function verify_io()
      echo read -P 0 32M 32M
      echo read -P 22 64M 13M
      echo read -P $discarded 77M 29M
 -    echo read -P 22 106M 22M
 +    echo read -P 22 106M 4M
 +    echo read -P 11 110M 18M
  }
- /**
+ verify_io | $QEMU_IO -r "$TEST_IMG" | _filter_qemu_io
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
+diff --git a/tests/qemu-iotests/177.out b/tests/qemu-iotests/177.out
-     return !(s->header.features & QED_F_NEED_CHECK);
+index XXXXXXX..XXXXXXX 100644
- }
+--- a/tests/qemu-iotests/177.out
++++ b/tests/qemu-iotests/177.out
--static void qed_aio_write_zero_cluster(void *opaque, int ret)
+@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134217728
--{
+ wrote 134217728/134217728 bytes at offset 0
--    QEDAIOCB *acb = opaque;
+MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
--
+ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/t.IMGFMT.base
--    if (ret) {
+-wrote 134217728/134217728 bytes at offset 0
--        qed_aio_complete(acb, ret);
+-128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
--        return;
++wrote 115343360/115343360 bytes at offset 0
--    }
++110 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
--
--    ret = qed_aio_write_l2_update(acb, 1);
+ == constrained alignment and max-transfer ==
--    if (ret < 0) {
+ wrote 131072/131072 bytes at offset 1000
--        qed_aio_complete(acb, ret);
+@@ -XXX,XX +XXX,XX @@ wrote 33554432/33554432 bytes at offset 33554432
--        return;
+ discard 31457280/31457280 bytes at offset 80000001
--    }
+MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
--    qed_aio_next_io(acb, 0);
--}
++== block status smaller than alignment ==
--
++1/1 bytes allocated at offset 1 bytes
- /**
++16/1000 bytes allocated at offset 110 MiB
-  * Write new data cluster
++0/1048576 bytes allocated at offset 127 MiB
-  *
++110 MiB (0x6e00000) bytes     allocated at offset 0 bytes (0x0)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
++18 MiB (0x1200000) bytes not allocated at offset 110 MiB (0x6e00000)
  static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
 -    BlockCompletionFunc *cb;
      int ret;
      /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
              qed_aio_start_io(acb);
              return;
          }
 -
 -        cb = qed_aio_write_zero_cluster;
      } else {
 -        cb = qed_aio_write_cow;
          acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
      }
      if (qed_should_set_need_check(s)) {
          s->header.features |= QED_F_NEED_CHECK;
          ret = qed_write_header(s);
 -        cb(acb, ret);
 +        if (ret < 0) {
 +            qed_aio_complete(acb, ret);
 +            return;
 +        }
 +    }
 +
-+    if (acb->flags & QED_AIOCB_ZERO) {
+ == verify image content ==
-+        ret = qed_aio_write_l2_update(acb, 1);
+ read 1000/1000 bytes at offset 0
-     } else {
+bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
--        cb(acb, 0);
+@@ -XXX,XX +XXX,XX @@ read 13631488/13631488 bytes at offset 67108864
-+        ret = qed_aio_write_cow(acb);
+MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-     }
+ read 30408704/30408704 bytes at offset 80740352
-+    if (ret < 0) {
+MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+        qed_aio_complete(acb, ret);
+-read 23068672/23068672 bytes at offset 111149056
-+        return;
+-22 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    }
++read 4194304/4194304 bytes at offset 111149056
-+    qed_aio_next_io(acb, 0);
++4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
- }
++read 18874368/18874368 bytes at offset 115343360
++18 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
- /**
+ Offset          Length          File
 0x800000        TEST_DIR/t.IMGFMT
 x900000        0x2400000       TEST_DIR/t.IMGFMT
 x3c00000       0x1100000       TEST_DIR/t.IMGFMT
 -0x6a00000       0x1600000       TEST_DIR/t.IMGFMT
 +0x6a00000       0x400000        TEST_DIR/t.IMGFMT
  No errors were found on the image.
  *** done
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 11/61] virtio-pci: use ioeventfd even when KVM is disabled
+[Qemu-devel] [PULL 27/35] qemu-img.1: Image invalidation on qemu-img commit
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Max Reitz <mreitz@redhat.com>
-Old kvm.ko versions only supported a tiny number of ioeventfds so
+qemu-img commit invalidates all images between base and top.  This
-virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.
+should be mentioned in the man page.
-Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
+Suggested-by: Ping Li <pingl@redhat.com>
-always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
+Signed-off-by: Max Reitz <mreitz@redhat.com>
-("memory: emulate ioeventfd") it has been possible to use ioeventfds in
+Reviewed-by: Jeff Cody <jcody@redhat.com>
 qtest or TCG mode.
 This patch makes -device virtio-blk-pci,iothread=iothread0 work even
 when KVM is disabled.
 I have tested that virtio-blk-pci works under TCG both with and without
 iothread.
 Cc: Michael S. Tsirkin <mst@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/virtio/virtio-pci.c | 2 +-
+ qemu-img.texi | 9 ++++-----
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 4 insertions(+), 5 deletions(-)
-diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
+diff --git a/qemu-img.texi b/qemu-img.texi
 index XXXXXXX..XXXXXXX 100644
---- a/hw/virtio/virtio-pci.c
+--- a/qemu-img.texi
-+++ b/hw/virtio/virtio-pci.c
++++ b/qemu-img.texi
-@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
+@@ -XXX,XX +XXX,XX @@ If the backing chain of the given image file @var{filename} has more than one
-     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
+ layer, the backing file into which the changes will be committed may be
-                      !pci_bus_is_root(pci_dev->bus);
+ specified as @var{base} (which has to be part of @var{filename}'s backing
+ chain). If @var{base} is not specified, the immediate backing file of the top
--    if (!kvm_has_many_ioeventfds()) {
+-image (which is @var{filename}) will be used. For reasons of consistency,
-+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
+-explicitly specifying @var{base} will always imply @code{-d} (since emptying an
-         proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+-image after committing to an indirect backing file would lead to different data
-     }
+-being read from the image due to content in the intermediate backing chain
 -overruling the commit target).
 +image (which is @var{filename}) will be used. Note that after a commit operation
 +all images between @var{base} and the top image will be invalid and may return
 +garbage data when read. For this reason, @code{-b} implies @code{-d} (so that
 +the top image stays valid).
  @item compare [-f @var{fmt}] [-F @var{fmt}] [-T @var{src_cache}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 10/61] throttle: Update throttle-groups.c documentation
+[Qemu-devel] [PULL 28/35] qcow2: Use BDRV_SECTOR_BITS instead of its literal value
 From: Alberto Garcia <berto@igalia.com>
-There used to be throttle_timers_{detach,attach}_aio_context() calls
+BDRV_SECTOR_BITS is defined to be 9 in block.h (and BDRV_SECTOR_SIZE
-in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
+is calculated from that), but there are still a couple of places where
-they are now in blk_set_aio_context().
+we are using the literal value instead of the macro.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-id: 20171009153856.20387-1-berto@igalia.com
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/throttle-groups.c | 2 +-
+ block/qcow2.c | 4 ++--
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/block/throttle-groups.c b/block/throttle-groups.c
+diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/throttle-groups.c
+--- a/block/qcow2.c
-+++ b/block/throttle-groups.c
++++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
-  * Again, all this is handled internally and is mostly transparent to
-  * the outside. The 'throttle_timers' field however has an additional
+     s->cluster_bits = header.cluster_bits;
-  * constraint because it may be temporarily invalid (see for example
+     s->cluster_size = 1 << s->cluster_bits;
-- * bdrv_set_aio_context()). Therefore in this file a thread will
+-    s->cluster_sectors = 1 << (s->cluster_bits - 9);
-+ * blk_set_aio_context()). Therefore in this file a thread will
++    s->cluster_sectors = 1 << (s->cluster_bits - BDRV_SECTOR_BITS);
-  * access some other BlockBackend's timers only after verifying that
-  * that BlockBackend has throttled requests in the queue.
+     /* Initialise version 3 header fields */
-  */
+     if (header.version == 2) {
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs,
      bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE);
      qemu_co_mutex_lock(&s->lock);
 -    ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
 +    ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, &bytes,
                                     &cluster_offset);
      qemu_co_mutex_unlock(&s->lock);
      if (ret < 0) {
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 03/61] qemu-iotests: Test exiting qemu with running job
+[Qemu-devel] [PULL 29/35] iotests: Add test for dataplane mirroring
-When qemu is exited, all running jobs should be cancelled successfully.
+From: Max Reitz <mreitz@redhat.com>
 This adds a test for this for all types of block jobs that currently
 exist in qemu.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Max Reitz <mreitz@redhat.com>
 Message-id: 20170929170843.3711-1-mreitz@redhat.com
 Reviewed-by: Eric Blake <eblake@redhat.com>
+Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
+ tests/qemu-iotests/127     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
- tests/qemu-iotests/185.out |  59 +++++++++++++
+ tests/qemu-iotests/127.out | 14 +++++++
- tests/qemu-iotests/group   |   1 +
+ tests/qemu-iotests/group   |  1 +
-files changed, 266 insertions(+)
+files changed, 112 insertions(+)
- create mode 100755 tests/qemu-iotests/185
+ create mode 100755 tests/qemu-iotests/127
- create mode 100644 tests/qemu-iotests/185.out
+ create mode 100644 tests/qemu-iotests/127.out
-diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
+diff --git a/tests/qemu-iotests/127 b/tests/qemu-iotests/127
 new file mode 100755
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/qemu-iotests/185
++++ b/tests/qemu-iotests/127
 @@ -XXX,XX +XXX,XX @@
 +#!/bin/bash
 +#
-+# Test exiting qemu while jobs are still running
++# Test case for mirroring with dataplane
 +#
 +# Copyright (C) 2017 Red Hat, Inc.
 +#
 +# This program is free software; you can redistribute it and/or modify
 +# it under the terms of the GNU General Public License as published by
 ...
 +# You should have received a copy of the GNU General Public License
 +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
 +#
 +
 +# creator
-+owner=kwolf@redhat.com
++owner=mreitz@redhat.com
 +
-+seq=`basename $0`
++seq=$(basename $0)
 +echo "QA output created by $seq"
 +
-+here=`pwd`
++here=$PWD
-+status=1 # failure is the default!
++status=1    # failure is the default!
 +
 +MIG_SOCKET="${TEST_DIR}/migrate"
 +
 +_cleanup()
 +{
-+    rm -f "${TEST_IMG}.mid"
++    _cleanup_qemu
 +    rm -f "${TEST_IMG}.copy"
 +    _cleanup_test_img
-+    _cleanup_qemu
++    _rm_test_img "$TEST_IMG.overlay0"
 +    _rm_test_img "$TEST_IMG.overlay1"
 +}
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
-+# get standard environment, filters and checks
++# get standard environment, filters and qemu instance handling
 +. ./common.rc
 +. ./common.filter
 +. ./common.qemu
 +
 +_supported_fmt qcow2
 +_supported_proto file
 +_supported_os Linux
 +
-+size=64M
++IMG_SIZE=64K
 +TEST_IMG="${TEST_IMG}.base" _make_test_img $size
 +
-+echo
++_make_test_img $IMG_SIZE
-+echo === Starting VM ===
++TEST_IMG="$TEST_IMG.overlay0" _make_test_img -b "$TEST_IMG" $IMG_SIZE
-+echo
++TEST_IMG="$TEST_IMG.overlay1" _make_test_img -b "$TEST_IMG" $IMG_SIZE
 +
-+qemu_comm_method="qmp"
++# So that we actually have something to mirror and the job does not return
 +# immediately (which may be bad because then we cannot know whether the
 +# 'return' or the 'BLOCK_JOB_READY' comes first).
 +$QEMU_IO -c 'write 0 42' "$TEST_IMG.overlay0" | _filter_qemu_io
 +
++# We cannot use virtio-blk here because that does not actually set the attached
++# BB's AioContext in qtest mode
 +_launch_qemu \
-+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
++    -object iothread,id=iothr \
-+h=$QEMU_HANDLE
++    -blockdev node-name=source,driver=$IMGFMT,file.driver=file,file.filename="$TEST_IMG.overlay0" \
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
++    -device virtio-scsi,id=scsi-bus,iothread=iothr \
 +    -device scsi-hd,bus=scsi-bus.0,drive=source
 +
-+echo
++_send_qemu_cmd $QEMU_HANDLE \
-+echo === Creating backing chain ===
++    "{ 'execute': 'qmp_capabilities' }" \
-+echo
++    'return'
 +
-+_send_qemu_cmd $h \
++_send_qemu_cmd $QEMU_HANDLE \
-+    "{ 'execute': 'blockdev-snapshot-sync',
++    "{ 'execute': 'drive-mirror',
-+       'arguments': { 'device': 'disk',
++       'arguments': {
-+                      'snapshot-file': '$TEST_IMG.mid',
++           'job-id': 'mirror',
-+                      'format': '$IMGFMT',
++           'device': 'source',
-+                      'mode': 'absolute-paths' } }" \
++           'target': '$TEST_IMG.overlay1',
-+    "return"
++           'mode':   'existing',
 +           'sync':   'top'
 +       } }" \
 +    'BLOCK_JOB_READY'
 +
-+_send_qemu_cmd $h \
++# The backing BDS should be assigned the overlay's AioContext
-+    "{ 'execute': 'human-monitor-command',
++_send_qemu_cmd $QEMU_HANDLE \
-+       'arguments': { 'command-line':
++    "{ 'execute': 'block-job-complete',
-+                      'qemu-io disk \"write 0 4M\"' } }" \
++       'arguments': { 'device': 'mirror' } }" \
-+    "return"
++    'BLOCK_JOB_COMPLETED'
 +
-+_send_qemu_cmd $h \
++_send_qemu_cmd $QEMU_HANDLE \
-+    "{ 'execute': 'blockdev-snapshot-sync',
++    "{ 'execute': 'quit' }" \
-+       'arguments': { 'device': 'disk',
++    'return'
 +                      'snapshot-file': '$TEST_IMG',
 +                      'format': '$IMGFMT',
 +                      'mode': 'absolute-paths' } }" \
 +    "return"
 +
-+echo
++wait=yes _cleanup_qemu
 +echo === Start commit job and exit qemu ===
 +echo
 +
 +# Note that the reference output intentionally includes the 'offset' field in
 +# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
 +# predictable and any change in the offsets would hint at a bug in the job
 +# throttling code.
 +#
 +# In order to achieve these predictable offsets, all of the following tests
 +# use speed=65536. Each job will perform exactly one iteration before it has
 +# to sleep at least for a second, which is plenty of time for the 'quit' QMP
 +# command to be received (after receiving the command, the rest runs
 +# synchronously, so jobs can arbitrarily continue or complete).
 +#
 +# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
 +# the first request), for active commit and mirror it's large enough to cover
 +# the full 4M, and for backup it's the qcow2 cluster size, which we know is
 +# 64k. As all of these are at least as large as the speed, we are sure that the
 +# offset doesn't advance after the first iteration before qemu exits.
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'block-commit',
 +       'arguments': { 'device': 'disk',
 +                      'base':'$TEST_IMG.base',
 +                      'top': '$TEST_IMG.mid',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start active commit job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'block-commit',
 +       'arguments': { 'device': 'disk',
 +                      'base':'$TEST_IMG.base',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start mirror job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'drive-mirror',
 +       'arguments': { 'device': 'disk',
 +                      'target': '$TEST_IMG.copy',
 +                      'format': '$IMGFMT',
 +                      'sync': 'full',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start backup job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'drive-backup',
 +       'arguments': { 'device': 'disk',
 +                      'target': '$TEST_IMG.copy',
 +                      'format': '$IMGFMT',
 +                      'sync': 'full',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +echo
 +echo === Start streaming job and exit qemu ===
 +echo
 +
 +_launch_qemu \
 +    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
 +h=$QEMU_HANDLE
 +_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
 +
 +_send_qemu_cmd $h \
 +    "{ 'execute': 'block-stream',
 +       'arguments': { 'device': 'disk',
 +                      'speed': 65536 } }" \
 +    "return"
 +
 +_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 +wait=1 _cleanup_qemu
 +
 +_check_test_img
 +
 +# success, all done
-+echo "*** done"
++echo '*** done'
 +rm -f $seq.full
 +status=0
-diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
+diff --git a/tests/qemu-iotests/127.out b/tests/qemu-iotests/127.out
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tests/qemu-iotests/185.out
++++ b/tests/qemu-iotests/127.out
 @@ -XXX,XX +XXX,XX @@
-+QA output created by 185
++QA output created by 127
-+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
++Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
-+
++Formatting 'TEST_DIR/t.IMGFMT.overlay0', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT
-+=== Starting VM ===
++Formatting 'TEST_DIR/t.IMGFMT.overlay1', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT
-+
++wrote 42/42 bytes at offset 0
-+{"return": {}}
++42 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +=== Creating backing chain ===
 +
 +Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +wrote 4194304/4194304 bytes at offset 0
 +4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +{"return": ""}
 +Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
 +{"return": {}}
 +
 +=== Start commit job and exit qemu ===
 +
 +{"return": {}}
 +{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
++{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "mirror", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
 +
 +=== Start active commit job and exit qemu ===
 +
 +{"return": {}}
-+{"return": {}}
++{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "mirror", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
 +{"return": {}}
 +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
-+
-+=== Start mirror job and exit qemu ===
-+
-+{"return": {}}
-+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
-+
-+=== Start backup job and exit qemu ===
-+
-+{"return": {}}
-+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
-+
-+=== Start streaming job and exit qemu ===
-+
-+{"return": {}}
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
-+No errors were found on the image.
 +*** done
 diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/group
 +++ b/tests/qemu-iotests/group
 @@ -XXX,XX +XXX,XX @@
-rw auto migration
+rw auto backing
-rw auto quick
+rw auto
-rw auto migration
+rw auto backing
-+185 rw auto
++127 rw auto backing quick
 rw auto quick
 rw auto quick
 rw auto quick
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 08/61] doc: Document generic -blockdev options
+Deleted patch
-This adds documentation for the -blockdev options that apply to all
-nodes independent of the block driver used.
-All options that are shared by -blockdev and -drive are now explained in
-the section for -blockdev. The documentation of -drive mentions that all
--blockdev options are accepted as well.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
-file changed, 79 insertions(+), 29 deletions(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
-+++ b/qemu-options.hx
-@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
-     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
-     "          [,driver specific parameters...]\n"
-     "                configure a block backend\n", QEMU_ARCH_ALL)
-+STEXI
-+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
-+@findex -blockdev
-+
-+Define a new block driver node.
-+
-+@table @option
-+@item Valid options for any block driver node:
-+
-+@table @code
-+@item driver
-+Specifies the block driver to use for the given node.
-+@item node-name
-+This defines the name of the block driver node by which it will be referenced
-+later. The name must be unique, i.e. it must not match the name of a different
-+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
-+
-+If no node name is specified, it is automatically generated. The generated node
-+name is not intended to be predictable and changes between QEMU invocations.
-+For the top level, an explicit node name must be specified.
-+@item read-only
-+Open the node read-only. Guest write attempts will fail.
-+@item cache.direct
-+The host page cache can be avoided with @option{cache.direct=on}. This will
-+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
-+internal copy of the data.
-+@item cache.no-flush
-+In case you don't care about data integrity over host failures, you can use
-+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
-+any data to the disk but can instead keep things in cache. If anything goes
-+wrong, like your host losing power, the disk storage getting disconnected
-+accidentally, etc. your image will most probably be rendered unusable.
-+@item discard=@var{discard}
-+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
-+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
-+ignored or passed to the filesystem. Some machine types may not support
-+discard requests.
-+@item detect-zeroes=@var{detect-zeroes}
-+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-+conversion of plain zero writes by the OS to driver specific optimized
-+zero write commands. You may even choose "unmap" if @var{discard} is set
-+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
-+@end table
-+
-+@end table
-+
-+ETEXI
- DEF("drive", HAS_ARG, QEMU_OPTION_drive,
-     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
-@@ -XXX,XX +XXX,XX @@ STEXI
- @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
- @findex -drive
--Define a new drive. Valid options are:
-+Define a new drive. This includes creating a block driver node (the backend) as
-+well as a guest device, and is mostly a shortcut for defining the corresponding
-+@option{-blockdev} and @option{-device} options.
-+
-+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
-+addition, it knows the following options:
- @table @option
- @item file=@var{file}
-@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
- @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
- (see @option{-snapshot}).
- @item cache=@var{cache}
--@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
-+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
-+and controls how the host cache is used to access block data. This is a
-+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
-+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
-+which provides a default for the @option{write-cache} option of block guest
-+devices (as in @option{-device}). The modes correspond to the following
-+settings:
-+
-+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
-+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
-+@c and the HTML output.
-+@example
-+@             │ cache.writeback   cache.direct   cache.no-flush
-+─────────────┼─────────────────────────────────────────────────
-+writeback    │ on                off            off
-+none         │ on                on             off
-+writethrough │ off               off            off
-+directsync   │ off               on             off
-+unsafe       │ on                off            on
-+@end example
-+
-+The default mode is @option{cache=writeback}.
-+
- @item aio=@var{aio}
- @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
--@item discard=@var{discard}
--@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
- @item format=@var{format}
- Specify which disk @var{format} will be used rather than detecting
- the format.  Can be used to specify format=raw to avoid interpreting
-@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
- "report" (report the error to the guest), "enospc" (pause QEMU only if the
- host disk is full; report the error to the guest otherwise).
- The default setting is @option{werror=enospc} and @option{rerror=report}.
--@item readonly
--Open drive @option{file} as read-only. Guest write attempts will fail.
- @item copy-on-read=@var{copy-on-read}
- @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
- file sectors into the image file.
--@item detect-zeroes=@var{detect-zeroes}
--@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
--conversion of plain zero writes by the OS to driver specific optimized
--zero write commands. You may even choose "unmap" if @var{discard} is set
--to "unmap" to allow a zero write to be converted to an UNMAP operation.
- @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
- Specify bandwidth throttling limits in bytes per second, either for all request
- types or for reads or writes only.  Small values can lead to timeouts or hangs
-@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
- instead of a single larger disk.
- @end table
--By default, the @option{cache=writeback} mode is used. It will report data
-+By default, the @option{cache.writeback=on} mode is used. It will report data
- writes as completed as soon as the data is present in the host page cache.
- This is safe as long as your guest OS makes sure to correctly flush disk caches
- where needed. If your guest OS does not handle volatile disk write caches
- correctly and your host crashes or loses power, then the guest may experience
- data corruption.
--For such guests, you should consider using @option{cache=writethrough}. This
-+For such guests, you should consider using @option{cache.writeback=off}. This
- means that the host page cache will be used to read and write data, but write
- notification will be sent to the guest only after QEMU has made sure to flush
- each write to the disk. Be aware that this has a major impact on performance.
--The host page cache can be avoided entirely with @option{cache=none}.  This will
--attempt to do disk IO directly to the guest's memory.  QEMU may still perform
--an internal copy of the data. Note that this is considered a writeback mode and
--the guest OS must handle the disk write cache correctly in order to avoid data
--corruption on host crashes.
--
--The host page cache can be avoided while only sending write notifications to
--the guest when the data has been flushed to the disk using
--@option{cache=directsync}.
--
--In case you don't care about data integrity over host failures, use
--@option{cache=unsafe}. This option tells QEMU that it never needs to write any
--data to the disk but can instead keep things in cache. If anything goes wrong,
--like your host losing power, the disk storage getting disconnected accidentally,
--etc. your image will most probably be rendered unusable.   When using
--the @option{-snapshot} option, unsafe caching is always used.
-+When using the @option{-snapshot} option, unsafe caching is always used.
- Copy-on-read avoids accessing the same backing file sectors repeatedly and is
- useful when the backing file is over a slow network.  By default copy-on-read
---
-.8.3.1

-[Qemu-devel] [PULL 49/61] qed: Implement .bdrv_co_readv/writev
+[Qemu-devel] [PULL 30/35] iotests: Pull _filter_actual_image_size from 67/87
-Most of the qed code is now synchronous and matches the coroutine model.
+From: Max Reitz <mreitz@redhat.com>
 One notable exception is the serialisation between requests which can
 still schedule a callback. Before we can replace this with coroutine
 locks, let's convert the driver's external interfaces to the coroutine
 versions.
-We need to be careful to handle both requests that call the completion
+Tests 067 and 087 filter the actual image size because it depends on the
-callback directly from the calling coroutine (i.e. fully synchronous
+host filesystem (and is not part of the respective test).  Since this is
-code) and requests that involve some callback, so that we need to yield
+generally true, we should have a common filter function for this, so
-and wait for the completion callback coming from outside the coroutine.
+let's pull out the sed line from both tests into such a function.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Max Reitz <mreitz@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
+Message-id: 20171009163456.485-2-mreitz@redhat.com
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
+ tests/qemu-iotests/067           | 2 +-
-file changed, 42 insertions(+), 55 deletions(-)
+ tests/qemu-iotests/087           | 2 +-
  tests/qemu-iotests/common.filter | 6 ++++++
 files changed, 8 insertions(+), 2 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/067
 +++ b/tests/qemu-iotests/067
@@ -XXX,XX +XXX,XX @@ _filter_qmp_events()
  function run_qemu()
  {
      do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qmp | _filter_qemu \
 -                          | sed -e 's/\("actual-size":\s*\)[0-9]\+/\1SIZE/g' \
 +                          | _filter_actual_image_size \
                            | _filter_generated_node_ids | _filter_qmp_events
  }
 diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/087
 +++ b/tests/qemu-iotests/087
@@ -XXX,XX +XXX,XX @@ function run_qemu()
  {
      do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qmp \
                            | _filter_qemu | _filter_imgfmt \
 -                          | sed -e 's/\("actual-size":\s*\)[0-9]\+/\1SIZE/g'
 +                          | _filter_actual_image_size
  }
  size=128M
 diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/tests/qemu-iotests/common.filter
-+++ b/block/qed.c
++++ b/tests/qemu-iotests/common.filter
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ _filter_block_job_len()
-     }
+     sed -e 's/, "len": [0-9]\+,/, "len": LEN,/g'
  }
--static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
++# replace actual image size (depends on the host filesystem)
--                                 int64_t sector_num,
++_filter_actual_image_size()
--                                 QEMUIOVector *qiov, int nb_sectors,
++{
--                                 BlockCompletionFunc *cb,
++    sed -s 's/\("actual-size":\s*\)[0-9]\+/\1SIZE/g'
 -                                 void *opaque, int flags)
 +typedef struct QEDRequestCo {
 +    Coroutine *co;
 +    bool done;
 +    int ret;
 +} QEDRequestCo;
 +
 +static void qed_co_request_cb(void *opaque, int ret)
  {
 -    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
 +    QEDRequestCo *co = opaque;
 -    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
 -                        opaque, flags);
 +    co->done = true;
 +    co->ret = ret;
 +    qemu_coroutine_enter_if_inactive(co->co);
 +}
 +
-+static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+ # replace driver-specific options in the "Formatting..." line
-+                                       QEMUIOVector *qiov, int nb_sectors,
+ _filter_img_create()
 +                                       int flags)
 +{
 +    QEDRequestCo co = {
 +        .co     = qemu_coroutine_self(),
 +        .done   = false,
 +    };
 +    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
 +
 +    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
      acb->flags = flags;
      acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
      /* Start request */
      qed_aio_start_io(acb);
 -    return &acb->common;
 -}
 -static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
 -                                      int64_t sector_num,
 -                                      QEMUIOVector *qiov, int nb_sectors,
 -                                      BlockCompletionFunc *cb,
 -                                      void *opaque)
 -{
 -    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 +    if (!co.done) {
 +        qemu_coroutine_yield();
 +    }
 +
 +    return co.ret;
  }
 -static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
 -                                       int64_t sector_num,
 -                                       QEMUIOVector *qiov, int nb_sectors,
 -                                       BlockCompletionFunc *cb,
 -                                       void *opaque)
 +static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
 +                                          int64_t sector_num, int nb_sectors,
 +                                          QEMUIOVector *qiov)
  {
--    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
--                         opaque, QED_AIOCB_WRITE);
-+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
- }
--typedef struct {
--    Coroutine *co;
--    int ret;
--    bool done;
--} QEDWriteZeroesCB;
--
--static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
-+static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
-+                                           int64_t sector_num, int nb_sectors,
-+                                           QEMUIOVector *qiov)
- {
--    QEDWriteZeroesCB *cb = opaque;
--
--    cb->done = true;
--    cb->ret = ret;
--    if (cb->co) {
--        aio_co_wake(cb->co);
--    }
-+    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
- }
- static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-                                                   int count,
-                                                   BdrvRequestFlags flags)
- {
--    BlockAIOCB *blockacb;
-     BDRVQEDState *s = bs->opaque;
--    QEDWriteZeroesCB cb = { .done = false };
-     QEMUIOVector qiov;
-     struct iovec iov;
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-     iov.iov_len = count;
-     qemu_iovec_init_external(&qiov, &iov, 1);
--    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
--                             count >> BDRV_SECTOR_BITS,
--                             qed_co_pwrite_zeroes_cb, &cb,
--                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
--    if (!blockacb) {
--        return -EIO;
--    }
--    if (!cb.done) {
--        cb.co = qemu_coroutine_self();
--        qemu_coroutine_yield();
--    }
--    assert(cb.done);
--    return cb.ret;
-+    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
-+                          count >> BDRV_SECTOR_BITS,
-+                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
- }
- static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
-@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
-     .bdrv_create              = bdrv_qed_create,
-     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
-     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
--    .bdrv_aio_readv           = bdrv_qed_aio_readv,
--    .bdrv_aio_writev          = bdrv_qed_aio_writev,
-+    .bdrv_co_readv            = bdrv_qed_co_readv,
-+    .bdrv_co_writev           = bdrv_qed_co_writev,
-     .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
-     .bdrv_truncate            = bdrv_qed_truncate,
-     .bdrv_getlength           = bdrv_qed_getlength,
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 15/61] qemu-iotests: 068: test iothread mode
+[Qemu-devel] [PULL 31/35] iotests: Filter actual image size in 184 and 191
-From: Stefan Hajnoczi <stefanha@redhat.com>
+From: Max Reitz <mreitz@redhat.com>
-Perform the savevm/loadvm test with both iothread on and off.  This
+Whenever the actual image size is not part of the test, it should be
-covers the recently found savevm/loadvm hang when iothread is enabled.
+filtered as it depends on the host filesystem.
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Max Reitz <mreitz@redhat.com>
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+Message-id: 20171009163456.485-3-mreitz@redhat.com
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- tests/qemu-iotests/068     | 23 ++++++++++++++---------
+ tests/qemu-iotests/184     |  3 ++-
- tests/qemu-iotests/068.out | 11 ++++++++++-
+ tests/qemu-iotests/184.out |  6 +++---
-files changed, 24 insertions(+), 10 deletions(-)
+ tests/qemu-iotests/191     |  4 ++--
+ tests/qemu-iotests/191.out | 46 +++++++++++++++++++++++-----------------------
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+files changed, 30 insertions(+), 29 deletions(-)
 diff --git a/tests/qemu-iotests/184 b/tests/qemu-iotests/184
 index XXXXXXX..XXXXXXX 100755
---- a/tests/qemu-iotests/068
+--- a/tests/qemu-iotests/184
-+++ b/tests/qemu-iotests/068
++++ b/tests/qemu-iotests/184
-@@ -XXX,XX +XXX,XX @@ _supported_os Linux
+@@ -XXX,XX +XXX,XX @@ function do_run_qemu()
- IMGOPTS="compat=1.1"
+ function run_qemu()
- IMG_SIZE=128K
+ {
+     do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qemu | _filter_qmp\
--echo
+-                          | _filter_qemu_io | _filter_generated_node_ids
--echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
++                          | _filter_qemu_io | _filter_generated_node_ids \
--echo
++                          | _filter_actual_image_size
 -_make_test_img $IMG_SIZE
 -
  case "$QEMU_DEFAULT_MACHINE" in
    s390-ccw-virtio)
        platform_parm="-no-shutdown"
@@ -XXX,XX +XXX,XX @@ _qemu()
      _filter_qemu | _filter_hmp
  }
--# Give qemu some time to boot before saving the VM state
+ _make_test_img 64M
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out
 -# Now try to continue from that VM state (this should just work)
 -echo quit | _qemu -loadvm 0
 +for extra_args in \
 +    "" \
 +    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
 +    echo
 +    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
 +    echo
 +
 +    _make_test_img $IMG_SIZE
 +
 +    # Give qemu some time to boot before saving the VM state
 +    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
 +    # Now try to continue from that VM state (this should just work)
 +    echo quit | _qemu $extra_args -loadvm 0
 +done
  # success, all done
  echo "*** done"
 diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
 index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068.out
+--- a/tests/qemu-iotests/184.out
-+++ b/tests/qemu-iotests/068.out
++++ b/tests/qemu-iotests/184.out
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ Testing:
- QA output created by 068
+                 "filename": "json:{\"throttle-group\": \"group0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"TEST_DIR/t.qcow2\"}}}",
+                 "cluster-size": 65536,
--=== Saving and reloading a VM state to/from a qcow2 image ===
+                 "format": "throttle",
-+=== Saving and reloading a VM state to/from a qcow2 image () ===
+-                "actual-size": 200704,
-+
++                "actual-size": SIZE,
-+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+                 "dirty-flag": false
-+QEMU X.Y.Z monitor - type 'help' for more information
+             },
-+(qemu) savevm 0
+             "iops_wr": 0,
-+(qemu) quit
+@@ -XXX,XX +XXX,XX @@ Testing:
-+QEMU X.Y.Z monitor - type 'help' for more information
+                 "filename": "TEST_DIR/t.qcow2",
-+(qemu) quit
+                 "cluster-size": 65536,
-+
+                 "format": "qcow2",
-+=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
+-                "actual-size": 200704,
++                "actual-size": SIZE,
- Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+                 "format-specific": {
- QEMU X.Y.Z monitor - type 'help' for more information
+                     "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ Testing:
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
 diff --git a/tests/qemu-iotests/191 b/tests/qemu-iotests/191
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/191
 +++ b/tests/qemu-iotests/191
@@ -XXX,XX +XXX,XX @@ echo === Check that both top and top2 point to base now ===
  echo
  _send_qemu_cmd $h "{ 'execute': 'query-named-block-nodes' }" "^}" |
 -    _filter_generated_node_ids
 +    _filter_generated_node_ids | _filter_actual_image_size
  _send_qemu_cmd $h "{ 'execute': 'quit' }" "^}"
  wait=1 _cleanup_qemu
@@ -XXX,XX +XXX,XX @@ echo === Check that both top and top2 point to base now ===
  echo
  _send_qemu_cmd $h "{ 'execute': 'query-named-block-nodes' }" "^}" |
 -    _filter_generated_node_ids
 +    _filter_generated_node_ids | _filter_actual_image_size
  _send_qemu_cmd $h "{ 'execute': 'quit' }" "^}"
  wait=1 _cleanup_qemu
 diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/191.out
 +++ b/tests/qemu-iotests/191.out
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.base",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 397312,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.ovl2",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2.ovl2",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.base",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 397312,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.base",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 397312,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.mid",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 393216,
                  "filename": "TEST_DIR/t.qcow2.mid",
                  "format": "file",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.base",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 393216,
                  "filename": "TEST_DIR/t.qcow2.base",
                  "format": "file",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.base",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 397312,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.ovl2",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2.ovl2",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                          "filename": "TEST_DIR/t.qcow2.base",
                          "cluster-size": 65536,
                          "format": "qcow2",
 -                        "actual-size": 397312,
 +                        "actual-size": SIZE,
                          "format-specific": {
                              "type": "qcow2",
                              "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.ovl2",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 200704,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.ovl3",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2.ovl3",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2.base",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 393216,
                  "filename": "TEST_DIR/t.qcow2.base",
                  "format": "file",
 -                "actual-size": 397312,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                      "filename": "TEST_DIR/t.qcow2.base",
                      "cluster-size": 65536,
                      "format": "qcow2",
 -                    "actual-size": 397312,
 +                    "actual-size": SIZE,
                      "format-specific": {
                          "type": "qcow2",
                          "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "filename": "TEST_DIR/t.qcow2",
                  "cluster-size": 65536,
                  "format": "qcow2",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "format-specific": {
                      "type": "qcow2",
                      "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                  "virtual-size": 197120,
                  "filename": "TEST_DIR/t.qcow2",
                  "format": "file",
 -                "actual-size": 200704,
 +                "actual-size": SIZE,
                  "dirty-flag": false
              },
              "iops_wr": 0,
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 29/61] qed: Remove callback from qed_find_cluster()
+Deleted patch
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
- block/qed.c         | 24 +++++++++++-------------
- block/qed.h         |  4 ++--
-files changed, 35 insertions(+), 32 deletions(-)
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
-+++ b/block/qed-cluster.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
-  * @s:          QED state
-  * @request:    L2 cache entry
-  * @pos:        Byte position in device
-- * @len:        Number of bytes
-- * @cb:         Completion function
-- * @opaque:     User data for completion function
-+ * @len:        Number of bytes (may be shortened on return)
-+ * @img_offset: Contains offset in the image file on success
-  *
-  * This function translates a position in the block device to an offset in the
-- * image file.  It invokes the cb completion callback to report back the
-- * translated offset or unallocated range in the image file.
-+ * image file. The translated offset or unallocated range in the image file is
-+ * reported back in *img_offset and *len.
-  *
-  * If the L2 table exists, request->l2_table points to the L2 table cache entry
-  * and the caller must free the reference when they are finished.  The cache
-  * entry is exposed in this way to avoid callers having to read the L2 table
-  * again later during request processing.  If request->l2_table is non-NULL it
-  * will be unreferenced before taking on the new cache entry.
-+ *
-+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
-+ * range in the image file.
-+ *
-+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
-+ * table offset, respectively. len is number of contiguous unallocated bytes.
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque)
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset)
- {
-     uint64_t l2_offset;
-     uint64_t offset = 0;
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
-      * so that a request acts on one L2 table at a time.
-      */
--    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
-     if (qed_offset_is_unalloc_cluster(l2_offset)) {
--        cb(opaque, QED_CLUSTER_L1, 0, len);
--        return;
-+        *img_offset = 0;
-+        return QED_CLUSTER_L1;
-     }
-     if (!qed_check_table_offset(s, l2_offset)) {
--        cb(opaque, -EINVAL, 0, 0);
--        return;
-+        *img_offset = *len = 0;
-+        return -EINVAL;
-     }
-     ret = qed_read_l2_table(s, request, l2_offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     }
-     index = qed_l2_index(s, pos);
--    n = qed_bytes_to_clusters(s,
--                              qed_offset_into_cluster(s, pos) + len);
-+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
-     n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                       index, n, &offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-         ret = -EINVAL;
-     }
--    len = MIN(len,
--              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
-+    *len = MIN(*len,
-+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
- out:
--    cb(opaque, ret, offset, len);
-+    *img_offset = offset;
-     qed_release(s);
-+    return ret;
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
-         .file = file,
-     };
-     QEDRequest request = { .l2_table = NULL };
-+    uint64_t offset;
-+    int ret;
--    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
-+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
-+    qed_is_allocated_cb(&cb, ret, offset, len);
--    /* Now sleep if the callback wasn't invoked immediately */
--    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
--        cb.co = qemu_coroutine_self();
--        qemu_coroutine_yield();
--    }
-+    /* The callback was invoked immediately */
-+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
-     qed_unref_l2_cache_entry(request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_write_data(void *opaque, int ret,
-                                uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_read_data(void *opaque, int ret,
-                               uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     BDRVQEDState *s = acb_to_s(acb);
-     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
-                                 qed_aio_write_data : qed_aio_read_data;
-+    uint64_t offset;
-+    size_t len;
-     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     }
-     /* Find next cluster and start I/O */
--    qed_find_cluster(s, &acb->request,
--                      acb->cur_pos, acb->end_pos - acb->cur_pos,
--                      io_fn, acb);
-+    len = acb->end_pos - acb->cur_pos;
-+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
-+    io_fn(acb, ret, offset, len);
- }
- static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
- /**
-  * Cluster functions
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque);
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset);
- /**
-  * Consistency check
---
-.8.3.1

-[Qemu-devel] [PULL 40/61] qed: Inline qed_commit_l2_update()
+Deleted patch
-qed_commit_l2_update() is unconditionally called at the end of
-qed_aio_write_l1_update(). Inline it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 36 ++++++++++++++----------------------
-file changed, 14 insertions(+), 22 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- }
- /**
-- * Commit the current L2 table to the cache
-+ * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_commit_l2_update(void *opaque, int ret)
-+static void qed_aio_write_l1_update(void *opaque, int ret)
- {
-     QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
-+    int index;
-+
-+    if (ret) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    index = qed_l1_index(s, acb->cur_pos);
-+    s->l1_table->offsets[index] = l2_table->offset;
-+
-+    ret = qed_write_l1_table(s, index, 1);
-+
-+    /* Commit the current L2 table to the cache */
-     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
-     /* This is guaranteed to succeed because we just committed the entry to the
-@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
-     qed_aio_next_io(acb, ret);
- }
--/**
-- * Update L1 table with new L2 table offset and write it out
-- */
--static void qed_aio_write_l1_update(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
--
--    index = qed_l1_index(s, acb->cur_pos);
--    s->l1_table->offsets[index] = acb->request.l2_table->offset;
--
--    ret = qed_write_l1_table(s, index, 1);
--    qed_commit_l2_update(acb, ret);
--}
- /**
-  * Update L2 table with new cluster offsets and write them out
---
-.8.3.1

-[Qemu-devel] [PULL 41/61] qed: Add return value to qed_aio_write_l1_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 19 +++++++++----------
-file changed, 9 insertions(+), 10 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- /**
-  * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_aio_write_l1_update(void *opaque, int ret)
-+static int qed_aio_write_l1_update(QEDAIOCB *acb)
- {
--    QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
-+    int index, ret;
-     index = qed_l1_index(s, acb->cur_pos);
-     s->l1_table->offsets[index] = l2_table->offset;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
-     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-     assert(acb->request.l2_table != NULL);
--    qed_aio_next_io(acb, ret);
-+    return ret;
- }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-     if (need_alloc) {
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
--        qed_aio_write_l1_update(acb, ret);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l1_update(acb);
-+        qed_aio_next_io(acb, ret);
-+
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
---
-.8.3.1

-[Qemu-devel] [PULL 42/61] qed: Add return value to qed_aio_write_l2_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 43 ++++++++++++++++++++++++++-----------------
-file changed, 26 insertions(+), 17 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
- /**
-  * Update L2 table with new cluster offsets and write them out
-  */
--static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
--    int index;
--
--    if (ret) {
--        goto err;
--    }
-+    int index, ret;
-     if (need_alloc) {
-         qed_unref_l2_cache_entry(acb->request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-         if (ret) {
--            goto err;
-+            return ret;
-         }
--        ret = qed_aio_write_l1_update(acb);
--        qed_aio_next_io(acb, ret);
--
-+        return qed_aio_write_l1_update(acb);
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-                                  false);
--        qed_aio_next_io(acb, ret);
-+        if (ret) {
-+            return ret;
-+        }
-     }
--    return;
--
--err:
--    qed_aio_complete(acb, ret);
-+    return 0;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-              */
-             ret = bdrv_flush(s->bs->file->bs);
-         }
--        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        qed_aio_next_io(acb, 0);
-     }
-+    return;
-+
-+err:
-+    qed_aio_complete(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
-         return;
-     }
--    qed_aio_write_l2_update(acb, 0, 1);
-+    ret = qed_aio_write_l2_update(acb, 1);
-+    if (ret < 0) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 45/61] qed: Add return value to qed_aio_write_inplace/alloc()
+[Qemu-devel] [PULL 32/35] qcow2: Emit errp when truncating the image tail
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
+From: Max Reitz <mreitz@redhat.com>
 just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+bdrv_truncate() has an errp parameter which is always set when an error
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+occurs.  Let's use that instead of a plain strerror().
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 Message-id: 20171009155431.14093-1-mreitz@redhat.com
 Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/qed.c | 43 ++++++++++++++++++++-----------------------
+ block/qcow2.c | 13 +++++++------
-file changed, 20 insertions(+), 23 deletions(-)
+file changed, 7 insertions(+), 6 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/qcow2.c
-+++ b/block/qed.c
++++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
-  *
+             return last_cluster;
-  * This path is taken when writing to previously unallocated clusters.
+         }
-  */
+         if ((last_cluster + 1) * s->cluster_size < old_file_size) {
--static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+-            ret = bdrv_truncate(bs->file, (last_cluster + 1) * s->cluster_size,
-+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+-                                PREALLOC_MODE_OFF, NULL);
- {
+-            if (ret < 0) {
-     BDRVQEDState *s = acb_to_s(acb);
+-                warn_report("Failed to truncate the tail of the image: %s",
-     int ret;
+-                            strerror(-ret));
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+-                ret = 0;
-     }
++            Error *local_err = NULL;
-     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
++
-         s->allocating_write_reqs_plugged) {
++            bdrv_truncate(bs->file, (last_cluster + 1) * s->cluster_size,
--        return; /* wait for existing request to finish */
++                          PREALLOC_MODE_OFF, &local_err);
-+        return -EINPROGRESS; /* wait for existing request to finish */
++            if (local_err) {
-     }
++                warn_reportf_err(local_err,
++                                 "Failed to truncate the tail of the image: ");
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
+             }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
      if (acb->flags & QED_AIOCB_ZERO) {
          /* Skip ahead if the clusters are already zero */
          if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
 -            qed_aio_start_io(acb);
 -            return;
 +            return 0;
          }
      } else {
-         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-         s->header.features |= QED_F_NEED_CHECK;
-         ret = qed_write_header(s);
-         if (ret < 0) {
--            qed_aio_complete(acb, ret);
--            return;
-+            return ret;
-         }
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-         ret = qed_aio_write_cow(acb);
-     }
-     if (ret < 0) {
--        qed_aio_complete(acb, ret);
--        return;
-+        return ret;
-     }
--    qed_aio_next_io(acb, 0);
-+    return 0;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-  *
-  * This path is taken when writing to already allocated clusters.
-  */
--static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-+static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
- {
--    int ret;
--
-     /* Allocate buffer for zero writes */
-     if (acb->flags & QED_AIOCB_ZERO) {
-         struct iovec *iov = acb->qiov->iov;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-         if (!iov->iov_base) {
-             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
-             if (iov->iov_base == NULL) {
--                qed_aio_complete(acb, -ENOMEM);
--                return;
-+                return -ENOMEM;
-             }
-             memset(iov->iov_base, 0, iov->iov_len);
-         }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-     qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
-     /* Do the actual write */
--    ret = qed_aio_write_main(acb);
--    if (ret < 0) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
--    qed_aio_next_io(acb, 0);
-+    return qed_aio_write_main(acb);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
-     switch (ret) {
-     case QED_CLUSTER_FOUND:
--        qed_aio_write_inplace(acb, offset, len);
-+        ret = qed_aio_write_inplace(acb, offset, len);
-         break;
-     case QED_CLUSTER_L2:
-     case QED_CLUSTER_L1:
-     case QED_CLUSTER_ZERO:
--        qed_aio_write_alloc(acb, len);
-+        ret = qed_aio_write_alloc(acb, len);
-         break;
-     default:
--        qed_aio_complete(acb, ret);
-+        assert(ret < 0);
-         break;
-     }
-+
-+    if (ret < 0) {
-+        if (ret != -EINPROGRESS) {
-+            qed_aio_complete(acb, ret);
-+        }
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 60/61] block: Do not strcmp() with NULL uri->scheme
+[Qemu-devel] [PULL 33/35] qcow2: Fix unaligned preallocated truncation
 From: Max Reitz <mreitz@redhat.com>
-uri_parse(...)->scheme may be NULL. In fact, probably every field may be
+A qcow2 image file's length is not required to have a length that is a
-NULL, and the callers do test this for all of the other fields but not
+multiple of the cluster size.  However, qcow2_refcount_area() expects an
-for scheme (except for block/gluster.c; block/vxhs.c does not access
+aligned value for its @start_offset parameter, so we need to round
-that field at all).
+@old_file_size up to the next cluster boundary.
-We can easily fix this by using g_strcmp0() instead of strcmp().
+Reported-by: Ping Li <pingl@redhat.com>
+Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1414049
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 Message-id: 20171009215533.12530-2-mreitz@redhat.com
 Cc: qemu-stable@nongnu.org
-Signed-off-by: Max Reitz <mreitz@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Message-id: 20170613205726.13544-1-mreitz@redhat.com
+Reviewed-by: Jeff Cody <jcody@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/nbd.c      | 6 +++---
+ block/qcow2.c | 1 +
- block/nfs.c      | 2 +-
+file changed, 1 insertion(+)
  block/sheepdog.c | 6 +++---
  block/ssh.c      | 2 +-
 files changed, 8 insertions(+), 8 deletions(-)
-diff --git a/block/nbd.c b/block/nbd.c
+diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nbd.c
+--- a/block/qcow2.c
-+++ b/block/nbd.c
++++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
+@@ -XXX,XX +XXX,XX @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
-     }
+                              "Failed to inquire current file length");
+             return old_file_size;
-     /* transport */
+         }
--    if (!strcmp(uri->scheme, "nbd")) {
++        old_file_size = ROUND_UP(old_file_size, s->cluster_size);
-+    if (!g_strcmp0(uri->scheme, "nbd")) {
-         is_unix = false;
+         nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
--    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
+                                             s->cluster_size);
 +    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
          is_unix = false;
 -    } else if (!strcmp(uri->scheme, "nbd+unix")) {
 +    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
          is_unix = true;
      } else {
          ret = -EINVAL;
 diff --git a/block/nfs.c b/block/nfs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
          error_setg(errp, "Invalid URI specified");
          goto out;
      }
 -    if (strcmp(uri->scheme, "nfs") != 0) {
 +    if (g_strcmp0(uri->scheme, "nfs") != 0) {
          error_setg(errp, "URI scheme must be 'nfs'");
          goto out;
      }
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
      }
      /* transport */
 -    if (!strcmp(uri->scheme, "sheepdog")) {
 +    if (!g_strcmp0(uri->scheme, "sheepdog")) {
          is_unix = false;
 -    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
 +    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
          is_unix = false;
 -    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
 +    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
          is_unix = true;
      } else {
          error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
 diff --git a/block/ssh.c b/block/ssh.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/ssh.c
 +++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
          return -EINVAL;
      }
 -    if (strcmp(uri->scheme, "ssh") != 0) {
 +    if (g_strcmp0(uri->scheme, "ssh") != 0) {
          error_setg(errp, "URI scheme must be 'ssh'");
          goto err;
      }
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 59/61] blkverify: Catch bs->exact_filename overflow
+[Qemu-devel] [PULL 34/35] qcow2: Always execute preallocate() in a coroutine
 From: Max Reitz <mreitz@redhat.com>
-The bs->exact_filename field may not be sufficient to store the full
+Some qcow2 functions (at least perform_cow()) expect s->lock to be
-blkverify node filename. In this case, we should not generate a filename
+taken.  Therefore, if we want to make use of them, we should execute
-at all instead of an unusable one.
+preallocate() (as "preallocate_co") in a coroutine so that we can use
 the qemu_co_mutex_* functions.
+Signed-off-by: Max Reitz <mreitz@redhat.com>
+Message-id: 20171009215533.12530-3-mreitz@redhat.com
 Cc: qemu-stable@nongnu.org
-Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Signed-off-by: Max Reitz <mreitz@redhat.com>
+Reviewed-by: Jeff Cody <jcody@redhat.com>
 Message-id: 20170613172006.19685-3-mreitz@redhat.com
 Reviewed-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkverify.c | 12 ++++++++----
+ block/qcow2.c | 41 ++++++++++++++++++++++++++++++++++-------
-file changed, 8 insertions(+), 4 deletions(-)
+file changed, 34 insertions(+), 7 deletions(-)
-diff --git a/block/blkverify.c b/block/blkverify.c
+diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkverify.c
+--- a/block/qcow2.c
-+++ b/block/blkverify.c
++++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
-     if (bs->file->bs->exact_filename[0]
+ }
-         && s->test_file->bs->exact_filename[0])
-     {
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++typedef struct PreallocCo {
--                 "blkverify:%s:%s",
++    BlockDriverState *bs;
--                 bs->file->bs->exact_filename,
++    uint64_t offset;
--                 s->test_file->bs->exact_filename);
++    uint64_t new_length;
-+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++
-+                           "blkverify:%s:%s",
++    int ret;
-+                           bs->file->bs->exact_filename,
++} PreallocCo;
-+                           s->test_file->bs->exact_filename);
++
-+        if (ret >= sizeof(bs->exact_filename)) {
+ /**
-+            /* An overflow makes the filename unusable, so do not report any */
+  * Preallocates metadata structures for data clusters between @offset (in the
-+            bs->exact_filename[0] = 0;
+  * guest disk) and @new_length (which is thus generally the new guest disk
-+        }
+@@ -XXX,XX +XXX,XX @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
   *
   * Returns: 0 on success, -errno on failure.
   */
 -static int preallocate(BlockDriverState *bs,
 -                       uint64_t offset, uint64_t new_length)
 +static void coroutine_fn preallocate_co(void *opaque)
  {
 +    PreallocCo *params = opaque;
 +    BlockDriverState *bs = params->bs;
 +    uint64_t offset = params->offset;
 +    uint64_t new_length = params->new_length;
      BDRVQcow2State *s = bs->opaque;
      uint64_t bytes;
      uint64_t host_offset = 0;
@@ -XXX,XX +XXX,XX @@ static int preallocate(BlockDriverState *bs,
      int ret;
      QCowL2Meta *meta;
 -    if (qemu_in_coroutine()) {
 -        qemu_co_mutex_lock(&s->lock);
 -    }
 +    qemu_co_mutex_lock(&s->lock);
      assert(offset <= new_length);
      bytes = new_length - offset;
@@ -XXX,XX +XXX,XX @@ static int preallocate(BlockDriverState *bs,
      ret = 0;
  done:
 +    qemu_co_mutex_unlock(&s->lock);
 +    params->ret = ret;
 +}
 +
 +static int preallocate(BlockDriverState *bs,
 +                       uint64_t offset, uint64_t new_length)
 +{
 +    PreallocCo params = {
 +        .bs         = bs,
 +        .offset     = offset,
 +        .new_length = new_length,
 +        .ret        = -EINPROGRESS,
 +    };
 +
      if (qemu_in_coroutine()) {
 -        qemu_co_mutex_unlock(&s->lock);
 +        preallocate_co(&params);
 +    } else {
 +        Coroutine *co = qemu_coroutine_create(preallocate_co, &params);
 +        bdrv_coroutine_enter(bs, co);
 +        BDRV_POLL_WHILE(bs, params.ret == -EINPROGRESS);
      }
+-    return ret;
++    return params.ret;
  }
+ /* qcow2_refcount_metadata_size:
 --
-.8.3.1
+.13.6

-[Qemu-devel] [PULL 58/61] blkdebug: Catch bs->exact_filename overflow
+[Qemu-devel] [PULL 35/35] iotests: Add cluster_size=64k to 125
 From: Max Reitz <mreitz@redhat.com>
-The bs->exact_filename field may not be sufficient to store the full
+Apparently it would be a good idea to test that, too.
 blkdebug node filename. In this case, we should not generate a filename
 at all instead of an unusable one.
-Cc: qemu-stable@nongnu.org
-Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20170613172006.19685-2-mreitz@redhat.com
+Message-id: 20171009215533.12530-4-mreitz@redhat.com
-Reviewed-by: Alberto Garcia <berto@igalia.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkdebug.c | 10 +++++++---
+ tests/qemu-iotests/125     |   7 +-
-file changed, 7 insertions(+), 3 deletions(-)
+ tests/qemu-iotests/125.out | 480 ++++++++++++++++++++++++++++++++++++++++-----
 files changed, 437 insertions(+), 50 deletions(-)
-diff --git a/block/blkdebug.c b/block/blkdebug.c
+diff --git a/tests/qemu-iotests/125 b/tests/qemu-iotests/125
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/125
 +++ b/tests/qemu-iotests/125
@@ -XXX,XX +XXX,XX @@ fi
  # in B
  CREATION_SIZE=$((2 * 1024 * 1024 - 48 * 1024))
 +# 512 is the actual test -- but it's good to test 64k as well, just to be sure.
 +for cluster_size in 512 64k; do
  # in kB
  for GROWTH_SIZE in 16 48 80; do
      for create_mode in off metadata falloc full; do
          for growth_mode in off metadata falloc full; do
 -            echo "--- growth_size=$GROWTH_SIZE create_mode=$create_mode growth_mode=$growth_mode ---"
 +            echo "--- cluster_size=$cluster_size growth_size=$GROWTH_SIZE create_mode=$create_mode growth_mode=$growth_mode ---"
 -            IMGOPTS="preallocation=$create_mode,cluster_size=512" _make_test_img ${CREATION_SIZE}
 +            IMGOPTS="preallocation=$create_mode,cluster_size=$cluster_size" _make_test_img ${CREATION_SIZE}
              $QEMU_IMG resize -f "$IMGFMT" --preallocation=$growth_mode "$TEST_IMG" +${GROWTH_SIZE}K
              host_size_0=$(get_image_size_on_host)
@@ -XXX,XX +XXX,XX @@ for GROWTH_SIZE in 16 48 80; do
          done
      done
  done
 +done
  # success, all done
  echo '*** done'
 diff --git a/tests/qemu-iotests/125.out b/tests/qemu-iotests/125.out
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkdebug.c
+--- a/tests/qemu-iotests/125.out
-+++ b/block/blkdebug.c
++++ b/tests/qemu-iotests/125.out
-@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@
-     }
+ QA output created by 125
+---- growth_size=16 create_mode=off growth_mode=off ---
-     if (!force_json && bs->file->bs->exact_filename[0]) {
++--- cluster_size=512 growth_size=16 create_mode=off growth_mode=off ---
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
--                 "blkdebug:%s:%s", s->config_file ?: "",
+ Image resized.
--                 bs->file->bs->exact_filename);
+ wrote 2048000/2048000 bytes at offset 0
-+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
-+                           "blkdebug:%s:%s", s->config_file ?: "",
+ wrote 16384/16384 bytes at offset 2048000
-+                           bs->file->bs->exact_filename);
+KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+        if (ret >= sizeof(bs->exact_filename)) {
-+            /* An overflow makes the filename unusable, so do not report any */
+---- growth_size=16 create_mode=off growth_mode=metadata ---
-+            bs->exact_filename[0] = 0;
++--- cluster_size=512 growth_size=16 create_mode=off growth_mode=metadata ---
-+        }
+ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
-     }
+ Image resized.
+ wrote 2048000/2048000 bytes at offset 0
-     opts = qdict_new();
+@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=off growth_mode=falloc ---
 +--- cluster_size=512 growth_size=16 create_mode=off growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=off growth_mode=full ---
 +--- cluster_size=512 growth_size=16 create_mode=off growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=metadata growth_mode=off ---
 +--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=metadata growth_mode=metadata ---
 +--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=metadata growth_mode=falloc ---
 +--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=metadata growth_mode=full ---
 +--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=falloc growth_mode=off ---
 +--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=falloc growth_mode=metadata ---
 +--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=falloc growth_mode=falloc ---
 +--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=falloc growth_mode=full ---
 +--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=full growth_mode=off ---
 +--- cluster_size=512 growth_size=16 create_mode=full growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=full growth_mode=metadata ---
 +--- cluster_size=512 growth_size=16 create_mode=full growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=full growth_mode=falloc ---
 +--- cluster_size=512 growth_size=16 create_mode=full growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=16 create_mode=full growth_mode=full ---
 +--- cluster_size=512 growth_size=16 create_mode=full growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 16384/16384 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=off growth_mode=off ---
 +--- cluster_size=512 growth_size=48 create_mode=off growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=off growth_mode=metadata ---
 +--- cluster_size=512 growth_size=48 create_mode=off growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=off growth_mode=falloc ---
 +--- cluster_size=512 growth_size=48 create_mode=off growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=off growth_mode=full ---
 +--- cluster_size=512 growth_size=48 create_mode=off growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=metadata growth_mode=off ---
 +--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=metadata growth_mode=metadata ---
 +--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=metadata growth_mode=falloc ---
 +--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=metadata growth_mode=full ---
 +--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=falloc growth_mode=off ---
 +--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=falloc growth_mode=metadata ---
 +--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=falloc growth_mode=falloc ---
 +--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=falloc growth_mode=full ---
 +--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=full growth_mode=off ---
 +--- cluster_size=512 growth_size=48 create_mode=full growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=full growth_mode=metadata ---
 +--- cluster_size=512 growth_size=48 create_mode=full growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=full growth_mode=falloc ---
 +--- cluster_size=512 growth_size=48 create_mode=full growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=48 create_mode=full growth_mode=full ---
 +--- cluster_size=512 growth_size=48 create_mode=full growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 49152/49152 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=off growth_mode=off ---
 +--- cluster_size=512 growth_size=80 create_mode=off growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=off growth_mode=metadata ---
 +--- cluster_size=512 growth_size=80 create_mode=off growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=off growth_mode=falloc ---
 +--- cluster_size=512 growth_size=80 create_mode=off growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=off growth_mode=full ---
 +--- cluster_size=512 growth_size=80 create_mode=off growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=metadata growth_mode=off ---
 +--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=metadata growth_mode=metadata ---
 +--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=metadata growth_mode=falloc ---
 +--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=metadata growth_mode=full ---
 +--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=falloc growth_mode=off ---
 +--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=falloc growth_mode=metadata ---
 +--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=falloc growth_mode=falloc ---
 +--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=falloc growth_mode=full ---
 +--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=full growth_mode=off ---
 +--- cluster_size=512 growth_size=80 create_mode=full growth_mode=off ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=full growth_mode=metadata ---
 +--- cluster_size=512 growth_size=80 create_mode=full growth_mode=metadata ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=full growth_mode=falloc ---
 +--- cluster_size=512 growth_size=80 create_mode=full growth_mode=falloc ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
  wrote 81920/81920 bytes at offset 2048000
 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ---- growth_size=80 create_mode=full growth_mode=full ---
 +--- cluster_size=512 growth_size=80 create_mode=full growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=off growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=off growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=off growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=off growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=full growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=full growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=full growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=16 create_mode=full growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 16384/16384 bytes at offset 2048000
 +16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=off growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=off growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=off growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=off growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=full growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=full growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=full growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=48 create_mode=full growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 49152/49152 bytes at offset 2048000
 +48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=off growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=off growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=off growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=off growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=full ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=full growth_mode=off ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=full growth_mode=metadata ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=full growth_mode=falloc ---
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 +Image resized.
 +wrote 2048000/2048000 bytes at offset 0
 +1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 81920/81920 bytes at offset 2048000
 +80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
 +--- cluster_size=64k growth_size=80 create_mode=full growth_mode=full ---
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
  Image resized.
  wrote 2048000/2048000 bytes at offset 0
 --
-.8.3.1
+.13.6

The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:

Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)

are available in the git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:

Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)

----------------------------------------------------------------

Block layer patches

----------------------------------------------------------------
Alberto Garcia (9):
      throttle: Update throttle-groups.c documentation
      qcow2: Remove unused Error variable in do_perform_cow()
      qcow2: Use unsigned int for both members of Qcow2COWRegion
      qcow2: Make perform_cow() call do_perform_cow() twice
      qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
      qcow2: Allow reading both COW regions with only one request
      qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
      qcow2: Merge the writing of the COW regions with the guest data
      qcow2: Use offset_into_cluster() and offset_to_l2_index()

Kevin Wolf (37):
      commit: Fix completion with extra reference
      qemu-iotests: Allow starting new qemu after cleanup
      qemu-iotests: Test exiting qemu with running job
      doc: Document generic -blockdev options
      doc: Document driver-specific -blockdev options
      qed: Use bottom half to resume waiting requests
      qed: Make qed_read_table() synchronous
      qed: Remove callback from qed_read_table()
      qed: Remove callback from qed_read_l2_table()
      qed: Remove callback from qed_find_cluster()
      qed: Make qed_read_backing_file() synchronous
      qed: Make qed_copy_from_backing_file() synchronous
      qed: Remove callback from qed_copy_from_backing_file()
      qed: Make qed_write_header() synchronous
      qed: Remove callback from qed_write_header()
      qed: Make qed_write_table() synchronous
      qed: Remove GenericCB
      qed: Remove callback from qed_write_table()
      qed: Make qed_aio_read_data() synchronous
      qed: Make qed_aio_write_main() synchronous
      qed: Inline qed_commit_l2_update()
      qed: Add return value to qed_aio_write_l1_update()
      qed: Add return value to qed_aio_write_l2_update()
      qed: Add return value to qed_aio_write_main()
      qed: Add return value to qed_aio_write_cow()
      qed: Add return value to qed_aio_write_inplace/alloc()
      qed: Add return value to qed_aio_read/write_data()
      qed: Remove ret argument from qed_aio_next_io()
      qed: Remove recursion in qed_aio_next_io()
      qed: Implement .bdrv_co_readv/writev
      qed: Use CoQueue for serialising allocations
      qed: Simplify request handling
      qed: Use a coroutine for need_check_timer
      qed: Add coroutine_fn to I/O path functions
      qed: Use bdrv_co_* for coroutine_fns
      block: Remove bdrv_aio_readv/writev/flush()
      Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block

Manos Pitsidianakis (1):
      block: change variable names in BlockDriverState

Max Reitz (3):
      blkdebug: Catch bs->exact_filename overflow
      blkverify: Catch bs->exact_filename overflow
      block: Do not strcmp() with NULL uri->scheme

Stefan Hajnoczi (10):
      block: count bdrv_co_rw_vmstate() requests
      block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
      migration: avoid recursive AioContext locking in save_vmstate()
      migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
      virtio-pci: use ioeventfd even when KVM is disabled
      migration: hold AioContext lock for loadvm qemu_fclose()
      qemu-iotests: 068: extract _qemu() function
      qemu-iotests: 068: use -drive/-device instead of -hda
      qemu-iotests: 068: test iothread mode
      qemu-img: don't shadow opts variable in img_dd()

Stephen Bates (1):
      nvme: Add support for Read Data and Write Data in CMBs.

sochin.jiang (1):
      fix: avoid an infinite loop or a dangling pointer problem in img_commit

commit_complete() can't assume that after its block_job_completed() the
job is actually immediately freed; someone else may still be holding
references. In this case, the op blockers on the intermediate nodes make
the graph reconfiguration in the completion code fail.

Call block_job_remove_all_bdrv() manually so that we know for sure that
any blockers on intermediate nodes are given up.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/commit.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/commit.c b/block/commit.c
index XXXXXXX..XXXXXXX 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
     }
     g_free(s->backing_file_str);
     blk_unref(s->top);
+
+    /* If there is more than one reference to the job (e.g. if called from
+     * block_job_finish_sync()), block_job_completed() won't free it and
+     * therefore the blockers on the intermediate nodes remain. This would
+     * cause bdrv_set_backing_hd() to fail. */
+    block_job_remove_all_bdrv(job);
+
     block_job_completed(&s->common, ret);
     g_free(data);
 
-- 
1.8.3.1

After _cleanup_qemu(), test cases should be able to start the next qemu
process and call _cleanup_qemu() for that one as well. For this to work
cleanly, we need to improve the cleanup so that the second invocation
doesn't try to kill the qemu instances from the first invocation a
second time (which would result in error messages).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/common.qemu | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.qemu
+++ b/tests/qemu-iotests/common.qemu
@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
         eval "exec ${QEMU_OUT[$i]}<&-"
+
+        unset QEMU_IN[$i]
+        unset QEMU_OUT[$i]
     done
 }
-- 
1.8.3.1

When qemu is exited, all running jobs should be cancelled successfully.
This adds a test for this for all types of block jobs that currently
exist in qemu.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/185.out |  59 +++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 266 insertions(+)
 create mode 100755 tests/qemu-iotests/185
 create mode 100644 tests/qemu-iotests/185.out

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185
@@ -XXX,XX +XXX,XX @@
+#!/bin/bash
+#
+# Test exiting qemu while jobs are still running
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=kwolf@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1 # failure is the default!
+
+MIG_SOCKET="${TEST_DIR}/migrate"
+
+_cleanup()
+{
+    rm -f "${TEST_IMG}.mid"
+    rm -f "${TEST_IMG}.copy"
+    _cleanup_test_img
+    _cleanup_qemu
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+size=64M
+TEST_IMG="${TEST_IMG}.base" _make_test_img $size
+
+echo
+echo === Starting VM ===
+echo
+
+qemu_comm_method="qmp"
+
+_launch_qemu \
+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+echo
+echo === Creating backing chain ===
+echo
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG.mid',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'human-monitor-command',
+       'arguments': { 'command-line':
+                      'qemu-io disk \"write 0 4M\"' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+echo
+echo === Start commit job and exit qemu ===
+echo
+
+# Note that the reference output intentionally includes the 'offset' field in
+# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
+# predictable and any change in the offsets would hint at a bug in the job
+# throttling code.
+#
+# In order to achieve these predictable offsets, all of the following tests
+# use speed=65536. Each job will perform exactly one iteration before it has
+# to sleep at least for a second, which is plenty of time for the 'quit' QMP
+# command to be received (after receiving the command, the rest runs
+# synchronously, so jobs can arbitrarily continue or complete).
+#
+# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
+# the first request), for active commit and mirror it's large enough to cover
+# the full 4M, and for backup it's the qcow2 cluster size, which we know is
+# 64k. As all of these are at least as large as the speed, we are sure that the
+# offset doesn't advance after the first iteration before qemu exits.
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'top': '$TEST_IMG.mid',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start active commit job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start mirror job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-mirror',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start backup job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-backup',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start streaming job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-stream',
+       'arguments': { 'device': 'disk',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+_check_test_img
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185.out
@@ -XXX,XX +XXX,XX @@
+QA output created by 185
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
+
+=== Starting VM ===
+
+{"return": {}}
+
+=== Creating backing chain ===
+
+Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+wrote 4194304/4194304 bytes at offset 0
+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{"return": ""}
+Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+
+=== Start commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
+
+=== Start active commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
+
+=== Start mirror job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
+
+=== Start backup job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
+
+=== Start streaming job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 181 rw auto migration
 182 rw auto quick
 183 rw auto migration
+185 rw auto
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Calling aio_poll() directly may have been fine previously, but this is
the future, man!  The difference between an aio_poll() loop and
BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
around aio_poll().

This allows the IOThread to run fd handlers or BHs to complete the
request.  Failure to release the AioContext causes deadlocks.

Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
 
         bdrv_coroutine_enter(bs, co);
-        while (data.ret == -EINPROGRESS) {
-            aio_poll(bdrv_get_aio_context(bs), true);
-        }
+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
         return data.ret;
     }
 }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

AioContext was designed to allow nested acquire/release calls.  It uses
a recursive mutex so callers don't need to worry about nesting...or so
we thought.

BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
the AioContext temporarily around aio_poll().  This gives IOThreads a
chance to acquire the AioContext to process I/O completions.

It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
will not be able to acquire the AioContext if it was acquired
multiple times.

Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
this patch simply avoids nested locking in save_vmstate().  It's the
simplest fix and we should step back to consider the big picture with
all the recent changes to block layer threading.

This patch is the final fix to solve 'savevm' hanging with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
         goto the_end;
     }
 
+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
+     * it only releases the lock once.  Therefore synchronous I/O will deadlock
+     * unless we release the AioContext before bdrv_all_create_snapshot().
+     */
+    aio_context_release(aio_context);
+    aio_context = NULL;
+
     ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
     if (ret < 0) {
         error_setg(errp, "Error while creating snapshot on '%s'",
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     ret = 0;
 
  the_end:
-    aio_context_release(aio_context);
+    if (aio_context) {
+        aio_context_release(aio_context);
+    }
     if (saved_vm_running) {
         vm_start();
     }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

blk/bdrv_drain_all() only takes effect for a single instant and then
resumes block jobs, guest devices, and other external clients like the
NBD server.  This can be handy when performing a synchronous drain
before terminating the program, for example.

Monitor commands usually need to quiesce I/O across an entire code
region so blk/bdrv_drain_all() is not suitable.  They must use
bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
requests from slipping in or worse - block jobs completing and modifying
the graph.

I audited other blk/bdrv_drain_all() callers but did not find anything
that needs a similar fix.  This patch fixes the savevm/loadvm commands.
Although I haven't encountered a read world issue this makes the code
safer.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     }
     vm_stop(RUN_STATE_SAVE_VM);
 
+    bdrv_drain_all_begin();
+
     aio_context_acquire(aio_context);
 
     memset(sn, 0, sizeof(*sn));
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     if (aio_context) {
         aio_context_release(aio_context);
     }
+
+    bdrv_drain_all_end();
+
     if (saved_vm_running) {
         vm_start();
     }
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     /* Flush all IO requests so they don't interfere with the new state.  */
-    bdrv_drain_all();
+    bdrv_drain_all_begin();
 
     ret = bdrv_all_goto_snapshot(name, &bs);
     if (ret < 0) {
         error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
                      ret, name, bdrv_get_device_name(bs));
-        return ret;
+        goto err_drain;
     }
 
     /* restore the VM state */
     f = qemu_fopen_bdrv(bs_vm_state, 0);
     if (!f) {
         error_setg(errp, "Could not open VM state file");
-        return -EINVAL;
+        ret = -EINVAL;
+        goto err_drain;
     }
 
     qemu_system_reset(SHUTDOWN_CAUSE_NONE);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     ret = qemu_loadvm_state(f);
     aio_context_release(aio_context);
 
+    bdrv_drain_all_end();
+
     migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     return 0;
+
+err_drain:
+    bdrv_drain_all_end();
+    return ret;
 }
 
 void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
-- 
1.8.3.1

This adds documentation for the -blockdev options that apply to all
nodes independent of the block driver used.

All options that are shared by -blockdev and -drive are now explained in
the section for -blockdev. The documentation of -drive mentions that all
-blockdev options are accepted as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
     "          [,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
+STEXI
+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
+@findex -blockdev
+
+Define a new block driver node.
+
+@table @option
+@item Valid options for any block driver node:
+
+@table @code
+@item driver
+Specifies the block driver to use for the given node.
+@item node-name
+This defines the name of the block driver node by which it will be referenced
+later. The name must be unique, i.e. it must not match the name of a different
+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
+
+If no node name is specified, it is automatically generated. The generated node
+name is not intended to be predictable and changes between QEMU invocations.
+For the top level, an explicit node name must be specified.
+@item read-only
+Open the node read-only. Guest write attempts will fail.
+@item cache.direct
+The host page cache can be avoided with @option{cache.direct=on}. This will
+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
+internal copy of the data.
+@item cache.no-flush
+In case you don't care about data integrity over host failures, you can use
+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
+any data to the disk but can instead keep things in cache. If anything goes
+wrong, like your host losing power, the disk storage getting disconnected
+accidentally, etc. your image will most probably be rendered unusable.
+@item discard=@var{discard}
+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
+ignored or passed to the filesystem. Some machine types may not support
+discard requests.
+@item detect-zeroes=@var{detect-zeroes}
+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
+conversion of plain zero writes by the OS to driver specific optimized
+zero write commands. You may even choose "unmap" if @var{discard} is set
+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@end table
+
+@end table
+
+ETEXI
 
 DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -drive
 
-Define a new drive. Valid options are:
+Define a new drive. This includes creating a block driver node (the backend) as
+well as a guest device, and is mostly a shortcut for defining the corresponding
+@option{-blockdev} and @option{-device} options.
+
+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
+addition, it knows the following options:
 
 @table @option
 @item file=@var{file}
@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
 @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
 (see @option{-snapshot}).
 @item cache=@var{cache}
-@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
+and controls how the host cache is used to access block data. This is a
+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
+which provides a default for the @option{write-cache} option of block guest
+devices (as in @option{-device}). The modes correspond to the following
+settings:
+
+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
+@c and the HTML output.
+@example
+@             │ cache.writeback   cache.direct   cache.no-flush
+─────────────┼─────────────────────────────────────────────────
+writeback    │ on                off            off
+none         │ on                on             off
+writethrough │ off               off            off
+directsync   │ off               on             off
+unsafe       │ on                off            on
+@end example
+
+The default mode is @option{cache=writeback}.
+
 @item aio=@var{aio}
 @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
-@item discard=@var{discard}
-@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
 @item format=@var{format}
 Specify which disk @var{format} will be used rather than detecting
 the format.  Can be used to specify format=raw to avoid interpreting
@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
 "report" (report the error to the guest), "enospc" (pause QEMU only if the
 host disk is full; report the error to the guest otherwise).
 The default setting is @option{werror=enospc} and @option{rerror=report}.
-@item readonly
-Open drive @option{file} as read-only. Guest write attempts will fail.
 @item copy-on-read=@var{copy-on-read}
 @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
 file sectors into the image file.
-@item detect-zeroes=@var{detect-zeroes}
-@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-conversion of plain zero writes by the OS to driver specific optimized
-zero write commands. You may even choose "unmap" if @var{discard} is set
-to "unmap" to allow a zero write to be converted to an UNMAP operation.
 @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
 Specify bandwidth throttling limits in bytes per second, either for all request
 types or for reads or writes only.  Small values can lead to timeouts or hangs
@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
 instead of a single larger disk.
 @end table
 
-By default, the @option{cache=writeback} mode is used. It will report data
+By default, the @option{cache.writeback=on} mode is used. It will report data
 writes as completed as soon as the data is present in the host page cache.
 This is safe as long as your guest OS makes sure to correctly flush disk caches
 where needed. If your guest OS does not handle volatile disk write caches
 correctly and your host crashes or loses power, then the guest may experience
 data corruption.
 
-For such guests, you should consider using @option{cache=writethrough}. This
+For such guests, you should consider using @option{cache.writeback=off}. This
 means that the host page cache will be used to read and write data, but write
 notification will be sent to the guest only after QEMU has made sure to flush
 each write to the disk. Be aware that this has a major impact on performance.
 
-The host page cache can be avoided entirely with @option{cache=none}.  This will
-attempt to do disk IO directly to the guest's memory.  QEMU may still perform
-an internal copy of the data. Note that this is considered a writeback mode and
-the guest OS must handle the disk write cache correctly in order to avoid data
-corruption on host crashes.
-
-The host page cache can be avoided while only sending write notifications to
-the guest when the data has been flushed to the disk using
-@option{cache=directsync}.
-
-In case you don't care about data integrity over host failures, use
-@option{cache=unsafe}. This option tells QEMU that it never needs to write any
-data to the disk but can instead keep things in cache. If anything goes wrong,
-like your host losing power, the disk storage getting disconnected accidentally,
-etc. your image will most probably be rendered unusable.   When using
-the @option{-snapshot} option, unsafe caching is always used.
+When using the @option{-snapshot} option, unsafe caching is always used.
 
 Copy-on-read avoids accessing the same backing file sectors repeatedly and is
 useful when the backing file is over a slow network.  By default copy-on-read
-- 
1.8.3.1

This documents the driver-specific options for the raw, qcow2 and file
block drivers for the man page. For everything else, we refer to the
QAPI documentation.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 1 deletion(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -blockdev
 
-Define a new block driver node.
+Define a new block driver node. Some of the options apply to all block drivers,
+other options are only accepted for a specific block driver. See below for a
+list of generic options and options for the most common block drivers.
+
+Options that expect a reference to another node (e.g. @code{file}) can be
+given in two ways. Either you specify the node name of an already existing node
+(file=@var{node-name}), or you define a new node inline, adding options
+for the referenced node after a dot (file.filename=@var{path},file.aio=native).
+
+A block driver node created with @option{-blockdev} can be used for a guest
+device by specifying its node name for the @code{drive} property in a
+@option{-device} argument that defines a block device.
 
 @table @option
 @item Valid options for any block driver node:
@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
 @end table
 
+@item Driver-specific options for @code{file}
+
+This is the protocol-level block driver for accessing regular files.
+
+@table @code
+@item filename
+The path to the image file in the local filesystem
+@item aio
+Specifies the AIO backend (threads/native, default: threads)
+@end table
+Example:
+@example
+-blockdev driver=file,node-name=disk,filename=disk.img
+@end example
+
+@item Driver-specific options for @code{raw}
+
+This is the image format block driver for raw images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+@end table
+Example 1:
+@example
+-blockdev driver=file,node-name=disk_file,filename=disk.img
+-blockdev driver=raw,node-name=disk,file=disk_file
+@end example
+Example 2:
+@example
+-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
+@end example
+
+@item Driver-specific options for @code{qcow2}
+
+This is the image format block driver for qcow2 images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+
+@item backing
+Reference to or definition of the backing file block device (default is taken
+from the image file). It is allowed to pass an empty string here in order to
+disable the default backing file.
+
+@item lazy-refcounts
+Whether to enable the lazy refcounts feature (on/off; default is taken from the
+image file)
+
+@item cache-size
+The maximum total size of the L2 table and refcount block caches in bytes
+(default: 1048576 bytes or 8 clusters, whichever is larger)
+
+@item l2-cache-size
+The maximum size of the L2 table cache in bytes
+(default: 4/5 of the total cache size)
+
+@item refcount-cache-size
+The maximum size of the refcount block cache in bytes
+(default: 1/5 of the total cache size)
+
+@item cache-clean-interval
+Clean unused entries in the L2 and refcount caches. The interval is in seconds.
+The default value is 0 and it disables this feature.
+
+@item pass-discard-request
+Whether discard requests to the qcow2 device should be forwarded to the data
+source (on/off; default: on if discard=unmap is specified, off otherwise)
+
+@item pass-discard-snapshot
+Whether discard requests for the data source should be issued when a snapshot
+operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
+default: on)
+
+@item pass-discard-other
+Whether discard requests for the data source should be issued on other
+occasions where a cluster gets freed (on/off; default: off)
+
+@item overlap-check
+Which overlap checks to perform for writes to the image
+(none/constant/cached/all; default: cached). For details or finer
+granularity control refer to the QAPI documentation of @code{blockdev-add}.
+@end table
+
+Example 1:
+@example
+-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
+-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
+@end example
+Example 2:
+@example
+-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
+@end example
+
+@item Driver-specific options for other drivers
+Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
+
 @end table
 
 ETEXI
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

There used to be throttle_timers_{detach,attach}_aio_context() calls
in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
they are now in blk_set_aio_context().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/throttle-groups.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@
  * Again, all this is handled internally and is mostly transparent to
  * the outside. The 'throttle_timers' field however has an additional
  * constraint because it may be temporarily invalid (see for example
- * bdrv_set_aio_context()). Therefore in this file a thread will
+ * blk_set_aio_context()). Therefore in this file a thread will
  * access some other BlockBackend's timers only after verifying that
  * that BlockBackend has throttled requests in the queue.
  */
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Old kvm.ko versions only supported a tiny number of ioeventfds so
virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.

Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
("memory: emulate ioeventfd") it has been possible to use ioeventfds in
qtest or TCG mode.

This patch makes -device virtio-blk-pci,iothread=iothread0 work even
when KVM is disabled.

I have tested that virtio-blk-pci works under TCG both with and without
iothread.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/virtio/virtio-pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
                      !pci_bus_is_root(pci_dev->bus);
 
-    if (!kvm_has_many_ioeventfds()) {
+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
         proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
     }
 
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
file.  Make sure to call it inside an AioContext acquire/release region.

This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
in loadvm.

This patch closes the vmstate file before ending the drained region.
Previously we closed the vmstate file after ending the drained region.
The order does not matter.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
 
     aio_context_acquire(aio_context);
     ret = qemu_loadvm_state(f);
+    migration_incoming_state_destroy();
     aio_context_release(aio_context);
 
     bdrv_drain_all_end();
 
-    migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
         return ret;
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Avoid duplicating the QEMU command-line.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068 | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
       ;;
 esac
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
+_qemu()
+{
+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
+          "$@" |\
     _filter_qemu | _filter_hmp
+}
+
+# Give qemu some time to boot before saving the VM state
+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
 # Now try to continue from that VM state (this should just work)
-echo quit |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
-    _filter_qemu | _filter_hmp
+echo quit | _qemu -loadvm 0
 
 # success, all done
 echo "*** done"
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Perform the savevm/loadvm test with both iothread on and off.  This
covers the recently found savevm/loadvm hang when iothread is enabled.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068     | 23 ++++++++++++++---------
 tests/qemu-iotests/068.out | 11 ++++++++++-
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ _supported_os Linux
 IMGOPTS="compat=1.1"
 IMG_SIZE=128K
 
-echo
-echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
-echo
-_make_test_img $IMG_SIZE
-
 case "$QEMU_DEFAULT_MACHINE" in
   s390-ccw-virtio)
       platform_parm="-no-shutdown"
@@ -XXX,XX +XXX,XX @@ _qemu()
     _filter_qemu | _filter_hmp
 }
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
-# Now try to continue from that VM state (this should just work)
-echo quit | _qemu -loadvm 0
+for extra_args in \
+    "" \
+    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
+    echo
+    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
+    echo
+
+    _make_test_img $IMG_SIZE
+
+    # Give qemu some time to boot before saving the VM state
+    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
+    # Now try to continue from that VM state (this should just work)
+    echo quit | _qemu $extra_args -loadvm 0
+done
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/068.out
+++ b/tests/qemu-iotests/068.out
@@ -XXX,XX +XXX,XX @@
 QA output created by 068
 
-=== Saving and reloading a VM state to/from a qcow2 image ===
+=== Saving and reloading a VM state to/from a qcow2 image () ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) savevm 0
+(qemu) quit
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) quit
+
+=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
 QEMU X.Y.Z monitor - type 'help' for more information
-- 
1.8.3.1

From: Stephen Bates <sbates@raithlin.com>

Add the ability for the NVMe model to support both the RDS and WDS
modes in the Controller Memory Buffer.

Although not currently supported in the upstreamed Linux kernel a fork
with support exists [1] and user-space test programs that build on
this also exist [2].

Useful for testing CMB functionality in preperation for real CMB
enabled NVMe devices (coming soon).

[1] https://github.com/sbates130272/linux-p2pmem
[2] https://github.com/sbates130272/p2pmem-test

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
 hw/block/nvme.h |  1 +
 2 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -XXX,XX +XXX,XX @@
  *              cmb_size_mb=<cmb_size_mb[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports SQS only for now.
+ * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
  */
 
 #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
-    uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
+                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
     trans_len = MIN(len, trans_len);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
     if (!prp1) {
         return NVME_INVALID_FIELD | NVME_DNR;
+    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
+               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+        qsg->nsg = 0;
+        qemu_iovec_init(iov, num_prps);
+        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+    } else {
+        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+        qemu_sglist_add(qsg, prp1, trans_len);
     }
-
-    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-    qemu_sglist_add(qsg, prp1, trans_len);
     len -= trans_len;
     if (len) {
         if (!prp2) {
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
             nents = (len + n->page_size - 1) >> n->page_bits;
             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
+            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
             while (len != 0) {
                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                     i = 0;
                     nents = (len + n->page_size - 1) >> n->page_bits;
                     prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
+                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                         prp_trans);
                     prp_ent = le64_to_cpu(prp_list[i]);
                 }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                 }
 
                 trans_len = MIN(len, n->page_size);
-                qemu_sglist_add(qsg, prp_ent, trans_len);
+                if (qsg->nsg){
+                    qemu_sglist_add(qsg, prp_ent, trans_len);
+                } else {
+                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
+                }
                 len -= trans_len;
                 i++;
             }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
             if (prp2 & (n->page_size - 1)) {
                 goto unmap;
             }
-            qemu_sglist_add(qsg, prp2, len);
+            if (qsg->nsg) {
+                qemu_sglist_add(qsg, prp2, len);
+            } else {
+                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
+            }
         }
     }
     return NVME_SUCCESS;
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     uint64_t prp1, uint64_t prp2)
 {
     QEMUSGList qsg;
+    QEMUIOVector iov;
+    uint16_t status = NVME_SUCCESS;
 
-    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
+    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
-    if (dma_buf_read(ptr, len, &qsg)) {
+    if (qsg.nsg > 0) {
+        if (dma_buf_read(ptr, len, &qsg)) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
         qemu_sglist_destroy(&qsg);
-        return NVME_INVALID_FIELD | NVME_DNR;
+    } else {
+        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
+        qemu_iovec_destroy(&iov);
     }
-    qemu_sglist_destroy(&qsg);
-    return NVME_SUCCESS;
+    return status;
 }
 
 static void nvme_post_cqes(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
+    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    assert((nlb << data_shift) == req->qsg.size);
-
-    req->has_sg = true;
     dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
-    req->aiocb = is_write ?
-        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                      nvme_rw_cb, req) :
-        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                     nvme_rw_cb, req);
+    if (req->qsg.nsg > 0) {
+        req->has_sg = true;
+        req->aiocb = is_write ?
+            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                          nvme_rw_cb, req) :
+            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                         nvme_rw_cb, req);
+    } else {
+        req->has_sg = false;
+        req->aiocb = is_write ?
+            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                            req) :
+            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                           req);
+    }
 
     return NVME_NO_COMPLETE;
 }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
         NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
         NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
+        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
         NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 
+        n->cmbloc = n->bar.cmbloc;
+        n->cmbsz = n->bar.cmbsz;
+
         n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
         memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                               "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
+    QEMUIOVector            iov;
     QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Qcow2COWRegion has two attributes:

- The offset of the COW region from the start of the first cluster
  touched by the I/O request. Since it's always going to be positive
  and the maximum request size is at most INT_MAX, we can use a
  regular unsigned int to store this offset.

- The size of the COW region in bytes. This is guaranteed to be >= 0,
  so we should use an unsigned type instead.

In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
It will also help keep some assertions simpler now that we know that
there are no negative numbers.

The prototype of do_perform_cow() is also updated to reflect these
changes.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.h         | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

From: Alberto Garcia <berto@igalia.com>

Instead of calling perform_cow() twice with a different COW region
each time, call it just once and make perform_cow() handle both
regions.

This patch simply moves code around. The next one will do the actual
reordering of the COW operations.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     struct iovec iov;
     int ret;
 
+    if (bytes == 0) {
+        return 0;
+    }
+
     iov.iov_len = bytes;
     iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
     if (iov.iov_base == NULL) {
@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
     return cluster_offset;
 }
 
-static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2COWRegion *start = &m->cow_start;
+    Qcow2COWRegion *end = &m->cow_end;
     int ret;
 
-    if (r->nb_bytes == 0) {
+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
-    qemu_co_mutex_lock(&s->lock);
-
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         start->offset, start->nb_bytes);
     if (ret < 0) {
-        return ret;
+        goto fail;
     }
 
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         end->offset, end->nb_bytes);
+
+fail:
+    qemu_co_mutex_lock(&s->lock);
+
     /*
      * Before we update the L2 table to actually point to the new cluster, we
      * need to be sure that the refcounts have been increased and COW was
      * handled.
      */
-    qcow2_cache_depends_on_flush(s->l2_table_cache);
+    if (ret == 0) {
+        qcow2_cache_depends_on_flush(s->l2_table_cache);
+    }
 
-    return 0;
+    return ret;
 }
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* copy content of unmodified sectors */
-    ret = perform_cow(bs, m, &m->cow_start);
-    if (ret < 0) {
-        goto err;
-    }
-
-    ret = perform_cow(bs, m, &m->cow_end);
+    ret = perform_cow(bs, m);
     if (ret < 0) {
         goto err;
     }
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

This patch splits do_perform_cow() into three separate functions to
read, encrypt and write the COW regions.

perform_cow() can now read both regions first, then encrypt them and
finally write them to disk. The memory allocation is also done in
this function now, using one single buffer large enough to hold both
regions.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 87 insertions(+), 30 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
     return 0;
 }
 
-static int coroutine_fn do_perform_cow(BlockDriverState *bs,
-                                       uint64_t src_cluster_offset,
-                                       uint64_t cluster_offset,
-                                       unsigned offset_in_cluster,
-                                       unsigned bytes)
+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+                                            uint64_t src_cluster_offset,
+                                            unsigned offset_in_cluster,
+                                            uint8_t *buffer,
+                                            unsigned bytes)
 {
-    BDRVQcow2State *s = bs->opaque;
     QEMUIOVector qiov;
-    struct iovec iov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
     if (bytes == 0) {
         return 0;
     }
 
-    iov.iov_len = bytes;
-    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
-    if (iov.iov_base == NULL) {
-        return -ENOMEM;
-    }
-
     qemu_iovec_init_external(&qiov, &iov, 1);
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
-        ret = -ENOMEDIUM;
-        goto out;
+        return -ENOMEDIUM;
     }
 
     /* Call .bdrv_co_readv() directly instead of using the public block-layer
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
                                   bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    if (bs->encrypted) {
+    return 0;
+}
+
+static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
+                                                uint64_t src_cluster_offset,
+                                                unsigned offset_in_cluster,
+                                                uint8_t *buffer,
+                                                unsigned bytes)
+{
+    if (bytes && bs->encrypted) {
+        BDRVQcow2State *s = bs->opaque;
         int64_t sector = (src_cluster_offset + offset_in_cluster)
                          >> BDRV_SECTOR_BITS;
         assert(s->cipher);
         assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
         assert((bytes & ~BDRV_SECTOR_MASK) == 0);
-        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
+        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
                                   bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
-            ret = -EIO;
-            goto out;
+            return false;
         }
     }
+    return true;
+}
+
+static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
+                                             uint64_t cluster_offset,
+                                             unsigned offset_in_cluster,
+                                             uint8_t *buffer,
+                                             unsigned bytes)
+{
+    QEMUIOVector qiov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+    int ret;
+
+    if (bytes == 0) {
+        return 0;
+    }
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
 
     ret = qcow2_pre_write_overlap_check(bs, 0,
             cluster_offset + offset_in_cluster, bytes);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
                           bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    ret = 0;
-out:
-    qemu_vfree(iov.iov_base);
-    return ret;
+    return 0;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     BDRVQcow2State *s = bs->opaque;
     Qcow2COWRegion *start = &m->cow_start;
     Qcow2COWRegion *end = &m->cow_end;
+    unsigned buffer_size;
+    uint8_t *start_buffer, *end_buffer;
     int ret;
 
+    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
+
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
+    /* Reserve a buffer large enough to store the data from both the
+     * start and end COW regions. Add some padding in the middle if
+     * necessary to make sure that the end region is optimally aligned */
+    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
+        end->nb_bytes;
+    start_buffer = qemu_try_blockalign(bs, buffer_size);
+    if (start_buffer == NULL) {
+        return -ENOMEM;
+    }
+    /* The part of the buffer where the end region is located */
+    end_buffer = start_buffer + buffer_size - end->nb_bytes;
+
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         start->offset, start->nb_bytes);
+    /* First we read the existing data from both COW regions */
+    ret = do_perform_cow_read(bs, m->offset, start->offset,
+                              start_buffer, start->nb_bytes);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         end->offset, end->nb_bytes);
+    ret = do_perform_cow_read(bs, m->offset, end->offset,
+                              end_buffer, end->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    /* Encrypt the data if necessary before writing it */
+    if (bs->encrypted) {
+        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
+                                    start_buffer, start->nb_bytes) ||
+            !do_perform_cow_encrypt(bs, m->offset, end->offset,
+                                    end_buffer, end->nb_bytes)) {
+            ret = -EIO;
+            goto fail;
+        }
+    }
+
+    /* And now we can write everything */
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
+                               start_buffer, start->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
 
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
+                               end_buffer, end->nb_bytes);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
         qcow2_cache_depends_on_flush(s->l2_table_cache);
     }
 
+    qemu_vfree(start_buffer);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Instead of passing a single buffer pointer to do_perform_cow_write(),
pass a QEMUIOVector. This will allow us to merge the write requests
for the COW regions and the actual data into a single one.

Although do_perform_cow_read() does not strictly need to change its
API, we're doing it here as well for consistency.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
 static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
                                             uint64_t src_cluster_offset,
                                             unsigned offset_in_cluster,
-                                            uint8_t *buffer,
-                                            unsigned bytes)
+                                            QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
      * which can lead to deadlock when block layer copy-on-read is enabled.
      */
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
-                                  bytes, &qiov, 0);
+                                  qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
 static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
                                              uint64_t cluster_offset,
                                              unsigned offset_in_cluster,
-                                             uint8_t *buffer,
-                                             unsigned bytes)
+                                             QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     ret = qcow2_pre_write_overlap_check(bs, 0,
-            cluster_offset + offset_in_cluster, bytes);
+            cluster_offset + offset_in_cluster, qiov->size);
     if (ret < 0) {
         return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
-                          bytes, &qiov, 0);
+                          qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
     bool merge_reads;
     uint8_t *start_buffer, *end_buffer;
+    QEMUIOVector qiov;
     int ret;
 
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
+    qemu_iovec_init(&qiov, 1);
+
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
      * either read the whole region in one go, or the start and end
      * regions separately. */
     if (merge_reads) {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, buffer_size);
+        qemu_iovec_add(&qiov, start_buffer, buffer_size);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
     } else {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, start->nb_bytes);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
         if (ret < 0) {
             goto fail;
         }
 
-        ret = do_perform_cow_read(bs, m->offset, end->offset,
-                                  end_buffer, end->nb_bytes);
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
     }
     if (ret < 0) {
         goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* And now we can write everything */
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
-                               start_buffer, start->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
-                               end_buffer, end->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
     }
 
     qemu_vfree(start_buffer);
+    qemu_iovec_destroy(&qiov);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

If the guest tries to write data that results on the allocation of a
new cluster, instead of writing the guest data first and then the data
from the COW regions, write everything together using one single I/O
operation.

This can improve the write performance by 25% or more, depending on
several factors such as the media type, the cluster size and the I/O
request size.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
 block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
 block/qcow2.h         |  7 ++++++
 3 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
     assert(start->offset + start->nb_bytes <= end->offset);
+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
-    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
 
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
         }
     }
 
-    /* And now we can write everything */
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
-    if (ret < 0) {
-        goto fail;
+    /* And now we can write everything. If we have the guest data we
+     * can write everything in one single operation */
+    if (m->data_qiov) {
+        qemu_iovec_reset(&qiov);
+        if (start->nb_bytes) {
+            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        }
+        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
+        if (end->nb_bytes) {
+            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        }
+        /* NOTE: we have a write_aio blkdebug event here followed by
+         * a cow_write one in do_perform_cow_write(), but there's only
+         * one single I/O operation */
+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+    } else {
+        /* If there's no guest data then write both COW regions separately */
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
     }
 
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ fail:
     return ret;
 }
 
+/* Check if it's possible to merge a write request with the writing of
+ * the data from the COW regions */
+static bool merge_cow(uint64_t offset, unsigned bytes,
+                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
+{
+    QCowL2Meta *m;
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        /* If both COW regions are empty then there's nothing to merge */
+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+            continue;
+        }
+
+        /* The data (middle) region must be immediately after the
+         * start region */
+        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+            continue;
+        }
+
+        /* The end region must be immediately after the data (middle)
+         * region */
+        if (m->offset + m->cow_end.offset != offset + bytes) {
+            continue;
+        }
+
+        /* Make sure that adding both COW regions to the QEMUIOVector
+         * does not exceed IOV_MAX */
+        if (hd_qiov->niov > IOV_MAX - 2) {
+            continue;
+        }
+
+        m->data_qiov = hd_qiov;
+        return true;
+    }
+
+    return false;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             goto fail;
         }
 
-        qemu_co_mutex_unlock(&s->lock);
-        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
-        trace_qcow2_writev_data(qemu_coroutine_self(),
-                                cluster_offset + offset_in_cluster);
-        ret = bdrv_co_pwritev(bs->file,
-                              cluster_offset + offset_in_cluster,
-                              cur_bytes, &hd_qiov, 0);
-        qemu_co_mutex_lock(&s->lock);
-        if (ret < 0) {
-            goto fail;
+        /* If we need to do COW, check if it's possible to merge the
+         * writing of the guest data together with that of the COW regions.
+         * If it's not possible (or not necessary) then write the
+         * guest data now. */
+        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
+            qemu_co_mutex_unlock(&s->lock);
+            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+            trace_qcow2_writev_data(qemu_coroutine_self(),
+                                    cluster_offset + offset_in_cluster);
+            ret = bdrv_co_pwritev(bs->file,
+                                  cluster_offset + offset_in_cluster,
+                                  cur_bytes, &hd_qiov, 0);
+            qemu_co_mutex_lock(&s->lock);
+            if (ret < 0) {
+                goto fail;
+            }
         }
 
         while (l2meta != NULL) {
diff --git a/block/qcow2.h b/block/qcow2.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /**
+     * The I/O vector with the data from the actual guest write request.
+     * If non-NULL, this is meant to be merged together with the data
+     * from @cow_start and @cow_end into one single write operation.
+     */
+    QEMUIOVector *data_qiov;
+
     /** Pointer to next L2Meta of the same write request */
     struct QCowL2Meta *next;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

We already have functions for doing these calculations, so let's use
them instead of doing everything by hand. This makes the code a bit
more readable.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.c         | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
 
     nb_clusters = size_to_clusters(s, bytes_needed);
@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
 
     *new_l2_table = l2_table;
     *new_l2_index = l2_index;
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
     }
 
     /* Tables must be cluster aligned */
-    if (offset & (s->cluster_size - 1)) {
+    if (offset_into_cluster(s, offset) != 0) {
         return -EINVAL;
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
 block/qed-table.c   | 15 +++------
 block/qed.h         |  3 +-
 3 files changed, 36 insertions(+), 76 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
     return i - index;
 }
 
-typedef struct {
-    BDRVQEDState *s;
-    uint64_t pos;
-    size_t len;
-
-    QEDRequest *request;
-
-    /* User callback */
-    QEDFindClusterFunc *cb;
-    void *opaque;
-} QEDFindClusterCB;
-
-static void qed_find_cluster_cb(void *opaque, int ret)
-{
-    QEDFindClusterCB *find_cluster_cb = opaque;
-    BDRVQEDState *s = find_cluster_cb->s;
-    QEDRequest *request = find_cluster_cb->request;
-    uint64_t offset = 0;
-    size_t len = 0;
-    unsigned int index;
-    unsigned int n;
-
-    qed_acquire(s);
-    if (ret) {
-        goto out;
-    }
-
-    index = qed_l2_index(s, find_cluster_cb->pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
-                              find_cluster_cb->len);
-    n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                      index, n, &offset);
-
-    if (qed_offset_is_unalloc_cluster(offset)) {
-        ret = QED_CLUSTER_L2;
-    } else if (qed_offset_is_zero_cluster(offset)) {
-        ret = QED_CLUSTER_ZERO;
-    } else if (qed_check_cluster_offset(s, offset)) {
-        ret = QED_CLUSTER_FOUND;
-    } else {
-        ret = -EINVAL;
-    }
-
-    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
-              qed_offset_into_cluster(s, find_cluster_cb->pos));
-
-out:
-    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
-    qed_release(s);
-    g_free(find_cluster_cb);
-}
-
 /**
  * Find the offset of a data cluster
  *
@@ -XXX,XX +XXX,XX @@ out:
 void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
                       size_t len, QEDFindClusterFunc *cb, void *opaque)
 {
-    QEDFindClusterCB *find_cluster_cb;
     uint64_t l2_offset;
+    uint64_t offset = 0;
+    unsigned int index;
+    unsigned int n;
+    int ret;
 
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         return;
     }
 
-    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
-    find_cluster_cb->s = s;
-    find_cluster_cb->pos = pos;
-    find_cluster_cb->len = len;
-    find_cluster_cb->cb = cb;
-    find_cluster_cb->opaque = opaque;
-    find_cluster_cb->request = request;
+    ret = qed_read_l2_table(s, request, l2_offset);
+    qed_acquire(s);
+    if (ret) {
+        goto out;
+    }
+
+    index = qed_l2_index(s, pos);
+    n = qed_bytes_to_clusters(s,
+                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+                                      index, n, &offset);
+
+    if (qed_offset_is_unalloc_cluster(offset)) {
+        ret = QED_CLUSTER_L2;
+    } else if (qed_offset_is_zero_cluster(offset)) {
+        ret = QED_CLUSTER_ZERO;
+    } else if (qed_check_cluster_offset(s, offset)) {
+        ret = QED_CLUSTER_FOUND;
+    } else {
+        ret = -EINVAL;
+    }
+
+    len = MIN(len,
+              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
-    qed_read_l2_table(s, request, l2_offset,
-                      qed_find_cluster_cb, find_cluster_cb);
+out:
+    cb(opaque, ret, offset, len);
+    qed_release(s);
 }
diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
     return ret;
 }
 
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque)
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
     int ret;
 
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     /* Check for cached L2 entry */
     request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
     if (request->l2_table) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     }
     qed_release(s);
 
-    cb(opaque, ret);
+    return ret;
 }
 
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
-    int ret = -EINPROGRESS;
-
-    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_read_l2_table(s, request, offset);
 }
 
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque);
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                         unsigned int index, unsigned int n, bool flush,
                         BlockCompletionFunc *cb, void *opaque);
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
 block/qed.c         | 24 +++++++++++-------------
 block/qed.h         |  4 ++--
 3 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * @s:          QED state
  * @request:    L2 cache entry
  * @pos:        Byte position in device
- * @len:        Number of bytes
- * @cb:         Completion function
- * @opaque:     User data for completion function
+ * @len:        Number of bytes (may be shortened on return)
+ * @img_offset: Contains offset in the image file on success
  *
  * This function translates a position in the block device to an offset in the
- * image file.  It invokes the cb completion callback to report back the
- * translated offset or unallocated range in the image file.
+ * image file. The translated offset or unallocated range in the image file is
+ * reported back in *img_offset and *len.
  *
  * If the L2 table exists, request->l2_table points to the L2 table cache entry
  * and the caller must free the reference when they are finished.  The cache
  * entry is exposed in this way to avoid callers having to read the L2 table
  * again later during request processing.  If request->l2_table is non-NULL it
  * will be unreferenced before taking on the new cache entry.
+ *
+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
+ * range in the image file.
+ *
+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque)
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
      */
-    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
 
     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
     if (qed_offset_is_unalloc_cluster(l2_offset)) {
-        cb(opaque, QED_CLUSTER_L1, 0, len);
-        return;
+        *img_offset = 0;
+        return QED_CLUSTER_L1;
     }
     if (!qed_check_table_offset(s, l2_offset)) {
-        cb(opaque, -EINVAL, 0, 0);
-        return;
+        *img_offset = *len = 0;
+        return -EINVAL;
     }
 
     ret = qed_read_l2_table(s, request, l2_offset);
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     }
 
     index = qed_l2_index(s, pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
     n = qed_count_contiguous_clusters(s, request->l2_table->table,
                                       index, n, &offset);
 
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         ret = -EINVAL;
     }
 
-    len = MIN(len,
-              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
+    *len = MIN(*len,
+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
 out:
-    cb(opaque, ret, offset, len);
+    *img_offset = offset;
     qed_release(s);
+    return ret;
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
         .file = file,
     };
     QEDRequest request = { .l2_table = NULL };
+    uint64_t offset;
+    int ret;
 
-    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
+    qed_is_allocated_cb(&cb, ret, offset, len);
 
-    /* Now sleep if the callback wasn't invoked immediately */
-    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
+    /* The callback was invoked immediately */
+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
 
     qed_unref_l2_cache_entry(request.l2_table);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_write_data(void *opaque, int ret,
                                uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_read_data(void *opaque, int ret,
                               uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
+    uint64_t offset;
+    size_t len;
 
     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     }
 
     /* Find next cluster and start I/O */
-    qed_find_cluster(s, &acb->request,
-                      acb->cur_pos, acb->end_pos - acb->cur_pos,
-                      io_fn, acb);
+    len = acb->end_pos - acb->cur_pos;
+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
+    io_fn(acb, ret, offset, len);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque);
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
collapse into a single function. This is reflected by a rename of the
combined function to qed_aio_write_cow().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 57 +++++++++++++++++++++++----------------------------------
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @pos:        Byte position in device
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
- * @cb:         Completion function
- * @opaque:     User data for completion function
  */
-static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                       uint64_t len, uint64_t offset,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
+                                      uint64_t len, uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 
     /* Skip copy entirely if there is no work to do */
     if (len == 0) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
     ret = 0;
 out:
     qemu_vfree(iov.iov_base);
-    cb(opaque, ret);
+    return ret;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
 }
 
 /**
- * Populate back untouched region of new data cluster
+ * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_postfill(void *opaque, int ret)
+static void qed_aio_write_cow(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
-    uint64_t len =
-        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
-    uint64_t offset = acb->cur_cluster +
-                      qed_offset_into_cluster(s, acb->cur_pos) +
-                      acb->cur_qiov.size;
+    uint64_t start, len, offset;
+
+    /* Populate front untouched region of new data cluster */
+    start = qed_start_of_cluster(s, acb->cur_pos);
+    len = qed_offset_into_cluster(s, acb->cur_pos);
 
+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
     if (ret) {
         qed_aio_complete(acb, ret);
         return;
     }
 
-    trace_qed_aio_write_postfill(s, acb, start, len, offset);
-    qed_copy_from_backing_file(s, start, len, offset,
-                                qed_aio_write_main, acb);
-}
+    /* Populate back untouched region of new data cluster */
+    start = acb->cur_pos + acb->cur_qiov.size;
+    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
+    offset = acb->cur_cluster +
+             qed_offset_into_cluster(s, acb->cur_pos) +
+             acb->cur_qiov.size;
 
-/**
- * Populate front untouched region of new data cluster
- */
-static void qed_aio_write_prefill(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
-    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+    ret = qed_copy_from_backing_file(s, start, len, offset);
 
-    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
-    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
-                                qed_aio_write_postfill, acb);
+    qed_aio_write_main(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
         cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_prefill;
+        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
-                             void *opaque)
+static int qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
     ret = 0;
 out:
     qemu_vfree(buf);
-    cb(opaque, ret);
+    return ret;
 }
 
 static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     }
 }
 
-static void qed_finish_clear_need_check(void *opaque, int ret)
-{
-    /* Do nothing */
-}
-
-static void qed_flush_after_clear_need_check(void *opaque, int ret)
-{
-    BDRVQEDState *s = opaque;
-
-    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
-
-    /* No need to wait until flush completes */
-    qed_unplug_allocating_write_reqs(s);
-}
-
 static void qed_clear_need_check(void *opaque, int ret)
 {
     BDRVQEDState *s = opaque;
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
     }
 
     s->header.features &= ~QED_F_NEED_CHECK;
-    qed_write_header(s, qed_flush_after_clear_need_check, s);
+    ret = qed_write_header(s);
+    (void) ret;
+
+    qed_unplug_allocating_write_reqs(s);
+
+    ret = bdrv_flush(s->bs);
+    (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb;
+    int ret;
 
     /* Cancel timer when the first allocating request comes in */
     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
-        qed_write_header(s, cb, acb);
+        ret = qed_write_header(s);
+        cb(acb, ret);
     } else {
         cb(acb, 0);
     }
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-table.c | 47 ++++++++++++-----------------------------------
 block/qed.c       | 12 +++++++-----
 block/qed.h       |  8 +++-----
 3 files changed, 22 insertions(+), 45 deletions(-)

diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ out:
  * @index:      Index of first element
  * @n:          Number of elements
  * @flush:      Whether or not to sync to disk
- * @cb:         Completion function
- * @opaque:     Argument for completion function
  */
-static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-                            unsigned int index, unsigned int n, bool flush,
-                            BlockCompletionFunc *cb, void *opaque)
+static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+                           unsigned int index, unsigned int n, bool flush)
 {
     unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
     unsigned int start, end, i;
@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
     ret = 0;
 out:
     qemu_vfree(new_table);
-    cb(opaque, ret);
-}
-
-/**
- * Propagate return value from async callback
- */
-static void qed_sync_cb(void *opaque, int ret)
-{
-    *(int *)opaque = ret;
+    return ret;
 }
 
 int qed_read_l1_table_sync(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
 }
 
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
-    qed_write_table(s, s->header.l1_table_offset,
-                    s->l1_table, index, n, false, cb, opaque);
+    return qed_write_table(s, s->header.l1_table_offset,
+                           s->l1_table, index, n, false);
 }
 
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l1_table(s, index, n);
 }
 
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
     return qed_read_l2_table(s, request, offset);
 }
 
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
-    qed_write_table(s, request->l2_table->offset,
-                    request->l2_table->table, index, n, flush, cb, opaque);
+    return qed_write_table(s, request->l2_table->offset,
+                           request->l2_table->table, index, n, flush);
 }
 
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l2_table(s, request, index, n, flush);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = acb->request.l2_table->offset;
 
-    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+    ret = qed_write_l1_table(s, index, 1);
+    qed_commit_l2_update(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
 
     if (need_alloc) {
         /* Write out the whole new L2 table */
-        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                           qed_aio_write_l1_update, acb);
+        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
+        qed_aio_write_l1_update(acb, ret);
     } else {
         /* Write out only the updated part of the L2 table */
-        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                           qed_aio_next_io_cb, acb);
+        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
+                                 false);
+        qed_aio_next_io(acb, ret);
     }
     return;
 
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
  * Table I/O functions
  */
 int qed_read_l1_table_sync(BDRVQEDState *s);
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush);
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush);
 
-- 
1.8.3.1

Note that this code is generally not running in coroutine context, so
this is an actual blocking synchronous operation. We'll fix this in a
moment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 61 +++++++++++++++++++------------------------------------------
 1 file changed, 19 insertions(+), 42 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
     qed_aio_next_io(acb, 0);
 }
 
-static void qed_aio_next_io_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    qed_aio_next_io(acb, ret);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ err:
     qed_aio_complete(acb, ret);
 }
 
-static void qed_aio_write_l2_update_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-}
-
-/**
- * Flush new data clusters before updating the L2 table
- *
- * This flush is necessary when a backing file is in use.  A crash during an
- * allocating write could result in empty clusters in the image.  If the write
- * only touched a subregion of the cluster, then backing image sectors have
- * been lost in the untouched region.  The solution is to flush after writing a
- * new data cluster and before updating the L2 table.
- */
-static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-
-    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
-        qed_aio_complete(acb, -EIO);
-    }
-}
-
 /**
  * Write data to the image file
  */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
                       qed_offset_into_cluster(s, acb->cur_pos);
-    BlockCompletionFunc *next_fn;
 
     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
         return;
     }
 
+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    if (ret >= 0) {
+        ret = 0;
+    }
+
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io_cb;
+        qed_aio_next_io(acb, ret);
     } else {
         if (s->bs->backing) {
-            next_fn = qed_aio_write_flush_before_l2_update;
-        } else {
-            next_fn = qed_aio_write_l2_update_cb;
+            /*
+             * Flush new data clusters before updating the L2 table
+             *
+             * This flush is necessary when a backing file is in use.  A crash
+             * during an allocating write could result in empty clusters in the
+             * image.  If the write only touched a subregion of the cluster,
+             * then backing image sectors have been lost in the untouched
+             * region.  The solution is to flush after writing a new data
+             * cluster and before updating the L2 table.
+             */
+            ret = bdrv_flush(s->bs->file->bs);
         }
+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
     }
-
-    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
-                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                    next_fn, acb);
 }
 
 /**
-- 
1.8.3.1

qed_commit_l2_update() is unconditionally called at the end of
qed_aio_write_l1_update(). Inline it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 }
 
 /**
- * Commit the current L2 table to the cache
+ * Update L1 table with new L2 table offset and write it out
  */
-static void qed_commit_l2_update(void *opaque, int ret)
+static void qed_aio_write_l1_update(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
+    int index;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
 
+    index = qed_l1_index(s, acb->cur_pos);
+    s->l1_table->offsets[index] = l2_table->offset;
+
+    ret = qed_write_l1_table(s, index, 1);
+
+    /* Commit the current L2 table to the cache */
     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
 
     /* This is guaranteed to succeed because we just committed the entry to the
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
     qed_aio_next_io(acb, ret);
 }
 
-/**
- * Update L1 table with new L2 table offset and write it out
- */
-static void qed_aio_write_l1_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    index = qed_l1_index(s, acb->cur_pos);
-    s->l1_table->offsets[index] = acb->request.l2_table->offset;
-
-    ret = qed_write_l1_table(s, index, 1);
-    qed_commit_l2_update(acb, ret);
-}
 
 /**
  * Update L2 table with new cluster offsets and write them out
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static void qed_aio_write_l1_update(void *opaque, int ret)
+static int qed_aio_write_l1_update(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
+    int index, ret;
 
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = l2_table->offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(acb, ret);
+    return ret;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-        qed_aio_write_l1_update(acb, ret);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l1_update(acb);
+        qed_aio_next_io(acb, ret);
+
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
-    int index;
-
-    if (ret) {
-        goto err;
-    }
+    int index, ret;
 
     if (need_alloc) {
         qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
         if (ret) {
-            goto err;
+            return ret;
         }
-        ret = qed_aio_write_l1_update(acb);
-        qed_aio_next_io(acb, ret);
-
+        return qed_aio_write_l1_update(acb);
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
                                  false);
-        qed_aio_next_io(acb, ret);
+        if (ret) {
+            return ret;
+        }
     }
-    return;
-
-err:
-    qed_aio_complete(acb, ret);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
              */
             ret = bdrv_flush(s->bs->file->bs);
         }
-        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        qed_aio_next_io(acb, 0);
     }
+    return;
+
+err:
+    qed_aio_complete(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
         return;
     }
 
-    qed_aio_write_l2_update(acb, 0, 1);
+    ret = qed_aio_write_l2_update(acb, 1);
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

While refactoring qed_aio_write_alloc() to accomodate the change,
qed_aio_write_zero_cluster() ended up with a single line, so I chose to
inline that line and remove the function completely.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 58 +++++++++++++++++++++-------------------------------------
 1 file changed, 21 insertions(+), 37 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_cow(void *opaque, int ret)
+static int qed_aio_write_cow(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
+    int ret;
 
     /* Populate front untouched region of new data cluster */
     start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
     ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
+    if (ret < 0) {
+        return ret;
     }
 
     /* Populate back untouched region of new data cluster */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_postfill(s, acb, start, len, offset);
     ret = qed_copy_from_backing_file(s, start, len, offset);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_main(acb);
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
     return !(s->header.features & QED_F_NEED_CHECK);
 }
 
-static void qed_aio_write_zero_cluster(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_l2_update(acb, 1);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
-}
-
 /**
  * Write new data cluster
  *
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
 static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb;
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
             qed_aio_start_io(acb);
             return;
         }
-
-        cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
-        cb(acb, ret);
+        if (ret < 0) {
+            qed_aio_complete(acb, ret);
+            return;
+        }
+    }
+
+    if (acb->flags & QED_AIOCB_ZERO) {
+        ret = qed_aio_write_l2_update(acb, 1);
     } else {
-        cb(acb, 0);
+        ret = qed_aio_write_cow(acb);
     }
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     }
     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
         s->allocating_write_reqs_plugged) {
-        return; /* wait for existing request to finish */
+        return -EINPROGRESS; /* wait for existing request to finish */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_start_io(acb);
-            return;
+            return 0;
         }
     } else {
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            return ret;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         ret = qed_aio_write_cow(acb);
     }
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
-    int ret;
-
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
         struct iovec *iov = acb->qiov->iov;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         if (!iov->iov_base) {
             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
             if (iov->iov_base == NULL) {
-                qed_aio_complete(acb, -ENOMEM);
-                return;
+                return -ENOMEM;
             }
             memset(iov->iov_base, 0, iov->iov_len);
         }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
     qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
     /* Do the actual write */
-    ret = qed_aio_write_main(acb);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
 
     switch (ret) {
     case QED_CLUSTER_FOUND:
-        qed_aio_write_inplace(acb, offset, len);
+        ret = qed_aio_write_inplace(acb, offset, len);
         break;
 
     case QED_CLUSTER_L2:
     case QED_CLUSTER_L1:
     case QED_CLUSTER_ZERO:
-        qed_aio_write_alloc(acb, len);
+        ret = qed_aio_write_alloc(acb, len);
         break;
 
     default:
-        qed_aio_complete(acb, ret);
+        assert(ret < 0);
         break;
     }
+
+    if (ret < 0) {
+        if (ret != -EINPROGRESS) {
+            qed_aio_complete(acb, ret);
+        }
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

All callers pass ret = 0, so we can just remove it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb);
 
 static void qed_aio_start_io(QEDAIOCB *acb)
 {
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
     size_t len;
+    int ret;
 
-    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
 
     if (acb->backing_qiov) {
         qemu_iovec_destroy(acb->backing_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         acb->backing_qiov = NULL;
     }
 
-    /* Handle I/O error */
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
     acb->qiov_offset += acb->cur_qiov.size;
     acb->cur_pos += acb->cur_qiov.size;
     qemu_iovec_reset(&acb->cur_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         }
         return;
     }
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-- 
1.8.3.1

Most of the qed code is now synchronous and matches the coroutine model.
One notable exception is the serialisation between requests which can
still schedule a callback. Before we can replace this with coroutine
locks, let's convert the driver's external interfaces to the coroutine
versions.

We need to be careful to handle both requests that call the completion
callback directly from the calling coroutine (i.e. fully synchronous
code) and requests that involve some callback, so that we need to yield
and wait for the completion callback coming from outside the coroutine.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
 1 file changed, 42 insertions(+), 55 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
     }
 }
 
-static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-                                 int64_t sector_num,
-                                 QEMUIOVector *qiov, int nb_sectors,
-                                 BlockCompletionFunc *cb,
-                                 void *opaque, int flags)
+typedef struct QEDRequestCo {
+    Coroutine *co;
+    bool done;
+    int ret;
+} QEDRequestCo;
+
+static void qed_co_request_cb(void *opaque, int ret)
 {
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
+    QEDRequestCo *co = opaque;
 
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
-                        opaque, flags);
+    co->done = true;
+    co->ret = ret;
+    qemu_coroutine_enter_if_inactive(co->co);
+}
+
+static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+                                       QEMUIOVector *qiov, int nb_sectors,
+                                       int flags)
+{
+    QEDRequestCo co = {
+        .co     = qemu_coroutine_self(),
+        .done   = false,
+    };
+    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
+
+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
 
     acb->flags = flags;
     acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
 
     /* Start request */
     qed_aio_start_io(acb);
-    return &acb->common;
-}
 
-static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
-                                      int64_t sector_num,
-                                      QEMUIOVector *qiov, int nb_sectors,
-                                      BlockCompletionFunc *cb,
-                                      void *opaque)
-{
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
+    if (!co.done) {
+        qemu_coroutine_yield();
+    }
+
+    return co.ret;
 }
 
-static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
-                                       int64_t sector_num,
-                                       QEMUIOVector *qiov, int nb_sectors,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
+                                          int64_t sector_num, int nb_sectors,
+                                          QEMUIOVector *qiov)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
-                         opaque, QED_AIOCB_WRITE);
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
 }
 
-typedef struct {
-    Coroutine *co;
-    int ret;
-    bool done;
-} QEDWriteZeroesCB;
-
-static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
+static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors,
+                                           QEMUIOVector *qiov)
 {
-    QEDWriteZeroesCB *cb = opaque;
-
-    cb->done = true;
-    cb->ret = ret;
-    if (cb->co) {
-        aio_co_wake(cb->co);
-    }
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
 }
 
 static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
                                                   int count,
                                                   BdrvRequestFlags flags)
 {
-    BlockAIOCB *blockacb;
     BDRVQEDState *s = bs->opaque;
-    QEDWriteZeroesCB cb = { .done = false };
     QEMUIOVector qiov;
     struct iovec iov;
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
     iov.iov_len = count;
 
     qemu_iovec_init_external(&qiov, &iov, 1);
-    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
-                             count >> BDRV_SECTOR_BITS,
-                             qed_co_pwrite_zeroes_cb, &cb,
-                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
-    if (!blockacb) {
-        return -EIO;
-    }
-    if (!cb.done) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
-    assert(cb.done);
-    return cb.ret;
+    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
+                          count >> BDRV_SECTOR_BITS,
+                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
 }
 
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
     .bdrv_create              = bdrv_qed_create,
     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
-    .bdrv_aio_readv           = bdrv_qed_aio_readv,
-    .bdrv_aio_writev          = bdrv_qed_aio_writev,
+    .bdrv_co_readv            = bdrv_qed_co_readv,
+    .bdrv_co_writev           = bdrv_qed_co_writev,
     .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
-- 
1.8.3.1

Now that we're running in coroutine context, the ad-hoc serialisation
code (which drops a request that has to wait out of coroutine context)
can be replaced by a CoQueue.

This means that when we resume a serialised request, it is running in
coroutine context again and its I/O isn't blocking any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 49 +++++++++++++++++--------------------------------
 block/qed.h |  3 ++-
 2 files changed, 19 insertions(+), 33 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 
 static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 {
-    QEDAIOCB *acb;
-
     assert(s->allocating_write_reqs_plugged);
 
     s->allocating_write_reqs_plugged = false;
-
-    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-    if (acb) {
-        qed_aio_start_io(acb);
-    }
+    qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
 static void qed_clear_need_check(void *opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
     BDRVQEDState *s = opaque;
 
     /* The timer should only fire when allocating writes have drained */
-    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
+    assert(!s->allocating_acb);
 
     trace_qed_need_check_timer_cb(s);
 
@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
     int ret;
 
     s->bs = bs;
-    QSIMPLEQ_INIT(&s->allocating_write_reqs);
+    qemu_co_queue_init(&s->allocating_write_reqs);
 
     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
     qed_release(s);
 }
 
-static void qed_resume_alloc_bh(void *opaque)
-{
-    qed_aio_start_io(opaque);
-}
-
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
 {
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
      * next request in the queue.  This ensures that we don't cycle through
      * requests multiple times but rather finish one at a time completely.
      */
-    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QEDAIOCB *next_acb;
-        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
-        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-        if (next_acb) {
-            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                                    qed_resume_alloc_bh, next_acb);
+    if (acb == s->allocating_acb) {
+        s->allocating_acb = NULL;
+        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
+            qemu_co_enter_next(&s->allocating_write_reqs);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
-    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
+    if (s->allocating_acb == NULL) {
         qed_cancel_need_check_timer(s);
     }
 
     /* Freeze this request if another allocating write is in progress */
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
-    }
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
-        s->allocating_write_reqs_plugged) {
-        return -EINPROGRESS; /* wait for existing request to finish */
+    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
+        if (s->allocating_acb != NULL) {
+            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
+            assert(s->allocating_acb == NULL);
+        }
+        s->allocating_acb = acb;
+        return -EAGAIN; /* start over with looking up table entries */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
             ret = qed_aio_read_data(acb, ret, offset, len);
         }
 
-        if (ret < 0) {
-            if (ret != -EINPROGRESS) {
-                qed_aio_complete(acb, ret);
-            }
+        if (ret < 0 && ret != -EAGAIN) {
+            qed_aio_complete(acb, ret);
             return;
         }
     }
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ typedef struct {
     uint32_t l2_mask;
 
     /* Allocating write request queue */
-    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
+    QEDAIOCB *allocating_acb;
+    CoQueue allocating_write_reqs;
     bool allocating_write_reqs_plugged;
 
     /* Periodic flush and clear need check flag */
-- 
1.8.3.1

Now that we process a request in the same coroutine from beginning to
end and don't drop out of it any more, we can look like a proper
coroutine-based driver and simply call qed_aio_next_io() and get a
return value from it instead of spawning an additional coroutine that
reenters the parent when it's done.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 101 +++++++++++++-----------------------------------------------
 block/qed.h |   3 +-
 2 files changed, 22 insertions(+), 82 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/block-backend.h"
 
-static const AIOCBInfo qed_aiocb_info = {
-    .aiocb_size         = sizeof(QEDAIOCB),
-};
-
 static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                           const char *filename)
 {
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb);
-
-static void qed_aio_start_io(QEDAIOCB *acb)
-{
-    qed_aio_next_io(acb);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
 
 static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
 {
-    return acb->common.bs->opaque;
+    return acb->bs->opaque;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete_bh(void *opaque)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb = acb->common.cb;
-    void *user_opaque = acb->common.opaque;
-    int ret = acb->bh_ret;
-
-    qemu_aio_unref(acb);
-
-    /* Invoke callback */
-    qed_acquire(s);
-    cb(user_opaque, ret);
-    qed_release(s);
-}
-
-static void qed_aio_complete(QEDAIOCB *acb, int ret)
+static void qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
-    trace_qed_aio_complete(s, acb, ret);
-
     /* Free resources */
     qemu_iovec_destroy(&acb->cur_qiov);
     qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         acb->qiov->iov[0].iov_base = NULL;
     }
 
-    /* Arrange for a bh to invoke the completion function */
-    acb->bh_ret = ret;
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                            qed_aio_complete_bh, acb);
-
     /* Start next allocating write request waiting behind this one.  Note that
      * requests enqueue themselves when they first hit an unallocated cluster
      * but they wait until the entire request is finished before waking up the
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         struct iovec *iov = acb->qiov->iov;
 
         if (!iov->iov_base) {
-            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
+            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
             if (iov->iov_base == NULL) {
                 return -ENOMEM;
             }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    BlockDriverState *bs = acb->common.bs;
+    BlockDriverState *bs = acb->bs;
 
     /* Adjust offset into cluster */
     offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb)
+static int qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
 
         /* Complete request */
         if (acb->cur_pos >= acb->end_pos) {
-            qed_aio_complete(acb, 0);
-            return;
+            ret = 0;
+            break;
         }
 
         /* Find next cluster and start I/O */
         len = acb->end_pos - acb->cur_pos;
         ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
 
         if (acb->flags & QED_AIOCB_WRITE) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
         }
 
         if (ret < 0 && ret != -EAGAIN) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
     }
-}
 
-typedef struct QEDRequestCo {
-    Coroutine *co;
-    bool done;
-    int ret;
-} QEDRequestCo;
-
-static void qed_co_request_cb(void *opaque, int ret)
-{
-    QEDRequestCo *co = opaque;
-
-    co->done = true;
-    co->ret = ret;
-    qemu_coroutine_enter_if_inactive(co->co);
+    trace_qed_aio_complete(s, acb, ret);
+    qed_aio_complete(acb);
+    return ret;
 }
 
 static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
                                        QEMUIOVector *qiov, int nb_sectors,
                                        int flags)
 {
-    QEDRequestCo co = {
-        .co     = qemu_coroutine_self(),
-        .done   = false,
+    QEDAIOCB acb = {
+        .bs         = bs,
+        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
+        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
+        .qiov       = qiov,
+        .flags      = flags,
     };
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
-
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
+    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
 
-    acb->flags = flags;
-    acb->qiov = qiov;
-    acb->qiov_offset = 0;
-    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
-    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
-    acb->backing_qiov = NULL;
-    acb->request.l2_table = NULL;
-    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
 
     /* Start request */
-    qed_aio_start_io(acb);
-
-    if (!co.done) {
-        qemu_coroutine_yield();
-    }
-
-    return co.ret;
+    return qed_aio_next_io(&acb);
 }
 
 static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
 };
 
 typedef struct QEDAIOCB {
-    BlockAIOCB common;
-    int bh_ret;                     /* final return status for completion bh */
+    BlockDriverState *bs;
     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
     int flags;                      /* QED_AIOCB_* bits ORed together */
     uint64_t end_pos;               /* request end on block device, in bytes */
-- 
1.8.3.1

This fixes the last place where we degraded from AIO to actual blocking
synchronous I/O requests. Putting it into a coroutine means that instead
of blocking, the coroutine simply yields while doing I/O.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_clear_need_check(void *opaque, int ret)
+static void qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
+    int ret;
 
-    if (ret) {
+    /* The timer should only fire when allocating writes have drained */
+    assert(!s->allocating_acb);
+
+    trace_qed_need_check_timer_cb(s);
+
+    qed_acquire(s);
+    qed_plug_allocating_write_reqs(s);
+
+    /* Ensure writes are on disk before clearing flag */
+    ret = bdrv_co_flush(s->bs->file->bs);
+    qed_release(s);
+    if (ret < 0) {
         qed_unplug_allocating_write_reqs(s);
         return;
     }
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
 
     qed_unplug_allocating_write_reqs(s);
 
-    ret = bdrv_flush(s->bs);
+    ret = bdrv_co_flush(s->bs);
     (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
 {
-    BDRVQEDState *s = opaque;
-
-    /* The timer should only fire when allocating writes have drained */
-    assert(!s->allocating_acb);
-
-    trace_qed_need_check_timer_cb(s);
-
-    qed_acquire(s);
-    qed_plug_allocating_write_reqs(s);
-
-    /* Ensure writes are on disk before clearing flag */
-    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
-    qed_release(s);
+    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+    qemu_coroutine_enter(co);
 }
 
 void qed_acquire(BDRVQEDState *s)
-- 
1.8.3.1

Now that we stay in coroutine context for the whole request when doing
reads or writes, we can add coroutine_fn annotations to many functions
that can do I/O or yield directly.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c |  5 +++--
 block/qed.c         | 44 ++++++++++++++++++++++++--------------------
 block/qed.h         |  5 +++--
 3 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
  * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset)
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static int qed_write_header(BDRVQEDState *s)
+static int coroutine_fn qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_need_check_timer_entry(void *opaque)
+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
     int ret;
@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
  * This function reads qiov->size bytes starting at pos from the backing file.
  * If there is no backing file then zeroes are read.
  */
-static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-                                 QEMUIOVector *qiov,
-                                 QEMUIOVector **backing_qiov)
+static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+                                              QEMUIOVector *qiov,
+                                              QEMUIOVector **backing_qiov)
 {
     uint64_t backing_length = 0;
     size_t size;
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
  */
-static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                      uint64_t len, uint64_t offset)
+static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
+                                                   uint64_t pos, uint64_t len,
+                                                   uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
  * The cluster offset may be an allocated byte offset in the image file, the
  * zero cluster marker, or the unallocated cluster marker.
  */
-static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
-                                unsigned int n, uint64_t cluster)
+static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
+                                             int index, unsigned int n,
+                                             uint64_t cluster)
 {
     int i;
     for (i = index; i < index + n; i++) {
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete(QEDAIOCB *acb)
+static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static int qed_aio_write_l1_update(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 /**
  * Write data to the image file
  */
-static int qed_aio_write_main(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static int qed_aio_write_cow(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
+                                              size_t len)
 {
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_write_data(void *opaque, int ret,
-                              uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
+                                           uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
 
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
+                                          uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static int qed_aio_next_io(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset);
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

All functions that are marked coroutine_fn can directly call the
bdrv_co_* version of functions instead of going through the wrapper.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     };
     qemu_iovec_init_external(&qiov, &iov, 1);
 
-    ret = bdrv_preadv(s->bs->file, 0, &qiov);
+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     /* Update header */
     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
 
-    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
-    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
     }
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
-    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
+                          &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
              * region.  The solution is to flush after writing a new data
              * cluster and before updating the L2 table.
              */
-            ret = bdrv_flush(s->bs->file->bs);
+            ret = bdrv_co_flush(s->bs->file->bs);
             if (ret < 0) {
                 return ret;
             }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
+                         &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
-- 
1.8.3.1

From: "sochin.jiang" <sochin.jiang@huawei.com>

img_commit could fall into an infinite loop calling run_block_job() if
its blockjob fails on any I/O error, fix this already known problem.

Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 blockjob.c               |  4 ++--
 include/block/blockjob.h | 18 ++++++++++++++++++
 qemu-img.c               | 20 +++++++++++++-------
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
     block_job_enter(job);
 }
 
-static void block_job_ref(BlockJob *job)
+void block_job_ref(BlockJob *job)
 {
     ++job->refcnt;
 }
@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
                                            void *opaque);
 static void block_job_detach_aio_context(void *opaque);
 
-static void block_job_unref(BlockJob *job)
+void block_job_unref(BlockJob *job)
 {
     if (--job->refcnt == 0) {
         BlockDriverState *bs = blk_bs(job->blk);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
 BlockJobTxn *block_job_txn_new(void);
 
 /**
+ * block_job_ref:
+ *
+ * Add a reference to BlockJob refcnt, it will be decreased with
+ * block_job_unref, and then be freed if it comes to be the last
+ * reference.
+ */
+void block_job_ref(BlockJob *job);
+
+/**
+ * block_job_unref:
+ *
+ * Release a reference that was previously acquired with block_job_ref
+ * or block_job_create. If it's the last reference to the object, it will be
+ * freed.
+ */
+void block_job_unref(BlockJob *job);
+
+/**
  * block_job_txn_unref:
  *
  * Release a reference that was previously acquired with block_job_txn_add_job
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
 static void run_block_job(BlockJob *job, Error **errp)
 {
     AioContext *aio_context = blk_get_aio_context(job->blk);
+    int ret = 0;
 
-    /* FIXME In error cases, the job simply goes away and we access a dangling
-     * pointer below. */
     aio_context_acquire(aio_context);
+    block_job_ref(job);
     do {
         aio_poll(aio_context, true);
         qemu_progress_print(job->len ?
                             ((float)job->offset / job->len * 100.f) : 0.0f, 0);
-    } while (!job->ready);
+    } while (!job->ready && !job->completed);
 
-    block_job_complete_sync(job, errp);
+    if (!job->completed) {
+        ret = block_job_complete_sync(job, errp);
+    } else {
+        ret = job->ret;
+    }
+    block_job_unref(job);
     aio_context_release(aio_context);
 
-    /* A block job may finish instantaneously without publishing any progress,
-     * so just signal completion here */
-    qemu_progress_print(100.f, 0);
+    /* publish completion progress only when success */
+    if (!ret) {
+        qemu_progress_print(100.f, 0);
+    }
 }
 
 static int img_commit(int argc, char **argv)
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkdebug node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-2-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkdebug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
     }
 
     if (!force_json && bs->file->bs->exact_filename[0]) {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkdebug:%s:%s", s->config_file ?: "",
-                 bs->file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkdebug:%s:%s", s->config_file ?: "",
+                           bs->file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 
     opts = qdict_new();
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkverify node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-3-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkverify.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
     if (bs->file->bs->exact_filename[0]
         && s->test_file->bs->exact_filename[0])
     {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkverify:%s:%s",
-                 bs->file->bs->exact_filename,
-                 s->test_file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkverify:%s:%s",
+                           bs->file->bs->exact_filename,
+                           s->test_file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 }
 
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

uri_parse(...)->scheme may be NULL. In fact, probably every field may be
NULL, and the callers do test this for all of the other fields but not
for scheme (except for block/gluster.c; block/vxhs.c does not access
that field at all).

We can easily fix this by using g_strcmp0() instead of strcmp().

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613205726.13544-1-mreitz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/nbd.c      | 6 +++---
 block/nfs.c      | 2 +-
 block/sheepdog.c | 6 +++---
 block/ssh.c      | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "nbd")) {
+    if (!g_strcmp0(uri->scheme, "nbd")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
         is_unix = true;
     } else {
         ret = -EINVAL;
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
         error_setg(errp, "Invalid URI specified");
         goto out;
     }
-    if (strcmp(uri->scheme, "nfs") != 0) {
+    if (g_strcmp0(uri->scheme, "nfs") != 0) {
         error_setg(errp, "URI scheme must be 'nfs'");
         goto out;
     }
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "sheepdog")) {
+    if (!g_strcmp0(uri->scheme, "sheepdog")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
         is_unix = true;
     } else {
         error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
diff --git a/block/ssh.c b/block/ssh.c
index XXXXXXX..XXXXXXX 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
         return -EINVAL;
     }
 
-    if (strcmp(uri->scheme, "ssh") != 0) {
+    if (g_strcmp0(uri->scheme, "ssh") != 0) {
         error_setg(errp, "URI scheme must be 'ssh'");
         goto err;
     }
-- 
1.8.3.1

The following changes since commit ae49fbbcd8e4e9d8bf7131add34773f579e1aff7:

Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20171025' into staging (2017-10-25 16:38:57 +0100)

are available in the git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 4254d01ce4eec9a3ccf320d14e2da132b8ad4a51:

Merge remote-tracking branch 'mreitz/tags/pull-block-2017-10-26' into queue-block (2017-10-26 15:02:40 +0200)

----------------------------------------------------------------
Block layer patches

----------------------------------------------------------------
Alberto Garcia (1):
      qcow2: Use BDRV_SECTOR_BITS instead of its literal value

Eric Blake (24):
      block: Allow NULL file for bdrv_get_block_status()
      block: Add flag to avoid wasted work in bdrv_is_allocated()
      block: Make bdrv_round_to_clusters() signature more useful
      qcow2: Switch is_zero_sectors() to byte-based
      block: Switch bdrv_make_zero() to byte-based
      qemu-img: Switch get_block_status() to byte-based
      block: Convert bdrv_get_block_status() to bytes
      block: Switch bdrv_co_get_block_status() to byte-based
      block: Switch BdrvCoGetBlockStatusData to byte-based
      block: Switch bdrv_common_block_status_above() to byte-based
      block: Switch bdrv_co_get_block_status_above() to byte-based
      block: Convert bdrv_get_block_status_above() to bytes
      qemu-img: Simplify logic in img_compare()
      qemu-img: Speed up compare on pre-allocated larger file
      qemu-img: Add find_nonzero()
      qemu-img: Drop redundant error message in compare
      qemu-img: Change check_empty_sectors() to byte-based
      qemu-img: Change compare_sectors() to be byte-based
      qemu-img: Change img_rebase() to be byte-based
      qemu-img: Change img_compare() to be byte-based
      block: Align block status requests
      block: Reduce bdrv_aligned_preadv() rounding
      qcow2: Reduce is_zero() rounding
      qemu-io: Relax 'alloc' now that block-status doesn't assert

Kevin Wolf (2):
      qemu-iotests: Test backing_fmt with backing node reference
      Merge remote-tracking branch 'mreitz/tags/pull-block-2017-10-26' into queue-block

Max Reitz (8):
      qemu-img.1: Image invalidation on qemu-img commit
      iotests: Add test for dataplane mirroring
      iotests: Pull _filter_actual_image_size from 67/87
      iotests: Filter actual image size in 184 and 191
      qcow2: Emit errp when truncating the image tail
      qcow2: Fix unaligned preallocated truncation
      qcow2: Always execute preallocate() in a coroutine
      iotests: Add cluster_size=64k to 125

Peter Krempa (1):
      block: don't add 'driver' to options when referring to backing via node name

From: Peter Krempa <pkrempa@redhat.com>

When referring to a backing file of an image via node name
bdrv_open_backing_file would add the 'driver' option to the option list
filling it with the backing format driver. This breaks construction of
the backing chain via -blockdev, as bdrv_open_inherit reports an error
if both 'reference' and 'options' are provided.

$ qemu-img create -f raw /tmp/backing.raw 64M
$ qemu-img create -f qcow2 -F raw -b /tmp/backing.raw /tmp/test.qcow2
$ qemu-system-x86_64 \
  -blockdev driver=file,filename=/tmp/backing.raw,node-name=backing \
  -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing
qemu-system-x86_64: -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing: Could not open backing file: Cannot reference an existing block device with additional options or a new filename

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
         goto free_exit;
     }
 
-    if (bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
+    if (!reference &&
+        bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
         qdict_put_str(options, "driver", bs->backing_format);
     }
 
-- 
2.13.6

This changes test case 191 to include a backing image that has
backing_fmt set in the image file, but is referenced by node name in the
qemu command line.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/191     | 3 ++-
 tests/qemu-iotests/191.out | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/191 b/tests/qemu-iotests/191
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/191
+++ b/tests/qemu-iotests/191
@@ -XXX,XX +XXX,XX @@ echo === Preparing and starting VM ===
 echo
 
 TEST_IMG="${TEST_IMG}.base" _make_test_img $size
-TEST_IMG="${TEST_IMG}.mid" _make_test_img -b "${TEST_IMG}.base"
+IMGOPTS=$(_optstr_add "$IMGOPTS" "backing_fmt=$IMGFMT") \
+    TEST_IMG="${TEST_IMG}.mid" _make_test_img -b "${TEST_IMG}.base"
 _make_test_img -b "${TEST_IMG}.mid"
 TEST_IMG="${TEST_IMG}.ovl2" _make_test_img -b "${TEST_IMG}.mid"
 
diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/191.out
+++ b/tests/qemu-iotests/191.out
@@ -XXX,XX +XXX,XX @@ QA output created by 191
 === Preparing and starting VM ===
 
 Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
-Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.base
+Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.mid
 Formatting 'TEST_DIR/t.IMGFMT.ovl2', fmt=IMGFMT size=67108864 backing_file=TEST_DIR/t.IMGFMT.mid
 wrote 65536/65536 bytes at offset 1048576
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Not all callers care about which BDS owns the mapping for a given
range of the file.  This patch merely simplifies the callers by
consolidating the logic in the common call point, while guaranteeing
a non-NULL file to all the driver callbacks, for no semantic change.
The only caller that does not care about pnum is bdrv_is_allocated,
as invoked by vvfat; we can likewise add assertions that the rest
of the stack does not have to worry about a NULL pnum.

Furthermore, this will also set the stage for a future cleanup: when
a caller does not care about which BDS owns an offset, it would be
nice to allow the driver to optimize things to not have to return
BDRV_BLOCK_OFFSET_VALID in the first place.  In the case of fragmented
allocation (for example, it's fairly easy to create a qcow2 image
where consecutive guest addresses are not at consecutive host
addresses), the current contract requires bdrv_get_block_status()
to clamp *pnum to the limit where host addresses are no longer
consecutive, but allowing a NULL file means that *pnum could be
set to the full length of known-allocated data.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block_int.h | 10 ++++++----
 block/io.c                | 49 ++++++++++++++++++++++++++---------------------
 block/mirror.c            |  3 +--
 block/qcow2.c             |  8 ++------
 qemu-img.c                | 10 ++++------
 5 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
         int64_t offset, int bytes);
 
     /*
-     * Building block for bdrv_block_status[_above]. The driver should
-     * answer only according to the current layer, and should not
-     * set BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
-     * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.
+     * Building block for bdrv_block_status[_above] and
+     * bdrv_is_allocated[_above].  The driver should answer only
+     * according to the current layer, and should not set
+     * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
+     * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
+     * layer guarantees non-NULL pnum and file.
      */
     int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, int *pnum,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
     int64_t target_sectors, ret, nb_sectors, sector_num = 0;
     BlockDriverState *bs = child->bs;
-    BlockDriverState *file;
     int n;
 
     target_sectors = bdrv_nb_sectors(bs);
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
         if (nb_sectors <= 0) {
             return 0;
         }
-        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, &file);
+        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, NULL);
         if (ret < 0) {
             error_report("error getting block status at sector %" PRId64 ": %s",
                          sector_num, strerror(-ret));
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * beyond the end of the disk image it will be clamped; if 'pnum' is set to
  * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
  *
- * If returned value is positive and BDRV_BLOCK_OFFSET_VALID bit is set, 'file'
- * points to the BDS which the sector range is allocated in.
+ * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
+ * 'file' is non-NULL, then '*file' points to the BDS which the sector range
+ * is allocated in.
  */
 static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
                                                      int64_t sector_num,
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
     int64_t total_sectors;
     int64_t n;
     int64_t ret, ret2;
+    BlockDriverState *local_file = NULL;
 
-    *file = NULL;
+    assert(pnum);
+    *pnum = 0;
     total_sectors = bdrv_nb_sectors(bs);
     if (total_sectors < 0) {
-        return total_sectors;
+        ret = total_sectors;
+        goto early_out;
     }
 
     if (sector_num >= total_sectors) {
-        *pnum = 0;
-        return BDRV_BLOCK_EOF;
+        ret = BDRV_BLOCK_EOF;
+        goto early_out;
     }
     if (!nb_sectors) {
-        *pnum = 0;
-        return 0;
+        ret = 0;
+        goto early_out;
     }
 
     n = total_sectors - sector_num;
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
         }
         if (bs->drv->protocol_name) {
             ret |= BDRV_BLOCK_OFFSET_VALID | (sector_num * BDRV_SECTOR_SIZE);
-            *file = bs;
+            local_file = bs;
         }
-        return ret;
+        goto early_out;
     }
 
     bdrv_inc_in_flight(bs);
     ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
-                                            file);
+                                            &local_file);
     if (ret < 0) {
         *pnum = 0;
         goto out;
     }
 
     if (ret & BDRV_BLOCK_RAW) {
-        assert(ret & BDRV_BLOCK_OFFSET_VALID && *file);
-        ret = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
-                                       *pnum, pnum, file);
+        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
+        ret = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+                                       *pnum, pnum, &local_file);
         goto out;
     }
 
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
         }
     }
 
-    if (*file && *file != bs &&
+    if (local_file && local_file != bs &&
         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
         (ret & BDRV_BLOCK_OFFSET_VALID)) {
-        BlockDriverState *file2;
         int file_pnum;
 
-        ret2 = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
-                                        *pnum, &file_pnum, &file2);
+        ret2 = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+                                        *pnum, &file_pnum, NULL);
         if (ret2 >= 0) {
             /* Ignore errors.  This is just providing extra information, it
              * is useful but not necessary.
@@ -XXX,XX +XXX,XX @@ out:
     if (ret >= 0 && sector_num + *pnum == total_sectors) {
         ret |= BDRV_BLOCK_EOF;
     }
+early_out:
+    if (file) {
+        *file = local_file;
+    }
     return ret;
 }
 
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status(BlockDriverState *bs,
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                    int64_t bytes, int64_t *pnum)
 {
-    BlockDriverState *file;
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
     int nb_sectors = bytes >> BDRV_SECTOR_BITS;
     int64_t ret;
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
     assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
     assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
     ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &psectors,
-                                &file);
+                                NULL);
     if (ret < 0) {
         return ret;
     }
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         int io_sectors;
         unsigned int io_bytes;
         int64_t io_bytes_acct;
-        BlockDriverState *file;
         enum MirrorMethod {
             MIRROR_METHOD_COPY,
             MIRROR_METHOD_ZERO,
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         ret = bdrv_get_block_status_above(source, NULL,
                                           offset >> BDRV_SECTOR_BITS,
                                           nb_chunks * sectors_per_chunk,
-                                          &io_sectors, &file);
+                                          &io_sectors, NULL);
         io_bytes = io_sectors * BDRV_SECTOR_SIZE;
         if (ret < 0) {
             io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
                             uint32_t count)
 {
     int nr;
-    BlockDriverState *file;
     int64_t res;
 
     if (start + count > bs->total_sectors) {
@@ -XXX,XX +XXX,XX @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
     if (!count) {
         return true;
     }
-    res = bdrv_get_block_status_above(bs, NULL, start, count,
-                                      &nr, &file);
+    res = bdrv_get_block_status_above(bs, NULL, start, count, &nr, NULL);
     return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
 }
 
@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
                  offset += pnum * BDRV_SECTOR_SIZE) {
                 int nb_sectors = MIN(ssize - offset,
                                      BDRV_REQUEST_MAX_BYTES) / BDRV_SECTOR_SIZE;
-                BlockDriverState *file;
                 int64_t ret;
 
                 ret = bdrv_get_block_status_above(in_bs, NULL,
                                                   offset >> BDRV_SECTOR_BITS,
-                                                  nb_sectors,
-                                                  &pnum, &file);
+                                                  nb_sectors, &pnum, NULL);
                 if (ret < 0) {
                     error_setg_errno(&local_err, -ret,
                                      "Unable to get block status");
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
 
     for (;;) {
         int64_t status1, status2;
-        BlockDriverState *file;
 
         nb_sectors = sectors_to_process(total_sectors, sector_num);
         if (nb_sectors <= 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         }
         status1 = bdrv_get_block_status_above(bs1, NULL, sector_num,
                                               total_sectors1 - sector_num,
-                                              &pnum1, &file);
+                                              &pnum1, NULL);
         if (status1 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
 
         status2 = bdrv_get_block_status_above(bs2, NULL, sector_num,
                                               total_sectors2 - sector_num,
-                                              &pnum2, &file);
+                                              &pnum2, NULL);
         if (status2 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename2);
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
     n = MIN(s->total_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
 
     if (s->sector_next_status <= sector_num) {
-        BlockDriverState *file;
         if (s->target_has_backing) {
             ret = bdrv_get_block_status(blk_bs(s->src[src_cur]),
                                         sector_num - src_cur_offset,
-                                        n, &n, &file);
+                                        n, &n, NULL);
         } else {
             ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
                                               sector_num - src_cur_offset,
-                                              n, &n, &file);
+                                              n, &n, NULL);
         }
         if (ret < 0) {
             return ret;
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Not all callers care about which BDS owns the mapping for a given
range of the file, or where the zeroes lie within that mapping.  In
particular, bdrv_is_allocated() cares more about finding the
largest run of allocated data from the guest perspective, whether
or not that data is consecutive from the host perspective, and
whether or not the data reads as zero.  Therefore, doing subsequent
refinements such as checking how much of the format-layer
allocation also satisfies BDRV_BLOCK_ZERO at the protocol layer is
wasted work - in the best case, it just costs extra CPU cycles
during a single bdrv_is_allocated(), but in the worst case, it
results in a smaller *pnum, and forces callers to iterate through
more status probes when visiting the entire file for even more
extra CPU cycles.

This patch only optimizes the block layer (no behavior change when
want_zero is true, but skip unnecessary effort when it is false).
Then when subsequent patches tweak the driver callback to be
byte-based, we can also pass this hint through to the driver.

Tweak BdrvCoGetBlockStatusData to declare arguments in parameter
order, rather than mixing things up (minimizing padding is not
necessary here).

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 57 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 16 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
 typedef struct BdrvCoGetBlockStatusData {
     BlockDriverState *bs;
     BlockDriverState *base;
-    BlockDriverState **file;
+    bool want_zero;
     int64_t sector_num;
     int nb_sectors;
     int *pnum;
+    BlockDriverState **file;
     int64_t ret;
     bool done;
 } BdrvCoGetBlockStatusData;
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * Drivers not implementing the functionality are assumed to not support
  * backing files, hence all their sectors are reported as allocated.
  *
+ * If 'want_zero' is true, the caller is querying for mapping purposes,
+ * and the result should include BDRV_BLOCK_OFFSET_VALID and
+ * BDRV_BLOCK_ZERO where possible; otherwise, the result may omit those
+ * bits particularly if it allows for a larger value in 'pnum'.
+ *
  * If 'sector_num' is beyond the end of the disk image the return value is
  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
  *
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * is allocated in.
  */
 static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
+                                                     bool want_zero,
                                                      int64_t sector_num,
                                                      int nb_sectors, int *pnum,
                                                      BlockDriverState **file)
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
 
     if (ret & BDRV_BLOCK_RAW) {
         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+        ret = bdrv_co_get_block_status(local_file, want_zero,
+                                       ret >> BDRV_SECTOR_BITS,
                                        *pnum, pnum, &local_file);
         goto out;
     }
 
     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
         ret |= BDRV_BLOCK_ALLOCATED;
-    } else {
+    } else if (want_zero) {
         if (bdrv_unallocated_blocks_are_zero(bs)) {
             ret |= BDRV_BLOCK_ZERO;
         } else if (bs->backing) {
             BlockDriverState *bs2 = bs->backing->bs;
             int64_t nb_sectors2 = bdrv_nb_sectors(bs2);
+
             if (nb_sectors2 >= 0 && sector_num >= nb_sectors2) {
                 ret |= BDRV_BLOCK_ZERO;
             }
         }
     }
 
-    if (local_file && local_file != bs &&
+    if (want_zero && local_file && local_file != bs &&
         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
         (ret & BDRV_BLOCK_OFFSET_VALID)) {
         int file_pnum;
 
-        ret2 = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+        ret2 = bdrv_co_get_block_status(local_file, want_zero,
+                                        ret >> BDRV_SECTOR_BITS,
                                         *pnum, &file_pnum, NULL);
         if (ret2 >= 0) {
             /* Ignore errors.  This is just providing extra information, it
@@ -XXX,XX +XXX,XX @@ early_out:
 
 static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
         BlockDriverState *base,
+        bool want_zero,
         int64_t sector_num,
         int nb_sectors,
         int *pnum,
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
 
     assert(bs != base);
     for (p = bs; p != base; p = backing_bs(p)) {
-        ret = bdrv_co_get_block_status(p, sector_num, nb_sectors, pnum, file);
+        ret = bdrv_co_get_block_status(p, want_zero, sector_num, nb_sectors,
+                                       pnum, file);
         if (ret < 0) {
             break;
         }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
     BdrvCoGetBlockStatusData *data = opaque;
 
     data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
+                                               data->want_zero,
                                                data->sector_num,
                                                data->nb_sectors,
                                                data->pnum,
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
  *
  * See bdrv_co_get_block_status_above() for details.
  */
-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-                                    BlockDriverState *base,
-                                    int64_t sector_num,
-                                    int nb_sectors, int *pnum,
-                                    BlockDriverState **file)
+static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
+                                              BlockDriverState *base,
+                                              bool want_zero,
+                                              int64_t sector_num,
+                                              int nb_sectors, int *pnum,
+                                              BlockDriverState **file)
 {
     Coroutine *co;
     BdrvCoGetBlockStatusData data = {
         .bs = bs,
         .base = base,
-        .file = file,
+        .want_zero = want_zero,
         .sector_num = sector_num,
         .nb_sectors = nb_sectors,
         .pnum = pnum,
+        .file = file,
         .done = false,
     };
 
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
     return data.ret;
 }
 
+int64_t bdrv_get_block_status_above(BlockDriverState *bs,
+                                    BlockDriverState *base,
+                                    int64_t sector_num,
+                                    int nb_sectors, int *pnum,
+                                    BlockDriverState **file)
+{
+    return bdrv_common_block_status_above(bs, base, true, sector_num,
+                                          nb_sectors, pnum, file);
+}
+
 int64_t bdrv_get_block_status(BlockDriverState *bs,
                               int64_t sector_num,
                               int nb_sectors, int *pnum,
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status(BlockDriverState *bs,
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                    int64_t bytes, int64_t *pnum)
 {
-    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
-    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
     int64_t ret;
     int psectors;
 
     assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
     assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
-    ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &psectors,
-                                NULL);
+    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false,
+                                         offset >> BDRV_SECTOR_BITS,
+                                         bytes >> BDRV_SECTOR_BITS, &psectors,
+                                         NULL);
     if (ret < 0) {
         return ret;
     }
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

In the process of converting sector-based interfaces to bytes,
I'm finding it easier to represent a byte count as a 64-bit
integer at the block layer (even if we are internally capped
by SIZE_MAX or even INT_MAX for individual transactions, it's
still nicer to not have to worry about truncation/overflow
issues on as many variables).  Update the signature of
bdrv_round_to_clusters() to uniformly use int64_t, matching
the signature already chosen for bdrv_is_allocated and the
fact that off_t is also a signed type, then adjust clients
according to the required fallout (even where the result could
now exceed 32 bits, no client is directly assigning the result
into a 32-bit value without breaking things into a loop first).

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block.h | 4 ++--
 block/io.c            | 6 +++---
 block/mirror.c        | 7 +++----
 block/trace-events    | 2 +-
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
-                            int64_t offset, unsigned int bytes,
+                            int64_t offset, int64_t bytes,
                             int64_t *cluster_offset,
-                            unsigned int *cluster_bytes);
+                            int64_t *cluster_bytes);
 
 const char *bdrv_get_encrypted_filename(BlockDriverState *bs);
 void bdrv_get_backing_filename(BlockDriverState *bs,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
  * Round a region to cluster boundaries
  */
 void bdrv_round_to_clusters(BlockDriverState *bs,
-                            int64_t offset, unsigned int bytes,
+                            int64_t offset, int64_t bytes,
                             int64_t *cluster_offset,
-                            unsigned int *cluster_bytes)
+                            int64_t *cluster_bytes)
 {
     BlockDriverInfo bdi;
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,
     struct iovec iov;
     QEMUIOVector local_qiov;
     int64_t cluster_offset;
-    unsigned int cluster_bytes;
+    int64_t cluster_bytes;
     size_t skip_bytes;
     int ret;
     int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
     bool need_cow;
     int ret = 0;
     int64_t align_offset = *offset;
-    unsigned int align_bytes = *bytes;
+    int64_t align_bytes = *bytes;
     int max_bytes = s->granularity * s->max_iov;
 
-    assert(*bytes < INT_MAX);
     need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
     need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
                           s->cow_bitmap);
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     while (nb_chunks > 0 && offset < s->bdev_length) {
         int64_t ret;
         int io_sectors;
-        unsigned int io_bytes;
+        int64_t io_bytes;
         int64_t io_bytes_acct;
         enum MirrorMethod {
             MIRROR_METHOD_COPY,
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
             io_bytes = s->granularity;
         } else if (ret >= 0 && !(ret & BDRV_BLOCK_DATA)) {
             int64_t target_offset;
-            unsigned int target_bytes;
+            int64_t target_bytes;
             bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
                                    &target_offset, &target_bytes);
             if (target_offset == offset &&
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int flag
 bdrv_co_preadv(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwritev(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags 0x%x"
-bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
+bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, int64_t cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %"PRId64
 
 # block/stream.c
 stream_one_iteration(void *s, int64_t offset, uint64_t bytes, int is_allocated) "s %p offset %" PRId64 " bytes %" PRIu64 " is_allocated %d"
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change), and rename it to is_zero() in the
process.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ finish:
 }
 
 
-static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
-                            uint32_t count)
+static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
     int nr;
     int64_t res;
+    int64_t start;
 
-    if (start + count > bs->total_sectors) {
-        count = bs->total_sectors - start;
+    /* TODO: Widening to sector boundaries should only be needed as
+     * long as we can't query finer granularity. */
+    start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
+    bytes = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE) - start;
+
+    /* Clamp to image length, before checking status of underlying sectors */
+    if (start + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - start;
     }
 
-    if (!count) {
+    if (!bytes) {
         return true;
     }
-    res = bdrv_get_block_status_above(bs, NULL, start, count, &nr, NULL);
-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
+    res = bdrv_get_block_status_above(bs, NULL, start >> BDRV_SECTOR_BITS,
+                                      bytes >> BDRV_SECTOR_BITS, &nr, NULL);
+    return res >= 0 && (res & BDRV_BLOCK_ZERO) &&
+        nr * BDRV_SECTOR_SIZE == bytes;
 }
 
 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
     }
 
     if (head || tail) {
-        int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
         uint64_t off;
         unsigned int nr;
 
         assert(head + bytes <= s->cluster_size);
 
         /* check whether remainder of cluster already reads as zero */
-        if (!(is_zero_sectors(bs, cl_start,
-                              DIV_ROUND_UP(head, BDRV_SECTOR_SIZE)) &&
-              is_zero_sectors(bs, (offset + bytes) >> BDRV_SECTOR_BITS,
-                              DIV_ROUND_UP(-tail & (s->cluster_size - 1),
-                                           BDRV_SECTOR_SIZE)))) {
+        if (!(is_zero(bs, offset - head, head) &&
+              is_zero(bs, offset + bytes,
+                      tail ? s->cluster_size - tail : 0))) {
             return -ENOTSUP;
         }
 
         qemu_co_mutex_lock(&s->lock);
         /* We can have new write after previous check */
-        offset = cl_start << BDRV_SECTOR_BITS;
+        offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
         bytes = s->cluster_size;
         nr = s->cluster_size;
         ret = qcow2_get_cluster_offset(bs, offset, &nr, &off);
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of zeroing a device to track by bytes instead of
sectors (although we are still guaranteed that we iterate by steps
that are sector-aligned).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
  */
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
-    int64_t target_sectors, ret, nb_sectors, sector_num = 0;
+    int64_t target_size, ret, bytes, offset = 0;
     BlockDriverState *bs = child->bs;
-    int n;
+    int n; /* sectors */
 
-    target_sectors = bdrv_nb_sectors(bs);
-    if (target_sectors < 0) {
-        return target_sectors;
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
     }
 
     for (;;) {
-        nb_sectors = MIN(target_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
-        if (nb_sectors <= 0) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
             return 0;
         }
-        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, NULL);
+        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
+                                    bytes >> BDRV_SECTOR_BITS, &n, NULL);
         if (ret < 0) {
-            error_report("error getting block status at sector %" PRId64 ": %s",
-                         sector_num, strerror(-ret));
+            error_report("error getting block status at offset %" PRId64 ": %s",
+                         offset, strerror(-ret));
             return ret;
         }
         if (ret & BDRV_BLOCK_ZERO) {
-            sector_num += n;
+            offset += n * BDRV_SECTOR_BITS;
             continue;
         }
-        ret = bdrv_pwrite_zeroes(child, sector_num << BDRV_SECTOR_BITS,
-                                 n << BDRV_SECTOR_BITS, flags);
+        ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
         if (ret < 0) {
-            error_report("error writing zeroes at sector %" PRId64 ": %s",
-                         sector_num, strerror(-ret));
+            error_report("error writing zeroes at offset %" PRId64 ": %s",
+                         offset, strerror(-ret));
             return ret;
         }
-        sector_num += n;
+        offset += n * BDRV_SECTOR_SIZE;
     }
 }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting
an internal function (no semantic change), and simplifying its
caller accordingly.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static void dump_map_entry(OutputFormat output_format, MapEntry *e,
     }
 }
 
-static int get_block_status(BlockDriverState *bs, int64_t sector_num,
-                            int nb_sectors, MapEntry *e)
+static int get_block_status(BlockDriverState *bs, int64_t offset,
+                            int64_t bytes, MapEntry *e)
 {
     int64_t ret;
     int depth;
     BlockDriverState *file;
     bool has_offset;
+    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
 
+    assert(bytes < INT_MAX);
     /* As an optimization, we could cache the current range of unallocated
      * clusters in each file of the chain, and avoid querying the same
      * range repeatedly.
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t sector_num,
 
     depth = 0;
     for (;;) {
-        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &nb_sectors,
-                                    &file);
+        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS, nb_sectors,
+                                    &nb_sectors, &file);
         if (ret < 0) {
             return ret;
         }
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t sector_num,
     has_offset = !!(ret & BDRV_BLOCK_OFFSET_VALID);
 
     *e = (MapEntry) {
-        .start = sector_num * BDRV_SECTOR_SIZE,
+        .start = offset,
         .length = nb_sectors * BDRV_SECTOR_SIZE,
         .data = !!(ret & BDRV_BLOCK_DATA),
         .zero = !!(ret & BDRV_BLOCK_ZERO),
@@ -XXX,XX +XXX,XX @@ static int img_map(int argc, char **argv)
 
     length = blk_getlength(blk);
     while (curr.start + curr.length < length) {
-        int64_t nsectors_left;
-        int64_t sector_num;
-        int n;
-
-        sector_num = (curr.start + curr.length) >> BDRV_SECTOR_BITS;
+        int64_t offset = curr.start + curr.length;
+        int64_t n;
 
         /* Probe up to 1 GiB at a time.  */
-        nsectors_left = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE) - sector_num;
-        n = MIN(1 << (30 - BDRV_SECTOR_BITS), nsectors_left);
-        ret = get_block_status(bs, sector_num, n, &next);
+        n = QEMU_ALIGN_DOWN(MIN(1 << 30, length - offset), BDRV_SECTOR_SIZE);
+        ret = get_block_status(bs, offset, n, &next);
 
         if (ret < 0) {
             error_report("Could not read file metadata: %s", strerror(-ret));
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the name of the function from bdrv_get_block_status() to
bdrv_block_status() ensures that the compiler enforces that all
callers are updated.  For now, the io.c layer still assert()s that
all callers are sector-aligned, but that can be relaxed when a later
patch implements byte-based block status in the drivers.

There was an inherent limitation in returning the offset via the
return value: we only have room for BDRV_BLOCK_OFFSET_MASK bits, which
means an offset can only be mapped for sector-aligned queries (or,
if we declare that non-aligned input is at the same relative position
modulo 512 of the answer), so the new interface also changes things to
return the offset via output through a parameter by reference rather
than mashed into the return value.  We'll have some glue code that
munges between the two styles until we finish converting all uses.

For the most part this patch is just the addition of scaling at the
callers followed by inverse scaling at bdrv_block_status(), coupled
with the tweak in calling convention.  But some code, particularly
bdrv_is_allocated(), gets a lot simpler because it no longer has to
mess with sectors.

For ease of review, bdrv_get_block_status_above() will be tackled
separately.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block.h | 17 +++++++++--------
 block/io.c            | 47 ++++++++++++++++++++++++++++++++++-------------
 block/qcow2-cluster.c |  2 +-
 qemu-img.c            | 25 ++++++++++++++-----------
 4 files changed, 58 insertions(+), 33 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ typedef struct HDGeometry {
 #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
 
 /*
- * Allocation status flags for bdrv_get_block_status() and friends.
+ * Allocation status flags for bdrv_block_status() and friends.
  *
  * Public flags:
  * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
@@ -XXX,XX +XXX,XX @@ typedef struct HDGeometry {
  *                 that the block layer recompute the answer from the returned
  *                 BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
  *
- * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK)
- * represent the offset in the returned BDS that is allocated for the
- * corresponding raw data; however, whether that offset actually contains
- * data also depends on BDRV_BLOCK_DATA and BDRV_BLOCK_ZERO, as follows:
+ * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of
+ * the return value (old interface) or the entire map parameter (new
+ * interface) represent the offset in the returned BDS that is allocated for
+ * the corresponding raw data.  However, whether that offset actually
+ * contains data also depends on BDRV_BLOCK_DATA, as follows:
  *
  * DATA ZERO OFFSET_VALID
  *  t    t        t       sectors read as zero, returned file is zero at offset
@@ -XXX,XX +XXX,XX @@ int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
-int64_t bdrv_get_block_status(BlockDriverState *bs, int64_t sector_num,
-                              int nb_sectors, int *pnum,
-                              BlockDriverState **file);
+int bdrv_block_status(BlockDriverState *bs, int64_t offset,
+                      int64_t bytes, int64_t *pnum, int64_t *map,
+                      BlockDriverState **file);
 int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                     BlockDriverState *base,
                                     int64_t sector_num,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
  */
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
-    int64_t target_size, ret, bytes, offset = 0;
+    int ret;
+    int64_t target_size, bytes, offset = 0;
     BlockDriverState *bs = child->bs;
-    int n; /* sectors */
 
     target_size = bdrv_getlength(bs);
     if (target_size < 0) {
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
         if (bytes <= 0) {
             return 0;
         }
-        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
-                                    bytes >> BDRV_SECTOR_BITS, &n, NULL);
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
         if (ret < 0) {
             error_report("error getting block status at offset %" PRId64 ": %s",
                          offset, strerror(-ret));
             return ret;
         }
         if (ret & BDRV_BLOCK_ZERO) {
-            offset += n * BDRV_SECTOR_BITS;
+            offset += bytes;
             continue;
         }
-        ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
+        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
         if (ret < 0) {
             error_report("error writing zeroes at offset %" PRId64 ": %s",
                          offset, strerror(-ret));
             return ret;
         }
-        offset += n * BDRV_SECTOR_SIZE;
+        offset += bytes;
     }
 }
 
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                           nb_sectors, pnum, file);
 }
 
-int64_t bdrv_get_block_status(BlockDriverState *bs,
-                              int64_t sector_num,
-                              int nb_sectors, int *pnum,
-                              BlockDriverState **file)
+int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
+                      int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
-    return bdrv_get_block_status_above(bs, backing_bs(bs),
-                                       sector_num, nb_sectors, pnum, file);
+    int64_t ret;
+    int n;
+
+    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+    assert(pnum);
+    /*
+     * The contract allows us to return pnum smaller than bytes, even
+     * if the next query would see the same status; we truncate the
+     * request to avoid overflowing the driver's 32-bit interface.
+     */
+    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
+    ret = bdrv_get_block_status_above(bs, backing_bs(bs),
+                                      offset >> BDRV_SECTOR_BITS,
+                                      bytes >> BDRV_SECTOR_BITS, &n, file);
+    if (ret < 0) {
+        assert(INT_MIN <= ret);
+        *pnum = 0;
+        return ret;
+    }
+    *pnum = n * BDRV_SECTOR_SIZE;
+    if (map) {
+        *map = ret & BDRV_BLOCK_OFFSET_MASK;
+    } else {
+        ret &= ~BDRV_BLOCK_OFFSET_VALID;
+    }
+    return ret & ~BDRV_BLOCK_OFFSET_MASK;
 }
 
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
          * cluster is already marked as zero, or if it's unallocated and we
          * don't have a backing file.
          *
-         * TODO We might want to use bdrv_get_block_status(bs) here, but we're
+         * TODO We might want to use bdrv_block_status(bs) here, but we're
          * holding s->lock, so that doesn't work today.
          *
          * If full_discard is true, the sector should not read back as zeroes,
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
 
     if (s->sector_next_status <= sector_num) {
         if (s->target_has_backing) {
-            ret = bdrv_get_block_status(blk_bs(s->src[src_cur]),
-                                        sector_num - src_cur_offset,
-                                        n, &n, NULL);
+            int64_t count = n * BDRV_SECTOR_SIZE;
+
+            ret = bdrv_block_status(blk_bs(s->src[src_cur]),
+                                    (sector_num - src_cur_offset) *
+                                    BDRV_SECTOR_SIZE,
+                                    count, &count, NULL, NULL);
+            assert(ret < 0 || QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
+            n = count >> BDRV_SECTOR_BITS;
         } else {
             ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
                                               sector_num - src_cur_offset,
@@ -XXX,XX +XXX,XX @@ static void dump_map_entry(OutputFormat output_format, MapEntry *e,
 static int get_block_status(BlockDriverState *bs, int64_t offset,
                             int64_t bytes, MapEntry *e)
 {
-    int64_t ret;
+    int ret;
     int depth;
     BlockDriverState *file;
     bool has_offset;
-    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
+    int64_t map;
 
-    assert(bytes < INT_MAX);
     /* As an optimization, we could cache the current range of unallocated
      * clusters in each file of the chain, and avoid querying the same
      * range repeatedly.
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
 
     depth = 0;
     for (;;) {
-        ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS, nb_sectors,
-                                    &nb_sectors, &file);
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
         if (ret < 0) {
             return ret;
         }
-        assert(nb_sectors);
+        assert(bytes);
         if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
             break;
         }
@@ -XXX,XX +XXX,XX @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
 
     *e = (MapEntry) {
         .start = offset,
-        .length = nb_sectors * BDRV_SECTOR_SIZE,
+        .length = bytes,
         .data = !!(ret & BDRV_BLOCK_DATA),
         .zero = !!(ret & BDRV_BLOCK_ZERO),
-        .offset = ret & BDRV_BLOCK_OFFSET_MASK,
+        .offset = map,
         .has_offset = has_offset,
         .depth = depth,
         .has_filename = file && has_offset,
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change); and as with its public counterpart,
rename to bdrv_co_block_status() and split the offset return, to
make the compiler enforce that we catch all uses.  For now, we
assert that callers and the return value still use aligned data,
but ultimately, this will be the function where we hand off to a
byte-based driver callback, and will eventually need to add logic
to ensure we round calls according to the driver's
request_alignment then touch up the result handed back to the
caller, to start permitting a caller to pass unaligned offsets.

Note that we are now prepared to accepts 'bytes' larger than INT_MAX;
this is okay as long as we clamp things internally before violating
any 32-bit limits, and makes no difference to how a client will
use the information (clients looping over the entire file must
already be prepared for consecutive calls to return the same status,
as drivers are already free to return shorter-than-maximal status
due to any other convenient split points, such as when the L2 table
crosses cluster boundaries in qcow2).

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 124 ++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 81 insertions(+), 43 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * BDRV_BLOCK_ZERO where possible; otherwise, the result may omit those
  * bits particularly if it allows for a larger value in 'pnum'.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is
+ * If 'offset' is beyond the end of the disk image the return value is
  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
- * allocated/unallocated state.
- *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
+ * 'bytes' is the max value 'pnum' should be set to.  If bytes goes
  * beyond the end of the disk image it will be clamped; if 'pnum' is set to
  * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
  *
- * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
- * 'file' is non-NULL, then '*file' points to the BDS which the sector range
- * is allocated in.
+ * 'pnum' is set to the number of bytes (including and immediately
+ * following the specified offset) that are easily known to be in the
+ * same allocated/unallocated state.  Note that a second call starting
+ * at the original offset plus returned pnum may have the same status.
+ * The returned value is non-zero on success except at end-of-file.
+ *
+ * Returns negative errno on failure.  Otherwise, if the
+ * BDRV_BLOCK_OFFSET_VALID bit is set, 'map' and 'file' (if non-NULL) are
+ * set to the host mapping and BDS corresponding to the guest offset.
  */
-static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
-                                                     bool want_zero,
-                                                     int64_t sector_num,
-                                                     int nb_sectors, int *pnum,
-                                                     BlockDriverState **file)
-{
-    int64_t total_sectors;
-    int64_t n;
-    int64_t ret, ret2;
+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
+                                             BlockDriverState **file)
+{
+    int64_t total_size;
+    int64_t n; /* bytes */
+    int64_t ret;
+    int64_t local_map = 0;
     BlockDriverState *local_file = NULL;
+    int count; /* sectors */
 
     assert(pnum);
     *pnum = 0;
-    total_sectors = bdrv_nb_sectors(bs);
-    if (total_sectors < 0) {
-        ret = total_sectors;
+    total_size = bdrv_getlength(bs);
+    if (total_size < 0) {
+        ret = total_size;
         goto early_out;
     }
 
-    if (sector_num >= total_sectors) {
+    if (offset >= total_size) {
         ret = BDRV_BLOCK_EOF;
         goto early_out;
     }
-    if (!nb_sectors) {
+    if (!bytes) {
         ret = 0;
         goto early_out;
     }
 
-    n = total_sectors - sector_num;
-    if (n < nb_sectors) {
-        nb_sectors = n;
+    n = total_size - offset;
+    if (n < bytes) {
+        bytes = n;
     }
 
     if (!bs->drv->bdrv_co_get_block_status) {
-        *pnum = nb_sectors;
+        *pnum = bytes;
         ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
-        if (sector_num + nb_sectors == total_sectors) {
+        if (offset + bytes == total_size) {
             ret |= BDRV_BLOCK_EOF;
         }
         if (bs->drv->protocol_name) {
-            ret |= BDRV_BLOCK_OFFSET_VALID | (sector_num * BDRV_SECTOR_SIZE);
+            ret |= BDRV_BLOCK_OFFSET_VALID;
+            local_map = offset;
             local_file = bs;
         }
         goto early_out;
     }
 
     bdrv_inc_in_flight(bs);
-    ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
+    /*
+     * TODO: Rather than require aligned offsets, we could instead
+     * round to the driver's request_alignment here, then touch up
+     * count afterwards back to the caller's expectations.
+     */
+    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+    /*
+     * The contract allows us to return pnum smaller than bytes, even
+     * if the next query would see the same status; we truncate the
+     * request to avoid overflowing the driver's 32-bit interface.
+     */
+    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
+    ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
+                                            bytes >> BDRV_SECTOR_BITS, &count,
                                             &local_file);
     if (ret < 0) {
-        *pnum = 0;
         goto out;
     }
+    if (ret & BDRV_BLOCK_OFFSET_VALID) {
+        local_map = ret & BDRV_BLOCK_OFFSET_MASK;
+    }
+    *pnum = count * BDRV_SECTOR_SIZE;
 
     if (ret & BDRV_BLOCK_RAW) {
         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_get_block_status(local_file, want_zero,
-                                       ret >> BDRV_SECTOR_BITS,
-                                       *pnum, pnum, &local_file);
+        ret = bdrv_co_block_status(local_file, want_zero, local_map,
+                                   *pnum, pnum, &local_map, &local_file);
+        assert(ret < 0 ||
+               QEMU_IS_ALIGNED(*pnum | local_map, BDRV_SECTOR_SIZE));
         goto out;
     }
 
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
             ret |= BDRV_BLOCK_ZERO;
         } else if (bs->backing) {
             BlockDriverState *bs2 = bs->backing->bs;
-            int64_t nb_sectors2 = bdrv_nb_sectors(bs2);
+            int64_t size2 = bdrv_getlength(bs2);
 
-            if (nb_sectors2 >= 0 && sector_num >= nb_sectors2) {
+            if (size2 >= 0 && offset >= size2) {
                 ret |= BDRV_BLOCK_ZERO;
             }
         }
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
     if (want_zero && local_file && local_file != bs &&
         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
         (ret & BDRV_BLOCK_OFFSET_VALID)) {
-        int file_pnum;
+        int64_t file_pnum;
+        int ret2;
 
-        ret2 = bdrv_co_get_block_status(local_file, want_zero,
-                                        ret >> BDRV_SECTOR_BITS,
-                                        *pnum, &file_pnum, NULL);
+        ret2 = bdrv_co_block_status(local_file, want_zero, local_map,
+                                    *pnum, &file_pnum, NULL, NULL);
         if (ret2 >= 0) {
             /* Ignore errors.  This is just providing extra information, it
              * is useful but not necessary.
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
 
 out:
     bdrv_dec_in_flight(bs);
-    if (ret >= 0 && sector_num + *pnum == total_sectors) {
+    if (ret >= 0 && offset + *pnum == total_size) {
         ret |= BDRV_BLOCK_EOF;
     }
 early_out:
     if (file) {
         *file = local_file;
     }
+    if (map) {
+        *map = local_map;
+    }
+    if (ret >= 0) {
+        ret &= ~BDRV_BLOCK_OFFSET_MASK;
+    } else {
+        assert(INT_MIN <= ret);
+    }
     return ret;
 }
 
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
     BlockDriverState *p;
     int64_t ret = 0;
     bool first = true;
+    int64_t map = 0;
 
     assert(bs != base);
     for (p = bs; p != base; p = backing_bs(p)) {
-        ret = bdrv_co_get_block_status(p, want_zero, sector_num, nb_sectors,
-                                       pnum, file);
+        int64_t count;
+
+        ret = bdrv_co_block_status(p, want_zero,
+                                   sector_num * BDRV_SECTOR_SIZE,
+                                   nb_sectors * BDRV_SECTOR_SIZE, &count,
+                                   &map, file);
         if (ret < 0) {
             break;
         }
+        assert(QEMU_IS_ALIGNED(count | map, BDRV_SECTOR_SIZE));
+        ret |= map;
+        *pnum = count >> BDRV_SECTOR_BITS;
         if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
             /*
              * Reading beyond the end of the file continues to read
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
type (no semantic change), and rename it to match the corresponding
public function rename.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 56 ++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
 }
 
 
-typedef struct BdrvCoGetBlockStatusData {
+typedef struct BdrvCoBlockStatusData {
     BlockDriverState *bs;
     BlockDriverState *base;
     bool want_zero;
-    int64_t sector_num;
-    int nb_sectors;
-    int *pnum;
+    int64_t offset;
+    int64_t bytes;
+    int64_t *pnum;
+    int64_t *map;
     BlockDriverState **file;
-    int64_t ret;
+    int ret;
     bool done;
-} BdrvCoGetBlockStatusData;
+} BdrvCoBlockStatusData;
 
 int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
                                                         int64_t sector_num,
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
 /* Coroutine wrapper for bdrv_get_block_status_above() */
 static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
 {
-    BdrvCoGetBlockStatusData *data = opaque;
+    BdrvCoBlockStatusData *data = opaque;
+    int n = 0;
+    int64_t ret;
 
-    data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
-                                               data->want_zero,
-                                               data->sector_num,
-                                               data->nb_sectors,
-                                               data->pnum,
-                                               data->file);
+    ret = bdrv_co_get_block_status_above(data->bs, data->base,
+                                         data->want_zero,
+                                         data->offset >> BDRV_SECTOR_BITS,
+                                         data->bytes >> BDRV_SECTOR_BITS,
+                                         &n,
+                                         data->file);
+    if (ret < 0) {
+        assert(INT_MIN <= ret);
+        data->ret = ret;
+    } else {
+        *data->pnum = n * BDRV_SECTOR_SIZE;
+        *data->map = ret & BDRV_BLOCK_OFFSET_MASK;
+        data->ret = ret & ~BDRV_BLOCK_OFFSET_MASK;
+    }
     data->done = true;
 }
 
@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
                                               BlockDriverState **file)
 {
     Coroutine *co;
-    BdrvCoGetBlockStatusData data = {
+    int64_t n;
+    int64_t map;
+    BdrvCoBlockStatusData data = {
         .bs = bs,
         .base = base,
         .want_zero = want_zero,
-        .sector_num = sector_num,
-        .nb_sectors = nb_sectors,
-        .pnum = pnum,
+        .offset = sector_num * BDRV_SECTOR_SIZE,
+        .bytes = nb_sectors * BDRV_SECTOR_SIZE,
+        .pnum = &n,
+        .map = &map,
         .file = file,
         .done = false,
     };
@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !data.done);
     }
-    return data.ret;
+    if (data.ret < 0) {
+        *pnum = 0;
+        return data.ret;
+    }
+    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
+    *pnum = n >> BDRV_SECTOR_BITS;
+    return data.ret | map;
 }
 
 int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change).

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 61 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 30 insertions(+), 31 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
  *
  * See bdrv_co_get_block_status_above() for details.
  */
-static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
-                                              BlockDriverState *base,
-                                              bool want_zero,
-                                              int64_t sector_num,
-                                              int nb_sectors, int *pnum,
-                                              BlockDriverState **file)
+static int bdrv_common_block_status_above(BlockDriverState *bs,
+                                          BlockDriverState *base,
+                                          bool want_zero, int64_t offset,
+                                          int64_t bytes, int64_t *pnum,
+                                          int64_t *map,
+                                          BlockDriverState **file)
 {
     Coroutine *co;
-    int64_t n;
-    int64_t map;
     BdrvCoBlockStatusData data = {
         .bs = bs,
         .base = base,
         .want_zero = want_zero,
-        .offset = sector_num * BDRV_SECTOR_SIZE,
-        .bytes = nb_sectors * BDRV_SECTOR_SIZE,
-        .pnum = &n,
-        .map = &map,
+        .offset = offset,
+        .bytes = bytes,
+        .pnum = pnum,
+        .map = map,
         .file = file,
         .done = false,
     };
@@ -XXX,XX +XXX,XX @@ static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !data.done);
     }
-    if (data.ret < 0) {
-        *pnum = 0;
-        return data.ret;
-    }
-    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
-    *pnum = n >> BDRV_SECTOR_BITS;
-    return data.ret | map;
+    return data.ret;
 }
 
 int64_t bdrv_get_block_status_above(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
                                     int nb_sectors, int *pnum,
                                     BlockDriverState **file)
 {
-    return bdrv_common_block_status_above(bs, base, true, sector_num,
-                                          nb_sectors, pnum, file);
+    int64_t ret;
+    int64_t n;
+    int64_t map;
+
+    ret = bdrv_common_block_status_above(bs, base, true,
+                                         sector_num * BDRV_SECTOR_SIZE,
+                                         nb_sectors * BDRV_SECTOR_SIZE,
+                                         &n, &map, file);
+    if (ret < 0) {
+        *pnum = 0;
+        return ret;
+    }
+    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
+    *pnum = n >> BDRV_SECTOR_BITS;
+    return ret | map;
 }
 
 int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
@@ -XXX,XX +XXX,XX @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                    int64_t bytes, int64_t *pnum)
 {
-    int64_t ret;
-    int psectors;
+    int ret;
+    int64_t dummy;
 
-    assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
-    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
-    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false,
-                                         offset >> BDRV_SECTOR_BITS,
-                                         bytes >> BDRV_SECTOR_BITS, &psectors,
+    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+                                         bytes, pnum ? pnum : &dummy, NULL,
                                          NULL);
     if (ret < 0) {
         return ret;
     }
-    if (pnum) {
-        *pnum = psectors * BDRV_SECTOR_SIZE;
-    }
     return !!(ret & BDRV_BLOCK_ALLOCATED);
 }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 68 ++++++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 44 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ early_out:
     return ret;
 }
 
-static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
-        BlockDriverState *base,
-        bool want_zero,
-        int64_t sector_num,
-        int nb_sectors,
-        int *pnum,
-        BlockDriverState **file)
+static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
+                                                   BlockDriverState *base,
+                                                   bool want_zero,
+                                                   int64_t offset,
+                                                   int64_t bytes,
+                                                   int64_t *pnum,
+                                                   int64_t *map,
+                                                   BlockDriverState **file)
 {
     BlockDriverState *p;
-    int64_t ret = 0;
+    int ret = 0;
     bool first = true;
-    int64_t map = 0;
 
     assert(bs != base);
     for (p = bs; p != base; p = backing_bs(p)) {
-        int64_t count;
-
-        ret = bdrv_co_block_status(p, want_zero,
-                                   sector_num * BDRV_SECTOR_SIZE,
-                                   nb_sectors * BDRV_SECTOR_SIZE, &count,
-                                   &map, file);
+        ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
+                                   file);
         if (ret < 0) {
             break;
         }
-        assert(QEMU_IS_ALIGNED(count | map, BDRV_SECTOR_SIZE));
-        ret |= map;
-        *pnum = count >> BDRV_SECTOR_BITS;
         if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
             /*
              * Reading beyond the end of the file continues to read
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
              * unallocated length we learned from an earlier
              * iteration.
              */
-            *pnum = nb_sectors;
+            *pnum = bytes;
         }
         if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
             break;
         }
-        /* [sector_num, pnum] unallocated on this layer, which could be only
-         * the first part of [sector_num, nb_sectors].  */
-        nb_sectors = MIN(nb_sectors, *pnum);
+        /* [offset, pnum] unallocated on this layer, which could be only
+         * the first part of [offset, bytes].  */
+        bytes = MIN(bytes, *pnum);
         first = false;
     }
     return ret;
 }
 
 /* Coroutine wrapper for bdrv_get_block_status_above() */
-static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
+static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
     BdrvCoBlockStatusData *data = opaque;
-    int n = 0;
-    int64_t ret;
 
-    ret = bdrv_co_get_block_status_above(data->bs, data->base,
-                                         data->want_zero,
-                                         data->offset >> BDRV_SECTOR_BITS,
-                                         data->bytes >> BDRV_SECTOR_BITS,
-                                         &n,
-                                         data->file);
-    if (ret < 0) {
-        assert(INT_MIN <= ret);
-        data->ret = ret;
-    } else {
-        *data->pnum = n * BDRV_SECTOR_SIZE;
-        *data->map = ret & BDRV_BLOCK_OFFSET_MASK;
-        data->ret = ret & ~BDRV_BLOCK_OFFSET_MASK;
-    }
+    data->ret = bdrv_co_block_status_above(data->bs, data->base,
+                                           data->want_zero,
+                                           data->offset, data->bytes,
+                                           data->pnum, data->map, data->file);
     data->done = true;
 }
 
 /*
- * Synchronous wrapper around bdrv_co_get_block_status_above().
+ * Synchronous wrapper around bdrv_co_block_status_above().
  *
- * See bdrv_co_get_block_status_above() for details.
+ * See bdrv_co_block_status_above() for details.
  */
 static int bdrv_common_block_status_above(BlockDriverState *bs,
                                           BlockDriverState *base,
@@ -XXX,XX +XXX,XX @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
 
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
-        bdrv_get_block_status_above_co_entry(&data);
+        bdrv_block_status_above_co_entry(&data);
     } else {
-        co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
-                                   &data);
+        co = qemu_coroutine_create(bdrv_block_status_above_co_entry, &data);
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !data.done);
     }
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Changing the name of the function from bdrv_get_block_status_above()
to bdrv_block_status_above() ensures that the compiler enforces that
all callers are updated.  Likewise, since it a byte interface allows
an offset mapping that might not be sector aligned, split the mapping
out of the return value and into a pass-by-reference parameter.  For
now, the io.c layer still assert()s that all uses are sector-aligned,
but that can be relaxed when a later patch implements byte-based
block status in the drivers.

For the most part this patch is just the addition of scaling at the
callers followed by inverse scaling at bdrv_block_status(), plus
updates for the new split return interface.  But some code,
particularly bdrv_block_status(), gets a lot simpler because it no
longer has to mess with sectors.  Likewise, mirror code no longer
computes s->granularity >> BDRV_SECTOR_BITS, and can therefore drop
an assertion about alignment because the loop no longer depends on
alignment (never mind that we don't really have a driver that
reports sub-sector alignments, so it's not really possible to test
the effect of sub-sector mirroring).  Fix a neighboring assertion to
use is_power_of_2 while there.

For ease of review, bdrv_get_block_status() was tackled separately.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block.h |  8 +++-----
 block/io.c            | 55 ++++++++-------------------------------------------
 block/mirror.c        | 18 ++++++-----------
 block/qcow2.c         | 30 +++++++++++-----------------
 qemu-img.c            | 49 +++++++++++++++++++++++++--------------------
 5 files changed, 57 insertions(+), 103 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
                       int64_t bytes, int64_t *pnum, int64_t *map,
                       BlockDriverState **file);
-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-                                    BlockDriverState *base,
-                                    int64_t sector_num,
-                                    int nb_sectors, int *pnum,
-                                    BlockDriverState **file);
+int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
+                            int64_t offset, int64_t bytes, int64_t *pnum,
+                            int64_t *map, BlockDriverState **file);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     return ret;
 }
 
-/* Coroutine wrapper for bdrv_get_block_status_above() */
+/* Coroutine wrapper for bdrv_block_status_above() */
 static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
     BdrvCoBlockStatusData *data = opaque;
@@ -XXX,XX +XXX,XX @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
     return data.ret;
 }
 
-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-                                    BlockDriverState *base,
-                                    int64_t sector_num,
-                                    int nb_sectors, int *pnum,
-                                    BlockDriverState **file)
+int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
+                            int64_t offset, int64_t bytes, int64_t *pnum,
+                            int64_t *map, BlockDriverState **file)
 {
-    int64_t ret;
-    int64_t n;
-    int64_t map;
-
-    ret = bdrv_common_block_status_above(bs, base, true,
-                                         sector_num * BDRV_SECTOR_SIZE,
-                                         nb_sectors * BDRV_SECTOR_SIZE,
-                                         &n, &map, file);
-    if (ret < 0) {
-        *pnum = 0;
-        return ret;
-    }
-    assert(QEMU_IS_ALIGNED(n | map, BDRV_SECTOR_SIZE));
-    *pnum = n >> BDRV_SECTOR_BITS;
-    return ret | map;
+    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+                                          pnum, map, file);
 }
 
 int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
-    int64_t ret;
-    int n;
-
-    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
-    assert(pnum);
-    /*
-     * The contract allows us to return pnum smaller than bytes, even
-     * if the next query would see the same status; we truncate the
-     * request to avoid overflowing the driver's 32-bit interface.
-     */
-    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-    ret = bdrv_get_block_status_above(bs, backing_bs(bs),
-                                      offset >> BDRV_SECTOR_BITS,
-                                      bytes >> BDRV_SECTOR_BITS, &n, file);
-    if (ret < 0) {
-        assert(INT_MIN <= ret);
-        *pnum = 0;
-        return ret;
-    }
-    *pnum = n * BDRV_SECTOR_SIZE;
-    if (map) {
-        *map = ret & BDRV_BLOCK_OFFSET_MASK;
-    } else {
-        ret &= ~BDRV_BLOCK_OFFSET_VALID;
-    }
-    return ret & ~BDRV_BLOCK_OFFSET_MASK;
+    return bdrv_block_status_above(bs, backing_bs(bs),
+                                   offset, bytes, pnum, map, file);
 }
 
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     uint64_t delay_ns = 0;
     /* At least the first dirty chunk is mirrored in one iteration. */
     int nb_chunks = 1;
-    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
     bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
     int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
 
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     }
 
     /* Clear dirty bits before querying the block status, because
-     * calling bdrv_get_block_status_above could yield - if some blocks are
+     * calling bdrv_block_status_above could yield - if some blocks are
      * marked dirty in this window, we need to know.
      */
     bdrv_reset_dirty_bitmap_locked(s->dirty_bitmap, offset,
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
     bitmap_set(s->in_flight_bitmap, offset / s->granularity, nb_chunks);
     while (nb_chunks > 0 && offset < s->bdev_length) {
-        int64_t ret;
-        int io_sectors;
+        int ret;
         int64_t io_bytes;
         int64_t io_bytes_acct;
         enum MirrorMethod {
@@ -XXX,XX +XXX,XX @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
         } mirror_method = MIRROR_METHOD_COPY;
 
         assert(!(offset % s->granularity));
-        ret = bdrv_get_block_status_above(source, NULL,
-                                          offset >> BDRV_SECTOR_BITS,
-                                          nb_chunks * sectors_per_chunk,
-                                          &io_sectors, NULL);
-        io_bytes = io_sectors * BDRV_SECTOR_SIZE;
+        ret = bdrv_block_status_above(source, NULL, offset,
+                                      nb_chunks * s->granularity,
+                                      &io_bytes, NULL, NULL);
         if (ret < 0) {
             io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
         } else if (ret & BDRV_BLOCK_DATA) {
@@ -XXX,XX +XXX,XX @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         granularity = bdrv_get_default_bitmap_granularity(target);
     }
 
-    assert ((granularity & (granularity - 1)) == 0);
-    /* Granularity must be large enough for sector-based dirty bitmap */
-    assert(granularity >= BDRV_SECTOR_SIZE);
+    assert(is_power_of_2(granularity));
 
     if (buf_size < 0) {
         error_setg(errp, "Invalid parameter 'buf-size'");
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ finish:
 
 static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
-    int nr;
-    int64_t res;
+    int64_t nr;
+    int res;
     int64_t start;
 
     /* TODO: Widening to sector boundaries should only be needed as
@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
     if (!bytes) {
         return true;
     }
-    res = bdrv_get_block_status_above(bs, NULL, start >> BDRV_SECTOR_BITS,
-                                      bytes >> BDRV_SECTOR_BITS, &nr, NULL);
-    return res >= 0 && (res & BDRV_BLOCK_ZERO) &&
-        nr * BDRV_SECTOR_SIZE == bytes;
+    res = bdrv_block_status_above(bs, NULL, start, bytes, &nr, NULL, NULL);
+    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
 }
 
 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
             required = virtual_size;
         } else {
             int64_t offset;
-            int pnum = 0;
+            int64_t pnum = 0;
 
-            for (offset = 0; offset < ssize;
-                 offset += pnum * BDRV_SECTOR_SIZE) {
-                int nb_sectors = MIN(ssize - offset,
-                                     BDRV_REQUEST_MAX_BYTES) / BDRV_SECTOR_SIZE;
-                int64_t ret;
+            for (offset = 0; offset < ssize; offset += pnum) {
+                int ret;
 
-                ret = bdrv_get_block_status_above(in_bs, NULL,
-                                                  offset >> BDRV_SECTOR_BITS,
-                                                  nb_sectors, &pnum, NULL);
+                ret = bdrv_block_status_above(in_bs, NULL, offset,
+                                              ssize - offset, &pnum, NULL,
+                                              NULL);
                 if (ret < 0) {
                     error_setg_errno(&local_err, -ret,
                                      "Unable to get block status");
@@ -XXX,XX +XXX,XX @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
                 } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
                            (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
                     /* Extend pnum to end of cluster for next iteration */
-                    pnum = (ROUND_UP(offset + pnum * BDRV_SECTOR_SIZE,
-                                 cluster_size) - offset) >> BDRV_SECTOR_BITS;
+                    pnum = ROUND_UP(offset + pnum, cluster_size) - offset;
 
                     /* Count clusters we've seen */
-                    required += offset % cluster_size + pnum * BDRV_SECTOR_SIZE;
+                    required += offset % cluster_size + pnum;
                 }
             }
         }
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
     BlockDriverState *bs1, *bs2;
     int64_t total_sectors1, total_sectors2;
     uint8_t *buf1 = NULL, *buf2 = NULL;
-    int pnum1, pnum2;
+    int64_t pnum1, pnum2;
     int allocated1, allocated2;
     int ret = 0; /* return value - 0 Ident, 1 Different, >1 Error */
     bool progress = false, quiet = false, strict = false;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
     }
 
     for (;;) {
-        int64_t status1, status2;
+        int status1, status2;
 
         nb_sectors = sectors_to_process(total_sectors, sector_num);
         if (nb_sectors <= 0) {
             break;
         }
-        status1 = bdrv_get_block_status_above(bs1, NULL, sector_num,
-                                              total_sectors1 - sector_num,
-                                              &pnum1, NULL);
+        status1 = bdrv_block_status_above(bs1, NULL,
+                                          sector_num * BDRV_SECTOR_SIZE,
+                                          (total_sectors1 - sector_num) *
+                                          BDRV_SECTOR_SIZE,
+                                          &pnum1, NULL, NULL);
         if (status1 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         }
         allocated1 = status1 & BDRV_BLOCK_ALLOCATED;
 
-        status2 = bdrv_get_block_status_above(bs2, NULL, sector_num,
-                                              total_sectors2 - sector_num,
-                                              &pnum2, NULL);
+        status2 = bdrv_block_status_above(bs2, NULL,
+                                          sector_num * BDRV_SECTOR_SIZE,
+                                          (total_sectors2 - sector_num) *
+                                          BDRV_SECTOR_SIZE,
+                                          &pnum2, NULL, NULL);
         if (status2 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename2);
             goto out;
         }
         allocated2 = status2 & BDRV_BLOCK_ALLOCATED;
+        /* TODO: Relax this once comparison is byte-based, and we no longer
+         * have to worry about sector alignment */
+        assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
         if (pnum1) {
-            nb_sectors = MIN(nb_sectors, pnum1);
+            nb_sectors = MIN(nb_sectors, pnum1 >> BDRV_SECTOR_BITS);
         }
         if (pnum2) {
-            nb_sectors = MIN(nb_sectors, pnum2);
+            nb_sectors = MIN(nb_sectors, pnum2 >> BDRV_SECTOR_BITS);
         }
 
         if (strict) {
-            if ((status1 & ~BDRV_BLOCK_OFFSET_MASK) !=
-                (status2 & ~BDRV_BLOCK_OFFSET_MASK)) {
+            if (status1 != status2) {
                 ret = 1;
                 qprintf(quiet, "Strict mode: Offset %" PRId64
                         " block status mismatch!\n",
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             }
         }
         if ((status1 & BDRV_BLOCK_ZERO) && (status2 & BDRV_BLOCK_ZERO)) {
-            nb_sectors = MIN(pnum1, pnum2);
+            nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
         } else if (allocated1 == allocated2) {
             if (allocated1) {
                 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
@@ -XXX,XX +XXX,XX @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num,
 
 static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
 {
-    int64_t ret, src_cur_offset;
-    int n, src_cur;
+    int64_t src_cur_offset;
+    int ret, n, src_cur;
 
     convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
 
@@ -XXX,XX +XXX,XX @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
     n = MIN(s->total_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
 
     if (s->sector_next_status <= sector_num) {
+        int64_t count = n * BDRV_SECTOR_SIZE;
+
         if (s->target_has_backing) {
-            int64_t count = n * BDRV_SECTOR_SIZE;
 
             ret = bdrv_block_status(blk_bs(s->src[src_cur]),
                                     (sector_num - src_cur_offset) *
                                     BDRV_SECTOR_SIZE,
                                     count, &count, NULL, NULL);
-            assert(ret < 0 || QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-            n = count >> BDRV_SECTOR_BITS;
         } else {
-            ret = bdrv_get_block_status_above(blk_bs(s->src[src_cur]), NULL,
-                                              sector_num - src_cur_offset,
-                                              n, &n, NULL);
+            ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
+                                          (sector_num - src_cur_offset) *
+                                          BDRV_SECTOR_SIZE,
+                                          count, &count, NULL, NULL);
         }
         if (ret < 0) {
             return ret;
         }
+        n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
 
         if (ret & BDRV_BLOCK_ZERO) {
             s->status = BLK_ZERO;
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

As long as we are querying the status for a chunk smaller than
the known image size, we are guaranteed that a successful return
will have set pnum to a non-zero size (pnum is zero only for
queries beyond the end of the file).  Use that to slightly
simplify the calculation of the current chunk size being compared.
Likewise, we don't have to shrink the amount of data operated on
until we know we have to read the file, and therefore have to fit
in the bounds of our buffer.  Also, note that 'total_sectors_over'
is equivalent to 'progress_base'.

With these changes in place, sectors_to_process() is now dead code,
and can be removed.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
     return sectors << BDRV_SECTOR_BITS;
 }
 
-static int64_t sectors_to_process(int64_t total, int64_t from)
-{
-    return MIN(total - from, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-}
-
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         goto out;
     }
 
-    for (;;) {
+    while (sector_num < total_sectors) {
         int status1, status2;
 
-        nb_sectors = sectors_to_process(total_sectors, sector_num);
-        if (nb_sectors <= 0) {
-            break;
-        }
         status1 = bdrv_block_status_above(bs1, NULL,
                                           sector_num * BDRV_SECTOR_SIZE,
                                           (total_sectors1 - sector_num) *
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         /* TODO: Relax this once comparison is byte-based, and we no longer
          * have to worry about sector alignment */
         assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
-        if (pnum1) {
-            nb_sectors = MIN(nb_sectors, pnum1 >> BDRV_SECTOR_BITS);
-        }
-        if (pnum2) {
-            nb_sectors = MIN(nb_sectors, pnum2 >> BDRV_SECTOR_BITS);
-        }
+
+        assert(pnum1 && pnum2);
+        nb_sectors = MIN(pnum1, pnum2) >> BDRV_SECTOR_BITS;
 
         if (strict) {
             if (status1 != status2) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             }
         }
         if ((status1 & BDRV_BLOCK_ZERO) && (status2 & BDRV_BLOCK_ZERO)) {
-            nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
+            /* nothing to do */
         } else if (allocated1 == allocated2) {
             if (allocated1) {
+                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
                 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
                                 nb_sectors << BDRV_SECTOR_BITS);
                 if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                 }
             }
         } else {
-
+            nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
             if (allocated1) {
                 ret = check_empty_sectors(blk1, sector_num, nb_sectors,
                                           filename1, buf1, quiet);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
 
     if (total_sectors1 != total_sectors2) {
         BlockBackend *blk_over;
-        int64_t total_sectors_over;
         const char *filename_over;
 
         qprintf(quiet, "Warning: Image size mismatch!\n");
         if (total_sectors1 > total_sectors2) {
-            total_sectors_over = total_sectors1;
             blk_over = blk1;
             filename_over = filename1;
         } else {
-            total_sectors_over = total_sectors2;
             blk_over = blk2;
             filename_over = filename2;
         }
 
-        for (;;) {
+        while (sector_num < progress_base) {
             int64_t count;
 
-            nb_sectors = sectors_to_process(total_sectors_over, sector_num);
-            if (nb_sectors <= 0) {
-                break;
-            }
             ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
                                           sector_num * BDRV_SECTOR_SIZE,
-                                          nb_sectors * BDRV_SECTOR_SIZE,
+                                          (progress_base - sector_num) *
+                                          BDRV_SECTOR_SIZE,
                                           &count);
             if (ret < 0) {
                 ret = 3;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
             nb_sectors = count >> BDRV_SECTOR_BITS;
             if (ret) {
+                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
                 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
                                           filename_over, buf1, quiet);
                 if (ret) {
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Compare the following images with all-zero contents:
$ truncate --size 1M A
$ qemu-img create -f qcow2 -o preallocation=off B 1G
$ qemu-img create -f qcow2 -o preallocation=metadata C 1G

On my machine, the difference is noticeable for pre-patch speeds,
with more than an order of magnitude in difference caused by the
choice of preallocation in the qcow2 file:

$ time ./qemu-img compare -f raw -F qcow2 A B
Warning: Image size mismatch!
Images are identical.

real	0m0.014s
user	0m0.007s
sys	0m0.007s

$ time ./qemu-img compare -f raw -F qcow2 A C
Warning: Image size mismatch!
Images are identical.

real	0m0.341s
user	0m0.144s
sys	0m0.188s

Why? Because bdrv_is_allocated() returns false for image B but
true for image C, throwing away the fact that both images know
via lseek(SEEK_HOLE) that the entire image still reads as zero.
From there, qemu-img ends up calling bdrv_pread() for every byte
of the tail, instead of quickly looking for the next allocation.
The solution: use block_status instead of is_allocated, giving:

$ time ./qemu-img compare -f raw -F qcow2 A C
Warning: Image size mismatch!
Images are identical.

real	0m0.014s
user	0m0.011s
sys	0m0.003s

which is on par with the speeds for no pre-allocation.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         while (sector_num < progress_base) {
             int64_t count;
 
-            ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
+            ret = bdrv_block_status_above(blk_bs(blk_over), NULL,
                                           sector_num * BDRV_SECTOR_SIZE,
                                           (progress_base - sector_num) *
                                           BDRV_SECTOR_SIZE,
-                                          &count);
+                                          &count, NULL, NULL);
             if (ret < 0) {
                 ret = 3;
                 error_report("Sector allocation test failed for %s",
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                 goto out;
 
             }
-            /* TODO relax this once bdrv_is_allocated_above does not enforce
+            /* TODO relax this once bdrv_block_status_above does not enforce
              * sector alignment */
             assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
             nb_sectors = count >> BDRV_SECTOR_BITS;
-            if (ret) {
+            if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
                 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
                                           filename_over, buf1, quiet);
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

During 'qemu-img compare', when we are checking that an allocated
portion of one file is all zeros, we don't need to waste time
computing how many additional sectors after the first non-zero
byte are also non-zero.  Create a new helper find_nonzero() to do
the check for a first non-zero sector, and rebase
check_empty_sectors() to use it.

The new interface intentionally uses bytes in its interface, even
though it still crawls the buffer a sector at a time; it is robust
to a partial sector at the end of the buffer.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ done:
 }
 
 /*
+ * Returns -1 if 'buf' contains only zeroes, otherwise the byte index
+ * of the first sector boundary within buf where the sector contains a
+ * non-zero byte.  This function is robust to a buffer that is not
+ * sector-aligned.
+ */
+static int64_t find_nonzero(const uint8_t *buf, int64_t n)
+{
+    int64_t i;
+    int64_t end = QEMU_ALIGN_DOWN(n, BDRV_SECTOR_SIZE);
+
+    for (i = 0; i < end; i += BDRV_SECTOR_SIZE) {
+        if (!buffer_is_zero(buf + i, BDRV_SECTOR_SIZE)) {
+            return i;
+        }
+    }
+    if (i < n && !buffer_is_zero(buf + i, n - end)) {
+        return i;
+    }
+    return -1;
+}
+
+/*
  * Returns true iff the first sector pointed to by 'buf' contains at least
  * a non-NUL byte.
  *
@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
                                int sect_count, const char *filename,
                                uint8_t *buffer, bool quiet)
 {
-    int pnum, ret = 0;
+    int ret = 0;
+    int64_t idx;
+
     ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
                     sect_count << BDRV_SECTOR_BITS);
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
                      sectors_to_bytes(sect_num), filename, strerror(-ret));
         return ret;
     }
-    ret = is_allocated_sectors(buffer, sect_count, &pnum);
-    if (ret || pnum != sect_count) {
+    idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
+    if (idx >= 0) {
         qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-                sectors_to_bytes(ret ? sect_num : sect_num + pnum));
+                sectors_to_bytes(sect_num) + idx);
         return 1;
     }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

If a read error is encountered during 'qemu-img compare', we
were printing the "Error while reading offset ..." message twice;
this was because our helper function was awkward, printing output
on some but not all paths.  Fix it to consistently report errors
on all paths, so that the callers do not risk a redundant message,
and update the testsuite for the improved output.

Further simplify the code by hoisting the conversion from an error
message to an exit code into the helper function, rather than
repeating that logic at all callers (yes, the helper function is
now less generic, but it's a net win in lines of code).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c                 | 19 +++++--------------
 tests/qemu-iotests/074.out |  2 --
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
- * Returns 0 in case sectors are filled with 0, 1 if sectors contain non-zero
- * data and negative value on error.
+ * Intended for use by 'qemu-img compare': Returns 0 in case sectors are
+ * filled with 0, 1 if sectors contain non-zero data (this is a comparison
+ * failure), and 4 on error (the exit status for read errors), after emitting
+ * an error message.
  *
  * @param blk:  BlockBackend for the image
  * @param sect_num: Number of first sector to check
@@ -XXX,XX +XXX,XX @@ static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
     if (ret < 0) {
         error_report("Error while reading offset %" PRId64 " of %s: %s",
                      sectors_to_bytes(sect_num), filename, strerror(-ret));
-        return ret;
+        return 4;
     }
     idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
     if (idx >= 0) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                                           filename2, buf1, quiet);
             }
             if (ret) {
-                if (ret < 0) {
-                    error_report("Error while reading offset %" PRId64 ": %s",
-                                 sectors_to_bytes(sector_num), strerror(-ret));
-                    ret = 4;
-                }
                 goto out;
             }
         }
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
                                           filename_over, buf1, quiet);
                 if (ret) {
-                    if (ret < 0) {
-                        error_report("Error while reading offset %" PRId64
-                                     " of %s: %s", sectors_to_bytes(sector_num),
-                                     filename_over, strerror(-ret));
-                        ret = 4;
-                    }
                     goto out;
                 }
             }
diff --git a/tests/qemu-iotests/074.out b/tests/qemu-iotests/074.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/074.out
+++ b/tests/qemu-iotests/074.out
@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 wrote 512/512 bytes at offset 512
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
-qemu-img: Error while reading offset 0: Input/output error
 4
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
 wrote 512/512 bytes at offset 512
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
-qemu-img: Error while reading offset 0 of blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
 Warning: Image size mismatch!
 4
 Cleanup
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Continue on the quest to make more things byte-based instead of
sector-based.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int64_t sectors_to_bytes(int64_t sectors)
  * an error message.
  *
  * @param blk:  BlockBackend for the image
- * @param sect_num: Number of first sector to check
- * @param sect_count: Number of sectors to check
+ * @param offset: Starting offset to check
+ * @param bytes: Number of bytes to check
  * @param filename: Name of disk file we are checking (logging purpose)
  * @param buffer: Allocated buffer for storing read data
  * @param quiet: Flag for quiet mode
  */
-static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
-                               int sect_count, const char *filename,
+static int check_empty_sectors(BlockBackend *blk, int64_t offset,
+                               int64_t bytes, const char *filename,
                                uint8_t *buffer, bool quiet)
 {
     int ret = 0;
     int64_t idx;
 
-    ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
-                    sect_count << BDRV_SECTOR_BITS);
+    ret = blk_pread(blk, offset, buffer, bytes);
     if (ret < 0) {
         error_report("Error while reading offset %" PRId64 " of %s: %s",
-                     sectors_to_bytes(sect_num), filename, strerror(-ret));
+                     offset, filename, strerror(-ret));
         return 4;
     }
-    idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
+    idx = find_nonzero(buffer, bytes);
     if (idx >= 0) {
         qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-                sectors_to_bytes(sect_num) + idx);
+                offset + idx);
         return 1;
     }
 
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         } else {
             nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
             if (allocated1) {
-                ret = check_empty_sectors(blk1, sector_num, nb_sectors,
+                ret = check_empty_sectors(blk1, sector_num * BDRV_SECTOR_SIZE,
+                                          nb_sectors * BDRV_SECTOR_SIZE,
                                           filename1, buf1, quiet);
             } else {
-                ret = check_empty_sectors(blk2, sector_num, nb_sectors,
+                ret = check_empty_sectors(blk2, sector_num * BDRV_SECTOR_SIZE,
+                                          nb_sectors * BDRV_SECTOR_SIZE,
                                           filename2, buf1, quiet);
             }
             if (ret) {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             nb_sectors = count >> BDRV_SECTOR_BITS;
             if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-                ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
+                ret = check_empty_sectors(blk_over,
+                                          sector_num * BDRV_SECTOR_SIZE,
+                                          nb_sectors * BDRV_SECTOR_SIZE,
                                           filename_over, buf1, quiet);
                 if (ret) {
                     goto out;
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

In the continuing quest to make more things byte-based, change
compare_sectors(), renaming it to compare_buffers() in the
process.  Note that one caller (qemu-img compare) only cares
about the first difference, while the other (qemu-img rebase)
cares about how many consecutive sectors have the same
equal/different status; however, this patch does not bother to
micro-optimize the compare case to avoid the comparisons of
sectors beyond the first mismatch.  Both callers are always
passing valid buffers in, so the initial check for buffer size
can be turned into an assertion.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 55 +++++++++++++++++++++++++++----------------------------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
 }
 
 /*
- * Compares two buffers sector by sector. Returns 0 if the first sector of both
- * buffers matches, non-zero otherwise.
+ * Compares two buffers sector by sector. Returns 0 if the first
+ * sector of each buffer matches, non-zero otherwise.
  *
- * pnum is set to the number of sectors (including and immediately following
- * the first one) that are known to have the same comparison result
+ * pnum is set to the sector-aligned size of the buffer prefix that
+ * has the same matching status as the first sector.
  */
-static int compare_sectors(const uint8_t *buf1, const uint8_t *buf2, int n,
-    int *pnum)
+static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
+                           int64_t bytes, int64_t *pnum)
 {
     bool res;
-    int i;
+    int64_t i = MIN(bytes, BDRV_SECTOR_SIZE);
 
-    if (n <= 0) {
-        *pnum = 0;
-        return 0;
-    }
+    assert(bytes > 0);
 
-    res = !!memcmp(buf1, buf2, 512);
-    for(i = 1; i < n; i++) {
-        buf1 += 512;
-        buf2 += 512;
+    res = !!memcmp(buf1, buf2, i);
+    while (i < bytes) {
+        int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE);
 
-        if (!!memcmp(buf1, buf2, 512) != res) {
+        if (!!memcmp(buf1 + i, buf2 + i, len) != res) {
             break;
         }
+        i += len;
     }
 
     *pnum = i;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
     int64_t total_sectors;
     int64_t sector_num = 0;
     int64_t nb_sectors;
-    int c, pnum;
+    int c;
     uint64_t progress_base;
     bool image_opts = false;
     bool force_share = false;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             /* nothing to do */
         } else if (allocated1 == allocated2) {
             if (allocated1) {
+                int64_t pnum;
+
                 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
                 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
                                 nb_sectors << BDRV_SECTOR_BITS);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                     ret = 4;
                     goto out;
                 }
-                ret = compare_sectors(buf1, buf2, nb_sectors, &pnum);
-                if (ret || pnum != nb_sectors) {
+                ret = compare_buffers(buf1, buf2,
+                                      nb_sectors * BDRV_SECTOR_SIZE, &pnum);
+                if (ret || pnum != nb_sectors * BDRV_SECTOR_SIZE) {
                     qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-                            sectors_to_bytes(
-                                ret ? sector_num : sector_num + pnum));
+                            sectors_to_bytes(sector_num) + (ret ? 0 : pnum));
                     ret = 1;
                     goto out;
                 }
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
             /* If they differ, we need to write to the COW file */
             uint64_t written = 0;
 
-            while (written < n) {
-                int pnum;
+            while (written < n * BDRV_SECTOR_SIZE) {
+                int64_t pnum;
 
-                if (compare_sectors(buf_old + written * 512,
-                    buf_new + written * 512, n - written, &pnum))
+                if (compare_buffers(buf_old + written,
+                                    buf_new + written,
+                                    n * BDRV_SECTOR_SIZE - written, &pnum))
                 {
                     ret = blk_pwrite(blk,
-                                     (sector + written) << BDRV_SECTOR_BITS,
-                                     buf_old + written * 512,
-                                     pnum << BDRV_SECTOR_BITS, 0);
+                                     (sector << BDRV_SECTOR_BITS) + written,
+                                     buf_old + written, pnum, 0);
                     if (ret < 0) {
                         error_report("Error while writing to COW image: %s",
                             strerror(-ret));
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

In the continuing quest to make more things byte-based, change
the internal iteration of img_rebase().  We can finally drop the
TODO assertion added earlier, now that the entire algorithm is
byte-based and no longer has to shift from bytes to sectors.

Most of the change is mechanical ('num_sectors' becomes 'size',
'sector' becomes 'offset', 'n' goes from sectors to bytes); some
of it is also a cleanup (use of MIN() instead of open-coding,
loss of variable 'count' added earlier in commit d6a644bb).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 84 +++++++++++++++++++++++++-------------------------------------
 1 file changed, 34 insertions(+), 50 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
      * the image is the same as the original one at any time.
      */
     if (!unsafe) {
-        int64_t num_sectors;
-        int64_t old_backing_num_sectors;
-        int64_t new_backing_num_sectors = 0;
-        uint64_t sector;
-        int n;
-        int64_t count;
+        int64_t size;
+        int64_t old_backing_size;
+        int64_t new_backing_size = 0;
+        uint64_t offset;
+        int64_t n;
         float local_progress = 0;
 
         buf_old = blk_blockalign(blk, IO_BUF_SIZE);
         buf_new = blk_blockalign(blk, IO_BUF_SIZE);
 
-        num_sectors = blk_nb_sectors(blk);
-        if (num_sectors < 0) {
+        size = blk_getlength(blk);
+        if (size < 0) {
             error_report("Could not get size of '%s': %s",
-                         filename, strerror(-num_sectors));
+                         filename, strerror(-size));
             ret = -1;
             goto out;
         }
-        old_backing_num_sectors = blk_nb_sectors(blk_old_backing);
-        if (old_backing_num_sectors < 0) {
+        old_backing_size = blk_getlength(blk_old_backing);
+        if (old_backing_size < 0) {
             char backing_name[PATH_MAX];
 
             bdrv_get_backing_filename(bs, backing_name, sizeof(backing_name));
             error_report("Could not get size of '%s': %s",
-                         backing_name, strerror(-old_backing_num_sectors));
+                         backing_name, strerror(-old_backing_size));
             ret = -1;
             goto out;
         }
         if (blk_new_backing) {
-            new_backing_num_sectors = blk_nb_sectors(blk_new_backing);
-            if (new_backing_num_sectors < 0) {
+            new_backing_size = blk_getlength(blk_new_backing);
+            if (new_backing_size < 0) {
                 error_report("Could not get size of '%s': %s",
-                             out_baseimg, strerror(-new_backing_num_sectors));
+                             out_baseimg, strerror(-new_backing_size));
                 ret = -1;
                 goto out;
             }
         }
 
-        if (num_sectors != 0) {
-            local_progress = (float)100 /
-                (num_sectors / MIN(num_sectors, IO_BUF_SIZE / 512));
+        if (size != 0) {
+            local_progress = (float)100 / (size / MIN(size, IO_BUF_SIZE));
         }
 
-        for (sector = 0; sector < num_sectors; sector += n) {
-
-            /* How many sectors can we handle with the next read? */
-            if (sector + (IO_BUF_SIZE / 512) <= num_sectors) {
-                n = (IO_BUF_SIZE / 512);
-            } else {
-                n = num_sectors - sector;
-            }
+        for (offset = 0; offset < size; offset += n) {
+            /* How many bytes can we handle with the next read? */
+            n = MIN(IO_BUF_SIZE, size - offset);
 
             /* If the cluster is allocated, we don't need to take action */
-            ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
-                                    n << BDRV_SECTOR_BITS, &count);
+            ret = bdrv_is_allocated(bs, offset, n, &n);
             if (ret < 0) {
                 error_report("error while reading image metadata: %s",
                              strerror(-ret));
                 goto out;
             }
-            /* TODO relax this once bdrv_is_allocated does not enforce
-             * sector alignment */
-            assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-            n = count >> BDRV_SECTOR_BITS;
             if (ret) {
                 continue;
             }
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
              * Read old and new backing file and take into consideration that
              * backing files may be smaller than the COW image.
              */
-            if (sector >= old_backing_num_sectors) {
-                memset(buf_old, 0, n * BDRV_SECTOR_SIZE);
+            if (offset >= old_backing_size) {
+                memset(buf_old, 0, n);
             } else {
-                if (sector + n > old_backing_num_sectors) {
-                    n = old_backing_num_sectors - sector;
+                if (offset + n > old_backing_size) {
+                    n = old_backing_size - offset;
                 }
 
-                ret = blk_pread(blk_old_backing, sector << BDRV_SECTOR_BITS,
-                                buf_old, n << BDRV_SECTOR_BITS);
+                ret = blk_pread(blk_old_backing, offset, buf_old, n);
                 if (ret < 0) {
                     error_report("error while reading from old backing file");
                     goto out;
                 }
             }
 
-            if (sector >= new_backing_num_sectors || !blk_new_backing) {
-                memset(buf_new, 0, n * BDRV_SECTOR_SIZE);
+            if (offset >= new_backing_size || !blk_new_backing) {
+                memset(buf_new, 0, n);
             } else {
-                if (sector + n > new_backing_num_sectors) {
-                    n = new_backing_num_sectors - sector;
+                if (offset + n > new_backing_size) {
+                    n = new_backing_size - offset;
                 }
 
-                ret = blk_pread(blk_new_backing, sector << BDRV_SECTOR_BITS,
-                                buf_new, n << BDRV_SECTOR_BITS);
+                ret = blk_pread(blk_new_backing, offset, buf_new, n);
                 if (ret < 0) {
                     error_report("error while reading from new backing file");
                     goto out;
@@ -XXX,XX +XXX,XX @@ static int img_rebase(int argc, char **argv)
             /* If they differ, we need to write to the COW file */
             uint64_t written = 0;
 
-            while (written < n * BDRV_SECTOR_SIZE) {
+            while (written < n) {
                 int64_t pnum;
 
-                if (compare_buffers(buf_old + written,
-                                    buf_new + written,
-                                    n * BDRV_SECTOR_SIZE - written, &pnum))
+                if (compare_buffers(buf_old + written, buf_new + written,
+                                    n - written, &pnum))
                 {
-                    ret = blk_pwrite(blk,
-                                     (sector << BDRV_SECTOR_BITS) + written,
+                    ret = blk_pwrite(blk, offset + written,
                                      buf_old + written, pnum, 0);
                     if (ret < 0) {
                         error_report("Error while writing to COW image: %s",
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

In the continuing quest to make more things byte-based, change
the internal iteration of img_compare().  We can finally drop the
TODO assertions added earlier, now that the entire algorithm is
byte-based and no longer has to shift from bytes to sectors.

Most of the change is mechanical ('total_sectors' becomes
'total_size', 'sector_num' becomes 'offset', 'nb_sectors' becomes
'chunk', 'progress_base' goes from sectors to bytes); some of it
is also a cleanup (sectors_to_bytes() is now unused, loss of
variable 'count' added earlier in commit 51b0a488).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 124 ++++++++++++++++++++++++-------------------------------------
 1 file changed, 48 insertions(+), 76 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
 
 #define IO_BUF_SIZE (2 * 1024 * 1024)
 
-static int64_t sectors_to_bytes(int64_t sectors)
-{
-    return sectors << BDRV_SECTOR_BITS;
-}
-
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
     const char *fmt1 = NULL, *fmt2 = NULL, *cache, *filename1, *filename2;
     BlockBackend *blk1, *blk2;
     BlockDriverState *bs1, *bs2;
-    int64_t total_sectors1, total_sectors2;
+    int64_t total_size1, total_size2;
     uint8_t *buf1 = NULL, *buf2 = NULL;
     int64_t pnum1, pnum2;
     int allocated1, allocated2;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
     bool progress = false, quiet = false, strict = false;
     int flags;
     bool writethrough;
-    int64_t total_sectors;
-    int64_t sector_num = 0;
-    int64_t nb_sectors;
+    int64_t total_size;
+    int64_t offset = 0;
+    int64_t chunk;
     int c;
     uint64_t progress_base;
     bool image_opts = false;
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
 
     buf1 = blk_blockalign(blk1, IO_BUF_SIZE);
     buf2 = blk_blockalign(blk2, IO_BUF_SIZE);
-    total_sectors1 = blk_nb_sectors(blk1);
-    if (total_sectors1 < 0) {
+    total_size1 = blk_getlength(blk1);
+    if (total_size1 < 0) {
         error_report("Can't get size of %s: %s",
-                     filename1, strerror(-total_sectors1));
+                     filename1, strerror(-total_size1));
         ret = 4;
         goto out;
     }
-    total_sectors2 = blk_nb_sectors(blk2);
-    if (total_sectors2 < 0) {
+    total_size2 = blk_getlength(blk2);
+    if (total_size2 < 0) {
         error_report("Can't get size of %s: %s",
-                     filename2, strerror(-total_sectors2));
+                     filename2, strerror(-total_size2));
         ret = 4;
         goto out;
     }
-    total_sectors = MIN(total_sectors1, total_sectors2);
-    progress_base = MAX(total_sectors1, total_sectors2);
+    total_size = MIN(total_size1, total_size2);
+    progress_base = MAX(total_size1, total_size2);
 
     qemu_progress_print(0, 100);
 
-    if (strict && total_sectors1 != total_sectors2) {
+    if (strict && total_size1 != total_size2) {
         ret = 1;
         qprintf(quiet, "Strict mode: Image size mismatch!\n");
         goto out;
     }
 
-    while (sector_num < total_sectors) {
+    while (offset < total_size) {
         int status1, status2;
 
-        status1 = bdrv_block_status_above(bs1, NULL,
-                                          sector_num * BDRV_SECTOR_SIZE,
-                                          (total_sectors1 - sector_num) *
-                                          BDRV_SECTOR_SIZE,
-                                          &pnum1, NULL, NULL);
+        status1 = bdrv_block_status_above(bs1, NULL, offset,
+                                          total_size1 - offset, &pnum1, NULL,
+                                          NULL);
         if (status1 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename1);
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
         }
         allocated1 = status1 & BDRV_BLOCK_ALLOCATED;
 
-        status2 = bdrv_block_status_above(bs2, NULL,
-                                          sector_num * BDRV_SECTOR_SIZE,
-                                          (total_sectors2 - sector_num) *
-                                          BDRV_SECTOR_SIZE,
-                                          &pnum2, NULL, NULL);
+        status2 = bdrv_block_status_above(bs2, NULL, offset,
+                                          total_size2 - offset, &pnum2, NULL,
+                                          NULL);
         if (status2 < 0) {
             ret = 3;
             error_report("Sector allocation test failed for %s", filename2);
             goto out;
         }
         allocated2 = status2 & BDRV_BLOCK_ALLOCATED;
-        /* TODO: Relax this once comparison is byte-based, and we no longer
-         * have to worry about sector alignment */
-        assert(QEMU_IS_ALIGNED(pnum1 | pnum2, BDRV_SECTOR_SIZE));
 
         assert(pnum1 && pnum2);
-        nb_sectors = MIN(pnum1, pnum2) >> BDRV_SECTOR_BITS;
+        chunk = MIN(pnum1, pnum2);
 
         if (strict) {
             if (status1 != status2) {
                 ret = 1;
                 qprintf(quiet, "Strict mode: Offset %" PRId64
-                        " block status mismatch!\n",
-                        sectors_to_bytes(sector_num));
+                        " block status mismatch!\n", offset);
                 goto out;
             }
         }
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             if (allocated1) {
                 int64_t pnum;
 
-                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-                ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
-                                nb_sectors << BDRV_SECTOR_BITS);
+                chunk = MIN(chunk, IO_BUF_SIZE);
+                ret = blk_pread(blk1, offset, buf1, chunk);
                 if (ret < 0) {
-                    error_report("Error while reading offset %" PRId64 " of %s:"
-                                 " %s", sectors_to_bytes(sector_num), filename1,
-                                 strerror(-ret));
+                    error_report("Error while reading offset %" PRId64
+                                 " of %s: %s",
+                                 offset, filename1, strerror(-ret));
                     ret = 4;
                     goto out;
                 }
-                ret = blk_pread(blk2, sector_num << BDRV_SECTOR_BITS, buf2,
-                                nb_sectors << BDRV_SECTOR_BITS);
+                ret = blk_pread(blk2, offset, buf2, chunk);
                 if (ret < 0) {
                     error_report("Error while reading offset %" PRId64
-                                 " of %s: %s", sectors_to_bytes(sector_num),
-                                 filename2, strerror(-ret));
+                                 " of %s: %s",
+                                 offset, filename2, strerror(-ret));
                     ret = 4;
                     goto out;
                 }
-                ret = compare_buffers(buf1, buf2,
-                                      nb_sectors * BDRV_SECTOR_SIZE, &pnum);
-                if (ret || pnum != nb_sectors * BDRV_SECTOR_SIZE) {
+                ret = compare_buffers(buf1, buf2, chunk, &pnum);
+                if (ret || pnum != chunk) {
                     qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-                            sectors_to_bytes(sector_num) + (ret ? 0 : pnum));
+                            offset + (ret ? 0 : pnum));
                     ret = 1;
                     goto out;
                 }
             }
         } else {
-            nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
+            chunk = MIN(chunk, IO_BUF_SIZE);
             if (allocated1) {
-                ret = check_empty_sectors(blk1, sector_num * BDRV_SECTOR_SIZE,
-                                          nb_sectors * BDRV_SECTOR_SIZE,
+                ret = check_empty_sectors(blk1, offset, chunk,
                                           filename1, buf1, quiet);
             } else {
-                ret = check_empty_sectors(blk2, sector_num * BDRV_SECTOR_SIZE,
-                                          nb_sectors * BDRV_SECTOR_SIZE,
+                ret = check_empty_sectors(blk2, offset, chunk,
                                           filename2, buf1, quiet);
             }
             if (ret) {
                 goto out;
             }
         }
-        sector_num += nb_sectors;
-        qemu_progress_print(((float) nb_sectors / progress_base)*100, 100);
+        offset += chunk;
+        qemu_progress_print(((float) chunk / progress_base) * 100, 100);
     }
 
-    if (total_sectors1 != total_sectors2) {
+    if (total_size1 != total_size2) {
         BlockBackend *blk_over;
         const char *filename_over;
 
         qprintf(quiet, "Warning: Image size mismatch!\n");
-        if (total_sectors1 > total_sectors2) {
+        if (total_size1 > total_size2) {
             blk_over = blk1;
             filename_over = filename1;
         } else {
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
             filename_over = filename2;
         }
 
-        while (sector_num < progress_base) {
-            int64_t count;
-
-            ret = bdrv_block_status_above(blk_bs(blk_over), NULL,
-                                          sector_num * BDRV_SECTOR_SIZE,
-                                          (progress_base - sector_num) *
-                                          BDRV_SECTOR_SIZE,
-                                          &count, NULL, NULL);
+        while (offset < progress_base) {
+            ret = bdrv_block_status_above(blk_bs(blk_over), NULL, offset,
+                                          progress_base - offset, &chunk,
+                                          NULL, NULL);
             if (ret < 0) {
                 ret = 3;
                 error_report("Sector allocation test failed for %s",
@@ -XXX,XX +XXX,XX @@ static int img_compare(int argc, char **argv)
                 goto out;
 
             }
-            /* TODO relax this once bdrv_block_status_above does not enforce
-             * sector alignment */
-            assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-            nb_sectors = count >> BDRV_SECTOR_BITS;
             if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
-                nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-                ret = check_empty_sectors(blk_over,
-                                          sector_num * BDRV_SECTOR_SIZE,
-                                          nb_sectors * BDRV_SECTOR_SIZE,
+                chunk = MIN(chunk, IO_BUF_SIZE);
+                ret = check_empty_sectors(blk_over, offset, chunk,
                                           filename_over, buf1, quiet);
                 if (ret) {
                     goto out;
                 }
             }
-            sector_num += nb_sectors;
-            qemu_progress_print(((float) nb_sectors / progress_base)*100, 100);
+            offset += chunk;
+            qemu_progress_print(((float) chunk / progress_base) * 100, 100);
         }
     }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Any device that has request_alignment greater than 512 should be
unable to report status at a finer granularity; it may also be
simpler for such devices to be guaranteed that the block layer
has rounded things out to the granularity boundary (the way the
block layer already rounds all other I/O out).  Besides, getting
the code correct for super-sector alignment also benefits us
for the fact that our public interface now has byte granularity,
even though none of our drivers have byte-level callbacks.

Add an assertion in blkdebug that proves that the block layer
never requests status of unaligned sections, similar to what it
does on other requests (while still keeping the generic helper
in place for when future patches add a throttle driver).  Note
that iotest 177 already covers this (it would fail if you use
just the blkdebug.c hunk without the io.c changes).  Meanwhile,
we can drop assertions in callers that no longer have to pass
in sector-aligned addresses.

There is a mid-function scope added for 'count' and 'longret',
for a couple of reasons: first, an upcoming patch will add an
'if' statement that checks whether a driver has an old- or
new-style callback, and can conveniently use the same scope for
less indentation churn at that time.  Second, since we are
trying to get rid of sector-based computations, wrapping things
in a scope makes it easier to group and see what will be
deleted in a final cleanup patch once all drivers have been
converted to the new-style callback.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block_int.h |  3 +-
 block/blkdebug.c          | 13 ++++++++-
 block/io.c                | 71 ++++++++++++++++++++++++++++++-----------------
 3 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -XXX,XX +XXX,XX @@ struct BlockDriver {
      * according to the current layer, and should not set
      * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
      * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
-     * layer guarantees non-NULL pnum and file.
+     * layer guarantees input aligned to request_alignment, as well as
+     * non-NULL pnum and file.
      */
     int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, int *pnum,
diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkdebug_co_pdiscard(BlockDriverState *bs,
     return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
 }
 
+static int64_t coroutine_fn blkdebug_co_get_block_status(
+    BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
+    BlockDriverState **file)
+{
+    assert(QEMU_IS_ALIGNED(sector_num | nb_sectors,
+                           DIV_ROUND_UP(bs->bl.request_alignment,
+                                        BDRV_SECTOR_SIZE)));
+    return bdrv_co_get_block_status_from_file(bs, sector_num, nb_sectors,
+                                              pnum, file);
+}
+
 static void blkdebug_close(BlockDriverState *bs)
 {
     BDRVBlkdebugState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_blkdebug = {
     .bdrv_co_flush_to_disk  = blkdebug_co_flush,
     .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
     .bdrv_co_pdiscard       = blkdebug_co_pdiscard,
-    .bdrv_co_get_block_status = bdrv_co_get_block_status_from_file,
+    .bdrv_co_get_block_status = blkdebug_co_get_block_status,
 
     .bdrv_debug_event           = blkdebug_debug_event,
     .bdrv_debug_breakpoint      = blkdebug_debug_breakpoint,
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
 {
     int64_t total_size;
     int64_t n; /* bytes */
-    int64_t ret;
+    int ret;
     int64_t local_map = 0;
     BlockDriverState *local_file = NULL;
-    int count; /* sectors */
+    int64_t aligned_offset, aligned_bytes;
+    uint32_t align;
 
     assert(pnum);
     *pnum = 0;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     }
 
     bdrv_inc_in_flight(bs);
+
+    /* Round out to request_alignment boundaries */
+    /* TODO: until we have a byte-based driver callback, we also have to
+     * round out to sectors, even if that is bigger than request_alignment */
+    align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
+    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
+    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
+
+    {
+        int count; /* sectors */
+        int64_t longret;
+
+        assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
+                               BDRV_SECTOR_SIZE));
+        /*
+         * The contract allows us to return pnum smaller than bytes, even
+         * if the next query would see the same status; we truncate the
+         * request to avoid overflowing the driver's 32-bit interface.
+         */
+        longret = bs->drv->bdrv_co_get_block_status(
+            bs, aligned_offset >> BDRV_SECTOR_BITS,
+            MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, &count,
+            &local_file);
+        if (longret < 0) {
+            assert(INT_MIN <= longret);
+            ret = longret;
+            goto out;
+        }
+        if (longret & BDRV_BLOCK_OFFSET_VALID) {
+            local_map = longret & BDRV_BLOCK_OFFSET_MASK;
+        }
+        ret = longret & ~BDRV_BLOCK_OFFSET_MASK;
+        *pnum = count * BDRV_SECTOR_SIZE;
+    }
+
     /*
-     * TODO: Rather than require aligned offsets, we could instead
-     * round to the driver's request_alignment here, then touch up
-     * count afterwards back to the caller's expectations.
-     */
-    assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
-    /*
-     * The contract allows us to return pnum smaller than bytes, even
-     * if the next query would see the same status; we truncate the
-     * request to avoid overflowing the driver's 32-bit interface.
+     * The driver's result must be a multiple of request_alignment.
+     * Clamp pnum and adjust map to original request.
      */
-    bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-    ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
-                                            bytes >> BDRV_SECTOR_BITS, &count,
-                                            &local_file);
-    if (ret < 0) {
-        goto out;
+    assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset);
+    *pnum -= offset - aligned_offset;
+    if (*pnum > bytes) {
+        *pnum = bytes;
     }
     if (ret & BDRV_BLOCK_OFFSET_VALID) {
-        local_map = ret & BDRV_BLOCK_OFFSET_MASK;
+        local_map += offset - aligned_offset;
     }
-    *pnum = count * BDRV_SECTOR_SIZE;
 
     if (ret & BDRV_BLOCK_RAW) {
         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
         ret = bdrv_co_block_status(local_file, want_zero, local_map,
                                    *pnum, pnum, &local_map, &local_file);
-        assert(ret < 0 ||
-               QEMU_IS_ALIGNED(*pnum | local_map, BDRV_SECTOR_SIZE));
         goto out;
     }
 
@@ -XXX,XX +XXX,XX @@ early_out:
     if (map) {
         *map = local_map;
     }
-    if (ret >= 0) {
-        ret &= ~BDRV_BLOCK_OFFSET_MASK;
-    } else {
-        assert(INT_MIN <= ret);
-    }
     return ret;
 }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Now that bdrv_is_allocated accepts non-aligned inputs, we can
remove the TODO added in commit d6a644bb.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
     }
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
-        /* TODO: Simplify further once bdrv_is_allocated no longer
-         * requires sector alignment */
-        int64_t start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
-        int64_t end = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE);
         int64_t pnum;
 
-        ret = bdrv_is_allocated(bs, start, end - start, &pnum);
+        ret = bdrv_is_allocated(bs, offset, bytes, &pnum);
         if (ret < 0) {
             goto out;
         }
 
-        if (!ret || pnum != end - start) {
+        if (!ret || pnum != bytes) {
             ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
             goto out;
         }
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Now that bdrv_is_allocated accepts non-aligned inputs, we can
remove the TODO added in earlier refactoring.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
     int64_t nr;
     int res;
-    int64_t start;
-
-    /* TODO: Widening to sector boundaries should only be needed as
-     * long as we can't query finer granularity. */
-    start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
-    bytes = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE) - start;
 
     /* Clamp to image length, before checking status of underlying sectors */
-    if (start + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
-        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - start;
+    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - offset;
     }
 
     if (!bytes) {
         return true;
     }
-    res = bdrv_block_status_above(bs, NULL, start, bytes, &nr, NULL, NULL);
+    res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
     return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
 }
 
-- 
2.13.6

From: Eric Blake <eblake@redhat.com>

Previously, the alloc command required that input parameters be
sector-aligned and clamped to 32 bits, because the underlying
bdrv_is_allocated used a 32-bit parameter and asserted aligned
inputs.  But now that we have fixed block status to report a
64-bit bytes value, and to properly round requests on behalf of
guests, we can pass any values, and can use qemu-io to add
coverage that our rounding is correct regardless of the guest
alignment constraints.

Update iotest 177 to intentionally probe block status at
unaligned boundaries as well as with a bytes value that does not
map to 32-bit sectors, which also required tweaking the image
prep to leave an unallocated portion to the image under test.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-io-cmds.c             | 13 -------------
 tests/qemu-iotests/177     | 12 ++++++++++--
 tests/qemu-iotests/177.out | 19 ++++++++++++++-----
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -XXX,XX +XXX,XX @@ static int alloc_f(BlockBackend *blk, int argc, char **argv)
     if (offset < 0) {
         print_cvtnum_err(offset, argv[1]);
         return 0;
-    } else if (!QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE)) {
-        printf("%" PRId64 " is not a sector-aligned value for 'offset'\n",
-               offset);
-        return 0;
     }
 
     if (argc == 3) {
@@ -XXX,XX +XXX,XX @@ static int alloc_f(BlockBackend *blk, int argc, char **argv)
         if (count < 0) {
             print_cvtnum_err(count, argv[2]);
             return 0;
-        } else if (count > INT_MAX * BDRV_SECTOR_SIZE) {
-            printf("length argument cannot exceed %llu, given %s\n",
-                   INT_MAX * BDRV_SECTOR_SIZE, argv[2]);
-            return 0;
         }
     } else {
         count = BDRV_SECTOR_SIZE;
     }
-    if (!QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE)) {
-        printf("%" PRId64 " is not a sector-aligned value for 'count'\n",
-               count);
-        return 0;
-    }
 
     remaining = count;
     sum_alloc = 0;
diff --git a/tests/qemu-iotests/177 b/tests/qemu-iotests/177
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/177
+++ b/tests/qemu-iotests/177
@@ -XXX,XX +XXX,XX @@ echo "== setting up files =="
 TEST_IMG="$TEST_IMG.base" _make_test_img $size
 $QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG.base" | _filter_qemu_io
 _make_test_img -b "$TEST_IMG.base"
-$QEMU_IO -c "write -P 22 0 $size" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -P 22 0 110M" "$TEST_IMG" | _filter_qemu_io
 
 # Limited to 64k max-transfer
 echo
@@ -XXX,XX +XXX,XX @@ $QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
          -c "discard 80000001 30M" | _filter_qemu_io
 
 echo
+echo "== block status smaller than alignment =="
+limits=align=4k
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+	 -c "alloc 1 1" -c "alloc 0x6dffff0 1000" -c "alloc 127m 5P" \
+	 -c map | _filter_qemu_io
+
+echo
 echo "== verify image content =="
 
 function verify_io()
@@ -XXX,XX +XXX,XX @@ function verify_io()
     echo read -P 0 32M 32M
     echo read -P 22 64M 13M
     echo read -P $discarded 77M 29M
-    echo read -P 22 106M 22M
+    echo read -P 22 106M 4M
+    echo read -P 11 110M 18M
 }
 
 verify_io | $QEMU_IO -r "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/177.out b/tests/qemu-iotests/177.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/177.out
+++ b/tests/qemu-iotests/177.out
@@ -XXX,XX +XXX,XX @@ Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134217728
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/t.IMGFMT.base
-wrote 134217728/134217728 bytes at offset 0
-128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 115343360/115343360 bytes at offset 0
+110 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 == constrained alignment and max-transfer ==
 wrote 131072/131072 bytes at offset 1000
@@ -XXX,XX +XXX,XX @@ wrote 33554432/33554432 bytes at offset 33554432
 discard 31457280/31457280 bytes at offset 80000001
 30 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== block status smaller than alignment ==
+1/1 bytes allocated at offset 1 bytes
+16/1000 bytes allocated at offset 110 MiB
+0/1048576 bytes allocated at offset 127 MiB
+110 MiB (0x6e00000) bytes     allocated at offset 0 bytes (0x0)
+18 MiB (0x1200000) bytes not allocated at offset 110 MiB (0x6e00000)
+
 == verify image content ==
 read 1000/1000 bytes at offset 0
 1000 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -XXX,XX +XXX,XX @@ read 13631488/13631488 bytes at offset 67108864
 13 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 30408704/30408704 bytes at offset 80740352
 29 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-read 23068672/23068672 bytes at offset 111149056
-22 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4194304/4194304 bytes at offset 111149056
+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 18874368/18874368 bytes at offset 115343360
+18 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset          Length          File
 0               0x800000        TEST_DIR/t.IMGFMT
 0x900000        0x2400000       TEST_DIR/t.IMGFMT
 0x3c00000       0x1100000       TEST_DIR/t.IMGFMT
-0x6a00000       0x1600000       TEST_DIR/t.IMGFMT
+0x6a00000       0x400000        TEST_DIR/t.IMGFMT
 No errors were found on the image.
 *** done
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

qemu-img commit invalidates all images between base and top.  This
should be mentioned in the man page.

Suggested-by: Ping Li <pingl@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.texi | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/qemu-img.texi b/qemu-img.texi
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -XXX,XX +XXX,XX @@ If the backing chain of the given image file @var{filename} has more than one
 layer, the backing file into which the changes will be committed may be
 specified as @var{base} (which has to be part of @var{filename}'s backing
 chain). If @var{base} is not specified, the immediate backing file of the top
-image (which is @var{filename}) will be used. For reasons of consistency,
-explicitly specifying @var{base} will always imply @code{-d} (since emptying an
-image after committing to an indirect backing file would lead to different data
-being read from the image due to content in the intermediate backing chain
-overruling the commit target).
+image (which is @var{filename}) will be used. Note that after a commit operation
+all images between @var{base} and the top image will be invalid and may return
+garbage data when read. For this reason, @code{-b} implies @code{-d} (so that
+the top image stays valid).
 
 @item compare [-f @var{fmt}] [-F @var{fmt}] [-T @var{src_cache}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 
-- 
2.13.6

From: Alberto Garcia <berto@igalia.com>

BDRV_SECTOR_BITS is defined to be 9 in block.h (and BDRV_SECTOR_SIZE
is calculated from that), but there are still a couple of places where
we are using the literal value instead of the macro.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 20171009153856.20387-1-berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
 
     s->cluster_bits = header.cluster_bits;
     s->cluster_size = 1 << s->cluster_bits;
-    s->cluster_sectors = 1 << (s->cluster_bits - 9);
+    s->cluster_sectors = 1 << (s->cluster_bits - BDRV_SECTOR_BITS);
 
     /* Initialise version 3 header fields */
     if (header.version == 2) {
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs,
 
     bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE);
     qemu_co_mutex_lock(&s->lock);
-    ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
+    ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, &bytes,
                                    &cluster_offset);
     qemu_co_mutex_unlock(&s->lock);
     if (ret < 0) {
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170929170843.3711-1-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/127     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/127.out | 14 +++++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 112 insertions(+)
 create mode 100755 tests/qemu-iotests/127
 create mode 100644 tests/qemu-iotests/127.out

diff --git a/tests/qemu-iotests/127 b/tests/qemu-iotests/127
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/127
@@ -XXX,XX +XXX,XX @@
+#!/bin/bash
+#
+# Test case for mirroring with dataplane
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=mreitz@redhat.com
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+here=$PWD
+status=1    # failure is the default!
+
+_cleanup()
+{
+    _cleanup_qemu
+    _cleanup_test_img
+    _rm_test_img "$TEST_IMG.overlay0"
+    _rm_test_img "$TEST_IMG.overlay1"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and qemu instance handling
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+IMG_SIZE=64K
+
+_make_test_img $IMG_SIZE
+TEST_IMG="$TEST_IMG.overlay0" _make_test_img -b "$TEST_IMG" $IMG_SIZE
+TEST_IMG="$TEST_IMG.overlay1" _make_test_img -b "$TEST_IMG" $IMG_SIZE
+
+# So that we actually have something to mirror and the job does not return
+# immediately (which may be bad because then we cannot know whether the
+# 'return' or the 'BLOCK_JOB_READY' comes first).
+$QEMU_IO -c 'write 0 42' "$TEST_IMG.overlay0" | _filter_qemu_io
+
+# We cannot use virtio-blk here because that does not actually set the attached
+# BB's AioContext in qtest mode
+_launch_qemu \
+    -object iothread,id=iothr \
+    -blockdev node-name=source,driver=$IMGFMT,file.driver=file,file.filename="$TEST_IMG.overlay0" \
+    -device virtio-scsi,id=scsi-bus,iothread=iothr \
+    -device scsi-hd,bus=scsi-bus.0,drive=source
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{ 'execute': 'qmp_capabilities' }" \
+    'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{ 'execute': 'drive-mirror',
+       'arguments': {
+           'job-id': 'mirror',
+           'device': 'source',
+           'target': '$TEST_IMG.overlay1',
+           'mode':   'existing',
+           'sync':   'top'
+       } }" \
+    'BLOCK_JOB_READY'
+
+# The backing BDS should be assigned the overlay's AioContext
+_send_qemu_cmd $QEMU_HANDLE \
+    "{ 'execute': 'block-job-complete',
+       'arguments': { 'device': 'mirror' } }" \
+    'BLOCK_JOB_COMPLETED'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{ 'execute': 'quit' }" \
+    'return'
+
+wait=yes _cleanup_qemu
+
+# success, all done
+echo '*** done'
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/127.out b/tests/qemu-iotests/127.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/127.out
@@ -XXX,XX +XXX,XX @@
+QA output created by 127
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
+Formatting 'TEST_DIR/t.IMGFMT.overlay0', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT
+Formatting 'TEST_DIR/t.IMGFMT.overlay1', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT
+wrote 42/42 bytes at offset 0
+42 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "mirror", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "mirror", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 124 rw auto backing
 125 rw auto
 126 rw auto backing
+127 rw auto backing quick
 128 rw auto quick
 129 rw auto quick
 130 rw auto quick
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

Tests 067 and 087 filter the actual image size because it depends on the
host filesystem (and is not part of the respective test).  Since this is
generally true, we should have a common filter function for this, so
let's pull out the sed line from both tests into such a function.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009163456.485-2-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/067           | 2 +-
 tests/qemu-iotests/087           | 2 +-
 tests/qemu-iotests/common.filter | 6 ++++++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/067
+++ b/tests/qemu-iotests/067
@@ -XXX,XX +XXX,XX @@ _filter_qmp_events()
 function run_qemu()
 {
     do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qmp | _filter_qemu \
-                          | sed -e 's/$"actual-size":\s*$[0-9]\+/\1SIZE/g' \
+                          | _filter_actual_image_size \
                           | _filter_generated_node_ids | _filter_qmp_events
 }
 
diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -XXX,XX +XXX,XX @@ function run_qemu()
 {
     do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qmp \
                           | _filter_qemu | _filter_imgfmt \
-                          | sed -e 's/$"actual-size":\s*$[0-9]\+/\1SIZE/g'
+                          | _filter_actual_image_size
 }
 
 size=128M
diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -XXX,XX +XXX,XX @@ _filter_block_job_len()
     sed -e 's/, "len": [0-9]\+,/, "len": LEN,/g'
 }
 
+# replace actual image size (depends on the host filesystem)
+_filter_actual_image_size()
+{
+    sed -s 's/$"actual-size":\s*$[0-9]\+/\1SIZE/g'
+}
+
 # replace driver-specific options in the "Formatting..." line
 _filter_img_create()
 {
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

Whenever the actual image size is not part of the test, it should be
filtered as it depends on the host filesystem.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009163456.485-3-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/184     |  3 ++-
 tests/qemu-iotests/184.out |  6 +++---
 tests/qemu-iotests/191     |  4 ++--
 tests/qemu-iotests/191.out | 46 +++++++++++++++++++++++-----------------------
 4 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/tests/qemu-iotests/184 b/tests/qemu-iotests/184
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/184
+++ b/tests/qemu-iotests/184
@@ -XXX,XX +XXX,XX @@ function do_run_qemu()
 function run_qemu()
 {
     do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qemu | _filter_qmp\
-                          | _filter_qemu_io | _filter_generated_node_ids
+                          | _filter_qemu_io | _filter_generated_node_ids \
+                          | _filter_actual_image_size
 }
 
 _make_test_img 64M
diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/184.out
+++ b/tests/qemu-iotests/184.out
@@ -XXX,XX +XXX,XX @@ Testing:
                 "filename": "json:{\"throttle-group\": \"group0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"TEST_DIR/t.qcow2\"}}}",
                 "cluster-size": 65536,
                 "format": "throttle",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ Testing:
                 "filename": "TEST_DIR/t.qcow2",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ Testing:
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
diff --git a/tests/qemu-iotests/191 b/tests/qemu-iotests/191
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/191
+++ b/tests/qemu-iotests/191
@@ -XXX,XX +XXX,XX @@ echo === Check that both top and top2 point to base now ===
 echo
 
 _send_qemu_cmd $h "{ 'execute': 'query-named-block-nodes' }" "^}" |
-    _filter_generated_node_ids
+    _filter_generated_node_ids | _filter_actual_image_size
 
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "^}"
 wait=1 _cleanup_qemu
@@ -XXX,XX +XXX,XX @@ echo === Check that both top and top2 point to base now ===
 echo
 
 _send_qemu_cmd $h "{ 'execute': 'query-named-block-nodes' }" "^}" |
-    _filter_generated_node_ids
+    _filter_generated_node_ids | _filter_actual_image_size
 
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "^}"
 wait=1 _cleanup_qemu
diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/191.out
+++ b/tests/qemu-iotests/191.out
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.base",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 397312,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.ovl2",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2.ovl2",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.base",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 397312,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.base",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 397312,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.mid",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 393216,
                 "filename": "TEST_DIR/t.qcow2.mid",
                 "format": "file",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.base",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 393216,
                 "filename": "TEST_DIR/t.qcow2.base",
                 "format": "file",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.base",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 397312,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.ovl2",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2.ovl2",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                         "filename": "TEST_DIR/t.qcow2.base",
                         "cluster-size": 65536,
                         "format": "qcow2",
-                        "actual-size": 397312,
+                        "actual-size": SIZE,
                         "format-specific": {
                             "type": "qcow2",
                             "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.ovl2",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 200704,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.ovl3",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2.ovl3",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2.base",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 393216,
                 "filename": "TEST_DIR/t.qcow2.base",
                 "format": "file",
-                "actual-size": 397312,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                     "filename": "TEST_DIR/t.qcow2.base",
                     "cluster-size": 65536,
                     "format": "qcow2",
-                    "actual-size": 397312,
+                    "actual-size": SIZE,
                     "format-specific": {
                         "type": "qcow2",
                         "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "filename": "TEST_DIR/t.qcow2",
                 "cluster-size": 65536,
                 "format": "qcow2",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "format-specific": {
                     "type": "qcow2",
                     "data": {
@@ -XXX,XX +XXX,XX @@ wrote 65536/65536 bytes at offset 1048576
                 "virtual-size": 197120,
                 "filename": "TEST_DIR/t.qcow2",
                 "format": "file",
-                "actual-size": 200704,
+                "actual-size": SIZE,
                 "dirty-flag": false
             },
             "iops_wr": 0,
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

bdrv_truncate() has an errp parameter which is always set when an error
occurs.  Let's use that instead of a plain strerror().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009155431.14093-1-mreitz@redhat.com
Reviewed-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
             return last_cluster;
         }
         if ((last_cluster + 1) * s->cluster_size < old_file_size) {
-            ret = bdrv_truncate(bs->file, (last_cluster + 1) * s->cluster_size,
-                                PREALLOC_MODE_OFF, NULL);
-            if (ret < 0) {
-                warn_report("Failed to truncate the tail of the image: %s",
-                            strerror(-ret));
-                ret = 0;
+            Error *local_err = NULL;
+
+            bdrv_truncate(bs->file, (last_cluster + 1) * s->cluster_size,
+                          PREALLOC_MODE_OFF, &local_err);
+            if (local_err) {
+                warn_reportf_err(local_err,
+                                 "Failed to truncate the tail of the image: ");
             }
         }
     } else {
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

A qcow2 image file's length is not required to have a length that is a
multiple of the cluster size.  However, qcow2_refcount_area() expects an
aligned value for its @start_offset parameter, so we need to round
@old_file_size up to the next cluster boundary.

Reported-by: Ping Li <pingl@redhat.com>
Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1414049
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009215533.12530-2-mreitz@redhat.com
Cc: qemu-stable@nongnu.org
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
                              "Failed to inquire current file length");
             return old_file_size;
         }
+        old_file_size = ROUND_UP(old_file_size, s->cluster_size);
 
         nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
                                             s->cluster_size);
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

Some qcow2 functions (at least perform_cow()) expect s->lock to be
taken.  Therefore, if we want to make use of them, we should execute
preallocate() (as "preallocate_co") in a coroutine so that we can use
the qemu_co_mutex_* functions.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009215533.12530-3-mreitz@redhat.com
Cc: qemu-stable@nongnu.org
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.c | 41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
 }
 
 
+typedef struct PreallocCo {
+    BlockDriverState *bs;
+    uint64_t offset;
+    uint64_t new_length;
+
+    int ret;
+} PreallocCo;
+
 /**
  * Preallocates metadata structures for data clusters between @offset (in the
  * guest disk) and @new_length (which is thus generally the new guest disk
@@ -XXX,XX +XXX,XX @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
  *
  * Returns: 0 on success, -errno on failure.
  */
-static int preallocate(BlockDriverState *bs,
-                       uint64_t offset, uint64_t new_length)
+static void coroutine_fn preallocate_co(void *opaque)
 {
+    PreallocCo *params = opaque;
+    BlockDriverState *bs = params->bs;
+    uint64_t offset = params->offset;
+    uint64_t new_length = params->new_length;
     BDRVQcow2State *s = bs->opaque;
     uint64_t bytes;
     uint64_t host_offset = 0;
@@ -XXX,XX +XXX,XX @@ static int preallocate(BlockDriverState *bs,
     int ret;
     QCowL2Meta *meta;
 
-    if (qemu_in_coroutine()) {
-        qemu_co_mutex_lock(&s->lock);
-    }
+    qemu_co_mutex_lock(&s->lock);
 
     assert(offset <= new_length);
     bytes = new_length - offset;
@@ -XXX,XX +XXX,XX @@ static int preallocate(BlockDriverState *bs,
     ret = 0;
 
 done:
+    qemu_co_mutex_unlock(&s->lock);
+    params->ret = ret;
+}
+
+static int preallocate(BlockDriverState *bs,
+                       uint64_t offset, uint64_t new_length)
+{
+    PreallocCo params = {
+        .bs         = bs,
+        .offset     = offset,
+        .new_length = new_length,
+        .ret        = -EINPROGRESS,
+    };
+
     if (qemu_in_coroutine()) {
-        qemu_co_mutex_unlock(&s->lock);
+        preallocate_co(&params);
+    } else {
+        Coroutine *co = qemu_coroutine_create(preallocate_co, &params);
+        bdrv_coroutine_enter(bs, co);
+        BDRV_POLL_WHILE(bs, params.ret == -EINPROGRESS);
     }
-    return ret;
+    return params.ret;
 }
 
 /* qcow2_refcount_metadata_size:
-- 
2.13.6

From: Max Reitz <mreitz@redhat.com>

Apparently it would be a good idea to test that, too.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20171009215533.12530-4-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/125     |   7 +-
 tests/qemu-iotests/125.out | 480 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 437 insertions(+), 50 deletions(-)

diff --git a/tests/qemu-iotests/125 b/tests/qemu-iotests/125
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/125
+++ b/tests/qemu-iotests/125
@@ -XXX,XX +XXX,XX @@ fi
 # in B
 CREATION_SIZE=$((2 * 1024 * 1024 - 48 * 1024))
 
+# 512 is the actual test -- but it's good to test 64k as well, just to be sure.
+for cluster_size in 512 64k; do
 # in kB
 for GROWTH_SIZE in 16 48 80; do
     for create_mode in off metadata falloc full; do
         for growth_mode in off metadata falloc full; do
-            echo "--- growth_size=$GROWTH_SIZE create_mode=$create_mode growth_mode=$growth_mode ---"
+            echo "--- cluster_size=$cluster_size growth_size=$GROWTH_SIZE create_mode=$create_mode growth_mode=$growth_mode ---"
 
-            IMGOPTS="preallocation=$create_mode,cluster_size=512" _make_test_img ${CREATION_SIZE}
+            IMGOPTS="preallocation=$create_mode,cluster_size=$cluster_size" _make_test_img ${CREATION_SIZE}
             $QEMU_IMG resize -f "$IMGFMT" --preallocation=$growth_mode "$TEST_IMG" +${GROWTH_SIZE}K
 
             host_size_0=$(get_image_size_on_host)
@@ -XXX,XX +XXX,XX @@ for GROWTH_SIZE in 16 48 80; do
         done
     done
 done
+done
 
 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/125.out b/tests/qemu-iotests/125.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/125.out
+++ b/tests/qemu-iotests/125.out
@@ -XXX,XX +XXX,XX @@
 QA output created by 125
---- growth_size=16 create_mode=off growth_mode=off ---
+--- cluster_size=512 growth_size=16 create_mode=off growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=off growth_mode=metadata ---
+--- cluster_size=512 growth_size=16 create_mode=off growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=off growth_mode=falloc ---
+--- cluster_size=512 growth_size=16 create_mode=off growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=off growth_mode=full ---
+--- cluster_size=512 growth_size=16 create_mode=off growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=metadata growth_mode=off ---
+--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=metadata growth_mode=metadata ---
+--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=metadata growth_mode=falloc ---
+--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=metadata growth_mode=full ---
+--- cluster_size=512 growth_size=16 create_mode=metadata growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=falloc growth_mode=off ---
+--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=falloc growth_mode=metadata ---
+--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=falloc growth_mode=falloc ---
+--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=falloc growth_mode=full ---
+--- cluster_size=512 growth_size=16 create_mode=falloc growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=full growth_mode=off ---
+--- cluster_size=512 growth_size=16 create_mode=full growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=full growth_mode=metadata ---
+--- cluster_size=512 growth_size=16 create_mode=full growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=full growth_mode=falloc ---
+--- cluster_size=512 growth_size=16 create_mode=full growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=16 create_mode=full growth_mode=full ---
+--- cluster_size=512 growth_size=16 create_mode=full growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 16384/16384 bytes at offset 2048000
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=off growth_mode=off ---
+--- cluster_size=512 growth_size=48 create_mode=off growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=off growth_mode=metadata ---
+--- cluster_size=512 growth_size=48 create_mode=off growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=off growth_mode=falloc ---
+--- cluster_size=512 growth_size=48 create_mode=off growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=off growth_mode=full ---
+--- cluster_size=512 growth_size=48 create_mode=off growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=metadata growth_mode=off ---
+--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=metadata growth_mode=metadata ---
+--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=metadata growth_mode=falloc ---
+--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=metadata growth_mode=full ---
+--- cluster_size=512 growth_size=48 create_mode=metadata growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=falloc growth_mode=off ---
+--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=falloc growth_mode=metadata ---
+--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=falloc growth_mode=falloc ---
+--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=falloc growth_mode=full ---
+--- cluster_size=512 growth_size=48 create_mode=falloc growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=full growth_mode=off ---
+--- cluster_size=512 growth_size=48 create_mode=full growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=full growth_mode=metadata ---
+--- cluster_size=512 growth_size=48 create_mode=full growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=full growth_mode=falloc ---
+--- cluster_size=512 growth_size=48 create_mode=full growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=48 create_mode=full growth_mode=full ---
+--- cluster_size=512 growth_size=48 create_mode=full growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 49152/49152 bytes at offset 2048000
 48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=off growth_mode=off ---
+--- cluster_size=512 growth_size=80 create_mode=off growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=off growth_mode=metadata ---
+--- cluster_size=512 growth_size=80 create_mode=off growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=off growth_mode=falloc ---
+--- cluster_size=512 growth_size=80 create_mode=off growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=off growth_mode=full ---
+--- cluster_size=512 growth_size=80 create_mode=off growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=metadata growth_mode=off ---
+--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=metadata growth_mode=metadata ---
+--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=metadata growth_mode=falloc ---
+--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=metadata growth_mode=full ---
+--- cluster_size=512 growth_size=80 create_mode=metadata growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=falloc growth_mode=off ---
+--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=falloc growth_mode=metadata ---
+--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=falloc growth_mode=falloc ---
+--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=falloc growth_mode=full ---
+--- cluster_size=512 growth_size=80 create_mode=falloc growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=full growth_mode=off ---
+--- cluster_size=512 growth_size=80 create_mode=full growth_mode=off ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=full growth_mode=metadata ---
+--- cluster_size=512 growth_size=80 create_mode=full growth_mode=metadata ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=full growth_mode=falloc ---
+--- cluster_size=512 growth_size=80 create_mode=full growth_mode=falloc ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
@@ -XXX,XX +XXX,XX @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
---- growth_size=80 create_mode=full growth_mode=full ---
+--- cluster_size=512 growth_size=80 create_mode=full growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=off growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=off growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=off growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=off growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=metadata growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=falloc growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=full growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=full growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=full growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=16 create_mode=full growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 2048000
+16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=off growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=off growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=off growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=off growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=metadata growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=falloc growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=full growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=full growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=full growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=48 create_mode=full growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 49152/49152 bytes at offset 2048000
+48 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=off growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=off growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=off growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=off growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=off
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=metadata growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=metadata
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=falloc growth_mode=full ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=falloc
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=full growth_mode=off ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=full growth_mode=metadata ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=full growth_mode=falloc ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
+Image resized.
+wrote 2048000/2048000 bytes at offset 0
+1.953 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 81920/81920 bytes at offset 2048000
+80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+--- cluster_size=64k growth_size=80 create_mode=full growth_mode=full ---
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2048000 preallocation=full
 Image resized.
 wrote 2048000/2048000 bytes at offset 0
-- 
2.13.6