Series comparison

 [PULL 00/17] Block patches
-The following changes since commit b150cb8f67bf491a49a1cb1c7da151eeacbdbcc9:
+The following changes since commit 848a6caa88b9f082c89c9b41afa975761262981d:
-  Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging (2020-09-29 13:18:54 +0100)
+  Merge tag 'migration-20230602-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-06-02 17:33:29 -0700)
 are available in the Git repository at:
-  https://gitlab.com/stefanha/qemu.git tags/block-pull-request
+  https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-06-05
-for you to fetch changes up to bc47831ff28d6f5830c9c8d74220131dc54c5253:
+for you to fetch changes up to 42a2890a76f4783cd1c212f27856edcf2b5e8a75:
-  util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved regions (2020-09-30 10:23:05 +0100)
+  qcow2: add discard-no-unref option (2023-06-05 13:15:42 +0200)
 ----------------------------------------------------------------
-Pull request
+Block patches
-Note I have switched from GitHub to GitLab.
+- Fix padding of unaligned vectored requests to match the host alignment
   for vectors with 1023 or 1024 buffers
 - Refactor and fix bugs in parallels's image check functionality
 - Add an option to the qcow2 driver to retain (qcow2-level) allocations
   on discard requests from the guest (while still forwarding the discard
   to the lower level and marking the range as zero)
 ----------------------------------------------------------------
+Alexander Ivanov (12):
+  parallels: Out of image offset in BAT leads to image inflation
+  parallels: Fix high_off calculation in parallels_co_check()
+  parallels: Fix image_end_offset and data_end after out-of-image check
+  parallels: create parallels_set_bat_entry_helper() to assign BAT value
+  parallels: Use generic infrastructure for BAT writing in
+    parallels_co_check()
+  parallels: Move check of unclean image to a separate function
+  parallels: Move check of cluster outside image to a separate function
+  parallels: Fix statistics calculation
+  parallels: Move check of leaks to a separate function
+  parallels: Move statistic collection to a separate function
+  parallels: Replace qemu_co_mutex_lock by WITH_QEMU_LOCK_GUARD
+  parallels: Incorrect condition in out-of-image check
-Eric Auger (2):
+Hanna Czenczek (4):
-  util/vfio-helpers: Collect IOVA reserved regions
+  util/iov: Make qiov_slice() public
-  util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved
+  block: Collapse padded I/O vecs exceeding IOV_MAX
-    regions
+  util/iov: Remove qemu_iovec_init_extended()
   iotests/iov-padding: New test
-Philippe Mathieu-Daudé (6):
+Jean-Louis Dupond (1):
-  util/vfio-helpers: Pass page protections to qemu_vfio_pci_map_bar()
+  qcow2: add discard-no-unref option
   block/nvme: Map doorbells pages write-only
   block/nvme: Reduce I/O registers scope
   block/nvme: Drop NVMeRegs structure, directly use NvmeBar
   block/nvme: Use register definitions from 'block/nvme.h'
   block/nvme: Replace magic value by SCALE_MS definition
-Stefano Garzarella (1):
+ qapi/block-core.json                     |  12 ++
-  docs: add 'io_uring' option to 'aio' param in qemu-options.hx
+ block/qcow2.h                            |   3 +
+ include/qemu/iov.h                       |   8 +-
-Vladimir Sementsov-Ogievskiy (8):
+ block/io.c                               | 166 ++++++++++++++++++--
-  block: return error-code from bdrv_invalidate_cache
+ block/parallels.c                        | 190 ++++++++++++++++-------
-  block/io: refactor coroutine wrappers
+ block/qcow2-cluster.c                    |  32 +++-
-  block: declare some coroutine functions in block/coroutines.h
+ block/qcow2.c                            |  18 +++
-  scripts: add block-coroutine-wrapper.py
+ util/iov.c                               |  89 ++---------
-  block: generate coroutine-wrapper code
+ qemu-options.hx                          |  12 ++
-  block: drop bdrv_prwv
+ tests/qemu-iotests/tests/iov-padding     |  85 ++++++++++
-  block/io: refactor save/load vmstate
+ tests/qemu-iotests/tests/iov-padding.out |  59 +++++++
-  include/block/block.h: drop non-ascii quotation mark
+files changed, 523 insertions(+), 151 deletions(-)
+ create mode 100755 tests/qemu-iotests/tests/iov-padding
- block/block-gen.h                      |  49 ++++
+ create mode 100644 tests/qemu-iotests/tests/iov-padding.out
  block/coroutines.h                     |  65 +++++
  include/block/block.h                  |  36 ++-
  include/qemu/vfio-helpers.h            |   2 +-
  block.c                                |  97 +------
  block/io.c                             | 339 ++++---------------------
  block/nvme.c                           |  73 +++---
  tests/test-bdrv-drain.c                |   2 +-
  util/vfio-helpers.c                    | 133 +++++++++-
  block/meson.build                      |   8 +
  docs/devel/block-coroutine-wrapper.rst |  54 ++++
  docs/devel/index.rst                   |   1 +
  qemu-options.hx                        |  10 +-
  scripts/block-coroutine-wrapper.py     | 188 ++++++++++++++
 files changed, 629 insertions(+), 428 deletions(-)
  create mode 100644 block/block-gen.h
  create mode 100644 block/coroutines.h
  create mode 100644 docs/devel/block-coroutine-wrapper.rst
  create mode 100644 scripts/block-coroutine-wrapper.py
 --
-.26.2
+.40.1

-[PULL 07/17] block: return error-code from bdrv_invalidate_cache
+[PULL 01/17] util/iov: Make qiov_slice() public
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+We want to inline qemu_iovec_init_extended() in block/io.c for padding
 requests, and having access to qiov_slice() is useful for this.  As a
 public function, it is renamed to qemu_iovec_slice().
-This is the only coroutine wrapper from block.c and block/io.c which
+(We will need to count the number of I/O vector elements of a slice
-doesn't return a value, so let's convert it to the common behavior, to
+there, and then later process this slice.  Without qiov_slice(), we
-simplify moving to generated coroutine wrappers in a further commit.
+would need to call qemu_iovec_subvec_niov(), and all further
 IOV-processing functions may need to skip prefixing elements to
 accomodate for a qiov_offset.  Because qemu_iovec_subvec_niov()
 internally calls qiov_slice(), we can just have the block/io.c code call
 qiov_slice() itself, thus get the number of elements, and also create an
 iovec array with the superfluous prefixing elements stripped, so the
 following processing functions no longer need to skip them.)
-Also, bdrv_invalidate_cache is a void function, returning error only
+Reviewed-by: Eric Blake <eblake@redhat.com>
-through **errp parameter, which is considered to be bad practice, as
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
-it forces callers to define and propagate local_err variable, so
+Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
-conversion is good anyway.
+Message-Id: <20230411173418.19549-2-hreitz@redhat.com>
 ---
  include/qemu/iov.h |  3 +++
  util/iov.c         | 14 +++++++-------
 files changed, 10 insertions(+), 7 deletions(-)
-This patch leaves the conversion of .bdrv_co_invalidate_cache() driver
+diff --git a/include/qemu/iov.h b/include/qemu/iov.h
 callbacks and bdrv_invalidate_cache_all() for another day.
 Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20200924185414.28642-2-vsementsov@virtuozzo.com>
 ---
  include/block/block.h |  2 +-
  block.c               | 32 ++++++++++++++++++--------------
 files changed, 19 insertions(+), 15 deletions(-)
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/block.h
+--- a/include/qemu/iov.h
-+++ b/include/block/block.h
++++ b/include/qemu/iov.h
-@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb);
+@@ -XXX,XX +XXX,XX @@ int qemu_iovec_init_extended(
- int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
+         void *tail_buf, size_t tail_len);
+ void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
- /* Invalidate any cached metadata used by image formats */
+                            size_t offset, size_t len);
--void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
++struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
-+int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
++                               size_t offset, size_t len,
- void bdrv_invalidate_cache_all(Error **errp);
++                               size_t *head, size_t *tail, int *niov);
- int bdrv_inactivate_all(void);
+ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len);
+ void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
-diff --git a/block.c b/block.c
+ void qemu_iovec_concat(QEMUIOVector *dst,
 diff --git a/util/iov.c b/util/iov.c
 index XXXXXXX..XXXXXXX 100644
---- a/block.c
+--- a/util/iov.c
-+++ b/block.c
++++ b/util/iov.c
-@@ -XXX,XX +XXX,XX @@ void bdrv_init_with_whitelist(void)
+@@ -XXX,XX +XXX,XX @@ static struct iovec *iov_skip_offset(struct iovec *iov, size_t offset,
      bdrv_init();
  }
--static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
+ /*
--                                                  Error **errp)
+- * qiov_slice
-+static int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
++ * qemu_iovec_slice
-+                                                 Error **errp)
+  *
   * Find subarray of iovec's, containing requested range. @head would
   * be offset in first iov (returned by the function), @tail would be
   * count of extra bytes in last iovec (returned iov + @niov - 1).
   */
 -static struct iovec *qiov_slice(QEMUIOVector *qiov,
 -                                size_t offset, size_t len,
 -                                size_t *head, size_t *tail, int *niov)
 +struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
 +                               size_t offset, size_t len,
 +                               size_t *head, size_t *tail, int *niov)
  {
-     BdrvChild *child, *parent;
+     struct iovec *iov, *end_iov;
-     uint64_t perm, shared_perm;
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
-     BdrvDirtyBitmap *bm;
+     size_t head, tail;
+     int niov;
-     if (!bs->drv)  {
--        return;
+-    qiov_slice(qiov, offset, len, &head, &tail, &niov);
-+        return -ENOMEDIUM;
++    qemu_iovec_slice(qiov, offset, len, &head, &tail, &niov);
      return niov;
  }
@@ -XXX,XX +XXX,XX @@ int qemu_iovec_init_extended(
      }
-     QLIST_FOREACH(child, &bs->children, next) {
+     if (mid_len) {
-         bdrv_co_invalidate_cache(child->bs, &local_err);
+-        mid_iov = qiov_slice(mid_qiov, mid_offset, mid_len,
-         if (local_err) {
+-                             &mid_head, &mid_tail, &mid_niov);
-             error_propagate(errp, local_err);
++        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
--            return;
++                                   &mid_head, &mid_tail, &mid_niov);
 +            return -EINVAL;
          }
      }
-@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
+     total_niov = !!head_len + mid_niov + !!tail_len;
          ret = bdrv_check_perm(bs, NULL, perm, shared_perm, NULL, NULL, errp);
          if (ret < 0) {
              bs->open_flags |= BDRV_O_INACTIVE;
 -            return;
 +            return ret;
          }
          bdrv_set_perm(bs, perm, shared_perm);
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
              if (local_err) {
                  bs->open_flags |= BDRV_O_INACTIVE;
                  error_propagate(errp, local_err);
 -                return;
 +                return -EINVAL;
              }
          }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
          if (ret < 0) {
              bs->open_flags |= BDRV_O_INACTIVE;
              error_setg_errno(errp, -ret, "Could not refresh total sector count");
 -            return;
 +            return ret;
          }
      }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
              if (local_err) {
                  bs->open_flags |= BDRV_O_INACTIVE;
                  error_propagate(errp, local_err);
 -                return;
 +                return -EINVAL;
              }
          }
      }
 +
 +    return 0;
  }
  typedef struct InvalidateCacheCo {
      BlockDriverState *bs;
      Error **errp;
      bool done;
 +    int ret;
  } InvalidateCacheCo;
  static void coroutine_fn bdrv_invalidate_cache_co_entry(void *opaque)
  {
      InvalidateCacheCo *ico = opaque;
 -    bdrv_co_invalidate_cache(ico->bs, ico->errp);
 +    ico->ret = bdrv_co_invalidate_cache(ico->bs, ico->errp);
      ico->done = true;
      aio_wait_kick();
  }
 -void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
 +int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
  {
      Coroutine *co;
      InvalidateCacheCo ico = {
@@ -XXX,XX +XXX,XX @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
          bdrv_coroutine_enter(bs, co);
          BDRV_POLL_WHILE(bs, !ico.done);
      }
 +
 +    return ico.ret;
  }
  void bdrv_invalidate_cache_all(Error **errp)
  {
      BlockDriverState *bs;
 -    Error *local_err = NULL;
      BdrvNextIterator it;
      for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
          AioContext *aio_context = bdrv_get_aio_context(bs);
 +        int ret;
          aio_context_acquire(aio_context);
 -        bdrv_invalidate_cache(bs, &local_err);
 +        ret = bdrv_invalidate_cache(bs, errp);
          aio_context_release(aio_context);
 -        if (local_err) {
 -            error_propagate(errp, local_err);
 +        if (ret < 0) {
              bdrv_next_cleanup(&it);
              return;
          }
 --
-.26.2
+.40.1

-[PULL 08/17] block/io: refactor coroutine wrappers
+[PULL 02/17] block: Collapse padded I/O vecs exceeding IOV_MAX
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+When processing vectored guest requests that are not aligned to the
+storage request alignment, we pad them by adding head and/or tail
-Most of our coroutine wrappers already follow this convention:
+buffers for a read-modify-write cycle.
-We have 'coroutine_fn bdrv_co_<something>(<normal argument list>)' as
+The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
-the core function, and a wrapper 'bdrv_<something>(<same argument
+with this padding, the vector can exceed that limit.  As of
-list>)' which does parameter packing and calls bdrv_run_co().
+c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
+qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
-The only outsiders are the bdrv_prwv_co and
+limit, instead returning an error to the guest.
-bdrv_common_block_status_above wrappers. Let's refactor them to behave
-as the others, it simplifies further conversion of coroutine wrappers.
+To the guest, this appears as a random I/O error.  We should not return
+an I/O error to the guest when it issued a perfectly valid request.
-This patch adds an indirection layer, but it will be compensated by
-a further commit, which will drop bdrv_co_prwv together with the
+Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
-is_write logic, to keep the read and write paths separate.
+longer than IOV_MAX, which generally seems to work (because the guest
+assumes a smaller alignment than we really have, file-posix's
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+so emulate the request, so that the IOV_MAX does not matter).  However,
-Reviewed-by: Eric Blake <eblake@redhat.com>
+that does not seem exactly great.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+I see two ways to fix this problem:
-Message-Id: <20200924185414.28642-3-vsementsov@virtuozzo.com>
+. We split such long requests into two requests.
 . We join some elements of the vector into new buffers to make it
    shorter.
 I am wary of (1), because it seems like it may have unintended side
 effects.
 (2) on the other hand seems relatively simple to implement, with
 hopefully few side effects, so this patch does that.
 To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
 is effectively replaced by the new function bdrv_create_padded_qiov(),
 which not only wraps the request IOV with padding head/tail, but also
 ensures that the resulting vector will not have more than IOV_MAX
 elements.  Putting that functionality into qemu_iovec_init_extended() is
 infeasible because it requires allocating a bounce buffer; doing so
 would require many more parameters (buffer alignment, how to initialize
 the buffer, and out parameters like the buffer, its length, and the
 original elements), which is not reasonable.
 Conversely, it is not difficult to move qemu_iovec_init_extended()'s
 functionality into bdrv_create_padded_qiov() by using public
 qemu_iovec_* functions, so that is what this patch does.
 Because bdrv_pad_request() was the only "serious" user of
 qemu_iovec_init_extended(), the next patch will remove the latter
 function, so the functionality is not implemented twice.
 Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 Message-Id: <20230411173418.19549-3-hreitz@redhat.com>
 Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 ---
- block/io.c | 60 +++++++++++++++++++++++++++++-------------------------
+ block/io.c | 166 ++++++++++++++++++++++++++++++++++++++++++++++++-----
-file changed, 32 insertions(+), 28 deletions(-)
+file changed, 151 insertions(+), 15 deletions(-)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ typedef struct RwCo {
+@@ -XXX,XX +XXX,XX @@ out:
-     BdrvRequestFlags flags;
+  * @merge_reads is true for small requests,
- } RwCo;
+  * if @buf_len == @head + bytes + @tail. In this case it is possible that both
+  * head and tail exist but @buf_len == align and @tail_buf == @buf.
-+static int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
++ *
-+                                     QEMUIOVector *qiov, bool is_write,
++ * @write is true for write requests, false for read requests.
-+                                     BdrvRequestFlags flags)
++ *
 + * If padding makes the vector too long (exceeding IOV_MAX), then we need to
 + * merge existing vector elements into a single one.  @collapse_bounce_buf acts
 + * as the bounce buffer in such cases.  @pre_collapse_qiov has the pre-collapse
 + * I/O vector elements so for read requests, the data can be copied back after
 + * the read is done.
   */
  typedef struct BdrvRequestPadding {
      uint8_t *buf;
@@ -XXX,XX +XXX,XX @@ typedef struct BdrvRequestPadding {
      size_t head;
      size_t tail;
      bool merge_reads;
 +    bool write;
      QEMUIOVector local_qiov;
 +
 +    uint8_t *collapse_bounce_buf;
 +    size_t collapse_len;
 +    QEMUIOVector pre_collapse_qiov;
  } BdrvRequestPadding;
  static bool bdrv_init_padding(BlockDriverState *bs,
                                int64_t offset, int64_t bytes,
 +                              bool write,
                                BdrvRequestPadding *pad)
  {
      int64_t align = bs->bl.request_alignment;
@@ -XXX,XX +XXX,XX @@ static bool bdrv_init_padding(BlockDriverState *bs,
          pad->tail_buf = pad->buf + pad->buf_len - align;
      }
 +    pad->write = write;
 +
      return true;
  }
@@ -XXX,XX +XXX,XX @@ zero_mem:
      return 0;
  }
 -static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 +/**
 + * Free *pad's associated buffers, and perform any necessary finalization steps.
 + */
 +static void bdrv_padding_finalize(BdrvRequestPadding *pad)
  {
 +    if (pad->collapse_bounce_buf) {
 +        if (!pad->write) {
 +            /*
 +             * If padding required elements in the vector to be collapsed into a
 +             * bounce buffer, copy the bounce buffer content back
 +             */
 +            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
 +                                pad->collapse_bounce_buf, pad->collapse_len);
 +        }
 +        qemu_vfree(pad->collapse_bounce_buf);
 +        qemu_iovec_destroy(&pad->pre_collapse_qiov);
 +    }
      if (pad->buf) {
          qemu_vfree(pad->buf);
          qemu_iovec_destroy(&pad->local_qiov);
@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
      memset(pad, 0, sizeof(*pad));
  }
 +/*
 + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
 + * ensuring that the resulting vector will not exceed IOV_MAX elements.
 + *
 + * To ensure this, when necessary, the first two or three elements of @iov are
 + * merged into pad->collapse_bounce_buf and replaced by a reference to that
 + * bounce buffer in pad->local_qiov.
 + *
 + * After performing a read request, the data from the bounce buffer must be
 + * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_finalize()).
 + */
 +static int bdrv_create_padded_qiov(BlockDriverState *bs,
 +                                   BdrvRequestPadding *pad,
 +                                   struct iovec *iov, int niov,
 +                                   size_t iov_offset, size_t bytes)
 +{
-+    if (is_write) {
++    int padded_niov, surplus_count, collapse_count;
-+        return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
++
-+    } else {
++    /* Assert this invariant */
-+        return bdrv_co_preadv(child, offset, qiov->size, qiov, flags);
++    assert(niov <= IOV_MAX);
-+    }
++
 +    /*
 +     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
 +     * to the guest is not ideal, but there is little else we can do.  At least
 +     * this will practically never happen on 64-bit systems.
 +     */
 +    if (SIZE_MAX - pad->head < bytes ||
 +        SIZE_MAX - pad->head - bytes < pad->tail)
 +    {
 +        return -EINVAL;
 +    }
 +
 +    /* Length of the resulting IOV if we just concatenated everything */
 +    padded_niov = !!pad->head + niov + !!pad->tail;
 +
 +    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
 +
 +    if (pad->head) {
 +        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
 +    }
 +
 +    /*
 +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
 +     * Instead, merge the first two or three elements of @iov to reduce the
 +     * number of vector elements as necessary.
 +     */
 +    if (padded_niov > IOV_MAX) {
 +        /*
 +         * Only head and tail can have lead to the number of entries exceeding
 +         * IOV_MAX, so we can exceed it by the head and tail at most.  We need
 +         * to reduce the number of elements by `surplus_count`, so we merge that
 +         * many elements plus one into one element.
 +         */
 +        surplus_count = padded_niov - IOV_MAX;
 +        assert(surplus_count <= !!pad->head + !!pad->tail);
 +        collapse_count = surplus_count + 1;
 +
 +        /*
 +         * Move the elements to collapse into `pad->pre_collapse_qiov`, then
 +         * advance `iov` (and associated variables) by those elements.
 +         */
 +        qemu_iovec_init(&pad->pre_collapse_qiov, collapse_count);
 +        qemu_iovec_concat_iov(&pad->pre_collapse_qiov, iov,
 +                              collapse_count, iov_offset, SIZE_MAX);
 +        iov += collapse_count;
 +        iov_offset = 0;
 +        niov -= collapse_count;
 +        bytes -= pad->pre_collapse_qiov.size;
 +
 +        /*
 +         * Construct the bounce buffer to match the length of the to-collapse
 +         * vector elements, and for write requests, initialize it with the data
 +         * from those elements.  Then add it to `pad->local_qiov`.
 +         */
 +        pad->collapse_len = pad->pre_collapse_qiov.size;
 +        pad->collapse_bounce_buf = qemu_blockalign(bs, pad->collapse_len);
 +        if (pad->write) {
 +            qemu_iovec_to_buf(&pad->pre_collapse_qiov, 0,
 +                              pad->collapse_bounce_buf, pad->collapse_len);
 +        }
 +        qemu_iovec_add(&pad->local_qiov,
 +                       pad->collapse_bounce_buf, pad->collapse_len);
 +    }
 +
 +    qemu_iovec_concat_iov(&pad->local_qiov, iov, niov, iov_offset, bytes);
 +
 +    if (pad->tail) {
 +        qemu_iovec_add(&pad->local_qiov,
 +                       pad->buf + pad->buf_len - pad->tail, pad->tail);
 +    }
 +
 +    assert(pad->local_qiov.niov == MIN(padded_niov, IOV_MAX));
 +    return 0;
 +}
 +
- static int coroutine_fn bdrv_rw_co_entry(void *opaque)
- {
-     RwCo *rwco = opaque;
--    if (!rwco->is_write) {
--        return bdrv_co_preadv(rwco->child, rwco->offset,
--                              rwco->qiov->size, rwco->qiov,
--                              rwco->flags);
--    } else {
--        return bdrv_co_pwritev(rwco->child, rwco->offset,
--                               rwco->qiov->size, rwco->qiov,
--                               rwco->flags);
--    }
-+    return bdrv_co_prwv(rwco->child, rwco->offset, rwco->qiov,
-+                        rwco->is_write, rwco->flags);
- }
  /*
-  * Process a vectored synchronous request using coroutines
+  * bdrv_pad_request
-  */
+  *
--static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
+@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
--                        QEMUIOVector *qiov, bool is_write,
+  * read of padding, bdrv_padding_rmw_read() should be called separately if
--                        BdrvRequestFlags flags)
+  * needed.
-+static int bdrv_prwv(BdrvChild *child, int64_t offset,
+  *
-+                     QEMUIOVector *qiov, bool is_write,
++ * @write is true for write requests, false for read requests.
-+                     BdrvRequestFlags flags)
++ *
- {
+  * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
-     RwCo rwco = {
+  *  - on function start they represent original request
-         .child = child,
+  *  - on failure or when padding is not needed they are unchanged
-@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
+@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
- {
+ static int bdrv_pad_request(BlockDriverState *bs,
-     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
+                             QEMUIOVector **qiov, size_t *qiov_offset,
+                             int64_t *offset, int64_t *bytes,
--    return bdrv_prwv_co(child, offset, &qiov, true,
++                            bool write,
--                        BDRV_REQ_ZERO_WRITE | flags);
+                             BdrvRequestPadding *pad, bool *padded,
-+    return bdrv_prwv(child, offset, &qiov, true, BDRV_REQ_ZERO_WRITE | flags);
+                             BdrvRequestFlags *flags)
  }
  /*
@@ -XXX,XX +XXX,XX @@ int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
  {
      int ret;
++    struct iovec *sliced_iov;
--    ret = bdrv_prwv_co(child, offset, qiov, false, 0);
++    int sliced_niov;
-+    ret = bdrv_prwv(child, offset, qiov, false, 0);
++    size_t sliced_head, sliced_tail;
      bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
 -    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
 +    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
          if (padded) {
              *padded = false;
          }
          return 0;
      }
 -    ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head,
 -                                   *qiov, *qiov_offset, *bytes,
 -                                   pad->buf + pad->buf_len - pad->tail,
 -                                   pad->tail);
 +    sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
 +                                  &sliced_head, &sliced_tail,
 +                                  &sliced_niov);
 +
 +    /* Guaranteed by bdrv_check_qiov_request() */
 +    assert(*bytes <= SIZE_MAX);
 +    ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
 +                                  sliced_head, *bytes);
      if (ret < 0) {
+-        bdrv_padding_destroy(pad);
++        bdrv_padding_finalize(pad);
          return ret;
      }
-@@ -XXX,XX +XXX,XX @@ int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
+     *bytes += pad->head + pad->tail;
- {
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
-     int ret;
+         flags |= BDRV_REQ_COPY_ON_READ;
+     }
--    ret = bdrv_prwv_co(child, offset, qiov, true, 0);
-+    ret = bdrv_prwv(child, offset, qiov, true, 0);
+-    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
 -                           NULL, &flags);
 +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
 +                           &pad, NULL, &flags);
      if (ret < 0) {
-         return ret;
+         goto fail;
      }
-@@ -XXX,XX +XXX,XX @@ early_out:
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
                                bs->bl.request_alignment,
                                qiov, qiov_offset, flags);
      tracked_request_end(&req);
 -    bdrv_padding_destroy(&pad);
 +    bdrv_padding_finalize(&pad);
  fail:
      bdrv_dec_in_flight(bs);
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
      /* This flag doesn't make sense for padding or zero writes */
      flags &= ~BDRV_REQ_REGISTERED_BUF;
 -    padding = bdrv_init_padding(bs, offset, bytes, &pad);
 +    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
      if (padding) {
          assert(!(flags & BDRV_REQ_NO_WAIT));
          bdrv_make_request_serialising(req, align);
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
      }
  out:
 -    bdrv_padding_destroy(&pad);
 +    bdrv_padding_finalize(&pad);
      return ret;
  }
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
--static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
+          * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
--                                                   BlockDriverState *base,
+          * alignment only if there is no ZERO flag.
--                                                   bool want_zero,
+          */
--                                                   int64_t offset,
+-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
--                                                   int64_t bytes,
+-                               &padded, &flags);
--                                                   int64_t *pnum,
++        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
--                                                   int64_t *map,
++                               &pad, &padded, &flags);
--                                                   BlockDriverState **file)
+         if (ret < 0) {
-+static int coroutine_fn
+             return ret;
-+bdrv_co_common_block_status_above(BlockDriverState *bs,
+         }
-+                                  BlockDriverState *base,
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
-+                                  bool want_zero,
+     ret = bdrv_aligned_pwritev(child, &req, offset, bytes, align,
-+                                  int64_t offset,
+                                qiov, qiov_offset, flags);
-+                                  int64_t bytes,
-+                                  int64_t *pnum,
+-    bdrv_padding_destroy(&pad);
-+                                  int64_t *map,
++    bdrv_padding_finalize(&pad);
-+                                  BlockDriverState **file)
- {
+ out:
-     BlockDriverState *p;
+     tracked_request_end(&req);
      int ret = 0;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
  {
      BdrvCoBlockStatusData *data = opaque;
 -    return bdrv_co_block_status_above(data->bs, data->base,
 -                                      data->want_zero,
 -                                      data->offset, data->bytes,
 -                                      data->pnum, data->map, data->file);
 +    return bdrv_co_common_block_status_above(data->bs, data->base,
 +                                             data->want_zero,
 +                                             data->offset, data->bytes,
 +                                             data->pnum, data->map, data->file);
  }
  /*
 --
-.26.2
+.40.1

-[PULL 11/17] block: generate coroutine-wrapper code
+[PULL 03/17] util/iov: Remove qemu_iovec_init_extended()
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+bdrv_pad_request() was the main user of qemu_iovec_init_extended().
 HEAD^ has removed that use, so we can remove qemu_iovec_init_extended()
 now.
-Use code generation implemented in previous commit to generated
+The only remaining user is qemu_iovec_init_slice(), which can easily
-coroutine wrappers in block.c and block/io.c
+inline the small part it really needs.
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Note that qemu_iovec_init_extended() offered a memcpy() optimization to
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+initialize the new I/O vector.  qemu_iovec_concat_iov(), which is used
 to replace its functionality, does not, but calls qemu_iovec_add() for
 every single element.  If we decide this optimization was important, we
 will need to re-implement it in qemu_iovec_concat_iov(), which might
 also benefit its pre-existing users.
 Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
-Message-Id: <20200924185414.28642-6-vsementsov@virtuozzo.com>
+Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 Message-Id: <20230411173418.19549-4-hreitz@redhat.com>
 ---
- block/coroutines.h    |   6 +-
+ include/qemu/iov.h |  5 ---
- include/block/block.h |  16 ++--
+ util/iov.c         | 79 +++++++---------------------------------------
- block.c               |  73 ---------------
+files changed, 11 insertions(+), 73 deletions(-)
  block/io.c            | 212 ------------------------------------------
 files changed, 13 insertions(+), 294 deletions(-)
-diff --git a/block/coroutines.h b/block/coroutines.h
+diff --git a/include/qemu/iov.h b/include/qemu/iov.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/coroutines.h
+--- a/include/qemu/iov.h
-+++ b/block/coroutines.h
++++ b/include/qemu/iov.h
-@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
+@@ -XXX,XX +XXX,XX @@ static inline void *qemu_iovec_buf(QEMUIOVector *qiov)
- int coroutine_fn
- bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
+ void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
-              bool is_write, BdrvRequestFlags flags);
+ void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
--int
+-int qemu_iovec_init_extended(
-+int generated_co_wrapper
+-        QEMUIOVector *qiov,
- bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
+-        void *head_buf, size_t head_len,
-           bool is_write, BdrvRequestFlags flags);
+-        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
+-        void *tail_buf, size_t tail_len);
-@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
+ void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
-                                   int64_t *pnum,
+                            size_t offset, size_t len);
-                                   int64_t *map,
+ struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
-                                   BlockDriverState **file);
+diff --git a/util/iov.c b/util/iov.c
 -int
 +int generated_co_wrapper
  bdrv_common_block_status_above(BlockDriverState *bs,
                                 BlockDriverState *base,
                                 bool want_zero,
@@ -XXX,XX +XXX,XX @@ bdrv_common_block_status_above(BlockDriverState *bs,
  int coroutine_fn
  bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                     bool is_read);
 -int
 +int generated_co_wrapper
  bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                  bool is_read);
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/block.h
+--- a/util/iov.c
-+++ b/include/block/block.h
++++ b/util/iov.c
-@@ -XXX,XX +XXX,XX @@ void bdrv_refresh_filename(BlockDriverState *bs);
+@@ -XXX,XX +XXX,XX @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
- int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
+     return niov;
                                    PreallocMode prealloc, BdrvRequestFlags flags,
                                    Error **errp);
 -int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
 -                  PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
 +int generated_co_wrapper
 +bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
 +              PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
  int64_t bdrv_nb_sectors(BlockDriverState *bs);
  int64_t bdrv_getlength(BlockDriverState *bs);
@@ -XXX,XX +XXX,XX @@ typedef enum {
      BDRV_FIX_ERRORS   = 2,
  } BdrvCheckMode;
 -int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix);
 +int generated_co_wrapper bdrv_check(BlockDriverState *bs, BdrvCheckResult *res,
 +                                    BdrvCheckMode fix);
  /* The units of offset and total_work_size may be chosen arbitrarily by the
   * block driver; total_work_size may change during the course of the amendment
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb);
  int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
  /* Invalidate any cached metadata used by image formats */
 -int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
 +int generated_co_wrapper bdrv_invalidate_cache(BlockDriverState *bs,
 +                                               Error **errp);
  void bdrv_invalidate_cache_all(Error **errp);
  int bdrv_inactivate_all(void);
  /* Ensure contents are flushed to disk.  */
 -int bdrv_flush(BlockDriverState *bs);
 +int generated_co_wrapper bdrv_flush(BlockDriverState *bs);
  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
  int bdrv_flush_all(void);
  void bdrv_close_all(void);
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all(void);
      AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
                     cond); })
 -int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 +int generated_co_wrapper bdrv_pdiscard(BdrvChild *child, int64_t offset,
 +                                       int64_t bytes);
  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
  int bdrv_has_zero_init_1(BlockDriverState *bs);
  int bdrv_has_zero_init(BlockDriverState *bs);
 diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block.c
 +++ b/block.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
      return bs->drv->bdrv_co_check(bs, res, fix);
  }
--typedef struct CheckCo {
+-/*
--    BlockDriverState *bs;
+- * Compile new iovec, combining @head_buf buffer, sub-qiov of @mid_qiov,
--    BdrvCheckResult *res;
+- * and @tail_buf buffer into new qiov.
--    BdrvCheckMode fix;
+- */
--    int ret;
+-int qemu_iovec_init_extended(
--} CheckCo;
+-        QEMUIOVector *qiov,
 -        void *head_buf, size_t head_len,
 -        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
 -        void *tail_buf, size_t tail_len)
 -{
 -    size_t mid_head, mid_tail;
 -    int total_niov, mid_niov = 0;
 -    struct iovec *p, *mid_iov = NULL;
 -
--static void coroutine_fn bdrv_check_co_entry(void *opaque)
+-    assert(mid_qiov->niov <= IOV_MAX);
 -{
 -    CheckCo *cco = opaque;
 -    cco->ret = bdrv_co_check(cco->bs, cco->res, cco->fix);
 -    aio_wait_kick();
 -}
 -
--int bdrv_check(BlockDriverState *bs,
+-    if (SIZE_MAX - head_len < mid_len ||
--               BdrvCheckResult *res, BdrvCheckMode fix)
+-        SIZE_MAX - head_len - mid_len < tail_len)
--{
+-    {
--    Coroutine *co;
+-        return -EINVAL;
 -    CheckCo cco = {
 -        .bs = bs,
 -        .res = res,
 -        .ret = -EINPROGRESS,
 -        .fix = fix,
 -    };
 -
 -    if (qemu_in_coroutine()) {
 -        /* Fast-path if already in coroutine context */
 -        bdrv_check_co_entry(&cco);
 -    } else {
 -        co = qemu_coroutine_create(bdrv_check_co_entry, &cco);
 -        bdrv_coroutine_enter(bs, co);
 -        BDRV_POLL_WHILE(bs, cco.ret == -EINPROGRESS);
 -    }
 -
--    return cco.ret;
+-    if (mid_len) {
 -        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
 -                                   &mid_head, &mid_tail, &mid_niov);
 -    }
 -
 -    total_niov = !!head_len + mid_niov + !!tail_len;
 -    if (total_niov > IOV_MAX) {
 -        return -EINVAL;
 -    }
 -
 -    if (total_niov == 1) {
 -        qemu_iovec_init_buf(qiov, NULL, 0);
 -        p = &qiov->local_iov;
 -    } else {
 -        qiov->niov = qiov->nalloc = total_niov;
 -        qiov->size = head_len + mid_len + tail_len;
 -        p = qiov->iov = g_new(struct iovec, qiov->niov);
 -    }
 -
 -    if (head_len) {
 -        p->iov_base = head_buf;
 -        p->iov_len = head_len;
 -        p++;
 -    }
 -
 -    assert(!mid_niov == !mid_len);
 -    if (mid_niov) {
 -        memcpy(p, mid_iov, mid_niov * sizeof(*p));
 -        p[0].iov_base = (uint8_t *)p[0].iov_base + mid_head;
 -        p[0].iov_len -= mid_head;
 -        p[mid_niov - 1].iov_len -= mid_tail;
 -        p += mid_niov;
 -    }
 -
 -    if (tail_len) {
 -        p->iov_base = tail_buf;
 -        p->iov_len = tail_len;
 -    }
 -
 -    return 0;
 -}
 -
  /*
-  * Return values:
+  * Check if the contents of subrange of qiov data is all zeroes.
-  * 0        - success
+  */
-@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
+@@ -XXX,XX +XXX,XX @@ bool qemu_iovec_is_zero(QEMUIOVector *qiov, size_t offset, size_t bytes)
-     return 0;
+ void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                             size_t offset, size_t len)
  {
 -    int ret;
 +    struct iovec *slice_iov;
 +    int slice_niov;
 +    size_t slice_head, slice_tail;
      assert(source->size >= len);
      assert(source->size - len >= offset);
 -    /* We shrink the request, so we can't overflow neither size_t nor MAX_IOV */
 -    ret = qemu_iovec_init_extended(qiov, NULL, 0, source, offset, len, NULL, 0);
 -    assert(ret == 0);
 +    slice_iov = qemu_iovec_slice(source, offset, len,
 +                                 &slice_head, &slice_tail, &slice_niov);
 +    if (slice_niov == 1) {
 +        qemu_iovec_init_buf(qiov, slice_iov[0].iov_base + slice_head, len);
 +    } else {
 +        qemu_iovec_init(qiov, slice_niov);
 +        qemu_iovec_concat_iov(qiov, slice_iov, slice_niov, slice_head, len);
 +    }
  }
--typedef struct InvalidateCacheCo {
+ void qemu_iovec_destroy(QEMUIOVector *qiov)
 -    BlockDriverState *bs;
 -    Error **errp;
 -    bool done;
 -    int ret;
 -} InvalidateCacheCo;
 -
 -static void coroutine_fn bdrv_invalidate_cache_co_entry(void *opaque)
 -{
 -    InvalidateCacheCo *ico = opaque;
 -    ico->ret = bdrv_co_invalidate_cache(ico->bs, ico->errp);
 -    ico->done = true;
 -    aio_wait_kick();
 -}
 -
 -int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
 -{
 -    Coroutine *co;
 -    InvalidateCacheCo ico = {
 -        .bs = bs,
 -        .done = false,
 -        .errp = errp
 -    };
 -
 -    if (qemu_in_coroutine()) {
 -        /* Fast-path if already in coroutine context */
 -        bdrv_invalidate_cache_co_entry(&ico);
 -    } else {
 -        co = qemu_coroutine_create(bdrv_invalidate_cache_co_entry, &ico);
 -        bdrv_coroutine_enter(bs, co);
 -        BDRV_POLL_WHILE(bs, !ico.done);
 -    }
 -
 -    return ico.ret;
 -}
 -
  void bdrv_invalidate_cache_all(Error **errp)
  {
      BlockDriverState *bs;
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
      return 0;
  }
 -typedef int coroutine_fn BdrvRequestEntry(void *opaque);
 -typedef struct BdrvRunCo {
 -    BdrvRequestEntry *entry;
 -    void *opaque;
 -    int ret;
 -    bool done;
 -    Coroutine *co; /* Coroutine, running bdrv_run_co_entry, for debugging */
 -} BdrvRunCo;
 -
 -static void coroutine_fn bdrv_run_co_entry(void *opaque)
 -{
 -    BdrvRunCo *arg = opaque;
 -
 -    arg->ret = arg->entry(arg->opaque);
 -    arg->done = true;
 -    aio_wait_kick();
 -}
 -
 -static int bdrv_run_co(BlockDriverState *bs, BdrvRequestEntry *entry,
 -                       void *opaque)
 -{
 -    if (qemu_in_coroutine()) {
 -        /* Fast-path if already in coroutine context */
 -        return entry(opaque);
 -    } else {
 -        BdrvRunCo s = { .entry = entry, .opaque = opaque };
 -
 -        s.co = qemu_coroutine_create(bdrv_run_co_entry, &s);
 -        bdrv_coroutine_enter(bs, s.co);
 -
 -        BDRV_POLL_WHILE(bs, !s.done);
 -
 -        return s.ret;
 -    }
 -}
 -
 -typedef struct RwCo {
 -    BdrvChild *child;
 -    int64_t offset;
 -    QEMUIOVector *qiov;
 -    bool is_write;
 -    BdrvRequestFlags flags;
 -} RwCo;
 -
  int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
                                QEMUIOVector *qiov, bool is_write,
                                BdrvRequestFlags flags)
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
      }
  }
 -static int coroutine_fn bdrv_rw_co_entry(void *opaque)
 -{
 -    RwCo *rwco = opaque;
 -
 -    return bdrv_co_prwv(rwco->child, rwco->offset, rwco->qiov,
 -                        rwco->is_write, rwco->flags);
 -}
 -
 -/*
 - * Process a vectored synchronous request using coroutines
 - */
 -int bdrv_prwv(BdrvChild *child, int64_t offset,
 -              QEMUIOVector *qiov, bool is_write,
 -              BdrvRequestFlags flags)
 -{
 -    RwCo rwco = {
 -        .child = child,
 -        .offset = offset,
 -        .qiov = qiov,
 -        .is_write = is_write,
 -        .flags = flags,
 -    };
 -
 -    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco);
 -}
 -
  int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                         int bytes, BdrvRequestFlags flags)
  {
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
      return result;
  }
 -
 -typedef struct BdrvCoBlockStatusData {
 -    BlockDriverState *bs;
 -    BlockDriverState *base;
 -    bool want_zero;
 -    int64_t offset;
 -    int64_t bytes;
 -    int64_t *pnum;
 -    int64_t *map;
 -    BlockDriverState **file;
 -} BdrvCoBlockStatusData;
 -
  /*
   * Returns the allocation status of the specified sectors.
   * Drivers not implementing the functionality are assumed to not support
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
      return ret;
  }
 -/* Coroutine wrapper for bdrv_block_status_above() */
 -static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 -{
 -    BdrvCoBlockStatusData *data = opaque;
 -
 -    return bdrv_co_common_block_status_above(data->bs, data->base,
 -                                             data->want_zero,
 -                                             data->offset, data->bytes,
 -                                             data->pnum, data->map, data->file);
 -}
 -
 -/*
 - * Synchronous wrapper around bdrv_co_block_status_above().
 - *
 - * See bdrv_co_block_status_above() for details.
 - */
 -int bdrv_common_block_status_above(BlockDriverState *bs,
 -                                   BlockDriverState *base,
 -                                   bool want_zero, int64_t offset,
 -                                   int64_t bytes, int64_t *pnum,
 -                                   int64_t *map,
 -                                   BlockDriverState **file)
 -{
 -    BdrvCoBlockStatusData data = {
 -        .bs = bs,
 -        .base = base,
 -        .want_zero = want_zero,
 -        .offset = offset,
 -        .bytes = bytes,
 -        .pnum = pnum,
 -        .map = map,
 -        .file = file,
 -    };
 -
 -    return bdrv_run_co(bs, bdrv_block_status_above_co_entry, &data);
 -}
 -
  int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                              int64_t offset, int64_t bytes, int64_t *pnum,
                              int64_t *map, BlockDriverState **file)
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top,
      return 0;
  }
 -typedef struct BdrvVmstateCo {
 -    BlockDriverState    *bs;
 -    QEMUIOVector        *qiov;
 -    int64_t             pos;
 -    bool                is_read;
 -} BdrvVmstateCo;
 -
  int coroutine_fn
  bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                     bool is_read)
@@ -XXX,XX +XXX,XX @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
      return ret;
  }
 -static int coroutine_fn bdrv_co_rw_vmstate_entry(void *opaque)
 -{
 -    BdrvVmstateCo *co = opaque;
 -
 -    return bdrv_co_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
 -}
 -
 -int bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 -                    bool is_read)
 -{
 -    BdrvVmstateCo data = {
 -        .bs         = bs,
 -        .qiov       = qiov,
 -        .pos        = pos,
 -        .is_read    = is_read,
 -    };
 -
 -    return bdrv_run_co(bs, bdrv_co_rw_vmstate_entry, &data);
 -}
 -
  int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                        int64_t pos, int size)
  {
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb)
  /**************************************************************/
  /* Coroutine block device emulation */
 -static int coroutine_fn bdrv_flush_co_entry(void *opaque)
 -{
 -    return bdrv_co_flush(opaque);
 -}
 -
  int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
  {
      BdrvChild *primary_child = bdrv_primary_child(bs);
@@ -XXX,XX +XXX,XX @@ early_exit:
      return ret;
  }
 -int bdrv_flush(BlockDriverState *bs)
 -{
 -    return bdrv_run_co(bs, bdrv_flush_co_entry, bs);
 -}
 -
 -typedef struct DiscardCo {
 -    BdrvChild *child;
 -    int64_t offset;
 -    int64_t bytes;
 -} DiscardCo;
 -
 -static int coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
 -{
 -    DiscardCo *rwco = opaque;
 -
 -    return bdrv_co_pdiscard(rwco->child, rwco->offset, rwco->bytes);
 -}
 -
  int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                    int64_t bytes)
  {
@@ -XXX,XX +XXX,XX @@ out:
      return ret;
  }
 -int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
 -{
 -    DiscardCo rwco = {
 -        .child = child,
 -        .offset = offset,
 -        .bytes = bytes,
 -    };
 -
 -    return bdrv_run_co(child->bs, bdrv_pdiscard_co_entry, &rwco);
 -}
 -
  int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf)
  {
      BlockDriver *drv = bs->drv;
@@ -XXX,XX +XXX,XX @@ out:
      return ret;
  }
 -
 -typedef struct TruncateCo {
 -    BdrvChild *child;
 -    int64_t offset;
 -    bool exact;
 -    PreallocMode prealloc;
 -    BdrvRequestFlags flags;
 -    Error **errp;
 -} TruncateCo;
 -
 -static int coroutine_fn bdrv_truncate_co_entry(void *opaque)
 -{
 -    TruncateCo *tco = opaque;
 -
 -    return bdrv_co_truncate(tco->child, tco->offset, tco->exact,
 -                            tco->prealloc, tco->flags, tco->errp);
 -}
 -
 -int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
 -                  PreallocMode prealloc, BdrvRequestFlags flags, Error **errp)
 -{
 -    TruncateCo tco = {
 -        .child      = child,
 -        .offset     = offset,
 -        .exact      = exact,
 -        .prealloc   = prealloc,
 -        .flags      = flags,
 -        .errp       = errp,
 -    };
 -
 -    return bdrv_run_co(child->bs, bdrv_truncate_co_entry, &tco);
 -}
 --
-.26.2
+.40.1

-[PULL 10/17] scripts: add block-coroutine-wrapper.py
+[PULL 04/17] iotests/iov-padding: New test
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Test that even vectored IO requests with 1024 vector elements that are
 not aligned to the device's request alignment will succeed.
-We have a very frequent pattern of creating a coroutine from a function
+Reviewed-by: Eric Blake <eblake@redhat.com>
-with several arguments:
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 Message-Id: <20230411173418.19549-5-hreitz@redhat.com>
 ---
  tests/qemu-iotests/tests/iov-padding     | 85 ++++++++++++++++++++++++
  tests/qemu-iotests/tests/iov-padding.out | 59 ++++++++++++++++
 files changed, 144 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/iov-padding
  create mode 100644 tests/qemu-iotests/tests/iov-padding.out
-  - create a structure to pack parameters
+diff --git a/tests/qemu-iotests/tests/iov-padding b/tests/qemu-iotests/tests/iov-padding
-  - create _entry function to call original function taking parameters
+new file mode 100755
-    from struct
+index XXXXXXX..XXXXXXX
-  - do different magic to handle completion: set ret to NOT_DONE or
+--- /dev/null
-    EINPROGRESS or use separate bool field
++++ b/tests/qemu-iotests/tests/iov-padding
-  - fill the struct and create coroutine from _entry function with this
+@@ -XXX,XX +XXX,XX @@
-    struct as a parameter
++#!/usr/bin/env bash
-  - do coroutine enter and BDRV_POLL_WHILE loop
++# group: rw quick
++#
-Let's reduce code duplication by generating coroutine wrappers.
++# Check the interaction of request padding (to fit alignment restrictions) with
++# vectored I/O from the guest
-This patch adds scripts/block-coroutine-wrapper.py together with some
++#
-friends, which will generate functions with declared prototypes marked
++# Copyright Red Hat
-by the 'generated_co_wrapper' specifier.
++#
++# This program is free software; you can redistribute it and/or modify
-The usage of new code generation is as follows:
++# it under the terms of the GNU General Public License as published by
++# the Free Software Foundation; either version 2 of the License, or
-. define the coroutine function somewhere
++# (at your option) any later version.
++#
-        int coroutine_fn bdrv_co_NAME(...) {...}
++# This program is distributed in the hope that it will be useful,
++# but WITHOUT ANY WARRANTY; without even the implied warranty of
-. declare in some header file
++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++# GNU General Public License for more details.
-        int generated_co_wrapper bdrv_NAME(...);
++#
++# You should have received a copy of the GNU General Public License
-       with same list of parameters (generated_co_wrapper is
++# along with this program.  If not, see <http://www.gnu.org/licenses/>.
-       defined in "include/block/block.h").
++#
++
-. Make sure the block_gen_c declaration in block/meson.build
++seq=$(basename $0)
-       mentions the file with your marker function.
++echo "QA output created by $seq"
++
-Still, no function is now marked, this work is for the following
++status=1    # failure is the default!
-commit.
++
++_cleanup()
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
++{
-Reviewed-by: Eric Blake <eblake@redhat.com>
++    _cleanup_test_img
-Message-Id: <20200924185414.28642-5-vsementsov@virtuozzo.com>
++}
-[Added encoding='utf-8' to open() calls as requested by Vladimir. Fixed
++trap "_cleanup; exit \$status" 0 1 2 3 15
-typo and grammar issues pointed out by Eric Blake.
++
---Stefan]
++# get standard environment, filters and checks
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
++cd ..
----
++. ./common.rc
- block/block-gen.h                      |  49 +++++++
++. ./common.filter
- include/block/block.h                  |  10 ++
++
- block/meson.build                      |   8 ++
++_supported_fmt raw
- docs/devel/block-coroutine-wrapper.rst |  54 +++++++
++_supported_proto file
- docs/devel/index.rst                   |   1 +
++
- scripts/block-coroutine-wrapper.py     | 188 +++++++++++++++++++++++++
++_make_test_img 1M
-files changed, 310 insertions(+)
++
- create mode 100644 block/block-gen.h
++IMGSPEC="driver=blkdebug,align=4096,image.driver=file,image.filename=$TEST_IMG"
- create mode 100644 docs/devel/block-coroutine-wrapper.rst
++
- create mode 100644 scripts/block-coroutine-wrapper.py
++# Four combinations:
++# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
-diff --git a/block/block-gen.h b/block/block-gen.h
++# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not
 +# - Offset 512, length 1023 * 512 + 512: Neither head nor tail are aligned
 +# - Offset 512, length 1023 * 512 + 4096: Tail is aligned, head is not
 +for start_offset in 4096 512; do
 +    for last_element_length in 512 4096; do
 +        length=$((1023 * 512 + $last_element_length))
 +
 +        echo
 +        echo "== performing 1024-element vectored requests to image (offset: $start_offset; length: $length) =="
 +
 +        # Fill with data for testing
 +        $QEMU_IO -c 'write -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
 +
 +        # 1023 512-byte buffers, and then one with length $last_element_length
 +        cmd_params="-P 2 $start_offset $(yes 512 | head -n 1023 | tr '\n' ' ') $last_element_length"
 +        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
 +            -c "writev $cmd_params" \
 +            --image-opts \
 +            "$IMGSPEC" \
 +            | _filter_qemu_io
 +
 +        # Read all patterns -- read the part we just wrote with writev twice,
 +        # once "normally", and once with a readv, so we see that that works, too
 +        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
 +            -c "read -P 1 0 $start_offset" \
 +            -c "read -P 2 $start_offset $length" \
 +            -c "readv $cmd_params" \
 +            -c "read -P 1 $((start_offset + length)) $((1024 * 1024 - length - start_offset))" \
 +            --image-opts \
 +            "$IMGSPEC" \
 +            | _filter_qemu_io
 +    done
 +done
 +
 +# success, all done
 +echo "*** done"
 +rm -f $seq.full
 +status=0
 diff --git a/tests/qemu-iotests/tests/iov-padding.out b/tests/qemu-iotests/tests/iov-padding.out
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/block/block-gen.h
++++ b/tests/qemu-iotests/tests/iov-padding.out
 @@ -XXX,XX +XXX,XX @@
-+/*
++QA output created by iov-padding
-+ * Block coroutine wrapping core, used by auto-generated block/block-gen.c
++Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 + *
 + * Copyright (c) 2003 Fabrice Bellard
 + * Copyright (c) 2020 Virtuozzo International GmbH
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a copy
 + * of this software and associated documentation files (the "Software"), to deal
 + * in the Software without restriction, including without limitation the rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
-+#ifndef BLOCK_BLOCK_GEN_H
++== performing 1024-element vectored requests to image (offset: 4096; length: 524288) ==
-+#define BLOCK_BLOCK_GEN_H
++wrote 1048576/1048576 bytes at offset 0
 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 524288/524288 bytes at offset 4096
 +512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 4096/4096 bytes at offset 0
 +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 524288/524288 bytes at offset 4096
 +512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 524288/524288 bytes at offset 4096
 +512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 520192/520192 bytes at offset 528384
 +508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
-+#include "block/block_int.h"
++== performing 1024-element vectored requests to image (offset: 4096; length: 527872) ==
 +wrote 1048576/1048576 bytes at offset 0
 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 527872/527872 bytes at offset 4096
 +515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 4096/4096 bytes at offset 0
 +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 527872/527872 bytes at offset 4096
 +515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 527872/527872 bytes at offset 4096
 +515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 516608/516608 bytes at offset 531968
 +504.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
-+/* Base structure for argument packing structures */
++== performing 1024-element vectored requests to image (offset: 512; length: 524288) ==
-+typedef struct BdrvPollCo {
++wrote 1048576/1048576 bytes at offset 0
-+    BlockDriverState *bs;
++1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    bool in_progress;
++wrote 524288/524288 bytes at offset 512
-+    int ret;
++512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    Coroutine *co; /* Keep pointer here for debugging */
++read 512/512 bytes at offset 0
-+} BdrvPollCo;
++512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 524288/524288 bytes at offset 512
 +512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 524288/524288 bytes at offset 512
 +512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +read 523776/523776 bytes at offset 524800
 +511.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +
-+static inline int bdrv_poll_co(BdrvPollCo *s)
++== performing 1024-element vectored requests to image (offset: 512; length: 527872) ==
-+{
++wrote 1048576/1048576 bytes at offset 0
-+    assert(!qemu_in_coroutine());
++1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+
++wrote 527872/527872 bytes at offset 512
-+    bdrv_coroutine_enter(s->bs, s->co);
++515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    BDRV_POLL_WHILE(s->bs, s->in_progress);
++read 512/512 bytes at offset 0
-+
++512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    return s->ret;
++read 527872/527872 bytes at offset 512
-+}
++515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+
++read 527872/527872 bytes at offset 512
-+#endif /* BLOCK_BLOCK_GEN_H */
++515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-diff --git a/include/block/block.h b/include/block/block.h
++read 520192/520192 bytes at offset 528384
-index XXXXXXX..XXXXXXX 100644
++508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
---- a/include/block/block.h
++*** done
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@
  #include "block/blockjob.h"
  #include "qemu/hbitmap.h"
 +/*
 + * generated_co_wrapper
 + *
 + * Function specifier, which does nothing but mark functions to be
 + * generated by scripts/block-coroutine-wrapper.py
 + *
 + * Read more in docs/devel/block-coroutine-wrapper.rst
 + */
 +#define generated_co_wrapper
 +
  /* block.c */
  typedef struct BlockDriver BlockDriver;
  typedef struct BdrvChild BdrvChild;
 diff --git a/block/meson.build b/block/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/block/meson.build
 +++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ module_block_h = custom_target('module_block.h',
                                 command: [module_block_py, '@OUTPUT0@', modsrc])
  block_ss.add(module_block_h)
 +wrapper_py = find_program('../scripts/block-coroutine-wrapper.py')
 +block_gen_c = custom_target('block-gen.c',
 +                            output: 'block-gen.c',
 +                            input: files('../include/block/block.h',
 +                                         'coroutines.h'),
 +                            command: [wrapper_py, '@OUTPUT@', '@INPUT@'])
 +block_ss.add(block_gen_c)
 +
  block_ss.add(files('stream.c'))
  softmmu_ss.add(files('qapi-sysemu.c'))
 diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/docs/devel/block-coroutine-wrapper.rst
@@ -XXX,XX +XXX,XX @@
 +=======================
 +block-coroutine-wrapper
 +=======================
 +
 +A lot of functions in QEMU block layer (see ``block/*``) can only be
 +called in coroutine context. Such functions are normally marked by the
 +coroutine_fn specifier. Still, sometimes we need to call them from
 +non-coroutine context; for this we need to start a coroutine, run the
 +needed function from it and wait for the coroutine to finish in a
 +BDRV_POLL_WHILE() loop. To run a coroutine we need a function with one
 +void* argument. So for each coroutine_fn function which needs a
 +non-coroutine interface, we should define a structure to pack the
 +parameters, define a separate function to unpack the parameters and
 +call the original function and finally define a new interface function
 +with same list of arguments as original one, which will pack the
 +parameters into a struct, create a coroutine, run it and wait in
 +BDRV_POLL_WHILE() loop. It's boring to create such wrappers by hand,
 +so we have a script to generate them.
 +
 +Usage
 +=====
 +
 +Assume we have defined the ``coroutine_fn`` function
 +``bdrv_co_foo(<some args>)`` and need a non-coroutine interface for it,
 +called ``bdrv_foo(<same args>)``. In this case the script can help. To
 +trigger the generation:
 +
 +1. You need ``bdrv_foo`` declaration somewhere (for example, in
 +   ``block/coroutines.h``) with the ``generated_co_wrapper`` mark,
 +   like this:
 +
 +.. code-block:: c
 +
 +    int generated_co_wrapper bdrv_foo(<some args>);
 +
 +2. You need to feed this declaration to block-coroutine-wrapper script.
 +   For this, add the .h (or .c) file with the declaration to the
 +   ``input: files(...)`` list of ``block_gen_c`` target declaration in
 +   ``block/meson.build``
 +
 +You are done. During the build, coroutine wrappers will be generated in
 +``<BUILD_DIR>/block/block-gen.c``.
 +
 +Links
 +=====
 +
 +1. The script location is ``scripts/block-coroutine-wrapper.py``.
 +
 +2. Generic place for private ``generated_co_wrapper`` declarations is
 +   ``block/coroutines.h``, for public declarations:
 +   ``include/block/block.h``
 +
 +3. The core API of generated coroutine wrappers is placed in
 +   (not generated) ``block/block-gen.h``
 diff --git a/docs/devel/index.rst b/docs/devel/index.rst
 index XXXXXXX..XXXXXXX 100644
 --- a/docs/devel/index.rst
 +++ b/docs/devel/index.rst
@@ -XXX,XX +XXX,XX @@ Contents:
     reset
     s390-dasd-ipl
     clocks
 +   block-coroutine-wrapper
 diff --git a/scripts/block-coroutine-wrapper.py b/scripts/block-coroutine-wrapper.py
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/scripts/block-coroutine-wrapper.py
@@ -XXX,XX +XXX,XX @@
 +#! /usr/bin/env python3
 +"""Generate coroutine wrappers for block subsystem.
 +
 +The program parses one or several concatenated c files from stdin,
 +searches for functions with the 'generated_co_wrapper' specifier
 +and generates corresponding wrappers on stdout.
 +
 +Usage: block-coroutine-wrapper.py generated-file.c FILE.[ch]...
 +
 +Copyright (c) 2020 Virtuozzo International GmbH.
 +
 +This program is free software; you can redistribute it and/or modify
 +it under the terms of the GNU General Public License as published by
 +the Free Software Foundation; either version 2 of the License, or
 +(at your option) any later version.
 +
 +This program is distributed in the hope that it will be useful,
 +but WITHOUT ANY WARRANTY; without even the implied warranty of
 +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +GNU General Public License for more details.
 +
 +You should have received a copy of the GNU General Public License
 +along with this program.  If not, see <http://www.gnu.org/licenses/>.
 +"""
 +
 +import sys
 +import re
 +import subprocess
 +import json
 +from typing import Iterator
 +
 +
 +def prettify(code: str) -> str:
 +    """Prettify code using clang-format if available"""
 +
 +    try:
 +        style = json.dumps({
 +            'IndentWidth': 4,
 +            'BraceWrapping': {'AfterFunction': True},
 +            'BreakBeforeBraces': 'Custom',
 +            'SortIncludes': False,
 +            'MaxEmptyLinesToKeep': 2,
 +        })
 +        p = subprocess.run(['clang-format', f'-style={style}'], check=True,
 +                           encoding='utf-8', input=code,
 +                           stdout=subprocess.PIPE)
 +        return p.stdout
 +    except FileNotFoundError:
 +        return code
 +
 +
 +def gen_header():
 +    copyright = re.sub('^.*Copyright', 'Copyright', __doc__, flags=re.DOTALL)
 +    copyright = re.sub('^(?=.)', ' * ', copyright.strip(), flags=re.MULTILINE)
 +    copyright = re.sub('^$', ' *', copyright, flags=re.MULTILINE)
 +    return f"""\
 +/*
 + * File is generated by scripts/block-coroutine-wrapper.py
 + *
 +{copyright}
 + */
 +
 +#include "qemu/osdep.h"
 +#include "block/coroutines.h"
 +#include "block/block-gen.h"
 +#include "block/block_int.h"\
 +"""
 +
 +
 +class ParamDecl:
 +    param_re = re.compile(r'(?P<decl>'
 +                          r'(?P<type>.*[ *])'
 +                          r'(?P<name>[a-z][a-z0-9_]*)'
 +                          r')')
 +
 +    def __init__(self, param_decl: str) -> None:
 +        m = self.param_re.match(param_decl.strip())
 +        if m is None:
 +            raise ValueError(f'Wrong parameter declaration: "{param_decl}"')
 +        self.decl = m.group('decl')
 +        self.type = m.group('type')
 +        self.name = m.group('name')
 +
 +
 +class FuncDecl:
 +    def __init__(self, return_type: str, name: str, args: str) -> None:
 +        self.return_type = return_type.strip()
 +        self.name = name.strip()
 +        self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
 +
 +    def gen_list(self, format: str) -> str:
 +        return ', '.join(format.format_map(arg.__dict__) for arg in self.args)
 +
 +    def gen_block(self, format: str) -> str:
 +        return '\n'.join(format.format_map(arg.__dict__) for arg in self.args)
 +
 +
 +# Match wrappers declared with a generated_co_wrapper mark
 +func_decl_re = re.compile(r'^int\s*generated_co_wrapper\s*'
 +                          r'(?P<wrapper_name>[a-z][a-z0-9_]*)'
 +                          r'\((?P<args>[^)]*)\);$', re.MULTILINE)
 +
 +
 +def func_decl_iter(text: str) -> Iterator:
 +    for m in func_decl_re.finditer(text):
 +        yield FuncDecl(return_type='int',
 +                       name=m.group('wrapper_name'),
 +                       args=m.group('args'))
 +
 +
 +def snake_to_camel(func_name: str) -> str:
 +    """
 +    Convert underscore names like 'some_function_name' to camel-case like
 +    'SomeFunctionName'
 +    """
 +    words = func_name.split('_')
 +    words = [w[0].upper() + w[1:] for w in words]
 +    return ''.join(words)
 +
 +
 +def gen_wrapper(func: FuncDecl) -> str:
 +    assert func.name.startswith('bdrv_')
 +    assert not func.name.startswith('bdrv_co_')
 +    assert func.return_type == 'int'
 +    assert func.args[0].type in ['BlockDriverState *', 'BdrvChild *']
 +
 +    name = 'bdrv_co_' + func.name[5:]
 +    bs = 'bs' if func.args[0].type == 'BlockDriverState *' else 'child->bs'
 +    struct_name = snake_to_camel(name)
 +
 +    return f"""\
 +/*
 + * Wrappers for {name}
 + */
 +
 +typedef struct {struct_name} {{
 +    BdrvPollCo poll_state;
 +{ func.gen_block('    {decl};') }
 +}} {struct_name};
 +
 +static void coroutine_fn {name}_entry(void *opaque)
 +{{
 +    {struct_name} *s = opaque;
 +
 +    s->poll_state.ret = {name}({ func.gen_list('s->{name}') });
 +    s->poll_state.in_progress = false;
 +
 +    aio_wait_kick();
 +}}
 +
 +int {func.name}({ func.gen_list('{decl}') })
 +{{
 +    if (qemu_in_coroutine()) {{
 +        return {name}({ func.gen_list('{name}') });
 +    }} else {{
 +        {struct_name} s = {{
 +            .poll_state.bs = {bs},
 +            .poll_state.in_progress = true,
 +
 +{ func.gen_block('            .{name} = {name},') }
 +        }};
 +
 +        s.poll_state.co = qemu_coroutine_create({name}_entry, &s);
 +
 +        return bdrv_poll_co(&s.poll_state);
 +    }}
 +}}"""
 +
 +
 +def gen_wrappers(input_code: str) -> str:
 +    res = ''
 +    for func in func_decl_iter(input_code):
 +        res += '\n\n\n'
 +        res += gen_wrapper(func)
 +
 +    return prettify(res)  # prettify to wrap long lines
 +
 +
 +if __name__ == '__main__':
 +    if len(sys.argv) < 3:
 +        exit(f'Usage: {sys.argv[0]} OUT_FILE.c IN_FILE.[ch]...')
 +
 +    with open(sys.argv[1], 'w', encoding='utf-8') as f_out:
 +        f_out.write(gen_header())
 +        for fname in sys.argv[2:]:
 +            with open(fname, encoding='utf-8') as f_in:
 +                f_out.write(gen_wrappers(f_in.read()))
 +                f_out.write('\n')
 --
-.26.2
+.40.1

-[PULL 05/17] block/nvme: Use register definitions from 'block/nvme.h'
+[PULL 05/17] parallels: Out of image offset in BAT leads to image inflation
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Use the NVMe register definitions from "block/nvme.h" which
+data_end field in BDRVParallelsState is set to the biggest offset present
-ease a bit reviewing the code while matching the datasheet.
+in BAT. If this offset is outside of the image, any further write will
 create the cluster at this offset and/or the image will be truncated to
 this offset on close. This is definitely not correct.
-Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Raise an error in parallels_open() if data_end points outside the image
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+and it is not a check (let the check to repaire the image). Set data_end
-Message-Id: <20200922083821.578519-6-philmd@redhat.com>
+to the end of the cluster with the last correct offset.
 Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
 Message-Id: <20230424093147.197643-2-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- block/nvme.c | 21 +++++++++++----------
+ block/parallels.c | 17 +++++++++++++++++
-file changed, 11 insertions(+), 10 deletions(-)
+file changed, 17 insertions(+)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/parallels.c
-+++ b/block/nvme.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
+@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
-      * Initialization". */
+     BDRVParallelsState *s = bs->opaque;
+     ParallelsHeader ph;
-     cap = le64_to_cpu(regs->cap);
+     int ret, size, i;
--    if (!(cap & (1ULL << 37))) {
++    int64_t file_nb_sectors;
-+    if (!NVME_CAP_CSS(cap)) {
+     QemuOpts *opts = NULL;
-         error_setg(errp, "Device doesn't support NVMe command set");
+     Error *local_err = NULL;
-         ret = -EINVAL;
+     char *buf;
-         goto out;
+@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
          return ret;
      }
--    s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
++    file_nb_sectors = bdrv_nb_sectors(bs->file->bs);
--    s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
++    if (file_nb_sectors < 0) {
-+    s->page_size = MAX(4096, 1 << NVME_CAP_MPSMIN(cap));
++        return -EINVAL;
-+    s->doorbell_scale = (4 << NVME_CAP_DSTRD(cap)) / sizeof(uint32_t);
++    }
-     bs->bl.opt_mem_alignment = s->page_size;
++
--    timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
+     ret = bdrv_pread(bs->file, 0, sizeof(ph), &ph, 0);
-+    timeout_ms = MIN(500 * NVME_CAP_TO(cap), 30000);
+     if (ret < 0) {
+         goto fail;
-     /* Reset device to get a clean state. */
+@@ -XXX,XX +XXX,XX @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
-     regs->cc = cpu_to_le32(le32_to_cpu(regs->cc) & 0xFE);
-     /* Wait for CSTS.RDY = 0. */
+     for (i = 0; i < s->bat_size; i++) {
-     deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
+         int64_t off = bat2sect(s, i);
--    while (le32_to_cpu(regs->csts) & 0x1) {
++        if (off >= file_nb_sectors) {
-+    while (NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
++            if (flags & BDRV_O_CHECK) {
-         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
++                continue;
-             error_setg(errp, "Timeout while waiting for device to reset (%"
++            }
-                              PRId64 " ms)",
++            error_setg(errp, "parallels: Offset %" PRIi64 " in BAT[%d] entry "
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
++                       "is larger than file size (%" PRIi64 ")",
-     }
++                       off << BDRV_SECTOR_BITS, i,
-     s->nr_queues = 1;
++                       file_nb_sectors << BDRV_SECTOR_BITS);
-     QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
++            ret = -EINVAL;
--    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
++            goto fail;
-+    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << AQA_ACQS_SHIFT) |
++        }
-+                            (NVME_QUEUE_SIZE << AQA_ASQS_SHIFT));
+         if (off >= s->data_end) {
-     regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
+             s->data_end = off + s->tracks;
-     regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
+         }
      /* After setting up all control registers we can enable device now. */
 -    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
 -                              (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
 -                              0x1);
 +    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << CC_IOCQES_SHIFT) |
 +                           (ctz32(NVME_SQ_ENTRY_BYTES) << CC_IOSQES_SHIFT) |
 +                           CC_EN_MASK);
      /* Wait for CSTS.RDY = 1. */
      now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      deadline = now + timeout_ms * 1000000;
 -    while (!(le32_to_cpu(regs->csts) & 0x1)) {
 +    while (!NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
          if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
              error_setg(errp, "Timeout while waiting for device to start (%"
                               PRId64 " ms)",
 --
-.26.2
+.40.1

-[PULL 14/17] include/block/block.h: drop non-ascii quotation mark
+[PULL 06/17] parallels: Fix high_off calculation in parallels_co_check()
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-This is the only non-ascii character in the file and it doesn't really
+Don't let high_off be more than the file size even if we don't fix the
-needed here. Let's use normal "'" symbol for consistency with the rest
+image.
 occurrences of "'" in the file.
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 Message-Id: <20230424093147.197643-3-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- include/block/block.h | 2 +-
+ block/parallels.c | 4 ++--
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/include/block/block.h b/include/block/block.h
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/block.h
+--- a/block/parallels.c
-+++ b/include/block/block.h
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ enum BdrvChildRoleBits {
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-     BDRV_CHILD_FILTERED     = (1 << 2),
+                     fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+             res->corruptions++;
-     /*
+             if (fix & BDRV_FIX_ERRORS) {
--     * Child from which to read all data that isn’t allocated in the
+-                prev_off = 0;
-+     * Child from which to read all data that isn't allocated in the
+                 s->bat_bitmap[i] = 0;
-      * parent (i.e., the backing child); such data is copied to the
+                 res->corruptions_fixed++;
-      * parent through COW (and optionally COR).
+                 flush_bat = true;
-      * This field is mutually exclusive with DATA, METADATA, and
+-                continue;
              }
 +            prev_off = 0;
 +            continue;
          }
          res->bfi.allocated_clusters++;
 --
-.26.2
+.40.1

-[PULL 02/17] block/nvme: Map doorbells pages write-only
+[PULL 07/17] parallels: Fix image_end_offset and data_end after out-of-image check
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Per the datasheet sections 3.1.13/3.1.14:
+Set data_end to the end of the last cluster inside the image. In such a
-  "The host should not read the doorbell registers."
+way we can be sure that corrupted offsets in the BAT can't affect on the
 image size. If there are no allocated clusters set image_end_offset by
 data_end.
-As we don't need read access, map the doorbells with write-only
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-permission. We keep a reference to this mapped address in the
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-BDRVNVMeState structure.
+Message-Id: <20230424093147.197643-4-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
  block/parallels.c | 8 +++++++-
 file changed, 7 insertions(+), 1 deletion(-)
-Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+diff --git a/block/parallels.c b/block/parallels.c
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20200922083821.578519-3-philmd@redhat.com>
 ---
  block/nvme.c | 29 +++++++++++++++++++----------
 file changed, 19 insertions(+), 10 deletions(-)
 diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/parallels.c
-+++ b/block/nvme.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
  #define NVME_SQ_ENTRY_BYTES 64
  #define NVME_CQ_ENTRY_BYTES 16
  #define NVME_QUEUE_SIZE 128
 -#define NVME_BAR_SIZE 8192
 +#define NVME_DOORBELL_SIZE 4096
  /*
   * We have to leave one slot empty as that is the full queue case where
@@ -XXX,XX +XXX,XX @@ typedef struct {
  /* Memory mapped registers */
  typedef volatile struct {
      NvmeBar ctrl;
 -    struct {
 -        uint32_t sq_tail;
 -        uint32_t cq_head;
 -    } doorbells[];
  } NVMeRegs;
  #define INDEX_ADMIN     0
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
      AioContext *aio_context;
      QEMUVFIOState *vfio;
      NVMeRegs *regs;
 +    /* Memory mapped registers */
 +    volatile struct {
 +        uint32_t sq_tail;
 +        uint32_t cq_head;
 +    } *doorbells;
      /* The submission/completion queue pairs.
       * [0]: admin queue.
       * [1..]: io queues.
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState *s,
          error_propagate(errp, local_err);
          goto fail;
      }
 -    q->sq.doorbell = &s->regs->doorbells[idx * s->doorbell_scale].sq_tail;
 +    q->sq.doorbell = &s->doorbells[idx * s->doorbell_scale].sq_tail;
      nvme_init_queue(s, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
      if (local_err) {
          error_propagate(errp, local_err);
          goto fail;
      }
 -    q->cq.doorbell = &s->regs->doorbells[idx * s->doorbell_scale].cq_head;
 +    q->cq.doorbell = &s->doorbells[idx * s->doorbell_scale].cq_head;
      return q;
  fail:
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
          goto out;
      }
 -    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE,
 +    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
                                      PROT_READ | PROT_WRITE, errp);
      if (!s->regs) {
          ret = -EINVAL;
          goto out;
      }
 -
      /* Perform initialize sequence as described in NVMe spec "7.6.1
       * Initialization". */
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
          }
      }
-+    s->doorbells = qemu_vfio_pci_map_bar(s->vfio, 0, sizeof(NvmeBar),
+-    res->image_end_offset = high_off + s->cluster_size;
-+                                         NVME_DOORBELL_SIZE, PROT_WRITE, errp);
++    if (high_off == 0) {
-+    if (!s->doorbells) {
++        res->image_end_offset = s->data_end << BDRV_SECTOR_BITS;
-+        ret = -EINVAL;
++    } else {
-+        goto out;
++        res->image_end_offset = high_off + s->cluster_size;
 +        s->data_end = res->image_end_offset >> BDRV_SECTOR_BITS;
 +    }
 +
-     /* Set up admin queue. */
+     if (size > res->image_end_offset) {
-     s->queues = g_new(NVMeQueuePair *, 1);
+         int64_t count;
-     s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s, aio_context, 0,
+         count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
                             &s->irq_notifier[MSIX_SHARED_IRQ_IDX],
                             false, NULL, NULL);
      event_notifier_cleanup(&s->irq_notifier[MSIX_SHARED_IRQ_IDX]);
 -    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, NVME_BAR_SIZE);
 +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
 +                            sizeof(NvmeBar), NVME_DOORBELL_SIZE);
 +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, sizeof(NvmeBar));
      qemu_vfio_close(s->vfio);
      g_free(s->device);
 --
-.26.2
+.40.1

-[PULL 16/17] util/vfio-helpers: Collect IOVA reserved regions
+[PULL 08/17] parallels: create parallels_set_bat_entry_helper() to assign BAT value
-From: Eric Auger <eric.auger@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-The IOVA allocator currently ignores host reserved regions.
+This helper will be reused in next patches during parallels_co_check
-As a result some chosen IOVAs may collide with some of them,
+rework to simplify its code.
 resulting in VFIO MAP_DMA errors later on. This happens on ARM
 where the MSI reserved window quickly is encountered:
 [0x8000000, 0x8100000]. since 5.4 kernel, VFIO returns the usable
 IOVA regions. So let's enumerate them in the prospect to avoid
 them, later on.
-Signed-off-by: Eric Auger <eric.auger@redhat.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Message-id: 20200929085550.30926-2-eric.auger@redhat.com
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 Message-Id: <20230424093147.197643-5-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- util/vfio-helpers.c | 72 +++++++++++++++++++++++++++++++++++++++++++--
+ block/parallels.c | 11 ++++++++---
-file changed, 70 insertions(+), 2 deletions(-)
+file changed, 8 insertions(+), 3 deletions(-)
-diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/vfio-helpers.c
+--- a/block/parallels.c
-+++ b/util/vfio-helpers.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@ static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
-     uint64_t iova;
+     return start_off;
  } IOVAMapping;
 +struct IOVARange {
 +    uint64_t start;
 +    uint64_t end;
 +};
 +
  struct QEMUVFIOState {
      QemuMutex lock;
@@ -XXX,XX +XXX,XX @@ struct QEMUVFIOState {
      int device;
      RAMBlockNotifier ram_notifier;
      struct vfio_region_info config_region_info, bar_region_info[6];
 +    struct IOVARange *usable_iova_ranges;
 +    uint8_t nb_iova_ranges;
      /* These fields are protected by @lock */
      /* VFIO's IO virtual address space is managed by splitting into a few
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int
      return ret == size ? 0 : -errno;
  }
-+static void collect_usable_iova_ranges(QEMUVFIOState *s, void *buf)
++static void parallels_set_bat_entry(BDRVParallelsState *s,
 +                                    uint32_t index, uint32_t offset)
 +{
-+    struct vfio_iommu_type1_info *info = (struct vfio_iommu_type1_info *)buf;
++    s->bat_bitmap[index] = cpu_to_le32(offset);
-+    struct vfio_info_cap_header *cap = (void *)buf + info->cap_offset;
++    bitmap_set(s->bat_dirty_bmap, bat_entry_off(index) / s->bat_dirty_block, 1);
 +    struct vfio_iommu_type1_info_cap_iova_range *cap_iova_range;
 +    int i;
 +
 +    while (cap->id != VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
 +        if (!cap->next) {
 +            return;
 +        }
 +        cap = (struct vfio_info_cap_header *)(buf + cap->next);
 +    }
 +
 +    cap_iova_range = (struct vfio_iommu_type1_info_cap_iova_range *)cap;
 +
 +    s->nb_iova_ranges = cap_iova_range->nr_iovas;
 +    if (s->nb_iova_ranges > 1) {
 +        s->usable_iova_ranges =
 +            g_realloc(s->usable_iova_ranges,
 +                      s->nb_iova_ranges * sizeof(struct IOVARange));
 +    }
 +
 +    for (i = 0; i < s->nb_iova_ranges; i++) {
 +        s->usable_iova_ranges[i].start = cap_iova_range->iova_ranges[i].start;
 +        s->usable_iova_ranges[i].end = cap_iova_range->iova_ranges[i].end;
 +    }
 +}
 +
- static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
+ static int64_t coroutine_fn GRAPH_RDLOCK
-                               Error **errp)
+ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
- {
+                   int nb_sectors, int *pnum)
-@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
+@@ -XXX,XX +XXX,XX @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
      int i;
      uint16_t pci_cmd;
      struct vfio_group_status group_status = { .argsz = sizeof(group_status) };
 -    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
 +    struct vfio_iommu_type1_info *iommu_info = NULL;
 +    size_t iommu_info_size = sizeof(*iommu_info);
      struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
      char *group_file = NULL;
 +    s->usable_iova_ranges = NULL;
 +
      /* Create a new container */
      s->container = open("/dev/vfio/vfio", O_RDWR);
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
          goto fail;
      }
-+    iommu_info = g_malloc0(iommu_info_size);
+     for (i = 0; i < to_allocate; i++) {
-+    iommu_info->argsz = iommu_info_size;
+-        s->bat_bitmap[idx + i] = cpu_to_le32(s->data_end / s->off_multiplier);
-+
++        parallels_set_bat_entry(s, idx + i, s->data_end / s->off_multiplier);
-     /* Get additional IOMMU info */
+         s->data_end += s->tracks;
--    if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) {
+-        bitmap_set(s->bat_dirty_bmap,
-+    if (ioctl(s->container, VFIO_IOMMU_GET_INFO, iommu_info)) {
+-                   bat_entry_off(idx + i) / s->bat_dirty_block, 1);
          error_setg_errno(errp, errno, "Failed to get IOMMU info");
          ret = -errno;
          goto fail;
      }
-+    /*
+     return bat2sect(s, idx) + sector_num % s->tracks;
 +     * if the kernel does not report usable IOVA regions, choose
 +     * the legacy [QEMU_VFIO_IOVA_MIN, QEMU_VFIO_IOVA_MAX -1] region
 +     */
 +    s->nb_iova_ranges = 1;
 +    s->usable_iova_ranges = g_new0(struct IOVARange, 1);
 +    s->usable_iova_ranges[0].start = QEMU_VFIO_IOVA_MIN;
 +    s->usable_iova_ranges[0].end = QEMU_VFIO_IOVA_MAX - 1;
 +
 +    if (iommu_info->argsz > iommu_info_size) {
 +        iommu_info_size = iommu_info->argsz;
 +        iommu_info = g_realloc(iommu_info, iommu_info_size);
 +        if (ioctl(s->container, VFIO_IOMMU_GET_INFO, iommu_info)) {
 +            ret = -errno;
 +            goto fail;
 +        }
 +        collect_usable_iova_ranges(s, iommu_info);
 +    }
 +
      s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device);
      if (s->device < 0) {
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
      if (ret) {
          goto fail;
      }
 +    g_free(iommu_info);
      return 0;
  fail:
 +    g_free(s->usable_iova_ranges);
 +    s->usable_iova_ranges = NULL;
 +    s->nb_iova_ranges = 0;
 +    g_free(iommu_info);
      close(s->group);
  fail_container:
      close(s->container);
@@ -XXX,XX +XXX,XX @@ void qemu_vfio_close(QEMUVFIOState *s)
          qemu_vfio_undo_mapping(s, &s->mappings[i], NULL);
      }
      ram_block_notifier_remove(&s->ram_notifier);
 +    g_free(s->usable_iova_ranges);
 +    s->nb_iova_ranges = 0;
      qemu_vfio_reset(s);
      close(s->device);
      close(s->group);
 --
-.26.2
+.40.1

-[PULL 03/17] block/nvme: Reduce I/O registers scope
+[PULL 09/17] parallels: Use generic infrastructure for BAT writing in parallels_co_check()
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-We only access the I/O register in nvme_init().
+BAT is written in the context of conventional operations over the image
-Remove the reference in BDRVNVMeState and reduce its scope.
+inside bdrv_co_flush() when it calls parallels_co_flush_to_os() callback.
 Thus we should not modify BAT array directly, but call
 parallels_set_bat_entry() helper and bdrv_co_flush() further on. After
 that there is no need to manually write BAT and track its modification.
-Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+This makes code more generic and allows to split parallels_set_bat_entry()
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+for independent pieces.
-Message-Id: <20200922083821.578519-4-philmd@redhat.com>
 Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
 Reviewed-by: Denis V. Lunev <den@openvz.org>
 Message-Id: <20230424093147.197643-6-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- block/nvme.c | 29 ++++++++++++++++-------------
+ block/parallels.c | 23 ++++++++++-------------
-file changed, 16 insertions(+), 13 deletions(-)
+file changed, 10 insertions(+), 13 deletions(-)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/parallels.c
-+++ b/block/nvme.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ enum {
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
- struct BDRVNVMeState {
+ {
-     AioContext *aio_context;
+     BDRVParallelsState *s = bs->opaque;
-     QEMUVFIOState *vfio;
+     int64_t size, prev_off, high_off;
--    NVMeRegs *regs;
+-    int ret;
-     /* Memory mapped registers */
++    int ret = 0;
-     volatile struct {
+     uint32_t i;
-         uint32_t sq_tail;
+-    bool flush_bat = false;
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
-     uint64_t timeout_ms;
+     size = bdrv_getlength(bs->file->bs);
-     uint64_t deadline, now;
+     if (size < 0) {
-     Error *local_err = NULL;
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+    NVMeRegs *regs;
+                     fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+             res->corruptions++;
-     qemu_co_mutex_init(&s->dma_map_lock);
+             if (fix & BDRV_FIX_ERRORS) {
-     qemu_co_queue_init(&s->dma_flush_queue);
+-                s->bat_bitmap[i] = 0;
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
++                parallels_set_bat_entry(s, i, 0);
-         goto out;
+                 res->corruptions_fixed++;
 -                flush_bat = true;
              }
              prev_off = 0;
              continue;
@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
          prev_off = off;
      }
--    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
+-    ret = 0;
--                                    PROT_READ | PROT_WRITE, errp);
+-    if (flush_bat) {
--    if (!s->regs) {
+-        ret = bdrv_co_pwrite_sync(bs->file, 0, s->header_size, s->header, 0);
-+    regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
+-        if (ret < 0) {
-+                                 PROT_READ | PROT_WRITE, errp);
+-            res->check_errors++;
-+    if (!regs) {
+-            goto out;
-         ret = -EINVAL;
+-        }
-         goto out;
+-    }
-     }
+-
-     /* Perform initialize sequence as described in NVMe spec "7.6.1
+     if (high_off == 0) {
-      * Initialization". */
+         res->image_end_offset = s->data_end << BDRV_SECTOR_BITS;
+     } else {
--    cap = le64_to_cpu(s->regs->ctrl.cap);
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+    cap = le64_to_cpu(regs->ctrl.cap);
      if (!(cap & (1ULL << 37))) {
          error_setg(errp, "Device doesn't support NVMe command set");
          ret = -EINVAL;
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
      /* Reset device to get a clean state. */
 -    s->regs->ctrl.cc = cpu_to_le32(le32_to_cpu(s->regs->ctrl.cc) & 0xFE);
 +    regs->ctrl.cc = cpu_to_le32(le32_to_cpu(regs->ctrl.cc) & 0xFE);
      /* Wait for CSTS.RDY = 0. */
      deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
 -    while (le32_to_cpu(s->regs->ctrl.csts) & 0x1) {
 +    while (le32_to_cpu(regs->ctrl.csts) & 0x1) {
          if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
              error_setg(errp, "Timeout while waiting for device to reset (%"
                               PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      }
      s->nr_queues = 1;
      QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
 -    s->regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
 -    s->regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
 -    s->regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 +    regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
 +    regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
 +    regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
      /* After setting up all control registers we can enable device now. */
 -    s->regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
 +    regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
                                (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
 x1);
      /* Wait for CSTS.RDY = 1. */
      now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      deadline = now + timeout_ms * 1000000;
 -    while (!(le32_to_cpu(s->regs->ctrl.csts) & 0x1)) {
 +    while (!(le32_to_cpu(regs->ctrl.csts) & 0x1)) {
          if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
              error_setg(errp, "Timeout while waiting for device to start (%"
                               PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
          ret = -EIO;
      }
  out:
-+    if (regs) {
+     qemu_co_mutex_unlock(&s->lock);
-+        qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)regs, 0, sizeof(NvmeBar));
++
 +    if (ret == 0) {
 +        ret = bdrv_co_flush(bs);
 +        if (ret < 0) {
 +            res->check_errors++;
 +        }
 +    }
 +
-     /* Cleaning up is done in nvme_file_open() upon error. */
      return ret;
  }
-@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
      event_notifier_cleanup(&s->irq_notifier[MSIX_SHARED_IRQ_IDX]);
      qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
                              sizeof(NvmeBar), NVME_DOORBELL_SIZE);
 -    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, sizeof(NvmeBar));
      qemu_vfio_close(s->vfio);
      g_free(s->device);
 --
-.26.2
+.40.1

-[PULL 09/17] block: declare some coroutine functions in block/coroutines.h
+[PULL 10/17] parallels: Move check of unclean image to a separate function
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-We are going to keep coroutine-wrappers code (structure-packing
+We will add more and more checks so we need a better code structure
-parameters, BDRV_POLL wrapper functions) in separate auto-generated
+in parallels_co_check. Let each check performs in a separate loop
-files. So, we'll need a header with declaration of original _co_
+in a separate helper.
 functions, for those which are static now. As well, we'll need
 declarations for wrapper functions. Do these declarations now, as a
 preparation step.
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Message-Id: <20230424093147.197643-7-alexander.ivanov@virtuozzo.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
-Message-Id: <20200924185414.28642-4-vsementsov@virtuozzo.com>
+Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- block/coroutines.h | 67 ++++++++++++++++++++++++++++++++++++++++++++++
+ block/parallels.c | 31 +++++++++++++++++++++----------
- block.c            |  8 +++---
+file changed, 21 insertions(+), 10 deletions(-)
  block/io.c         | 34 +++++++++++------------
 files changed, 88 insertions(+), 21 deletions(-)
  create mode 100644 block/coroutines.h
-diff --git a/block/coroutines.h b/block/coroutines.h
+diff --git a/block/parallels.c b/block/parallels.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@
 +/*
 + * Block layer I/O functions
 + *
 + * Copyright (c) 2003 Fabrice Bellard
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a copy
 + * of this software and associated documentation files (the "Software"), to deal
 + * in the Software without restriction, including without limitation the rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#ifndef BLOCK_COROUTINES_INT_H
 +#define BLOCK_COROUTINES_INT_H
 +
 +#include "block/block_int.h"
 +
 +int coroutine_fn bdrv_co_check(BlockDriverState *bs,
 +                               BdrvCheckResult *res, BdrvCheckMode fix);
 +int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
 +
 +int coroutine_fn
 +bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
 +             bool is_write, BdrvRequestFlags flags);
 +int
 +bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
 +          bool is_write, BdrvRequestFlags flags);
 +
 +int coroutine_fn
 +bdrv_co_common_block_status_above(BlockDriverState *bs,
 +                                  BlockDriverState *base,
 +                                  bool want_zero,
 +                                  int64_t offset,
 +                                  int64_t bytes,
 +                                  int64_t *pnum,
 +                                  int64_t *map,
 +                                  BlockDriverState **file);
 +int
 +bdrv_common_block_status_above(BlockDriverState *bs,
 +                               BlockDriverState *base,
 +                               bool want_zero,
 +                               int64_t offset,
 +                               int64_t bytes,
 +                               int64_t *pnum,
 +                               int64_t *map,
 +                               BlockDriverState **file);
 +
 +int coroutine_fn
 +bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 +                   bool is_read);
 +int
 +bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 +                bool is_read);
 +
 +#endif /* BLOCK_COROUTINES_INT_H */
 diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
---- a/block.c
+--- a/block/parallels.c
-+++ b/block.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ parallels_co_readv(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
  #include "qemu/timer.h"
  #include "qemu/cutils.h"
  #include "qemu/id.h"
 +#include "block/coroutines.h"
  #ifdef CONFIG_BSD
  #include <sys/ioctl.h>
@@ -XXX,XX +XXX,XX @@ static void bdrv_delete(BlockDriverState *bs)
   * free of errors) or -errno when an internal error occurred. The results of the
   * check are stored in res.
   */
 -static int coroutine_fn bdrv_co_check(BlockDriverState *bs,
 -                                      BdrvCheckResult *res, BdrvCheckMode fix)
 +int coroutine_fn bdrv_co_check(BlockDriverState *bs,
 +                               BdrvCheckResult *res, BdrvCheckMode fix)
  {
      if (bs->drv == NULL) {
          return -ENOMEDIUM;
@@ -XXX,XX +XXX,XX @@ void bdrv_init_with_whitelist(void)
      bdrv_init();
  }
 -static int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
 -                                                 Error **errp)
 +int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
  {
      BdrvChild *child, *parent;
      uint64_t perm, shared_perm;
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@
  #include "block/blockjob.h"
  #include "block/blockjob_int.h"
  #include "block/block_int.h"
 +#include "block/coroutines.h"
  #include "qemu/cutils.h"
  #include "qapi/error.h"
  #include "qemu/error-report.h"
@@ -XXX,XX +XXX,XX @@ typedef struct RwCo {
      BdrvRequestFlags flags;
  } RwCo;
 -static int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
 -                                     QEMUIOVector *qiov, bool is_write,
 -                                     BdrvRequestFlags flags)
 +int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
 +                              QEMUIOVector *qiov, bool is_write,
 +                              BdrvRequestFlags flags)
  {
      if (is_write) {
          return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_rw_co_entry(void *opaque)
  /*
   * Process a vectored synchronous request using coroutines
   */
 -static int bdrv_prwv(BdrvChild *child, int64_t offset,
 -                     QEMUIOVector *qiov, bool is_write,
 -                     BdrvRequestFlags flags)
 +int bdrv_prwv(BdrvChild *child, int64_t offset,
 +              QEMUIOVector *qiov, bool is_write,
 +              BdrvRequestFlags flags)
  {
      RwCo rwco = {
          .child = child,
@@ -XXX,XX +XXX,XX @@ early_out:
      return ret;
  }
--static int coroutine_fn
++static void parallels_check_unclean(BlockDriverState *bs,
-+int coroutine_fn
++                                    BdrvCheckResult *res,
- bdrv_co_common_block_status_above(BlockDriverState *bs,
++                                    BdrvCheckMode fix)
-                                   BlockDriverState *base,
++{
-                                   bool want_zero,
++    BDRVParallelsState *s = bs->opaque;
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
++
-  *
++    if (!s->header_unclean) {
-  * See bdrv_co_block_status_above() for details.
++        return;
-  */
++    }
--static int bdrv_common_block_status_above(BlockDriverState *bs,
++
--                                          BlockDriverState *base,
++    fprintf(stderr, "%s image was not closed correctly\n",
--                                          bool want_zero, int64_t offset,
++            fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
--                                          int64_t bytes, int64_t *pnum,
++    res->corruptions++;
--                                          int64_t *map,
++    if (fix & BDRV_FIX_ERRORS) {
--                                          BlockDriverState **file)
++        /* parallels_close will do the job right */
-+int bdrv_common_block_status_above(BlockDriverState *bs,
++        res->corruptions_fixed++;
-+                                   BlockDriverState *base,
++        s->header_unclean = false;
-+                                   bool want_zero, int64_t offset,
++    }
-+                                   int64_t bytes, int64_t *pnum,
++}
-+                                   int64_t *map,
-+                                   BlockDriverState **file)
+ static int coroutine_fn GRAPH_RDLOCK
- {
+ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-     BdrvCoBlockStatusData data = {
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-         .bs = bs,
+     }
-@@ -XXX,XX +XXX,XX @@ typedef struct BdrvVmstateCo {
-     bool                is_read;
+     qemu_co_mutex_lock(&s->lock);
- } BdrvVmstateCo;
+-    if (s->header_unclean) {
+-        fprintf(stderr, "%s image was not closed correctly\n",
--static int coroutine_fn
+-                fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
-+int coroutine_fn
+-        res->corruptions++;
- bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+-        if (fix & BDRV_FIX_ERRORS) {
-                    bool is_read)
+-            /* parallels_close will do the job right */
- {
+-            res->corruptions_fixed++;
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_rw_vmstate_entry(void *opaque)
+-            s->header_unclean = false;
-     return bdrv_co_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
+-        }
- }
+-    }
++
--static inline int
++    parallels_check_unclean(bs, res, fix);
--bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
--                bool is_read)
+     res->bfi.total_clusters = s->bat_size;
-+int bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+     res->bfi.compressed_clusters = 0; /* compression is not supported */
 +                    bool is_read)
  {
      BdrvVmstateCo data = {
          .bs         = bs,
 --
-.26.2
+.40.1

-[PULL 13/17] block/io: refactor save/load vmstate
+[PULL 11/17] parallels: Move check of cluster outside image to a separate function
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Like for read/write in a previous commit, drop extra indirection layer,
+We will add more and more checks so we need a better code structure in
-generate directly bdrv_readv_vmstate() and bdrv_writev_vmstate().
+parallels_co_check. Let each check performs in a separate loop in a
 separate helper.
-Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20230424093147.197643-8-alexander.ivanov@virtuozzo.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
-Message-Id: <20200924185414.28642-8-vsementsov@virtuozzo.com>
+Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- block/coroutines.h    | 10 +++----
+ block/parallels.c | 75 +++++++++++++++++++++++++++++++----------------
- include/block/block.h |  6 ++--
+file changed, 49 insertions(+), 26 deletions(-)
  block/io.c            | 70 ++++++++++++++++++++++---------------------
 files changed, 44 insertions(+), 42 deletions(-)
-diff --git a/block/coroutines.h b/block/coroutines.h
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/coroutines.h
+--- a/block/parallels.c
-+++ b/block/coroutines.h
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ bdrv_common_block_status_above(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ static void parallels_check_unclean(BlockDriverState *bs,
-                                int64_t *map,
+     }
                                 BlockDriverState **file);
 -int coroutine_fn
 -bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 -                   bool is_read);
 -int generated_co_wrapper
 -bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 -                bool is_read);
 +int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
 +                                       QEMUIOVector *qiov, int64_t pos);
 +int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
 +                                        QEMUIOVector *qiov, int64_t pos);
  #endif /* BLOCK_COROUTINES_INT_H */
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int path_has_protocol(const char *path);
  int path_is_absolute(const char *path);
  char *path_combine(const char *base_path, const char *filename);
 -int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 -int bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 +int generated_co_wrapper
 +bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 +int generated_co_wrapper
 +bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
  int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                        int64_t pos, int size);
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top,
  }
- int coroutine_fn
++static int coroutine_fn GRAPH_RDLOCK
--bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
++parallels_check_outside_image(BlockDriverState *bs, BdrvCheckResult *res,
--                   bool is_read)
++                              BdrvCheckMode fix)
-+bdrv_co_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
++{
- {
++    BDRVParallelsState *s = bs->opaque;
-     BlockDriver *drv = bs->drv;
++    uint32_t i;
-     BlockDriverState *child_bs = bdrv_primary_bs(bs);
++    int64_t off, high_off, size;
-     int ret = -ENOTSUP;
++
++    size = bdrv_getlength(bs->file->bs);
-+    if (!drv) {
++    if (size < 0) {
-+        return -ENOMEDIUM;
++        res->check_errors++;
 +        return size;
 +    }
 +
-     bdrv_inc_in_flight(bs);
++    high_off = 0;
++    for (i = 0; i < s->bat_size; i++) {
-+    if (drv->bdrv_load_vmstate) {
++        off = bat2sect(s, i) << BDRV_SECTOR_BITS;
-+        ret = drv->bdrv_load_vmstate(bs, qiov, pos);
++        if (off > size) {
-+    } else if (child_bs) {
++            fprintf(stderr, "%s cluster %u is outside image\n",
-+        ret = bdrv_co_readv_vmstate(child_bs, qiov, pos);
++                    fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 +            res->corruptions++;
 +            if (fix & BDRV_FIX_ERRORS) {
 +                parallels_set_bat_entry(s, i, 0);
 +                res->corruptions_fixed++;
 +            }
 +            continue;
 +        }
 +        if (high_off < off) {
 +            high_off = off;
 +        }
 +    }
 +
-+    bdrv_dec_in_flight(bs);
++    if (high_off == 0) {
 +        res->image_end_offset = s->data_end << BDRV_SECTOR_BITS;
 +    } else {
 +        res->image_end_offset = high_off + s->cluster_size;
 +        s->data_end = res->image_end_offset >> BDRV_SECTOR_BITS;
 +    }
 +
-+    return ret;
++    return 0;
 +}
 +
-+int coroutine_fn
+ static int coroutine_fn GRAPH_RDLOCK
-+bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
+ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+{
+                    BdrvCheckMode fix)
-+    BlockDriver *drv = bs->drv;
+ {
-+    BlockDriverState *child_bs = bdrv_primary_bs(bs);
+     BDRVParallelsState *s = bs->opaque;
-+    int ret = -ENOTSUP;
+-    int64_t size, prev_off, high_off;
-+
+-    int ret = 0;
-     if (!drv) {
++    int64_t size, prev_off;
--        ret = -ENOMEDIUM;
++    int ret;
--    } else if (drv->bdrv_load_vmstate) {
+     uint32_t i;
--        if (is_read) {
--            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
+     size = bdrv_getlength(bs->file->bs);
--        } else {
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
--            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
--        }
+     parallels_check_unclean(bs, res, fix);
-+        return -ENOMEDIUM;
 +    ret = parallels_check_outside_image(bs, res, fix);
 +    if (ret < 0) {
 +        goto out;
 +    }
 +
-+    bdrv_inc_in_flight(bs);
+     res->bfi.total_clusters = s->bat_size;
-+
+     res->bfi.compressed_clusters = 0; /* compression is not supported */
-+    if (drv->bdrv_save_vmstate) {
-+        ret = drv->bdrv_save_vmstate(bs, qiov, pos);
+-    high_off = 0;
-     } else if (child_bs) {
+     prev_off = 0;
--        ret = bdrv_co_rw_vmstate(child_bs, qiov, pos, is_read);
+     for (i = 0; i < s->bat_size; i++) {
-+        ret = bdrv_co_writev_vmstate(child_bs, qiov, pos);
+         int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
              continue;
          }
 -        /* cluster outside the image */
 -        if (off > size) {
 -            fprintf(stderr, "%s cluster %u is outside image\n",
 -                    fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 -            res->corruptions++;
 -            if (fix & BDRV_FIX_ERRORS) {
 -                parallels_set_bat_entry(s, i, 0);
 -                res->corruptions_fixed++;
 -            }
 -            prev_off = 0;
 -            continue;
 -        }
 -
          res->bfi.allocated_clusters++;
 -        if (off > high_off) {
 -            high_off = off;
 -        }
          if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
              res->bfi.fragmented_clusters++;
@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
          prev_off = off;
      }
-     bdrv_dec_in_flight(bs);
+-    if (high_off == 0) {
-+
+-        res->image_end_offset = s->data_end << BDRV_SECTOR_BITS;
-     return ret;
+-    } else {
- }
+-        res->image_end_offset = high_off + s->cluster_size;
+-        s->data_end = res->image_end_offset >> BDRV_SECTOR_BITS;
@@ -XXX,XX +XXX,XX @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                        int64_t pos, int size)
  {
      QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
 -    int ret;
 +    int ret = bdrv_writev_vmstate(bs, &qiov, pos);
 -    ret = bdrv_writev_vmstate(bs, &qiov, pos);
 -    if (ret < 0) {
 -        return ret;
 -    }
 -
--    return size;
+     if (size > res->image_end_offset) {
--}
+         int64_t count;
--
+         count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
 -int bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 -{
 -    return bdrv_rw_vmstate(bs, qiov, pos, false);
 +    return ret < 0 ? ret : size;
  }
  int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                        int64_t pos, int size)
  {
      QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
 -    int ret;
 +    int ret = bdrv_readv_vmstate(bs, &qiov, pos);
 -    ret = bdrv_readv_vmstate(bs, &qiov, pos);
 -    if (ret < 0) {
 -        return ret;
 -    }
 -
 -    return size;
 -}
 -
 -int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 -{
 -    return bdrv_rw_vmstate(bs, qiov, pos, true);
 +    return ret < 0 ? ret : size;
  }
  /**************************************************************/
 --
-.26.2
+.40.1

-[PULL 04/17] block/nvme: Drop NVMeRegs structure, directly use NvmeBar
+[PULL 12/17] parallels: Fix statistics calculation
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-NVMeRegs only contains NvmeBar. Simplify the code by using NvmeBar
+Exclude out-of-image clusters from allocated and fragmented clusters
-directly.
+calculation.
-This triggers a checkpatch.pl error:
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
 Message-Id: <20230424093147.197643-9-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
  block/parallels.c | 6 +++++-
 file changed, 5 insertions(+), 1 deletion(-)
-  ERROR: Use of volatile is usually wrong, please add a comment
+diff --git a/block/parallels.c b/block/parallels.c
   #30: FILE: block/nvme.c:691:
   +    volatile NvmeBar *regs;
 This is a false positive as in our case we are using I/O registers,
 so the 'volatile' use is justified.
 Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20200922083821.578519-5-philmd@redhat.com>
 ---
  block/nvme.c | 23 +++++++++--------------
 file changed, 9 insertions(+), 14 deletions(-)
 diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/parallels.c
-+++ b/block/nvme.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-     QEMUBH      *completion_bh;
+     prev_off = 0;
- } NVMeQueuePair;
+     for (i = 0; i < s->bat_size; i++) {
+         int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
--/* Memory mapped registers */
+-        if (off == 0) {
--typedef volatile struct {
++        /*
--    NvmeBar ctrl;
++         * If BDRV_FIX_ERRORS is not set, out-of-image BAT entries were not
--} NVMeRegs;
++         * fixed. Skip not allocated and out-of-image BAT entries.
--
++         */
- #define INDEX_ADMIN     0
++        if (off == 0 || off + s->cluster_size > res->image_end_offset) {
- #define INDEX_IO(n)     (1 + n)
+             prev_off = 0;
+             continue;
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
+         }
      uint64_t timeout_ms;
      uint64_t deadline, now;
      Error *local_err = NULL;
 -    NVMeRegs *regs;
 +    volatile NvmeBar *regs = NULL;
      qemu_co_mutex_init(&s->dma_map_lock);
      qemu_co_queue_init(&s->dma_flush_queue);
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      /* Perform initialize sequence as described in NVMe spec "7.6.1
       * Initialization". */
 -    cap = le64_to_cpu(regs->ctrl.cap);
 +    cap = le64_to_cpu(regs->cap);
      if (!(cap & (1ULL << 37))) {
          error_setg(errp, "Device doesn't support NVMe command set");
          ret = -EINVAL;
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
      /* Reset device to get a clean state. */
 -    regs->ctrl.cc = cpu_to_le32(le32_to_cpu(regs->ctrl.cc) & 0xFE);
 +    regs->cc = cpu_to_le32(le32_to_cpu(regs->cc) & 0xFE);
      /* Wait for CSTS.RDY = 0. */
      deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
 -    while (le32_to_cpu(regs->ctrl.csts) & 0x1) {
 +    while (le32_to_cpu(regs->csts) & 0x1) {
          if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
              error_setg(errp, "Timeout while waiting for device to reset (%"
                               PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      }
      s->nr_queues = 1;
      QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
 -    regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
 -    regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
 -    regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 +    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
 +    regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
 +    regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
      /* After setting up all control registers we can enable device now. */
 -    regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
 +    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
                                (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
 x1);
      /* Wait for CSTS.RDY = 1. */
      now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      deadline = now + timeout_ms * 1000000;
 -    while (!(le32_to_cpu(regs->ctrl.csts) & 0x1)) {
 +    while (!(le32_to_cpu(regs->csts) & 0x1)) {
          if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
              error_setg(errp, "Timeout while waiting for device to start (%"
                               PRId64 " ms)",
 --
-.26.2
+.40.1

-[PULL 17/17] util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved regions
+[PULL 13/17] parallels: Move check of leaks to a separate function
-From: Eric Auger <eric.auger@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Introduce the qemu_vfio_find_fixed/temp_iova helpers which
+We will add more and more checks so we need a better code structure
-respectively allocate IOVAs from the bottom/top parts of the
+in parallels_co_check. Let each check performs in a separate loop
-usable IOVA range, without picking within host IOVA reserved
+in a separate helper.
 windows. The allocation remains basic: if the size is too big
 for the remaining of the current usable IOVA range, we jump
 to the next one, leaving a hole in the address map.
-Signed-off-by: Eric Auger <eric.auger@redhat.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Message-id: 20200929085550.30926-3-eric.auger@redhat.com
+Message-Id: <20230424093147.197643-10-alexander.ivanov@virtuozzo.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- util/vfio-helpers.c | 57 +++++++++++++++++++++++++++++++++++++++++----
+ block/parallels.c | 74 ++++++++++++++++++++++++++++-------------------
-file changed, 53 insertions(+), 4 deletions(-)
+file changed, 45 insertions(+), 29 deletions(-)
-diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/vfio-helpers.c
+--- a/block/parallels.c
-+++ b/util/vfio-helpers.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
+@@ -XXX,XX +XXX,XX @@ parallels_check_outside_image(BlockDriverState *bs, BdrvCheckResult *res,
      return true;
  }
-+static int
+ static int coroutine_fn GRAPH_RDLOCK
-+qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+-parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+{
+-                   BdrvCheckMode fix)
-+    int i;
++parallels_check_leak(BlockDriverState *bs, BdrvCheckResult *res,
 +                     BdrvCheckMode fix)
  {
      BDRVParallelsState *s = bs->opaque;
 -    int64_t size, prev_off;
 +    int64_t size;
      int ret;
 -    uint32_t i;
      size = bdrv_getlength(bs->file->bs);
      if (size < 0) {
@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
          return size;
      }
 +    if (size > res->image_end_offset) {
 +        int64_t count;
 +        count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
 +        fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
 +                fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
 +                size - res->image_end_offset);
 +        res->leaks += count;
 +        if (fix & BDRV_FIX_LEAKS) {
 +            Error *local_err = NULL;
 +
-+    for (i = 0; i < s->nb_iova_ranges; i++) {
++            /*
-+        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
++             * In order to really repair the image, we must shrink it.
-+            continue;
++             * That means we have to pass exact=true.
-+        }
++             */
-+        s->low_water_mark =
++            ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
-+            MAX(s->low_water_mark, s->usable_iova_ranges[i].start);
++                                   PREALLOC_MODE_OFF, 0, &local_err);
-+
++            if (ret < 0) {
-+        if (s->usable_iova_ranges[i].end - s->low_water_mark + 1 >= size ||
++                error_report_err(local_err);
-+            s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
++                res->check_errors++;
-+            *iova = s->low_water_mark;
++                return ret;
-+            s->low_water_mark += size;
++            }
-+            return 0;
++            res->leaks_fixed += count;
 +        }
 +    }
-+    return -ENOMEM;
++
 +    return 0;
 +}
 +
-+static int
++static int coroutine_fn GRAPH_RDLOCK
-+qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
++parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
 +                   BdrvCheckMode fix)
 +{
-+    int i;
++    BDRVParallelsState *s = bs->opaque;
 +    int64_t prev_off;
 +    int ret;
 +    uint32_t i;
 +
-+    for (i = s->nb_iova_ranges - 1; i >= 0; i--) {
+     qemu_co_mutex_lock(&s->lock);
-+        if (s->usable_iova_ranges[i].start > s->high_water_mark) {
-+            continue;
+     parallels_check_unclean(bs, res, fix);
-+        }
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+        s->high_water_mark =
+         goto out;
-+            MIN(s->high_water_mark, s->usable_iova_ranges[i].end + 1);
+     }
 +    ret = parallels_check_leak(bs, res, fix);
 +    if (ret < 0) {
 +        goto out;
 +    }
 +
-+        if (s->high_water_mark - s->usable_iova_ranges[i].start + 1 >= size ||
+     res->bfi.total_clusters = s->bat_size;
-+            s->high_water_mark - s->usable_iova_ranges[i].start + 1 == 0) {
+     res->bfi.compressed_clusters = 0; /* compression is not supported */
-+            *iova = s->high_water_mark - size;
-+            s->high_water_mark = *iova;
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
-+            return 0;
+         prev_off = off;
 +        }
 +    }
 +    return -ENOMEM;
 +}
 +
  /* Map [host, host + size) area into a contiguous IOVA address space, and store
   * the result in @iova if not NULL. The caller need to make sure the area is
   * aligned to page size, and mustn't overlap with existing mapping areas (split
@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
              goto out;
          }
          if (!temporary) {
 -            iova0 = s->low_water_mark;
 +            if (qemu_vfio_find_fixed_iova(s, size, &iova0)) {
 +                ret = -ENOMEM;
 +                goto out;
 +            }
 +
              mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
              if (!mapping) {
                  ret = -ENOMEM;
@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
                  qemu_vfio_undo_mapping(s, mapping, NULL);
                  goto out;
              }
 -            s->low_water_mark += size;
              qemu_vfio_dump_mappings(s);
          } else {
 -            iova0 = s->high_water_mark - size;
 +            if (qemu_vfio_find_temp_iova(s, size, &iova0)) {
 +                ret = -ENOMEM;
 +                goto out;
 +            }
              ret = qemu_vfio_do_mapping(s, host, size, iova0);
              if (ret) {
                  goto out;
              }
 -            s->high_water_mark -= size;
          }
      }
-     if (iova) {
 -    if (size > res->image_end_offset) {
 -        int64_t count;
 -        count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
 -        fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
 -                fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
 -                size - res->image_end_offset);
 -        res->leaks += count;
 -        if (fix & BDRV_FIX_LEAKS) {
 -            Error *local_err = NULL;
 -
 -            /*
 -             * In order to really repair the image, we must shrink it.
 -             * That means we have to pass exact=true.
 -             */
 -            ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
 -                                   PREALLOC_MODE_OFF, 0, &local_err);
 -            if (ret < 0) {
 -                error_report_err(local_err);
 -                res->check_errors++;
 -                goto out;
 -            }
 -            res->leaks_fixed += count;
 -        }
 -    }
 -
  out:
      qemu_co_mutex_unlock(&s->lock);
 --
-.26.2
+.40.1

-[PULL 12/17] block: drop bdrv_prwv
+[PULL 14/17] parallels: Move statistic collection to a separate function
-From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Now that we are not maintaining boilerplate code for coroutine
+We will add more and more checks so we need a better code structure
-wrappers, there is no more sense in keeping the extra indirection layer
+in parallels_co_check. Let each check performs in a separate loop
-of bdrv_prwv().  Let's drop it and instead generate pure bdrv_preadv()
+in a separate helper.
 and bdrv_pwritev().
-Currently, bdrv_pwritev() and bdrv_preadv() are returning bytes on
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-success, auto generated functions will instead return zero, as their
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-_co_ prototype. Still, it's simple to make the conversion safe: the
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
-only external user of bdrv_pwritev() is test-bdrv-drain, and it is
+Message-Id: <20230424093147.197643-11-alexander.ivanov@virtuozzo.com>
-comfortable enough with bdrv_co_pwritev() instead. So prototypes are
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
-moved to local block/coroutines.h. Next, the only internal use is
+Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
-bdrv_pread() and bdrv_pwrite(), which are modified to return bytes on
+---
-success.
+ block/parallels.c | 52 +++++++++++++++++++++++++++--------------------
 file changed, 30 insertions(+), 22 deletions(-)
-Of course, it would be great to convert bdrv_pread() and bdrv_pwrite()
+diff --git a/block/parallels.c b/block/parallels.c
 to return 0 on success. But this requires audit (and probably
 conversion) of all their users, let's leave it for another day
 refactoring.
 Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20200924185414.28642-7-vsementsov@virtuozzo.com>
 ---
  block/coroutines.h      | 10 ++++-----
  include/block/block.h   |  2 --
  block/io.c              | 49 ++++++++---------------------------------
  tests/test-bdrv-drain.c |  2 +-
 files changed, 15 insertions(+), 48 deletions(-)
 diff --git a/block/coroutines.h b/block/coroutines.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/coroutines.h
+--- a/block/parallels.c
-+++ b/block/coroutines.h
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ parallels_check_leak(BlockDriverState *bs, BdrvCheckResult *res,
                                 BdrvCheckResult *res, BdrvCheckMode fix);
  int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
 -int coroutine_fn
 -bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
 -             bool is_write, BdrvRequestFlags flags);
  int generated_co_wrapper
 -bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
 -          bool is_write, BdrvRequestFlags flags);
 +bdrv_preadv(BdrvChild *child, int64_t offset, unsigned int bytes,
 +            QEMUIOVector *qiov, BdrvRequestFlags flags);
 +int generated_co_wrapper
 +bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes,
 +             QEMUIOVector *qiov, BdrvRequestFlags flags);
  int coroutine_fn
  bdrv_co_common_block_status_above(BlockDriverState *bs,
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                         int bytes, BdrvRequestFlags flags);
  int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
  int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes);
 -int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
  int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes);
 -int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
  int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
                       const void *buf, int count);
  /*
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
      return 0;
  }
--int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
+-static int coroutine_fn GRAPH_RDLOCK
--                              QEMUIOVector *qiov, bool is_write,
+-parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
--                              BdrvRequestFlags flags)
+-                   BdrvCheckMode fix)
--{
++static void parallels_collect_statistics(BlockDriverState *bs,
--    if (is_write) {
++                                         BdrvCheckResult *res,
--        return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
++                                         BdrvCheckMode fix)
--    } else {
+ {
--        return bdrv_co_preadv(child, offset, qiov->size, qiov, flags);
+     BDRVParallelsState *s = bs->opaque;
--    }
+-    int64_t prev_off;
--}
+-    int ret;
 +    int64_t off, prev_off;
      uint32_t i;
 -    qemu_co_mutex_lock(&s->lock);
 -
- int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
+-    parallels_check_unclean(bs, res, fix);
                         int bytes, BdrvRequestFlags flags)
  {
 -    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
 -
--    return bdrv_prwv(child, offset, &qiov, true, BDRV_REQ_ZERO_WRITE | flags);
+-    ret = parallels_check_outside_image(bs, res, fix);
 +    return bdrv_pwritev(child, offset, bytes, NULL,
 +                        BDRV_REQ_ZERO_WRITE | flags);
  }
  /*
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
      }
  }
 -/* return < 0 if error. See bdrv_pwrite() for the return codes */
 -int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 -{
 -    int ret;
 -
 -    ret = bdrv_prwv(child, offset, qiov, false, 0);
 -    if (ret < 0) {
--        return ret;
+-        goto out;
 -    }
 -
--    return qiov->size;
+-    ret = parallels_check_leak(bs, res, fix);
 -}
 -
  /* See bdrv_pwrite() for the return codes */
  int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes)
  {
 +    int ret;
      QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
      if (bytes < 0) {
          return -EINVAL;
      }
 -    return bdrv_preadv(child, offset, &qiov);
 -}
 +    ret = bdrv_preadv(child, offset, bytes, &qiov,  0);
 -int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 -{
 -    int ret;
 -
 -    ret = bdrv_prwv(child, offset, qiov, true, 0);
 -    if (ret < 0) {
--        return ret;
+-        goto out;
 -    }
 -
--    return qiov->size;
+     res->bfi.total_clusters = s->bat_size;
-+    return ret < 0 ? ret : bytes;
+     res->bfi.compressed_clusters = 0; /* compression is not supported */
- }
+     prev_off = 0;
- /* Return no. of bytes on success or < 0 on error. Important errors are:
+     for (i = 0; i < s->bat_size; i++) {
-@@ -XXX,XX +XXX,XX @@ int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
+-        int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
- */
++        off = bat2sect(s, i) << BDRV_SECTOR_BITS;
- int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes)
+         /*
- {
+          * If BDRV_FIX_ERRORS is not set, out-of-image BAT entries were not
           * fixed. Skip not allocated and out-of-image BAT entries.
@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
              continue;
          }
 -        res->bfi.allocated_clusters++;
 -
          if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
              res->bfi.fragmented_clusters++;
          }
          prev_off = off;
 +        res->bfi.allocated_clusters++;
      }
 +}
 +
 +static int coroutine_fn GRAPH_RDLOCK
 +parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
 +                   BdrvCheckMode fix)
 +{
 +    BDRVParallelsState *s = bs->opaque;
 +    int ret;
-     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-     if (bytes < 0) {
-         return -EINVAL;
-     }
--    return bdrv_pwritev(child, offset, &qiov);
-+    ret = bdrv_pwritev(child, offset, bytes, &qiov, 0);
 +
-+    return ret < 0 ? ret : bytes;
++    qemu_co_mutex_lock(&s->lock);
- }
++
++    parallels_check_unclean(bs, res, fix);
- /*
++
-diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
++    ret = parallels_check_outside_image(bs, res, fix);
-index XXXXXXX..XXXXXXX 100644
++    if (ret < 0) {
---- a/tests/test-bdrv-drain.c
++        goto out;
-+++ b/tests/test-bdrv-drain.c
++    }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs,
++
-         }
++    ret = parallels_check_leak(bs, res, fix);
-         s->io_co = NULL;
++    if (ret < 0) {
++        goto out;
--        ret = bdrv_preadv(bs->backing, offset, qiov);
++    }
-+        ret = bdrv_co_preadv(bs->backing, offset, bytes, qiov, 0);
++
-         s->has_read = true;
++    parallels_collect_statistics(bs, res, fix);
-         /* Wake up drain_co if it runs */
+ out:
      qemu_co_mutex_unlock(&s->lock);
 --
-.26.2
+.40.1

-[PULL 01/17] util/vfio-helpers: Pass page protections to qemu_vfio_pci_map_bar()
+[PULL 15/17] parallels: Replace qemu_co_mutex_lock by WITH_QEMU_LOCK_GUARD
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Pages are currently mapped READ/WRITE. To be able to use different
+Replace the way we use mutex in parallels_co_check() for simplier
-protections, add a new argument to qemu_vfio_pci_map_bar().
+and less error prone code.
-Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Message-Id: <20200922083821.578519-2-philmd@redhat.com>
+Message-Id: <20230424093147.197643-12-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- include/qemu/vfio-helpers.h | 2 +-
+ block/parallels.c | 33 ++++++++++++++-------------------
- block/nvme.c                | 3 ++-
+file changed, 14 insertions(+), 19 deletions(-)
  util/vfio-helpers.c         | 4 ++--
 files changed, 5 insertions(+), 4 deletions(-)
-diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/vfio-helpers.h
+--- a/block/parallels.c
-+++ b/include/qemu/vfio-helpers.h
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+@@ -XXX,XX +XXX,XX @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult *res,
- int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s);
+     BDRVParallelsState *s = bs->opaque;
- void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
+     int ret;
- void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
--                            uint64_t offset, uint64_t size,
+-    qemu_co_mutex_lock(&s->lock);
-+                            uint64_t offset, uint64_t size, int prot,
++    WITH_QEMU_LOCK_GUARD(&s->lock) {
-                             Error **errp);
++        parallels_check_unclean(bs, res, fix);
- void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
-                              uint64_t offset, uint64_t size);
+-    parallels_check_unclean(bs, res, fix);
-diff --git a/block/nvme.c b/block/nvme.c
++        ret = parallels_check_outside_image(bs, res, fix);
-index XXXXXXX..XXXXXXX 100644
++        if (ret < 0) {
---- a/block/nvme.c
++            return ret;
-+++ b/block/nvme.c
++        }
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
-         goto out;
+-    ret = parallels_check_outside_image(bs, res, fix);
 -    if (ret < 0) {
 -        goto out;
 -    }
 +        ret = parallels_check_leak(bs, res, fix);
 +        if (ret < 0) {
 +            return ret;
 +        }
 -    ret = parallels_check_leak(bs, res, fix);
 -    if (ret < 0) {
 -        goto out;
 +        parallels_collect_statistics(bs, res, fix);
      }
--    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE, errp);
+-    parallels_collect_statistics(bs, res, fix);
-+    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE,
+-
-+                                    PROT_READ | PROT_WRITE, errp);
+-out:
-     if (!s->regs) {
+-    qemu_co_mutex_unlock(&s->lock);
-         ret = -EINVAL;
+-
-         goto out;
+-    if (ret == 0) {
-diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
+-        ret = bdrv_co_flush(bs);
-index XXXXXXX..XXXXXXX 100644
+-        if (ret < 0) {
---- a/util/vfio-helpers.c
+-            res->check_errors++;
-+++ b/util/vfio-helpers.c
+-        }
-@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int index, Error **errp)
++    ret = bdrv_co_flush(bs);
-  * Map a PCI bar area.
++    if (ret < 0) {
-  */
++        res->check_errors++;
- void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
+     }
--                            uint64_t offset, uint64_t size,
-+                            uint64_t offset, uint64_t size, int prot,
+     return ret;
                              Error **errp)
  {
      void *p;
      assert_bar_index_valid(s, index);
      p = mmap(NULL, MIN(size, s->bar_region_info[index].size - offset),
 -             PROT_READ | PROT_WRITE, MAP_SHARED,
 +             prot, MAP_SHARED,
               s->device, s->bar_region_info[index].offset + offset);
      if (p == MAP_FAILED) {
          error_setg_errno(errp, errno, "Failed to map BAR region");
 --
-.26.2
+.40.1

-[PULL 06/17] block/nvme: Replace magic value by SCALE_MS definition
+[PULL 16/17] parallels: Incorrect condition in out-of-image check
-From: Philippe Mathieu-Daudé <philmd@redhat.com>
+From: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Use self-explicit SCALE_MS definition instead of magic value
+All the offsets in the BAT must be lower than the file size.
-(missed in similar commit e4f310fe7f5).
+Fix the check condition for correct check.
-Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
+Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Message-Id: <20200922083821.578519-7-philmd@redhat.com>
+Message-Id: <20230424093147.197643-13-alexander.ivanov@virtuozzo.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- block/nvme.c | 2 +-
+ block/parallels.c | 2 +-
 file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/block/nvme.c b/block/nvme.c
+diff --git a/block/parallels.c b/block/parallels.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/nvme.c
+--- a/block/parallels.c
-+++ b/block/nvme.c
++++ b/block/parallels.c
-@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
+@@ -XXX,XX +XXX,XX @@ parallels_check_outside_image(BlockDriverState *bs, BdrvCheckResult *res,
-                            CC_EN_MASK);
+     high_off = 0;
-     /* Wait for CSTS.RDY = 1. */
+     for (i = 0; i < s->bat_size; i++) {
-     now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+         off = bat2sect(s, i) << BDRV_SECTOR_BITS;
--    deadline = now + timeout_ms * 1000000;
+-        if (off > size) {
-+    deadline = now + timeout_ms * SCALE_MS;
++        if (off + s->cluster_size > size) {
-     while (!NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
+             fprintf(stderr, "%s cluster %u is outside image\n",
-         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
+                     fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
-             error_setg(errp, "Timeout while waiting for device to start (%"
+             res->corruptions++;
 --
-.26.2
+.40.1

-[PULL 15/17] docs: add 'io_uring' option to 'aio' param in qemu-options.hx
+[PULL 17/17] qcow2: add discard-no-unref option
-From: Stefano Garzarella <sgarzare@redhat.com>
+From: Jean-Louis Dupond <jean-louis@dupond.be>
-When we added io_uring AIO engine, we forgot to update qemu-options.hx,
+When we for example have a sparse qcow2 image and discard: unmap is enabled,
-so qemu(1) man page and qemu help were outdated.
+there can be a lot of fragmentation in the image after some time. Especially on VM's
+that do a lot of writes/deletes.
-Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
+This causes the qcow2 image to grow even over 110% of its virtual size,
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+because the free gaps in the image get too small to allocate new
-Reviewed-by: Julia Suvorova <jusual@redhat.com>
+continuous clusters. So it allocates new space at the end of the image.
-Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
-Message-Id: <20200924151511.131471-1-sgarzare@redhat.com>
+Disabling discard is not an option, as discard is needed to keep the
 incremental backup size as low as possible. Without discard, the
 incremental backups would become large, as qemu thinks it's just dirty
 blocks but it doesn't know the blocks are unneeded.
 So we need to avoid fragmentation but also 'empty' the unneeded blocks in
 the image to have a small incremental backup.
 In addition, we also want to send the discards further down the stack, so
 the underlying blocks are still discarded.
 Therefor we introduce a new qcow2 option "discard-no-unref".
 When setting this option to true, discards will no longer have the qcow2
 driver relinquish cluster allocations. Other than that, the request is
 handled as normal: All clusters in range are marked as zero, and, if
 pass-discard-request is true, it is passed further down the stack.
 The only difference is that the now-zero clusters are preallocated
 instead of being unallocated.
 This will avoid fragmentation on the qcow2 image.
 Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
 Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
 Message-Id: <20230605084523.34134-2-jean-louis@dupond.be>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
 ---
- qemu-options.hx | 10 ++++++----
+ qapi/block-core.json  | 12 ++++++++++++
-file changed, 6 insertions(+), 4 deletions(-)
+ block/qcow2.h         |  3 +++
+ block/qcow2-cluster.c | 32 ++++++++++++++++++++++++++++----
  block/qcow2.c         | 18 ++++++++++++++++++
  qemu-options.hx       | 12 ++++++++++++
 files changed, 73 insertions(+), 4 deletions(-)
 diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
 --- a/qapi/block-core.json
 +++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
  # @pass-discard-other: whether discard requests for the data source
  #     should be issued on other occasions where a cluster gets freed
  #
 +# @discard-no-unref: when enabled, discards from the guest will not cause
 +#     cluster allocations to be relinquished. This prevents qcow2 fragmentation
 +#     that would be caused by such discards. Besides potential
 +#     performance degradation, such fragmentation can lead to increased
 +#     allocation of clusters past the end of the image file,
 +#     resulting in image files whose file length can grow much larger
 +#     than their guest disk size would suggest.
 +#     If image file length is of concern (e.g. when storing qcow2
 +#     images directly on block devices), you should consider enabling
 +#     this option. (since 8.1)
 +#
  # @overlap-check: which overlap checks to perform for writes to the
  #     image, defaults to 'cached' (since 2.2)
  #
@@ -XXX,XX +XXX,XX @@
              '*pass-discard-request': 'bool',
              '*pass-discard-snapshot': 'bool',
              '*pass-discard-other': 'bool',
 +            '*discard-no-unref': 'bool',
              '*overlap-check': 'Qcow2OverlapChecks',
              '*cache-size': 'int',
              '*l2-cache-size': 'int',
 diff --git a/block/qcow2.h b/block/qcow2.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@
  #define QCOW2_OPT_DISCARD_REQUEST "pass-discard-request"
  #define QCOW2_OPT_DISCARD_SNAPSHOT "pass-discard-snapshot"
  #define QCOW2_OPT_DISCARD_OTHER "pass-discard-other"
 +#define QCOW2_OPT_DISCARD_NO_UNREF "discard-no-unref"
  #define QCOW2_OPT_OVERLAP "overlap-check"
  #define QCOW2_OPT_OVERLAP_TEMPLATE "overlap-check.template"
  #define QCOW2_OPT_OVERLAP_MAIN_HEADER "overlap-check.main-header"
@@ -XXX,XX +XXX,XX @@ typedef struct BDRVQcow2State {
      bool discard_passthrough[QCOW2_DISCARD_MAX];
 +    bool discard_no_unref;
 +
      int overlap_check; /* bitmask of Qcow2MetadataOverlap values */
      bool signaled_corruption;
 diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-cluster.c
 +++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
          uint64_t new_l2_bitmap = old_l2_bitmap;
          QCow2ClusterType cluster_type =
              qcow2_get_cluster_type(bs, old_l2_entry);
 +        bool keep_reference = (cluster_type != QCOW2_CLUSTER_COMPRESSED) &&
 +                              !full_discard &&
 +                              (s->discard_no_unref &&
 +                               type == QCOW2_DISCARD_REQUEST);
          /*
           * If full_discard is true, the cluster should not read back as zeroes,
@@ -XXX,XX +XXX,XX @@ static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
              new_l2_entry = new_l2_bitmap = 0;
          } else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
              if (has_subclusters(s)) {
 -                new_l2_entry = 0;
 +                if (keep_reference) {
 +                    new_l2_entry = old_l2_entry;
 +                } else {
 +                    new_l2_entry = 0;
 +                }
                  new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
              } else {
 -                new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
 +                if (s->qcow_version >= 3) {
 +                    if (keep_reference) {
 +                        new_l2_entry |= QCOW_OFLAG_ZERO;
 +                    } else {
 +                        new_l2_entry = QCOW_OFLAG_ZERO;
 +                    }
 +                } else {
 +                    new_l2_entry = 0;
 +                }
              }
          }
@@ -XXX,XX +XXX,XX @@ static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
          if (has_subclusters(s)) {
              set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
          }
 -        /* Then decrease the refcount */
 -        qcow2_free_any_cluster(bs, old_l2_entry, type);
 +        if (!keep_reference) {
 +            /* Then decrease the refcount */
 +            qcow2_free_any_cluster(bs, old_l2_entry, type);
 +        } else if (s->discard_passthrough[type] &&
 +                   (cluster_type == QCOW2_CLUSTER_NORMAL ||
 +                    cluster_type == QCOW2_CLUSTER_ZERO_ALLOC)) {
 +            /* If we keep the reference, pass on the discard still */
 +            bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
 +                          s->cluster_size);
 +       }
      }
      qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static const char *const mutable_opts[] = {
      QCOW2_OPT_DISCARD_REQUEST,
      QCOW2_OPT_DISCARD_SNAPSHOT,
      QCOW2_OPT_DISCARD_OTHER,
 +    QCOW2_OPT_DISCARD_NO_UNREF,
      QCOW2_OPT_OVERLAP,
      QCOW2_OPT_OVERLAP_TEMPLATE,
      QCOW2_OPT_OVERLAP_MAIN_HEADER,
@@ -XXX,XX +XXX,XX @@ static QemuOptsList qcow2_runtime_opts = {
              .type = QEMU_OPT_BOOL,
              .help = "Generate discard requests when other clusters are freed",
          },
 +        {
 +            .name = QCOW2_OPT_DISCARD_NO_UNREF,
 +            .type = QEMU_OPT_BOOL,
 +            .help = "Do not unreference discarded clusters",
 +        },
          {
              .name = QCOW2_OPT_OVERLAP,
              .type = QEMU_OPT_STRING,
@@ -XXX,XX +XXX,XX @@ typedef struct Qcow2ReopenState {
      bool use_lazy_refcounts;
      int overlap_check;
      bool discard_passthrough[QCOW2_DISCARD_MAX];
 +    bool discard_no_unref;
      uint64_t cache_clean_interval;
      QCryptoBlockOpenOptions *crypto_opts; /* Disk encryption runtime options */
  } Qcow2ReopenState;
@@ -XXX,XX +XXX,XX @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
      r->discard_passthrough[QCOW2_DISCARD_OTHER] =
          qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_OTHER, false);
 +    r->discard_no_unref = qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_NO_UNREF,
 +                                            false);
 +    if (r->discard_no_unref && s->qcow_version < 3) {
 +        error_setg(errp,
 +                   "discard-no-unref is only supported since qcow2 version 3");
 +        ret = -EINVAL;
 +        goto fail;
 +    }
 +
      switch (s->crypt_method_header) {
      case QCOW_CRYPT_NONE:
          if (encryptfmt) {
@@ -XXX,XX +XXX,XX @@ static void qcow2_update_options_commit(BlockDriverState *bs,
          s->discard_passthrough[i] = r->discard_passthrough[i];
      }
 +    s->discard_no_unref = r->discard_no_unref;
 +
      if (s->cache_clean_interval != r->cache_clean_interval) {
          cache_clean_timer_del(bs);
          s->cache_clean_interval = r->cache_clean_interval;
 diff --git a/qemu-options.hx b/qemu-options.hx
 index XXXXXXX..XXXXXXX 100644
 --- a/qemu-options.hx
 +++ b/qemu-options.hx
 @@ -XXX,XX +XXX,XX @@ SRST
-             The path to the image file in the local filesystem
+             issued on other occasions where a cluster gets freed
+             (on/off; default: off)
-         ``aio``
--            Specifies the AIO backend (threads/native, default: threads)
++        ``discard-no-unref``
-+            Specifies the AIO backend (threads/native/io_uring,
++            When enabled, discards from the guest will not cause cluster
-+            default: threads)
++            allocations to be relinquished. This prevents qcow2 fragmentation
++            that would be caused by such discards. Besides potential
-         ``locking``
++            performance degradation, such fragmentation can lead to increased
-             Specifies whether the image file is protected with Linux OFD
++            allocation of clusters past the end of the image file,
-@@ -XXX,XX +XXX,XX @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
++            resulting in image files whose file length can grow much larger
-     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
++            than their guest disk size would suggest.
-     "       [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
++            If image file length is of concern (e.g. when storing qcow2
-     "       [,snapshot=on|off][,rerror=ignore|stop|report]\n"
++            images directly on block devices), you should consider enabling
--    "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
++            this option.
-+    "       [,werror=ignore|stop|report|enospc][,id=name]\n"
++
-+    "       [,aio=threads|native|io_uring]\n"
+         ``overlap-check``
-     "       [,readonly=on|off][,copy-on-read=on|off]\n"
+             Which overlap checks to perform for writes to the image
-     "       [,discard=ignore|unmap][,detect-zeroes=on|off|unmap]\n"
+             (none/constant/cached/all; default: cached). For details or
      "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
@@ -XXX,XX +XXX,XX @@ SRST
          The default mode is ``cache=writeback``.
      ``aio=aio``
 -        aio is "threads", or "native" and selects between pthread based
 -        disk I/O and native Linux AIO.
 +        aio is "threads", "native", or "io_uring" and selects between pthread
 +        based disk I/O, native Linux AIO, or Linux io_uring API.
      ``format=format``
          Specify which disk format will be used rather than detecting the
 --
-.26.2
+.40.1

The following changes since commit b150cb8f67bf491a49a1cb1c7da151eeacbdbcc9:

Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging (2020-09-29 13:18:54 +0100)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to bc47831ff28d6f5830c9c8d74220131dc54c5253:

util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved regions (2020-09-30 10:23:05 +0100)

----------------------------------------------------------------
Pull request

Note I have switched from GitHub to GitLab.

----------------------------------------------------------------

Eric Auger (2):
  util/vfio-helpers: Collect IOVA reserved regions
  util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved
    regions

Philippe Mathieu-Daudé (6):
  util/vfio-helpers: Pass page protections to qemu_vfio_pci_map_bar()
  block/nvme: Map doorbells pages write-only
  block/nvme: Reduce I/O registers scope
  block/nvme: Drop NVMeRegs structure, directly use NvmeBar
  block/nvme: Use register definitions from 'block/nvme.h'
  block/nvme: Replace magic value by SCALE_MS definition

Stefano Garzarella (1):
  docs: add 'io_uring' option to 'aio' param in qemu-options.hx

Vladimir Sementsov-Ogievskiy (8):
  block: return error-code from bdrv_invalidate_cache
  block/io: refactor coroutine wrappers
  block: declare some coroutine functions in block/coroutines.h
  scripts: add block-coroutine-wrapper.py
  block: generate coroutine-wrapper code
  block: drop bdrv_prwv
  block/io: refactor save/load vmstate
  include/block/block.h: drop non-ascii quotation mark

block/block-gen.h                      |  49 ++++
 block/coroutines.h                     |  65 +++++
 include/block/block.h                  |  36 ++-
 include/qemu/vfio-helpers.h            |   2 +-
 block.c                                |  97 +------
 block/io.c                             | 339 ++++---------------------
 block/nvme.c                           |  73 +++---
 tests/test-bdrv-drain.c                |   2 +-
 util/vfio-helpers.c                    | 133 +++++++++-
 block/meson.build                      |   8 +
 docs/devel/block-coroutine-wrapper.rst |  54 ++++
 docs/devel/index.rst                   |   1 +
 qemu-options.hx                        |  10 +-
 scripts/block-coroutine-wrapper.py     | 188 ++++++++++++++
 14 files changed, 629 insertions(+), 428 deletions(-)
 create mode 100644 block/block-gen.h
 create mode 100644 block/coroutines.h
 create mode 100644 docs/devel/block-coroutine-wrapper.rst
 create mode 100644 scripts/block-coroutine-wrapper.py

-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

Pages are currently mapped READ/WRITE. To be able to use different
protections, add a new argument to qemu_vfio_pci_map_bar().

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-2-philmd@redhat.com>
---
 include/qemu/vfio-helpers.h | 2 +-
 block/nvme.c                | 3 ++-
 util/vfio-helpers.c         | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
 int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s);
 void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
 void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
-                            uint64_t offset, uint64_t size,
+                            uint64_t offset, uint64_t size, int prot,
                             Error **errp);
 void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
                              uint64_t offset, uint64_t size);
diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
         goto out;
     }
 
-    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE, errp);
+    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE,
+                                    PROT_READ | PROT_WRITE, errp);
     if (!s->regs) {
         ret = -EINVAL;
         goto out;
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int index, Error **errp)
  * Map a PCI bar area.
  */
 void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
-                            uint64_t offset, uint64_t size,
+                            uint64_t offset, uint64_t size, int prot,
                             Error **errp)
 {
     void *p;
     assert_bar_index_valid(s, index);
     p = mmap(NULL, MIN(size, s->bar_region_info[index].size - offset),
-             PROT_READ | PROT_WRITE, MAP_SHARED,
+             prot, MAP_SHARED,
              s->device, s->bar_region_info[index].offset + offset);
     if (p == MAP_FAILED) {
         error_setg_errno(errp, errno, "Failed to map BAR region");
-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

Per the datasheet sections 3.1.13/3.1.14:
  "The host should not read the doorbell registers."

As we don't need read access, map the doorbells with write-only
permission. We keep a reference to this mapped address in the
BDRVNVMeState structure.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-3-philmd@redhat.com>
---
 block/nvme.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
 #define NVME_SQ_ENTRY_BYTES 64
 #define NVME_CQ_ENTRY_BYTES 16
 #define NVME_QUEUE_SIZE 128
-#define NVME_BAR_SIZE 8192
+#define NVME_DOORBELL_SIZE 4096
 
 /*
  * We have to leave one slot empty as that is the full queue case where
@@ -XXX,XX +XXX,XX @@ typedef struct {
 /* Memory mapped registers */
 typedef volatile struct {
     NvmeBar ctrl;
-    struct {
-        uint32_t sq_tail;
-        uint32_t cq_head;
-    } doorbells[];
 } NVMeRegs;
 
 #define INDEX_ADMIN     0
@@ -XXX,XX +XXX,XX @@ struct BDRVNVMeState {
     AioContext *aio_context;
     QEMUVFIOState *vfio;
     NVMeRegs *regs;
+    /* Memory mapped registers */
+    volatile struct {
+        uint32_t sq_tail;
+        uint32_t cq_head;
+    } *doorbells;
     /* The submission/completion queue pairs.
      * [0]: admin queue.
      * [1..]: io queues.
@@ -XXX,XX +XXX,XX @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState *s,
         error_propagate(errp, local_err);
         goto fail;
     }
-    q->sq.doorbell = &s->regs->doorbells[idx * s->doorbell_scale].sq_tail;
+    q->sq.doorbell = &s->doorbells[idx * s->doorbell_scale].sq_tail;
 
     nvme_init_queue(s, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         goto fail;
     }
-    q->cq.doorbell = &s->regs->doorbells[idx * s->doorbell_scale].cq_head;
+    q->cq.doorbell = &s->doorbells[idx * s->doorbell_scale].cq_head;
 
     return q;
 fail:
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
         goto out;
     }
 
-    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE,
+    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
                                     PROT_READ | PROT_WRITE, errp);
     if (!s->regs) {
         ret = -EINVAL;
         goto out;
     }
-
     /* Perform initialize sequence as described in NVMe spec "7.6.1
      * Initialization". */
 
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
         }
     }
 
+    s->doorbells = qemu_vfio_pci_map_bar(s->vfio, 0, sizeof(NvmeBar),
+                                         NVME_DOORBELL_SIZE, PROT_WRITE, errp);
+    if (!s->doorbells) {
+        ret = -EINVAL;
+        goto out;
+    }
+
     /* Set up admin queue. */
     s->queues = g_new(NVMeQueuePair *, 1);
     s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s, aio_context, 0,
@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
                            &s->irq_notifier[MSIX_SHARED_IRQ_IDX],
                            false, NULL, NULL);
     event_notifier_cleanup(&s->irq_notifier[MSIX_SHARED_IRQ_IDX]);
-    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, NVME_BAR_SIZE);
+    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
+                            sizeof(NvmeBar), NVME_DOORBELL_SIZE);
+    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, sizeof(NvmeBar));
     qemu_vfio_close(s->vfio);
 
     g_free(s->device);
-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

We only access the I/O register in nvme_init().
Remove the reference in BDRVNVMeState and reduce its scope.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-4-philmd@redhat.com>
---
 block/nvme.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ enum {
 struct BDRVNVMeState {
     AioContext *aio_context;
     QEMUVFIOState *vfio;
-    NVMeRegs *regs;
     /* Memory mapped registers */
     volatile struct {
         uint32_t sq_tail;
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     uint64_t timeout_ms;
     uint64_t deadline, now;
     Error *local_err = NULL;
+    NVMeRegs *regs;
 
     qemu_co_mutex_init(&s->dma_map_lock);
     qemu_co_queue_init(&s->dma_flush_queue);
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
         goto out;
     }
 
-    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
-                                    PROT_READ | PROT_WRITE, errp);
-    if (!s->regs) {
+    regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
+                                 PROT_READ | PROT_WRITE, errp);
+    if (!regs) {
         ret = -EINVAL;
         goto out;
     }
     /* Perform initialize sequence as described in NVMe spec "7.6.1
      * Initialization". */
 
-    cap = le64_to_cpu(s->regs->ctrl.cap);
+    cap = le64_to_cpu(regs->ctrl.cap);
     if (!(cap & (1ULL << 37))) {
         error_setg(errp, "Device doesn't support NVMe command set");
         ret = -EINVAL;
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
 
     /* Reset device to get a clean state. */
-    s->regs->ctrl.cc = cpu_to_le32(le32_to_cpu(s->regs->ctrl.cc) & 0xFE);
+    regs->ctrl.cc = cpu_to_le32(le32_to_cpu(regs->ctrl.cc) & 0xFE);
     /* Wait for CSTS.RDY = 0. */
     deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
-    while (le32_to_cpu(s->regs->ctrl.csts) & 0x1) {
+    while (le32_to_cpu(regs->ctrl.csts) & 0x1) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to reset (%"
                              PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     }
     s->nr_queues = 1;
     QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
-    s->regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
-    s->regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
-    s->regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
+    regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
+    regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
+    regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 
     /* After setting up all control registers we can enable device now. */
-    s->regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
+    regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
                               (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
                               0x1);
     /* Wait for CSTS.RDY = 1. */
     now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     deadline = now + timeout_ms * 1000000;
-    while (!(le32_to_cpu(s->regs->ctrl.csts) & 0x1)) {
+    while (!(le32_to_cpu(regs->ctrl.csts) & 0x1)) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to start (%"
                              PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
         ret = -EIO;
     }
 out:
+    if (regs) {
+        qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)regs, 0, sizeof(NvmeBar));
+    }
+
     /* Cleaning up is done in nvme_file_open() upon error. */
     return ret;
 }
@@ -XXX,XX +XXX,XX @@ static void nvme_close(BlockDriverState *bs)
     event_notifier_cleanup(&s->irq_notifier[MSIX_SHARED_IRQ_IDX]);
     qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
                             sizeof(NvmeBar), NVME_DOORBELL_SIZE);
-    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, sizeof(NvmeBar));
     qemu_vfio_close(s->vfio);
 
     g_free(s->device);
-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

NVMeRegs only contains NvmeBar. Simplify the code by using NvmeBar
directly.

This triggers a checkpatch.pl error:

ERROR: Use of volatile is usually wrong, please add a comment
  #30: FILE: block/nvme.c:691:
  +    volatile NvmeBar *regs;

This is a false positive as in our case we are using I/O registers,
so the 'volatile' use is justified.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-5-philmd@redhat.com>
---
 block/nvme.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ typedef struct {
     QEMUBH      *completion_bh;
 } NVMeQueuePair;
 
-/* Memory mapped registers */
-typedef volatile struct {
-    NvmeBar ctrl;
-} NVMeRegs;
-
 #define INDEX_ADMIN     0
 #define INDEX_IO(n)     (1 + n)
 
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     uint64_t timeout_ms;
     uint64_t deadline, now;
     Error *local_err = NULL;
-    NVMeRegs *regs;
+    volatile NvmeBar *regs = NULL;
 
     qemu_co_mutex_init(&s->dma_map_lock);
     qemu_co_queue_init(&s->dma_flush_queue);
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     /* Perform initialize sequence as described in NVMe spec "7.6.1
      * Initialization". */
 
-    cap = le64_to_cpu(regs->ctrl.cap);
+    cap = le64_to_cpu(regs->cap);
     if (!(cap & (1ULL << 37))) {
         error_setg(errp, "Device doesn't support NVMe command set");
         ret = -EINVAL;
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
 
     /* Reset device to get a clean state. */
-    regs->ctrl.cc = cpu_to_le32(le32_to_cpu(regs->ctrl.cc) & 0xFE);
+    regs->cc = cpu_to_le32(le32_to_cpu(regs->cc) & 0xFE);
     /* Wait for CSTS.RDY = 0. */
     deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
-    while (le32_to_cpu(regs->ctrl.csts) & 0x1) {
+    while (le32_to_cpu(regs->csts) & 0x1) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to reset (%"
                              PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     }
     s->nr_queues = 1;
     QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
-    regs->ctrl.aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
-    regs->ctrl.asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
-    regs->ctrl.acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
+    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
+    regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
+    regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 
     /* After setting up all control registers we can enable device now. */
-    regs->ctrl.cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
+    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
                               (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
                               0x1);
     /* Wait for CSTS.RDY = 1. */
     now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     deadline = now + timeout_ms * 1000000;
-    while (!(le32_to_cpu(regs->ctrl.csts) & 0x1)) {
+    while (!(le32_to_cpu(regs->csts) & 0x1)) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to start (%"
                              PRId64 " ms)",
-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

Use the NVMe register definitions from "block/nvme.h" which
ease a bit reviewing the code while matching the datasheet.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-6-philmd@redhat.com>
---
 block/nvme.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
      * Initialization". */
 
     cap = le64_to_cpu(regs->cap);
-    if (!(cap & (1ULL << 37))) {
+    if (!NVME_CAP_CSS(cap)) {
         error_setg(errp, "Device doesn't support NVMe command set");
         ret = -EINVAL;
         goto out;
     }
 
-    s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
-    s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
+    s->page_size = MAX(4096, 1 << NVME_CAP_MPSMIN(cap));
+    s->doorbell_scale = (4 << NVME_CAP_DSTRD(cap)) / sizeof(uint32_t);
     bs->bl.opt_mem_alignment = s->page_size;
-    timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
+    timeout_ms = MIN(500 * NVME_CAP_TO(cap), 30000);
 
     /* Reset device to get a clean state. */
     regs->cc = cpu_to_le32(le32_to_cpu(regs->cc) & 0xFE);
     /* Wait for CSTS.RDY = 0. */
     deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
-    while (le32_to_cpu(regs->csts) & 0x1) {
+    while (NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to reset (%"
                              PRId64 " ms)",
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
     }
     s->nr_queues = 1;
     QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
-    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
+    regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << AQA_ACQS_SHIFT) |
+                            (NVME_QUEUE_SIZE << AQA_ASQS_SHIFT));
     regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
     regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 
     /* After setting up all control registers we can enable device now. */
-    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
-                              (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
-                              0x1);
+    regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << CC_IOCQES_SHIFT) |
+                           (ctz32(NVME_SQ_ENTRY_BYTES) << CC_IOSQES_SHIFT) |
+                           CC_EN_MASK);
     /* Wait for CSTS.RDY = 1. */
     now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     deadline = now + timeout_ms * 1000000;
-    while (!(le32_to_cpu(regs->csts) & 0x1)) {
+    while (!NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to start (%"
                              PRId64 " ms)",
-- 
2.26.2

From: Philippe Mathieu-Daudé <philmd@redhat.com>

Use self-explicit SCALE_MS definition instead of magic value
(missed in similar commit e4f310fe7f5).

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-7-philmd@redhat.com>
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@ static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
                            CC_EN_MASK);
     /* Wait for CSTS.RDY = 1. */
     now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-    deadline = now + timeout_ms * 1000000;
+    deadline = now + timeout_ms * SCALE_MS;
     while (!NVME_CSTS_RDY(le32_to_cpu(regs->csts))) {
         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
             error_setg(errp, "Timeout while waiting for device to start (%"
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

This is the only coroutine wrapper from block.c and block/io.c which
doesn't return a value, so let's convert it to the common behavior, to
simplify moving to generated coroutine wrappers in a further commit.

Also, bdrv_invalidate_cache is a void function, returning error only
through **errp parameter, which is considered to be bad practice, as
it forces callers to define and propagate local_err variable, so
conversion is good anyway.

This patch leaves the conversion of .bdrv_co_invalidate_cache() driver
callbacks and bdrv_invalidate_cache_all() for another day.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-2-vsementsov@virtuozzo.com>
---
 include/block/block.h |  2 +-
 block.c               | 32 ++++++++++++++++++--------------
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb);
 int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
 
 /* Invalidate any cached metadata used by image formats */
-void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
+int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
 void bdrv_invalidate_cache_all(Error **errp);
 int bdrv_inactivate_all(void);
 
diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ void bdrv_init_with_whitelist(void)
     bdrv_init();
 }
 
-static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
-                                                  Error **errp)
+static int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
+                                                 Error **errp)
 {
     BdrvChild *child, *parent;
     uint64_t perm, shared_perm;
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
     BdrvDirtyBitmap *bm;
 
     if (!bs->drv)  {
-        return;
+        return -ENOMEDIUM;
     }
 
     QLIST_FOREACH(child, &bs->children, next) {
         bdrv_co_invalidate_cache(child->bs, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
-            return;
+            return -EINVAL;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
         ret = bdrv_check_perm(bs, NULL, perm, shared_perm, NULL, NULL, errp);
         if (ret < 0) {
             bs->open_flags |= BDRV_O_INACTIVE;
-            return;
+            return ret;
         }
         bdrv_set_perm(bs, perm, shared_perm);
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
             if (local_err) {
                 bs->open_flags |= BDRV_O_INACTIVE;
                 error_propagate(errp, local_err);
-                return;
+                return -EINVAL;
             }
         }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
         if (ret < 0) {
             bs->open_flags |= BDRV_O_INACTIVE;
             error_setg_errno(errp, -ret, "Could not refresh total sector count");
-            return;
+            return ret;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
             if (local_err) {
                 bs->open_flags |= BDRV_O_INACTIVE;
                 error_propagate(errp, local_err);
-                return;
+                return -EINVAL;
             }
         }
     }
+
+    return 0;
 }
 
 typedef struct InvalidateCacheCo {
     BlockDriverState *bs;
     Error **errp;
     bool done;
+    int ret;
 } InvalidateCacheCo;
 
 static void coroutine_fn bdrv_invalidate_cache_co_entry(void *opaque)
 {
     InvalidateCacheCo *ico = opaque;
-    bdrv_co_invalidate_cache(ico->bs, ico->errp);
+    ico->ret = bdrv_co_invalidate_cache(ico->bs, ico->errp);
     ico->done = true;
     aio_wait_kick();
 }
 
-void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
+int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
 {
     Coroutine *co;
     InvalidateCacheCo ico = {
@@ -XXX,XX +XXX,XX @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !ico.done);
     }
+
+    return ico.ret;
 }
 
 void bdrv_invalidate_cache_all(Error **errp)
 {
     BlockDriverState *bs;
-    Error *local_err = NULL;
     BdrvNextIterator it;
 
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
+        int ret;
 
         aio_context_acquire(aio_context);
-        bdrv_invalidate_cache(bs, &local_err);
+        ret = bdrv_invalidate_cache(bs, errp);
         aio_context_release(aio_context);
-        if (local_err) {
-            error_propagate(errp, local_err);
+        if (ret < 0) {
             bdrv_next_cleanup(&it);
             return;
         }
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Most of our coroutine wrappers already follow this convention:

We have 'coroutine_fn bdrv_co_<something>(<normal argument list>)' as
the core function, and a wrapper 'bdrv_<something>(<same argument
list>)' which does parameter packing and calls bdrv_run_co().

The only outsiders are the bdrv_prwv_co and
bdrv_common_block_status_above wrappers. Let's refactor them to behave
as the others, it simplifies further conversion of coroutine wrappers.

This patch adds an indirection layer, but it will be compensated by
a further commit, which will drop bdrv_co_prwv together with the
is_write logic, to keep the read and write paths separate.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-3-vsementsov@virtuozzo.com>
---
 block/io.c | 60 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ typedef struct RwCo {
     BdrvRequestFlags flags;
 } RwCo;
 
+static int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
+                                     QEMUIOVector *qiov, bool is_write,
+                                     BdrvRequestFlags flags)
+{
+    if (is_write) {
+        return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
+    } else {
+        return bdrv_co_preadv(child, offset, qiov->size, qiov, flags);
+    }
+}
+
 static int coroutine_fn bdrv_rw_co_entry(void *opaque)
 {
     RwCo *rwco = opaque;
 
-    if (!rwco->is_write) {
-        return bdrv_co_preadv(rwco->child, rwco->offset,
-                              rwco->qiov->size, rwco->qiov,
-                              rwco->flags);
-    } else {
-        return bdrv_co_pwritev(rwco->child, rwco->offset,
-                               rwco->qiov->size, rwco->qiov,
-                               rwco->flags);
-    }
+    return bdrv_co_prwv(rwco->child, rwco->offset, rwco->qiov,
+                        rwco->is_write, rwco->flags);
 }
 
 /*
  * Process a vectored synchronous request using coroutines
  */
-static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
-                        QEMUIOVector *qiov, bool is_write,
-                        BdrvRequestFlags flags)
+static int bdrv_prwv(BdrvChild *child, int64_t offset,
+                     QEMUIOVector *qiov, bool is_write,
+                     BdrvRequestFlags flags)
 {
     RwCo rwco = {
         .child = child,
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
 {
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
 
-    return bdrv_prwv_co(child, offset, &qiov, true,
-                        BDRV_REQ_ZERO_WRITE | flags);
+    return bdrv_prwv(child, offset, &qiov, true, BDRV_REQ_ZERO_WRITE | flags);
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 {
     int ret;
 
-    ret = bdrv_prwv_co(child, offset, qiov, false, 0);
+    ret = bdrv_prwv(child, offset, qiov, false, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 {
     int ret;
 
-    ret = bdrv_prwv_co(child, offset, qiov, true, 0);
+    ret = bdrv_prwv(child, offset, qiov, true, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ early_out:
     return ret;
 }
 
-static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
-                                                   BlockDriverState *base,
-                                                   bool want_zero,
-                                                   int64_t offset,
-                                                   int64_t bytes,
-                                                   int64_t *pnum,
-                                                   int64_t *map,
-                                                   BlockDriverState **file)
+static int coroutine_fn
+bdrv_co_common_block_status_above(BlockDriverState *bs,
+                                  BlockDriverState *base,
+                                  bool want_zero,
+                                  int64_t offset,
+                                  int64_t bytes,
+                                  int64_t *pnum,
+                                  int64_t *map,
+                                  BlockDriverState **file)
 {
     BlockDriverState *p;
     int ret = 0;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
     BdrvCoBlockStatusData *data = opaque;
 
-    return bdrv_co_block_status_above(data->bs, data->base,
-                                      data->want_zero,
-                                      data->offset, data->bytes,
-                                      data->pnum, data->map, data->file);
+    return bdrv_co_common_block_status_above(data->bs, data->base,
+                                             data->want_zero,
+                                             data->offset, data->bytes,
+                                             data->pnum, data->map, data->file);
 }
 
 /*
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

We are going to keep coroutine-wrappers code (structure-packing
parameters, BDRV_POLL wrapper functions) in separate auto-generated
files. So, we'll need a header with declaration of original _co_
functions, for those which are static now. As well, we'll need
declarations for wrapper functions. Do these declarations now, as a
preparation step.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-4-vsementsov@virtuozzo.com>
---
 block/coroutines.h | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 block.c            |  8 +++---
 block/io.c         | 34 +++++++++++------------
 3 files changed, 88 insertions(+), 21 deletions(-)
 create mode 100644 block/coroutines.h

diff --git a/block/coroutines.h b/block/coroutines.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Block layer I/O functions
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef BLOCK_COROUTINES_INT_H
+#define BLOCK_COROUTINES_INT_H
+
+#include "block/block_int.h"
+
+int coroutine_fn bdrv_co_check(BlockDriverState *bs,
+                               BdrvCheckResult *res, BdrvCheckMode fix);
+int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
+
+int coroutine_fn
+bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
+             bool is_write, BdrvRequestFlags flags);
+int
+bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
+          bool is_write, BdrvRequestFlags flags);
+
+int coroutine_fn
+bdrv_co_common_block_status_above(BlockDriverState *bs,
+                                  BlockDriverState *base,
+                                  bool want_zero,
+                                  int64_t offset,
+                                  int64_t bytes,
+                                  int64_t *pnum,
+                                  int64_t *map,
+                                  BlockDriverState **file);
+int
+bdrv_common_block_status_above(BlockDriverState *bs,
+                               BlockDriverState *base,
+                               bool want_zero,
+                               int64_t offset,
+                               int64_t bytes,
+                               int64_t *pnum,
+                               int64_t *map,
+                               BlockDriverState **file);
+
+int coroutine_fn
+bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+                   bool is_read);
+int
+bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+                bool is_read);
+
+#endif /* BLOCK_COROUTINES_INT_H */
diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "block/coroutines.h"
 
 #ifdef CONFIG_BSD
 #include <sys/ioctl.h>
@@ -XXX,XX +XXX,XX @@ static void bdrv_delete(BlockDriverState *bs)
  * free of errors) or -errno when an internal error occurred. The results of the
  * check are stored in res.
  */
-static int coroutine_fn bdrv_co_check(BlockDriverState *bs,
-                                      BdrvCheckResult *res, BdrvCheckMode fix)
+int coroutine_fn bdrv_co_check(BlockDriverState *bs,
+                               BdrvCheckResult *res, BdrvCheckMode fix)
 {
     if (bs->drv == NULL) {
         return -ENOMEDIUM;
@@ -XXX,XX +XXX,XX @@ void bdrv_init_with_whitelist(void)
     bdrv_init();
 }
 
-static int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs,
-                                                 Error **errp)
+int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
 {
     BdrvChild *child, *parent;
     uint64_t perm, shared_perm;
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@
 #include "block/blockjob.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
+#include "block/coroutines.h"
 #include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
@@ -XXX,XX +XXX,XX @@ typedef struct RwCo {
     BdrvRequestFlags flags;
 } RwCo;
 
-static int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
-                                     QEMUIOVector *qiov, bool is_write,
-                                     BdrvRequestFlags flags)
+int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
+                              QEMUIOVector *qiov, bool is_write,
+                              BdrvRequestFlags flags)
 {
     if (is_write) {
         return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_rw_co_entry(void *opaque)
 /*
  * Process a vectored synchronous request using coroutines
  */
-static int bdrv_prwv(BdrvChild *child, int64_t offset,
-                     QEMUIOVector *qiov, bool is_write,
-                     BdrvRequestFlags flags)
+int bdrv_prwv(BdrvChild *child, int64_t offset,
+              QEMUIOVector *qiov, bool is_write,
+              BdrvRequestFlags flags)
 {
     RwCo rwco = {
         .child = child,
@@ -XXX,XX +XXX,XX @@ early_out:
     return ret;
 }
 
-static int coroutine_fn
+int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   BlockDriverState *base,
                                   bool want_zero,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
  *
  * See bdrv_co_block_status_above() for details.
  */
-static int bdrv_common_block_status_above(BlockDriverState *bs,
-                                          BlockDriverState *base,
-                                          bool want_zero, int64_t offset,
-                                          int64_t bytes, int64_t *pnum,
-                                          int64_t *map,
-                                          BlockDriverState **file)
+int bdrv_common_block_status_above(BlockDriverState *bs,
+                                   BlockDriverState *base,
+                                   bool want_zero, int64_t offset,
+                                   int64_t bytes, int64_t *pnum,
+                                   int64_t *map,
+                                   BlockDriverState **file)
 {
     BdrvCoBlockStatusData data = {
         .bs = bs,
@@ -XXX,XX +XXX,XX @@ typedef struct BdrvVmstateCo {
     bool                is_read;
 } BdrvVmstateCo;
 
-static int coroutine_fn
+int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
 {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_co_rw_vmstate_entry(void *opaque)
     return bdrv_co_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
 }
 
-static inline int
-bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
-                bool is_read)
+int bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+                    bool is_read)
 {
     BdrvVmstateCo data = {
         .bs         = bs,
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

We have a very frequent pattern of creating a coroutine from a function
with several arguments:

- create a structure to pack parameters
  - create _entry function to call original function taking parameters
    from struct
  - do different magic to handle completion: set ret to NOT_DONE or
    EINPROGRESS or use separate bool field
  - fill the struct and create coroutine from _entry function with this
    struct as a parameter
  - do coroutine enter and BDRV_POLL_WHILE loop

Let's reduce code duplication by generating coroutine wrappers.

This patch adds scripts/block-coroutine-wrapper.py together with some
friends, which will generate functions with declared prototypes marked
by the 'generated_co_wrapper' specifier.

The usage of new code generation is as follows:

1. define the coroutine function somewhere

int coroutine_fn bdrv_co_NAME(...) {...}

2. declare in some header file

int generated_co_wrapper bdrv_NAME(...);

with same list of parameters (generated_co_wrapper is
       defined in "include/block/block.h").

3. Make sure the block_gen_c declaration in block/meson.build
       mentions the file with your marker function.

Still, no function is now marked, this work is for the following
commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924185414.28642-5-vsementsov@virtuozzo.com>
[Added encoding='utf-8' to open() calls as requested by Vladimir. Fixed
typo and grammar issues pointed out by Eric Blake.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-gen.h                      |  49 +++++++
 include/block/block.h                  |  10 ++
 block/meson.build                      |   8 ++
 docs/devel/block-coroutine-wrapper.rst |  54 +++++++
 docs/devel/index.rst                   |   1 +
 scripts/block-coroutine-wrapper.py     | 188 +++++++++++++++++++++++++
 6 files changed, 310 insertions(+)
 create mode 100644 block/block-gen.h
 create mode 100644 docs/devel/block-coroutine-wrapper.rst
 create mode 100644 scripts/block-coroutine-wrapper.py

diff --git a/block/block-gen.h b/block/block-gen.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/block-gen.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Block coroutine wrapping core, used by auto-generated block/block-gen.c
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ * Copyright (c) 2020 Virtuozzo International GmbH
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef BLOCK_BLOCK_GEN_H
+#define BLOCK_BLOCK_GEN_H
+
+#include "block/block_int.h"
+
+/* Base structure for argument packing structures */
+typedef struct BdrvPollCo {
+    BlockDriverState *bs;
+    bool in_progress;
+    int ret;
+    Coroutine *co; /* Keep pointer here for debugging */
+} BdrvPollCo;
+
+static inline int bdrv_poll_co(BdrvPollCo *s)
+{
+    assert(!qemu_in_coroutine());
+
+    bdrv_coroutine_enter(s->bs, s->co);
+    BDRV_POLL_WHILE(s->bs, s->in_progress);
+
+    return s->ret;
+}
+
+#endif /* BLOCK_BLOCK_GEN_H */
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@
 #include "block/blockjob.h"
 #include "qemu/hbitmap.h"
 
+/*
+ * generated_co_wrapper
+ *
+ * Function specifier, which does nothing but mark functions to be
+ * generated by scripts/block-coroutine-wrapper.py
+ *
+ * Read more in docs/devel/block-coroutine-wrapper.rst
+ */
+#define generated_co_wrapper
+
 /* block.c */
 typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
diff --git a/block/meson.build b/block/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -XXX,XX +XXX,XX @@ module_block_h = custom_target('module_block.h',
                                command: [module_block_py, '@OUTPUT0@', modsrc])
 block_ss.add(module_block_h)
 
+wrapper_py = find_program('../scripts/block-coroutine-wrapper.py')
+block_gen_c = custom_target('block-gen.c',
+                            output: 'block-gen.c',
+                            input: files('../include/block/block.h',
+                                         'coroutines.h'),
+                            command: [wrapper_py, '@OUTPUT@', '@INPUT@'])
+block_ss.add(block_gen_c)
+
 block_ss.add(files('stream.c'))
 
 softmmu_ss.add(files('qapi-sysemu.c'))
diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/docs/devel/block-coroutine-wrapper.rst
@@ -XXX,XX +XXX,XX @@
+=======================
+block-coroutine-wrapper
+=======================
+
+A lot of functions in QEMU block layer (see ``block/*``) can only be
+called in coroutine context. Such functions are normally marked by the
+coroutine_fn specifier. Still, sometimes we need to call them from
+non-coroutine context; for this we need to start a coroutine, run the
+needed function from it and wait for the coroutine to finish in a
+BDRV_POLL_WHILE() loop. To run a coroutine we need a function with one
+void* argument. So for each coroutine_fn function which needs a
+non-coroutine interface, we should define a structure to pack the
+parameters, define a separate function to unpack the parameters and
+call the original function and finally define a new interface function
+with same list of arguments as original one, which will pack the
+parameters into a struct, create a coroutine, run it and wait in
+BDRV_POLL_WHILE() loop. It's boring to create such wrappers by hand,
+so we have a script to generate them.
+
+Usage
+=====
+
+Assume we have defined the ``coroutine_fn`` function
+``bdrv_co_foo(<some args>)`` and need a non-coroutine interface for it,
+called ``bdrv_foo(<same args>)``. In this case the script can help. To
+trigger the generation:
+
+1. You need ``bdrv_foo`` declaration somewhere (for example, in
+   ``block/coroutines.h``) with the ``generated_co_wrapper`` mark,
+   like this:
+
+.. code-block:: c
+
+    int generated_co_wrapper bdrv_foo(<some args>);
+
+2. You need to feed this declaration to block-coroutine-wrapper script.
+   For this, add the .h (or .c) file with the declaration to the
+   ``input: files(...)`` list of ``block_gen_c`` target declaration in
+   ``block/meson.build``
+
+You are done. During the build, coroutine wrappers will be generated in
+``<BUILD_DIR>/block/block-gen.c``.
+
+Links
+=====
+
+1. The script location is ``scripts/block-coroutine-wrapper.py``.
+
+2. Generic place for private ``generated_co_wrapper`` declarations is
+   ``block/coroutines.h``, for public declarations:
+   ``include/block/block.h``
+
+3. The core API of generated coroutine wrappers is placed in
+   (not generated) ``block/block-gen.h``
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -XXX,XX +XXX,XX @@ Contents:
    reset
    s390-dasd-ipl
    clocks
+   block-coroutine-wrapper
diff --git a/scripts/block-coroutine-wrapper.py b/scripts/block-coroutine-wrapper.py
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/scripts/block-coroutine-wrapper.py
@@ -XXX,XX +XXX,XX @@
+#! /usr/bin/env python3
+"""Generate coroutine wrappers for block subsystem.
+
+The program parses one or several concatenated c files from stdin,
+searches for functions with the 'generated_co_wrapper' specifier
+and generates corresponding wrappers on stdout.
+
+Usage: block-coroutine-wrapper.py generated-file.c FILE.[ch]...
+
+Copyright (c) 2020 Virtuozzo International GmbH.
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program.  If not, see <http://www.gnu.org/licenses/>.
+"""
+
+import sys
+import re
+import subprocess
+import json
+from typing import Iterator
+
+
+def prettify(code: str) -> str:
+    """Prettify code using clang-format if available"""
+
+    try:
+        style = json.dumps({
+            'IndentWidth': 4,
+            'BraceWrapping': {'AfterFunction': True},
+            'BreakBeforeBraces': 'Custom',
+            'SortIncludes': False,
+            'MaxEmptyLinesToKeep': 2,
+        })
+        p = subprocess.run(['clang-format', f'-style={style}'], check=True,
+                           encoding='utf-8', input=code,
+                           stdout=subprocess.PIPE)
+        return p.stdout
+    except FileNotFoundError:
+        return code
+
+
+def gen_header():
+    copyright = re.sub('^.*Copyright', 'Copyright', __doc__, flags=re.DOTALL)
+    copyright = re.sub('^(?=.)', ' * ', copyright.strip(), flags=re.MULTILINE)
+    copyright = re.sub('^$', ' *', copyright, flags=re.MULTILINE)
+    return f"""\
+/*
+ * File is generated by scripts/block-coroutine-wrapper.py
+ *
+{copyright}
+ */
+
+#include "qemu/osdep.h"
+#include "block/coroutines.h"
+#include "block/block-gen.h"
+#include "block/block_int.h"\
+"""
+
+
+class ParamDecl:
+    param_re = re.compile(r'(?P<decl>'
+                          r'(?P<type>.*[ *])'
+                          r'(?P<name>[a-z][a-z0-9_]*)'
+                          r')')
+
+    def __init__(self, param_decl: str) -> None:
+        m = self.param_re.match(param_decl.strip())
+        if m is None:
+            raise ValueError(f'Wrong parameter declaration: "{param_decl}"')
+        self.decl = m.group('decl')
+        self.type = m.group('type')
+        self.name = m.group('name')
+
+
+class FuncDecl:
+    def __init__(self, return_type: str, name: str, args: str) -> None:
+        self.return_type = return_type.strip()
+        self.name = name.strip()
+        self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
+
+    def gen_list(self, format: str) -> str:
+        return ', '.join(format.format_map(arg.__dict__) for arg in self.args)
+
+    def gen_block(self, format: str) -> str:
+        return '\n'.join(format.format_map(arg.__dict__) for arg in self.args)
+
+
+# Match wrappers declared with a generated_co_wrapper mark
+func_decl_re = re.compile(r'^int\s*generated_co_wrapper\s*'
+                          r'(?P<wrapper_name>[a-z][a-z0-9_]*)'
+                          r'$(?P<args>[^)]*)$;$', re.MULTILINE)
+
+
+def func_decl_iter(text: str) -> Iterator:
+    for m in func_decl_re.finditer(text):
+        yield FuncDecl(return_type='int',
+                       name=m.group('wrapper_name'),
+                       args=m.group('args'))
+
+
+def snake_to_camel(func_name: str) -> str:
+    """
+    Convert underscore names like 'some_function_name' to camel-case like
+    'SomeFunctionName'
+    """
+    words = func_name.split('_')
+    words = [w[0].upper() + w[1:] for w in words]
+    return ''.join(words)
+
+
+def gen_wrapper(func: FuncDecl) -> str:
+    assert func.name.startswith('bdrv_')
+    assert not func.name.startswith('bdrv_co_')
+    assert func.return_type == 'int'
+    assert func.args[0].type in ['BlockDriverState *', 'BdrvChild *']
+
+    name = 'bdrv_co_' + func.name[5:]
+    bs = 'bs' if func.args[0].type == 'BlockDriverState *' else 'child->bs'
+    struct_name = snake_to_camel(name)
+
+    return f"""\
+/*
+ * Wrappers for {name}
+ */
+
+typedef struct {struct_name} {{
+    BdrvPollCo poll_state;
+{ func.gen_block('    {decl};') }
+}} {struct_name};
+
+static void coroutine_fn {name}_entry(void *opaque)
+{{
+    {struct_name} *s = opaque;
+
+    s->poll_state.ret = {name}({ func.gen_list('s->{name}') });
+    s->poll_state.in_progress = false;
+
+    aio_wait_kick();
+}}
+
+int {func.name}({ func.gen_list('{decl}') })
+{{
+    if (qemu_in_coroutine()) {{
+        return {name}({ func.gen_list('{name}') });
+    }} else {{
+        {struct_name} s = {{
+            .poll_state.bs = {bs},
+            .poll_state.in_progress = true,
+
+{ func.gen_block('            .{name} = {name},') }
+        }};
+
+        s.poll_state.co = qemu_coroutine_create({name}_entry, &s);
+
+        return bdrv_poll_co(&s.poll_state);
+    }}
+}}"""
+
+
+def gen_wrappers(input_code: str) -> str:
+    res = ''
+    for func in func_decl_iter(input_code):
+        res += '\n\n\n'
+        res += gen_wrapper(func)
+
+    return prettify(res)  # prettify to wrap long lines
+
+
+if __name__ == '__main__':
+    if len(sys.argv) < 3:
+        exit(f'Usage: {sys.argv[0]} OUT_FILE.c IN_FILE.[ch]...')
+
+    with open(sys.argv[1], 'w', encoding='utf-8') as f_out:
+        f_out.write(gen_header())
+        for fname in sys.argv[2:]:
+            with open(fname, encoding='utf-8') as f_in:
+                f_out.write(gen_wrappers(f_in.read()))
+                f_out.write('\n')
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Use code generation implemented in previous commit to generated
coroutine wrappers in block.c and block/io.c

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-6-vsementsov@virtuozzo.com>
---
 block/coroutines.h    |   6 +-
 include/block/block.h |  16 ++--
 block.c               |  73 ---------------
 block/io.c            | 212 ------------------------------------------
 4 files changed, 13 insertions(+), 294 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index XXXXXXX..XXXXXXX 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
 int coroutine_fn
 bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
              bool is_write, BdrvRequestFlags flags);
-int
+int generated_co_wrapper
 bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
           bool is_write, BdrvRequestFlags flags);
 
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   int64_t *pnum,
                                   int64_t *map,
                                   BlockDriverState **file);
-int
+int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
                                BlockDriverState *base,
                                bool want_zero,
@@ -XXX,XX +XXX,XX @@ bdrv_common_block_status_above(BlockDriverState *bs,
 int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read);
-int
+int generated_co_wrapper
 bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                 bool is_read);
 
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ void bdrv_refresh_filename(BlockDriverState *bs);
 int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
                                   PreallocMode prealloc, BdrvRequestFlags flags,
                                   Error **errp);
-int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
-                  PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
+int generated_co_wrapper
+bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
+              PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
 
 int64_t bdrv_nb_sectors(BlockDriverState *bs);
 int64_t bdrv_getlength(BlockDriverState *bs);
@@ -XXX,XX +XXX,XX @@ typedef enum {
     BDRV_FIX_ERRORS   = 2,
 } BdrvCheckMode;
 
-int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix);
+int generated_co_wrapper bdrv_check(BlockDriverState *bs, BdrvCheckResult *res,
+                                    BdrvCheckMode fix);
 
 /* The units of offset and total_work_size may be chosen arbitrarily by the
  * block driver; total_work_size may change during the course of the amendment
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb);
 int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
 
 /* Invalidate any cached metadata used by image formats */
-int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
+int generated_co_wrapper bdrv_invalidate_cache(BlockDriverState *bs,
+                                               Error **errp);
 void bdrv_invalidate_cache_all(Error **errp);
 int bdrv_inactivate_all(void);
 
 /* Ensure contents are flushed to disk.  */
-int bdrv_flush(BlockDriverState *bs);
+int generated_co_wrapper bdrv_flush(BlockDriverState *bs);
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
 int bdrv_flush_all(void);
 void bdrv_close_all(void);
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all(void);
     AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
                    cond); })
 
-int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
+int generated_co_wrapper bdrv_pdiscard(BdrvChild *child, int64_t offset,
+                                       int64_t bytes);
 int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
     return bs->drv->bdrv_co_check(bs, res, fix);
 }
 
-typedef struct CheckCo {
-    BlockDriverState *bs;
-    BdrvCheckResult *res;
-    BdrvCheckMode fix;
-    int ret;
-} CheckCo;
-
-static void coroutine_fn bdrv_check_co_entry(void *opaque)
-{
-    CheckCo *cco = opaque;
-    cco->ret = bdrv_co_check(cco->bs, cco->res, cco->fix);
-    aio_wait_kick();
-}
-
-int bdrv_check(BlockDriverState *bs,
-               BdrvCheckResult *res, BdrvCheckMode fix)
-{
-    Coroutine *co;
-    CheckCo cco = {
-        .bs = bs,
-        .res = res,
-        .ret = -EINPROGRESS,
-        .fix = fix,
-    };
-
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_check_co_entry(&cco);
-    } else {
-        co = qemu_coroutine_create(bdrv_check_co_entry, &cco);
-        bdrv_coroutine_enter(bs, co);
-        BDRV_POLL_WHILE(bs, cco.ret == -EINPROGRESS);
-    }
-
-    return cco.ret;
-}
-
 /*
  * Return values:
  * 0        - success
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
     return 0;
 }
 
-typedef struct InvalidateCacheCo {
-    BlockDriverState *bs;
-    Error **errp;
-    bool done;
-    int ret;
-} InvalidateCacheCo;
-
-static void coroutine_fn bdrv_invalidate_cache_co_entry(void *opaque)
-{
-    InvalidateCacheCo *ico = opaque;
-    ico->ret = bdrv_co_invalidate_cache(ico->bs, ico->errp);
-    ico->done = true;
-    aio_wait_kick();
-}
-
-int bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
-{
-    Coroutine *co;
-    InvalidateCacheCo ico = {
-        .bs = bs,
-        .done = false,
-        .errp = errp
-    };
-
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_invalidate_cache_co_entry(&ico);
-    } else {
-        co = qemu_coroutine_create(bdrv_invalidate_cache_co_entry, &ico);
-        bdrv_coroutine_enter(bs, co);
-        BDRV_POLL_WHILE(bs, !ico.done);
-    }
-
-    return ico.ret;
-}
-
 void bdrv_invalidate_cache_all(Error **errp)
 {
     BlockDriverState *bs;
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
-typedef int coroutine_fn BdrvRequestEntry(void *opaque);
-typedef struct BdrvRunCo {
-    BdrvRequestEntry *entry;
-    void *opaque;
-    int ret;
-    bool done;
-    Coroutine *co; /* Coroutine, running bdrv_run_co_entry, for debugging */
-} BdrvRunCo;
-
-static void coroutine_fn bdrv_run_co_entry(void *opaque)
-{
-    BdrvRunCo *arg = opaque;
-
-    arg->ret = arg->entry(arg->opaque);
-    arg->done = true;
-    aio_wait_kick();
-}
-
-static int bdrv_run_co(BlockDriverState *bs, BdrvRequestEntry *entry,
-                       void *opaque)
-{
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        return entry(opaque);
-    } else {
-        BdrvRunCo s = { .entry = entry, .opaque = opaque };
-
-        s.co = qemu_coroutine_create(bdrv_run_co_entry, &s);
-        bdrv_coroutine_enter(bs, s.co);
-
-        BDRV_POLL_WHILE(bs, !s.done);
-
-        return s.ret;
-    }
-}
-
-typedef struct RwCo {
-    BdrvChild *child;
-    int64_t offset;
-    QEMUIOVector *qiov;
-    bool is_write;
-    BdrvRequestFlags flags;
-} RwCo;
-
 int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
                               QEMUIOVector *qiov, bool is_write,
                               BdrvRequestFlags flags)
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
     }
 }
 
-static int coroutine_fn bdrv_rw_co_entry(void *opaque)
-{
-    RwCo *rwco = opaque;
-
-    return bdrv_co_prwv(rwco->child, rwco->offset, rwco->qiov,
-                        rwco->is_write, rwco->flags);
-}
-
-/*
- * Process a vectored synchronous request using coroutines
- */
-int bdrv_prwv(BdrvChild *child, int64_t offset,
-              QEMUIOVector *qiov, bool is_write,
-              BdrvRequestFlags flags)
-{
-    RwCo rwco = {
-        .child = child,
-        .offset = offset,
-        .qiov = qiov,
-        .is_write = is_write,
-        .flags = flags,
-    };
-
-    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco);
-}
-
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags)
 {
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
     return result;
 }
 
-
-typedef struct BdrvCoBlockStatusData {
-    BlockDriverState *bs;
-    BlockDriverState *base;
-    bool want_zero;
-    int64_t offset;
-    int64_t bytes;
-    int64_t *pnum;
-    int64_t *map;
-    BlockDriverState **file;
-} BdrvCoBlockStatusData;
-
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
@@ -XXX,XX +XXX,XX @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
     return ret;
 }
 
-/* Coroutine wrapper for bdrv_block_status_above() */
-static int coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
-{
-    BdrvCoBlockStatusData *data = opaque;
-
-    return bdrv_co_common_block_status_above(data->bs, data->base,
-                                             data->want_zero,
-                                             data->offset, data->bytes,
-                                             data->pnum, data->map, data->file);
-}
-
-/*
- * Synchronous wrapper around bdrv_co_block_status_above().
- *
- * See bdrv_co_block_status_above() for details.
- */
-int bdrv_common_block_status_above(BlockDriverState *bs,
-                                   BlockDriverState *base,
-                                   bool want_zero, int64_t offset,
-                                   int64_t bytes, int64_t *pnum,
-                                   int64_t *map,
-                                   BlockDriverState **file)
-{
-    BdrvCoBlockStatusData data = {
-        .bs = bs,
-        .base = base,
-        .want_zero = want_zero,
-        .offset = offset,
-        .bytes = bytes,
-        .pnum = pnum,
-        .map = map,
-        .file = file,
-    };
-
-    return bdrv_run_co(bs, bdrv_block_status_above_co_entry, &data);
-}
-
 int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file)
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top,
     return 0;
 }
 
-typedef struct BdrvVmstateCo {
-    BlockDriverState    *bs;
-    QEMUIOVector        *qiov;
-    int64_t             pos;
-    bool                is_read;
-} BdrvVmstateCo;
-
 int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -XXX,XX +XXX,XX @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
     return ret;
 }
 
-static int coroutine_fn bdrv_co_rw_vmstate_entry(void *opaque)
-{
-    BdrvVmstateCo *co = opaque;
-
-    return bdrv_co_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
-}
-
-int bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
-                    bool is_read)
-{
-    BdrvVmstateCo data = {
-        .bs         = bs,
-        .qiov       = qiov,
-        .pos        = pos,
-        .is_read    = is_read,
-    };
-
-    return bdrv_run_co(bs, bdrv_co_rw_vmstate_entry, &data);
-}
-
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                       int64_t pos, int size)
 {
@@ -XXX,XX +XXX,XX @@ void bdrv_aio_cancel_async(BlockAIOCB *acb)
 /**************************************************************/
 /* Coroutine block device emulation */
 
-static int coroutine_fn bdrv_flush_co_entry(void *opaque)
-{
-    return bdrv_co_flush(opaque);
-}
-
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 {
     BdrvChild *primary_child = bdrv_primary_child(bs);
@@ -XXX,XX +XXX,XX @@ early_exit:
     return ret;
 }
 
-int bdrv_flush(BlockDriverState *bs)
-{
-    return bdrv_run_co(bs, bdrv_flush_co_entry, bs);
-}
-
-typedef struct DiscardCo {
-    BdrvChild *child;
-    int64_t offset;
-    int64_t bytes;
-} DiscardCo;
-
-static int coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
-{
-    DiscardCo *rwco = opaque;
-
-    return bdrv_co_pdiscard(rwco->child, rwco->offset, rwco->bytes);
-}
-
 int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                   int64_t bytes)
 {
@@ -XXX,XX +XXX,XX @@ out:
     return ret;
 }
 
-int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
-{
-    DiscardCo rwco = {
-        .child = child,
-        .offset = offset,
-        .bytes = bytes,
-    };
-
-    return bdrv_run_co(child->bs, bdrv_pdiscard_co_entry, &rwco);
-}
-
 int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf)
 {
     BlockDriver *drv = bs->drv;
@@ -XXX,XX +XXX,XX @@ out:
 
     return ret;
 }
-
-typedef struct TruncateCo {
-    BdrvChild *child;
-    int64_t offset;
-    bool exact;
-    PreallocMode prealloc;
-    BdrvRequestFlags flags;
-    Error **errp;
-} TruncateCo;
-
-static int coroutine_fn bdrv_truncate_co_entry(void *opaque)
-{
-    TruncateCo *tco = opaque;
-
-    return bdrv_co_truncate(tco->child, tco->offset, tco->exact,
-                            tco->prealloc, tco->flags, tco->errp);
-}
-
-int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
-                  PreallocMode prealloc, BdrvRequestFlags flags, Error **errp)
-{
-    TruncateCo tco = {
-        .child      = child,
-        .offset     = offset,
-        .exact      = exact,
-        .prealloc   = prealloc,
-        .flags      = flags,
-        .errp       = errp,
-    };
-
-    return bdrv_run_co(child->bs, bdrv_truncate_co_entry, &tco);
-}
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Now that we are not maintaining boilerplate code for coroutine
wrappers, there is no more sense in keeping the extra indirection layer
of bdrv_prwv().  Let's drop it and instead generate pure bdrv_preadv()
and bdrv_pwritev().

Currently, bdrv_pwritev() and bdrv_preadv() are returning bytes on
success, auto generated functions will instead return zero, as their
_co_ prototype. Still, it's simple to make the conversion safe: the
only external user of bdrv_pwritev() is test-bdrv-drain, and it is
comfortable enough with bdrv_co_pwritev() instead. So prototypes are
moved to local block/coroutines.h. Next, the only internal use is
bdrv_pread() and bdrv_pwrite(), which are modified to return bytes on
success.

Of course, it would be great to convert bdrv_pread() and bdrv_pwrite()
to return 0 on success. But this requires audit (and probably
conversion) of all their users, let's leave it for another day
refactoring.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-7-vsementsov@virtuozzo.com>
---
 block/coroutines.h      | 10 ++++-----
 include/block/block.h   |  2 --
 block/io.c              | 49 ++++++++---------------------------------
 tests/test-bdrv-drain.c |  2 +-
 4 files changed, 15 insertions(+), 48 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index XXXXXXX..XXXXXXX 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
                                BdrvCheckResult *res, BdrvCheckMode fix);
 int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
 
-int coroutine_fn
-bdrv_co_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
-             bool is_write, BdrvRequestFlags flags);
 int generated_co_wrapper
-bdrv_prwv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov,
-          bool is_write, BdrvRequestFlags flags);
+bdrv_preadv(BdrvChild *child, int64_t offset, unsigned int bytes,
+            QEMUIOVector *qiov, BdrvRequestFlags flags);
+int generated_co_wrapper
+bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes,
+             QEMUIOVector *qiov, BdrvRequestFlags flags);
 
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
 int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes);
-int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
 int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes);
-int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
 int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
                      const void *buf, int count);
 /*
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
-int coroutine_fn bdrv_co_prwv(BdrvChild *child, int64_t offset,
-                              QEMUIOVector *qiov, bool is_write,
-                              BdrvRequestFlags flags)
-{
-    if (is_write) {
-        return bdrv_co_pwritev(child, offset, qiov->size, qiov, flags);
-    } else {
-        return bdrv_co_preadv(child, offset, qiov->size, qiov, flags);
-    }
-}
-
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags)
 {
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
-
-    return bdrv_prwv(child, offset, &qiov, true, BDRV_REQ_ZERO_WRITE | flags);
+    return bdrv_pwritev(child, offset, bytes, NULL,
+                        BDRV_REQ_ZERO_WRITE | flags);
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
     }
 }
 
-/* return < 0 if error. See bdrv_pwrite() for the return codes */
-int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
-{
-    int ret;
-
-    ret = bdrv_prwv(child, offset, qiov, false, 0);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return qiov->size;
-}
-
 /* See bdrv_pwrite() for the return codes */
 int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes)
 {
+    int ret;
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 
     if (bytes < 0) {
         return -EINVAL;
     }
 
-    return bdrv_preadv(child, offset, &qiov);
-}
+    ret = bdrv_preadv(child, offset, bytes, &qiov,  0);
 
-int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
-{
-    int ret;
-
-    ret = bdrv_prwv(child, offset, qiov, true, 0);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return qiov->size;
+    return ret < 0 ? ret : bytes;
 }
 
 /* Return no. of bytes on success or < 0 on error. Important errors are:
@@ -XXX,XX +XXX,XX @@ int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 */
 int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes)
 {
+    int ret;
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 
     if (bytes < 0) {
         return -EINVAL;
     }
 
-    return bdrv_pwritev(child, offset, &qiov);
+    ret = bdrv_pwritev(child, offset, bytes, &qiov, 0);
+
+    return ret < 0 ? ret : bytes;
 }
 
 /*
diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs,
         }
         s->io_co = NULL;
 
-        ret = bdrv_preadv(bs->backing, offset, qiov);
+        ret = bdrv_co_preadv(bs->backing, offset, bytes, qiov, 0);
         s->has_read = true;
 
         /* Wake up drain_co if it runs */
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Like for read/write in a previous commit, drop extra indirection layer,
generate directly bdrv_readv_vmstate() and bdrv_writev_vmstate().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-8-vsementsov@virtuozzo.com>
---
 block/coroutines.h    | 10 +++----
 include/block/block.h |  6 ++--
 block/io.c            | 70 ++++++++++++++++++++++---------------------
 3 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index XXXXXXX..XXXXXXX 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -XXX,XX +XXX,XX @@ bdrv_common_block_status_above(BlockDriverState *bs,
                                int64_t *map,
                                BlockDriverState **file);
 
-int coroutine_fn
-bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
-                   bool is_read);
-int generated_co_wrapper
-bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
-                bool is_read);
+int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
+                                       QEMUIOVector *qiov, int64_t pos);
+int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
+                                        QEMUIOVector *qiov, int64_t pos);
 
 #endif /* BLOCK_COROUTINES_INT_H */
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int path_has_protocol(const char *path);
 int path_is_absolute(const char *path);
 char *path_combine(const char *base_path, const char *filename);
 
-int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
-int bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+int generated_co_wrapper
+bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+int generated_co_wrapper
+bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                       int64_t pos, int size);
 
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top,
 }
 
 int coroutine_fn
-bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
-                   bool is_read)
+bdrv_co_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 {
     BlockDriver *drv = bs->drv;
     BlockDriverState *child_bs = bdrv_primary_bs(bs);
     int ret = -ENOTSUP;
 
+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
     bdrv_inc_in_flight(bs);
 
+    if (drv->bdrv_load_vmstate) {
+        ret = drv->bdrv_load_vmstate(bs, qiov, pos);
+    } else if (child_bs) {
+        ret = bdrv_co_readv_vmstate(child_bs, qiov, pos);
+    }
+
+    bdrv_dec_in_flight(bs);
+
+    return ret;
+}
+
+int coroutine_fn
+bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
+{
+    BlockDriver *drv = bs->drv;
+    BlockDriverState *child_bs = bdrv_primary_bs(bs);
+    int ret = -ENOTSUP;
+
     if (!drv) {
-        ret = -ENOMEDIUM;
-    } else if (drv->bdrv_load_vmstate) {
-        if (is_read) {
-            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
-        } else {
-            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
-        }
+        return -ENOMEDIUM;
+    }
+
+    bdrv_inc_in_flight(bs);
+
+    if (drv->bdrv_save_vmstate) {
+        ret = drv->bdrv_save_vmstate(bs, qiov, pos);
     } else if (child_bs) {
-        ret = bdrv_co_rw_vmstate(child_bs, qiov, pos, is_read);
+        ret = bdrv_co_writev_vmstate(child_bs, qiov, pos);
     }
 
     bdrv_dec_in_flight(bs);
+
     return ret;
 }
 
@@ -XXX,XX +XXX,XX @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                       int64_t pos, int size)
 {
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
-    int ret;
+    int ret = bdrv_writev_vmstate(bs, &qiov, pos);
 
-    ret = bdrv_writev_vmstate(bs, &qiov, pos);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return size;
-}
-
-int bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
-{
-    return bdrv_rw_vmstate(bs, qiov, pos, false);
+    return ret < 0 ? ret : size;
 }
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                       int64_t pos, int size)
 {
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
-    int ret;
+    int ret = bdrv_readv_vmstate(bs, &qiov, pos);
 
-    ret = bdrv_readv_vmstate(bs, &qiov, pos);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return size;
-}
-
-int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
-{
-    return bdrv_rw_vmstate(bs, qiov, pos, true);
+    return ret < 0 ? ret : size;
 }
 
 /**************************************************************/
-- 
2.26.2

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

This is the only non-ascii character in the file and it doesn't really
needed here. Let's use normal "'" symbol for consistency with the rest
11 occurrences of "'" in the file.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ enum BdrvChildRoleBits {
     BDRV_CHILD_FILTERED     = (1 << 2),
 
     /*
-     * Child from which to read all data that isn’t allocated in the
+     * Child from which to read all data that isn't allocated in the
      * parent (i.e., the backing child); such data is copied to the
      * parent through COW (and optionally COR).
      * This field is mutually exclusive with DATA, METADATA, and
-- 
2.26.2

From: Stefano Garzarella <sgarzare@redhat.com>

When we added io_uring AIO engine, we forgot to update qemu-options.hx,
so qemu(1) man page and qemu help were outdated.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Julia Suvorova <jusual@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Message-Id: <20200924151511.131471-1-sgarzare@redhat.com>
---
 qemu-options.hx | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ SRST
             The path to the image file in the local filesystem
 
         ``aio``
-            Specifies the AIO backend (threads/native, default: threads)
+            Specifies the AIO backend (threads/native/io_uring,
+            default: threads)
 
         ``locking``
             Specifies whether the image file is protected with Linux OFD
@@ -XXX,XX +XXX,XX @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
     "       [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
     "       [,snapshot=on|off][,rerror=ignore|stop|report]\n"
-    "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
+    "       [,werror=ignore|stop|report|enospc][,id=name]\n"
+    "       [,aio=threads|native|io_uring]\n"
     "       [,readonly=on|off][,copy-on-read=on|off]\n"
     "       [,discard=ignore|unmap][,detect-zeroes=on|off|unmap]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
@@ -XXX,XX +XXX,XX @@ SRST
         The default mode is ``cache=writeback``.
 
     ``aio=aio``
-        aio is "threads", or "native" and selects between pthread based
-        disk I/O and native Linux AIO.
+        aio is "threads", "native", or "io_uring" and selects between pthread
+        based disk I/O, native Linux AIO, or Linux io_uring API.
 
     ``format=format``
         Specify which disk format will be used rather than detecting the
-- 
2.26.2

From: Eric Auger <eric.auger@redhat.com>

The IOVA allocator currently ignores host reserved regions.
As a result some chosen IOVAs may collide with some of them,
resulting in VFIO MAP_DMA errors later on. This happens on ARM
where the MSI reserved window quickly is encountered:
[0x8000000, 0x8100000]. since 5.4 kernel, VFIO returns the usable
IOVA regions. So let's enumerate them in the prospect to avoid
them, later on.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Message-id: 20200929085550.30926-2-eric.auger@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vfio-helpers.c | 72 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -XXX,XX +XXX,XX @@ typedef struct {
     uint64_t iova;
 } IOVAMapping;
 
+struct IOVARange {
+    uint64_t start;
+    uint64_t end;
+};
+
 struct QEMUVFIOState {
     QemuMutex lock;
 
@@ -XXX,XX +XXX,XX @@ struct QEMUVFIOState {
     int device;
     RAMBlockNotifier ram_notifier;
     struct vfio_region_info config_region_info, bar_region_info[6];
+    struct IOVARange *usable_iova_ranges;
+    uint8_t nb_iova_ranges;
 
     /* These fields are protected by @lock */
     /* VFIO's IO virtual address space is managed by splitting into a few
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int
     return ret == size ? 0 : -errno;
 }
 
+static void collect_usable_iova_ranges(QEMUVFIOState *s, void *buf)
+{
+    struct vfio_iommu_type1_info *info = (struct vfio_iommu_type1_info *)buf;
+    struct vfio_info_cap_header *cap = (void *)buf + info->cap_offset;
+    struct vfio_iommu_type1_info_cap_iova_range *cap_iova_range;
+    int i;
+
+    while (cap->id != VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
+        if (!cap->next) {
+            return;
+        }
+        cap = (struct vfio_info_cap_header *)(buf + cap->next);
+    }
+
+    cap_iova_range = (struct vfio_iommu_type1_info_cap_iova_range *)cap;
+
+    s->nb_iova_ranges = cap_iova_range->nr_iovas;
+    if (s->nb_iova_ranges > 1) {
+        s->usable_iova_ranges =
+            g_realloc(s->usable_iova_ranges,
+                      s->nb_iova_ranges * sizeof(struct IOVARange));
+    }
+
+    for (i = 0; i < s->nb_iova_ranges; i++) {
+        s->usable_iova_ranges[i].start = cap_iova_range->iova_ranges[i].start;
+        s->usable_iova_ranges[i].end = cap_iova_range->iova_ranges[i].end;
+    }
+}
+
 static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
                               Error **errp)
 {
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
     int i;
     uint16_t pci_cmd;
     struct vfio_group_status group_status = { .argsz = sizeof(group_status) };
-    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
+    struct vfio_iommu_type1_info *iommu_info = NULL;
+    size_t iommu_info_size = sizeof(*iommu_info);
     struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
     char *group_file = NULL;
 
+    s->usable_iova_ranges = NULL;
+
     /* Create a new container */
     s->container = open("/dev/vfio/vfio", O_RDWR);
 
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
         goto fail;
     }
 
+    iommu_info = g_malloc0(iommu_info_size);
+    iommu_info->argsz = iommu_info_size;
+
     /* Get additional IOMMU info */
-    if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) {
+    if (ioctl(s->container, VFIO_IOMMU_GET_INFO, iommu_info)) {
         error_setg_errno(errp, errno, "Failed to get IOMMU info");
         ret = -errno;
         goto fail;
     }
 
+    /*
+     * if the kernel does not report usable IOVA regions, choose
+     * the legacy [QEMU_VFIO_IOVA_MIN, QEMU_VFIO_IOVA_MAX -1] region
+     */
+    s->nb_iova_ranges = 1;
+    s->usable_iova_ranges = g_new0(struct IOVARange, 1);
+    s->usable_iova_ranges[0].start = QEMU_VFIO_IOVA_MIN;
+    s->usable_iova_ranges[0].end = QEMU_VFIO_IOVA_MAX - 1;
+
+    if (iommu_info->argsz > iommu_info_size) {
+        iommu_info_size = iommu_info->argsz;
+        iommu_info = g_realloc(iommu_info, iommu_info_size);
+        if (ioctl(s->container, VFIO_IOMMU_GET_INFO, iommu_info)) {
+            ret = -errno;
+            goto fail;
+        }
+        collect_usable_iova_ranges(s, iommu_info);
+    }
+
     s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device);
 
     if (s->device < 0) {
@@ -XXX,XX +XXX,XX @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
     if (ret) {
         goto fail;
     }
+    g_free(iommu_info);
     return 0;
 fail:
+    g_free(s->usable_iova_ranges);
+    s->usable_iova_ranges = NULL;
+    s->nb_iova_ranges = 0;
+    g_free(iommu_info);
     close(s->group);
 fail_container:
     close(s->container);
@@ -XXX,XX +XXX,XX @@ void qemu_vfio_close(QEMUVFIOState *s)
         qemu_vfio_undo_mapping(s, &s->mappings[i], NULL);
     }
     ram_block_notifier_remove(&s->ram_notifier);
+    g_free(s->usable_iova_ranges);
+    s->nb_iova_ranges = 0;
     qemu_vfio_reset(s);
     close(s->device);
     close(s->group);
-- 
2.26.2

From: Eric Auger <eric.auger@redhat.com>

Introduce the qemu_vfio_find_fixed/temp_iova helpers which
respectively allocate IOVAs from the bottom/top parts of the
usable IOVA range, without picking within host IOVA reserved
windows. The allocation remains basic: if the size is too big
for the remaining of the current usable IOVA range, we jump
to the next one, leaving a hole in the address map.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Message-id: 20200929085550.30926-3-eric.auger@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/vfio-helpers.c | 57 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index XXXXXXX..XXXXXXX 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -XXX,XX +XXX,XX @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
     return true;
 }
 
+static int
+qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+{
+    int i;
+
+    for (i = 0; i < s->nb_iova_ranges; i++) {
+        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
+            continue;
+        }
+        s->low_water_mark =
+            MAX(s->low_water_mark, s->usable_iova_ranges[i].start);
+
+        if (s->usable_iova_ranges[i].end - s->low_water_mark + 1 >= size ||
+            s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
+            *iova = s->low_water_mark;
+            s->low_water_mark += size;
+            return 0;
+        }
+    }
+    return -ENOMEM;
+}
+
+static int
+qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+{
+    int i;
+
+    for (i = s->nb_iova_ranges - 1; i >= 0; i--) {
+        if (s->usable_iova_ranges[i].start > s->high_water_mark) {
+            continue;
+        }
+        s->high_water_mark =
+            MIN(s->high_water_mark, s->usable_iova_ranges[i].end + 1);
+
+        if (s->high_water_mark - s->usable_iova_ranges[i].start + 1 >= size ||
+            s->high_water_mark - s->usable_iova_ranges[i].start + 1 == 0) {
+            *iova = s->high_water_mark - size;
+            s->high_water_mark = *iova;
+            return 0;
+        }
+    }
+    return -ENOMEM;
+}
+
 /* Map [host, host + size) area into a contiguous IOVA address space, and store
  * the result in @iova if not NULL. The caller need to make sure the area is
  * aligned to page size, and mustn't overlap with existing mapping areas (split
@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
             goto out;
         }
         if (!temporary) {
-            iova0 = s->low_water_mark;
+            if (qemu_vfio_find_fixed_iova(s, size, &iova0)) {
+                ret = -ENOMEM;
+                goto out;
+            }
+
             mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
             if (!mapping) {
                 ret = -ENOMEM;
@@ -XXX,XX +XXX,XX @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
                 qemu_vfio_undo_mapping(s, mapping, NULL);
                 goto out;
             }
-            s->low_water_mark += size;
             qemu_vfio_dump_mappings(s);
         } else {
-            iova0 = s->high_water_mark - size;
+            if (qemu_vfio_find_temp_iova(s, size, &iova0)) {
+                ret = -ENOMEM;
+                goto out;
+            }
             ret = qemu_vfio_do_mapping(s, host, size, iova0);
             if (ret) {
                 goto out;
             }
-            s->high_water_mark -= size;
         }
     }
     if (iova) {
-- 
2.26.2

The following changes since commit 848a6caa88b9f082c89c9b41afa975761262981d:

Merge tag 'migration-20230602-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-06-02 17:33:29 -0700)

are available in the Git repository at:

https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-06-05

for you to fetch changes up to 42a2890a76f4783cd1c212f27856edcf2b5e8a75:

qcow2: add discard-no-unref option (2023-06-05 13:15:42 +0200)

----------------------------------------------------------------
Block patches

- Fix padding of unaligned vectored requests to match the host alignment
  for vectors with 1023 or 1024 buffers
- Refactor and fix bugs in parallels's image check functionality
- Add an option to the qcow2 driver to retain (qcow2-level) allocations
  on discard requests from the guest (while still forwarding the discard
  to the lower level and marking the range as zero)

----------------------------------------------------------------
Alexander Ivanov (12):
  parallels: Out of image offset in BAT leads to image inflation
  parallels: Fix high_off calculation in parallels_co_check()
  parallels: Fix image_end_offset and data_end after out-of-image check
  parallels: create parallels_set_bat_entry_helper() to assign BAT value
  parallels: Use generic infrastructure for BAT writing in
    parallels_co_check()
  parallels: Move check of unclean image to a separate function
  parallels: Move check of cluster outside image to a separate function
  parallels: Fix statistics calculation
  parallels: Move check of leaks to a separate function
  parallels: Move statistic collection to a separate function
  parallels: Replace qemu_co_mutex_lock by WITH_QEMU_LOCK_GUARD
  parallels: Incorrect condition in out-of-image check

Hanna Czenczek (4):
  util/iov: Make qiov_slice() public
  block: Collapse padded I/O vecs exceeding IOV_MAX
  util/iov: Remove qemu_iovec_init_extended()
  iotests/iov-padding: New test

Jean-Louis Dupond (1):
  qcow2: add discard-no-unref option

qapi/block-core.json                     |  12 ++
 block/qcow2.h                            |   3 +
 include/qemu/iov.h                       |   8 +-
 block/io.c                               | 166 ++++++++++++++++++--
 block/parallels.c                        | 190 ++++++++++++++++-------
 block/qcow2-cluster.c                    |  32 +++-
 block/qcow2.c                            |  18 +++
 util/iov.c                               |  89 ++---------
 qemu-options.hx                          |  12 ++
 tests/qemu-iotests/tests/iov-padding     |  85 ++++++++++
 tests/qemu-iotests/tests/iov-padding.out |  59 +++++++
 11 files changed, 523 insertions(+), 151 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

-- 
2.40.1

We want to inline qemu_iovec_init_extended() in block/io.c for padding
requests, and having access to qiov_slice() is useful for this.  As a
public function, it is renamed to qemu_iovec_slice().

(We will need to count the number of I/O vector elements of a slice
there, and then later process this slice.  Without qiov_slice(), we
would need to call qemu_iovec_subvec_niov(), and all further
IOV-processing functions may need to skip prefixing elements to
accomodate for a qiov_offset.  Because qemu_iovec_subvec_niov()
internally calls qiov_slice(), we can just have the block/io.c code call
qiov_slice() itself, thus get the number of elements, and also create an
iovec array with the superfluous prefixing elements stripped, so the
following processing functions no longer need to skip them.)

diff --git a/include/qemu/iov.h b/include/qemu/iov.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/iov.h
+++ b/include/qemu/iov.h
@@ -XXX,XX +XXX,XX @@ int qemu_iovec_init_extended(
         void *tail_buf, size_t tail_len);
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len);
+struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
+                               size_t offset, size_t len,
+                               size_t *head, size_t *tail, int *niov);
 int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
 void qemu_iovec_concat(QEMUIOVector *dst,
diff --git a/util/iov.c b/util/iov.c
index XXXXXXX..XXXXXXX 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -XXX,XX +XXX,XX @@ static struct iovec *iov_skip_offset(struct iovec *iov, size_t offset,
 }
 
 /*
- * qiov_slice
+ * qemu_iovec_slice
  *
  * Find subarray of iovec's, containing requested range. @head would
  * be offset in first iov (returned by the function), @tail would be
  * count of extra bytes in last iovec (returned iov + @niov - 1).
  */
-static struct iovec *qiov_slice(QEMUIOVector *qiov,
-                                size_t offset, size_t len,
-                                size_t *head, size_t *tail, int *niov)
+struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
+                               size_t offset, size_t len,
+                               size_t *head, size_t *tail, int *niov)
 {
     struct iovec *iov, *end_iov;
 
@@ -XXX,XX +XXX,XX @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
     size_t head, tail;
     int niov;
 
-    qiov_slice(qiov, offset, len, &head, &tail, &niov);
+    qemu_iovec_slice(qiov, offset, len, &head, &tail, &niov);
 
     return niov;
 }
@@ -XXX,XX +XXX,XX @@ int qemu_iovec_init_extended(
     }
 
     if (mid_len) {
-        mid_iov = qiov_slice(mid_qiov, mid_offset, mid_len,
-                             &mid_head, &mid_tail, &mid_niov);
+        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
+                                   &mid_head, &mid_tail, &mid_niov);
     }
 
     total_niov = !!head_len + mid_niov + !!tail_len;
-- 
2.40.1

When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
   shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
is effectively replaced by the new function bdrv_create_padded_qiov(),
which not only wraps the request IOV with padding head/tail, but also
ensures that the resulting vector will not have more than IOV_MAX
elements.  Putting that functionality into qemu_iovec_init_extended() is
infeasible because it requires allocating a bounce buffer; doing so
would require many more parameters (buffer alignment, how to initialize
the buffer, and out parameters like the buffer, its length, and the
original elements), which is not reasonable.

Conversely, it is not difficult to move qemu_iovec_init_extended()'s
functionality into bdrv_create_padded_qiov() by using public
qemu_iovec_* functions, so that is what this patch does.

Because bdrv_pad_request() was the only "serious" user of
qemu_iovec_init_extended(), the next patch will remove the latter
function, so the functionality is not implemented twice.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230411173418.19549-3-hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---
 block/io.c | 166 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 151 insertions(+), 15 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ out:
  * @merge_reads is true for small requests,
  * if @buf_len == @head + bytes + @tail. In this case it is possible that both
  * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then we need to
+ * merge existing vector elements into a single one.  @collapse_bounce_buf acts
+ * as the bounce buffer in such cases.  @pre_collapse_qiov has the pre-collapse
+ * I/O vector elements so for read requests, the data can be copied back after
+ * the read is done.
  */
 typedef struct BdrvRequestPadding {
     uint8_t *buf;
@@ -XXX,XX +XXX,XX @@ typedef struct BdrvRequestPadding {
     size_t head;
     size_t tail;
     bool merge_reads;
+    bool write;
     QEMUIOVector local_qiov;
+
+    uint8_t *collapse_bounce_buf;
+    size_t collapse_len;
+    QEMUIOVector pre_collapse_qiov;
 } BdrvRequestPadding;
 
 static bool bdrv_init_padding(BlockDriverState *bs,
                               int64_t offset, int64_t bytes,
+                              bool write,
                               BdrvRequestPadding *pad)
 {
     int64_t align = bs->bl.request_alignment;
@@ -XXX,XX +XXX,XX @@ static bool bdrv_init_padding(BlockDriverState *bs,
         pad->tail_buf = pad->buf + pad->buf_len - align;
     }
 
+    pad->write = write;
+
     return true;
 }
 
@@ -XXX,XX +XXX,XX @@ zero_mem:
     return 0;
 }
 
-static void bdrv_padding_destroy(BdrvRequestPadding *pad)
+/**
+ * Free *pad's associated buffers, and perform any necessary finalization steps.
+ */
+static void bdrv_padding_finalize(BdrvRequestPadding *pad)
 {
+    if (pad->collapse_bounce_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into a
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
+                                pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_bounce_buf);
+        qemu_iovec_destroy(&pad->pre_collapse_qiov);
+    }
     if (pad->buf) {
         qemu_vfree(pad->buf);
         qemu_iovec_destroy(&pad->local_qiov);
@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
     memset(pad, 0, sizeof(*pad));
 }
 
+/*
+ * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
+ * ensuring that the resulting vector will not exceed IOV_MAX elements.
+ *
+ * To ensure this, when necessary, the first two or three elements of @iov are
+ * merged into pad->collapse_bounce_buf and replaced by a reference to that
+ * bounce buffer in pad->local_qiov.
+ *
+ * After performing a read request, the data from the bounce buffer must be
+ * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_finalize()).
+ */
+static int bdrv_create_padded_qiov(BlockDriverState *bs,
+                                   BdrvRequestPadding *pad,
+                                   struct iovec *iov, int niov,
+                                   size_t iov_offset, size_t bytes)
+{
+    int padded_niov, surplus_count, collapse_count;
+
+    /* Assert this invariant */
+    assert(niov <= IOV_MAX);
+
+    /*
+     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
+     * to the guest is not ideal, but there is little else we can do.  At least
+     * this will practically never happen on 64-bit systems.
+     */
+    if (SIZE_MAX - pad->head < bytes ||
+        SIZE_MAX - pad->head - bytes < pad->tail)
+    {
+        return -EINVAL;
+    }
+
+    /* Length of the resulting IOV if we just concatenated everything */
+    padded_niov = !!pad->head + niov + !!pad->tail;
+
+    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
+
+    if (pad->head) {
+        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
+    }
+
+    /*
+     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
+     * Instead, merge the first two or three elements of @iov to reduce the
+     * number of vector elements as necessary.
+     */
+    if (padded_niov > IOV_MAX) {
+        /*
+         * Only head and tail can have lead to the number of entries exceeding
+         * IOV_MAX, so we can exceed it by the head and tail at most.  We need
+         * to reduce the number of elements by `surplus_count`, so we merge that
+         * many elements plus one into one element.
+         */
+        surplus_count = padded_niov - IOV_MAX;
+        assert(surplus_count <= !!pad->head + !!pad->tail);
+        collapse_count = surplus_count + 1;
+
+        /*
+         * Move the elements to collapse into `pad->pre_collapse_qiov`, then
+         * advance `iov` (and associated variables) by those elements.
+         */
+        qemu_iovec_init(&pad->pre_collapse_qiov, collapse_count);
+        qemu_iovec_concat_iov(&pad->pre_collapse_qiov, iov,
+                              collapse_count, iov_offset, SIZE_MAX);
+        iov += collapse_count;
+        iov_offset = 0;
+        niov -= collapse_count;
+        bytes -= pad->pre_collapse_qiov.size;
+
+        /*
+         * Construct the bounce buffer to match the length of the to-collapse
+         * vector elements, and for write requests, initialize it with the data
+         * from those elements.  Then add it to `pad->local_qiov`.
+         */
+        pad->collapse_len = pad->pre_collapse_qiov.size;
+        pad->collapse_bounce_buf = qemu_blockalign(bs, pad->collapse_len);
+        if (pad->write) {
+            qemu_iovec_to_buf(&pad->pre_collapse_qiov, 0,
+                              pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->collapse_bounce_buf, pad->collapse_len);
+    }
+
+    qemu_iovec_concat_iov(&pad->local_qiov, iov, niov, iov_offset, bytes);
+
+    if (pad->tail) {
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->buf + pad->buf_len - pad->tail, pad->tail);
+    }
+
+    assert(pad->local_qiov.niov == MIN(padded_niov, IOV_MAX));
+    return 0;
+}
+
 /*
  * bdrv_pad_request
  *
@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
  * read of padding, bdrv_padding_rmw_read() should be called separately if
  * needed.
  *
+ * @write is true for write requests, false for read requests.
+ *
  * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
  *  - on function start they represent original request
  *  - on failure or when padding is not needed they are unchanged
@@ -XXX,XX +XXX,XX @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
+                            bool write,
                             BdrvRequestPadding *pad, bool *padded,
                             BdrvRequestFlags *flags)
 {
     int ret;
+    struct iovec *sliced_iov;
+    int sliced_niov;
+    size_t sliced_head, sliced_tail;
 
     bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
 
-    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
         if (padded) {
             *padded = false;
         }
         return 0;
     }
 
-    ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head,
-                                   *qiov, *qiov_offset, *bytes,
-                                   pad->buf + pad->buf_len - pad->tail,
-                                   pad->tail);
+    sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+                                  &sliced_head, &sliced_tail,
+                                  &sliced_niov);
+
+    /* Guaranteed by bdrv_check_qiov_request() */
+    assert(*bytes <= SIZE_MAX);
+    ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+                                  sliced_head, *bytes);
     if (ret < 0) {
-        bdrv_padding_destroy(pad);
+        bdrv_padding_finalize(pad);
         return ret;
     }
     *bytes += pad->head + pad->tail;
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
         flags |= BDRV_REQ_COPY_ON_READ;
     }
 
-    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
                               bs->bl.request_alignment,
                               qiov, qiov_offset, flags);
     tracked_request_end(&req);
-    bdrv_padding_destroy(&pad);
+    bdrv_padding_finalize(&pad);
 
 fail:
     bdrv_dec_in_flight(bs);
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
     /* This flag doesn't make sense for padding or zero writes */
     flags &= ~BDRV_REQ_REGISTERED_BUF;
 
-    padding = bdrv_init_padding(bs, offset, bytes, &pad);
+    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
         bdrv_make_request_serialising(req, align);
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
     }
 
 out:
-    bdrv_padding_destroy(&pad);
+    bdrv_padding_finalize(&pad);
 
     return ret;
 }
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
          * alignment only if there is no ZERO flag.
          */
-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded, &flags);
+        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
+                               &pad, &padded, &flags);
         if (ret < 0) {
             return ret;
         }
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     ret = bdrv_aligned_pwritev(child, &req, offset, bytes, align,
                                qiov, qiov_offset, flags);
 
-    bdrv_padding_destroy(&pad);
+    bdrv_padding_finalize(&pad);
 
 out:
     tracked_request_end(&req);
-- 
2.40.1

bdrv_pad_request() was the main user of qemu_iovec_init_extended().
HEAD^ has removed that use, so we can remove qemu_iovec_init_extended()
now.

The only remaining user is qemu_iovec_init_slice(), which can easily
inline the small part it really needs.

Note that qemu_iovec_init_extended() offered a memcpy() optimization to
initialize the new I/O vector.  qemu_iovec_concat_iov(), which is used
to replace its functionality, does not, but calls qemu_iovec_add() for
every single element.  If we decide this optimization was important, we
will need to re-implement it in qemu_iovec_concat_iov(), which might
also benefit its pre-existing users.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230411173418.19549-4-hreitz@redhat.com>
---
 include/qemu/iov.h |  5 ---
 util/iov.c         | 79 +++++++---------------------------------------
 2 files changed, 11 insertions(+), 73 deletions(-)

diff --git a/include/qemu/iov.h b/include/qemu/iov.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/iov.h
+++ b/include/qemu/iov.h
@@ -XXX,XX +XXX,XX @@ static inline void *qemu_iovec_buf(QEMUIOVector *qiov)
 
 void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
 void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
-int qemu_iovec_init_extended(
-        QEMUIOVector *qiov,
-        void *head_buf, size_t head_len,
-        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
-        void *tail_buf, size_t tail_len);
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len);
 struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
diff --git a/util/iov.c b/util/iov.c
index XXXXXXX..XXXXXXX 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -XXX,XX +XXX,XX @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
     return niov;
 }
 
-/*
- * Compile new iovec, combining @head_buf buffer, sub-qiov of @mid_qiov,
- * and @tail_buf buffer into new qiov.
- */
-int qemu_iovec_init_extended(
-        QEMUIOVector *qiov,
-        void *head_buf, size_t head_len,
-        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
-        void *tail_buf, size_t tail_len)
-{
-    size_t mid_head, mid_tail;
-    int total_niov, mid_niov = 0;
-    struct iovec *p, *mid_iov = NULL;
-
-    assert(mid_qiov->niov <= IOV_MAX);
-
-    if (SIZE_MAX - head_len < mid_len ||
-        SIZE_MAX - head_len - mid_len < tail_len)
-    {
-        return -EINVAL;
-    }
-
-    if (mid_len) {
-        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
-                                   &mid_head, &mid_tail, &mid_niov);
-    }
-
-    total_niov = !!head_len + mid_niov + !!tail_len;
-    if (total_niov > IOV_MAX) {
-        return -EINVAL;
-    }
-
-    if (total_niov == 1) {
-        qemu_iovec_init_buf(qiov, NULL, 0);
-        p = &qiov->local_iov;
-    } else {
-        qiov->niov = qiov->nalloc = total_niov;
-        qiov->size = head_len + mid_len + tail_len;
-        p = qiov->iov = g_new(struct iovec, qiov->niov);
-    }
-
-    if (head_len) {
-        p->iov_base = head_buf;
-        p->iov_len = head_len;
-        p++;
-    }
-
-    assert(!mid_niov == !mid_len);
-    if (mid_niov) {
-        memcpy(p, mid_iov, mid_niov * sizeof(*p));
-        p[0].iov_base = (uint8_t *)p[0].iov_base + mid_head;
-        p[0].iov_len -= mid_head;
-        p[mid_niov - 1].iov_len -= mid_tail;
-        p += mid_niov;
-    }
-
-    if (tail_len) {
-        p->iov_base = tail_buf;
-        p->iov_len = tail_len;
-    }
-
-    return 0;
-}
-
 /*
  * Check if the contents of subrange of qiov data is all zeroes.
  */
@@ -XXX,XX +XXX,XX @@ bool qemu_iovec_is_zero(QEMUIOVector *qiov, size_t offset, size_t bytes)
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len)
 {
-    int ret;
+    struct iovec *slice_iov;
+    int slice_niov;
+    size_t slice_head, slice_tail;
 
     assert(source->size >= len);
     assert(source->size - len >= offset);
 
-    /* We shrink the request, so we can't overflow neither size_t nor MAX_IOV */
-    ret = qemu_iovec_init_extended(qiov, NULL, 0, source, offset, len, NULL, 0);
-    assert(ret == 0);
+    slice_iov = qemu_iovec_slice(source, offset, len,
+                                 &slice_head, &slice_tail, &slice_niov);
+    if (slice_niov == 1) {
+        qemu_iovec_init_buf(qiov, slice_iov[0].iov_base + slice_head, len);
+    } else {
+        qemu_iovec_init(qiov, slice_niov);
+        qemu_iovec_concat_iov(qiov, slice_iov, slice_niov, slice_head, len);
+    }
 }
 
 void qemu_iovec_destroy(QEMUIOVector *qiov)
-- 
2.40.1

Test that even vectored IO requests with 1024 vector elements that are
not aligned to the device's request alignment will succeed.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230411173418.19549-5-hreitz@redhat.com>
---
 tests/qemu-iotests/tests/iov-padding     | 85 ++++++++++++++++++++++++
 tests/qemu-iotests/tests/iov-padding.out | 59 ++++++++++++++++
 2 files changed, 144 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

diff --git a/tests/qemu-iotests/tests/iov-padding b/tests/qemu-iotests/tests/iov-padding
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding
@@ -XXX,XX +XXX,XX @@
+#!/usr/bin/env bash
+# group: rw quick
+#
+# Check the interaction of request padding (to fit alignment restrictions) with
+# vectored I/O from the guest
+#
+# Copyright Red Hat
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+cd ..
+. ./common.rc
+. ./common.filter
+
+_supported_fmt raw
+_supported_proto file
+
+_make_test_img 1M
+
+IMGSPEC="driver=blkdebug,align=4096,image.driver=file,image.filename=$TEST_IMG"
+
+# Four combinations:
+# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
+# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not
+# - Offset 512, length 1023 * 512 + 512: Neither head nor tail are aligned
+# - Offset 512, length 1023 * 512 + 4096: Tail is aligned, head is not
+for start_offset in 4096 512; do
+    for last_element_length in 512 4096; do
+        length=$((1023 * 512 + $last_element_length))
+
+        echo
+        echo "== performing 1024-element vectored requests to image (offset: $start_offset; length: $length) =="
+
+        # Fill with data for testing
+        $QEMU_IO -c 'write -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+
+        # 1023 512-byte buffers, and then one with length $last_element_length
+        cmd_params="-P 2 $start_offset $(yes 512 | head -n 1023 | tr '\n' ' ') $last_element_length"
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "writev $cmd_params" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+
+        # Read all patterns -- read the part we just wrote with writev twice,
+        # once "normally", and once with a readv, so we see that that works, too
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "read -P 1 0 $start_offset" \
+            -c "read -P 2 $start_offset $length" \
+            -c "readv $cmd_params" \
+            -c "read -P 1 $((start_offset + length)) $((1024 * 1024 - length - start_offset))" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+    done
+done
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/iov-padding.out b/tests/qemu-iotests/tests/iov-padding.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding.out
@@ -XXX,XX +XXX,XX @@
+QA output created by iov-padding
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 516608/516608 bytes at offset 531968
+504.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 523776/523776 bytes at offset 524800
+511.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.40.1