Series comparison

-[Qemu-devel] [PULL for-2.9 0/4] Block patches
+[PULL v3 0/5] Block patches
-The following changes since commit 55a19ad8b2d0797e3a8fe90ab99a9bb713824059:
+The following changes since commit 813bac3d8d70d85cb7835f7945eb9eed84c2d8d0:
-  Update version for v2.9.0-rc1 release (2017-03-21 17:13:29 +0000)
+  Merge tag '2023q3-bsd-user-pull-request' of https://gitlab.com/bsdimp/qemu into staging (2023-08-29 08:58:00 -0400)
-are available in the git repository at:
+are available in the Git repository at:
-  https://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request
+  https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to 600ac6a0ef5c06418446ef2f37407bddcc51b21c:
+for you to fetch changes up to 87ec6f55af38e29be5b2b65a8acf84da73e06d06:
-  blockjob: add devops to blockjob backends (2017-03-22 13:26:27 -0400)
+  aio-posix: zero out io_uring sqe user_data (2023-08-30 07:39:59 -0400)
 ----------------------------------------------------------------
-Block patches for 2.9
+Pull request
 v3:
 - Drop UFS emulation due to CI failures
 - Add "aio-posix: zero out io_uring sqe user_data"
 ----------------------------------------------------------------
-John Snow (3):
+Andrey Drobyshev (3):
-  blockjob: add block_job_start_shim
+  block: add subcluster_size field to BlockDriverInfo
-  block-backend: add drained_begin / drained_end ops
+  block/io: align requests to subcluster_size
-  blockjob: add devops to blockjob backends
+  tests/qemu-iotests/197: add testcase for CoR with subclusters
-Paolo Bonzini (1):
+Fabiano Rosas (1):
-  blockjob: avoid recursive AioContext locking
+  block-migration: Ensure we don't crash during migration cleanup
- block/block-backend.c          | 24 ++++++++++++++--
+Stefan Hajnoczi (1):
- blockjob.c                     | 63 ++++++++++++++++++++++++++++++++----------
+  aio-posix: zero out io_uring sqe user_data
- include/sysemu/block-backend.h |  8 ++++++
-files changed, 79 insertions(+), 16 deletions(-)
+ include/block/block-common.h |  5 ++++
  include/block/block-io.h     |  8 +++---
  block.c                      |  7 +++++
  block/io.c                   | 50 ++++++++++++++++++------------------
  block/mirror.c               |  8 +++---
  block/qcow2.c                |  1 +
  migration/block.c            | 11 ++++++--
  util/fdmon-io_uring.c        |  2 ++
  tests/qemu-iotests/197       | 29 +++++++++++++++++++++
  tests/qemu-iotests/197.out   | 24 +++++++++++++++++
 files changed, 110 insertions(+), 35 deletions(-)
 --
-.9.3
+.41.0

-New patch
+[PULL v3 1/5] block-migration: Ensure we don't crash during migration cleanup
+From: Fabiano Rosas <farosas@suse.de>
+We can fail the blk_insert_bs() at init_blk_migration(), leaving the
+BlkMigDevState without a dirty_bitmap and BlockDriverState. Account
+for the possibly missing elements when doing cleanup.
+Fix the following crashes:
+Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
+x0000555555ec83ef in bdrv_release_dirty_bitmap (bitmap=0x0) at ../block/dirty-bitmap.c:359
+BlockDriverState *bs = bitmap->bs;
+ #0  0x0000555555ec83ef in bdrv_release_dirty_bitmap (bitmap=0x0) at ../block/dirty-bitmap.c:359
+ #1  0x0000555555bba331 in unset_dirty_tracking () at ../migration/block.c:371
+ #2  0x0000555555bbad98 in block_migration_cleanup_bmds () at ../migration/block.c:681
+Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
+x0000555555e971ff in bdrv_op_unblock (bs=0x0, op=BLOCK_OP_TYPE_BACKUP_SOURCE, reason=0x0) at ../block.c:7073
+QLIST_FOREACH_SAFE(blocker, &bs->op_blockers[op], list, next) {
+ #0  0x0000555555e971ff in bdrv_op_unblock (bs=0x0, op=BLOCK_OP_TYPE_BACKUP_SOURCE, reason=0x0) at ../block.c:7073
+ #1  0x0000555555e9734a in bdrv_op_unblock_all (bs=0x0, reason=0x0) at ../block.c:7095
+ #2  0x0000555555bbae13 in block_migration_cleanup_bmds () at ../migration/block.c:690
+Signed-off-by: Fabiano Rosas <farosas@suse.de>
+Message-id: 20230731203338.27581-1-farosas@suse.de
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+---
+ migration/block.c | 11 +++++++++--
+file changed, 9 insertions(+), 2 deletions(-)
+diff --git a/migration/block.c b/migration/block.c
+index XXXXXXX..XXXXXXX 100644
+--- a/migration/block.c
++++ b/migration/block.c
+@@ -XXX,XX +XXX,XX @@ static void unset_dirty_tracking(void)
+     BlkMigDevState *bmds;
+     QSIMPLEQ_FOREACH(bmds, &block_mig_state.bmds_list, entry) {
+-        bdrv_release_dirty_bitmap(bmds->dirty_bitmap);
++        if (bmds->dirty_bitmap) {
++            bdrv_release_dirty_bitmap(bmds->dirty_bitmap);
++        }
+     }
+ }
+@@ -XXX,XX +XXX,XX @@ static int64_t get_remaining_dirty(void)
+ static void block_migration_cleanup_bmds(void)
+ {
+     BlkMigDevState *bmds;
++    BlockDriverState *bs;
+     AioContext *ctx;
+     unset_dirty_tracking();
+     while ((bmds = QSIMPLEQ_FIRST(&block_mig_state.bmds_list)) != NULL) {
+         QSIMPLEQ_REMOVE_HEAD(&block_mig_state.bmds_list, entry);
+-        bdrv_op_unblock_all(blk_bs(bmds->blk), bmds->blocker);
++
++        bs = blk_bs(bmds->blk);
++        if (bs) {
++            bdrv_op_unblock_all(bs, bmds->blocker);
++        }
+         error_free(bmds->blocker);
+         /* Save ctx, because bmds->blk can disappear during blk_unref.  */
+--
+.41.0

-[Qemu-devel] [PULL for-2.9 4/4] blockjob: add devops to blockjob backends
+[PULL v3 2/5] block: add subcluster_size field to BlockDriverInfo
-From: John Snow <jsnow@redhat.com>
+From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
-This lets us hook into drained_begin and drained_end requests from the
+This is going to be used in the subsequent commit as requests alignment
-backend level, which is particularly useful for making sure that all
+(in particular, during copy-on-read).  This value only makes sense for
-jobs associated with a particular node (whether the source or the target)
+the formats which support subclusters (currently QCOW2 only).  If this
-receive a drain request.
+field isn't set by driver's own bdrv_get_info() implementation, we
 simply set it equal to the cluster size thus treating each cluster as
 having a single subcluster.
-Suggested-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Signed-off-by: John Snow <jsnow@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Reviewed-by: Jeff Cody <jcody@redhat.com>
+Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
-Message-id: 20170316212351.13797-4-jsnow@redhat.com
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
-Signed-off-by: Jeff Cody <jcody@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-ID: <20230711172553.234055-2-andrey.drobyshev@virtuozzo.com>
 ---
- blockjob.c | 29 ++++++++++++++++++++++++-----
+ include/block/block-common.h | 5 +++++
-file changed, 24 insertions(+), 5 deletions(-)
+ block.c                      | 7 +++++++
  block/qcow2.c                | 1 +
 files changed, 13 insertions(+)
-diff --git a/blockjob.c b/blockjob.c
+diff --git a/include/block/block-common.h b/include/block/block-common.h
 index XXXXXXX..XXXXXXX 100644
---- a/blockjob.c
+--- a/include/block/block-common.h
-+++ b/blockjob.c
++++ b/include/block/block-common.h
-@@ -XXX,XX +XXX,XX @@ static const BdrvChildRole child_job = {
+@@ -XXX,XX +XXX,XX @@ typedef struct BlockZoneWps {
-     .stay_at_node       = true,
+ typedef struct BlockDriverInfo {
- };
+     /* in bytes, 0 if irrelevant */
+     int cluster_size;
-+static void block_job_drained_begin(void *opaque)
++    /*
-+{
++     * A fraction of cluster_size, if supported (currently QCOW2 only); if
-+    BlockJob *job = opaque;
++     * disabled or unsupported, set equal to cluster_size.
-+    block_job_pause(job);
++     */
-+}
++    int subcluster_size;
-+
+     /* offset at which the VM state can be saved (0 if not possible) */
-+static void block_job_drained_end(void *opaque)
+     int64_t vm_state_offset;
-+{
+     bool is_dirty;
-+    BlockJob *job = opaque;
+diff --git a/block.c b/block.c
-+    block_job_resume(job);
+index XXXXXXX..XXXXXXX 100644
-+}
+--- a/block.c
-+
++++ b/block.c
-+static const BlockDevOps block_job_dev_ops = {
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
-+    .drained_begin = block_job_drained_begin,
+     }
-+    .drained_end = block_job_drained_end,
+     memset(bdi, 0, sizeof(*bdi));
-+};
+     ret = drv->bdrv_co_get_info(bs, bdi);
-+
++    if (bdi->subcluster_size == 0) {
- BlockJob *block_job_next(BlockJob *job)
++        /*
 +         * If the driver left this unset, subclusters are not supported.
 +         * Then it is safe to treat each cluster as having only one subcluster.
 +         */
 +        bdi->subcluster_size = bdi->cluster_size;
 +    }
      if (ret < 0) {
          return ret;
      }
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
  {
-     if (!job) {
+     BDRVQcow2State *s = bs->opaque;
-@@ -XXX,XX +XXX,XX @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
+     bdi->cluster_size = s->cluster_size;
-     }
++    bdi->subcluster_size = s->subcluster_size;
+     bdi->vm_state_offset = qcow2_vm_state_offset(s);
-     job = g_malloc0(driver->instance_size);
+     bdi->is_dirty = s->incompatible_features & QCOW2_INCOMPAT_DIRTY;
--    error_setg(&job->blocker, "block device is in use by block job: %s",
+     return 0;
 -               BlockJobType_lookup[driver->job_type]);
 -    block_job_add_bdrv(job, "main node", bs, 0, BLK_PERM_ALL, &error_abort);
 -    bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, job->blocker);
 -
      job->driver        = driver;
      job->id            = g_strdup(job_id);
      job->blk           = blk;
@@ -XXX,XX +XXX,XX @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
      job->paused        = true;
      job->pause_count   = 1;
      job->refcnt        = 1;
 +
 +    error_setg(&job->blocker, "block device is in use by block job: %s",
 +               BlockJobType_lookup[driver->job_type]);
 +    block_job_add_bdrv(job, "main node", bs, 0, BLK_PERM_ALL, &error_abort);
      bs->job = job;
 +    blk_set_dev_ops(blk, &block_job_dev_ops, job);
 +    bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, job->blocker);
 +
      QLIST_INSERT_HEAD(&block_jobs, job, job_list);
      blk_add_aio_context_notifier(blk, block_job_attached_aio_context,
 --
-.9.3
+.41.0

-[Qemu-devel] [PULL for-2.9 3/4] block-backend: add drained_begin / drained_end ops
+[PULL v3 3/5] block/io: align requests to subcluster_size
-From: John Snow <jsnow@redhat.com>
+From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
-Allow block backends to forward drain requests to their devices/users.
+When target image is using subclusters, and we align the request during
-The initial intended purpose for this patch is to allow BBs to forward
+copy-on-read, it makes sense to align to subcluster_size rather than
-requests along to BlockJobs, which will want to pause if their associated
+cluster_size.  Otherwise we end up with unnecessary allocations.
-BB has entered a drained region.
+This commit renames bdrv_round_to_clusters() to bdrv_round_to_subclusters()
-Signed-off-by: John Snow <jsnow@redhat.com>
+and utilizes subcluster_size field of BlockDriverInfo to make necessary
-Reviewed-by: Jeff Cody <jcody@redhat.com>
+alignments.  It affects copy-on-read as well as mirror job (which is
-Message-id: 20170316212351.13797-3-jsnow@redhat.com
+using bdrv_round_to_clusters()).
-Signed-off-by: Jeff Cody <jcody@redhat.com>
 This change also fixes the following bug with failing assert (covered by
 the test in the subsequent commit):
 qemu-img create -f qcow2 base.qcow2 64K
 qemu-img create -f qcow2 -o extended_l2=on,backing_file=base.qcow2,backing_fmt=qcow2 img.qcow2 64K
 qemu-io -c "write -P 0xaa 0 2K" img.qcow2
 qemu-io -C -c "read -P 0x00 2K 62K" img.qcow2
 qemu-io: ../block/io.c:1236: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Denis V. Lunev <den@openvz.org>
 Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
 Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-ID: <20230711172553.234055-3-andrey.drobyshev@virtuozzo.com>
 ---
- block/block-backend.c          | 24 ++++++++++++++++++++++--
+ include/block/block-io.h |  8 +++----
- include/sysemu/block-backend.h |  8 ++++++++
+ block/io.c               | 50 ++++++++++++++++++++--------------------
-files changed, 30 insertions(+), 2 deletions(-)
+ block/mirror.c           |  8 +++----
+files changed, 33 insertions(+), 33 deletions(-)
-diff --git a/block/block-backend.c b/block/block-backend.c
 diff --git a/include/block/block-io.h b/include/block/block-io.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/block-backend.c
+--- a/include/block/block-io.h
-+++ b/block/block-backend.c
++++ b/include/block/block-io.h
-@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
+@@ -XXX,XX +XXX,XX @@ bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
-     bool allow_write_beyond_eof;
+ ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
+                                           Error **errp);
-     NotifierList remove_bs_notifiers, insert_bs_notifiers;
+ BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
-+
+-void bdrv_round_to_clusters(BlockDriverState *bs,
-+    int quiesce_counter;
+-                            int64_t offset, int64_t bytes,
- };
+-                            int64_t *cluster_offset,
+-                            int64_t *cluster_bytes);
- typedef struct BlockBackendAIOCB {
++void bdrv_round_to_subclusters(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
++                               int64_t offset, int64_t bytes,
-                      void *opaque)
++                               int64_t *cluster_offset,
 +                               int64_t *cluster_bytes);
  void bdrv_get_backing_filename(BlockDriverState *bs,
                                 char *filename, int filename_size);
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs)
  }
  /**
 - * Round a region to cluster boundaries
 + * Round a region to subcluster (if supported) or cluster boundaries
   */
  void coroutine_fn GRAPH_RDLOCK
 -bdrv_round_to_clusters(BlockDriverState *bs, int64_t offset, int64_t bytes,
 -                       int64_t *cluster_offset, int64_t *cluster_bytes)
 +bdrv_round_to_subclusters(BlockDriverState *bs, int64_t offset, int64_t bytes,
 +                          int64_t *align_offset, int64_t *align_bytes)
  {
-     /* All drivers that use blk_set_dev_ops() are qdevified and we want to keep
+     BlockDriverInfo bdi;
--     * it that way, so we can assume blk->dev is a DeviceState if blk->dev_ops
+     IO_CODE();
--     * is set. */
+-    if (bdrv_co_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
-+     * it that way, so we can assume blk->dev, if present, is a DeviceState if
+-        *cluster_offset = offset;
-+     * blk->dev_ops is set. Non-device users may use dev_ops without device. */
+-        *cluster_bytes = bytes;
-     assert(!blk->legacy_dev);
++    if (bdrv_co_get_info(bs, &bdi) < 0 || bdi.subcluster_size == 0) {
++        *align_offset = offset;
-     blk->dev_ops = ops;
++        *align_bytes = bytes;
-     blk->dev_opaque = opaque;
+     } else {
-+
+-        int64_t c = bdi.cluster_size;
-+    /* Are we currently quiesced? Should we enforce this right now? */
+-        *cluster_offset = QEMU_ALIGN_DOWN(offset, c);
-+    if (blk->quiesce_counter && ops->drained_begin) {
+-        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes, c);
-+        ops->drained_begin(opaque);
++        int64_t c = bdi.subcluster_size;
-+    }
++        *align_offset = QEMU_ALIGN_DOWN(offset, c);
 +        *align_bytes = QEMU_ALIGN_UP(offset - *align_offset + bytes, c);
      }
  }
- /*
+@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
-@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
+     void *bounce_buffer = NULL;
- {
-     BlockBackend *blk = child->opaque;
+     BlockDriver *drv = bs->drv;
+-    int64_t cluster_offset;
-+    if (++blk->quiesce_counter == 1) {
+-    int64_t cluster_bytes;
-+        if (blk->dev_ops && blk->dev_ops->drained_begin) {
++    int64_t align_offset;
-+            blk->dev_ops->drained_begin(blk->dev_opaque);
++    int64_t align_bytes;
-+        }
+     int64_t skip_bytes;
-+    }
+     int ret;
-+
+     int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-     /* Note that blk->root may not be accessible here yet if we are just
+@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
-      * attaching to a BlockDriverState that is drained. Use child instead. */
+      * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
+      * is one reason we loop rather than doing it all at once.
-@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
+      */
- static void blk_root_drained_end(BdrvChild *child)
+-    bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
- {
+-    skip_bytes = offset - cluster_offset;
-     BlockBackend *blk = child->opaque;
++    bdrv_round_to_subclusters(bs, offset, bytes, &align_offset, &align_bytes);
-+    assert(blk->quiesce_counter);
++    skip_bytes = offset - align_offset;
-     assert(blk->public.io_limits_disabled);
+     trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
-     --blk->public.io_limits_disabled;
+-                                   cluster_offset, cluster_bytes);
-+
++                                   align_offset, align_bytes);
-+    if (--blk->quiesce_counter == 0) {
-+        if (blk->dev_ops && blk->dev_ops->drained_end) {
+-    while (cluster_bytes) {
-+            blk->dev_ops->drained_end(blk->dev_opaque);
++    while (align_bytes) {
-+        }
+         int64_t pnum;
-+    }
- }
+         if (skip_write) {
-diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
+             ret = 1; /* "already allocated", so nothing will be copied */
 -            pnum = MIN(cluster_bytes, max_transfer);
 +            pnum = MIN(align_bytes, max_transfer);
          } else {
 -            ret = bdrv_is_allocated(bs, cluster_offset,
 -                                    MIN(cluster_bytes, max_transfer), &pnum);
 +            ret = bdrv_is_allocated(bs, align_offset,
 +                                    MIN(align_bytes, max_transfer), &pnum);
              if (ret < 0) {
                  /*
                   * Safe to treat errors in querying allocation as if
                   * unallocated; we'll probably fail again soon on the
                   * read, but at least that will set a decent errno.
                   */
 -                pnum = MIN(cluster_bytes, max_transfer);
 +                pnum = MIN(align_bytes, max_transfer);
              }
              /* Stop at EOF if the image ends in the middle of the cluster */
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
              /* Must copy-on-read; use the bounce buffer */
              pnum = MIN(pnum, MAX_BOUNCE_BUFFER);
              if (!bounce_buffer) {
 -                int64_t max_we_need = MAX(pnum, cluster_bytes - pnum);
 +                int64_t max_we_need = MAX(pnum, align_bytes - pnum);
                  int64_t max_allowed = MIN(max_transfer, MAX_BOUNCE_BUFFER);
                  int64_t bounce_buffer_len = MIN(max_we_need, max_allowed);
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
              }
              qemu_iovec_init_buf(&local_qiov, bounce_buffer, pnum);
 -            ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
 +            ret = bdrv_driver_preadv(bs, align_offset, pnum,
                                       &local_qiov, 0, 0);
              if (ret < 0) {
                  goto err;
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
                  /* FIXME: Should we (perhaps conditionally) be setting
                   * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
                   * that still correctly reads as zero? */
 -                ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum,
 +                ret = bdrv_co_do_pwrite_zeroes(bs, align_offset, pnum,
                                                 BDRV_REQ_WRITE_UNCHANGED);
              } else {
                  /* This does not change the data on the disk, it is not
                   * necessary to flush even in cache=writethrough mode.
                   */
 -                ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
 +                ret = bdrv_driver_pwritev(bs, align_offset, pnum,
                                            &local_qiov, 0,
                                            BDRV_REQ_WRITE_UNCHANGED);
              }
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
              }
          }
 -        cluster_offset += pnum;
 -        cluster_bytes -= pnum;
 +        align_offset += pnum;
 +        align_bytes -= pnum;
          progress += pnum - skip_bytes;
          skip_bytes = 0;
      }
 diff --git a/block/mirror.c b/block/mirror.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/sysemu/block-backend.h
+--- a/block/mirror.c
-+++ b/include/sysemu/block-backend.h
++++ b/block/mirror.c
-@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
-      * Runs when the size changed (e.g. monitor command block_resize)
+     need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
-      */
+                           s->cow_bitmap);
-     void (*resize_cb)(void *opaque);
+     if (need_cow) {
-+    /*
+-        bdrv_round_to_clusters(blk_bs(s->target), *offset, *bytes,
-+     * Runs when the backend receives a drain request.
+-                               &align_offset, &align_bytes);
-+     */
++        bdrv_round_to_subclusters(blk_bs(s->target), *offset, *bytes,
-+    void (*drained_begin)(void *opaque);
++                                  &align_offset, &align_bytes);
-+    /*
+     }
-+     * Runs when the backend's last drain request ends.
-+     */
+     if (align_bytes > max_bytes) {
-+    void (*drained_end)(void *opaque);
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn mirror_iteration(MirrorBlockJob *s)
- } BlockDevOps;
+             int64_t target_offset;
+             int64_t target_bytes;
- /* This struct is embedded in (the private) BlockBackend struct and contains
+             WITH_GRAPH_RDLOCK_GUARD() {
 -                bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
 -                                       &target_offset, &target_bytes);
 +                bdrv_round_to_subclusters(blk_bs(s->target), offset, io_bytes,
 +                                          &target_offset, &target_bytes);
              }
              if (target_offset == offset &&
                  target_bytes == io_bytes) {
 --
-.9.3
+.41.0

-[Qemu-devel] [PULL for-2.9 1/4] blockjob: avoid recursive AioContext locking
+[PULL v3 4/5] tests/qemu-iotests/197: add testcase for CoR with subclusters
-From: Paolo Bonzini <pbonzini@redhat.com>
+From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
-Streaming or any other block job hangs when performed on a block device
+Add testcase which checks that allocations during copy-on-read are
-that has a non-default iothread.  This happens because the AioContext
+performed on the subcluster basis when subclusters are enabled in target
-is acquired twice by block_job_defer_to_main_loop_bh and then released
+image.
 only once by BDRV_POLL_WHILE.  (Insert rants on recursive mutexes, which
 unfortunately are a temporary but necessary evil for iothreads at the
 moment).
-Luckily, the reason for the double acquisition is simple; the function
+This testcase also triggers the following assert with previous commit
-acquires the AioContext for both the job iothread and the BDS iothread,
+not being applied, so we check that as well:
 in case the BDS iothread was changed while the job was running.  It
 is therefore enough to skip the second acquisition when the two
 AioContexts are one and the same.
-Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+qemu-io: ../block/io.c:1236: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.
 Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Jeff Cody <jcody@redhat.com>
+Reviewed-by: Denis V. Lunev <den@openvz.org>
-Message-id: 1490118490-5597-1-git-send-email-pbonzini@redhat.com
+Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
-Signed-off-by: Jeff Cody <jcody@redhat.com>
+Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-ID: <20230711172553.234055-4-andrey.drobyshev@virtuozzo.com>
 ---
- blockjob.c | 8 ++++++--
+ tests/qemu-iotests/197     | 29 +++++++++++++++++++++++++++++
-file changed, 6 insertions(+), 2 deletions(-)
+ tests/qemu-iotests/197.out | 24 ++++++++++++++++++++++++
 files changed, 53 insertions(+)
-diff --git a/blockjob.c b/blockjob.c
+diff --git a/tests/qemu-iotests/197 b/tests/qemu-iotests/197
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/197
 +++ b/tests/qemu-iotests/197
@@ -XXX,XX +XXX,XX @@ $QEMU_IO -f qcow2 -C -c 'read 0 1024' "$TEST_WRAP" | _filter_qemu_io
  $QEMU_IO -f qcow2 -c map "$TEST_WRAP"
  _check_test_img
 +echo
 +echo '=== Copy-on-read with subclusters ==='
 +echo
 +
 +# Create base and top images 64K (1 cluster) each.  Make subclusters enabled
 +# for the top image
 +_make_test_img 64K
 +IMGPROTO=file IMGFMT=qcow2 TEST_IMG_FILE="$TEST_WRAP" \
 +    _make_test_img --no-opts -o extended_l2=true -F "$IMGFMT" -b "$TEST_IMG" \
 +    64K | _filter_img_create
 +
 +$QEMU_IO -c "write -P 0xaa 0 64k" "$TEST_IMG" | _filter_qemu_io
 +
 +# Allocate individual subclusters in the top image, and not the whole cluster
 +$QEMU_IO -c "write -P 0xbb 28K 2K" -c "write -P 0xcc 34K 2K" "$TEST_WRAP" \
 +    | _filter_qemu_io
 +
 +# Only 2 subclusters should be allocated in the top image at this point
 +$QEMU_IMG map "$TEST_WRAP" | _filter_qemu_img_map
 +
 +# Actual copy-on-read operation
 +$QEMU_IO -C -c "read -P 0xaa 30K 4K" "$TEST_WRAP" | _filter_qemu_io
 +
 +# And here we should have 4 subclusters allocated right in the middle of the
 +# top image. Make sure the whole cluster remains unallocated
 +$QEMU_IMG map "$TEST_WRAP" | _filter_qemu_img_map
 +
 +_check_test_img
 +
  # success, all done
  echo '*** done'
  status=0
 diff --git a/tests/qemu-iotests/197.out b/tests/qemu-iotests/197.out
 index XXXXXXX..XXXXXXX 100644
---- a/blockjob.c
+--- a/tests/qemu-iotests/197.out
-+++ b/blockjob.c
++++ b/tests/qemu-iotests/197.out
-@@ -XXX,XX +XXX,XX @@ static void block_job_defer_to_main_loop_bh(void *opaque)
+@@ -XXX,XX +XXX,XX @@ read 1024/1024 bytes at offset 0
+KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-     /* Fetch BDS AioContext again, in case it has changed */
+KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
-     aio_context = blk_get_aio_context(data->job->blk);
+ No errors were found on the image.
--    aio_context_acquire(aio_context);
++
-+    if (aio_context != data->aio_context) {
++=== Copy-on-read with subclusters ===
-+        aio_context_acquire(aio_context);
++
-+    }
++Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
++Formatting 'TEST_DIR/t.wrap.IMGFMT', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT backing_fmt=IMGFMT
-     data->job->deferred_to_main_loop = false;
++wrote 65536/65536 bytes at offset 0
-     data->fn(data->job, data->opaque);
++64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
++wrote 2048/2048 bytes at offset 28672
--    aio_context_release(aio_context);
++2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    if (aio_context != data->aio_context) {
++wrote 2048/2048 bytes at offset 34816
-+        aio_context_release(aio_context);
++2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+    }
++Offset          Length          File
++0               0x7000          TEST_DIR/t.IMGFMT
-     aio_context_release(data->aio_context);
++0x7000          0x800           TEST_DIR/t.wrap.IMGFMT
++0x7800          0x1000          TEST_DIR/t.IMGFMT
 +0x8800          0x800           TEST_DIR/t.wrap.IMGFMT
 +0x9000          0x7000          TEST_DIR/t.IMGFMT
 +read 4096/4096 bytes at offset 30720
 +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +Offset          Length          File
 +0               0x7000          TEST_DIR/t.IMGFMT
 +0x7000          0x2000          TEST_DIR/t.wrap.IMGFMT
 +0x9000          0x7000          TEST_DIR/t.IMGFMT
 +No errors were found on the image.
  *** done
 --
-.9.3
+.41.0

-[Qemu-devel] [PULL for-2.9 2/4] blockjob: add block_job_start_shim
+[PULL v3 5/5] aio-posix: zero out io_uring sqe user_data
-From: John Snow <jsnow@redhat.com>
+liburing does not clear sqe->user_data. We must do it ourselves to avoid
 undefined behavior in process_cqe() when user_data is used.
-The purpose of this shim is to allow us to pause pre-started jobs.
+Note that fdmon-io_uring is currently disabled, so this is a latent bug
-The purpose of *that* is to allow us to buffer a pause request that
+that does not affect users. Let's merge this fix now to make it easier
-will be able to take effect before the job ever does any work, allowing
+to enable fdmon-io_uring in the future (and I'm working on that).
 us to create jobs during a quiescent state (under which they will be
 automatically paused), then resuming the jobs after the critical section
 in any order, either:
-(1) -block_job_start
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-    -block_job_resume (via e.g. drained_end)
+Message-ID: <20230426212639.82310-1-stefanha@redhat.com>
 ---
  util/fdmon-io_uring.c | 2 ++
 file changed, 2 insertions(+)
-(2) -block_job_resume (via e.g. drained_end)
+diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
     -block_job_start
 The problem that requires a startup wrapper is the idea that a job must
 start in the busy=true state only its first time-- all subsequent entries
 require busy to be false, and the toggling of this state is otherwise
 handled during existing pause and yield points.
 The wrapper simply allows us to mandate that a job can "start," set busy
 to true, then immediately pause only if necessary. We could avoid
 requiring a wrapper, but all jobs would need to do it, so it's been
 factored out here.
 Signed-off-by: John Snow <jsnow@redhat.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Message-id: 20170316212351.13797-2-jsnow@redhat.com
 Signed-off-by: Jeff Cody <jcody@redhat.com>
 ---
  blockjob.c | 26 +++++++++++++++++++-------
 file changed, 19 insertions(+), 7 deletions(-)
 diff --git a/blockjob.c b/blockjob.c
 index XXXXXXX..XXXXXXX 100644
---- a/blockjob.c
+--- a/util/fdmon-io_uring.c
-+++ b/blockjob.c
++++ b/util/fdmon-io_uring.c
-@@ -XXX,XX +XXX,XX @@ static bool block_job_started(BlockJob *job)
+@@ -XXX,XX +XXX,XX @@ static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
-     return job->co;
+ #else
      io_uring_prep_poll_remove(sqe, node);
  #endif
 +    io_uring_sqe_set_data(sqe, NULL);
  }
-+/**
+ /* Add a timeout that self-cancels when another cqe becomes ready */
-+ * All jobs must allow a pause point before entering their job proper. This
+@@ -XXX,XX +XXX,XX @@ static void add_timeout_sqe(AioContext *ctx, int64_t ns)
-+ * ensures that jobs can be paused prior to being started, then resumed later.
-+ */
+     sqe = get_sqe(ctx);
-+static void coroutine_fn block_job_co_entry(void *opaque)
+     io_uring_prep_timeout(sqe, &ts, 1, 0);
-+{
++    io_uring_sqe_set_data(sqe, NULL);
 +    BlockJob *job = opaque;
 +
 +    assert(job && job->driver && job->driver->start);
 +    block_job_pause_point(job);
 +    job->driver->start(job);
 +}
 +
  void block_job_start(BlockJob *job)
  {
      assert(job && !block_job_started(job) && job->paused &&
 -           !job->busy && job->driver->start);
 -    job->co = qemu_coroutine_create(job->driver->start, job);
 -    if (--job->pause_count == 0) {
 -        job->paused = false;
 -        job->busy = true;
 -        qemu_coroutine_enter(job->co);
 -    }
 +           job->driver && job->driver->start);
 +    job->co = qemu_coroutine_create(block_job_co_entry, job);
 +    job->pause_count--;
 +    job->busy = true;
 +    job->paused = false;
 +    qemu_coroutine_enter(job->co);
  }
- void block_job_ref(BlockJob *job)
+ /* Add sqes from ctx->submit_list for submission */
 --
-.9.3
+.41.0

The following changes since commit 55a19ad8b2d0797e3a8fe90ab99a9bb713824059:

Update version for v2.9.0-rc1 release (2017-03-21 17:13:29 +0000)

are available in the git repository at:

https://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request

for you to fetch changes up to 600ac6a0ef5c06418446ef2f37407bddcc51b21c:

blockjob: add devops to blockjob backends (2017-03-22 13:26:27 -0400)

----------------------------------------------------------------
Block patches for 2.9
----------------------------------------------------------------

John Snow (3):
  blockjob: add block_job_start_shim
  block-backend: add drained_begin / drained_end ops
  blockjob: add devops to blockjob backends

Paolo Bonzini (1):
  blockjob: avoid recursive AioContext locking

block/block-backend.c          | 24 ++++++++++++++--
 blockjob.c                     | 63 ++++++++++++++++++++++++++++++++----------
 include/sysemu/block-backend.h |  8 ++++++
 3 files changed, 79 insertions(+), 16 deletions(-)

-- 
2.9.3

From: Paolo Bonzini <pbonzini@redhat.com>

Streaming or any other block job hangs when performed on a block device
that has a non-default iothread.  This happens because the AioContext
is acquired twice by block_job_defer_to_main_loop_bh and then released
only once by BDRV_POLL_WHILE.  (Insert rants on recursive mutexes, which
unfortunately are a temporary but necessary evil for iothreads at the
moment).

Luckily, the reason for the double acquisition is simple; the function
acquires the AioContext for both the job iothread and the BDS iothread,
in case the BDS iothread was changed while the job was running.  It
is therefore enough to skip the second acquisition when the two
AioContexts are one and the same.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490118490-5597-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 blockjob.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static void block_job_defer_to_main_loop_bh(void *opaque)
 
     /* Fetch BDS AioContext again, in case it has changed */
     aio_context = blk_get_aio_context(data->job->blk);
-    aio_context_acquire(aio_context);
+    if (aio_context != data->aio_context) {
+        aio_context_acquire(aio_context);
+    }
 
     data->job->deferred_to_main_loop = false;
     data->fn(data->job, data->opaque);
 
-    aio_context_release(aio_context);
+    if (aio_context != data->aio_context) {
+        aio_context_release(aio_context);
+    }
 
     aio_context_release(data->aio_context);
 
-- 
2.9.3

From: John Snow <jsnow@redhat.com>

The purpose of this shim is to allow us to pause pre-started jobs.
The purpose of *that* is to allow us to buffer a pause request that
will be able to take effect before the job ever does any work, allowing
us to create jobs during a quiescent state (under which they will be
automatically paused), then resuming the jobs after the critical section
in any order, either:

(1) -block_job_start
    -block_job_resume (via e.g. drained_end)

(2) -block_job_resume (via e.g. drained_end)
    -block_job_start

The problem that requires a startup wrapper is the idea that a job must
start in the busy=true state only its first time-- all subsequent entries
require busy to be false, and the toggling of this state is otherwise
handled during existing pause and yield points.

The wrapper simply allows us to mandate that a job can "start," set busy
to true, then immediately pause only if necessary. We could avoid
requiring a wrapper, but all jobs would need to do it, so it's been
factored out here.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20170316212351.13797-2-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 blockjob.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static bool block_job_started(BlockJob *job)
     return job->co;
 }
 
+/**
+ * All jobs must allow a pause point before entering their job proper. This
+ * ensures that jobs can be paused prior to being started, then resumed later.
+ */
+static void coroutine_fn block_job_co_entry(void *opaque)
+{
+    BlockJob *job = opaque;
+
+    assert(job && job->driver && job->driver->start);
+    block_job_pause_point(job);
+    job->driver->start(job);
+}
+
 void block_job_start(BlockJob *job)
 {
     assert(job && !block_job_started(job) && job->paused &&
-           !job->busy && job->driver->start);
-    job->co = qemu_coroutine_create(job->driver->start, job);
-    if (--job->pause_count == 0) {
-        job->paused = false;
-        job->busy = true;
-        qemu_coroutine_enter(job->co);
-    }
+           job->driver && job->driver->start);
+    job->co = qemu_coroutine_create(block_job_co_entry, job);
+    job->pause_count--;
+    job->busy = true;
+    job->paused = false;
+    qemu_coroutine_enter(job->co);
 }
 
 void block_job_ref(BlockJob *job)
-- 
2.9.3

From: John Snow <jsnow@redhat.com>

Allow block backends to forward drain requests to their devices/users.
The initial intended purpose for this patch is to allow BBs to forward
requests along to BlockJobs, which will want to pause if their associated
BB has entered a drained region.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20170316212351.13797-3-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/block-backend.c          | 24 ++++++++++++++++++++++--
 include/sysemu/block-backend.h |  8 ++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
     bool allow_write_beyond_eof;
 
     NotifierList remove_bs_notifiers, insert_bs_notifiers;
+
+    int quiesce_counter;
 };
 
 typedef struct BlockBackendAIOCB {
@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
                      void *opaque)
 {
     /* All drivers that use blk_set_dev_ops() are qdevified and we want to keep
-     * it that way, so we can assume blk->dev is a DeviceState if blk->dev_ops
-     * is set. */
+     * it that way, so we can assume blk->dev, if present, is a DeviceState if
+     * blk->dev_ops is set. Non-device users may use dev_ops without device. */
     assert(!blk->legacy_dev);
 
     blk->dev_ops = ops;
     blk->dev_opaque = opaque;
+
+    /* Are we currently quiesced? Should we enforce this right now? */
+    if (blk->quiesce_counter && ops->drained_begin) {
+        ops->drained_begin(opaque);
+    }
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
 
+    if (++blk->quiesce_counter == 1) {
+        if (blk->dev_ops && blk->dev_ops->drained_begin) {
+            blk->dev_ops->drained_begin(blk->dev_opaque);
+        }
+    }
+
     /* Note that blk->root may not be accessible here yet if we are just
      * attaching to a BlockDriverState that is drained. Use child instead. */
 
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
 static void blk_root_drained_end(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
+    assert(blk->quiesce_counter);
 
     assert(blk->public.io_limits_disabled);
     --blk->public.io_limits_disabled;
+
+    if (--blk->quiesce_counter == 0) {
+        if (blk->dev_ops && blk->dev_ops->drained_end) {
+            blk->dev_ops->drained_end(blk->dev_opaque);
+        }
+    }
 }
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockDevOps {
      * Runs when the size changed (e.g. monitor command block_resize)
      */
     void (*resize_cb)(void *opaque);
+    /*
+     * Runs when the backend receives a drain request.
+     */
+    void (*drained_begin)(void *opaque);
+    /*
+     * Runs when the backend's last drain request ends.
+     */
+    void (*drained_end)(void *opaque);
 } BlockDevOps;
 
 /* This struct is embedded in (the private) BlockBackend struct and contains
-- 
2.9.3

From: John Snow <jsnow@redhat.com>

This lets us hook into drained_begin and drained_end requests from the
backend level, which is particularly useful for making sure that all
jobs associated with a particular node (whether the source or the target)
receive a drain request.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20170316212351.13797-4-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 blockjob.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static const BdrvChildRole child_job = {
     .stay_at_node       = true,
 };
 
+static void block_job_drained_begin(void *opaque)
+{
+    BlockJob *job = opaque;
+    block_job_pause(job);
+}
+
+static void block_job_drained_end(void *opaque)
+{
+    BlockJob *job = opaque;
+    block_job_resume(job);
+}
+
+static const BlockDevOps block_job_dev_ops = {
+    .drained_begin = block_job_drained_begin,
+    .drained_end = block_job_drained_end,
+};
+
 BlockJob *block_job_next(BlockJob *job)
 {
     if (!job) {
@@ -XXX,XX +XXX,XX @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     }
 
     job = g_malloc0(driver->instance_size);
-    error_setg(&job->blocker, "block device is in use by block job: %s",
-               BlockJobType_lookup[driver->job_type]);
-    block_job_add_bdrv(job, "main node", bs, 0, BLK_PERM_ALL, &error_abort);
-    bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, job->blocker);
-
     job->driver        = driver;
     job->id            = g_strdup(job_id);
     job->blk           = blk;
@@ -XXX,XX +XXX,XX @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     job->paused        = true;
     job->pause_count   = 1;
     job->refcnt        = 1;
+
+    error_setg(&job->blocker, "block device is in use by block job: %s",
+               BlockJobType_lookup[driver->job_type]);
+    block_job_add_bdrv(job, "main node", bs, 0, BLK_PERM_ALL, &error_abort);
     bs->job = job;
 
+    blk_set_dev_ops(blk, &block_job_dev_ops, job);
+    bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, job->blocker);
+
     QLIST_INSERT_HEAD(&block_jobs, job, job_list);
 
     blk_add_aio_context_notifier(blk, block_job_attached_aio_context,
-- 
2.9.3

The following changes since commit 813bac3d8d70d85cb7835f7945eb9eed84c2d8d0:

Merge tag '2023q3-bsd-user-pull-request' of https://gitlab.com/bsdimp/qemu into staging (2023-08-29 08:58:00 -0400)

are available in the Git repository at:

https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 87ec6f55af38e29be5b2b65a8acf84da73e06d06:

aio-posix: zero out io_uring sqe user_data (2023-08-30 07:39:59 -0400)

----------------------------------------------------------------
Pull request

v3:
- Drop UFS emulation due to CI failures
- Add "aio-posix: zero out io_uring sqe user_data"

----------------------------------------------------------------

Andrey Drobyshev (3):
  block: add subcluster_size field to BlockDriverInfo
  block/io: align requests to subcluster_size
  tests/qemu-iotests/197: add testcase for CoR with subclusters

Fabiano Rosas (1):
  block-migration: Ensure we don't crash during migration cleanup

Stefan Hajnoczi (1):
  aio-posix: zero out io_uring sqe user_data

-- 
2.41.0

From: Fabiano Rosas <farosas@suse.de>

We can fail the blk_insert_bs() at init_blk_migration(), leaving the
BlkMigDevState without a dirty_bitmap and BlockDriverState. Account
for the possibly missing elements when doing cleanup.

Fix the following crashes:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x0000555555ec83ef in bdrv_release_dirty_bitmap (bitmap=0x0) at ../block/dirty-bitmap.c:359
359         BlockDriverState *bs = bitmap->bs;
 #0  0x0000555555ec83ef in bdrv_release_dirty_bitmap (bitmap=0x0) at ../block/dirty-bitmap.c:359
 #1  0x0000555555bba331 in unset_dirty_tracking () at ../migration/block.c:371
 #2  0x0000555555bbad98 in block_migration_cleanup_bmds () at ../migration/block.c:681

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x0000555555e971ff in bdrv_op_unblock (bs=0x0, op=BLOCK_OP_TYPE_BACKUP_SOURCE, reason=0x0) at ../block.c:7073
7073        QLIST_FOREACH_SAFE(blocker, &bs->op_blockers[op], list, next) {
 #0  0x0000555555e971ff in bdrv_op_unblock (bs=0x0, op=BLOCK_OP_TYPE_BACKUP_SOURCE, reason=0x0) at ../block.c:7073
 #1  0x0000555555e9734a in bdrv_op_unblock_all (bs=0x0, reason=0x0) at ../block.c:7095
 #2  0x0000555555bbae13 in block_migration_cleanup_bmds () at ../migration/block.c:690

Signed-off-by: Fabiano Rosas <farosas@suse.de>
Message-id: 20230731203338.27581-1-farosas@suse.de
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 migration/block.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -XXX,XX +XXX,XX @@ static void unset_dirty_tracking(void)
     BlkMigDevState *bmds;
 
     QSIMPLEQ_FOREACH(bmds, &block_mig_state.bmds_list, entry) {
-        bdrv_release_dirty_bitmap(bmds->dirty_bitmap);
+        if (bmds->dirty_bitmap) {
+            bdrv_release_dirty_bitmap(bmds->dirty_bitmap);
+        }
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static int64_t get_remaining_dirty(void)
 static void block_migration_cleanup_bmds(void)
 {
     BlkMigDevState *bmds;
+    BlockDriverState *bs;
     AioContext *ctx;
 
     unset_dirty_tracking();
 
     while ((bmds = QSIMPLEQ_FIRST(&block_mig_state.bmds_list)) != NULL) {
         QSIMPLEQ_REMOVE_HEAD(&block_mig_state.bmds_list, entry);
-        bdrv_op_unblock_all(blk_bs(bmds->blk), bmds->blocker);
+
+        bs = blk_bs(bmds->blk);
+        if (bs) {
+            bdrv_op_unblock_all(bs, bmds->blocker);
+        }
         error_free(bmds->blocker);
 
         /* Save ctx, because bmds->blk can disappear during blk_unref.  */
-- 
2.41.0

From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

This is going to be used in the subsequent commit as requests alignment
(in particular, during copy-on-read).  This value only makes sense for
the formats which support subclusters (currently QCOW2 only).  If this
field isn't set by driver's own bdrv_get_info() implementation, we
simply set it equal to the cluster size thus treating each cluster as
having a single subcluster.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230711172553.234055-2-andrey.drobyshev@virtuozzo.com>
---
 include/block/block-common.h | 5 +++++
 block.c                      | 7 +++++++
 block/qcow2.c                | 1 +
 3 files changed, 13 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -XXX,XX +XXX,XX @@ typedef struct BlockZoneWps {
 typedef struct BlockDriverInfo {
     /* in bytes, 0 if irrelevant */
     int cluster_size;
+    /*
+     * A fraction of cluster_size, if supported (currently QCOW2 only); if
+     * disabled or unsupported, set equal to cluster_size.
+     */
+    int subcluster_size;
     /* offset at which the VM state can be saved (0 if not possible) */
     int64_t vm_state_offset;
     bool is_dirty;
diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     }
     memset(bdi, 0, sizeof(*bdi));
     ret = drv->bdrv_co_get_info(bs, bdi);
+    if (bdi->subcluster_size == 0) {
+        /*
+         * If the driver left this unset, subclusters are not supported.
+         * Then it is safe to treat each cluster as having only one subcluster.
+         */
+        bdi->subcluster_size = bdi->cluster_size;
+    }
     if (ret < 0) {
         return ret;
     }
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     BDRVQcow2State *s = bs->opaque;
     bdi->cluster_size = s->cluster_size;
+    bdi->subcluster_size = s->subcluster_size;
     bdi->vm_state_offset = qcow2_vm_state_offset(s);
     bdi->is_dirty = s->incompatible_features & QCOW2_INCOMPAT_DIRTY;
     return 0;
-- 
2.41.0

From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

When target image is using subclusters, and we align the request during
copy-on-read, it makes sense to align to subcluster_size rather than
cluster_size.  Otherwise we end up with unnecessary allocations.

This commit renames bdrv_round_to_clusters() to bdrv_round_to_subclusters()
and utilizes subcluster_size field of BlockDriverInfo to make necessary
alignments.  It affects copy-on-read as well as mirror job (which is
using bdrv_round_to_clusters()).

This change also fixes the following bug with failing assert (covered by
the test in the subsequent commit):

qemu-img create -f qcow2 base.qcow2 64K
qemu-img create -f qcow2 -o extended_l2=on,backing_file=base.qcow2,backing_fmt=qcow2 img.qcow2 64K
qemu-io -c "write -P 0xaa 0 2K" img.qcow2
qemu-io -C -c "read -P 0x00 2K 62K" img.qcow2

qemu-io: ../block/io.c:1236: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230711172553.234055-3-andrey.drobyshev@virtuozzo.com>
---
 include/block/block-io.h |  8 +++----
 block/io.c               | 50 ++++++++++++++++++++--------------------
 block/mirror.c           |  8 +++----
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -XXX,XX +XXX,XX @@ bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
                                           Error **errp);
 BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
-void bdrv_round_to_clusters(BlockDriverState *bs,
-                            int64_t offset, int64_t bytes,
-                            int64_t *cluster_offset,
-                            int64_t *cluster_bytes);
+void bdrv_round_to_subclusters(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes,
+                               int64_t *cluster_offset,
+                               int64_t *cluster_bytes);
 
 void bdrv_get_backing_filename(BlockDriverState *bs,
                                char *filename, int filename_size);
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs)
 }
 
 /**
- * Round a region to cluster boundaries
+ * Round a region to subcluster (if supported) or cluster boundaries
  */
 void coroutine_fn GRAPH_RDLOCK
-bdrv_round_to_clusters(BlockDriverState *bs, int64_t offset, int64_t bytes,
-                       int64_t *cluster_offset, int64_t *cluster_bytes)
+bdrv_round_to_subclusters(BlockDriverState *bs, int64_t offset, int64_t bytes,
+                          int64_t *align_offset, int64_t *align_bytes)
 {
     BlockDriverInfo bdi;
     IO_CODE();
-    if (bdrv_co_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
-        *cluster_offset = offset;
-        *cluster_bytes = bytes;
+    if (bdrv_co_get_info(bs, &bdi) < 0 || bdi.subcluster_size == 0) {
+        *align_offset = offset;
+        *align_bytes = bytes;
     } else {
-        int64_t c = bdi.cluster_size;
-        *cluster_offset = QEMU_ALIGN_DOWN(offset, c);
-        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes, c);
+        int64_t c = bdi.subcluster_size;
+        *align_offset = QEMU_ALIGN_DOWN(offset, c);
+        *align_bytes = QEMU_ALIGN_UP(offset - *align_offset + bytes, c);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
     void *bounce_buffer = NULL;
 
     BlockDriver *drv = bs->drv;
-    int64_t cluster_offset;
-    int64_t cluster_bytes;
+    int64_t align_offset;
+    int64_t align_bytes;
     int64_t skip_bytes;
     int ret;
     int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
      * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
      * is one reason we loop rather than doing it all at once.
      */
-    bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
-    skip_bytes = offset - cluster_offset;
+    bdrv_round_to_subclusters(bs, offset, bytes, &align_offset, &align_bytes);
+    skip_bytes = offset - align_offset;
 
     trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
-                                   cluster_offset, cluster_bytes);
+                                   align_offset, align_bytes);
 
-    while (cluster_bytes) {
+    while (align_bytes) {
         int64_t pnum;
 
         if (skip_write) {
             ret = 1; /* "already allocated", so nothing will be copied */
-            pnum = MIN(cluster_bytes, max_transfer);
+            pnum = MIN(align_bytes, max_transfer);
         } else {
-            ret = bdrv_is_allocated(bs, cluster_offset,
-                                    MIN(cluster_bytes, max_transfer), &pnum);
+            ret = bdrv_is_allocated(bs, align_offset,
+                                    MIN(align_bytes, max_transfer), &pnum);
             if (ret < 0) {
                 /*
                  * Safe to treat errors in querying allocation as if
                  * unallocated; we'll probably fail again soon on the
                  * read, but at least that will set a decent errno.
                  */
-                pnum = MIN(cluster_bytes, max_transfer);
+                pnum = MIN(align_bytes, max_transfer);
             }
 
             /* Stop at EOF if the image ends in the middle of the cluster */
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
             /* Must copy-on-read; use the bounce buffer */
             pnum = MIN(pnum, MAX_BOUNCE_BUFFER);
             if (!bounce_buffer) {
-                int64_t max_we_need = MAX(pnum, cluster_bytes - pnum);
+                int64_t max_we_need = MAX(pnum, align_bytes - pnum);
                 int64_t max_allowed = MIN(max_transfer, MAX_BOUNCE_BUFFER);
                 int64_t bounce_buffer_len = MIN(max_we_need, max_allowed);
 
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
             }
             qemu_iovec_init_buf(&local_qiov, bounce_buffer, pnum);
 
-            ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
+            ret = bdrv_driver_preadv(bs, align_offset, pnum,
                                      &local_qiov, 0, 0);
             if (ret < 0) {
                 goto err;
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
                 /* FIXME: Should we (perhaps conditionally) be setting
                  * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
                  * that still correctly reads as zero? */
-                ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum,
+                ret = bdrv_co_do_pwrite_zeroes(bs, align_offset, pnum,
                                                BDRV_REQ_WRITE_UNCHANGED);
             } else {
                 /* This does not change the data on the disk, it is not
                  * necessary to flush even in cache=writethrough mode.
                  */
-                ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+                ret = bdrv_driver_pwritev(bs, align_offset, pnum,
                                           &local_qiov, 0,
                                           BDRV_REQ_WRITE_UNCHANGED);
             }
@@ -XXX,XX +XXX,XX @@ bdrv_co_do_copy_on_readv(BdrvChild *child, int64_t offset, int64_t bytes,
             }
         }
 
-        cluster_offset += pnum;
-        cluster_bytes -= pnum;
+        align_offset += pnum;
+        align_bytes -= pnum;
         progress += pnum - skip_bytes;
         skip_bytes = 0;
     }
diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
     need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
                           s->cow_bitmap);
     if (need_cow) {
-        bdrv_round_to_clusters(blk_bs(s->target), *offset, *bytes,
-                               &align_offset, &align_bytes);
+        bdrv_round_to_subclusters(blk_bs(s->target), *offset, *bytes,
+                                  &align_offset, &align_bytes);
     }
 
     if (align_bytes > max_bytes) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn mirror_iteration(MirrorBlockJob *s)
             int64_t target_offset;
             int64_t target_bytes;
             WITH_GRAPH_RDLOCK_GUARD() {
-                bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
-                                       &target_offset, &target_bytes);
+                bdrv_round_to_subclusters(blk_bs(s->target), offset, io_bytes,
+                                          &target_offset, &target_bytes);
             }
             if (target_offset == offset &&
                 target_bytes == io_bytes) {
-- 
2.41.0

From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

Add testcase which checks that allocations during copy-on-read are
performed on the subcluster basis when subclusters are enabled in target
image.

This testcase also triggers the following assert with previous commit
not being applied, so we check that as well:

qemu-io: ../block/io.c:1236: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230711172553.234055-4-andrey.drobyshev@virtuozzo.com>
---
 tests/qemu-iotests/197     | 29 +++++++++++++++++++++++++++++
 tests/qemu-iotests/197.out | 24 ++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/tests/qemu-iotests/197 b/tests/qemu-iotests/197
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/197
+++ b/tests/qemu-iotests/197
@@ -XXX,XX +XXX,XX @@ $QEMU_IO -f qcow2 -C -c 'read 0 1024' "$TEST_WRAP" | _filter_qemu_io
 $QEMU_IO -f qcow2 -c map "$TEST_WRAP"
 _check_test_img
 
+echo
+echo '=== Copy-on-read with subclusters ==='
+echo
+
+# Create base and top images 64K (1 cluster) each.  Make subclusters enabled
+# for the top image
+_make_test_img 64K
+IMGPROTO=file IMGFMT=qcow2 TEST_IMG_FILE="$TEST_WRAP" \
+    _make_test_img --no-opts -o extended_l2=true -F "$IMGFMT" -b "$TEST_IMG" \
+    64K | _filter_img_create
+
+$QEMU_IO -c "write -P 0xaa 0 64k" "$TEST_IMG" | _filter_qemu_io
+
+# Allocate individual subclusters in the top image, and not the whole cluster
+$QEMU_IO -c "write -P 0xbb 28K 2K" -c "write -P 0xcc 34K 2K" "$TEST_WRAP" \
+    | _filter_qemu_io
+
+# Only 2 subclusters should be allocated in the top image at this point
+$QEMU_IMG map "$TEST_WRAP" | _filter_qemu_img_map
+
+# Actual copy-on-read operation
+$QEMU_IO -C -c "read -P 0xaa 30K 4K" "$TEST_WRAP" | _filter_qemu_io
+
+# And here we should have 4 subclusters allocated right in the middle of the
+# top image. Make sure the whole cluster remains unallocated
+$QEMU_IMG map "$TEST_WRAP" | _filter_qemu_img_map
+
+_check_test_img
+
 # success, all done
 echo '*** done'
 status=0
diff --git a/tests/qemu-iotests/197.out b/tests/qemu-iotests/197.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/197.out
+++ b/tests/qemu-iotests/197.out
@@ -XXX,XX +XXX,XX @@ read 1024/1024 bytes at offset 0
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
 No errors were found on the image.
+
+=== Copy-on-read with subclusters ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
+Formatting 'TEST_DIR/t.wrap.IMGFMT', fmt=IMGFMT size=65536 backing_file=TEST_DIR/t.IMGFMT backing_fmt=IMGFMT
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 28672
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 34816
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x7000          TEST_DIR/t.IMGFMT
+0x7000          0x800           TEST_DIR/t.wrap.IMGFMT
+0x7800          0x1000          TEST_DIR/t.IMGFMT
+0x8800          0x800           TEST_DIR/t.wrap.IMGFMT
+0x9000          0x7000          TEST_DIR/t.IMGFMT
+read 4096/4096 bytes at offset 30720
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x7000          TEST_DIR/t.IMGFMT
+0x7000          0x2000          TEST_DIR/t.wrap.IMGFMT
+0x9000          0x7000          TEST_DIR/t.IMGFMT
+No errors were found on the image.
 *** done
-- 
2.41.0

liburing does not clear sqe->user_data. We must do it ourselves to avoid
undefined behavior in process_cqe() when user_data is used.

Note that fdmon-io_uring is currently disabled, so this is a latent bug
that does not affect users. Let's merge this fix now to make it easier
to enable fdmon-io_uring in the future (and I'm working on that).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230426212639.82310-1-stefanha@redhat.com>
---
 util/fdmon-io_uring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -XXX,XX +XXX,XX @@ static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
 #else
     io_uring_prep_poll_remove(sqe, node);
 #endif
+    io_uring_sqe_set_data(sqe, NULL);
 }
 
 /* Add a timeout that self-cancels when another cqe becomes ready */
@@ -XXX,XX +XXX,XX @@ static void add_timeout_sqe(AioContext *ctx, int64_t ns)
 
     sqe = get_sqe(ctx);
     io_uring_prep_timeout(sqe, &ts, 1, 0);
+    io_uring_sqe_set_data(sqe, NULL);
 }
 
 /* Add sqes from ctx->submit_list for submission */
-- 
2.41.0