Series comparison

-[PULL 00/15] Block layer patches
+[PULL 00/25] Block layer patches
-The following changes since commit 22dbfdecc3c52228d3489da3fe81da92b21197bf:
+The following changes since commit ac5f7bf8e208cd7893dbb1a9520559e569a4677c:
-  Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20191010.0' into staging (2019-10-14 15:09:08 +0100)
+  Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-24 15:00:39 +0100)
 are available in the Git repository at:
-  git://repo.or.cz/qemu/kevin.git tags/for-upstream
+  https://repo.or.cz/qemu/kevin.git tags/for-upstream
-for you to fetch changes up to a1406a9262a087d9ec9627b88da13c4590b61dae:
+for you to fetch changes up to 8c1e8fb2e7fc2cbeb57703e143965a4cd3ad301a:
-  iotests: Test large write request to qcow2 file (2019-10-14 17:12:48 +0200)
+  block/monitor: Fix crash when executing HMP commit (2023-04-25 15:11:57 +0200)
 ----------------------------------------------------------------
-Block layer patches:
+Block layer patches
-- block: Fix crash with qcow2 partial cluster COW with small cluster
+- Protect BlockBackend.queued_requests with its own lock
-  sizes (misaligned write requests with BDRV_REQ_NO_FALLBACK)
+- Switch to AIO_WAIT_WHILE_UNLOCKED() where possible
-- qcow2: Fix integer overflow potentially causing corruption with huge
+- AioContext removal: LinuxAioState/LuringState/ThreadPool
-  requests
+- Add more coroutine_fn annotations, use bdrv/blk_co_*
-- vhdx: Detect truncated image files
+- Fix crash when execute hmp_commit
 - tools: Support help options for --object
 - Various block-related replay improvements
 - iotests/028: Fix for long $TEST_DIRs
 ----------------------------------------------------------------
-Alberto Garcia (1):
+Emanuele Giuseppe Esposito (4):
-      block: Reject misaligned write requests with BDRV_REQ_NO_FALLBACK
+      linux-aio: use LinuxAioState from the running thread
       io_uring: use LuringState from the running thread
       thread-pool: use ThreadPool from the running thread
       thread-pool: avoid passing the pool parameter every time
-Kevin Wolf (4):
+Paolo Bonzini (9):
-      vl: Split off user_creatable_print_help()
+      vvfat: mark various functions as coroutine_fn
-      qemu-io: Support help options for --object
+      blkdebug: add missing coroutine_fn annotation
-      qemu-img: Support help options for --object
+      mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
-      qemu-nbd: Support help options for --object
+      nbd: mark more coroutine_fns, do not use co_wrappers
 pfs: mark more coroutine_fns
       qemu-pr-helper: mark more coroutine_fns
       tests: mark more coroutine_fns
       qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
       vmdk: make vmdk_is_cid_valid a coroutine_fn
-Max Reitz (3):
+Stefan Hajnoczi (10):
-      iotests/028: Fix for long $TEST_DIRs
+      block: make BlockBackend->quiesce_counter atomic
-      qcow2: Limit total allocation range to INT_MAX
+      block: make BlockBackend->disable_request_queuing atomic
-      iotests: Test large write request to qcow2 file
+      block: protect BlockBackend->queued_requests with a lock
       block: don't acquire AioContext lock in bdrv_drain_all()
       block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
       block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
       block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
       hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
       monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
       block: add missing coroutine_fn to bdrv_sum_allocated_file_size()
-Pavel Dovgaluk (6):
+Wang Liang (1):
-      block: implement bdrv_snapshot_goto for blkreplay
+      block/monitor: Fix crash when executing HMP commit
       replay: disable default snapshot for record/replay
       replay: update docs for record/replay with block devices
       replay: don't drain/flush bdrv queue while RR is working
       replay: finish record/replay before closing the disks
       replay: add BH oneshot event for block layer
-Peter Lieven (1):
+Wilfred Mallawa (1):
-      block/vhdx: add check for truncated image files
+      include/block: fixup typos
- docs/replay.txt                 |  12 +++-
+ block/qcow2.h                     | 15 +++++-----
- include/qom/object_interfaces.h |  12 ++++
+ hw/9pfs/9p.h                      |  4 +--
- include/sysemu/replay.h         |   4 ++
+ include/block/aio-wait.h          |  2 +-
- replay/replay-internal.h        |   1 +
+ include/block/aio.h               |  8 ------
- block/blkreplay.c               |   8 +++
+ include/block/block_int-common.h  |  2 +-
- block/block-backend.c           |   9 ++-
+ include/block/raw-aio.h           | 33 +++++++++++++++-------
- block/io.c                      |  39 ++++++++++++-
+ include/block/thread-pool.h       | 15 ++++++----
- block/iscsi.c                   |   5 +-
+ include/sysemu/block-backend-io.h |  5 ++++
- block/nfs.c                     |   6 +-
+ backends/tpm/tpm_backend.c        |  4 +--
- block/null.c                    |   4 +-
+ block.c                           |  2 +-
- block/nvme.c                    |   6 +-
+ block/blkdebug.c                  |  4 +--
- block/qcow2-cluster.c           |   5 +-
+ block/block-backend.c             | 45 ++++++++++++++++++------------
- block/rbd.c                     |   5 +-
+ block/export/export.c             |  2 +-
- block/vhdx.c                    | 120 ++++++++++++++++++++++++++++++++++------
+ block/file-posix.c                | 45 ++++++++++++------------------
- block/vxhs.c                    |   5 +-
+ block/file-win32.c                |  4 +--
- cpus.c                          |   2 -
+ block/graph-lock.c                |  2 +-
- qemu-img.c                      |  34 +++++++-----
+ block/io.c                        |  2 +-
- qemu-io.c                       |   9 ++-
+ block/io_uring.c                  | 23 ++++++++++------
- qemu-nbd.c                      |   9 ++-
+ block/linux-aio.c                 | 29 ++++++++++++--------
- qom/object_interfaces.c         |  61 ++++++++++++++++++++
+ block/mirror.c                    |  4 +--
- replay/replay-events.c          |  16 ++++++
+ block/monitor/block-hmp-cmds.c    | 10 ++++---
- replay/replay.c                 |   2 +
+ block/qcow2-bitmap.c              |  2 +-
- stubs/replay-user.c             |   9 +++
+ block/qcow2-cluster.c             | 21 ++++++++------
- vl.c                            |  63 ++++-----------------
+ block/qcow2-refcount.c            |  8 +++---
- stubs/Makefile.objs             |   1 +
+ block/qcow2-snapshot.c            | 25 +++++++++--------
- tests/qemu-iotests/028          |  11 +++-
+ block/qcow2-threads.c             |  3 +-
- tests/qemu-iotests/028.out      |   1 -
+ block/qcow2.c                     | 27 +++++++++---------
- tests/qemu-iotests/268          |  55 ++++++++++++++++++
+ block/vmdk.c                      |  2 +-
- tests/qemu-iotests/268.out      |   7 +++
+ block/vvfat.c                     | 58 ++++++++++++++++++++-------------------
- tests/qemu-iotests/270          |  83 +++++++++++++++++++++++++++
+ hw/9pfs/codir.c                   |  6 ++--
- tests/qemu-iotests/270.out      |   9 +++
+ hw/9pfs/coth.c                    |  3 +-
- tests/qemu-iotests/group        |   2 +
+ hw/ppc/spapr_nvdimm.c             |  6 ++--
-files changed, 504 insertions(+), 111 deletions(-)
+ hw/virtio/virtio-pmem.c           |  3 +-
- create mode 100644 stubs/replay-user.c
+ monitor/hmp.c                     |  2 +-
- create mode 100755 tests/qemu-iotests/268
+ monitor/monitor.c                 |  4 +--
- create mode 100644 tests/qemu-iotests/268.out
+ nbd/server.c                      | 48 ++++++++++++++++----------------
- create mode 100755 tests/qemu-iotests/270
+ scsi/pr-manager.c                 |  3 +-
- create mode 100644 tests/qemu-iotests/270.out
+ scsi/qemu-pr-helper.c             | 25 ++++++++---------
+ tests/unit/test-thread-pool.c     | 14 ++++------
  util/thread-pool.c                | 25 ++++++++---------
 files changed, 283 insertions(+), 262 deletions(-)

-[PULL 07/15] replay: add BH oneshot event for block layer
+[PULL 01/25] block: make BlockBackend->quiesce_counter atomic
-From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-Replay is capable of recording normal BH events, but sometimes
+The main loop thread increments/decrements BlockBackend->quiesce_counter
-there are single use callbacks scheduled with aio_bh_schedule_oneshot
+when drained sections begin/end. The counter is read in the I/O code
-function. This patch enables recording and replaying such callbacks.
+path. Therefore this field is used to communicate between threads
-Block layer uses these events for calling the completion function.
+without a lock.
 Replaying these calls makes the execution deterministic.
-Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+Acquire/release are not necessary because the BlockBackend->in_flight
-Acked-by: Kevin Wolf <kwolf@redhat.com>
+counter already uses sequentially consistent accesses and running I/O
 requests hold that counter when blk_wait_while_drained() is called.
 qatomic_read() can be used.
 Use qatomic_fetch_inc()/qatomic_fetch_dec() for modifications even
 though sequentially consistent atomic accesses are not strictly required
 here. They are, however, nicer to read than multiple calls to
 qatomic_read() and qatomic_set(). Since beginning and ending drain is
 not a hot path the extra cost doesn't matter.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20230307210427.269214-2-stefanha@redhat.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- include/sysemu/replay.h  |  4 ++++
+ block/block-backend.c | 14 +++++++-------
- replay/replay-internal.h |  1 +
+file changed, 7 insertions(+), 7 deletions(-)
  block/block-backend.c    |  9 ++++++---
  block/io.c               |  4 ++--
  block/iscsi.c            |  5 +++--
  block/nfs.c              |  6 ++++--
  block/null.c             |  4 +++-
  block/nvme.c             |  6 ++++--
  block/rbd.c              |  5 +++--
  block/vxhs.c             |  5 +++--
  replay/replay-events.c   | 16 ++++++++++++++++
  stubs/replay-user.c      |  9 +++++++++
  stubs/Makefile.objs      |  1 +
 files changed, 59 insertions(+), 16 deletions(-)
  create mode 100644 stubs/replay-user.c
-diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/sysemu/replay.h
-+++ b/include/sysemu/replay.h
-@@ -XXX,XX +XXX,XX @@
- #include "qapi/qapi-types-misc.h"
- #include "qapi/qapi-types-run-state.h"
- #include "qapi/qapi-types-ui.h"
-+#include "block/aio.h"
- /* replay clock kinds */
- enum ReplayClockKind {
-@@ -XXX,XX +XXX,XX @@ void replay_enable_events(void);
- bool replay_events_enabled(void);
- /*! Adds bottom half event to the queue */
- void replay_bh_schedule_event(QEMUBH *bh);
-+/* Adds oneshot bottom half event to the queue */
-+void replay_bh_schedule_oneshot_event(AioContext *ctx,
-+    QEMUBHFunc *cb, void *opaque);
- /*! Adds input event to the queue */
- void replay_input_event(QemuConsole *src, InputEvent *evt);
- /*! Adds input sync event to the queue */
-diff --git a/replay/replay-internal.h b/replay/replay-internal.h
-index XXXXXXX..XXXXXXX 100644
---- a/replay/replay-internal.h
-+++ b/replay/replay-internal.h
-@@ -XXX,XX +XXX,XX @@ enum ReplayEvents {
- enum ReplayAsyncEventKind {
-     REPLAY_ASYNC_EVENT_BH,
-+    REPLAY_ASYNC_EVENT_BH_ONESHOT,
-     REPLAY_ASYNC_EVENT_INPUT,
-     REPLAY_ASYNC_EVENT_INPUT_SYNC,
-     REPLAY_ASYNC_EVENT_CHAR_READ,
 diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
- #include "hw/qdev-core.h"
+     NotifierList remove_bs_notifiers, insert_bs_notifiers;
- #include "sysemu/blockdev.h"
+     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
- #include "sysemu/runstate.h"
-+#include "sysemu/sysemu.h"
+-    int quiesce_counter;
-+#include "sysemu/replay.h"
++    int quiesce_counter; /* atomic: written under BQL, read by other threads */
- #include "qapi/error.h"
+     CoQueue queued_requests;
- #include "qapi/qapi-events-block.h"
+     bool disable_request_queuing;
- #include "qemu/id.h"
-@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
+@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
-     acb->blk = blk;
+     blk->dev_opaque = opaque;
-     acb->ret = ret;
+     /* Are we currently quiesced? Should we enforce this right now? */
--    aio_bh_schedule_oneshot(blk_get_aio_context(blk), error_callback_bh, acb);
+-    if (blk->quiesce_counter && ops && ops->drained_begin) {
-+    replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
++    if (qatomic_read(&blk->quiesce_counter) && ops && ops->drained_begin) {
-+                                     error_callback_bh, acb);
+         ops->drained_begin(opaque);
      return &acb->common;
  }
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
      acb->has_returned = true;
      if (acb->rwco.ret != NOT_DONE) {
 -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
 -                                blk_aio_complete_bh, acb);
 +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
 +                                         blk_aio_complete_bh, acb);
      }
      return &acb->common;
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
      if (bs) {
          bdrv_inc_in_flight(bs);
      }
 -    aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
 -                            bdrv_co_drain_bh_cb, &data);
 +    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
 +                                     bdrv_co_drain_bh_cb, &data);
      qemu_coroutine_yield();
      /* If we are resumed from some other event (such as an aio completion or a
 diff --git a/block/iscsi.c b/block/iscsi.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/module.h"
  #include "qemu/option.h"
  #include "qemu/uuid.h"
 +#include "sysemu/replay.h"
  #include "qapi/error.h"
  #include "qapi/qapi-commands-misc.h"
  #include "qapi/qmp/qdict.h"
@@ -XXX,XX +XXX,XX @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
      }
      if (iTask->co) {
 -        aio_bh_schedule_oneshot(iTask->iscsilun->aio_context,
 -                                 iscsi_co_generic_bh_cb, iTask);
 +        replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
 +                                         iscsi_co_generic_bh_cb, iTask);
      } else {
          iTask->complete = 1;
      }
 diff --git a/block/nfs.c b/block/nfs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/option.h"
  #include "qemu/uri.h"
  #include "qemu/cutils.h"
 +#include "sysemu/sysemu.h"
 +#include "sysemu/replay.h"
  #include "qapi/qapi-visit-block-core.h"
  #include "qapi/qmp/qdict.h"
  #include "qapi/qmp/qstring.h"
@@ -XXX,XX +XXX,XX @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
      if (task->ret < 0) {
          error_report("NFS Error: %s", nfs_get_error(nfs));
      }
 -    aio_bh_schedule_oneshot(task->client->aio_context,
 -                            nfs_co_generic_bh_cb, task);
 +    replay_bh_schedule_oneshot_event(task->client->aio_context,
 +                                     nfs_co_generic_bh_cb, task);
  }
  static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, uint64_t offset,
 diff --git a/block/null.c b/block/null.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/null.c
 +++ b/block/null.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/module.h"
  #include "qemu/option.h"
  #include "block/block_int.h"
 +#include "sysemu/replay.h"
  #define NULL_OPT_LATENCY "latency-ns"
  #define NULL_OPT_ZEROES  "read-zeroes"
@@ -XXX,XX +XXX,XX @@ static inline BlockAIOCB *null_aio_common(BlockDriverState *bs,
          timer_mod_ns(&acb->timer,
                       qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + s->latency_ns);
      } else {
 -        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), null_bh_cb, acb);
 +        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
 +                                         null_bh_cb, acb);
      }
      return &acb->common;
  }
 diff --git a/block/nvme.c b/block/nvme.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/nvme.c
 +++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/option.h"
  #include "qemu/vfio-helpers.h"
  #include "block/block_int.h"
 +#include "sysemu/replay.h"
  #include "trace.h"
  #include "block/nvme.h"
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
          smp_mb_release();
          *q->cq.doorbell = cpu_to_le32(q->cq.head);
          if (!qemu_co_queue_empty(&q->free_req_queue)) {
 -            aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb, q);
 +            replay_bh_schedule_oneshot_event(s->aio_context,
 +                                             nvme_free_req_queue_cb, q);
          }
      }
      q->busy = false;
@@ -XXX,XX +XXX,XX @@ static void nvme_rw_cb(void *opaque, int ret)
          /* The rw coroutine hasn't yielded, don't try to enter. */
          return;
      }
 -    aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
 +    replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
  }
  static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
 diff --git a/block/rbd.c b/block/rbd.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@
  #include "block/qdict.h"
  #include "crypto/secret.h"
  #include "qemu/cutils.h"
 +#include "sysemu/replay.h"
  #include "qapi/qmp/qstring.h"
  #include "qapi/qmp/qdict.h"
  #include "qapi/qmp/qjson.h"
@@ -XXX,XX +XXX,XX @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
      rcb->ret = rbd_aio_get_return_value(c);
      rbd_aio_release(c);
 -    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
 -                            rbd_finish_bh, rcb);
 +    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
 +                                     rbd_finish_bh, rcb);
  }
  static int rbd_aio_discard_wrapper(rbd_image_t image,
 diff --git a/block/vxhs.c b/block/vxhs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/vxhs.c
 +++ b/block/vxhs.c
@@ -XXX,XX +XXX,XX @@
  #include "qapi/error.h"
  #include "qemu/uuid.h"
  #include "crypto/tlscredsx509.h"
 +#include "sysemu/replay.h"
  #define VXHS_OPT_FILENAME           "filename"
  #define VXHS_OPT_VDISK_ID           "vdisk-id"
@@ -XXX,XX +XXX,XX @@ static void vxhs_iio_callback(void *ctx, uint32_t opcode, uint32_t error)
              trace_vxhs_iio_callback(error);
          }
 -        aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
 -                                vxhs_complete_aio_bh, acb);
 +        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
 +                                         vxhs_complete_aio_bh, acb);
          break;
      default:
 diff --git a/replay/replay-events.c b/replay/replay-events.c
 index XXXXXXX..XXXXXXX 100644
 --- a/replay/replay-events.c
 +++ b/replay/replay-events.c
@@ -XXX,XX +XXX,XX @@ static void replay_run_event(Event *event)
      case REPLAY_ASYNC_EVENT_BH:
          aio_bh_call(event->opaque);
          break;
 +    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
 +        ((QEMUBHFunc *)event->opaque)(event->opaque2);
 +        break;
      case REPLAY_ASYNC_EVENT_INPUT:
          qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
          qapi_free_InputEvent((InputEvent *)event->opaque);
@@ -XXX,XX +XXX,XX @@ void replay_bh_schedule_event(QEMUBH *bh)
      }
  }
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 +void replay_bh_schedule_oneshot_event(AioContext *ctx,
 +    QEMUBHFunc *cb, void *opaque)
 +{
 +    if (events_enabled) {
 +        uint64_t id = replay_get_current_icount();
 +        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
 +    } else {
 +        aio_bh_schedule_oneshot(ctx, cb, opaque);
 +    }
 +}
 +
  void replay_add_input_event(struct InputEvent *event)
  {
-     replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
+     assert(blk->in_flight > 0);
-@@ -XXX,XX +XXX,XX @@ static void replay_save_event(Event *event, int checkpoint)
-         /* save event-specific data */
+-    if (blk->quiesce_counter && !blk->disable_request_queuing) {
-         switch (event->event_kind) {
++    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
-         case REPLAY_ASYNC_EVENT_BH:
+         blk_dec_in_flight(blk);
-+        case REPLAY_ASYNC_EVENT_BH_ONESHOT:
+         qemu_co_queue_wait(&blk->queued_requests, NULL);
-             replay_put_qword(event->id);
+         blk_inc_in_flight(blk);
-             break;
+@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
-         case REPLAY_ASYNC_EVENT_INPUT:
+     BlockBackend *blk = child->opaque;
-@@ -XXX,XX +XXX,XX @@ static Event *replay_read_event(int checkpoint)
+     ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
-     /* Events that has not to be in the queue */
-     switch (replay_state.read_event_kind) {
+-    if (++blk->quiesce_counter == 1) {
-     case REPLAY_ASYNC_EVENT_BH:
++    if (qatomic_fetch_inc(&blk->quiesce_counter) == 0) {
-+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
+         if (blk->dev_ops && blk->dev_ops->drained_begin) {
-         if (replay_state.read_event_id == -1) {
+             blk->dev_ops->drained_begin(blk->dev_opaque);
              replay_state.read_event_id = replay_get_qword();
          }
-diff --git a/stubs/replay-user.c b/stubs/replay-user.c
+@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
-new file mode 100644
+ {
-index XXXXXXX..XXXXXXX
+     BlockBackend *blk = child->opaque;
---- /dev/null
+     bool busy = false;
-+++ b/stubs/replay-user.c
+-    assert(blk->quiesce_counter);
-@@ -XXX,XX +XXX,XX @@
++    assert(qatomic_read(&blk->quiesce_counter));
-+#include "qemu/osdep.h"
-+#include "sysemu/replay.h"
+     if (blk->dev_ops && blk->dev_ops->drained_poll) {
-+#include "sysemu/sysemu.h"
+         busy = blk->dev_ops->drained_poll(blk->dev_opaque);
-+
+@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
-+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+ static void blk_root_drained_end(BdrvChild *child)
-+    QEMUBHFunc *cb, void *opaque)
+ {
-+{
+     BlockBackend *blk = child->opaque;
-+    aio_bh_schedule_oneshot(ctx, cb, opaque);
+-    assert(blk->quiesce_counter);
-+}
++    assert(qatomic_read(&blk->quiesce_counter));
-diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
-index XXXXXXX..XXXXXXX 100644
+     assert(blk->public.throttle_group_member.io_limits_disabled);
---- a/stubs/Makefile.objs
+     qatomic_dec(&blk->public.throttle_group_member.io_limits_disabled);
-+++ b/stubs/Makefile.objs
-@@ -XXX,XX +XXX,XX @@ stub-obj-y += monitor.o
+-    if (--blk->quiesce_counter == 0) {
- stub-obj-y += notify-event.o
++    if (qatomic_fetch_dec(&blk->quiesce_counter) == 1) {
- stub-obj-y += qtest.o
+         if (blk->dev_ops && blk->dev_ops->drained_end) {
- stub-obj-y += replay.o
+             blk->dev_ops->drained_end(blk->dev_opaque);
-+stub-obj-y += replay-user.o
+         }
  stub-obj-y += runstate-check.o
  stub-obj-y += set-fd-handler.o
  stub-obj-y += sysbus.o
 --
-.20.1
+.40.0

-[PULL 06/15] replay: finish record/replay before closing the disks
+[PULL 02/25] block: make BlockBackend->disable_request_queuing atomic
-From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-After recent updates block devices cannot be closed on qemu exit.
+This field is accessed by multiple threads without a lock. Use explicit
-This happens due to the block request polling when replay is not finished.
+qatomic_read()/qatomic_set() calls. There is no need for acquire/release
-Therefore now we stop execution recording before closing the block devices.
+because blk_set_disable_request_queuing() doesn't provide any
 guarantees (it helps that it's used at BlockBackend creation time and
 not when there is I/O in flight).
-Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
 Message-Id: <20230307210427.269214-3-stefanha@redhat.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- replay/replay.c | 2 ++
+ block/block-backend.c | 7 ++++---
- vl.c            | 1 +
+file changed, 4 insertions(+), 3 deletions(-)
 files changed, 3 insertions(+)
-diff --git a/replay/replay.c b/replay/replay.c
+diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
---- a/replay/replay.c
+--- a/block/block-backend.c
-+++ b/replay/replay.c
++++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ void replay_finish(void)
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
-     g_free(replay_snapshot);
-     replay_snapshot = NULL;
+     int quiesce_counter; /* atomic: written under BQL, read by other threads */
+     CoQueue queued_requests;
-+    replay_mode = REPLAY_MODE_NONE;
+-    bool disable_request_queuing;
-+
++    bool disable_request_queuing; /* atomic */
-     replay_finish_events();
      VMChangeStateEntry *vmsh;
      bool force_allow_inactivate;
@@ -XXX,XX +XXX,XX @@ void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow)
  void blk_set_disable_request_queuing(BlockBackend *blk, bool disable)
  {
      IO_CODE();
 -    blk->disable_request_queuing = disable;
 +    qatomic_set(&blk->disable_request_queuing, disable);
  }
-diff --git a/vl.c b/vl.c
+ static int coroutine_fn GRAPH_RDLOCK
-index XXXXXXX..XXXXXXX 100644
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
---- a/vl.c
+ {
-+++ b/vl.c
+     assert(blk->in_flight > 0);
-@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv, char **envp)
+-    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
-     /* No more vcpu or device emulation activity beyond this point */
++    if (qatomic_read(&blk->quiesce_counter) &&
-     vm_shutdown();
++        !qatomic_read(&blk->disable_request_queuing)) {
-+    replay_finish();
+         blk_dec_in_flight(blk);
+         qemu_co_queue_wait(&blk->queued_requests, NULL);
-     job_cancel_sync_all();
+         blk_inc_in_flight(blk);
      bdrv_close_all();
 --
-.20.1
+.40.0

-New patch
+[PULL 03/25] block: protect BlockBackend->queued_requests with a lock
+From: Stefan Hajnoczi <stefanha@redhat.com>
+The CoQueue API offers thread-safety via the lock argument that
+qemu_co_queue_wait() and qemu_co_enter_next() take. BlockBackend
+currently does not make use of the lock argument. This means that
+multiple threads submitting I/O requests can corrupt the CoQueue's
+QSIMPLEQ.
+Add a QemuMutex and pass it to CoQueue APIs so that the queue is
+protected. While we're at it, also assert that the queue is empty when
+the BlockBackend is deleted.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
+Message-Id: <20230307210427.269214-4-stefanha@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/block-backend.c | 18 ++++++++++++++++--
+file changed, 16 insertions(+), 2 deletions(-)
+diff --git a/block/block-backend.c b/block/block-backend.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/block-backend.c
++++ b/block/block-backend.c
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
+     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
+     int quiesce_counter; /* atomic: written under BQL, read by other threads */
++    QemuMutex queued_requests_lock; /* protects queued_requests */
+     CoQueue queued_requests;
+     bool disable_request_queuing; /* atomic */
+@@ -XXX,XX +XXX,XX @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
+     block_acct_init(&blk->stats);
++    qemu_mutex_init(&blk->queued_requests_lock);
+     qemu_co_queue_init(&blk->queued_requests);
+     notifier_list_init(&blk->remove_bs_notifiers);
+     notifier_list_init(&blk->insert_bs_notifiers);
+@@ -XXX,XX +XXX,XX @@ static void blk_delete(BlockBackend *blk)
+     assert(QLIST_EMPTY(&blk->remove_bs_notifiers.notifiers));
+     assert(QLIST_EMPTY(&blk->insert_bs_notifiers.notifiers));
+     assert(QLIST_EMPTY(&blk->aio_notifiers));
++    assert(qemu_co_queue_empty(&blk->queued_requests));
++    qemu_mutex_destroy(&blk->queued_requests_lock);
+     QTAILQ_REMOVE(&block_backends, blk, link);
+     drive_info_del(blk->legacy_dinfo);
+     block_acct_cleanup(&blk->stats);
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
+     if (qatomic_read(&blk->quiesce_counter) &&
+         !qatomic_read(&blk->disable_request_queuing)) {
++        /*
++         * Take lock before decrementing in flight counter so main loop thread
++         * waits for us to enqueue ourselves before it can leave the drained
++         * section.
++         */
++        qemu_mutex_lock(&blk->queued_requests_lock);
+         blk_dec_in_flight(blk);
+-        qemu_co_queue_wait(&blk->queued_requests, NULL);
++        qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
+         blk_inc_in_flight(blk);
++        qemu_mutex_unlock(&blk->queued_requests_lock);
+     }
+ }
+@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_end(BdrvChild *child)
+         if (blk->dev_ops && blk->dev_ops->drained_end) {
+             blk->dev_ops->drained_end(blk->dev_opaque);
+         }
+-        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
++        qemu_mutex_lock(&blk->queued_requests_lock);
++        while (qemu_co_enter_next(&blk->queued_requests,
++                                  &blk->queued_requests_lock)) {
+             /* Resume all queued requests */
+         }
++        qemu_mutex_unlock(&blk->queued_requests_lock);
+     }
+ }
+--
+.40.0

-New patch
+[PULL 04/25] block: don't acquire AioContext lock in bdrv_drain_all()
+From: Stefan Hajnoczi <stefanha@redhat.com>
+There is no need for the AioContext lock in bdrv_drain_all() because
+nothing in AIO_WAIT_WHILE() needs the lock and the condition is atomic.
+AIO_WAIT_WHILE_UNLOCKED() has no use for the AioContext parameter other
+than performing a check that is nowadays already done by the
+GLOBAL_STATE_CODE()/IO_CODE() macros. Set the ctx argument to NULL here
+to help us keep track of all converted callers. Eventually all callers
+will have been converted and then the argument can be dropped entirely.
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230309190855.414275-2-stefanha@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/block-backend.c | 8 +-------
+file changed, 1 insertion(+), 7 deletions(-)
+diff --git a/block/block-backend.c b/block/block-backend.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/block-backend.c
++++ b/block/block-backend.c
+@@ -XXX,XX +XXX,XX @@ void blk_drain_all(void)
+     bdrv_drain_all_begin();
+     while ((blk = blk_all_next(blk)) != NULL) {
+-        AioContext *ctx = blk_get_aio_context(blk);
+-
+-        aio_context_acquire(ctx);
+-
+         /* We may have -ENOMEDIUM completions in flight */
+-        AIO_WAIT_WHILE(ctx, qatomic_read(&blk->in_flight) > 0);
+-
+-        aio_context_release(ctx);
++        AIO_WAIT_WHILE_UNLOCKED(NULL, qatomic_read(&blk->in_flight) > 0);
+     }
+     bdrv_drain_all_end();
+--
+.40.0

-[PULL 08/15] block: Reject misaligned write requests with BDRV_REQ_NO_FALLBACK
+[PULL 05/25] block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
-From: Alberto Garcia <berto@igalia.com>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-The BDRV_REQ_NO_FALLBACK flag means that an operation should only be
+There is no change in behavior. Switch to AIO_WAIT_WHILE_UNLOCKED()
-performed if it can be offloaded or otherwise performed efficiently.
+instead of AIO_WAIT_WHILE() to document that this code has already been
 audited and converted. The AioContext argument is already NULL so
 aio_context_release() is never called anyway.
-However a misaligned write request requires a RMW so we should return
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-an error and let the caller decide how to proceed.
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
-This hits an assertion since commit c8bb23cbdb if the required
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-alignment is larger than the cluster size:
+Message-Id: <20230309190855.414275-3-stefanha@redhat.com>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 qemu-img create -f qcow2 -o cluster_size=2k img.qcow2 4G
 qemu-io -c "open -o driver=qcow2,file.align=4k blkdebug::img.qcow2" \
         -c 'write 0 512'
 qemu-io: block/io.c:1127: bdrv_driver_pwritev: Assertion `!(flags & BDRV_REQ_NO_FALLBACK)' failed.
 Aborted
 The reason is that when writing to an unallocated cluster we try to
 skip the copy-on-write part and zeroize it using BDRV_REQ_NO_FALLBACK
 instead, resulting in a write request that is too small (2KB cluster
 size vs 4KB required alignment).
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/io.c                 |  7 +++++
+ block/export/export.c | 2 +-
- tests/qemu-iotests/268     | 55 ++++++++++++++++++++++++++++++++++++++
+file changed, 1 insertion(+), 1 deletion(-)
  tests/qemu-iotests/268.out |  7 +++++
  tests/qemu-iotests/group   |  1 +
 files changed, 70 insertions(+)
  create mode 100755 tests/qemu-iotests/268
  create mode 100644 tests/qemu-iotests/268.out
-diff --git a/block/io.c b/block/io.c
+diff --git a/block/export/export.c b/block/export/export.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/io.c
+--- a/block/export/export.c
-+++ b/block/io.c
++++ b/block/export/export.c
-@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+@@ -XXX,XX +XXX,XX @@ void blk_exp_close_all_type(BlockExportType type)
-         return ret;
+         blk_exp_request_shutdown(exp);
      }
-+    /* If the request is misaligned then we can't make it efficient */
+-    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
-+    if ((flags & BDRV_REQ_NO_FALLBACK) &&
++    AIO_WAIT_WHILE_UNLOCKED(NULL, blk_exp_has_type(type));
-+        !QEMU_IS_ALIGNED(offset | bytes, align))
+ }
-+    {
-+        return -ENOTSUP;
+ void blk_exp_close_all(void)
 +    }
 +
      bdrv_inc_in_flight(bs);
      /*
       * Align write if necessary by performing a read-modify-write cycle.
 diff --git a/tests/qemu-iotests/268 b/tests/qemu-iotests/268
 new file mode 100755
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qemu-iotests/268
@@ -XXX,XX +XXX,XX @@
 +#!/usr/bin/env bash
 +#
 +# Test write request with required alignment larger than the cluster size
 +#
 +# Copyright (C) 2019 Igalia, S.L.
 +# Author: Alberto Garcia <berto@igalia.com>
 +#
 +# This program is free software; you can redistribute it and/or modify
 +# it under the terms of the GNU General Public License as published by
 +# the Free Software Foundation; either version 2 of the License, or
 +# (at your option) any later version.
 +#
 +# This program is distributed in the hope that it will be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
 +#
 +
 +# creator
 +owner=berto@igalia.com
 +
 +seq=`basename $0`
 +echo "QA output created by $seq"
 +
 +status=1    # failure is the default!
 +
 +_cleanup()
 +{
 +    _cleanup_test_img
 +}
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
 +# get standard environment, filters and checks
 +. ./common.rc
 +. ./common.filter
 +
 +_supported_fmt qcow2
 +_supported_proto file
 +
 +echo
 +echo "== Required alignment larger than cluster size =="
 +
 +CLUSTER_SIZE=2k _make_test_img 1M
 +# Since commit c8bb23cbdb writing to an unallocated cluster fills the
 +# empty COW areas with bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK)
 +$QEMU_IO -c "open -o driver=$IMGFMT,file.align=4k blkdebug::$TEST_IMG" \
 +         -c "write 0 512" | _filter_qemu_io
 +
 +# success, all done
 +echo "*** done"
 +rm -f $seq.full
 +status=0
 diff --git a/tests/qemu-iotests/268.out b/tests/qemu-iotests/268.out
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/qemu-iotests/268.out
@@ -XXX,XX +XXX,XX @@
 +QA output created by 268
 +
 +== Required alignment larger than cluster size ==
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 +wrote 512/512 bytes at offset 0
 +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +*** done
 diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/group
 +++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 rw auto quick
 rw quick
 rw auto quick snapshot
 +268 rw auto quick
 --
-.20.1
+.40.0

-[PULL 03/15] replay: disable default snapshot for record/replay
+[PULL 06/25] block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
-From: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-This patch disables setting '-snapshot' option on by default
+The following conversion is safe and does not change behavior:
 in record/replay mode. This is needed for creating vmstates in record
 and replay modes.
-Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+     GLOBAL_STATE_CODE();
-Acked-by: Kevin Wolf <kwolf@redhat.com>
+     ...
   -  AIO_WAIT_WHILE(qemu_get_aio_context(), ...);
   +  AIO_WAIT_WHILE_UNLOCKED(NULL, ...);
 Since we're in GLOBAL_STATE_CODE(), qemu_get_aio_context() is our home
 thread's AioContext. Thus AIO_WAIT_WHILE() does not unlock the
 AioContext:
   if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
       while ((cond)) {                                           \
           aio_poll(ctx_, true);                                  \
           waited_ = true;                                        \
       }                                                          \
 And that means AIO_WAIT_WHILE_UNLOCKED(NULL, ...) can be substituted.
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20230309190855.414275-4-stefanha@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- vl.c | 10 ++++++++--
+ block/graph-lock.c | 2 +-
-file changed, 8 insertions(+), 2 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/vl.c b/vl.c
+diff --git a/block/graph-lock.c b/block/graph-lock.c
 index XXXXXXX..XXXXXXX 100644
---- a/vl.c
+--- a/block/graph-lock.c
-+++ b/vl.c
++++ b/block/graph-lock.c
-@@ -XXX,XX +XXX,XX @@ static void configure_blockdev(BlockdevOptionsQueue *bdo_queue,
+@@ -XXX,XX +XXX,XX @@ void bdrv_graph_wrlock(void)
-         qapi_free_BlockdevOptions(bdo->bdo);
+          * reader lock.
-         g_free(bdo);
+          */
-     }
+         qatomic_set(&has_writer, 0);
--    if (snapshot || replay_mode != REPLAY_MODE_NONE) {
+-        AIO_WAIT_WHILE(qemu_get_aio_context(), reader_count() >= 1);
-+    if (snapshot) {
++        AIO_WAIT_WHILE_UNLOCKED(NULL, reader_count() >= 1);
-         qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot,
+         qatomic_set(&has_writer, 1);
-                           NULL, NULL);
-     }
+         /*
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv, char **envp)
                  drive_add(IF_PFLASH, -1, optarg, PFLASH_OPTS);
                  break;
              case QEMU_OPTION_snapshot:
 -                snapshot = 1;
 +                {
 +                    Error *blocker = NULL;
 +                    snapshot = 1;
 +                    error_setg(&blocker, QERR_REPLAY_NOT_SUPPORTED,
 +                               "-snapshot");
 +                    replay_add_blocker(blocker);
 +                }
                  break;
              case QEMU_OPTION_numa:
                  opts = qemu_opts_parse_noisily(qemu_find_opts("numa"),
 --
-.20.1
+.40.0

-[PULL 05/15] replay: don't drain/flush bdrv queue while RR is working
+[PULL 07/25] block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
-From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-In record/replay mode bdrv queue is controlled by replay mechanism.
+Since the AioContext argument was already NULL, AIO_WAIT_WHILE() was
-It does not allow saving or loading the snapshots
+never going to unlock the AioContext. Therefore it is possible to
-when bdrv queue is not empty. Stopping the VM is not blocked by nonempty
+replace AIO_WAIT_WHILE() with AIO_WAIT_WHILE_UNLOCKED().
 queue, but flushing the queue is still impossible there,
 because it may cause deadlocks in replay mode.
 This patch disables bdrv_drain_all and bdrv_flush_all in
 record/replay mode.
-Stopping the machine when the IO requests are not finished is needed
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-for the debugging. E.g., breakpoint may be set at the specified step,
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-and forcing the IO requests to finish may break the determinism
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
-of the execution.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230309190855.414275-5-stefanha@redhat.com>
-Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Acked-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/io.c | 28 ++++++++++++++++++++++++++++
+ block/io.c | 2 +-
- cpus.c     |  2 --
+file changed, 1 insertion(+), 1 deletion(-)
 files changed, 28 insertions(+), 2 deletions(-)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@
- #include "qapi/error.h"
- #include "qemu/error-report.h"
- #include "qemu/main-loop.h"
-+#include "sysemu/replay.h"
- #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
 @@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_begin(void)
-         return;
+     bdrv_drain_all_begin_nopoll();
-     }
+     /* Now poll the in-flight requests */
-+    /*
+-    AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
-+     * bdrv queue is managed by record/replay,
++    AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
-+     * waiting for finishing the I/O requests may
 +     * be infinite
 +     */
 +    if (replay_events_enabled()) {
 +        return;
 +    }
 +
      /* AIO_WAIT_WHILE() with a NULL context can only be called from the main
       * loop AioContext, so make sure we're in the main context. */
      assert(qemu_get_current_aio_context() == qemu_get_aio_context());
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_end(void)
      BlockDriverState *bs = NULL;
      int drained_end_counter = 0;
 +    /*
 +     * bdrv queue is managed by record/replay,
 +     * waiting for finishing the I/O requests may
 +     * be endless
 +     */
 +    if (replay_events_enabled()) {
 +        return;
 +    }
 +
      while ((bs = bdrv_next_all_states(bs))) {
-         AioContext *aio_context = bdrv_get_aio_context(bs);
+         bdrv_drain_assert_idle(bs);
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
      BlockDriverState *bs = NULL;
      int result = 0;
 +    /*
 +     * bdrv queue is managed by record/replay,
 +     * creating new flush request for stopping
 +     * the VM may break the determinism
 +     */
 +    if (replay_events_enabled()) {
 +        return result;
 +    }
 +
      for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
          AioContext *aio_context = bdrv_get_aio_context(bs);
          int ret;
 diff --git a/cpus.c b/cpus.c
 index XXXXXXX..XXXXXXX 100644
 --- a/cpus.c
 +++ b/cpus.c
@@ -XXX,XX +XXX,XX @@ static int do_vm_stop(RunState state, bool send_stop)
      }
      bdrv_drain_all();
 -    replay_disable_events();
      ret = bdrv_flush_all();
      return ret;
@@ -XXX,XX +XXX,XX @@ int vm_prepare_start(void)
      /* We are sending this now, but the CPUs will be resumed shortly later */
      qapi_event_send_resume();
 -    replay_enable_events();
      cpu_enable_ticks();
      runstate_set(RUN_STATE_RUNNING);
      vm_state_notify(1, RUN_STATE_RUNNING);
 --
-.20.1
+.40.0

-[PULL 09/15] iotests/028: Fix for long $TEST_DIRs
+[PULL 08/25] hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
-From: Max Reitz <mreitz@redhat.com>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-For long test image paths, the order of the "Formatting" line and the
+The HMP monitor runs in the main loop thread. Calling
-"(qemu)" prompt after a drive_backup HMP command may be reversed.  In
+AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
-fact, the interaction between the prompt and the line may lead to the
+equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
-"Formatting" to being greppable at all after "read"-ing it (if the
+the AioContext and the latter's assertion that we're in the main loop
-prompt injects an IFS character into the "Formatting" string).
+succeeds.
-So just wait until we get a prompt.  At that point, the block job must
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-have been started, so "info block-jobs" will only return "No active
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-jobs" once it is done.
+Reviewed-by: Markus Armbruster <armbru@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
-Reported-by: Thomas Huth <thuth@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Max Reitz <mreitz@redhat.com>
+Message-Id: <20230309190855.414275-6-stefanha@redhat.com>
-Reviewed-by: John Snow <jsnow@redhat.com>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/028     | 11 ++++++++---
+ monitor/hmp.c | 2 +-
- tests/qemu-iotests/028.out |  1 -
+file changed, 1 insertion(+), 1 deletion(-)
 files changed, 8 insertions(+), 4 deletions(-)
-diff --git a/tests/qemu-iotests/028 b/tests/qemu-iotests/028
+diff --git a/monitor/hmp.c b/monitor/hmp.c
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/028
 +++ b/tests/qemu-iotests/028
@@ -XXX,XX +XXX,XX @@ fi
  # Silence output since it contains the disk image path and QEMU's readline
  # character echoing makes it very hard to filter the output. Plus, there
  # is no telling how many times the command will repeat before succeeding.
 -_send_qemu_cmd $h "drive_backup disk ${TEST_IMG}.copy" "(qemu)" >/dev/null
 -_send_qemu_cmd $h "" "Formatting" | _filter_img_create
 -qemu_cmd_repeat=20 _send_qemu_cmd $h "info block-jobs" "No active jobs" >/dev/null
 +# (Note that creating the image results in a "Formatting..." message over
 +# stdout, which is the same channel the monitor uses.  We cannot reliably
 +# wait for it because the monitor output may interact with it in such a
 +# way that _timed_wait_for cannot read it.  However, once the block job is
 +# done, we know that the "Formatting..." message must have appeared
 +# already, so the output is still deterministic.)
 +silent=y _send_qemu_cmd $h "drive_backup disk ${TEST_IMG}.copy" "(qemu)"
 +silent=y qemu_cmd_repeat=20 _send_qemu_cmd $h "info block-jobs" "No active jobs"
  _send_qemu_cmd $h "info block-jobs" "No active jobs"
  _send_qemu_cmd $h 'quit' ""
 diff --git a/tests/qemu-iotests/028.out b/tests/qemu-iotests/028.out
 index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/028.out
+--- a/monitor/hmp.c
-+++ b/tests/qemu-iotests/028.out
++++ b/monitor/hmp.c
-@@ -XXX,XX +XXX,XX @@ No errors were found on the image.
+@@ -XXX,XX +XXX,XX @@ void handle_hmp_command(MonitorHMP *mon, const char *cmdline)
+         Coroutine *co = qemu_coroutine_create(handle_hmp_command_co, &data);
- block-backup
+         monitor_set_cur(co, &mon->common);
+         aio_co_enter(qemu_get_aio_context(), co);
--Formatting 'TEST_DIR/t.IMGFMT.copy', fmt=IMGFMT size=4294968832 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
+-        AIO_WAIT_WHILE(qemu_get_aio_context(), !data.done);
- (qemu) info block-jobs
++        AIO_WAIT_WHILE_UNLOCKED(NULL, !data.done);
- No active jobs
+     }
- === IO: pattern 195
      qobject_unref(qdict);
 --
-.20.1
+.40.0

-[PULL 01/15] block/vhdx: add check for truncated image files
+[PULL 09/25] monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
-From: Peter Lieven <pl@kamp.de>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-qemu is currently not able to detect truncated vhdx image files.
+monitor_cleanup() is called from the main loop thread. Calling
-Add a basic check if all allocated blocks are reachable at open and
+AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
-report all errors during bdrv_co_check.
+equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
 the AioContext and the latter's assertion that we're in the main loop
 succeeds.
-Signed-off-by: Peter Lieven <pl@kamp.de>
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Markus Armbruster <armbru@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20230309190855.414275-7-stefanha@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/vhdx.c | 120 +++++++++++++++++++++++++++++++++++++++++++--------
+ monitor/monitor.c | 4 ++--
-file changed, 103 insertions(+), 17 deletions(-)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/block/vhdx.c b/block/vhdx.c
+diff --git a/monitor/monitor.c b/monitor/monitor.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/vhdx.c
+--- a/monitor/monitor.c
-+++ b/block/vhdx.c
++++ b/monitor/monitor.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
- #include "qemu/option.h"
+      * We need to poll both qemu_aio_context and iohandler_ctx to make
- #include "qemu/crc32c.h"
+      * sure that the dispatcher coroutine keeps making progress and
- #include "qemu/bswap.h"
+      * eventually terminates.  qemu_aio_context is automatically
-+#include "qemu/error-report.h"
+-     * polled by calling AIO_WAIT_WHILE on it, but we must poll
- #include "vhdx.h"
++     * polled by calling AIO_WAIT_WHILE_UNLOCKED on it, but we must poll
- #include "migration/blocker.h"
+      * iohandler_ctx manually.
- #include "qemu/uuid.h"
+      *
-@@ -XXX,XX +XXX,XX @@ static int vhdx_region_check(BDRVVHDXState *s, uint64_t start, uint64_t length)
+      * Letting the iothread continue while shutting down the dispatcher
-     end = start + length;
+@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
-     QLIST_FOREACH(r, &s->regions, entries) {
+         aio_co_wake(qmp_dispatcher_co);
          if (!((start >= r->end) || (end <= r->start))) {
 +            error_report("VHDX region %" PRIu64 "-%" PRIu64 " overlaps with "
 +                         "region %" PRIu64 "-%." PRIu64, start, end, r->start,
 +                         r->end);
              ret = -EINVAL;
              goto exit;
          }
@@ -XXX,XX +XXX,XX @@ static void vhdx_calc_bat_entries(BDRVVHDXState *s)
  }
 +static int vhdx_check_bat_entries(BlockDriverState *bs, int *errcnt)
 +{
 +    BDRVVHDXState *s = bs->opaque;
 +    int64_t image_file_size = bdrv_getlength(bs->file->bs);
 +    uint64_t payblocks = s->chunk_ratio;
 +    uint64_t i;
 +    int ret = 0;
 +
 +    if (image_file_size < 0) {
 +        error_report("Could not determinate VHDX image file size.");
 +        return image_file_size;
 +    }
 +
 +    for (i = 0; i < s->bat_entries; i++) {
 +        if ((s->bat[i] & VHDX_BAT_STATE_BIT_MASK) ==
 +            PAYLOAD_BLOCK_FULLY_PRESENT) {
 +            uint64_t offset = s->bat[i] & VHDX_BAT_FILE_OFF_MASK;
 +            /*
 +             * Allow that the last block exists only partially. The VHDX spec
 +             * states that the image file can only grow in blocksize increments,
 +             * but QEMU created images with partial last blocks in the past.
 +             */
 +            uint32_t block_length = MIN(s->block_size,
 +                bs->total_sectors * BDRV_SECTOR_SIZE - i * s->block_size);
 +            /*
 +             * Check for BAT entry overflow.
 +             */
 +            if (offset > INT64_MAX - s->block_size) {
 +                error_report("VHDX BAT entry %" PRIu64 " offset overflow.", i);
 +                ret = -EINVAL;
 +                if (!errcnt) {
 +                    break;
 +                }
 +                (*errcnt)++;
 +            }
 +            /*
 +             * Check if fully allocated BAT entries do not reside after
 +             * end of the image file.
 +             */
 +            if (offset >= image_file_size) {
 +                error_report("VHDX BAT entry %" PRIu64 " start offset %" PRIu64
 +                             " points after end of file (%" PRIi64 "). Image"
 +                             " has probably been truncated.",
 +                             i, offset, image_file_size);
 +                ret = -EINVAL;
 +                if (!errcnt) {
 +                    break;
 +                }
 +                (*errcnt)++;
 +            } else if (offset + block_length > image_file_size) {
 +                error_report("VHDX BAT entry %" PRIu64 " end offset %" PRIu64
 +                             " points after end of file (%" PRIi64 "). Image"
 +                             " has probably been truncated.",
 +                             i, offset + block_length - 1, image_file_size);
 +                ret = -EINVAL;
 +                if (!errcnt) {
 +                    break;
 +                }
 +                (*errcnt)++;
 +            }
 +
 +            /*
 +             * verify populated BAT field file offsets against
 +             * region table and log entries
 +             */
 +            if (payblocks--) {
 +                /* payload bat entries */
 +                int ret2;
 +                ret2 = vhdx_region_check(s, offset, s->block_size);
 +                if (ret2 < 0) {
 +                    ret = -EINVAL;
 +                    if (!errcnt) {
 +                        break;
 +                    }
 +                    (*errcnt)++;
 +                }
 +            } else {
 +                payblocks = s->chunk_ratio;
 +                /*
 +                 * Once differencing files are supported, verify sector bitmap
 +                 * blocks here
 +                 */
 +            }
 +        }
 +    }
 +
 +    return ret;
 +}
 +
  static void vhdx_close(BlockDriverState *bs)
  {
      BDRVVHDXState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static int vhdx_open(BlockDriverState *bs, QDict *options, int flags,
          goto fail;
      }
--    uint64_t payblocks = s->chunk_ratio;
+-    AIO_WAIT_WHILE(qemu_get_aio_context(),
--    /* endian convert, and verify populated BAT field file offsets against
++    AIO_WAIT_WHILE_UNLOCKED(NULL,
--     * region table and log entries */
+                    (aio_poll(iohandler_get_aio_context(), false),
-+    /* endian convert populated BAT field entires */
+                     qatomic_mb_read(&qmp_dispatcher_co_busy)));
      for (i = 0; i < s->bat_entries; i++) {
          s->bat[i] = le64_to_cpu(s->bat[i]);
 -        if (payblocks--) {
 -            /* payload bat entries */
 -            if ((s->bat[i] & VHDX_BAT_STATE_BIT_MASK) ==
 -                    PAYLOAD_BLOCK_FULLY_PRESENT) {
 -                ret = vhdx_region_check(s, s->bat[i] & VHDX_BAT_FILE_OFF_MASK,
 -                                        s->block_size);
 -                if (ret < 0) {
 -                    goto fail;
 -                }
 -            }
 -        } else {
 -            payblocks = s->chunk_ratio;
 -            /* Once differencing files are supported, verify sector bitmap
 -             * blocks here */
 +    }
 +
 +    if (!(flags & BDRV_O_CHECK)) {
 +        ret = vhdx_check_bat_entries(bs, NULL);
 +        if (ret < 0) {
 +            goto fail;
          }
      }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn vhdx_co_check(BlockDriverState *bs,
      if (s->log_replayed_on_open) {
          result->corruptions_fixed++;
      }
 +
 +    vhdx_check_bat_entries(bs, &result->corruptions);
 +
      return 0;
  }
 --
-.20.1
+.40.0

-[PULL 04/15] replay: update docs for record/replay with block devices
+[PULL 10/25] include/block: fixup typos
-From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
-This patch updates the description of the command lines for using
+Fixup a few minor typos
 record/replay with attached block devices.
-Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
+Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Message-Id: <20230313003744.55476-1-wilfred.mallawa@opensource.wdc.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- docs/replay.txt | 12 +++++++++---
+ include/block/aio-wait.h         | 2 +-
-file changed, 9 insertions(+), 3 deletions(-)
+ include/block/block_int-common.h | 2 +-
 files changed, 2 insertions(+), 2 deletions(-)
-diff --git a/docs/replay.txt b/docs/replay.txt
+diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
 index XXXXXXX..XXXXXXX 100644
---- a/docs/replay.txt
+--- a/include/block/aio-wait.h
-+++ b/docs/replay.txt
++++ b/include/block/aio-wait.h
-@@ -XXX,XX +XXX,XX @@ Usage of the record/replay:
+@@ -XXX,XX +XXX,XX @@ extern AioWait global_aio_wait;
-  * First, record the execution with the following command line:
+  * @ctx: the aio context, or NULL if multiple aio contexts (for which the
-     qemu-system-i386 \
+  *       caller does not hold a lock) are involved in the polling condition.
-      -icount shift=7,rr=record,rrfile=replay.bin \
+  * @cond: wait while this conditional expression is true
--     -drive file=disk.qcow2,if=none,id=img-direct \
+- * @unlock: whether to unlock and then lock again @ctx. This apples
-+     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
++ * @unlock: whether to unlock and then lock again @ctx. This applies
-      -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
+  * only when waiting for another AioContext from the main loop.
-      -device ide-hd,drive=img-blkreplay \
+  * Otherwise it's ignored.
-      -netdev user,id=net1 -device rtl8139,netdev=net1 \
+  *
-@@ -XXX,XX +XXX,XX @@ Usage of the record/replay:
+diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
-  * After recording, you can replay it by using another command line:
+index XXXXXXX..XXXXXXX 100644
-     qemu-system-i386 \
+--- a/include/block/block_int-common.h
-      -icount shift=7,rr=replay,rrfile=replay.bin \
++++ b/include/block/block_int-common.h
--     -drive file=disk.qcow2,if=none,id=img-direct \
+@@ -XXX,XX +XXX,XX @@ extern QemuOptsList bdrv_create_opts_simple;
-+     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
+ /*
-      -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
+  * Common functions that are neither I/O nor Global State.
-      -device ide-hd,drive=img-blkreplay \
+  *
-      -netdev user,id=net1 -device rtl8139,netdev=net1 \
+- * See include/block/block-commmon.h for more information about
-@@ -XXX,XX +XXX,XX @@ Block devices record/replay module intercepts calls of
++ * See include/block/block-common.h for more information about
- bdrv coroutine functions at the top of block drivers stack.
+  * the Common API.
- To record and replay block operations the drive must be configured
+  */
- as following:
 - -drive file=disk.qcow2,if=none,id=img-direct
 + -drive file=disk.qcow2,if=none,snapshot,id=img-direct
   -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
   -device ide-hd,drive=img-blkreplay
@@ -XXX,XX +XXX,XX @@ This snapshot is created at start of recording and restored at start
  of replaying. It also can be loaded while replaying to roll back
  the execution.
 +'snapshot' flag of the disk image must be removed to save the snapshots
 +in the overlay (or original image) instead of using the temporary overlay.
 + -drive file=disk.ovl,if=none,id=img-direct
 + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
 + -device ide-hd,drive=img-blkreplay
 +
  Use QEMU monitor to create additional snapshots. 'savevm <name>' command
  created the snapshot and 'loadvm <name>' restores it. To prevent corruption
  of the original disk image, use overlay files linked to the original images.
 --
-.20.1
+.40.0

-New patch
+[PULL 11/25] block: add missing coroutine_fn to bdrv_sum_allocated_file_size()
+From: Stefan Hajnoczi <stefanha@redhat.com>
+Not a coroutine_fn, you say?
+  static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
+  {
+      BdrvChild *child;
+      int64_t child_size, sum = 0;
+      QLIST_FOREACH(child, &bs->children, next) {
+          if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
+                             BDRV_CHILD_FILTERED))
+          {
+              child_size = bdrv_co_get_allocated_file_size(child->bs);
+                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Well what do we have here?!
+I rest my case, your honor.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230308211435.346375-1-stefanha@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block.c | 2 +-
+file changed, 1 insertion(+), 1 deletion(-)
+diff --git a/block.c b/block.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block.c
++++ b/block.c
+@@ -XXX,XX +XXX,XX @@ exit:
+  * sums the size of all data-bearing children.  (This excludes backing
+  * children.)
+  */
+-static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
++static int64_t coroutine_fn bdrv_sum_allocated_file_size(BlockDriverState *bs)
+ {
+     BdrvChild *child;
+     int64_t child_size, sum = 0;
+--
+.40.0

-New patch
+[PULL 12/25] linux-aio: use LinuxAioState from the running thread
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
 Remove usage of aio_context_acquire by always submitting asynchronous
 AIO to the current thread's LinuxAioState.
 In order to prevent mistakes from the caller side, avoid passing LinuxAioState
 in laio_io_{plug/unplug} and laio_co_submit, and document the functions
 to make clear that they work in the current thread's AioContext.
 Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
 Message-Id: <20230203131731.851116-2-eesposit@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
  include/block/aio.h               |  4 ----
  include/block/raw-aio.h           | 18 ++++++++++++------
  include/sysemu/block-backend-io.h |  5 +++++
  block/file-posix.c                | 10 +++-------
  block/linux-aio.c                 | 29 +++++++++++++++++------------
 files changed, 37 insertions(+), 29 deletions(-)
 diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/aio.h
 +++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      struct ThreadPool *thread_pool;
  #ifdef CONFIG_LINUX_AIO
 -    /*
 -     * State for native Linux AIO.  Uses aio_context_acquire/release for
 -     * locking.
 -     */
      struct LinuxAioState *linux_aio;
  #endif
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
  typedef struct LinuxAioState LinuxAioState;
  LinuxAioState *laio_init(Error **errp);
  void laio_cleanup(LinuxAioState *s);
 -int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
 -                                uint64_t offset, QEMUIOVector *qiov, int type,
 -                                uint64_t dev_max_batch);
 +
 +/* laio_co_submit: submit I/O requests in the thread's current AioContext. */
 +int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 +                                int type, uint64_t dev_max_batch);
 +
  void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 -void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
 -void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
 -                    uint64_t dev_max_batch);
 +
 +/*
 + * laio_io_plug/unplug work in the thread's current AioContext, therefore the
 + * caller must ensure that they are paired in the same IOThread.
 + */
 +void laio_io_plug(void);
 +void laio_io_unplug(uint64_t dev_max_batch);
  #endif
  /* io_uring.c - Linux io_uring implementation */
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/sysemu/block-backend-io.h
 +++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
  int blk_get_max_iov(BlockBackend *blk);
  int blk_get_max_hw_iov(BlockBackend *blk);
 +/*
 + * blk_io_plug/unplug are thread-local operations. This means that multiple
 + * IOThreads can simultaneously call plug/unplug, but the caller must ensure
 + * that each unplug() is called in the same IOThread of the matching plug().
 + */
  void coroutine_fn blk_co_io_plug(BlockBackend *blk);
  void co_wrapper blk_io_plug(BlockBackend *blk);
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
  #endif
  #ifdef CONFIG_LINUX_AIO
      } else if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
          assert(qiov->size == bytes);
 -        return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
 -                              s->aio_max_batch);
 +        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
  #endif
      }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
      BDRVRawState __attribute__((unused)) *s = bs->opaque;
  #ifdef CONFIG_LINUX_AIO
      if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
 -        laio_io_plug(bs, aio);
 +        laio_io_plug();
      }
  #endif
  #ifdef CONFIG_LINUX_IO_URING
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
      BDRVRawState __attribute__((unused)) *s = bs->opaque;
  #ifdef CONFIG_LINUX_AIO
      if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
 -        laio_io_unplug(bs, aio, s->aio_max_batch);
 +        laio_io_unplug(s->aio_max_batch);
      }
  #endif
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/coroutine.h"
  #include "qapi/error.h"
 +/* Only used for assertions.  */
 +#include "qemu/coroutine_int.h"
 +
  #include <libaio.h>
  /*
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
      io_context_t ctx;
      EventNotifier e;
 -    /* io queue for submit at batch.  Protected by AioContext lock. */
 +    /* No locking required, only accessed from AioContext home thread */
      LaioQueue io_q;
 -
 -    /* I/O completion processing.  Only runs in I/O thread.  */
      QEMUBH *completion_bh;
      int event_idx;
      int event_max;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
       * later.  Coroutines cannot be entered recursively so avoid doing
       * that!
       */
 +    assert(laiocb->co->ctx == laiocb->ctx->aio_context);
      if (!qemu_coroutine_entered(laiocb->co)) {
          aio_co_wake(laiocb->co);
      }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
  {
 -    aio_context_acquire(s->aio_context);
      qemu_laio_process_completions(s);
      if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
 -    aio_context_release(s->aio_context);
  }
  static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
      return max_batch;
  }
 -void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
 +void laio_io_plug(void)
  {
 +    AioContext *ctx = qemu_get_current_aio_context();
 +    LinuxAioState *s = aio_get_linux_aio(ctx);
 +
      s->io_q.plugged++;
  }
 -void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
 -                    uint64_t dev_max_batch)
 +void laio_io_unplug(uint64_t dev_max_batch)
  {
 +    AioContext *ctx = qemu_get_current_aio_context();
 +    LinuxAioState *s = aio_get_linux_aio(ctx);
 +
      assert(s->io_q.plugged);
      s->io_q.plugged--;
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
      return 0;
  }
 -int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
 -                                uint64_t offset, QEMUIOVector *qiov, int type,
 -                                uint64_t dev_max_batch)
 +int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 +                                int type, uint64_t dev_max_batch)
  {
      int ret;
 +    AioContext *ctx = qemu_get_current_aio_context();
      struct qemu_laiocb laiocb = {
          .co         = qemu_coroutine_self(),
          .nbytes     = qiov->size,
 -        .ctx        = s,
 +        .ctx        = aio_get_linux_aio(ctx),
          .ret        = -EINPROGRESS,
          .is_read    = (type == QEMU_AIO_READ),
          .qiov       = qiov,
 --
 .40.0

-New patch
+[PULL 13/25] io_uring: use LuringState from the running thread
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Remove usage of aio_context_acquire by always submitting asynchronous
+AIO to the current thread's LuringState.
+In order to prevent mistakes from the caller side, avoid passing LuringState
+in luring_io_{plug/unplug} and luring_co_submit, and document the functions
+to make clear that they work in the current thread's AioContext.
+Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Message-Id: <20230203131731.851116-3-eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ include/block/aio.h     |  4 ----
+ include/block/raw-aio.h | 15 +++++++++++----
+ block/file-posix.c      | 12 ++++--------
+ block/io_uring.c        | 23 +++++++++++++++--------
+files changed, 30 insertions(+), 24 deletions(-)
+diff --git a/include/block/aio.h b/include/block/aio.h
+index XXXXXXX..XXXXXXX 100644
+--- a/include/block/aio.h
++++ b/include/block/aio.h
+@@ -XXX,XX +XXX,XX @@ struct AioContext {
+     struct LinuxAioState *linux_aio;
+ #endif
+ #ifdef CONFIG_LINUX_IO_URING
+-    /*
+-     * State for Linux io_uring.  Uses aio_context_acquire/release for
+-     * locking.
+-     */
+     struct LuringState *linux_io_uring;
+     /* State for file descriptor monitoring using Linux io_uring */
+diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
+index XXXXXXX..XXXXXXX 100644
+--- a/include/block/raw-aio.h
++++ b/include/block/raw-aio.h
+@@ -XXX,XX +XXX,XX @@ void laio_io_unplug(uint64_t dev_max_batch);
+ typedef struct LuringState LuringState;
+ LuringState *luring_init(Error **errp);
+ void luring_cleanup(LuringState *s);
+-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
+-                                uint64_t offset, QEMUIOVector *qiov, int type);
++
++/* luring_co_submit: submit I/O requests in the thread's current AioContext. */
++int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
++                                  QEMUIOVector *qiov, int type);
+ void luring_detach_aio_context(LuringState *s, AioContext *old_context);
+ void luring_attach_aio_context(LuringState *s, AioContext *new_context);
+-void luring_io_plug(BlockDriverState *bs, LuringState *s);
+-void luring_io_unplug(BlockDriverState *bs, LuringState *s);
++
++/*
++ * luring_io_plug/unplug work in the thread's current AioContext, therefore the
++ * caller must ensure that they are paired in the same IOThread.
++ */
++void luring_io_plug(void);
++void luring_io_unplug(void);
+ #endif
+ #ifdef _WIN32
+diff --git a/block/file-posix.c b/block/file-posix.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/file-posix.c
++++ b/block/file-posix.c
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
+         type |= QEMU_AIO_MISALIGNED;
+ #ifdef CONFIG_LINUX_IO_URING
+     } else if (s->use_linux_io_uring) {
+-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+         assert(qiov->size == bytes);
+-        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
++        return luring_co_submit(bs, s->fd, offset, qiov, type);
+ #endif
+ #ifdef CONFIG_LINUX_AIO
+     } else if (s->use_linux_aio) {
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
+ #endif
+ #ifdef CONFIG_LINUX_IO_URING
+     if (s->use_linux_io_uring) {
+-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+-        luring_io_plug(bs, aio);
++        luring_io_plug();
+     }
+ #endif
+ }
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
+ #endif
+ #ifdef CONFIG_LINUX_IO_URING
+     if (s->use_linux_io_uring) {
+-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+-        luring_io_unplug(bs, aio);
++        luring_io_unplug();
+     }
+ #endif
+ }
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
+ #ifdef CONFIG_LINUX_IO_URING
+     if (s->use_linux_io_uring) {
+-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+-        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
++        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
+     }
+ #endif
+     return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
+diff --git a/block/io_uring.c b/block/io_uring.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/io_uring.c
++++ b/block/io_uring.c
+@@ -XXX,XX +XXX,XX @@
+ #include "qapi/error.h"
+ #include "trace.h"
++/* Only used for assertions.  */
++#include "qemu/coroutine_int.h"
++
+ /* io_uring ring size */
+ #define MAX_ENTRIES 128
+@@ -XXX,XX +XXX,XX @@ typedef struct LuringState {
+     struct io_uring ring;
+-    /* io queue for submit at batch.  Protected by AioContext lock. */
++    /* No locking required, only accessed from AioContext home thread */
+     LuringQueue io_q;
+-    /* I/O completion processing.  Only runs in I/O thread.  */
+     QEMUBH *completion_bh;
+ } LuringState;
+@@ -XXX,XX +XXX,XX @@ end:
+          * eventually runs later. Coroutines cannot be entered recursively
+          * so avoid doing that!
+          */
++        assert(luringcb->co->ctx == s->aio_context);
+         if (!qemu_coroutine_entered(luringcb->co)) {
+             aio_co_wake(luringcb->co);
+         }
+@@ -XXX,XX +XXX,XX @@ static int ioq_submit(LuringState *s)
+ static void luring_process_completions_and_submit(LuringState *s)
+ {
+-    aio_context_acquire(s->aio_context);
+     luring_process_completions(s);
+     if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+         ioq_submit(s);
+     }
+-    aio_context_release(s->aio_context);
+ }
+ static void qemu_luring_completion_bh(void *opaque)
+@@ -XXX,XX +XXX,XX @@ static void ioq_init(LuringQueue *io_q)
+     io_q->blocked = false;
+ }
+-void luring_io_plug(BlockDriverState *bs, LuringState *s)
++void luring_io_plug(void)
+ {
++    AioContext *ctx = qemu_get_current_aio_context();
++    LuringState *s = aio_get_linux_io_uring(ctx);
+     trace_luring_io_plug(s);
+     s->io_q.plugged++;
+ }
+-void luring_io_unplug(BlockDriverState *bs, LuringState *s)
++void luring_io_unplug(void)
+ {
++    AioContext *ctx = qemu_get_current_aio_context();
++    LuringState *s = aio_get_linux_io_uring(ctx);
+     assert(s->io_q.plugged);
+     trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
+                            s->io_q.in_queue, s->io_q.in_flight);
+@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
+     return 0;
+ }
+-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
+-                                  uint64_t offset, QEMUIOVector *qiov, int type)
++int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
++                                  QEMUIOVector *qiov, int type)
+ {
+     int ret;
++    AioContext *ctx = qemu_get_current_aio_context();
++    LuringState *s = aio_get_linux_io_uring(ctx);
+     LuringAIOCB luringcb = {
+         .co         = qemu_coroutine_self(),
+         .ret        = -EINPROGRESS,
+--
+.40.0

-New patch
+[PULL 14/25] thread-pool: use ThreadPool from the running thread
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
 Use qemu_get_current_aio_context() where possible, since we always
 submit work to the current thread anyways.
 We want to also be sure that the thread submitting the work is
 the same as the one processing the pool, to avoid adding
 synchronization to the pool list.
 Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
 Message-Id: <20230203131731.851116-4-eesposit@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
  include/block/thread-pool.h |  5 +++++
  block/file-posix.c          | 21 ++++++++++-----------
  block/file-win32.c          |  2 +-
  block/qcow2-threads.c       |  2 +-
  util/thread-pool.c          |  9 ++++-----
 files changed, 21 insertions(+), 18 deletions(-)
 diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/thread-pool.h
 +++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ typedef struct ThreadPool ThreadPool;
  ThreadPool *thread_pool_new(struct AioContext *ctx);
  void thread_pool_free(ThreadPool *pool);
 +/*
 + * thread_pool_submit* API: submit I/O requests in the thread's
 + * current AioContext.
 + */
  BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
          ThreadPoolFunc *func, void *arg,
          BlockCompletionFunc *cb, void *opaque);
  int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
          ThreadPoolFunc *func, void *arg);
  void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
 +
  void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
  #endif
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
      return result;
  }
 -static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs,
 -                                               ThreadPoolFunc func, void *arg)
 +static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
  {
      /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
 -    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
 +    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
      return thread_pool_submit_co(pool, func, arg);
  }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
      };
      assert(qiov->size == bytes);
 -    return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
  }
  static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
          return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
      }
  #endif
 -    return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_flush, &acb);
  }
  static void raw_aio_attach_aio_context(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ raw_regular_truncate(BlockDriverState *bs, int fd, int64_t offset,
          },
      };
 -    return raw_thread_pool_submit(bs, handle_aiocb_truncate, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_truncate, &acb);
  }
  static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
          acb.aio_type |= QEMU_AIO_BLKDEV;
      }
 -    ret = raw_thread_pool_submit(bs, handle_aiocb_discard, &acb);
 +    ret = raw_thread_pool_submit(handle_aiocb_discard, &acb);
      raw_account_discard(s, bytes, ret);
      return ret;
  }
@@ -XXX,XX +XXX,XX @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
          handler = handle_aiocb_write_zeroes;
      }
 -    return raw_thread_pool_submit(bs, handler, &acb);
 +    return raw_thread_pool_submit(handler, &acb);
  }
  static int coroutine_fn raw_co_pwrite_zeroes(
@@ -XXX,XX +XXX,XX @@ raw_co_copy_range_to(BlockDriverState *bs,
          },
      };
 -    return raw_thread_pool_submit(bs, handle_aiocb_copy_range, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_copy_range, &acb);
  }
  BlockDriver bdrv_file = {
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
          struct sg_io_hdr *io_hdr = buf;
          if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
              io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
 -            return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
 +            return pr_manager_execute(s->pr_mgr, qemu_get_current_aio_context(),
                                        s->fd, io_hdr);
          }
      }
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
          },
      };
 -    return raw_thread_pool_submit(bs, handle_aiocb_ioctl, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
  }
  #endif /* linux */
 diff --git a/block/file-win32.c b/block/file-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-win32.c
 +++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
      acb->aio_offset = offset;
      trace_file_paio_submit(acb, opaque, offset, count, type);
 -    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
 +    pool = aio_get_thread_pool(qemu_get_current_aio_context());
      return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
  }
 diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-threads.c
 +++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
  {
      int ret;
      BDRVQcow2State *s = bs->opaque;
 -    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
 +    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
      qemu_co_mutex_lock(&s->lock);
      while (s->nb_threads >= QCOW2_MAX_THREADS) {
 diff --git a/util/thread-pool.c b/util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ struct ThreadPoolElement {
      /* Access to this list is protected by lock.  */
      QTAILQ_ENTRY(ThreadPoolElement) reqs;
 -    /* Access to this list is protected by the global mutex.  */
 +    /* This list is only written by the thread pool's mother thread.  */
      QLIST_ENTRY(ThreadPoolElement) all;
  };
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
      ThreadPool *pool = opaque;
      ThreadPoolElement *elem, *next;
 -    aio_context_acquire(pool->ctx);
  restart:
      QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
          if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
               */
              qemu_bh_schedule(pool->completion_bh);
 -            aio_context_release(pool->ctx);
              elem->common.cb(elem->common.opaque, elem->ret);
 -            aio_context_acquire(pool->ctx);
              /* We can safely cancel the completion_bh here regardless of someone
               * else having scheduled it meanwhile because we reenter the
@@ -XXX,XX +XXX,XX @@ restart:
              qemu_aio_unref(elem);
          }
      }
 -    aio_context_release(pool->ctx);
  }
  static void thread_pool_cancel(BlockAIOCB *acb)
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
  {
      ThreadPoolElement *req;
 +    /* Assert that the thread submitting work is the same running the pool */
 +    assert(pool->ctx == qemu_get_current_aio_context());
 +
      req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
      req->func = func;
      req->arg = arg;
 --
 .40.0

-New patch
+[PULL 15/25] thread-pool: avoid passing the pool parameter every time
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+thread_pool_submit_aio() is always called on a pool taken from
+qemu_get_current_aio_context(), and that is the only intended
+use: each pool runs only in the same thread that is submitting
+work to it, it can't run anywhere else.
+Therefore simplify the thread_pool_submit* API and remove the
+ThreadPool function parameter.
+Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Message-Id: <20230203131731.851116-5-eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ include/block/thread-pool.h   | 10 ++++------
+ backends/tpm/tpm_backend.c    |  4 +---
+ block/file-posix.c            |  4 +---
+ block/file-win32.c            |  4 +---
+ block/qcow2-threads.c         |  3 +--
+ hw/9pfs/coth.c                |  3 +--
+ hw/ppc/spapr_nvdimm.c         |  6 ++----
+ hw/virtio/virtio-pmem.c       |  3 +--
+ scsi/pr-manager.c             |  3 +--
+ scsi/qemu-pr-helper.c         |  3 +--
+ tests/unit/test-thread-pool.c | 12 +++++-------
+ util/thread-pool.c            | 16 ++++++++--------
+files changed, 27 insertions(+), 44 deletions(-)
+diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
+index XXXXXXX..XXXXXXX 100644
+--- a/include/block/thread-pool.h
++++ b/include/block/thread-pool.h
+@@ -XXX,XX +XXX,XX @@ void thread_pool_free(ThreadPool *pool);
+  * thread_pool_submit* API: submit I/O requests in the thread's
+  * current AioContext.
+  */
+-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
+-        ThreadPoolFunc *func, void *arg,
+-        BlockCompletionFunc *cb, void *opaque);
+-int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
+-        ThreadPoolFunc *func, void *arg);
+-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
++BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
++                                   BlockCompletionFunc *cb, void *opaque);
++int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
++void thread_pool_submit(ThreadPoolFunc *func, void *arg);
+ void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
+diff --git a/backends/tpm/tpm_backend.c b/backends/tpm/tpm_backend.c
+index XXXXXXX..XXXXXXX 100644
+--- a/backends/tpm/tpm_backend.c
++++ b/backends/tpm/tpm_backend.c
+@@ -XXX,XX +XXX,XX @@ bool tpm_backend_had_startup_error(TPMBackend *s)
+ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
+ {
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+-
+     if (s->cmd != NULL) {
+         error_report("There is a TPM request pending");
+         return;
+@@ -XXX,XX +XXX,XX @@ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
+     s->cmd = cmd;
+     object_ref(OBJECT(s));
+-    thread_pool_submit_aio(pool, tpm_backend_worker_thread, s,
++    thread_pool_submit_aio(tpm_backend_worker_thread, s,
+                            tpm_backend_request_completed, s);
+ }
+diff --git a/block/file-posix.c b/block/file-posix.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/file-posix.c
++++ b/block/file-posix.c
+@@ -XXX,XX +XXX,XX @@ out:
+ static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
+ {
+-    /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
+-    return thread_pool_submit_co(pool, func, arg);
++    return thread_pool_submit_co(func, arg);
+ }
+ /*
+diff --git a/block/file-win32.c b/block/file-win32.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/file-win32.c
++++ b/block/file-win32.c
+@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
+         BlockCompletionFunc *cb, void *opaque, int type)
+ {
+     RawWin32AIOData *acb = g_new(RawWin32AIOData, 1);
+-    ThreadPool *pool;
+     acb->bs = bs;
+     acb->hfile = hfile;
+@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
+     acb->aio_offset = offset;
+     trace_file_paio_submit(acb, opaque, offset, count, type);
+-    pool = aio_get_thread_pool(qemu_get_current_aio_context());
+-    return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
++    return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
+ }
+ int qemu_ftruncate64(int fd, int64_t length)
+diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-threads.c
++++ b/block/qcow2-threads.c
+@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
+ {
+     int ret;
+     BDRVQcow2State *s = bs->opaque;
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
+     qemu_co_mutex_lock(&s->lock);
+     while (s->nb_threads >= QCOW2_MAX_THREADS) {
+@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
+     s->nb_threads++;
+     qemu_co_mutex_unlock(&s->lock);
+-    ret = thread_pool_submit_co(pool, func, arg);
++    ret = thread_pool_submit_co(func, arg);
+     qemu_co_mutex_lock(&s->lock);
+     s->nb_threads--;
+diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/9pfs/coth.c
++++ b/hw/9pfs/coth.c
+@@ -XXX,XX +XXX,XX @@ static int coroutine_enter_func(void *arg)
+ void co_run_in_worker_bh(void *opaque)
+ {
+     Coroutine *co = opaque;
+-    thread_pool_submit_aio(aio_get_thread_pool(qemu_get_aio_context()),
+-                           coroutine_enter_func, co, coroutine_enter_cb, co);
++    thread_pool_submit_aio(coroutine_enter_func, co, coroutine_enter_cb, co);
+ }
+diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/ppc/spapr_nvdimm.c
++++ b/hw/ppc/spapr_nvdimm.c
+@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
+ {
+     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
+     SpaprNVDIMMDeviceFlushState *state;
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+     HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
+     bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+     bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
+@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
+     }
+     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+-        thread_pool_submit_aio(pool, flush_worker_cb, state,
++        thread_pool_submit_aio(flush_worker_cb, state,
+                                spapr_nvdimm_flush_completion_cb, state);
+     }
+@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+     PCDIMMDevice *dimm;
+     HostMemoryBackend *backend = NULL;
+     SpaprNVDIMMDeviceFlushState *state;
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+     int fd;
+     if (!drc || !drc->dev ||
+@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+         state->drcidx = drc_index;
+-        thread_pool_submit_aio(pool, flush_worker_cb, state,
++        thread_pool_submit_aio(flush_worker_cb, state,
+                                spapr_nvdimm_flush_completion_cb, state);
+         continue_token = state->continue_token;
+diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/virtio/virtio-pmem.c
++++ b/hw/virtio/virtio-pmem.c
+@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
+     VirtIODeviceRequest *req_data;
+     VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
+     HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+     trace_virtio_pmem_flush_request();
+     req_data = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
+@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
+     req_data->fd   = memory_region_get_fd(&backend->mr);
+     req_data->pmem = pmem;
+     req_data->vdev = vdev;
+-    thread_pool_submit_aio(pool, worker_cb, req_data, done_cb, req_data);
++    thread_pool_submit_aio(worker_cb, req_data, done_cb, req_data);
+ }
+ static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
+diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
+index XXXXXXX..XXXXXXX 100644
+--- a/scsi/pr-manager.c
++++ b/scsi/pr-manager.c
+@@ -XXX,XX +XXX,XX @@ static int pr_manager_worker(void *opaque)
+ int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
+                                     struct sg_io_hdr *hdr)
+ {
+-    ThreadPool *pool = aio_get_thread_pool(ctx);
+     PRManagerData data = {
+         .pr_mgr = pr_mgr,
+         .fd     = fd,
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
+     /* The matching object_unref is in pr_manager_worker.  */
+     object_ref(OBJECT(pr_mgr));
+-    return thread_pool_submit_co(pool, pr_manager_worker, &data);
++    return thread_pool_submit_co(pr_manager_worker, &data);
+ }
+ bool pr_manager_is_connected(PRManager *pr_mgr)
+diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
+index XXXXXXX..XXXXXXX 100644
+--- a/scsi/qemu-pr-helper.c
++++ b/scsi/qemu-pr-helper.c
+@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
+ static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
+                     uint8_t *buf, int *sz, int dir)
+ {
+-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+     int r;
+     PRHelperSGIOData data = {
+@@ -XXX,XX +XXX,XX @@ static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
+         .dir = dir,
+     };
+-    r = thread_pool_submit_co(pool, do_sgio_worker, &data);
++    r = thread_pool_submit_co(do_sgio_worker, &data);
+     *sz = data.sz;
+     return r;
+ }
+diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/unit/test-thread-pool.c
++++ b/tests/unit/test-thread-pool.c
+@@ -XXX,XX +XXX,XX @@
+ #include "qemu/main-loop.h"
+ static AioContext *ctx;
+-static ThreadPool *pool;
+ static int active;
+ typedef struct {
+@@ -XXX,XX +XXX,XX @@ static void done_cb(void *opaque, int ret)
+ static void test_submit(void)
+ {
+     WorkerTestData data = { .n = 0 };
+-    thread_pool_submit(pool, worker_cb, &data);
++    thread_pool_submit(worker_cb, &data);
+     while (data.n == 0) {
+         aio_poll(ctx, true);
+     }
+@@ -XXX,XX +XXX,XX @@ static void test_submit(void)
+ static void test_submit_aio(void)
+ {
+     WorkerTestData data = { .n = 0, .ret = -EINPROGRESS };
+-    data.aiocb = thread_pool_submit_aio(pool, worker_cb, &data,
++    data.aiocb = thread_pool_submit_aio(worker_cb, &data,
+                                         done_cb, &data);
+     /* The callbacks are not called until after the first wait.  */
+@@ -XXX,XX +XXX,XX @@ static void co_test_cb(void *opaque)
+     active = 1;
+     data->n = 0;
+     data->ret = -EINPROGRESS;
+-    thread_pool_submit_co(pool, worker_cb, data);
++    thread_pool_submit_co(worker_cb, data);
+     /* The test continues in test_submit_co, after qemu_coroutine_enter... */
+@@ -XXX,XX +XXX,XX @@ static void test_submit_many(void)
+     for (i = 0; i < 100; i++) {
+         data[i].n = 0;
+         data[i].ret = -EINPROGRESS;
+-        thread_pool_submit_aio(pool, worker_cb, &data[i], done_cb, &data[i]);
++        thread_pool_submit_aio(worker_cb, &data[i], done_cb, &data[i]);
+     }
+     active = 100;
+@@ -XXX,XX +XXX,XX @@ static void do_test_cancel(bool sync)
+     for (i = 0; i < 100; i++) {
+         data[i].n = 0;
+         data[i].ret = -EINPROGRESS;
+-        data[i].aiocb = thread_pool_submit_aio(pool, long_cb, &data[i],
++        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i],
+                                                done_cb, &data[i]);
+     }
+@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
+ {
+     qemu_init_main_loop(&error_abort);
+     ctx = qemu_get_current_aio_context();
+-    pool = aio_get_thread_pool(ctx);
+     g_test_init(&argc, &argv, NULL);
+     g_test_add_func("/thread-pool/submit", test_submit);
+diff --git a/util/thread-pool.c b/util/thread-pool.c
+index XXXXXXX..XXXXXXX 100644
+--- a/util/thread-pool.c
++++ b/util/thread-pool.c
+@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo thread_pool_aiocb_info = {
+     .get_aio_context    = thread_pool_get_aio_context,
+ };
+-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
+-        ThreadPoolFunc *func, void *arg,
+-        BlockCompletionFunc *cb, void *opaque)
++BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
++                                   BlockCompletionFunc *cb, void *opaque)
+ {
+     ThreadPoolElement *req;
++    AioContext *ctx = qemu_get_current_aio_context();
++    ThreadPool *pool = aio_get_thread_pool(ctx);
+     /* Assert that the thread submitting work is the same running the pool */
+     assert(pool->ctx == qemu_get_current_aio_context());
+@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
+     aio_co_wake(co->co);
+ }
+-int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
+-                                       void *arg)
++int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
+ {
+     ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
+     assert(qemu_in_coroutine());
+-    thread_pool_submit_aio(pool, func, arg, thread_pool_co_cb, &tpc);
++    thread_pool_submit_aio(func, arg, thread_pool_co_cb, &tpc);
+     qemu_coroutine_yield();
+     return tpc.ret;
+ }
+-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
++void thread_pool_submit(ThreadPoolFunc *func, void *arg)
+ {
+-    thread_pool_submit_aio(pool, func, arg, NULL, NULL);
++    thread_pool_submit_aio(func, arg, NULL, NULL);
+ }
+ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+--
+.40.0

-New patch
+[PULL 16/25] vvfat: mark various functions as coroutine_fn
+From: Paolo Bonzini <pbonzini@redhat.com>
 Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
 change for those that are themselves called only from coroutine_fns.
 In addition, coroutine_fns should do I/O using bdrv_co_*() functions, for
 which it is required to hold the BlockDriverState graph lock.  So also nnotate
 functions on the I/O path with TSA attributes, making it possible to
 switch them to use bdrv_co_*() functions.
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-Id: <20230309084456.304669-2-pbonzini@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
  block/vvfat.c | 58 ++++++++++++++++++++++++++-------------------------
 file changed, 30 insertions(+), 28 deletions(-)
 diff --git a/block/vvfat.c b/block/vvfat.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/vvfat.c
 +++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static BDRVVVFATState *vvv = NULL;
  #endif
  static int enable_write_target(BlockDriverState *bs, Error **errp);
 -static int is_consistent(BDRVVVFATState *s);
 +static int coroutine_fn is_consistent(BDRVVVFATState *s);
  static QemuOptsList runtime_opts = {
      .name = "vvfat",
@@ -XXX,XX +XXX,XX @@ static void print_mapping(const mapping_t* mapping)
  }
  #endif
 -static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
 -                    uint8_t *buf, int nb_sectors)
 +static int coroutine_fn GRAPH_RDLOCK
 +vvfat_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors)
  {
      BDRVVVFATState *s = bs->opaque;
      int i;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
                  DLOG(fprintf(stderr, "sectors %" PRId64 "+%" PRId64
                               " allocated\n", sector_num,
                               n >> BDRV_SECTOR_BITS));
 -                if (bdrv_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
 -                               buf + i * 0x200, 0) < 0) {
 +                if (bdrv_co_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
 +                                  buf + i * 0x200, 0) < 0) {
                      return -1;
                  }
                  i += (n >> BDRV_SECTOR_BITS) - 1;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
      return 0;
  }
 -static int coroutine_fn
 +static int coroutine_fn GRAPH_RDLOCK
  vvfat_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
                  QEMUIOVector *qiov, BdrvRequestFlags flags)
  {
@@ -XXX,XX +XXX,XX @@ static inline uint32_t modified_fat_get(BDRVVVFATState* s,
      }
  }
 -static inline bool cluster_was_modified(BDRVVVFATState *s,
 -                                        uint32_t cluster_num)
 +static inline bool coroutine_fn GRAPH_RDLOCK
 +cluster_was_modified(BDRVVVFATState *s, uint32_t cluster_num)
  {
      int was_modified = 0;
      int i;
@@ -XXX,XX +XXX,XX @@ typedef enum {
   * Further, the files/directories handled by this function are
   * assumed to be *not* deleted (and *only* those).
   */
 -static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
 -        direntry_t* direntry, const char* path)
 +static uint32_t coroutine_fn GRAPH_RDLOCK
 +get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const char* path)
  {
      /*
       * This is a little bit tricky:
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
                          if (res) {
                              return -1;
                          }
 -                        res = bdrv_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
 -                                          BDRV_SECTOR_SIZE, s->cluster_buffer,
 -                                          0);
 +                        res = bdrv_co_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
 +                                             BDRV_SECTOR_SIZE, s->cluster_buffer,
 +                                             0);
                          if (res < 0) {
                              return -2;
                          }
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
   * It returns 0 upon inconsistency or error, and the number of clusters
   * used by the directory, its subdirectories and their files.
   */
 -static int check_directory_consistency(BDRVVVFATState *s,
 -        int cluster_num, const char* path)
 +static int coroutine_fn GRAPH_RDLOCK
 +check_directory_consistency(BDRVVVFATState *s, int cluster_num, const char* path)
  {
      int ret = 0;
      unsigned char* cluster = g_malloc(s->cluster_size);
@@ -XXX,XX +XXX,XX @@ DLOG(fprintf(stderr, "check direntry %d:\n", i); print_direntry(direntries + i))
  }
  /* returns 1 on success */
 -static int is_consistent(BDRVVVFATState* s)
 +static int coroutine_fn GRAPH_RDLOCK
 +is_consistent(BDRVVVFATState* s)
  {
      int i, check;
      int used_clusters_count = 0;
@@ -XXX,XX +XXX,XX @@ static int commit_mappings(BDRVVVFATState* s,
      return 0;
  }
 -static int commit_direntries(BDRVVVFATState* s,
 -        int dir_index, int parent_mapping_index)
 +static int coroutine_fn GRAPH_RDLOCK
 +commit_direntries(BDRVVVFATState* s, int dir_index, int parent_mapping_index)
  {
      direntry_t* direntry = array_get(&(s->directory), dir_index);
      uint32_t first_cluster = dir_index == 0 ? 0 : begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int commit_direntries(BDRVVVFATState* s,
  /* commit one file (adjust contents, adjust mapping),
     return first_mapping_index */
 -static int commit_one_file(BDRVVVFATState* s,
 -        int dir_index, uint32_t offset)
 +static int coroutine_fn GRAPH_RDLOCK
 +commit_one_file(BDRVVVFATState* s, int dir_index, uint32_t offset)
  {
      direntry_t* direntry = array_get(&(s->directory), dir_index);
      uint32_t c = begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int handle_renames_and_mkdirs(BDRVVVFATState* s)
  /*
   * TODO: make sure that the short name is not matching *another* file
   */
 -static int handle_commits(BDRVVVFATState* s)
 +static int coroutine_fn GRAPH_RDLOCK handle_commits(BDRVVVFATState* s)
  {
      int i, fail = 0;
@@ -XXX,XX +XXX,XX @@ static int handle_deletes(BDRVVVFATState* s)
   * - recurse direntries from root (using bs->bdrv_pread)
   * - delete files corresponding to mappings marked as deleted
   */
 -static int do_commit(BDRVVVFATState* s)
 +static int coroutine_fn GRAPH_RDLOCK do_commit(BDRVVVFATState* s)
  {
      int ret = 0;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      return 0;
  }
 -static int try_commit(BDRVVVFATState* s)
 +static int coroutine_fn GRAPH_RDLOCK try_commit(BDRVVVFATState* s)
  {
      vvfat_close_current_file(s);
  DLOG(checkpoint());
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      return do_commit(s);
  }
 -static int vvfat_write(BlockDriverState *bs, int64_t sector_num,
 -                    const uint8_t *buf, int nb_sectors)
 +static int coroutine_fn GRAPH_RDLOCK
 +vvfat_write(BlockDriverState *bs, int64_t sector_num,
 +            const uint8_t *buf, int nb_sectors)
  {
      BDRVVVFATState *s = bs->opaque;
      int i, ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
       * Use qcow backend. Commit later.
       */
  DLOG(fprintf(stderr, "Write to qcow backend: %d + %d\n", (int)sector_num, nb_sectors));
 -    ret = bdrv_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
 -                      nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
 +    ret = bdrv_co_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
 +                         nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
      if (ret < 0) {
          fprintf(stderr, "Error writing to qcow backend\n");
          return ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      return 0;
  }
 -static int coroutine_fn
 +static int coroutine_fn GRAPH_RDLOCK
  vvfat_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
                   QEMUIOVector *qiov, BdrvRequestFlags flags)
  {
 --
 .40.0

-New patch
+[PULL 17/25] blkdebug: add missing coroutine_fn annotation
+From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-3-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/blkdebug.c | 4 ++--
+file changed, 2 insertions(+), 2 deletions(-)
+diff --git a/block/blkdebug.c b/block/blkdebug.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/blkdebug.c
++++ b/block/blkdebug.c
+@@ -XXX,XX +XXX,XX @@ out:
+     return ret;
+ }
+-static int rule_check(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+-                      BlkdebugIOType iotype)
++static int coroutine_fn rule_check(BlockDriverState *bs, uint64_t offset,
++                                   uint64_t bytes, BlkdebugIOType iotype)
+ {
+     BDRVBlkdebugState *s = bs->opaque;
+     BlkdebugRule *rule = NULL;
+--
+.40.0

-New patch
+[PULL 18/25] mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
+From: Paolo Bonzini <pbonzini@redhat.com>
+mirror_flush calls a mixed function blk_flush but it is only called
+from mirror_run; so call the coroutine version and make mirror_flush
+a coroutine_fn too.
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-4-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/mirror.c | 4 ++--
+file changed, 2 insertions(+), 2 deletions(-)
+diff --git a/block/mirror.c b/block/mirror.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/mirror.c
++++ b/block/mirror.c
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
+ /* Called when going out of the streaming phase to flush the bulk of the
+  * data to the medium, or just before completing.
+  */
+-static int mirror_flush(MirrorBlockJob *s)
++static int coroutine_fn mirror_flush(MirrorBlockJob *s)
+ {
+-    int ret = blk_flush(s->target);
++    int ret = blk_co_flush(s->target);
+     if (ret < 0) {
+         if (mirror_error_action(s, false, -ret) == BLOCK_ERROR_ACTION_REPORT) {
+             s->ret = ret;
+--
+.40.0

-New patch
+[PULL 19/25] nbd: mark more coroutine_fns, do not use co_wrappers
+From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ nbd/server.c | 48 ++++++++++++++++++++++++------------------------
+file changed, 24 insertions(+), 24 deletions(-)
+diff --git a/nbd/server.c b/nbd/server.c
+index XXXXXXX..XXXXXXX 100644
+--- a/nbd/server.c
++++ b/nbd/server.c
+@@ -XXX,XX +XXX,XX @@ nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
+     return 1;
+ }
+-static int nbd_receive_request(NBDClient *client, NBDRequest *request,
+-                               Error **errp)
++static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest *request,
++                                            Error **errp)
+ {
+     uint8_t buf[NBD_REQUEST_SIZE];
+     uint32_t magic;
+@@ -XXX,XX +XXX,XX @@ static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
+     stq_be_p(&reply->handle, handle);
+ }
+-static int nbd_co_send_simple_reply(NBDClient *client,
+-                                    uint64_t handle,
+-                                    uint32_t error,
+-                                    void *data,
+-                                    size_t len,
+-                                    Error **errp)
++static int coroutine_fn nbd_co_send_simple_reply(NBDClient *client,
++                                                 uint64_t handle,
++                                                 uint32_t error,
++                                                 void *data,
++                                                 size_t len,
++                                                 Error **errp)
+ {
+     NBDSimpleReply reply;
+     int nbd_err = system_errno_to_nbd_errno(error);
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
+             stl_be_p(&chunk.length, pnum);
+             ret = nbd_co_send_iov(client, iov, 1, errp);
+         } else {
+-            ret = blk_pread(exp->common.blk, offset + progress, pnum,
+-                            data + progress, 0);
++            ret = blk_co_pread(exp->common.blk, offset + progress, pnum,
++                               data + progress, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret, "reading from file failed");
+                 break;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blockalloc_to_extents(BlockBackend *blk,
+  * @ea is converted to BE by the function
+  * @last controls whether NBD_REPLY_FLAG_DONE is sent.
+  */
+-static int nbd_co_send_extents(NBDClient *client, uint64_t handle,
+-                               NBDExtentArray *ea,
+-                               bool last, uint32_t context_id, Error **errp)
++static int coroutine_fn
++nbd_co_send_extents(NBDClient *client, uint64_t handle, NBDExtentArray *ea,
++                    bool last, uint32_t context_id, Error **errp)
+ {
+     NBDStructuredMeta chunk;
+     struct iovec iov[] = {
+@@ -XXX,XX +XXX,XX @@ static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
+     bdrv_dirty_bitmap_unlock(bitmap);
+ }
+-static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
+-                              BdrvDirtyBitmap *bitmap, uint64_t offset,
+-                              uint32_t length, bool dont_fragment, bool last,
+-                              uint32_t context_id, Error **errp)
++static int coroutine_fn nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
++                                           BdrvDirtyBitmap *bitmap, uint64_t offset,
++                                           uint32_t length, bool dont_fragment, bool last,
++                                           uint32_t context_id, Error **errp)
+ {
+     unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
+     g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);
+@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
+  * to the client (although the caller may still need to disconnect after
+  * reporting the error).
+  */
+-static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
+-                                  Error **errp)
++static int coroutine_fn nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
++                                               Error **errp)
+ {
+     NBDClient *client = req->client;
+     int valid_flags;
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_do_cmd_read(NBDClient *client, NBDRequest *request,
+                                        data, request->len, errp);
+     }
+-    ret = blk_pread(exp->common.blk, request->from, request->len, data, 0);
++    ret = blk_co_pread(exp->common.blk, request->from, request->len, data, 0);
+     if (ret < 0) {
+         return nbd_send_generic_reply(client, request->handle, ret,
+                                       "reading from file failed", errp);
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
+         if (request->flags & NBD_CMD_FLAG_FUA) {
+             flags |= BDRV_REQ_FUA;
+         }
+-        ret = blk_pwrite(exp->common.blk, request->from, request->len, data,
+-                         flags);
++        ret = blk_co_pwrite(exp->common.blk, request->from, request->len, data,
++                            flags);
+         return nbd_send_generic_reply(client, request->handle, ret,
+                                       "writing to file failed", errp);
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
+         if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
+             flags |= BDRV_REQ_NO_FALLBACK;
+         }
+-        ret = blk_pwrite_zeroes(exp->common.blk, request->from, request->len,
+-                                flags);
++        ret = blk_co_pwrite_zeroes(exp->common.blk, request->from, request->len,
++                                   flags);
+         return nbd_send_generic_reply(client, request->handle, ret,
+                                       "writing to file failed", errp);
+--
+.40.0

-New patch
+[PULL 20/25] 9pfs: mark more coroutine_fns
+From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-6-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ hw/9pfs/9p.h    | 4 ++--
+ hw/9pfs/codir.c | 6 +++---
+files changed, 5 insertions(+), 5 deletions(-)
+diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/9pfs/9p.h
++++ b/hw/9pfs/9p.h
+@@ -XXX,XX +XXX,XX @@ typedef struct V9fsDir {
+     QemuMutex readdir_mutex_L;
+ } V9fsDir;
+-static inline void v9fs_readdir_lock(V9fsDir *dir)
++static inline void coroutine_fn v9fs_readdir_lock(V9fsDir *dir)
+ {
+     if (dir->proto_version == V9FS_PROTO_2000U) {
+         qemu_co_mutex_lock(&dir->readdir_mutex_u);
+@@ -XXX,XX +XXX,XX @@ static inline void v9fs_readdir_lock(V9fsDir *dir)
+     }
+ }
+-static inline void v9fs_readdir_unlock(V9fsDir *dir)
++static inline void coroutine_fn v9fs_readdir_unlock(V9fsDir *dir)
+ {
+     if (dir->proto_version == V9FS_PROTO_2000U) {
+         qemu_co_mutex_unlock(&dir->readdir_mutex_u);
+diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/9pfs/codir.c
++++ b/hw/9pfs/codir.c
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
+  *
+  * See v9fs_co_readdir_many() (as its only user) below for details.
+  */
+-static int do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
+-                           struct V9fsDirEnt **entries, off_t offset,
+-                           int32_t maxsize, bool dostat)
++static int coroutine_fn
++do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp, struct V9fsDirEnt **entries,
++                off_t offset, int32_t maxsize, bool dostat)
+ {
+     V9fsState *s = pdu->s;
+     V9fsString name;
+--
+.40.0

-New patch
+[PULL 21/25] qemu-pr-helper: mark more coroutine_fns
+From: Paolo Bonzini <pbonzini@redhat.com>
+do_sgio can suspend via the coroutine function thread_pool_submit_co, so it
+has to be coroutine_fn as well---and the same is true of all its direct and
+indirect callers.
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-7-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ scsi/qemu-pr-helper.c | 22 +++++++++++-----------
+file changed, 11 insertions(+), 11 deletions(-)
+diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
+index XXXXXXX..XXXXXXX 100644
+--- a/scsi/qemu-pr-helper.c
++++ b/scsi/qemu-pr-helper.c
+@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
+     return status;
+ }
+-static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
+-                    uint8_t *buf, int *sz, int dir)
++static int coroutine_fn do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
++                                uint8_t *buf, int *sz, int dir)
+ {
+     int r;
+@@ -XXX,XX +XXX,XX @@ static SCSISense mpath_generic_sense(int r)
+     }
+ }
+-static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
++static int coroutine_fn mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
+ {
+     switch (r) {
+     case MPATH_PR_SUCCESS:
+@@ -XXX,XX +XXX,XX @@ static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
+     }
+ }
+-static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+-                           uint8_t *data, int sz)
++static int coroutine_fn multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
++                                        uint8_t *data, int sz)
+ {
+     int rq_servact = cdb[1];
+     struct prin_resp resp;
+@@ -XXX,XX +XXX,XX @@ static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+     return mpath_reconstruct_sense(fd, r, sense);
+ }
+-static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+-                            const uint8_t *param, int sz)
++static int coroutine_fn multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
++                                         const uint8_t *param, int sz)
+ {
+     int rq_servact = cdb[1];
+     int rq_scope = cdb[2] >> 4;
+@@ -XXX,XX +XXX,XX @@ static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+ }
+ #endif
+-static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+-                    uint8_t *data, int *resp_sz)
++static int coroutine_fn do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
++                                 uint8_t *data, int *resp_sz)
+ {
+ #ifdef CONFIG_MPATH
+     if (is_mpath(fd)) {
+@@ -XXX,XX +XXX,XX @@ static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+                    SG_DXFER_FROM_DEV);
+ }
+-static int do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+-                     const uint8_t *param, int sz)
++static int coroutine_fn do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
++                                  const uint8_t *param, int sz)
+ {
+     int resp_sz;
+--
+.40.0

-New patch
+[PULL 22/25] tests: mark more coroutine_fns
+From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-8-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ tests/unit/test-thread-pool.c | 2 +-
+file changed, 1 insertion(+), 1 deletion(-)
+diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/unit/test-thread-pool.c
++++ b/tests/unit/test-thread-pool.c
+@@ -XXX,XX +XXX,XX @@ static void test_submit_aio(void)
+     g_assert_cmpint(data.ret, ==, 0);
+ }
+-static void co_test_cb(void *opaque)
++static void coroutine_fn co_test_cb(void *opaque)
+ {
+     WorkerTestData *data = opaque;
+--
+.40.0

-New patch
+[PULL 23/25] qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
+From: Paolo Bonzini <pbonzini@redhat.com>
+Functions that can do I/O (including calling bdrv_is_allocated
+and bdrv_block_status functions) are prime candidates for being
+coroutine_fns.  Make the change for those that are themselves called
+only from coroutine_fns.  Also annotate that they are called with the
+graph rdlock taken, thus allowing them to call bdrv_co_*() functions
+for I/O.
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-9-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/qcow2.h          | 15 ++++++++-------
+ block/qcow2-bitmap.c   |  2 +-
+ block/qcow2-cluster.c  | 21 +++++++++++++--------
+ block/qcow2-refcount.c |  8 ++++----
+ block/qcow2-snapshot.c | 25 +++++++++++++------------
+ block/qcow2.c          | 27 ++++++++++++++-------------
+files changed, 53 insertions(+), 45 deletions(-)
+diff --git a/block/qcow2.h b/block/qcow2.h
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2.h
++++ b/block/qcow2.h
+@@ -XXX,XX +XXX,XX @@ int64_t qcow2_refcount_area(BlockDriverState *bs, uint64_t offset,
+                             uint64_t new_refblock_offset);
+ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size);
+-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+-                                int64_t nb_clusters);
+-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size);
++int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
++                                             int64_t nb_clusters);
++int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size);
+ void qcow2_free_clusters(BlockDriverState *bs,
+                           int64_t offset, int64_t size,
+                           enum qcow2_discard_type type);
+@@ -XXX,XX +XXX,XX @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
+                                 BlockDriverAmendStatusCB *status_cb,
+                                 void *cb_opaque, Error **errp);
+ int coroutine_fn GRAPH_RDLOCK qcow2_shrink_reftable(BlockDriverState *bs);
+-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
++int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
+ int coroutine_fn qcow2_detect_metadata_preallocation(BlockDriverState *bs);
+ /* qcow2-cluster.c functions */
+@@ -XXX,XX +XXX,XX @@ void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
+ int coroutine_fn GRAPH_RDLOCK
+ qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
+-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
++void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
+ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
+                           uint64_t bytes, enum qcow2_discard_type type,
+                           bool full_discard);
+@@ -XXX,XX +XXX,XX @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
+                             Error **errp);
+ void qcow2_free_snapshots(BlockDriverState *bs);
+-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
++int coroutine_fn GRAPH_RDLOCK
++qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
+ int qcow2_write_snapshots(BlockDriverState *bs);
+ int coroutine_fn GRAPH_RDLOCK
+@@ -XXX,XX +XXX,XX @@ bool coroutine_fn qcow2_load_dirty_bitmaps(BlockDriverState *bs,
+ bool qcow2_get_bitmap_info_list(BlockDriverState *bs,
+                                 Qcow2BitmapInfoList **info_list, Error **errp);
+ int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp);
+-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
++int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
+ bool qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
+                                           bool release_stored, Error **errp);
+ int qcow2_reopen_bitmaps_ro(BlockDriverState *bs, Error **errp);
+diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-bitmap.c
++++ b/block/qcow2-bitmap.c
+@@ -XXX,XX +XXX,XX @@ out:
+ }
+ /* Checks to see if it's safe to resize bitmaps */
+-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
++int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     Qcow2BitmapList *bm_list;
+diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-cluster.c
++++ b/block/qcow2-cluster.c
+@@ -XXX,XX +XXX,XX @@ err:
+  * Frees the allocated clusters because the request failed and they won't
+  * actually be linked.
+  */
+-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
++void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     if (!has_data_file(bs) && !m->keep_old_clusters) {
+@@ -XXX,XX +XXX,XX @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
+  *
+  * Returns 0 on success, -errno on failure.
+  */
+-static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_cluster_offset,
+-                             uint64_t guest_offset, unsigned bytes,
+-                             uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
++static int coroutine_fn calculate_l2_meta(BlockDriverState *bs,
++                                          uint64_t host_cluster_offset,
++                                          uint64_t guest_offset, unsigned bytes,
++                                          uint64_t *l2_slice, QCowL2Meta **m,
++                                          bool keep_old)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
+@@ -XXX,XX +XXX,XX @@ out:
+  * function has been waiting for another request and the allocation must be
+  * restarted, but the whole request should not be failed.
+  */
+-static int do_alloc_cluster_offset(BlockDriverState *bs, uint64_t guest_offset,
+-                                   uint64_t *host_offset, uint64_t *nb_clusters)
++static int coroutine_fn do_alloc_cluster_offset(BlockDriverState *bs,
++                                                uint64_t guest_offset,
++                                                uint64_t *host_offset,
++                                                uint64_t *nb_clusters)
+ {
+     BDRVQcow2State *s = bs->opaque;
+@@ -XXX,XX +XXX,XX @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
+     return nb_clusters;
+ }
+-static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+-                               unsigned nb_subclusters)
++static int coroutine_fn
++zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
++                    unsigned nb_subclusters)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     uint64_t *l2_slice;
+diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-refcount.c
++++ b/block/qcow2-refcount.c
+@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
+     return offset;
+ }
+-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+-                                int64_t nb_clusters)
++int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
++                                             int64_t nb_clusters)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     uint64_t cluster_index, refcount;
+@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+ /* only used to allocate compressed sectors. We try to allocate
+    contiguous sectors. size must be <= cluster_size */
+-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
++int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     int64_t offset;
+@@ -XXX,XX +XXX,XX @@ out:
+     return ret;
+ }
+-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
++int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     int64_t i;
+diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-snapshot.c
++++ b/block/qcow2-snapshot.c
+@@ -XXX,XX +XXX,XX @@ void qcow2_free_snapshots(BlockDriverState *bs)
+  *   qcow2_check_refcounts() does not do anything with snapshots'
+  *   extra data.)
+  */
+-static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+-                                   int *nb_clusters_reduced,
+-                                   int *extra_data_dropped,
+-                                   Error **errp)
++static coroutine_fn GRAPH_RDLOCK
++int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
++                            int *nb_clusters_reduced,
++                            int *extra_data_dropped,
++                            Error **errp)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     QCowSnapshotHeader h;
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+         /* Read statically sized part of the snapshot header */
+         offset = ROUND_UP(offset, 8);
+-        ret = bdrv_pread(bs->file, offset, sizeof(h), &h, 0);
++        ret = bdrv_co_pread(bs->file, offset, sizeof(h), &h, 0);
+         if (ret < 0) {
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
+             goto fail;
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+         }
+         /* Read known extra data */
+-        ret = bdrv_pread(bs->file, offset,
+-                         MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
++        ret = bdrv_co_pread(bs->file, offset,
++                            MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
+         if (ret < 0) {
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
+             goto fail;
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+             /* Store unknown extra data */
+             unknown_extra_data_size = sn->extra_data_size - sizeof(extra);
+             sn->unknown_extra_data = g_malloc(unknown_extra_data_size);
+-            ret = bdrv_pread(bs->file, offset, unknown_extra_data_size,
+-                             sn->unknown_extra_data, 0);
++            ret = bdrv_co_pread(bs->file, offset, unknown_extra_data_size,
++                                sn->unknown_extra_data, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret,
+                                  "Failed to read snapshot table");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+         /* Read snapshot ID */
+         sn->id_str = g_malloc(id_str_size + 1);
+-        ret = bdrv_pread(bs->file, offset, id_str_size, sn->id_str, 0);
++        ret = bdrv_co_pread(bs->file, offset, id_str_size, sn->id_str, 0);
+         if (ret < 0) {
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
+             goto fail;
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+         /* Read snapshot name */
+         sn->name = g_malloc(name_size + 1);
+-        ret = bdrv_pread(bs->file, offset, name_size, sn->name, 0);
++        ret = bdrv_co_pread(bs->file, offset, name_size, sn->name, 0);
+         if (ret < 0) {
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
+             goto fail;
+@@ -XXX,XX +XXX,XX @@ fail:
+     return ret;
+ }
+-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
++int coroutine_fn qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
+ {
+     return qcow2_do_read_snapshots(bs, false, NULL, NULL, errp);
+ }
+diff --git a/block/qcow2.c b/block/qcow2.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2.c
++++ b/block/qcow2.c
+@@ -XXX,XX +XXX,XX @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char *fmt, Error **errp)
+  * unknown magic is skipped (future extension this version knows nothing about)
+  * return 0 upon success, non-0 otherwise
+  */
+-static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+-                                 uint64_t end_offset, void **p_feature_table,
+-                                 int flags, bool *need_update_header,
+-                                 Error **errp)
++static int coroutine_fn GRAPH_RDLOCK
++qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
++                      uint64_t end_offset, void **p_feature_table,
++                      int flags, bool *need_update_header, Error **errp)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     QCowExtension ext;
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+         printf("attempting to read extended header in offset %lu\n", offset);
+ #endif
+-        ret = bdrv_pread(bs->file, offset, sizeof(ext), &ext, 0);
++        ret = bdrv_co_pread(bs->file, offset, sizeof(ext), &ext, 0);
+         if (ret < 0) {
+             error_setg_errno(errp, -ret, "qcow2_read_extension: ERROR: "
+                              "pread fail from offset %" PRIu64, offset);
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                            sizeof(bs->backing_format));
+                 return 2;
+             }
+-            ret = bdrv_pread(bs->file, offset, ext.len, bs->backing_format, 0);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, bs->backing_format, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret, "ERROR: ext_backing_format: "
+                                  "Could not read format name");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+         case QCOW2_EXT_MAGIC_FEATURE_TABLE:
+             if (p_feature_table != NULL) {
+                 void *feature_table = g_malloc0(ext.len + 2 * sizeof(Qcow2Feature));
+-                ret = bdrv_pread(bs->file, offset, ext.len, feature_table, 0);
++                ret = bdrv_co_pread(bs->file, offset, ext.len, feature_table, 0);
+                 if (ret < 0) {
+                     error_setg_errno(errp, -ret, "ERROR: ext_feature_table: "
+                                      "Could not read table");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                 return -EINVAL;
+             }
+-            ret = bdrv_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret,
+                                  "Unable to read CRYPTO header extension");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                 break;
+             }
+-            ret = bdrv_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret, "bitmaps_ext: "
+                                  "Could not read ext header");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+         case QCOW2_EXT_MAGIC_DATA_FILE:
+         {
+             s->image_data_file = g_malloc0(ext.len + 1);
+-            ret = bdrv_pread(bs->file, offset, ext.len, s->image_data_file, 0);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, s->image_data_file, 0);
+             if (ret < 0) {
+                 error_setg_errno(errp, -ret,
+                                  "ERROR: Could not read data file name");
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                 uext->len = ext.len;
+                 QLIST_INSERT_HEAD(&s->unknown_header_ext, uext, next);
+-                ret = bdrv_pread(bs->file, offset, uext->len, uext->data, 0);
++                ret = bdrv_co_pread(bs->file, offset, uext->len, uext->data, 0);
+                 if (ret < 0) {
+                     error_setg_errno(errp, -ret, "ERROR: unknown extension: "
+                                      "Could not read data");
+@@ -XXX,XX +XXX,XX @@ static void qcow2_update_options_abort(BlockDriverState *bs,
+     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
+ }
+-static int qcow2_update_options(BlockDriverState *bs, QDict *options,
+-                                int flags, Error **errp)
++static int coroutine_fn
++qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
++                     Error **errp)
+ {
+     Qcow2ReopenState r = {};
+     int ret;
+--
+.40.0

-[PULL 02/15] block: implement bdrv_snapshot_goto for blkreplay
+[PULL 24/25] vmdk: make vmdk_is_cid_valid a coroutine_fn
-From: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+From: Paolo Bonzini <pbonzini@redhat.com>
-This patch enables making snapshots with blkreplay used in
+Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
-block devices.
+change for the one that is itself called only from coroutine_fns.  Unfortunately
-This function is required to make bdrv_snapshot_goto without
+vmdk does not use a coroutine_fn for the bulk of the open (like qcow2 does) so
-calling .bdrv_open which is not implemented.
+vmdk_read_cid cannot have the same treatment.
-Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-Acked-by: Kevin Wolf <kwolf@redhat.com>
+Message-Id: <20230309084456.304669-10-pbonzini@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/blkreplay.c | 8 ++++++++
+ block/vmdk.c | 2 +-
-file changed, 8 insertions(+)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/block/blkreplay.c b/block/blkreplay.c
+diff --git a/block/vmdk.c b/block/vmdk.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkreplay.c
+--- a/block/vmdk.c
-+++ b/block/blkreplay.c
++++ b/block/vmdk.c
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkreplay_co_flush(BlockDriverState *bs)
+@@ -XXX,XX +XXX,XX @@ out:
      return ret;
  }
-+static int blkreplay_snapshot_goto(BlockDriverState *bs,
+-static int vmdk_is_cid_valid(BlockDriverState *bs)
-+                                   const char *snapshot_id)
++static int coroutine_fn vmdk_is_cid_valid(BlockDriverState *bs)
-+{
+ {
-+    return bdrv_snapshot_goto(bs->file->bs, snapshot_id, NULL);
+     BDRVVmdkState *s = bs->opaque;
-+}
+     uint32_t cur_pcid;
 +
  static BlockDriver bdrv_blkreplay = {
      .format_name            = "blkreplay",
      .instance_size          = 0,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_blkreplay = {
      .bdrv_co_pwrite_zeroes  = blkreplay_co_pwrite_zeroes,
      .bdrv_co_pdiscard       = blkreplay_co_pdiscard,
      .bdrv_co_flush          = blkreplay_co_flush,
 +
 +    .bdrv_snapshot_goto     = blkreplay_snapshot_goto,
  };
  static void bdrv_blkreplay_init(void)
 --
-.20.1
+.40.0

-New patch
+[PULL 25/25] block/monitor/block-hmp-cmds.c: Fix crash when execute hmp_commit
+From: Wang Liang <wangliangzz@inspur.com>
+hmp_commit() calls blk_is_available() from a non-coroutine context (and in
+the main loop). blk_is_available() is a co_wrapper_mixed_bdrv_rdlock
+function, and in the non-coroutine context it calls AIO_WAIT_WHILE(),
+which crashes if the aio_context lock is not taken before.
+Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1615
+Signed-off-by: Wang Liang <wangliangzz@inspur.com>
+Message-Id: <20230424103902.45265-1-wangliangzz@126.com>
+Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
+---
+ block/monitor/block-hmp-cmds.c | 10 ++++++----
+file changed, 6 insertions(+), 4 deletions(-)
+diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/monitor/block-hmp-cmds.c
++++ b/block/monitor/block-hmp-cmds.c
+@@ -XXX,XX +XXX,XX @@ void hmp_commit(Monitor *mon, const QDict *qdict)
+             error_report("Device '%s' not found", device);
+             return;
+         }
+-        if (!blk_is_available(blk)) {
+-            error_report("Device '%s' has no medium", device);
+-            return;
+-        }
+         bs = bdrv_skip_implicit_filters(blk_bs(blk));
+         aio_context = bdrv_get_aio_context(bs);
+         aio_context_acquire(aio_context);
++        if (!blk_is_available(blk)) {
++            error_report("Device '%s' has no medium", device);
++            aio_context_release(aio_context);
++            return;
++        }
++
+         ret = bdrv_commit(bs);
+         aio_context_release(aio_context);
+--
+.40.0

The following changes since commit 22dbfdecc3c52228d3489da3fe81da92b21197bf:

Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20191010.0' into staging (2019-10-14 15:09:08 +0100)

are available in the Git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to a1406a9262a087d9ec9627b88da13c4590b61dae:

iotests: Test large write request to qcow2 file (2019-10-14 17:12:48 +0200)

----------------------------------------------------------------
Block layer patches:

- block: Fix crash with qcow2 partial cluster COW with small cluster
  sizes (misaligned write requests with BDRV_REQ_NO_FALLBACK)
- qcow2: Fix integer overflow potentially causing corruption with huge
  requests
- vhdx: Detect truncated image files
- tools: Support help options for --object
- Various block-related replay improvements
- iotests/028: Fix for long $TEST_DIRs

----------------------------------------------------------------
Alberto Garcia (1):
      block: Reject misaligned write requests with BDRV_REQ_NO_FALLBACK

Kevin Wolf (4):
      vl: Split off user_creatable_print_help()
      qemu-io: Support help options for --object
      qemu-img: Support help options for --object
      qemu-nbd: Support help options for --object

Max Reitz (3):
      iotests/028: Fix for long $TEST_DIRs
      qcow2: Limit total allocation range to INT_MAX
      iotests: Test large write request to qcow2 file

Pavel Dovgaluk (6):
      block: implement bdrv_snapshot_goto for blkreplay
      replay: disable default snapshot for record/replay
      replay: update docs for record/replay with block devices
      replay: don't drain/flush bdrv queue while RR is working
      replay: finish record/replay before closing the disks
      replay: add BH oneshot event for block layer

Peter Lieven (1):
      block/vhdx: add check for truncated image files

From: Peter Lieven <pl@kamp.de>

qemu is currently not able to detect truncated vhdx image files.
Add a basic check if all allocated blocks are reachable at open and
report all errors during bdrv_co_check.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/vhdx.c | 120 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 103 insertions(+), 17 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/option.h"
 #include "qemu/crc32c.h"
 #include "qemu/bswap.h"
+#include "qemu/error-report.h"
 #include "vhdx.h"
 #include "migration/blocker.h"
 #include "qemu/uuid.h"
@@ -XXX,XX +XXX,XX @@ static int vhdx_region_check(BDRVVHDXState *s, uint64_t start, uint64_t length)
     end = start + length;
     QLIST_FOREACH(r, &s->regions, entries) {
         if (!((start >= r->end) || (end <= r->start))) {
+            error_report("VHDX region %" PRIu64 "-%" PRIu64 " overlaps with "
+                         "region %" PRIu64 "-%." PRIu64, start, end, r->start,
+                         r->end);
             ret = -EINVAL;
             goto exit;
         }
@@ -XXX,XX +XXX,XX @@ static void vhdx_calc_bat_entries(BDRVVHDXState *s)
 
 }
 
+static int vhdx_check_bat_entries(BlockDriverState *bs, int *errcnt)
+{
+    BDRVVHDXState *s = bs->opaque;
+    int64_t image_file_size = bdrv_getlength(bs->file->bs);
+    uint64_t payblocks = s->chunk_ratio;
+    uint64_t i;
+    int ret = 0;
+
+    if (image_file_size < 0) {
+        error_report("Could not determinate VHDX image file size.");
+        return image_file_size;
+    }
+
+    for (i = 0; i < s->bat_entries; i++) {
+        if ((s->bat[i] & VHDX_BAT_STATE_BIT_MASK) ==
+            PAYLOAD_BLOCK_FULLY_PRESENT) {
+            uint64_t offset = s->bat[i] & VHDX_BAT_FILE_OFF_MASK;
+            /*
+             * Allow that the last block exists only partially. The VHDX spec
+             * states that the image file can only grow in blocksize increments,
+             * but QEMU created images with partial last blocks in the past.
+             */
+            uint32_t block_length = MIN(s->block_size,
+                bs->total_sectors * BDRV_SECTOR_SIZE - i * s->block_size);
+            /*
+             * Check for BAT entry overflow.
+             */
+            if (offset > INT64_MAX - s->block_size) {
+                error_report("VHDX BAT entry %" PRIu64 " offset overflow.", i);
+                ret = -EINVAL;
+                if (!errcnt) {
+                    break;
+                }
+                (*errcnt)++;
+            }
+            /*
+             * Check if fully allocated BAT entries do not reside after
+             * end of the image file.
+             */
+            if (offset >= image_file_size) {
+                error_report("VHDX BAT entry %" PRIu64 " start offset %" PRIu64
+                             " points after end of file (%" PRIi64 "). Image"
+                             " has probably been truncated.",
+                             i, offset, image_file_size);
+                ret = -EINVAL;
+                if (!errcnt) {
+                    break;
+                }
+                (*errcnt)++;
+            } else if (offset + block_length > image_file_size) {
+                error_report("VHDX BAT entry %" PRIu64 " end offset %" PRIu64
+                             " points after end of file (%" PRIi64 "). Image"
+                             " has probably been truncated.",
+                             i, offset + block_length - 1, image_file_size);
+                ret = -EINVAL;
+                if (!errcnt) {
+                    break;
+                }
+                (*errcnt)++;
+            }
+
+            /*
+             * verify populated BAT field file offsets against
+             * region table and log entries
+             */
+            if (payblocks--) {
+                /* payload bat entries */
+                int ret2;
+                ret2 = vhdx_region_check(s, offset, s->block_size);
+                if (ret2 < 0) {
+                    ret = -EINVAL;
+                    if (!errcnt) {
+                        break;
+                    }
+                    (*errcnt)++;
+                }
+            } else {
+                payblocks = s->chunk_ratio;
+                /*
+                 * Once differencing files are supported, verify sector bitmap
+                 * blocks here
+                 */
+            }
+        }
+    }
+
+    return ret;
+}
+
 static void vhdx_close(BlockDriverState *bs)
 {
     BDRVVHDXState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static int vhdx_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
-    uint64_t payblocks = s->chunk_ratio;
-    /* endian convert, and verify populated BAT field file offsets against
-     * region table and log entries */
+    /* endian convert populated BAT field entires */
     for (i = 0; i < s->bat_entries; i++) {
         s->bat[i] = le64_to_cpu(s->bat[i]);
-        if (payblocks--) {
-            /* payload bat entries */
-            if ((s->bat[i] & VHDX_BAT_STATE_BIT_MASK) ==
-                    PAYLOAD_BLOCK_FULLY_PRESENT) {
-                ret = vhdx_region_check(s, s->bat[i] & VHDX_BAT_FILE_OFF_MASK,
-                                        s->block_size);
-                if (ret < 0) {
-                    goto fail;
-                }
-            }
-        } else {
-            payblocks = s->chunk_ratio;
-            /* Once differencing files are supported, verify sector bitmap
-             * blocks here */
+    }
+
+    if (!(flags & BDRV_O_CHECK)) {
+        ret = vhdx_check_bat_entries(bs, NULL);
+        if (ret < 0) {
+            goto fail;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn vhdx_co_check(BlockDriverState *bs,
     if (s->log_replayed_on_open) {
         result->corruptions_fixed++;
     }
+
+    vhdx_check_bat_entries(bs, &result->corruptions);
+
     return 0;
 }
 
-- 
2.20.1

From: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>

This patch enables making snapshots with blkreplay used in
block devices.
This function is required to make bdrv_snapshot_goto without
calling .bdrv_open which is not implemented.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/blkreplay.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blkreplay.c b/block/blkreplay.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blkreplay_co_flush(BlockDriverState *bs)
     return ret;
 }
 
+static int blkreplay_snapshot_goto(BlockDriverState *bs,
+                                   const char *snapshot_id)
+{
+    return bdrv_snapshot_goto(bs->file->bs, snapshot_id, NULL);
+}
+
 static BlockDriver bdrv_blkreplay = {
     .format_name            = "blkreplay",
     .instance_size          = 0,
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_blkreplay = {
     .bdrv_co_pwrite_zeroes  = blkreplay_co_pwrite_zeroes,
     .bdrv_co_pdiscard       = blkreplay_co_pdiscard,
     .bdrv_co_flush          = blkreplay_co_flush,
+
+    .bdrv_snapshot_goto     = blkreplay_snapshot_goto,
 };
 
 static void bdrv_blkreplay_init(void)
-- 
2.20.1

From: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>

This patch disables setting '-snapshot' option on by default
in record/replay mode. This is needed for creating vmstates in record
and replay modes.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 vl.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index XXXXXXX..XXXXXXX 100644
--- a/vl.c
+++ b/vl.c
@@ -XXX,XX +XXX,XX @@ static void configure_blockdev(BlockdevOptionsQueue *bdo_queue,
         qapi_free_BlockdevOptions(bdo->bdo);
         g_free(bdo);
     }
-    if (snapshot || replay_mode != REPLAY_MODE_NONE) {
+    if (snapshot) {
         qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot,
                           NULL, NULL);
     }
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv, char **envp)
                 drive_add(IF_PFLASH, -1, optarg, PFLASH_OPTS);
                 break;
             case QEMU_OPTION_snapshot:
-                snapshot = 1;
+                {
+                    Error *blocker = NULL;
+                    snapshot = 1;
+                    error_setg(&blocker, QERR_REPLAY_NOT_SUPPORTED,
+                               "-snapshot");
+                    replay_add_blocker(blocker);
+                }
                 break;
             case QEMU_OPTION_numa:
                 opts = qemu_opts_parse_noisily(qemu_find_opts("numa"),
-- 
2.20.1

From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

This patch updates the description of the command lines for using
record/replay with attached block devices.

Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 docs/replay.txt | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/docs/replay.txt b/docs/replay.txt
index XXXXXXX..XXXXXXX 100644
--- a/docs/replay.txt
+++ b/docs/replay.txt
@@ -XXX,XX +XXX,XX @@ Usage of the record/replay:
  * First, record the execution with the following command line:
     qemu-system-i386 \
      -icount shift=7,rr=record,rrfile=replay.bin \
-     -drive file=disk.qcow2,if=none,id=img-direct \
+     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
      -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
      -device ide-hd,drive=img-blkreplay \
      -netdev user,id=net1 -device rtl8139,netdev=net1 \
@@ -XXX,XX +XXX,XX @@ Usage of the record/replay:
  * After recording, you can replay it by using another command line:
     qemu-system-i386 \
      -icount shift=7,rr=replay,rrfile=replay.bin \
-     -drive file=disk.qcow2,if=none,id=img-direct \
+     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
      -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
      -device ide-hd,drive=img-blkreplay \
      -netdev user,id=net1 -device rtl8139,netdev=net1 \
@@ -XXX,XX +XXX,XX @@ Block devices record/replay module intercepts calls of
 bdrv coroutine functions at the top of block drivers stack.
 To record and replay block operations the drive must be configured
 as following:
- -drive file=disk.qcow2,if=none,id=img-direct
+ -drive file=disk.qcow2,if=none,snapshot,id=img-direct
  -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
  -device ide-hd,drive=img-blkreplay
 
@@ -XXX,XX +XXX,XX @@ This snapshot is created at start of recording and restored at start
 of replaying. It also can be loaded while replaying to roll back
 the execution.
 
+'snapshot' flag of the disk image must be removed to save the snapshots
+in the overlay (or original image) instead of using the temporary overlay.
+ -drive file=disk.ovl,if=none,id=img-direct
+ -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
+ -device ide-hd,drive=img-blkreplay
+
 Use QEMU monitor to create additional snapshots. 'savevm <name>' command
 created the snapshot and 'loadvm <name>' restores it. To prevent corruption
 of the original disk image, use overlay files linked to the original images.
-- 
2.20.1

From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

In record/replay mode bdrv queue is controlled by replay mechanism.
It does not allow saving or loading the snapshots
when bdrv queue is not empty. Stopping the VM is not blocked by nonempty
queue, but flushing the queue is still impossible there,
because it may cause deadlocks in replay mode.
This patch disables bdrv_drain_all and bdrv_flush_all in
record/replay mode.

Stopping the machine when the IO requests are not finished is needed
for the debugging. E.g., breakpoint may be set at the specified step,
and forcing the IO requests to finish may break the determinism
of the execution.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 28 ++++++++++++++++++++++++++++
 cpus.c     |  2 --
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "sysemu/replay.h"
 
 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
 
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_begin(void)
         return;
     }
 
+    /*
+     * bdrv queue is managed by record/replay,
+     * waiting for finishing the I/O requests may
+     * be infinite
+     */
+    if (replay_events_enabled()) {
+        return;
+    }
+
     /* AIO_WAIT_WHILE() with a NULL context can only be called from the main
      * loop AioContext, so make sure we're in the main context. */
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_end(void)
     BlockDriverState *bs = NULL;
     int drained_end_counter = 0;
 
+    /*
+     * bdrv queue is managed by record/replay,
+     * waiting for finishing the I/O requests may
+     * be endless
+     */
+    if (replay_events_enabled()) {
+        return;
+    }
+
     while ((bs = bdrv_next_all_states(bs))) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
@@ -XXX,XX +XXX,XX @@ int bdrv_flush_all(void)
     BlockDriverState *bs = NULL;
     int result = 0;
 
+    /*
+     * bdrv queue is managed by record/replay,
+     * creating new flush request for stopping
+     * the VM may break the determinism
+     */
+    if (replay_events_enabled()) {
+        return result;
+    }
+
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
         int ret;
diff --git a/cpus.c b/cpus.c
index XXXXXXX..XXXXXXX 100644
--- a/cpus.c
+++ b/cpus.c
@@ -XXX,XX +XXX,XX @@ static int do_vm_stop(RunState state, bool send_stop)
     }
 
     bdrv_drain_all();
-    replay_disable_events();
     ret = bdrv_flush_all();
 
     return ret;
@@ -XXX,XX +XXX,XX @@ int vm_prepare_start(void)
     /* We are sending this now, but the CPUs will be resumed shortly later */
     qapi_event_send_resume();
 
-    replay_enable_events();
     cpu_enable_ticks();
     runstate_set(RUN_STATE_RUNNING);
     vm_state_notify(1, RUN_STATE_RUNNING);
-- 
2.20.1

From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

After recent updates block devices cannot be closed on qemu exit.
This happens due to the block request polling when replay is not finished.
Therefore now we stop execution recording before closing the block devices.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 replay/replay.c | 2 ++
 vl.c            | 1 +
 2 files changed, 3 insertions(+)

diff --git a/replay/replay.c b/replay/replay.c
index XXXXXXX..XXXXXXX 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -XXX,XX +XXX,XX @@ void replay_finish(void)
     g_free(replay_snapshot);
     replay_snapshot = NULL;
 
+    replay_mode = REPLAY_MODE_NONE;
+
     replay_finish_events();
 }
 
diff --git a/vl.c b/vl.c
index XXXXXXX..XXXXXXX 100644
--- a/vl.c
+++ b/vl.c
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv, char **envp)
 
     /* No more vcpu or device emulation activity beyond this point */
     vm_shutdown();
+    replay_finish();
 
     job_cancel_sync_all();
     bdrv_close_all();
-- 
2.20.1

From: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

Replay is capable of recording normal BH events, but sometimes
there are single use callbacks scheduled with aio_bh_schedule_oneshot
function. This patch enables recording and replaying such callbacks.
Block layer uses these events for calling the completion function.
Replaying these calls makes the execution deterministic.

Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/sysemu/replay.h  |  4 ++++
 replay/replay-internal.h |  1 +
 block/block-backend.c    |  9 ++++++---
 block/io.c               |  4 ++--
 block/iscsi.c            |  5 +++--
 block/nfs.c              |  6 ++++--
 block/null.c             |  4 +++-
 block/nvme.c             |  6 ++++--
 block/rbd.c              |  5 +++--
 block/vxhs.c             |  5 +++--
 replay/replay-events.c   | 16 ++++++++++++++++
 stubs/replay-user.c      |  9 +++++++++
 stubs/Makefile.objs      |  1 +
 13 files changed, 59 insertions(+), 16 deletions(-)
 create mode 100644 stubs/replay-user.c

diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/replay.h
+++ b/include/sysemu/replay.h
@@ -XXX,XX +XXX,XX @@
 #include "qapi/qapi-types-misc.h"
 #include "qapi/qapi-types-run-state.h"
 #include "qapi/qapi-types-ui.h"
+#include "block/aio.h"
 
 /* replay clock kinds */
 enum ReplayClockKind {
@@ -XXX,XX +XXX,XX @@ void replay_enable_events(void);
 bool replay_events_enabled(void);
 /*! Adds bottom half event to the queue */
 void replay_bh_schedule_event(QEMUBH *bh);
+/* Adds oneshot bottom half event to the queue */
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque);
 /*! Adds input event to the queue */
 void replay_input_event(QemuConsole *src, InputEvent *evt);
 /*! Adds input sync event to the queue */
diff --git a/replay/replay-internal.h b/replay/replay-internal.h
index XXXXXXX..XXXXXXX 100644
--- a/replay/replay-internal.h
+++ b/replay/replay-internal.h
@@ -XXX,XX +XXX,XX @@ enum ReplayEvents {
 
 enum ReplayAsyncEventKind {
     REPLAY_ASYNC_EVENT_BH,
+    REPLAY_ASYNC_EVENT_BH_ONESHOT,
     REPLAY_ASYNC_EVENT_INPUT,
     REPLAY_ASYNC_EVENT_INPUT_SYNC,
     REPLAY_ASYNC_EVENT_CHAR_READ,
diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@
 #include "hw/qdev-core.h"
 #include "sysemu/blockdev.h"
 #include "sysemu/runstate.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-block.h"
 #include "qemu/id.h"
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
     acb->blk = blk;
     acb->ret = ret;
 
-    aio_bh_schedule_oneshot(blk_get_aio_context(blk), error_callback_bh, acb);
+    replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                     error_callback_bh, acb);
     return &acb->common;
 }
 
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
 
     acb->has_returned = true;
     if (acb->rwco.ret != NOT_DONE) {
-        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
-                                blk_aio_complete_bh, acb);
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
     }
 
     return &acb->common;
diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
     if (bs) {
         bdrv_inc_in_flight(bs);
     }
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
-                            bdrv_co_drain_bh_cb, &data);
+    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
+                                     bdrv_co_drain_bh_cb, &data);
 
     qemu_coroutine_yield();
     /* If we are resumed from some other event (such as an aio completion or a
diff --git a/block/iscsi.c b/block/iscsi.c
index XXXXXXX..XXXXXXX 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/module.h"
 #include "qemu/option.h"
 #include "qemu/uuid.h"
+#include "sysemu/replay.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-misc.h"
 #include "qapi/qmp/qdict.h"
@@ -XXX,XX +XXX,XX @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
     }
 
     if (iTask->co) {
-        aio_bh_schedule_oneshot(iTask->iscsilun->aio_context,
-                                 iscsi_co_generic_bh_cb, iTask);
+        replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
+                                         iscsi_co_generic_bh_cb, iTask);
     } else {
         iTask->complete = 1;
     }
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/option.h"
 #include "qemu/uri.h"
 #include "qemu/cutils.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi/qapi-visit-block-core.h"
 #include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qstring.h"
@@ -XXX,XX +XXX,XX @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
-    aio_bh_schedule_oneshot(task->client->aio_context,
-                            nfs_co_generic_bh_cb, task);
+    replay_bh_schedule_oneshot_event(task->client->aio_context,
+                                     nfs_co_generic_bh_cb, task);
 }
 
 static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, uint64_t offset,
diff --git a/block/null.c b/block/null.c
index XXXXXXX..XXXXXXX 100644
--- a/block/null.c
+++ b/block/null.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/module.h"
 #include "qemu/option.h"
 #include "block/block_int.h"
+#include "sysemu/replay.h"
 
 #define NULL_OPT_LATENCY "latency-ns"
 #define NULL_OPT_ZEROES  "read-zeroes"
@@ -XXX,XX +XXX,XX @@ static inline BlockAIOCB *null_aio_common(BlockDriverState *bs,
         timer_mod_ns(&acb->timer,
                      qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + s->latency_ns);
     } else {
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), null_bh_cb, acb);
+        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
+                                         null_bh_cb, acb);
     }
     return &acb->common;
 }
diff --git a/block/nvme.c b/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/option.h"
 #include "qemu/vfio-helpers.h"
 #include "block/block_int.h"
+#include "sysemu/replay.h"
 #include "trace.h"
 
 #include "block/nvme.h"
@@ -XXX,XX +XXX,XX @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         smp_mb_release();
         *q->cq.doorbell = cpu_to_le32(q->cq.head);
         if (!qemu_co_queue_empty(&q->free_req_queue)) {
-            aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb, q);
+            replay_bh_schedule_oneshot_event(s->aio_context,
+                                             nvme_free_req_queue_cb, q);
         }
     }
     q->busy = false;
@@ -XXX,XX +XXX,XX @@ static void nvme_rw_cb(void *opaque, int ret)
         /* The rw coroutine hasn't yielded, don't try to enter. */
         return;
     }
-    aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
+    replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
 }
 
 static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@
 #include "block/qdict.h"
 #include "crypto/secret.h"
 #include "qemu/cutils.h"
+#include "sysemu/replay.h"
 #include "qapi/qmp/qstring.h"
 #include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qjson.h"
@@ -XXX,XX +XXX,XX @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
     rcb->ret = rbd_aio_get_return_value(c);
     rbd_aio_release(c);
 
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                            rbd_finish_bh, rcb);
+    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
+                                     rbd_finish_bh, rcb);
 }
 
 static int rbd_aio_discard_wrapper(rbd_image_t image,
diff --git a/block/vxhs.c b/block/vxhs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vxhs.c
+++ b/block/vxhs.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "qemu/uuid.h"
 #include "crypto/tlscredsx509.h"
+#include "sysemu/replay.h"
 
 #define VXHS_OPT_FILENAME           "filename"
 #define VXHS_OPT_VDISK_ID           "vdisk-id"
@@ -XXX,XX +XXX,XX @@ static void vxhs_iio_callback(void *ctx, uint32_t opcode, uint32_t error)
             trace_vxhs_iio_callback(error);
         }
 
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                                vxhs_complete_aio_bh, acb);
+        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
+                                         vxhs_complete_aio_bh, acb);
         break;
 
     default:
diff --git a/replay/replay-events.c b/replay/replay-events.c
index XXXXXXX..XXXXXXX 100644
--- a/replay/replay-events.c
+++ b/replay/replay-events.c
@@ -XXX,XX +XXX,XX @@ static void replay_run_event(Event *event)
     case REPLAY_ASYNC_EVENT_BH:
         aio_bh_call(event->opaque);
         break;
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
+        ((QEMUBHFunc *)event->opaque)(event->opaque2);
+        break;
     case REPLAY_ASYNC_EVENT_INPUT:
         qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
         qapi_free_InputEvent((InputEvent *)event->opaque);
@@ -XXX,XX +XXX,XX @@ void replay_bh_schedule_event(QEMUBH *bh)
     }
 }
 
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    if (events_enabled) {
+        uint64_t id = replay_get_current_icount();
+        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
+    } else {
+        aio_bh_schedule_oneshot(ctx, cb, opaque);
+    }
+}
+
 void replay_add_input_event(struct InputEvent *event)
 {
     replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
@@ -XXX,XX +XXX,XX @@ static void replay_save_event(Event *event, int checkpoint)
         /* save event-specific data */
         switch (event->event_kind) {
         case REPLAY_ASYNC_EVENT_BH:
+        case REPLAY_ASYNC_EVENT_BH_ONESHOT:
             replay_put_qword(event->id);
             break;
         case REPLAY_ASYNC_EVENT_INPUT:
@@ -XXX,XX +XXX,XX @@ static Event *replay_read_event(int checkpoint)
     /* Events that has not to be in the queue */
     switch (replay_state.read_event_kind) {
     case REPLAY_ASYNC_EVENT_BH:
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
         if (replay_state.read_event_id == -1) {
             replay_state.read_event_id = replay_get_qword();
         }
diff --git a/stubs/replay-user.c b/stubs/replay-user.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/stubs/replay-user.c
@@ -XXX,XX +XXX,XX @@
+#include "qemu/osdep.h"
+#include "sysemu/replay.h"
+#include "sysemu/sysemu.h"
+
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    aio_bh_schedule_oneshot(ctx, cb, opaque);
+}
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -XXX,XX +XXX,XX @@ stub-obj-y += monitor.o
 stub-obj-y += notify-event.o
 stub-obj-y += qtest.o
 stub-obj-y += replay.o
+stub-obj-y += replay-user.o
 stub-obj-y += runstate-check.o
 stub-obj-y += set-fd-handler.o
 stub-obj-y += sysbus.o
-- 
2.20.1

From: Alberto Garcia <berto@igalia.com>

The BDRV_REQ_NO_FALLBACK flag means that an operation should only be
performed if it can be offloaded or otherwise performed efficiently.

However a misaligned write request requires a RMW so we should return
an error and let the caller decide how to proceed.

This hits an assertion since commit c8bb23cbdb if the required
alignment is larger than the cluster size:

qemu-img create -f qcow2 -o cluster_size=2k img.qcow2 4G
qemu-io -c "open -o driver=qcow2,file.align=4k blkdebug::img.qcow2" \
        -c 'write 0 512'
qemu-io: block/io.c:1127: bdrv_driver_pwritev: Assertion `!(flags & BDRV_REQ_NO_FALLBACK)' failed.
Aborted

The reason is that when writing to an unallocated cluster we try to
skip the copy-on-write part and zeroize it using BDRV_REQ_NO_FALLBACK
instead, resulting in a write request that is too small (2KB cluster
size vs 4KB required alignment).

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c                 |  7 +++++
 tests/qemu-iotests/268     | 55 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/268.out |  7 +++++
 tests/qemu-iotests/group   |  1 +
 4 files changed, 70 insertions(+)
 create mode 100755 tests/qemu-iotests/268
 create mode 100644 tests/qemu-iotests/268.out

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
         return ret;
     }
 
+    /* If the request is misaligned then we can't make it efficient */
+    if ((flags & BDRV_REQ_NO_FALLBACK) &&
+        !QEMU_IS_ALIGNED(offset | bytes, align))
+    {
+        return -ENOTSUP;
+    }
+
     bdrv_inc_in_flight(bs);
     /*
      * Align write if necessary by performing a read-modify-write cycle.
diff --git a/tests/qemu-iotests/268 b/tests/qemu-iotests/268
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/268
@@ -XXX,XX +XXX,XX @@
+#!/usr/bin/env bash
+#
+# Test write request with required alignment larger than the cluster size
+#
+# Copyright (C) 2019 Igalia, S.L.
+# Author: Alberto Garcia <berto@igalia.com>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=berto@igalia.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+
+echo
+echo "== Required alignment larger than cluster size =="
+
+CLUSTER_SIZE=2k _make_test_img 1M
+# Since commit c8bb23cbdb writing to an unallocated cluster fills the
+# empty COW areas with bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK)
+$QEMU_IO -c "open -o driver=$IMGFMT,file.align=4k blkdebug::$TEST_IMG" \
+         -c "write 0 512" | _filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/268.out b/tests/qemu-iotests/268.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/268.out
@@ -XXX,XX +XXX,XX @@
+QA output created by 268
+
+== Required alignment larger than cluster size ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 265 rw auto quick
 266 rw quick
 267 rw auto quick snapshot
+268 rw auto quick
-- 
2.20.1

From: Max Reitz <mreitz@redhat.com>

For long test image paths, the order of the "Formatting" line and the
"(qemu)" prompt after a drive_backup HMP command may be reversed.  In
fact, the interaction between the prompt and the line may lead to the
"Formatting" to being greppable at all after "read"-ing it (if the
prompt injects an IFS character into the "Formatting" string).

So just wait until we get a prompt.  At that point, the block job must
have been started, so "info block-jobs" will only return "No active
jobs" once it is done.

Reported-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/028     | 11 ++++++++---
 tests/qemu-iotests/028.out |  1 -
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/028 b/tests/qemu-iotests/028
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/028
+++ b/tests/qemu-iotests/028
@@ -XXX,XX +XXX,XX @@ fi
 # Silence output since it contains the disk image path and QEMU's readline
 # character echoing makes it very hard to filter the output. Plus, there
 # is no telling how many times the command will repeat before succeeding.
-_send_qemu_cmd $h "drive_backup disk ${TEST_IMG}.copy" "(qemu)" >/dev/null
-_send_qemu_cmd $h "" "Formatting" | _filter_img_create
-qemu_cmd_repeat=20 _send_qemu_cmd $h "info block-jobs" "No active jobs" >/dev/null
+# (Note that creating the image results in a "Formatting..." message over
+# stdout, which is the same channel the monitor uses.  We cannot reliably
+# wait for it because the monitor output may interact with it in such a
+# way that _timed_wait_for cannot read it.  However, once the block job is
+# done, we know that the "Formatting..." message must have appeared
+# already, so the output is still deterministic.)
+silent=y _send_qemu_cmd $h "drive_backup disk ${TEST_IMG}.copy" "(qemu)"
+silent=y qemu_cmd_repeat=20 _send_qemu_cmd $h "info block-jobs" "No active jobs"
 _send_qemu_cmd $h "info block-jobs" "No active jobs"
 _send_qemu_cmd $h 'quit' ""
 
diff --git a/tests/qemu-iotests/028.out b/tests/qemu-iotests/028.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/028.out
+++ b/tests/qemu-iotests/028.out
@@ -XXX,XX +XXX,XX @@ No errors were found on the image.
 
 block-backup
 
-Formatting 'TEST_DIR/t.IMGFMT.copy', fmt=IMGFMT size=4294968832 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
 (qemu) info block-jobs
 No active jobs
 === IO: pattern 195
-- 
2.20.1

The following changes since commit ac5f7bf8e208cd7893dbb1a9520559e569a4677c:

Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-24 15:00:39 +0100)

are available in the Git repository at:

https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 8c1e8fb2e7fc2cbeb57703e143965a4cd3ad301a:

block/monitor: Fix crash when executing HMP commit (2023-04-25 15:11:57 +0200)

----------------------------------------------------------------
Block layer patches

- Protect BlockBackend.queued_requests with its own lock
- Switch to AIO_WAIT_WHILE_UNLOCKED() where possible
- AioContext removal: LinuxAioState/LuringState/ThreadPool
- Add more coroutine_fn annotations, use bdrv/blk_co_*
- Fix crash when execute hmp_commit

----------------------------------------------------------------
Emanuele Giuseppe Esposito (4):
      linux-aio: use LinuxAioState from the running thread
      io_uring: use LuringState from the running thread
      thread-pool: use ThreadPool from the running thread
      thread-pool: avoid passing the pool parameter every time

Paolo Bonzini (9):
      vvfat: mark various functions as coroutine_fn
      blkdebug: add missing coroutine_fn annotation
      mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
      nbd: mark more coroutine_fns, do not use co_wrappers
      9pfs: mark more coroutine_fns
      qemu-pr-helper: mark more coroutine_fns
      tests: mark more coroutine_fns
      qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
      vmdk: make vmdk_is_cid_valid a coroutine_fn

Stefan Hajnoczi (10):
      block: make BlockBackend->quiesce_counter atomic
      block: make BlockBackend->disable_request_queuing atomic
      block: protect BlockBackend->queued_requests with a lock
      block: don't acquire AioContext lock in bdrv_drain_all()
      block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
      block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
      block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
      hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
      monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
      block: add missing coroutine_fn to bdrv_sum_allocated_file_size()

Wang Liang (1):
      block/monitor: Fix crash when executing HMP commit

Wilfred Mallawa (1):
      include/block: fixup typos

From: Stefan Hajnoczi <stefanha@redhat.com>

The main loop thread increments/decrements BlockBackend->quiesce_counter
when drained sections begin/end. The counter is read in the I/O code
path. Therefore this field is used to communicate between threads
without a lock.

Acquire/release are not necessary because the BlockBackend->in_flight
counter already uses sequentially consistent accesses and running I/O
requests hold that counter when blk_wait_while_drained() is called.
qatomic_read() can be used.

Use qatomic_fetch_inc()/qatomic_fetch_dec() for modifications even
though sequentially consistent atomic accesses are not strictly required
here. They are, however, nicer to read than multiple calls to
qatomic_read() and qatomic_set(). Since beginning and ending drain is
not a hot path the extra cost doesn't matter.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230307210427.269214-2-stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
     NotifierList remove_bs_notifiers, insert_bs_notifiers;
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
-    int quiesce_counter;
+    int quiesce_counter; /* atomic: written under BQL, read by other threads */
     CoQueue queued_requests;
     bool disable_request_queuing;
 
@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
     blk->dev_opaque = opaque;
 
     /* Are we currently quiesced? Should we enforce this right now? */
-    if (blk->quiesce_counter && ops && ops->drained_begin) {
+    if (qatomic_read(&blk->quiesce_counter) && ops && ops->drained_begin) {
         ops->drained_begin(opaque);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
     assert(blk->in_flight > 0);
 
-    if (blk->quiesce_counter && !blk->disable_request_queuing) {
+    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
         blk_dec_in_flight(blk);
         qemu_co_queue_wait(&blk->queued_requests, NULL);
         blk_inc_in_flight(blk);
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
     BlockBackend *blk = child->opaque;
     ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
 
-    if (++blk->quiesce_counter == 1) {
+    if (qatomic_fetch_inc(&blk->quiesce_counter) == 0) {
         if (blk->dev_ops && blk->dev_ops->drained_begin) {
             blk->dev_ops->drained_begin(blk->dev_opaque);
         }
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
     bool busy = false;
-    assert(blk->quiesce_counter);
+    assert(qatomic_read(&blk->quiesce_counter));
 
     if (blk->dev_ops && blk->dev_ops->drained_poll) {
         busy = blk->dev_ops->drained_poll(blk->dev_opaque);
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
 static void blk_root_drained_end(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
-    assert(blk->quiesce_counter);
+    assert(qatomic_read(&blk->quiesce_counter));
 
     assert(blk->public.throttle_group_member.io_limits_disabled);
     qatomic_dec(&blk->public.throttle_group_member.io_limits_disabled);
 
-    if (--blk->quiesce_counter == 0) {
+    if (qatomic_fetch_dec(&blk->quiesce_counter) == 1) {
         if (blk->dev_ops && blk->dev_ops->drained_end) {
             blk->dev_ops->drained_end(blk->dev_opaque);
         }
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

This field is accessed by multiple threads without a lock. Use explicit
qatomic_read()/qatomic_set() calls. There is no need for acquire/release
because blk_set_disable_request_queuing() doesn't provide any
guarantees (it helps that it's used at BlockBackend creation time and
not when there is I/O in flight).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230307210427.269214-3-stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
 
     int quiesce_counter; /* atomic: written under BQL, read by other threads */
     CoQueue queued_requests;
-    bool disable_request_queuing;
+    bool disable_request_queuing; /* atomic */
 
     VMChangeStateEntry *vmsh;
     bool force_allow_inactivate;
@@ -XXX,XX +XXX,XX @@ void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow)
 void blk_set_disable_request_queuing(BlockBackend *blk, bool disable)
 {
     IO_CODE();
-    blk->disable_request_queuing = disable;
+    qatomic_set(&blk->disable_request_queuing, disable);
 }
 
 static int coroutine_fn GRAPH_RDLOCK
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
     assert(blk->in_flight > 0);
 
-    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
+    if (qatomic_read(&blk->quiesce_counter) &&
+        !qatomic_read(&blk->disable_request_queuing)) {
         blk_dec_in_flight(blk);
         qemu_co_queue_wait(&blk->queued_requests, NULL);
         blk_inc_in_flight(blk);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The CoQueue API offers thread-safety via the lock argument that
qemu_co_queue_wait() and qemu_co_enter_next() take. BlockBackend
currently does not make use of the lock argument. This means that
multiple threads submitting I/O requests can corrupt the CoQueue's
QSIMPLEQ.

Add a QemuMutex and pass it to CoQueue APIs so that the queue is
protected. While we're at it, also assert that the queue is empty when
the BlockBackend is deleted.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230307210427.269214-4-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
     int quiesce_counter; /* atomic: written under BQL, read by other threads */
+    QemuMutex queued_requests_lock; /* protects queued_requests */
     CoQueue queued_requests;
     bool disable_request_queuing; /* atomic */
 
@@ -XXX,XX +XXX,XX @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
 
     block_acct_init(&blk->stats);
 
+    qemu_mutex_init(&blk->queued_requests_lock);
     qemu_co_queue_init(&blk->queued_requests);
     notifier_list_init(&blk->remove_bs_notifiers);
     notifier_list_init(&blk->insert_bs_notifiers);
@@ -XXX,XX +XXX,XX @@ static void blk_delete(BlockBackend *blk)
     assert(QLIST_EMPTY(&blk->remove_bs_notifiers.notifiers));
     assert(QLIST_EMPTY(&blk->insert_bs_notifiers.notifiers));
     assert(QLIST_EMPTY(&blk->aio_notifiers));
+    assert(qemu_co_queue_empty(&blk->queued_requests));
+    qemu_mutex_destroy(&blk->queued_requests_lock);
     QTAILQ_REMOVE(&block_backends, blk, link);
     drive_info_del(blk->legacy_dinfo);
     block_acct_cleanup(&blk->stats);
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 
     if (qatomic_read(&blk->quiesce_counter) &&
         !qatomic_read(&blk->disable_request_queuing)) {
+        /*
+         * Take lock before decrementing in flight counter so main loop thread
+         * waits for us to enqueue ourselves before it can leave the drained
+         * section.
+         */
+        qemu_mutex_lock(&blk->queued_requests_lock);
         blk_dec_in_flight(blk);
-        qemu_co_queue_wait(&blk->queued_requests, NULL);
+        qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
         blk_inc_in_flight(blk);
+        qemu_mutex_unlock(&blk->queued_requests_lock);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_end(BdrvChild *child)
         if (blk->dev_ops && blk->dev_ops->drained_end) {
             blk->dev_ops->drained_end(blk->dev_opaque);
         }
-        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
+        qemu_mutex_lock(&blk->queued_requests_lock);
+        while (qemu_co_enter_next(&blk->queued_requests,
+                                  &blk->queued_requests_lock)) {
             /* Resume all queued requests */
         }
+        qemu_mutex_unlock(&blk->queued_requests_lock);
     }
 }
 
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

There is no need for the AioContext lock in bdrv_drain_all() because
nothing in AIO_WAIT_WHILE() needs the lock and the condition is atomic.

AIO_WAIT_WHILE_UNLOCKED() has no use for the AioContext parameter other
than performing a check that is nowadays already done by the
GLOBAL_STATE_CODE()/IO_CODE() macros. Set the ctx argument to NULL here
to help us keep track of all converted callers. Eventually all callers
will have been converted and then the argument can be dropped entirely.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-2-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_drain_all(void)
     bdrv_drain_all_begin();
 
     while ((blk = blk_all_next(blk)) != NULL) {
-        AioContext *ctx = blk_get_aio_context(blk);
-
-        aio_context_acquire(ctx);
-
         /* We may have -ENOMEDIUM completions in flight */
-        AIO_WAIT_WHILE(ctx, qatomic_read(&blk->in_flight) > 0);
-
-        aio_context_release(ctx);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, qatomic_read(&blk->in_flight) > 0);
     }
 
     bdrv_drain_all_end();
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

There is no change in behavior. Switch to AIO_WAIT_WHILE_UNLOCKED()
instead of AIO_WAIT_WHILE() to document that this code has already been
audited and converted. The AioContext argument is already NULL so
aio_context_release() is never called anyway.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-3-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/export.c b/block/export/export.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@ void blk_exp_close_all_type(BlockExportType type)
         blk_exp_request_shutdown(exp);
     }
 
-    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+    AIO_WAIT_WHILE_UNLOCKED(NULL, blk_exp_has_type(type));
 }
 
 void blk_exp_close_all(void)
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The following conversion is safe and does not change behavior:

GLOBAL_STATE_CODE();
     ...
  -  AIO_WAIT_WHILE(qemu_get_aio_context(), ...);
  +  AIO_WAIT_WHILE_UNLOCKED(NULL, ...);

Since we're in GLOBAL_STATE_CODE(), qemu_get_aio_context() is our home
thread's AioContext. Thus AIO_WAIT_WHILE() does not unlock the
AioContext:

if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
      while ((cond)) {                                           \
          aio_poll(ctx_, true);                                  \
          waited_ = true;                                        \
      }                                                          \

And that means AIO_WAIT_WHILE_UNLOCKED(NULL, ...) can be substituted.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-4-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/graph-lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/graph-lock.c b/block/graph-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -XXX,XX +XXX,XX @@ void bdrv_graph_wrlock(void)
          * reader lock.
          */
         qatomic_set(&has_writer, 0);
-        AIO_WAIT_WHILE(qemu_get_aio_context(), reader_count() >= 1);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, reader_count() >= 1);
         qatomic_set(&has_writer, 1);
 
         /*
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

Since the AioContext argument was already NULL, AIO_WAIT_WHILE() was
never going to unlock the AioContext. Therefore it is possible to
replace AIO_WAIT_WHILE() with AIO_WAIT_WHILE_UNLOCKED().

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-5-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_begin(void)
     bdrv_drain_all_begin_nopoll();
 
     /* Now poll the in-flight requests */
-    AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
+    AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
 
     while ((bs = bdrv_next_all_states(bs))) {
         bdrv_drain_assert_idle(bs);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The HMP monitor runs in the main loop thread. Calling
AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
the AioContext and the latter's assertion that we're in the main loop
succeeds.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-6-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 monitor/hmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index XXXXXXX..XXXXXXX 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -XXX,XX +XXX,XX @@ void handle_hmp_command(MonitorHMP *mon, const char *cmdline)
         Coroutine *co = qemu_coroutine_create(handle_hmp_command_co, &data);
         monitor_set_cur(co, &mon->common);
         aio_co_enter(qemu_get_aio_context(), co);
-        AIO_WAIT_WHILE(qemu_get_aio_context(), !data.done);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, !data.done);
     }
 
     qobject_unref(qdict);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

monitor_cleanup() is called from the main loop thread. Calling
AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
the AioContext and the latter's assertion that we're in the main loop
succeeds.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-7-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 monitor/monitor.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/monitor/monitor.c b/monitor/monitor.c
index XXXXXXX..XXXXXXX 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
      * We need to poll both qemu_aio_context and iohandler_ctx to make
      * sure that the dispatcher coroutine keeps making progress and
      * eventually terminates.  qemu_aio_context is automatically
-     * polled by calling AIO_WAIT_WHILE on it, but we must poll
+     * polled by calling AIO_WAIT_WHILE_UNLOCKED on it, but we must poll
      * iohandler_ctx manually.
      *
      * Letting the iothread continue while shutting down the dispatcher
@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
         aio_co_wake(qmp_dispatcher_co);
     }
 
-    AIO_WAIT_WHILE(qemu_get_aio_context(),
+    AIO_WAIT_WHILE_UNLOCKED(NULL,
                    (aio_poll(iohandler_get_aio_context(), false),
                     qatomic_mb_read(&qmp_dispatcher_co_busy)));
 
-- 
2.40.0

From: Wilfred Mallawa <wilfred.mallawa@wdc.com>

Fixup a few minor typos

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Message-Id: <20230313003744.55476-1-wilfred.mallawa@opensource.wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio-wait.h         | 2 +-
 include/block/block_int-common.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -XXX,XX +XXX,XX @@ extern AioWait global_aio_wait;
  * @ctx: the aio context, or NULL if multiple aio contexts (for which the
  *       caller does not hold a lock) are involved in the polling condition.
  * @cond: wait while this conditional expression is true
- * @unlock: whether to unlock and then lock again @ctx. This apples
+ * @unlock: whether to unlock and then lock again @ctx. This applies
  * only when waiting for another AioContext from the main loop.
  * Otherwise it's ignored.
  *
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ extern QemuOptsList bdrv_create_opts_simple;
 /*
  * Common functions that are neither I/O nor Global State.
  *
- * See include/block/block-commmon.h for more information about
+ * See include/block/block-common.h for more information about
  * the Common API.
  */
 
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

Not a coroutine_fn, you say?

static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
  {
      BdrvChild *child;
      int64_t child_size, sum = 0;

QLIST_FOREACH(child, &bs->children, next) {
          if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
                             BDRV_CHILD_FILTERED))
          {
              child_size = bdrv_co_get_allocated_file_size(child->bs);
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Well what do we have here?!

I rest my case, your honor.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230308211435.346375-1-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ exit:
  * sums the size of all data-bearing children.  (This excludes backing
  * children.)
  */
-static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
+static int64_t coroutine_fn bdrv_sum_allocated_file_size(BlockDriverState *bs)
 {
     BdrvChild *child;
     int64_t child_size, sum = 0;
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LinuxAioState.

In order to prevent mistakes from the caller side, avoid passing LinuxAioState
in laio_io_{plug/unplug} and laio_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-2-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h               |  4 ----
 include/block/raw-aio.h           | 18 ++++++++++++------
 include/sysemu/block-backend-io.h |  5 +++++
 block/file-posix.c                | 10 +++-------
 block/linux-aio.c                 | 29 +++++++++++++++++------------
 5 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     struct ThreadPool *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
-    /*
-     * State for native Linux AIO.  Uses aio_context_acquire/release for
-     * locking.
-     */
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
 typedef struct LinuxAioState LinuxAioState;
 LinuxAioState *laio_init(Error **errp);
 void laio_cleanup(LinuxAioState *s);
-int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type,
-                                uint64_t dev_max_batch);
+
+/* laio_co_submit: submit I/O requests in the thread's current AioContext. */
+int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
+                                int type, uint64_t dev_max_batch);
+
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
-void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
-                    uint64_t dev_max_batch);
+
+/*
+ * laio_io_plug/unplug work in the thread's current AioContext, therefore the
+ * caller must ensure that they are paired in the same IOThread.
+ */
+void laio_io_plug(void);
+void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
+/*
+ * blk_io_plug/unplug are thread-local operations. This means that multiple
+ * IOThreads can simultaneously call plug/unplug, but the caller must ensure
+ * that each unplug() is called in the same IOThread of the matching plug().
+ */
 void coroutine_fn blk_co_io_plug(BlockBackend *blk);
 void co_wrapper blk_io_plug(BlockBackend *blk);
 
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
         assert(qiov->size == bytes);
-        return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
-                              s->aio_max_batch);
+        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
 #endif
     }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
     BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
     if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
-        laio_io_plug(bs, aio);
+        laio_io_plug();
     }
 #endif
 #ifdef CONFIG_LINUX_IO_URING
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
     BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
     if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
-        laio_io_unplug(bs, aio, s->aio_max_batch);
+        laio_io_unplug(s->aio_max_batch);
     }
 #endif
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
 
+/* Only used for assertions.  */
+#include "qemu/coroutine_int.h"
+
 #include <libaio.h>
 
 /*
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
     io_context_t ctx;
     EventNotifier e;
 
-    /* io queue for submit at batch.  Protected by AioContext lock. */
+    /* No locking required, only accessed from AioContext home thread */
     LaioQueue io_q;
-
-    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
     int event_idx;
     int event_max;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
      * later.  Coroutines cannot be entered recursively so avoid doing
      * that!
      */
+    assert(laiocb->co->ctx == laiocb->ctx->aio_context);
     if (!qemu_coroutine_entered(laiocb->co)) {
         aio_co_wake(laiocb->co);
     }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
 
 static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
-    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions(s);
 
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
-    aio_context_release(s->aio_context);
 }
 
 static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
     return max_batch;
 }
 
-void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
+void laio_io_plug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LinuxAioState *s = aio_get_linux_aio(ctx);
+
     s->io_q.plugged++;
 }
 
-void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
-                    uint64_t dev_max_batch)
+void laio_io_unplug(uint64_t dev_max_batch)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LinuxAioState *s = aio_get_linux_aio(ctx);
+
     assert(s->io_q.plugged);
     s->io_q.plugged--;
 
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
     return 0;
 }
 
-int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type,
-                                uint64_t dev_max_batch)
+int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
+                                int type, uint64_t dev_max_batch)
 {
     int ret;
+    AioContext *ctx = qemu_get_current_aio_context();
     struct qemu_laiocb laiocb = {
         .co         = qemu_coroutine_self(),
         .nbytes     = qiov->size,
-        .ctx        = s,
+        .ctx        = aio_get_linux_aio(ctx),
         .ret        = -EINPROGRESS,
         .is_read    = (type == QEMU_AIO_READ),
         .qiov       = qiov,
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LuringState.

In order to prevent mistakes from the caller side, avoid passing LuringState
in luring_io_{plug/unplug} and luring_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-3-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h     |  4 ----
 include/block/raw-aio.h | 15 +++++++++++----
 block/file-posix.c      | 12 ++++--------
 block/io_uring.c        | 23 +++++++++++++++--------
 4 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
-    /*
-     * State for Linux io_uring.  Uses aio_context_acquire/release for
-     * locking.
-     */
     struct LuringState *linux_io_uring;
 
     /* State for file descriptor monitoring using Linux io_uring */
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ void laio_io_unplug(uint64_t dev_max_batch);
 typedef struct LuringState LuringState;
 LuringState *luring_init(Error **errp);
 void luring_cleanup(LuringState *s);
-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type);
+
+/* luring_co_submit: submit I/O requests in the thread's current AioContext. */
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
+                                  QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-void luring_io_plug(BlockDriverState *bs, LuringState *s);
-void luring_io_unplug(BlockDriverState *bs, LuringState *s);
+
+/*
+ * luring_io_plug/unplug work in the thread's current AioContext, therefore the
+ * caller must ensure that they are paired in the same IOThread.
+ */
+void luring_io_plug(void);
+void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
         type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_IO_URING
     } else if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
         assert(qiov->size == bytes);
-        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
+        return luring_co_submit(bs, s->fd, offset, qiov, type);
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
 #endif
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        luring_io_plug(bs, aio);
+        luring_io_plug();
     }
 #endif
 }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
 #endif
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        luring_io_unplug(bs, aio);
+        luring_io_unplug();
     }
 #endif
 }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
+        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
     }
 #endif
     return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
diff --git a/block/io_uring.c b/block/io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "trace.h"
 
+/* Only used for assertions.  */
+#include "qemu/coroutine_int.h"
+
 /* io_uring ring size */
 #define MAX_ENTRIES 128
 
@@ -XXX,XX +XXX,XX @@ typedef struct LuringState {
 
     struct io_uring ring;
 
-    /* io queue for submit at batch.  Protected by AioContext lock. */
+    /* No locking required, only accessed from AioContext home thread */
     LuringQueue io_q;
 
-    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
 } LuringState;
 
@@ -XXX,XX +XXX,XX @@ end:
          * eventually runs later. Coroutines cannot be entered recursively
          * so avoid doing that!
          */
+        assert(luringcb->co->ctx == s->aio_context);
         if (!qemu_coroutine_entered(luringcb->co)) {
             aio_co_wake(luringcb->co);
         }
@@ -XXX,XX +XXX,XX @@ static int ioq_submit(LuringState *s)
 
 static void luring_process_completions_and_submit(LuringState *s)
 {
-    aio_context_acquire(s->aio_context);
     luring_process_completions(s);
 
     if (!s->io_q.plugged && s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
-    aio_context_release(s->aio_context);
 }
 
 static void qemu_luring_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void ioq_init(LuringQueue *io_q)
     io_q->blocked = false;
 }
 
-void luring_io_plug(BlockDriverState *bs, LuringState *s)
+void luring_io_plug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     trace_luring_io_plug(s);
     s->io_q.plugged++;
 }
 
-void luring_io_unplug(BlockDriverState *bs, LuringState *s)
+void luring_io_unplug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     assert(s->io_q.plugged);
     trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
                            s->io_q.in_queue, s->io_q.in_flight);
@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
     return 0;
 }
 
-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                  uint64_t offset, QEMUIOVector *qiov, int type)
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
+                                  QEMUIOVector *qiov, int type)
 {
     int ret;
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     LuringAIOCB luringcb = {
         .co         = qemu_coroutine_self(),
         .ret        = -EINPROGRESS,
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Use qemu_get_current_aio_context() where possible, since we always
submit work to the current thread anyways.

We want to also be sure that the thread submitting the work is
the same as the one processing the pool, to avoid adding
synchronization to the pool list.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-4-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/thread-pool.h |  5 +++++
 block/file-posix.c          | 21 ++++++++++-----------
 block/file-win32.c          |  2 +-
 block/qcow2-threads.c       |  2 +-
 util/thread-pool.c          |  9 ++++-----
 5 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ typedef struct ThreadPool ThreadPool;
 ThreadPool *thread_pool_new(struct AioContext *ctx);
 void thread_pool_free(ThreadPool *pool);
 
+/*
+ * thread_pool_submit* API: submit I/O requests in the thread's
+ * current AioContext.
+ */
 BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
         ThreadPoolFunc *func, void *arg,
         BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
         ThreadPoolFunc *func, void *arg);
 void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
+
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
     return result;
 }
 
-static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs,
-                                               ThreadPoolFunc func, void *arg)
+static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
 {
     /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
     return thread_pool_submit_co(pool, func, arg);
 }
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
     };
 
     assert(qiov->size == bytes);
-    return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
+    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
 }
 
 static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
         return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
     }
 #endif
-    return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
+    return raw_thread_pool_submit(handle_aiocb_flush, &acb);
 }
 
 static void raw_aio_attach_aio_context(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ raw_regular_truncate(BlockDriverState *bs, int fd, int64_t offset,
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_truncate, &acb);
+    return raw_thread_pool_submit(handle_aiocb_truncate, &acb);
 }
 
 static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
         acb.aio_type |= QEMU_AIO_BLKDEV;
     }
 
-    ret = raw_thread_pool_submit(bs, handle_aiocb_discard, &acb);
+    ret = raw_thread_pool_submit(handle_aiocb_discard, &acb);
     raw_account_discard(s, bytes, ret);
     return ret;
 }
@@ -XXX,XX +XXX,XX @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
         handler = handle_aiocb_write_zeroes;
     }
 
-    return raw_thread_pool_submit(bs, handler, &acb);
+    return raw_thread_pool_submit(handler, &acb);
 }
 
 static int coroutine_fn raw_co_pwrite_zeroes(
@@ -XXX,XX +XXX,XX @@ raw_co_copy_range_to(BlockDriverState *bs,
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_copy_range, &acb);
+    return raw_thread_pool_submit(handle_aiocb_copy_range, &acb);
 }
 
 BlockDriver bdrv_file = {
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         struct sg_io_hdr *io_hdr = buf;
         if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
             io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
-            return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
+            return pr_manager_execute(s->pr_mgr, qemu_get_current_aio_context(),
                                       s->fd, io_hdr);
         }
     }
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_ioctl, &acb);
+    return raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
 }
 #endif /* linux */
 
diff --git a/block/file-win32.c b/block/file-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
     acb->aio_offset = offset;
 
     trace_file_paio_submit(acb, opaque, offset, count, type);
-    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    pool = aio_get_thread_pool(qemu_get_current_aio_context());
     return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
 }
 
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
 {
     int ret;
     BDRVQcow2State *s = bs->opaque;
-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
 
     qemu_co_mutex_lock(&s->lock);
     while (s->nb_threads >= QCOW2_MAX_THREADS) {
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ struct ThreadPoolElement {
     /* Access to this list is protected by lock.  */
     QTAILQ_ENTRY(ThreadPoolElement) reqs;
 
-    /* Access to this list is protected by the global mutex.  */
+    /* This list is only written by the thread pool's mother thread.  */
     QLIST_ENTRY(ThreadPoolElement) all;
 };
 
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
     ThreadPool *pool = opaque;
     ThreadPoolElement *elem, *next;
 
-    aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
         if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
              */
             qemu_bh_schedule(pool->completion_bh);
 
-            aio_context_release(pool->ctx);
             elem->common.cb(elem->common.opaque, elem->ret);
-            aio_context_acquire(pool->ctx);
 
             /* We can safely cancel the completion_bh here regardless of someone
              * else having scheduled it meanwhile because we reenter the
@@ -XXX,XX +XXX,XX @@ restart:
             qemu_aio_unref(elem);
         }
     }
-    aio_context_release(pool->ctx);
 }
 
 static void thread_pool_cancel(BlockAIOCB *acb)
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
 {
     ThreadPoolElement *req;
 
+    /* Assert that the thread submitting work is the same running the pool */
+    assert(pool->ctx == qemu_get_current_aio_context());
+
     req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
     req->func = func;
     req->arg = arg;
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

thread_pool_submit_aio() is always called on a pool taken from
qemu_get_current_aio_context(), and that is the only intended
use: each pool runs only in the same thread that is submitting
work to it, it can't run anywhere else.

Therefore simplify the thread_pool_submit* API and remove the
ThreadPool function parameter.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-5-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/thread-pool.h   | 10 ++++------
 backends/tpm/tpm_backend.c    |  4 +---
 block/file-posix.c            |  4 +---
 block/file-win32.c            |  4 +---
 block/qcow2-threads.c         |  3 +--
 hw/9pfs/coth.c                |  3 +--
 hw/ppc/spapr_nvdimm.c         |  6 ++----
 hw/virtio/virtio-pmem.c       |  3 +--
 scsi/pr-manager.c             |  3 +--
 scsi/qemu-pr-helper.c         |  3 +--
 tests/unit/test-thread-pool.c | 12 +++++-------
 util/thread-pool.c            | 16 ++++++++--------
 12 files changed, 27 insertions(+), 44 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ void thread_pool_free(ThreadPool *pool);
  * thread_pool_submit* API: submit I/O requests in the thread's
  * current AioContext.
  */
-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg,
-        BlockCompletionFunc *cb, void *opaque);
-int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+                                   BlockCompletionFunc *cb, void *opaque);
+int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
+void thread_pool_submit(ThreadPoolFunc *func, void *arg);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/backends/tpm/tpm_backend.c b/backends/tpm/tpm_backend.c
index XXXXXXX..XXXXXXX 100644
--- a/backends/tpm/tpm_backend.c
+++ b/backends/tpm/tpm_backend.c
@@ -XXX,XX +XXX,XX @@ bool tpm_backend_had_startup_error(TPMBackend *s)
 
 void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 {
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
-
     if (s->cmd != NULL) {
         error_report("There is a TPM request pending");
         return;
@@ -XXX,XX +XXX,XX @@ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 
     s->cmd = cmd;
     object_ref(OBJECT(s));
-    thread_pool_submit_aio(pool, tpm_backend_worker_thread, s,
+    thread_pool_submit_aio(tpm_backend_worker_thread, s,
                            tpm_backend_request_completed, s);
 }
 
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
 
 static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
 {
-    /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
-    return thread_pool_submit_co(pool, func, arg);
+    return thread_pool_submit_co(func, arg);
 }
 
 /*
diff --git a/block/file-win32.c b/block/file-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
         BlockCompletionFunc *cb, void *opaque, int type)
 {
     RawWin32AIOData *acb = g_new(RawWin32AIOData, 1);
-    ThreadPool *pool;
 
     acb->bs = bs;
     acb->hfile = hfile;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
     acb->aio_offset = offset;
 
     trace_file_paio_submit(acb, opaque, offset, count, type);
-    pool = aio_get_thread_pool(qemu_get_current_aio_context());
-    return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
+    return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
 }
 
 int qemu_ftruncate64(int fd, int64_t length)
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
 {
     int ret;
     BDRVQcow2State *s = bs->opaque;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
 
     qemu_co_mutex_lock(&s->lock);
     while (s->nb_threads >= QCOW2_MAX_THREADS) {
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
     s->nb_threads++;
     qemu_co_mutex_unlock(&s->lock);
 
-    ret = thread_pool_submit_co(pool, func, arg);
+    ret = thread_pool_submit_co(func, arg);
 
     qemu_co_mutex_lock(&s->lock);
     s->nb_threads--;
diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/coth.c
+++ b/hw/9pfs/coth.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_enter_func(void *arg)
 void co_run_in_worker_bh(void *opaque)
 {
     Coroutine *co = opaque;
-    thread_pool_submit_aio(aio_get_thread_pool(qemu_get_aio_context()),
-                           coroutine_enter_func, co, coroutine_enter_cb, co);
+    thread_pool_submit_aio(coroutine_enter_func, co, coroutine_enter_cb, co);
 }
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
 {
     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
     SpaprNVDIMMDeviceFlushState *state;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
     bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
     bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
-        thread_pool_submit_aio(pool, flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state,
                                spapr_nvdimm_flush_completion_cb, state);
     }
 
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     PCDIMMDevice *dimm;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     int fd;
 
     if (!drc || !drc->dev ||
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
 
         state->drcidx = drc_index;
 
-        thread_pool_submit_aio(pool, flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state,
                                spapr_nvdimm_flush_completion_cb, state);
 
         continue_token = state->continue_token;
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
     VirtIODeviceRequest *req_data;
     VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
     HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
 
     trace_virtio_pmem_flush_request();
     req_data = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
     req_data->fd   = memory_region_get_fd(&backend->mr);
     req_data->pmem = pmem;
     req_data->vdev = vdev;
-    thread_pool_submit_aio(pool, worker_cb, req_data, done_cb, req_data);
+    thread_pool_submit_aio(worker_cb, req_data, done_cb, req_data);
 }
 
 static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/pr-manager.c
+++ b/scsi/pr-manager.c
@@ -XXX,XX +XXX,XX @@ static int pr_manager_worker(void *opaque)
 int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
                                     struct sg_io_hdr *hdr)
 {
-    ThreadPool *pool = aio_get_thread_pool(ctx);
     PRManagerData data = {
         .pr_mgr = pr_mgr,
         .fd     = fd,
@@ -XXX,XX +XXX,XX @@ int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
 
     /* The matching object_unref is in pr_manager_worker.  */
     object_ref(OBJECT(pr_mgr));
-    return thread_pool_submit_co(pool, pr_manager_worker, &data);
+    return thread_pool_submit_co(pr_manager_worker, &data);
 }
 
 bool pr_manager_is_connected(PRManager *pr_mgr)
diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/qemu-pr-helper.c
+++ b/scsi/qemu-pr-helper.c
@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
 static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
                     uint8_t *buf, int *sz, int dir)
 {
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     int r;
 
     PRHelperSGIOData data = {
@@ -XXX,XX +XXX,XX @@ static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
         .dir = dir,
     };
 
-    r = thread_pool_submit_co(pool, do_sgio_worker, &data);
+    r = thread_pool_submit_co(do_sgio_worker, &data);
     *sz = data.sz;
     return r;
 }
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/main-loop.h"
 
 static AioContext *ctx;
-static ThreadPool *pool;
 static int active;
 
 typedef struct {
@@ -XXX,XX +XXX,XX @@ static void done_cb(void *opaque, int ret)
 static void test_submit(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(pool, worker_cb, &data);
+    thread_pool_submit(worker_cb, &data);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
@@ -XXX,XX +XXX,XX @@ static void test_submit(void)
 static void test_submit_aio(void)
 {
     WorkerTestData data = { .n = 0, .ret = -EINPROGRESS };
-    data.aiocb = thread_pool_submit_aio(pool, worker_cb, &data,
+    data.aiocb = thread_pool_submit_aio(worker_cb, &data,
                                         done_cb, &data);
 
     /* The callbacks are not called until after the first wait.  */
@@ -XXX,XX +XXX,XX @@ static void co_test_cb(void *opaque)
     active = 1;
     data->n = 0;
     data->ret = -EINPROGRESS;
-    thread_pool_submit_co(pool, worker_cb, data);
+    thread_pool_submit_co(worker_cb, data);
 
     /* The test continues in test_submit_co, after qemu_coroutine_enter... */
 
@@ -XXX,XX +XXX,XX @@ static void test_submit_many(void)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        thread_pool_submit_aio(pool, worker_cb, &data[i], done_cb, &data[i]);
+        thread_pool_submit_aio(worker_cb, &data[i], done_cb, &data[i]);
     }
 
     active = 100;
@@ -XXX,XX +XXX,XX @@ static void do_test_cancel(bool sync)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        data[i].aiocb = thread_pool_submit_aio(pool, long_cb, &data[i],
+        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i],
                                                done_cb, &data[i]);
     }
 
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
 {
     qemu_init_main_loop(&error_abort);
     ctx = qemu_get_current_aio_context();
-    pool = aio_get_thread_pool(ctx);
 
     g_test_init(&argc, &argv, NULL);
     g_test_add_func("/thread-pool/submit", test_submit);
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo thread_pool_aiocb_info = {
     .get_aio_context    = thread_pool_get_aio_context,
 };
 
-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg,
-        BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+                                   BlockCompletionFunc *cb, void *opaque)
 {
     ThreadPoolElement *req;
+    AioContext *ctx = qemu_get_current_aio_context();
+    ThreadPool *pool = aio_get_thread_pool(ctx);
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
     aio_co_wake(co->co);
 }
 
-int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
-                                       void *arg)
+int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
 {
     ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
     assert(qemu_in_coroutine());
-    thread_pool_submit_aio(pool, func, arg, thread_pool_co_cb, &tpc);
+    thread_pool_submit_aio(func, arg, thread_pool_co_cb, &tpc);
     qemu_coroutine_yield();
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
+void thread_pool_submit(ThreadPoolFunc *func, void *arg)
 {
-    thread_pool_submit_aio(pool, func, arg, NULL, NULL);
+    thread_pool_submit_aio(func, arg, NULL, NULL);
 }
 
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
change for those that are themselves called only from coroutine_fns.

In addition, coroutine_fns should do I/O using bdrv_co_*() functions, for
which it is required to hold the BlockDriverState graph lock.  So also nnotate
functions on the I/O path with TSA attributes, making it possible to
switch them to use bdrv_co_*() functions.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-2-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/vvfat.c | 58 ++++++++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static BDRVVVFATState *vvv = NULL;
 #endif
 
 static int enable_write_target(BlockDriverState *bs, Error **errp);
-static int is_consistent(BDRVVVFATState *s);
+static int coroutine_fn is_consistent(BDRVVVFATState *s);
 
 static QemuOptsList runtime_opts = {
     .name = "vvfat",
@@ -XXX,XX +XXX,XX @@ static void print_mapping(const mapping_t* mapping)
 }
 #endif
 
-static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
-                    uint8_t *buf, int nb_sectors)
+static int coroutine_fn GRAPH_RDLOCK
+vvfat_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors)
 {
     BDRVVVFATState *s = bs->opaque;
     int i;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
                 DLOG(fprintf(stderr, "sectors %" PRId64 "+%" PRId64
                              " allocated\n", sector_num,
                              n >> BDRV_SECTOR_BITS));
-                if (bdrv_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
-                               buf + i * 0x200, 0) < 0) {
+                if (bdrv_co_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
+                                  buf + i * 0x200, 0) < 0) {
                     return -1;
                 }
                 i += (n >> BDRV_SECTOR_BITS) - 1;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
     return 0;
 }
 
-static int coroutine_fn
+static int coroutine_fn GRAPH_RDLOCK
 vvfat_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 QEMUIOVector *qiov, BdrvRequestFlags flags)
 {
@@ -XXX,XX +XXX,XX @@ static inline uint32_t modified_fat_get(BDRVVVFATState* s,
     }
 }
 
-static inline bool cluster_was_modified(BDRVVVFATState *s,
-                                        uint32_t cluster_num)
+static inline bool coroutine_fn GRAPH_RDLOCK
+cluster_was_modified(BDRVVVFATState *s, uint32_t cluster_num)
 {
     int was_modified = 0;
     int i;
@@ -XXX,XX +XXX,XX @@ typedef enum {
  * Further, the files/directories handled by this function are
  * assumed to be *not* deleted (and *only* those).
  */
-static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
-        direntry_t* direntry, const char* path)
+static uint32_t coroutine_fn GRAPH_RDLOCK
+get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const char* path)
 {
     /*
      * This is a little bit tricky:
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
                         if (res) {
                             return -1;
                         }
-                        res = bdrv_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
-                                          BDRV_SECTOR_SIZE, s->cluster_buffer,
-                                          0);
+                        res = bdrv_co_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
+                                             BDRV_SECTOR_SIZE, s->cluster_buffer,
+                                             0);
                         if (res < 0) {
                             return -2;
                         }
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
  * It returns 0 upon inconsistency or error, and the number of clusters
  * used by the directory, its subdirectories and their files.
  */
-static int check_directory_consistency(BDRVVVFATState *s,
-        int cluster_num, const char* path)
+static int coroutine_fn GRAPH_RDLOCK
+check_directory_consistency(BDRVVVFATState *s, int cluster_num, const char* path)
 {
     int ret = 0;
     unsigned char* cluster = g_malloc(s->cluster_size);
@@ -XXX,XX +XXX,XX @@ DLOG(fprintf(stderr, "check direntry %d:\n", i); print_direntry(direntries + i))
 }
 
 /* returns 1 on success */
-static int is_consistent(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK
+is_consistent(BDRVVVFATState* s)
 {
     int i, check;
     int used_clusters_count = 0;
@@ -XXX,XX +XXX,XX @@ static int commit_mappings(BDRVVVFATState* s,
     return 0;
 }
 
-static int commit_direntries(BDRVVVFATState* s,
-        int dir_index, int parent_mapping_index)
+static int coroutine_fn GRAPH_RDLOCK
+commit_direntries(BDRVVVFATState* s, int dir_index, int parent_mapping_index)
 {
     direntry_t* direntry = array_get(&(s->directory), dir_index);
     uint32_t first_cluster = dir_index == 0 ? 0 : begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int commit_direntries(BDRVVVFATState* s,
 
 /* commit one file (adjust contents, adjust mapping),
    return first_mapping_index */
-static int commit_one_file(BDRVVVFATState* s,
-        int dir_index, uint32_t offset)
+static int coroutine_fn GRAPH_RDLOCK
+commit_one_file(BDRVVVFATState* s, int dir_index, uint32_t offset)
 {
     direntry_t* direntry = array_get(&(s->directory), dir_index);
     uint32_t c = begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int handle_renames_and_mkdirs(BDRVVVFATState* s)
 /*
  * TODO: make sure that the short name is not matching *another* file
  */
-static int handle_commits(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK handle_commits(BDRVVVFATState* s)
 {
     int i, fail = 0;
 
@@ -XXX,XX +XXX,XX @@ static int handle_deletes(BDRVVVFATState* s)
  * - recurse direntries from root (using bs->bdrv_pread)
  * - delete files corresponding to mappings marked as deleted
  */
-static int do_commit(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK do_commit(BDRVVVFATState* s)
 {
     int ret = 0;
 
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return 0;
 }
 
-static int try_commit(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK try_commit(BDRVVVFATState* s)
 {
     vvfat_close_current_file(s);
 DLOG(checkpoint());
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return do_commit(s);
 }
 
-static int vvfat_write(BlockDriverState *bs, int64_t sector_num,
-                    const uint8_t *buf, int nb_sectors)
+static int coroutine_fn GRAPH_RDLOCK
+vvfat_write(BlockDriverState *bs, int64_t sector_num,
+            const uint8_t *buf, int nb_sectors)
 {
     BDRVVVFATState *s = bs->opaque;
     int i, ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      * Use qcow backend. Commit later.
      */
 DLOG(fprintf(stderr, "Write to qcow backend: %d + %d\n", (int)sector_num, nb_sectors));
-    ret = bdrv_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
-                      nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
+    ret = bdrv_co_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
+                         nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
     if (ret < 0) {
         fprintf(stderr, "Error writing to qcow backend\n");
         return ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return 0;
 }
 
-static int coroutine_fn
+static int coroutine_fn GRAPH_RDLOCK
 vvfat_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
                  QEMUIOVector *qiov, BdrvRequestFlags flags)
 {
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

mirror_flush calls a mixed function blk_flush but it is only called
from mirror_run; so call the coroutine version and make mirror_flush
a coroutine_fn too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-4-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/mirror.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 /* Called when going out of the streaming phase to flush the bulk of the
  * data to the medium, or just before completing.
  */
-static int mirror_flush(MirrorBlockJob *s)
+static int coroutine_fn mirror_flush(MirrorBlockJob *s)
 {
-    int ret = blk_flush(s->target);
+    int ret = blk_co_flush(s->target);
     if (ret < 0) {
         if (mirror_error_action(s, false, -ret) == BLOCK_ERROR_ACTION_REPORT) {
             s->ret = ret;
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 nbd/server.c | 48 ++++++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -XXX,XX +XXX,XX @@ nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
     return 1;
 }
 
-static int nbd_receive_request(NBDClient *client, NBDRequest *request,
-                               Error **errp)
+static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest *request,
+                                            Error **errp)
 {
     uint8_t buf[NBD_REQUEST_SIZE];
     uint32_t magic;
@@ -XXX,XX +XXX,XX @@ static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
     stq_be_p(&reply->handle, handle);
 }
 
-static int nbd_co_send_simple_reply(NBDClient *client,
-                                    uint64_t handle,
-                                    uint32_t error,
-                                    void *data,
-                                    size_t len,
-                                    Error **errp)
+static int coroutine_fn nbd_co_send_simple_reply(NBDClient *client,
+                                                 uint64_t handle,
+                                                 uint32_t error,
+                                                 void *data,
+                                                 size_t len,
+                                                 Error **errp)
 {
     NBDSimpleReply reply;
     int nbd_err = system_errno_to_nbd_errno(error);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
             stl_be_p(&chunk.length, pnum);
             ret = nbd_co_send_iov(client, iov, 1, errp);
         } else {
-            ret = blk_pread(exp->common.blk, offset + progress, pnum,
-                            data + progress, 0);
+            ret = blk_co_pread(exp->common.blk, offset + progress, pnum,
+                               data + progress, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "reading from file failed");
                 break;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blockalloc_to_extents(BlockBackend *blk,
  * @ea is converted to BE by the function
  * @last controls whether NBD_REPLY_FLAG_DONE is sent.
  */
-static int nbd_co_send_extents(NBDClient *client, uint64_t handle,
-                               NBDExtentArray *ea,
-                               bool last, uint32_t context_id, Error **errp)
+static int coroutine_fn
+nbd_co_send_extents(NBDClient *client, uint64_t handle, NBDExtentArray *ea,
+                    bool last, uint32_t context_id, Error **errp)
 {
     NBDStructuredMeta chunk;
     struct iovec iov[] = {
@@ -XXX,XX +XXX,XX @@ static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
     bdrv_dirty_bitmap_unlock(bitmap);
 }
 
-static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
-                              BdrvDirtyBitmap *bitmap, uint64_t offset,
-                              uint32_t length, bool dont_fragment, bool last,
-                              uint32_t context_id, Error **errp)
+static int coroutine_fn nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
+                                           BdrvDirtyBitmap *bitmap, uint64_t offset,
+                                           uint32_t length, bool dont_fragment, bool last,
+                                           uint32_t context_id, Error **errp)
 {
     unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
     g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
  * to the client (although the caller may still need to disconnect after
  * reporting the error).
  */
-static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
-                                  Error **errp)
+static int coroutine_fn nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
+                                               Error **errp)
 {
     NBDClient *client = req->client;
     int valid_flags;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_do_cmd_read(NBDClient *client, NBDRequest *request,
                                        data, request->len, errp);
     }
 
-    ret = blk_pread(exp->common.blk, request->from, request->len, data, 0);
+    ret = blk_co_pread(exp->common.blk, request->from, request->len, data, 0);
     if (ret < 0) {
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "reading from file failed", errp);
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
         if (request->flags & NBD_CMD_FLAG_FUA) {
             flags |= BDRV_REQ_FUA;
         }
-        ret = blk_pwrite(exp->common.blk, request->from, request->len, data,
-                         flags);
+        ret = blk_co_pwrite(exp->common.blk, request->from, request->len, data,
+                            flags);
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "writing to file failed", errp);
 
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
         if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
             flags |= BDRV_REQ_NO_FALLBACK;
         }
-        ret = blk_pwrite_zeroes(exp->common.blk, request->from, request->len,
-                                flags);
+        ret = blk_co_pwrite_zeroes(exp->common.blk, request->from, request->len,
+                                   flags);
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "writing to file failed", errp);
 
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-6-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/9pfs/9p.h    | 4 ++--
 hw/9pfs/codir.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/9p.h
+++ b/hw/9pfs/9p.h
@@ -XXX,XX +XXX,XX @@ typedef struct V9fsDir {
     QemuMutex readdir_mutex_L;
 } V9fsDir;
 
-static inline void v9fs_readdir_lock(V9fsDir *dir)
+static inline void coroutine_fn v9fs_readdir_lock(V9fsDir *dir)
 {
     if (dir->proto_version == V9FS_PROTO_2000U) {
         qemu_co_mutex_lock(&dir->readdir_mutex_u);
@@ -XXX,XX +XXX,XX @@ static inline void v9fs_readdir_lock(V9fsDir *dir)
     }
 }
 
-static inline void v9fs_readdir_unlock(V9fsDir *dir)
+static inline void coroutine_fn v9fs_readdir_unlock(V9fsDir *dir)
 {
     if (dir->proto_version == V9FS_PROTO_2000U) {
         qemu_co_mutex_unlock(&dir->readdir_mutex_u);
diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/codir.c
+++ b/hw/9pfs/codir.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
  *
  * See v9fs_co_readdir_many() (as its only user) below for details.
  */
-static int do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
-                           struct V9fsDirEnt **entries, off_t offset,
-                           int32_t maxsize, bool dostat)
+static int coroutine_fn
+do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp, struct V9fsDirEnt **entries,
+                off_t offset, int32_t maxsize, bool dostat)
 {
     V9fsState *s = pdu->s;
     V9fsString name;
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

do_sgio can suspend via the coroutine function thread_pool_submit_co, so it
has to be coroutine_fn as well---and the same is true of all its direct and
indirect callers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-7-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 scsi/qemu-pr-helper.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/qemu-pr-helper.c
+++ b/scsi/qemu-pr-helper.c
@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
     return status;
 }
 
-static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
-                    uint8_t *buf, int *sz, int dir)
+static int coroutine_fn do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
+                                uint8_t *buf, int *sz, int dir)
 {
     int r;
 
@@ -XXX,XX +XXX,XX @@ static SCSISense mpath_generic_sense(int r)
     }
 }
 
-static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
+static int coroutine_fn mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
 {
     switch (r) {
     case MPATH_PR_SUCCESS:
@@ -XXX,XX +XXX,XX @@ static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
     }
 }
 
-static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-                           uint8_t *data, int sz)
+static int coroutine_fn multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+                                        uint8_t *data, int sz)
 {
     int rq_servact = cdb[1];
     struct prin_resp resp;
@@ -XXX,XX +XXX,XX @@ static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
     return mpath_reconstruct_sense(fd, r, sense);
 }
 
-static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-                            const uint8_t *param, int sz)
+static int coroutine_fn multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+                                         const uint8_t *param, int sz)
 {
     int rq_servact = cdb[1];
     int rq_scope = cdb[2] >> 4;
@@ -XXX,XX +XXX,XX @@ static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
 }
 #endif
 
-static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-                    uint8_t *data, int *resp_sz)
+static int coroutine_fn do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+                                 uint8_t *data, int *resp_sz)
 {
 #ifdef CONFIG_MPATH
     if (is_mpath(fd)) {
@@ -XXX,XX +XXX,XX @@ static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
                    SG_DXFER_FROM_DEV);
 }
 
-static int do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-                     const uint8_t *param, int sz)
+static int coroutine_fn do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+                                  const uint8_t *param, int sz)
 {
     int resp_sz;
 
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Functions that can do I/O (including calling bdrv_is_allocated
and bdrv_block_status functions) are prime candidates for being
coroutine_fns.  Make the change for those that are themselves called
only from coroutine_fns.  Also annotate that they are called with the
graph rdlock taken, thus allowing them to call bdrv_co_*() functions
for I/O.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-9-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.h          | 15 ++++++++-------
 block/qcow2-bitmap.c   |  2 +-
 block/qcow2-cluster.c  | 21 +++++++++++++--------
 block/qcow2-refcount.c |  8 ++++----
 block/qcow2-snapshot.c | 25 +++++++++++++------------
 block/qcow2.c          | 27 ++++++++++++++-------------
 6 files changed, 53 insertions(+), 45 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_refcount_area(BlockDriverState *bs, uint64_t offset,
                             uint64_t new_refblock_offset);
 
 int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size);
-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
-                                int64_t nb_clusters);
-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size);
+int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+                                             int64_t nb_clusters);
+int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size);
 void qcow2_free_clusters(BlockDriverState *bs,
                           int64_t offset, int64_t size,
                           enum qcow2_discard_type type);
@@ -XXX,XX +XXX,XX @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
                                 BlockDriverAmendStatusCB *status_cb,
                                 void *cb_opaque, Error **errp);
 int coroutine_fn GRAPH_RDLOCK qcow2_shrink_reftable(BlockDriverState *bs);
-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
+int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int coroutine_fn qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
 /* qcow2-cluster.c functions */
@@ -XXX,XX +XXX,XX @@ void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
 int coroutine_fn GRAPH_RDLOCK
 qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
 
-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
+void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
 int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
                           uint64_t bytes, enum qcow2_discard_type type,
                           bool full_discard);
@@ -XXX,XX +XXX,XX @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
                             Error **errp);
 
 void qcow2_free_snapshots(BlockDriverState *bs);
-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
+int coroutine_fn GRAPH_RDLOCK
+qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
 int qcow2_write_snapshots(BlockDriverState *bs);
 
 int coroutine_fn GRAPH_RDLOCK
@@ -XXX,XX +XXX,XX @@ bool coroutine_fn qcow2_load_dirty_bitmaps(BlockDriverState *bs,
 bool qcow2_get_bitmap_info_list(BlockDriverState *bs,
                                 Qcow2BitmapInfoList **info_list, Error **errp);
 int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp);
-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
+int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
 bool qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
                                           bool release_stored, Error **errp);
 int qcow2_reopen_bitmaps_ro(BlockDriverState *bs, Error **errp);
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -XXX,XX +XXX,XX @@ out:
 }
 
 /* Checks to see if it's safe to resize bitmaps */
-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
+int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     Qcow2BitmapList *bm_list;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ err:
  * Frees the allocated clusters because the request failed and they won't
  * actually be linked.
  */
-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
+void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
 {
     BDRVQcow2State *s = bs->opaque;
     if (!has_data_file(bs) && !m->keep_old_clusters) {
@@ -XXX,XX +XXX,XX @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
  *
  * Returns 0 on success, -errno on failure.
  */
-static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_cluster_offset,
-                             uint64_t guest_offset, unsigned bytes,
-                             uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
+static int coroutine_fn calculate_l2_meta(BlockDriverState *bs,
+                                          uint64_t host_cluster_offset,
+                                          uint64_t guest_offset, unsigned bytes,
+                                          uint64_t *l2_slice, QCowL2Meta **m,
+                                          bool keep_old)
 {
     BDRVQcow2State *s = bs->opaque;
     int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
@@ -XXX,XX +XXX,XX @@ out:
  * function has been waiting for another request and the allocation must be
  * restarted, but the whole request should not be failed.
  */
-static int do_alloc_cluster_offset(BlockDriverState *bs, uint64_t guest_offset,
-                                   uint64_t *host_offset, uint64_t *nb_clusters)
+static int coroutine_fn do_alloc_cluster_offset(BlockDriverState *bs,
+                                                uint64_t guest_offset,
+                                                uint64_t *host_offset,
+                                                uint64_t *nb_clusters)
 {
     BDRVQcow2State *s = bs->opaque;
 
@@ -XXX,XX +XXX,XX @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
     return nb_clusters;
 }
 
-static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
-                               unsigned nb_subclusters)
+static int coroutine_fn
+zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+                    unsigned nb_subclusters)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t *l2_slice;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
     return offset;
 }
 
-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
-                                int64_t nb_clusters)
+int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+                                             int64_t nb_clusters)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t cluster_index, refcount;
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 
 /* only used to allocate compressed sectors. We try to allocate
    contiguous sectors. size must be <= cluster_size */
-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
+int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size)
 {
     BDRVQcow2State *s = bs->opaque;
     int64_t offset;
@@ -XXX,XX +XXX,XX @@ out:
     return ret;
 }
 
-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
+int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
 {
     BDRVQcow2State *s = bs->opaque;
     int64_t i;
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -XXX,XX +XXX,XX @@ void qcow2_free_snapshots(BlockDriverState *bs)
  *   qcow2_check_refcounts() does not do anything with snapshots'
  *   extra data.)
  */
-static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
-                                   int *nb_clusters_reduced,
-                                   int *extra_data_dropped,
-                                   Error **errp)
+static coroutine_fn GRAPH_RDLOCK
+int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+                            int *nb_clusters_reduced,
+                            int *extra_data_dropped,
+                            Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowSnapshotHeader h;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read statically sized part of the snapshot header */
         offset = ROUND_UP(offset, 8);
-        ret = bdrv_pread(bs->file, offset, sizeof(h), &h, 0);
+        ret = bdrv_co_pread(bs->file, offset, sizeof(h), &h, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
         }
 
         /* Read known extra data */
-        ret = bdrv_pread(bs->file, offset,
-                         MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
+        ret = bdrv_co_pread(bs->file, offset,
+                            MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
             /* Store unknown extra data */
             unknown_extra_data_size = sn->extra_data_size - sizeof(extra);
             sn->unknown_extra_data = g_malloc(unknown_extra_data_size);
-            ret = bdrv_pread(bs->file, offset, unknown_extra_data_size,
-                             sn->unknown_extra_data, 0);
+            ret = bdrv_co_pread(bs->file, offset, unknown_extra_data_size,
+                                sn->unknown_extra_data, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "Failed to read snapshot table");
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read snapshot ID */
         sn->id_str = g_malloc(id_str_size + 1);
-        ret = bdrv_pread(bs->file, offset, id_str_size, sn->id_str, 0);
+        ret = bdrv_co_pread(bs->file, offset, id_str_size, sn->id_str, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read snapshot name */
         sn->name = g_malloc(name_size + 1);
-        ret = bdrv_pread(bs->file, offset, name_size, sn->name, 0);
+        ret = bdrv_co_pread(bs->file, offset, name_size, sn->name, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ fail:
     return ret;
 }
 
-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
+int coroutine_fn qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
 {
     return qcow2_do_read_snapshots(bs, false, NULL, NULL, errp);
 }
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char *fmt, Error **errp)
  * unknown magic is skipped (future extension this version knows nothing about)
  * return 0 upon success, non-0 otherwise
  */
-static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-                                 uint64_t end_offset, void **p_feature_table,
-                                 int flags, bool *need_update_header,
-                                 Error **errp)
+static int coroutine_fn GRAPH_RDLOCK
+qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                      uint64_t end_offset, void **p_feature_table,
+                      int flags, bool *need_update_header, Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowExtension ext;
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         printf("attempting to read extended header in offset %lu\n", offset);
 #endif
 
-        ret = bdrv_pread(bs->file, offset, sizeof(ext), &ext, 0);
+        ret = bdrv_co_pread(bs->file, offset, sizeof(ext), &ext, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "qcow2_read_extension: ERROR: "
                              "pread fail from offset %" PRIu64, offset);
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                            sizeof(bs->backing_format));
                 return 2;
             }
-            ret = bdrv_pread(bs->file, offset, ext.len, bs->backing_format, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, bs->backing_format, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "ERROR: ext_backing_format: "
                                  "Could not read format name");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         case QCOW2_EXT_MAGIC_FEATURE_TABLE:
             if (p_feature_table != NULL) {
                 void *feature_table = g_malloc0(ext.len + 2 * sizeof(Qcow2Feature));
-                ret = bdrv_pread(bs->file, offset, ext.len, feature_table, 0);
+                ret = bdrv_co_pread(bs->file, offset, ext.len, feature_table, 0);
                 if (ret < 0) {
                     error_setg_errno(errp, -ret, "ERROR: ext_feature_table: "
                                      "Could not read table");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 return -EINVAL;
             }
 
-            ret = bdrv_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "Unable to read CRYPTO header extension");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 break;
             }
 
-            ret = bdrv_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "bitmaps_ext: "
                                  "Could not read ext header");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         case QCOW2_EXT_MAGIC_DATA_FILE:
         {
             s->image_data_file = g_malloc0(ext.len + 1);
-            ret = bdrv_pread(bs->file, offset, ext.len, s->image_data_file, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, s->image_data_file, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "ERROR: Could not read data file name");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 uext->len = ext.len;
                 QLIST_INSERT_HEAD(&s->unknown_header_ext, uext, next);
 
-                ret = bdrv_pread(bs->file, offset, uext->len, uext->data, 0);
+                ret = bdrv_co_pread(bs->file, offset, uext->len, uext->data, 0);
                 if (ret < 0) {
                     error_setg_errno(errp, -ret, "ERROR: unknown extension: "
                                      "Could not read data");
@@ -XXX,XX +XXX,XX @@ static void qcow2_update_options_abort(BlockDriverState *bs,
     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
 }
 
-static int qcow2_update_options(BlockDriverState *bs, QDict *options,
-                                int flags, Error **errp)
+static int coroutine_fn
+qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
+                     Error **errp)
 {
     Qcow2ReopenState r = {};
     int ret;
-- 
2.40.0

From: Wang Liang <wangliangzz@inspur.com>

hmp_commit() calls blk_is_available() from a non-coroutine context (and in
the main loop). blk_is_available() is a co_wrapper_mixed_bdrv_rdlock
function, and in the non-coroutine context it calls AIO_WAIT_WHILE(),
which crashes if the aio_context lock is not taken before.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1615
Signed-off-by: Wang Liang <wangliangzz@inspur.com>
Message-Id: <20230424103902.45265-1-wangliangzz@126.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/monitor/block-hmp-cmds.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index XXXXXXX..XXXXXXX 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -XXX,XX +XXX,XX @@ void hmp_commit(Monitor *mon, const QDict *qdict)
             error_report("Device '%s' not found", device);
             return;
         }
-        if (!blk_is_available(blk)) {
-            error_report("Device '%s' has no medium", device);
-            return;
-        }
 
         bs = bdrv_skip_implicit_filters(blk_bs(blk));
         aio_context = bdrv_get_aio_context(bs);
         aio_context_acquire(aio_context);
 
+        if (!blk_is_available(blk)) {
+            error_report("Device '%s' has no medium", device);
+            aio_context_release(aio_context);
+            return;
+        }
+
         ret = bdrv_commit(bs);
 
         aio_context_release(aio_context);
-- 
2.40.0