Series comparison

-[Qemu-devel] [PULL 00/61] Block layer patches
+[PULL 00/25] Block layer patches
-The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:
+The following changes since commit ac5f7bf8e208cd7893dbb1a9520559e569a4677c:
-  Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)
+  Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-24 15:00:39 +0100)
-are available in the git repository at:
+are available in the Git repository at:
+  https://repo.or.cz/qemu/kevin.git tags/for-upstream
-  git://repo.or.cz/qemu/kevin.git tags/for-upstream
+for you to fetch changes up to 8c1e8fb2e7fc2cbeb57703e143965a4cd3ad301a:
-for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:
+  block/monitor: Fix crash when executing HMP commit (2023-04-25 15:11:57 +0200)
   Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)
 ----------------------------------------------------------------
 Block layer patches
+- Protect BlockBackend.queued_requests with its own lock
+- Switch to AIO_WAIT_WHILE_UNLOCKED() where possible
+- AioContext removal: LinuxAioState/LuringState/ThreadPool
+- Add more coroutine_fn annotations, use bdrv/blk_co_*
+- Fix crash when execute hmp_commit
 ----------------------------------------------------------------
-Alberto Garcia (9):
+Emanuele Giuseppe Esposito (4):
-      throttle: Update throttle-groups.c documentation
+      linux-aio: use LinuxAioState from the running thread
-      qcow2: Remove unused Error variable in do_perform_cow()
+      io_uring: use LuringState from the running thread
-      qcow2: Use unsigned int for both members of Qcow2COWRegion
+      thread-pool: use ThreadPool from the running thread
-      qcow2: Make perform_cow() call do_perform_cow() twice
+      thread-pool: avoid passing the pool parameter every time
       qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
       qcow2: Allow reading both COW regions with only one request
       qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
       qcow2: Merge the writing of the COW regions with the guest data
       qcow2: Use offset_into_cluster() and offset_to_l2_index()
-Kevin Wolf (37):
+Paolo Bonzini (9):
-      commit: Fix completion with extra reference
+      vvfat: mark various functions as coroutine_fn
-      qemu-iotests: Allow starting new qemu after cleanup
+      blkdebug: add missing coroutine_fn annotation
-      qemu-iotests: Test exiting qemu with running job
+      mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
-      doc: Document generic -blockdev options
+      nbd: mark more coroutine_fns, do not use co_wrappers
-      doc: Document driver-specific -blockdev options
+pfs: mark more coroutine_fns
-      qed: Use bottom half to resume waiting requests
+      qemu-pr-helper: mark more coroutine_fns
-      qed: Make qed_read_table() synchronous
+      tests: mark more coroutine_fns
-      qed: Remove callback from qed_read_table()
+      qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
-      qed: Remove callback from qed_read_l2_table()
+      vmdk: make vmdk_is_cid_valid a coroutine_fn
       qed: Remove callback from qed_find_cluster()
       qed: Make qed_read_backing_file() synchronous
       qed: Make qed_copy_from_backing_file() synchronous
       qed: Remove callback from qed_copy_from_backing_file()
       qed: Make qed_write_header() synchronous
       qed: Remove callback from qed_write_header()
       qed: Make qed_write_table() synchronous
       qed: Remove GenericCB
       qed: Remove callback from qed_write_table()
       qed: Make qed_aio_read_data() synchronous
       qed: Make qed_aio_write_main() synchronous
       qed: Inline qed_commit_l2_update()
       qed: Add return value to qed_aio_write_l1_update()
       qed: Add return value to qed_aio_write_l2_update()
       qed: Add return value to qed_aio_write_main()
       qed: Add return value to qed_aio_write_cow()
       qed: Add return value to qed_aio_write_inplace/alloc()
       qed: Add return value to qed_aio_read/write_data()
       qed: Remove ret argument from qed_aio_next_io()
       qed: Remove recursion in qed_aio_next_io()
       qed: Implement .bdrv_co_readv/writev
       qed: Use CoQueue for serialising allocations
       qed: Simplify request handling
       qed: Use a coroutine for need_check_timer
       qed: Add coroutine_fn to I/O path functions
       qed: Use bdrv_co_* for coroutine_fns
       block: Remove bdrv_aio_readv/writev/flush()
       Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block
 Manos Pitsidianakis (1):
       block: change variable names in BlockDriverState
 Max Reitz (3):
       blkdebug: Catch bs->exact_filename overflow
       blkverify: Catch bs->exact_filename overflow
       block: Do not strcmp() with NULL uri->scheme
 Stefan Hajnoczi (10):
-      block: count bdrv_co_rw_vmstate() requests
+      block: make BlockBackend->quiesce_counter atomic
-      block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
+      block: make BlockBackend->disable_request_queuing atomic
-      migration: avoid recursive AioContext locking in save_vmstate()
+      block: protect BlockBackend->queued_requests with a lock
-      migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
+      block: don't acquire AioContext lock in bdrv_drain_all()
-      virtio-pci: use ioeventfd even when KVM is disabled
+      block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
-      migration: hold AioContext lock for loadvm qemu_fclose()
+      block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
-      qemu-iotests: 068: extract _qemu() function
+      block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
-      qemu-iotests: 068: use -drive/-device instead of -hda
+      hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
-      qemu-iotests: 068: test iothread mode
+      monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
-      qemu-img: don't shadow opts variable in img_dd()
+      block: add missing coroutine_fn to bdrv_sum_allocated_file_size()
-Stephen Bates (1):
+Wang Liang (1):
-      nvme: Add support for Read Data and Write Data in CMBs.
+      block/monitor: Fix crash when executing HMP commit
-sochin.jiang (1):
+Wilfred Mallawa (1):
-      fix: avoid an infinite loop or a dangling pointer problem in img_commit
+      include/block: fixup typos
- block/Makefile.objs            |   2 +-
+ block/qcow2.h                     | 15 +++++-----
- block/blkdebug.c               |  46 +--
+ hw/9pfs/9p.h                      |  4 +--
- block/blkreplay.c              |   8 +-
+ include/block/aio-wait.h          |  2 +-
- block/blkverify.c              |  12 +-
+ include/block/aio.h               |  8 ------
- block/block-backend.c          |  22 +-
+ include/block/block_int-common.h  |  2 +-
- block/commit.c                 |   7 +
+ include/block/raw-aio.h           | 33 +++++++++++++++-------
- block/file-posix.c             |  34 +-
+ include/block/thread-pool.h       | 15 ++++++----
- block/io.c                     | 240 ++-----------
+ include/sysemu/block-backend-io.h |  5 ++++
- block/iscsi.c                  |  20 +-
+ backends/tpm/tpm_backend.c        |  4 +--
- block/mirror.c                 |   8 +-
+ block.c                           |  2 +-
- block/nbd-client.c             |   8 +-
+ block/blkdebug.c                  |  4 +--
- block/nbd-client.h             |   4 +-
+ block/block-backend.c             | 45 ++++++++++++++++++------------
- block/nbd.c                    |   6 +-
+ block/export/export.c             |  2 +-
- block/nfs.c                    |   2 +-
+ block/file-posix.c                | 45 ++++++++++++------------------
- block/qcow2-cluster.c          | 201 ++++++++---
+ block/file-win32.c                |  4 +--
- block/qcow2.c                  |  94 +++--
+ block/graph-lock.c                |  2 +-
- block/qcow2.h                  |  11 +-
+ block/io.c                        |  2 +-
- block/qed-cluster.c            | 124 +++----
+ block/io_uring.c                  | 23 ++++++++++------
- block/qed-gencb.c              |  33 --
+ block/linux-aio.c                 | 29 ++++++++++++--------
- block/qed-table.c              | 261 +++++---------
+ block/mirror.c                    |  4 +--
- block/qed.c                    | 779 ++++++++++++++++-------------------------
+ block/monitor/block-hmp-cmds.c    | 10 ++++---
- block/qed.h                    |  54 +--
+ block/qcow2-bitmap.c              |  2 +-
- block/raw-format.c             |   8 +-
+ block/qcow2-cluster.c             | 21 ++++++++------
- block/rbd.c                    |   4 +-
+ block/qcow2-refcount.c            |  8 +++---
- block/sheepdog.c               |  12 +-
+ block/qcow2-snapshot.c            | 25 +++++++++--------
- block/ssh.c                    |   2 +-
+ block/qcow2-threads.c             |  3 +-
- block/throttle-groups.c        |   2 +-
+ block/qcow2.c                     | 27 +++++++++---------
- block/trace-events             |   3 -
+ block/vmdk.c                      |  2 +-
- blockjob.c                     |   4 +-
+ block/vvfat.c                     | 58 ++++++++++++++++++++-------------------
- hw/block/nvme.c                |  83 +++--
+ hw/9pfs/codir.c                   |  6 ++--
- hw/block/nvme.h                |   1 +
+ hw/9pfs/coth.c                    |  3 +-
- hw/virtio/virtio-pci.c         |   2 +-
+ hw/ppc/spapr_nvdimm.c             |  6 ++--
- include/block/block.h          |  16 +-
+ hw/virtio/virtio-pmem.c           |  3 +-
- include/block/block_int.h      |   6 +-
+ monitor/hmp.c                     |  2 +-
- include/block/blockjob.h       |  18 +
+ monitor/monitor.c                 |  4 +--
- include/sysemu/block-backend.h |  20 +-
+ nbd/server.c                      | 48 ++++++++++++++++----------------
- migration/savevm.c             |  32 +-
+ scsi/pr-manager.c                 |  3 +-
- qemu-img.c                     |  29 +-
+ scsi/qemu-pr-helper.c             | 25 ++++++++---------
- qemu-io-cmds.c                 |  46 +--
+ tests/unit/test-thread-pool.c     | 14 ++++------
- qemu-options.hx                | 221 ++++++++++--
+ util/thread-pool.c                | 25 ++++++++---------
- tests/qemu-iotests/068         |  37 +-
+files changed, 283 insertions(+), 262 deletions(-)
  tests/qemu-iotests/068.out     |  11 +-
  tests/qemu-iotests/185         | 206 +++++++++++
  tests/qemu-iotests/185.out     |  59 ++++
  tests/qemu-iotests/common.qemu |   3 +
  tests/qemu-iotests/group       |   1 +
 files changed, 1477 insertions(+), 1325 deletions(-)
  delete mode 100644 block/qed-gencb.c
  create mode 100755 tests/qemu-iotests/185
  create mode 100644 tests/qemu-iotests/185.out

-[Qemu-devel] [PULL 01/61] commit: Fix completion with extra reference
+Deleted patch
-commit_complete() can't assume that after its block_job_completed() the
-job is actually immediately freed; someone else may still be holding
-references. In this case, the op blockers on the intermediate nodes make
-the graph reconfiguration in the completion code fail.
-Call block_job_remove_all_bdrv() manually so that we know for sure that
-any blockers on intermediate nodes are given up.
-Cc: qemu-stable@nongnu.org
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- block/commit.c | 7 +++++++
-file changed, 7 insertions(+)
-diff --git a/block/commit.c b/block/commit.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/commit.c
-+++ b/block/commit.c
-@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
-     }
-     g_free(s->backing_file_str);
-     blk_unref(s->top);
-+
-+    /* If there is more than one reference to the job (e.g. if called from
-+     * block_job_finish_sync()), block_job_completed() won't free it and
-+     * therefore the blockers on the intermediate nodes remain. This would
-+     * cause bdrv_set_backing_hd() to fail. */
-+    block_job_remove_all_bdrv(job);
-+
-     block_job_completed(&s->common, ret);
-     g_free(data);
---
-.8.3.1

-[Qemu-devel] [PULL 02/61] qemu-iotests: Allow starting new qemu after cleanup
+Deleted patch
-After _cleanup_qemu(), test cases should be able to start the next qemu
-process and call _cleanup_qemu() for that one as well. For this to work
-cleanly, we need to improve the cleanup so that the second invocation
-doesn't try to kill the qemu instances from the first invocation a
-second time (which would result in error messages).
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- tests/qemu-iotests/common.qemu | 3 +++
-file changed, 3 insertions(+)
-diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
-index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/common.qemu
-+++ b/tests/qemu-iotests/common.qemu
-@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
-         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
-         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
-         eval "exec ${QEMU_OUT[$i]}<&-"
-+
-+        unset QEMU_IN[$i]
-+        unset QEMU_OUT[$i]
-     done
- }
---
-.8.3.1

-[Qemu-devel] [PULL 03/61] qemu-iotests: Test exiting qemu with running job
+Deleted patch
-When qemu is exited, all running jobs should be cancelled successfully.
-This adds a test for this for all types of block jobs that currently
-exist in qemu.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
----
- tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
- tests/qemu-iotests/185.out |  59 +++++++++++++
- tests/qemu-iotests/group   |   1 +
-files changed, 266 insertions(+)
- create mode 100755 tests/qemu-iotests/185
- create mode 100644 tests/qemu-iotests/185.out
-diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
-new file mode 100755
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/tests/qemu-iotests/185
-@@ -XXX,XX +XXX,XX @@
-+#!/bin/bash
-+#
-+# Test exiting qemu while jobs are still running
-+#
-+# Copyright (C) 2017 Red Hat, Inc.
-+#
-+# This program is free software; you can redistribute it and/or modify
-+# it under the terms of the GNU General Public License as published by
-+# the Free Software Foundation; either version 2 of the License, or
-+# (at your option) any later version.
-+#
-+# This program is distributed in the hope that it will be useful,
-+# but WITHOUT ANY WARRANTY; without even the implied warranty of
-+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-+# GNU General Public License for more details.
-+#
-+# You should have received a copy of the GNU General Public License
-+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
-+#
-+
-+# creator
-+owner=kwolf@redhat.com
-+
-+seq=`basename $0`
-+echo "QA output created by $seq"
-+
-+here=`pwd`
-+status=1 # failure is the default!
-+
-+MIG_SOCKET="${TEST_DIR}/migrate"
-+
-+_cleanup()
-+{
-+    rm -f "${TEST_IMG}.mid"
-+    rm -f "${TEST_IMG}.copy"
-+    _cleanup_test_img
-+    _cleanup_qemu
-+}
-+trap "_cleanup; exit \$status" 0 1 2 3 15
-+
-+# get standard environment, filters and checks
-+. ./common.rc
-+. ./common.filter
-+. ./common.qemu
-+
-+_supported_fmt qcow2
-+_supported_proto file
-+_supported_os Linux
-+
-+size=64M
-+TEST_IMG="${TEST_IMG}.base" _make_test_img $size
-+
-+echo
-+echo === Starting VM ===
-+echo
-+
-+qemu_comm_method="qmp"
-+
-+_launch_qemu \
-+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
-+h=$QEMU_HANDLE
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
-+
-+echo
-+echo === Creating backing chain ===
-+echo
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'blockdev-snapshot-sync',
-+       'arguments': { 'device': 'disk',
-+                      'snapshot-file': '$TEST_IMG.mid',
-+                      'format': '$IMGFMT',
-+                      'mode': 'absolute-paths' } }" \
-+    "return"
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'human-monitor-command',
-+       'arguments': { 'command-line':
-+                      'qemu-io disk \"write 0 4M\"' } }" \
-+    "return"
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'blockdev-snapshot-sync',
-+       'arguments': { 'device': 'disk',
-+                      'snapshot-file': '$TEST_IMG',
-+                      'format': '$IMGFMT',
-+                      'mode': 'absolute-paths' } }" \
-+    "return"
-+
-+echo
-+echo === Start commit job and exit qemu ===
-+echo
-+
-+# Note that the reference output intentionally includes the 'offset' field in
-+# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
-+# predictable and any change in the offsets would hint at a bug in the job
-+# throttling code.
-+#
-+# In order to achieve these predictable offsets, all of the following tests
-+# use speed=65536. Each job will perform exactly one iteration before it has
-+# to sleep at least for a second, which is plenty of time for the 'quit' QMP
-+# command to be received (after receiving the command, the rest runs
-+# synchronously, so jobs can arbitrarily continue or complete).
-+#
-+# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
-+# the first request), for active commit and mirror it's large enough to cover
-+# the full 4M, and for backup it's the qcow2 cluster size, which we know is
-+# 64k. As all of these are at least as large as the speed, we are sure that the
-+# offset doesn't advance after the first iteration before qemu exits.
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'block-commit',
-+       'arguments': { 'device': 'disk',
-+                      'base':'$TEST_IMG.base',
-+                      'top': '$TEST_IMG.mid',
-+                      'speed': 65536 } }" \
-+    "return"
-+
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
-+wait=1 _cleanup_qemu
-+
-+echo
-+echo === Start active commit job and exit qemu ===
-+echo
-+
-+_launch_qemu \
-+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
-+h=$QEMU_HANDLE
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'block-commit',
-+       'arguments': { 'device': 'disk',
-+                      'base':'$TEST_IMG.base',
-+                      'speed': 65536 } }" \
-+    "return"
-+
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
-+wait=1 _cleanup_qemu
-+
-+echo
-+echo === Start mirror job and exit qemu ===
-+echo
-+
-+_launch_qemu \
-+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
-+h=$QEMU_HANDLE
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'drive-mirror',
-+       'arguments': { 'device': 'disk',
-+                      'target': '$TEST_IMG.copy',
-+                      'format': '$IMGFMT',
-+                      'sync': 'full',
-+                      'speed': 65536 } }" \
-+    "return"
-+
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
-+wait=1 _cleanup_qemu
-+
-+echo
-+echo === Start backup job and exit qemu ===
-+echo
-+
-+_launch_qemu \
-+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
-+h=$QEMU_HANDLE
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'drive-backup',
-+       'arguments': { 'device': 'disk',
-+                      'target': '$TEST_IMG.copy',
-+                      'format': '$IMGFMT',
-+                      'sync': 'full',
-+                      'speed': 65536 } }" \
-+    "return"
-+
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
-+wait=1 _cleanup_qemu
-+
-+echo
-+echo === Start streaming job and exit qemu ===
-+echo
-+
-+_launch_qemu \
-+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
-+h=$QEMU_HANDLE
-+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
-+
-+_send_qemu_cmd $h \
-+    "{ 'execute': 'block-stream',
-+       'arguments': { 'device': 'disk',
-+                      'speed': 65536 } }" \
-+    "return"
-+
-+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
-+wait=1 _cleanup_qemu
-+
-+_check_test_img
-+
-+# success, all done
-+echo "*** done"
-+rm -f $seq.full
-+status=0
-diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/tests/qemu-iotests/185.out
-@@ -XXX,XX +XXX,XX @@
-+QA output created by 185
-+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
-+
-+=== Starting VM ===
-+
-+{"return": {}}
-+
-+=== Creating backing chain ===
-+
-+Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+wrote 4194304/4194304 bytes at offset 0
-+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-+{"return": ""}
-+Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+
-+=== Start commit job and exit qemu ===
-+
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
-+
-+=== Start active commit job and exit qemu ===
-+
-+{"return": {}}
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
-+
-+=== Start mirror job and exit qemu ===
-+
-+{"return": {}}
-+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
-+
-+=== Start backup job and exit qemu ===
-+
-+{"return": {}}
-+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
-+
-+=== Start streaming job and exit qemu ===
-+
-+{"return": {}}
-+{"return": {}}
-+{"return": {}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
-+No errors were found on the image.
-+*** done
-diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
-index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/group
-+++ b/tests/qemu-iotests/group
-@@ -XXX,XX +XXX,XX @@
-rw auto migration
-rw auto quick
-rw auto migration
-+185 rw auto
---
-.8.3.1

-[Qemu-devel] [PULL 13/61] qemu-iotests: 068: extract _qemu() function
+[PULL 01/25] block: make BlockBackend->quiesce_counter atomic
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Avoid duplicating the QEMU command-line.
+The main loop thread increments/decrements BlockBackend->quiesce_counter
 when drained sections begin/end. The counter is read in the I/O code
 path. Therefore this field is used to communicate between threads
 without a lock.
 Acquire/release are not necessary because the BlockBackend->in_flight
 counter already uses sequentially consistent accesses and running I/O
 requests hold that counter when blk_wait_while_drained() is called.
 qatomic_read() can be used.
 Use qatomic_fetch_inc()/qatomic_fetch_dec() for modifications even
 though sequentially consistent atomic accesses are not strictly required
 here. They are, however, nicer to read than multiple calls to
 qatomic_read() and qatomic_set(). Since beginning and ending drain is
 not a hot path the extra cost doesn't matter.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230307210427.269214-2-stefanha@redhat.com>
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/068 | 15 +++++++++------
+ block/block-backend.c | 14 +++++++-------
-file changed, 9 insertions(+), 6 deletions(-)
+file changed, 7 insertions(+), 7 deletions(-)
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+diff --git a/block/block-backend.c b/block/block-backend.c
-index XXXXXXX..XXXXXXX 100755
+index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068
+--- a/block/block-backend.c
-+++ b/tests/qemu-iotests/068
++++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
-       ;;
+     NotifierList remove_bs_notifiers, insert_bs_notifiers;
- esac
+     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
--# Give qemu some time to boot before saving the VM state
+-    int quiesce_counter;
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
++    int quiesce_counter; /* atomic: written under BQL, read by other threads */
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
+     CoQueue queued_requests;
-+_qemu()
+     bool disable_request_queuing;
-+{
-+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
+@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
-+          "$@" |\
+     blk->dev_opaque = opaque;
-     _filter_qemu | _filter_hmp
-+}
+     /* Are we currently quiesced? Should we enforce this right now? */
-+
+-    if (blk->quiesce_counter && ops && ops->drained_begin) {
-+# Give qemu some time to boot before saving the VM state
++    if (qatomic_read(&blk->quiesce_counter) && ops && ops->drained_begin) {
-+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+         ops->drained_begin(opaque);
- # Now try to continue from that VM state (this should just work)
+     }
--echo quit |\
+ }
--    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
--    _filter_qemu | _filter_hmp
+ {
-+echo quit | _qemu -loadvm 0
+     assert(blk->in_flight > 0);
- # success, all done
+-    if (blk->quiesce_counter && !blk->disable_request_queuing) {
- echo "*** done"
++    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
          blk_dec_in_flight(blk);
          qemu_co_queue_wait(&blk->queued_requests, NULL);
          blk_inc_in_flight(blk);
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
      BlockBackend *blk = child->opaque;
      ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
 -    if (++blk->quiesce_counter == 1) {
 +    if (qatomic_fetch_inc(&blk->quiesce_counter) == 0) {
          if (blk->dev_ops && blk->dev_ops->drained_begin) {
              blk->dev_ops->drained_begin(blk->dev_opaque);
          }
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
  {
      BlockBackend *blk = child->opaque;
      bool busy = false;
 -    assert(blk->quiesce_counter);
 +    assert(qatomic_read(&blk->quiesce_counter));
      if (blk->dev_ops && blk->dev_ops->drained_poll) {
          busy = blk->dev_ops->drained_poll(blk->dev_opaque);
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
  static void blk_root_drained_end(BdrvChild *child)
  {
      BlockBackend *blk = child->opaque;
 -    assert(blk->quiesce_counter);
 +    assert(qatomic_read(&blk->quiesce_counter));
      assert(blk->public.throttle_group_member.io_limits_disabled);
      qatomic_dec(&blk->public.throttle_group_member.io_limits_disabled);
 -    if (--blk->quiesce_counter == 0) {
 +    if (qatomic_fetch_dec(&blk->quiesce_counter) == 1) {
          if (blk->dev_ops && blk->dev_ops->drained_end) {
              blk->dev_ops->drained_end(blk->dev_opaque);
          }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 15/61] qemu-iotests: 068: test iothread mode
+[PULL 02/25] block: make BlockBackend->disable_request_queuing atomic
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Perform the savevm/loadvm test with both iothread on and off.  This
+This field is accessed by multiple threads without a lock. Use explicit
-covers the recently found savevm/loadvm hang when iothread is enabled.
+qatomic_read()/qatomic_set() calls. There is no need for acquire/release
 because blk_set_disable_request_queuing() doesn't provide any
 guarantees (it helps that it's used at BlockBackend creation time and
 not when there is I/O in flight).
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
+Message-Id: <20230307210427.269214-3-stefanha@redhat.com>
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- tests/qemu-iotests/068     | 23 ++++++++++++++---------
+ block/block-backend.c | 7 ++++---
- tests/qemu-iotests/068.out | 11 ++++++++++-
+file changed, 4 insertions(+), 3 deletions(-)
 files changed, 24 insertions(+), 10 deletions(-)
-diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
+diff --git a/block/block-backend.c b/block/block-backend.c
-index XXXXXXX..XXXXXXX 100755
+index XXXXXXX..XXXXXXX 100644
---- a/tests/qemu-iotests/068
+--- a/block/block-backend.c
-+++ b/tests/qemu-iotests/068
++++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ _supported_os Linux
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
- IMGOPTS="compat=1.1"
- IMG_SIZE=128K
+     int quiesce_counter; /* atomic: written under BQL, read by other threads */
+     CoQueue queued_requests;
--echo
+-    bool disable_request_queuing;
--echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
++    bool disable_request_queuing; /* atomic */
--echo
--_make_test_img $IMG_SIZE
+     VMChangeStateEntry *vmsh;
--
+     bool force_allow_inactivate;
- case "$QEMU_DEFAULT_MACHINE" in
+@@ -XXX,XX +XXX,XX @@ void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow)
-   s390-ccw-virtio)
+ void blk_set_disable_request_queuing(BlockBackend *blk, bool disable)
-       platform_parm="-no-shutdown"
+ {
-@@ -XXX,XX +XXX,XX @@ _qemu()
+     IO_CODE();
-     _filter_qemu | _filter_hmp
+-    blk->disable_request_queuing = disable;
 +    qatomic_set(&blk->disable_request_queuing, disable);
  }
--# Give qemu some time to boot before saving the VM state
+ static int coroutine_fn GRAPH_RDLOCK
--bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
--# Now try to continue from that VM state (this should just work)
+ {
--echo quit | _qemu -loadvm 0
+     assert(blk->in_flight > 0);
-+for extra_args in \
-+    "" \
+-    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
-+    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
++    if (qatomic_read(&blk->quiesce_counter) &&
-+    echo
++        !qatomic_read(&blk->disable_request_queuing)) {
-+    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
+         blk_dec_in_flight(blk);
-+    echo
+         qemu_co_queue_wait(&blk->queued_requests, NULL);
-+
+         blk_inc_in_flight(blk);
 +    _make_test_img $IMG_SIZE
 +
 +    # Give qemu some time to boot before saving the VM state
 +    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
 +    # Now try to continue from that VM state (this should just work)
 +    echo quit | _qemu $extra_args -loadvm 0
 +done
  # success, all done
  echo "*** done"
 diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/qemu-iotests/068.out
 +++ b/tests/qemu-iotests/068.out
@@ -XXX,XX +XXX,XX @@
  QA output created by 068
 -=== Saving and reloading a VM state to/from a qcow2 image ===
 +=== Saving and reloading a VM state to/from a qcow2 image () ===
 +
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
 +QEMU X.Y.Z monitor - type 'help' for more information
 +(qemu) savevm 0
 +(qemu) quit
 +QEMU X.Y.Z monitor - type 'help' for more information
 +(qemu) quit
 +
 +=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
  QEMU X.Y.Z monitor - type 'help' for more information
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 51/61] qed: Simplify request handling
+[PULL 03/25] block: protect BlockBackend->queued_requests with a lock
-Now that we process a request in the same coroutine from beginning to
+From: Stefan Hajnoczi <stefanha@redhat.com>
 end and don't drop out of it any more, we can look like a proper
 coroutine-based driver and simply call qed_aio_next_io() and get a
 return value from it instead of spawning an additional coroutine that
 reenters the parent when it's done.
+The CoQueue API offers thread-safety via the lock argument that
+qemu_co_queue_wait() and qemu_co_enter_next() take. BlockBackend
+currently does not make use of the lock argument. This means that
+multiple threads submitting I/O requests can corrupt the CoQueue's
+QSIMPLEQ.
+Add a QemuMutex and pass it to CoQueue APIs so that the queue is
+protected. While we're at it, also assert that the queue is empty when
+the BlockBackend is deleted.
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
+Message-Id: <20230307210427.269214-4-stefanha@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 101 +++++++++++++-----------------------------------------------
+ block/block-backend.c | 18 ++++++++++++++++--
- block/qed.h |   3 +-
+file changed, 16 insertions(+), 2 deletions(-)
 files changed, 22 insertions(+), 82 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/block-backend.c
-+++ b/block/qed.c
++++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
- #include "qapi/qmp/qerror.h"
+     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
- #include "sysemu/block-backend.h"
+     int quiesce_counter; /* atomic: written under BQL, read by other threads */
--static const AIOCBInfo qed_aiocb_info = {
++    QemuMutex queued_requests_lock; /* protects queued_requests */
--    .aiocb_size         = sizeof(QEDAIOCB),
+     CoQueue queued_requests;
--};
+     bool disable_request_queuing; /* atomic */
--
- static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
+@@ -XXX,XX +XXX,XX @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
-                           const char *filename)
- {
+     block_acct_init(&blk->stats);
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
-     return l2_table;
++    qemu_mutex_init(&blk->queued_requests_lock);
- }
+     qemu_co_queue_init(&blk->queued_requests);
+     notifier_list_init(&blk->remove_bs_notifiers);
--static void qed_aio_next_io(QEDAIOCB *acb);
+     notifier_list_init(&blk->insert_bs_notifiers);
--
+@@ -XXX,XX +XXX,XX @@ static void blk_delete(BlockBackend *blk)
--static void qed_aio_start_io(QEDAIOCB *acb)
+     assert(QLIST_EMPTY(&blk->remove_bs_notifiers.notifiers));
--{
+     assert(QLIST_EMPTY(&blk->insert_bs_notifiers.notifiers));
--    qed_aio_next_io(acb);
+     assert(QLIST_EMPTY(&blk->aio_notifiers));
--}
++    assert(qemu_co_queue_empty(&blk->queued_requests));
--
++    qemu_mutex_destroy(&blk->queued_requests_lock);
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+     QTAILQ_REMOVE(&block_backends, blk, link);
- {
+     drive_info_del(blk->legacy_dinfo);
-     assert(!s->allocating_write_reqs_plugged);
+     block_acct_cleanup(&blk->stats);
-@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
- static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
+     if (qatomic_read(&blk->quiesce_counter) &&
- {
+         !qatomic_read(&blk->disable_request_queuing)) {
--    return acb->common.bs->opaque;
++        /*
-+    return acb->bs->opaque;
++         * Take lock before decrementing in flight counter so main loop thread
- }
++         * waits for us to enqueue ourselves before it can leave the drained
++         * section.
- /**
++         */
-@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
++        qemu_mutex_lock(&blk->queued_requests_lock);
          blk_dec_in_flight(blk);
 -        qemu_co_queue_wait(&blk->queued_requests, NULL);
 +        qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
          blk_inc_in_flight(blk);
 +        qemu_mutex_unlock(&blk->queued_requests_lock);
      }
  }
--static void qed_aio_complete_bh(void *opaque)
+@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_end(BdrvChild *child)
--{
+         if (blk->dev_ops && blk->dev_ops->drained_end) {
--    QEDAIOCB *acb = opaque;
+             blk->dev_ops->drained_end(blk->dev_opaque);
--    BDRVQEDState *s = acb_to_s(acb);
+         }
--    BlockCompletionFunc *cb = acb->common.cb;
+-        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
--    void *user_opaque = acb->common.opaque;
++        qemu_mutex_lock(&blk->queued_requests_lock);
--    int ret = acb->bh_ret;
++        while (qemu_co_enter_next(&blk->queued_requests,
--
++                                  &blk->queued_requests_lock)) {
--    qemu_aio_unref(acb);
+             /* Resume all queued requests */
--
+         }
--    /* Invoke callback */
++        qemu_mutex_unlock(&blk->queued_requests_lock);
 -    qed_acquire(s);
 -    cb(user_opaque, ret);
 -    qed_release(s);
 -}
 -
 -static void qed_aio_complete(QEDAIOCB *acb, int ret)
 +static void qed_aio_complete(QEDAIOCB *acb)
  {
      BDRVQEDState *s = acb_to_s(acb);
 -    trace_qed_aio_complete(s, acb, ret);
 -
      /* Free resources */
      qemu_iovec_destroy(&acb->cur_qiov);
      qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
          acb->qiov->iov[0].iov_base = NULL;
      }
--    /* Arrange for a bh to invoke the completion function */
--    acb->bh_ret = ret;
--    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
--                            qed_aio_complete_bh, acb);
--
-     /* Start next allocating write request waiting behind this one.  Note that
-      * requests enqueue themselves when they first hit an unallocated cluster
-      * but they wait until the entire request is finished before waking up the
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-         struct iovec *iov = acb->qiov->iov;
-         if (!iov->iov_base) {
--            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
-+            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
-             if (iov->iov_base == NULL) {
-                 return -ENOMEM;
-             }
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
- {
-     QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
--    BlockDriverState *bs = acb->common.bs;
-+    BlockDriverState *bs = acb->bs;
-     /* Adjust offset into cluster */
-     offset += qed_offset_into_cluster(s, acb->cur_pos);
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
- /**
-  * Begin next I/O or complete the request
-  */
--static void qed_aio_next_io(QEDAIOCB *acb)
-+static int qed_aio_next_io(QEDAIOCB *acb)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     uint64_t offset;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
-         /* Complete request */
-         if (acb->cur_pos >= acb->end_pos) {
--            qed_aio_complete(acb, 0);
--            return;
-+            ret = 0;
-+            break;
-         }
-         /* Find next cluster and start I/O */
-         len = acb->end_pos - acb->cur_pos;
-         ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
-         if (ret < 0) {
--            qed_aio_complete(acb, ret);
--            return;
-+            break;
-         }
-         if (acb->flags & QED_AIOCB_WRITE) {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
-         }
-         if (ret < 0 && ret != -EAGAIN) {
--            qed_aio_complete(acb, ret);
--            return;
-+            break;
-         }
-     }
--}
--typedef struct QEDRequestCo {
--    Coroutine *co;
--    bool done;
--    int ret;
--} QEDRequestCo;
--
--static void qed_co_request_cb(void *opaque, int ret)
--{
--    QEDRequestCo *co = opaque;
--
--    co->done = true;
--    co->ret = ret;
--    qemu_coroutine_enter_if_inactive(co->co);
-+    trace_qed_aio_complete(s, acb, ret);
-+    qed_aio_complete(acb);
-+    return ret;
  }
- static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
-                                        QEMUIOVector *qiov, int nb_sectors,
-                                        int flags)
- {
--    QEDRequestCo co = {
--        .co     = qemu_coroutine_self(),
--        .done   = false,
-+    QEDAIOCB acb = {
-+        .bs         = bs,
-+        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
-+        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
-+        .qiov       = qiov,
-+        .flags      = flags,
-     };
--    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
--
--    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
-+    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
--    acb->flags = flags;
--    acb->qiov = qiov;
--    acb->qiov_offset = 0;
--    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
--    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
--    acb->backing_qiov = NULL;
--    acb->request.l2_table = NULL;
--    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
-+    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
-     /* Start request */
--    qed_aio_start_io(acb);
--
--    if (!co.done) {
--        qemu_coroutine_yield();
--    }
--
--    return co.ret;
-+    return qed_aio_next_io(&acb);
- }
- static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ enum {
- };
- typedef struct QEDAIOCB {
--    BlockAIOCB common;
--    int bh_ret;                     /* final return status for completion bh */
-+    BlockDriverState *bs;
-     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
-     int flags;                      /* QED_AIOCB_* bits ORed together */
-     uint64_t end_pos;               /* request end on block device, in bytes */
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 06/61] migration: avoid recursive AioContext locking in save_vmstate()
+[PULL 04/25] block: don't acquire AioContext lock in bdrv_drain_all()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-AioContext was designed to allow nested acquire/release calls.  It uses
+There is no need for the AioContext lock in bdrv_drain_all() because
-a recursive mutex so callers don't need to worry about nesting...or so
+nothing in AIO_WAIT_WHILE() needs the lock and the condition is atomic.
 we thought.
-BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
+AIO_WAIT_WHILE_UNLOCKED() has no use for the AioContext parameter other
-the AioContext temporarily around aio_poll().  This gives IOThreads a
+than performing a check that is nowadays already done by the
-chance to acquire the AioContext to process I/O completions.
+GLOBAL_STATE_CODE()/IO_CODE() macros. Set the ctx argument to NULL here
 to help us keep track of all converted callers. Eventually all callers
 will have been converted and then the argument can be dropped entirely.
-It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
 will not be able to acquire the AioContext if it was acquired
 multiple times.
 Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
 this patch simply avoids nested locking in save_vmstate().  It's the
 simplest fix and we should step back to consider the big picture with
 all the recent changes to block layer threading.
 This patch is the final fix to solve 'savevm' hanging with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20230309190855.414275-2-stefanha@redhat.com>
-Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 12 +++++++++++-
+ block/block-backend.c | 8 +-------
-file changed, 11 insertions(+), 1 deletion(-)
+file changed, 1 insertion(+), 7 deletions(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block/block-backend.c b/block/block-backend.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block/block-backend.c
-+++ b/migration/savevm.c
++++ b/block/block-backend.c
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ void blk_drain_all(void)
-         goto the_end;
+     bdrv_drain_all_begin();
      while ((blk = blk_all_next(blk)) != NULL) {
 -        AioContext *ctx = blk_get_aio_context(blk);
 -
 -        aio_context_acquire(ctx);
 -
          /* We may have -ENOMEDIUM completions in flight */
 -        AIO_WAIT_WHILE(ctx, qatomic_read(&blk->in_flight) > 0);
 -
 -        aio_context_release(ctx);
 +        AIO_WAIT_WHILE_UNLOCKED(NULL, qatomic_read(&blk->in_flight) > 0);
      }
-+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+     bdrv_drain_all_end();
 +     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
 +     * it only releases the lock once.  Therefore synchronous I/O will deadlock
 +     * unless we release the AioContext before bdrv_all_create_snapshot().
 +     */
 +    aio_context_release(aio_context);
 +    aio_context = NULL;
 +
      ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
      if (ret < 0) {
          error_setg(errp, "Error while creating snapshot on '%s'",
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
      ret = 0;
   the_end:
 -    aio_context_release(aio_context);
 +    if (aio_context) {
 +        aio_context_release(aio_context);
 +    }
      if (saved_vm_running) {
          vm_start();
      }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 32/61] qed: Remove callback from qed_copy_from_backing_file()
+[PULL 05/25] block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
-With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
+From: Stefan Hajnoczi <stefanha@redhat.com>
 collapse into a single function. This is reflected by a rename of the
 combined function to qed_aio_write_cow().
+There is no change in behavior. Switch to AIO_WAIT_WHILE_UNLOCKED()
+instead of AIO_WAIT_WHILE() to document that this code has already been
+audited and converted. The AioContext argument is already NULL so
+aio_context_release() is never called anyway.
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230309190855.414275-3-stefanha@redhat.com>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 57 +++++++++++++++++++++++----------------------------------
+ block/export/export.c | 2 +-
-file changed, 23 insertions(+), 34 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/export/export.c b/block/export/export.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/export/export.c
-+++ b/block/qed.c
++++ b/block/export/export.c
-@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+@@ -XXX,XX +XXX,XX @@ void blk_exp_close_all_type(BlockExportType type)
-  * @pos:        Byte position in device
+         blk_exp_request_shutdown(exp);
   * @len:        Number of bytes
   * @offset:     Byte offset in image file
 - * @cb:         Completion function
 - * @opaque:     User data for completion function
   */
 -static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                       uint64_t len, uint64_t offset,
 -                                       BlockCompletionFunc *cb,
 -                                       void *opaque)
 +static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 +                                      uint64_t len, uint64_t offset)
  {
      QEMUIOVector qiov;
      QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
      /* Skip copy entirely if there is no work to do */
      if (len == 0) {
 -        cb(opaque, 0);
 -        return;
 +        return 0;
      }
-     iov = (struct iovec) {
+-    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
-@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
++    AIO_WAIT_WHILE_UNLOCKED(NULL, blk_exp_has_type(type));
      ret = 0;
  out:
      qemu_vfree(iov.iov_base);
 -    cb(opaque, ret);
 +    return ret;
  }
- /**
+ void blk_exp_close_all(void)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
  }
  /**
 - * Populate back untouched region of new data cluster
 + * Populate untouched regions of new data cluster
   */
 -static void qed_aio_write_postfill(void *opaque, int ret)
 +static void qed_aio_write_cow(void *opaque, int ret)
  {
      QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
 -    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
 -    uint64_t len =
 -        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
 -    uint64_t offset = acb->cur_cluster +
 -                      qed_offset_into_cluster(s, acb->cur_pos) +
 -                      acb->cur_qiov.size;
 +    uint64_t start, len, offset;
 +
 +    /* Populate front untouched region of new data cluster */
 +    start = qed_start_of_cluster(s, acb->cur_pos);
 +    len = qed_offset_into_cluster(s, acb->cur_pos);
 +    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
 +    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
      if (ret) {
          qed_aio_complete(acb, ret);
          return;
      }
 -    trace_qed_aio_write_postfill(s, acb, start, len, offset);
 -    qed_copy_from_backing_file(s, start, len, offset,
 -                                qed_aio_write_main, acb);
 -}
 +    /* Populate back untouched region of new data cluster */
 +    start = acb->cur_pos + acb->cur_qiov.size;
 +    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
 +    offset = acb->cur_cluster +
 +             qed_offset_into_cluster(s, acb->cur_pos) +
 +             acb->cur_qiov.size;
 -/**
 - * Populate front untouched region of new data cluster
 - */
 -static void qed_aio_write_prefill(void *opaque, int ret)
 -{
 -    QEDAIOCB *acb = opaque;
 -    BDRVQEDState *s = acb_to_s(acb);
 -    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
 -    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
 +    trace_qed_aio_write_postfill(s, acb, start, len, offset);
 +    ret = qed_copy_from_backing_file(s, start, len, offset);
 -    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
 -    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
 -                                qed_aio_write_postfill, acb);
 +    qed_aio_write_main(acb, ret);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
          cb = qed_aio_write_zero_cluster;
      } else {
 -        cb = qed_aio_write_prefill;
 +        cb = qed_aio_write_cow;
          acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
      }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 07/61] migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
+[PULL 06/25] block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-blk/bdrv_drain_all() only takes effect for a single instant and then
+The following conversion is safe and does not change behavior:
 resumes block jobs, guest devices, and other external clients like the
 NBD server.  This can be handy when performing a synchronous drain
 before terminating the program, for example.
-Monitor commands usually need to quiesce I/O across an entire code
+     GLOBAL_STATE_CODE();
-region so blk/bdrv_drain_all() is not suitable.  They must use
+     ...
-bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
+  -  AIO_WAIT_WHILE(qemu_get_aio_context(), ...);
-requests from slipping in or worse - block jobs completing and modifying
+  +  AIO_WAIT_WHILE_UNLOCKED(NULL, ...);
 the graph.
-I audited other blk/bdrv_drain_all() callers but did not find anything
+Since we're in GLOBAL_STATE_CODE(), qemu_get_aio_context() is our home
-that needs a similar fix.  This patch fixes the savevm/loadvm commands.
+thread's AioContext. Thus AIO_WAIT_WHILE() does not unlock the
-Although I haven't encountered a read world issue this makes the code
+AioContext:
 safer.
-Suggested-by: Kevin Wolf <kwolf@redhat.com>
+  if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
       while ((cond)) {                                           \
           aio_poll(ctx_, true);                                  \
           waited_ = true;                                        \
       }                                                          \
 And that means AIO_WAIT_WHILE_UNLOCKED(NULL, ...) can be substituted.
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20230309190855.414275-4-stefanha@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 18 +++++++++++++++---
+ block/graph-lock.c | 2 +-
-file changed, 15 insertions(+), 3 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block/graph-lock.c b/block/graph-lock.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block/graph-lock.c
-+++ b/migration/savevm.c
++++ b/block/graph-lock.c
-@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ void bdrv_graph_wrlock(void)
-     }
+          * reader lock.
-     vm_stop(RUN_STATE_SAVE_VM);
+          */
+         qatomic_set(&has_writer, 0);
-+    bdrv_drain_all_begin();
+-        AIO_WAIT_WHILE(qemu_get_aio_context(), reader_count() >= 1);
-+
++        AIO_WAIT_WHILE_UNLOCKED(NULL, reader_count() >= 1);
-     aio_context_acquire(aio_context);
+         qatomic_set(&has_writer, 1);
-     memset(sn, 0, sizeof(*sn));
+         /*
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
      if (aio_context) {
          aio_context_release(aio_context);
      }
 +
 +    bdrv_drain_all_end();
 +
      if (saved_vm_running) {
          vm_start();
      }
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      }
      /* Flush all IO requests so they don't interfere with the new state.  */
 -    bdrv_drain_all();
 +    bdrv_drain_all_begin();
      ret = bdrv_all_goto_snapshot(name, &bs);
      if (ret < 0) {
          error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
                       ret, name, bdrv_get_device_name(bs));
 -        return ret;
 +        goto err_drain;
      }
      /* restore the VM state */
      f = qemu_fopen_bdrv(bs_vm_state, 0);
      if (!f) {
          error_setg(errp, "Could not open VM state file");
 -        return -EINVAL;
 +        ret = -EINVAL;
 +        goto err_drain;
      }
      qemu_system_reset(SHUTDOWN_CAUSE_NONE);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      ret = qemu_loadvm_state(f);
      aio_context_release(aio_context);
 +    bdrv_drain_all_end();
 +
      migration_incoming_state_destroy();
      if (ret < 0) {
          error_setg(errp, "Error %d while loading VM state", ret);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
      }
      return 0;
 +
 +err_drain:
 +    bdrv_drain_all_end();
 +    return ret;
  }
  void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 05/61] block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
+[PULL 07/25] block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Calling aio_poll() directly may have been fine previously, but this is
+Since the AioContext argument was already NULL, AIO_WAIT_WHILE() was
-the future, man!  The difference between an aio_poll() loop and
+never going to unlock the AioContext. Therefore it is possible to
-BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
+replace AIO_WAIT_WHILE() with AIO_WAIT_WHILE_UNLOCKED().
 around aio_poll().
-This allows the IOThread to run fd handlers or BHs to complete the
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-request.  Failure to release the AioContext causes deadlocks.
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
 iothread.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
+Message-Id: <20230309190855.414275-5-stefanha@redhat.com>
-Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/io.c | 4 +---
+ block/io.c | 2 +-
-file changed, 1 insertion(+), 3 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
 diff --git a/block/io.c b/block/io.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io.c
 +++ b/block/io.c
-@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_begin(void)
-         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
+     bdrv_drain_all_begin_nopoll();
-         bdrv_coroutine_enter(bs, co);
+     /* Now poll the in-flight requests */
--        while (data.ret == -EINPROGRESS) {
+-    AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
--            aio_poll(bdrv_get_aio_context(bs), true);
++    AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
--        }
-+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
+     while ((bs = bdrv_next_all_states(bs))) {
-         return data.ret;
+         bdrv_drain_assert_idle(bs);
      }
  }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 08/61] doc: Document generic -blockdev options
+Deleted patch
-This adds documentation for the -blockdev options that apply to all
-nodes independent of the block driver used.
-All options that are shared by -blockdev and -drive are now explained in
-the section for -blockdev. The documentation of -drive mentions that all
--blockdev options are accepted as well.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
-file changed, 79 insertions(+), 29 deletions(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
-+++ b/qemu-options.hx
-@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
-     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
-     "          [,driver specific parameters...]\n"
-     "                configure a block backend\n", QEMU_ARCH_ALL)
-+STEXI
-+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
-+@findex -blockdev
-+
-+Define a new block driver node.
-+
-+@table @option
-+@item Valid options for any block driver node:
-+
-+@table @code
-+@item driver
-+Specifies the block driver to use for the given node.
-+@item node-name
-+This defines the name of the block driver node by which it will be referenced
-+later. The name must be unique, i.e. it must not match the name of a different
-+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
-+
-+If no node name is specified, it is automatically generated. The generated node
-+name is not intended to be predictable and changes between QEMU invocations.
-+For the top level, an explicit node name must be specified.
-+@item read-only
-+Open the node read-only. Guest write attempts will fail.
-+@item cache.direct
-+The host page cache can be avoided with @option{cache.direct=on}. This will
-+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
-+internal copy of the data.
-+@item cache.no-flush
-+In case you don't care about data integrity over host failures, you can use
-+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
-+any data to the disk but can instead keep things in cache. If anything goes
-+wrong, like your host losing power, the disk storage getting disconnected
-+accidentally, etc. your image will most probably be rendered unusable.
-+@item discard=@var{discard}
-+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
-+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
-+ignored or passed to the filesystem. Some machine types may not support
-+discard requests.
-+@item detect-zeroes=@var{detect-zeroes}
-+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-+conversion of plain zero writes by the OS to driver specific optimized
-+zero write commands. You may even choose "unmap" if @var{discard} is set
-+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
-+@end table
-+
-+@end table
-+
-+ETEXI
- DEF("drive", HAS_ARG, QEMU_OPTION_drive,
-     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
-@@ -XXX,XX +XXX,XX @@ STEXI
- @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
- @findex -drive
--Define a new drive. Valid options are:
-+Define a new drive. This includes creating a block driver node (the backend) as
-+well as a guest device, and is mostly a shortcut for defining the corresponding
-+@option{-blockdev} and @option{-device} options.
-+
-+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
-+addition, it knows the following options:
- @table @option
- @item file=@var{file}
-@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
- @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
- (see @option{-snapshot}).
- @item cache=@var{cache}
--@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
-+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
-+and controls how the host cache is used to access block data. This is a
-+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
-+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
-+which provides a default for the @option{write-cache} option of block guest
-+devices (as in @option{-device}). The modes correspond to the following
-+settings:
-+
-+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
-+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
-+@c and the HTML output.
-+@example
-+@             │ cache.writeback   cache.direct   cache.no-flush
-+─────────────┼─────────────────────────────────────────────────
-+writeback    │ on                off            off
-+none         │ on                on             off
-+writethrough │ off               off            off
-+directsync   │ off               on             off
-+unsafe       │ on                off            on
-+@end example
-+
-+The default mode is @option{cache=writeback}.
-+
- @item aio=@var{aio}
- @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
--@item discard=@var{discard}
--@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
- @item format=@var{format}
- Specify which disk @var{format} will be used rather than detecting
- the format.  Can be used to specify format=raw to avoid interpreting
-@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
- "report" (report the error to the guest), "enospc" (pause QEMU only if the
- host disk is full; report the error to the guest otherwise).
- The default setting is @option{werror=enospc} and @option{rerror=report}.
--@item readonly
--Open drive @option{file} as read-only. Guest write attempts will fail.
- @item copy-on-read=@var{copy-on-read}
- @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
- file sectors into the image file.
--@item detect-zeroes=@var{detect-zeroes}
--@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
--conversion of plain zero writes by the OS to driver specific optimized
--zero write commands. You may even choose "unmap" if @var{discard} is set
--to "unmap" to allow a zero write to be converted to an UNMAP operation.
- @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
- Specify bandwidth throttling limits in bytes per second, either for all request
- types or for reads or writes only.  Small values can lead to timeouts or hangs
-@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
- instead of a single larger disk.
- @end table
--By default, the @option{cache=writeback} mode is used. It will report data
-+By default, the @option{cache.writeback=on} mode is used. It will report data
- writes as completed as soon as the data is present in the host page cache.
- This is safe as long as your guest OS makes sure to correctly flush disk caches
- where needed. If your guest OS does not handle volatile disk write caches
- correctly and your host crashes or loses power, then the guest may experience
- data corruption.
--For such guests, you should consider using @option{cache=writethrough}. This
-+For such guests, you should consider using @option{cache.writeback=off}. This
- means that the host page cache will be used to read and write data, but write
- notification will be sent to the guest only after QEMU has made sure to flush
- each write to the disk. Be aware that this has a major impact on performance.
--The host page cache can be avoided entirely with @option{cache=none}.  This will
--attempt to do disk IO directly to the guest's memory.  QEMU may still perform
--an internal copy of the data. Note that this is considered a writeback mode and
--the guest OS must handle the disk write cache correctly in order to avoid data
--corruption on host crashes.
--
--The host page cache can be avoided while only sending write notifications to
--the guest when the data has been flushed to the disk using
--@option{cache=directsync}.
--
--In case you don't care about data integrity over host failures, use
--@option{cache=unsafe}. This option tells QEMU that it never needs to write any
--data to the disk but can instead keep things in cache. If anything goes wrong,
--like your host losing power, the disk storage getting disconnected accidentally,
--etc. your image will most probably be rendered unusable.   When using
--the @option{-snapshot} option, unsafe caching is always used.
-+When using the @option{-snapshot} option, unsafe caching is always used.
- Copy-on-read avoids accessing the same backing file sectors repeatedly and is
- useful when the backing file is over a slow network.  By default copy-on-read
---
-.8.3.1

-[Qemu-devel] [PULL 09/61] doc: Document driver-specific -blockdev options
+Deleted patch
-This documents the driver-specific options for the raw, qcow2 and file
-block drivers for the man page. For everything else, we refer to the
-QAPI documentation.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Max Reitz <mreitz@redhat.com>
----
- qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
-file changed, 114 insertions(+), 1 deletion(-)
-diff --git a/qemu-options.hx b/qemu-options.hx
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-options.hx
-+++ b/qemu-options.hx
-@@ -XXX,XX +XXX,XX @@ STEXI
- @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
- @findex -blockdev
--Define a new block driver node.
-+Define a new block driver node. Some of the options apply to all block drivers,
-+other options are only accepted for a specific block driver. See below for a
-+list of generic options and options for the most common block drivers.
-+
-+Options that expect a reference to another node (e.g. @code{file}) can be
-+given in two ways. Either you specify the node name of an already existing node
-+(file=@var{node-name}), or you define a new node inline, adding options
-+for the referenced node after a dot (file.filename=@var{path},file.aio=native).
-+
-+A block driver node created with @option{-blockdev} can be used for a guest
-+device by specifying its node name for the @code{drive} property in a
-+@option{-device} argument that defines a block device.
- @table @option
- @item Valid options for any block driver node:
-@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
- to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
- @end table
-+@item Driver-specific options for @code{file}
-+
-+This is the protocol-level block driver for accessing regular files.
-+
-+@table @code
-+@item filename
-+The path to the image file in the local filesystem
-+@item aio
-+Specifies the AIO backend (threads/native, default: threads)
-+@end table
-+Example:
-+@example
-+-blockdev driver=file,node-name=disk,filename=disk.img
-+@end example
-+
-+@item Driver-specific options for @code{raw}
-+
-+This is the image format block driver for raw images. It is usually
-+stacked on top of a protocol level block driver such as @code{file}.
-+
-+@table @code
-+@item file
-+Reference to or definition of the data source block driver node
-+(e.g. a @code{file} driver node)
-+@end table
-+Example 1:
-+@example
-+-blockdev driver=file,node-name=disk_file,filename=disk.img
-+-blockdev driver=raw,node-name=disk,file=disk_file
-+@end example
-+Example 2:
-+@example
-+-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
-+@end example
-+
-+@item Driver-specific options for @code{qcow2}
-+
-+This is the image format block driver for qcow2 images. It is usually
-+stacked on top of a protocol level block driver such as @code{file}.
-+
-+@table @code
-+@item file
-+Reference to or definition of the data source block driver node
-+(e.g. a @code{file} driver node)
-+
-+@item backing
-+Reference to or definition of the backing file block device (default is taken
-+from the image file). It is allowed to pass an empty string here in order to
-+disable the default backing file.
-+
-+@item lazy-refcounts
-+Whether to enable the lazy refcounts feature (on/off; default is taken from the
-+image file)
-+
-+@item cache-size
-+The maximum total size of the L2 table and refcount block caches in bytes
-+(default: 1048576 bytes or 8 clusters, whichever is larger)
-+
-+@item l2-cache-size
-+The maximum size of the L2 table cache in bytes
-+(default: 4/5 of the total cache size)
-+
-+@item refcount-cache-size
-+The maximum size of the refcount block cache in bytes
-+(default: 1/5 of the total cache size)
-+
-+@item cache-clean-interval
-+Clean unused entries in the L2 and refcount caches. The interval is in seconds.
-+The default value is 0 and it disables this feature.
-+
-+@item pass-discard-request
-+Whether discard requests to the qcow2 device should be forwarded to the data
-+source (on/off; default: on if discard=unmap is specified, off otherwise)
-+
-+@item pass-discard-snapshot
-+Whether discard requests for the data source should be issued when a snapshot
-+operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
-+default: on)
-+
-+@item pass-discard-other
-+Whether discard requests for the data source should be issued on other
-+occasions where a cluster gets freed (on/off; default: off)
-+
-+@item overlap-check
-+Which overlap checks to perform for writes to the image
-+(none/constant/cached/all; default: cached). For details or finer
-+granularity control refer to the QAPI documentation of @code{blockdev-add}.
-+@end table
-+
-+Example 1:
-+@example
-+-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
-+-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
-+@end example
-+Example 2:
-+@example
-+-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
-+@end example
-+
-+@item Driver-specific options for other drivers
-+Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
-+
- @end table
- ETEXI
---
-.8.3.1

-[Qemu-devel] [PULL 11/61] virtio-pci: use ioeventfd even when KVM is disabled
+[PULL 08/25] hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-Old kvm.ko versions only supported a tiny number of ioeventfds so
+The HMP monitor runs in the main loop thread. Calling
-virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.
+AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
 equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
 the AioContext and the latter's assertion that we're in the main loop
 succeeds.
-Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
+Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-("memory: emulate ioeventfd") it has been possible to use ioeventfds in
+Reviewed-by: Markus Armbruster <armbru@redhat.com>
-qtest or TCG mode.
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 This patch makes -device virtio-blk-pci,iothread=iothread0 work even
 when KVM is disabled.
 I have tested that virtio-blk-pci works under TCG both with and without
 iothread.
 Cc: Michael S. Tsirkin <mst@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
+Message-Id: <20230309190855.414275-6-stefanha@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/virtio/virtio-pci.c | 2 +-
+ monitor/hmp.c | 2 +-
 file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
+diff --git a/monitor/hmp.c b/monitor/hmp.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/virtio/virtio-pci.c
+--- a/monitor/hmp.c
-+++ b/hw/virtio/virtio-pci.c
++++ b/monitor/hmp.c
-@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
+@@ -XXX,XX +XXX,XX @@ void handle_hmp_command(MonitorHMP *mon, const char *cmdline)
-     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
+         Coroutine *co = qemu_coroutine_create(handle_hmp_command_co, &data);
-                      !pci_bus_is_root(pci_dev->bus);
+         monitor_set_cur(co, &mon->common);
+         aio_co_enter(qemu_get_aio_context(), co);
--    if (!kvm_has_many_ioeventfds()) {
+-        AIO_WAIT_WHILE(qemu_get_aio_context(), !data.done);
-+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
++        AIO_WAIT_WHILE_UNLOCKED(NULL, !data.done);
          proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
      }
+     qobject_unref(qdict);
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 24/61] qcow2: Use offset_into_cluster() and offset_to_l2_index()
+[PULL 09/25] monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
-From: Alberto Garcia <berto@igalia.com>
+From: Stefan Hajnoczi <stefanha@redhat.com>
-We already have functions for doing these calculations, so let's use
+monitor_cleanup() is called from the main loop thread. Calling
-them instead of doing everything by hand. This makes the code a bit
+AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
-more readable.
+equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
 the AioContext and the latter's assertion that we're in the main loop
 succeeds.
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Markus Armbruster <armbru@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Message-Id: <20230309190855.414275-7-stefanha@redhat.com>
 Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ monitor/monitor.c | 4 ++--
- block/qcow2.c         | 2 +-
+file changed, 2 insertions(+), 2 deletions(-)
 files changed, 3 insertions(+), 3 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/monitor/monitor.c b/monitor/monitor.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/monitor/monitor.c
-+++ b/block/qcow2-cluster.c
++++ b/monitor/monitor.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
+      * We need to poll both qemu_aio_context and iohandler_ctx to make
-     /* find the cluster offset for the given disk offset */
+      * sure that the dispatcher coroutine keeps making progress and
+      * eventually terminates.  qemu_aio_context is automatically
--    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+-     * polled by calling AIO_WAIT_WHILE on it, but we must poll
-+    l2_index = offset_to_l2_index(s, offset);
++     * polled by calling AIO_WAIT_WHILE_UNLOCKED on it, but we must poll
-     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
+      * iohandler_ctx manually.
+      *
-     nb_clusters = size_to_clusters(s, bytes_needed);
+      * Letting the iothread continue while shutting down the dispatcher
-@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
+         aio_co_wake(qmp_dispatcher_co);
      /* find the cluster offset for the given disk offset */
 -    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
 +    l2_index = offset_to_l2_index(s, offset);
      *new_l2_table = l2_table;
      *new_l2_index = l2_index;
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
      }
-     /* Tables must be cluster aligned */
+-    AIO_WAIT_WHILE(qemu_get_aio_context(),
--    if (offset & (s->cluster_size - 1)) {
++    AIO_WAIT_WHILE_UNLOCKED(NULL,
-+    if (offset_into_cluster(s, offset) != 0) {
+                    (aio_poll(iohandler_get_aio_context(), false),
-         return -EINVAL;
+                     qatomic_mb_read(&qmp_dispatcher_co_busy)));
      }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 59/61] blkverify: Catch bs->exact_filename overflow
+[PULL 10/25] include/block: fixup typos
-From: Max Reitz <mreitz@redhat.com>
+From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
-The bs->exact_filename field may not be sufficient to store the full
+Fixup a few minor typos
 blkverify node filename. In this case, we should not generate a filename
 at all instead of an unusable one.
-Cc: qemu-stable@nongnu.org
+Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
-Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
+Message-Id: <20230313003744.55476-1-wilfred.mallawa@opensource.wdc.com>
-Signed-off-by: Max Reitz <mreitz@redhat.com>
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Message-id: 20170613172006.19685-3-mreitz@redhat.com
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Alberto Garcia <berto@igalia.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkverify.c | 12 ++++++++----
+ include/block/aio-wait.h         | 2 +-
-file changed, 8 insertions(+), 4 deletions(-)
+ include/block/block_int-common.h | 2 +-
 files changed, 2 insertions(+), 2 deletions(-)
-diff --git a/block/blkverify.c b/block/blkverify.c
+diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/blkverify.c
+--- a/include/block/aio-wait.h
-+++ b/block/blkverify.c
++++ b/include/block/aio-wait.h
-@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@ extern AioWait global_aio_wait;
-     if (bs->file->bs->exact_filename[0]
+  * @ctx: the aio context, or NULL if multiple aio contexts (for which the
-         && s->test_file->bs->exact_filename[0])
+  *       caller does not hold a lock) are involved in the polling condition.
-     {
+  * @cond: wait while this conditional expression is true
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+- * @unlock: whether to unlock and then lock again @ctx. This apples
--                 "blkverify:%s:%s",
++ * @unlock: whether to unlock and then lock again @ctx. This applies
--                 bs->file->bs->exact_filename,
+  * only when waiting for another AioContext from the main loop.
--                 s->test_file->bs->exact_filename);
+  * Otherwise it's ignored.
-+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+  *
-+                           "blkverify:%s:%s",
+diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
-+                           bs->file->bs->exact_filename,
+index XXXXXXX..XXXXXXX 100644
-+                           s->test_file->bs->exact_filename);
+--- a/include/block/block_int-common.h
-+        if (ret >= sizeof(bs->exact_filename)) {
++++ b/include/block/block_int-common.h
-+            /* An overflow makes the filename unusable, so do not report any */
+@@ -XXX,XX +XXX,XX @@ extern QemuOptsList bdrv_create_opts_simple;
-+            bs->exact_filename[0] = 0;
+ /*
-+        }
+  * Common functions that are neither I/O nor Global State.
-     }
+  *
- }
+- * See include/block/block-commmon.h for more information about
 + * See include/block/block-common.h for more information about
   * the Common API.
   */
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 12/61] migration: hold AioContext lock for loadvm qemu_fclose()
+[PULL 11/25] block: add missing coroutine_fn to bdrv_sum_allocated_file_size()
 From: Stefan Hajnoczi <stefanha@redhat.com>
-migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
+Not a coroutine_fn, you say?
 file.  Make sure to call it inside an AioContext acquire/release region.
-This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
+  static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
-in loadvm.
+  {
       BdrvChild *child;
       int64_t child_size, sum = 0;
-This patch closes the vmstate file before ending the drained region.
+      QLIST_FOREACH(child, &bs->children, next) {
-Previously we closed the vmstate file after ending the drained region.
+          if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
-The order does not matter.
+                             BDRV_CHILD_FILTERED))
           {
               child_size = bdrv_co_get_allocated_file_size(child->bs);
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Well what do we have here?!
 I rest my case, your honor.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Message-Id: <20230308211435.346375-1-stefanha@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- migration/savevm.c | 2 +-
+ block.c | 2 +-
 file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/migration/savevm.c b/migration/savevm.c
+diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
---- a/migration/savevm.c
+--- a/block.c
-+++ b/migration/savevm.c
++++ b/block.c
-@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
+@@ -XXX,XX +XXX,XX @@ exit:
+  * sums the size of all data-bearing children.  (This excludes backing
-     aio_context_acquire(aio_context);
+  * children.)
-     ret = qemu_loadvm_state(f);
+  */
-+    migration_incoming_state_destroy();
+-static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
-     aio_context_release(aio_context);
++static int64_t coroutine_fn bdrv_sum_allocated_file_size(BlockDriverState *bs)
+ {
-     bdrv_drain_all_end();
+     BdrvChild *child;
+     int64_t child_size, sum = 0;
 -    migration_incoming_state_destroy();
      if (ret < 0) {
          error_setg(errp, "Error %d while loading VM state", ret);
          return ret;
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 20/61] qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
+[PULL 12/25] linux-aio: use LinuxAioState from the running thread
-From: Alberto Garcia <berto@igalia.com>
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
-This patch splits do_perform_cow() into three separate functions to
+Remove usage of aio_context_acquire by always submitting asynchronous
-read, encrypt and write the COW regions.
+AIO to the current thread's LinuxAioState.
-perform_cow() can now read both regions first, then encrypt them and
+In order to prevent mistakes from the caller side, avoid passing LinuxAioState
-finally write them to disk. The memory allocation is also done in
+in laio_io_{plug/unplug} and laio_co_submit, and document the functions
-this function now, using one single buffer large enough to hold both
+to make clear that they work in the current thread's AioContext.
-regions.
+Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+Message-Id: <20230203131731.851116-2-eesposit@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
+ include/block/aio.h               |  4 ----
-file changed, 87 insertions(+), 30 deletions(-)
+ include/block/raw-aio.h           | 18 ++++++++++++------
+ include/sysemu/block-backend-io.h |  5 +++++
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+ block/file-posix.c                | 10 +++-------
-index XXXXXXX..XXXXXXX 100644
+ block/linux-aio.c                 | 29 +++++++++++++++++------------
---- a/block/qcow2-cluster.c
+files changed, 37 insertions(+), 29 deletions(-)
-+++ b/block/qcow2-cluster.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/aio.h
 +++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      struct ThreadPool *thread_pool;
  #ifdef CONFIG_LINUX_AIO
 -    /*
 -     * State for native Linux AIO.  Uses aio_context_acquire/release for
 -     * locking.
 -     */
      struct LinuxAioState *linux_aio;
  #endif
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
  typedef struct LinuxAioState LinuxAioState;
  LinuxAioState *laio_init(Error **errp);
  void laio_cleanup(LinuxAioState *s);
 -int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
 -                                uint64_t offset, QEMUIOVector *qiov, int type,
 -                                uint64_t dev_max_batch);
 +
 +/* laio_co_submit: submit I/O requests in the thread's current AioContext. */
 +int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 +                                int type, uint64_t dev_max_batch);
 +
  void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
  void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 -void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
 -void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
 -                    uint64_t dev_max_batch);
 +
 +/*
 + * laio_io_plug/unplug work in the thread's current AioContext, therefore the
 + * caller must ensure that they are paired in the same IOThread.
 + */
 +void laio_io_plug(void);
 +void laio_io_unplug(uint64_t dev_max_batch);
  #endif
  /* io_uring.c - Linux io_uring implementation */
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/sysemu/block-backend-io.h
 +++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
  int blk_get_max_iov(BlockBackend *blk);
  int blk_get_max_hw_iov(BlockBackend *blk);
 +/*
 + * blk_io_plug/unplug are thread-local operations. This means that multiple
 + * IOThreads can simultaneously call plug/unplug, but the caller must ensure
 + * that each unplug() is called in the same IOThread of the matching plug().
 + */
  void coroutine_fn blk_co_io_plug(BlockBackend *blk);
  void co_wrapper blk_io_plug(BlockBackend *blk);
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
  #endif
  #ifdef CONFIG_LINUX_AIO
      } else if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
          assert(qiov->size == bytes);
 -        return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
 -                              s->aio_max_batch);
 +        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
  #endif
      }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
      BDRVRawState __attribute__((unused)) *s = bs->opaque;
  #ifdef CONFIG_LINUX_AIO
      if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
 -        laio_io_plug(bs, aio);
 +        laio_io_plug();
      }
  #endif
  #ifdef CONFIG_LINUX_IO_URING
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
      BDRVRawState __attribute__((unused)) *s = bs->opaque;
  #ifdef CONFIG_LINUX_AIO
      if (s->use_linux_aio) {
 -        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
 -        laio_io_unplug(bs, aio, s->aio_max_batch);
 +        laio_io_unplug(s->aio_max_batch);
      }
  #endif
  #ifdef CONFIG_LINUX_IO_URING
 diff --git a/block/linux-aio.c b/block/linux-aio.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/linux-aio.c
 +++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/coroutine.h"
  #include "qapi/error.h"
 +/* Only used for assertions.  */
 +#include "qemu/coroutine_int.h"
 +
  #include <libaio.h>
  /*
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
      io_context_t ctx;
      EventNotifier e;
 -    /* io queue for submit at batch.  Protected by AioContext lock. */
 +    /* No locking required, only accessed from AioContext home thread */
      LaioQueue io_q;
 -
 -    /* I/O completion processing.  Only runs in I/O thread.  */
      QEMUBH *completion_bh;
      int event_idx;
      int event_max;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
       * later.  Coroutines cannot be entered recursively so avoid doing
       * that!
       */
 +    assert(laiocb->co->ctx == laiocb->ctx->aio_context);
      if (!qemu_coroutine_entered(laiocb->co)) {
          aio_co_wake(laiocb->co);
      }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
  {
 -    aio_context_acquire(s->aio_context);
      qemu_laio_process_completions(s);
      if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
          ioq_submit(s);
      }
 -    aio_context_release(s->aio_context);
  }
  static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
      return max_batch;
  }
 -void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
 +void laio_io_plug(void)
  {
 +    AioContext *ctx = qemu_get_current_aio_context();
 +    LinuxAioState *s = aio_get_linux_aio(ctx);
 +
      s->io_q.plugged++;
  }
 -void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
 -                    uint64_t dev_max_batch)
 +void laio_io_unplug(uint64_t dev_max_batch)
  {
 +    AioContext *ctx = qemu_get_current_aio_context();
 +    LinuxAioState *s = aio_get_linux_aio(ctx);
 +
      assert(s->io_q.plugged);
      s->io_q.plugged--;
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
      return 0;
  }
--static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+-int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
--                                       uint64_t src_cluster_offset,
+-                                uint64_t offset, QEMUIOVector *qiov, int type,
--                                       uint64_t cluster_offset,
+-                                uint64_t dev_max_batch)
--                                       unsigned offset_in_cluster,
++int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
--                                       unsigned bytes)
++                                int type, uint64_t dev_max_batch)
-+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+ {
 +                                            uint64_t src_cluster_offset,
 +                                            unsigned offset_in_cluster,
 +                                            uint8_t *buffer,
 +                                            unsigned bytes)
  {
 -    BDRVQcow2State *s = bs->opaque;
      QEMUIOVector qiov;
 -    struct iovec iov;
 +    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
++    AioContext *ctx = qemu_get_current_aio_context();
-     if (bytes == 0) {
+     struct qemu_laiocb laiocb = {
-         return 0;
+         .co         = qemu_coroutine_self(),
-     }
+         .nbytes     = qiov->size,
+-        .ctx        = s,
--    iov.iov_len = bytes;
++        .ctx        = aio_get_linux_aio(ctx),
--    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
+         .ret        = -EINPROGRESS,
--    if (iov.iov_base == NULL) {
+         .is_read    = (type == QEMU_AIO_READ),
--        return -ENOMEM;
+         .qiov       = qiov,
 -    }
 -
      qemu_iovec_init_external(&qiov, &iov, 1);
      BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
      if (!bs->drv) {
 -        ret = -ENOMEDIUM;
 -        goto out;
 +        return -ENOMEDIUM;
      }
      /* Call .bdrv_co_readv() directly instead of using the public block-layer
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
      ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
                                    bytes, &qiov, 0);
      if (ret < 0) {
 -        goto out;
 +        return ret;
      }
 -    if (bs->encrypted) {
 +    return 0;
 +}
 +
 +static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
 +                                                uint64_t src_cluster_offset,
 +                                                unsigned offset_in_cluster,
 +                                                uint8_t *buffer,
 +                                                unsigned bytes)
 +{
 +    if (bytes && bs->encrypted) {
 +        BDRVQcow2State *s = bs->opaque;
          int64_t sector = (src_cluster_offset + offset_in_cluster)
                           >> BDRV_SECTOR_BITS;
          assert(s->cipher);
          assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
          assert((bytes & ~BDRV_SECTOR_MASK) == 0);
 -        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
 +        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
                                    bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
 -            ret = -EIO;
 -            goto out;
 +            return false;
          }
      }
 +    return true;
 +}
 +
 +static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
 +                                             uint64_t cluster_offset,
 +                                             unsigned offset_in_cluster,
 +                                             uint8_t *buffer,
 +                                             unsigned bytes)
 +{
 +    QEMUIOVector qiov;
 +    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
 +    int ret;
 +
 +    if (bytes == 0) {
 +        return 0;
 +    }
 +
 +    qemu_iovec_init_external(&qiov, &iov, 1);
      ret = qcow2_pre_write_overlap_check(bs, 0,
              cluster_offset + offset_in_cluster, bytes);
      if (ret < 0) {
 -        goto out;
 +        return ret;
      }
      BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
      ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
                            bytes, &qiov, 0);
      if (ret < 0) {
 -        goto out;
 +        return ret;
      }
 -    ret = 0;
 -out:
 -    qemu_vfree(iov.iov_base);
 -    return ret;
 +    return 0;
  }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      BDRVQcow2State *s = bs->opaque;
      Qcow2COWRegion *start = &m->cow_start;
      Qcow2COWRegion *end = &m->cow_end;
 +    unsigned buffer_size;
 +    uint8_t *start_buffer, *end_buffer;
      int ret;
 +    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
 +
      if (start->nb_bytes == 0 && end->nb_bytes == 0) {
          return 0;
      }
 +    /* Reserve a buffer large enough to store the data from both the
 +     * start and end COW regions. Add some padding in the middle if
 +     * necessary to make sure that the end region is optimally aligned */
 +    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
 +        end->nb_bytes;
 +    start_buffer = qemu_try_blockalign(bs, buffer_size);
 +    if (start_buffer == NULL) {
 +        return -ENOMEM;
 +    }
 +    /* The part of the buffer where the end region is located */
 +    end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +
      qemu_co_mutex_unlock(&s->lock);
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 -                         start->offset, start->nb_bytes);
 +    /* First we read the existing data from both COW regions */
 +    ret = do_perform_cow_read(bs, m->offset, start->offset,
 +                              start_buffer, start->nb_bytes);
      if (ret < 0) {
          goto fail;
      }
 -    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 -                         end->offset, end->nb_bytes);
 +    ret = do_perform_cow_read(bs, m->offset, end->offset,
 +                              end_buffer, end->nb_bytes);
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +
 +    /* Encrypt the data if necessary before writing it */
 +    if (bs->encrypted) {
 +        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
 +                                    start_buffer, start->nb_bytes) ||
 +            !do_perform_cow_encrypt(bs, m->offset, end->offset,
 +                                    end_buffer, end->nb_bytes)) {
 +            ret = -EIO;
 +            goto fail;
 +        }
 +    }
 +
 +    /* And now we can write everything */
 +    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
 +                               start_buffer, start->nb_bytes);
 +    if (ret < 0) {
 +        goto fail;
 +    }
 +    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
 +                               end_buffer, end->nb_bytes);
  fail:
      qemu_co_mutex_lock(&s->lock);
@@ -XXX,XX +XXX,XX @@ fail:
          qcow2_cache_depends_on_flush(s->l2_table_cache);
      }
 +    qemu_vfree(start_buffer);
      return ret;
  }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 37/61] qed: Remove callback from qed_write_table()
+[PULL 13/25] io_uring: use LuringState from the running thread
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Remove usage of aio_context_acquire by always submitting asynchronous
+AIO to the current thread's LuringState.
+In order to prevent mistakes from the caller side, avoid passing LuringState
+in luring_io_{plug/unplug} and luring_co_submit, and document the functions
+to make clear that they work in the current thread's AioContext.
+Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Message-Id: <20230203131731.851116-3-eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-table.c | 47 ++++++++++++-----------------------------------
+ include/block/aio.h     |  4 ----
- block/qed.c       | 12 +++++++-----
+ include/block/raw-aio.h | 15 +++++++++++----
- block/qed.h       |  8 +++-----
+ block/file-posix.c      | 12 ++++--------
-files changed, 22 insertions(+), 45 deletions(-)
+ block/io_uring.c        | 23 +++++++++++++++--------
 files changed, 30 insertions(+), 24 deletions(-)
-diff --git a/block/qed-table.c b/block/qed-table.c
+diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-table.c
+--- a/include/block/aio.h
-+++ b/block/qed-table.c
++++ b/include/block/aio.h
-@@ -XXX,XX +XXX,XX @@ out:
+@@ -XXX,XX +XXX,XX @@ struct AioContext {
-  * @index:      Index of first element
+     struct LinuxAioState *linux_aio;
-  * @n:          Number of elements
+ #endif
-  * @flush:      Whether or not to sync to disk
+ #ifdef CONFIG_LINUX_IO_URING
-- * @cb:         Completion function
+-    /*
-- * @opaque:     Argument for completion function
+-     * State for Linux io_uring.  Uses aio_context_acquire/release for
-  */
+-     * locking.
--static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+-     */
--                            unsigned int index, unsigned int n, bool flush,
+     struct LuringState *linux_io_uring;
--                            BlockCompletionFunc *cb, void *opaque)
-+static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+     /* State for file descriptor monitoring using Linux io_uring */
-+                           unsigned int index, unsigned int n, bool flush)
+diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/raw-aio.h
 +++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ void laio_io_unplug(uint64_t dev_max_batch);
  typedef struct LuringState LuringState;
  LuringState *luring_init(Error **errp);
  void luring_cleanup(LuringState *s);
 -int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
 -                                uint64_t offset, QEMUIOVector *qiov, int type);
 +
 +/* luring_co_submit: submit I/O requests in the thread's current AioContext. */
 +int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
 +                                  QEMUIOVector *qiov, int type);
  void luring_detach_aio_context(LuringState *s, AioContext *old_context);
  void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 -void luring_io_plug(BlockDriverState *bs, LuringState *s);
 -void luring_io_unplug(BlockDriverState *bs, LuringState *s);
 +
 +/*
 + * luring_io_plug/unplug work in the thread's current AioContext, therefore the
 + * caller must ensure that they are paired in the same IOThread.
 + */
 +void luring_io_plug(void);
 +void luring_io_unplug(void);
  #endif
  #ifdef _WIN32
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
          type |= QEMU_AIO_MISALIGNED;
  #ifdef CONFIG_LINUX_IO_URING
      } else if (s->use_linux_io_uring) {
 -        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
          assert(qiov->size == bytes);
 -        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
 +        return luring_co_submit(bs, s->fd, offset, qiov, type);
  #endif
  #ifdef CONFIG_LINUX_AIO
      } else if (s->use_linux_aio) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
  #endif
  #ifdef CONFIG_LINUX_IO_URING
      if (s->use_linux_io_uring) {
 -        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
 -        luring_io_plug(bs, aio);
 +        luring_io_plug();
      }
  #endif
  }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
  #endif
  #ifdef CONFIG_LINUX_IO_URING
      if (s->use_linux_io_uring) {
 -        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
 -        luring_io_unplug(bs, aio);
 +        luring_io_unplug();
      }
  #endif
  }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
  #ifdef CONFIG_LINUX_IO_URING
      if (s->use_linux_io_uring) {
 -        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
 -        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
 +        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
      }
  #endif
      return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
 diff --git a/block/io_uring.c b/block/io_uring.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/io_uring.c
 +++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@
  #include "qapi/error.h"
  #include "trace.h"
 +/* Only used for assertions.  */
 +#include "qemu/coroutine_int.h"
 +
  /* io_uring ring size */
  #define MAX_ENTRIES 128
@@ -XXX,XX +XXX,XX @@ typedef struct LuringState {
      struct io_uring ring;
 -    /* io queue for submit at batch.  Protected by AioContext lock. */
 +    /* No locking required, only accessed from AioContext home thread */
      LuringQueue io_q;
 -    /* I/O completion processing.  Only runs in I/O thread.  */
      QEMUBH *completion_bh;
  } LuringState;
@@ -XXX,XX +XXX,XX @@ end:
           * eventually runs later. Coroutines cannot be entered recursively
           * so avoid doing that!
           */
 +        assert(luringcb->co->ctx == s->aio_context);
          if (!qemu_coroutine_entered(luringcb->co)) {
              aio_co_wake(luringcb->co);
          }
@@ -XXX,XX +XXX,XX @@ static int ioq_submit(LuringState *s)
  static void luring_process_completions_and_submit(LuringState *s)
  {
-     unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
+-    aio_context_acquire(s->aio_context);
-     unsigned int start, end, i;
+     luring_process_completions(s);
-@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-     ret = 0;
+     if (!s->io_q.plugged && s->io_q.in_queue > 0) {
- out:
+         ioq_submit(s);
-     qemu_vfree(new_table);
+     }
--    cb(opaque, ret);
+-    aio_context_release(s->aio_context);
 -}
 -
 -/**
 - * Propagate return value from async callback
 - */
 -static void qed_sync_cb(void *opaque, int ret)
 -{
 -    *(int *)opaque = ret;
 +    return ret;
  }
- int qed_read_l1_table_sync(BDRVQEDState *s)
+ static void qemu_luring_completion_bh(void *opaque)
-@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ static void ioq_init(LuringQueue *io_q)
-     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
+     io_q->blocked = false;
  }
--void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
+-void luring_io_plug(BlockDriverState *bs, LuringState *s)
--                        BlockCompletionFunc *cb, void *opaque)
++void luring_io_plug(void)
 +int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
  {
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
++    AioContext *ctx = qemu_get_current_aio_context();
--    qed_write_table(s, s->header.l1_table_offset,
++    LuringState *s = aio_get_linux_io_uring(ctx);
--                    s->l1_table, index, n, false, cb, opaque);
+     trace_luring_io_plug(s);
-+    return qed_write_table(s, s->header.l1_table_offset,
+     s->io_q.plugged++;
 +                           s->l1_table, index, n, false);
  }
- int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
+-void luring_io_unplug(BlockDriverState *bs, LuringState *s)
-                             unsigned int n)
++void luring_io_unplug(void)
  {
--    int ret = -EINPROGRESS;
++    AioContext *ctx = qemu_get_current_aio_context();
--
++    LuringState *s = aio_get_linux_io_uring(ctx);
--    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
+     assert(s->io_q.plugged);
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
+     trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
--
+                            s->io_q.in_queue, s->io_q.in_flight);
--    return ret;
+@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-+    return qed_write_l1_table(s, index, n);
+     return 0;
  }
- int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
+-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
+-                                  uint64_t offset, QEMUIOVector *qiov, int type)
-     return qed_read_l2_table(s, request, offset);
++int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
- }
++                                  QEMUIOVector *qiov, int type)
 -void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 -                        unsigned int index, unsigned int n, bool flush,
 -                        BlockCompletionFunc *cb, void *opaque)
 +int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 +                       unsigned int index, unsigned int n, bool flush)
  {
-     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
+     int ret;
--    qed_write_table(s, request->l2_table->offset,
++    AioContext *ctx = qemu_get_current_aio_context();
--                    request->l2_table->table, index, n, flush, cb, opaque);
++    LuringState *s = aio_get_linux_io_uring(ctx);
-+    return qed_write_table(s, request->l2_table->offset,
+     LuringAIOCB luringcb = {
-+                           request->l2_table->table, index, n, flush);
+         .co         = qemu_coroutine_self(),
- }
+         .ret        = -EINPROGRESS,
  int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                              unsigned int index, unsigned int n, bool flush)
  {
 -    int ret = -EINPROGRESS;
 -
 -    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
 -    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
 -
 -    return ret;
 +    return qed_write_l2_table(s, request, index, n, flush);
  }
 diff --git a/block/qed.c b/block/qed.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.c
 +++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
      index = qed_l1_index(s, acb->cur_pos);
      s->l1_table->offsets[index] = acb->request.l2_table->offset;
 -    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
 +    ret = qed_write_l1_table(s, index, 1);
 +    qed_commit_l2_update(acb, ret);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
      if (need_alloc) {
          /* Write out the whole new L2 table */
 -        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
 -                           qed_aio_write_l1_update, acb);
 +        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
 +        qed_aio_write_l1_update(acb, ret);
      } else {
          /* Write out only the updated part of the L2 table */
 -        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
 -                           qed_aio_next_io_cb, acb);
 +        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
 +                                 false);
 +        qed_aio_next_io(acb, ret);
      }
      return;
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
   * Table I/O functions
   */
  int qed_read_l1_table_sync(BDRVQEDState *s);
 -void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
 -                        BlockCompletionFunc *cb, void *opaque);
 +int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
  int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                              unsigned int n);
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             uint64_t offset);
  int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
 -void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 -                        unsigned int index, unsigned int n, bool flush,
 -                        BlockCompletionFunc *cb, void *opaque);
 +int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 +                       unsigned int index, unsigned int n, bool flush);
  int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                              unsigned int index, unsigned int n, bool flush);
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 28/61] qed: Remove callback from qed_read_l2_table()
+[PULL 14/25] thread-pool: use ThreadPool from the running thread
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Use qemu_get_current_aio_context() where possible, since we always
+submit work to the current thread anyways.
+We want to also be sure that the thread submitting the work is
+the same as the one processing the pool, to avoid adding
+synchronization to the pool list.
+Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Message-Id: <20230203131731.851116-4-eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Eric Blake <eblake@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
+ include/block/thread-pool.h |  5 +++++
- block/qed-table.c   | 15 +++------
+ block/file-posix.c          | 21 ++++++++++-----------
- block/qed.h         |  3 +-
+ block/file-win32.c          |  2 +-
-files changed, 36 insertions(+), 76 deletions(-)
+ block/qcow2-threads.c       |  2 +-
+ util/thread-pool.c          |  9 ++++-----
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
+files changed, 21 insertions(+), 18 deletions(-)
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
-+++ b/block/qed-cluster.c
+index XXXXXXX..XXXXXXX 100644
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+--- a/include/block/thread-pool.h
-     return i - index;
++++ b/include/block/thread-pool.h
- }
+@@ -XXX,XX +XXX,XX @@ typedef struct ThreadPool ThreadPool;
+ ThreadPool *thread_pool_new(struct AioContext *ctx);
--typedef struct {
+ void thread_pool_free(ThreadPool *pool);
--    BDRVQEDState *s;
--    uint64_t pos;
++/*
--    size_t len;
++ * thread_pool_submit* API: submit I/O requests in the thread's
--
++ * current AioContext.
--    QEDRequest *request;
++ */
--
+ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
--    /* User callback */
+         ThreadPoolFunc *func, void *arg,
--    QEDFindClusterFunc *cb;
+         BlockCompletionFunc *cb, void *opaque);
--    void *opaque;
+ int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
--} QEDFindClusterCB;
+         ThreadPoolFunc *func, void *arg);
--
+ void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
--static void qed_find_cluster_cb(void *opaque, int ret)
++
--{
+ void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
--    QEDFindClusterCB *find_cluster_cb = opaque;
--    BDRVQEDState *s = find_cluster_cb->s;
+ #endif
--    QEDRequest *request = find_cluster_cb->request;
+diff --git a/block/file-posix.c b/block/file-posix.c
--    uint64_t offset = 0;
+index XXXXXXX..XXXXXXX 100644
--    size_t len = 0;
+--- a/block/file-posix.c
--    unsigned int index;
++++ b/block/file-posix.c
 -    unsigned int n;
 -
 -    qed_acquire(s);
 -    if (ret) {
 -        goto out;
 -    }
 -
 -    index = qed_l2_index(s, find_cluster_cb->pos);
 -    n = qed_bytes_to_clusters(s,
 -                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
 -                              find_cluster_cb->len);
 -    n = qed_count_contiguous_clusters(s, request->l2_table->table,
 -                                      index, n, &offset);
 -
 -    if (qed_offset_is_unalloc_cluster(offset)) {
 -        ret = QED_CLUSTER_L2;
 -    } else if (qed_offset_is_zero_cluster(offset)) {
 -        ret = QED_CLUSTER_ZERO;
 -    } else if (qed_check_cluster_offset(s, offset)) {
 -        ret = QED_CLUSTER_FOUND;
 -    } else {
 -        ret = -EINVAL;
 -    }
 -
 -    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
 -              qed_offset_into_cluster(s, find_cluster_cb->pos));
 -
 -out:
 -    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
 -    qed_release(s);
 -    g_free(find_cluster_cb);
 -}
 -
  /**
   * Find the offset of a data cluster
   *
 @@ -XXX,XX +XXX,XX @@ out:
- void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+     return result;
-                       size_t len, QEDFindClusterFunc *cb, void *opaque)
+ }
 -static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs,
 -                                               ThreadPoolFunc func, void *arg)
 +static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
  {
--    QEDFindClusterCB *find_cluster_cb;
+     /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
-     uint64_t l2_offset;
+-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
-+    uint64_t offset = 0;
++    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
-+    unsigned int index;
+     return thread_pool_submit_co(pool, func, arg);
-+    unsigned int n;
+ }
-+    int ret;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
-     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
+     };
-      * so that a request acts on one L2 table at a time.
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+     assert(qiov->size == bytes);
-         return;
+-    return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
-     }
++    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
+ }
--    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
--    find_cluster_cb->s = s;
+ static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
--    find_cluster_cb->pos = pos;
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
--    find_cluster_cb->len = len;
+         return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
--    find_cluster_cb->cb = cb;
+     }
--    find_cluster_cb->opaque = opaque;
+ #endif
--    find_cluster_cb->request = request;
+-    return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
-+    ret = qed_read_l2_table(s, request, l2_offset);
++    return raw_thread_pool_submit(handle_aiocb_flush, &acb);
-+    qed_acquire(s);
+ }
-+    if (ret) {
-+        goto out;
+ static void raw_aio_attach_aio_context(BlockDriverState *bs,
-+    }
+@@ -XXX,XX +XXX,XX @@ raw_regular_truncate(BlockDriverState *bs, int fd, int64_t offset,
-+
+         },
-+    index = qed_l2_index(s, pos);
+     };
-+    n = qed_bytes_to_clusters(s,
-+                              qed_offset_into_cluster(s, pos) + len);
+-    return raw_thread_pool_submit(bs, handle_aiocb_truncate, &acb);
-+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
++    return raw_thread_pool_submit(handle_aiocb_truncate, &acb);
-+                                      index, n, &offset);
+ }
-+
-+    if (qed_offset_is_unalloc_cluster(offset)) {
+ static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
-+        ret = QED_CLUSTER_L2;
+@@ -XXX,XX +XXX,XX @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
-+    } else if (qed_offset_is_zero_cluster(offset)) {
+         acb.aio_type |= QEMU_AIO_BLKDEV;
-+        ret = QED_CLUSTER_ZERO;
+     }
-+    } else if (qed_check_cluster_offset(s, offset)) {
-+        ret = QED_CLUSTER_FOUND;
+-    ret = raw_thread_pool_submit(bs, handle_aiocb_discard, &acb);
-+    } else {
++    ret = raw_thread_pool_submit(handle_aiocb_discard, &acb);
-+        ret = -EINVAL;
+     raw_account_discard(s, bytes, ret);
 +    }
 +
 +    len = MIN(len,
 +              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 -    qed_read_l2_table(s, request, l2_offset,
 -                      qed_find_cluster_cb, find_cluster_cb);
 +out:
 +    cb(opaque, ret, offset, len);
 +    qed_release(s);
  }
 diff --git a/block/qed-table.c b/block/qed-table.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed-table.c
 +++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
      return ret;
  }
+@@ -XXX,XX +XXX,XX @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
--void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+         handler = handle_aiocb_write_zeroes;
--                       BlockCompletionFunc *cb, void *opaque)
+     }
-+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 -    return raw_thread_pool_submit(bs, handler, &acb);
 +    return raw_thread_pool_submit(handler, &acb);
  }
  static int coroutine_fn raw_co_pwrite_zeroes(
@@ -XXX,XX +XXX,XX @@ raw_co_copy_range_to(BlockDriverState *bs,
          },
      };
 -    return raw_thread_pool_submit(bs, handle_aiocb_copy_range, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_copy_range, &acb);
  }
  BlockDriver bdrv_file = {
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
          struct sg_io_hdr *io_hdr = buf;
          if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
              io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
 -            return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
 +            return pr_manager_execute(s->pr_mgr, qemu_get_current_aio_context(),
                                        s->fd, io_hdr);
          }
      }
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
          },
      };
 -    return raw_thread_pool_submit(bs, handle_aiocb_ioctl, &acb);
 +    return raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
  }
  #endif /* linux */
 diff --git a/block/file-win32.c b/block/file-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-win32.c
 +++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
      acb->aio_offset = offset;
      trace_file_paio_submit(acb, opaque, offset, count, type);
 -    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
 +    pool = aio_get_thread_pool(qemu_get_current_aio_context());
      return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
  }
 diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-threads.c
 +++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
  {
      int ret;
+     BDRVQcow2State *s = bs->opaque;
-@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
-     /* Check for cached L2 entry */
++    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
-     request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
-     if (request->l2_table) {
+     qemu_co_mutex_lock(&s->lock);
--        cb(opaque, 0);
+     while (s->nb_threads >= QCOW2_MAX_THREADS) {
--        return;
+diff --git a/util/thread-pool.c b/util/thread-pool.c
-+        return 0;
+index XXXXXXX..XXXXXXX 100644
-     }
+--- a/util/thread-pool.c
++++ b/util/thread-pool.c
-     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+@@ -XXX,XX +XXX,XX @@ struct ThreadPoolElement {
-@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+     /* Access to this list is protected by lock.  */
-     }
+     QTAILQ_ENTRY(ThreadPoolElement) reqs;
-     qed_release(s);
+-    /* Access to this list is protected by the global mutex.  */
--    cb(opaque, ret);
++    /* This list is only written by the thread pool's mother thread.  */
-+    return ret;
+     QLIST_ENTRY(ThreadPoolElement) all;
- }
+ };
- int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
+@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
      ThreadPool *pool = opaque;
      ThreadPoolElement *elem, *next;
 -    aio_context_acquire(pool->ctx);
  restart:
      QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
          if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
               */
              qemu_bh_schedule(pool->completion_bh);
 -            aio_context_release(pool->ctx);
              elem->common.cb(elem->common.opaque, elem->ret);
 -            aio_context_acquire(pool->ctx);
              /* We can safely cancel the completion_bh here regardless of someone
               * else having scheduled it meanwhile because we reenter the
@@ -XXX,XX +XXX,XX @@ restart:
              qemu_aio_unref(elem);
          }
      }
 -    aio_context_release(pool->ctx);
  }
  static void thread_pool_cancel(BlockAIOCB *acb)
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
  {
--    int ret = -EINPROGRESS;
+     ThreadPoolElement *req;
--
--    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
++    /* Assert that the thread submitting work is the same running the pool */
--    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
++    assert(pool->ctx == qemu_get_current_aio_context());
--
++
--    return ret;
+     req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
-+    return qed_read_l2_table(s, request, offset);
+     req->func = func;
- }
+     req->arg = arg;
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
 diff --git a/block/qed.h b/block/qed.h
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qed.h
 +++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                              unsigned int n);
  int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             uint64_t offset);
 -void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
 -                       BlockCompletionFunc *cb, void *opaque);
 +int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
  void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                          unsigned int index, unsigned int n, bool flush,
                          BlockCompletionFunc *cb, void *opaque);
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 10/61] throttle: Update throttle-groups.c documentation
+[PULL 15/25] thread-pool: avoid passing the pool parameter every time
-From: Alberto Garcia <berto@igalia.com>
+From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
-There used to be throttle_timers_{detach,attach}_aio_context() calls
+thread_pool_submit_aio() is always called on a pool taken from
-in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
+qemu_get_current_aio_context(), and that is the only intended
-they are now in blk_set_aio_context().
+use: each pool runs only in the same thread that is submitting
 work to it, it can't run anywhere else.
-Signed-off-by: Alberto Garcia <berto@igalia.com>
+Therefore simplify the thread_pool_submit* API and remove the
 ThreadPool function parameter.
 Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
 Message-Id: <20230203131731.851116-5-eesposit@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/throttle-groups.c | 2 +-
+ include/block/thread-pool.h   | 10 ++++------
-file changed, 1 insertion(+), 1 deletion(-)
+ backends/tpm/tpm_backend.c    |  4 +---
  block/file-posix.c            |  4 +---
  block/file-win32.c            |  4 +---
  block/qcow2-threads.c         |  3 +--
  hw/9pfs/coth.c                |  3 +--
  hw/ppc/spapr_nvdimm.c         |  6 ++----
  hw/virtio/virtio-pmem.c       |  3 +--
  scsi/pr-manager.c             |  3 +--
  scsi/qemu-pr-helper.c         |  3 +--
  tests/unit/test-thread-pool.c | 12 +++++-------
  util/thread-pool.c            | 16 ++++++++--------
 files changed, 27 insertions(+), 44 deletions(-)
-diff --git a/block/throttle-groups.c b/block/throttle-groups.c
+diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/throttle-groups.c
+--- a/include/block/thread-pool.h
-+++ b/block/throttle-groups.c
++++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ void thread_pool_free(ThreadPool *pool);
   * thread_pool_submit* API: submit I/O requests in the thread's
   * current AioContext.
   */
 -BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
 -        ThreadPoolFunc *func, void *arg,
 -        BlockCompletionFunc *cb, void *opaque);
 -int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
 -        ThreadPoolFunc *func, void *arg);
 -void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
 +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 +                                   BlockCompletionFunc *cb, void *opaque);
 +int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
 +void thread_pool_submit(ThreadPoolFunc *func, void *arg);
  void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 diff --git a/backends/tpm/tpm_backend.c b/backends/tpm/tpm_backend.c
 index XXXXXXX..XXXXXXX 100644
 --- a/backends/tpm/tpm_backend.c
 +++ b/backends/tpm/tpm_backend.c
@@ -XXX,XX +XXX,XX @@ bool tpm_backend_had_startup_error(TPMBackend *s)
  void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
  {
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
 -
      if (s->cmd != NULL) {
          error_report("There is a TPM request pending");
          return;
@@ -XXX,XX +XXX,XX @@ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
      s->cmd = cmd;
      object_ref(OBJECT(s));
 -    thread_pool_submit_aio(pool, tpm_backend_worker_thread, s,
 +    thread_pool_submit_aio(tpm_backend_worker_thread, s,
                             tpm_backend_request_completed, s);
  }
 diff --git a/block/file-posix.c b/block/file-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-posix.c
 +++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
  static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
  {
 -    /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
 -    return thread_pool_submit_co(pool, func, arg);
 +    return thread_pool_submit_co(func, arg);
  }
  /*
 diff --git a/block/file-win32.c b/block/file-win32.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/file-win32.c
 +++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
          BlockCompletionFunc *cb, void *opaque, int type)
  {
      RawWin32AIOData *acb = g_new(RawWin32AIOData, 1);
 -    ThreadPool *pool;
      acb->bs = bs;
      acb->hfile = hfile;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
      acb->aio_offset = offset;
      trace_file_paio_submit(acb, opaque, offset, count, type);
 -    pool = aio_get_thread_pool(qemu_get_current_aio_context());
 -    return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
 +    return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
  }
  int qemu_ftruncate64(int fd, int64_t length)
 diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-threads.c
 +++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
  {
      int ret;
      BDRVQcow2State *s = bs->opaque;
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
      qemu_co_mutex_lock(&s->lock);
      while (s->nb_threads >= QCOW2_MAX_THREADS) {
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
      s->nb_threads++;
      qemu_co_mutex_unlock(&s->lock);
 -    ret = thread_pool_submit_co(pool, func, arg);
 +    ret = thread_pool_submit_co(func, arg);
      qemu_co_mutex_lock(&s->lock);
      s->nb_threads--;
 diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/9pfs/coth.c
 +++ b/hw/9pfs/coth.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_enter_func(void *arg)
  void co_run_in_worker_bh(void *opaque)
  {
      Coroutine *co = opaque;
 -    thread_pool_submit_aio(aio_get_thread_pool(qemu_get_aio_context()),
 -                           coroutine_enter_func, co, coroutine_enter_cb, co);
 +    thread_pool_submit_aio(coroutine_enter_func, co, coroutine_enter_cb, co);
  }
 diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/ppc/spapr_nvdimm.c
 +++ b/hw/ppc/spapr_nvdimm.c
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
  {
      SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
      SpaprNVDIMMDeviceFlushState *state;
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
      HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
      bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
      bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
      }
      QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
 -        thread_pool_submit_aio(pool, flush_worker_cb, state,
 +        thread_pool_submit_aio(flush_worker_cb, state,
                                 spapr_nvdimm_flush_completion_cb, state);
      }
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
      PCDIMMDevice *dimm;
      HostMemoryBackend *backend = NULL;
      SpaprNVDIMMDeviceFlushState *state;
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
      int fd;
      if (!drc || !drc->dev ||
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
          state->drcidx = drc_index;
 -        thread_pool_submit_aio(pool, flush_worker_cb, state,
 +        thread_pool_submit_aio(flush_worker_cb, state,
                                 spapr_nvdimm_flush_completion_cb, state);
          continue_token = state->continue_token;
 diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/virtio/virtio-pmem.c
 +++ b/hw/virtio/virtio-pmem.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
      VirtIODeviceRequest *req_data;
      VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
      HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
      trace_virtio_pmem_flush_request();
      req_data = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
      req_data->fd   = memory_region_get_fd(&backend->mr);
      req_data->pmem = pmem;
      req_data->vdev = vdev;
 -    thread_pool_submit_aio(pool, worker_cb, req_data, done_cb, req_data);
 +    thread_pool_submit_aio(worker_cb, req_data, done_cb, req_data);
  }
  static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
 diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
 index XXXXXXX..XXXXXXX 100644
 --- a/scsi/pr-manager.c
 +++ b/scsi/pr-manager.c
@@ -XXX,XX +XXX,XX @@ static int pr_manager_worker(void *opaque)
  int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
                                      struct sg_io_hdr *hdr)
  {
 -    ThreadPool *pool = aio_get_thread_pool(ctx);
      PRManagerData data = {
          .pr_mgr = pr_mgr,
          .fd     = fd,
@@ -XXX,XX +XXX,XX @@ int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
      /* The matching object_unref is in pr_manager_worker.  */
      object_ref(OBJECT(pr_mgr));
 -    return thread_pool_submit_co(pool, pr_manager_worker, &data);
 +    return thread_pool_submit_co(pr_manager_worker, &data);
  }
  bool pr_manager_is_connected(PRManager *pr_mgr)
 diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/scsi/qemu-pr-helper.c
 +++ b/scsi/qemu-pr-helper.c
@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
  static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
                      uint8_t *buf, int *sz, int dir)
  {
 -    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
      int r;
      PRHelperSGIOData data = {
@@ -XXX,XX +XXX,XX @@ static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
          .dir = dir,
      };
 -    r = thread_pool_submit_co(pool, do_sgio_worker, &data);
 +    r = thread_pool_submit_co(do_sgio_worker, &data);
      *sz = data.sz;
      return r;
  }
 diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tests/unit/test-thread-pool.c
 +++ b/tests/unit/test-thread-pool.c
 @@ -XXX,XX +XXX,XX @@
-  * Again, all this is handled internally and is mostly transparent to
+ #include "qemu/main-loop.h"
-  * the outside. The 'throttle_timers' field however has an additional
-  * constraint because it may be temporarily invalid (see for example
+ static AioContext *ctx;
-- * bdrv_set_aio_context()). Therefore in this file a thread will
+-static ThreadPool *pool;
-+ * blk_set_aio_context()). Therefore in this file a thread will
+ static int active;
-  * access some other BlockBackend's timers only after verifying that
-  * that BlockBackend has throttled requests in the queue.
+ typedef struct {
-  */
+@@ -XXX,XX +XXX,XX @@ static void done_cb(void *opaque, int ret)
  static void test_submit(void)
  {
      WorkerTestData data = { .n = 0 };
 -    thread_pool_submit(pool, worker_cb, &data);
 +    thread_pool_submit(worker_cb, &data);
      while (data.n == 0) {
          aio_poll(ctx, true);
      }
@@ -XXX,XX +XXX,XX @@ static void test_submit(void)
  static void test_submit_aio(void)
  {
      WorkerTestData data = { .n = 0, .ret = -EINPROGRESS };
 -    data.aiocb = thread_pool_submit_aio(pool, worker_cb, &data,
 +    data.aiocb = thread_pool_submit_aio(worker_cb, &data,
                                          done_cb, &data);
      /* The callbacks are not called until after the first wait.  */
@@ -XXX,XX +XXX,XX @@ static void co_test_cb(void *opaque)
      active = 1;
      data->n = 0;
      data->ret = -EINPROGRESS;
 -    thread_pool_submit_co(pool, worker_cb, data);
 +    thread_pool_submit_co(worker_cb, data);
      /* The test continues in test_submit_co, after qemu_coroutine_enter... */
@@ -XXX,XX +XXX,XX @@ static void test_submit_many(void)
      for (i = 0; i < 100; i++) {
          data[i].n = 0;
          data[i].ret = -EINPROGRESS;
 -        thread_pool_submit_aio(pool, worker_cb, &data[i], done_cb, &data[i]);
 +        thread_pool_submit_aio(worker_cb, &data[i], done_cb, &data[i]);
      }
      active = 100;
@@ -XXX,XX +XXX,XX @@ static void do_test_cancel(bool sync)
      for (i = 0; i < 100; i++) {
          data[i].n = 0;
          data[i].ret = -EINPROGRESS;
 -        data[i].aiocb = thread_pool_submit_aio(pool, long_cb, &data[i],
 +        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i],
                                                 done_cb, &data[i]);
      }
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
  {
      qemu_init_main_loop(&error_abort);
      ctx = qemu_get_current_aio_context();
 -    pool = aio_get_thread_pool(ctx);
      g_test_init(&argc, &argv, NULL);
      g_test_add_func("/thread-pool/submit", test_submit);
 diff --git a/util/thread-pool.c b/util/thread-pool.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/thread-pool.c
 +++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo thread_pool_aiocb_info = {
      .get_aio_context    = thread_pool_get_aio_context,
  };
 -BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
 -        ThreadPoolFunc *func, void *arg,
 -        BlockCompletionFunc *cb, void *opaque)
 +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 +                                   BlockCompletionFunc *cb, void *opaque)
  {
      ThreadPoolElement *req;
 +    AioContext *ctx = qemu_get_current_aio_context();
 +    ThreadPool *pool = aio_get_thread_pool(ctx);
      /* Assert that the thread submitting work is the same running the pool */
      assert(pool->ctx == qemu_get_current_aio_context());
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
      aio_co_wake(co->co);
  }
 -int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
 -                                       void *arg)
 +int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
  {
      ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
      assert(qemu_in_coroutine());
 -    thread_pool_submit_aio(pool, func, arg, thread_pool_co_cb, &tpc);
 +    thread_pool_submit_aio(func, arg, thread_pool_co_cb, &tpc);
      qemu_coroutine_yield();
      return tpc.ret;
  }
 -void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
 +void thread_pool_submit(ThreadPoolFunc *func, void *arg)
  {
 -    thread_pool_submit_aio(pool, func, arg, NULL, NULL);
 +    thread_pool_submit_aio(func, arg, NULL, NULL);
  }
  void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 53/61] qed: Add coroutine_fn to I/O path functions
+[PULL 16/25] vvfat: mark various functions as coroutine_fn
-Now that we stay in coroutine context for the whole request when doing
+From: Paolo Bonzini <pbonzini@redhat.com>
-reads or writes, we can add coroutine_fn annotations to many functions
-that can do I/O or yield directly.
+Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
+change for those that are themselves called only from coroutine_fns.
 In addition, coroutine_fns should do I/O using bdrv_co_*() functions, for
 which it is required to hold the BlockDriverState graph lock.  So also nnotate
 functions on the I/O path with TSA attributes, making it possible to
 switch them to use bdrv_co_*() functions.
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 Message-Id: <20230309084456.304669-2-pbonzini@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed-cluster.c |  5 +++--
+ block/vvfat.c | 58 ++++++++++++++++++++++++++-------------------------
- block/qed.c         | 44 ++++++++++++++++++++++++--------------------
+file changed, 30 insertions(+), 28 deletions(-)
- block/qed.h         |  5 +++--
-files changed, 30 insertions(+), 24 deletions(-)
+diff --git a/block/vvfat.c b/block/vvfat.c
 diff --git a/block/qed-cluster.c b/block/qed-cluster.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
+--- a/block/vvfat.c
-+++ b/block/qed-cluster.c
++++ b/block/vvfat.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+@@ -XXX,XX +XXX,XX @@ static BDRVVVFATState *vvv = NULL;
-  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ #endif
-  * table offset, respectively. len is number of contiguous unallocated bytes.
-  */
+ static int enable_write_target(BlockDriverState *bs, Error **errp);
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+-static int is_consistent(BDRVVVFATState *s);
--                     size_t *len, uint64_t *img_offset)
++static int coroutine_fn is_consistent(BDRVVVFATState *s);
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
-+                                  uint64_t pos, size_t *len,
+ static QemuOptsList runtime_opts = {
-+                                  uint64_t *img_offset)
+     .name = "vvfat",
- {
+@@ -XXX,XX +XXX,XX @@ static void print_mapping(const mapping_t* mapping)
-     uint64_t l2_offset;
+ }
-     uint64_t offset = 0;
+ #endif
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
+-static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
---- a/block/qed.c
+-                    uint8_t *buf, int nb_sectors)
-+++ b/block/qed.c
++static int coroutine_fn GRAPH_RDLOCK
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
++vvfat_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors)
-  * This function only updates known header fields in-place and does not affect
+ {
-  * extra data after the QED header.
+     BDRVVVFATState *s = bs->opaque;
   */
 -static int qed_write_header(BDRVQEDState *s)
 +static int coroutine_fn qed_write_header(BDRVQEDState *s)
  {
      /* We must write full sectors for O_DIRECT but cannot necessarily generate
       * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
      qemu_co_enter_next(&s->allocating_write_reqs);
  }
 -static void qed_need_check_timer_entry(void *opaque)
 +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
  {
      BDRVQEDState *s = opaque;
      int ret;
@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
   * This function reads qiov->size bytes starting at pos from the backing file.
   * If there is no backing file then zeroes are read.
   */
 -static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                 QEMUIOVector *qiov,
 -                                 QEMUIOVector **backing_qiov)
 +static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
 +                                              QEMUIOVector *qiov,
 +                                              QEMUIOVector **backing_qiov)
  {
      uint64_t backing_length = 0;
      size_t size;
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
   * @len:        Number of bytes
   * @offset:     Byte offset in image file
   */
 -static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 -                                      uint64_t len, uint64_t offset)
 +static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
 +                                                   uint64_t pos, uint64_t len,
 +                                                   uint64_t offset)
  {
      QEMUIOVector qiov;
      QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
   * The cluster offset may be an allocated byte offset in the image file, the
   * zero cluster marker, or the unallocated cluster marker.
   */
 -static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 -                                unsigned int n, uint64_t cluster)
 +static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
 +                                             int index, unsigned int n,
 +                                             uint64_t cluster)
  {
      int i;
-     for (i = index; i < index + n; i++) {
+@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
-@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
+                 DLOG(fprintf(stderr, "sectors %" PRId64 "+%" PRId64
                               " allocated\n", sector_num,
                               n >> BDRV_SECTOR_BITS));
 -                if (bdrv_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
 -                               buf + i * 0x200, 0) < 0) {
 +                if (bdrv_co_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
 +                                  buf + i * 0x200, 0) < 0) {
                      return -1;
                  }
                  i += (n >> BDRV_SECTOR_BITS) - 1;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
      return 0;
  }
 -static int coroutine_fn
 +static int coroutine_fn GRAPH_RDLOCK
  vvfat_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
                  QEMUIOVector *qiov, BdrvRequestFlags flags)
  {
@@ -XXX,XX +XXX,XX @@ static inline uint32_t modified_fat_get(BDRVVVFATState* s,
      }
  }
--static void qed_aio_complete(QEDAIOCB *acb)
+-static inline bool cluster_was_modified(BDRVVVFATState *s,
-+static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
+-                                        uint32_t cluster_num)
- {
++static inline bool coroutine_fn GRAPH_RDLOCK
-     BDRVQEDState *s = acb_to_s(acb);
++cluster_was_modified(BDRVVVFATState *s, uint32_t cluster_num)
+ {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
+     int was_modified = 0;
- /**
+     int i;
-  * Update L1 table with new L2 table offset and write it out
+@@ -XXX,XX +XXX,XX @@ typedef enum {
-  */
+  * Further, the files/directories handled by this function are
--static int qed_aio_write_l1_update(QEDAIOCB *acb)
+  * assumed to be *not* deleted (and *only* those).
-+static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
+  */
- {
+-static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
-     BDRVQEDState *s = acb_to_s(acb);
+-        direntry_t* direntry, const char* path)
-     CachedL2Table *l2_table = acb->request.l2_table;
++static uint32_t coroutine_fn GRAPH_RDLOCK
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
++get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const char* path)
- /**
+ {
-  * Update L2 table with new cluster offsets and write them out
+     /*
-  */
+      * This is a little bit tricky:
--static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
-+static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+                         if (res) {
- {
+                             return -1;
-     BDRVQEDState *s = acb_to_s(acb);
+                         }
-     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
+-                        res = bdrv_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+-                                          BDRV_SECTOR_SIZE, s->cluster_buffer,
- /**
+-                                          0);
-  * Write data to the image file
++                        res = bdrv_co_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
-  */
++                                             BDRV_SECTOR_SIZE, s->cluster_buffer,
--static int qed_aio_write_main(QEDAIOCB *acb)
++                                             0);
-+static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
+                         if (res < 0) {
- {
+                             return -2;
-     BDRVQEDState *s = acb_to_s(acb);
+                         }
-     uint64_t offset = acb->cur_cluster +
+@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
+  * It returns 0 upon inconsistency or error, and the number of clusters
- /**
+  * used by the directory, its subdirectories and their files.
-  * Populate untouched regions of new data cluster
+  */
-  */
+-static int check_directory_consistency(BDRVVVFATState *s,
--static int qed_aio_write_cow(QEDAIOCB *acb)
+-        int cluster_num, const char* path)
-+static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
++static int coroutine_fn GRAPH_RDLOCK
- {
++check_directory_consistency(BDRVVVFATState *s, int cluster_num, const char* path)
-     BDRVQEDState *s = acb_to_s(acb);
+ {
-     uint64_t start, len, offset;
+     int ret = 0;
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
+     unsigned char* cluster = g_malloc(s->cluster_size);
-  *
+@@ -XXX,XX +XXX,XX @@ DLOG(fprintf(stderr, "check direntry %d:\n", i); print_direntry(direntries + i))
-  * This path is taken when writing to previously unallocated clusters.
+ }
-  */
--static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+ /* returns 1 on success */
-+static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+-static int is_consistent(BDRVVVFATState* s)
- {
++static int coroutine_fn GRAPH_RDLOCK
-     BDRVQEDState *s = acb_to_s(acb);
++is_consistent(BDRVVVFATState* s)
-     int ret;
+ {
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+     int i, check;
-  *
+     int used_clusters_count = 0;
-  * This path is taken when writing to already allocated clusters.
+@@ -XXX,XX +XXX,XX @@ static int commit_mappings(BDRVVVFATState* s,
-  */
+     return 0;
--static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+ }
-+static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
-+                                              size_t len)
+-static int commit_direntries(BDRVVVFATState* s,
- {
+-        int dir_index, int parent_mapping_index)
-     /* Allocate buffer for zero writes */
++static int coroutine_fn GRAPH_RDLOCK
-     if (acb->flags & QED_AIOCB_ZERO) {
++commit_direntries(BDRVVVFATState* s, int dir_index, int parent_mapping_index)
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+ {
-  * @offset:     Cluster offset in bytes
+     direntry_t* direntry = array_get(&(s->directory), dir_index);
-  * @len:        Length in bytes
+     uint32_t first_cluster = dir_index == 0 ? 0 : begin_of_direntry(direntry);
-  */
+@@ -XXX,XX +XXX,XX @@ static int commit_direntries(BDRVVVFATState* s,
--static int qed_aio_write_data(void *opaque, int ret,
--                              uint64_t offset, size_t len)
+ /* commit one file (adjust contents, adjust mapping),
-+static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
+    return first_mapping_index */
-+                                           uint64_t offset, size_t len)
+-static int commit_one_file(BDRVVVFATState* s,
- {
+-        int dir_index, uint32_t offset)
-     QEDAIOCB *acb = opaque;
++static int coroutine_fn GRAPH_RDLOCK
++commit_one_file(BDRVVVFATState* s, int dir_index, uint32_t offset)
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
+ {
-  * @offset:     Cluster offset in bytes
+     direntry_t* direntry = array_get(&(s->directory), dir_index);
-  * @len:        Length in bytes
+     uint32_t c = begin_of_direntry(direntry);
-  */
+@@ -XXX,XX +XXX,XX @@ static int handle_renames_and_mkdirs(BDRVVVFATState* s)
--static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+ /*
-+static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
+  * TODO: make sure that the short name is not matching *another* file
-+                                          uint64_t offset, size_t len)
+  */
- {
+-static int handle_commits(BDRVVVFATState* s)
-     QEDAIOCB *acb = opaque;
++static int coroutine_fn GRAPH_RDLOCK handle_commits(BDRVVVFATState* s)
-     BDRVQEDState *s = acb_to_s(acb);
+ {
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+     int i, fail = 0;
- /**
-  * Begin next I/O or complete the request
+@@ -XXX,XX +XXX,XX @@ static int handle_deletes(BDRVVVFATState* s)
-  */
+  * - recurse direntries from root (using bs->bdrv_pread)
--static int qed_aio_next_io(QEDAIOCB *acb)
+  * - delete files corresponding to mappings marked as deleted
-+static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
+  */
- {
+-static int do_commit(BDRVVVFATState* s)
-     BDRVQEDState *s = acb_to_s(acb);
++static int coroutine_fn GRAPH_RDLOCK do_commit(BDRVVVFATState* s)
-     uint64_t offset;
+ {
-diff --git a/block/qed.h b/block/qed.h
+     int ret = 0;
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
+@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
-+++ b/block/qed.h
+     return 0;
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+ }
- /**
-  * Cluster functions
+-static int try_commit(BDRVVVFATState* s)
-  */
++static int coroutine_fn GRAPH_RDLOCK try_commit(BDRVVVFATState* s)
--int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+ {
--                     size_t *len, uint64_t *img_offset);
+     vvfat_close_current_file(s);
-+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+ DLOG(checkpoint());
-+                                  uint64_t pos, size_t *len,
+@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
-+                                  uint64_t *img_offset);
+     return do_commit(s);
+ }
- /**
-  * Consistency check
+-static int vvfat_write(BlockDriverState *bs, int64_t sector_num,
 -                    const uint8_t *buf, int nb_sectors)
 +static int coroutine_fn GRAPH_RDLOCK
 +vvfat_write(BlockDriverState *bs, int64_t sector_num,
 +            const uint8_t *buf, int nb_sectors)
  {
      BDRVVVFATState *s = bs->opaque;
      int i, ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
       * Use qcow backend. Commit later.
       */
  DLOG(fprintf(stderr, "Write to qcow backend: %d + %d\n", (int)sector_num, nb_sectors));
 -    ret = bdrv_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
 -                      nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
 +    ret = bdrv_co_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
 +                         nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
      if (ret < 0) {
          fprintf(stderr, "Error writing to qcow backend\n");
          return ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      return 0;
  }
 -static int coroutine_fn
 +static int coroutine_fn GRAPH_RDLOCK
  vvfat_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
                   QEMUIOVector *qiov, BdrvRequestFlags flags)
  {
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 58/61] blkdebug: Catch bs->exact_filename overflow
+[PULL 17/25] blkdebug: add missing coroutine_fn annotation
-From: Max Reitz <mreitz@redhat.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-The bs->exact_filename field may not be sufficient to store the full
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-blkdebug node filename. In this case, we should not generate a filename
+Message-Id: <20230309084456.304669-3-pbonzini@redhat.com>
-at all instead of an unusable one.
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
+Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 Cc: qemu-stable@nongnu.org
 Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 Message-id: 20170613172006.19685-2-mreitz@redhat.com
 Reviewed-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Max Reitz <mreitz@redhat.com>
 ---
- block/blkdebug.c | 10 +++++++---
+ block/blkdebug.c | 4 ++--
-file changed, 7 insertions(+), 3 deletions(-)
+file changed, 2 insertions(+), 2 deletions(-)
 diff --git a/block/blkdebug.c b/block/blkdebug.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/blkdebug.c
 +++ b/block/blkdebug.c
-@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
+@@ -XXX,XX +XXX,XX @@ out:
-     }
+     return ret;
+ }
-     if (!force_json && bs->file->bs->exact_filename[0]) {
--        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+-static int rule_check(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
--                 "blkdebug:%s:%s", s->config_file ?: "",
+-                      BlkdebugIOType iotype)
--                 bs->file->bs->exact_filename);
++static int coroutine_fn rule_check(BlockDriverState *bs, uint64_t offset,
-+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
++                                   uint64_t bytes, BlkdebugIOType iotype)
-+                           "blkdebug:%s:%s", s->config_file ?: "",
+ {
-+                           bs->file->bs->exact_filename);
+     BDRVBlkdebugState *s = bs->opaque;
-+        if (ret >= sizeof(bs->exact_filename)) {
+     BlkdebugRule *rule = NULL;
 +            /* An overflow makes the filename unusable, so do not report any */
 +            bs->exact_filename[0] = 0;
 +        }
      }
      opts = qdict_new();
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 18/61] qcow2: Use unsigned int for both members of Qcow2COWRegion
+[PULL 18/25] mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
-From: Alberto Garcia <berto@igalia.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-Qcow2COWRegion has two attributes:
+mirror_flush calls a mixed function blk_flush but it is only called
 from mirror_run; so call the coroutine version and make mirror_flush
 a coroutine_fn too.
-- The offset of the COW region from the start of the first cluster
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-  touched by the I/O request. Since it's always going to be positive
+Message-Id: <20230309084456.304669-4-pbonzini@redhat.com>
   and the maximum request size is at most INT_MAX, we can use a
   regular unsigned int to store this offset.
 - The size of the COW region in bytes. This is guaranteed to be >= 0,
   so we should use an unsigned type instead.
 In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
 It will also help keep some assertions simpler now that we know that
 there are no negative numbers.
 The prototype of do_perform_cow() is also updated to reflect these
 changes.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 4 ++--
+ block/mirror.c | 4 ++--
- block/qcow2.h         | 4 ++--
+file changed, 2 insertions(+), 2 deletions(-)
 files changed, 4 insertions(+), 4 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/mirror.c b/block/mirror.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/mirror.c
-+++ b/block/qcow2-cluster.c
++++ b/block/mirror.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
- static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+ /* Called when going out of the streaming phase to flush the bulk of the
-                                        uint64_t src_cluster_offset,
+  * data to the medium, or just before completing.
-                                        uint64_t cluster_offset,
+  */
--                                       int offset_in_cluster,
+-static int mirror_flush(MirrorBlockJob *s)
--                                       int bytes)
++static int coroutine_fn mirror_flush(MirrorBlockJob *s)
 +                                       unsigned offset_in_cluster,
 +                                       unsigned bytes)
  {
-     BDRVQcow2State *s = bs->opaque;
+-    int ret = blk_flush(s->target);
-     QEMUIOVector qiov;
++    int ret = blk_co_flush(s->target);
-diff --git a/block/qcow2.h b/block/qcow2.h
+     if (ret < 0) {
-index XXXXXXX..XXXXXXX 100644
+         if (mirror_error_action(s, false, -ret) == BLOCK_ERROR_ACTION_REPORT) {
---- a/block/qcow2.h
+             s->ret = ret;
 +++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct Qcow2COWRegion {
       * Offset of the COW region in bytes from the start of the first cluster
       * touched by the request.
       */
 -    uint64_t    offset;
 +    unsigned    offset;
      /** Number of bytes to copy */
 -    int         nb_bytes;
 +    unsigned    nb_bytes;
  } Qcow2COWRegion;
  /**
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 47/61] qed: Remove ret argument from qed_aio_next_io()
+[PULL 19/25] nbd: mark more coroutine_fns, do not use co_wrappers
-All callers pass ret = 0, so we can just remove it.
+From: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 17 ++++++-----------
+ nbd/server.c | 48 ++++++++++++++++++++++++------------------------
-file changed, 6 insertions(+), 11 deletions(-)
+file changed, 24 insertions(+), 24 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/nbd/server.c b/nbd/server.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/nbd/server.c
-+++ b/block/qed.c
++++ b/nbd/server.c
-@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+@@ -XXX,XX +XXX,XX @@ nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
-     return l2_table;
+     return 1;
  }
--static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+-static int nbd_receive_request(NBDClient *client, NBDRequest *request,
-+static void qed_aio_next_io(QEDAIOCB *acb);
+-                               Error **errp)
++static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest *request,
- static void qed_aio_start_io(QEDAIOCB *acb)
++                                            Error **errp)
  {
--    qed_aio_next_io(acb, 0);
+     uint8_t buf[NBD_REQUEST_SIZE];
-+    qed_aio_next_io(acb);
+     uint32_t magic;
@@ -XXX,XX +XXX,XX @@ static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
      stq_be_p(&reply->handle, handle);
  }
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
+-static int nbd_co_send_simple_reply(NBDClient *client,
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+-                                    uint64_t handle,
- /**
+-                                    uint32_t error,
-  * Begin next I/O or complete the request
+-                                    void *data,
 -                                    size_t len,
 -                                    Error **errp)
 +static int coroutine_fn nbd_co_send_simple_reply(NBDClient *client,
 +                                                 uint64_t handle,
 +                                                 uint32_t error,
 +                                                 void *data,
 +                                                 size_t len,
 +                                                 Error **errp)
  {
      NBDSimpleReply reply;
      int nbd_err = system_errno_to_nbd_errno(error);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
              stl_be_p(&chunk.length, pnum);
              ret = nbd_co_send_iov(client, iov, 1, errp);
          } else {
 -            ret = blk_pread(exp->common.blk, offset + progress, pnum,
 -                            data + progress, 0);
 +            ret = blk_co_pread(exp->common.blk, offset + progress, pnum,
 +                               data + progress, 0);
              if (ret < 0) {
                  error_setg_errno(errp, -ret, "reading from file failed");
                  break;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blockalloc_to_extents(BlockBackend *blk,
   * @ea is converted to BE by the function
   * @last controls whether NBD_REPLY_FLAG_DONE is sent.
   */
--static void qed_aio_next_io(QEDAIOCB *acb, int ret)
+-static int nbd_co_send_extents(NBDClient *client, uint64_t handle,
-+static void qed_aio_next_io(QEDAIOCB *acb)
+-                               NBDExtentArray *ea,
 -                               bool last, uint32_t context_id, Error **errp)
 +static int coroutine_fn
 +nbd_co_send_extents(NBDClient *client, uint64_t handle, NBDExtentArray *ea,
 +                    bool last, uint32_t context_id, Error **errp)
  {
-     BDRVQEDState *s = acb_to_s(acb);
+     NBDStructuredMeta chunk;
-     uint64_t offset;
+     struct iovec iov[] = {
-     size_t len;
+@@ -XXX,XX +XXX,XX @@ static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
-+    int ret;
+     bdrv_dirty_bitmap_unlock(bitmap);
+ }
--    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
-+    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
+-static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
+-                              BdrvDirtyBitmap *bitmap, uint64_t offset,
-     if (acb->backing_qiov) {
+-                              uint32_t length, bool dont_fragment, bool last,
-         qemu_iovec_destroy(acb->backing_qiov);
+-                              uint32_t context_id, Error **errp)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
++static int coroutine_fn nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
-         acb->backing_qiov = NULL;
++                                           BdrvDirtyBitmap *bitmap, uint64_t offset,
 +                                           uint32_t length, bool dont_fragment, bool last,
 +                                           uint32_t context_id, Error **errp)
  {
      unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
      g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
   * to the client (although the caller may still need to disconnect after
   * reporting the error).
   */
 -static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
 -                                  Error **errp)
 +static int coroutine_fn nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
 +                                               Error **errp)
  {
      NBDClient *client = req->client;
      int valid_flags;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_do_cmd_read(NBDClient *client, NBDRequest *request,
                                         data, request->len, errp);
      }
--    /* Handle I/O error */
+-    ret = blk_pread(exp->common.blk, request->from, request->len, data, 0);
--    if (ret) {
++    ret = blk_co_pread(exp->common.blk, request->from, request->len, data, 0);
--        qed_aio_complete(acb, ret);
+     if (ret < 0) {
--        return;
+         return nbd_send_generic_reply(client, request->handle, ret,
--    }
+                                       "reading from file failed", errp);
--
+@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
-     acb->qiov_offset += acb->cur_qiov.size;
+         if (request->flags & NBD_CMD_FLAG_FUA) {
-     acb->cur_pos += acb->cur_qiov.size;
+             flags |= BDRV_REQ_FUA;
      qemu_iovec_reset(&acb->cur_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
          }
-         return;
+-        ret = blk_pwrite(exp->common.blk, request->from, request->len, data,
-     }
+-                         flags);
--    qed_aio_next_io(acb, 0);
++        ret = blk_co_pwrite(exp->common.blk, request->from, request->len, data,
-+    qed_aio_next_io(acb);
++                            flags);
- }
+         return nbd_send_generic_reply(client, request->handle, ret,
+                                       "writing to file failed", errp);
- static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
          if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
              flags |= BDRV_REQ_NO_FALLBACK;
          }
 -        ret = blk_pwrite_zeroes(exp->common.blk, request->from, request->len,
 -                                flags);
 +        ret = blk_co_pwrite_zeroes(exp->common.blk, request->from, request->len,
 +                                   flags);
          return nbd_send_generic_reply(client, request->handle, ret,
                                        "writing to file failed", errp);
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 19/61] qcow2: Make perform_cow() call do_perform_cow() twice
+[PULL 20/25] 9pfs: mark more coroutine_fns
-From: Alberto Garcia <berto@igalia.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-Instead of calling perform_cow() twice with a different COW region
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-each time, call it just once and make perform_cow() handle both
+Message-Id: <20230309084456.304669-6-pbonzini@redhat.com>
 regions.
 This patch simply moves code around. The next one will do the actual
 reordering of the COW operations.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Eric Blake <eblake@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
+ hw/9pfs/9p.h    | 4 ++--
-file changed, 22 insertions(+), 14 deletions(-)
+ hw/9pfs/codir.c | 6 +++---
 files changed, 5 insertions(+), 5 deletions(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/hw/9pfs/9p.h
-+++ b/block/qcow2-cluster.c
++++ b/hw/9pfs/9p.h
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
+@@ -XXX,XX +XXX,XX @@ typedef struct V9fsDir {
-     struct iovec iov;
+     QemuMutex readdir_mutex_L;
-     int ret;
+ } V9fsDir;
-+    if (bytes == 0) {
+-static inline void v9fs_readdir_lock(V9fsDir *dir)
-+        return 0;
++static inline void coroutine_fn v9fs_readdir_lock(V9fsDir *dir)
-+    }
+ {
-+
+     if (dir->proto_version == V9FS_PROTO_2000U) {
-     iov.iov_len = bytes;
+         qemu_co_mutex_lock(&dir->readdir_mutex_u);
-     iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
+@@ -XXX,XX +XXX,XX @@ static inline void v9fs_readdir_lock(V9fsDir *dir)
-     if (iov.iov_base == NULL) {
+     }
@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
      return cluster_offset;
  }
--static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+-static inline void v9fs_readdir_unlock(V9fsDir *dir)
-+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
++static inline void coroutine_fn v9fs_readdir_unlock(V9fsDir *dir)
  {
-     BDRVQcow2State *s = bs->opaque;
+     if (dir->proto_version == V9FS_PROTO_2000U) {
-+    Qcow2COWRegion *start = &m->cow_start;
+         qemu_co_mutex_unlock(&dir->readdir_mutex_u);
-+    Qcow2COWRegion *end = &m->cow_end;
+diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
-     int ret;
+index XXXXXXX..XXXXXXX 100644
+--- a/hw/9pfs/codir.c
--    if (r->nb_bytes == 0) {
++++ b/hw/9pfs/codir.c
-+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+@@ -XXX,XX +XXX,XX @@ int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
-         return 0;
+  *
-     }
+  * See v9fs_co_readdir_many() (as its only user) below for details.
+  */
-     qemu_co_mutex_unlock(&s->lock);
+-static int do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
--    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
+-                           struct V9fsDirEnt **entries, off_t offset,
--    qemu_co_mutex_lock(&s->lock);
+-                           int32_t maxsize, bool dostat)
--
++static int coroutine_fn
-+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
++do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp, struct V9fsDirEnt **entries,
-+                         start->offset, start->nb_bytes);
++                off_t offset, int32_t maxsize, bool dostat)
-     if (ret < 0) {
+ {
--        return ret;
+     V9fsState *s = pdu->s;
-+        goto fail;
+     V9fsString name;
      }
 +    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
 +                         end->offset, end->nb_bytes);
 +
 +fail:
 +    qemu_co_mutex_lock(&s->lock);
 +
      /*
       * Before we update the L2 table to actually point to the new cluster, we
       * need to be sure that the refcounts have been increased and COW was
       * handled.
       */
 -    qcow2_cache_depends_on_flush(s->l2_table_cache);
 +    if (ret == 0) {
 +        qcow2_cache_depends_on_flush(s->l2_table_cache);
 +    }
 -    return 0;
 +    return ret;
  }
  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
      }
      /* copy content of unmodified sectors */
 -    ret = perform_cow(bs, m, &m->cow_start);
 -    if (ret < 0) {
 -        goto err;
 -    }
 -
 -    ret = perform_cow(bs, m, &m->cow_end);
 +    ret = perform_cow(bs, m);
      if (ret < 0) {
          goto err;
      }
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 49/61] qed: Implement .bdrv_co_readv/writev
+[PULL 21/25] qemu-pr-helper: mark more coroutine_fns
-Most of the qed code is now synchronous and matches the coroutine model.
+From: Paolo Bonzini <pbonzini@redhat.com>
 One notable exception is the serialisation between requests which can
 still schedule a callback. Before we can replace this with coroutine
 locks, let's convert the driver's external interfaces to the coroutine
 versions.
-We need to be careful to handle both requests that call the completion
+do_sgio can suspend via the coroutine function thread_pool_submit_co, so it
-callback directly from the calling coroutine (i.e. fully synchronous
+has to be coroutine_fn as well---and the same is true of all its direct and
-code) and requests that involve some callback, so that we need to yield
+indirect callers.
 and wait for the completion callback coming from outside the coroutine.
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Message-Id: <20230309084456.304669-7-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
+ scsi/qemu-pr-helper.c | 22 +++++++++++-----------
-file changed, 42 insertions(+), 55 deletions(-)
+file changed, 11 insertions(+), 11 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/scsi/qemu-pr-helper.c
-+++ b/block/qed.c
++++ b/scsi/qemu-pr-helper.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
      return status;
  }
 -static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
 -                    uint8_t *buf, int *sz, int dir)
 +static int coroutine_fn do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
 +                                uint8_t *buf, int *sz, int dir)
  {
      int r;
@@ -XXX,XX +XXX,XX @@ static SCSISense mpath_generic_sense(int r)
      }
  }
--static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
+-static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
--                                 int64_t sector_num,
++static int coroutine_fn mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
 -                                 QEMUIOVector *qiov, int nb_sectors,
 -                                 BlockCompletionFunc *cb,
 -                                 void *opaque, int flags)
 +typedef struct QEDRequestCo {
 +    Coroutine *co;
 +    bool done;
 +    int ret;
 +} QEDRequestCo;
 +
 +static void qed_co_request_cb(void *opaque, int ret)
  {
--    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
+     switch (r) {
-+    QEDRequestCo *co = opaque;
+     case MPATH_PR_SUCCESS:
+@@ -XXX,XX +XXX,XX @@ static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
--    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
+     }
 -                        opaque, flags);
 +    co->done = true;
 +    co->ret = ret;
 +    qemu_coroutine_enter_if_inactive(co->co);
 +}
 +
 +static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
 +                                       QEMUIOVector *qiov, int nb_sectors,
 +                                       int flags)
 +{
 +    QEDRequestCo co = {
 +        .co     = qemu_coroutine_self(),
 +        .done   = false,
 +    };
 +    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
 +
 +    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
      acb->flags = flags;
      acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
      /* Start request */
      qed_aio_start_io(acb);
 -    return &acb->common;
 -}
 -static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
 -                                      int64_t sector_num,
 -                                      QEMUIOVector *qiov, int nb_sectors,
 -                                      BlockCompletionFunc *cb,
 -                                      void *opaque)
 -{
 -    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 +    if (!co.done) {
 +        qemu_coroutine_yield();
 +    }
 +
 +    return co.ret;
  }
--static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
+-static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
--                                       int64_t sector_num,
+-                           uint8_t *data, int sz)
--                                       QEMUIOVector *qiov, int nb_sectors,
++static int coroutine_fn multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
--                                       BlockCompletionFunc *cb,
++                                        uint8_t *data, int sz)
 -                                       void *opaque)
 +static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
 +                                          int64_t sector_num, int nb_sectors,
 +                                          QEMUIOVector *qiov)
  {
--    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
+     int rq_servact = cdb[1];
--                         opaque, QED_AIOCB_WRITE);
+     struct prin_resp resp;
-+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
+@@ -XXX,XX +XXX,XX @@ static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
      return mpath_reconstruct_sense(fd, r, sense);
  }
--typedef struct {
+-static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
--    Coroutine *co;
+-                            const uint8_t *param, int sz)
--    int ret;
++static int coroutine_fn multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
--    bool done;
++                                         const uint8_t *param, int sz)
 -} QEDWriteZeroesCB;
 -
 -static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
 +static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
 +                                           int64_t sector_num, int nb_sectors,
 +                                           QEMUIOVector *qiov)
  {
--    QEDWriteZeroesCB *cb = opaque;
+     int rq_servact = cdb[1];
--
+     int rq_scope = cdb[2] >> 4;
--    cb->done = true;
+@@ -XXX,XX +XXX,XX @@ static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
 -    cb->ret = ret;
 -    if (cb->co) {
 -        aio_co_wake(cb->co);
 -    }
 +    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
  }
+ #endif
- static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
+-static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-                                                   int count,
+-                    uint8_t *data, int *resp_sz)
-                                                   BdrvRequestFlags flags)
++static int coroutine_fn do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
 +                                 uint8_t *data, int *resp_sz)
  {
--    BlockAIOCB *blockacb;
+ #ifdef CONFIG_MPATH
-     BDRVQEDState *s = bs->opaque;
+     if (is_mpath(fd)) {
--    QEDWriteZeroesCB cb = { .done = false };
+@@ -XXX,XX +XXX,XX @@ static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-     QEMUIOVector qiov;
+                    SG_DXFER_FROM_DEV);
      struct iovec iov;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
      iov.iov_len = count;
      qemu_iovec_init_external(&qiov, &iov, 1);
 -    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
 -                             count >> BDRV_SECTOR_BITS,
 -                             qed_co_pwrite_zeroes_cb, &cb,
 -                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
 -    if (!blockacb) {
 -        return -EIO;
 -    }
 -    if (!cb.done) {
 -        cb.co = qemu_coroutine_self();
 -        qemu_coroutine_yield();
 -    }
 -    assert(cb.done);
 -    return cb.ret;
 +    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
 +                          count >> BDRV_SECTOR_BITS,
 +                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
  }
- static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
+-static int do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
+-                     const uint8_t *param, int sz)
-     .bdrv_create              = bdrv_qed_create,
++static int coroutine_fn do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
++                                  const uint8_t *param, int sz)
-     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
+ {
--    .bdrv_aio_readv           = bdrv_qed_aio_readv,
+     int resp_sz;
--    .bdrv_aio_writev          = bdrv_qed_aio_writev,
 +    .bdrv_co_readv            = bdrv_qed_co_readv,
 +    .bdrv_co_writev           = bdrv_qed_co_writev,
      .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
      .bdrv_truncate            = bdrv_qed_truncate,
      .bdrv_getlength           = bdrv_qed_getlength,
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 16/61] nvme: Add support for Read Data and Write Data in CMBs.
+[PULL 22/25] tests: mark more coroutine_fns
-From: Stephen Bates <sbates@raithlin.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-Add the ability for the NVMe model to support both the RDS and WDS
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-modes in the Controller Memory Buffer.
+Message-Id: <20230309084456.304669-8-pbonzini@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Although not currently supported in the upstreamed Linux kernel a fork
 with support exists [1] and user-space test programs that build on
 this also exist [2].
 Useful for testing CMB functionality in preperation for real CMB
 enabled NVMe devices (coming soon).
 [1] https://github.com/sbates130272/linux-p2pmem
 [2] https://github.com/sbates130272/p2pmem-test
 Signed-off-by: Stephen Bates <sbates@raithlin.com>
 Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
 Reviewed-by: Keith Busch <keith.busch@intel.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
+ tests/unit/test-thread-pool.c | 2 +-
- hw/block/nvme.h |  1 +
+file changed, 1 insertion(+), 1 deletion(-)
 files changed, 58 insertions(+), 26 deletions(-)
-diff --git a/hw/block/nvme.c b/hw/block/nvme.c
+diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/block/nvme.c
+--- a/tests/unit/test-thread-pool.c
-+++ b/hw/block/nvme.c
++++ b/tests/unit/test-thread-pool.c
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static void test_submit_aio(void)
-  *              cmb_size_mb=<cmb_size_mb[optional]>
+     g_assert_cmpint(data.ret, ==, 0);
   *
   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
 - * offset 0 in BAR2 and supports SQS only for now.
 + * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
   */
  #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
      }
  }
--static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
+-static void co_test_cb(void *opaque)
--    uint32_t len, NvmeCtrl *n)
++static void coroutine_fn co_test_cb(void *opaque)
 +static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
 +                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
  {
-     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
+     WorkerTestData *data = opaque;
      trans_len = MIN(len, trans_len);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
      if (!prp1) {
          return NVME_INVALID_FIELD | NVME_DNR;
 +    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
 +               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
 +        qsg->nsg = 0;
 +        qemu_iovec_init(iov, num_prps);
 +        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
 +    } else {
 +        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
 +        qemu_sglist_add(qsg, prp1, trans_len);
      }
 -
 -    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
 -    qemu_sglist_add(qsg, prp1, trans_len);
      len -= trans_len;
      if (len) {
          if (!prp2) {
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
              nents = (len + n->page_size - 1) >> n->page_bits;
              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 -            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
 +            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
              while (len != 0) {
                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                      i = 0;
                      nents = (len + n->page_size - 1) >> n->page_bits;
                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 -                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
 +                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                          prp_trans);
                      prp_ent = le64_to_cpu(prp_list[i]);
                  }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                  }
                  trans_len = MIN(len, n->page_size);
 -                qemu_sglist_add(qsg, prp_ent, trans_len);
 +                if (qsg->nsg){
 +                    qemu_sglist_add(qsg, prp_ent, trans_len);
 +                } else {
 +                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
 +                }
                  len -= trans_len;
                  i++;
              }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
              if (prp2 & (n->page_size - 1)) {
                  goto unmap;
              }
 -            qemu_sglist_add(qsg, prp2, len);
 +            if (qsg->nsg) {
 +                qemu_sglist_add(qsg, prp2, len);
 +            } else {
 +                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
 +            }
          }
      }
      return NVME_SUCCESS;
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
      uint64_t prp1, uint64_t prp2)
  {
      QEMUSGList qsg;
 +    QEMUIOVector iov;
 +    uint16_t status = NVME_SUCCESS;
 -    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
 +    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    if (dma_buf_read(ptr, len, &qsg)) {
 +    if (qsg.nsg > 0) {
 +        if (dma_buf_read(ptr, len, &qsg)) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
          qemu_sglist_destroy(&qsg);
 -        return NVME_INVALID_FIELD | NVME_DNR;
 +    } else {
 +        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
 +            status = NVME_INVALID_FIELD | NVME_DNR;
 +        }
 +        qemu_iovec_destroy(&iov);
      }
 -    qemu_sglist_destroy(&qsg);
 -    return NVME_SUCCESS;
 +    return status;
  }
  static void nvme_post_cqes(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
          return NVME_LBA_RANGE | NVME_DNR;
      }
 -    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
 +    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
          return NVME_INVALID_FIELD | NVME_DNR;
      }
 -    assert((nlb << data_shift) == req->qsg.size);
 -
 -    req->has_sg = true;
      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
 -    req->aiocb = is_write ?
 -        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                      nvme_rw_cb, req) :
 -        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 -                     nvme_rw_cb, req);
 +    if (req->qsg.nsg > 0) {
 +        req->has_sg = true;
 +        req->aiocb = is_write ?
 +            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                          nvme_rw_cb, req) :
 +            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
 +                         nvme_rw_cb, req);
 +    } else {
 +        req->has_sg = false;
 +        req->aiocb = is_write ?
 +            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                            req) :
 +            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 +                           req);
 +    }
      return NVME_NO_COMPLETE;
  }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
          NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
          NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
 -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
 +        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 +        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
          NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
          NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 +        n->cmbloc = n->bar.cmbloc;
 +        n->cmbsz = n->bar.cmbsz;
 +
          n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
          memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                                "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
 diff --git a/hw/block/nvme.h b/hw/block/nvme.h
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/block/nvme.h
 +++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
      NvmeCqe                 cqe;
      BlockAcctCookie         acct;
      QEMUSGList              qsg;
 +    QEMUIOVector            iov;
      QTAILQ_ENTRY(NvmeRequest)entry;
  } NvmeRequest;
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 23/61] qcow2: Merge the writing of the COW regions with the guest data
+[PULL 23/25] qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
-From: Alberto Garcia <berto@igalia.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-If the guest tries to write data that results on the allocation of a
+Functions that can do I/O (including calling bdrv_is_allocated
-new cluster, instead of writing the guest data first and then the data
+and bdrv_block_status functions) are prime candidates for being
-from the COW regions, write everything together using one single I/O
+coroutine_fns.  Make the change for those that are themselves called
-operation.
+only from coroutine_fns.  Also annotate that they are called with the
 graph rdlock taken, thus allowing them to call bdrv_co_*() functions
 for I/O.
-This can improve the write performance by 25% or more, depending on
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-several factors such as the media type, the cluster size and the I/O
+Message-Id: <20230309084456.304669-9-pbonzini@redhat.com>
 request size.
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
+ block/qcow2.h          | 15 ++++++++-------
- block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
+ block/qcow2-bitmap.c   |  2 +-
- block/qcow2.h         |  7 ++++++
+ block/qcow2-cluster.c  | 21 +++++++++++++--------
-files changed, 91 insertions(+), 20 deletions(-)
+ block/qcow2-refcount.c |  8 ++++----
  block/qcow2-snapshot.c | 25 +++++++++++++------------
  block/qcow2.c          | 27 ++++++++++++++-------------
 files changed, 53 insertions(+), 45 deletions(-)
+diff --git a/block/qcow2.h b/block/qcow2.h
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2.h
++++ b/block/qcow2.h
+@@ -XXX,XX +XXX,XX @@ int64_t qcow2_refcount_area(BlockDriverState *bs, uint64_t offset,
+                             uint64_t new_refblock_offset);
+ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size);
+-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+-                                int64_t nb_clusters);
+-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size);
++int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
++                                             int64_t nb_clusters);
++int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size);
+ void qcow2_free_clusters(BlockDriverState *bs,
+                           int64_t offset, int64_t size,
+                           enum qcow2_discard_type type);
+@@ -XXX,XX +XXX,XX @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
+                                 BlockDriverAmendStatusCB *status_cb,
+                                 void *cb_opaque, Error **errp);
+ int coroutine_fn GRAPH_RDLOCK qcow2_shrink_reftable(BlockDriverState *bs);
+-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
++int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
+ int coroutine_fn qcow2_detect_metadata_preallocation(BlockDriverState *bs);
+ /* qcow2-cluster.c functions */
+@@ -XXX,XX +XXX,XX @@ void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
+ int coroutine_fn GRAPH_RDLOCK
+ qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
+-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
++void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
+ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
+                           uint64_t bytes, enum qcow2_discard_type type,
+                           bool full_discard);
+@@ -XXX,XX +XXX,XX @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
+                             Error **errp);
+ void qcow2_free_snapshots(BlockDriverState *bs);
+-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
++int coroutine_fn GRAPH_RDLOCK
++qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
+ int qcow2_write_snapshots(BlockDriverState *bs);
+ int coroutine_fn GRAPH_RDLOCK
+@@ -XXX,XX +XXX,XX @@ bool coroutine_fn qcow2_load_dirty_bitmaps(BlockDriverState *bs,
+ bool qcow2_get_bitmap_info_list(BlockDriverState *bs,
+                                 Qcow2BitmapInfoList **info_list, Error **errp);
+ int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp);
+-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
++int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
+ bool qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
+                                           bool release_stored, Error **errp);
+ int qcow2_reopen_bitmaps_ro(BlockDriverState *bs, Error **errp);
+diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block/qcow2-bitmap.c
++++ b/block/qcow2-bitmap.c
+@@ -XXX,XX +XXX,XX @@ out:
+ }
+ /* Checks to see if it's safe to resize bitmaps */
+-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
++int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
+ {
+     BDRVQcow2State *s = bs->opaque;
+     Qcow2BitmapList *bm_list;
 diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-cluster.c
 +++ b/block/qcow2-cluster.c
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+@@ -XXX,XX +XXX,XX @@ err:
-     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
+  * Frees the allocated clusters because the request failed and they won't
-     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
+  * actually be linked.
-     assert(start->offset + start->nb_bytes <= end->offset);
+  */
-+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
+-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
++void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
-     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+ {
-         return 0;
+     BDRVQcow2State *s = bs->opaque;
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
+     if (!has_data_file(bs) && !m->keep_old_clusters) {
-     /* The part of the buffer where the end region is located */
+@@ -XXX,XX +XXX,XX @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
-     end_buffer = start_buffer + buffer_size - end->nb_bytes;
+  *
+  * Returns 0 on success, -errno on failure.
--    qemu_iovec_init(&qiov, 1);
+  */
-+    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
+-static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_cluster_offset,
+-                             uint64_t guest_offset, unsigned bytes,
-     qemu_co_mutex_unlock(&s->lock);
+-                             uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
-     /* First we read the existing data from both COW regions. We
++static int coroutine_fn calculate_l2_meta(BlockDriverState *bs,
-@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
++                                          uint64_t host_cluster_offset,
 +                                          uint64_t guest_offset, unsigned bytes,
 +                                          uint64_t *l2_slice, QCowL2Meta **m,
 +                                          bool keep_old)
  {
      BDRVQcow2State *s = bs->opaque;
      int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
@@ -XXX,XX +XXX,XX @@ out:
   * function has been waiting for another request and the allocation must be
   * restarted, but the whole request should not be failed.
   */
 -static int do_alloc_cluster_offset(BlockDriverState *bs, uint64_t guest_offset,
 -                                   uint64_t *host_offset, uint64_t *nb_clusters)
 +static int coroutine_fn do_alloc_cluster_offset(BlockDriverState *bs,
 +                                                uint64_t guest_offset,
 +                                                uint64_t *host_offset,
 +                                                uint64_t *nb_clusters)
  {
      BDRVQcow2State *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
      return nb_clusters;
  }
 -static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
 -                               unsigned nb_subclusters)
 +static int coroutine_fn
 +zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
 +                    unsigned nb_subclusters)
  {
      BDRVQcow2State *s = bs->opaque;
      uint64_t *l2_slice;
 diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-refcount.c
 +++ b/block/qcow2-refcount.c
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
      return offset;
  }
 -int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 -                                int64_t nb_clusters)
 +int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 +                                             int64_t nb_clusters)
  {
      BDRVQcow2State *s = bs->opaque;
      uint64_t cluster_index, refcount;
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
  /* only used to allocate compressed sectors. We try to allocate
     contiguous sectors. size must be <= cluster_size */
 -int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
 +int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size)
  {
      BDRVQcow2State *s = bs->opaque;
      int64_t offset;
@@ -XXX,XX +XXX,XX @@ out:
      return ret;
  }
 -int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
 +int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
  {
      BDRVQcow2State *s = bs->opaque;
      int64_t i;
 diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2-snapshot.c
 +++ b/block/qcow2-snapshot.c
@@ -XXX,XX +XXX,XX @@ void qcow2_free_snapshots(BlockDriverState *bs)
   *   qcow2_check_refcounts() does not do anything with snapshots'
   *   extra data.)
   */
 -static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 -                                   int *nb_clusters_reduced,
 -                                   int *extra_data_dropped,
 -                                   Error **errp)
 +static coroutine_fn GRAPH_RDLOCK
 +int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 +                            int *nb_clusters_reduced,
 +                            int *extra_data_dropped,
 +                            Error **errp)
  {
      BDRVQcow2State *s = bs->opaque;
      QCowSnapshotHeader h;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
          /* Read statically sized part of the snapshot header */
          offset = ROUND_UP(offset, 8);
 -        ret = bdrv_pread(bs->file, offset, sizeof(h), &h, 0);
 +        ret = bdrv_co_pread(bs->file, offset, sizeof(h), &h, 0);
          if (ret < 0) {
              error_setg_errno(errp, -ret, "Failed to read snapshot table");
              goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
          }
-     }
+         /* Read known extra data */
--    /* And now we can write everything */
+-        ret = bdrv_pread(bs->file, offset,
--    qemu_iovec_reset(&qiov);
+-                         MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
--    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
++        ret = bdrv_co_pread(bs->file, offset,
--    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
++                            MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
--    if (ret < 0) {
+         if (ret < 0) {
--        goto fail;
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
-+    /* And now we can write everything. If we have the guest data we
+             goto fail;
-+     * can write everything in one single operation */
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
-+    if (m->data_qiov) {
+             /* Store unknown extra data */
-+        qemu_iovec_reset(&qiov);
+             unknown_extra_data_size = sn->extra_data_size - sizeof(extra);
-+        if (start->nb_bytes) {
+             sn->unknown_extra_data = g_malloc(unknown_extra_data_size);
-+            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+-            ret = bdrv_pread(bs->file, offset, unknown_extra_data_size,
-+        }
+-                             sn->unknown_extra_data, 0);
-+        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
++            ret = bdrv_co_pread(bs->file, offset, unknown_extra_data_size,
-+        if (end->nb_bytes) {
++                                sn->unknown_extra_data, 0);
-+            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+             if (ret < 0) {
-+        }
+                 error_setg_errno(errp, -ret,
-+        /* NOTE: we have a write_aio blkdebug event here followed by
+                                  "Failed to read snapshot table");
-+         * a cow_write one in do_perform_cow_write(), but there's only
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
-+         * one single I/O operation */
-+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+         /* Read snapshot ID */
-+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+         sn->id_str = g_malloc(id_str_size + 1);
-+    } else {
+-        ret = bdrv_pread(bs->file, offset, id_str_size, sn->id_str, 0);
-+        /* If there's no guest data then write both COW regions separately */
++        ret = bdrv_co_pread(bs->file, offset, id_str_size, sn->id_str, 0);
-+        qemu_iovec_reset(&qiov);
+         if (ret < 0) {
-+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
-+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+             goto fail;
-+        if (ret < 0) {
+@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
-+            goto fail;
-+        }
+         /* Read snapshot name */
-+
+         sn->name = g_malloc(name_size + 1);
-+        qemu_iovec_reset(&qiov);
+-        ret = bdrv_pread(bs->file, offset, name_size, sn->name, 0);
-+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
++        ret = bdrv_co_pread(bs->file, offset, name_size, sn->name, 0);
-+        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
+         if (ret < 0) {
-     }
+             error_setg_errno(errp, -ret, "Failed to read snapshot table");
+             goto fail;
--    qemu_iovec_reset(&qiov);
+@@ -XXX,XX +XXX,XX @@ fail:
--    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+     return ret;
--    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
+ }
- fail:
-     qemu_co_mutex_lock(&s->lock);
+-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
++int coroutine_fn qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
  {
      return qcow2_do_read_snapshots(bs, false, NULL, NULL, errp);
  }
 diff --git a/block/qcow2.c b/block/qcow2.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/qcow2.c
 +++ b/block/qcow2.c
-@@ -XXX,XX +XXX,XX @@ fail:
+@@ -XXX,XX +XXX,XX @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char *fmt, Error **errp)
-     return ret;
+  * unknown magic is skipped (future extension this version knows nothing about)
- }
+  * return 0 upon success, non-0 otherwise
+  */
-+/* Check if it's possible to merge a write request with the writing of
+-static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+ * the data from the COW regions */
+-                                 uint64_t end_offset, void **p_feature_table,
-+static bool merge_cow(uint64_t offset, unsigned bytes,
+-                                 int flags, bool *need_update_header,
-+                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
+-                                 Error **errp)
-+{
++static int coroutine_fn GRAPH_RDLOCK
-+    QCowL2Meta *m;
++qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+
++                      uint64_t end_offset, void **p_feature_table,
-+    for (m = l2meta; m != NULL; m = m->next) {
++                      int flags, bool *need_update_header, Error **errp)
-+        /* If both COW regions are empty then there's nothing to merge */
+ {
-+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+     BDRVQcow2State *s = bs->opaque;
-+            continue;
+     QCowExtension ext;
-+        }
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+
+         printf("attempting to read extended header in offset %lu\n", offset);
-+        /* The data (middle) region must be immediately after the
+ #endif
-+         * start region */
-+        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+-        ret = bdrv_pread(bs->file, offset, sizeof(ext), &ext, 0);
-+            continue;
++        ret = bdrv_co_pread(bs->file, offset, sizeof(ext), &ext, 0);
-+        }
+         if (ret < 0) {
-+
+             error_setg_errno(errp, -ret, "qcow2_read_extension: ERROR: "
-+        /* The end region must be immediately after the data (middle)
+                              "pread fail from offset %" PRIu64, offset);
-+         * region */
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+        if (m->offset + m->cow_end.offset != offset + bytes) {
+                            sizeof(bs->backing_format));
-+            continue;
+                 return 2;
-+        }
+             }
-+
+-            ret = bdrv_pread(bs->file, offset, ext.len, bs->backing_format, 0);
-+        /* Make sure that adding both COW regions to the QEMUIOVector
++            ret = bdrv_co_pread(bs->file, offset, ext.len, bs->backing_format, 0);
-+         * does not exceed IOV_MAX */
+             if (ret < 0) {
-+        if (hd_qiov->niov > IOV_MAX - 2) {
+                 error_setg_errno(errp, -ret, "ERROR: ext_backing_format: "
-+            continue;
+                                  "Could not read format name");
-+        }
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+
+         case QCOW2_EXT_MAGIC_FEATURE_TABLE:
-+        m->data_qiov = hd_qiov;
+             if (p_feature_table != NULL) {
-+        return true;
+                 void *feature_table = g_malloc0(ext.len + 2 * sizeof(Qcow2Feature));
-+    }
+-                ret = bdrv_pread(bs->file, offset, ext.len, feature_table, 0);
-+
++                ret = bdrv_co_pread(bs->file, offset, ext.len, feature_table, 0);
-+    return false;
+                 if (ret < 0) {
-+}
+                     error_setg_errno(errp, -ret, "ERROR: ext_feature_table: "
-+
+                                      "Could not read table");
- static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-                                          uint64_t bytes, QEMUIOVector *qiov,
+                 return -EINVAL;
-                                          int flags)
+             }
-@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
-             goto fail;
+-            ret = bdrv_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
-         }
++            ret = bdrv_co_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
+             if (ret < 0) {
--        qemu_co_mutex_unlock(&s->lock);
+                 error_setg_errno(errp, -ret,
--        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+                                  "Unable to read CRYPTO header extension");
--        trace_qcow2_writev_data(qemu_coroutine_self(),
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
--                                cluster_offset + offset_in_cluster);
+                 break;
--        ret = bdrv_co_pwritev(bs->file,
+             }
--                              cluster_offset + offset_in_cluster,
--                              cur_bytes, &hd_qiov, 0);
+-            ret = bdrv_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
--        qemu_co_mutex_lock(&s->lock);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
--        if (ret < 0) {
+             if (ret < 0) {
--            goto fail;
+                 error_setg_errno(errp, -ret, "bitmaps_ext: "
-+        /* If we need to do COW, check if it's possible to merge the
+                                  "Could not read ext header");
-+         * writing of the guest data together with that of the COW regions.
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+         * If it's not possible (or not necessary) then write the
+         case QCOW2_EXT_MAGIC_DATA_FILE:
-+         * guest data now. */
+         {
-+        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
+             s->image_data_file = g_malloc0(ext.len + 1);
-+            qemu_co_mutex_unlock(&s->lock);
+-            ret = bdrv_pread(bs->file, offset, ext.len, s->image_data_file, 0);
-+            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
++            ret = bdrv_co_pread(bs->file, offset, ext.len, s->image_data_file, 0);
-+            trace_qcow2_writev_data(qemu_coroutine_self(),
+             if (ret < 0) {
-+                                    cluster_offset + offset_in_cluster);
+                 error_setg_errno(errp, -ret,
-+            ret = bdrv_co_pwritev(bs->file,
+                                  "ERROR: Could not read data file name");
-+                                  cluster_offset + offset_in_cluster,
+@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-+                                  cur_bytes, &hd_qiov, 0);
+                 uext->len = ext.len;
-+            qemu_co_mutex_lock(&s->lock);
+                 QLIST_INSERT_HEAD(&s->unknown_header_ext, uext, next);
-+            if (ret < 0) {
-+                goto fail;
+-                ret = bdrv_pread(bs->file, offset, uext->len, uext->data, 0);
-+            }
++                ret = bdrv_co_pread(bs->file, offset, uext->len, uext->data, 0);
-         }
+                 if (ret < 0) {
+                     error_setg_errno(errp, -ret, "ERROR: unknown extension: "
-         while (l2meta != NULL) {
+                                      "Could not read data");
-diff --git a/block/qcow2.h b/block/qcow2.h
+@@ -XXX,XX +XXX,XX @@ static void qcow2_update_options_abort(BlockDriverState *bs,
-index XXXXXXX..XXXXXXX 100644
+     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
---- a/block/qcow2.h
+ }
-+++ b/block/qcow2.h
-@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
+-static int qcow2_update_options(BlockDriverState *bs, QDict *options,
-      */
+-                                int flags, Error **errp)
-     Qcow2COWRegion cow_end;
++static int coroutine_fn
++qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
-+    /**
++                     Error **errp)
-+     * The I/O vector with the data from the actual guest write request.
+ {
-+     * If non-NULL, this is meant to be merged together with the data
+     Qcow2ReopenState r = {};
-+     * from @cow_start and @cow_end into one single write operation.
+     int ret;
 +     */
 +    QEMUIOVector *data_qiov;
 +
      /** Pointer to next L2Meta of the same write request */
      struct QCowL2Meta *next;
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 22/61] qcow2: Pass a QEMUIOVector to do_perform_cow_{read, write}()
+[PULL 24/25] vmdk: make vmdk_is_cid_valid a coroutine_fn
-From: Alberto Garcia <berto@igalia.com>
+From: Paolo Bonzini <pbonzini@redhat.com>
-Instead of passing a single buffer pointer to do_perform_cow_write(),
+Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
-pass a QEMUIOVector. This will allow us to merge the write requests
+change for the one that is itself called only from coroutine_fns.  Unfortunately
-for the COW regions and the actual data into a single one.
+vmdk does not use a coroutine_fn for the bulk of the open (like qcow2 does) so
 vmdk_read_cid cannot have the same treatment.
-Although do_perform_cow_read() does not strictly need to change its
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-API, we're doing it here as well for consistency.
+Message-Id: <20230309084456.304669-10-pbonzini@redhat.com>
 Signed-off-by: Alberto Garcia <berto@igalia.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
 ---
- block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
+ block/vmdk.c | 2 +-
-file changed, 24 insertions(+), 27 deletions(-)
+file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
+diff --git a/block/vmdk.c b/block/vmdk.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qcow2-cluster.c
+--- a/block/vmdk.c
-+++ b/block/qcow2-cluster.c
++++ b/block/vmdk.c
-@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
+@@ -XXX,XX +XXX,XX @@ out:
  static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
                                              uint64_t src_cluster_offset,
                                              unsigned offset_in_cluster,
 -                                            uint8_t *buffer,
 -                                            unsigned bytes)
 +                                            QEMUIOVector *qiov)
  {
 -    QEMUIOVector qiov;
 -    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
 -    if (bytes == 0) {
 +    if (qiov->size == 0) {
          return 0;
      }
 -    qemu_iovec_init_external(&qiov, &iov, 1);
 -
      BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
      if (!bs->drv) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
       * which can lead to deadlock when block layer copy-on-read is enabled.
       */
      ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
 -                                  bytes, &qiov, 0);
 +                                  qiov->size, qiov, 0);
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
  static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
                                               uint64_t cluster_offset,
                                               unsigned offset_in_cluster,
 -                                             uint8_t *buffer,
 -                                             unsigned bytes)
 +                                             QEMUIOVector *qiov)
  {
 -    QEMUIOVector qiov;
 -    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
      int ret;
 -    if (bytes == 0) {
 +    if (qiov->size == 0) {
          return 0;
      }
 -    qemu_iovec_init_external(&qiov, &iov, 1);
 -
      ret = qcow2_pre_write_overlap_check(bs, 0,
 -            cluster_offset + offset_in_cluster, bytes);
 +            cluster_offset + offset_in_cluster, qiov->size);
      if (ret < 0) {
          return ret;
      }
      BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
      ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
 -                          bytes, &qiov, 0);
 +                          qiov->size, qiov, 0);
      if (ret < 0) {
          return ret;
      }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
      bool merge_reads;
      uint8_t *start_buffer, *end_buffer;
 +    QEMUIOVector qiov;
      int ret;
      assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      /* The part of the buffer where the end region is located */
      end_buffer = start_buffer + buffer_size - end->nb_bytes;
 +    qemu_iovec_init(&qiov, 1);
 +
      qemu_co_mutex_unlock(&s->lock);
      /* First we read the existing data from both COW regions. We
       * either read the whole region in one go, or the start and end
       * regions separately. */
      if (merge_reads) {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, buffer_size);
 +        qemu_iovec_add(&qiov, start_buffer, buffer_size);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
      } else {
 -        ret = do_perform_cow_read(bs, m->offset, start->offset,
 -                                  start_buffer, start->nb_bytes);
 +        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
          if (ret < 0) {
              goto fail;
          }
 -        ret = do_perform_cow_read(bs, m->offset, end->offset,
 -                                  end_buffer, end->nb_bytes);
 +        qemu_iovec_reset(&qiov);
 +        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
      }
      if (ret < 0) {
          goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
      }
      /* And now we can write everything */
 -    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
 -                               start_buffer, start->nb_bytes);
 +    qemu_iovec_reset(&qiov);
 +    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
      if (ret < 0) {
          goto fail;
      }
 -    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
 -                               end_buffer, end->nb_bytes);
 +    qemu_iovec_reset(&qiov);
 +    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
 +    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
  fail:
      qemu_co_mutex_lock(&s->lock);
@@ -XXX,XX +XXX,XX @@ fail:
      }
      qemu_vfree(start_buffer);
 +    qemu_iovec_destroy(&qiov);
      return ret;
  }
+-static int vmdk_is_cid_valid(BlockDriverState *bs)
++static int coroutine_fn vmdk_is_cid_valid(BlockDriverState *bs)
+ {
+     BDRVVmdkState *s = bs->opaque;
+     uint32_t cur_pcid;
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 29/61] qed: Remove callback from qed_find_cluster()
+Deleted patch
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
- block/qed.c         | 24 +++++++++++-------------
- block/qed.h         |  4 ++--
-files changed, 35 insertions(+), 32 deletions(-)
-diff --git a/block/qed-cluster.c b/block/qed-cluster.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed-cluster.c
-+++ b/block/qed-cluster.c
-@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
-  * @s:          QED state
-  * @request:    L2 cache entry
-  * @pos:        Byte position in device
-- * @len:        Number of bytes
-- * @cb:         Completion function
-- * @opaque:     User data for completion function
-+ * @len:        Number of bytes (may be shortened on return)
-+ * @img_offset: Contains offset in the image file on success
-  *
-  * This function translates a position in the block device to an offset in the
-- * image file.  It invokes the cb completion callback to report back the
-- * translated offset or unallocated range in the image file.
-+ * image file. The translated offset or unallocated range in the image file is
-+ * reported back in *img_offset and *len.
-  *
-  * If the L2 table exists, request->l2_table points to the L2 table cache entry
-  * and the caller must free the reference when they are finished.  The cache
-  * entry is exposed in this way to avoid callers having to read the L2 table
-  * again later during request processing.  If request->l2_table is non-NULL it
-  * will be unreferenced before taking on the new cache entry.
-+ *
-+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
-+ * range in the image file.
-+ *
-+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
-+ * table offset, respectively. len is number of contiguous unallocated bytes.
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque)
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset)
- {
-     uint64_t l2_offset;
-     uint64_t offset = 0;
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
-      * so that a request acts on one L2 table at a time.
-      */
--    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
-     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
-     if (qed_offset_is_unalloc_cluster(l2_offset)) {
--        cb(opaque, QED_CLUSTER_L1, 0, len);
--        return;
-+        *img_offset = 0;
-+        return QED_CLUSTER_L1;
-     }
-     if (!qed_check_table_offset(s, l2_offset)) {
--        cb(opaque, -EINVAL, 0, 0);
--        return;
-+        *img_offset = *len = 0;
-+        return -EINVAL;
-     }
-     ret = qed_read_l2_table(s, request, l2_offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-     }
-     index = qed_l2_index(s, pos);
--    n = qed_bytes_to_clusters(s,
--                              qed_offset_into_cluster(s, pos) + len);
-+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
-     n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                       index, n, &offset);
-@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-         ret = -EINVAL;
-     }
--    len = MIN(len,
--              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
-+    *len = MIN(*len,
-+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
- out:
--    cb(opaque, ret, offset, len);
-+    *img_offset = offset;
-     qed_release(s);
-+    return ret;
- }
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
-         .file = file,
-     };
-     QEDRequest request = { .l2_table = NULL };
-+    uint64_t offset;
-+    int ret;
--    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
-+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
-+    qed_is_allocated_cb(&cb, ret, offset, len);
--    /* Now sleep if the callback wasn't invoked immediately */
--    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
--        cb.co = qemu_coroutine_self();
--        qemu_coroutine_yield();
--    }
-+    /* The callback was invoked immediately */
-+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
-     qed_unref_l2_cache_entry(request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_write_data(void *opaque, int ret,
-                                uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
-  *              or -errno
-  * @offset:     Cluster offset in bytes
-  * @len:        Length in bytes
-- *
-- * Callback from qed_find_cluster().
-  */
- static void qed_aio_read_data(void *opaque, int ret,
-                               uint64_t offset, size_t len)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     BDRVQEDState *s = acb_to_s(acb);
-     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
-                                 qed_aio_write_data : qed_aio_read_data;
-+    uint64_t offset;
-+    size_t len;
-     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
-     }
-     /* Find next cluster and start I/O */
--    qed_find_cluster(s, &acb->request,
--                      acb->cur_pos, acb->end_pos - acb->cur_pos,
--                      io_fn, acb);
-+    len = acb->end_pos - acb->cur_pos;
-+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
-+    io_fn(acb, ret, offset, len);
- }
- static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
- /**
-  * Cluster functions
-  */
--void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
--                      size_t len, QEDFindClusterFunc *cb, void *opaque);
-+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-+                     size_t *len, uint64_t *img_offset);
- /**
-  * Consistency check
---
-.8.3.1

-[Qemu-devel] [PULL 34/61] qed: Remove callback from qed_write_header()
+Deleted patch
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 32 ++++++++++++--------------------
-file changed, 12 insertions(+), 20 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
-  * This function only updates known header fields in-place and does not affect
-  * extra data after the QED header.
-  */
--static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
--                             void *opaque)
-+static int qed_write_header(BDRVQEDState *s)
- {
-     /* We must write full sectors for O_DIRECT but cannot necessarily generate
-      * the data following the header if an unrecognized compat feature is
-@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
-     ret = 0;
- out:
-     qemu_vfree(buf);
--    cb(opaque, ret);
-+    return ret;
- }
- static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
-     }
- }
--static void qed_finish_clear_need_check(void *opaque, int ret)
--{
--    /* Do nothing */
--}
--
--static void qed_flush_after_clear_need_check(void *opaque, int ret)
--{
--    BDRVQEDState *s = opaque;
--
--    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
--
--    /* No need to wait until flush completes */
--    qed_unplug_allocating_write_reqs(s);
--}
--
- static void qed_clear_need_check(void *opaque, int ret)
- {
-     BDRVQEDState *s = opaque;
-@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
-     }
-     s->header.features &= ~QED_F_NEED_CHECK;
--    qed_write_header(s, qed_flush_after_clear_need_check, s);
-+    ret = qed_write_header(s);
-+    (void) ret;
-+
-+    qed_unplug_allocating_write_reqs(s);
-+
-+    ret = bdrv_flush(s->bs);
-+    (void) ret;
- }
- static void qed_need_check_timer_cb(void *opaque)
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     BlockCompletionFunc *cb;
-+    int ret;
-     /* Cancel timer when the first allocating request comes in */
-     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     if (qed_should_set_need_check(s)) {
-         s->header.features |= QED_F_NEED_CHECK;
--        qed_write_header(s, cb, acb);
-+        ret = qed_write_header(s);
-+        cb(acb, ret);
-     } else {
-         cb(acb, 0);
-     }
---
-.8.3.1

-[Qemu-devel] [PULL 39/61] qed: Make qed_aio_write_main() synchronous
+Deleted patch
-Note that this code is generally not running in coroutine context, so
-this is an actual blocking synchronous operation. We'll fix this in a
-moment.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 61 +++++++++++++++++++------------------------------------------
-file changed, 19 insertions(+), 42 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
-     qed_aio_next_io(acb, 0);
- }
--static void qed_aio_next_io_cb(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--
--    qed_aio_next_io(acb, ret);
--}
--
- static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
- {
-     assert(!s->allocating_write_reqs_plugged);
-@@ -XXX,XX +XXX,XX @@ err:
-     qed_aio_complete(acb, ret);
- }
--static void qed_aio_write_l2_update_cb(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
--}
--
--/**
-- * Flush new data clusters before updating the L2 table
-- *
-- * This flush is necessary when a backing file is in use.  A crash during an
-- * allocating write could result in empty clusters in the image.  If the write
-- * only touched a subregion of the cluster, then backing image sectors have
-- * been lost in the untouched region.  The solution is to flush after writing a
-- * new data cluster and before updating the L2 table.
-- */
--static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--
--    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
--        qed_aio_complete(acb, -EIO);
--    }
--}
--
- /**
-  * Write data to the image file
-  */
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-     BDRVQEDState *s = acb_to_s(acb);
-     uint64_t offset = acb->cur_cluster +
-                       qed_offset_into_cluster(s, acb->cur_pos);
--    BlockCompletionFunc *next_fn;
-     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-         return;
-     }
-+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
-+    if (ret >= 0) {
-+        ret = 0;
-+    }
-+
-     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
--        next_fn = qed_aio_next_io_cb;
-+        qed_aio_next_io(acb, ret);
-     } else {
-         if (s->bs->backing) {
--            next_fn = qed_aio_write_flush_before_l2_update;
--        } else {
--            next_fn = qed_aio_write_l2_update_cb;
-+            /*
-+             * Flush new data clusters before updating the L2 table
-+             *
-+             * This flush is necessary when a backing file is in use.  A crash
-+             * during an allocating write could result in empty clusters in the
-+             * image.  If the write only touched a subregion of the cluster,
-+             * then backing image sectors have been lost in the untouched
-+             * region.  The solution is to flush after writing a new data
-+             * cluster and before updating the L2 table.
-+             */
-+            ret = bdrv_flush(s->bs->file->bs);
-         }
-+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-     }
--
--    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
--    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
--                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
--                    next_fn, acb);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 40/61] qed: Inline qed_commit_l2_update()
+Deleted patch
-qed_commit_l2_update() is unconditionally called at the end of
-qed_aio_write_l1_update(). Inline it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 36 ++++++++++++++----------------------
-file changed, 14 insertions(+), 22 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- }
- /**
-- * Commit the current L2 table to the cache
-+ * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_commit_l2_update(void *opaque, int ret)
-+static void qed_aio_write_l1_update(void *opaque, int ret)
- {
-     QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
-+    int index;
-+
-+    if (ret) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    index = qed_l1_index(s, acb->cur_pos);
-+    s->l1_table->offsets[index] = l2_table->offset;
-+
-+    ret = qed_write_l1_table(s, index, 1);
-+
-+    /* Commit the current L2 table to the cache */
-     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
-     /* This is guaranteed to succeed because we just committed the entry to the
-@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
-     qed_aio_next_io(acb, ret);
- }
--/**
-- * Update L1 table with new L2 table offset and write it out
-- */
--static void qed_aio_write_l1_update(void *opaque, int ret)
--{
--    QEDAIOCB *acb = opaque;
--    BDRVQEDState *s = acb_to_s(acb);
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
--
--    index = qed_l1_index(s, acb->cur_pos);
--    s->l1_table->offsets[index] = acb->request.l2_table->offset;
--
--    ret = qed_write_l1_table(s, index, 1);
--    qed_commit_l2_update(acb, ret);
--}
- /**
-  * Update L2 table with new cluster offsets and write them out
---
-.8.3.1

-[Qemu-devel] [PULL 41/61] qed: Add return value to qed_aio_write_l1_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 19 +++++++++----------
-file changed, 9 insertions(+), 10 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
- /**
-  * Update L1 table with new L2 table offset and write it out
-  */
--static void qed_aio_write_l1_update(void *opaque, int ret)
-+static int qed_aio_write_l1_update(QEDAIOCB *acb)
- {
--    QEDAIOCB *acb = opaque;
-     BDRVQEDState *s = acb_to_s(acb);
-     CachedL2Table *l2_table = acb->request.l2_table;
-     uint64_t l2_offset = l2_table->offset;
--    int index;
--
--    if (ret) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
-+    int index, ret;
-     index = qed_l1_index(s, acb->cur_pos);
-     s->l1_table->offsets[index] = l2_table->offset;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
-     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
-     assert(acb->request.l2_table != NULL);
--    qed_aio_next_io(acb, ret);
-+    return ret;
- }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-     if (need_alloc) {
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
--        qed_aio_write_l1_update(acb, ret);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l1_update(acb);
-+        qed_aio_next_io(acb, ret);
-+
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
---
-.8.3.1

-[Qemu-devel] [PULL 42/61] qed: Add return value to qed_aio_write_l2_update()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 43 ++++++++++++++++++++++++++-----------------
-file changed, 26 insertions(+), 17 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
- /**
-  * Update L2 table with new cluster offsets and write them out
-  */
--static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
--    int index;
--
--    if (ret) {
--        goto err;
--    }
-+    int index, ret;
-     if (need_alloc) {
-         qed_unref_l2_cache_entry(acb->request.l2_table);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
-         /* Write out the whole new L2 table */
-         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-         if (ret) {
--            goto err;
-+            return ret;
-         }
--        ret = qed_aio_write_l1_update(acb);
--        qed_aio_next_io(acb, ret);
--
-+        return qed_aio_write_l1_update(acb);
-     } else {
-         /* Write out only the updated part of the L2 table */
-         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-                                  false);
--        qed_aio_next_io(acb, ret);
-+        if (ret) {
-+            return ret;
-+        }
-     }
--    return;
--
--err:
--    qed_aio_complete(acb, ret);
-+    return 0;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
-              */
-             ret = bdrv_flush(s->bs->file->bs);
-         }
--        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
-+        if (ret) {
-+            goto err;
-+        }
-+        qed_aio_next_io(acb, 0);
-     }
-+    return;
-+
-+err:
-+    qed_aio_complete(acb, ret);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
-         return;
-     }
--    qed_aio_write_l2_update(acb, 0, 1);
-+    ret = qed_aio_write_l2_update(acb, 1);
-+    if (ret < 0) {
-+        qed_aio_complete(acb, ret);
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 44/61] qed: Add return value to qed_aio_write_cow()
+[PULL 25/25] block/monitor/block-hmp-cmds.c: Fix crash when execute hmp_commit
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
+From: Wang Liang <wangliangzz@inspur.com>
 just return an error code and let the caller handle it.
-While refactoring qed_aio_write_alloc() to accomodate the change,
+hmp_commit() calls blk_is_available() from a non-coroutine context (and in
-qed_aio_write_zero_cluster() ended up with a single line, so I chose to
+the main loop). blk_is_available() is a co_wrapper_mixed_bdrv_rdlock
-inline that line and remove the function completely.
+function, and in the non-coroutine context it calls AIO_WAIT_WHILE(),
 which crashes if the aio_context lock is not taken before.
+Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1615
+Signed-off-by: Wang Liang <wangliangzz@inspur.com>
+Message-Id: <20230424103902.45265-1-wangliangzz@126.com>
+Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
+Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- block/qed.c | 58 +++++++++++++++++++++-------------------------------------
+ block/monitor/block-hmp-cmds.c | 10 ++++++----
-file changed, 21 insertions(+), 37 deletions(-)
+file changed, 6 insertions(+), 4 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
+diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
 index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
+--- a/block/monitor/block-hmp-cmds.c
-+++ b/block/qed.c
++++ b/block/monitor/block-hmp-cmds.c
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
+@@ -XXX,XX +XXX,XX @@ void hmp_commit(Monitor *mon, const QDict *qdict)
- /**
+             error_report("Device '%s' not found", device);
   * Populate untouched regions of new data cluster
   */
 -static void qed_aio_write_cow(void *opaque, int ret)
 +static int qed_aio_write_cow(QEDAIOCB *acb)
  {
 -    QEDAIOCB *acb = opaque;
      BDRVQEDState *s = acb_to_s(acb);
      uint64_t start, len, offset;
 +    int ret;
      /* Populate front untouched region of new data cluster */
      start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
      trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
      ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +    if (ret < 0) {
 +        return ret;
      }
      /* Populate back untouched region of new data cluster */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
      trace_qed_aio_write_postfill(s, acb, start, len, offset);
      ret = qed_copy_from_backing_file(s, start, len, offset);
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -
 -    ret = qed_aio_write_main(acb);
      if (ret < 0) {
 -        qed_aio_complete(acb, ret);
 -        return;
 +        return ret;
      }
 -    qed_aio_next_io(acb, 0);
 +
 +    return qed_aio_write_main(acb);
  }
  /**
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
      return !(s->header.features & QED_F_NEED_CHECK);
  }
 -static void qed_aio_write_zero_cluster(void *opaque, int ret)
 -{
 -    QEDAIOCB *acb = opaque;
 -
 -    if (ret) {
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -
 -    ret = qed_aio_write_l2_update(acb, 1);
 -    if (ret < 0) {
 -        qed_aio_complete(acb, ret);
 -        return;
 -    }
 -    qed_aio_next_io(acb, 0);
 -}
 -
  /**
   * Write new data cluster
   *
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
  static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  {
      BDRVQEDState *s = acb_to_s(acb);
 -    BlockCompletionFunc *cb;
      int ret;
      /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
              qed_aio_start_io(acb);
              return;
          }
--
+-        if (!blk_is_available(blk)) {
--        cb = qed_aio_write_zero_cluster;
+-            error_report("Device '%s' has no medium", device);
-     } else {
+-            return;
--        cb = qed_aio_write_cow;
+-        }
-         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
-     }
+         bs = bdrv_skip_implicit_filters(blk_bs(blk));
+         aio_context = bdrv_get_aio_context(bs);
-     if (qed_should_set_need_check(s)) {
+         aio_context_acquire(aio_context);
-         s->header.features |= QED_F_NEED_CHECK;
-         ret = qed_write_header(s);
++        if (!blk_is_available(blk)) {
--        cb(acb, ret);
++            error_report("Device '%s' has no medium", device);
-+        if (ret < 0) {
++            aio_context_release(aio_context);
 +            qed_aio_complete(acb, ret);
 +            return;
 +        }
-+    }
 +
-+    if (acb->flags & QED_AIOCB_ZERO) {
+         ret = bdrv_commit(bs);
-+        ret = qed_aio_write_l2_update(acb, 1);
-     } else {
+         aio_context_release(aio_context);
 -        cb(acb, 0);
 +        ret = qed_aio_write_cow(acb);
      }
 +    if (ret < 0) {
 +        qed_aio_complete(acb, ret);
 +        return;
 +    }
 +    qed_aio_next_io(acb, 0);
  }
  /**
 --
-.8.3.1
+.40.0

-[Qemu-devel] [PULL 45/61] qed: Add return value to qed_aio_write_inplace/alloc()
+Deleted patch
-Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
-just return an error code and let the caller handle it.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 43 ++++++++++++++++++++-----------------------
-file changed, 20 insertions(+), 23 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
-  *
-  * This path is taken when writing to previously unallocated clusters.
-  */
--static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-     int ret;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     }
-     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
-         s->allocating_write_reqs_plugged) {
--        return; /* wait for existing request to finish */
-+        return -EINPROGRESS; /* wait for existing request to finish */
-     }
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     if (acb->flags & QED_AIOCB_ZERO) {
-         /* Skip ahead if the clusters are already zero */
-         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
--            qed_aio_start_io(acb);
--            return;
-+            return 0;
-         }
-     } else {
-         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-         s->header.features |= QED_F_NEED_CHECK;
-         ret = qed_write_header(s);
-         if (ret < 0) {
--            qed_aio_complete(acb, ret);
--            return;
-+            return ret;
-         }
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-         ret = qed_aio_write_cow(acb);
-     }
-     if (ret < 0) {
--        qed_aio_complete(acb, ret);
--        return;
-+        return ret;
-     }
--    qed_aio_next_io(acb, 0);
-+    return 0;
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-  *
-  * This path is taken when writing to already allocated clusters.
-  */
--static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-+static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
- {
--    int ret;
--
-     /* Allocate buffer for zero writes */
-     if (acb->flags & QED_AIOCB_ZERO) {
-         struct iovec *iov = acb->qiov->iov;
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-         if (!iov->iov_base) {
-             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
-             if (iov->iov_base == NULL) {
--                qed_aio_complete(acb, -ENOMEM);
--                return;
-+                return -ENOMEM;
-             }
-             memset(iov->iov_base, 0, iov->iov_len);
-         }
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
-     qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
-     /* Do the actual write */
--    ret = qed_aio_write_main(acb);
--    if (ret < 0) {
--        qed_aio_complete(acb, ret);
--        return;
--    }
--    qed_aio_next_io(acb, 0);
-+    return qed_aio_write_main(acb);
- }
- /**
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
-     switch (ret) {
-     case QED_CLUSTER_FOUND:
--        qed_aio_write_inplace(acb, offset, len);
-+        ret = qed_aio_write_inplace(acb, offset, len);
-         break;
-     case QED_CLUSTER_L2:
-     case QED_CLUSTER_L1:
-     case QED_CLUSTER_ZERO:
--        qed_aio_write_alloc(acb, len);
-+        ret = qed_aio_write_alloc(acb, len);
-         break;
-     default:
--        qed_aio_complete(acb, ret);
-+        assert(ret < 0);
-         break;
-     }
-+
-+    if (ret < 0) {
-+        if (ret != -EINPROGRESS) {
-+            qed_aio_complete(acb, ret);
-+        }
-+        return;
-+    }
-+    qed_aio_next_io(acb, 0);
- }
- /**
---
-.8.3.1

-[Qemu-devel] [PULL 50/61] qed: Use CoQueue for serialising allocations
+Deleted patch
-Now that we're running in coroutine context, the ad-hoc serialisation
-code (which drops a request that has to wait out of coroutine context)
-can be replaced by a CoQueue.
-This means that when we resume a serialised request, it is running in
-coroutine context again and its I/O isn't blocking any more.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 49 +++++++++++++++++--------------------------------
- block/qed.h |  3 ++-
-files changed, 19 insertions(+), 33 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
- static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
- {
--    QEDAIOCB *acb;
--
-     assert(s->allocating_write_reqs_plugged);
-     s->allocating_write_reqs_plugged = false;
--
--    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
--    if (acb) {
--        qed_aio_start_io(acb);
--    }
-+    qemu_co_enter_next(&s->allocating_write_reqs);
- }
- static void qed_clear_need_check(void *opaque, int ret)
-@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
-     BDRVQEDState *s = opaque;
-     /* The timer should only fire when allocating writes have drained */
--    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
-+    assert(!s->allocating_acb);
-     trace_qed_need_check_timer_cb(s);
-@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
-     int ret;
-     s->bs = bs;
--    QSIMPLEQ_INIT(&s->allocating_write_reqs);
-+    qemu_co_queue_init(&s->allocating_write_reqs);
-     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
-     if (ret < 0) {
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
-     qed_release(s);
- }
--static void qed_resume_alloc_bh(void *opaque)
--{
--    qed_aio_start_io(opaque);
--}
--
- static void qed_aio_complete(QEDAIOCB *acb, int ret)
- {
-     BDRVQEDState *s = acb_to_s(acb);
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
-      * next request in the queue.  This ensures that we don't cycle through
-      * requests multiple times but rather finish one at a time completely.
-      */
--    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
--        QEDAIOCB *next_acb;
--        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
--        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
--        if (next_acb) {
--            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
--                                    qed_resume_alloc_bh, next_acb);
-+    if (acb == s->allocating_acb) {
-+        s->allocating_acb = NULL;
-+        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
-+            qemu_co_enter_next(&s->allocating_write_reqs);
-         } else if (s->header.features & QED_F_NEED_CHECK) {
-             qed_start_need_check_timer(s);
-         }
-@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
-     int ret;
-     /* Cancel timer when the first allocating request comes in */
--    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
-+    if (s->allocating_acb == NULL) {
-         qed_cancel_need_check_timer(s);
-     }
-     /* Freeze this request if another allocating write is in progress */
--    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
--        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
--    }
--    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
--        s->allocating_write_reqs_plugged) {
--        return -EINPROGRESS; /* wait for existing request to finish */
-+    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
-+        if (s->allocating_acb != NULL) {
-+            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
-+            assert(s->allocating_acb == NULL);
-+        }
-+        s->allocating_acb = acb;
-+        return -EAGAIN; /* start over with looking up table entries */
-     }
-     acb->cur_nclusters = qed_bytes_to_clusters(s,
-@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
-             ret = qed_aio_read_data(acb, ret, offset, len);
-         }
--        if (ret < 0) {
--            if (ret != -EINPROGRESS) {
--                qed_aio_complete(acb, ret);
--            }
-+        if (ret < 0 && ret != -EAGAIN) {
-+            qed_aio_complete(acb, ret);
-             return;
-         }
-     }
-diff --git a/block/qed.h b/block/qed.h
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.h
-+++ b/block/qed.h
-@@ -XXX,XX +XXX,XX @@ typedef struct {
-     uint32_t l2_mask;
-     /* Allocating write request queue */
--    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
-+    QEDAIOCB *allocating_acb;
-+    CoQueue allocating_write_reqs;
-     bool allocating_write_reqs_plugged;
-     /* Periodic flush and clear need check flag */
---
-.8.3.1

-[Qemu-devel] [PULL 52/61] qed: Use a coroutine for need_check_timer
+Deleted patch
-This fixes the last place where we degraded from AIO to actual blocking
-synchronous I/O requests. Putting it into a coroutine means that instead
-of blocking, the coroutine simply yields while doing I/O.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 33 +++++++++++++++++----------------
-file changed, 17 insertions(+), 16 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
-     qemu_co_enter_next(&s->allocating_write_reqs);
- }
--static void qed_clear_need_check(void *opaque, int ret)
-+static void qed_need_check_timer_entry(void *opaque)
- {
-     BDRVQEDState *s = opaque;
-+    int ret;
--    if (ret) {
-+    /* The timer should only fire when allocating writes have drained */
-+    assert(!s->allocating_acb);
-+
-+    trace_qed_need_check_timer_cb(s);
-+
-+    qed_acquire(s);
-+    qed_plug_allocating_write_reqs(s);
-+
-+    /* Ensure writes are on disk before clearing flag */
-+    ret = bdrv_co_flush(s->bs->file->bs);
-+    qed_release(s);
-+    if (ret < 0) {
-         qed_unplug_allocating_write_reqs(s);
-         return;
-     }
-@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
-     qed_unplug_allocating_write_reqs(s);
--    ret = bdrv_flush(s->bs);
-+    ret = bdrv_co_flush(s->bs);
-     (void) ret;
- }
- static void qed_need_check_timer_cb(void *opaque)
- {
--    BDRVQEDState *s = opaque;
--
--    /* The timer should only fire when allocating writes have drained */
--    assert(!s->allocating_acb);
--
--    trace_qed_need_check_timer_cb(s);
--
--    qed_acquire(s);
--    qed_plug_allocating_write_reqs(s);
--
--    /* Ensure writes are on disk before clearing flag */
--    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
--    qed_release(s);
-+    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
-+    qemu_coroutine_enter(co);
- }
- void qed_acquire(BDRVQEDState *s)
---
-.8.3.1

-[Qemu-devel] [PULL 54/61] qed: Use bdrv_co_* for coroutine_fns
+Deleted patch
-All functions that are marked coroutine_fn can directly call the
-bdrv_co_* version of functions instead of going through the wrapper.
-Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
----
- block/qed.c | 16 +++++++++-------
-file changed, 9 insertions(+), 7 deletions(-)
-diff --git a/block/qed.c b/block/qed.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/qed.c
-+++ b/block/qed.c
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
-     };
-     qemu_iovec_init_external(&qiov, &iov, 1);
--    ret = bdrv_preadv(s->bs->file, 0, &qiov);
-+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
-     if (ret < 0) {
-         goto out;
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
-     /* Update header */
-     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
--    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
-+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
-     if (ret < 0) {
-         goto out;
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
-     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
--    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
-+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
-     if (ret < 0) {
-         return ret;
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
-     }
-     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
--    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
-+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
-     if (ret < 0) {
-         goto out;
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
-     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
-     BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
--    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
-+    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
-+                          &acb->cur_qiov, 0);
-     if (ret < 0) {
-         return ret;
-     }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
-              * region.  The solution is to flush after writing a new data
-              * cluster and before updating the L2 table.
-              */
--            ret = bdrv_flush(s->bs->file->bs);
-+            ret = bdrv_co_flush(s->bs->file->bs);
-             if (ret < 0) {
-                 return ret;
-             }
-@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
-     }
-     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
--    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
-+    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
-+                         &acb->cur_qiov, 0);
-     if (ret < 0) {
-         return ret;
-     }
---
-.8.3.1

-[Qemu-devel] [PULL 57/61] fix: avoid an infinite loop or a dangling pointer problem in img_commit
+Deleted patch
-From: "sochin.jiang" <sochin.jiang@huawei.com>
-img_commit could fall into an infinite loop calling run_block_job() if
-its blockjob fails on any I/O error, fix this already known problem.
-Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
-Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
-Signed-off-by: Max Reitz <mreitz@redhat.com>
----
- blockjob.c               |  4 ++--
- include/block/blockjob.h | 18 ++++++++++++++++++
- qemu-img.c               | 20 +++++++++++++-------
-files changed, 33 insertions(+), 9 deletions(-)
-diff --git a/blockjob.c b/blockjob.c
-index XXXXXXX..XXXXXXX 100644
---- a/blockjob.c
-+++ b/blockjob.c
-@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
-     block_job_enter(job);
- }
--static void block_job_ref(BlockJob *job)
-+void block_job_ref(BlockJob *job)
- {
-     ++job->refcnt;
- }
-@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
-                                            void *opaque);
- static void block_job_detach_aio_context(void *opaque);
--static void block_job_unref(BlockJob *job)
-+void block_job_unref(BlockJob *job)
- {
-     if (--job->refcnt == 0) {
-         BlockDriverState *bs = blk_bs(job->blk);
-diff --git a/include/block/blockjob.h b/include/block/blockjob.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/block/blockjob.h
-+++ b/include/block/blockjob.h
-@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
- BlockJobTxn *block_job_txn_new(void);
- /**
-+ * block_job_ref:
-+ *
-+ * Add a reference to BlockJob refcnt, it will be decreased with
-+ * block_job_unref, and then be freed if it comes to be the last
-+ * reference.
-+ */
-+void block_job_ref(BlockJob *job);
-+
-+/**
-+ * block_job_unref:
-+ *
-+ * Release a reference that was previously acquired with block_job_ref
-+ * or block_job_create. If it's the last reference to the object, it will be
-+ * freed.
-+ */
-+void block_job_unref(BlockJob *job);
-+
-+/**
-  * block_job_txn_unref:
-  *
-  * Release a reference that was previously acquired with block_job_txn_add_job
-diff --git a/qemu-img.c b/qemu-img.c
-index XXXXXXX..XXXXXXX 100644
---- a/qemu-img.c
-+++ b/qemu-img.c
-@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
- static void run_block_job(BlockJob *job, Error **errp)
- {
-     AioContext *aio_context = blk_get_aio_context(job->blk);
-+    int ret = 0;
--    /* FIXME In error cases, the job simply goes away and we access a dangling
--     * pointer below. */
-     aio_context_acquire(aio_context);
-+    block_job_ref(job);
-     do {
-         aio_poll(aio_context, true);
-         qemu_progress_print(job->len ?
-                             ((float)job->offset / job->len * 100.f) : 0.0f, 0);
--    } while (!job->ready);
-+    } while (!job->ready && !job->completed);
--    block_job_complete_sync(job, errp);
-+    if (!job->completed) {
-+        ret = block_job_complete_sync(job, errp);
-+    } else {
-+        ret = job->ret;
-+    }
-+    block_job_unref(job);
-     aio_context_release(aio_context);
--    /* A block job may finish instantaneously without publishing any progress,
--     * so just signal completion here */
--    qemu_progress_print(100.f, 0);
-+    /* publish completion progress only when success */
-+    if (!ret) {
-+        qemu_progress_print(100.f, 0);
-+    }
- }
- static int img_commit(int argc, char **argv)
---
-.8.3.1

-[Qemu-devel] [PULL 60/61] block: Do not strcmp() with NULL uri->scheme
+Deleted patch
-From: Max Reitz <mreitz@redhat.com>
-uri_parse(...)->scheme may be NULL. In fact, probably every field may be
-NULL, and the callers do test this for all of the other fields but not
-for scheme (except for block/gluster.c; block/vxhs.c does not access
-that field at all).
-We can easily fix this by using g_strcmp0() instead of strcmp().
-Cc: qemu-stable@nongnu.org
-Signed-off-by: Max Reitz <mreitz@redhat.com>
-Message-id: 20170613205726.13544-1-mreitz@redhat.com
-Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-Signed-off-by: Max Reitz <mreitz@redhat.com>
----
- block/nbd.c      | 6 +++---
- block/nfs.c      | 2 +-
- block/sheepdog.c | 6 +++---
- block/ssh.c      | 2 +-
-files changed, 8 insertions(+), 8 deletions(-)
-diff --git a/block/nbd.c b/block/nbd.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/nbd.c
-+++ b/block/nbd.c
-@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
-     }
-     /* transport */
--    if (!strcmp(uri->scheme, "nbd")) {
-+    if (!g_strcmp0(uri->scheme, "nbd")) {
-         is_unix = false;
--    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
-+    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
-         is_unix = false;
--    } else if (!strcmp(uri->scheme, "nbd+unix")) {
-+    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
-         is_unix = true;
-     } else {
-         ret = -EINVAL;
-diff --git a/block/nfs.c b/block/nfs.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/nfs.c
-+++ b/block/nfs.c
-@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
-         error_setg(errp, "Invalid URI specified");
-         goto out;
-     }
--    if (strcmp(uri->scheme, "nfs") != 0) {
-+    if (g_strcmp0(uri->scheme, "nfs") != 0) {
-         error_setg(errp, "URI scheme must be 'nfs'");
-         goto out;
-     }
-diff --git a/block/sheepdog.c b/block/sheepdog.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/sheepdog.c
-+++ b/block/sheepdog.c
-@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
-     }
-     /* transport */
--    if (!strcmp(uri->scheme, "sheepdog")) {
-+    if (!g_strcmp0(uri->scheme, "sheepdog")) {
-         is_unix = false;
--    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
-+    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
-         is_unix = false;
--    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
-+    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
-         is_unix = true;
-     } else {
-         error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
-diff --git a/block/ssh.c b/block/ssh.c
-index XXXXXXX..XXXXXXX 100644
---- a/block/ssh.c
-+++ b/block/ssh.c
-@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
-         return -EINVAL;
-     }
--    if (strcmp(uri->scheme, "ssh") != 0) {
-+    if (g_strcmp0(uri->scheme, "ssh") != 0) {
-         error_setg(errp, "URI scheme must be 'ssh'");
-         goto err;
-     }
---
-.8.3.1

The following changes since commit 4c8c1cc544dbd5e2564868e61c5037258e393832:

Merge remote-tracking branch 'remotes/vivier/tags/m68k-for-2.10-pull-request' into staging (2017-06-22 19:01:58 +0100)

are available in the git repository at:

git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 1512008812410ca4054506a7c44343088abdd977:

Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block (2017-06-23 14:09:12 +0200)

----------------------------------------------------------------

Block layer patches

----------------------------------------------------------------
Alberto Garcia (9):
      throttle: Update throttle-groups.c documentation
      qcow2: Remove unused Error variable in do_perform_cow()
      qcow2: Use unsigned int for both members of Qcow2COWRegion
      qcow2: Make perform_cow() call do_perform_cow() twice
      qcow2: Split do_perform_cow() into _read(), _encrypt() and _write()
      qcow2: Allow reading both COW regions with only one request
      qcow2: Pass a QEMUIOVector to do_perform_cow_{read,write}()
      qcow2: Merge the writing of the COW regions with the guest data
      qcow2: Use offset_into_cluster() and offset_to_l2_index()

Kevin Wolf (37):
      commit: Fix completion with extra reference
      qemu-iotests: Allow starting new qemu after cleanup
      qemu-iotests: Test exiting qemu with running job
      doc: Document generic -blockdev options
      doc: Document driver-specific -blockdev options
      qed: Use bottom half to resume waiting requests
      qed: Make qed_read_table() synchronous
      qed: Remove callback from qed_read_table()
      qed: Remove callback from qed_read_l2_table()
      qed: Remove callback from qed_find_cluster()
      qed: Make qed_read_backing_file() synchronous
      qed: Make qed_copy_from_backing_file() synchronous
      qed: Remove callback from qed_copy_from_backing_file()
      qed: Make qed_write_header() synchronous
      qed: Remove callback from qed_write_header()
      qed: Make qed_write_table() synchronous
      qed: Remove GenericCB
      qed: Remove callback from qed_write_table()
      qed: Make qed_aio_read_data() synchronous
      qed: Make qed_aio_write_main() synchronous
      qed: Inline qed_commit_l2_update()
      qed: Add return value to qed_aio_write_l1_update()
      qed: Add return value to qed_aio_write_l2_update()
      qed: Add return value to qed_aio_write_main()
      qed: Add return value to qed_aio_write_cow()
      qed: Add return value to qed_aio_write_inplace/alloc()
      qed: Add return value to qed_aio_read/write_data()
      qed: Remove ret argument from qed_aio_next_io()
      qed: Remove recursion in qed_aio_next_io()
      qed: Implement .bdrv_co_readv/writev
      qed: Use CoQueue for serialising allocations
      qed: Simplify request handling
      qed: Use a coroutine for need_check_timer
      qed: Add coroutine_fn to I/O path functions
      qed: Use bdrv_co_* for coroutine_fns
      block: Remove bdrv_aio_readv/writev/flush()
      Merge remote-tracking branch 'mreitz/tags/pull-block-2017-06-23' into queue-block

Manos Pitsidianakis (1):
      block: change variable names in BlockDriverState

Max Reitz (3):
      blkdebug: Catch bs->exact_filename overflow
      blkverify: Catch bs->exact_filename overflow
      block: Do not strcmp() with NULL uri->scheme

Stefan Hajnoczi (10):
      block: count bdrv_co_rw_vmstate() requests
      block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate()
      migration: avoid recursive AioContext locking in save_vmstate()
      migration: use bdrv_drain_all_begin/end() instead bdrv_drain_all()
      virtio-pci: use ioeventfd even when KVM is disabled
      migration: hold AioContext lock for loadvm qemu_fclose()
      qemu-iotests: 068: extract _qemu() function
      qemu-iotests: 068: use -drive/-device instead of -hda
      qemu-iotests: 068: test iothread mode
      qemu-img: don't shadow opts variable in img_dd()

Stephen Bates (1):
      nvme: Add support for Read Data and Write Data in CMBs.

sochin.jiang (1):
      fix: avoid an infinite loop or a dangling pointer problem in img_commit

commit_complete() can't assume that after its block_job_completed() the
job is actually immediately freed; someone else may still be holding
references. In this case, the op blockers on the intermediate nodes make
the graph reconfiguration in the completion code fail.

Call block_job_remove_all_bdrv() manually so that we know for sure that
any blockers on intermediate nodes are given up.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/commit.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/commit.c b/block/commit.c
index XXXXXXX..XXXXXXX 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -XXX,XX +XXX,XX @@ static void commit_complete(BlockJob *job, void *opaque)
     }
     g_free(s->backing_file_str);
     blk_unref(s->top);
+
+    /* If there is more than one reference to the job (e.g. if called from
+     * block_job_finish_sync()), block_job_completed() won't free it and
+     * therefore the blockers on the intermediate nodes remain. This would
+     * cause bdrv_set_backing_hd() to fail. */
+    block_job_remove_all_bdrv(job);
+
     block_job_completed(&s->common, ret);
     g_free(data);
 
-- 
1.8.3.1

After _cleanup_qemu(), test cases should be able to start the next qemu
process and call _cleanup_qemu() for that one as well. For this to work
cleanly, we need to improve the cleanup so that the second invocation
doesn't try to kill the qemu instances from the first invocation a
second time (which would result in error messages).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/common.qemu | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.qemu
+++ b/tests/qemu-iotests/common.qemu
@@ -XXX,XX +XXX,XX @@ function _cleanup_qemu()
         rm -f "${QEMU_FIFO_IN}_${i}" "${QEMU_FIFO_OUT}_${i}"
         eval "exec ${QEMU_IN[$i]}<&-"   # close file descriptors
         eval "exec ${QEMU_OUT[$i]}<&-"
+
+        unset QEMU_IN[$i]
+        unset QEMU_OUT[$i]
     done
 }
-- 
1.8.3.1

When qemu is exited, all running jobs should be cancelled successfully.
This adds a test for this for all types of block jobs that currently
exist in qemu.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/185     | 206 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/185.out |  59 +++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 266 insertions(+)
 create mode 100755 tests/qemu-iotests/185
 create mode 100644 tests/qemu-iotests/185.out

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
new file mode 100755
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185
@@ -XXX,XX +XXX,XX @@
+#!/bin/bash
+#
+# Test exiting qemu while jobs are still running
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=kwolf@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1 # failure is the default!
+
+MIG_SOCKET="${TEST_DIR}/migrate"
+
+_cleanup()
+{
+    rm -f "${TEST_IMG}.mid"
+    rm -f "${TEST_IMG}.copy"
+    _cleanup_test_img
+    _cleanup_qemu
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+size=64M
+TEST_IMG="${TEST_IMG}.base" _make_test_img $size
+
+echo
+echo === Starting VM ===
+echo
+
+qemu_comm_method="qmp"
+
+_launch_qemu \
+    -drive file="${TEST_IMG}.base",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+echo
+echo === Creating backing chain ===
+echo
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG.mid',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'human-monitor-command',
+       'arguments': { 'command-line':
+                      'qemu-io disk \"write 0 4M\"' } }" \
+    "return"
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'blockdev-snapshot-sync',
+       'arguments': { 'device': 'disk',
+                      'snapshot-file': '$TEST_IMG',
+                      'format': '$IMGFMT',
+                      'mode': 'absolute-paths' } }" \
+    "return"
+
+echo
+echo === Start commit job and exit qemu ===
+echo
+
+# Note that the reference output intentionally includes the 'offset' field in
+# BLOCK_JOB_CANCELLED events for all of the following block jobs. They are
+# predictable and any change in the offsets would hint at a bug in the job
+# throttling code.
+#
+# In order to achieve these predictable offsets, all of the following tests
+# use speed=65536. Each job will perform exactly one iteration before it has
+# to sleep at least for a second, which is plenty of time for the 'quit' QMP
+# command to be received (after receiving the command, the rest runs
+# synchronously, so jobs can arbitrarily continue or complete).
+#
+# The buffer size for commit and streaming is 512k (waiting for 8 seconds after
+# the first request), for active commit and mirror it's large enough to cover
+# the full 4M, and for backup it's the qcow2 cluster size, which we know is
+# 64k. As all of these are at least as large as the speed, we are sure that the
+# offset doesn't advance after the first iteration before qemu exits.
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'top': '$TEST_IMG.mid',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start active commit job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-commit',
+       'arguments': { 'device': 'disk',
+                      'base':'$TEST_IMG.base',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start mirror job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-mirror',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start backup job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'drive-backup',
+       'arguments': { 'device': 'disk',
+                      'target': '$TEST_IMG.copy',
+                      'format': '$IMGFMT',
+                      'sync': 'full',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+echo
+echo === Start streaming job and exit qemu ===
+echo
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+h=$QEMU_HANDLE
+_send_qemu_cmd $h "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_send_qemu_cmd $h \
+    "{ 'execute': 'block-stream',
+       'arguments': { 'device': 'disk',
+                      'speed': 65536 } }" \
+    "return"
+
+_send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
+wait=1 _cleanup_qemu
+
+_check_test_img
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/qemu-iotests/185.out
@@ -XXX,XX +XXX,XX @@
+QA output created by 185
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
+
+=== Starting VM ===
+
+{"return": {}}
+
+=== Creating backing chain ===
+
+Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+wrote 4194304/4194304 bytes at offset 0
+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{"return": ""}
+Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.mid backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+
+=== Start commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "commit"}}
+
+=== Start active commit job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
+
+=== Start mirror job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
+
+=== Start backup job and exit qemu ===
+
+{"return": {}}
+Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 65536, "speed": 65536, "type": "backup"}}
+
+=== Start streaming job and exit qemu ===
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 67108864, "offset": 524288, "speed": 65536, "type": "stream"}}
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -XXX,XX +XXX,XX @@
 181 rw auto migration
 182 rw auto quick
 183 rw auto migration
+185 rw auto
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Calling aio_poll() directly may have been fine previously, but this is
the future, man!  The difference between an aio_poll() loop and
BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext
around aio_poll().

This allows the IOThread to run fd handlers or BHs to complete the
request.  Failure to release the AioContext causes deadlocks.

Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
 
         bdrv_coroutine_enter(bs, co);
-        while (data.ret == -EINPROGRESS) {
-            aio_poll(bdrv_get_aio_context(bs), true);
-        }
+        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
         return data.ret;
     }
 }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

AioContext was designed to allow nested acquire/release calls.  It uses
a recursive mutex so callers don't need to worry about nesting...or so
we thought.

BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
the AioContext temporarily around aio_poll().  This gives IOThreads a
chance to acquire the AioContext to process I/O completions.

It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
will not be able to acquire the AioContext if it was acquired
multiple times.

Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
this patch simply avoids nested locking in save_vmstate().  It's the
simplest fix and we should step back to consider the big picture with
all the recent changes to block layer threading.

This patch is the final fix to solve 'savevm' hanging with -object
iothread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
         goto the_end;
     }
 
+    /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
+     * for itself.  BDRV_POLL_WHILE() does not support nested locking because
+     * it only releases the lock once.  Therefore synchronous I/O will deadlock
+     * unless we release the AioContext before bdrv_all_create_snapshot().
+     */
+    aio_context_release(aio_context);
+    aio_context = NULL;
+
     ret = bdrv_all_create_snapshot(sn, bs, vm_state_size, &bs);
     if (ret < 0) {
         error_setg(errp, "Error while creating snapshot on '%s'",
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     ret = 0;
 
  the_end:
-    aio_context_release(aio_context);
+    if (aio_context) {
+        aio_context_release(aio_context);
+    }
     if (saved_vm_running) {
         vm_start();
     }
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

blk/bdrv_drain_all() only takes effect for a single instant and then
resumes block jobs, guest devices, and other external clients like the
NBD server.  This can be handy when performing a synchronous drain
before terminating the program, for example.

Monitor commands usually need to quiesce I/O across an entire code
region so blk/bdrv_drain_all() is not suitable.  They must use
bdrv_drain_all_begin/end() to mark the region.  This prevents new I/O
requests from slipping in or worse - block jobs completing and modifying
the graph.

I audited other blk/bdrv_drain_all() callers but did not find anything
that needs a similar fix.  This patch fixes the savevm/loadvm commands.
Although I haven't encountered a read world issue this makes the code
safer.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     }
     vm_stop(RUN_STATE_SAVE_VM);
 
+    bdrv_drain_all_begin();
+
     aio_context_acquire(aio_context);
 
     memset(sn, 0, sizeof(*sn));
@@ -XXX,XX +XXX,XX @@ int save_snapshot(const char *name, Error **errp)
     if (aio_context) {
         aio_context_release(aio_context);
     }
+
+    bdrv_drain_all_end();
+
     if (saved_vm_running) {
         vm_start();
     }
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     /* Flush all IO requests so they don't interfere with the new state.  */
-    bdrv_drain_all();
+    bdrv_drain_all_begin();
 
     ret = bdrv_all_goto_snapshot(name, &bs);
     if (ret < 0) {
         error_setg(errp, "Error %d while activating snapshot '%s' on '%s'",
                      ret, name, bdrv_get_device_name(bs));
-        return ret;
+        goto err_drain;
     }
 
     /* restore the VM state */
     f = qemu_fopen_bdrv(bs_vm_state, 0);
     if (!f) {
         error_setg(errp, "Could not open VM state file");
-        return -EINVAL;
+        ret = -EINVAL;
+        goto err_drain;
     }
 
     qemu_system_reset(SHUTDOWN_CAUSE_NONE);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     ret = qemu_loadvm_state(f);
     aio_context_release(aio_context);
 
+    bdrv_drain_all_end();
+
     migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
     }
 
     return 0;
+
+err_drain:
+    bdrv_drain_all_end();
+    return ret;
 }
 
 void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
-- 
1.8.3.1

This adds documentation for the -blockdev options that apply to all
nodes independent of the block driver used.

All options that are shared by -blockdev and -drive are now explained in
the section for -blockdev. The documentation of -drive mentions that all
-blockdev options are accepted as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 108 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
     "          [,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
+STEXI
+@item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
+@findex -blockdev
+
+Define a new block driver node.
+
+@table @option
+@item Valid options for any block driver node:
+
+@table @code
+@item driver
+Specifies the block driver to use for the given node.
+@item node-name
+This defines the name of the block driver node by which it will be referenced
+later. The name must be unique, i.e. it must not match the name of a different
+block driver node, or (if you use @option{-drive} as well) the ID of a drive.
+
+If no node name is specified, it is automatically generated. The generated node
+name is not intended to be predictable and changes between QEMU invocations.
+For the top level, an explicit node name must be specified.
+@item read-only
+Open the node read-only. Guest write attempts will fail.
+@item cache.direct
+The host page cache can be avoided with @option{cache.direct=on}. This will
+attempt to do disk IO directly to the guest's memory. QEMU may still perform an
+internal copy of the data.
+@item cache.no-flush
+In case you don't care about data integrity over host failures, you can use
+@option{cache.no-flush=on}. This option tells QEMU that it never needs to write
+any data to the disk but can instead keep things in cache. If anything goes
+wrong, like your host losing power, the disk storage getting disconnected
+accidentally, etc. your image will most probably be rendered unusable.
+@item discard=@var{discard}
+@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls
+whether @code{discard} (also known as @code{trim} or @code{unmap}) requests are
+ignored or passed to the filesystem. Some machine types may not support
+discard requests.
+@item detect-zeroes=@var{detect-zeroes}
+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
+conversion of plain zero writes by the OS to driver specific optimized
+zero write commands. You may even choose "unmap" if @var{discard} is set
+to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@end table
+
+@end table
+
+ETEXI
 
 DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -drive
 
-Define a new drive. Valid options are:
+Define a new drive. This includes creating a block driver node (the backend) as
+well as a guest device, and is mostly a shortcut for defining the corresponding
+@option{-blockdev} and @option{-device} options.
+
+@option{-drive} accepts all options that are accepted by @option{-blockdev}. In
+addition, it knows the following options:
 
 @table @option
 @item file=@var{file}
@@ -XXX,XX +XXX,XX @@ These options have the same definition as they have in @option{-hdachs}.
 @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
 (see @option{-snapshot}).
 @item cache=@var{cache}
-@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough" and controls how the host cache is used to access block data.
+@var{cache} is "none", "writeback", "unsafe", "directsync" or "writethrough"
+and controls how the host cache is used to access block data. This is a
+shortcut that sets the @option{cache.direct} and @option{cache.no-flush}
+options (as in @option{-blockdev}), and additionally @option{cache.writeback},
+which provides a default for the @option{write-cache} option of block guest
+devices (as in @option{-device}). The modes correspond to the following
+settings:
+
+@c Our texi2pod.pl script doesn't support @multitable, so fall back to using
+@c plain ASCII art (well, UTF-8 art really). This looks okay both in the manpage
+@c and the HTML output.
+@example
+@             │ cache.writeback   cache.direct   cache.no-flush
+─────────────┼─────────────────────────────────────────────────
+writeback    │ on                off            off
+none         │ on                on             off
+writethrough │ off               off            off
+directsync   │ off               on             off
+unsafe       │ on                off            on
+@end example
+
+The default mode is @option{cache=writeback}.
+
 @item aio=@var{aio}
 @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
-@item discard=@var{discard}
-@var{discard} is one of "ignore" (or "off") or "unmap" (or "on") and controls whether @dfn{discard} (also known as @dfn{trim} or @dfn{unmap}) requests are ignored or passed to the filesystem.  Some machine types may not support discard requests.
 @item format=@var{format}
 Specify which disk @var{format} will be used rather than detecting
 the format.  Can be used to specify format=raw to avoid interpreting
@@ -XXX,XX +XXX,XX @@ Specify which @var{action} to take on write and read errors. Valid actions are:
 "report" (report the error to the guest), "enospc" (pause QEMU only if the
 host disk is full; report the error to the guest otherwise).
 The default setting is @option{werror=enospc} and @option{rerror=report}.
-@item readonly
-Open drive @option{file} as read-only. Guest write attempts will fail.
 @item copy-on-read=@var{copy-on-read}
 @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
 file sectors into the image file.
-@item detect-zeroes=@var{detect-zeroes}
-@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
-conversion of plain zero writes by the OS to driver specific optimized
-zero write commands. You may even choose "unmap" if @var{discard} is set
-to "unmap" to allow a zero write to be converted to an UNMAP operation.
 @item bps=@var{b},bps_rd=@var{r},bps_wr=@var{w}
 Specify bandwidth throttling limits in bytes per second, either for all request
 types or for reads or writes only.  Small values can lead to timeouts or hangs
@@ -XXX,XX +XXX,XX @@ prevent guests from circumventing throttling limits by using many small disks
 instead of a single larger disk.
 @end table
 
-By default, the @option{cache=writeback} mode is used. It will report data
+By default, the @option{cache.writeback=on} mode is used. It will report data
 writes as completed as soon as the data is present in the host page cache.
 This is safe as long as your guest OS makes sure to correctly flush disk caches
 where needed. If your guest OS does not handle volatile disk write caches
 correctly and your host crashes or loses power, then the guest may experience
 data corruption.
 
-For such guests, you should consider using @option{cache=writethrough}. This
+For such guests, you should consider using @option{cache.writeback=off}. This
 means that the host page cache will be used to read and write data, but write
 notification will be sent to the guest only after QEMU has made sure to flush
 each write to the disk. Be aware that this has a major impact on performance.
 
-The host page cache can be avoided entirely with @option{cache=none}.  This will
-attempt to do disk IO directly to the guest's memory.  QEMU may still perform
-an internal copy of the data. Note that this is considered a writeback mode and
-the guest OS must handle the disk write cache correctly in order to avoid data
-corruption on host crashes.
-
-The host page cache can be avoided while only sending write notifications to
-the guest when the data has been flushed to the disk using
-@option{cache=directsync}.
-
-In case you don't care about data integrity over host failures, use
-@option{cache=unsafe}. This option tells QEMU that it never needs to write any
-data to the disk but can instead keep things in cache. If anything goes wrong,
-like your host losing power, the disk storage getting disconnected accidentally,
-etc. your image will most probably be rendered unusable.   When using
-the @option{-snapshot} option, unsafe caching is always used.
+When using the @option{-snapshot} option, unsafe caching is always used.
 
 Copy-on-read avoids accessing the same backing file sectors repeatedly and is
 useful when the backing file is over a slow network.  By default copy-on-read
-- 
1.8.3.1

This documents the driver-specific options for the raw, qcow2 and file
block drivers for the man page. For everything else, we refer to the
QAPI documentation.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-options.hx | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 1 deletion(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index XXXXXXX..XXXXXXX 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -XXX,XX +XXX,XX @@ STEXI
 @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
 @findex -blockdev
 
-Define a new block driver node.
+Define a new block driver node. Some of the options apply to all block drivers,
+other options are only accepted for a specific block driver. See below for a
+list of generic options and options for the most common block drivers.
+
+Options that expect a reference to another node (e.g. @code{file}) can be
+given in two ways. Either you specify the node name of an already existing node
+(file=@var{node-name}), or you define a new node inline, adding options
+for the referenced node after a dot (file.filename=@var{path},file.aio=native).
+
+A block driver node created with @option{-blockdev} can be used for a guest
+device by specifying its node name for the @code{drive} property in a
+@option{-device} argument that defines a block device.
 
 @table @option
 @item Valid options for any block driver node:
@@ -XXX,XX +XXX,XX @@ zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
 @end table
 
+@item Driver-specific options for @code{file}
+
+This is the protocol-level block driver for accessing regular files.
+
+@table @code
+@item filename
+The path to the image file in the local filesystem
+@item aio
+Specifies the AIO backend (threads/native, default: threads)
+@end table
+Example:
+@example
+-blockdev driver=file,node-name=disk,filename=disk.img
+@end example
+
+@item Driver-specific options for @code{raw}
+
+This is the image format block driver for raw images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+@end table
+Example 1:
+@example
+-blockdev driver=file,node-name=disk_file,filename=disk.img
+-blockdev driver=raw,node-name=disk,file=disk_file
+@end example
+Example 2:
+@example
+-blockdev driver=raw,node-name=disk,file.driver=file,file.filename=disk.img
+@end example
+
+@item Driver-specific options for @code{qcow2}
+
+This is the image format block driver for qcow2 images. It is usually
+stacked on top of a protocol level block driver such as @code{file}.
+
+@table @code
+@item file
+Reference to or definition of the data source block driver node
+(e.g. a @code{file} driver node)
+
+@item backing
+Reference to or definition of the backing file block device (default is taken
+from the image file). It is allowed to pass an empty string here in order to
+disable the default backing file.
+
+@item lazy-refcounts
+Whether to enable the lazy refcounts feature (on/off; default is taken from the
+image file)
+
+@item cache-size
+The maximum total size of the L2 table and refcount block caches in bytes
+(default: 1048576 bytes or 8 clusters, whichever is larger)
+
+@item l2-cache-size
+The maximum size of the L2 table cache in bytes
+(default: 4/5 of the total cache size)
+
+@item refcount-cache-size
+The maximum size of the refcount block cache in bytes
+(default: 1/5 of the total cache size)
+
+@item cache-clean-interval
+Clean unused entries in the L2 and refcount caches. The interval is in seconds.
+The default value is 0 and it disables this feature.
+
+@item pass-discard-request
+Whether discard requests to the qcow2 device should be forwarded to the data
+source (on/off; default: on if discard=unmap is specified, off otherwise)
+
+@item pass-discard-snapshot
+Whether discard requests for the data source should be issued when a snapshot
+operation (e.g. deleting a snapshot) frees clusters in the qcow2 file (on/off;
+default: on)
+
+@item pass-discard-other
+Whether discard requests for the data source should be issued on other
+occasions where a cluster gets freed (on/off; default: off)
+
+@item overlap-check
+Which overlap checks to perform for writes to the image
+(none/constant/cached/all; default: cached). For details or finer
+granularity control refer to the QAPI documentation of @code{blockdev-add}.
+@end table
+
+Example 1:
+@example
+-blockdev driver=file,node-name=my_file,filename=/tmp/disk.qcow2
+-blockdev driver=qcow2,node-name=hda,file=my_file,overlap-check=none,cache-size=16777216
+@end example
+Example 2:
+@example
+-blockdev driver=qcow2,node-name=disk,file.driver=http,file.filename=http://example.com/image.qcow2
+@end example
+
+@item Driver-specific options for other drivers
+Please refer to the QAPI documentation of the @code{blockdev-add} QMP command.
+
 @end table
 
 ETEXI
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

There used to be throttle_timers_{detach,attach}_aio_context() calls
in bdrv_set_aio_context(), but since 7ca7f0f6db1fedd28d490795d778cf239
they are now in blk_set_aio_context().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/throttle-groups.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index XXXXXXX..XXXXXXX 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -XXX,XX +XXX,XX @@
  * Again, all this is handled internally and is mostly transparent to
  * the outside. The 'throttle_timers' field however has an additional
  * constraint because it may be temporarily invalid (see for example
- * bdrv_set_aio_context()). Therefore in this file a thread will
+ * blk_set_aio_context()). Therefore in this file a thread will
  * access some other BlockBackend's timers only after verifying that
  * that BlockBackend has throttled requests in the queue.
  */
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Old kvm.ko versions only supported a tiny number of ioeventfds so
virtio-pci avoids ioeventfds when kvm_has_many_ioeventfds() returns 0.

Do not check kvm_has_many_ioeventfds() when KVM is disabled since it
always returns 0.  Since commit 8c56c1a592b5092d91da8d8943c17777d6462a6f
("memory: emulate ioeventfd") it has been possible to use ioeventfds in
qtest or TCG mode.

This patch makes -device virtio-blk-pci,iothread=iothread0 work even
when KVM is disabled.

I have tested that virtio-blk-pci works under TCG both with and without
iothread.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/virtio/virtio-pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
     bool pcie_port = pci_bus_is_express(pci_dev->bus) &&
                      !pci_bus_is_root(pci_dev->bus);
 
-    if (!kvm_has_many_ioeventfds()) {
+    if (kvm_enabled() && !kvm_has_many_ioeventfds()) {
         proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
     }
 
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

migration_incoming_state_destroy() uses qemu_fclose() on the vmstate
file.  Make sure to call it inside an AioContext acquire/release region.

This fixes an 'qemu: qemu_mutex_unlock: Operation not permitted' abort
in loadvm.

This patch closes the vmstate file before ending the drained region.
Previously we closed the vmstate file after ending the drained region.
The order does not matter.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 migration/savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -XXX,XX +XXX,XX @@ int load_snapshot(const char *name, Error **errp)
 
     aio_context_acquire(aio_context);
     ret = qemu_loadvm_state(f);
+    migration_incoming_state_destroy();
     aio_context_release(aio_context);
 
     bdrv_drain_all_end();
 
-    migration_incoming_state_destroy();
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
         return ret;
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Avoid duplicating the QEMU command-line.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068 | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ case "$QEMU_DEFAULT_MACHINE" in
       ;;
 esac
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" |\
+_qemu()
+{
+    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" \
+          "$@" |\
     _filter_qemu | _filter_hmp
+}
+
+# Give qemu some time to boot before saving the VM state
+bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
 # Now try to continue from that VM state (this should just work)
-echo quit |\
-    $QEMU $platform_parm -nographic -monitor stdio -serial none -hda "$TEST_IMG" -loadvm 0 |\
-    _filter_qemu | _filter_hmp
+echo quit | _qemu -loadvm 0
 
 # success, all done
 echo "*** done"
-- 
1.8.3.1

From: Stefan Hajnoczi <stefanha@redhat.com>

Perform the savevm/loadvm test with both iothread on and off.  This
covers the recently found savevm/loadvm hang when iothread is enabled.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/068     | 23 ++++++++++++++---------
 tests/qemu-iotests/068.out | 11 ++++++++++-
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -XXX,XX +XXX,XX @@ _supported_os Linux
 IMGOPTS="compat=1.1"
 IMG_SIZE=128K
 
-echo
-echo "=== Saving and reloading a VM state to/from a qcow2 image ==="
-echo
-_make_test_img $IMG_SIZE
-
 case "$QEMU_DEFAULT_MACHINE" in
   s390-ccw-virtio)
       platform_parm="-no-shutdown"
@@ -XXX,XX +XXX,XX @@ _qemu()
     _filter_qemu | _filter_hmp
 }
 
-# Give qemu some time to boot before saving the VM state
-bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu
-# Now try to continue from that VM state (this should just work)
-echo quit | _qemu -loadvm 0
+for extra_args in \
+    "" \
+    "-object iothread,id=iothread0 -set device.hba0.iothread=iothread0"; do
+    echo
+    echo "=== Saving and reloading a VM state to/from a qcow2 image ($extra_args) ==="
+    echo
+
+    _make_test_img $IMG_SIZE
+
+    # Give qemu some time to boot before saving the VM state
+    bash -c 'sleep 1; echo -e "savevm 0\nquit"' | _qemu $extra_args
+    # Now try to continue from that VM state (this should just work)
+    echo quit | _qemu $extra_args -loadvm 0
+done
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/068.out b/tests/qemu-iotests/068.out
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/068.out
+++ b/tests/qemu-iotests/068.out
@@ -XXX,XX +XXX,XX @@
 QA output created by 068
 
-=== Saving and reloading a VM state to/from a qcow2 image ===
+=== Saving and reloading a VM state to/from a qcow2 image () ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) savevm 0
+(qemu) quit
+QEMU X.Y.Z monitor - type 'help' for more information
+(qemu) quit
+
+=== Saving and reloading a VM state to/from a qcow2 image (-object iothread,id=iothread0 -set device.hba0.iothread=iothread0) ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
 QEMU X.Y.Z monitor - type 'help' for more information
-- 
1.8.3.1

From: Stephen Bates <sbates@raithlin.com>

Add the ability for the NVMe model to support both the RDS and WDS
modes in the Controller Memory Buffer.

Although not currently supported in the upstreamed Linux kernel a fork
with support exists [1] and user-space test programs that build on
this also exist [2].

Useful for testing CMB functionality in preperation for real CMB
enabled NVMe devices (coming soon).

[1] https://github.com/sbates130272/linux-p2pmem
[2] https://github.com/sbates130272/p2pmem-test

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/block/nvme.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
 hw/block/nvme.h |  1 +
 2 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -XXX,XX +XXX,XX @@
  *              cmb_size_mb=<cmb_size_mb[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports SQS only for now.
+ * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
  */
 
 #include "qemu/osdep.h"
@@ -XXX,XX +XXX,XX @@ static void nvme_isr_notify(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
-    uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
+                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
     trans_len = MIN(len, trans_len);
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
     if (!prp1) {
         return NVME_INVALID_FIELD | NVME_DNR;
+    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
+               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+        qsg->nsg = 0;
+        qemu_iovec_init(iov, num_prps);
+        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+    } else {
+        pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+        qemu_sglist_add(qsg, prp1, trans_len);
     }
-
-    pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-    qemu_sglist_add(qsg, prp1, trans_len);
     len -= trans_len;
     if (len) {
         if (!prp2) {
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
 
             nents = (len + n->page_size - 1) >> n->page_bits;
             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-            pci_dma_read(&n->parent_obj, prp2, (void *)prp_list, prp_trans);
+            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
             while (len != 0) {
                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                     i = 0;
                     nents = (len + n->page_size - 1) >> n->page_bits;
                     prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-                    pci_dma_read(&n->parent_obj, prp_ent, (void *)prp_list,
+                    nvme_addr_read(n, prp_ent, (void *)prp_list,
                         prp_trans);
                     prp_ent = le64_to_cpu(prp_list[i]);
                 }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
                 }
 
                 trans_len = MIN(len, n->page_size);
-                qemu_sglist_add(qsg, prp_ent, trans_len);
+                if (qsg->nsg){
+                    qemu_sglist_add(qsg, prp_ent, trans_len);
+                } else {
+                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
+                }
                 len -= trans_len;
                 i++;
             }
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, uint64_t prp1, uint64_t prp2,
             if (prp2 & (n->page_size - 1)) {
                 goto unmap;
             }
-            qemu_sglist_add(qsg, prp2, len);
+            if (qsg->nsg) {
+                qemu_sglist_add(qsg, prp2, len);
+            } else {
+                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
+            }
         }
     }
     return NVME_SUCCESS;
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     uint64_t prp1, uint64_t prp2)
 {
     QEMUSGList qsg;
+    QEMUIOVector iov;
+    uint16_t status = NVME_SUCCESS;
 
-    if (nvme_map_prp(&qsg, prp1, prp2, len, n)) {
+    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
-    if (dma_buf_read(ptr, len, &qsg)) {
+    if (qsg.nsg > 0) {
+        if (dma_buf_read(ptr, len, &qsg)) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
         qemu_sglist_destroy(&qsg);
-        return NVME_INVALID_FIELD | NVME_DNR;
+    } else {
+        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
+            status = NVME_INVALID_FIELD | NVME_DNR;
+        }
+        qemu_iovec_destroy(&iov);
     }
-    qemu_sglist_destroy(&qsg);
-    return NVME_SUCCESS;
+    return status;
 }
 
 static void nvme_post_cqes(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    if (nvme_map_prp(&req->qsg, prp1, prp2, data_size, n)) {
+    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    assert((nlb << data_shift) == req->qsg.size);
-
-    req->has_sg = true;
     dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
-    req->aiocb = is_write ?
-        dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                      nvme_rw_cb, req) :
-        dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                     nvme_rw_cb, req);
+    if (req->qsg.nsg > 0) {
+        req->has_sg = true;
+        req->aiocb = is_write ?
+            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                          nvme_rw_cb, req) :
+            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
+                         nvme_rw_cb, req);
+    } else {
+        req->has_sg = false;
+        req->aiocb = is_write ?
+            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                            req) :
+            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
+                           req);
+    }
 
     return NVME_NO_COMPLETE;
 }
@@ -XXX,XX +XXX,XX @@ static int nvme_init(PCIDevice *pci_dev)
         NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
         NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 0);
+        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
         NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
 
+        n->cmbloc = n->bar.cmbloc;
+        n->cmbsz = n->bar.cmbsz;
+
         n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
         memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
                               "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -XXX,XX +XXX,XX @@ typedef struct NvmeRequest {
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
+    QEMUIOVector            iov;
     QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Qcow2COWRegion has two attributes:

- The offset of the COW region from the start of the first cluster
  touched by the I/O request. Since it's always going to be positive
  and the maximum request size is at most INT_MAX, we can use a
  regular unsigned int to store this offset.

- The size of the COW region in bytes. This is guaranteed to be >= 0,
  so we should use an unsigned type instead.

In x86_64 this reduces the size of Qcow2COWRegion from 16 to 8 bytes.
It will also help keep some assertions simpler now that we know that
there are no negative numbers.

The prototype of do_perform_cow() is also updated to reflect these
changes.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.h         | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

From: Alberto Garcia <berto@igalia.com>

Instead of calling perform_cow() twice with a different COW region
each time, call it just once and make perform_cow() handle both
regions.

This patch simply moves code around. The next one will do the actual
reordering of the COW operations.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     struct iovec iov;
     int ret;
 
+    if (bytes == 0) {
+        return 0;
+    }
+
     iov.iov_len = bytes;
     iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
     if (iov.iov_base == NULL) {
@@ -XXX,XX +XXX,XX @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
     return cluster_offset;
 }
 
-static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
+static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2COWRegion *start = &m->cow_start;
+    Qcow2COWRegion *end = &m->cow_end;
     int ret;
 
-    if (r->nb_bytes == 0) {
+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset, r->offset, r->nb_bytes);
-    qemu_co_mutex_lock(&s->lock);
-
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         start->offset, start->nb_bytes);
     if (ret < 0) {
-        return ret;
+        goto fail;
     }
 
+    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
+                         end->offset, end->nb_bytes);
+
+fail:
+    qemu_co_mutex_lock(&s->lock);
+
     /*
      * Before we update the L2 table to actually point to the new cluster, we
      * need to be sure that the refcounts have been increased and COW was
      * handled.
      */
-    qcow2_cache_depends_on_flush(s->l2_table_cache);
+    if (ret == 0) {
+        qcow2_cache_depends_on_flush(s->l2_table_cache);
+    }
 
-    return 0;
+    return ret;
 }
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
@@ -XXX,XX +XXX,XX @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* copy content of unmodified sectors */
-    ret = perform_cow(bs, m, &m->cow_start);
-    if (ret < 0) {
-        goto err;
-    }
-
-    ret = perform_cow(bs, m, &m->cow_end);
+    ret = perform_cow(bs, m);
     if (ret < 0) {
         goto err;
     }
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

This patch splits do_perform_cow() into three separate functions to
read, encrypt and write the COW regions.

perform_cow() can now read both regions first, then encrypt them and
finally write them to disk. The memory allocation is also done in
this function now, using one single buffer large enough to hold both
regions.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 117 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 87 insertions(+), 30 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
     return 0;
 }
 
-static int coroutine_fn do_perform_cow(BlockDriverState *bs,
-                                       uint64_t src_cluster_offset,
-                                       uint64_t cluster_offset,
-                                       unsigned offset_in_cluster,
-                                       unsigned bytes)
+static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
+                                            uint64_t src_cluster_offset,
+                                            unsigned offset_in_cluster,
+                                            uint8_t *buffer,
+                                            unsigned bytes)
 {
-    BDRVQcow2State *s = bs->opaque;
     QEMUIOVector qiov;
-    struct iovec iov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
     if (bytes == 0) {
         return 0;
     }
 
-    iov.iov_len = bytes;
-    iov.iov_base = qemu_try_blockalign(bs, iov.iov_len);
-    if (iov.iov_base == NULL) {
-        return -ENOMEM;
-    }
-
     qemu_iovec_init_external(&qiov, &iov, 1);
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
-        ret = -ENOMEDIUM;
-        goto out;
+        return -ENOMEDIUM;
     }
 
     /* Call .bdrv_co_readv() directly instead of using the public block-layer
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow(BlockDriverState *bs,
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
                                   bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    if (bs->encrypted) {
+    return 0;
+}
+
+static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
+                                                uint64_t src_cluster_offset,
+                                                unsigned offset_in_cluster,
+                                                uint8_t *buffer,
+                                                unsigned bytes)
+{
+    if (bytes && bs->encrypted) {
+        BDRVQcow2State *s = bs->opaque;
         int64_t sector = (src_cluster_offset + offset_in_cluster)
                          >> BDRV_SECTOR_BITS;
         assert(s->cipher);
         assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
         assert((bytes & ~BDRV_SECTOR_MASK) == 0);
-        if (qcow2_encrypt_sectors(s, sector, iov.iov_base, iov.iov_base,
+        if (qcow2_encrypt_sectors(s, sector, buffer, buffer,
                                   bytes >> BDRV_SECTOR_BITS, true, NULL) < 0) {
-            ret = -EIO;
-            goto out;
+            return false;
         }
     }
+    return true;
+}
+
+static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
+                                             uint64_t cluster_offset,
+                                             unsigned offset_in_cluster,
+                                             uint8_t *buffer,
+                                             unsigned bytes)
+{
+    QEMUIOVector qiov;
+    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
+    int ret;
+
+    if (bytes == 0) {
+        return 0;
+    }
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
 
     ret = qcow2_pre_write_overlap_check(bs, 0,
             cluster_offset + offset_in_cluster, bytes);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
                           bytes, &qiov, 0);
     if (ret < 0) {
-        goto out;
+        return ret;
     }
 
-    ret = 0;
-out:
-    qemu_vfree(iov.iov_base);
-    return ret;
+    return 0;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     BDRVQcow2State *s = bs->opaque;
     Qcow2COWRegion *start = &m->cow_start;
     Qcow2COWRegion *end = &m->cow_end;
+    unsigned buffer_size;
+    uint8_t *start_buffer, *end_buffer;
     int ret;
 
+    assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
+
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
+    /* Reserve a buffer large enough to store the data from both the
+     * start and end COW regions. Add some padding in the middle if
+     * necessary to make sure that the end region is optimally aligned */
+    buffer_size = QEMU_ALIGN_UP(start->nb_bytes, bdrv_opt_mem_align(bs)) +
+        end->nb_bytes;
+    start_buffer = qemu_try_blockalign(bs, buffer_size);
+    if (start_buffer == NULL) {
+        return -ENOMEM;
+    }
+    /* The part of the buffer where the end region is located */
+    end_buffer = start_buffer + buffer_size - end->nb_bytes;
+
     qemu_co_mutex_unlock(&s->lock);
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         start->offset, start->nb_bytes);
+    /* First we read the existing data from both COW regions */
+    ret = do_perform_cow_read(bs, m->offset, start->offset,
+                              start_buffer, start->nb_bytes);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow(bs, m->offset, m->alloc_offset,
-                         end->offset, end->nb_bytes);
+    ret = do_perform_cow_read(bs, m->offset, end->offset,
+                              end_buffer, end->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    /* Encrypt the data if necessary before writing it */
+    if (bs->encrypted) {
+        if (!do_perform_cow_encrypt(bs, m->offset, start->offset,
+                                    start_buffer, start->nb_bytes) ||
+            !do_perform_cow_encrypt(bs, m->offset, end->offset,
+                                    end_buffer, end->nb_bytes)) {
+            ret = -EIO;
+            goto fail;
+        }
+    }
+
+    /* And now we can write everything */
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
+                               start_buffer, start->nb_bytes);
+    if (ret < 0) {
+        goto fail;
+    }
 
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
+                               end_buffer, end->nb_bytes);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
         qcow2_cache_depends_on_flush(s->l2_table_cache);
     }
 
+    qemu_vfree(start_buffer);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

Instead of passing a single buffer pointer to do_perform_cow_write(),
pass a QEMUIOVector. This will allow us to merge the write requests
for the COW regions and the actual data into a single one.

Although do_perform_cow_read() does not strictly need to change its
API, we're doing it here as well for consistency.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 51 ++++++++++++++++++++++++---------------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
 static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
                                             uint64_t src_cluster_offset,
                                             unsigned offset_in_cluster,
-                                            uint8_t *buffer,
-                                            unsigned bytes)
+                                            QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
 
     if (!bs->drv) {
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
      * which can lead to deadlock when block layer copy-on-read is enabled.
      */
     ret = bs->drv->bdrv_co_preadv(bs, src_cluster_offset + offset_in_cluster,
-                                  bytes, &qiov, 0);
+                                  qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
 static int coroutine_fn do_perform_cow_write(BlockDriverState *bs,
                                              uint64_t cluster_offset,
                                              unsigned offset_in_cluster,
-                                             uint8_t *buffer,
-                                             unsigned bytes)
+                                             QEMUIOVector *qiov)
 {
-    QEMUIOVector qiov;
-    struct iovec iov = { .iov_base = buffer, .iov_len = bytes };
     int ret;
 
-    if (bytes == 0) {
+    if (qiov->size == 0) {
         return 0;
     }
 
-    qemu_iovec_init_external(&qiov, &iov, 1);
-
     ret = qcow2_pre_write_overlap_check(bs, 0,
-            cluster_offset + offset_in_cluster, bytes);
+            cluster_offset + offset_in_cluster, qiov->size);
     if (ret < 0) {
         return ret;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
     ret = bdrv_co_pwritev(bs->file, cluster_offset + offset_in_cluster,
-                          bytes, &qiov, 0);
+                          qiov->size, qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     unsigned data_bytes = end->offset - (start->offset + start->nb_bytes);
     bool merge_reads;
     uint8_t *start_buffer, *end_buffer;
+    QEMUIOVector qiov;
     int ret;
 
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
+    qemu_iovec_init(&qiov, 1);
+
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
      * either read the whole region in one go, or the start and end
      * regions separately. */
     if (merge_reads) {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, buffer_size);
+        qemu_iovec_add(&qiov, start_buffer, buffer_size);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
     } else {
-        ret = do_perform_cow_read(bs, m->offset, start->offset,
-                                  start_buffer, start->nb_bytes);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, start->offset, &qiov);
         if (ret < 0) {
             goto fail;
         }
 
-        ret = do_perform_cow_read(bs, m->offset, end->offset,
-                                  end_buffer, end->nb_bytes);
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_read(bs, m->offset, end->offset, &qiov);
     }
     if (ret < 0) {
         goto fail;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* And now we can write everything */
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset,
-                               start_buffer, start->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
     if (ret < 0) {
         goto fail;
     }
 
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset,
-                               end_buffer, end->nb_bytes);
+    qemu_iovec_reset(&qiov);
+    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
@@ -XXX,XX +XXX,XX @@ fail:
     }
 
     qemu_vfree(start_buffer);
+    qemu_iovec_destroy(&qiov);
     return ret;
 }
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

If the guest tries to write data that results on the allocation of a
new cluster, instead of writing the guest data first and then the data
from the COW regions, write everything together using one single I/O
operation.

This can improve the write performance by 25% or more, depending on
several factors such as the media type, the cluster size and the I/O
request size.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 40 ++++++++++++++++++++++++--------
 block/qcow2.c         | 64 +++++++++++++++++++++++++++++++++++++++++++--------
 block/qcow2.h         |  7 ++++++
 3 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->nb_bytes <= UINT_MAX - end->nb_bytes);
     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
     assert(start->offset + start->nb_bytes <= end->offset);
+    assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
     if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     /* The part of the buffer where the end region is located */
     end_buffer = start_buffer + buffer_size - end->nb_bytes;
 
-    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_init(&qiov, 2 + (m->data_qiov ? m->data_qiov->niov : 0));
 
     qemu_co_mutex_unlock(&s->lock);
     /* First we read the existing data from both COW regions. We
@@ -XXX,XX +XXX,XX @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
         }
     }
 
-    /* And now we can write everything */
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
-    if (ret < 0) {
-        goto fail;
+    /* And now we can write everything. If we have the guest data we
+     * can write everything in one single operation */
+    if (m->data_qiov) {
+        qemu_iovec_reset(&qiov);
+        if (start->nb_bytes) {
+            qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        }
+        qemu_iovec_concat(&qiov, m->data_qiov, 0, data_bytes);
+        if (end->nb_bytes) {
+            qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        }
+        /* NOTE: we have a write_aio blkdebug event here followed by
+         * a cow_write one in do_perform_cow_write(), but there's only
+         * one single I/O operation */
+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+    } else {
+        /* If there's no guest data then write both COW regions separately */
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, start_buffer, start->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, start->offset, &qiov);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_iovec_reset(&qiov);
+        qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
+        ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
     }
 
-    qemu_iovec_reset(&qiov);
-    qemu_iovec_add(&qiov, end_buffer, end->nb_bytes);
-    ret = do_perform_cow_write(bs, m->alloc_offset, end->offset, &qiov);
 fail:
     qemu_co_mutex_lock(&s->lock);
 
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ fail:
     return ret;
 }
 
+/* Check if it's possible to merge a write request with the writing of
+ * the data from the COW regions */
+static bool merge_cow(uint64_t offset, unsigned bytes,
+                      QEMUIOVector *hd_qiov, QCowL2Meta *l2meta)
+{
+    QCowL2Meta *m;
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        /* If both COW regions are empty then there's nothing to merge */
+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+            continue;
+        }
+
+        /* The data (middle) region must be immediately after the
+         * start region */
+        if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+            continue;
+        }
+
+        /* The end region must be immediately after the data (middle)
+         * region */
+        if (m->offset + m->cow_end.offset != offset + bytes) {
+            continue;
+        }
+
+        /* Make sure that adding both COW regions to the QEMUIOVector
+         * does not exceed IOV_MAX */
+        if (hd_qiov->niov > IOV_MAX - 2) {
+            continue;
+        }
+
+        m->data_qiov = hd_qiov;
+        return true;
+    }
+
+    return false;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             goto fail;
         }
 
-        qemu_co_mutex_unlock(&s->lock);
-        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
-        trace_qcow2_writev_data(qemu_coroutine_self(),
-                                cluster_offset + offset_in_cluster);
-        ret = bdrv_co_pwritev(bs->file,
-                              cluster_offset + offset_in_cluster,
-                              cur_bytes, &hd_qiov, 0);
-        qemu_co_mutex_lock(&s->lock);
-        if (ret < 0) {
-            goto fail;
+        /* If we need to do COW, check if it's possible to merge the
+         * writing of the guest data together with that of the COW regions.
+         * If it's not possible (or not necessary) then write the
+         * guest data now. */
+        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
+            qemu_co_mutex_unlock(&s->lock);
+            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+            trace_qcow2_writev_data(qemu_coroutine_self(),
+                                    cluster_offset + offset_in_cluster);
+            ret = bdrv_co_pwritev(bs->file,
+                                  cluster_offset + offset_in_cluster,
+                                  cur_bytes, &hd_qiov, 0);
+            qemu_co_mutex_lock(&s->lock);
+            if (ret < 0) {
+                goto fail;
+            }
         }
 
         while (l2meta != NULL) {
diff --git a/block/qcow2.h b/block/qcow2.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /**
+     * The I/O vector with the data from the actual guest write request.
+     * If non-NULL, this is meant to be merged together with the data
+     * from @cow_start and @cow_end into one single write operation.
+     */
+    QEMUIOVector *data_qiov;
+
     /** Pointer to next L2Meta of the same write request */
     struct QCowL2Meta *next;
 
-- 
1.8.3.1

From: Alberto Garcia <berto@igalia.com>

We already have functions for doing these calculations, so let's use
them instead of doing everything by hand. This makes the code a bit
more readable.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 4 ++--
 block/qcow2.c         | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
     *cluster_offset = be64_to_cpu(l2_table[l2_index]);
 
     nb_clusters = size_to_clusters(s, bytes_needed);
@@ -XXX,XX +XXX,XX @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
 
     /* find the cluster offset for the given disk offset */
 
-    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
+    l2_index = offset_to_l2_index(s, offset);
 
     *new_l2_table = l2_table;
     *new_l2_index = l2_index;
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ static int validate_table_offset(BlockDriverState *bs, uint64_t offset,
     }
 
     /* Tables must be cluster aligned */
-    if (offset & (s->cluster_size - 1)) {
+    if (offset_into_cluster(s, offset) != 0) {
         return -EINVAL;
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 94 ++++++++++++++++++-----------------------------------
 block/qed-table.c   | 15 +++------
 block/qed.h         |  3 +-
 3 files changed, 36 insertions(+), 76 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
     return i - index;
 }
 
-typedef struct {
-    BDRVQEDState *s;
-    uint64_t pos;
-    size_t len;
-
-    QEDRequest *request;
-
-    /* User callback */
-    QEDFindClusterFunc *cb;
-    void *opaque;
-} QEDFindClusterCB;
-
-static void qed_find_cluster_cb(void *opaque, int ret)
-{
-    QEDFindClusterCB *find_cluster_cb = opaque;
-    BDRVQEDState *s = find_cluster_cb->s;
-    QEDRequest *request = find_cluster_cb->request;
-    uint64_t offset = 0;
-    size_t len = 0;
-    unsigned int index;
-    unsigned int n;
-
-    qed_acquire(s);
-    if (ret) {
-        goto out;
-    }
-
-    index = qed_l2_index(s, find_cluster_cb->pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
-                              find_cluster_cb->len);
-    n = qed_count_contiguous_clusters(s, request->l2_table->table,
-                                      index, n, &offset);
-
-    if (qed_offset_is_unalloc_cluster(offset)) {
-        ret = QED_CLUSTER_L2;
-    } else if (qed_offset_is_zero_cluster(offset)) {
-        ret = QED_CLUSTER_ZERO;
-    } else if (qed_check_cluster_offset(s, offset)) {
-        ret = QED_CLUSTER_FOUND;
-    } else {
-        ret = -EINVAL;
-    }
-
-    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
-              qed_offset_into_cluster(s, find_cluster_cb->pos));
-
-out:
-    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
-    qed_release(s);
-    g_free(find_cluster_cb);
-}
-
 /**
  * Find the offset of a data cluster
  *
@@ -XXX,XX +XXX,XX @@ out:
 void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
                       size_t len, QEDFindClusterFunc *cb, void *opaque)
 {
-    QEDFindClusterCB *find_cluster_cb;
     uint64_t l2_offset;
+    uint64_t offset = 0;
+    unsigned int index;
+    unsigned int n;
+    int ret;
 
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         return;
     }
 
-    find_cluster_cb = g_malloc(sizeof(*find_cluster_cb));
-    find_cluster_cb->s = s;
-    find_cluster_cb->pos = pos;
-    find_cluster_cb->len = len;
-    find_cluster_cb->cb = cb;
-    find_cluster_cb->opaque = opaque;
-    find_cluster_cb->request = request;
+    ret = qed_read_l2_table(s, request, l2_offset);
+    qed_acquire(s);
+    if (ret) {
+        goto out;
+    }
+
+    index = qed_l2_index(s, pos);
+    n = qed_bytes_to_clusters(s,
+                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+                                      index, n, &offset);
+
+    if (qed_offset_is_unalloc_cluster(offset)) {
+        ret = QED_CLUSTER_L2;
+    } else if (qed_offset_is_zero_cluster(offset)) {
+        ret = QED_CLUSTER_ZERO;
+    } else if (qed_check_cluster_offset(s, offset)) {
+        ret = QED_CLUSTER_FOUND;
+    } else {
+        ret = -EINVAL;
+    }
+
+    len = MIN(len,
+              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
-    qed_read_l2_table(s, request, l2_offset,
-                      qed_find_cluster_cb, find_cluster_cb);
+out:
+    cb(opaque, ret, offset, len);
+    qed_release(s);
 }
diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
     return ret;
 }
 
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque)
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
     int ret;
 
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     /* Check for cached L2 entry */
     request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
     if (request->l2_table) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
@@ -XXX,XX +XXX,XX @@ void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
     }
     qed_release(s);
 
-    cb(opaque, ret);
+    return ret;
 }
 
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
 {
-    int ret = -EINPROGRESS;
-
-    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_read_l2_table(s, request, offset);
 }
 
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
-void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
-                       BlockCompletionFunc *cb, void *opaque);
+int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
 void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
                         unsigned int index, unsigned int n, bool flush,
                         BlockCompletionFunc *cb, void *opaque);
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c | 39 ++++++++++++++++++++++-----------------
 block/qed.c         | 24 +++++++++++-------------
 block/qed.h         |  4 ++--
 3 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * @s:          QED state
  * @request:    L2 cache entry
  * @pos:        Byte position in device
- * @len:        Number of bytes
- * @cb:         Completion function
- * @opaque:     User data for completion function
+ * @len:        Number of bytes (may be shortened on return)
+ * @img_offset: Contains offset in the image file on success
  *
  * This function translates a position in the block device to an offset in the
- * image file.  It invokes the cb completion callback to report back the
- * translated offset or unallocated range in the image file.
+ * image file. The translated offset or unallocated range in the image file is
+ * reported back in *img_offset and *len.
  *
  * If the L2 table exists, request->l2_table points to the L2 table cache entry
  * and the caller must free the reference when they are finished.  The cache
  * entry is exposed in this way to avoid callers having to read the L2 table
  * again later during request processing.  If request->l2_table is non-NULL it
  * will be unreferenced before taking on the new cache entry.
+ *
+ * On success QED_CLUSTER_FOUND is returned and img_offset/len are a contiguous
+ * range in the image file.
+ *
+ * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
+ * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque)
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
      * so that a request acts on one L2 table at a time.
      */
-    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
+    *len = MIN(*len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
 
     l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
     if (qed_offset_is_unalloc_cluster(l2_offset)) {
-        cb(opaque, QED_CLUSTER_L1, 0, len);
-        return;
+        *img_offset = 0;
+        return QED_CLUSTER_L1;
     }
     if (!qed_check_table_offset(s, l2_offset)) {
-        cb(opaque, -EINVAL, 0, 0);
-        return;
+        *img_offset = *len = 0;
+        return -EINVAL;
     }
 
     ret = qed_read_l2_table(s, request, l2_offset);
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
     }
 
     index = qed_l2_index(s, pos);
-    n = qed_bytes_to_clusters(s,
-                              qed_offset_into_cluster(s, pos) + len);
+    n = qed_bytes_to_clusters(s, qed_offset_into_cluster(s, pos) + *len);
     n = qed_count_contiguous_clusters(s, request->l2_table->table,
                                       index, n, &offset);
 
@@ -XXX,XX +XXX,XX @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
         ret = -EINVAL;
     }
 
-    len = MIN(len,
-              n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
+    *len = MIN(*len,
+               n * s->header.cluster_size - qed_offset_into_cluster(s, pos));
 
 out:
-    cb(opaque, ret, offset, len);
+    *img_offset = offset;
     qed_release(s);
+    return ret;
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
         .file = file,
     };
     QEDRequest request = { .l2_table = NULL };
+    uint64_t offset;
+    int ret;
 
-    qed_find_cluster(s, &request, cb.pos, len, qed_is_allocated_cb, &cb);
+    ret = qed_find_cluster(s, &request, cb.pos, &len, &offset);
+    qed_is_allocated_cb(&cb, ret, offset, len);
 
-    /* Now sleep if the callback wasn't invoked immediately */
-    while (cb.status == BDRV_BLOCK_OFFSET_MASK) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
+    /* The callback was invoked immediately */
+    assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
 
     qed_unref_l2_cache_entry(request.l2_table);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_write_data(void *opaque, int ret,
                                uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
  *              or -errno
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
- *
- * Callback from qed_find_cluster().
  */
 static void qed_aio_read_data(void *opaque, int ret,
                               uint64_t offset, size_t len)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
+    uint64_t offset;
+    size_t len;
 
     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
     }
 
     /* Find next cluster and start I/O */
-    qed_find_cluster(s, &acb->request,
-                      acb->cur_pos, acb->end_pos - acb->cur_pos,
-                      io_fn, acb);
+    len = acb->end_pos - acb->cur_pos;
+    ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
+    io_fn(acb, ret, offset, len);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                      size_t len, QEDFindClusterFunc *cb, void *opaque);
+int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                     size_t *len, uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

With this change, qed_aio_write_prefill() and qed_aio_write_postfill()
collapse into a single function. This is reflected by a rename of the
combined function to qed_aio_write_cow().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 57 +++++++++++++++++++++++----------------------------------
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @pos:        Byte position in device
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
- * @cb:         Completion function
- * @opaque:     User data for completion function
  */
-static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                       uint64_t len, uint64_t offset,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
+                                      uint64_t len, uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
 
     /* Skip copy entirely if there is no work to do */
     if (len == 0) {
-        cb(opaque, 0);
-        return;
+        return 0;
     }
 
     iov = (struct iovec) {
@@ -XXX,XX +XXX,XX @@ static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
     ret = 0;
 out:
     qemu_vfree(iov.iov_base);
-    cb(opaque, ret);
+    return ret;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
 }
 
 /**
- * Populate back untouched region of new data cluster
+ * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_postfill(void *opaque, int ret)
+static void qed_aio_write_cow(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
-    uint64_t len =
-        qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
-    uint64_t offset = acb->cur_cluster +
-                      qed_offset_into_cluster(s, acb->cur_pos) +
-                      acb->cur_qiov.size;
+    uint64_t start, len, offset;
+
+    /* Populate front untouched region of new data cluster */
+    start = qed_start_of_cluster(s, acb->cur_pos);
+    len = qed_offset_into_cluster(s, acb->cur_pos);
 
+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+    ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
     if (ret) {
         qed_aio_complete(acb, ret);
         return;
     }
 
-    trace_qed_aio_write_postfill(s, acb, start, len, offset);
-    qed_copy_from_backing_file(s, start, len, offset,
-                                qed_aio_write_main, acb);
-}
+    /* Populate back untouched region of new data cluster */
+    start = acb->cur_pos + acb->cur_qiov.size;
+    len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
+    offset = acb->cur_cluster +
+             qed_offset_into_cluster(s, acb->cur_pos) +
+             acb->cur_qiov.size;
 
-/**
- * Populate front untouched region of new data cluster
- */
-static void qed_aio_write_prefill(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
-    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+    ret = qed_copy_from_backing_file(s, start, len, offset);
 
-    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
-    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
-                                qed_aio_write_postfill, acb);
+    qed_aio_write_main(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
         cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_prefill;
+        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
-                             void *opaque)
+static int qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_write_header(BDRVQEDState *s, BlockCompletionFunc cb,
     ret = 0;
 out:
     qemu_vfree(buf);
-    cb(opaque, ret);
+    return ret;
 }
 
 static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     }
 }
 
-static void qed_finish_clear_need_check(void *opaque, int ret)
-{
-    /* Do nothing */
-}
-
-static void qed_flush_after_clear_need_check(void *opaque, int ret)
-{
-    BDRVQEDState *s = opaque;
-
-    bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
-
-    /* No need to wait until flush completes */
-    qed_unplug_allocating_write_reqs(s);
-}
-
 static void qed_clear_need_check(void *opaque, int ret)
 {
     BDRVQEDState *s = opaque;
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
     }
 
     s->header.features &= ~QED_F_NEED_CHECK;
-    qed_write_header(s, qed_flush_after_clear_need_check, s);
+    ret = qed_write_header(s);
+    (void) ret;
+
+    qed_unplug_allocating_write_reqs(s);
+
+    ret = bdrv_flush(s->bs);
+    (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb;
+    int ret;
 
     /* Cancel timer when the first allocating request comes in */
     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
-        qed_write_header(s, cb, acb);
+        ret = qed_write_header(s);
+        cb(acb, ret);
     } else {
         cb(acb, 0);
     }
-- 
1.8.3.1

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-table.c | 47 ++++++++++++-----------------------------------
 block/qed.c       | 12 +++++++-----
 block/qed.h       |  8 +++-----
 3 files changed, 22 insertions(+), 45 deletions(-)

diff --git a/block/qed-table.c b/block/qed-table.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -XXX,XX +XXX,XX @@ out:
  * @index:      Index of first element
  * @n:          Number of elements
  * @flush:      Whether or not to sync to disk
- * @cb:         Completion function
- * @opaque:     Argument for completion function
  */
-static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
-                            unsigned int index, unsigned int n, bool flush,
-                            BlockCompletionFunc *cb, void *opaque)
+static int qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+                           unsigned int index, unsigned int n, bool flush)
 {
     unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
     unsigned int start, end, i;
@@ -XXX,XX +XXX,XX @@ static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
     ret = 0;
 out:
     qemu_vfree(new_table);
-    cb(opaque, ret);
-}
-
-/**
- * Propagate return value from async callback
- */
-static void qed_sync_cb(void *opaque, int ret)
-{
-    *(int *)opaque = ret;
+    return ret;
 }
 
 int qed_read_l1_table_sync(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ int qed_read_l1_table_sync(BDRVQEDState *s)
     return qed_read_table(s, s->header.l1_table_offset, s->l1_table);
 }
 
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
-    qed_write_table(s, s->header.l1_table_offset,
-                    s->l1_table, index, n, false, cb, opaque);
+    return qed_write_table(s, s->header.l1_table_offset,
+                           s->l1_table, index, n, false);
 }
 
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l1_table(s, index, n);
 }
 
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
@@ -XXX,XX +XXX,XX @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
     return qed_read_l2_table(s, request, offset);
 }
 
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque)
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush)
 {
     BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
-    qed_write_table(s, request->l2_table->offset,
-                    request->l2_table->table, index, n, flush, cb, opaque);
+    return qed_write_table(s, request->l2_table->offset,
+                           request->l2_table->table, index, n, flush);
 }
 
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush)
 {
-    int ret = -EINPROGRESS;
-
-    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
-    BDRV_POLL_WHILE(s->bs, ret == -EINPROGRESS);
-
-    return ret;
+    return qed_write_l2_table(s, request, index, n, flush);
 }
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = acb->request.l2_table->offset;
 
-    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+    ret = qed_write_l1_table(s, index, 1);
+    qed_commit_l2_update(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
 
     if (need_alloc) {
         /* Write out the whole new L2 table */
-        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                           qed_aio_write_l1_update, acb);
+        ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
+        qed_aio_write_l1_update(acb, ret);
     } else {
         /* Write out only the updated part of the L2 table */
-        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                           qed_aio_next_io_cb, acb);
+        ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
+                                 false);
+        qed_aio_next_io(acb, ret);
     }
     return;
 
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
  * Table I/O functions
  */
 int qed_read_l1_table_sync(BDRVQEDState *s);
-void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n);
 int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
                             unsigned int n);
 int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                            uint64_t offset);
 int qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset);
-void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
-                        unsigned int index, unsigned int n, bool flush,
-                        BlockCompletionFunc *cb, void *opaque);
+int qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                       unsigned int index, unsigned int n, bool flush);
 int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
                             unsigned int index, unsigned int n, bool flush);
 
-- 
1.8.3.1

Note that this code is generally not running in coroutine context, so
this is an actual blocking synchronous operation. We'll fix this in a
moment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 61 +++++++++++++++++++------------------------------------------
 1 file changed, 19 insertions(+), 42 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_start_io(QEDAIOCB *acb)
     qed_aio_next_io(acb, 0);
 }
 
-static void qed_aio_next_io_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    qed_aio_next_io(acb, ret);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ err:
     qed_aio_complete(acb, ret);
 }
 
-static void qed_aio_write_l2_update_cb(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
-}
-
-/**
- * Flush new data clusters before updating the L2 table
- *
- * This flush is necessary when a backing file is in use.  A crash during an
- * allocating write could result in empty clusters in the image.  If the write
- * only touched a subregion of the cluster, then backing image sectors have
- * been lost in the untouched region.  The solution is to flush after writing a
- * new data cluster and before updating the L2 table.
- */
-static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-
-    if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
-        qed_aio_complete(acb, -EIO);
-    }
-}
-
 /**
  * Write data to the image file
  */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
                       qed_offset_into_cluster(s, acb->cur_pos);
-    BlockCompletionFunc *next_fn;
 
     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
         return;
     }
 
+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    if (ret >= 0) {
+        ret = 0;
+    }
+
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io_cb;
+        qed_aio_next_io(acb, ret);
     } else {
         if (s->bs->backing) {
-            next_fn = qed_aio_write_flush_before_l2_update;
-        } else {
-            next_fn = qed_aio_write_l2_update_cb;
+            /*
+             * Flush new data clusters before updating the L2 table
+             *
+             * This flush is necessary when a backing file is in use.  A crash
+             * during an allocating write could result in empty clusters in the
+             * image.  If the write only touched a subregion of the cluster,
+             * then backing image sectors have been lost in the untouched
+             * region.  The solution is to flush after writing a new data
+             * cluster and before updating the L2 table.
+             */
+            ret = bdrv_flush(s->bs->file->bs);
         }
+        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
     }
-
-    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
-                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                    next_fn, acb);
 }
 
 /**
-- 
1.8.3.1

qed_commit_l2_update() is unconditionally called at the end of
qed_aio_write_l1_update(). Inline it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 }
 
 /**
- * Commit the current L2 table to the cache
+ * Update L1 table with new L2 table offset and write it out
  */
-static void qed_commit_l2_update(void *opaque, int ret)
+static void qed_aio_write_l1_update(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
+    int index;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
 
+    index = qed_l1_index(s, acb->cur_pos);
+    s->l1_table->offsets[index] = l2_table->offset;
+
+    ret = qed_write_l1_table(s, index, 1);
+
+    /* Commit the current L2 table to the cache */
     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
 
     /* This is guaranteed to succeed because we just committed the entry to the
@@ -XXX,XX +XXX,XX @@ static void qed_commit_l2_update(void *opaque, int ret)
     qed_aio_next_io(acb, ret);
 }
 
-/**
- * Update L1 table with new L2 table offset and write it out
- */
-static void qed_aio_write_l1_update(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    index = qed_l1_index(s, acb->cur_pos);
-    s->l1_table->offsets[index] = acb->request.l2_table->offset;
-
-    ret = qed_write_l1_table(s, index, 1);
-    qed_commit_l2_update(acb, ret);
-}
 
 /**
  * Update L2 table with new cluster offsets and write them out
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static void qed_aio_write_l1_update(void *opaque, int ret)
+static int qed_aio_write_l1_update(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
-    int index;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
+    int index, ret;
 
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = l2_table->offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(acb, ret);
+    return ret;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
-        qed_aio_write_l1_update(acb, ret);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l1_update(acb);
+        qed_aio_next_io(acb, ret);
+
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
+static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
-    int index;
-
-    if (ret) {
-        goto err;
-    }
+    int index, ret;
 
     if (need_alloc) {
         qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
         /* Write out the whole new L2 table */
         ret = qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true);
         if (ret) {
-            goto err;
+            return ret;
         }
-        ret = qed_aio_write_l1_update(acb);
-        qed_aio_next_io(acb, ret);
-
+        return qed_aio_write_l1_update(acb);
     } else {
         /* Write out only the updated part of the L2 table */
         ret = qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters,
                                  false);
-        qed_aio_next_io(acb, ret);
+        if (ret) {
+            return ret;
+        }
     }
-    return;
-
-err:
-    qed_aio_complete(acb, ret);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_main(void *opaque, int ret)
              */
             ret = bdrv_flush(s->bs->file->bs);
         }
-        qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        ret = qed_aio_write_l2_update(acb, acb->cur_cluster);
+        if (ret) {
+            goto err;
+        }
+        qed_aio_next_io(acb, 0);
     }
+    return;
+
+err:
+    qed_aio_complete(acb, ret);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
         return;
     }
 
-    qed_aio_write_l2_update(acb, 0, 1);
+    ret = qed_aio_write_l2_update(acb, 1);
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

While refactoring qed_aio_write_alloc() to accomodate the change,
qed_aio_write_zero_cluster() ended up with a single line, so I chose to
inline that line and remove the function completely.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 58 +++++++++++++++++++++-------------------------------------
 1 file changed, 21 insertions(+), 37 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static void qed_aio_write_cow(void *opaque, int ret)
+static int qed_aio_write_cow(QEDAIOCB *acb)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
+    int ret;
 
     /* Populate front untouched region of new data cluster */
     start = qed_start_of_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
     ret = qed_copy_from_backing_file(s, start, len, acb->cur_cluster);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
+    if (ret < 0) {
+        return ret;
     }
 
     /* Populate back untouched region of new data cluster */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_cow(void *opaque, int ret)
 
     trace_qed_aio_write_postfill(s, acb, start, len, offset);
     ret = qed_copy_from_backing_file(s, start, len, offset);
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_main(acb);
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
     return !(s->header.features & QED_F_NEED_CHECK);
 }
 
-static void qed_aio_write_zero_cluster(void *opaque, int ret)
-{
-    QEDAIOCB *acb = opaque;
-
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
-    ret = qed_aio_write_l2_update(acb, 1);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
-}
-
 /**
  * Write new data cluster
  *
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
 static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb;
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
             qed_aio_start_io(acb);
             return;
         }
-
-        cb = qed_aio_write_zero_cluster;
     } else {
-        cb = qed_aio_write_cow;
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     }
 
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
-        cb(acb, ret);
+        if (ret < 0) {
+            qed_aio_complete(acb, ret);
+            return;
+        }
+    }
+
+    if (acb->flags & QED_AIOCB_ZERO) {
+        ret = qed_aio_write_l2_update(acb, 1);
     } else {
-        cb(acb, 0);
+        ret = qed_aio_write_cow(acb);
     }
+    if (ret < 0) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

Don't recurse into qed_aio_next_io() and qed_aio_complete() here, but
just return an error code and let the caller handle it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     }
     if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
         s->allocating_write_reqs_plugged) {
-        return; /* wait for existing request to finish */
+        return -EINPROGRESS; /* wait for existing request to finish */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_start_io(acb);
-            return;
+            return 0;
         }
     } else {
         acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         s->header.features |= QED_F_NEED_CHECK;
         ret = qed_write_header(s);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            return ret;
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
         ret = qed_aio_write_cow(acb);
     }
     if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
+        return ret;
     }
-    qed_aio_next_io(acb, 0);
+    return 0;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
-    int ret;
-
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
         struct iovec *iov = acb->qiov->iov;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         if (!iov->iov_base) {
             iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
             if (iov->iov_base == NULL) {
-                qed_aio_complete(acb, -ENOMEM);
-                return;
+                return -ENOMEM;
             }
             memset(iov->iov_base, 0, iov->iov_len);
         }
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
     qemu_iovec_concat(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
     /* Do the actual write */
-    ret = qed_aio_write_main(acb);
-    if (ret < 0) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-    qed_aio_next_io(acb, 0);
+    return qed_aio_write_main(acb);
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_aio_write_data(void *opaque, int ret,
 
     switch (ret) {
     case QED_CLUSTER_FOUND:
-        qed_aio_write_inplace(acb, offset, len);
+        ret = qed_aio_write_inplace(acb, offset, len);
         break;
 
     case QED_CLUSTER_L2:
     case QED_CLUSTER_L1:
     case QED_CLUSTER_ZERO:
-        qed_aio_write_alloc(acb, len);
+        ret = qed_aio_write_alloc(acb, len);
         break;
 
     default:
-        qed_aio_complete(acb, ret);
+        assert(ret < 0);
         break;
     }
+
+    if (ret < 0) {
+        if (ret != -EINPROGRESS) {
+            qed_aio_complete(acb, ret);
+        }
+        return;
+    }
+    qed_aio_next_io(acb, 0);
 }
 
 /**
-- 
1.8.3.1

All callers pass ret = 0, so we can just remove it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb);
 
 static void qed_aio_start_io(QEDAIOCB *acb)
 {
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
     size_t len;
+    int ret;
 
-    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+    trace_qed_aio_next_io(s, acb, 0, acb->cur_pos + acb->cur_qiov.size);
 
     if (acb->backing_qiov) {
         qemu_iovec_destroy(acb->backing_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         acb->backing_qiov = NULL;
     }
 
-    /* Handle I/O error */
-    if (ret) {
-        qed_aio_complete(acb, ret);
-        return;
-    }
-
     acb->qiov_offset += acb->cur_qiov.size;
     acb->cur_pos += acb->cur_qiov.size;
     qemu_iovec_reset(&acb->cur_qiov);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb, int ret)
         }
         return;
     }
-    qed_aio_next_io(acb, 0);
+    qed_aio_next_io(acb);
 }
 
 static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-- 
1.8.3.1

Most of the qed code is now synchronous and matches the coroutine model.
One notable exception is the serialisation between requests which can
still schedule a callback. Before we can replace this with coroutine
locks, let's convert the driver's external interfaces to the coroutine
versions.

We need to be careful to handle both requests that call the completion
callback directly from the calling coroutine (i.e. fully synchronous
code) and requests that involve some callback, so that we need to yield
and wait for the completion callback coming from outside the coroutine.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 97 ++++++++++++++++++++++++++-----------------------------------
 1 file changed, 42 insertions(+), 55 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
     }
 }
 
-static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
-                                 int64_t sector_num,
-                                 QEMUIOVector *qiov, int nb_sectors,
-                                 BlockCompletionFunc *cb,
-                                 void *opaque, int flags)
+typedef struct QEDRequestCo {
+    Coroutine *co;
+    bool done;
+    int ret;
+} QEDRequestCo;
+
+static void qed_co_request_cb(void *opaque, int ret)
 {
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, cb, opaque);
+    QEDRequestCo *co = opaque;
 
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
-                        opaque, flags);
+    co->done = true;
+    co->ret = ret;
+    qemu_coroutine_enter_if_inactive(co->co);
+}
+
+static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
+                                       QEMUIOVector *qiov, int nb_sectors,
+                                       int flags)
+{
+    QEDRequestCo co = {
+        .co     = qemu_coroutine_self(),
+        .done   = false,
+    };
+    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
+
+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
 
     acb->flags = flags;
     acb->qiov = qiov;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
 
     /* Start request */
     qed_aio_start_io(acb);
-    return &acb->common;
-}
 
-static BlockAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
-                                      int64_t sector_num,
-                                      QEMUIOVector *qiov, int nb_sectors,
-                                      BlockCompletionFunc *cb,
-                                      void *opaque)
-{
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
+    if (!co.done) {
+        qemu_coroutine_yield();
+    }
+
+    return co.ret;
 }
 
-static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
-                                       int64_t sector_num,
-                                       QEMUIOVector *qiov, int nb_sectors,
-                                       BlockCompletionFunc *cb,
-                                       void *opaque)
+static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
+                                          int64_t sector_num, int nb_sectors,
+                                          QEMUIOVector *qiov)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
-                         opaque, QED_AIOCB_WRITE);
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, 0);
 }
 
-typedef struct {
-    Coroutine *co;
-    int ret;
-    bool done;
-} QEDWriteZeroesCB;
-
-static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
+static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors,
+                                           QEMUIOVector *qiov)
 {
-    QEDWriteZeroesCB *cb = opaque;
-
-    cb->done = true;
-    cb->ret = ret;
-    if (cb->co) {
-        aio_co_wake(cb->co);
-    }
+    return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
 }
 
 static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
                                                   int count,
                                                   BdrvRequestFlags flags)
 {
-    BlockAIOCB *blockacb;
     BDRVQEDState *s = bs->opaque;
-    QEDWriteZeroesCB cb = { .done = false };
     QEMUIOVector qiov;
     struct iovec iov;
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
     iov.iov_len = count;
 
     qemu_iovec_init_external(&qiov, &iov, 1);
-    blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, &qiov,
-                             count >> BDRV_SECTOR_BITS,
-                             qed_co_pwrite_zeroes_cb, &cb,
-                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
-    if (!blockacb) {
-        return -EIO;
-    }
-    if (!cb.done) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
-    assert(cb.done);
-    return cb.ret;
+    return qed_co_request(bs, offset >> BDRV_SECTOR_BITS, &qiov,
+                          count >> BDRV_SECTOR_BITS,
+                          QED_AIOCB_WRITE | QED_AIOCB_ZERO);
 }
 
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_qed = {
     .bdrv_create              = bdrv_qed_create,
     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
     .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
-    .bdrv_aio_readv           = bdrv_qed_aio_readv,
-    .bdrv_aio_writev          = bdrv_qed_aio_writev,
+    .bdrv_co_readv            = bdrv_qed_co_readv,
+    .bdrv_co_writev           = bdrv_qed_co_writev,
     .bdrv_co_pwrite_zeroes    = bdrv_qed_co_pwrite_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
-- 
1.8.3.1

Now that we're running in coroutine context, the ad-hoc serialisation
code (which drops a request that has to wait out of coroutine context)
can be replaced by a CoQueue.

This means that when we resume a serialised request, it is running in
coroutine context again and its I/O isn't blocking any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 49 +++++++++++++++++--------------------------------
 block/qed.h |  3 ++-
 2 files changed, 19 insertions(+), 33 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 
 static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 {
-    QEDAIOCB *acb;
-
     assert(s->allocating_write_reqs_plugged);
 
     s->allocating_write_reqs_plugged = false;
-
-    acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-    if (acb) {
-        qed_aio_start_io(acb);
-    }
+    qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
 static void qed_clear_need_check(void *opaque, int ret)
@@ -XXX,XX +XXX,XX @@ static void qed_need_check_timer_cb(void *opaque)
     BDRVQEDState *s = opaque;
 
     /* The timer should only fire when allocating writes have drained */
-    assert(!QSIMPLEQ_FIRST(&s->allocating_write_reqs));
+    assert(!s->allocating_acb);
 
     trace_qed_need_check_timer_cb(s);
 
@@ -XXX,XX +XXX,XX @@ static int bdrv_qed_do_open(BlockDriverState *bs, QDict *options, int flags,
     int ret;
 
     s->bs = bs;
-    QSIMPLEQ_INIT(&s->allocating_write_reqs);
+    qemu_co_queue_init(&s->allocating_write_reqs);
 
     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
     if (ret < 0) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete_bh(void *opaque)
     qed_release(s);
 }
 
-static void qed_resume_alloc_bh(void *opaque)
-{
-    qed_aio_start_io(opaque);
-}
-
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
 {
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
      * next request in the queue.  This ensures that we don't cycle through
      * requests multiple times but rather finish one at a time completely.
      */
-    if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QEDAIOCB *next_acb;
-        QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
-        next_acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
-        if (next_acb) {
-            aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                                    qed_resume_alloc_bh, next_acb);
+    if (acb == s->allocating_acb) {
+        s->allocating_acb = NULL;
+        if (!qemu_co_queue_empty(&s->allocating_write_reqs)) {
+            qemu_co_enter_next(&s->allocating_write_reqs);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     int ret;
 
     /* Cancel timer when the first allocating request comes in */
-    if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
+    if (s->allocating_acb == NULL) {
         qed_cancel_need_check_timer(s);
     }
 
     /* Freeze this request if another allocating write is in progress */
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
-        QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
-    }
-    if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs) ||
-        s->allocating_write_reqs_plugged) {
-        return -EINPROGRESS; /* wait for existing request to finish */
+    if (s->allocating_acb != acb || s->allocating_write_reqs_plugged) {
+        if (s->allocating_acb != NULL) {
+            qemu_co_queue_wait(&s->allocating_write_reqs, NULL);
+            assert(s->allocating_acb == NULL);
+        }
+        s->allocating_acb = acb;
+        return -EAGAIN; /* start over with looking up table entries */
     }
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
             ret = qed_aio_read_data(acb, ret, offset, len);
         }
 
-        if (ret < 0) {
-            if (ret != -EINPROGRESS) {
-                qed_aio_complete(acb, ret);
-            }
+        if (ret < 0 && ret != -EAGAIN) {
+            qed_aio_complete(acb, ret);
             return;
         }
     }
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ typedef struct {
     uint32_t l2_mask;
 
     /* Allocating write request queue */
-    QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs;
+    QEDAIOCB *allocating_acb;
+    CoQueue allocating_write_reqs;
     bool allocating_write_reqs_plugged;
 
     /* Periodic flush and clear need check flag */
-- 
1.8.3.1

Now that we process a request in the same coroutine from beginning to
end and don't drop out of it any more, we can look like a proper
coroutine-based driver and simply call qed_aio_next_io() and get a
return value from it instead of spawning an additional coroutine that
reenters the parent when it's done.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 101 +++++++++++++-----------------------------------------------
 block/qed.h |   3 +-
 2 files changed, 22 insertions(+), 82 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/block-backend.h"
 
-static const AIOCBInfo qed_aiocb_info = {
-    .aiocb_size         = sizeof(QEDAIOCB),
-};
-
 static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                           const char *filename)
 {
@@ -XXX,XX +XXX,XX @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(QEDAIOCB *acb);
-
-static void qed_aio_start_io(QEDAIOCB *acb)
-{
-    qed_aio_next_io(acb);
-}
-
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
     assert(!s->allocating_write_reqs_plugged);
@@ -XXX,XX +XXX,XX @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
 
 static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
 {
-    return acb->common.bs->opaque;
+    return acb->bs->opaque;
 }
 
 /**
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete_bh(void *opaque)
-{
-    QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
-    BlockCompletionFunc *cb = acb->common.cb;
-    void *user_opaque = acb->common.opaque;
-    int ret = acb->bh_ret;
-
-    qemu_aio_unref(acb);
-
-    /* Invoke callback */
-    qed_acquire(s);
-    cb(user_opaque, ret);
-    qed_release(s);
-}
-
-static void qed_aio_complete(QEDAIOCB *acb, int ret)
+static void qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
-    trace_qed_aio_complete(s, acb, ret);
-
     /* Free resources */
     qemu_iovec_destroy(&acb->cur_qiov);
     qed_unref_l2_cache_entry(acb->request.l2_table);
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         acb->qiov->iov[0].iov_base = NULL;
     }
 
-    /* Arrange for a bh to invoke the completion function */
-    acb->bh_ret = ret;
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                            qed_aio_complete_bh, acb);
-
     /* Start next allocating write request waiting behind this one.  Note that
      * requests enqueue themselves when they first hit an unallocated cluster
      * but they wait until the entire request is finished before waking up the
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
         struct iovec *iov = acb->qiov->iov;
 
         if (!iov->iov_base) {
-            iov->iov_base = qemu_try_blockalign(acb->common.bs, iov->iov_len);
+            iov->iov_base = qemu_try_blockalign(acb->bs, iov->iov_len);
             if (iov->iov_base == NULL) {
                 return -ENOMEM;
             }
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    BlockDriverState *bs = acb->common.bs;
+    BlockDriverState *bs = acb->bs;
 
     /* Adjust offset into cluster */
     offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(QEDAIOCB *acb)
+static int qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
 
         /* Complete request */
         if (acb->cur_pos >= acb->end_pos) {
-            qed_aio_complete(acb, 0);
-            return;
+            ret = 0;
+            break;
         }
 
         /* Find next cluster and start I/O */
         len = acb->end_pos - acb->cur_pos;
         ret = qed_find_cluster(s, &acb->request, acb->cur_pos, &len, &offset);
         if (ret < 0) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
 
         if (acb->flags & QED_AIOCB_WRITE) {
@@ -XXX,XX +XXX,XX @@ static void qed_aio_next_io(QEDAIOCB *acb)
         }
 
         if (ret < 0 && ret != -EAGAIN) {
-            qed_aio_complete(acb, ret);
-            return;
+            break;
         }
     }
-}
 
-typedef struct QEDRequestCo {
-    Coroutine *co;
-    bool done;
-    int ret;
-} QEDRequestCo;
-
-static void qed_co_request_cb(void *opaque, int ret)
-{
-    QEDRequestCo *co = opaque;
-
-    co->done = true;
-    co->ret = ret;
-    qemu_coroutine_enter_if_inactive(co->co);
+    trace_qed_aio_complete(s, acb, ret);
+    qed_aio_complete(acb);
+    return ret;
 }
 
 static int coroutine_fn qed_co_request(BlockDriverState *bs, int64_t sector_num,
                                        QEMUIOVector *qiov, int nb_sectors,
                                        int flags)
 {
-    QEDRequestCo co = {
-        .co     = qemu_coroutine_self(),
-        .done   = false,
+    QEDAIOCB acb = {
+        .bs         = bs,
+        .cur_pos    = (uint64_t) sector_num * BDRV_SECTOR_SIZE,
+        .end_pos    = (sector_num + nb_sectors) * BDRV_SECTOR_SIZE,
+        .qiov       = qiov,
+        .flags      = flags,
     };
-    QEDAIOCB *acb = qemu_aio_get(&qed_aiocb_info, bs, qed_co_request_cb, &co);
-
-    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors, &co, flags);
+    qemu_iovec_init(&acb.cur_qiov, qiov->niov);
 
-    acb->flags = flags;
-    acb->qiov = qiov;
-    acb->qiov_offset = 0;
-    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
-    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
-    acb->backing_qiov = NULL;
-    acb->request.l2_table = NULL;
-    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+    trace_qed_aio_setup(bs->opaque, &acb, sector_num, nb_sectors, NULL, flags);
 
     /* Start request */
-    qed_aio_start_io(acb);
-
-    if (!co.done) {
-        qemu_coroutine_yield();
-    }
-
-    return co.ret;
+    return qed_aio_next_io(&acb);
 }
 
 static int coroutine_fn bdrv_qed_co_readv(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ enum {
 };
 
 typedef struct QEDAIOCB {
-    BlockAIOCB common;
-    int bh_ret;                     /* final return status for completion bh */
+    BlockDriverState *bs;
     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
     int flags;                      /* QED_AIOCB_* bits ORed together */
     uint64_t end_pos;               /* request end on block device, in bytes */
-- 
1.8.3.1

This fixes the last place where we degraded from AIO to actual blocking
synchronous I/O requests. Putting it into a coroutine means that instead
of blocking, the coroutine simply yields while doing I/O.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_clear_need_check(void *opaque, int ret)
+static void qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
+    int ret;
 
-    if (ret) {
+    /* The timer should only fire when allocating writes have drained */
+    assert(!s->allocating_acb);
+
+    trace_qed_need_check_timer_cb(s);
+
+    qed_acquire(s);
+    qed_plug_allocating_write_reqs(s);
+
+    /* Ensure writes are on disk before clearing flag */
+    ret = bdrv_co_flush(s->bs->file->bs);
+    qed_release(s);
+    if (ret < 0) {
         qed_unplug_allocating_write_reqs(s);
         return;
     }
@@ -XXX,XX +XXX,XX @@ static void qed_clear_need_check(void *opaque, int ret)
 
     qed_unplug_allocating_write_reqs(s);
 
-    ret = bdrv_flush(s->bs);
+    ret = bdrv_co_flush(s->bs);
     (void) ret;
 }
 
 static void qed_need_check_timer_cb(void *opaque)
 {
-    BDRVQEDState *s = opaque;
-
-    /* The timer should only fire when allocating writes have drained */
-    assert(!s->allocating_acb);
-
-    trace_qed_need_check_timer_cb(s);
-
-    qed_acquire(s);
-    qed_plug_allocating_write_reqs(s);
-
-    /* Ensure writes are on disk before clearing flag */
-    bdrv_aio_flush(s->bs->file->bs, qed_clear_need_check, s);
-    qed_release(s);
+    Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+    qemu_coroutine_enter(co);
 }
 
 void qed_acquire(BDRVQEDState *s)
-- 
1.8.3.1

Now that we stay in coroutine context for the whole request when doing
reads or writes, we can add coroutine_fn annotations to many functions
that can do I/O or yield directly.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed-cluster.c |  5 +++--
 block/qed.c         | 44 ++++++++++++++++++++++++--------------------
 block/qed.h         |  5 +++--
 3 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -XXX,XX +XXX,XX @@ static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
  * On failure QED_CLUSTER_L2 or QED_CLUSTER_L1 is returned for missing L2 or L1
  * table offset, respectively. len is number of contiguous unallocated bytes.
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset)
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset)
 {
     uint64_t l2_offset;
     uint64_t offset = 0;
diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ int qed_write_header_sync(BDRVQEDState *s)
  * This function only updates known header fields in-place and does not affect
  * extra data after the QED header.
  */
-static int qed_write_header(BDRVQEDState *s)
+static int coroutine_fn qed_write_header(BDRVQEDState *s)
 {
     /* We must write full sectors for O_DIRECT but cannot necessarily generate
      * the data following the header if an unrecognized compat feature is
@@ -XXX,XX +XXX,XX @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_enter_next(&s->allocating_write_reqs);
 }
 
-static void qed_need_check_timer_entry(void *opaque)
+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
 {
     BDRVQEDState *s = opaque;
     int ret;
@@ -XXX,XX +XXX,XX @@ static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
  * This function reads qiov->size bytes starting at pos from the backing file.
  * If there is no backing file then zeroes are read.
  */
-static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
-                                 QEMUIOVector *qiov,
-                                 QEMUIOVector **backing_qiov)
+static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
+                                              QEMUIOVector *qiov,
+                                              QEMUIOVector **backing_qiov)
 {
     uint64_t backing_length = 0;
     size_t size;
@@ -XXX,XX +XXX,XX @@ static int qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
  * @len:        Number of bytes
  * @offset:     Byte offset in image file
  */
-static int qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
-                                      uint64_t len, uint64_t offset)
+static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
+                                                   uint64_t pos, uint64_t len,
+                                                   uint64_t offset)
 {
     QEMUIOVector qiov;
     QEMUIOVector *backing_qiov = NULL;
@@ -XXX,XX +XXX,XX @@ out:
  * The cluster offset may be an allocated byte offset in the image file, the
  * zero cluster marker, or the unallocated cluster marker.
  */
-static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
-                                unsigned int n, uint64_t cluster)
+static void coroutine_fn qed_update_l2_table(BDRVQEDState *s, QEDTable *table,
+                                             int index, unsigned int n,
+                                             uint64_t cluster)
 {
     int i;
     for (i = index; i < index + n; i++) {
@@ -XXX,XX +XXX,XX @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
     }
 }
 
-static void qed_aio_complete(QEDAIOCB *acb)
+static void coroutine_fn qed_aio_complete(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
 
@@ -XXX,XX +XXX,XX @@ static void qed_aio_complete(QEDAIOCB *acb)
 /**
  * Update L1 table with new L2 table offset and write it out
  */
-static int qed_aio_write_l1_update(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_l1_update(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     CachedL2Table *l2_table = acb->request.l2_table;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l1_update(QEDAIOCB *acb)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
+static int coroutine_fn qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 {
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_l2_update(QEDAIOCB *acb, uint64_t offset)
 /**
  * Write data to the image file
  */
-static int qed_aio_write_main(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset = acb->cur_cluster +
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_main(QEDAIOCB *acb)
 /**
  * Populate untouched regions of new data cluster
  */
-static int qed_aio_write_cow(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_write_cow(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t start, len, offset;
@@ -XXX,XX +XXX,XX @@ static bool qed_should_set_need_check(BDRVQEDState *s)
  *
  * This path is taken when writing to previously unallocated clusters.
  */
-static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
+static int coroutine_fn qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
     int ret;
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  *
  * This path is taken when writing to already allocated clusters.
  */
-static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset,
+                                              size_t len)
 {
     /* Allocate buffer for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_write_data(void *opaque, int ret,
-                              uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_write_data(void *opaque, int ret,
+                                           uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
 
@@ -XXX,XX +XXX,XX @@ static int qed_aio_write_data(void *opaque, int ret,
  * @offset:     Cluster offset in bytes
  * @len:        Length in bytes
  */
-static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
+static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
+                                          uint64_t offset, size_t len)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
@@ -XXX,XX +XXX,XX @@ static int qed_aio_read_data(void *opaque, int ret, uint64_t offset, size_t len)
 /**
  * Begin next I/O or complete the request
  */
-static int qed_aio_next_io(QEDAIOCB *acb)
+static int coroutine_fn qed_aio_next_io(QEDAIOCB *acb)
 {
     BDRVQEDState *s = acb_to_s(acb);
     uint64_t offset;
diff --git a/block/qed.h b/block/qed.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -XXX,XX +XXX,XX @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
 /**
  * Cluster functions
  */
-int qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
-                     size_t *len, uint64_t *img_offset);
+int coroutine_fn qed_find_cluster(BDRVQEDState *s, QEDRequest *request,
+                                  uint64_t pos, size_t *len,
+                                  uint64_t *img_offset);
 
 /**
  * Consistency check
-- 
1.8.3.1

All functions that are marked coroutine_fn can directly call the
bdrv_co_* version of functions instead of going through the wrapper.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/qed.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     };
     qemu_iovec_init_external(&qiov, &iov, 1);
 
-    ret = bdrv_preadv(s->bs->file, 0, &qiov);
+    ret = bdrv_co_preadv(s->bs->file, 0, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_write_header(BDRVQEDState *s)
     /* Update header */
     qed_header_cpu_to_le(&s->header, (QEDHeader *) buf);
 
-    ret = bdrv_pwritev(s->bs->file, 0, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, 0, qiov.size,  &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_read_backing_file(BDRVQEDState *s, uint64_t pos,
     qemu_iovec_concat(*backing_qiov, qiov, 0, size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING_AIO);
-    ret = bdrv_preadv(s->bs->backing, pos, *backing_qiov);
+    ret = bdrv_co_preadv(s->bs->backing, pos, size, *backing_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_copy_from_backing_file(BDRVQEDState *s,
     }
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
-    ret = bdrv_pwritev(s->bs->file, offset, &qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, qiov.size, &qiov, 0);
     if (ret < 0) {
         goto out;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
     trace_qed_aio_write_main(s, acb, 0, offset, acb->cur_qiov.size);
 
     BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
-    ret = bdrv_pwritev(s->bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_pwritev(s->bs->file, offset, acb->cur_qiov.size,
+                          &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_write_main(QEDAIOCB *acb)
              * region.  The solution is to flush after writing a new data
              * cluster and before updating the L2 table.
              */
-            ret = bdrv_flush(s->bs->file->bs);
+            ret = bdrv_co_flush(s->bs->file->bs);
             if (ret < 0) {
                 return ret;
             }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn qed_aio_read_data(void *opaque, int ret,
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-    ret = bdrv_preadv(bs->file, offset, &acb->cur_qiov);
+    ret = bdrv_co_preadv(bs->file, offset, acb->cur_qiov.size,
+                         &acb->cur_qiov, 0);
     if (ret < 0) {
         return ret;
     }
-- 
1.8.3.1

From: "sochin.jiang" <sochin.jiang@huawei.com>

img_commit could fall into an infinite loop calling run_block_job() if
its blockjob fails on any I/O error, fix this already known problem.

Signed-off-by: sochin.jiang <sochin.jiang@huawei.com>
Message-id: 1497509253-28941-1-git-send-email-sochin.jiang@huawei.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 blockjob.c               |  4 ++--
 include/block/blockjob.h | 18 ++++++++++++++++++
 qemu-img.c               | 20 +++++++++++++-------
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index XXXXXXX..XXXXXXX 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -XXX,XX +XXX,XX @@ static void block_job_resume(BlockJob *job)
     block_job_enter(job);
 }
 
-static void block_job_ref(BlockJob *job)
+void block_job_ref(BlockJob *job)
 {
     ++job->refcnt;
 }
@@ -XXX,XX +XXX,XX @@ static void block_job_attached_aio_context(AioContext *new_context,
                                            void *opaque);
 static void block_job_detach_aio_context(void *opaque);
 
-static void block_job_unref(BlockJob *job)
+void block_job_unref(BlockJob *job)
 {
     if (--job->refcnt == 0) {
         BlockDriverState *bs = blk_bs(job->blk);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -XXX,XX +XXX,XX @@ void block_job_iostatus_reset(BlockJob *job);
 BlockJobTxn *block_job_txn_new(void);
 
 /**
+ * block_job_ref:
+ *
+ * Add a reference to BlockJob refcnt, it will be decreased with
+ * block_job_unref, and then be freed if it comes to be the last
+ * reference.
+ */
+void block_job_ref(BlockJob *job);
+
+/**
+ * block_job_unref:
+ *
+ * Release a reference that was previously acquired with block_job_ref
+ * or block_job_create. If it's the last reference to the object, it will be
+ * freed.
+ */
+void block_job_unref(BlockJob *job);
+
+/**
  * block_job_txn_unref:
  *
  * Release a reference that was previously acquired with block_job_txn_add_job
diff --git a/qemu-img.c b/qemu-img.c
index XXXXXXX..XXXXXXX 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -XXX,XX +XXX,XX @@ static void common_block_job_cb(void *opaque, int ret)
 static void run_block_job(BlockJob *job, Error **errp)
 {
     AioContext *aio_context = blk_get_aio_context(job->blk);
+    int ret = 0;
 
-    /* FIXME In error cases, the job simply goes away and we access a dangling
-     * pointer below. */
     aio_context_acquire(aio_context);
+    block_job_ref(job);
     do {
         aio_poll(aio_context, true);
         qemu_progress_print(job->len ?
                             ((float)job->offset / job->len * 100.f) : 0.0f, 0);
-    } while (!job->ready);
+    } while (!job->ready && !job->completed);
 
-    block_job_complete_sync(job, errp);
+    if (!job->completed) {
+        ret = block_job_complete_sync(job, errp);
+    } else {
+        ret = job->ret;
+    }
+    block_job_unref(job);
     aio_context_release(aio_context);
 
-    /* A block job may finish instantaneously without publishing any progress,
-     * so just signal completion here */
-    qemu_progress_print(100.f, 0);
+    /* publish completion progress only when success */
+    if (!ret) {
+        qemu_progress_print(100.f, 0);
+    }
 }
 
 static int img_commit(int argc, char **argv)
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkdebug node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-2-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkdebug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -XXX,XX +XXX,XX @@ static void blkdebug_refresh_filename(BlockDriverState *bs, QDict *options)
     }
 
     if (!force_json && bs->file->bs->exact_filename[0]) {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkdebug:%s:%s", s->config_file ?: "",
-                 bs->file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkdebug:%s:%s", s->config_file ?: "",
+                           bs->file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 
     opts = qdict_new();
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

The bs->exact_filename field may not be sufficient to store the full
blkverify node filename. In this case, we should not generate a filename
at all instead of an unusable one.

Cc: qemu-stable@nongnu.org
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613172006.19685-3-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkverify.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index XXXXXXX..XXXXXXX 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -XXX,XX +XXX,XX @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options)
     if (bs->file->bs->exact_filename[0]
         && s->test_file->bs->exact_filename[0])
     {
-        snprintf(bs->exact_filename, sizeof(bs->exact_filename),
-                 "blkverify:%s:%s",
-                 bs->file->bs->exact_filename,
-                 s->test_file->bs->exact_filename);
+        int ret = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+                           "blkverify:%s:%s",
+                           bs->file->bs->exact_filename,
+                           s->test_file->bs->exact_filename);
+        if (ret >= sizeof(bs->exact_filename)) {
+            /* An overflow makes the filename unusable, so do not report any */
+            bs->exact_filename[0] = 0;
+        }
     }
 }
 
-- 
1.8.3.1

From: Max Reitz <mreitz@redhat.com>

uri_parse(...)->scheme may be NULL. In fact, probably every field may be
NULL, and the callers do test this for all of the other fields but not
for scheme (except for block/gluster.c; block/vxhs.c does not access
that field at all).

We can easily fix this by using g_strcmp0() instead of strcmp().

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170613205726.13544-1-mreitz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/nbd.c      | 6 +++---
 block/nfs.c      | 2 +-
 block/sheepdog.c | 6 +++---
 block/ssh.c      | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -XXX,XX +XXX,XX @@ static int nbd_parse_uri(const char *filename, QDict *options)
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "nbd")) {
+    if (!g_strcmp0(uri->scheme, "nbd")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "nbd+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
         is_unix = true;
     } else {
         ret = -EINVAL;
diff --git a/block/nfs.c b/block/nfs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -XXX,XX +XXX,XX @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp)
         error_setg(errp, "Invalid URI specified");
         goto out;
     }
-    if (strcmp(uri->scheme, "nfs") != 0) {
+    if (g_strcmp0(uri->scheme, "nfs") != 0) {
         error_setg(errp, "URI scheme must be 'nfs'");
         goto out;
     }
diff --git a/block/sheepdog.c b/block/sheepdog.c
index XXXXXXX..XXXXXXX 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -XXX,XX +XXX,XX @@ static void sd_parse_uri(SheepdogConfig *cfg, const char *filename,
     }
 
     /* transport */
-    if (!strcmp(uri->scheme, "sheepdog")) {
+    if (!g_strcmp0(uri->scheme, "sheepdog")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+tcp")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+tcp")) {
         is_unix = false;
-    } else if (!strcmp(uri->scheme, "sheepdog+unix")) {
+    } else if (!g_strcmp0(uri->scheme, "sheepdog+unix")) {
         is_unix = true;
     } else {
         error_setg(&err, "URI scheme must be 'sheepdog', 'sheepdog+tcp',"
diff --git a/block/ssh.c b/block/ssh.c
index XXXXXXX..XXXXXXX 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -XXX,XX +XXX,XX @@ static int parse_uri(const char *filename, QDict *options, Error **errp)
         return -EINVAL;
     }
 
-    if (strcmp(uri->scheme, "ssh") != 0) {
+    if (g_strcmp0(uri->scheme, "ssh") != 0) {
         error_setg(errp, "URI scheme must be 'ssh'");
         goto err;
     }
-- 
1.8.3.1

The following changes since commit ac5f7bf8e208cd7893dbb1a9520559e569a4677c:

Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-24 15:00:39 +0100)

are available in the Git repository at:

https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 8c1e8fb2e7fc2cbeb57703e143965a4cd3ad301a:

block/monitor: Fix crash when executing HMP commit (2023-04-25 15:11:57 +0200)

----------------------------------------------------------------
Block layer patches

- Protect BlockBackend.queued_requests with its own lock
- Switch to AIO_WAIT_WHILE_UNLOCKED() where possible
- AioContext removal: LinuxAioState/LuringState/ThreadPool
- Add more coroutine_fn annotations, use bdrv/blk_co_*
- Fix crash when execute hmp_commit

----------------------------------------------------------------
Emanuele Giuseppe Esposito (4):
      linux-aio: use LinuxAioState from the running thread
      io_uring: use LuringState from the running thread
      thread-pool: use ThreadPool from the running thread
      thread-pool: avoid passing the pool parameter every time

Paolo Bonzini (9):
      vvfat: mark various functions as coroutine_fn
      blkdebug: add missing coroutine_fn annotation
      mirror: make mirror_flush a coroutine_fn, do not use co_wrappers
      nbd: mark more coroutine_fns, do not use co_wrappers
      9pfs: mark more coroutine_fns
      qemu-pr-helper: mark more coroutine_fns
      tests: mark more coroutine_fns
      qcow2: mark various functions as coroutine_fn and GRAPH_RDLOCK
      vmdk: make vmdk_is_cid_valid a coroutine_fn

Stefan Hajnoczi (10):
      block: make BlockBackend->quiesce_counter atomic
      block: make BlockBackend->disable_request_queuing atomic
      block: protect BlockBackend->queued_requests with a lock
      block: don't acquire AioContext lock in bdrv_drain_all()
      block: convert blk_exp_close_all_type() to AIO_WAIT_WHILE_UNLOCKED()
      block: convert bdrv_graph_wrlock() to AIO_WAIT_WHILE_UNLOCKED()
      block: convert bdrv_drain_all_begin() to AIO_WAIT_WHILE_UNLOCKED()
      hmp: convert handle_hmp_command() to AIO_WAIT_WHILE_UNLOCKED()
      monitor: convert monitor_cleanup() to AIO_WAIT_WHILE_UNLOCKED()
      block: add missing coroutine_fn to bdrv_sum_allocated_file_size()

Wang Liang (1):
      block/monitor: Fix crash when executing HMP commit

Wilfred Mallawa (1):
      include/block: fixup typos

From: Stefan Hajnoczi <stefanha@redhat.com>

The main loop thread increments/decrements BlockBackend->quiesce_counter
when drained sections begin/end. The counter is read in the I/O code
path. Therefore this field is used to communicate between threads
without a lock.

Acquire/release are not necessary because the BlockBackend->in_flight
counter already uses sequentially consistent accesses and running I/O
requests hold that counter when blk_wait_while_drained() is called.
qatomic_read() can be used.

Use qatomic_fetch_inc()/qatomic_fetch_dec() for modifications even
though sequentially consistent atomic accesses are not strictly required
here. They are, however, nicer to read than multiple calls to
qatomic_read() and qatomic_set(). Since beginning and ending drain is
not a hot path the extra cost doesn't matter.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230307210427.269214-2-stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
     NotifierList remove_bs_notifiers, insert_bs_notifiers;
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
-    int quiesce_counter;
+    int quiesce_counter; /* atomic: written under BQL, read by other threads */
     CoQueue queued_requests;
     bool disable_request_queuing;
 
@@ -XXX,XX +XXX,XX @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
     blk->dev_opaque = opaque;
 
     /* Are we currently quiesced? Should we enforce this right now? */
-    if (blk->quiesce_counter && ops && ops->drained_begin) {
+    if (qatomic_read(&blk->quiesce_counter) && ops && ops->drained_begin) {
         ops->drained_begin(opaque);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
     assert(blk->in_flight > 0);
 
-    if (blk->quiesce_counter && !blk->disable_request_queuing) {
+    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
         blk_dec_in_flight(blk);
         qemu_co_queue_wait(&blk->queued_requests, NULL);
         blk_inc_in_flight(blk);
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_begin(BdrvChild *child)
     BlockBackend *blk = child->opaque;
     ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
 
-    if (++blk->quiesce_counter == 1) {
+    if (qatomic_fetch_inc(&blk->quiesce_counter) == 0) {
         if (blk->dev_ops && blk->dev_ops->drained_begin) {
             blk->dev_ops->drained_begin(blk->dev_opaque);
         }
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
     bool busy = false;
-    assert(blk->quiesce_counter);
+    assert(qatomic_read(&blk->quiesce_counter));
 
     if (blk->dev_ops && blk->dev_ops->drained_poll) {
         busy = blk->dev_ops->drained_poll(blk->dev_opaque);
@@ -XXX,XX +XXX,XX @@ static bool blk_root_drained_poll(BdrvChild *child)
 static void blk_root_drained_end(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
-    assert(blk->quiesce_counter);
+    assert(qatomic_read(&blk->quiesce_counter));
 
     assert(blk->public.throttle_group_member.io_limits_disabled);
     qatomic_dec(&blk->public.throttle_group_member.io_limits_disabled);
 
-    if (--blk->quiesce_counter == 0) {
+    if (qatomic_fetch_dec(&blk->quiesce_counter) == 1) {
         if (blk->dev_ops && blk->dev_ops->drained_end) {
             blk->dev_ops->drained_end(blk->dev_opaque);
         }
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

This field is accessed by multiple threads without a lock. Use explicit
qatomic_read()/qatomic_set() calls. There is no need for acquire/release
because blk_set_disable_request_queuing() doesn't provide any
guarantees (it helps that it's used at BlockBackend creation time and
not when there is I/O in flight).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230307210427.269214-3-stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
 
     int quiesce_counter; /* atomic: written under BQL, read by other threads */
     CoQueue queued_requests;
-    bool disable_request_queuing;
+    bool disable_request_queuing; /* atomic */
 
     VMChangeStateEntry *vmsh;
     bool force_allow_inactivate;
@@ -XXX,XX +XXX,XX @@ void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow)
 void blk_set_disable_request_queuing(BlockBackend *blk, bool disable)
 {
     IO_CODE();
-    blk->disable_request_queuing = disable;
+    qatomic_set(&blk->disable_request_queuing, disable);
 }
 
 static int coroutine_fn GRAPH_RDLOCK
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
     assert(blk->in_flight > 0);
 
-    if (qatomic_read(&blk->quiesce_counter) && !blk->disable_request_queuing) {
+    if (qatomic_read(&blk->quiesce_counter) &&
+        !qatomic_read(&blk->disable_request_queuing)) {
         blk_dec_in_flight(blk);
         qemu_co_queue_wait(&blk->queued_requests, NULL);
         blk_inc_in_flight(blk);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The CoQueue API offers thread-safety via the lock argument that
qemu_co_queue_wait() and qemu_co_enter_next() take. BlockBackend
currently does not make use of the lock argument. This means that
multiple threads submitting I/O requests can corrupt the CoQueue's
QSIMPLEQ.

Add a QemuMutex and pass it to CoQueue APIs so that the queue is
protected. While we're at it, also assert that the queue is empty when
the BlockBackend is deleted.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230307210427.269214-4-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ struct BlockBackend {
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
     int quiesce_counter; /* atomic: written under BQL, read by other threads */
+    QemuMutex queued_requests_lock; /* protects queued_requests */
     CoQueue queued_requests;
     bool disable_request_queuing; /* atomic */
 
@@ -XXX,XX +XXX,XX @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
 
     block_acct_init(&blk->stats);
 
+    qemu_mutex_init(&blk->queued_requests_lock);
     qemu_co_queue_init(&blk->queued_requests);
     notifier_list_init(&blk->remove_bs_notifiers);
     notifier_list_init(&blk->insert_bs_notifiers);
@@ -XXX,XX +XXX,XX @@ static void blk_delete(BlockBackend *blk)
     assert(QLIST_EMPTY(&blk->remove_bs_notifiers.notifiers));
     assert(QLIST_EMPTY(&blk->insert_bs_notifiers.notifiers));
     assert(QLIST_EMPTY(&blk->aio_notifiers));
+    assert(qemu_co_queue_empty(&blk->queued_requests));
+    qemu_mutex_destroy(&blk->queued_requests_lock);
     QTAILQ_REMOVE(&block_backends, blk, link);
     drive_info_del(blk->legacy_dinfo);
     block_acct_cleanup(&blk->stats);
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 
     if (qatomic_read(&blk->quiesce_counter) &&
         !qatomic_read(&blk->disable_request_queuing)) {
+        /*
+         * Take lock before decrementing in flight counter so main loop thread
+         * waits for us to enqueue ourselves before it can leave the drained
+         * section.
+         */
+        qemu_mutex_lock(&blk->queued_requests_lock);
         blk_dec_in_flight(blk);
-        qemu_co_queue_wait(&blk->queued_requests, NULL);
+        qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
         blk_inc_in_flight(blk);
+        qemu_mutex_unlock(&blk->queued_requests_lock);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void blk_root_drained_end(BdrvChild *child)
         if (blk->dev_ops && blk->dev_ops->drained_end) {
             blk->dev_ops->drained_end(blk->dev_opaque);
         }
-        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
+        qemu_mutex_lock(&blk->queued_requests_lock);
+        while (qemu_co_enter_next(&blk->queued_requests,
+                                  &blk->queued_requests_lock)) {
             /* Resume all queued requests */
         }
+        qemu_mutex_unlock(&blk->queued_requests_lock);
     }
 }
 
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

There is no need for the AioContext lock in bdrv_drain_all() because
nothing in AIO_WAIT_WHILE() needs the lock and the condition is atomic.

AIO_WAIT_WHILE_UNLOCKED() has no use for the AioContext parameter other
than performing a check that is nowadays already done by the
GLOBAL_STATE_CODE()/IO_CODE() macros. Set the ctx argument to NULL here
to help us keep track of all converted callers. Eventually all callers
will have been converted and then the argument can be dropped entirely.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-2-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/block-backend.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index XXXXXXX..XXXXXXX 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -XXX,XX +XXX,XX @@ void blk_drain_all(void)
     bdrv_drain_all_begin();
 
     while ((blk = blk_all_next(blk)) != NULL) {
-        AioContext *ctx = blk_get_aio_context(blk);
-
-        aio_context_acquire(ctx);
-
         /* We may have -ENOMEDIUM completions in flight */
-        AIO_WAIT_WHILE(ctx, qatomic_read(&blk->in_flight) > 0);
-
-        aio_context_release(ctx);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, qatomic_read(&blk->in_flight) > 0);
     }
 
     bdrv_drain_all_end();
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

There is no change in behavior. Switch to AIO_WAIT_WHILE_UNLOCKED()
instead of AIO_WAIT_WHILE() to document that this code has already been
audited and converted. The AioContext argument is already NULL so
aio_context_release() is never called anyway.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-3-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/export.c b/block/export/export.c
index XXXXXXX..XXXXXXX 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -XXX,XX +XXX,XX @@ void blk_exp_close_all_type(BlockExportType type)
         blk_exp_request_shutdown(exp);
     }
 
-    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+    AIO_WAIT_WHILE_UNLOCKED(NULL, blk_exp_has_type(type));
 }
 
 void blk_exp_close_all(void)
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The following conversion is safe and does not change behavior:

GLOBAL_STATE_CODE();
     ...
  -  AIO_WAIT_WHILE(qemu_get_aio_context(), ...);
  +  AIO_WAIT_WHILE_UNLOCKED(NULL, ...);

Since we're in GLOBAL_STATE_CODE(), qemu_get_aio_context() is our home
thread's AioContext. Thus AIO_WAIT_WHILE() does not unlock the
AioContext:

if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
      while ((cond)) {                                           \
          aio_poll(ctx_, true);                                  \
          waited_ = true;                                        \
      }                                                          \

And that means AIO_WAIT_WHILE_UNLOCKED(NULL, ...) can be substituted.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-4-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/graph-lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/graph-lock.c b/block/graph-lock.c
index XXXXXXX..XXXXXXX 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -XXX,XX +XXX,XX @@ void bdrv_graph_wrlock(void)
          * reader lock.
          */
         qatomic_set(&has_writer, 0);
-        AIO_WAIT_WHILE(qemu_get_aio_context(), reader_count() >= 1);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, reader_count() >= 1);
         qatomic_set(&has_writer, 1);
 
         /*
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

Since the AioContext argument was already NULL, AIO_WAIT_WHILE() was
never going to unlock the AioContext. Therefore it is possible to
replace AIO_WAIT_WHILE() with AIO_WAIT_WHILE_UNLOCKED().

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-5-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io.c
+++ b/block/io.c
@@ -XXX,XX +XXX,XX @@ void bdrv_drain_all_begin(void)
     bdrv_drain_all_begin_nopoll();
 
     /* Now poll the in-flight requests */
-    AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
+    AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
 
     while ((bs = bdrv_next_all_states(bs))) {
         bdrv_drain_assert_idle(bs);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

The HMP monitor runs in the main loop thread. Calling
AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
the AioContext and the latter's assertion that we're in the main loop
succeeds.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-6-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 monitor/hmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index XXXXXXX..XXXXXXX 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -XXX,XX +XXX,XX @@ void handle_hmp_command(MonitorHMP *mon, const char *cmdline)
         Coroutine *co = qemu_coroutine_create(handle_hmp_command_co, &data);
         monitor_set_cur(co, &mon->common);
         aio_co_enter(qemu_get_aio_context(), co);
-        AIO_WAIT_WHILE(qemu_get_aio_context(), !data.done);
+        AIO_WAIT_WHILE_UNLOCKED(NULL, !data.done);
     }
 
     qobject_unref(qdict);
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

monitor_cleanup() is called from the main loop thread. Calling
AIO_WAIT_WHILE(qemu_get_aio_context(), ...) from the main loop thread is
equivalent to AIO_WAIT_WHILE_UNLOCKED(NULL, ...) because neither unlocks
the AioContext and the latter's assertion that we're in the main loop
succeeds.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230309190855.414275-7-stefanha@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 monitor/monitor.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/monitor/monitor.c b/monitor/monitor.c
index XXXXXXX..XXXXXXX 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
      * We need to poll both qemu_aio_context and iohandler_ctx to make
      * sure that the dispatcher coroutine keeps making progress and
      * eventually terminates.  qemu_aio_context is automatically
-     * polled by calling AIO_WAIT_WHILE on it, but we must poll
+     * polled by calling AIO_WAIT_WHILE_UNLOCKED on it, but we must poll
      * iohandler_ctx manually.
      *
      * Letting the iothread continue while shutting down the dispatcher
@@ -XXX,XX +XXX,XX @@ void monitor_cleanup(void)
         aio_co_wake(qmp_dispatcher_co);
     }
 
-    AIO_WAIT_WHILE(qemu_get_aio_context(),
+    AIO_WAIT_WHILE_UNLOCKED(NULL,
                    (aio_poll(iohandler_get_aio_context(), false),
                     qatomic_mb_read(&qmp_dispatcher_co_busy)));
 
-- 
2.40.0

From: Wilfred Mallawa <wilfred.mallawa@wdc.com>

Fixup a few minor typos

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Message-Id: <20230313003744.55476-1-wilfred.mallawa@opensource.wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio-wait.h         | 2 +-
 include/block/block_int-common.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -XXX,XX +XXX,XX @@ extern AioWait global_aio_wait;
  * @ctx: the aio context, or NULL if multiple aio contexts (for which the
  *       caller does not hold a lock) are involved in the polling condition.
  * @cond: wait while this conditional expression is true
- * @unlock: whether to unlock and then lock again @ctx. This apples
+ * @unlock: whether to unlock and then lock again @ctx. This applies
  * only when waiting for another AioContext from the main loop.
  * Otherwise it's ignored.
  *
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -XXX,XX +XXX,XX @@ extern QemuOptsList bdrv_create_opts_simple;
 /*
  * Common functions that are neither I/O nor Global State.
  *
- * See include/block/block-commmon.h for more information about
+ * See include/block/block-common.h for more information about
  * the Common API.
  */
 
-- 
2.40.0

From: Stefan Hajnoczi <stefanha@redhat.com>

Not a coroutine_fn, you say?

static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
  {
      BdrvChild *child;
      int64_t child_size, sum = 0;

QLIST_FOREACH(child, &bs->children, next) {
          if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
                             BDRV_CHILD_FILTERED))
          {
              child_size = bdrv_co_get_allocated_file_size(child->bs);
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Well what do we have here?!

I rest my case, your honor.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230308211435.346375-1-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ exit:
  * sums the size of all data-bearing children.  (This excludes backing
  * children.)
  */
-static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
+static int64_t coroutine_fn bdrv_sum_allocated_file_size(BlockDriverState *bs)
 {
     BdrvChild *child;
     int64_t child_size, sum = 0;
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LinuxAioState.

In order to prevent mistakes from the caller side, avoid passing LinuxAioState
in laio_io_{plug/unplug} and laio_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-2-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h               |  4 ----
 include/block/raw-aio.h           | 18 ++++++++++++------
 include/sysemu/block-backend-io.h |  5 +++++
 block/file-posix.c                | 10 +++-------
 block/linux-aio.c                 | 29 +++++++++++++++++------------
 5 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     struct ThreadPool *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
-    /*
-     * State for native Linux AIO.  Uses aio_context_acquire/release for
-     * locking.
-     */
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@
 typedef struct LinuxAioState LinuxAioState;
 LinuxAioState *laio_init(Error **errp);
 void laio_cleanup(LinuxAioState *s);
-int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type,
-                                uint64_t dev_max_batch);
+
+/* laio_co_submit: submit I/O requests in the thread's current AioContext. */
+int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
+                                int type, uint64_t dev_max_batch);
+
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
-void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
-                    uint64_t dev_max_batch);
+
+/*
+ * laio_io_plug/unplug work in the thread's current AioContext, therefore the
+ * caller must ensure that they are paired in the same IOThread.
+ */
+void laio_io_plug(void);
+void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index XXXXXXX..XXXXXXX 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -XXX,XX +XXX,XX @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
+/*
+ * blk_io_plug/unplug are thread-local operations. This means that multiple
+ * IOThreads can simultaneously call plug/unplug, but the caller must ensure
+ * that each unplug() is called in the same IOThread of the matching plug().
+ */
 void coroutine_fn blk_co_io_plug(BlockBackend *blk);
 void co_wrapper blk_io_plug(BlockBackend *blk);
 
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
         assert(qiov->size == bytes);
-        return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
-                              s->aio_max_batch);
+        return laio_co_submit(s->fd, offset, qiov, type, s->aio_max_batch);
 #endif
     }
 
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
     BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
     if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
-        laio_io_plug(bs, aio);
+        laio_io_plug();
     }
 #endif
 #ifdef CONFIG_LINUX_IO_URING
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
     BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
     if (s->use_linux_aio) {
-        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
-        laio_io_unplug(bs, aio, s->aio_max_batch);
+        laio_io_unplug(s->aio_max_batch);
     }
 #endif
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/linux-aio.c b/block/linux-aio.c
index XXXXXXX..XXXXXXX 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
 
+/* Only used for assertions.  */
+#include "qemu/coroutine_int.h"
+
 #include <libaio.h>
 
 /*
@@ -XXX,XX +XXX,XX @@ struct LinuxAioState {
     io_context_t ctx;
     EventNotifier e;
 
-    /* io queue for submit at batch.  Protected by AioContext lock. */
+    /* No locking required, only accessed from AioContext home thread */
     LaioQueue io_q;
-
-    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
     int event_idx;
     int event_max;
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
      * later.  Coroutines cannot be entered recursively so avoid doing
      * that!
      */
+    assert(laiocb->co->ctx == laiocb->ctx->aio_context);
     if (!qemu_coroutine_entered(laiocb->co)) {
         aio_co_wake(laiocb->co);
     }
@@ -XXX,XX +XXX,XX @@ static void qemu_laio_process_completions(LinuxAioState *s)
 
 static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
-    aio_context_acquire(s->aio_context);
     qemu_laio_process_completions(s);
 
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
-    aio_context_release(s->aio_context);
 }
 
 static void qemu_laio_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
     return max_batch;
 }
 
-void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
+void laio_io_plug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LinuxAioState *s = aio_get_linux_aio(ctx);
+
     s->io_q.plugged++;
 }
 
-void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s,
-                    uint64_t dev_max_batch)
+void laio_io_unplug(uint64_t dev_max_batch)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LinuxAioState *s = aio_get_linux_aio(ctx);
+
     assert(s->io_q.plugged);
     s->io_q.plugged--;
 
@@ -XXX,XX +XXX,XX @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
     return 0;
 }
 
-int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type,
-                                uint64_t dev_max_batch)
+int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
+                                int type, uint64_t dev_max_batch)
 {
     int ret;
+    AioContext *ctx = qemu_get_current_aio_context();
     struct qemu_laiocb laiocb = {
         .co         = qemu_coroutine_self(),
         .nbytes     = qiov->size,
-        .ctx        = s,
+        .ctx        = aio_get_linux_aio(ctx),
         .ret        = -EINPROGRESS,
         .is_read    = (type == QEMU_AIO_READ),
         .qiov       = qiov,
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LuringState.

In order to prevent mistakes from the caller side, avoid passing LuringState
in luring_io_{plug/unplug} and luring_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-3-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h     |  4 ----
 include/block/raw-aio.h | 15 +++++++++++----
 block/file-posix.c      | 12 ++++--------
 block/io_uring.c        | 23 +++++++++++++++--------
 4 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
-    /*
-     * State for Linux io_uring.  Uses aio_context_acquire/release for
-     * locking.
-     */
     struct LuringState *linux_io_uring;
 
     /* State for file descriptor monitoring using Linux io_uring */
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -XXX,XX +XXX,XX @@ void laio_io_unplug(uint64_t dev_max_batch);
 typedef struct LuringState LuringState;
 LuringState *luring_init(Error **errp);
 void luring_cleanup(LuringState *s);
-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                uint64_t offset, QEMUIOVector *qiov, int type);
+
+/* luring_co_submit: submit I/O requests in the thread's current AioContext. */
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
+                                  QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-void luring_io_plug(BlockDriverState *bs, LuringState *s);
-void luring_io_unplug(BlockDriverState *bs, LuringState *s);
+
+/*
+ * luring_io_plug/unplug work in the thread's current AioContext, therefore the
+ * caller must ensure that they are paired in the same IOThread.
+ */
+void luring_io_plug(void);
+void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
         type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_IO_URING
     } else if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
         assert(qiov->size == bytes);
-        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
+        return luring_co_submit(bs, s->fd, offset, qiov, type);
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (s->use_linux_aio) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
 #endif
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        luring_io_plug(bs, aio);
+        luring_io_plug();
     }
 #endif
 }
@@ -XXX,XX +XXX,XX @@ static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
 #endif
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        luring_io_unplug(bs, aio);
+        luring_io_unplug();
     }
 #endif
 }
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 
 #ifdef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
-        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
-        return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
+        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
     }
 #endif
     return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
diff --git a/block/io_uring.c b/block/io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -XXX,XX +XXX,XX @@
 #include "qapi/error.h"
 #include "trace.h"
 
+/* Only used for assertions.  */
+#include "qemu/coroutine_int.h"
+
 /* io_uring ring size */
 #define MAX_ENTRIES 128
 
@@ -XXX,XX +XXX,XX @@ typedef struct LuringState {
 
     struct io_uring ring;
 
-    /* io queue for submit at batch.  Protected by AioContext lock. */
+    /* No locking required, only accessed from AioContext home thread */
     LuringQueue io_q;
 
-    /* I/O completion processing.  Only runs in I/O thread.  */
     QEMUBH *completion_bh;
 } LuringState;
 
@@ -XXX,XX +XXX,XX @@ end:
          * eventually runs later. Coroutines cannot be entered recursively
          * so avoid doing that!
          */
+        assert(luringcb->co->ctx == s->aio_context);
         if (!qemu_coroutine_entered(luringcb->co)) {
             aio_co_wake(luringcb->co);
         }
@@ -XXX,XX +XXX,XX @@ static int ioq_submit(LuringState *s)
 
 static void luring_process_completions_and_submit(LuringState *s)
 {
-    aio_context_acquire(s->aio_context);
     luring_process_completions(s);
 
     if (!s->io_q.plugged && s->io_q.in_queue > 0) {
         ioq_submit(s);
     }
-    aio_context_release(s->aio_context);
 }
 
 static void qemu_luring_completion_bh(void *opaque)
@@ -XXX,XX +XXX,XX @@ static void ioq_init(LuringQueue *io_q)
     io_q->blocked = false;
 }
 
-void luring_io_plug(BlockDriverState *bs, LuringState *s)
+void luring_io_plug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     trace_luring_io_plug(s);
     s->io_q.plugged++;
 }
 
-void luring_io_unplug(BlockDriverState *bs, LuringState *s)
+void luring_io_unplug(void)
 {
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     assert(s->io_q.plugged);
     trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
                            s->io_q.in_queue, s->io_q.in_flight);
@@ -XXX,XX +XXX,XX @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
     return 0;
 }
 
-int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
-                                  uint64_t offset, QEMUIOVector *qiov, int type)
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
+                                  QEMUIOVector *qiov, int type)
 {
     int ret;
+    AioContext *ctx = qemu_get_current_aio_context();
+    LuringState *s = aio_get_linux_io_uring(ctx);
     LuringAIOCB luringcb = {
         .co         = qemu_coroutine_self(),
         .ret        = -EINPROGRESS,
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

Use qemu_get_current_aio_context() where possible, since we always
submit work to the current thread anyways.

We want to also be sure that the thread submitting the work is
the same as the one processing the pool, to avoid adding
synchronization to the pool list.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-4-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/thread-pool.h |  5 +++++
 block/file-posix.c          | 21 ++++++++++-----------
 block/file-win32.c          |  2 +-
 block/qcow2-threads.c       |  2 +-
 util/thread-pool.c          |  9 ++++-----
 5 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ typedef struct ThreadPool ThreadPool;
 ThreadPool *thread_pool_new(struct AioContext *ctx);
 void thread_pool_free(ThreadPool *pool);
 
+/*
+ * thread_pool_submit* API: submit I/O requests in the thread's
+ * current AioContext.
+ */
 BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
         ThreadPoolFunc *func, void *arg,
         BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
         ThreadPoolFunc *func, void *arg);
 void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
+
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
     return result;
 }
 
-static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs,
-                                               ThreadPoolFunc func, void *arg)
+static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
 {
     /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
     return thread_pool_submit_co(pool, func, arg);
 }
 
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
     };
 
     assert(qiov->size == bytes);
-    return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
+    return raw_thread_pool_submit(handle_aiocb_rw, &acb);
 }
 
 static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
         return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
     }
 #endif
-    return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
+    return raw_thread_pool_submit(handle_aiocb_flush, &acb);
 }
 
 static void raw_aio_attach_aio_context(BlockDriverState *bs,
@@ -XXX,XX +XXX,XX @@ raw_regular_truncate(BlockDriverState *bs, int fd, int64_t offset,
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_truncate, &acb);
+    return raw_thread_pool_submit(handle_aiocb_truncate, &acb);
 }
 
 static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
@@ -XXX,XX +XXX,XX @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
         acb.aio_type |= QEMU_AIO_BLKDEV;
     }
 
-    ret = raw_thread_pool_submit(bs, handle_aiocb_discard, &acb);
+    ret = raw_thread_pool_submit(handle_aiocb_discard, &acb);
     raw_account_discard(s, bytes, ret);
     return ret;
 }
@@ -XXX,XX +XXX,XX @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
         handler = handle_aiocb_write_zeroes;
     }
 
-    return raw_thread_pool_submit(bs, handler, &acb);
+    return raw_thread_pool_submit(handler, &acb);
 }
 
 static int coroutine_fn raw_co_pwrite_zeroes(
@@ -XXX,XX +XXX,XX @@ raw_co_copy_range_to(BlockDriverState *bs,
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_copy_range, &acb);
+    return raw_thread_pool_submit(handle_aiocb_copy_range, &acb);
 }
 
 BlockDriver bdrv_file = {
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         struct sg_io_hdr *io_hdr = buf;
         if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
             io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
-            return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
+            return pr_manager_execute(s->pr_mgr, qemu_get_current_aio_context(),
                                       s->fd, io_hdr);
         }
     }
@@ -XXX,XX +XXX,XX @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         },
     };
 
-    return raw_thread_pool_submit(bs, handle_aiocb_ioctl, &acb);
+    return raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
 }
 #endif /* linux */
 
diff --git a/block/file-win32.c b/block/file-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
     acb->aio_offset = offset;
 
     trace_file_paio_submit(acb, opaque, offset, count, type);
-    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    pool = aio_get_thread_pool(qemu_get_current_aio_context());
     return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
 }
 
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
 {
     int ret;
     BDRVQcow2State *s = bs->opaque;
-    ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
 
     qemu_co_mutex_lock(&s->lock);
     while (s->nb_threads >= QCOW2_MAX_THREADS) {
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ struct ThreadPoolElement {
     /* Access to this list is protected by lock.  */
     QTAILQ_ENTRY(ThreadPoolElement) reqs;
 
-    /* Access to this list is protected by the global mutex.  */
+    /* This list is only written by the thread pool's mother thread.  */
     QLIST_ENTRY(ThreadPoolElement) all;
 };
 
@@ -XXX,XX +XXX,XX @@ static void thread_pool_completion_bh(void *opaque)
     ThreadPool *pool = opaque;
     ThreadPoolElement *elem, *next;
 
-    aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
         if (elem->state != THREAD_DONE) {
@@ -XXX,XX +XXX,XX @@ restart:
              */
             qemu_bh_schedule(pool->completion_bh);
 
-            aio_context_release(pool->ctx);
             elem->common.cb(elem->common.opaque, elem->ret);
-            aio_context_acquire(pool->ctx);
 
             /* We can safely cancel the completion_bh here regardless of someone
              * else having scheduled it meanwhile because we reenter the
@@ -XXX,XX +XXX,XX @@ restart:
             qemu_aio_unref(elem);
         }
     }
-    aio_context_release(pool->ctx);
 }
 
 static void thread_pool_cancel(BlockAIOCB *acb)
@@ -XXX,XX +XXX,XX @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
 {
     ThreadPoolElement *req;
 
+    /* Assert that the thread submitting work is the same running the pool */
+    assert(pool->ctx == qemu_get_current_aio_context());
+
     req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
     req->func = func;
     req->arg = arg;
-- 
2.40.0

From: Emanuele Giuseppe Esposito <eesposit@redhat.com>

thread_pool_submit_aio() is always called on a pool taken from
qemu_get_current_aio_context(), and that is the only intended
use: each pool runs only in the same thread that is submitting
work to it, it can't run anywhere else.

Therefore simplify the thread_pool_submit* API and remove the
ThreadPool function parameter.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-5-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/thread-pool.h   | 10 ++++------
 backends/tpm/tpm_backend.c    |  4 +---
 block/file-posix.c            |  4 +---
 block/file-win32.c            |  4 +---
 block/qcow2-threads.c         |  3 +--
 hw/9pfs/coth.c                |  3 +--
 hw/ppc/spapr_nvdimm.c         |  6 ++----
 hw/virtio/virtio-pmem.c       |  3 +--
 scsi/pr-manager.c             |  3 +--
 scsi/qemu-pr-helper.c         |  3 +--
 tests/unit/test-thread-pool.c | 12 +++++-------
 util/thread-pool.c            | 16 ++++++++--------
 12 files changed, 27 insertions(+), 44 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -XXX,XX +XXX,XX @@ void thread_pool_free(ThreadPool *pool);
  * thread_pool_submit* API: submit I/O requests in the thread's
  * current AioContext.
  */
-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg,
-        BlockCompletionFunc *cb, void *opaque);
-int coroutine_fn thread_pool_submit_co(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg);
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+                                   BlockCompletionFunc *cb, void *opaque);
+int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
+void thread_pool_submit(ThreadPoolFunc *func, void *arg);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/backends/tpm/tpm_backend.c b/backends/tpm/tpm_backend.c
index XXXXXXX..XXXXXXX 100644
--- a/backends/tpm/tpm_backend.c
+++ b/backends/tpm/tpm_backend.c
@@ -XXX,XX +XXX,XX @@ bool tpm_backend_had_startup_error(TPMBackend *s)
 
 void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 {
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
-
     if (s->cmd != NULL) {
         error_report("There is a TPM request pending");
         return;
@@ -XXX,XX +XXX,XX @@ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 
     s->cmd = cmd;
     object_ref(OBJECT(s));
-    thread_pool_submit_aio(pool, tpm_backend_worker_thread, s,
+    thread_pool_submit_aio(tpm_backend_worker_thread, s,
                            tpm_backend_request_completed, s);
 }
 
diff --git a/block/file-posix.c b/block/file-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -XXX,XX +XXX,XX @@ out:
 
 static int coroutine_fn raw_thread_pool_submit(ThreadPoolFunc func, void *arg)
 {
-    /* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
-    return thread_pool_submit_co(pool, func, arg);
+    return thread_pool_submit_co(func, arg);
 }
 
 /*
diff --git a/block/file-win32.c b/block/file-win32.c
index XXXXXXX..XXXXXXX 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
         BlockCompletionFunc *cb, void *opaque, int type)
 {
     RawWin32AIOData *acb = g_new(RawWin32AIOData, 1);
-    ThreadPool *pool;
 
     acb->bs = bs;
     acb->hfile = hfile;
@@ -XXX,XX +XXX,XX @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
     acb->aio_offset = offset;
 
     trace_file_paio_submit(acb, opaque, offset, count, type);
-    pool = aio_get_thread_pool(qemu_get_current_aio_context());
-    return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
+    return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
 }
 
 int qemu_ftruncate64(int fd, int64_t length)
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
 {
     int ret;
     BDRVQcow2State *s = bs->opaque;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_current_aio_context());
 
     qemu_co_mutex_lock(&s->lock);
     while (s->nb_threads >= QCOW2_MAX_THREADS) {
@@ -XXX,XX +XXX,XX @@ qcow2_co_process(BlockDriverState *bs, ThreadPoolFunc *func, void *arg)
     s->nb_threads++;
     qemu_co_mutex_unlock(&s->lock);
 
-    ret = thread_pool_submit_co(pool, func, arg);
+    ret = thread_pool_submit_co(func, arg);
 
     qemu_co_mutex_lock(&s->lock);
     s->nb_threads--;
diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/coth.c
+++ b/hw/9pfs/coth.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_enter_func(void *arg)
 void co_run_in_worker_bh(void *opaque)
 {
     Coroutine *co = opaque;
-    thread_pool_submit_aio(aio_get_thread_pool(qemu_get_aio_context()),
-                           coroutine_enter_func, co, coroutine_enter_cb, co);
+    thread_pool_submit_aio(coroutine_enter_func, co, coroutine_enter_cb, co);
 }
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
 {
     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
     SpaprNVDIMMDeviceFlushState *state;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
     bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
     bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
@@ -XXX,XX +XXX,XX @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
-        thread_pool_submit_aio(pool, flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state,
                                spapr_nvdimm_flush_completion_cb, state);
     }
 
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     PCDIMMDevice *dimm;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     int fd;
 
     if (!drc || !drc->dev ||
@@ -XXX,XX +XXX,XX @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
 
         state->drcidx = drc_index;
 
-        thread_pool_submit_aio(pool, flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state,
                                spapr_nvdimm_flush_completion_cb, state);
 
         continue_token = state->continue_token;
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
     VirtIODeviceRequest *req_data;
     VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
     HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
 
     trace_virtio_pmem_flush_request();
     req_data = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
@@ -XXX,XX +XXX,XX @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
     req_data->fd   = memory_region_get_fd(&backend->mr);
     req_data->pmem = pmem;
     req_data->vdev = vdev;
-    thread_pool_submit_aio(pool, worker_cb, req_data, done_cb, req_data);
+    thread_pool_submit_aio(worker_cb, req_data, done_cb, req_data);
 }
 
 static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/pr-manager.c
+++ b/scsi/pr-manager.c
@@ -XXX,XX +XXX,XX @@ static int pr_manager_worker(void *opaque)
 int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
                                     struct sg_io_hdr *hdr)
 {
-    ThreadPool *pool = aio_get_thread_pool(ctx);
     PRManagerData data = {
         .pr_mgr = pr_mgr,
         .fd     = fd,
@@ -XXX,XX +XXX,XX @@ int coroutine_fn pr_manager_execute(PRManager *pr_mgr, AioContext *ctx, int fd,
 
     /* The matching object_unref is in pr_manager_worker.  */
     object_ref(OBJECT(pr_mgr));
-    return thread_pool_submit_co(pool, pr_manager_worker, &data);
+    return thread_pool_submit_co(pr_manager_worker, &data);
 }
 
 bool pr_manager_is_connected(PRManager *pr_mgr)
diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/qemu-pr-helper.c
+++ b/scsi/qemu-pr-helper.c
@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
 static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
                     uint8_t *buf, int *sz, int dir)
 {
-    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
     int r;
 
     PRHelperSGIOData data = {
@@ -XXX,XX +XXX,XX @@ static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
         .dir = dir,
     };
 
-    r = thread_pool_submit_co(pool, do_sgio_worker, &data);
+    r = thread_pool_submit_co(do_sgio_worker, &data);
     *sz = data.sz;
     return r;
 }
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/main-loop.h"
 
 static AioContext *ctx;
-static ThreadPool *pool;
 static int active;
 
 typedef struct {
@@ -XXX,XX +XXX,XX @@ static void done_cb(void *opaque, int ret)
 static void test_submit(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(pool, worker_cb, &data);
+    thread_pool_submit(worker_cb, &data);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
@@ -XXX,XX +XXX,XX @@ static void test_submit(void)
 static void test_submit_aio(void)
 {
     WorkerTestData data = { .n = 0, .ret = -EINPROGRESS };
-    data.aiocb = thread_pool_submit_aio(pool, worker_cb, &data,
+    data.aiocb = thread_pool_submit_aio(worker_cb, &data,
                                         done_cb, &data);
 
     /* The callbacks are not called until after the first wait.  */
@@ -XXX,XX +XXX,XX @@ static void co_test_cb(void *opaque)
     active = 1;
     data->n = 0;
     data->ret = -EINPROGRESS;
-    thread_pool_submit_co(pool, worker_cb, data);
+    thread_pool_submit_co(worker_cb, data);
 
     /* The test continues in test_submit_co, after qemu_coroutine_enter... */
 
@@ -XXX,XX +XXX,XX @@ static void test_submit_many(void)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        thread_pool_submit_aio(pool, worker_cb, &data[i], done_cb, &data[i]);
+        thread_pool_submit_aio(worker_cb, &data[i], done_cb, &data[i]);
     }
 
     active = 100;
@@ -XXX,XX +XXX,XX @@ static void do_test_cancel(bool sync)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        data[i].aiocb = thread_pool_submit_aio(pool, long_cb, &data[i],
+        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i],
                                                done_cb, &data[i]);
     }
 
@@ -XXX,XX +XXX,XX @@ int main(int argc, char **argv)
 {
     qemu_init_main_loop(&error_abort);
     ctx = qemu_get_current_aio_context();
-    pool = aio_get_thread_pool(ctx);
 
     g_test_init(&argc, &argv, NULL);
     g_test_add_func("/thread-pool/submit", test_submit);
diff --git a/util/thread-pool.c b/util/thread-pool.c
index XXXXXXX..XXXXXXX 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -XXX,XX +XXX,XX @@ static const AIOCBInfo thread_pool_aiocb_info = {
     .get_aio_context    = thread_pool_get_aio_context,
 };
 
-BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
-        ThreadPoolFunc *func, void *arg,
-        BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+                                   BlockCompletionFunc *cb, void *opaque)
 {
     ThreadPoolElement *req;
+    AioContext *ctx = qemu_get_current_aio_context();
+    ThreadPool *pool = aio_get_thread_pool(ctx);
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -XXX,XX +XXX,XX @@ static void thread_pool_co_cb(void *opaque, int ret)
     aio_co_wake(co->co);
 }
 
-int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
-                                       void *arg)
+int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
 {
     ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
     assert(qemu_in_coroutine());
-    thread_pool_submit_aio(pool, func, arg, thread_pool_co_cb, &tpc);
+    thread_pool_submit_aio(func, arg, thread_pool_co_cb, &tpc);
     qemu_coroutine_yield();
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
+void thread_pool_submit(ThreadPoolFunc *func, void *arg)
 {
-    thread_pool_submit_aio(pool, func, arg, NULL, NULL);
+    thread_pool_submit_aio(func, arg, NULL, NULL);
 }
 
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Functions that can do I/O are prime candidates for being coroutine_fns.  Make the
change for those that are themselves called only from coroutine_fns.

In addition, coroutine_fns should do I/O using bdrv_co_*() functions, for
which it is required to hold the BlockDriverState graph lock.  So also nnotate
functions on the I/O path with TSA attributes, making it possible to
switch them to use bdrv_co_*() functions.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-2-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/vvfat.c | 58 ++++++++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static BDRVVVFATState *vvv = NULL;
 #endif
 
 static int enable_write_target(BlockDriverState *bs, Error **errp);
-static int is_consistent(BDRVVVFATState *s);
+static int coroutine_fn is_consistent(BDRVVVFATState *s);
 
 static QemuOptsList runtime_opts = {
     .name = "vvfat",
@@ -XXX,XX +XXX,XX @@ static void print_mapping(const mapping_t* mapping)
 }
 #endif
 
-static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
-                    uint8_t *buf, int nb_sectors)
+static int coroutine_fn GRAPH_RDLOCK
+vvfat_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors)
 {
     BDRVVVFATState *s = bs->opaque;
     int i;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
                 DLOG(fprintf(stderr, "sectors %" PRId64 "+%" PRId64
                              " allocated\n", sector_num,
                              n >> BDRV_SECTOR_BITS));
-                if (bdrv_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
-                               buf + i * 0x200, 0) < 0) {
+                if (bdrv_co_pread(s->qcow, sector_num * BDRV_SECTOR_SIZE, n,
+                                  buf + i * 0x200, 0) < 0) {
                     return -1;
                 }
                 i += (n >> BDRV_SECTOR_BITS) - 1;
@@ -XXX,XX +XXX,XX @@ static int vvfat_read(BlockDriverState *bs, int64_t sector_num,
     return 0;
 }
 
-static int coroutine_fn
+static int coroutine_fn GRAPH_RDLOCK
 vvfat_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 QEMUIOVector *qiov, BdrvRequestFlags flags)
 {
@@ -XXX,XX +XXX,XX @@ static inline uint32_t modified_fat_get(BDRVVVFATState* s,
     }
 }
 
-static inline bool cluster_was_modified(BDRVVVFATState *s,
-                                        uint32_t cluster_num)
+static inline bool coroutine_fn GRAPH_RDLOCK
+cluster_was_modified(BDRVVVFATState *s, uint32_t cluster_num)
 {
     int was_modified = 0;
     int i;
@@ -XXX,XX +XXX,XX @@ typedef enum {
  * Further, the files/directories handled by this function are
  * assumed to be *not* deleted (and *only* those).
  */
-static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
-        direntry_t* direntry, const char* path)
+static uint32_t coroutine_fn GRAPH_RDLOCK
+get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const char* path)
 {
     /*
      * This is a little bit tricky:
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
                         if (res) {
                             return -1;
                         }
-                        res = bdrv_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
-                                          BDRV_SECTOR_SIZE, s->cluster_buffer,
-                                          0);
+                        res = bdrv_co_pwrite(s->qcow, offset * BDRV_SECTOR_SIZE,
+                                             BDRV_SECTOR_SIZE, s->cluster_buffer,
+                                             0);
                         if (res < 0) {
                             return -2;
                         }
@@ -XXX,XX +XXX,XX @@ static uint32_t get_cluster_count_for_direntry(BDRVVVFATState* s,
  * It returns 0 upon inconsistency or error, and the number of clusters
  * used by the directory, its subdirectories and their files.
  */
-static int check_directory_consistency(BDRVVVFATState *s,
-        int cluster_num, const char* path)
+static int coroutine_fn GRAPH_RDLOCK
+check_directory_consistency(BDRVVVFATState *s, int cluster_num, const char* path)
 {
     int ret = 0;
     unsigned char* cluster = g_malloc(s->cluster_size);
@@ -XXX,XX +XXX,XX @@ DLOG(fprintf(stderr, "check direntry %d:\n", i); print_direntry(direntries + i))
 }
 
 /* returns 1 on success */
-static int is_consistent(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK
+is_consistent(BDRVVVFATState* s)
 {
     int i, check;
     int used_clusters_count = 0;
@@ -XXX,XX +XXX,XX @@ static int commit_mappings(BDRVVVFATState* s,
     return 0;
 }
 
-static int commit_direntries(BDRVVVFATState* s,
-        int dir_index, int parent_mapping_index)
+static int coroutine_fn GRAPH_RDLOCK
+commit_direntries(BDRVVVFATState* s, int dir_index, int parent_mapping_index)
 {
     direntry_t* direntry = array_get(&(s->directory), dir_index);
     uint32_t first_cluster = dir_index == 0 ? 0 : begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int commit_direntries(BDRVVVFATState* s,
 
 /* commit one file (adjust contents, adjust mapping),
    return first_mapping_index */
-static int commit_one_file(BDRVVVFATState* s,
-        int dir_index, uint32_t offset)
+static int coroutine_fn GRAPH_RDLOCK
+commit_one_file(BDRVVVFATState* s, int dir_index, uint32_t offset)
 {
     direntry_t* direntry = array_get(&(s->directory), dir_index);
     uint32_t c = begin_of_direntry(direntry);
@@ -XXX,XX +XXX,XX @@ static int handle_renames_and_mkdirs(BDRVVVFATState* s)
 /*
  * TODO: make sure that the short name is not matching *another* file
  */
-static int handle_commits(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK handle_commits(BDRVVVFATState* s)
 {
     int i, fail = 0;
 
@@ -XXX,XX +XXX,XX @@ static int handle_deletes(BDRVVVFATState* s)
  * - recurse direntries from root (using bs->bdrv_pread)
  * - delete files corresponding to mappings marked as deleted
  */
-static int do_commit(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK do_commit(BDRVVVFATState* s)
 {
     int ret = 0;
 
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return 0;
 }
 
-static int try_commit(BDRVVVFATState* s)
+static int coroutine_fn GRAPH_RDLOCK try_commit(BDRVVVFATState* s)
 {
     vvfat_close_current_file(s);
 DLOG(checkpoint());
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return do_commit(s);
 }
 
-static int vvfat_write(BlockDriverState *bs, int64_t sector_num,
-                    const uint8_t *buf, int nb_sectors)
+static int coroutine_fn GRAPH_RDLOCK
+vvfat_write(BlockDriverState *bs, int64_t sector_num,
+            const uint8_t *buf, int nb_sectors)
 {
     BDRVVVFATState *s = bs->opaque;
     int i, ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
      * Use qcow backend. Commit later.
      */
 DLOG(fprintf(stderr, "Write to qcow backend: %d + %d\n", (int)sector_num, nb_sectors));
-    ret = bdrv_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
-                      nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
+    ret = bdrv_co_pwrite(s->qcow, sector_num * BDRV_SECTOR_SIZE,
+                         nb_sectors * BDRV_SECTOR_SIZE, buf, 0);
     if (ret < 0) {
         fprintf(stderr, "Error writing to qcow backend\n");
         return ret;
@@ -XXX,XX +XXX,XX @@ DLOG(checkpoint());
     return 0;
 }
 
-static int coroutine_fn
+static int coroutine_fn GRAPH_RDLOCK
 vvfat_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
                  QEMUIOVector *qiov, BdrvRequestFlags flags)
 {
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

mirror_flush calls a mixed function blk_flush but it is only called
from mirror_run; so call the coroutine version and make mirror_flush
a coroutine_fn too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-4-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/mirror.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index XXXXXXX..XXXXXXX 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 /* Called when going out of the streaming phase to flush the bulk of the
  * data to the medium, or just before completing.
  */
-static int mirror_flush(MirrorBlockJob *s)
+static int coroutine_fn mirror_flush(MirrorBlockJob *s)
 {
-    int ret = blk_flush(s->target);
+    int ret = blk_co_flush(s->target);
     if (ret < 0) {
         if (mirror_error_action(s, false, -ret) == BLOCK_ERROR_ACTION_REPORT) {
             s->ret = ret;
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 nbd/server.c | 48 ++++++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index XXXXXXX..XXXXXXX 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -XXX,XX +XXX,XX @@ nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
     return 1;
 }
 
-static int nbd_receive_request(NBDClient *client, NBDRequest *request,
-                               Error **errp)
+static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest *request,
+                                            Error **errp)
 {
     uint8_t buf[NBD_REQUEST_SIZE];
     uint32_t magic;
@@ -XXX,XX +XXX,XX @@ static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
     stq_be_p(&reply->handle, handle);
 }
 
-static int nbd_co_send_simple_reply(NBDClient *client,
-                                    uint64_t handle,
-                                    uint32_t error,
-                                    void *data,
-                                    size_t len,
-                                    Error **errp)
+static int coroutine_fn nbd_co_send_simple_reply(NBDClient *client,
+                                                 uint64_t handle,
+                                                 uint32_t error,
+                                                 void *data,
+                                                 size_t len,
+                                                 Error **errp)
 {
     NBDSimpleReply reply;
     int nbd_err = system_errno_to_nbd_errno(error);
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
             stl_be_p(&chunk.length, pnum);
             ret = nbd_co_send_iov(client, iov, 1, errp);
         } else {
-            ret = blk_pread(exp->common.blk, offset + progress, pnum,
-                            data + progress, 0);
+            ret = blk_co_pread(exp->common.blk, offset + progress, pnum,
+                               data + progress, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "reading from file failed");
                 break;
@@ -XXX,XX +XXX,XX @@ static int coroutine_fn blockalloc_to_extents(BlockBackend *blk,
  * @ea is converted to BE by the function
  * @last controls whether NBD_REPLY_FLAG_DONE is sent.
  */
-static int nbd_co_send_extents(NBDClient *client, uint64_t handle,
-                               NBDExtentArray *ea,
-                               bool last, uint32_t context_id, Error **errp)
+static int coroutine_fn
+nbd_co_send_extents(NBDClient *client, uint64_t handle, NBDExtentArray *ea,
+                    bool last, uint32_t context_id, Error **errp)
 {
     NBDStructuredMeta chunk;
     struct iovec iov[] = {
@@ -XXX,XX +XXX,XX @@ static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
     bdrv_dirty_bitmap_unlock(bitmap);
 }
 
-static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
-                              BdrvDirtyBitmap *bitmap, uint64_t offset,
-                              uint32_t length, bool dont_fragment, bool last,
-                              uint32_t context_id, Error **errp)
+static int coroutine_fn nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
+                                           BdrvDirtyBitmap *bitmap, uint64_t offset,
+                                           uint32_t length, bool dont_fragment, bool last,
+                                           uint32_t context_id, Error **errp)
 {
     unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
     g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);
@@ -XXX,XX +XXX,XX @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
  * to the client (although the caller may still need to disconnect after
  * reporting the error).
  */
-static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
-                                  Error **errp)
+static int coroutine_fn nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
+                                               Error **errp)
 {
     NBDClient *client = req->client;
     int valid_flags;
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_do_cmd_read(NBDClient *client, NBDRequest *request,
                                        data, request->len, errp);
     }
 
-    ret = blk_pread(exp->common.blk, request->from, request->len, data, 0);
+    ret = blk_co_pread(exp->common.blk, request->from, request->len, data, 0);
     if (ret < 0) {
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "reading from file failed", errp);
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
         if (request->flags & NBD_CMD_FLAG_FUA) {
             flags |= BDRV_REQ_FUA;
         }
-        ret = blk_pwrite(exp->common.blk, request->from, request->len, data,
-                         flags);
+        ret = blk_co_pwrite(exp->common.blk, request->from, request->len, data,
+                            flags);
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "writing to file failed", errp);
 
@@ -XXX,XX +XXX,XX @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
         if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
             flags |= BDRV_REQ_NO_FALLBACK;
         }
-        ret = blk_pwrite_zeroes(exp->common.blk, request->from, request->len,
-                                flags);
+        ret = blk_co_pwrite_zeroes(exp->common.blk, request->from, request->len,
+                                   flags);
         return nbd_send_generic_reply(client, request->handle, ret,
                                       "writing to file failed", errp);
 
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-6-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/9pfs/9p.h    | 4 ++--
 hw/9pfs/codir.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/9p.h
+++ b/hw/9pfs/9p.h
@@ -XXX,XX +XXX,XX @@ typedef struct V9fsDir {
     QemuMutex readdir_mutex_L;
 } V9fsDir;
 
-static inline void v9fs_readdir_lock(V9fsDir *dir)
+static inline void coroutine_fn v9fs_readdir_lock(V9fsDir *dir)
 {
     if (dir->proto_version == V9FS_PROTO_2000U) {
         qemu_co_mutex_lock(&dir->readdir_mutex_u);
@@ -XXX,XX +XXX,XX @@ static inline void v9fs_readdir_lock(V9fsDir *dir)
     }
 }
 
-static inline void v9fs_readdir_unlock(V9fsDir *dir)
+static inline void coroutine_fn v9fs_readdir_unlock(V9fsDir *dir)
 {
     if (dir->proto_version == V9FS_PROTO_2000U) {
         qemu_co_mutex_unlock(&dir->readdir_mutex_u);
diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/9pfs/codir.c
+++ b/hw/9pfs/codir.c
@@ -XXX,XX +XXX,XX @@ int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
  *
  * See v9fs_co_readdir_many() (as its only user) below for details.
  */
-static int do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
-                           struct V9fsDirEnt **entries, off_t offset,
-                           int32_t maxsize, bool dostat)
+static int coroutine_fn
+do_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp, struct V9fsDirEnt **entries,
+                off_t offset, int32_t maxsize, bool dostat)
 {
     V9fsState *s = pdu->s;
     V9fsString name;
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

do_sgio can suspend via the coroutine function thread_pool_submit_co, so it
has to be coroutine_fn as well---and the same is true of all its direct and
indirect callers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-7-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 scsi/qemu-pr-helper.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
index XXXXXXX..XXXXXXX 100644
--- a/scsi/qemu-pr-helper.c
+++ b/scsi/qemu-pr-helper.c
@@ -XXX,XX +XXX,XX @@ static int do_sgio_worker(void *opaque)
     return status;
 }
 
-static int do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
-                    uint8_t *buf, int *sz, int dir)
+static int coroutine_fn do_sgio(int fd, const uint8_t *cdb, uint8_t *sense,
+                                uint8_t *buf, int *sz, int dir)
 {
     int r;
 
@@ -XXX,XX +XXX,XX @@ static SCSISense mpath_generic_sense(int r)
     }
 }
 
-static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
+static int coroutine_fn mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
 {
     switch (r) {
     case MPATH_PR_SUCCESS:
@@ -XXX,XX +XXX,XX @@ static int mpath_reconstruct_sense(int fd, int r, uint8_t *sense)
     }
 }
 
-static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-                           uint8_t *data, int sz)
+static int coroutine_fn multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+                                        uint8_t *data, int sz)
 {
     int rq_servact = cdb[1];
     struct prin_resp resp;
@@ -XXX,XX +XXX,XX @@ static int multipath_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
     return mpath_reconstruct_sense(fd, r, sense);
 }
 
-static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-                            const uint8_t *param, int sz)
+static int coroutine_fn multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+                                         const uint8_t *param, int sz)
 {
     int rq_servact = cdb[1];
     int rq_scope = cdb[2] >> 4;
@@ -XXX,XX +XXX,XX @@ static int multipath_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
 }
 #endif
 
-static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
-                    uint8_t *data, int *resp_sz)
+static int coroutine_fn do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
+                                 uint8_t *data, int *resp_sz)
 {
 #ifdef CONFIG_MPATH
     if (is_mpath(fd)) {
@@ -XXX,XX +XXX,XX @@ static int do_pr_in(int fd, const uint8_t *cdb, uint8_t *sense,
                    SG_DXFER_FROM_DEV);
 }
 
-static int do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
-                     const uint8_t *param, int sz)
+static int coroutine_fn do_pr_out(int fd, const uint8_t *cdb, uint8_t *sense,
+                                  const uint8_t *param, int sz)
 {
     int resp_sz;
 
-- 
2.40.0

From: Paolo Bonzini <pbonzini@redhat.com>

Functions that can do I/O (including calling bdrv_is_allocated
and bdrv_block_status functions) are prime candidates for being
coroutine_fns.  Make the change for those that are themselves called
only from coroutine_fns.  Also annotate that they are called with the
graph rdlock taken, thus allowing them to call bdrv_co_*() functions
for I/O.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230309084456.304669-9-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.h          | 15 ++++++++-------
 block/qcow2-bitmap.c   |  2 +-
 block/qcow2-cluster.c  | 21 +++++++++++++--------
 block/qcow2-refcount.c |  8 ++++----
 block/qcow2-snapshot.c | 25 +++++++++++++------------
 block/qcow2.c          | 27 ++++++++++++++-------------
 6 files changed, 53 insertions(+), 45 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_refcount_area(BlockDriverState *bs, uint64_t offset,
                             uint64_t new_refblock_offset);
 
 int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size);
-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
-                                int64_t nb_clusters);
-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size);
+int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+                                             int64_t nb_clusters);
+int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size);
 void qcow2_free_clusters(BlockDriverState *bs,
                           int64_t offset, int64_t size,
                           enum qcow2_discard_type type);
@@ -XXX,XX +XXX,XX @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
                                 BlockDriverAmendStatusCB *status_cb,
                                 void *cb_opaque, Error **errp);
 int coroutine_fn GRAPH_RDLOCK qcow2_shrink_reftable(BlockDriverState *bs);
-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
+int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int coroutine_fn qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
 /* qcow2-cluster.c functions */
@@ -XXX,XX +XXX,XX @@ void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
 int coroutine_fn GRAPH_RDLOCK
 qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
 
-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
+void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
 int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
                           uint64_t bytes, enum qcow2_discard_type type,
                           bool full_discard);
@@ -XXX,XX +XXX,XX @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
                             Error **errp);
 
 void qcow2_free_snapshots(BlockDriverState *bs);
-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
+int coroutine_fn GRAPH_RDLOCK
+qcow2_read_snapshots(BlockDriverState *bs, Error **errp);
 int qcow2_write_snapshots(BlockDriverState *bs);
 
 int coroutine_fn GRAPH_RDLOCK
@@ -XXX,XX +XXX,XX @@ bool coroutine_fn qcow2_load_dirty_bitmaps(BlockDriverState *bs,
 bool qcow2_get_bitmap_info_list(BlockDriverState *bs,
                                 Qcow2BitmapInfoList **info_list, Error **errp);
 int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp);
-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
+int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp);
 bool qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
                                           bool release_stored, Error **errp);
 int qcow2_reopen_bitmaps_ro(BlockDriverState *bs, Error **errp);
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -XXX,XX +XXX,XX @@ out:
 }
 
 /* Checks to see if it's safe to resize bitmaps */
-int qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
+int coroutine_fn qcow2_truncate_bitmaps_check(BlockDriverState *bs, Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     Qcow2BitmapList *bm_list;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -XXX,XX +XXX,XX @@ err:
  * Frees the allocated clusters because the request failed and they won't
  * actually be linked.
  */
-void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
+void coroutine_fn qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
 {
     BDRVQcow2State *s = bs->opaque;
     if (!has_data_file(bs) && !m->keep_old_clusters) {
@@ -XXX,XX +XXX,XX @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
  *
  * Returns 0 on success, -errno on failure.
  */
-static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_cluster_offset,
-                             uint64_t guest_offset, unsigned bytes,
-                             uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
+static int coroutine_fn calculate_l2_meta(BlockDriverState *bs,
+                                          uint64_t host_cluster_offset,
+                                          uint64_t guest_offset, unsigned bytes,
+                                          uint64_t *l2_slice, QCowL2Meta **m,
+                                          bool keep_old)
 {
     BDRVQcow2State *s = bs->opaque;
     int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
@@ -XXX,XX +XXX,XX @@ out:
  * function has been waiting for another request and the allocation must be
  * restarted, but the whole request should not be failed.
  */
-static int do_alloc_cluster_offset(BlockDriverState *bs, uint64_t guest_offset,
-                                   uint64_t *host_offset, uint64_t *nb_clusters)
+static int coroutine_fn do_alloc_cluster_offset(BlockDriverState *bs,
+                                                uint64_t guest_offset,
+                                                uint64_t *host_offset,
+                                                uint64_t *nb_clusters)
 {
     BDRVQcow2State *s = bs->opaque;
 
@@ -XXX,XX +XXX,XX @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
     return nb_clusters;
 }
 
-static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
-                               unsigned nb_subclusters)
+static int coroutine_fn
+zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+                    unsigned nb_subclusters)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t *l2_slice;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
     return offset;
 }
 
-int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
-                                int64_t nb_clusters)
+int64_t coroutine_fn qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
+                                             int64_t nb_clusters)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t cluster_index, refcount;
@@ -XXX,XX +XXX,XX @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 
 /* only used to allocate compressed sectors. We try to allocate
    contiguous sectors. size must be <= cluster_size */
-int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
+int64_t coroutine_fn qcow2_alloc_bytes(BlockDriverState *bs, int size)
 {
     BDRVQcow2State *s = bs->opaque;
     int64_t offset;
@@ -XXX,XX +XXX,XX @@ out:
     return ret;
 }
 
-int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
+int64_t coroutine_fn qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
 {
     BDRVQcow2State *s = bs->opaque;
     int64_t i;
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -XXX,XX +XXX,XX @@ void qcow2_free_snapshots(BlockDriverState *bs)
  *   qcow2_check_refcounts() does not do anything with snapshots'
  *   extra data.)
  */
-static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
-                                   int *nb_clusters_reduced,
-                                   int *extra_data_dropped,
-                                   Error **errp)
+static coroutine_fn GRAPH_RDLOCK
+int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
+                            int *nb_clusters_reduced,
+                            int *extra_data_dropped,
+                            Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowSnapshotHeader h;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read statically sized part of the snapshot header */
         offset = ROUND_UP(offset, 8);
-        ret = bdrv_pread(bs->file, offset, sizeof(h), &h, 0);
+        ret = bdrv_co_pread(bs->file, offset, sizeof(h), &h, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
         }
 
         /* Read known extra data */
-        ret = bdrv_pread(bs->file, offset,
-                         MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
+        ret = bdrv_co_pread(bs->file, offset,
+                            MIN(sizeof(extra), sn->extra_data_size), &extra, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
             /* Store unknown extra data */
             unknown_extra_data_size = sn->extra_data_size - sizeof(extra);
             sn->unknown_extra_data = g_malloc(unknown_extra_data_size);
-            ret = bdrv_pread(bs->file, offset, unknown_extra_data_size,
-                             sn->unknown_extra_data, 0);
+            ret = bdrv_co_pread(bs->file, offset, unknown_extra_data_size,
+                                sn->unknown_extra_data, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "Failed to read snapshot table");
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read snapshot ID */
         sn->id_str = g_malloc(id_str_size + 1);
-        ret = bdrv_pread(bs->file, offset, id_str_size, sn->id_str, 0);
+        ret = bdrv_co_pread(bs->file, offset, id_str_size, sn->id_str, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ static int qcow2_do_read_snapshots(BlockDriverState *bs, bool repair,
 
         /* Read snapshot name */
         sn->name = g_malloc(name_size + 1);
-        ret = bdrv_pread(bs->file, offset, name_size, sn->name, 0);
+        ret = bdrv_co_pread(bs->file, offset, name_size, sn->name, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to read snapshot table");
             goto fail;
@@ -XXX,XX +XXX,XX @@ fail:
     return ret;
 }
 
-int qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
+int coroutine_fn qcow2_read_snapshots(BlockDriverState *bs, Error **errp)
 {
     return qcow2_do_read_snapshots(bs, false, NULL, NULL, errp);
 }
diff --git a/block/qcow2.c b/block/qcow2.c
index XXXXXXX..XXXXXXX 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -XXX,XX +XXX,XX @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char *fmt, Error **errp)
  * unknown magic is skipped (future extension this version knows nothing about)
  * return 0 upon success, non-0 otherwise
  */
-static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
-                                 uint64_t end_offset, void **p_feature_table,
-                                 int flags, bool *need_update_header,
-                                 Error **errp)
+static int coroutine_fn GRAPH_RDLOCK
+qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
+                      uint64_t end_offset, void **p_feature_table,
+                      int flags, bool *need_update_header, Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowExtension ext;
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         printf("attempting to read extended header in offset %lu\n", offset);
 #endif
 
-        ret = bdrv_pread(bs->file, offset, sizeof(ext), &ext, 0);
+        ret = bdrv_co_pread(bs->file, offset, sizeof(ext), &ext, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "qcow2_read_extension: ERROR: "
                              "pread fail from offset %" PRIu64, offset);
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                            sizeof(bs->backing_format));
                 return 2;
             }
-            ret = bdrv_pread(bs->file, offset, ext.len, bs->backing_format, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, bs->backing_format, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "ERROR: ext_backing_format: "
                                  "Could not read format name");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         case QCOW2_EXT_MAGIC_FEATURE_TABLE:
             if (p_feature_table != NULL) {
                 void *feature_table = g_malloc0(ext.len + 2 * sizeof(Qcow2Feature));
-                ret = bdrv_pread(bs->file, offset, ext.len, feature_table, 0);
+                ret = bdrv_co_pread(bs->file, offset, ext.len, feature_table, 0);
                 if (ret < 0) {
                     error_setg_errno(errp, -ret, "ERROR: ext_feature_table: "
                                      "Could not read table");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 return -EINVAL;
             }
 
-            ret = bdrv_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, &s->crypto_header, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "Unable to read CRYPTO header extension");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 break;
             }
 
-            ret = bdrv_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, &bitmaps_ext, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "bitmaps_ext: "
                                  "Could not read ext header");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
         case QCOW2_EXT_MAGIC_DATA_FILE:
         {
             s->image_data_file = g_malloc0(ext.len + 1);
-            ret = bdrv_pread(bs->file, offset, ext.len, s->image_data_file, 0);
+            ret = bdrv_co_pread(bs->file, offset, ext.len, s->image_data_file, 0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret,
                                  "ERROR: Could not read data file name");
@@ -XXX,XX +XXX,XX @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 uext->len = ext.len;
                 QLIST_INSERT_HEAD(&s->unknown_header_ext, uext, next);
 
-                ret = bdrv_pread(bs->file, offset, uext->len, uext->data, 0);
+                ret = bdrv_co_pread(bs->file, offset, uext->len, uext->data, 0);
                 if (ret < 0) {
                     error_setg_errno(errp, -ret, "ERROR: unknown extension: "
                                      "Could not read data");
@@ -XXX,XX +XXX,XX @@ static void qcow2_update_options_abort(BlockDriverState *bs,
     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
 }
 
-static int qcow2_update_options(BlockDriverState *bs, QDict *options,
-                                int flags, Error **errp)
+static int coroutine_fn
+qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
+                     Error **errp)
 {
     Qcow2ReopenState r = {};
     int ret;
-- 
2.40.0

From: Wang Liang <wangliangzz@inspur.com>

hmp_commit() calls blk_is_available() from a non-coroutine context (and in
the main loop). blk_is_available() is a co_wrapper_mixed_bdrv_rdlock
function, and in the non-coroutine context it calls AIO_WAIT_WHILE(),
which crashes if the aio_context lock is not taken before.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1615
Signed-off-by: Wang Liang <wangliangzz@inspur.com>
Message-Id: <20230424103902.45265-1-wangliangzz@126.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/monitor/block-hmp-cmds.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index XXXXXXX..XXXXXXX 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -XXX,XX +XXX,XX @@ void hmp_commit(Monitor *mon, const QDict *qdict)
             error_report("Device '%s' not found", device);
             return;
         }
-        if (!blk_is_available(blk)) {
-            error_report("Device '%s' has no medium", device);
-            return;
-        }
 
         bs = bdrv_skip_implicit_filters(blk_bs(blk));
         aio_context = bdrv_get_aio_context(bs);
         aio_context_acquire(aio_context);
 
+        if (!blk_is_available(blk)) {
+            error_report("Device '%s' has no medium", device);
+            aio_context_release(aio_context);
+            return;
+        }
+
         ret = bdrv_commit(bs);
 
         aio_context_release(aio_context);
-- 
2.40.0