Series comparison

-[PULL 0/9] Block patches
+[Qemu-devel] [PULL v2 00/12] Block patches
-The following changes since commit 67f17e23baca5dd545fe98b01169cc351a70fe35:
+The following changes since commit 4c55b1d0bad8a703f0499fe62e3761a0cd288da3:
-  Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging (2020-03-06 17:15:36 +0000)
+  Merge remote-tracking branch 'remotes/armbru/tags/pull-error-2017-04-24' into staging (2017-04-24 14:49:48 +0100)
-are available in the Git repository at:
+are available in the git repository at:
-  https://github.com/stefanha/qemu.git tags/block-pull-request
+  git://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request
-for you to fetch changes up to d37d0e365afb6825a90d8356fc6adcc1f58f40f3:
+for you to fetch changes up to ecfa185400ade2abc9915efa924cbad1e15a21a4:
-  aio-posix: remove idle poll handlers to improve scalability (2020-03-09 16:45:16 +0000)
+  qemu-iotests: _cleanup_qemu must be called on exit (2017-04-24 15:09:33 -0400)
 ----------------------------------------------------------------
-Pull request
+Pull v2, with 32-bit errors fixed.  I don't have OS X to test compile on,
+but I think it is safe to assume the cause of the compile error was the same.
 ----------------------------------------------------------------
-Stefan Hajnoczi (9):
+Ashish Mittal (2):
-  qemu/queue.h: clear linked list pointers on remove
+  block/vxhs.c: Add support for a new block device type called "vxhs"
-  aio-posix: remove confusing QLIST_SAFE_REMOVE()
+  block/vxhs.c: Add qemu-iotests for new block device type "vxhs"
   aio-posix: completely stop polling when disabled
   aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
   aio-posix: extract ppoll(2) and epoll(7) fd monitoring
   aio-posix: simplify FDMonOps->update() prototype
   aio-posix: add io_uring fd monitoring implementation
   aio-posix: support userspace polling of fd monitoring
   aio-posix: remove idle poll handlers to improve scalability
- MAINTAINERS           |   2 +
+Jeff Cody (10):
- configure             |   5 +
+  qemu-iotests: exclude vxhs from image creation via protocol
- include/block/aio.h   |  71 ++++++-
+  block: add bdrv_set_read_only() helper function
- include/qemu/queue.h  |  19 +-
+  block: do not set BDS read_only if copy_on_read enabled
- util/Makefile.objs    |   3 +
+  block: honor BDRV_O_ALLOW_RDWR when clearing bs->read_only
- util/aio-posix.c      | 451 ++++++++++++++----------------------------
+  block: code movement
- util/aio-posix.h      |  81 ++++++++
+  block: introduce bdrv_can_set_read_only()
- util/fdmon-epoll.c    | 155 +++++++++++++++
+  block: use bdrv_can_set_read_only() during reopen
- util/fdmon-io_uring.c | 332 +++++++++++++++++++++++++++++++
+  block/rbd - update variable names to more apt names
- util/fdmon-poll.c     | 107 ++++++++++
+  block/rbd: Add support for reopen()
- util/trace-events     |   2 +
+  qemu-iotests: _cleanup_qemu must be called on exit
-files changed, 915 insertions(+), 313 deletions(-)
- create mode 100644 util/aio-posix.h
+ block.c                          |  56 +++-
- create mode 100644 util/fdmon-epoll.c
+ block/Makefile.objs              |   2 +
- create mode 100644 util/fdmon-io_uring.c
+ block/bochs.c                    |   5 +-
- create mode 100644 util/fdmon-poll.c
+ block/cloop.c                    |   5 +-
  block/dmg.c                      |   6 +-
  block/rbd.c                      |  65 +++--
  block/trace-events               |  17 ++
  block/vvfat.c                    |  19 +-
  block/vxhs.c                     | 575 +++++++++++++++++++++++++++++++++++++++
  configure                        |  39 +++
  include/block/block.h            |   2 +
  qapi/block-core.json             |  23 +-
  tests/qemu-iotests/017           |   1 +
  tests/qemu-iotests/020           |   1 +
  tests/qemu-iotests/028           |   1 +
  tests/qemu-iotests/029           |   1 +
  tests/qemu-iotests/073           |   1 +
  tests/qemu-iotests/094           |  11 +-
  tests/qemu-iotests/102           |   5 +-
  tests/qemu-iotests/109           |   1 +
  tests/qemu-iotests/114           |   1 +
  tests/qemu-iotests/117           |   1 +
  tests/qemu-iotests/130           |   2 +
  tests/qemu-iotests/134           |   1 +
  tests/qemu-iotests/140           |   1 +
  tests/qemu-iotests/141           |   1 +
  tests/qemu-iotests/143           |   1 +
  tests/qemu-iotests/156           |   2 +
  tests/qemu-iotests/158           |   1 +
  tests/qemu-iotests/common        |   6 +
  tests/qemu-iotests/common.config |  13 +
  tests/qemu-iotests/common.filter |   1 +
  tests/qemu-iotests/common.rc     |  19 ++
 files changed, 844 insertions(+), 42 deletions(-)
  create mode 100644 block/vxhs.c
 --
-.24.1
+.9.3

-[PULL 5/9] aio-posix: extract ppoll(2) and epoll(7) fd monitoring
+[Qemu-devel] [PULL v2 01/12] block/vxhs.c: Add support for a new block device type called "vxhs"
-The ppoll(2) and epoll(7) file descriptor monitoring implementations are
+From: Ashish Mittal <ashmit602@gmail.com>
 mixed with the core util/aio-posix.c code.  Before adding another
 implementation for Linux io_uring, extract out the existing
 ones so there is a clear interface and the core code is simpler.
-The new interface is AioContext->fdmon_ops, a pointer to a FDMonOps
+Source code for the qnio library that this code loads can be downloaded from:
-struct.  See the patch for details.
+https://github.com/VeritasHyperScale/libqnio.git
-Semantic changes:
+Sample command line using JSON syntax:
-. ppoll(2) now reflects events from pollfds[] back into AioHandlers
+./x86_64-softmmu/qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0
-   while we're still on the clock for adaptive polling.  This was
+-k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
-   already happening for epoll(7), so if it's really an issue then we'll
+-msg timestamp=on
-   need to fix both in the future.
+'json:{"driver":"vxhs","vdisk-id":"c3e9095a-a5ee-4dce-afeb-2a59fb387410",
-. epoll(7)'s fallback to ppoll(2) while external events are disabled
+"server":{"host":"172.172.17.4","port":"9999"}}'
    was broken when the number of fds exceeded the epoll(7) upgrade
    threshold.  I guess this code path simply wasn't tested and no one
    noticed the bug.  I didn't go out of my way to fix it but the correct
    code is simpler than preserving the bug.
-I also took some liberties in removing the unnecessary
+Sample command line using URI syntax:
-AioContext->epoll_available (just check AioContext->epollfd != -1
+qemu-img convert -f raw -O raw -n
-instead) and AioContext->epoll_enabled (it's implicit if our
+/var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad
-AioContext->fdmon_ops callbacks are being invoked) fields.
+vxhs://192.168.0.1:9999/c6718f6b-0401-441d-a8c3-1f0064d75ee0
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Sample command line using TLS credentials (run in secure mode):
-Link: https://lore.kernel.org/r/20200305170806.1313245-4-stefanha@redhat.com
+./qemu-io --object
-Message-Id: <20200305170806.1313245-4-stefanha@redhat.com>
+tls-creds-x509,id=tls0,dir=/etc/pki/qemu/vxhs,endpoint=client -c 'read
 -v 66000 2.5k' 'json:{"server.host": "127.0.0.1", "server.port": "9999",
 "vdisk-id": "/test.raw", "driver": "vxhs", "tls-creds":"tls0"}'
 [Jeff: Modified trace-events with the correct string formatting]
 Signed-off-by: Ashish Mittal <Ashish.Mittal@veritas.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: Jeff Cody <jcody@redhat.com>
 Signed-off-by: Jeff Cody <jcody@redhat.com>
 Message-id: 1491277689-24949-2-git-send-email-Ashish.Mittal@veritas.com
 ---
- MAINTAINERS         |   2 +
+ block/Makefile.objs  |   2 +
- include/block/aio.h |  36 +++++-
+ block/trace-events   |  17 ++
- util/Makefile.objs  |   2 +
+ block/vxhs.c         | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++
- util/aio-posix.c    | 286 ++------------------------------------------
+ configure            |  39 ++++
- util/aio-posix.h    |  61 ++++++++++
+ qapi/block-core.json |  23 ++-
- util/fdmon-epoll.c  | 151 +++++++++++++++++++++++
+files changed, 654 insertions(+), 2 deletions(-)
- util/fdmon-poll.c   | 104 ++++++++++++++++
+ create mode 100644 block/vxhs.c
 files changed, 366 insertions(+), 276 deletions(-)
  create mode 100644 util/aio-posix.h
  create mode 100644 util/fdmon-epoll.c
  create mode 100644 util/fdmon-poll.c
-diff --git a/MAINTAINERS b/MAINTAINERS
+diff --git a/block/Makefile.objs b/block/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
---- a/MAINTAINERS
+--- a/block/Makefile.objs
-+++ b/MAINTAINERS
++++ b/block/Makefile.objs
-@@ -XXX,XX +XXX,XX @@ L: qemu-block@nongnu.org
+@@ -XXX,XX +XXX,XX @@ block-obj-$(CONFIG_LIBNFS) += nfs.o
- S: Supported
+ block-obj-$(CONFIG_CURL) += curl.o
- F: util/async.c
+ block-obj-$(CONFIG_RBD) += rbd.o
- F: util/aio-*.c
+ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
-+F: util/aio-*.h
++block-obj-$(CONFIG_VXHS) += vxhs.o
-+F: util/fdmon-*.c
+ block-obj-$(CONFIG_LIBSSH2) += ssh.o
- F: block/io.c
+ block-obj-y += accounting.o dirty-bitmap.o
- F: migration/block*
+ block-obj-y += write-threshold.o
- F: include/block/aio.h
+@@ -XXX,XX +XXX,XX @@ rbd.o-cflags       := $(RBD_CFLAGS)
-diff --git a/include/block/aio.h b/include/block/aio.h
+ rbd.o-libs         := $(RBD_LIBS)
  gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
  gluster.o-libs     := $(GLUSTERFS_LIBS)
 +vxhs.o-libs        := $(VXHS_LIBS)
  ssh.o-cflags       := $(LIBSSH2_CFLAGS)
  ssh.o-libs         := $(LIBSSH2_LIBS)
  block-obj-$(if $(CONFIG_BZIP2),m,n) += dmg-bz2.o
 diff --git a/block/trace-events b/block/trace-events
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
+--- a/block/trace-events
-+++ b/include/block/aio.h
++++ b/block/trace-events
-@@ -XXX,XX +XXX,XX @@ struct ThreadPool;
+@@ -XXX,XX +XXX,XX @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s
- struct LinuxAioState;
+ qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
- struct LuringState;
+ qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
+ qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
-+/* Callbacks for file descriptor monitoring implementations */
++
-+typedef struct {
++# block/vxhs.c
-+    /*
++vxhs_iio_callback(int error) "ctx is NULL: error %d"
-+     * update:
++vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o %d, %d"
-+     * @ctx: the AioContext
++vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno %d"
-+     * @node: the handler
++vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d"
-+     * @is_new: is the file descriptor already being monitored?
++vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void *acb, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %"PRIu64" offset = %"PRIu64" ACB = %p. Error = %d, errno = %d"
-+     *
++vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl failed, ret = %d, errno = %d"
-+     * Add/remove/modify a monitored file descriptor.  There are three cases:
++vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat ioctl returned size %"PRIu64
-+     * 1. node->pfd.events == 0 means remove the file descriptor.
++vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %"PRIu64
-+     * 2. !is_new means modify an already monitored file descriptor.
++vxhs_parse_uri_filename(const char *filename) "URI passed via bdrv_parse_filename %s"
-+     * 3. is_new means add a new file descriptor.
++vxhs_open_vdiskid(const char *vdisk_id) "Opening vdisk-id %s"
-+     *
++vxhs_open_hostinfo(char *of_vsa_addr, int port) "Adding host %s:%d to BDRVVXHSState"
-+     * Called with ctx->list_lock acquired.
++vxhs_open_iio_open(const char *host) "Failed to connect to storage agent on host %s"
-+     */
++vxhs_parse_uri_hostinfo(char *host, int port) "Host: IP %s, Port %d"
-+    void (*update)(AioContext *ctx, AioHandler *node, bool is_new);
++vxhs_close(char *vdisk_guid) "Closing vdisk %s"
-+
++vxhs_get_creds(const char *cacert, const char *client_key, const char *client_cert) "cacert %s, client_key %s, client_cert %s"
-+    /*
+diff --git a/block/vxhs.c b/block/vxhs.c
 +     * wait:
 +     * @ctx: the AioContext
 +     * @ready_list: list for handlers that become ready
 +     * @timeout: maximum duration to wait, in nanoseconds
 +     *
 +     * Wait for file descriptors to become ready and place them on ready_list.
 +     *
 +     * Called with ctx->list_lock incremented but not locked.
 +     *
 +     * Returns: number of ready file descriptors.
 +     */
 +    int (*wait)(AioContext *ctx, AioHandlerList *ready_list, int64_t timeout);
 +} FDMonOps;
 +
  /*
   * Each aio_bh_poll() call carves off a slice of the BH list, so that newly
   * scheduled BHs are not processed until the next aio_bh_poll() call.  All
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      /* epoll(7) state used when built with CONFIG_EPOLL */
      int epollfd;
 -    bool epoll_enabled;
 -    bool epoll_available;
 +
 +    const FDMonOps *fdmon_ops;
  };
  /**
 diff --git a/util/Makefile.objs b/util/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/util/Makefile.objs
 +++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@ util-obj-y += aiocb.o async.o aio-wait.o thread-pool.o qemu-timer.o
  util-obj-y += main-loop.o
  util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
  util-obj-$(CONFIG_POSIX) += aio-posix.o
 +util-obj-$(CONFIG_POSIX) += fdmon-poll.o
 +util-obj-$(CONFIG_EPOLL_CREATE1) += fdmon-epoll.o
  util-obj-$(CONFIG_POSIX) += compatfd.o
  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
  #include "qemu/sockets.h"
  #include "qemu/cutils.h"
  #include "trace.h"
 -#ifdef CONFIG_EPOLL_CREATE1
 -#include <sys/epoll.h>
 -#endif
 +#include "aio-posix.h"
 -struct AioHandler
 -{
 -    GPollFD pfd;
 -    IOHandler *io_read;
 -    IOHandler *io_write;
 -    AioPollFn *io_poll;
 -    IOHandler *io_poll_begin;
 -    IOHandler *io_poll_end;
 -    void *opaque;
 -    bool is_external;
 -    QLIST_ENTRY(AioHandler) node;
 -    QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
 -    QLIST_ENTRY(AioHandler) node_deleted;
 -};
 -
 -/* Add a handler to a ready list */
 -static void add_ready_handler(AioHandlerList *ready_list,
 -                              AioHandler *node,
 -                              int revents)
 +void aio_add_ready_handler(AioHandlerList *ready_list,
 +                           AioHandler *node,
 +                           int revents)
  {
      QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's list */
      node->pfd.revents = revents;
      QLIST_INSERT_HEAD(ready_list, node, node_ready);
  }
 -#ifdef CONFIG_EPOLL_CREATE1
 -
 -/* The fd number threshold to switch to epoll */
 -#define EPOLL_ENABLE_THRESHOLD 64
 -
 -static void aio_epoll_disable(AioContext *ctx)
 -{
 -    ctx->epoll_enabled = false;
 -    if (!ctx->epoll_available) {
 -        return;
 -    }
 -    ctx->epoll_available = false;
 -    close(ctx->epollfd);
 -}
 -
 -static inline int epoll_events_from_pfd(int pfd_events)
 -{
 -    return (pfd_events & G_IO_IN ? EPOLLIN : 0) |
 -           (pfd_events & G_IO_OUT ? EPOLLOUT : 0) |
 -           (pfd_events & G_IO_HUP ? EPOLLHUP : 0) |
 -           (pfd_events & G_IO_ERR ? EPOLLERR : 0);
 -}
 -
 -static bool aio_epoll_try_enable(AioContext *ctx)
 -{
 -    AioHandler *node;
 -    struct epoll_event event;
 -
 -    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
 -        int r;
 -        if (QLIST_IS_INSERTED(node, node_deleted) || !node->pfd.events) {
 -            continue;
 -        }
 -        event.events = epoll_events_from_pfd(node->pfd.events);
 -        event.data.ptr = node;
 -        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, &event);
 -        if (r) {
 -            return false;
 -        }
 -    }
 -    ctx->epoll_enabled = true;
 -    return true;
 -}
 -
 -static void aio_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
 -{
 -    struct epoll_event event;
 -    int r;
 -    int ctl;
 -
 -    if (!ctx->epoll_enabled) {
 -        return;
 -    }
 -    if (!node->pfd.events) {
 -        ctl = EPOLL_CTL_DEL;
 -    } else {
 -        event.data.ptr = node;
 -        event.events = epoll_events_from_pfd(node->pfd.events);
 -        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
 -    }
 -
 -    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
 -    if (r) {
 -        aio_epoll_disable(ctx);
 -    }
 -}
 -
 -static int aio_epoll(AioContext *ctx, AioHandlerList *ready_list,
 -                     int64_t timeout)
 -{
 -    GPollFD pfd = {
 -        .fd = ctx->epollfd,
 -        .events = G_IO_IN | G_IO_OUT | G_IO_HUP | G_IO_ERR,
 -    };
 -    AioHandler *node;
 -    int i, ret = 0;
 -    struct epoll_event events[128];
 -
 -    if (timeout > 0) {
 -        ret = qemu_poll_ns(&pfd, 1, timeout);
 -        if (ret > 0) {
 -            timeout = 0;
 -        }
 -    }
 -    if (timeout <= 0 || ret > 0) {
 -        ret = epoll_wait(ctx->epollfd, events,
 -                         ARRAY_SIZE(events),
 -                         timeout);
 -        if (ret <= 0) {
 -            goto out;
 -        }
 -        for (i = 0; i < ret; i++) {
 -            int ev = events[i].events;
 -            int revents = (ev & EPOLLIN ? G_IO_IN : 0) |
 -                          (ev & EPOLLOUT ? G_IO_OUT : 0) |
 -                          (ev & EPOLLHUP ? G_IO_HUP : 0) |
 -                          (ev & EPOLLERR ? G_IO_ERR : 0);
 -
 -            node = events[i].data.ptr;
 -            add_ready_handler(ready_list, node, revents);
 -        }
 -    }
 -out:
 -    return ret;
 -}
 -
 -static bool aio_epoll_enabled(AioContext *ctx)
 -{
 -    /* Fall back to ppoll when external clients are disabled. */
 -    return !aio_external_disabled(ctx) && ctx->epoll_enabled;
 -}
 -
 -static bool aio_epoll_check_poll(AioContext *ctx, GPollFD *pfds,
 -                                 unsigned npfd, int64_t timeout)
 -{
 -    if (!ctx->epoll_available) {
 -        return false;
 -    }
 -    if (aio_epoll_enabled(ctx)) {
 -        return true;
 -    }
 -    if (npfd >= EPOLL_ENABLE_THRESHOLD) {
 -        if (aio_epoll_try_enable(ctx)) {
 -            return true;
 -        } else {
 -            aio_epoll_disable(ctx);
 -        }
 -    }
 -    return false;
 -}
 -
 -#else
 -
 -static void aio_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
 -{
 -}
 -
 -static int aio_epoll(AioContext *ctx, AioHandlerList *ready_list,
 -                     int64_t timeout)
 -{
 -    assert(false);
 -}
 -
 -static bool aio_epoll_enabled(AioContext *ctx)
 -{
 -    return false;
 -}
 -
 -static bool aio_epoll_check_poll(AioContext *ctx, GPollFD *pfds,
 -                          unsigned npfd, int64_t timeout)
 -{
 -    return false;
 -}
 -
 -#endif
 -
  static AioHandler *find_aio_handler(AioContext *ctx, int fd)
  {
      AioHandler *node;
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
                 atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
      if (new_node) {
 -        aio_epoll_update(ctx, new_node, is_new);
 +        ctx->fdmon_ops->update(ctx, new_node, is_new);
      } else if (node) {
          /* Unregister deleted fd_handler */
 -        aio_epoll_update(ctx, node, false);
 +        ctx->fdmon_ops->update(ctx, node, false);
      }
      qemu_lockcnt_unlock(&ctx->list_lock);
      aio_notify(ctx);
@@ -XXX,XX +XXX,XX @@ void aio_dispatch(AioContext *ctx)
      timerlistgroup_run_timers(&ctx->tlg);
  }
 -/* These thread-local variables are used only in a small part of aio_poll
 - * around the call to the poll() system call.  In particular they are not
 - * used while aio_poll is performing callbacks, which makes it much easier
 - * to think about reentrancy!
 - *
 - * Stack-allocated arrays would be perfect but they have size limitations;
 - * heap allocation is expensive enough that we want to reuse arrays across
 - * calls to aio_poll().  And because poll() has to be called without holding
 - * any lock, the arrays cannot be stored in AioContext.  Thread-local data
 - * has none of the disadvantages of these three options.
 - */
 -static __thread GPollFD *pollfds;
 -static __thread AioHandler **nodes;
 -static __thread unsigned npfd, nalloc;
 -static __thread Notifier pollfds_cleanup_notifier;
 -
 -static void pollfds_cleanup(Notifier *n, void *unused)
 -{
 -    g_assert(npfd == 0);
 -    g_free(pollfds);
 -    g_free(nodes);
 -    nalloc = 0;
 -}
 -
 -static void add_pollfd(AioHandler *node)
 -{
 -    if (npfd == nalloc) {
 -        if (nalloc == 0) {
 -            pollfds_cleanup_notifier.notify = pollfds_cleanup;
 -            qemu_thread_atexit_add(&pollfds_cleanup_notifier);
 -            nalloc = 8;
 -        } else {
 -            g_assert(nalloc <= INT_MAX);
 -            nalloc *= 2;
 -        }
 -        pollfds = g_renew(GPollFD, pollfds, nalloc);
 -        nodes = g_renew(AioHandler *, nodes, nalloc);
 -    }
 -    nodes[npfd] = node;
 -    pollfds[npfd] = (GPollFD) {
 -        .fd = node->pfd.fd,
 -        .events = node->pfd.events,
 -    };
 -    npfd++;
 -}
 -
  static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
  {
      bool progress = false;
@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
  bool aio_poll(AioContext *ctx, bool blocking)
  {
      AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
 -    AioHandler *node;
 -    int i;
      int ret = 0;
      bool progress;
      int64_t timeout;
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
       * system call---a single round of run_poll_handlers_once suffices.
       */
      if (timeout || atomic_read(&ctx->poll_disable_cnt)) {
 -        assert(npfd == 0);
 -
 -        /* fill pollfds */
 -
 -        if (!aio_epoll_enabled(ctx)) {
 -            QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
 -                if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events
 -                    && aio_node_check(ctx, node->is_external)) {
 -                    add_pollfd(node);
 -                }
 -            }
 -        }
 -
 -        /* wait until next event */
 -        if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
 -            npfd = 0; /* pollfds[] is not being used */
 -            ret = aio_epoll(ctx, &ready_list, timeout);
 -        } else  {
 -            ret = qemu_poll_ns(pollfds, npfd, timeout);
 -        }
 +        ret = ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
      }
      if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
          }
      }
 -    /* if we have any readable fds, dispatch event */
 -    if (ret > 0) {
 -        for (i = 0; i < npfd; i++) {
 -            int revents = pollfds[i].revents;
 -
 -            if (revents) {
 -                add_ready_handler(&ready_list, nodes[i], revents);
 -            }
 -        }
 -    }
 -
 -    npfd = 0;
 -
      progress |= aio_bh_poll(ctx);
      if (ret > 0) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
  void aio_context_setup(AioContext *ctx)
  {
 -#ifdef CONFIG_EPOLL_CREATE1
 -    assert(!ctx->epollfd);
 -    ctx->epollfd = epoll_create1(EPOLL_CLOEXEC);
 -    if (ctx->epollfd == -1) {
 -        fprintf(stderr, "Failed to create epoll instance: %s", strerror(errno));
 -        ctx->epoll_available = false;
 -    } else {
 -        ctx->epoll_available = true;
 -    }
 -#endif
 +    ctx->fdmon_ops = &fdmon_poll_ops;
 +    ctx->epollfd = -1;
 +
 +    fdmon_epoll_setup(ctx);
  }
  void aio_context_destroy(AioContext *ctx)
  {
 -#ifdef CONFIG_EPOLL_CREATE1
 -    aio_epoll_disable(ctx);
 -#endif
 +    fdmon_epoll_disable(ctx);
  }
  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
 diff --git a/util/aio-posix.h b/util/aio-posix.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/util/aio-posix.h
++++ b/block/vxhs.c
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * AioContext POSIX event loop implementation internal APIs
++ * QEMU Block driver for Veritas HyperScale (VxHS)
 + *
-+ * Copyright IBM, Corp. 2008
++ * Copyright (c) 2017 Veritas Technologies LLC.
 + * Copyright Red Hat, Inc. 2020
 + *
-+ * Authors:
++ * This work is licensed under the terms of the GNU GPL, version 2 or later.
-+ *  Anthony Liguori   <aliguori@us.ibm.com>
++ * See the COPYING file in the top-level directory.
 + *
-+ * This work is licensed under the terms of the GNU GPL, version 2.  See
-+ * the COPYING file in the top-level directory.
-+ *
-+ * Contributions after 2012-01-13 are licensed under the terms of the
-+ * GNU GPL, version 2 or (at your option) any later version.
 + */
 +
-+#ifndef AIO_POSIX_H
++#include "qemu/osdep.h"
-+#define AIO_POSIX_H
++#include <qnio/qnio_api.h>
-+
++#include <sys/param.h>
-+#include "block/aio.h"
++#include "block/block_int.h"
-+
++#include "qapi/qmp/qerror.h"
-+struct AioHandler {
++#include "qapi/qmp/qdict.h"
-+    GPollFD pfd;
++#include "qapi/qmp/qstring.h"
-+    IOHandler *io_read;
++#include "trace.h"
-+    IOHandler *io_write;
++#include "qemu/uri.h"
-+    AioPollFn *io_poll;
++#include "qapi/error.h"
-+    IOHandler *io_poll_begin;
++#include "qemu/uuid.h"
-+    IOHandler *io_poll_end;
++#include "crypto/tlscredsx509.h"
-+    void *opaque;
++
-+    bool is_external;
++#define VXHS_OPT_FILENAME           "filename"
-+    QLIST_ENTRY(AioHandler) node;
++#define VXHS_OPT_VDISK_ID           "vdisk-id"
-+    QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
++#define VXHS_OPT_SERVER             "server"
-+    QLIST_ENTRY(AioHandler) node_deleted;
++#define VXHS_OPT_HOST               "host"
-+};
++#define VXHS_OPT_PORT               "port"
 +
-+/* Add a handler to a ready list */
++/* Only accessed under QEMU global mutex */
-+void aio_add_ready_handler(AioHandlerList *ready_list, AioHandler *node,
++static uint32_t vxhs_ref;
-+                           int revents);
++
-+
++typedef enum {
-+extern const FDMonOps fdmon_poll_ops;
++    VDISK_AIO_READ,
-+
++    VDISK_AIO_WRITE,
-+#ifdef CONFIG_EPOLL_CREATE1
++} VDISKAIOCmd;
-+bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd);
++
 +void fdmon_epoll_setup(AioContext *ctx);
 +void fdmon_epoll_disable(AioContext *ctx);
 +#else
 +static inline bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
 +{
 +    return false;
 +}
 +
 +static inline void fdmon_epoll_setup(AioContext *ctx)
 +{
 +}
 +
 +static inline void fdmon_epoll_disable(AioContext *ctx)
 +{
 +}
 +#endif /* !CONFIG_EPOLL_CREATE1 */
 +
 +#endif /* AIO_POSIX_H */
 diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@
 +/* SPDX-License-Identifier: GPL-2.0-or-later */
 +/*
-+ * epoll(7) file descriptor monitoring
++ * HyperScale AIO callbacks structure
 + */
-+
++typedef struct VXHSAIOCB {
-+#include "qemu/osdep.h"
++    BlockAIOCB common;
-+#include <sys/epoll.h>
++    int err;
-+#include "qemu/rcu_queue.h"
++} VXHSAIOCB;
-+#include "aio-posix.h"
++
-+
++typedef struct VXHSvDiskHostsInfo {
-+/* The fd number threshold to switch to epoll */
++    void *dev_handle; /* Device handle */
-+#define EPOLL_ENABLE_THRESHOLD 64
++    char *host; /* Host name or IP */
-+
++    int port; /* Host's port number */
-+void fdmon_epoll_disable(AioContext *ctx)
++} VXHSvDiskHostsInfo;
-+{
++
-+    if (ctx->epollfd >= 0) {
++/*
-+        close(ctx->epollfd);
++ * Structure per vDisk maintained for state
-+        ctx->epollfd = -1;
++ */
-+    }
++typedef struct BDRVVXHSState {
-+
++    VXHSvDiskHostsInfo vdisk_hostinfo; /* Per host info */
-+    /* Switch back */
++    char *vdisk_guid;
-+    ctx->fdmon_ops = &fdmon_poll_ops;
++    char *tlscredsid; /* tlscredsid */
-+}
++} BDRVVXHSState;
 +
-+static inline int epoll_events_from_pfd(int pfd_events)
++static void vxhs_complete_aio_bh(void *opaque)
 +{
-+    return (pfd_events & G_IO_IN ? EPOLLIN : 0) |
++    VXHSAIOCB *acb = opaque;
-+           (pfd_events & G_IO_OUT ? EPOLLOUT : 0) |
++    BlockCompletionFunc *cb = acb->common.cb;
-+           (pfd_events & G_IO_HUP ? EPOLLHUP : 0) |
++    void *cb_opaque = acb->common.opaque;
-+           (pfd_events & G_IO_ERR ? EPOLLERR : 0);
++    int ret = 0;
-+}
++
-+
++    if (acb->err != 0) {
-+static void fdmon_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
++        trace_vxhs_complete_aio(acb, acb->err);
-+{
++        ret = (-EIO);
-+    struct epoll_event event;
++    }
-+    int r;
++
-+    int ctl;
++    qemu_aio_unref(acb);
-+
++    cb(cb_opaque, ret);
-+    if (!node->pfd.events) {
++}
-+        ctl = EPOLL_CTL_DEL;
++
-+    } else {
++/*
-+        event.data.ptr = node;
++ * Called from a libqnio thread
-+        event.events = epoll_events_from_pfd(node->pfd.events);
++ */
-+        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
++static void vxhs_iio_callback(void *ctx, uint32_t opcode, uint32_t error)
-+    }
++{
-+
++    VXHSAIOCB *acb = NULL;
-+    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
++
-+    if (r) {
++    switch (opcode) {
-+        fdmon_epoll_disable(ctx);
++    case IRP_READ_REQUEST:
-+    }
++    case IRP_WRITE_REQUEST:
-+}
++
-+
++        /*
-+static int fdmon_epoll_wait(AioContext *ctx, AioHandlerList *ready_list,
++         * ctx is VXHSAIOCB*
-+                            int64_t timeout)
++         * ctx is NULL if error is QNIOERROR_CHANNEL_HUP
-+{
++         */
-+    GPollFD pfd = {
++        if (ctx) {
-+        .fd = ctx->epollfd,
++            acb = ctx;
-+        .events = G_IO_IN | G_IO_OUT | G_IO_HUP | G_IO_ERR,
++        } else {
-+    };
++            trace_vxhs_iio_callback(error);
 +    AioHandler *node;
 +    int i, ret = 0;
 +    struct epoll_event events[128];
 +
 +    /* Fall back while external clients are disabled */
 +    if (atomic_read(&ctx->external_disable_cnt)) {
 +        return fdmon_poll_ops.wait(ctx, ready_list, timeout);
 +    }
 +
 +    if (timeout > 0) {
 +        ret = qemu_poll_ns(&pfd, 1, timeout);
 +        if (ret > 0) {
 +            timeout = 0;
 +        }
 +    }
 +    if (timeout <= 0 || ret > 0) {
 +        ret = epoll_wait(ctx->epollfd, events,
 +                         ARRAY_SIZE(events),
 +                         timeout);
 +        if (ret <= 0) {
 +            goto out;
 +        }
-+        for (i = 0; i < ret; i++) {
++
-+            int ev = events[i].events;
++        if (error) {
-+            int revents = (ev & EPOLLIN ? G_IO_IN : 0) |
++            if (!acb->err) {
-+                          (ev & EPOLLOUT ? G_IO_OUT : 0) |
++                acb->err = error;
-+                          (ev & EPOLLHUP ? G_IO_HUP : 0) |
++            }
-+                          (ev & EPOLLERR ? G_IO_ERR : 0);
++            trace_vxhs_iio_callback(error);
 +
 +            node = events[i].data.ptr;
 +            aio_add_ready_handler(ready_list, node, revents);
 +        }
++
++        aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
++                                vxhs_complete_aio_bh, acb);
++        break;
++
++    default:
++        if (error == QNIOERROR_HUP) {
++            /*
++             * Channel failed, spontaneous notification,
++             * not in response to I/O
++             */
++            trace_vxhs_iio_callback_chnfail(error, errno);
++        } else {
++            trace_vxhs_iio_callback_unknwn(opcode, error);
++        }
++        break;
 +    }
 +out:
++    return;
++}
++
++static QemuOptsList runtime_opts = {
++    .name = "vxhs",
++    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
++    .desc = {
++        {
++            .name = VXHS_OPT_FILENAME,
++            .type = QEMU_OPT_STRING,
++            .help = "URI to the Veritas HyperScale image",
++        },
++        {
++            .name = VXHS_OPT_VDISK_ID,
++            .type = QEMU_OPT_STRING,
++            .help = "UUID of the VxHS vdisk",
++        },
++        {
++            .name = "tls-creds",
++            .type = QEMU_OPT_STRING,
++            .help = "ID of the TLS/SSL credentials to use",
++        },
++        { /* end of list */ }
++    },
++};
++
++static QemuOptsList runtime_tcp_opts = {
++    .name = "vxhs_tcp",
++    .head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head),
++    .desc = {
++        {
++            .name = VXHS_OPT_HOST,
++            .type = QEMU_OPT_STRING,
++            .help = "host address (ipv4 addresses)",
++        },
++        {
++            .name = VXHS_OPT_PORT,
++            .type = QEMU_OPT_NUMBER,
++            .help = "port number on which VxHSD is listening (default 9999)",
++            .def_value_str = "9999"
++        },
++        { /* end of list */ }
++    },
++};
++
++/*
++ * Parse incoming URI and populate *options with the host
++ * and device information
++ */
++static int vxhs_parse_uri(const char *filename, QDict *options)
++{
++    URI *uri = NULL;
++    char *port;
++    int ret = 0;
++
++    trace_vxhs_parse_uri_filename(filename);
++    uri = uri_parse(filename);
++    if (!uri || !uri->server || !uri->path) {
++        uri_free(uri);
++        return -EINVAL;
++    }
++
++    qdict_put(options, VXHS_OPT_SERVER".host", qstring_from_str(uri->server));
++
++    if (uri->port) {
++        port = g_strdup_printf("%d", uri->port);
++        qdict_put(options, VXHS_OPT_SERVER".port", qstring_from_str(port));
++        g_free(port);
++    }
++
++    qdict_put(options, "vdisk-id", qstring_from_str(uri->path));
++
++    trace_vxhs_parse_uri_hostinfo(uri->server, uri->port);
++    uri_free(uri);
++
 +    return ret;
 +}
 +
-+static const FDMonOps fdmon_epoll_ops = {
++static void vxhs_parse_filename(const char *filename, QDict *options,
-+    .update = fdmon_epoll_update,
++                                Error **errp)
-+    .wait = fdmon_epoll_wait,
++{
 +    if (qdict_haskey(options, "vdisk-id") || qdict_haskey(options, "server")) {
 +        error_setg(errp, "vdisk-id/server and a file name may not be specified "
 +                         "at the same time");
 +        return;
 +    }
 +
 +    if (strstr(filename, "://")) {
 +        int ret = vxhs_parse_uri(filename, options);
 +        if (ret < 0) {
 +            error_setg(errp, "Invalid URI. URI should be of the form "
 +                       "  vxhs://<host_ip>:<port>/<vdisk-id>");
 +        }
 +    }
 +}
 +
 +static int vxhs_init_and_ref(void)
 +{
 +    if (vxhs_ref++ == 0) {
 +        if (iio_init(QNIO_VERSION, vxhs_iio_callback)) {
 +            return -ENODEV;
 +        }
 +    }
 +    return 0;
 +}
 +
 +static void vxhs_unref(void)
 +{
 +    if (--vxhs_ref == 0) {
 +        iio_fini();
 +    }
 +}
 +
 +static void vxhs_get_tls_creds(const char *id, char **cacert,
 +                               char **key, char **cert, Error **errp)
 +{
 +    Object *obj;
 +    QCryptoTLSCreds *creds;
 +    QCryptoTLSCredsX509 *creds_x509;
 +
 +    obj = object_resolve_path_component(
 +        object_get_objects_root(), id);
 +
 +    if (!obj) {
 +        error_setg(errp, "No TLS credentials with id '%s'",
 +                   id);
 +        return;
 +    }
 +
 +    creds_x509 = (QCryptoTLSCredsX509 *)
 +        object_dynamic_cast(obj, TYPE_QCRYPTO_TLS_CREDS_X509);
 +
 +    if (!creds_x509) {
 +        error_setg(errp, "Object with id '%s' is not TLS credentials",
 +                   id);
 +        return;
 +    }
 +
 +    creds = &creds_x509->parent_obj;
 +
 +    if (creds->endpoint != QCRYPTO_TLS_CREDS_ENDPOINT_CLIENT) {
 +        error_setg(errp,
 +                   "Expecting TLS credentials with a client endpoint");
 +        return;
 +    }
 +
 +    /*
 +     * Get the cacert, client_cert and client_key file names.
 +     */
 +    if (!creds->dir) {
 +        error_setg(errp, "TLS object missing 'dir' property value");
 +        return;
 +    }
 +
 +    *cacert = g_strdup_printf("%s/%s", creds->dir,
 +                              QCRYPTO_TLS_CREDS_X509_CA_CERT);
 +    *cert = g_strdup_printf("%s/%s", creds->dir,
 +                            QCRYPTO_TLS_CREDS_X509_CLIENT_CERT);
 +    *key = g_strdup_printf("%s/%s", creds->dir,
 +                           QCRYPTO_TLS_CREDS_X509_CLIENT_KEY);
 +}
 +
 +static int vxhs_open(BlockDriverState *bs, QDict *options,
 +                     int bdrv_flags, Error **errp)
 +{
 +    BDRVVXHSState *s = bs->opaque;
 +    void *dev_handlep;
 +    QDict *backing_options = NULL;
 +    QemuOpts *opts = NULL;
 +    QemuOpts *tcp_opts = NULL;
 +    char *of_vsa_addr = NULL;
 +    Error *local_err = NULL;
 +    const char *vdisk_id_opt;
 +    const char *server_host_opt;
 +    int ret = 0;
 +    char *cacert = NULL;
 +    char *client_key = NULL;
 +    char *client_cert = NULL;
 +
 +    ret = vxhs_init_and_ref();
 +    if (ret < 0) {
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    /* Create opts info from runtime_opts and runtime_tcp_opts list */
 +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
 +    tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
 +
 +    qemu_opts_absorb_qdict(opts, options, &local_err);
 +    if (local_err) {
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    /* vdisk-id is the disk UUID */
 +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
 +    if (!vdisk_id_opt) {
 +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    /* vdisk-id may contain a leading '/' */
 +    if (strlen(vdisk_id_opt) > UUID_FMT_LEN + 1) {
 +        error_setg(&local_err, "vdisk-id cannot be more than %d characters",
 +                   UUID_FMT_LEN);
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    s->vdisk_guid = g_strdup(vdisk_id_opt);
 +    trace_vxhs_open_vdiskid(vdisk_id_opt);
 +
 +    /* get the 'server.' arguments */
 +    qdict_extract_subqdict(options, &backing_options, VXHS_OPT_SERVER".");
 +
 +    qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
 +    if (local_err != NULL) {
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    server_host_opt = qemu_opt_get(tcp_opts, VXHS_OPT_HOST);
 +    if (!server_host_opt) {
 +        error_setg(&local_err, QERR_MISSING_PARAMETER,
 +                   VXHS_OPT_SERVER"."VXHS_OPT_HOST);
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    if (strlen(server_host_opt) > MAXHOSTNAMELEN) {
 +        error_setg(&local_err, "server.host cannot be more than %d characters",
 +                   MAXHOSTNAMELEN);
 +        ret = -EINVAL;
 +        goto out;
 +    }
 +
 +    /* check if we got tls-creds via the --object argument */
 +    s->tlscredsid = g_strdup(qemu_opt_get(opts, "tls-creds"));
 +    if (s->tlscredsid) {
 +        vxhs_get_tls_creds(s->tlscredsid, &cacert, &client_key,
 +                           &client_cert, &local_err);
 +        if (local_err != NULL) {
 +            ret = -EINVAL;
 +            goto out;
 +        }
 +        trace_vxhs_get_creds(cacert, client_key, client_cert);
 +    }
 +
 +    s->vdisk_hostinfo.host = g_strdup(server_host_opt);
 +    s->vdisk_hostinfo.port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
 +                                                          VXHS_OPT_PORT),
 +                                                          NULL, 0);
 +
 +    trace_vxhs_open_hostinfo(s->vdisk_hostinfo.host,
 +                             s->vdisk_hostinfo.port);
 +
 +    of_vsa_addr = g_strdup_printf("of://%s:%d",
 +                                  s->vdisk_hostinfo.host,
 +                                  s->vdisk_hostinfo.port);
 +
 +    /*
 +     * Open qnio channel to storage agent if not opened before
 +     */
 +    dev_handlep = iio_open(of_vsa_addr, s->vdisk_guid, 0,
 +                           cacert, client_key, client_cert);
 +    if (dev_handlep == NULL) {
 +        trace_vxhs_open_iio_open(of_vsa_addr);
 +        ret = -ENODEV;
 +        goto out;
 +    }
 +    s->vdisk_hostinfo.dev_handle = dev_handlep;
 +
 +out:
 +    g_free(of_vsa_addr);
 +    QDECREF(backing_options);
 +    qemu_opts_del(tcp_opts);
 +    qemu_opts_del(opts);
 +    g_free(cacert);
 +    g_free(client_key);
 +    g_free(client_cert);
 +
 +    if (ret < 0) {
 +        vxhs_unref();
 +        error_propagate(errp, local_err);
 +        g_free(s->vdisk_hostinfo.host);
 +        g_free(s->vdisk_guid);
 +        g_free(s->tlscredsid);
 +        s->vdisk_guid = NULL;
 +    }
 +
 +    return ret;
 +}
 +
 +static const AIOCBInfo vxhs_aiocb_info = {
 +    .aiocb_size = sizeof(VXHSAIOCB)
 +};
 +
-+static bool fdmon_epoll_try_enable(AioContext *ctx)
++/*
-+{
++ * This allocates QEMU-VXHS callback for each IO
-+    AioHandler *node;
++ * and is passed to QNIO. When QNIO completes the work,
-+    struct epoll_event event;
++ * it will be passed back through the callback.
-+
++ */
-+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
++static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, int64_t sector_num,
-+        int r;
++                               QEMUIOVector *qiov, int nb_sectors,
-+        if (QLIST_IS_INSERTED(node, node_deleted) || !node->pfd.events) {
++                               BlockCompletionFunc *cb, void *opaque,
-+            continue;
++                               VDISKAIOCmd iodir)
-+        }
++{
-+        event.events = epoll_events_from_pfd(node->pfd.events);
++    VXHSAIOCB *acb = NULL;
-+        event.data.ptr = node;
++    BDRVVXHSState *s = bs->opaque;
-+        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, &event);
++    size_t size;
-+        if (r) {
++    uint64_t offset;
-+            return false;
++    int iio_flags = 0;
-+        }
++    int ret = 0;
-+    }
++    void *dev_handle = s->vdisk_hostinfo.dev_handle;
 +
-+    ctx->fdmon_ops = &fdmon_epoll_ops;
++    offset = sector_num * BDRV_SECTOR_SIZE;
-+    return true;
++    size = nb_sectors * BDRV_SECTOR_SIZE;
-+}
++    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
 +
-+bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
++    /*
-+{
++     * Initialize VXHSAIOCB.
-+    if (ctx->epollfd < 0) {
++     */
-+        return false;
++    acb->err = 0;
-+    }
++
-+
++    iio_flags = IIO_FLAG_ASYNC;
-+    /* Do not upgrade while external clients are disabled */
++
-+    if (atomic_read(&ctx->external_disable_cnt)) {
++    switch (iodir) {
-+        return false;
++    case VDISK_AIO_WRITE:
-+    }
++            ret = iio_writev(dev_handle, acb, qiov->iov, qiov->niov,
-+
++                             offset, (uint64_t)size, iio_flags);
-+    if (npfd >= EPOLL_ENABLE_THRESHOLD) {
++            break;
-+        if (fdmon_epoll_try_enable(ctx)) {
++    case VDISK_AIO_READ:
-+            return true;
++            ret = iio_readv(dev_handle, acb, qiov->iov, qiov->niov,
-+        } else {
++                            offset, (uint64_t)size, iio_flags);
-+            fdmon_epoll_disable(ctx);
++            break;
-+        }
++    default:
-+    }
++            trace_vxhs_aio_rw_invalid(iodir);
-+    return false;
++            goto errout;
-+}
++    }
 +
-+void fdmon_epoll_setup(AioContext *ctx)
++    if (ret != 0) {
-+{
++        trace_vxhs_aio_rw_ioerr(s->vdisk_guid, iodir, size, offset,
-+    ctx->epollfd = epoll_create1(EPOLL_CLOEXEC);
++                                acb, ret, errno);
-+    if (ctx->epollfd == -1) {
++        goto errout;
-+        fprintf(stderr, "Failed to create epoll instance: %s", strerror(errno));
++    }
-+    }
++    return &acb->common;
-+}
++
-diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
++errout:
-new file mode 100644
++    qemu_aio_unref(acb);
-index XXXXXXX..XXXXXXX
++    return NULL;
---- /dev/null
++}
-+++ b/util/fdmon-poll.c
++
 +static BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs,
 +                                   int64_t sector_num, QEMUIOVector *qiov,
 +                                   int nb_sectors,
 +                                   BlockCompletionFunc *cb, void *opaque)
 +{
 +    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, cb,
 +                       opaque, VDISK_AIO_READ);
 +}
 +
 +static BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs,
 +                                   int64_t sector_num, QEMUIOVector *qiov,
 +                                   int nb_sectors,
 +                                   BlockCompletionFunc *cb, void *opaque)
 +{
 +    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors,
 +                       cb, opaque, VDISK_AIO_WRITE);
 +}
 +
 +static void vxhs_close(BlockDriverState *bs)
 +{
 +    BDRVVXHSState *s = bs->opaque;
 +
 +    trace_vxhs_close(s->vdisk_guid);
 +
 +    g_free(s->vdisk_guid);
 +    s->vdisk_guid = NULL;
 +
 +    /*
 +     * Close vDisk device
 +     */
 +    if (s->vdisk_hostinfo.dev_handle) {
 +        iio_close(s->vdisk_hostinfo.dev_handle);
 +        s->vdisk_hostinfo.dev_handle = NULL;
 +    }
 +
 +    vxhs_unref();
 +
 +    /*
 +     * Free the dynamically allocated host string etc
 +     */
 +    g_free(s->vdisk_hostinfo.host);
 +    g_free(s->tlscredsid);
 +    s->tlscredsid = NULL;
 +    s->vdisk_hostinfo.host = NULL;
 +    s->vdisk_hostinfo.port = 0;
 +}
 +
 +static int64_t vxhs_get_vdisk_stat(BDRVVXHSState *s)
 +{
 +    int64_t vdisk_size = -1;
 +    int ret = 0;
 +    void *dev_handle = s->vdisk_hostinfo.dev_handle;
 +
 +    ret = iio_ioctl(dev_handle, IOR_VDISK_STAT, &vdisk_size, 0);
 +    if (ret < 0) {
 +        trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno);
 +        return -EIO;
 +    }
 +
 +    trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size);
 +    return vdisk_size;
 +}
 +
 +/*
 + * Returns the size of vDisk in bytes. This is required
 + * by QEMU block upper block layer so that it is visible
 + * to guest.
 + */
 +static int64_t vxhs_getlength(BlockDriverState *bs)
 +{
 +    BDRVVXHSState *s = bs->opaque;
 +    int64_t vdisk_size;
 +
 +    vdisk_size = vxhs_get_vdisk_stat(s);
 +    if (vdisk_size < 0) {
 +        return -EIO;
 +    }
 +
 +    return vdisk_size;
 +}
 +
 +static BlockDriver bdrv_vxhs = {
 +    .format_name                  = "vxhs",
 +    .protocol_name                = "vxhs",
 +    .instance_size                = sizeof(BDRVVXHSState),
 +    .bdrv_file_open               = vxhs_open,
 +    .bdrv_parse_filename          = vxhs_parse_filename,
 +    .bdrv_close                   = vxhs_close,
 +    .bdrv_getlength               = vxhs_getlength,
 +    .bdrv_aio_readv               = vxhs_aio_readv,
 +    .bdrv_aio_writev              = vxhs_aio_writev,
 +};
 +
 +static void bdrv_vxhs_init(void)
 +{
 +    bdrv_register(&bdrv_vxhs);
 +}
 +
 +block_init(bdrv_vxhs_init);
 diff --git a/configure b/configure
 index XXXXXXX..XXXXXXX 100755
 --- a/configure
 +++ b/configure
@@ -XXX,XX +XXX,XX @@ numa=""
  tcmalloc="no"
  jemalloc="no"
  replication="yes"
 +vxhs=""
  supported_cpu="no"
  supported_os="no"
@@ -XXX,XX +XXX,XX @@ for opt do
    ;;
    --enable-replication) replication="yes"
    ;;
 +  --disable-vxhs) vxhs="no"
 +  ;;
 +  --enable-vxhs) vxhs="yes"
 +  ;;
    *)
        echo "ERROR: unknown option $opt"
        echo "Try '$0 --help' for more information"
@@ -XXX,XX +XXX,XX @@ disabled with --disable-FEATURE, default is enabled if available:
    xfsctl          xfsctl support
    qom-cast-debug  cast debugging support
    tools           build qemu-io, qemu-nbd and qemu-image tools
 +  vxhs            Veritas HyperScale vDisk backend support
  NOTE: The object files are built at the place where configure is launched
  EOF
@@ -XXX,XX +XXX,XX @@ if compile_prog "" "" ; then
  fi
  ##########################################
 +# Veritas HyperScale block driver VxHS
 +# Check if libvxhs is installed
 +
 +if test "$vxhs" != "no" ; then
 +  cat > $TMPC <<EOF
 +#include <stdint.h>
 +#include <qnio/qnio_api.h>
 +
 +void *vxhs_callback;
 +
 +int main(void) {
 +    iio_init(QNIO_VERSION, vxhs_callback);
 +    return 0;
 +}
 +EOF
 +  vxhs_libs="-lvxhs -lssl"
 +  if compile_prog "" "$vxhs_libs" ; then
 +    vxhs=yes
 +  else
 +    if test "$vxhs" = "yes" ; then
 +      feature_not_found "vxhs block device" "Install libvxhs See github"
 +    fi
 +    vxhs=no
 +  fi
 +fi
 +
 +##########################################
  # End of CC checks
  # After here, no more $cc or $ld runs
@@ -XXX,XX +XXX,XX @@ echo "tcmalloc support  $tcmalloc"
  echo "jemalloc support  $jemalloc"
  echo "avx2 optimization $avx2_opt"
  echo "replication support $replication"
 +echo "VxHS block device $vxhs"
  if test "$sdl_too_old" = "yes"; then
  echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -XXX,XX +XXX,XX @@ if test "$pthread_setname_np" = "yes" ; then
    echo "CONFIG_PTHREAD_SETNAME_NP=y" >> $config_host_mak
  fi
 +if test "$vxhs" = "yes" ; then
 +  echo "CONFIG_VXHS=y" >> $config_host_mak
 +  echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
 +fi
 +
  if test "$tcg_interpreter" = "yes"; then
    QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
  elif test "$ARCH" = "sparc64" ; then
 diff --git a/qapi/block-core.json b/qapi/block-core.json
 index XXXXXXX..XXXXXXX 100644
 --- a/qapi/block-core.json
 +++ b/qapi/block-core.json
 @@ -XXX,XX +XXX,XX @@
-+/* SPDX-License-Identifier: GPL-2.0-or-later */
+ #
-+/*
+ # Drivers that are supported in block device operations.
-+ * poll(2) file descriptor monitoring
+ #
-+ *
++# @vxhs: Since 2.10
-+ * Uses ppoll(2) when available, g_poll() otherwise.
++#
-+ */
+ # Since: 2.9
-+
+ ##
-+#include "qemu/osdep.h"
+ { 'enum': 'BlockdevDriver',
-+#include "aio-posix.h"
+@@ -XXX,XX +XXX,XX @@
-+#include "qemu/rcu_queue.h"
+             'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs',
-+
+             'null-aio', 'null-co', 'parallels', 'qcow', 'qcow2', 'qed',
-+/*
+             'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh',
-+ * These thread-local variables are used only in fdmon_poll_wait() around the
+-            'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
-+ * call to the poll() system call.  In particular they are not used while
++            'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
-+ * aio_poll is performing callbacks, which makes it much easier to think about
-+ * reentrancy!
+ ##
-+ *
+ # @BlockdevOptionsFile:
-+ * Stack-allocated arrays would be perfect but they have size limitations;
+@@ -XXX,XX +XXX,XX @@
-+ * heap allocation is expensive enough that we want to reuse arrays across
+   'data': { '*offset': 'int', '*size': 'int' } }
-+ * calls to aio_poll().  And because poll() has to be called without holding
-+ * any lock, the arrays cannot be stored in AioContext.  Thread-local data
+ ##
-+ * has none of the disadvantages of these three options.
++# @BlockdevOptionsVxHS:
-+ */
++#
-+static __thread GPollFD *pollfds;
++# Driver specific block device options for VxHS
-+static __thread AioHandler **nodes;
++#
-+static __thread unsigned npfd, nalloc;
++# @vdisk-id:    UUID of VxHS volume
-+static __thread Notifier pollfds_cleanup_notifier;
++# @server:      vxhs server IP, port
-+
++# @tls-creds:   TLS credentials ID
-+static void pollfds_cleanup(Notifier *n, void *unused)
++#
-+{
++# Since: 2.10
-+    g_assert(npfd == 0);
++##
-+    g_free(pollfds);
++{ 'struct': 'BlockdevOptionsVxHS',
-+    g_free(nodes);
++  'data': { 'vdisk-id': 'str',
-+    nalloc = 0;
++            'server': 'InetSocketAddressBase',
-+}
++            '*tls-creds': 'str' } }
 +
-+static void add_pollfd(AioHandler *node)
++##
-+{
+ # @BlockdevOptions:
-+    if (npfd == nalloc) {
+ #
-+        if (nalloc == 0) {
+ # Options for creating a block device.  Many options are available for all
-+            pollfds_cleanup_notifier.notify = pollfds_cleanup;
+@@ -XXX,XX +XXX,XX @@
-+            qemu_thread_atexit_add(&pollfds_cleanup_notifier);
+       'vhdx':       'BlockdevOptionsGenericFormat',
-+            nalloc = 8;
+       'vmdk':       'BlockdevOptionsGenericCOWFormat',
-+        } else {
+       'vpc':        'BlockdevOptionsGenericFormat',
-+            g_assert(nalloc <= INT_MAX);
+-      'vvfat':      'BlockdevOptionsVVFAT'
-+            nalloc *= 2;
++      'vvfat':      'BlockdevOptionsVVFAT',
-+        }
++      'vxhs':       'BlockdevOptionsVxHS'
-+        pollfds = g_renew(GPollFD, pollfds, nalloc);
+   } }
-+        nodes = g_renew(AioHandler *, nodes, nalloc);
-+    }
+ ##
 +    nodes[npfd] = node;
 +    pollfds[npfd] = (GPollFD) {
 +        .fd = node->pfd.fd,
 +        .events = node->pfd.events,
 +    };
 +    npfd++;
 +}
 +
 +static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
 +                            int64_t timeout)
 +{
 +    AioHandler *node;
 +    int ret;
 +
 +    assert(npfd == 0);
 +
 +    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
 +        if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events
 +                && aio_node_check(ctx, node->is_external)) {
 +            add_pollfd(node);
 +        }
 +    }
 +
 +    /* epoll(7) is faster above a certain number of fds */
 +    if (fdmon_epoll_try_upgrade(ctx, npfd)) {
 +        return ctx->fdmon_ops->wait(ctx, ready_list, timeout);
 +    }
 +
 +    ret = qemu_poll_ns(pollfds, npfd, timeout);
 +    if (ret > 0) {
 +        int i;
 +
 +        for (i = 0; i < npfd; i++) {
 +            int revents = pollfds[i].revents;
 +
 +            if (revents) {
 +                aio_add_ready_handler(ready_list, nodes[i], revents);
 +            }
 +        }
 +    }
 +
 +    npfd = 0;
 +    return ret;
 +}
 +
 +static void fdmon_poll_update(AioContext *ctx, AioHandler *node, bool is_new)
 +{
 +    /* Do nothing, AioHandler already contains the state we'll need */
 +}
 +
 +const FDMonOps fdmon_poll_ops = {
 +    .update = fdmon_poll_update,
 +    .wait = fdmon_poll_wait,
 +};
 --
-.24.1
+.9.3

-New patch
+[Qemu-devel] [PULL v2 02/12] block/vxhs.c: Add qemu-iotests for new block device type "vxhs"
+From: Ashish Mittal <ashmit602@gmail.com>
+These changes use a vxhs test server that is a part of the following
+repository:
+https://github.com/VeritasHyperScale/libqnio.git
+Signed-off-by: Ashish Mittal <Ashish.Mittal@veritas.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: Jeff Cody <jcody@redhat.com>
+Signed-off-by: Jeff Cody <jcody@redhat.com>
+Message-id: 1491277689-24949-3-git-send-email-Ashish.Mittal@veritas.com
+---
+ tests/qemu-iotests/common        |  6 ++++++
+ tests/qemu-iotests/common.config | 13 +++++++++++++
+ tests/qemu-iotests/common.filter |  1 +
+ tests/qemu-iotests/common.rc     | 19 +++++++++++++++++++
+files changed, 39 insertions(+)
+diff --git a/tests/qemu-iotests/common b/tests/qemu-iotests/common
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/qemu-iotests/common
++++ b/tests/qemu-iotests/common
+@@ -XXX,XX +XXX,XX @@ check options
+     -ssh                test ssh
+     -nfs                test nfs
+     -luks               test luks
++    -vxhs               test vxhs
+     -xdiff              graphical mode diff
+     -nocache            use O_DIRECT on backing file
+     -misalign           misalign memory allocations
+@@ -XXX,XX +XXX,XX @@ testlist options
+             xpand=false
+             ;;
++        -vxhs)
++            IMGPROTO=vxhs
++            xpand=false
++            ;;
++
+         -ssh)
+             IMGPROTO=ssh
+             xpand=false
+diff --git a/tests/qemu-iotests/common.config b/tests/qemu-iotests/common.config
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/qemu-iotests/common.config
++++ b/tests/qemu-iotests/common.config
+@@ -XXX,XX +XXX,XX @@ if [ -z "$QEMU_NBD_PROG" ]; then
+     export QEMU_NBD_PROG="`set_prog_path qemu-nbd`"
+ fi
++if [ -z "$QEMU_VXHS_PROG" ]; then
++    export QEMU_VXHS_PROG="`set_prog_path qnio_server`"
++fi
++
+ _qemu_wrapper()
+ {
+     (
+@@ -XXX,XX +XXX,XX @@ _qemu_nbd_wrapper()
+     )
+ }
++_qemu_vxhs_wrapper()
++{
++    (
++        echo $BASHPID > "${TEST_DIR}/qemu-vxhs.pid"
++        exec "$QEMU_VXHS_PROG" $QEMU_VXHS_OPTIONS "$@"
++    )
++}
++
+ export QEMU=_qemu_wrapper
+ export QEMU_IMG=_qemu_img_wrapper
+ export QEMU_IO=_qemu_io_wrapper
+ export QEMU_NBD=_qemu_nbd_wrapper
++export QEMU_VXHS=_qemu_vxhs_wrapper
+ QEMU_IMG_EXTRA_ARGS=
+ if [ "$IMGOPTSSYNTAX" = "true" ]; then
+diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/qemu-iotests/common.filter
++++ b/tests/qemu-iotests/common.filter
+@@ -XXX,XX +XXX,XX @@ _filter_img_info()
+         -e "s#$TEST_DIR#TEST_DIR#g" \
+         -e "s#$IMGFMT#IMGFMT#g" \
+         -e 's#nbd://127.0.0.1:10810$#TEST_DIR/t.IMGFMT#g' \
++        -e 's#json.*vdisk-id.*vxhs"}}#TEST_DIR/t.IMGFMT#' \
+         -e "/encrypted: yes/d" \
+         -e "/cluster_size: [0-9]\\+/d" \
+         -e "/table_size: [0-9]\\+/d" \
+diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/qemu-iotests/common.rc
++++ b/tests/qemu-iotests/common.rc
+@@ -XXX,XX +XXX,XX @@ else
+     elif [ "$IMGPROTO" = "nfs" ]; then
+         TEST_DIR="nfs://127.0.0.1/$TEST_DIR"
+         TEST_IMG=$TEST_DIR/t.$IMGFMT
++    elif [ "$IMGPROTO" = "vxhs" ]; then
++        TEST_IMG_FILE=$TEST_DIR/t.$IMGFMT
++        TEST_IMG="vxhs://127.0.0.1:9999/t.$IMGFMT"
+     else
+         TEST_IMG=$IMGPROTO:$TEST_DIR/t.$IMGFMT
+     fi
+@@ -XXX,XX +XXX,XX @@ _make_test_img()
+         eval "$QEMU_NBD -v -t -b 127.0.0.1 -p 10810 -f $IMGFMT  $TEST_IMG_FILE >/dev/null &"
+         sleep 1 # FIXME: qemu-nbd needs to be listening before we continue
+     fi
++
++    # Start QNIO server on image directory for vxhs protocol
++    if [ $IMGPROTO = "vxhs" ]; then
++        eval "$QEMU_VXHS -d  $TEST_DIR > /dev/null &"
++        sleep 1 # Wait for server to come up.
++    fi
+ }
+ _rm_test_img()
+@@ -XXX,XX +XXX,XX @@ _cleanup_test_img()
+             fi
+             rm -f "$TEST_IMG_FILE"
+             ;;
++        vxhs)
++            if [ -f "${TEST_DIR}/qemu-vxhs.pid" ]; then
++                local QEMU_VXHS_PID
++                read QEMU_VXHS_PID < "${TEST_DIR}/qemu-vxhs.pid"
++                kill ${QEMU_VXHS_PID} >/dev/null 2>&1
++                rm -f "${TEST_DIR}/qemu-vxhs.pid"
++            fi
++            rm -f "$TEST_IMG_FILE"
++            ;;
++
+         file)
+             _rm_test_img "$TEST_DIR/t.$IMGFMT"
+             _rm_test_img "$TEST_DIR/t.$IMGFMT.orig"
+--
+.9.3

-New patch
+[Qemu-devel] [PULL v2 03/12] qemu-iotests: exclude vxhs from image creation via protocol
+The protocol VXHS does not support image creation.  Some tests expect
+to be able to create images through the protocol.  Exclude VXHS from
+these tests.
+Signed-off-by: Jeff Cody <jcody@redhat.com>
+---
+ tests/qemu-iotests/017 | 1 +
+ tests/qemu-iotests/020 | 1 +
+ tests/qemu-iotests/029 | 1 +
+ tests/qemu-iotests/073 | 1 +
+ tests/qemu-iotests/114 | 1 +
+ tests/qemu-iotests/130 | 1 +
+ tests/qemu-iotests/134 | 1 +
+ tests/qemu-iotests/156 | 1 +
+ tests/qemu-iotests/158 | 1 +
+files changed, 9 insertions(+)
+diff --git a/tests/qemu-iotests/017 b/tests/qemu-iotests/017
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/017
++++ b/tests/qemu-iotests/017
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ # Any format supporting backing files
+ _supported_fmt qcow qcow2 vmdk qed
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ _unsupported_imgopts "subformat=monolithicFlat" "subformat=twoGbMaxExtentFlat"
+diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/020
++++ b/tests/qemu-iotests/020
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ # Any format supporting backing files
+ _supported_fmt qcow qcow2 vmdk qed
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ _unsupported_imgopts "subformat=monolithicFlat" \
+                      "subformat=twoGbMaxExtentFlat" \
+diff --git a/tests/qemu-iotests/029 b/tests/qemu-iotests/029
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/029
++++ b/tests/qemu-iotests/029
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ # Any format supporting intenal snapshots
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ # Internal snapshots are (currently) impossible with refcount_bits=1
+ _unsupported_imgopts 'refcount_bits=1[^0-9]'
+diff --git a/tests/qemu-iotests/073 b/tests/qemu-iotests/073
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/073
++++ b/tests/qemu-iotests/073
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ CLUSTER_SIZE=64k
+diff --git a/tests/qemu-iotests/114 b/tests/qemu-iotests/114
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/114
++++ b/tests/qemu-iotests/114
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/130
++++ b/tests/qemu-iotests/130
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ qemu_comm_method="monitor"
+diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/134
++++ b/tests/qemu-iotests/134
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/156
++++ b/tests/qemu-iotests/156
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2 qed
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+ # Create source disk
+diff --git a/tests/qemu-iotests/158 b/tests/qemu-iotests/158
+index XXXXXXX..XXXXXXX 100755
+--- a/tests/qemu-iotests/158
++++ b/tests/qemu-iotests/158
+@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
+ _supported_fmt qcow2
+ _supported_proto generic
++_unsupported_proto vxhs
+ _supported_os Linux
+--
+.9.3

-[PULL 4/9] aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
+[Qemu-devel] [PULL v2 04/12] block: add bdrv_set_read_only() helper function
-Now that run_poll_handlers_once() is only called by run_poll_handlers()
+We have a helper wrapper for checking for the BDS read_only flag,
-we can improve the CPU time profile by moving the expensive
+add a helper wrapper to set the read_only flag as well.
 RCU_READ_LOCK() out of the polling loop.
-This reduces the run_poll_handlers() from 40% CPU to 10% CPU in perf's
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-sampling profiler output.
+Signed-off-by: Jeff Cody <jcody@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Message-id: 9b18972d05f5fa2ac16c014f0af98d680553048d.1491597120.git.jcody@redhat.com
 ---
  block.c               | 5 +++++
  block/bochs.c         | 2 +-
  block/cloop.c         | 2 +-
  block/dmg.c           | 2 +-
  block/rbd.c           | 2 +-
  block/vvfat.c         | 4 ++--
  include/block/block.h | 1 +
 files changed, 12 insertions(+), 6 deletions(-)
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+diff --git a/block.c b/block.c
-Link: https://lore.kernel.org/r/20200305170806.1313245-3-stefanha@redhat.com
+index XXXXXXX..XXXXXXX 100644
-Message-Id: <20200305170806.1313245-3-stefanha@redhat.com>
+--- a/block.c
----
++++ b/block.c
- util/aio-posix.c | 20 ++++++++++----------
+@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
-file changed, 10 insertions(+), 10 deletions(-)
+     }
  }
 +void bdrv_set_read_only(BlockDriverState *bs, bool read_only)
 +{
 +    bs->read_only = read_only;
 +}
 +
  void bdrv_get_full_backing_filename_from_filename(const char *backed,
                                                    const char *backing,
                                                    char *dest, size_t sz,
 diff --git a/block/bochs.c b/block/bochs.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/bochs.c
 +++ b/block/bochs.c
@@ -XXX,XX +XXX,XX @@ static int bochs_open(BlockDriverState *bs, QDict *options, int flags,
          return -EINVAL;
      }
 -    bs->read_only = true; /* no write support yet */
 +    bdrv_set_read_only(bs, true); /* no write support yet */
      ret = bdrv_pread(bs->file, 0, &bochs, sizeof(bochs));
      if (ret < 0) {
 diff --git a/block/cloop.c b/block/cloop.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/cloop.c
 +++ b/block/cloop.c
@@ -XXX,XX +XXX,XX @@ static int cloop_open(BlockDriverState *bs, QDict *options, int flags,
          return -EINVAL;
      }
 -    bs->read_only = true;
 +    bdrv_set_read_only(bs, true);
      /* read header */
      ret = bdrv_pread(bs->file, 128, &s->block_size, 4);
 diff --git a/block/dmg.c b/block/dmg.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/dmg.c
 +++ b/block/dmg.c
@@ -XXX,XX +XXX,XX @@ static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
      }
      block_module_load_one("dmg-bz2");
 -    bs->read_only = true;
 +    bdrv_set_read_only(bs, true);
      s->n_chunks = 0;
      s->offsets = s->lengths = s->sectors = s->sectorcounts = NULL;
 diff --git a/block/rbd.c b/block/rbd.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
          goto failed_open;
      }
 -    bs->read_only = (s->snap != NULL);
 +    bdrv_set_read_only(bs, (s->snap != NULL));
      qemu_opts_del(opts);
      return 0;
 diff --git a/block/vvfat.c b/block/vvfat.c
 index XXXXXXX..XXXXXXX 100644
 --- a/block/vvfat.c
 +++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
      s->current_cluster=0xffffffff;
      /* read only is the default for safety */
 -    bs->read_only = true;
 +    bdrv_set_read_only(bs, true);
      s->qcow = NULL;
      s->qcow_filename = NULL;
      s->fat2 = NULL;
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
          if (ret < 0) {
              goto fail;
          }
 -        bs->read_only = false;
 +        bdrv_set_read_only(bs, false);
      }
      bs->total_sectors = cyls * heads * secs;
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                              int64_t sector_num, int nb_sectors, int *pnum);
  bool bdrv_is_read_only(BlockDriverState *bs);
 +void bdrv_set_read_only(BlockDriverState *bs, bool read_only);
  bool bdrv_is_sg(BlockDriverState *bs);
  bool bdrv_is_inserted(BlockDriverState *bs);
  int bdrv_media_changed(BlockDriverState *bs);
 --
 .9.3
-diff --git a/util/aio-posix.c b/util/aio-posix.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
-+++ b/util/aio-posix.c
-@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
-     bool progress = false;
-     AioHandler *node;
--    /*
--     * Optimization: ->io_poll() handlers often contain RCU read critical
--     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
--     * -> rcu_read_lock() -> ... sequences with expensive memory
--     * synchronization primitives.  Make the entire polling loop an RCU
--     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
--     * are cheap.
--     */
--    RCU_READ_LOCK_GUARD();
--
-     QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-         if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll &&
-             aio_node_check(ctx, node->is_external) &&
-@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
-     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
-+    /*
-+     * Optimization: ->io_poll() handlers often contain RCU read critical
-+     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
-+     * -> rcu_read_lock() -> ... sequences with expensive memory
-+     * synchronization primitives.  Make the entire polling loop an RCU
-+     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
-+     * are cheap.
-+     */
-+    RCU_READ_LOCK_GUARD();
-+
-     start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-     do {
-         progress = run_poll_handlers_once(ctx, timeout);
---
-.24.1

-[PULL 9/9] aio-posix: remove idle poll handlers to improve scalability
+[Qemu-devel] [PULL v2 05/12] block: do not set BDS read_only if copy_on_read enabled
-When there are many poll handlers it's likely that some of them are idle
+A few block drivers will set the BDS read_only flag from their
-most of the time.  Remove handlers that haven't had activity recently so
+.bdrv_open() function.  This means the bs->read_only flag could
-that the polling loop scales better for guests with a large number of
+be set after we enable copy_on_read, as the BDRV_O_COPY_ON_READ
-devices.
+flag check occurs prior to the call to bdrv->bdrv_open().
-This feature only takes effect for the Linux io_uring fd monitoring
+This adds an error return to bdrv_set_read_only(), and an error will be
-implementation because it is capable of combining fd monitoring with
+return if we try to set the BDS to read_only while copy_on_read is
-userspace polling.  The other implementations can't do that and risk
+enabled.
 starving fds in favor of poll handlers, so don't try this optimization
 when they are in use.
-IOPS improves from 10k to 105k when the guest has 100
+This patch also changes the behavior of vvfat.  Before, vvfat could
-virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1
+override the drive 'readonly' flag with its own, internal 'rw' flag.
 device for rw=randread,iodepth=1,bs=4k,ioengine=libaio on NVMe.
-[Clarified aio_poll_handlers locking discipline explanation in comment
+For instance, this -drive parameter would result in a writable image:
 after discussion with Paolo Bonzini <pbonzini@redhat.com>.
 --Stefan]
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+"-drive format=vvfat,dir=/tmp/vvfat,rw,if=virtio,readonly=on"
-Link: https://lore.kernel.org/r/20200305170806.1313245-8-stefanha@redhat.com
-Message-Id: <20200305170806.1313245-8-stefanha@redhat.com>
+This is not correct.  Now, attempting to use the above -drive parameter
 will result in an error (i.e., 'rw' is incompatible with 'readonly=on').
 Signed-off-by: Jeff Cody <jcody@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Message-id: 0c5b4c1cc2c651471b131f21376dfd5ea24d2196.1491597120.git.jcody@redhat.com
 ---
- include/block/aio.h |  8 ++++
+ block.c               | 10 +++++++++-
- util/aio-posix.c    | 93 +++++++++++++++++++++++++++++++++++++++++----
+ block/bochs.c         |  5 ++++-
- util/aio-posix.h    |  2 +
+ block/cloop.c         |  5 ++++-
- util/trace-events   |  2 +
+ block/dmg.c           |  6 +++++-
-files changed, 98 insertions(+), 7 deletions(-)
+ block/rbd.c           | 11 ++++++++++-
  block/vvfat.c         | 19 +++++++++++++++----
  include/block/block.h |  2 +-
 files changed, 48 insertions(+), 10 deletions(-)
-diff --git a/include/block/aio.h b/include/block/aio.h
+diff --git a/block.c b/block.c
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
+--- a/block.c
-+++ b/include/block/aio.h
++++ b/block.c
-@@ -XXX,XX +XXX,XX @@ struct AioContext {
+@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
-     int64_t poll_grow;      /* polling time growth factor */
+     }
-     int64_t poll_shrink;    /* polling time shrink factor */
+ }
-+    /*
+-void bdrv_set_read_only(BlockDriverState *bs, bool read_only)
-+     * List of handlers participating in userspace polling.  Protected by
++int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 +     * ctx->list_lock.  Iterated and modified mostly by the event loop thread
 +     * from aio_poll() with ctx->list_lock incremented.  aio_set_fd_handler()
 +     * only touches the list to delete nodes if ctx->list_lock's count is zero.
 +     */
 +    AioHandlerList poll_aio_handlers;
 +
      /* Are we in polling mode or monitoring file descriptors? */
      bool poll_started;
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
  #include "trace.h"
  #include "aio-posix.h"
 +/* Stop userspace polling on a handler if it isn't active for some time */
 +#define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
 +
  bool aio_poll_disabled(AioContext *ctx)
  {
-     return atomic_read(&ctx->poll_disable_cnt);
++    /* Do not set read_only if copy_on_read is enabled */
-@@ -XXX,XX +XXX,XX @@ static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
++    if (bs->copy_on_read && read_only) {
-      * deleted because deleted nodes are only cleaned up while
++        error_setg(errp, "Can't set node '%s' to r/o with copy-on-read enabled",
-      * no one is walking the handlers list.
++                   bdrv_get_device_or_node_name(bs));
-      */
++        return -EINVAL;
 +    QLIST_SAFE_REMOVE(node, node_poll);
      QLIST_REMOVE(node, node);
      return true;
  }
@@ -XXX,XX +XXX,XX @@ static bool poll_set_started(AioContext *ctx, bool started)
      ctx->poll_started = started;
      qemu_lockcnt_inc(&ctx->list_lock);
 -    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
 +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
          IOHandler *fn;
          if (QLIST_IS_INSERTED(node, node_deleted)) {
@@ -XXX,XX +XXX,XX @@ static void aio_free_deleted_handlers(AioContext *ctx)
      while ((node = QLIST_FIRST_RCU(&ctx->deleted_aio_handlers))) {
          QLIST_REMOVE(node, node);
          QLIST_REMOVE(node, node_deleted);
 +        QLIST_SAFE_REMOVE(node, node_poll);
          g_free(node);
      }
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
      revents = node->pfd.revents & node->pfd.events;
      node->pfd.revents = 0;
 +    /*
 +     * Start polling AioHandlers when they become ready because activity is
 +     * likely to continue.  Note that starvation is theoretically possible when
 +     * fdmon_supports_polling(), but only until the fd fires for the first
 +     * time.
 +     */
 +    if (!QLIST_IS_INSERTED(node, node_deleted) &&
 +        !QLIST_IS_INSERTED(node, node_poll) &&
 +        node->io_poll) {
 +        trace_poll_add(ctx, node, node->pfd.fd, revents);
 +        if (ctx->poll_started && node->io_poll_begin) {
 +            node->io_poll_begin(node->opaque);
 +        }
 +        QLIST_INSERT_HEAD(&ctx->poll_aio_handlers, node, node_poll);
 +    }
 +
-     if (!QLIST_IS_INSERTED(node, node_deleted) &&
+     bs->read_only = read_only;
-         (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
++    return 0;
          aio_node_check(ctx, node->is_external) &&
@@ -XXX,XX +XXX,XX @@ void aio_dispatch(AioContext *ctx)
      timerlistgroup_run_timers(&ctx->tlg);
  }
--static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
+ void bdrv_get_full_backing_filename_from_filename(const char *backed,
-+static bool run_poll_handlers_once(AioContext *ctx,
+diff --git a/block/bochs.c b/block/bochs.c
-+                                   int64_t now,
+index XXXXXXX..XXXXXXX 100644
-+                                   int64_t *timeout)
+--- a/block/bochs.c
- {
++++ b/block/bochs.c
-     bool progress = false;
+@@ -XXX,XX +XXX,XX @@ static int bochs_open(BlockDriverState *bs, QDict *options, int flags,
-     AioHandler *node;
+         return -EINVAL;
-+    AioHandler *tmp;
+     }
--    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+-    bdrv_set_read_only(bs, true); /* no write support yet */
--        if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll &&
++    ret = bdrv_set_read_only(bs, true, errp); /* no write support yet */
--            aio_node_check(ctx, node->is_external) &&
++    if (ret < 0) {
-+    QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) {
++        return ret;
-+        if (aio_node_check(ctx, node->is_external) &&
++    }
-             node->io_poll(node->opaque)) {
-+            node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS;
+     ret = bdrv_pread(bs->file, 0, &bochs, sizeof(bochs));
-+
+     if (ret < 0) {
-             /*
+diff --git a/block/cloop.c b/block/cloop.c
-              * Polling was successful, exit try_poll_mode immediately
+index XXXXXXX..XXXXXXX 100644
-              * to adjust the next polling time.
+--- a/block/cloop.c
-@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
++++ b/block/cloop.c
-     return progress;
+@@ -XXX,XX +XXX,XX @@ static int cloop_open(BlockDriverState *bs, QDict *options, int flags,
- }
+         return -EINVAL;
+     }
-+static bool fdmon_supports_polling(AioContext *ctx)
-+{
+-    bdrv_set_read_only(bs, true);
-+    return ctx->fdmon_ops->need_wait != aio_poll_disabled;
++    ret = bdrv_set_read_only(bs, true, errp);
-+}
++    if (ret < 0) {
-+
++        return ret;
-+static bool remove_idle_poll_handlers(AioContext *ctx, int64_t now)
++    }
-+{
-+    AioHandler *node;
+     /* read header */
-+    AioHandler *tmp;
+     ret = bdrv_pread(bs->file, 128, &s->block_size, 4);
-+    bool progress = false;
+diff --git a/block/dmg.c b/block/dmg.c
-+
+index XXXXXXX..XXXXXXX 100644
-+    /*
+--- a/block/dmg.c
-+     * File descriptor monitoring implementations without userspace polling
++++ b/block/dmg.c
-+     * support suffer from starvation when a subset of handlers is polled
+@@ -XXX,XX +XXX,XX @@ static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
-+     * because fds will not be processed in a timely fashion.  Don't remove
+         return -EINVAL;
-+     * idle poll handlers.
+     }
-+     */
-+    if (!fdmon_supports_polling(ctx)) {
++    ret = bdrv_set_read_only(bs, true, errp);
-+        return false;
++    if (ret < 0) {
 +        return ret;
 +    }
 +
-+    QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) {
+     block_module_load_one("dmg-bz2");
-+        if (node->poll_idle_timeout == 0LL) {
+-    bdrv_set_read_only(bs, true);
-+            node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS;
-+        } else if (now >= node->poll_idle_timeout) {
+     s->n_chunks = 0;
-+            trace_poll_remove(ctx, node, node->pfd.fd);
+     s->offsets = s->lengths = s->sectors = s->sectorcounts = NULL;
-+            node->poll_idle_timeout = 0LL;
+diff --git a/block/rbd.c b/block/rbd.c
-+            QLIST_SAFE_REMOVE(node, node_poll);
+index XXXXXXX..XXXXXXX 100644
-+            if (ctx->poll_started && node->io_poll_end) {
+--- a/block/rbd.c
-+                node->io_poll_end(node->opaque);
++++ b/block/rbd.c
-+
+@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
-+                /*
+         goto failed_shutdown;
-+                 * Final poll in case ->io_poll_end() races with an event.
+     }
-+                 * Nevermind about re-adding the handler in the rare case where
-+                 * this causes progress.
++    /* rbd_open is always r/w */
-+                 */
+     r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
-+                progress = node->io_poll(node->opaque) || progress;
+     if (r < 0) {
-+            }
+         error_setg_errno(errp, -r, "error reading header from %s", s->name);
          goto failed_open;
      }
 -    bdrv_set_read_only(bs, (s->snap != NULL));
 +    /* If we are using an rbd snapshot, we must be r/o, otherwise
 +     * leave as-is */
 +    if (s->snap != NULL) {
 +        r = bdrv_set_read_only(bs, true, &local_err);
 +        if (r < 0) {
 +            error_propagate(errp, local_err);
 +            goto failed_open;
 +        }
 +    }
-+
-+    return progress;
+     qemu_opts_del(opts);
-+}
+     return 0;
-+
+diff --git a/block/vvfat.c b/block/vvfat.c
  /* run_poll_handlers:
   * @ctx: the AioContext
   * @max_ns: maximum time to poll for, in nanoseconds
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
      start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      do {
 -        progress = run_poll_handlers_once(ctx, timeout);
 +        progress = run_poll_handlers_once(ctx, start_time, timeout);
          elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;
          max_ns = qemu_soonest_timeout(*timeout, max_ns);
          assert(!(max_ns && progress));
      } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx));
 +    if (remove_idle_poll_handlers(ctx, start_time + elapsed_time)) {
 +        *timeout = 0;
 +        progress = true;
 +    }
 +
      /* If time has passed with no successful polling, adjust *timeout to
       * keep the same ending time.
       */
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
   */
  static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
  {
 -    int64_t max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
 +    int64_t max_ns;
 +
 +    if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
 +        return false;
 +    }
 +    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
          poll_set_started(ctx, true);
 diff --git a/util/aio-posix.h b/util/aio-posix.h
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.h
+--- a/block/vvfat.c
-+++ b/util/aio-posix.h
++++ b/block/vvfat.c
-@@ -XXX,XX +XXX,XX @@ struct AioHandler {
+@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
-     QLIST_ENTRY(AioHandler) node;
-     QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
+     s->current_cluster=0xffffffff;
-     QLIST_ENTRY(AioHandler) node_deleted;
-+    QLIST_ENTRY(AioHandler) node_poll;
+-    /* read only is the default for safety */
- #ifdef CONFIG_LINUX_IO_URING
+-    bdrv_set_read_only(bs, true);
-     QSLIST_ENTRY(AioHandler) node_submitted;
+     s->qcow = NULL;
-     unsigned flags; /* see fdmon-io_uring.c */
+     s->qcow_filename = NULL;
- #endif
+     s->fat2 = NULL;
-+    int64_t poll_idle_timeout; /* when to stop userspace polling */
+@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
-     bool is_external;
+     s->sector_count = cyls * heads * secs - (s->first_sectors_number - 1);
- };
+     if (qemu_opt_get_bool(opts, "rw", false)) {
-diff --git a/util/trace-events b/util/trace-events
+-        ret = enable_write_target(bs, errp);
 +        if (!bdrv_is_read_only(bs)) {
 +            ret = enable_write_target(bs, errp);
 +            if (ret < 0) {
 +                goto fail;
 +            }
 +        } else {
 +            ret = -EPERM;
 +            error_setg(errp,
 +                       "Unable to set VVFAT to 'rw' when drive is read-only");
 +            goto fail;
 +        }
 +    } else  {
 +        /* read only is the default for safety */
 +        ret = bdrv_set_read_only(bs, true, &local_err);
          if (ret < 0) {
 +            error_propagate(errp, local_err);
              goto fail;
          }
 -        bdrv_set_read_only(bs, false);
      }
      bs->total_sectors = cyls * heads * secs;
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
---- a/util/trace-events
+--- a/include/block/block.h
-+++ b/util/trace-events
++++ b/include/block/block.h
-@@ -XXX,XX +XXX,XX @@ run_poll_handlers_begin(void *ctx, int64_t max_ns, int64_t timeout) "ctx %p max_
+@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
- run_poll_handlers_end(void *ctx, bool progress, int64_t timeout) "ctx %p progress %d new timeout %"PRId64
+                             int64_t sector_num, int nb_sectors, int *pnum);
- poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
- poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+ bool bdrv_is_read_only(BlockDriverState *bs);
-+poll_add(void *ctx, void *node, int fd, unsigned revents) "ctx %p node %p fd %d revents 0x%x"
+-void bdrv_set_read_only(BlockDriverState *bs, bool read_only);
-+poll_remove(void *ctx, void *node, int fd) "ctx %p node %p fd %d"
++int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
+ bool bdrv_is_sg(BlockDriverState *bs);
- # async.c
+ bool bdrv_is_inserted(BlockDriverState *bs);
- aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
+ int bdrv_media_changed(BlockDriverState *bs);
 --
-.24.1
+.9.3

-[PULL 6/9] aio-posix: simplify FDMonOps->update() prototype
+[Qemu-devel] [PULL v2 06/12] block: honor BDRV_O_ALLOW_RDWR when clearing bs->read_only
-The AioHandler *node, bool is_new arguments are more complicated to
+The BDRV_O_ALLOW_RDWR flag allows / prohibits the changing of
-think about than simply being given AioHandler *old_node, AioHandler
+the BDS 'read_only' state, but there are a few places where it
-*new_node.
+is ignored.  In the bdrv_set_read_only() helper, make sure to
 honor the flag.
-Furthermore, the new Linux io_uring file descriptor monitoring mechanism
+Signed-off-by: Jeff Cody <jcody@redhat.com>
-added by the new patch requires access to both the old and the new
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-nodes.  Make this change now in preparation.
+Reviewed-by: John Snow <jsnow@redhat.com>
 Message-id: be2e5fb2d285cbece2b6d06bed54a6f56520d251.1491597120.git.jcody@redhat.com
 ---
  block.c | 7 +++++++
 file changed, 7 insertions(+)
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+diff --git a/block.c b/block.c
 Link: https://lore.kernel.org/r/20200305170806.1313245-5-stefanha@redhat.com
 Message-Id: <20200305170806.1313245-5-stefanha@redhat.com>
 ---
  include/block/aio.h | 13 ++++++-------
  util/aio-posix.c    |  7 +------
  util/fdmon-epoll.c  | 21 ++++++++++++---------
  util/fdmon-poll.c   |  4 +++-
 files changed, 22 insertions(+), 23 deletions(-)
 diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
+--- a/block.c
-+++ b/include/block/aio.h
++++ b/block.c
-@@ -XXX,XX +XXX,XX @@ typedef struct {
+@@ -XXX,XX +XXX,XX @@ int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
-     /*
+         return -EINVAL;
       * update:
       * @ctx: the AioContext
 -     * @node: the handler
 -     * @is_new: is the file descriptor already being monitored?
 +     * @old_node: the existing handler or NULL if this file descriptor is being
 +     *            monitored for the first time
 +     * @new_node: the new handler or NULL if this file descriptor is being
 +     *            removed
       *
 -     * Add/remove/modify a monitored file descriptor.  There are three cases:
 -     * 1. node->pfd.events == 0 means remove the file descriptor.
 -     * 2. !is_new means modify an already monitored file descriptor.
 -     * 3. is_new means add a new file descriptor.
 +     * Add/remove/modify a monitored file descriptor.
       *
       * Called with ctx->list_lock acquired.
       */
 -    void (*update)(AioContext *ctx, AioHandler *node, bool is_new);
 +    void (*update)(AioContext *ctx, AioHandler *old_node, AioHandler *new_node);
      /*
       * wait:
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
      atomic_set(&ctx->poll_disable_cnt,
                 atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
 -    if (new_node) {
 -        ctx->fdmon_ops->update(ctx, new_node, is_new);
 -    } else if (node) {
 -        /* Unregister deleted fd_handler */
 -        ctx->fdmon_ops->update(ctx, node, false);
 -    }
 +    ctx->fdmon_ops->update(ctx, node, new_node);
      qemu_lockcnt_unlock(&ctx->list_lock);
      aio_notify(ctx);
 diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/fdmon-epoll.c
 +++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@ static inline int epoll_events_from_pfd(int pfd_events)
             (pfd_events & G_IO_ERR ? EPOLLERR : 0);
  }
 -static void fdmon_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
 +static void fdmon_epoll_update(AioContext *ctx,
 +                               AioHandler *old_node,
 +                               AioHandler *new_node)
  {
 -    struct epoll_event event;
 +    struct epoll_event event = {
 +        .data.ptr = new_node,
 +        .events = new_node ? epoll_events_from_pfd(new_node->pfd.events) : 0,
 +    };
      int r;
 -    int ctl;
 -    if (!node->pfd.events) {
 -        ctl = EPOLL_CTL_DEL;
 +    if (!new_node) {
 +        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_DEL, old_node->pfd.fd, &event);
 +    } else if (!old_node) {
 +        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, new_node->pfd.fd, &event);
      } else {
 -        event.data.ptr = node;
 -        event.events = epoll_events_from_pfd(node->pfd.events);
 -        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
 +        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_MOD, new_node->pfd.fd, &event);
      }
--    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
++    /* Do not clear read_only if it is prohibited */
-     if (r) {
++    if (!read_only && !(bs->open_flags & BDRV_O_ALLOW_RDWR)) {
-         fdmon_epoll_disable(ctx);
++        error_setg(errp, "Node '%s' is read only",
-     }
++                   bdrv_get_device_or_node_name(bs));
-diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
++        return -EPERM;
-index XXXXXXX..XXXXXXX 100644
++    }
---- a/util/fdmon-poll.c
++
-+++ b/util/fdmon-poll.c
+     bs->read_only = read_only;
-@@ -XXX,XX +XXX,XX @@ static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
+     return 0;
      return ret;
  }
 -static void fdmon_poll_update(AioContext *ctx, AioHandler *node, bool is_new)
 +static void fdmon_poll_update(AioContext *ctx,
 +                              AioHandler *old_node,
 +                              AioHandler *new_node)
  {
      /* Do nothing, AioHandler already contains the state we'll need */
  }
 --
-.24.1
+.9.3

-[PULL 8/9] aio-posix: support userspace polling of fd monitoring
+[Qemu-devel] [PULL v2 07/12] block: code movement
-Unlike ppoll(2) and epoll(7), Linux io_uring completions can be polled
+Move bdrv_is_read_only() up with its friends.
 from userspace.  Previously userspace polling was only allowed when all
 AioHandler's had an ->io_poll() callback.  This prevented starvation of
 fds by userspace pollable handlers.
-Add the FDMonOps->need_wait() callback that enables userspace polling
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-even when some AioHandlers lack ->io_poll().
+Reviewed-by: John Snow <jsnow@redhat.com>
 Signed-off-by: Jeff Cody <jcody@redhat.com>
 Message-id: 73b2399459760c32506f9407efb9dddb3a2789de.1491597120.git.jcody@redhat.com
 ---
  block.c | 10 +++++-----
 file changed, 5 insertions(+), 5 deletions(-)
-For example, it's now possible to do userspace polling when a TCP/IP
+diff --git a/block.c b/block.c
 socket is monitored thanks to Linux io_uring.
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 Link: https://lore.kernel.org/r/20200305170806.1313245-7-stefanha@redhat.com
 Message-Id: <20200305170806.1313245-7-stefanha@redhat.com>
 ---
  include/block/aio.h   | 19 +++++++++++++++++++
  util/aio-posix.c      | 11 ++++++++---
  util/fdmon-epoll.c    |  1 +
  util/fdmon-io_uring.c |  6 ++++++
  util/fdmon-poll.c     |  1 +
 files changed, 35 insertions(+), 3 deletions(-)
 diff --git a/include/block/aio.h b/include/block/aio.h
 index XXXXXXX..XXXXXXX 100644
---- a/include/block/aio.h
+--- a/block.c
-+++ b/include/block/aio.h
++++ b/block.c
-@@ -XXX,XX +XXX,XX @@ struct ThreadPool;
+@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
- struct LinuxAioState;
+     }
- struct LuringState;
+ }
-+/* Is polling disabled? */
++bool bdrv_is_read_only(BlockDriverState *bs)
 +bool aio_poll_disabled(AioContext *ctx);
 +
  /* Callbacks for file descriptor monitoring implementations */
  typedef struct {
      /*
@@ -XXX,XX +XXX,XX @@ typedef struct {
       * Returns: number of ready file descriptors.
       */
      int (*wait)(AioContext *ctx, AioHandlerList *ready_list, int64_t timeout);
 +
 +    /*
 +     * need_wait:
 +     * @ctx: the AioContext
 +     *
 +     * Tell aio_poll() when to stop userspace polling early because ->wait()
 +     * has fds ready.
 +     *
 +     * File descriptor monitoring implementations that cannot poll fd readiness
 +     * from userspace should use aio_poll_disabled() here.  This ensures that
 +     * file descriptors are not starved by handlers that frequently make
 +     * progress via userspace polling.
 +     *
 +     * Returns: true if ->wait() should be called, false otherwise.
 +     */
 +    bool (*need_wait)(AioContext *ctx);
  } FDMonOps;
  /*
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
  #include "trace.h"
  #include "aio-posix.h"
 +bool aio_poll_disabled(AioContext *ctx)
 +{
-+    return atomic_read(&ctx->poll_disable_cnt);
++    return bs->read_only;
 +}
 +
- void aio_add_ready_handler(AioHandlerList *ready_list,
+ int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
                             AioHandler *node,
                             int revents)
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
          elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;
          max_ns = qemu_soonest_timeout(*timeout, max_ns);
          assert(!(max_ns && progress));
 -    } while (elapsed_time < max_ns && !atomic_read(&ctx->poll_disable_cnt));
 +    } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx));
      /* If time has passed with no successful polling, adjust *timeout to
       * keep the same ending time.
@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
  {
-     int64_t max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
+     /* Do not set read_only if copy_on_read is enabled */
+@@ -XXX,XX +XXX,XX @@ void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
--    if (max_ns && !atomic_read(&ctx->poll_disable_cnt)) {
+     *nb_sectors_ptr = nb_sectors < 0 ? 0 : nb_sectors;
 +    if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
          poll_set_started(ctx, true);
          if (run_poll_handlers(ctx, max_ns, timeout)) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
      /* If polling is allowed, non-blocking aio_poll does not need the
       * system call---a single round of run_poll_handlers_once suffices.
       */
 -    if (timeout || atomic_read(&ctx->poll_disable_cnt)) {
 +    if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
          ret = ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
      }
 diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/fdmon-epoll.c
 +++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@ out:
  static const FDMonOps fdmon_epoll_ops = {
      .update = fdmon_epoll_update,
      .wait = fdmon_epoll_wait,
 +    .need_wait = aio_poll_disabled,
  };
  static bool fdmon_epoll_try_enable(AioContext *ctx)
 diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/fdmon-io_uring.c
 +++ b/util/fdmon-io_uring.c
@@ -XXX,XX +XXX,XX @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
      return process_cq_ring(ctx, ready_list);
  }
-+static bool fdmon_io_uring_need_wait(AioContext *ctx)
+-bool bdrv_is_read_only(BlockDriverState *bs)
-+{
+-{
-+    return io_uring_cq_ready(&ctx->fdmon_io_uring);
+-    return bs->read_only;
-+}
+-}
-+
+-
- static const FDMonOps fdmon_io_uring_ops = {
+ bool bdrv_is_sg(BlockDriverState *bs)
-     .update = fdmon_io_uring_update,
+ {
-     .wait = fdmon_io_uring_wait,
+     return bs->sg;
 +    .need_wait = fdmon_io_uring_need_wait,
  };
  bool fdmon_io_uring_setup(AioContext *ctx)
 diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/fdmon-poll.c
 +++ b/util/fdmon-poll.c
@@ -XXX,XX +XXX,XX @@ static void fdmon_poll_update(AioContext *ctx,
  const FDMonOps fdmon_poll_ops = {
      .update = fdmon_poll_update,
      .wait = fdmon_poll_wait,
 +    .need_wait = aio_poll_disabled,
  };
 --
-.24.1
+.9.3

-[PULL 1/9] qemu/queue.h: clear linked list pointers on remove
+[Qemu-devel] [PULL v2 08/12] block: introduce bdrv_can_set_read_only()
-Do not leave stale linked list pointers around after removal.  It's
+Introduce check function for setting read_only flags.  Will return < 0 on
-safer to set them to NULL so that use-after-removal results in an
+error, with appropriate Error value set.  Does not alter any flags.
 immediate segfault.
-The RCU queue removal macros are unchanged since nodes may still be
+Signed-off-by: Jeff Cody <jcody@redhat.com>
-traversed after removal.
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Message-id: e2bba34ac3bc76a0c42adc390413f358ae0566e8.1491597120.git.jcody@redhat.com
 ---
  block.c               | 14 +++++++++++++-
  include/block/block.h |  1 +
 files changed, 14 insertions(+), 1 deletion(-)
-Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
+diff --git a/block.c b/block.c
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+index XXXXXXX..XXXXXXX 100644
-Link: https://lore.kernel.org/r/20200224103406.1894923-2-stefanha@redhat.com
+--- a/block.c
-Message-Id: <20200224103406.1894923-2-stefanha@redhat.com>
++++ b/block.c
----
+@@ -XXX,XX +XXX,XX @@ bool bdrv_is_read_only(BlockDriverState *bs)
- include/qemu/queue.h | 19 +++++++++++++++----
+     return bs->read_only;
-file changed, 15 insertions(+), 4 deletions(-)
+ }
 -int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 +int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
  {
      /* Do not set read_only if copy_on_read is enabled */
      if (bs->copy_on_read && read_only) {
@@ -XXX,XX +XXX,XX @@ int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
          return -EPERM;
      }
 +    return 0;
 +}
 +
 +int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 +{
 +    int ret = 0;
 +
 +    ret = bdrv_can_set_read_only(bs, read_only, errp);
 +    if (ret < 0) {
 +        return ret;
 +    }
 +
      bs->read_only = read_only;
      return 0;
  }
 diff --git a/include/block/block.h b/include/block/block.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                              int64_t sector_num, int nb_sectors, int *pnum);
  bool bdrv_is_read_only(BlockDriverState *bs);
 +int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
  int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
  bool bdrv_is_sg(BlockDriverState *bs);
  bool bdrv_is_inserted(BlockDriverState *bs);
 --
 .9.3
-diff --git a/include/qemu/queue.h b/include/qemu/queue.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/queue.h
-+++ b/include/qemu/queue.h
-@@ -XXX,XX +XXX,XX @@ struct {                                                                \
-                 (elm)->field.le_next->field.le_prev =                   \
-                     (elm)->field.le_prev;                               \
-         *(elm)->field.le_prev = (elm)->field.le_next;                   \
-+        (elm)->field.le_next = NULL;                                    \
-+        (elm)->field.le_prev = NULL;                                    \
- } while (/*CONSTCOND*/0)
- /*
-@@ -XXX,XX +XXX,XX @@ struct {                                                                \
- } while (/*CONSTCOND*/0)
- #define QSLIST_REMOVE_HEAD(head, field) do {                             \
--        (head)->slh_first = (head)->slh_first->field.sle_next;          \
-+        typeof((head)->slh_first) elm = (head)->slh_first;               \
-+        (head)->slh_first = elm->field.sle_next;                         \
-+        elm->field.sle_next = NULL;                                      \
- } while (/*CONSTCOND*/0)
- #define QSLIST_REMOVE_AFTER(slistelm, field) do {                       \
--        (slistelm)->field.sle_next =                                    \
--            QSLIST_NEXT(QSLIST_NEXT((slistelm), field), field);         \
-+        typeof(slistelm) next = (slistelm)->field.sle_next;             \
-+        (slistelm)->field.sle_next = next->field.sle_next;              \
-+        next->field.sle_next = NULL;                                    \
- } while (/*CONSTCOND*/0)
- #define QSLIST_REMOVE(head, elm, type, field) do {                      \
-@@ -XXX,XX +XXX,XX @@ struct {                                                                \
-         while (curelm->field.sle_next != (elm))                         \
-             curelm = curelm->field.sle_next;                            \
-         curelm->field.sle_next = curelm->field.sle_next->field.sle_next; \
-+        (elm)->field.sle_next = NULL;                                   \
-     }                                                                   \
- } while (/*CONSTCOND*/0)
-@@ -XXX,XX +XXX,XX @@ struct {                                                                \
- } while (/*CONSTCOND*/0)
- #define QSIMPLEQ_REMOVE_HEAD(head, field) do {                          \
--    if (((head)->sqh_first = (head)->sqh_first->field.sqe_next) == NULL)\
-+    typeof((head)->sqh_first) elm = (head)->sqh_first;                  \
-+    if (((head)->sqh_first = elm->field.sqe_next) == NULL)              \
-         (head)->sqh_last = &(head)->sqh_first;                          \
-+    elm->field.sqe_next = NULL;                                         \
- } while (/*CONSTCOND*/0)
- #define QSIMPLEQ_SPLIT_AFTER(head, elm, field, removed) do {            \
-@@ -XXX,XX +XXX,XX @@ struct {                                                                \
-         if ((curelm->field.sqe_next =                                   \
-             curelm->field.sqe_next->field.sqe_next) == NULL)            \
-                 (head)->sqh_last = &(curelm)->field.sqe_next;           \
-+        (elm)->field.sqe_next = NULL;                                   \
-     }                                                                   \
- } while (/*CONSTCOND*/0)
-@@ -XXX,XX +XXX,XX @@ union {                                                                 \
-             (head)->tqh_circ.tql_prev = (elm)->field.tqe_circ.tql_prev; \
-         (elm)->field.tqe_circ.tql_prev->tql_next = (elm)->field.tqe_next; \
-         (elm)->field.tqe_circ.tql_prev = NULL;                          \
-+        (elm)->field.tqe_circ.tql_next = NULL;                          \
-+        (elm)->field.tqe_next = NULL;                                   \
- } while (/*CONSTCOND*/0)
- /* remove @left, @right and all elements in between from @head */
---
-.24.1

-New patch
+[Qemu-devel] [PULL v2 09/12] block: use bdrv_can_set_read_only() during reopen
+Signed-off-by: Jeff Cody <jcody@redhat.com>
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reviewed-by: John Snow <jsnow@redhat.com>
+Message-id: 00aed7ffdd7be4b9ed9ce1007d50028a72b34ebe.1491597120.git.jcody@redhat.com
+---
+ block.c | 14 ++++++++------
+file changed, 8 insertions(+), 6 deletions(-)
+diff --git a/block.c b/block.c
+index XXXXXXX..XXXXXXX 100644
+--- a/block.c
++++ b/block.c
+@@ -XXX,XX +XXX,XX @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
+     BlockDriver *drv;
+     QemuOpts *opts;
+     const char *value;
++    bool read_only;
+     assert(reopen_state != NULL);
+     assert(reopen_state->bs->drv != NULL);
+@@ -XXX,XX +XXX,XX @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
+         qdict_put(reopen_state->options, "driver", qstring_from_str(value));
+     }
+-    /* if we are to stay read-only, do not allow permission change
+-     * to r/w */
+-    if (!(reopen_state->bs->open_flags & BDRV_O_ALLOW_RDWR) &&
+-        reopen_state->flags & BDRV_O_RDWR) {
+-        error_setg(errp, "Node '%s' is read only",
+-                   bdrv_get_device_or_node_name(reopen_state->bs));
++    /* If we are to stay read-only, do not allow permission change
++     * to r/w. Attempting to set to r/w may fail if either BDRV_O_ALLOW_RDWR is
++     * not set, or if the BDS still has copy_on_read enabled */
++    read_only = !(reopen_state->flags & BDRV_O_RDWR);
++    ret = bdrv_can_set_read_only(reopen_state->bs, read_only, &local_err);
++    if (local_err) {
++        error_propagate(errp, local_err);
+         goto error;
+     }
+--
+.9.3

-[PULL 2/9] aio-posix: remove confusing QLIST_SAFE_REMOVE()
+[Qemu-devel] [PULL v2 10/12] block/rbd - update variable names to more apt names
-QLIST_SAFE_REMOVE() is confusing here because the node must be on the
+Update 'clientname' to be 'user', which tracks better with both
-list.  We actually just wanted to clear the linked list pointers when
+the QAPI and rados variable naming.
 removing it from the list.  QLIST_REMOVE() now does this, so switch to
 it.
-Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
+Update 'name' to be 'image_name', as it indicates the rbd image.
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Naming it 'image' would have been ideal, but we are using that for
-Link: https://lore.kernel.org/r/20200224103406.1894923-3-stefanha@redhat.com
+the rados_image_t value returned by rbd_open().
-Message-Id: <20200224103406.1894923-3-stefanha@redhat.com>
 Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
 Signed-off-by: Jeff Cody <jcody@redhat.com>
 Reviewed-by: John Snow <jsnow@redhat.com>
 Message-id: b7ec1fb2e1cf36f9b6911631447a5b0422590b7d.1491597120.git.jcody@redhat.com
 ---
- util/aio-posix.c | 2 +-
+ block/rbd.c | 33 +++++++++++++++++----------------
-file changed, 1 insertion(+), 1 deletion(-)
+file changed, 17 insertions(+), 16 deletions(-)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/block/rbd.c b/block/rbd.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/aio-posix.c
+--- a/block/rbd.c
-+++ b/util/aio-posix.c
++++ b/block/rbd.c
-@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
+@@ -XXX,XX +XXX,XX @@ typedef struct BDRVRBDState {
-     AioHandler *node;
+     rados_t cluster;
+     rados_ioctx_t io_ctx;
-     while ((node = QLIST_FIRST(ready_list))) {
+     rbd_image_t image;
--        QLIST_SAFE_REMOVE(node, node_ready);
+-    char *name;
-+        QLIST_REMOVE(node, node_ready);
++    char *image_name;
-         progress = aio_dispatch_handler(ctx, node) || progress;
+     char *snap;
  } BDRVRBDState;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
      int64_t bytes = 0;
      int64_t objsize;
      int obj_order = 0;
 -    const char *pool, *name, *conf, *clientname, *keypairs;
 +    const char *pool, *image_name, *conf, *user, *keypairs;
      const char *secretid;
      rados_t cluster;
      rados_ioctx_t io_ctx;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
       */
      pool       = qdict_get_try_str(options, "pool");
      conf       = qdict_get_try_str(options, "conf");
 -    clientname = qdict_get_try_str(options, "user");
 -    name       = qdict_get_try_str(options, "image");
 +    user       = qdict_get_try_str(options, "user");
 +    image_name = qdict_get_try_str(options, "image");
      keypairs   = qdict_get_try_str(options, "=keyvalue-pairs");
 -    ret = rados_create(&cluster, clientname);
 +    ret = rados_create(&cluster, user);
      if (ret < 0) {
          error_setg_errno(errp, -ret, "error initializing");
          goto exit;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
          goto shutdown;
      }
+-    ret = rbd_create(io_ctx, name, bytes, &obj_order);
++    ret = rbd_create(io_ctx, image_name, bytes, &obj_order);
+     if (ret < 0) {
+         error_setg_errno(errp, -ret, "error rbd create");
+     }
+@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
+                          Error **errp)
+ {
+     BDRVRBDState *s = bs->opaque;
+-    const char *pool, *snap, *conf, *clientname, *name, *keypairs;
++    const char *pool, *snap, *conf, *user, *image_name, *keypairs;
+     const char *secretid;
+     QemuOpts *opts;
+     Error *local_err = NULL;
+@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
+     pool           = qemu_opt_get(opts, "pool");
+     conf           = qemu_opt_get(opts, "conf");
+     snap           = qemu_opt_get(opts, "snapshot");
+-    clientname     = qemu_opt_get(opts, "user");
+-    name           = qemu_opt_get(opts, "image");
++    user           = qemu_opt_get(opts, "user");
++    image_name     = qemu_opt_get(opts, "image");
+     keypairs       = qemu_opt_get(opts, "=keyvalue-pairs");
+-    if (!pool || !name) {
++    if (!pool || !image_name) {
+         error_setg(errp, "Parameters 'pool' and 'image' are required");
+         r = -EINVAL;
+         goto failed_opts;
+     }
+-    r = rados_create(&s->cluster, clientname);
++    r = rados_create(&s->cluster, user);
+     if (r < 0) {
+         error_setg_errno(errp, -r, "error initializing");
+         goto failed_opts;
+     }
+     s->snap = g_strdup(snap);
+-    s->name = g_strdup(name);
++    s->image_name = g_strdup(image_name);
+     /* try default location when conf=NULL, but ignore failure */
+     r = rados_conf_read_file(s->cluster, conf);
+@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
+     }
+     /* rbd_open is always r/w */
+-    r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
++    r = rbd_open(s->io_ctx, s->image_name, &s->image, s->snap);
+     if (r < 0) {
+-        error_setg_errno(errp, -r, "error reading header from %s", s->name);
++        error_setg_errno(errp, -r, "error reading header from %s",
++                         s->image_name);
+         goto failed_open;
+     }
+@@ -XXX,XX +XXX,XX @@ failed_open:
+ failed_shutdown:
+     rados_shutdown(s->cluster);
+     g_free(s->snap);
+-    g_free(s->name);
++    g_free(s->image_name);
+ failed_opts:
+     qemu_opts_del(opts);
+     g_free(mon_host);
+@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_close(BlockDriverState *bs)
+     rbd_close(s->image);
+     rados_ioctx_destroy(s->io_ctx);
+     g_free(s->snap);
+-    g_free(s->name);
++    g_free(s->image_name);
+     rados_shutdown(s->cluster);
+ }
 --
-.24.1
+.9.3

-[PULL 7/9] aio-posix: add io_uring fd monitoring implementation
+[Qemu-devel] [PULL v2 11/12] block/rbd: Add support for reopen()
-The recent Linux io_uring API has several advantages over ppoll(2) and
+This adds support for reopen in rbd, for changing between r/w and r/o.
 epoll(2).  Details are given in the source code.
-Add an io_uring implementation and make it the default on Linux.
+Note, that this is only a flag change, but we will block a change from
-Performance is the same as with epoll(7) but later patches add
+r/o to r/w if we are using an RBD internal snapshot.
 optimizations that take advantage of io_uring.
-It is necessary to change how aio_set_fd_handler() deals with deleting
+Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
-AioHandlers since removing monitored file descriptors is asynchronous in
+Signed-off-by: Jeff Cody <jcody@redhat.com>
-io_uring.  fdmon_io_uring_remove() marks the AioHandler deleted and
+Reviewed-by: John Snow <jsnow@redhat.com>
-aio_set_fd_handler() will let it handle deletion in that case.
+Message-id: d4e87539167ec6527d44c97b164eabcccf96e4f3.1491597120.git.jcody@redhat.com
 ---
  block/rbd.c | 21 +++++++++++++++++++++
 file changed, 21 insertions(+)
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+diff --git a/block/rbd.c b/block/rbd.c
-Link: https://lore.kernel.org/r/20200305170806.1313245-6-stefanha@redhat.com
+index XXXXXXX..XXXXXXX 100644
-Message-Id: <20200305170806.1313245-6-stefanha@redhat.com>
+--- a/block/rbd.c
----
++++ b/block/rbd.c
- configure             |   5 +
+@@ -XXX,XX +XXX,XX @@ failed_opts:
- include/block/aio.h   |   9 ++
+     return r;
- util/Makefile.objs    |   1 +
+ }
- util/aio-posix.c      |  20 ++-
  util/aio-posix.h      |  20 ++-
  util/fdmon-io_uring.c | 326 ++++++++++++++++++++++++++++++++++++++++++
 files changed, 376 insertions(+), 5 deletions(-)
  create mode 100644 util/fdmon-io_uring.c
 diff --git a/configure b/configure
 index XXXXXXX..XXXXXXX 100755
 --- a/configure
 +++ b/configure
@@ -XXX,XX +XXX,XX @@ if test "$linux_io_uring" != "no" ; then
      linux_io_uring_cflags=$($pkg_config --cflags liburing)
      linux_io_uring_libs=$($pkg_config --libs liburing)
      linux_io_uring=yes
 +
-+    # io_uring is used in libqemuutil.a where per-file -libs variables are not
++/* Since RBD is currently always opened R/W via the API,
-+    # seen by programs linking the archive.  It's not ideal, but just add the
++ * we just need to check if we are using a snapshot or not, in
-+    # library dependency globally.
++ * order to determine if we will allow it to be R/W */
-+    LIBS="$linux_io_uring_libs $LIBS"
++static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
-   else
++                                   BlockReopenQueue *queue, Error **errp)
-     if test "$linux_io_uring" = "yes" ; then
++{
-       feature_not_found "linux io_uring" "Install liburing devel"
++    BDRVRBDState *s = state->bs->opaque;
-diff --git a/include/block/aio.h b/include/block/aio.h
++    int ret = 0;
 index XXXXXXX..XXXXXXX 100644
 --- a/include/block/aio.h
 +++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@
  #ifndef QEMU_AIO_H
  #define QEMU_AIO_H
 +#ifdef CONFIG_LINUX_IO_URING
 +#include <liburing.h>
 +#endif
  #include "qemu/queue.h"
  #include "qemu/event_notifier.h"
  #include "qemu/thread.h"
@@ -XXX,XX +XXX,XX @@ struct BHListSlice {
      QSIMPLEQ_ENTRY(BHListSlice) next;
  };
 +typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
 +
- struct AioContext {
++    if (s->snap && state->flags & BDRV_O_RDWR) {
-     GSource source;
++        error_setg(errp,
++                   "Cannot change node '%s' to r/w when using RBD snapshot",
-@@ -XXX,XX +XXX,XX @@ struct AioContext {
++                   bdrv_get_device_or_node_name(state->bs));
-      * locking.
++        ret = -EINVAL;
       */
      struct LuringState *linux_io_uring;
 +
 +    /* State for file descriptor monitoring using Linux io_uring */
 +    struct io_uring fdmon_io_uring;
 +    AioHandlerSList submit_list;
  #endif
      /* TimerLists for calling timers - one per clock type.  Has its own
 diff --git a/util/Makefile.objs b/util/Makefile.objs
 index XXXXXXX..XXXXXXX 100644
 --- a/util/Makefile.objs
 +++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@ util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
  util-obj-$(CONFIG_POSIX) += aio-posix.o
  util-obj-$(CONFIG_POSIX) += fdmon-poll.o
  util-obj-$(CONFIG_EPOLL_CREATE1) += fdmon-epoll.o
 +util-obj-$(CONFIG_LINUX_IO_URING) += fdmon-io_uring.o
  util-obj-$(CONFIG_POSIX) += compatfd.o
  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
 diff --git a/util/aio-posix.c b/util/aio-posix.c
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.c
 +++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
          g_source_remove_poll(&ctx->source, &node->pfd);
      }
 +    node->pfd.revents = 0;
 +
 +    /* If the fd monitor has already marked it deleted, leave it alone */
 +    if (QLIST_IS_INSERTED(node, node_deleted)) {
 +        return false;
 +    }
 +
-     /* If a read is in progress, just mark the node as deleted */
++    return ret;
      if (qemu_lockcnt_count(&ctx->list_lock)) {
          QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
 -        node->pfd.revents = 0;
          return false;
      }
      /* Otherwise, delete it for real.  We can't just mark it as
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
          QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, new_node, node);
      }
 -    if (node) {
 -        deleted = aio_remove_fd_handler(ctx, node);
 -    }
      /* No need to order poll_disable_cnt writes against other updates;
       * the counter is only used to avoid wasting time and latency on
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
                 atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
      ctx->fdmon_ops->update(ctx, node, new_node);
 +    if (node) {
 +        deleted = aio_remove_fd_handler(ctx, node);
 +    }
      qemu_lockcnt_unlock(&ctx->list_lock);
      aio_notify(ctx);
@@ -XXX,XX +XXX,XX @@ void aio_context_setup(AioContext *ctx)
      ctx->fdmon_ops = &fdmon_poll_ops;
      ctx->epollfd = -1;
 +    /* Use the fastest fd monitoring implementation if available */
 +    if (fdmon_io_uring_setup(ctx)) {
 +        return;
 +    }
 +
      fdmon_epoll_setup(ctx);
  }
  void aio_context_destroy(AioContext *ctx)
  {
 +    fdmon_io_uring_destroy(ctx);
      fdmon_epoll_disable(ctx);
  }
 diff --git a/util/aio-posix.h b/util/aio-posix.h
 index XXXXXXX..XXXXXXX 100644
 --- a/util/aio-posix.h
 +++ b/util/aio-posix.h
@@ -XXX,XX +XXX,XX @@ struct AioHandler {
      IOHandler *io_poll_begin;
      IOHandler *io_poll_end;
      void *opaque;
 -    bool is_external;
      QLIST_ENTRY(AioHandler) node;
      QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
      QLIST_ENTRY(AioHandler) node_deleted;
 +#ifdef CONFIG_LINUX_IO_URING
 +    QSLIST_ENTRY(AioHandler) node_submitted;
 +    unsigned flags; /* see fdmon-io_uring.c */
 +#endif
 +    bool is_external;
  };
  /* Add a handler to a ready list */
@@ -XXX,XX +XXX,XX @@ static inline void fdmon_epoll_disable(AioContext *ctx)
  }
  #endif /* !CONFIG_EPOLL_CREATE1 */
 +#ifdef CONFIG_LINUX_IO_URING
 +bool fdmon_io_uring_setup(AioContext *ctx);
 +void fdmon_io_uring_destroy(AioContext *ctx);
 +#else
 +static inline bool fdmon_io_uring_setup(AioContext *ctx)
 +{
 +    return false;
 +}
 +
-+static inline void fdmon_io_uring_destroy(AioContext *ctx)
+ static void qemu_rbd_close(BlockDriverState *bs)
-+{
+ {
-+}
+     BDRVRBDState *s = bs->opaque;
-+#endif /* !CONFIG_LINUX_IO_URING */
+@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_rbd = {
-+
+     .bdrv_parse_filename    = qemu_rbd_parse_filename,
- #endif /* AIO_POSIX_H */
+     .bdrv_file_open         = qemu_rbd_open,
-diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
+     .bdrv_close             = qemu_rbd_close,
-new file mode 100644
++    .bdrv_reopen_prepare    = qemu_rbd_reopen_prepare,
-index XXXXXXX..XXXXXXX
+     .bdrv_create            = qemu_rbd_create,
---- /dev/null
+     .bdrv_has_zero_init     = bdrv_has_zero_init_1,
-+++ b/util/fdmon-io_uring.c
+     .bdrv_get_info          = qemu_rbd_getinfo,
@@ -XXX,XX +XXX,XX @@
 +/* SPDX-License-Identifier: GPL-2.0-or-later */
 +/*
 + * Linux io_uring file descriptor monitoring
 + *
 + * The Linux io_uring API supports file descriptor monitoring with a few
 + * advantages over existing APIs like poll(2) and epoll(7):
 + *
 + * 1. Userspace polling of events is possible because the completion queue (cq
 + *    ring) is shared between the kernel and userspace.  This allows
 + *    applications that rely on userspace polling to also monitor file
 + *    descriptors in the same userspace polling loop.
 + *
 + * 2. Submission and completion is batched and done together in a single system
 + *    call.  This minimizes the number of system calls.
 + *
 + * 3. File descriptor monitoring is O(1) like epoll(7) so it scales better than
 + *    poll(2).
 + *
 + * 4. Nanosecond timeouts are supported so it requires fewer syscalls than
 + *    epoll(7).
 + *
 + * This code only monitors file descriptors and does not do asynchronous disk
 + * I/O.  Implementing disk I/O efficiently has other requirements and should
 + * use a separate io_uring so it does not make sense to unify the code.
 + *
 + * File descriptor monitoring is implemented using the following operations:
 + *
 + * 1. IORING_OP_POLL_ADD - adds a file descriptor to be monitored.
 + * 2. IORING_OP_POLL_REMOVE - removes a file descriptor being monitored.  When
 + *    the poll mask changes for a file descriptor it is first removed and then
 + *    re-added with the new poll mask, so this operation is also used as part
 + *    of modifying an existing monitored file descriptor.
 + * 3. IORING_OP_TIMEOUT - added every time a blocking syscall is made to wait
 + *    for events.  This operation self-cancels if another event completes
 + *    before the timeout.
 + *
 + * io_uring calls the submission queue the "sq ring" and the completion queue
 + * the "cq ring".  Ring entries are called "sqe" and "cqe", respectively.
 + *
 + * The code is structured so that sq/cq rings are only modified within
 + * fdmon_io_uring_wait().  Changes to AioHandlers are made by enqueuing them on
 + * ctx->submit_list so that fdmon_io_uring_wait() can submit IORING_OP_POLL_ADD
 + * and/or IORING_OP_POLL_REMOVE sqes for them.
 + */
 +
 +#include "qemu/osdep.h"
 +#include <poll.h>
 +#include "qemu/rcu_queue.h"
 +#include "aio-posix.h"
 +
 +enum {
 +    FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
 +
 +    /* AioHandler::flags */
 +    FDMON_IO_URING_PENDING  = (1 << 0),
 +    FDMON_IO_URING_ADD      = (1 << 1),
 +    FDMON_IO_URING_REMOVE   = (1 << 2),
 +};
 +
 +static inline int poll_events_from_pfd(int pfd_events)
 +{
 +    return (pfd_events & G_IO_IN ? POLLIN : 0) |
 +           (pfd_events & G_IO_OUT ? POLLOUT : 0) |
 +           (pfd_events & G_IO_HUP ? POLLHUP : 0) |
 +           (pfd_events & G_IO_ERR ? POLLERR : 0);
 +}
 +
 +static inline int pfd_events_from_poll(int poll_events)
 +{
 +    return (poll_events & POLLIN ? G_IO_IN : 0) |
 +           (poll_events & POLLOUT ? G_IO_OUT : 0) |
 +           (poll_events & POLLHUP ? G_IO_HUP : 0) |
 +           (poll_events & POLLERR ? G_IO_ERR : 0);
 +}
 +
 +/*
 + * Returns an sqe for submitting a request.  Only be called within
 + * fdmon_io_uring_wait().
 + */
 +static struct io_uring_sqe *get_sqe(AioContext *ctx)
 +{
 +    struct io_uring *ring = &ctx->fdmon_io_uring;
 +    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
 +    int ret;
 +
 +    if (likely(sqe)) {
 +        return sqe;
 +    }
 +
 +    /* No free sqes left, submit pending sqes first */
 +    ret = io_uring_submit(ring);
 +    assert(ret > 1);
 +    sqe = io_uring_get_sqe(ring);
 +    assert(sqe);
 +    return sqe;
 +}
 +
 +/* Atomically enqueue an AioHandler for sq ring submission */
 +static void enqueue(AioHandlerSList *head, AioHandler *node, unsigned flags)
 +{
 +    unsigned old_flags;
 +
 +    old_flags = atomic_fetch_or(&node->flags, FDMON_IO_URING_PENDING | flags);
 +    if (!(old_flags & FDMON_IO_URING_PENDING)) {
 +        QSLIST_INSERT_HEAD_ATOMIC(head, node, node_submitted);
 +    }
 +}
 +
 +/* Dequeue an AioHandler for sq ring submission.  Called by fill_sq_ring(). */
 +static AioHandler *dequeue(AioHandlerSList *head, unsigned *flags)
 +{
 +    AioHandler *node = QSLIST_FIRST(head);
 +
 +    if (!node) {
 +        return NULL;
 +    }
 +
 +    /* Doesn't need to be atomic since fill_sq_ring() moves the list */
 +    QSLIST_REMOVE_HEAD(head, node_submitted);
 +
 +    /*
 +     * Don't clear FDMON_IO_URING_REMOVE.  It's sticky so it can serve two
 +     * purposes: telling fill_sq_ring() to submit IORING_OP_POLL_REMOVE and
 +     * telling process_cqe() to delete the AioHandler when its
 +     * IORING_OP_POLL_ADD completes.
 +     */
 +    *flags = atomic_fetch_and(&node->flags, ~(FDMON_IO_URING_PENDING |
 +                                              FDMON_IO_URING_ADD));
 +    return node;
 +}
 +
 +static void fdmon_io_uring_update(AioContext *ctx,
 +                                  AioHandler *old_node,
 +                                  AioHandler *new_node)
 +{
 +    if (new_node) {
 +        enqueue(&ctx->submit_list, new_node, FDMON_IO_URING_ADD);
 +    }
 +
 +    if (old_node) {
 +        /*
 +         * Deletion is tricky because IORING_OP_POLL_ADD and
 +         * IORING_OP_POLL_REMOVE are async.  We need to wait for the original
 +         * IORING_OP_POLL_ADD to complete before this handler can be freed
 +         * safely.
 +         *
 +         * It's possible that the file descriptor becomes ready and the
 +         * IORING_OP_POLL_ADD cqe is enqueued before IORING_OP_POLL_REMOVE is
 +         * submitted, too.
 +         *
 +         * Mark this handler deleted right now but don't place it on
 +         * ctx->deleted_aio_handlers yet.  Instead, manually fudge the list
 +         * entry to make QLIST_IS_INSERTED() think this handler has been
 +         * inserted and other code recognizes this AioHandler as deleted.
 +         *
 +         * Once the original IORING_OP_POLL_ADD completes we enqueue the
 +         * handler on the real ctx->deleted_aio_handlers list to be freed.
 +         */
 +        assert(!QLIST_IS_INSERTED(old_node, node_deleted));
 +        old_node->node_deleted.le_prev = &old_node->node_deleted.le_next;
 +
 +        enqueue(&ctx->submit_list, old_node, FDMON_IO_URING_REMOVE);
 +    }
 +}
 +
 +static void add_poll_add_sqe(AioContext *ctx, AioHandler *node)
 +{
 +    struct io_uring_sqe *sqe = get_sqe(ctx);
 +    int events = poll_events_from_pfd(node->pfd.events);
 +
 +    io_uring_prep_poll_add(sqe, node->pfd.fd, events);
 +    io_uring_sqe_set_data(sqe, node);
 +}
 +
 +static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
 +{
 +    struct io_uring_sqe *sqe = get_sqe(ctx);
 +
 +    io_uring_prep_poll_remove(sqe, node);
 +}
 +
 +/* Add a timeout that self-cancels when another cqe becomes ready */
 +static void add_timeout_sqe(AioContext *ctx, int64_t ns)
 +{
 +    struct io_uring_sqe *sqe;
 +    struct __kernel_timespec ts = {
 +        .tv_sec = ns / NANOSECONDS_PER_SECOND,
 +        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
 +    };
 +
 +    sqe = get_sqe(ctx);
 +    io_uring_prep_timeout(sqe, &ts, 1, 0);
 +}
 +
 +/* Add sqes from ctx->submit_list for submission */
 +static void fill_sq_ring(AioContext *ctx)
 +{
 +    AioHandlerSList submit_list;
 +    AioHandler *node;
 +    unsigned flags;
 +
 +    QSLIST_MOVE_ATOMIC(&submit_list, &ctx->submit_list);
 +
 +    while ((node = dequeue(&submit_list, &flags))) {
 +        /* Order matters, just in case both flags were set */
 +        if (flags & FDMON_IO_URING_ADD) {
 +            add_poll_add_sqe(ctx, node);
 +        }
 +        if (flags & FDMON_IO_URING_REMOVE) {
 +            add_poll_remove_sqe(ctx, node);
 +        }
 +    }
 +}
 +
 +/* Returns true if a handler became ready */
 +static bool process_cqe(AioContext *ctx,
 +                        AioHandlerList *ready_list,
 +                        struct io_uring_cqe *cqe)
 +{
 +    AioHandler *node = io_uring_cqe_get_data(cqe);
 +    unsigned flags;
 +
 +    /* poll_timeout and poll_remove have a zero user_data field */
 +    if (!node) {
 +        return false;
 +    }
 +
 +    /*
 +     * Deletion can only happen when IORING_OP_POLL_ADD completes.  If we race
 +     * with enqueue() here then we can safely clear the FDMON_IO_URING_REMOVE
 +     * bit before IORING_OP_POLL_REMOVE is submitted.
 +     */
 +    flags = atomic_fetch_and(&node->flags, ~FDMON_IO_URING_REMOVE);
 +    if (flags & FDMON_IO_URING_REMOVE) {
 +        QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
 +        return false;
 +    }
 +
 +    aio_add_ready_handler(ready_list, node, pfd_events_from_poll(cqe->res));
 +
 +    /* IORING_OP_POLL_ADD is one-shot so we must re-arm it */
 +    add_poll_add_sqe(ctx, node);
 +    return true;
 +}
 +
 +static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
 +{
 +    struct io_uring *ring = &ctx->fdmon_io_uring;
 +    struct io_uring_cqe *cqe;
 +    unsigned num_cqes = 0;
 +    unsigned num_ready = 0;
 +    unsigned head;
 +
 +    io_uring_for_each_cqe(ring, head, cqe) {
 +        if (process_cqe(ctx, ready_list, cqe)) {
 +            num_ready++;
 +        }
 +
 +        num_cqes++;
 +    }
 +
 +    io_uring_cq_advance(ring, num_cqes);
 +    return num_ready;
 +}
 +
 +static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
 +                               int64_t timeout)
 +{
 +    unsigned wait_nr = 1; /* block until at least one cqe is ready */
 +    int ret;
 +
 +    /* Fall back while external clients are disabled */
 +    if (atomic_read(&ctx->external_disable_cnt)) {
 +        return fdmon_poll_ops.wait(ctx, ready_list, timeout);
 +    }
 +
 +    if (timeout == 0) {
 +        wait_nr = 0; /* non-blocking */
 +    } else if (timeout > 0) {
 +        add_timeout_sqe(ctx, timeout);
 +    }
 +
 +    fill_sq_ring(ctx);
 +
 +    ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
 +    assert(ret >= 0);
 +
 +    return process_cq_ring(ctx, ready_list);
 +}
 +
 +static const FDMonOps fdmon_io_uring_ops = {
 +    .update = fdmon_io_uring_update,
 +    .wait = fdmon_io_uring_wait,
 +};
 +
 +bool fdmon_io_uring_setup(AioContext *ctx)
 +{
 +    int ret;
 +
 +    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
 +    if (ret != 0) {
 +        return false;
 +    }
 +
 +    QSLIST_INIT(&ctx->submit_list);
 +    ctx->fdmon_ops = &fdmon_io_uring_ops;
 +    return true;
 +}
 +
 +void fdmon_io_uring_destroy(AioContext *ctx)
 +{
 +    if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
 +        AioHandler *node;
 +
 +        io_uring_queue_exit(&ctx->fdmon_io_uring);
 +
 +        /* No need to submit these anymore, just free them. */
 +        while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
 +            QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
 +            QLIST_REMOVE(node, node);
 +            g_free(node);
 +        }
 +
 +        ctx->fdmon_ops = &fdmon_poll_ops;
 +    }
 +}
 --
-.24.1
+.9.3

-[PULL 3/9] aio-posix: completely stop polling when disabled
+[Qemu-devel] [PULL v2 12/12] qemu-iotests: _cleanup_qemu must be called on exit
-One iteration of polling is always performed even when polling is
+For the tests that use the common.qemu functions for running a QEMU
-disabled.  This is done because:
+process, _cleanup_qemu must be called in the exit function.
 . Userspace polling is cheaper than making a syscall.  We might get
    lucky.
 . We must poll once more after polling has stopped in case an event
    occurred while stopping polling.
-However, there are downsides:
+If it is not, if the qemu process aborts, then not all of the droppings
-. Polling becomes a bottleneck when the number of event sources is very
+are cleaned up (e.g. pidfile, fifos).
    high.  It's more efficient to monitor fds in that case.
 . A high-frequency polling event source can starve non-polling event
    sources because ppoll(2)/epoll(7) is never invoked.
-This patch removes the forced polling iteration so that poll_ns=0 really
+This updates those tests that did not have a cleanup in qemu-iotests.
 means no polling.
-IOPS increases from 10k to 60k when the guest has 100
+(I swapped spaces for tabs in test 102 as well)
 virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1
 device because the large number of event sources being polled slows down
 the event loop.
-Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
+Reported-by: Eric Blake <eblake@redhat.com>
-Link: https://lore.kernel.org/r/20200305170806.1313245-2-stefanha@redhat.com
+Reviewed-by: Eric Blake <eblake@redhat.com>
-Message-Id: <20200305170806.1313245-2-stefanha@redhat.com>
+Signed-off-by: Jeff Cody <jcody@redhat.com>
 Message-id: d59c2f6ad6c1da8b9b3c7f357c94a7122ccfc55a.1492544096.git.jcody@redhat.com
 ---
- util/aio-posix.c | 22 +++++++++++++++-------
+ tests/qemu-iotests/028 |  1 +
-file changed, 15 insertions(+), 7 deletions(-)
+ tests/qemu-iotests/094 | 11 ++++++++---
  tests/qemu-iotests/102 |  5 +++--
  tests/qemu-iotests/109 |  1 +
  tests/qemu-iotests/117 |  1 +
  tests/qemu-iotests/130 |  1 +
  tests/qemu-iotests/140 |  1 +
  tests/qemu-iotests/141 |  1 +
  tests/qemu-iotests/143 |  1 +
  tests/qemu-iotests/156 |  1 +
 files changed, 19 insertions(+), 5 deletions(-)
-diff --git a/util/aio-posix.c b/util/aio-posix.c
+diff --git a/tests/qemu-iotests/028 b/tests/qemu-iotests/028
-index XXXXXXX..XXXXXXX 100644
+index XXXXXXX..XXXXXXX 100755
---- a/util/aio-posix.c
+--- a/tests/qemu-iotests/028
-+++ b/util/aio-posix.c
++++ b/tests/qemu-iotests/028
-@@ -XXX,XX +XXX,XX @@ void aio_set_event_notifier_poll(AioContext *ctx,
+@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
-                     (IOHandler *)io_poll_end);
  _cleanup()
  {
 +    _cleanup_qemu
      rm -f "${TEST_IMG}.copy"
      _cleanup_test_img
  }
+diff --git a/tests/qemu-iotests/094 b/tests/qemu-iotests/094
--static void poll_set_started(AioContext *ctx, bool started)
+index XXXXXXX..XXXXXXX 100755
-+static bool poll_set_started(AioContext *ctx, bool started)
+--- a/tests/qemu-iotests/094
 +++ b/tests/qemu-iotests/094
@@ -XXX,XX +XXX,XX @@ echo "QA output created by $seq"
  here="$PWD"
  status=1    # failure is the default!
 -trap "exit \$status" 0 1 2 3 15
 +_cleanup()
 +{
 +    _cleanup_qemu
 +    _cleanup_test_img
 +    rm -f "$TEST_DIR/source.$IMGFMT"
 +}
 +
 +trap "_cleanup; exit \$status" 0 1 2 3 15
  # get standard environment, filters and checks
  . ./common.rc
@@ -XXX,XX +XXX,XX @@ _send_qemu_cmd $QEMU_HANDLE \
  wait=1 _cleanup_qemu
 -_cleanup_test_img
 -rm -f "$TEST_DIR/source.$IMGFMT"
  # success, all done
  echo '*** done'
 diff --git a/tests/qemu-iotests/102 b/tests/qemu-iotests/102
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/102
 +++ b/tests/qemu-iotests/102
@@ -XXX,XX +XXX,XX @@ seq=$(basename $0)
  echo "QA output created by $seq"
  here=$PWD
 -status=1    # failure is the default!
 +status=1    # failure is the default!
  _cleanup()
  {
-     AioHandler *node;
+-    _cleanup_test_img
-+    bool progress = false;
++    _cleanup_qemu
++    _cleanup_test_img
      if (started == ctx->poll_started) {
 -        return;
 +        return false;
      }
      ctx->poll_started = started;
@@ -XXX,XX +XXX,XX @@ static void poll_set_started(AioContext *ctx, bool started)
          if (fn) {
              fn(node->opaque);
          }
 +
 +        /* Poll one last time in case ->io_poll_end() raced with the event */
 +        if (!started) {
 +            progress = node->io_poll(node->opaque) || progress;
 +        }
      }
      qemu_lockcnt_dec(&ctx->list_lock);
 +
 +    return progress;
  }
+ trap "_cleanup; exit \$status" 0 1 2 3 15
-@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
+diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
-         }
+index XXXXXXX..XXXXXXX 100755
-     }
+--- a/tests/qemu-iotests/109
++++ b/tests/qemu-iotests/109
--    poll_set_started(ctx, false);
+@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
-+    if (poll_set_started(ctx, false)) {
-+        *timeout = 0;
+ _cleanup()
-+        return true;
+ {
-+    }
++    _cleanup_qemu
+     rm -f $TEST_IMG.src
--    /* Even if we don't run busy polling, try polling once in case it can make
+     _cleanup_test_img
 -     * progress and the caller will be able to avoid ppoll(2)/epoll_wait(2).
 -     */
 -    return run_poll_handlers_once(ctx, timeout);
 +    return false;
  }
+diff --git a/tests/qemu-iotests/117 b/tests/qemu-iotests/117
- bool aio_poll(AioContext *ctx, bool blocking)
+index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/117
 +++ b/tests/qemu-iotests/117
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      _cleanup_test_img
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
 diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/130
 +++ b/tests/qemu-iotests/130
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      _cleanup_test_img
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
 diff --git a/tests/qemu-iotests/140 b/tests/qemu-iotests/140
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/140
 +++ b/tests/qemu-iotests/140
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      _cleanup_test_img
      rm -f "$TEST_DIR/nbd"
  }
 diff --git a/tests/qemu-iotests/141 b/tests/qemu-iotests/141
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/141
 +++ b/tests/qemu-iotests/141
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      _cleanup_test_img
      rm -f "$TEST_DIR/{b,m,o}.$IMGFMT"
  }
 diff --git a/tests/qemu-iotests/143 b/tests/qemu-iotests/143
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/143
 +++ b/tests/qemu-iotests/143
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      rm -f "$TEST_DIR/nbd"
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
 diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
 index XXXXXXX..XXXXXXX 100755
 --- a/tests/qemu-iotests/156
 +++ b/tests/qemu-iotests/156
@@ -XXX,XX +XXX,XX @@ status=1    # failure is the default!
  _cleanup()
  {
 +    _cleanup_qemu
      rm -f "$TEST_IMG{,.target}{,.backing,.overlay}"
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
 --
-.24.1
+.9.3

The following changes since commit 67f17e23baca5dd545fe98b01169cc351a70fe35:

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging (2020-03-06 17:15:36 +0000)

are available in the Git repository at:

https://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to d37d0e365afb6825a90d8356fc6adcc1f58f40f3:

aio-posix: remove idle poll handlers to improve scalability (2020-03-09 16:45:16 +0000)

----------------------------------------------------------------
Pull request

----------------------------------------------------------------

Stefan Hajnoczi (9):
  qemu/queue.h: clear linked list pointers on remove
  aio-posix: remove confusing QLIST_SAFE_REMOVE()
  aio-posix: completely stop polling when disabled
  aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
  aio-posix: extract ppoll(2) and epoll(7) fd monitoring
  aio-posix: simplify FDMonOps->update() prototype
  aio-posix: add io_uring fd monitoring implementation
  aio-posix: support userspace polling of fd monitoring
  aio-posix: remove idle poll handlers to improve scalability

MAINTAINERS           |   2 +
 configure             |   5 +
 include/block/aio.h   |  71 ++++++-
 include/qemu/queue.h  |  19 +-
 util/Makefile.objs    |   3 +
 util/aio-posix.c      | 451 ++++++++++++++----------------------------
 util/aio-posix.h      |  81 ++++++++
 util/fdmon-epoll.c    | 155 +++++++++++++++
 util/fdmon-io_uring.c | 332 +++++++++++++++++++++++++++++++
 util/fdmon-poll.c     | 107 ++++++++++
 util/trace-events     |   2 +
 11 files changed, 915 insertions(+), 313 deletions(-)
 create mode 100644 util/aio-posix.h
 create mode 100644 util/fdmon-epoll.c
 create mode 100644 util/fdmon-io_uring.c
 create mode 100644 util/fdmon-poll.c

-- 
2.24.1

Do not leave stale linked list pointers around after removal.  It's
safer to set them to NULL so that use-after-removal results in an
immediate segfault.

The RCU queue removal macros are unchanged since nodes may still be
traversed after removal.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200224103406.1894923-2-stefanha@redhat.com
Message-Id: <20200224103406.1894923-2-stefanha@redhat.com>
---
 include/qemu/queue.h | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -XXX,XX +XXX,XX @@ struct {                                                                \
                 (elm)->field.le_next->field.le_prev =                   \
                     (elm)->field.le_prev;                               \
         *(elm)->field.le_prev = (elm)->field.le_next;                   \
+        (elm)->field.le_next = NULL;                                    \
+        (elm)->field.le_prev = NULL;                                    \
 } while (/*CONSTCOND*/0)
 
 /*
@@ -XXX,XX +XXX,XX @@ struct {                                                                \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
-        (head)->slh_first = (head)->slh_first->field.sle_next;          \
+        typeof((head)->slh_first) elm = (head)->slh_first;               \
+        (head)->slh_first = elm->field.sle_next;                         \
+        elm->field.sle_next = NULL;                                      \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_AFTER(slistelm, field) do {                       \
-        (slistelm)->field.sle_next =                                    \
-            QSLIST_NEXT(QSLIST_NEXT((slistelm), field), field);         \
+        typeof(slistelm) next = (slistelm)->field.sle_next;             \
+        (slistelm)->field.sle_next = next->field.sle_next;              \
+        next->field.sle_next = NULL;                                    \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE(head, elm, type, field) do {                      \
@@ -XXX,XX +XXX,XX @@ struct {                                                                \
         while (curelm->field.sle_next != (elm))                         \
             curelm = curelm->field.sle_next;                            \
         curelm->field.sle_next = curelm->field.sle_next->field.sle_next; \
+        (elm)->field.sle_next = NULL;                                   \
     }                                                                   \
 } while (/*CONSTCOND*/0)
 
@@ -XXX,XX +XXX,XX @@ struct {                                                                \
 } while (/*CONSTCOND*/0)
 
 #define QSIMPLEQ_REMOVE_HEAD(head, field) do {                          \
-    if (((head)->sqh_first = (head)->sqh_first->field.sqe_next) == NULL)\
+    typeof((head)->sqh_first) elm = (head)->sqh_first;                  \
+    if (((head)->sqh_first = elm->field.sqe_next) == NULL)              \
         (head)->sqh_last = &(head)->sqh_first;                          \
+    elm->field.sqe_next = NULL;                                         \
 } while (/*CONSTCOND*/0)
 
 #define QSIMPLEQ_SPLIT_AFTER(head, elm, field, removed) do {            \
@@ -XXX,XX +XXX,XX @@ struct {                                                                \
         if ((curelm->field.sqe_next =                                   \
             curelm->field.sqe_next->field.sqe_next) == NULL)            \
                 (head)->sqh_last = &(curelm)->field.sqe_next;           \
+        (elm)->field.sqe_next = NULL;                                   \
     }                                                                   \
 } while (/*CONSTCOND*/0)
 
@@ -XXX,XX +XXX,XX @@ union {                                                                 \
             (head)->tqh_circ.tql_prev = (elm)->field.tqe_circ.tql_prev; \
         (elm)->field.tqe_circ.tql_prev->tql_next = (elm)->field.tqe_next; \
         (elm)->field.tqe_circ.tql_prev = NULL;                          \
+        (elm)->field.tqe_circ.tql_next = NULL;                          \
+        (elm)->field.tqe_next = NULL;                                   \
 } while (/*CONSTCOND*/0)
 
 /* remove @left, @right and all elements in between from @head */
-- 
2.24.1

One iteration of polling is always performed even when polling is
disabled.  This is done because:
1. Userspace polling is cheaper than making a syscall.  We might get
   lucky.
2. We must poll once more after polling has stopped in case an event
   occurred while stopping polling.

However, there are downsides:
1. Polling becomes a bottleneck when the number of event sources is very
   high.  It's more efficient to monitor fds in that case.
2. A high-frequency polling event source can starve non-polling event
   sources because ppoll(2)/epoll(7) is never invoked.

This patch removes the forced polling iteration so that poll_ns=0 really
means no polling.

IOPS increases from 10k to 60k when the guest has 100
virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1
device because the large number of event sources being polled slows down
the event loop.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-2-stefanha@redhat.com
Message-Id: <20200305170806.1313245-2-stefanha@redhat.com>
---
 util/aio-posix.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ void aio_set_event_notifier_poll(AioContext *ctx,
                     (IOHandler *)io_poll_end);
 }
 
-static void poll_set_started(AioContext *ctx, bool started)
+static bool poll_set_started(AioContext *ctx, bool started)
 {
     AioHandler *node;
+    bool progress = false;
 
     if (started == ctx->poll_started) {
-        return;
+        return false;
     }
 
     ctx->poll_started = started;
@@ -XXX,XX +XXX,XX @@ static void poll_set_started(AioContext *ctx, bool started)
         if (fn) {
             fn(node->opaque);
         }
+
+        /* Poll one last time in case ->io_poll_end() raced with the event */
+        if (!started) {
+            progress = node->io_poll(node->opaque) || progress;
+        }
     }
     qemu_lockcnt_dec(&ctx->list_lock);
+
+    return progress;
 }
 
 
@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
         }
     }
 
-    poll_set_started(ctx, false);
+    if (poll_set_started(ctx, false)) {
+        *timeout = 0;
+        return true;
+    }
 
-    /* Even if we don't run busy polling, try polling once in case it can make
-     * progress and the caller will be able to avoid ppoll(2)/epoll_wait(2).
-     */
-    return run_poll_handlers_once(ctx, timeout);
+    return false;
 }
 
 bool aio_poll(AioContext *ctx, bool blocking)
-- 
2.24.1

Now that run_poll_handlers_once() is only called by run_poll_handlers()
we can improve the CPU time profile by moving the expensive
RCU_READ_LOCK() out of the polling loop.

This reduces the run_poll_handlers() from 40% CPU to 10% CPU in perf's
sampling profiler output.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-3-stefanha@redhat.com
Message-Id: <20200305170806.1313245-3-stefanha@redhat.com>
---
 util/aio-posix.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
     bool progress = false;
     AioHandler *node;
 
-    /*
-     * Optimization: ->io_poll() handlers often contain RCU read critical
-     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
-     * -> rcu_read_lock() -> ... sequences with expensive memory
-     * synchronization primitives.  Make the entire polling loop an RCU
-     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
-     * are cheap.
-     */
-    RCU_READ_LOCK_GUARD();
-
     QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll &&
             aio_node_check(ctx, node->is_external) &&
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
 
     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
 
+    /*
+     * Optimization: ->io_poll() handlers often contain RCU read critical
+     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
+     * -> rcu_read_lock() -> ... sequences with expensive memory
+     * synchronization primitives.  Make the entire polling loop an RCU
+     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
+     * are cheap.
+     */
+    RCU_READ_LOCK_GUARD();
+
     start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     do {
         progress = run_poll_handlers_once(ctx, timeout);
-- 
2.24.1

The ppoll(2) and epoll(7) file descriptor monitoring implementations are
mixed with the core util/aio-posix.c code.  Before adding another
implementation for Linux io_uring, extract out the existing
ones so there is a clear interface and the core code is simpler.

The new interface is AioContext->fdmon_ops, a pointer to a FDMonOps
struct.  See the patch for details.

Semantic changes:
1. ppoll(2) now reflects events from pollfds[] back into AioHandlers
   while we're still on the clock for adaptive polling.  This was
   already happening for epoll(7), so if it's really an issue then we'll
   need to fix both in the future.
2. epoll(7)'s fallback to ppoll(2) while external events are disabled
   was broken when the number of fds exceeded the epoll(7) upgrade
   threshold.  I guess this code path simply wasn't tested and no one
   noticed the bug.  I didn't go out of my way to fix it but the correct
   code is simpler than preserving the bug.

I also took some liberties in removing the unnecessary
AioContext->epoll_available (just check AioContext->epollfd != -1
instead) and AioContext->epoll_enabled (it's implicit if our
AioContext->fdmon_ops callbacks are being invoked) fields.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-4-stefanha@redhat.com
Message-Id: <20200305170806.1313245-4-stefanha@redhat.com>
---
 MAINTAINERS         |   2 +
 include/block/aio.h |  36 +++++-
 util/Makefile.objs  |   2 +
 util/aio-posix.c    | 286 ++------------------------------------------
 util/aio-posix.h    |  61 ++++++++++
 util/fdmon-epoll.c  | 151 +++++++++++++++++++++++
 util/fdmon-poll.c   | 104 ++++++++++++++++
 7 files changed, 366 insertions(+), 276 deletions(-)
 create mode 100644 util/aio-posix.h
 create mode 100644 util/fdmon-epoll.c
 create mode 100644 util/fdmon-poll.c

diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ L: qemu-block@nongnu.org
 S: Supported
 F: util/async.c
 F: util/aio-*.c
+F: util/aio-*.h
+F: util/fdmon-*.c
 F: block/io.c
 F: migration/block*
 F: include/block/aio.h
diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct ThreadPool;
 struct LinuxAioState;
 struct LuringState;
 
+/* Callbacks for file descriptor monitoring implementations */
+typedef struct {
+    /*
+     * update:
+     * @ctx: the AioContext
+     * @node: the handler
+     * @is_new: is the file descriptor already being monitored?
+     *
+     * Add/remove/modify a monitored file descriptor.  There are three cases:
+     * 1. node->pfd.events == 0 means remove the file descriptor.
+     * 2. !is_new means modify an already monitored file descriptor.
+     * 3. is_new means add a new file descriptor.
+     *
+     * Called with ctx->list_lock acquired.
+     */
+    void (*update)(AioContext *ctx, AioHandler *node, bool is_new);
+
+    /*
+     * wait:
+     * @ctx: the AioContext
+     * @ready_list: list for handlers that become ready
+     * @timeout: maximum duration to wait, in nanoseconds
+     *
+     * Wait for file descriptors to become ready and place them on ready_list.
+     *
+     * Called with ctx->list_lock incremented but not locked.
+     *
+     * Returns: number of ready file descriptors.
+     */
+    int (*wait)(AioContext *ctx, AioHandlerList *ready_list, int64_t timeout);
+} FDMonOps;
+
 /*
  * Each aio_bh_poll() call carves off a slice of the BH list, so that newly
  * scheduled BHs are not processed until the next aio_bh_poll() call.  All
@@ -XXX,XX +XXX,XX @@ struct AioContext {
 
     /* epoll(7) state used when built with CONFIG_EPOLL */
     int epollfd;
-    bool epoll_enabled;
-    bool epoll_available;
+
+    const FDMonOps *fdmon_ops;
 };
 
 /**
diff --git a/util/Makefile.objs b/util/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@ util-obj-y += aiocb.o async.o aio-wait.o thread-pool.o qemu-timer.o
 util-obj-y += main-loop.o
 util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
 util-obj-$(CONFIG_POSIX) += aio-posix.o
+util-obj-$(CONFIG_POSIX) += fdmon-poll.o
+util-obj-$(CONFIG_EPOLL_CREATE1) += fdmon-epoll.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
 #include "trace.h"
-#ifdef CONFIG_EPOLL_CREATE1
-#include <sys/epoll.h>
-#endif
+#include "aio-posix.h"
 
-struct AioHandler
-{
-    GPollFD pfd;
-    IOHandler *io_read;
-    IOHandler *io_write;
-    AioPollFn *io_poll;
-    IOHandler *io_poll_begin;
-    IOHandler *io_poll_end;
-    void *opaque;
-    bool is_external;
-    QLIST_ENTRY(AioHandler) node;
-    QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
-    QLIST_ENTRY(AioHandler) node_deleted;
-};
-
-/* Add a handler to a ready list */
-static void add_ready_handler(AioHandlerList *ready_list,
-                              AioHandler *node,
-                              int revents)
+void aio_add_ready_handler(AioHandlerList *ready_list,
+                           AioHandler *node,
+                           int revents)
 {
     QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's list */
     node->pfd.revents = revents;
     QLIST_INSERT_HEAD(ready_list, node, node_ready);
 }
 
-#ifdef CONFIG_EPOLL_CREATE1
-
-/* The fd number threshold to switch to epoll */
-#define EPOLL_ENABLE_THRESHOLD 64
-
-static void aio_epoll_disable(AioContext *ctx)
-{
-    ctx->epoll_enabled = false;
-    if (!ctx->epoll_available) {
-        return;
-    }
-    ctx->epoll_available = false;
-    close(ctx->epollfd);
-}
-
-static inline int epoll_events_from_pfd(int pfd_events)
-{
-    return (pfd_events & G_IO_IN ? EPOLLIN : 0) |
-           (pfd_events & G_IO_OUT ? EPOLLOUT : 0) |
-           (pfd_events & G_IO_HUP ? EPOLLHUP : 0) |
-           (pfd_events & G_IO_ERR ? EPOLLERR : 0);
-}
-
-static bool aio_epoll_try_enable(AioContext *ctx)
-{
-    AioHandler *node;
-    struct epoll_event event;
-
-    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-        int r;
-        if (QLIST_IS_INSERTED(node, node_deleted) || !node->pfd.events) {
-            continue;
-        }
-        event.events = epoll_events_from_pfd(node->pfd.events);
-        event.data.ptr = node;
-        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, &event);
-        if (r) {
-            return false;
-        }
-    }
-    ctx->epoll_enabled = true;
-    return true;
-}
-
-static void aio_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
-{
-    struct epoll_event event;
-    int r;
-    int ctl;
-
-    if (!ctx->epoll_enabled) {
-        return;
-    }
-    if (!node->pfd.events) {
-        ctl = EPOLL_CTL_DEL;
-    } else {
-        event.data.ptr = node;
-        event.events = epoll_events_from_pfd(node->pfd.events);
-        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
-    }
-
-    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
-    if (r) {
-        aio_epoll_disable(ctx);
-    }
-}
-
-static int aio_epoll(AioContext *ctx, AioHandlerList *ready_list,
-                     int64_t timeout)
-{
-    GPollFD pfd = {
-        .fd = ctx->epollfd,
-        .events = G_IO_IN | G_IO_OUT | G_IO_HUP | G_IO_ERR,
-    };
-    AioHandler *node;
-    int i, ret = 0;
-    struct epoll_event events[128];
-
-    if (timeout > 0) {
-        ret = qemu_poll_ns(&pfd, 1, timeout);
-        if (ret > 0) {
-            timeout = 0;
-        }
-    }
-    if (timeout <= 0 || ret > 0) {
-        ret = epoll_wait(ctx->epollfd, events,
-                         ARRAY_SIZE(events),
-                         timeout);
-        if (ret <= 0) {
-            goto out;
-        }
-        for (i = 0; i < ret; i++) {
-            int ev = events[i].events;
-            int revents = (ev & EPOLLIN ? G_IO_IN : 0) |
-                          (ev & EPOLLOUT ? G_IO_OUT : 0) |
-                          (ev & EPOLLHUP ? G_IO_HUP : 0) |
-                          (ev & EPOLLERR ? G_IO_ERR : 0);
-
-            node = events[i].data.ptr;
-            add_ready_handler(ready_list, node, revents);
-        }
-    }
-out:
-    return ret;
-}
-
-static bool aio_epoll_enabled(AioContext *ctx)
-{
-    /* Fall back to ppoll when external clients are disabled. */
-    return !aio_external_disabled(ctx) && ctx->epoll_enabled;
-}
-
-static bool aio_epoll_check_poll(AioContext *ctx, GPollFD *pfds,
-                                 unsigned npfd, int64_t timeout)
-{
-    if (!ctx->epoll_available) {
-        return false;
-    }
-    if (aio_epoll_enabled(ctx)) {
-        return true;
-    }
-    if (npfd >= EPOLL_ENABLE_THRESHOLD) {
-        if (aio_epoll_try_enable(ctx)) {
-            return true;
-        } else {
-            aio_epoll_disable(ctx);
-        }
-    }
-    return false;
-}
-
-#else
-
-static void aio_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
-{
-}
-
-static int aio_epoll(AioContext *ctx, AioHandlerList *ready_list,
-                     int64_t timeout)
-{
-    assert(false);
-}
-
-static bool aio_epoll_enabled(AioContext *ctx)
-{
-    return false;
-}
-
-static bool aio_epoll_check_poll(AioContext *ctx, GPollFD *pfds,
-                          unsigned npfd, int64_t timeout)
-{
-    return false;
-}
-
-#endif
-
 static AioHandler *find_aio_handler(AioContext *ctx, int fd)
 {
     AioHandler *node;
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
                atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
 
     if (new_node) {
-        aio_epoll_update(ctx, new_node, is_new);
+        ctx->fdmon_ops->update(ctx, new_node, is_new);
     } else if (node) {
         /* Unregister deleted fd_handler */
-        aio_epoll_update(ctx, node, false);
+        ctx->fdmon_ops->update(ctx, node, false);
     }
     qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
@@ -XXX,XX +XXX,XX @@ void aio_dispatch(AioContext *ctx)
     timerlistgroup_run_timers(&ctx->tlg);
 }
 
-/* These thread-local variables are used only in a small part of aio_poll
- * around the call to the poll() system call.  In particular they are not
- * used while aio_poll is performing callbacks, which makes it much easier
- * to think about reentrancy!
- *
- * Stack-allocated arrays would be perfect but they have size limitations;
- * heap allocation is expensive enough that we want to reuse arrays across
- * calls to aio_poll().  And because poll() has to be called without holding
- * any lock, the arrays cannot be stored in AioContext.  Thread-local data
- * has none of the disadvantages of these three options.
- */
-static __thread GPollFD *pollfds;
-static __thread AioHandler **nodes;
-static __thread unsigned npfd, nalloc;
-static __thread Notifier pollfds_cleanup_notifier;
-
-static void pollfds_cleanup(Notifier *n, void *unused)
-{
-    g_assert(npfd == 0);
-    g_free(pollfds);
-    g_free(nodes);
-    nalloc = 0;
-}
-
-static void add_pollfd(AioHandler *node)
-{
-    if (npfd == nalloc) {
-        if (nalloc == 0) {
-            pollfds_cleanup_notifier.notify = pollfds_cleanup;
-            qemu_thread_atexit_add(&pollfds_cleanup_notifier);
-            nalloc = 8;
-        } else {
-            g_assert(nalloc <= INT_MAX);
-            nalloc *= 2;
-        }
-        pollfds = g_renew(GPollFD, pollfds, nalloc);
-        nodes = g_renew(AioHandler *, nodes, nalloc);
-    }
-    nodes[npfd] = node;
-    pollfds[npfd] = (GPollFD) {
-        .fd = node->pfd.fd,
-        .events = node->pfd.events,
-    };
-    npfd++;
-}
-
 static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
 {
     bool progress = false;
@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
 bool aio_poll(AioContext *ctx, bool blocking)
 {
     AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
-    AioHandler *node;
-    int i;
     int ret = 0;
     bool progress;
     int64_t timeout;
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
      * system call---a single round of run_poll_handlers_once suffices.
      */
     if (timeout || atomic_read(&ctx->poll_disable_cnt)) {
-        assert(npfd == 0);
-
-        /* fill pollfds */
-
-        if (!aio_epoll_enabled(ctx)) {
-            QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-                if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events
-                    && aio_node_check(ctx, node->is_external)) {
-                    add_pollfd(node);
-                }
-            }
-        }
-
-        /* wait until next event */
-        if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
-            npfd = 0; /* pollfds[] is not being used */
-            ret = aio_epoll(ctx, &ready_list, timeout);
-        } else  {
-            ret = qemu_poll_ns(pollfds, npfd, timeout);
-        }
+        ret = ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
     }
 
     if (blocking) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
         }
     }
 
-    /* if we have any readable fds, dispatch event */
-    if (ret > 0) {
-        for (i = 0; i < npfd; i++) {
-            int revents = pollfds[i].revents;
-
-            if (revents) {
-                add_ready_handler(&ready_list, nodes[i], revents);
-            }
-        }
-    }
-
-    npfd = 0;
-
     progress |= aio_bh_poll(ctx);
 
     if (ret > 0) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
 
 void aio_context_setup(AioContext *ctx)
 {
-#ifdef CONFIG_EPOLL_CREATE1
-    assert(!ctx->epollfd);
-    ctx->epollfd = epoll_create1(EPOLL_CLOEXEC);
-    if (ctx->epollfd == -1) {
-        fprintf(stderr, "Failed to create epoll instance: %s", strerror(errno));
-        ctx->epoll_available = false;
-    } else {
-        ctx->epoll_available = true;
-    }
-#endif
+    ctx->fdmon_ops = &fdmon_poll_ops;
+    ctx->epollfd = -1;
+
+    fdmon_epoll_setup(ctx);
 }
 
 void aio_context_destroy(AioContext *ctx)
 {
-#ifdef CONFIG_EPOLL_CREATE1
-    aio_epoll_disable(ctx);
-#endif
+    fdmon_epoll_disable(ctx);
 }
 
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
diff --git a/util/aio-posix.h b/util/aio-posix.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/aio-posix.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * AioContext POSIX event loop implementation internal APIs
+ *
+ * Copyright IBM, Corp. 2008
+ * Copyright Red Hat, Inc. 2020
+ *
+ * Authors:
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Contributions after 2012-01-13 are licensed under the terms of the
+ * GNU GPL, version 2 or (at your option) any later version.
+ */
+
+#ifndef AIO_POSIX_H
+#define AIO_POSIX_H
+
+#include "block/aio.h"
+
+struct AioHandler {
+    GPollFD pfd;
+    IOHandler *io_read;
+    IOHandler *io_write;
+    AioPollFn *io_poll;
+    IOHandler *io_poll_begin;
+    IOHandler *io_poll_end;
+    void *opaque;
+    bool is_external;
+    QLIST_ENTRY(AioHandler) node;
+    QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
+    QLIST_ENTRY(AioHandler) node_deleted;
+};
+
+/* Add a handler to a ready list */
+void aio_add_ready_handler(AioHandlerList *ready_list, AioHandler *node,
+                           int revents);
+
+extern const FDMonOps fdmon_poll_ops;
+
+#ifdef CONFIG_EPOLL_CREATE1
+bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd);
+void fdmon_epoll_setup(AioContext *ctx);
+void fdmon_epoll_disable(AioContext *ctx);
+#else
+static inline bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
+{
+    return false;
+}
+
+static inline void fdmon_epoll_setup(AioContext *ctx)
+{
+}
+
+static inline void fdmon_epoll_disable(AioContext *ctx)
+{
+}
+#endif /* !CONFIG_EPOLL_CREATE1 */
+
+#endif /* AIO_POSIX_H */
diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * epoll(7) file descriptor monitoring
+ */
+
+#include "qemu/osdep.h"
+#include <sys/epoll.h>
+#include "qemu/rcu_queue.h"
+#include "aio-posix.h"
+
+/* The fd number threshold to switch to epoll */
+#define EPOLL_ENABLE_THRESHOLD 64
+
+void fdmon_epoll_disable(AioContext *ctx)
+{
+    if (ctx->epollfd >= 0) {
+        close(ctx->epollfd);
+        ctx->epollfd = -1;
+    }
+
+    /* Switch back */
+    ctx->fdmon_ops = &fdmon_poll_ops;
+}
+
+static inline int epoll_events_from_pfd(int pfd_events)
+{
+    return (pfd_events & G_IO_IN ? EPOLLIN : 0) |
+           (pfd_events & G_IO_OUT ? EPOLLOUT : 0) |
+           (pfd_events & G_IO_HUP ? EPOLLHUP : 0) |
+           (pfd_events & G_IO_ERR ? EPOLLERR : 0);
+}
+
+static void fdmon_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
+{
+    struct epoll_event event;
+    int r;
+    int ctl;
+
+    if (!node->pfd.events) {
+        ctl = EPOLL_CTL_DEL;
+    } else {
+        event.data.ptr = node;
+        event.events = epoll_events_from_pfd(node->pfd.events);
+        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
+    }
+
+    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
+    if (r) {
+        fdmon_epoll_disable(ctx);
+    }
+}
+
+static int fdmon_epoll_wait(AioContext *ctx, AioHandlerList *ready_list,
+                            int64_t timeout)
+{
+    GPollFD pfd = {
+        .fd = ctx->epollfd,
+        .events = G_IO_IN | G_IO_OUT | G_IO_HUP | G_IO_ERR,
+    };
+    AioHandler *node;
+    int i, ret = 0;
+    struct epoll_event events[128];
+
+    /* Fall back while external clients are disabled */
+    if (atomic_read(&ctx->external_disable_cnt)) {
+        return fdmon_poll_ops.wait(ctx, ready_list, timeout);
+    }
+
+    if (timeout > 0) {
+        ret = qemu_poll_ns(&pfd, 1, timeout);
+        if (ret > 0) {
+            timeout = 0;
+        }
+    }
+    if (timeout <= 0 || ret > 0) {
+        ret = epoll_wait(ctx->epollfd, events,
+                         ARRAY_SIZE(events),
+                         timeout);
+        if (ret <= 0) {
+            goto out;
+        }
+        for (i = 0; i < ret; i++) {
+            int ev = events[i].events;
+            int revents = (ev & EPOLLIN ? G_IO_IN : 0) |
+                          (ev & EPOLLOUT ? G_IO_OUT : 0) |
+                          (ev & EPOLLHUP ? G_IO_HUP : 0) |
+                          (ev & EPOLLERR ? G_IO_ERR : 0);
+
+            node = events[i].data.ptr;
+            aio_add_ready_handler(ready_list, node, revents);
+        }
+    }
+out:
+    return ret;
+}
+
+static const FDMonOps fdmon_epoll_ops = {
+    .update = fdmon_epoll_update,
+    .wait = fdmon_epoll_wait,
+};
+
+static bool fdmon_epoll_try_enable(AioContext *ctx)
+{
+    AioHandler *node;
+    struct epoll_event event;
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        int r;
+        if (QLIST_IS_INSERTED(node, node_deleted) || !node->pfd.events) {
+            continue;
+        }
+        event.events = epoll_events_from_pfd(node->pfd.events);
+        event.data.ptr = node;
+        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, &event);
+        if (r) {
+            return false;
+        }
+    }
+
+    ctx->fdmon_ops = &fdmon_epoll_ops;
+    return true;
+}
+
+bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
+{
+    if (ctx->epollfd < 0) {
+        return false;
+    }
+
+    /* Do not upgrade while external clients are disabled */
+    if (atomic_read(&ctx->external_disable_cnt)) {
+        return false;
+    }
+
+    if (npfd >= EPOLL_ENABLE_THRESHOLD) {
+        if (fdmon_epoll_try_enable(ctx)) {
+            return true;
+        } else {
+            fdmon_epoll_disable(ctx);
+        }
+    }
+    return false;
+}
+
+void fdmon_epoll_setup(AioContext *ctx)
+{
+    ctx->epollfd = epoll_create1(EPOLL_CLOEXEC);
+    if (ctx->epollfd == -1) {
+        fprintf(stderr, "Failed to create epoll instance: %s", strerror(errno));
+    }
+}
diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/fdmon-poll.c
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * poll(2) file descriptor monitoring
+ *
+ * Uses ppoll(2) when available, g_poll() otherwise.
+ */
+
+#include "qemu/osdep.h"
+#include "aio-posix.h"
+#include "qemu/rcu_queue.h"
+
+/*
+ * These thread-local variables are used only in fdmon_poll_wait() around the
+ * call to the poll() system call.  In particular they are not used while
+ * aio_poll is performing callbacks, which makes it much easier to think about
+ * reentrancy!
+ *
+ * Stack-allocated arrays would be perfect but they have size limitations;
+ * heap allocation is expensive enough that we want to reuse arrays across
+ * calls to aio_poll().  And because poll() has to be called without holding
+ * any lock, the arrays cannot be stored in AioContext.  Thread-local data
+ * has none of the disadvantages of these three options.
+ */
+static __thread GPollFD *pollfds;
+static __thread AioHandler **nodes;
+static __thread unsigned npfd, nalloc;
+static __thread Notifier pollfds_cleanup_notifier;
+
+static void pollfds_cleanup(Notifier *n, void *unused)
+{
+    g_assert(npfd == 0);
+    g_free(pollfds);
+    g_free(nodes);
+    nalloc = 0;
+}
+
+static void add_pollfd(AioHandler *node)
+{
+    if (npfd == nalloc) {
+        if (nalloc == 0) {
+            pollfds_cleanup_notifier.notify = pollfds_cleanup;
+            qemu_thread_atexit_add(&pollfds_cleanup_notifier);
+            nalloc = 8;
+        } else {
+            g_assert(nalloc <= INT_MAX);
+            nalloc *= 2;
+        }
+        pollfds = g_renew(GPollFD, pollfds, nalloc);
+        nodes = g_renew(AioHandler *, nodes, nalloc);
+    }
+    nodes[npfd] = node;
+    pollfds[npfd] = (GPollFD) {
+        .fd = node->pfd.fd,
+        .events = node->pfd.events,
+    };
+    npfd++;
+}
+
+static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
+                            int64_t timeout)
+{
+    AioHandler *node;
+    int ret;
+
+    assert(npfd == 0);
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events
+                && aio_node_check(ctx, node->is_external)) {
+            add_pollfd(node);
+        }
+    }
+
+    /* epoll(7) is faster above a certain number of fds */
+    if (fdmon_epoll_try_upgrade(ctx, npfd)) {
+        return ctx->fdmon_ops->wait(ctx, ready_list, timeout);
+    }
+
+    ret = qemu_poll_ns(pollfds, npfd, timeout);
+    if (ret > 0) {
+        int i;
+
+        for (i = 0; i < npfd; i++) {
+            int revents = pollfds[i].revents;
+
+            if (revents) {
+                aio_add_ready_handler(ready_list, nodes[i], revents);
+            }
+        }
+    }
+
+    npfd = 0;
+    return ret;
+}
+
+static void fdmon_poll_update(AioContext *ctx, AioHandler *node, bool is_new)
+{
+    /* Do nothing, AioHandler already contains the state we'll need */
+}
+
+const FDMonOps fdmon_poll_ops = {
+    .update = fdmon_poll_update,
+    .wait = fdmon_poll_wait,
+};
-- 
2.24.1

The AioHandler *node, bool is_new arguments are more complicated to
think about than simply being given AioHandler *old_node, AioHandler
*new_node.

Furthermore, the new Linux io_uring file descriptor monitoring mechanism
added by the new patch requires access to both the old and the new
nodes.  Make this change now in preparation.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-5-stefanha@redhat.com
Message-Id: <20200305170806.1313245-5-stefanha@redhat.com>
---
 include/block/aio.h | 13 ++++++-------
 util/aio-posix.c    |  7 +------
 util/fdmon-epoll.c  | 21 ++++++++++++---------
 util/fdmon-poll.c   |  4 +++-
 4 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ typedef struct {
     /*
      * update:
      * @ctx: the AioContext
-     * @node: the handler
-     * @is_new: is the file descriptor already being monitored?
+     * @old_node: the existing handler or NULL if this file descriptor is being
+     *            monitored for the first time
+     * @new_node: the new handler or NULL if this file descriptor is being
+     *            removed
      *
-     * Add/remove/modify a monitored file descriptor.  There are three cases:
-     * 1. node->pfd.events == 0 means remove the file descriptor.
-     * 2. !is_new means modify an already monitored file descriptor.
-     * 3. is_new means add a new file descriptor.
+     * Add/remove/modify a monitored file descriptor.
      *
      * Called with ctx->list_lock acquired.
      */
-    void (*update)(AioContext *ctx, AioHandler *node, bool is_new);
+    void (*update)(AioContext *ctx, AioHandler *old_node, AioHandler *new_node);
 
     /*
      * wait:
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
     atomic_set(&ctx->poll_disable_cnt,
                atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
 
-    if (new_node) {
-        ctx->fdmon_ops->update(ctx, new_node, is_new);
-    } else if (node) {
-        /* Unregister deleted fd_handler */
-        ctx->fdmon_ops->update(ctx, node, false);
-    }
+    ctx->fdmon_ops->update(ctx, node, new_node);
     qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
 
diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-epoll.c
+++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@ static inline int epoll_events_from_pfd(int pfd_events)
            (pfd_events & G_IO_ERR ? EPOLLERR : 0);
 }
 
-static void fdmon_epoll_update(AioContext *ctx, AioHandler *node, bool is_new)
+static void fdmon_epoll_update(AioContext *ctx,
+                               AioHandler *old_node,
+                               AioHandler *new_node)
 {
-    struct epoll_event event;
+    struct epoll_event event = {
+        .data.ptr = new_node,
+        .events = new_node ? epoll_events_from_pfd(new_node->pfd.events) : 0,
+    };
     int r;
-    int ctl;
 
-    if (!node->pfd.events) {
-        ctl = EPOLL_CTL_DEL;
+    if (!new_node) {
+        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_DEL, old_node->pfd.fd, &event);
+    } else if (!old_node) {
+        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, new_node->pfd.fd, &event);
     } else {
-        event.data.ptr = node;
-        event.events = epoll_events_from_pfd(node->pfd.events);
-        ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
+        r = epoll_ctl(ctx->epollfd, EPOLL_CTL_MOD, new_node->pfd.fd, &event);
     }
 
-    r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, &event);
     if (r) {
         fdmon_epoll_disable(ctx);
     }
diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-poll.c
+++ b/util/fdmon-poll.c
@@ -XXX,XX +XXX,XX @@ static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
     return ret;
 }
 
-static void fdmon_poll_update(AioContext *ctx, AioHandler *node, bool is_new)
+static void fdmon_poll_update(AioContext *ctx,
+                              AioHandler *old_node,
+                              AioHandler *new_node)
 {
     /* Do nothing, AioHandler already contains the state we'll need */
 }
-- 
2.24.1

The recent Linux io_uring API has several advantages over ppoll(2) and
epoll(2).  Details are given in the source code.

Add an io_uring implementation and make it the default on Linux.
Performance is the same as with epoll(7) but later patches add
optimizations that take advantage of io_uring.

It is necessary to change how aio_set_fd_handler() deals with deleting
AioHandlers since removing monitored file descriptors is asynchronous in
io_uring.  fdmon_io_uring_remove() marks the AioHandler deleted and
aio_set_fd_handler() will let it handle deletion in that case.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-6-stefanha@redhat.com
Message-Id: <20200305170806.1313245-6-stefanha@redhat.com>
---
 configure             |   5 +
 include/block/aio.h   |   9 ++
 util/Makefile.objs    |   1 +
 util/aio-posix.c      |  20 ++-
 util/aio-posix.h      |  20 ++-
 util/fdmon-io_uring.c | 326 ++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 376 insertions(+), 5 deletions(-)
 create mode 100644 util/fdmon-io_uring.c

diff --git a/configure b/configure
index XXXXXXX..XXXXXXX 100755
--- a/configure
+++ b/configure
@@ -XXX,XX +XXX,XX @@ if test "$linux_io_uring" != "no" ; then
     linux_io_uring_cflags=$($pkg_config --cflags liburing)
     linux_io_uring_libs=$($pkg_config --libs liburing)
     linux_io_uring=yes
+
+    # io_uring is used in libqemuutil.a where per-file -libs variables are not
+    # seen by programs linking the archive.  It's not ideal, but just add the
+    # library dependency globally.
+    LIBS="$linux_io_uring_libs $LIBS"
   else
     if test "$linux_io_uring" = "yes" ; then
       feature_not_found "linux io_uring" "Install liburing devel"
diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@
 #ifndef QEMU_AIO_H
 #define QEMU_AIO_H
 
+#ifdef CONFIG_LINUX_IO_URING
+#include <liburing.h>
+#endif
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
@@ -XXX,XX +XXX,XX @@ struct BHListSlice {
     QSIMPLEQ_ENTRY(BHListSlice) next;
 };
 
+typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
+
 struct AioContext {
     GSource source;
 
@@ -XXX,XX +XXX,XX @@ struct AioContext {
      * locking.
      */
     struct LuringState *linux_io_uring;
+
+    /* State for file descriptor monitoring using Linux io_uring */
+    struct io_uring fdmon_io_uring;
+    AioHandlerSList submit_list;
 #endif
 
     /* TimerLists for calling timers - one per clock type.  Has its own
diff --git a/util/Makefile.objs b/util/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -XXX,XX +XXX,XX @@ util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
 util-obj-$(CONFIG_POSIX) += aio-posix.o
 util-obj-$(CONFIG_POSIX) += fdmon-poll.o
 util-obj-$(CONFIG_EPOLL_CREATE1) += fdmon-epoll.o
+util-obj-$(CONFIG_LINUX_IO_URING) += fdmon-io_uring.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@ static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
         g_source_remove_poll(&ctx->source, &node->pfd);
     }
 
+    node->pfd.revents = 0;
+
+    /* If the fd monitor has already marked it deleted, leave it alone */
+    if (QLIST_IS_INSERTED(node, node_deleted)) {
+        return false;
+    }
+
     /* If a read is in progress, just mark the node as deleted */
     if (qemu_lockcnt_count(&ctx->list_lock)) {
         QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
-        node->pfd.revents = 0;
         return false;
     }
     /* Otherwise, delete it for real.  We can't just mark it as
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
 
         QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, new_node, node);
     }
-    if (node) {
-        deleted = aio_remove_fd_handler(ctx, node);
-    }
 
     /* No need to order poll_disable_cnt writes against other updates;
      * the counter is only used to avoid wasting time and latency on
@@ -XXX,XX +XXX,XX @@ void aio_set_fd_handler(AioContext *ctx,
                atomic_read(&ctx->poll_disable_cnt) + poll_disable_change);
 
     ctx->fdmon_ops->update(ctx, node, new_node);
+    if (node) {
+        deleted = aio_remove_fd_handler(ctx, node);
+    }
     qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
 
@@ -XXX,XX +XXX,XX @@ void aio_context_setup(AioContext *ctx)
     ctx->fdmon_ops = &fdmon_poll_ops;
     ctx->epollfd = -1;
 
+    /* Use the fastest fd monitoring implementation if available */
+    if (fdmon_io_uring_setup(ctx)) {
+        return;
+    }
+
     fdmon_epoll_setup(ctx);
 }
 
 void aio_context_destroy(AioContext *ctx)
 {
+    fdmon_io_uring_destroy(ctx);
     fdmon_epoll_disable(ctx);
 }
 
diff --git a/util/aio-posix.h b/util/aio-posix.h
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -XXX,XX +XXX,XX @@ struct AioHandler {
     IOHandler *io_poll_begin;
     IOHandler *io_poll_end;
     void *opaque;
-    bool is_external;
     QLIST_ENTRY(AioHandler) node;
     QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
     QLIST_ENTRY(AioHandler) node_deleted;
+#ifdef CONFIG_LINUX_IO_URING
+    QSLIST_ENTRY(AioHandler) node_submitted;
+    unsigned flags; /* see fdmon-io_uring.c */
+#endif
+    bool is_external;
 };
 
 /* Add a handler to a ready list */
@@ -XXX,XX +XXX,XX @@ static inline void fdmon_epoll_disable(AioContext *ctx)
 }
 #endif /* !CONFIG_EPOLL_CREATE1 */
 
+#ifdef CONFIG_LINUX_IO_URING
+bool fdmon_io_uring_setup(AioContext *ctx);
+void fdmon_io_uring_destroy(AioContext *ctx);
+#else
+static inline bool fdmon_io_uring_setup(AioContext *ctx)
+{
+    return false;
+}
+
+static inline void fdmon_io_uring_destroy(AioContext *ctx)
+{
+}
+#endif /* !CONFIG_LINUX_IO_URING */
+
 #endif /* AIO_POSIX_H */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/util/fdmon-io_uring.c
@@ -XXX,XX +XXX,XX @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Linux io_uring file descriptor monitoring
+ *
+ * The Linux io_uring API supports file descriptor monitoring with a few
+ * advantages over existing APIs like poll(2) and epoll(7):
+ *
+ * 1. Userspace polling of events is possible because the completion queue (cq
+ *    ring) is shared between the kernel and userspace.  This allows
+ *    applications that rely on userspace polling to also monitor file
+ *    descriptors in the same userspace polling loop.
+ *
+ * 2. Submission and completion is batched and done together in a single system
+ *    call.  This minimizes the number of system calls.
+ *
+ * 3. File descriptor monitoring is O(1) like epoll(7) so it scales better than
+ *    poll(2).
+ *
+ * 4. Nanosecond timeouts are supported so it requires fewer syscalls than
+ *    epoll(7).
+ *
+ * This code only monitors file descriptors and does not do asynchronous disk
+ * I/O.  Implementing disk I/O efficiently has other requirements and should
+ * use a separate io_uring so it does not make sense to unify the code.
+ *
+ * File descriptor monitoring is implemented using the following operations:
+ *
+ * 1. IORING_OP_POLL_ADD - adds a file descriptor to be monitored.
+ * 2. IORING_OP_POLL_REMOVE - removes a file descriptor being monitored.  When
+ *    the poll mask changes for a file descriptor it is first removed and then
+ *    re-added with the new poll mask, so this operation is also used as part
+ *    of modifying an existing monitored file descriptor.
+ * 3. IORING_OP_TIMEOUT - added every time a blocking syscall is made to wait
+ *    for events.  This operation self-cancels if another event completes
+ *    before the timeout.
+ *
+ * io_uring calls the submission queue the "sq ring" and the completion queue
+ * the "cq ring".  Ring entries are called "sqe" and "cqe", respectively.
+ *
+ * The code is structured so that sq/cq rings are only modified within
+ * fdmon_io_uring_wait().  Changes to AioHandlers are made by enqueuing them on
+ * ctx->submit_list so that fdmon_io_uring_wait() can submit IORING_OP_POLL_ADD
+ * and/or IORING_OP_POLL_REMOVE sqes for them.
+ */
+
+#include "qemu/osdep.h"
+#include <poll.h>
+#include "qemu/rcu_queue.h"
+#include "aio-posix.h"
+
+enum {
+    FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
+
+    /* AioHandler::flags */
+    FDMON_IO_URING_PENDING  = (1 << 0),
+    FDMON_IO_URING_ADD      = (1 << 1),
+    FDMON_IO_URING_REMOVE   = (1 << 2),
+};
+
+static inline int poll_events_from_pfd(int pfd_events)
+{
+    return (pfd_events & G_IO_IN ? POLLIN : 0) |
+           (pfd_events & G_IO_OUT ? POLLOUT : 0) |
+           (pfd_events & G_IO_HUP ? POLLHUP : 0) |
+           (pfd_events & G_IO_ERR ? POLLERR : 0);
+}
+
+static inline int pfd_events_from_poll(int poll_events)
+{
+    return (poll_events & POLLIN ? G_IO_IN : 0) |
+           (poll_events & POLLOUT ? G_IO_OUT : 0) |
+           (poll_events & POLLHUP ? G_IO_HUP : 0) |
+           (poll_events & POLLERR ? G_IO_ERR : 0);
+}
+
+/*
+ * Returns an sqe for submitting a request.  Only be called within
+ * fdmon_io_uring_wait().
+ */
+static struct io_uring_sqe *get_sqe(AioContext *ctx)
+{
+    struct io_uring *ring = &ctx->fdmon_io_uring;
+    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
+    int ret;
+
+    if (likely(sqe)) {
+        return sqe;
+    }
+
+    /* No free sqes left, submit pending sqes first */
+    ret = io_uring_submit(ring);
+    assert(ret > 1);
+    sqe = io_uring_get_sqe(ring);
+    assert(sqe);
+    return sqe;
+}
+
+/* Atomically enqueue an AioHandler for sq ring submission */
+static void enqueue(AioHandlerSList *head, AioHandler *node, unsigned flags)
+{
+    unsigned old_flags;
+
+    old_flags = atomic_fetch_or(&node->flags, FDMON_IO_URING_PENDING | flags);
+    if (!(old_flags & FDMON_IO_URING_PENDING)) {
+        QSLIST_INSERT_HEAD_ATOMIC(head, node, node_submitted);
+    }
+}
+
+/* Dequeue an AioHandler for sq ring submission.  Called by fill_sq_ring(). */
+static AioHandler *dequeue(AioHandlerSList *head, unsigned *flags)
+{
+    AioHandler *node = QSLIST_FIRST(head);
+
+    if (!node) {
+        return NULL;
+    }
+
+    /* Doesn't need to be atomic since fill_sq_ring() moves the list */
+    QSLIST_REMOVE_HEAD(head, node_submitted);
+
+    /*
+     * Don't clear FDMON_IO_URING_REMOVE.  It's sticky so it can serve two
+     * purposes: telling fill_sq_ring() to submit IORING_OP_POLL_REMOVE and
+     * telling process_cqe() to delete the AioHandler when its
+     * IORING_OP_POLL_ADD completes.
+     */
+    *flags = atomic_fetch_and(&node->flags, ~(FDMON_IO_URING_PENDING |
+                                              FDMON_IO_URING_ADD));
+    return node;
+}
+
+static void fdmon_io_uring_update(AioContext *ctx,
+                                  AioHandler *old_node,
+                                  AioHandler *new_node)
+{
+    if (new_node) {
+        enqueue(&ctx->submit_list, new_node, FDMON_IO_URING_ADD);
+    }
+
+    if (old_node) {
+        /*
+         * Deletion is tricky because IORING_OP_POLL_ADD and
+         * IORING_OP_POLL_REMOVE are async.  We need to wait for the original
+         * IORING_OP_POLL_ADD to complete before this handler can be freed
+         * safely.
+         *
+         * It's possible that the file descriptor becomes ready and the
+         * IORING_OP_POLL_ADD cqe is enqueued before IORING_OP_POLL_REMOVE is
+         * submitted, too.
+         *
+         * Mark this handler deleted right now but don't place it on
+         * ctx->deleted_aio_handlers yet.  Instead, manually fudge the list
+         * entry to make QLIST_IS_INSERTED() think this handler has been
+         * inserted and other code recognizes this AioHandler as deleted.
+         *
+         * Once the original IORING_OP_POLL_ADD completes we enqueue the
+         * handler on the real ctx->deleted_aio_handlers list to be freed.
+         */
+        assert(!QLIST_IS_INSERTED(old_node, node_deleted));
+        old_node->node_deleted.le_prev = &old_node->node_deleted.le_next;
+
+        enqueue(&ctx->submit_list, old_node, FDMON_IO_URING_REMOVE);
+    }
+}
+
+static void add_poll_add_sqe(AioContext *ctx, AioHandler *node)
+{
+    struct io_uring_sqe *sqe = get_sqe(ctx);
+    int events = poll_events_from_pfd(node->pfd.events);
+
+    io_uring_prep_poll_add(sqe, node->pfd.fd, events);
+    io_uring_sqe_set_data(sqe, node);
+}
+
+static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
+{
+    struct io_uring_sqe *sqe = get_sqe(ctx);
+
+    io_uring_prep_poll_remove(sqe, node);
+}
+
+/* Add a timeout that self-cancels when another cqe becomes ready */
+static void add_timeout_sqe(AioContext *ctx, int64_t ns)
+{
+    struct io_uring_sqe *sqe;
+    struct __kernel_timespec ts = {
+        .tv_sec = ns / NANOSECONDS_PER_SECOND,
+        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
+    };
+
+    sqe = get_sqe(ctx);
+    io_uring_prep_timeout(sqe, &ts, 1, 0);
+}
+
+/* Add sqes from ctx->submit_list for submission */
+static void fill_sq_ring(AioContext *ctx)
+{
+    AioHandlerSList submit_list;
+    AioHandler *node;
+    unsigned flags;
+
+    QSLIST_MOVE_ATOMIC(&submit_list, &ctx->submit_list);
+
+    while ((node = dequeue(&submit_list, &flags))) {
+        /* Order matters, just in case both flags were set */
+        if (flags & FDMON_IO_URING_ADD) {
+            add_poll_add_sqe(ctx, node);
+        }
+        if (flags & FDMON_IO_URING_REMOVE) {
+            add_poll_remove_sqe(ctx, node);
+        }
+    }
+}
+
+/* Returns true if a handler became ready */
+static bool process_cqe(AioContext *ctx,
+                        AioHandlerList *ready_list,
+                        struct io_uring_cqe *cqe)
+{
+    AioHandler *node = io_uring_cqe_get_data(cqe);
+    unsigned flags;
+
+    /* poll_timeout and poll_remove have a zero user_data field */
+    if (!node) {
+        return false;
+    }
+
+    /*
+     * Deletion can only happen when IORING_OP_POLL_ADD completes.  If we race
+     * with enqueue() here then we can safely clear the FDMON_IO_URING_REMOVE
+     * bit before IORING_OP_POLL_REMOVE is submitted.
+     */
+    flags = atomic_fetch_and(&node->flags, ~FDMON_IO_URING_REMOVE);
+    if (flags & FDMON_IO_URING_REMOVE) {
+        QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
+        return false;
+    }
+
+    aio_add_ready_handler(ready_list, node, pfd_events_from_poll(cqe->res));
+
+    /* IORING_OP_POLL_ADD is one-shot so we must re-arm it */
+    add_poll_add_sqe(ctx, node);
+    return true;
+}
+
+static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
+{
+    struct io_uring *ring = &ctx->fdmon_io_uring;
+    struct io_uring_cqe *cqe;
+    unsigned num_cqes = 0;
+    unsigned num_ready = 0;
+    unsigned head;
+
+    io_uring_for_each_cqe(ring, head, cqe) {
+        if (process_cqe(ctx, ready_list, cqe)) {
+            num_ready++;
+        }
+
+        num_cqes++;
+    }
+
+    io_uring_cq_advance(ring, num_cqes);
+    return num_ready;
+}
+
+static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+                               int64_t timeout)
+{
+    unsigned wait_nr = 1; /* block until at least one cqe is ready */
+    int ret;
+
+    /* Fall back while external clients are disabled */
+    if (atomic_read(&ctx->external_disable_cnt)) {
+        return fdmon_poll_ops.wait(ctx, ready_list, timeout);
+    }
+
+    if (timeout == 0) {
+        wait_nr = 0; /* non-blocking */
+    } else if (timeout > 0) {
+        add_timeout_sqe(ctx, timeout);
+    }
+
+    fill_sq_ring(ctx);
+
+    ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
+    assert(ret >= 0);
+
+    return process_cq_ring(ctx, ready_list);
+}
+
+static const FDMonOps fdmon_io_uring_ops = {
+    .update = fdmon_io_uring_update,
+    .wait = fdmon_io_uring_wait,
+};
+
+bool fdmon_io_uring_setup(AioContext *ctx)
+{
+    int ret;
+
+    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
+    if (ret != 0) {
+        return false;
+    }
+
+    QSLIST_INIT(&ctx->submit_list);
+    ctx->fdmon_ops = &fdmon_io_uring_ops;
+    return true;
+}
+
+void fdmon_io_uring_destroy(AioContext *ctx)
+{
+    if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
+        AioHandler *node;
+
+        io_uring_queue_exit(&ctx->fdmon_io_uring);
+
+        /* No need to submit these anymore, just free them. */
+        while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
+            QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
+            QLIST_REMOVE(node, node);
+            g_free(node);
+        }
+
+        ctx->fdmon_ops = &fdmon_poll_ops;
+    }
+}
-- 
2.24.1

Unlike ppoll(2) and epoll(7), Linux io_uring completions can be polled
from userspace.  Previously userspace polling was only allowed when all
AioHandler's had an ->io_poll() callback.  This prevented starvation of
fds by userspace pollable handlers.

Add the FDMonOps->need_wait() callback that enables userspace polling
even when some AioHandlers lack ->io_poll().

For example, it's now possible to do userspace polling when a TCP/IP
socket is monitored thanks to Linux io_uring.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-7-stefanha@redhat.com
Message-Id: <20200305170806.1313245-7-stefanha@redhat.com>
---
 include/block/aio.h   | 19 +++++++++++++++++++
 util/aio-posix.c      | 11 ++++++++---
 util/fdmon-epoll.c    |  1 +
 util/fdmon-io_uring.c |  6 ++++++
 util/fdmon-poll.c     |  1 +
 5 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct ThreadPool;
 struct LinuxAioState;
 struct LuringState;
 
+/* Is polling disabled? */
+bool aio_poll_disabled(AioContext *ctx);
+
 /* Callbacks for file descriptor monitoring implementations */
 typedef struct {
     /*
@@ -XXX,XX +XXX,XX @@ typedef struct {
      * Returns: number of ready file descriptors.
      */
     int (*wait)(AioContext *ctx, AioHandlerList *ready_list, int64_t timeout);
+
+    /*
+     * need_wait:
+     * @ctx: the AioContext
+     *
+     * Tell aio_poll() when to stop userspace polling early because ->wait()
+     * has fds ready.
+     *
+     * File descriptor monitoring implementations that cannot poll fd readiness
+     * from userspace should use aio_poll_disabled() here.  This ensures that
+     * file descriptors are not starved by handlers that frequently make
+     * progress via userspace polling.
+     *
+     * Returns: true if ->wait() should be called, false otherwise.
+     */
+    bool (*need_wait)(AioContext *ctx);
 } FDMonOps;
 
 /*
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
 #include "trace.h"
 #include "aio-posix.h"
 
+bool aio_poll_disabled(AioContext *ctx)
+{
+    return atomic_read(&ctx->poll_disable_cnt);
+}
+
 void aio_add_ready_handler(AioHandlerList *ready_list,
                            AioHandler *node,
                            int revents)
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
         elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;
         max_ns = qemu_soonest_timeout(*timeout, max_ns);
         assert(!(max_ns && progress));
-    } while (elapsed_time < max_ns && !atomic_read(&ctx->poll_disable_cnt));
+    } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx));
 
     /* If time has passed with no successful polling, adjust *timeout to
      * keep the same ending time.
@@ -XXX,XX +XXX,XX @@ static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
 {
     int64_t max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
 
-    if (max_ns && !atomic_read(&ctx->poll_disable_cnt)) {
+    if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         poll_set_started(ctx, true);
 
         if (run_poll_handlers(ctx, max_ns, timeout)) {
@@ -XXX,XX +XXX,XX @@ bool aio_poll(AioContext *ctx, bool blocking)
     /* If polling is allowed, non-blocking aio_poll does not need the
      * system call---a single round of run_poll_handlers_once suffices.
      */
-    if (timeout || atomic_read(&ctx->poll_disable_cnt)) {
+    if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
         ret = ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
     }
 
diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-epoll.c
+++ b/util/fdmon-epoll.c
@@ -XXX,XX +XXX,XX @@ out:
 static const FDMonOps fdmon_epoll_ops = {
     .update = fdmon_epoll_update,
     .wait = fdmon_epoll_wait,
+    .need_wait = aio_poll_disabled,
 };
 
 static bool fdmon_epoll_try_enable(AioContext *ctx)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -XXX,XX +XXX,XX @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
     return process_cq_ring(ctx, ready_list);
 }
 
+static bool fdmon_io_uring_need_wait(AioContext *ctx)
+{
+    return io_uring_cq_ready(&ctx->fdmon_io_uring);
+}
+
 static const FDMonOps fdmon_io_uring_ops = {
     .update = fdmon_io_uring_update,
     .wait = fdmon_io_uring_wait,
+    .need_wait = fdmon_io_uring_need_wait,
 };
 
 bool fdmon_io_uring_setup(AioContext *ctx)
diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
index XXXXXXX..XXXXXXX 100644
--- a/util/fdmon-poll.c
+++ b/util/fdmon-poll.c
@@ -XXX,XX +XXX,XX @@ static void fdmon_poll_update(AioContext *ctx,
 const FDMonOps fdmon_poll_ops = {
     .update = fdmon_poll_update,
     .wait = fdmon_poll_wait,
+    .need_wait = aio_poll_disabled,
 };
-- 
2.24.1

When there are many poll handlers it's likely that some of them are idle
most of the time.  Remove handlers that haven't had activity recently so
that the polling loop scales better for guests with a large number of
devices.

This feature only takes effect for the Linux io_uring fd monitoring
implementation because it is capable of combining fd monitoring with
userspace polling.  The other implementations can't do that and risk
starving fds in favor of poll handlers, so don't try this optimization
when they are in use.

IOPS improves from 10k to 105k when the guest has 100
virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1
device for rw=randread,iodepth=1,bs=4k,ioengine=libaio on NVMe.

[Clarified aio_poll_handlers locking discipline explanation in comment
after discussion with Paolo Bonzini <pbonzini@redhat.com>.
--Stefan]

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-8-stefanha@redhat.com
Message-Id: <20200305170806.1313245-8-stefanha@redhat.com>
---
 include/block/aio.h |  8 ++++
 util/aio-posix.c    | 93 +++++++++++++++++++++++++++++++++++++++++----
 util/aio-posix.h    |  2 +
 util/trace-events   |  2 +
 4 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -XXX,XX +XXX,XX @@ struct AioContext {
     int64_t poll_grow;      /* polling time growth factor */
     int64_t poll_shrink;    /* polling time shrink factor */
 
+    /*
+     * List of handlers participating in userspace polling.  Protected by
+     * ctx->list_lock.  Iterated and modified mostly by the event loop thread
+     * from aio_poll() with ctx->list_lock incremented.  aio_set_fd_handler()
+     * only touches the list to delete nodes if ctx->list_lock's count is zero.
+     */
+    AioHandlerList poll_aio_handlers;
+
     /* Are we in polling mode or monitoring file descriptors? */
     bool poll_started;
 
diff --git a/util/aio-posix.c b/util/aio-posix.c
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -XXX,XX +XXX,XX @@
 #include "trace.h"
 #include "aio-posix.h"
 
+/* Stop userspace polling on a handler if it isn't active for some time */
+#define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
+
 bool aio_poll_disabled(AioContext *ctx)
 {
     return atomic_read(&ctx->poll_disable_cnt);
@@ -XXX,XX +XXX,XX @@ static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
      * deleted because deleted nodes are only cleaned up while
      * no one is walking the handlers list.
      */
+    QLIST_SAFE_REMOVE(node, node_poll);
     QLIST_REMOVE(node, node);
     return true;
 }
@@ -XXX,XX +XXX,XX @@ static bool poll_set_started(AioContext *ctx, bool started)
     ctx->poll_started = started;
 
     qemu_lockcnt_inc(&ctx->list_lock);
-    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
         IOHandler *fn;
 
         if (QLIST_IS_INSERTED(node, node_deleted)) {
@@ -XXX,XX +XXX,XX @@ static void aio_free_deleted_handlers(AioContext *ctx)
     while ((node = QLIST_FIRST_RCU(&ctx->deleted_aio_handlers))) {
         QLIST_REMOVE(node, node);
         QLIST_REMOVE(node, node_deleted);
+        QLIST_SAFE_REMOVE(node, node_poll);
         g_free(node);
     }
 
@@ -XXX,XX +XXX,XX @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
     revents = node->pfd.revents & node->pfd.events;
     node->pfd.revents = 0;
 
+    /*
+     * Start polling AioHandlers when they become ready because activity is
+     * likely to continue.  Note that starvation is theoretically possible when
+     * fdmon_supports_polling(), but only until the fd fires for the first
+     * time.
+     */
+    if (!QLIST_IS_INSERTED(node, node_deleted) &&
+        !QLIST_IS_INSERTED(node, node_poll) &&
+        node->io_poll) {
+        trace_poll_add(ctx, node, node->pfd.fd, revents);
+        if (ctx->poll_started && node->io_poll_begin) {
+            node->io_poll_begin(node->opaque);
+        }
+        QLIST_INSERT_HEAD(&ctx->poll_aio_handlers, node, node_poll);
+    }
+
     if (!QLIST_IS_INSERTED(node, node_deleted) &&
         (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
         aio_node_check(ctx, node->is_external) &&
@@ -XXX,XX +XXX,XX @@ void aio_dispatch(AioContext *ctx)
     timerlistgroup_run_timers(&ctx->tlg);
 }
 
-static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
+static bool run_poll_handlers_once(AioContext *ctx,
+                                   int64_t now,
+                                   int64_t *timeout)
 {
     bool progress = false;
     AioHandler *node;
+    AioHandler *tmp;
 
-    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-        if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll &&
-            aio_node_check(ctx, node->is_external) &&
+    QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) {
+        if (aio_node_check(ctx, node->is_external) &&
             node->io_poll(node->opaque)) {
+            node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS;
+
             /*
              * Polling was successful, exit try_poll_mode immediately
              * to adjust the next polling time.
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
     return progress;
 }
 
+static bool fdmon_supports_polling(AioContext *ctx)
+{
+    return ctx->fdmon_ops->need_wait != aio_poll_disabled;
+}
+
+static bool remove_idle_poll_handlers(AioContext *ctx, int64_t now)
+{
+    AioHandler *node;
+    AioHandler *tmp;
+    bool progress = false;
+
+    /*
+     * File descriptor monitoring implementations without userspace polling
+     * support suffer from starvation when a subset of handlers is polled
+     * because fds will not be processed in a timely fashion.  Don't remove
+     * idle poll handlers.
+     */
+    if (!fdmon_supports_polling(ctx)) {
+        return false;
+    }
+
+    QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) {
+        if (node->poll_idle_timeout == 0LL) {
+            node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS;
+        } else if (now >= node->poll_idle_timeout) {
+            trace_poll_remove(ctx, node, node->pfd.fd);
+            node->poll_idle_timeout = 0LL;
+            QLIST_SAFE_REMOVE(node, node_poll);
+            if (ctx->poll_started && node->io_poll_end) {
+                node->io_poll_end(node->opaque);
+
+                /*
+                 * Final poll in case ->io_poll_end() races with an event.
+                 * Nevermind about re-adding the handler in the rare case where
+                 * this causes progress.
+                 */
+                progress = node->io_poll(node->opaque) || progress;
+            }
+        }
+    }
+
+    return progress;
+}
+
 /* run_poll_handlers:
  * @ctx: the AioContext
  * @max_ns: maximum time to poll for, in nanoseconds
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
 
     start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     do {
-        progress = run_poll_handlers_once(ctx, timeout);
+        progress = run_poll_handlers_once(ctx, start_time, timeout);
         elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;
         max_ns = qemu_soonest_timeout(*timeout, max_ns);
         assert(!(max_ns && progress));
     } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx));
 
+    if (remove_idle_poll_handlers(ctx, start_time + elapsed_time)) {
+        *timeout = 0;
+        progress = true;
+    }
+
     /* If time has passed with no successful polling, adjust *timeout to
      * keep the same ending time.
      */
@@ -XXX,XX +XXX,XX @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
  */
 static bool try_poll_mode(AioContext *ctx, int64_t *timeout)
 {
-    int64_t max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
+    int64_t max_ns;
+
+    if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
+        return false;
+    }
 
+    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         poll_set_started(ctx, true);
 
diff --git a/util/aio-posix.h b/util/aio-posix.h
index XXXXXXX..XXXXXXX 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -XXX,XX +XXX,XX @@ struct AioHandler {
     QLIST_ENTRY(AioHandler) node;
     QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */
     QLIST_ENTRY(AioHandler) node_deleted;
+    QLIST_ENTRY(AioHandler) node_poll;
 #ifdef CONFIG_LINUX_IO_URING
     QSLIST_ENTRY(AioHandler) node_submitted;
     unsigned flags; /* see fdmon-io_uring.c */
 #endif
+    int64_t poll_idle_timeout; /* when to stop userspace polling */
     bool is_external;
 };
 
diff --git a/util/trace-events b/util/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -XXX,XX +XXX,XX @@ run_poll_handlers_begin(void *ctx, int64_t max_ns, int64_t timeout) "ctx %p max_
 run_poll_handlers_end(void *ctx, bool progress, int64_t timeout) "ctx %p progress %d new timeout %"PRId64
 poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
 poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64
+poll_add(void *ctx, void *node, int fd, unsigned revents) "ctx %p node %p fd %d revents 0x%x"
+poll_remove(void *ctx, void *node, int fd) "ctx %p node %p fd %d"
 
 # async.c
 aio_co_schedule(void *ctx, void *co) "ctx %p co %p"
-- 
2.24.1

The following changes since commit 4c55b1d0bad8a703f0499fe62e3761a0cd288da3:

Merge remote-tracking branch 'remotes/armbru/tags/pull-error-2017-04-24' into staging (2017-04-24 14:49:48 +0100)

are available in the git repository at:

git://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request

for you to fetch changes up to ecfa185400ade2abc9915efa924cbad1e15a21a4:

qemu-iotests: _cleanup_qemu must be called on exit (2017-04-24 15:09:33 -0400)

----------------------------------------------------------------
Pull v2, with 32-bit errors fixed.  I don't have OS X to test compile on,
but I think it is safe to assume the cause of the compile error was the same.
----------------------------------------------------------------

Ashish Mittal (2):
  block/vxhs.c: Add support for a new block device type called "vxhs"
  block/vxhs.c: Add qemu-iotests for new block device type "vxhs"

Jeff Cody (10):
  qemu-iotests: exclude vxhs from image creation via protocol
  block: add bdrv_set_read_only() helper function
  block: do not set BDS read_only if copy_on_read enabled
  block: honor BDRV_O_ALLOW_RDWR when clearing bs->read_only
  block: code movement
  block: introduce bdrv_can_set_read_only()
  block: use bdrv_can_set_read_only() during reopen
  block/rbd - update variable names to more apt names
  block/rbd: Add support for reopen()
  qemu-iotests: _cleanup_qemu must be called on exit

-- 
2.9.3

From: Ashish Mittal <ashmit602@gmail.com>

Source code for the qnio library that this code loads can be downloaded from:
https://github.com/VeritasHyperScale/libqnio.git

Sample command line using JSON syntax:
./x86_64-softmmu/qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0
-k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
-msg timestamp=on
'json:{"driver":"vxhs","vdisk-id":"c3e9095a-a5ee-4dce-afeb-2a59fb387410",
"server":{"host":"172.172.17.4","port":"9999"}}'

Sample command line using URI syntax:
qemu-img convert -f raw -O raw -n
/var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad
vxhs://192.168.0.1:9999/c6718f6b-0401-441d-a8c3-1f0064d75ee0

Sample command line using TLS credentials (run in secure mode):
./qemu-io --object
tls-creds-x509,id=tls0,dir=/etc/pki/qemu/vxhs,endpoint=client -c 'read
-v 66000 2.5k' 'json:{"server.host": "127.0.0.1", "server.port": "9999",
"vdisk-id": "/test.raw", "driver": "vxhs", "tls-creds":"tls0"}'

[Jeff: Modified trace-events with the correct string formatting]

Signed-off-by: Ashish Mittal <Ashish.Mittal@veritas.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 1491277689-24949-2-git-send-email-Ashish.Mittal@veritas.com
---
 block/Makefile.objs  |   2 +
 block/trace-events   |  17 ++
 block/vxhs.c         | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++
 configure            |  39 ++++
 qapi/block-core.json |  23 ++-
 5 files changed, 654 insertions(+), 2 deletions(-)
 create mode 100644 block/vxhs.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index XXXXXXX..XXXXXXX 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -XXX,XX +XXX,XX @@ block-obj-$(CONFIG_LIBNFS) += nfs.o
 block-obj-$(CONFIG_CURL) += curl.o
 block-obj-$(CONFIG_RBD) += rbd.o
 block-obj-$(CONFIG_GLUSTERFS) += gluster.o
+block-obj-$(CONFIG_VXHS) += vxhs.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
@@ -XXX,XX +XXX,XX @@ rbd.o-cflags       := $(RBD_CFLAGS)
 rbd.o-libs         := $(RBD_LIBS)
 gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
 gluster.o-libs     := $(GLUSTERFS_LIBS)
+vxhs.o-libs        := $(VXHS_LIBS)
 ssh.o-cflags       := $(LIBSSH2_CFLAGS)
 ssh.o-libs         := $(LIBSSH2_LIBS)
 block-obj-$(if $(CONFIG_BZIP2),m,n) += dmg-bz2.o
diff --git a/block/trace-events b/block/trace-events
index XXXXXXX..XXXXXXX 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -XXX,XX +XXX,XX @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s
 qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+
+# block/vxhs.c
+vxhs_iio_callback(int error) "ctx is NULL: error %d"
+vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o %d, %d"
+vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno %d"
+vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d"
+vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void *acb, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %"PRIu64" offset = %"PRIu64" ACB = %p. Error = %d, errno = %d"
+vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl failed, ret = %d, errno = %d"
+vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat ioctl returned size %"PRIu64
+vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %"PRIu64
+vxhs_parse_uri_filename(const char *filename) "URI passed via bdrv_parse_filename %s"
+vxhs_open_vdiskid(const char *vdisk_id) "Opening vdisk-id %s"
+vxhs_open_hostinfo(char *of_vsa_addr, int port) "Adding host %s:%d to BDRVVXHSState"
+vxhs_open_iio_open(const char *host) "Failed to connect to storage agent on host %s"
+vxhs_parse_uri_hostinfo(char *host, int port) "Host: IP %s, Port %d"
+vxhs_close(char *vdisk_guid) "Closing vdisk %s"
+vxhs_get_creds(const char *cacert, const char *client_key, const char *client_cert) "cacert %s, client_key %s, client_cert %s"
diff --git a/block/vxhs.c b/block/vxhs.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/block/vxhs.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * QEMU Block driver for Veritas HyperScale (VxHS)
+ *
+ * Copyright (c) 2017 Veritas Technologies LLC.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include <qnio/qnio_api.h>
+#include <sys/param.h>
+#include "block/block_int.h"
+#include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "trace.h"
+#include "qemu/uri.h"
+#include "qapi/error.h"
+#include "qemu/uuid.h"
+#include "crypto/tlscredsx509.h"
+
+#define VXHS_OPT_FILENAME           "filename"
+#define VXHS_OPT_VDISK_ID           "vdisk-id"
+#define VXHS_OPT_SERVER             "server"
+#define VXHS_OPT_HOST               "host"
+#define VXHS_OPT_PORT               "port"
+
+/* Only accessed under QEMU global mutex */
+static uint32_t vxhs_ref;
+
+typedef enum {
+    VDISK_AIO_READ,
+    VDISK_AIO_WRITE,
+} VDISKAIOCmd;
+
+/*
+ * HyperScale AIO callbacks structure
+ */
+typedef struct VXHSAIOCB {
+    BlockAIOCB common;
+    int err;
+} VXHSAIOCB;
+
+typedef struct VXHSvDiskHostsInfo {
+    void *dev_handle; /* Device handle */
+    char *host; /* Host name or IP */
+    int port; /* Host's port number */
+} VXHSvDiskHostsInfo;
+
+/*
+ * Structure per vDisk maintained for state
+ */
+typedef struct BDRVVXHSState {
+    VXHSvDiskHostsInfo vdisk_hostinfo; /* Per host info */
+    char *vdisk_guid;
+    char *tlscredsid; /* tlscredsid */
+} BDRVVXHSState;
+
+static void vxhs_complete_aio_bh(void *opaque)
+{
+    VXHSAIOCB *acb = opaque;
+    BlockCompletionFunc *cb = acb->common.cb;
+    void *cb_opaque = acb->common.opaque;
+    int ret = 0;
+
+    if (acb->err != 0) {
+        trace_vxhs_complete_aio(acb, acb->err);
+        ret = (-EIO);
+    }
+
+    qemu_aio_unref(acb);
+    cb(cb_opaque, ret);
+}
+
+/*
+ * Called from a libqnio thread
+ */
+static void vxhs_iio_callback(void *ctx, uint32_t opcode, uint32_t error)
+{
+    VXHSAIOCB *acb = NULL;
+
+    switch (opcode) {
+    case IRP_READ_REQUEST:
+    case IRP_WRITE_REQUEST:
+
+        /*
+         * ctx is VXHSAIOCB*
+         * ctx is NULL if error is QNIOERROR_CHANNEL_HUP
+         */
+        if (ctx) {
+            acb = ctx;
+        } else {
+            trace_vxhs_iio_callback(error);
+            goto out;
+        }
+
+        if (error) {
+            if (!acb->err) {
+                acb->err = error;
+            }
+            trace_vxhs_iio_callback(error);
+        }
+
+        aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
+                                vxhs_complete_aio_bh, acb);
+        break;
+
+    default:
+        if (error == QNIOERROR_HUP) {
+            /*
+             * Channel failed, spontaneous notification,
+             * not in response to I/O
+             */
+            trace_vxhs_iio_callback_chnfail(error, errno);
+        } else {
+            trace_vxhs_iio_callback_unknwn(opcode, error);
+        }
+        break;
+    }
+out:
+    return;
+}
+
+static QemuOptsList runtime_opts = {
+    .name = "vxhs",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        {
+            .name = VXHS_OPT_FILENAME,
+            .type = QEMU_OPT_STRING,
+            .help = "URI to the Veritas HyperScale image",
+        },
+        {
+            .name = VXHS_OPT_VDISK_ID,
+            .type = QEMU_OPT_STRING,
+            .help = "UUID of the VxHS vdisk",
+        },
+        {
+            .name = "tls-creds",
+            .type = QEMU_OPT_STRING,
+            .help = "ID of the TLS/SSL credentials to use",
+        },
+        { /* end of list */ }
+    },
+};
+
+static QemuOptsList runtime_tcp_opts = {
+    .name = "vxhs_tcp",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head),
+    .desc = {
+        {
+            .name = VXHS_OPT_HOST,
+            .type = QEMU_OPT_STRING,
+            .help = "host address (ipv4 addresses)",
+        },
+        {
+            .name = VXHS_OPT_PORT,
+            .type = QEMU_OPT_NUMBER,
+            .help = "port number on which VxHSD is listening (default 9999)",
+            .def_value_str = "9999"
+        },
+        { /* end of list */ }
+    },
+};
+
+/*
+ * Parse incoming URI and populate *options with the host
+ * and device information
+ */
+static int vxhs_parse_uri(const char *filename, QDict *options)
+{
+    URI *uri = NULL;
+    char *port;
+    int ret = 0;
+
+    trace_vxhs_parse_uri_filename(filename);
+    uri = uri_parse(filename);
+    if (!uri || !uri->server || !uri->path) {
+        uri_free(uri);
+        return -EINVAL;
+    }
+
+    qdict_put(options, VXHS_OPT_SERVER".host", qstring_from_str(uri->server));
+
+    if (uri->port) {
+        port = g_strdup_printf("%d", uri->port);
+        qdict_put(options, VXHS_OPT_SERVER".port", qstring_from_str(port));
+        g_free(port);
+    }
+
+    qdict_put(options, "vdisk-id", qstring_from_str(uri->path));
+
+    trace_vxhs_parse_uri_hostinfo(uri->server, uri->port);
+    uri_free(uri);
+
+    return ret;
+}
+
+static void vxhs_parse_filename(const char *filename, QDict *options,
+                                Error **errp)
+{
+    if (qdict_haskey(options, "vdisk-id") || qdict_haskey(options, "server")) {
+        error_setg(errp, "vdisk-id/server and a file name may not be specified "
+                         "at the same time");
+        return;
+    }
+
+    if (strstr(filename, "://")) {
+        int ret = vxhs_parse_uri(filename, options);
+        if (ret < 0) {
+            error_setg(errp, "Invalid URI. URI should be of the form "
+                       "  vxhs://<host_ip>:<port>/<vdisk-id>");
+        }
+    }
+}
+
+static int vxhs_init_and_ref(void)
+{
+    if (vxhs_ref++ == 0) {
+        if (iio_init(QNIO_VERSION, vxhs_iio_callback)) {
+            return -ENODEV;
+        }
+    }
+    return 0;
+}
+
+static void vxhs_unref(void)
+{
+    if (--vxhs_ref == 0) {
+        iio_fini();
+    }
+}
+
+static void vxhs_get_tls_creds(const char *id, char **cacert,
+                               char **key, char **cert, Error **errp)
+{
+    Object *obj;
+    QCryptoTLSCreds *creds;
+    QCryptoTLSCredsX509 *creds_x509;
+
+    obj = object_resolve_path_component(
+        object_get_objects_root(), id);
+
+    if (!obj) {
+        error_setg(errp, "No TLS credentials with id '%s'",
+                   id);
+        return;
+    }
+
+    creds_x509 = (QCryptoTLSCredsX509 *)
+        object_dynamic_cast(obj, TYPE_QCRYPTO_TLS_CREDS_X509);
+
+    if (!creds_x509) {
+        error_setg(errp, "Object with id '%s' is not TLS credentials",
+                   id);
+        return;
+    }
+
+    creds = &creds_x509->parent_obj;
+
+    if (creds->endpoint != QCRYPTO_TLS_CREDS_ENDPOINT_CLIENT) {
+        error_setg(errp,
+                   "Expecting TLS credentials with a client endpoint");
+        return;
+    }
+
+    /*
+     * Get the cacert, client_cert and client_key file names.
+     */
+    if (!creds->dir) {
+        error_setg(errp, "TLS object missing 'dir' property value");
+        return;
+    }
+
+    *cacert = g_strdup_printf("%s/%s", creds->dir,
+                              QCRYPTO_TLS_CREDS_X509_CA_CERT);
+    *cert = g_strdup_printf("%s/%s", creds->dir,
+                            QCRYPTO_TLS_CREDS_X509_CLIENT_CERT);
+    *key = g_strdup_printf("%s/%s", creds->dir,
+                           QCRYPTO_TLS_CREDS_X509_CLIENT_KEY);
+}
+
+static int vxhs_open(BlockDriverState *bs, QDict *options,
+                     int bdrv_flags, Error **errp)
+{
+    BDRVVXHSState *s = bs->opaque;
+    void *dev_handlep;
+    QDict *backing_options = NULL;
+    QemuOpts *opts = NULL;
+    QemuOpts *tcp_opts = NULL;
+    char *of_vsa_addr = NULL;
+    Error *local_err = NULL;
+    const char *vdisk_id_opt;
+    const char *server_host_opt;
+    int ret = 0;
+    char *cacert = NULL;
+    char *client_key = NULL;
+    char *client_cert = NULL;
+
+    ret = vxhs_init_and_ref();
+    if (ret < 0) {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* Create opts info from runtime_opts and runtime_tcp_opts list */
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
+
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* vdisk-id is the disk UUID */
+    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
+    if (!vdisk_id_opt) {
+        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* vdisk-id may contain a leading '/' */
+    if (strlen(vdisk_id_opt) > UUID_FMT_LEN + 1) {
+        error_setg(&local_err, "vdisk-id cannot be more than %d characters",
+                   UUID_FMT_LEN);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    s->vdisk_guid = g_strdup(vdisk_id_opt);
+    trace_vxhs_open_vdiskid(vdisk_id_opt);
+
+    /* get the 'server.' arguments */
+    qdict_extract_subqdict(options, &backing_options, VXHS_OPT_SERVER".");
+
+    qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
+    if (local_err != NULL) {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    server_host_opt = qemu_opt_get(tcp_opts, VXHS_OPT_HOST);
+    if (!server_host_opt) {
+        error_setg(&local_err, QERR_MISSING_PARAMETER,
+                   VXHS_OPT_SERVER"."VXHS_OPT_HOST);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    if (strlen(server_host_opt) > MAXHOSTNAMELEN) {
+        error_setg(&local_err, "server.host cannot be more than %d characters",
+                   MAXHOSTNAMELEN);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* check if we got tls-creds via the --object argument */
+    s->tlscredsid = g_strdup(qemu_opt_get(opts, "tls-creds"));
+    if (s->tlscredsid) {
+        vxhs_get_tls_creds(s->tlscredsid, &cacert, &client_key,
+                           &client_cert, &local_err);
+        if (local_err != NULL) {
+            ret = -EINVAL;
+            goto out;
+        }
+        trace_vxhs_get_creds(cacert, client_key, client_cert);
+    }
+
+    s->vdisk_hostinfo.host = g_strdup(server_host_opt);
+    s->vdisk_hostinfo.port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
+                                                          VXHS_OPT_PORT),
+                                                          NULL, 0);
+
+    trace_vxhs_open_hostinfo(s->vdisk_hostinfo.host,
+                             s->vdisk_hostinfo.port);
+
+    of_vsa_addr = g_strdup_printf("of://%s:%d",
+                                  s->vdisk_hostinfo.host,
+                                  s->vdisk_hostinfo.port);
+
+    /*
+     * Open qnio channel to storage agent if not opened before
+     */
+    dev_handlep = iio_open(of_vsa_addr, s->vdisk_guid, 0,
+                           cacert, client_key, client_cert);
+    if (dev_handlep == NULL) {
+        trace_vxhs_open_iio_open(of_vsa_addr);
+        ret = -ENODEV;
+        goto out;
+    }
+    s->vdisk_hostinfo.dev_handle = dev_handlep;
+
+out:
+    g_free(of_vsa_addr);
+    QDECREF(backing_options);
+    qemu_opts_del(tcp_opts);
+    qemu_opts_del(opts);
+    g_free(cacert);
+    g_free(client_key);
+    g_free(client_cert);
+
+    if (ret < 0) {
+        vxhs_unref();
+        error_propagate(errp, local_err);
+        g_free(s->vdisk_hostinfo.host);
+        g_free(s->vdisk_guid);
+        g_free(s->tlscredsid);
+        s->vdisk_guid = NULL;
+    }
+
+    return ret;
+}
+
+static const AIOCBInfo vxhs_aiocb_info = {
+    .aiocb_size = sizeof(VXHSAIOCB)
+};
+
+/*
+ * This allocates QEMU-VXHS callback for each IO
+ * and is passed to QNIO. When QNIO completes the work,
+ * it will be passed back through the callback.
+ */
+static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, int64_t sector_num,
+                               QEMUIOVector *qiov, int nb_sectors,
+                               BlockCompletionFunc *cb, void *opaque,
+                               VDISKAIOCmd iodir)
+{
+    VXHSAIOCB *acb = NULL;
+    BDRVVXHSState *s = bs->opaque;
+    size_t size;
+    uint64_t offset;
+    int iio_flags = 0;
+    int ret = 0;
+    void *dev_handle = s->vdisk_hostinfo.dev_handle;
+
+    offset = sector_num * BDRV_SECTOR_SIZE;
+    size = nb_sectors * BDRV_SECTOR_SIZE;
+    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
+
+    /*
+     * Initialize VXHSAIOCB.
+     */
+    acb->err = 0;
+
+    iio_flags = IIO_FLAG_ASYNC;
+
+    switch (iodir) {
+    case VDISK_AIO_WRITE:
+            ret = iio_writev(dev_handle, acb, qiov->iov, qiov->niov,
+                             offset, (uint64_t)size, iio_flags);
+            break;
+    case VDISK_AIO_READ:
+            ret = iio_readv(dev_handle, acb, qiov->iov, qiov->niov,
+                            offset, (uint64_t)size, iio_flags);
+            break;
+    default:
+            trace_vxhs_aio_rw_invalid(iodir);
+            goto errout;
+    }
+
+    if (ret != 0) {
+        trace_vxhs_aio_rw_ioerr(s->vdisk_guid, iodir, size, offset,
+                                acb, ret, errno);
+        goto errout;
+    }
+    return &acb->common;
+
+errout:
+    qemu_aio_unref(acb);
+    return NULL;
+}
+
+static BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs,
+                                   int64_t sector_num, QEMUIOVector *qiov,
+                                   int nb_sectors,
+                                   BlockCompletionFunc *cb, void *opaque)
+{
+    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, cb,
+                       opaque, VDISK_AIO_READ);
+}
+
+static BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs,
+                                   int64_t sector_num, QEMUIOVector *qiov,
+                                   int nb_sectors,
+                                   BlockCompletionFunc *cb, void *opaque)
+{
+    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors,
+                       cb, opaque, VDISK_AIO_WRITE);
+}
+
+static void vxhs_close(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+
+    trace_vxhs_close(s->vdisk_guid);
+
+    g_free(s->vdisk_guid);
+    s->vdisk_guid = NULL;
+
+    /*
+     * Close vDisk device
+     */
+    if (s->vdisk_hostinfo.dev_handle) {
+        iio_close(s->vdisk_hostinfo.dev_handle);
+        s->vdisk_hostinfo.dev_handle = NULL;
+    }
+
+    vxhs_unref();
+
+    /*
+     * Free the dynamically allocated host string etc
+     */
+    g_free(s->vdisk_hostinfo.host);
+    g_free(s->tlscredsid);
+    s->tlscredsid = NULL;
+    s->vdisk_hostinfo.host = NULL;
+    s->vdisk_hostinfo.port = 0;
+}
+
+static int64_t vxhs_get_vdisk_stat(BDRVVXHSState *s)
+{
+    int64_t vdisk_size = -1;
+    int ret = 0;
+    void *dev_handle = s->vdisk_hostinfo.dev_handle;
+
+    ret = iio_ioctl(dev_handle, IOR_VDISK_STAT, &vdisk_size, 0);
+    if (ret < 0) {
+        trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno);
+        return -EIO;
+    }
+
+    trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size);
+    return vdisk_size;
+}
+
+/*
+ * Returns the size of vDisk in bytes. This is required
+ * by QEMU block upper block layer so that it is visible
+ * to guest.
+ */
+static int64_t vxhs_getlength(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+    int64_t vdisk_size;
+
+    vdisk_size = vxhs_get_vdisk_stat(s);
+    if (vdisk_size < 0) {
+        return -EIO;
+    }
+
+    return vdisk_size;
+}
+
+static BlockDriver bdrv_vxhs = {
+    .format_name                  = "vxhs",
+    .protocol_name                = "vxhs",
+    .instance_size                = sizeof(BDRVVXHSState),
+    .bdrv_file_open               = vxhs_open,
+    .bdrv_parse_filename          = vxhs_parse_filename,
+    .bdrv_close                   = vxhs_close,
+    .bdrv_getlength               = vxhs_getlength,
+    .bdrv_aio_readv               = vxhs_aio_readv,
+    .bdrv_aio_writev              = vxhs_aio_writev,
+};
+
+static void bdrv_vxhs_init(void)
+{
+    bdrv_register(&bdrv_vxhs);
+}
+
+block_init(bdrv_vxhs_init);
diff --git a/configure b/configure
index XXXXXXX..XXXXXXX 100755
--- a/configure
+++ b/configure
@@ -XXX,XX +XXX,XX @@ numa=""
 tcmalloc="no"
 jemalloc="no"
 replication="yes"
+vxhs=""
 
 supported_cpu="no"
 supported_os="no"
@@ -XXX,XX +XXX,XX @@ for opt do
   ;;
   --enable-replication) replication="yes"
   ;;
+  --disable-vxhs) vxhs="no"
+  ;;
+  --enable-vxhs) vxhs="yes"
+  ;;
   *)
       echo "ERROR: unknown option $opt"
       echo "Try '$0 --help' for more information"
@@ -XXX,XX +XXX,XX @@ disabled with --disable-FEATURE, default is enabled if available:
   xfsctl          xfsctl support
   qom-cast-debug  cast debugging support
   tools           build qemu-io, qemu-nbd and qemu-image tools
+  vxhs            Veritas HyperScale vDisk backend support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -XXX,XX +XXX,XX @@ if compile_prog "" "" ; then
 fi
 
 ##########################################
+# Veritas HyperScale block driver VxHS
+# Check if libvxhs is installed
+
+if test "$vxhs" != "no" ; then
+  cat > $TMPC <<EOF
+#include <stdint.h>
+#include <qnio/qnio_api.h>
+
+void *vxhs_callback;
+
+int main(void) {
+    iio_init(QNIO_VERSION, vxhs_callback);
+    return 0;
+}
+EOF
+  vxhs_libs="-lvxhs -lssl"
+  if compile_prog "" "$vxhs_libs" ; then
+    vxhs=yes
+  else
+    if test "$vxhs" = "yes" ; then
+      feature_not_found "vxhs block device" "Install libvxhs See github"
+    fi
+    vxhs=no
+  fi
+fi
+
+##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
 
@@ -XXX,XX +XXX,XX @@ echo "tcmalloc support  $tcmalloc"
 echo "jemalloc support  $jemalloc"
 echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
+echo "VxHS block device $vxhs"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -XXX,XX +XXX,XX @@ if test "$pthread_setname_np" = "yes" ; then
   echo "CONFIG_PTHREAD_SETNAME_NP=y" >> $config_host_mak
 fi
 
+if test "$vxhs" = "yes" ; then
+  echo "CONFIG_VXHS=y" >> $config_host_mak
+  echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
diff --git a/qapi/block-core.json b/qapi/block-core.json
index XXXXXXX..XXXXXXX 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -XXX,XX +XXX,XX @@
 #
 # Drivers that are supported in block device operations.
 #
+# @vxhs: Since 2.10
+#
 # Since: 2.9
 ##
 { 'enum': 'BlockdevDriver',
@@ -XXX,XX +XXX,XX @@
             'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs',
             'null-aio', 'null-co', 'parallels', 'qcow', 'qcow2', 'qed',
             'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh',
-            'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+            'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 
 ##
 # @BlockdevOptionsFile:
@@ -XXX,XX +XXX,XX @@
   'data': { '*offset': 'int', '*size': 'int' } }
 
 ##
+# @BlockdevOptionsVxHS:
+#
+# Driver specific block device options for VxHS
+#
+# @vdisk-id:    UUID of VxHS volume
+# @server:      vxhs server IP, port
+# @tls-creds:   TLS credentials ID
+#
+# Since: 2.10
+##
+{ 'struct': 'BlockdevOptionsVxHS',
+  'data': { 'vdisk-id': 'str',
+            'server': 'InetSocketAddressBase',
+            '*tls-creds': 'str' } }
+
+##
 # @BlockdevOptions:
 #
 # Options for creating a block device.  Many options are available for all
@@ -XXX,XX +XXX,XX @@
       'vhdx':       'BlockdevOptionsGenericFormat',
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
-      'vvfat':      'BlockdevOptionsVVFAT'
+      'vvfat':      'BlockdevOptionsVVFAT',
+      'vxhs':       'BlockdevOptionsVxHS'
   } }
 
 ##
-- 
2.9.3

From: Ashish Mittal <ashmit602@gmail.com>

These changes use a vxhs test server that is a part of the following
repository:
https://github.com/VeritasHyperScale/libqnio.git

Signed-off-by: Ashish Mittal <Ashish.Mittal@veritas.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 1491277689-24949-3-git-send-email-Ashish.Mittal@veritas.com
---
 tests/qemu-iotests/common        |  6 ++++++
 tests/qemu-iotests/common.config | 13 +++++++++++++
 tests/qemu-iotests/common.filter |  1 +
 tests/qemu-iotests/common.rc     | 19 +++++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/tests/qemu-iotests/common b/tests/qemu-iotests/common
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common
+++ b/tests/qemu-iotests/common
@@ -XXX,XX +XXX,XX @@ check options
     -ssh                test ssh
     -nfs                test nfs
     -luks               test luks
+    -vxhs               test vxhs
     -xdiff              graphical mode diff
     -nocache            use O_DIRECT on backing file
     -misalign           misalign memory allocations
@@ -XXX,XX +XXX,XX @@ testlist options
             xpand=false
             ;;
 
+        -vxhs)
+            IMGPROTO=vxhs
+            xpand=false
+            ;;
+
         -ssh)
             IMGPROTO=ssh
             xpand=false
diff --git a/tests/qemu-iotests/common.config b/tests/qemu-iotests/common.config
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.config
+++ b/tests/qemu-iotests/common.config
@@ -XXX,XX +XXX,XX @@ if [ -z "$QEMU_NBD_PROG" ]; then
     export QEMU_NBD_PROG="`set_prog_path qemu-nbd`"
 fi
 
+if [ -z "$QEMU_VXHS_PROG" ]; then
+    export QEMU_VXHS_PROG="`set_prog_path qnio_server`"
+fi
+
 _qemu_wrapper()
 {
     (
@@ -XXX,XX +XXX,XX @@ _qemu_nbd_wrapper()
     )
 }
 
+_qemu_vxhs_wrapper()
+{
+    (
+        echo $BASHPID > "${TEST_DIR}/qemu-vxhs.pid"
+        exec "$QEMU_VXHS_PROG" $QEMU_VXHS_OPTIONS "$@"
+    )
+}
+
 export QEMU=_qemu_wrapper
 export QEMU_IMG=_qemu_img_wrapper
 export QEMU_IO=_qemu_io_wrapper
 export QEMU_NBD=_qemu_nbd_wrapper
+export QEMU_VXHS=_qemu_vxhs_wrapper
 
 QEMU_IMG_EXTRA_ARGS=
 if [ "$IMGOPTSSYNTAX" = "true" ]; then
diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -XXX,XX +XXX,XX @@ _filter_img_info()
         -e "s#$TEST_DIR#TEST_DIR#g" \
         -e "s#$IMGFMT#IMGFMT#g" \
         -e 's#nbd://127.0.0.1:10810$#TEST_DIR/t.IMGFMT#g' \
+        -e 's#json.*vdisk-id.*vxhs"}}#TEST_DIR/t.IMGFMT#' \
         -e "/encrypted: yes/d" \
         -e "/cluster_size: [0-9]\\+/d" \
         -e "/table_size: [0-9]\\+/d" \
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index XXXXXXX..XXXXXXX 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -XXX,XX +XXX,XX @@ else
     elif [ "$IMGPROTO" = "nfs" ]; then
         TEST_DIR="nfs://127.0.0.1/$TEST_DIR"
         TEST_IMG=$TEST_DIR/t.$IMGFMT
+    elif [ "$IMGPROTO" = "vxhs" ]; then
+        TEST_IMG_FILE=$TEST_DIR/t.$IMGFMT
+        TEST_IMG="vxhs://127.0.0.1:9999/t.$IMGFMT"
     else
         TEST_IMG=$IMGPROTO:$TEST_DIR/t.$IMGFMT
     fi
@@ -XXX,XX +XXX,XX @@ _make_test_img()
         eval "$QEMU_NBD -v -t -b 127.0.0.1 -p 10810 -f $IMGFMT  $TEST_IMG_FILE >/dev/null &"
         sleep 1 # FIXME: qemu-nbd needs to be listening before we continue
     fi
+
+    # Start QNIO server on image directory for vxhs protocol
+    if [ $IMGPROTO = "vxhs" ]; then
+        eval "$QEMU_VXHS -d  $TEST_DIR > /dev/null &"
+        sleep 1 # Wait for server to come up.
+    fi
 }
 
 _rm_test_img()
@@ -XXX,XX +XXX,XX @@ _cleanup_test_img()
             fi
             rm -f "$TEST_IMG_FILE"
             ;;
+        vxhs)
+            if [ -f "${TEST_DIR}/qemu-vxhs.pid" ]; then
+                local QEMU_VXHS_PID
+                read QEMU_VXHS_PID < "${TEST_DIR}/qemu-vxhs.pid"
+                kill ${QEMU_VXHS_PID} >/dev/null 2>&1
+                rm -f "${TEST_DIR}/qemu-vxhs.pid"
+            fi
+            rm -f "$TEST_IMG_FILE"
+            ;;
+
         file)
             _rm_test_img "$TEST_DIR/t.$IMGFMT"
             _rm_test_img "$TEST_DIR/t.$IMGFMT.orig"
-- 
2.9.3

The protocol VXHS does not support image creation.  Some tests expect
to be able to create images through the protocol.  Exclude VXHS from
these tests.

diff --git a/tests/qemu-iotests/017 b/tests/qemu-iotests/017
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/017
+++ b/tests/qemu-iotests/017
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # Any format supporting backing files
 _supported_fmt qcow qcow2 vmdk qed
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 _unsupported_imgopts "subformat=monolithicFlat" "subformat=twoGbMaxExtentFlat"
 
diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/020
+++ b/tests/qemu-iotests/020
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # Any format supporting backing files
 _supported_fmt qcow qcow2 vmdk qed
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 _unsupported_imgopts "subformat=monolithicFlat" \
                      "subformat=twoGbMaxExtentFlat" \
diff --git a/tests/qemu-iotests/029 b/tests/qemu-iotests/029
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/029
+++ b/tests/qemu-iotests/029
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # Any format supporting intenal snapshots
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 # Internal snapshots are (currently) impossible with refcount_bits=1
 _unsupported_imgopts 'refcount_bits=1[^0-9]'
diff --git a/tests/qemu-iotests/073 b/tests/qemu-iotests/073
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/073
+++ b/tests/qemu-iotests/073
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 CLUSTER_SIZE=64k
diff --git a/tests/qemu-iotests/114 b/tests/qemu-iotests/114
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/114
+++ b/tests/qemu-iotests/114
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 
diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/130
+++ b/tests/qemu-iotests/130
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 qemu_comm_method="monitor"
diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 
diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/156
+++ b/tests/qemu-iotests/156
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2 qed
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 # Create source disk
diff --git a/tests/qemu-iotests/158 b/tests/qemu-iotests/158
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/158
+++ b/tests/qemu-iotests/158
@@ -XXX,XX +XXX,XX @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto generic
+_unsupported_proto vxhs
 _supported_os Linux
 
 
-- 
2.9.3

We have a helper wrapper for checking for the BDS read_only flag,
add a helper wrapper to set the read_only flag as well.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 9b18972d05f5fa2ac16c014f0af98d680553048d.1491597120.git.jcody@redhat.com
---
 block.c               | 5 +++++
 block/bochs.c         | 2 +-
 block/cloop.c         | 2 +-
 block/dmg.c           | 2 +-
 block/rbd.c           | 2 +-
 block/vvfat.c         | 4 ++--
 include/block/block.h | 1 +
 7 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
     }
 }
 
+void bdrv_set_read_only(BlockDriverState *bs, bool read_only)
+{
+    bs->read_only = read_only;
+}
+
 void bdrv_get_full_backing_filename_from_filename(const char *backed,
                                                   const char *backing,
                                                   char *dest, size_t sz,
diff --git a/block/bochs.c b/block/bochs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/bochs.c
+++ b/block/bochs.c
@@ -XXX,XX +XXX,XX @@ static int bochs_open(BlockDriverState *bs, QDict *options, int flags,
         return -EINVAL;
     }
 
-    bs->read_only = true; /* no write support yet */
+    bdrv_set_read_only(bs, true); /* no write support yet */
 
     ret = bdrv_pread(bs->file, 0, &bochs, sizeof(bochs));
     if (ret < 0) {
diff --git a/block/cloop.c b/block/cloop.c
index XXXXXXX..XXXXXXX 100644
--- a/block/cloop.c
+++ b/block/cloop.c
@@ -XXX,XX +XXX,XX @@ static int cloop_open(BlockDriverState *bs, QDict *options, int flags,
         return -EINVAL;
     }
 
-    bs->read_only = true;
+    bdrv_set_read_only(bs, true);
 
     /* read header */
     ret = bdrv_pread(bs->file, 128, &s->block_size, 4);
diff --git a/block/dmg.c b/block/dmg.c
index XXXXXXX..XXXXXXX 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -XXX,XX +XXX,XX @@ static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     block_module_load_one("dmg-bz2");
-    bs->read_only = true;
+    bdrv_set_read_only(bs, true);
 
     s->n_chunks = 0;
     s->offsets = s->lengths = s->sectors = s->sectorcounts = NULL;
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
         goto failed_open;
     }
 
-    bs->read_only = (s->snap != NULL);
+    bdrv_set_read_only(bs, (s->snap != NULL));
 
     qemu_opts_del(opts);
     return 0;
diff --git a/block/vvfat.c b/block/vvfat.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
     s->current_cluster=0xffffffff;
 
     /* read only is the default for safety */
-    bs->read_only = true;
+    bdrv_set_read_only(bs, true);
     s->qcow = NULL;
     s->qcow_filename = NULL;
     s->fat2 = NULL;
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
         if (ret < 0) {
             goto fail;
         }
-        bs->read_only = false;
+        bdrv_set_read_only(bs, false);
     }
 
     bs->total_sectors = cyls * heads * secs;
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                             int64_t sector_num, int nb_sectors, int *pnum);
 
 bool bdrv_is_read_only(BlockDriverState *bs);
+void bdrv_set_read_only(BlockDriverState *bs, bool read_only);
 bool bdrv_is_sg(BlockDriverState *bs);
 bool bdrv_is_inserted(BlockDriverState *bs);
 int bdrv_media_changed(BlockDriverState *bs);
-- 
2.9.3

A few block drivers will set the BDS read_only flag from their
.bdrv_open() function.  This means the bs->read_only flag could
be set after we enable copy_on_read, as the BDRV_O_COPY_ON_READ
flag check occurs prior to the call to bdrv->bdrv_open().

This adds an error return to bdrv_set_read_only(), and an error will be
return if we try to set the BDS to read_only while copy_on_read is
enabled.

This patch also changes the behavior of vvfat.  Before, vvfat could
override the drive 'readonly' flag with its own, internal 'rw' flag.

For instance, this -drive parameter would result in a writable image:

"-drive format=vvfat,dir=/tmp/vvfat,rw,if=virtio,readonly=on"

This is not correct.  Now, attempting to use the above -drive parameter
will result in an error (i.e., 'rw' is incompatible with 'readonly=on').

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 0c5b4c1cc2c651471b131f21376dfd5ea24d2196.1491597120.git.jcody@redhat.com
---
 block.c               | 10 +++++++++-
 block/bochs.c         |  5 ++++-
 block/cloop.c         |  5 ++++-
 block/dmg.c           |  6 +++++-
 block/rbd.c           | 11 ++++++++++-
 block/vvfat.c         | 19 +++++++++++++++----
 include/block/block.h |  2 +-
 7 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
     }
 }
 
-void bdrv_set_read_only(BlockDriverState *bs, bool read_only)
+int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 {
+    /* Do not set read_only if copy_on_read is enabled */
+    if (bs->copy_on_read && read_only) {
+        error_setg(errp, "Can't set node '%s' to r/o with copy-on-read enabled",
+                   bdrv_get_device_or_node_name(bs));
+        return -EINVAL;
+    }
+
     bs->read_only = read_only;
+    return 0;
 }
 
 void bdrv_get_full_backing_filename_from_filename(const char *backed,
diff --git a/block/bochs.c b/block/bochs.c
index XXXXXXX..XXXXXXX 100644
--- a/block/bochs.c
+++ b/block/bochs.c
@@ -XXX,XX +XXX,XX @@ static int bochs_open(BlockDriverState *bs, QDict *options, int flags,
         return -EINVAL;
     }
 
-    bdrv_set_read_only(bs, true); /* no write support yet */
+    ret = bdrv_set_read_only(bs, true, errp); /* no write support yet */
+    if (ret < 0) {
+        return ret;
+    }
 
     ret = bdrv_pread(bs->file, 0, &bochs, sizeof(bochs));
     if (ret < 0) {
diff --git a/block/cloop.c b/block/cloop.c
index XXXXXXX..XXXXXXX 100644
--- a/block/cloop.c
+++ b/block/cloop.c
@@ -XXX,XX +XXX,XX @@ static int cloop_open(BlockDriverState *bs, QDict *options, int flags,
         return -EINVAL;
     }
 
-    bdrv_set_read_only(bs, true);
+    ret = bdrv_set_read_only(bs, true, errp);
+    if (ret < 0) {
+        return ret;
+    }
 
     /* read header */
     ret = bdrv_pread(bs->file, 128, &s->block_size, 4);
diff --git a/block/dmg.c b/block/dmg.c
index XXXXXXX..XXXXXXX 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -XXX,XX +XXX,XX @@ static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
         return -EINVAL;
     }
 
+    ret = bdrv_set_read_only(bs, true, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
     block_module_load_one("dmg-bz2");
-    bdrv_set_read_only(bs, true);
 
     s->n_chunks = 0;
     s->offsets = s->lengths = s->sectors = s->sectorcounts = NULL;
diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
         goto failed_shutdown;
     }
 
+    /* rbd_open is always r/w */
     r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
     if (r < 0) {
         error_setg_errno(errp, -r, "error reading header from %s", s->name);
         goto failed_open;
     }
 
-    bdrv_set_read_only(bs, (s->snap != NULL));
+    /* If we are using an rbd snapshot, we must be r/o, otherwise
+     * leave as-is */
+    if (s->snap != NULL) {
+        r = bdrv_set_read_only(bs, true, &local_err);
+        if (r < 0) {
+            error_propagate(errp, local_err);
+            goto failed_open;
+        }
+    }
 
     qemu_opts_del(opts);
     return 0;
diff --git a/block/vvfat.c b/block/vvfat.c
index XXXXXXX..XXXXXXX 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
 
     s->current_cluster=0xffffffff;
 
-    /* read only is the default for safety */
-    bdrv_set_read_only(bs, true);
     s->qcow = NULL;
     s->qcow_filename = NULL;
     s->fat2 = NULL;
@@ -XXX,XX +XXX,XX @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
     s->sector_count = cyls * heads * secs - (s->first_sectors_number - 1);
 
     if (qemu_opt_get_bool(opts, "rw", false)) {
-        ret = enable_write_target(bs, errp);
+        if (!bdrv_is_read_only(bs)) {
+            ret = enable_write_target(bs, errp);
+            if (ret < 0) {
+                goto fail;
+            }
+        } else {
+            ret = -EPERM;
+            error_setg(errp,
+                       "Unable to set VVFAT to 'rw' when drive is read-only");
+            goto fail;
+        }
+    } else  {
+        /* read only is the default for safety */
+        ret = bdrv_set_read_only(bs, true, &local_err);
         if (ret < 0) {
+            error_propagate(errp, local_err);
             goto fail;
         }
-        bdrv_set_read_only(bs, false);
     }
 
     bs->total_sectors = cyls * heads * secs;
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                             int64_t sector_num, int nb_sectors, int *pnum);
 
 bool bdrv_is_read_only(BlockDriverState *bs);
-void bdrv_set_read_only(BlockDriverState *bs, bool read_only);
+int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
 bool bdrv_is_sg(BlockDriverState *bs);
 bool bdrv_is_inserted(BlockDriverState *bs);
 int bdrv_media_changed(BlockDriverState *bs);
-- 
2.9.3

The BDRV_O_ALLOW_RDWR flag allows / prohibits the changing of
the BDS 'read_only' state, but there are a few places where it
is ignored.  In the bdrv_set_read_only() helper, make sure to
honor the flag.

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: be2e5fb2d285cbece2b6d06bed54a6f56520d251.1491597120.git.jcody@redhat.com
---
 block.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
         return -EINVAL;
     }
 
+    /* Do not clear read_only if it is prohibited */
+    if (!read_only && !(bs->open_flags & BDRV_O_ALLOW_RDWR)) {
+        error_setg(errp, "Node '%s' is read only",
+                   bdrv_get_device_or_node_name(bs));
+        return -EPERM;
+    }
+
     bs->read_only = read_only;
     return 0;
 }
-- 
2.9.3

Move bdrv_is_read_only() up with its friends.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 73b2399459760c32506f9407efb9dddb3a2789de.1491597120.git.jcody@redhat.com
---
 block.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ void path_combine(char *dest, int dest_size,
     }
 }
 
+bool bdrv_is_read_only(BlockDriverState *bs)
+{
+    return bs->read_only;
+}
+
 int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 {
     /* Do not set read_only if copy_on_read is enabled */
@@ -XXX,XX +XXX,XX @@ void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
     *nb_sectors_ptr = nb_sectors < 0 ? 0 : nb_sectors;
 }
 
-bool bdrv_is_read_only(BlockDriverState *bs)
-{
-    return bs->read_only;
-}
-
 bool bdrv_is_sg(BlockDriverState *bs)
 {
     return bs->sg;
-- 
2.9.3

Introduce check function for setting read_only flags.  Will return < 0 on
error, with appropriate Error value set.  Does not alter any flags.

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: e2bba34ac3bc76a0c42adc390413f358ae0566e8.1491597120.git.jcody@redhat.com
---
 block.c               | 14 +++++++++++++-
 include/block/block.h |  1 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ bool bdrv_is_read_only(BlockDriverState *bs)
     return bs->read_only;
 }
 
-int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
+int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 {
     /* Do not set read_only if copy_on_read is enabled */
     if (bs->copy_on_read && read_only) {
@@ -XXX,XX +XXX,XX @@ int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
         return -EPERM;
     }
 
+    return 0;
+}
+
+int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
+{
+    int ret = 0;
+
+    ret = bdrv_can_set_read_only(bs, read_only, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
     bs->read_only = read_only;
     return 0;
 }
diff --git a/include/block/block.h b/include/block/block.h
index XXXXXXX..XXXXXXX 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -XXX,XX +XXX,XX @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                             int64_t sector_num, int nb_sectors, int *pnum);
 
 bool bdrv_is_read_only(BlockDriverState *bs);
+int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
 int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp);
 bool bdrv_is_sg(BlockDriverState *bs);
 bool bdrv_is_inserted(BlockDriverState *bs);
-- 
2.9.3

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 00aed7ffdd7be4b9ed9ce1007d50028a72b34ebe.1491597120.git.jcody@redhat.com
---
 block.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index XXXXXXX..XXXXXXX 100644
--- a/block.c
+++ b/block.c
@@ -XXX,XX +XXX,XX @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
     BlockDriver *drv;
     QemuOpts *opts;
     const char *value;
+    bool read_only;
 
     assert(reopen_state != NULL);
     assert(reopen_state->bs->drv != NULL);
@@ -XXX,XX +XXX,XX @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
         qdict_put(reopen_state->options, "driver", qstring_from_str(value));
     }
 
-    /* if we are to stay read-only, do not allow permission change
-     * to r/w */
-    if (!(reopen_state->bs->open_flags & BDRV_O_ALLOW_RDWR) &&
-        reopen_state->flags & BDRV_O_RDWR) {
-        error_setg(errp, "Node '%s' is read only",
-                   bdrv_get_device_or_node_name(reopen_state->bs));
+    /* If we are to stay read-only, do not allow permission change
+     * to r/w. Attempting to set to r/w may fail if either BDRV_O_ALLOW_RDWR is
+     * not set, or if the BDS still has copy_on_read enabled */
+    read_only = !(reopen_state->flags & BDRV_O_RDWR);
+    ret = bdrv_can_set_read_only(reopen_state->bs, read_only, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
         goto error;
     }
 
-- 
2.9.3

Update 'clientname' to be 'user', which tracks better with both
the QAPI and rados variable naming.

Update 'name' to be 'image_name', as it indicates the rbd image.
Naming it 'image' would have been ideal, but we are using that for
the rados_image_t value returned by rbd_open().

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: b7ec1fb2e1cf36f9b6911631447a5b0422590b7d.1491597120.git.jcody@redhat.com
---
 block/rbd.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ typedef struct BDRVRBDState {
     rados_t cluster;
     rados_ioctx_t io_ctx;
     rbd_image_t image;
-    char *name;
+    char *image_name;
     char *snap;
 } BDRVRBDState;
 
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
     int64_t bytes = 0;
     int64_t objsize;
     int obj_order = 0;
-    const char *pool, *name, *conf, *clientname, *keypairs;
+    const char *pool, *image_name, *conf, *user, *keypairs;
     const char *secretid;
     rados_t cluster;
     rados_ioctx_t io_ctx;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
      */
     pool       = qdict_get_try_str(options, "pool");
     conf       = qdict_get_try_str(options, "conf");
-    clientname = qdict_get_try_str(options, "user");
-    name       = qdict_get_try_str(options, "image");
+    user       = qdict_get_try_str(options, "user");
+    image_name = qdict_get_try_str(options, "image");
     keypairs   = qdict_get_try_str(options, "=keyvalue-pairs");
 
-    ret = rados_create(&cluster, clientname);
+    ret = rados_create(&cluster, user);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "error initializing");
         goto exit;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
         goto shutdown;
     }
 
-    ret = rbd_create(io_ctx, name, bytes, &obj_order);
+    ret = rbd_create(io_ctx, image_name, bytes, &obj_order);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "error rbd create");
     }
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
                          Error **errp)
 {
     BDRVRBDState *s = bs->opaque;
-    const char *pool, *snap, *conf, *clientname, *name, *keypairs;
+    const char *pool, *snap, *conf, *user, *image_name, *keypairs;
     const char *secretid;
     QemuOpts *opts;
     Error *local_err = NULL;
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
     pool           = qemu_opt_get(opts, "pool");
     conf           = qemu_opt_get(opts, "conf");
     snap           = qemu_opt_get(opts, "snapshot");
-    clientname     = qemu_opt_get(opts, "user");
-    name           = qemu_opt_get(opts, "image");
+    user           = qemu_opt_get(opts, "user");
+    image_name     = qemu_opt_get(opts, "image");
     keypairs       = qemu_opt_get(opts, "=keyvalue-pairs");
 
-    if (!pool || !name) {
+    if (!pool || !image_name) {
         error_setg(errp, "Parameters 'pool' and 'image' are required");
         r = -EINVAL;
         goto failed_opts;
     }
 
-    r = rados_create(&s->cluster, clientname);
+    r = rados_create(&s->cluster, user);
     if (r < 0) {
         error_setg_errno(errp, -r, "error initializing");
         goto failed_opts;
     }
 
     s->snap = g_strdup(snap);
-    s->name = g_strdup(name);
+    s->image_name = g_strdup(image_name);
 
     /* try default location when conf=NULL, but ignore failure */
     r = rados_conf_read_file(s->cluster, conf);
@@ -XXX,XX +XXX,XX @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     /* rbd_open is always r/w */
-    r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
+    r = rbd_open(s->io_ctx, s->image_name, &s->image, s->snap);
     if (r < 0) {
-        error_setg_errno(errp, -r, "error reading header from %s", s->name);
+        error_setg_errno(errp, -r, "error reading header from %s",
+                         s->image_name);
         goto failed_open;
     }
 
@@ -XXX,XX +XXX,XX @@ failed_open:
 failed_shutdown:
     rados_shutdown(s->cluster);
     g_free(s->snap);
-    g_free(s->name);
+    g_free(s->image_name);
 failed_opts:
     qemu_opts_del(opts);
     g_free(mon_host);
@@ -XXX,XX +XXX,XX @@ static void qemu_rbd_close(BlockDriverState *bs)
     rbd_close(s->image);
     rados_ioctx_destroy(s->io_ctx);
     g_free(s->snap);
-    g_free(s->name);
+    g_free(s->image_name);
     rados_shutdown(s->cluster);
 }
 
-- 
2.9.3

This adds support for reopen in rbd, for changing between r/w and r/o.

Note, that this is only a flag change, but we will block a change from
r/o to r/w if we are using an RBD internal snapshot.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: d4e87539167ec6527d44c97b164eabcccf96e4f3.1491597120.git.jcody@redhat.com
---
 block/rbd.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index XXXXXXX..XXXXXXX 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -XXX,XX +XXX,XX @@ failed_opts:
     return r;
 }
 
+
+/* Since RBD is currently always opened R/W via the API,
+ * we just need to check if we are using a snapshot or not, in
+ * order to determine if we will allow it to be R/W */
+static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
+                                   BlockReopenQueue *queue, Error **errp)
+{
+    BDRVRBDState *s = state->bs->opaque;
+    int ret = 0;
+
+    if (s->snap && state->flags & BDRV_O_RDWR) {
+        error_setg(errp,
+                   "Cannot change node '%s' to r/w when using RBD snapshot",
+                   bdrv_get_device_or_node_name(state->bs));
+        ret = -EINVAL;
+    }
+
+    return ret;
+}
+
 static void qemu_rbd_close(BlockDriverState *bs)
 {
     BDRVRBDState *s = bs->opaque;
@@ -XXX,XX +XXX,XX @@ static BlockDriver bdrv_rbd = {
     .bdrv_parse_filename    = qemu_rbd_parse_filename,
     .bdrv_file_open         = qemu_rbd_open,
     .bdrv_close             = qemu_rbd_close,
+    .bdrv_reopen_prepare    = qemu_rbd_reopen_prepare,
     .bdrv_create            = qemu_rbd_create,
     .bdrv_has_zero_init     = bdrv_has_zero_init_1,
     .bdrv_get_info          = qemu_rbd_getinfo,
-- 
2.9.3

For the tests that use the common.qemu functions for running a QEMU
process, _cleanup_qemu must be called in the exit function.

If it is not, if the qemu process aborts, then not all of the droppings
are cleaned up (e.g. pidfile, fifos).

This updates those tests that did not have a cleanup in qemu-iotests.

(I swapped spaces for tabs in test 102 as well)

Reported-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: d59c2f6ad6c1da8b9b3c7f357c94a7122ccfc55a.1492544096.git.jcody@redhat.com
---
 tests/qemu-iotests/028 |  1 +
 tests/qemu-iotests/094 | 11 ++++++++---
 tests/qemu-iotests/102 |  5 +++--
 tests/qemu-iotests/109 |  1 +
 tests/qemu-iotests/117 |  1 +
 tests/qemu-iotests/130 |  1 +
 tests/qemu-iotests/140 |  1 +
 tests/qemu-iotests/141 |  1 +
 tests/qemu-iotests/143 |  1 +
 tests/qemu-iotests/156 |  1 +
 10 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/028 b/tests/qemu-iotests/028
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/028
+++ b/tests/qemu-iotests/028
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     rm -f "${TEST_IMG}.copy"
     _cleanup_test_img
 }
diff --git a/tests/qemu-iotests/094 b/tests/qemu-iotests/094
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/094
+++ b/tests/qemu-iotests/094
@@ -XXX,XX +XXX,XX @@ echo "QA output created by $seq"
 here="$PWD"
 status=1	# failure is the default!
 
-trap "exit \$status" 0 1 2 3 15
+_cleanup()
+{
+    _cleanup_qemu
+    _cleanup_test_img
+    rm -f "$TEST_DIR/source.$IMGFMT"
+}
+
+trap "_cleanup; exit \$status" 0 1 2 3 15
 
 # get standard environment, filters and checks
 . ./common.rc
@@ -XXX,XX +XXX,XX @@ _send_qemu_cmd $QEMU_HANDLE \
 
 wait=1 _cleanup_qemu
 
-_cleanup_test_img
-rm -f "$TEST_DIR/source.$IMGFMT"
 
 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/102 b/tests/qemu-iotests/102
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/102
+++ b/tests/qemu-iotests/102
@@ -XXX,XX +XXX,XX @@ seq=$(basename $0)
 echo "QA output created by $seq"
 
 here=$PWD
-status=1	# failure is the default!
+status=1    # failure is the default!
 
 _cleanup()
 {
-	_cleanup_test_img
+    _cleanup_qemu
+    _cleanup_test_img
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/109
+++ b/tests/qemu-iotests/109
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     rm -f $TEST_IMG.src
 	_cleanup_test_img
 }
diff --git a/tests/qemu-iotests/117 b/tests/qemu-iotests/117
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/117
+++ b/tests/qemu-iotests/117
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
 	_cleanup_test_img
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/130
+++ b/tests/qemu-iotests/130
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     _cleanup_test_img
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
diff --git a/tests/qemu-iotests/140 b/tests/qemu-iotests/140
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/140
+++ b/tests/qemu-iotests/140
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     _cleanup_test_img
     rm -f "$TEST_DIR/nbd"
 }
diff --git a/tests/qemu-iotests/141 b/tests/qemu-iotests/141
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/141
+++ b/tests/qemu-iotests/141
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     _cleanup_test_img
     rm -f "$TEST_DIR/{b,m,o}.$IMGFMT"
 }
diff --git a/tests/qemu-iotests/143 b/tests/qemu-iotests/143
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/143
+++ b/tests/qemu-iotests/143
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     rm -f "$TEST_DIR/nbd"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
index XXXXXXX..XXXXXXX 100755
--- a/tests/qemu-iotests/156
+++ b/tests/qemu-iotests/156
@@ -XXX,XX +XXX,XX @@ status=1	# failure is the default!
 
 _cleanup()
 {
+    _cleanup_qemu
     rm -f "$TEST_IMG{,.target}{,.backing,.overlay}"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
-- 
2.9.3