Series comparison

 [PULL 0/1] Block patches
-The following changes since commit 887cba855bb6ff4775256f7968409281350b568c:
+The following changes since commit ddc27d2ad9361a81c2b3800d14143bf420dae172:
-  configure: Fix cross-building for RISCV host (v5) (2023-07-11 17:56:09 +0100)
+  Merge tag 'pull-request-2024-03-18' of https://gitlab.com/thuth/qemu into staging (2024-03-19 10:25:25 +0000)
 are available in the Git repository at:
   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
-for you to fetch changes up to 75dcb4d790bbe5327169fd72b185960ca58e2fa6:
+for you to fetch changes up to 86a637e48104ae74d8be53bed6441ce32be33433:
-  virtio-blk: fix host notifier issues during dataplane start/stop (2023-07-12 15:20:32 -0400)
+  coroutine: cap per-thread local pool size (2024-03-19 10:49:31 -0400)
 ----------------------------------------------------------------
 Pull request
+This fix solves the "failed to set up stack guard page" error that has been
+reported on Linux hosts where the QEMU coroutine pool exceeds the
+vm.max_map_count limit.
 ----------------------------------------------------------------
 Stefan Hajnoczi (1):
-  virtio-blk: fix host notifier issues during dataplane start/stop
+  coroutine: cap per-thread local pool size
- hw/block/dataplane/virtio-blk.c | 67 +++++++++++++++++++--------------
+ util/qemu-coroutine.c | 282 +++++++++++++++++++++++++++++++++---------
-file changed, 38 insertions(+), 29 deletions(-)
+file changed, 223 insertions(+), 59 deletions(-)
 --
-.40.1
+.44.0

-[PULL 1/1] virtio-blk: fix host notifier issues during dataplane start/stop
+[PULL 1/1] coroutine: cap per-thread local pool size
-The main loop thread can consume 100% CPU when using --device
+The coroutine pool implementation can hit the Linux vm.max_map_count
-virtio-blk-pci,iothread=<iothread>. ppoll() constantly returns but
+limit, causing QEMU to abort with "failed to allocate memory for stack"
-reading virtqueue host notifiers fails with EAGAIN. The file descriptors
+or "failed to set up stack guard page" during coroutine creation.
-are stale and remain registered with the AioContext because of bugs in
-the virtio-blk dataplane start/stop code.
+This happens because per-thread pools can grow to tens of thousands of
+coroutines. Each coroutine causes 2 virtual memory areas to be created.
-The problem is that the dataplane start/stop code involves drain
+Eventually vm.max_map_count is reached and memory-related syscalls fail.
-operations, which call virtio_blk_drained_begin() and
+The per-thread pool sizes are non-uniform and depend on past coroutine
-virtio_blk_drained_end() at points where the host notifier is not
+usage in each thread, so it's possible for one thread to have a large
-operational:
+pool while another thread's pool is empty.
-- In virtio_blk_data_plane_start(), blk_set_aio_context() drains after
-  vblk->dataplane_started has been set to true but the host notifier has
+Switch to a new coroutine pool implementation with a global pool that
-  not been attached yet.
+grows to a maximum number of coroutines and per-thread local pools that
-- In virtio_blk_data_plane_stop(), blk_drain() and blk_set_aio_context()
+are capped at hardcoded small number of coroutines.
-  drain after the host notifier has already been detached but with
-  vblk->dataplane_started still set to true.
+This approach does not leave large numbers of coroutines pooled in a
+thread that may not use them again. In order to perform well it
-I would like to simplify ->ioeventfd_start/stop() to avoid interactions
+amortizes the cost of global pool accesses by working in batches of
-with drain entirely, but couldn't find a way to do that. Instead, this
+coroutines instead of individual coroutines.
-patch accepts the fragile nature of the code and reorders it so that
-vblk->dataplane_started is false during drain operations. This way the
+The global pool is a list. Threads donate batches of coroutines to when
-virtio_blk_drained_begin() and virtio_blk_drained_end() calls don't
+they have too many and take batches from when they have too few:
-touch the host notifier. The result is that
-virtio_blk_data_plane_start() and virtio_blk_data_plane_stop() have
+.-----------------------------------.
-complete control over the host notifier and stale file descriptors are
+| Batch 1 | Batch 2 | Batch 3 | ... | global_pool
-no longer left in the AioContext.
+`-----------------------------------'
-This patch fixes the 100% CPU consumption in the main loop thread and
+Each thread has up to 2 batches of coroutines:
-correctly moves host notifier processing to the IOThread.
+.-------------------.
-Fixes: 1665d9326fd2 ("virtio-blk: implement BlockDevOps->drained_begin()")
+| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
-Reported-by: Lukáš Doktor <ldoktor@redhat.com>
+`-------------------'
 The goal of this change is to reduce the excessive number of pooled
 coroutines that cause QEMU to abort when vm.max_map_count is reached
 without losing the performance of an adequately sized coroutine pool.
 Here are virtio-blk disk I/O benchmark results:
       RW BLKSIZE IODEPTH    OLD    NEW CHANGE
 randread      4k       1 113725 117451 +3.3%
 randread      4k       8 192968 198510 +2.9%
 randread      4k      16 207138 209429 +1.1%
 randread      4k      32 212399 215145 +1.3%
 randread      4k      64 218319 221277 +1.4%
 randread    128k       1  17587  17535 -0.3%
 randread    128k       8  17614  17616 +0.0%
 randread    128k      16  17608  17609 +0.0%
 randread    128k      32  17552  17553 +0.0%
 randread    128k      64  17484  17484 +0.0%
 See files/{fio.sh,test.xml.j2} for the benchmark configuration:
 https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
 Buglink: https://issues.redhat.com/browse/RHEL-28947
 Reported-by: Sanjay Rao <srao@redhat.com>
 Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
 Reported-by: Joe Mario <jmario@redhat.com>
 Reviewed-by: Kevin Wolf <kwolf@redhat.com>
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-Tested-by: Lukas Doktor <ldoktor@redhat.com>
+Message-ID: <20240318183429.1039340-1-stefanha@redhat.com>
 Message-id: 20230704151527.193586-1-stefanha@redhat.com
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 ---
- hw/block/dataplane/virtio-blk.c | 67 +++++++++++++++++++--------------
+ util/qemu-coroutine.c | 282 +++++++++++++++++++++++++++++++++---------
-file changed, 38 insertions(+), 29 deletions(-)
+file changed, 223 insertions(+), 59 deletions(-)
-diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
+diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
 index XXXXXXX..XXXXXXX 100644
---- a/hw/block/dataplane/virtio-blk.c
+--- a/util/qemu-coroutine.c
-+++ b/hw/block/dataplane/virtio-blk.c
++++ b/util/qemu-coroutine.c
-@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
+@@ -XXX,XX +XXX,XX @@
+ #include "qemu/atomic.h"
-     memory_region_transaction_commit();
+ #include "qemu/coroutine_int.h"
+ #include "qemu/coroutine-tls.h"
--    /*
++#include "qemu/cutils.h"
--     * These fields are visible to the IOThread so we rely on implicit barriers
+ #include "block/aio.h"
--     * in aio_context_acquire() on the write side and aio_notify_accept() on
--     * the read side.
+-/**
--     */
+- * The minimal batch size is always 64, coroutines from the release_pool are
--    s->starting = false;
+- * reused as soon as there are 64 coroutines in it. The maximum pool size starts
--    vblk->dataplane_started = true;
+- * with 64 and is increased on demand so that coroutines are not deleted even if
-     trace_virtio_blk_data_plane_start(s);
+- * they are not immediately reused.
+- */
-     old_context = blk_get_aio_context(s->conf->conf.blk);
+ enum {
-@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
+-    POOL_MIN_BATCH_SIZE = 64,
-         event_notifier_set(virtio_queue_get_host_notifier(vq));
+-    POOL_INITIAL_MAX_SIZE = 64,
 +    COROUTINE_POOL_BATCH_MAX_SIZE = 128,
  };
 -/** Free list to speed up creation */
 -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
 -static unsigned int release_pool_size;
 +/*
 + * Coroutine creation and deletion is expensive so a pool of unused coroutines
 + * is kept as a cache. When the pool has coroutines available, they are
 + * recycled instead of creating new ones from scratch. Coroutines are added to
 + * the pool upon termination.
 + *
 + * The pool is global but each thread maintains a small local pool to avoid
 + * global pool contention. Threads fetch and return batches of coroutines from
 + * the global pool to maintain their local pool. The local pool holds up to two
 + * batches whereas the maximum size of the global pool is controlled by the
 + * qemu_coroutine_inc_pool_size() API.
 + *
 + * .-----------------------------------.
 + * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
 + * `-----------------------------------'
 + *
 + * .-------------------.
 + * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
 + * `-------------------'
 + */
 +typedef struct CoroutinePoolBatch {
 +    /* Batches are kept in a list */
 +    QSLIST_ENTRY(CoroutinePoolBatch) next;
 -typedef QSLIST_HEAD(, Coroutine) CoroutineQSList;
 -QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool);
 -QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size);
 -QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier);
 +    /* This batch holds up to @COROUTINE_POOL_BATCH_MAX_SIZE coroutines */
 +    QSLIST_HEAD(, Coroutine) list;
 +    unsigned int size;
 +} CoroutinePoolBatch;
 -static void coroutine_pool_cleanup(Notifier *n, void *value)
 +typedef QSLIST_HEAD(, CoroutinePoolBatch) CoroutinePool;
 +
 +/* Host operating system limit on number of pooled coroutines */
 +static unsigned int global_pool_hard_max_size;
 +
 +static QemuMutex global_pool_lock; /* protects the following variables */
 +static CoroutinePool global_pool = QSLIST_HEAD_INITIALIZER(global_pool);
 +static unsigned int global_pool_size;
 +static unsigned int global_pool_max_size = COROUTINE_POOL_BATCH_MAX_SIZE;
 +
 +QEMU_DEFINE_STATIC_CO_TLS(CoroutinePool, local_pool);
 +QEMU_DEFINE_STATIC_CO_TLS(Notifier, local_pool_cleanup_notifier);
 +
 +static CoroutinePoolBatch *coroutine_pool_batch_new(void)
 +{
 +    CoroutinePoolBatch *batch = g_new(CoroutinePoolBatch, 1);
 +
 +    QSLIST_INIT(&batch->list);
 +    batch->size = 0;
 +    return batch;
 +}
 +
 +static void coroutine_pool_batch_delete(CoroutinePoolBatch *batch)
  {
      Coroutine *co;
      Coroutine *tmp;
 -    CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
 -    QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) {
 -        QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 +    QSLIST_FOREACH_SAFE(co, &batch->list, pool_next, tmp) {
 +        QSLIST_REMOVE_HEAD(&batch->list, pool_next);
          qemu_coroutine_delete(co);
      }
++    g_free(batch);
 +}
 +
 +static void local_pool_cleanup(Notifier *n, void *value)
 +{
 +    CoroutinePool *local_pool = get_ptr_local_pool();
 +    CoroutinePoolBatch *batch;
 +    CoroutinePoolBatch *tmp;
 +
 +    QSLIST_FOREACH_SAFE(batch, local_pool, next, tmp) {
 +        QSLIST_REMOVE_HEAD(local_pool, next);
 +        coroutine_pool_batch_delete(batch);
 +    }
 +}
 +
 +/* Ensure the atexit notifier is registered */
 +static void local_pool_cleanup_init_once(void)
 +{
 +    Notifier *notifier = get_ptr_local_pool_cleanup_notifier();
 +    if (!notifier->notify) {
 +        notifier->notify = local_pool_cleanup;
 +        qemu_thread_atexit_add(notifier);
 +    }
 +}
 +
 +/* Helper to get the next unused coroutine from the local pool */
 +static Coroutine *coroutine_pool_get_local(void)
 +{
 +    CoroutinePool *local_pool = get_ptr_local_pool();
 +    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
 +    Coroutine *co;
 +
 +    if (unlikely(!batch)) {
 +        return NULL;
 +    }
 +
 +    co = QSLIST_FIRST(&batch->list);
 +    QSLIST_REMOVE_HEAD(&batch->list, pool_next);
 +    batch->size--;
 +
 +    if (batch->size == 0) {
 +        QSLIST_REMOVE_HEAD(local_pool, next);
 +        coroutine_pool_batch_delete(batch);
 +    }
 +    return co;
 +}
 +
 +/* Get the next batch from the global pool */
 +static void coroutine_pool_refill_local(void)
 +{
 +    CoroutinePool *local_pool = get_ptr_local_pool();
 +    CoroutinePoolBatch *batch;
 +
 +    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
 +        batch = QSLIST_FIRST(&global_pool);
 +
 +        if (batch) {
 +            QSLIST_REMOVE_HEAD(&global_pool, next);
 +            global_pool_size -= batch->size;
 +        }
 +    }
 +
 +    if (batch) {
 +        QSLIST_INSERT_HEAD(local_pool, batch, next);
 +        local_pool_cleanup_init_once();
 +    }
 +}
 +
 +/* Add a batch of coroutines to the global pool */
 +static void coroutine_pool_put_global(CoroutinePoolBatch *batch)
 +{
 +    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
 +        unsigned int max = MIN(global_pool_max_size,
 +                               global_pool_hard_max_size);
 +
 +        if (global_pool_size < max) {
 +            QSLIST_INSERT_HEAD(&global_pool, batch, next);
 +
 +            /* Overshooting the max pool size is allowed */
 +            global_pool_size += batch->size;
 +            return;
 +        }
 +    }
 +
 +    /* The global pool was full, so throw away this batch */
 +    coroutine_pool_batch_delete(batch);
 +}
 +
 +/* Get the next unused coroutine from the pool or return NULL */
 +static Coroutine *coroutine_pool_get(void)
 +{
 +    Coroutine *co;
 +
 +    co = coroutine_pool_get_local();
 +    if (!co) {
 +        coroutine_pool_refill_local();
 +        co = coroutine_pool_get_local();
 +    }
 +    return co;
 +}
 +
 +static void coroutine_pool_put(Coroutine *co)
 +{
 +    CoroutinePool *local_pool = get_ptr_local_pool();
 +    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
 +
 +    if (unlikely(!batch)) {
 +        batch = coroutine_pool_batch_new();
 +        QSLIST_INSERT_HEAD(local_pool, batch, next);
 +        local_pool_cleanup_init_once();
 +    }
 +
 +    if (unlikely(batch->size >= COROUTINE_POOL_BATCH_MAX_SIZE)) {
 +        CoroutinePoolBatch *next = QSLIST_NEXT(batch, next);
 +
 +        /* Is the local pool full? */
 +        if (next) {
 +            QSLIST_REMOVE_HEAD(local_pool, next);
 +            coroutine_pool_put_global(batch);
 +        }
 +
 +        batch = coroutine_pool_batch_new();
 +        QSLIST_INSERT_HEAD(local_pool, batch, next);
 +    }
 +
 +    QSLIST_INSERT_HEAD(&batch->list, co, pool_next);
 +    batch->size++;
  }
  Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
      Coroutine *co = NULL;
      if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
 -        CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
 -
 -        co = QSLIST_FIRST(alloc_pool);
 -        if (!co) {
 -            if (release_pool_size > POOL_MIN_BATCH_SIZE) {
 -                /* Slow path; a good place to register the destructor, too.  */
 -                Notifier *notifier = get_ptr_coroutine_pool_cleanup_notifier();
 -                if (!notifier->notify) {
 -                    notifier->notify = coroutine_pool_cleanup;
 -                    qemu_thread_atexit_add(notifier);
 -                }
 -
 -                /* This is not exact; there could be a little skew between
 -                 * release_pool_size and the actual size of release_pool.  But
 -                 * it is just a heuristic, it does not need to be perfect.
 -                 */
 -                set_alloc_pool_size(qatomic_xchg(&release_pool_size, 0));
 -                QSLIST_MOVE_ATOMIC(alloc_pool, &release_pool);
 -                co = QSLIST_FIRST(alloc_pool);
 -            }
 -        }
 -        if (co) {
 -            QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 -            set_alloc_pool_size(get_alloc_pool_size() - 1);
 -        }
 +        co = coroutine_pool_get();
      }
      if (!co) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_delete(Coroutine *co)
      co->caller = NULL;
      if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
 -        if (release_pool_size < qatomic_read(&pool_max_size) * 2) {
 -            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
 -            qatomic_inc(&release_pool_size);
 -            return;
 -        }
 -        if (get_alloc_pool_size() < qatomic_read(&pool_max_size)) {
 -            QSLIST_INSERT_HEAD(get_ptr_alloc_pool(), co, pool_next);
 -            set_alloc_pool_size(get_alloc_pool_size() + 1);
 -            return;
 -        }
 +        coroutine_pool_put(co);
 +    } else {
 +        qemu_coroutine_delete(co);
      }
 -
 -    qemu_coroutine_delete(co);
  }
  void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co)
@@ -XXX,XX +XXX,XX @@ AioContext *qemu_coroutine_get_aio_context(Coroutine *co)
  void qemu_coroutine_inc_pool_size(unsigned int additional_pool_size)
  {
 -    qatomic_add(&pool_max_size, additional_pool_size);
 +    QEMU_LOCK_GUARD(&global_pool_lock);
 +    global_pool_max_size += additional_pool_size;
  }
  void qemu_coroutine_dec_pool_size(unsigned int removing_pool_size)
  {
 -    qatomic_sub(&pool_max_size, removing_pool_size);
 +    QEMU_LOCK_GUARD(&global_pool_lock);
 +    global_pool_max_size -= removing_pool_size;
 +}
 +
 +static unsigned int get_global_pool_hard_max_size(void)
 +{
 +#ifdef __linux__
 +    g_autofree char *contents = NULL;
 +    int max_map_count;
 +
 +    /*
-+     * These fields must be visible to the IOThread when it processes the
++     * Linux processes can have up to max_map_count virtual memory areas
-+     * virtqueue, otherwise it will think dataplane has not started yet.
++     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
-+     *
++     * must limit the coroutine pool to a safe size to avoid running out of
-+     * Make sure ->dataplane_started is false when blk_set_aio_context() is
++     * VMAs.
 +     * called above so that draining does not cause the host notifier to be
 +     * detached/attached prematurely.
 +     */
-+    s->starting = false;
++    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
-+    vblk->dataplane_started = true;
++                            NULL) &&
-+    smp_wmb(); /* paired with aio_notify_accept() on the read side */
++        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
-+
++        /*
-     /* Get this show started by hooking up our callbacks */
++         * This is a conservative upper bound that avoids exceeding
-     if (!blk_in_drain(s->conf->conf.blk)) {
++         * max_map_count. Leave half for non-coroutine users like library
-         aio_context_acquire(s->ctx);
++         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
-@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
++         * halve the amount again.
-   fail_guest_notifiers:
++         */
-     vblk->dataplane_disabled = true;
++        return max_map_count / 4;
-     s->starting = false;
++    }
--    vblk->dataplane_started = true;
++#endif
-     return -ENOSYS;
++
- }
++    return UINT_MAX;
++}
-@@ -XXX,XX +XXX,XX @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
++
-         aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
++static void __attribute__((constructor)) qemu_coroutine_init(void)
-     }
++{
++    qemu_mutex_init(&global_pool_lock);
-+    /*
++    global_pool_hard_max_size = get_global_pool_hard_max_size();
 +     * Batch all the host notifiers in a single transaction to avoid
 +     * quadratic time complexity in address_space_update_ioeventfds().
 +     */
 +    memory_region_transaction_begin();
 +
 +    for (i = 0; i < nvqs; i++) {
 +        virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 +    }
 +
 +    /*
 +     * The transaction expects the ioeventfds to be open when it
 +     * commits. Do it now, before the cleanup loop.
 +     */
 +    memory_region_transaction_commit();
 +
 +    for (i = 0; i < nvqs; i++) {
 +        virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
 +    }
 +
 +    /*
 +     * Set ->dataplane_started to false before draining so that host notifiers
 +     * are not detached/attached anymore.
 +     */
 +    vblk->dataplane_started = false;
 +
      aio_context_acquire(s->ctx);
      /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
@@ -XXX,XX +XXX,XX @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
      aio_context_release(s->ctx);
 -    /*
 -     * Batch all the host notifiers in a single transaction to avoid
 -     * quadratic time complexity in address_space_update_ioeventfds().
 -     */
 -    memory_region_transaction_begin();
 -
 -    for (i = 0; i < nvqs; i++) {
 -        virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 -    }
 -
 -    /*
 -     * The transaction expects the ioeventfds to be open when it
 -     * commits. Do it now, before the cleanup loop.
 -     */
 -    memory_region_transaction_commit();
 -
 -    for (i = 0; i < nvqs; i++) {
 -        virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
 -    }
 -
      qemu_bh_cancel(s->bh);
      notify_guest_bh(s); /* final chance to notify guest */
      /* Clean up guest notifier (irq) */
      k->set_guest_notifiers(qbus->parent, nvqs, false);
 -    vblk->dataplane_started = false;
      s->stopping = false;
  }
 --
-.40.1
+.44.0

The main loop thread can consume 100% CPU when using --device
virtio-blk-pci,iothread=<iothread>. ppoll() constantly returns but
reading virtqueue host notifiers fails with EAGAIN. The file descriptors
are stale and remain registered with the AioContext because of bugs in
the virtio-blk dataplane start/stop code.

The problem is that the dataplane start/stop code involves drain
operations, which call virtio_blk_drained_begin() and
virtio_blk_drained_end() at points where the host notifier is not
operational:
- In virtio_blk_data_plane_start(), blk_set_aio_context() drains after
  vblk->dataplane_started has been set to true but the host notifier has
  not been attached yet.
- In virtio_blk_data_plane_stop(), blk_drain() and blk_set_aio_context()
  drain after the host notifier has already been detached but with
  vblk->dataplane_started still set to true.

I would like to simplify ->ioeventfd_start/stop() to avoid interactions
with drain entirely, but couldn't find a way to do that. Instead, this
patch accepts the fragile nature of the code and reorders it so that
vblk->dataplane_started is false during drain operations. This way the
virtio_blk_drained_begin() and virtio_blk_drained_end() calls don't
touch the host notifier. The result is that
virtio_blk_data_plane_start() and virtio_blk_data_plane_stop() have
complete control over the host notifier and stale file descriptors are
no longer left in the AioContext.

This patch fixes the 100% CPU consumption in the main loop thread and
correctly moves host notifier processing to the IOThread.

Fixes: 1665d9326fd2 ("virtio-blk: implement BlockDevOps->drained_begin()")
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Lukas Doktor <ldoktor@redhat.com>
Message-id: 20230704151527.193586-1-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 67 +++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 
     memory_region_transaction_commit();
 
-    /*
-     * These fields are visible to the IOThread so we rely on implicit barriers
-     * in aio_context_acquire() on the write side and aio_notify_accept() on
-     * the read side.
-     */
-    s->starting = false;
-    vblk->dataplane_started = true;
     trace_virtio_blk_data_plane_start(s);
 
     old_context = blk_get_aio_context(s->conf->conf.blk);
@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
         event_notifier_set(virtio_queue_get_host_notifier(vq));
     }
 
+    /*
+     * These fields must be visible to the IOThread when it processes the
+     * virtqueue, otherwise it will think dataplane has not started yet.
+     *
+     * Make sure ->dataplane_started is false when blk_set_aio_context() is
+     * called above so that draining does not cause the host notifier to be
+     * detached/attached prematurely.
+     */
+    s->starting = false;
+    vblk->dataplane_started = true;
+    smp_wmb(); /* paired with aio_notify_accept() on the read side */
+
     /* Get this show started by hooking up our callbacks */
     if (!blk_in_drain(s->conf->conf.blk)) {
         aio_context_acquire(s->ctx);
@@ -XXX,XX +XXX,XX @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
   fail_guest_notifiers:
     vblk->dataplane_disabled = true;
     s->starting = false;
-    vblk->dataplane_started = true;
     return -ENOSYS;
 }
 
@@ -XXX,XX +XXX,XX @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
         aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
     }
 
+    /*
+     * Batch all the host notifiers in a single transaction to avoid
+     * quadratic time complexity in address_space_update_ioeventfds().
+     */
+    memory_region_transaction_begin();
+
+    for (i = 0; i < nvqs; i++) {
+        virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
+    }
+
+    /*
+     * The transaction expects the ioeventfds to be open when it
+     * commits. Do it now, before the cleanup loop.
+     */
+    memory_region_transaction_commit();
+
+    for (i = 0; i < nvqs; i++) {
+        virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
+    }
+
+    /*
+     * Set ->dataplane_started to false before draining so that host notifiers
+     * are not detached/attached anymore.
+     */
+    vblk->dataplane_started = false;
+
     aio_context_acquire(s->ctx);
 
     /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
@@ -XXX,XX +XXX,XX @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 
     aio_context_release(s->ctx);
 
-    /*
-     * Batch all the host notifiers in a single transaction to avoid
-     * quadratic time complexity in address_space_update_ioeventfds().
-     */
-    memory_region_transaction_begin();
-
-    for (i = 0; i < nvqs; i++) {
-        virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
-    }
-
-    /*
-     * The transaction expects the ioeventfds to be open when it
-     * commits. Do it now, before the cleanup loop.
-     */
-    memory_region_transaction_commit();
-
-    for (i = 0; i < nvqs; i++) {
-        virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
-    }
-
     qemu_bh_cancel(s->bh);
     notify_guest_bh(s); /* final chance to notify guest */
 
     /* Clean up guest notifier (irq) */
     k->set_guest_notifiers(qbus->parent, nvqs, false);
 
-    vblk->dataplane_started = false;
     s->stopping = false;
 }
-- 
2.40.1

The coroutine pool implementation can hit the Linux vm.max_map_count
limit, causing QEMU to abort with "failed to allocate memory for stack"
or "failed to set up stack guard page" during coroutine creation.

This happens because per-thread pools can grow to tens of thousands of
coroutines. Each coroutine causes 2 virtual memory areas to be created.
Eventually vm.max_map_count is reached and memory-related syscalls fail.
The per-thread pool sizes are non-uniform and depend on past coroutine
usage in each thread, so it's possible for one thread to have a large
pool while another thread's pool is empty.

Switch to a new coroutine pool implementation with a global pool that
grows to a maximum number of coroutines and per-thread local pools that
are capped at hardcoded small number of coroutines.

This approach does not leave large numbers of coroutines pooled in a
thread that may not use them again. In order to perform well it
amortizes the cost of global pool accesses by working in batches of
coroutines instead of individual coroutines.

The global pool is a list. Threads donate batches of coroutines to when
they have too many and take batches from when they have too few:

Each thread has up to 2 batches of coroutines:

.-------------------.
| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
`-------------------'

The goal of this change is to reduce the excessive number of pooled
coroutines that cause QEMU to abort when vm.max_map_count is reached
without losing the performance of an adequately sized coroutine pool.

Here are virtio-blk disk I/O benchmark results:

RW BLKSIZE IODEPTH    OLD    NEW CHANGE
randread      4k       1 113725 117451 +3.3%
randread      4k       8 192968 198510 +2.9%
randread      4k      16 207138 209429 +1.1%
randread      4k      32 212399 215145 +1.3%
randread      4k      64 218319 221277 +1.4%
randread    128k       1  17587  17535 -0.3%
randread    128k       8  17614  17616 +0.0%
randread    128k      16  17608  17609 +0.0%
randread    128k      32  17552  17553 +0.0%
randread    128k      64  17484  17484 +0.0%

See files/{fio.sh,test.xml.j2} for the benchmark configuration:
https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing

Buglink: https://issues.redhat.com/browse/RHEL-28947
Reported-by: Sanjay Rao <srao@redhat.com>
Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
Reported-by: Joe Mario <jmario@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20240318183429.1039340-1-stefanha@redhat.com>
---
 util/qemu-coroutine.c | 282 +++++++++++++++++++++++++++++++++---------
 1 file changed, 223 insertions(+), 59 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index XXXXXXX..XXXXXXX 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/coroutine-tls.h"
+#include "qemu/cutils.h"
 #include "block/aio.h"
 
-/**
- * The minimal batch size is always 64, coroutines from the release_pool are
- * reused as soon as there are 64 coroutines in it. The maximum pool size starts
- * with 64 and is increased on demand so that coroutines are not deleted even if
- * they are not immediately reused.
- */
 enum {
-    POOL_MIN_BATCH_SIZE = 64,
-    POOL_INITIAL_MAX_SIZE = 64,
+    COROUTINE_POOL_BATCH_MAX_SIZE = 128,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
-static unsigned int release_pool_size;
+/*
+ * Coroutine creation and deletion is expensive so a pool of unused coroutines
+ * is kept as a cache. When the pool has coroutines available, they are
+ * recycled instead of creating new ones from scratch. Coroutines are added to
+ * the pool upon termination.
+ *
+ * The pool is global but each thread maintains a small local pool to avoid
+ * global pool contention. Threads fetch and return batches of coroutines from
+ * the global pool to maintain their local pool. The local pool holds up to two
+ * batches whereas the maximum size of the global pool is controlled by the
+ * qemu_coroutine_inc_pool_size() API.
+ *
+ * .-----------------------------------.
+ * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
+ * `-----------------------------------'
+ *
+ * .-------------------.
+ * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
+ * `-------------------'
+ */
+typedef struct CoroutinePoolBatch {
+    /* Batches are kept in a list */
+    QSLIST_ENTRY(CoroutinePoolBatch) next;
 
-typedef QSLIST_HEAD(, Coroutine) CoroutineQSList;
-QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool);
-QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size);
-QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier);
+    /* This batch holds up to @COROUTINE_POOL_BATCH_MAX_SIZE coroutines */
+    QSLIST_HEAD(, Coroutine) list;
+    unsigned int size;
+} CoroutinePoolBatch;
 
-static void coroutine_pool_cleanup(Notifier *n, void *value)
+typedef QSLIST_HEAD(, CoroutinePoolBatch) CoroutinePool;
+
+/* Host operating system limit on number of pooled coroutines */
+static unsigned int global_pool_hard_max_size;
+
+static QemuMutex global_pool_lock; /* protects the following variables */
+static CoroutinePool global_pool = QSLIST_HEAD_INITIALIZER(global_pool);
+static unsigned int global_pool_size;
+static unsigned int global_pool_max_size = COROUTINE_POOL_BATCH_MAX_SIZE;
+
+QEMU_DEFINE_STATIC_CO_TLS(CoroutinePool, local_pool);
+QEMU_DEFINE_STATIC_CO_TLS(Notifier, local_pool_cleanup_notifier);
+
+static CoroutinePoolBatch *coroutine_pool_batch_new(void)
+{
+    CoroutinePoolBatch *batch = g_new(CoroutinePoolBatch, 1);
+
+    QSLIST_INIT(&batch->list);
+    batch->size = 0;
+    return batch;
+}
+
+static void coroutine_pool_batch_delete(CoroutinePoolBatch *batch)
 {
     Coroutine *co;
     Coroutine *tmp;
-    CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
 
-    QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
+    QSLIST_FOREACH_SAFE(co, &batch->list, pool_next, tmp) {
+        QSLIST_REMOVE_HEAD(&batch->list, pool_next);
         qemu_coroutine_delete(co);
     }
+    g_free(batch);
+}
+
+static void local_pool_cleanup(Notifier *n, void *value)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch;
+    CoroutinePoolBatch *tmp;
+
+    QSLIST_FOREACH_SAFE(batch, local_pool, next, tmp) {
+        QSLIST_REMOVE_HEAD(local_pool, next);
+        coroutine_pool_batch_delete(batch);
+    }
+}
+
+/* Ensure the atexit notifier is registered */
+static void local_pool_cleanup_init_once(void)
+{
+    Notifier *notifier = get_ptr_local_pool_cleanup_notifier();
+    if (!notifier->notify) {
+        notifier->notify = local_pool_cleanup;
+        qemu_thread_atexit_add(notifier);
+    }
+}
+
+/* Helper to get the next unused coroutine from the local pool */
+static Coroutine *coroutine_pool_get_local(void)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
+    Coroutine *co;
+
+    if (unlikely(!batch)) {
+        return NULL;
+    }
+
+    co = QSLIST_FIRST(&batch->list);
+    QSLIST_REMOVE_HEAD(&batch->list, pool_next);
+    batch->size--;
+
+    if (batch->size == 0) {
+        QSLIST_REMOVE_HEAD(local_pool, next);
+        coroutine_pool_batch_delete(batch);
+    }
+    return co;
+}
+
+/* Get the next batch from the global pool */
+static void coroutine_pool_refill_local(void)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch;
+
+    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
+        batch = QSLIST_FIRST(&global_pool);
+
+        if (batch) {
+            QSLIST_REMOVE_HEAD(&global_pool, next);
+            global_pool_size -= batch->size;
+        }
+    }
+
+    if (batch) {
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+        local_pool_cleanup_init_once();
+    }
+}
+
+/* Add a batch of coroutines to the global pool */
+static void coroutine_pool_put_global(CoroutinePoolBatch *batch)
+{
+    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
+        unsigned int max = MIN(global_pool_max_size,
+                               global_pool_hard_max_size);
+
+        if (global_pool_size < max) {
+            QSLIST_INSERT_HEAD(&global_pool, batch, next);
+
+            /* Overshooting the max pool size is allowed */
+            global_pool_size += batch->size;
+            return;
+        }
+    }
+
+    /* The global pool was full, so throw away this batch */
+    coroutine_pool_batch_delete(batch);
+}
+
+/* Get the next unused coroutine from the pool or return NULL */
+static Coroutine *coroutine_pool_get(void)
+{
+    Coroutine *co;
+
+    co = coroutine_pool_get_local();
+    if (!co) {
+        coroutine_pool_refill_local();
+        co = coroutine_pool_get_local();
+    }
+    return co;
+}
+
+static void coroutine_pool_put(Coroutine *co)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
+
+    if (unlikely(!batch)) {
+        batch = coroutine_pool_batch_new();
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+        local_pool_cleanup_init_once();
+    }
+
+    if (unlikely(batch->size >= COROUTINE_POOL_BATCH_MAX_SIZE)) {
+        CoroutinePoolBatch *next = QSLIST_NEXT(batch, next);
+
+        /* Is the local pool full? */
+        if (next) {
+            QSLIST_REMOVE_HEAD(local_pool, next);
+            coroutine_pool_put_global(batch);
+        }
+
+        batch = coroutine_pool_batch_new();
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+    }
+
+    QSLIST_INSERT_HEAD(&batch->list, co, pool_next);
+    batch->size++;
 }
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
@@ -XXX,XX +XXX,XX @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
     Coroutine *co = NULL;
 
     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
-        CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
-
-        co = QSLIST_FIRST(alloc_pool);
-        if (!co) {
-            if (release_pool_size > POOL_MIN_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                Notifier *notifier = get_ptr_coroutine_pool_cleanup_notifier();
-                if (!notifier->notify) {
-                    notifier->notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(notifier);
-                }
-
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                set_alloc_pool_size(qatomic_xchg(&release_pool_size, 0));
-                QSLIST_MOVE_ATOMIC(alloc_pool, &release_pool);
-                co = QSLIST_FIRST(alloc_pool);
-            }
-        }
-        if (co) {
-            QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
-            set_alloc_pool_size(get_alloc_pool_size() - 1);
-        }
+        co = coroutine_pool_get();
     }
 
     if (!co) {
@@ -XXX,XX +XXX,XX @@ static void coroutine_delete(Coroutine *co)
     co->caller = NULL;
 
     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
-        if (release_pool_size < qatomic_read(&pool_max_size) * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            qatomic_inc(&release_pool_size);
-            return;
-        }
-        if (get_alloc_pool_size() < qatomic_read(&pool_max_size)) {
-            QSLIST_INSERT_HEAD(get_ptr_alloc_pool(), co, pool_next);
-            set_alloc_pool_size(get_alloc_pool_size() + 1);
-            return;
-        }
+        coroutine_pool_put(co);
+    } else {
+        qemu_coroutine_delete(co);
     }
-
-    qemu_coroutine_delete(co);
 }
 
 void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co)
@@ -XXX,XX +XXX,XX @@ AioContext *qemu_coroutine_get_aio_context(Coroutine *co)
 
 void qemu_coroutine_inc_pool_size(unsigned int additional_pool_size)
 {
-    qatomic_add(&pool_max_size, additional_pool_size);
+    QEMU_LOCK_GUARD(&global_pool_lock);
+    global_pool_max_size += additional_pool_size;
 }
 
 void qemu_coroutine_dec_pool_size(unsigned int removing_pool_size)
 {
-    qatomic_sub(&pool_max_size, removing_pool_size);
+    QEMU_LOCK_GUARD(&global_pool_lock);
+    global_pool_max_size -= removing_pool_size;
+}
+
+static unsigned int get_global_pool_hard_max_size(void)
+{
+#ifdef __linux__
+    g_autofree char *contents = NULL;
+    int max_map_count;
+
+    /*
+     * Linux processes can have up to max_map_count virtual memory areas
+     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
+     * must limit the coroutine pool to a safe size to avoid running out of
+     * VMAs.
+     */
+    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
+                            NULL) &&
+        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
+        /*
+         * This is a conservative upper bound that avoids exceeding
+         * max_map_count. Leave half for non-coroutine users like library
+         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
+         * halve the amount again.
+         */
+        return max_map_count / 4;
+    }
+#endif
+
+    return UINT_MAX;
+}
+
+static void __attribute__((constructor)) qemu_coroutine_init(void)
+{
+    qemu_mutex_init(&global_pool_lock);
+    global_pool_hard_max_size = get_global_pool_hard_max_size();
 }
-- 
2.44.0