coroutine: cap per-thread local pool size

[PATCH] coroutine: cap per-thread local pool size

Posted by Stefan Hajnoczi 1 month, 1 week ago

The coroutine pool implementation can hit the Linux vm.max_map_count
limit, causing QEMU to abort with "failed to allocate memory for stack"
or "failed to set up stack guard page" during coroutine creation.

This happens because per-thread pools can grow to tens of thousands of
coroutines. Each coroutine causes 2 virtual memory areas to be created.
Eventually vm.max_map_count is reached and memory-related syscalls fail.
The per-thread pool sizes are non-uniform and depend on past coroutine
usage in each thread, so it's possible for one thread to have a large
pool while another thread's pool is empty.

Switch to a new coroutine pool implementation with a global pool that
grows to a maximum number of coroutines and per-thread local pools that
are capped at hardcoded small number of coroutines.

This approach does not leave large numbers of coroutines pooled in a
thread that may not use them again. In order to perform well it
amortizes the cost of global pool accesses by working in batches of
coroutines instead of individual coroutines.

The global pool is a list. Threads donate batches of coroutines to when
they have too many and take batches from when they have too few:

.-----------------------------------.
| Batch 1 | Batch 2 | Batch 3 | ... | global_pool
`-----------------------------------'

Each thread has up to 2 batches of coroutines:

.-------------------.
| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
`-------------------'

The goal of this change is to reduce the excessive number of pooled
coroutines that cause QEMU to abort when vm.max_map_count is reached
without losing the performance of an adequately sized coroutine pool.

Here are virtio-blk disk I/O benchmark results:

      RW BLKSIZE IODEPTH    OLD    NEW CHANGE
randread      4k       1 113725 117451 +3.3%
randread      4k       8 192968 198510 +2.9%
randread      4k      16 207138 209429 +1.1%
randread      4k      32 212399 215145 +1.3%
randread      4k      64 218319 221277 +1.4%
randread    128k       1  17587  17535 -0.3%
randread    128k       8  17614  17616 +0.0%
randread    128k      16  17608  17609 +0.0%
randread    128k      32  17552  17553 +0.0%
randread    128k      64  17484  17484 +0.0%

See files/{fio.sh,test.xml.j2} for the benchmark configuration:
https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing

Buglink: https://issues.redhat.com/browse/RHEL-28947
Reported-by: Sanjay Rao <srao@redhat.com>
Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
Reported-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
This patch obsoletes "[PATCH v2] virtio-blk: iothread-vq-mapping
coroutine pool sizing" because the pool size is now global instead of
per thread:
https://lore.kernel.org/qemu-devel/20240312151204.412624-1-stefanha@redhat.com/

Please don't apply "[PATCH v2] virtio-blk: iothread-vq-mapping coroutine
pool sizing".

 util/qemu-coroutine.c | 282 +++++++++++++++++++++++++++++++++---------
 1 file changed, 223 insertions(+), 59 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5fd2dbaf8b..2790959eaf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -18,39 +18,200 @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/coroutine-tls.h"
+#include "qemu/cutils.h"
 #include "block/aio.h"
 
-/**
- * The minimal batch size is always 64, coroutines from the release_pool are
- * reused as soon as there are 64 coroutines in it. The maximum pool size starts
- * with 64 and is increased on demand so that coroutines are not deleted even if
- * they are not immediately reused.
- */
 enum {
-    POOL_MIN_BATCH_SIZE = 64,
-    POOL_INITIAL_MAX_SIZE = 64,
+    COROUTINE_POOL_BATCH_MAX_SIZE = 128,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
-static unsigned int release_pool_size;
+/*
+ * Coroutine creation and deletion is expensive so a pool of unused coroutines
+ * is kept as a cache. When the pool has coroutines available, they are
+ * recycled instead of creating new ones from scratch. Coroutines are added to
+ * the pool upon termination.
+ *
+ * The pool is global but each thread maintains a small local pool to avoid
+ * global pool contention. Threads fetch and return batches of coroutines from
+ * the global pool to maintain their local pool. The local pool holds up to two
+ * batches whereas the maximum size of the global pool is controlled by the
+ * qemu_coroutine_inc_pool_size() API.
+ *
+ * .-----------------------------------.
+ * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
+ * `-----------------------------------'
+ *
+ * .-------------------.
+ * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
+ * `-------------------'
+ */
+typedef struct CoroutinePoolBatch {
+    /* Batches are kept in a list */
+    QSLIST_ENTRY(CoroutinePoolBatch) next;
 
-typedef QSLIST_HEAD(, Coroutine) CoroutineQSList;
-QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool);
-QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size);
-QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier);
+    /* This batch holds up to @COROUTINE_POOL_BATCH_MAX_SIZE coroutines */
+    QSLIST_HEAD(, Coroutine) list;
+    unsigned int size;
+} CoroutinePoolBatch;
 
-static void coroutine_pool_cleanup(Notifier *n, void *value)
+typedef QSLIST_HEAD(, CoroutinePoolBatch) CoroutinePool;
+
+/* Host operating system limit on number of pooled coroutines */
+static unsigned int global_pool_hard_max_size;
+
+static QemuMutex global_pool_lock; /* protects the following variables */
+static CoroutinePool global_pool = QSLIST_HEAD_INITIALIZER(global_pool);
+static unsigned int global_pool_size;
+static unsigned int global_pool_max_size = COROUTINE_POOL_BATCH_MAX_SIZE;
+
+QEMU_DEFINE_STATIC_CO_TLS(CoroutinePool, local_pool);
+QEMU_DEFINE_STATIC_CO_TLS(Notifier, local_pool_cleanup_notifier);
+
+static CoroutinePoolBatch *coroutine_pool_batch_new(void)
+{
+    CoroutinePoolBatch *batch = g_new(CoroutinePoolBatch, 1);
+
+    QSLIST_INIT(&batch->list);
+    batch->size = 0;
+    return batch;
+}
+
+static void coroutine_pool_batch_delete(CoroutinePoolBatch *batch)
 {
     Coroutine *co;
     Coroutine *tmp;
-    CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
 
-    QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
+    QSLIST_FOREACH_SAFE(co, &batch->list, pool_next, tmp) {
+        QSLIST_REMOVE_HEAD(&batch->list, pool_next);
         qemu_coroutine_delete(co);
     }
+    g_free(batch);
+}
+
+static void local_pool_cleanup(Notifier *n, void *value)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch;
+    CoroutinePoolBatch *tmp;
+
+    QSLIST_FOREACH_SAFE(batch, local_pool, next, tmp) {
+        QSLIST_REMOVE_HEAD(local_pool, next);
+        coroutine_pool_batch_delete(batch);
+    }
+}
+
+/* Ensure the atexit notifier is registered */
+static void local_pool_cleanup_init_once(void)
+{
+    Notifier *notifier = get_ptr_local_pool_cleanup_notifier();
+    if (!notifier->notify) {
+        notifier->notify = local_pool_cleanup;
+        qemu_thread_atexit_add(notifier);
+    }
+}
+
+/* Helper to get the next unused coroutine from the local pool */
+static Coroutine *coroutine_pool_get_local(void)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
+    Coroutine *co;
+
+    if (unlikely(!batch)) {
+        return NULL;
+    }
+
+    co = QSLIST_FIRST(&batch->list);
+    QSLIST_REMOVE_HEAD(&batch->list, pool_next);
+    batch->size--;
+
+    if (batch->size == 0) {
+        QSLIST_REMOVE_HEAD(local_pool, next);
+        coroutine_pool_batch_delete(batch);
+    }
+    return co;
+}
+
+/* Get the next batch from the global pool */
+static void coroutine_pool_refill_local(void)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch;
+
+    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
+        batch = QSLIST_FIRST(&global_pool);
+
+        if (batch) {
+            QSLIST_REMOVE_HEAD(&global_pool, next);
+            global_pool_size -= batch->size;
+        }
+    }
+
+    if (batch) {
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+        local_pool_cleanup_init_once();
+    }
+}
+
+/* Add a batch of coroutines to the global pool */
+static void coroutine_pool_put_global(CoroutinePoolBatch *batch)
+{
+    WITH_QEMU_LOCK_GUARD(&global_pool_lock) {
+        unsigned int max = MIN(global_pool_max_size,
+                               global_pool_hard_max_size);
+
+        if (global_pool_size < max) {
+            QSLIST_INSERT_HEAD(&global_pool, batch, next);
+
+            /* Overshooting the max pool size is allowed */
+            global_pool_size += batch->size;
+            return;
+        }
+    }
+
+    /* The global pool was full, so throw away this batch */
+    coroutine_pool_batch_delete(batch);
+}
+
+/* Get the next unused coroutine from the pool or return NULL */
+static Coroutine *coroutine_pool_get(void)
+{
+    Coroutine *co;
+
+    co = coroutine_pool_get_local();
+    if (!co) {
+        coroutine_pool_refill_local();
+        co = coroutine_pool_get_local();
+    }
+    return co;
+}
+
+static void coroutine_pool_put(Coroutine *co)
+{
+    CoroutinePool *local_pool = get_ptr_local_pool();
+    CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool);
+
+    if (unlikely(!batch)) {
+        batch = coroutine_pool_batch_new();
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+        local_pool_cleanup_init_once();
+    }
+
+    if (unlikely(batch->size >= COROUTINE_POOL_BATCH_MAX_SIZE)) {
+        CoroutinePoolBatch *next = QSLIST_NEXT(batch, next);
+
+        /* Is the local pool full? */
+        if (next) {
+            QSLIST_REMOVE_HEAD(local_pool, next);
+            coroutine_pool_put_global(batch);
+        }
+
+        batch = coroutine_pool_batch_new();
+        QSLIST_INSERT_HEAD(local_pool, batch, next);
+    }
+
+    QSLIST_INSERT_HEAD(&batch->list, co, pool_next);
+    batch->size++;
 }
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
@@ -58,31 +219,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
     Coroutine *co = NULL;
 
     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
-        CoroutineQSList *alloc_pool = get_ptr_alloc_pool();
-
-        co = QSLIST_FIRST(alloc_pool);
-        if (!co) {
-            if (release_pool_size > POOL_MIN_BATCH_SIZE) {
-                /* Slow path; a good place to register the destructor, too.  */
-                Notifier *notifier = get_ptr_coroutine_pool_cleanup_notifier();
-                if (!notifier->notify) {
-                    notifier->notify = coroutine_pool_cleanup;
-                    qemu_thread_atexit_add(notifier);
-                }
-
-                /* This is not exact; there could be a little skew between
-                 * release_pool_size and the actual size of release_pool.  But
-                 * it is just a heuristic, it does not need to be perfect.
-                 */
-                set_alloc_pool_size(qatomic_xchg(&release_pool_size, 0));
-                QSLIST_MOVE_ATOMIC(alloc_pool, &release_pool);
-                co = QSLIST_FIRST(alloc_pool);
-            }
-        }
-        if (co) {
-            QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
-            set_alloc_pool_size(get_alloc_pool_size() - 1);
-        }
+        co = coroutine_pool_get();
     }
 
     if (!co) {
@@ -100,19 +237,10 @@ static void coroutine_delete(Coroutine *co)
     co->caller = NULL;
 
     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) {
-        if (release_pool_size < qatomic_read(&pool_max_size) * 2) {
-            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            qatomic_inc(&release_pool_size);
-            return;
-        }
-        if (get_alloc_pool_size() < qatomic_read(&pool_max_size)) {
-            QSLIST_INSERT_HEAD(get_ptr_alloc_pool(), co, pool_next);
-            set_alloc_pool_size(get_alloc_pool_size() + 1);
-            return;
-        }
+        coroutine_pool_put(co);
+    } else {
+        qemu_coroutine_delete(co);
     }
-
-    qemu_coroutine_delete(co);
 }
 
 void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co)
@@ -223,10 +351,46 @@ AioContext *qemu_coroutine_get_aio_context(Coroutine *co)
 
 void qemu_coroutine_inc_pool_size(unsigned int additional_pool_size)
 {
-    qatomic_add(&pool_max_size, additional_pool_size);
+    QEMU_LOCK_GUARD(&global_pool_lock);
+    global_pool_max_size += additional_pool_size;
 }
 
 void qemu_coroutine_dec_pool_size(unsigned int removing_pool_size)
 {
-    qatomic_sub(&pool_max_size, removing_pool_size);
+    QEMU_LOCK_GUARD(&global_pool_lock);
+    global_pool_max_size -= removing_pool_size;
+}
+
+static unsigned int get_global_pool_hard_max_size(void)
+{
+#ifdef __linux__
+    g_autofree char *contents = NULL;
+    int max_map_count;
+
+    /*
+     * Linux processes can have up to max_map_count virtual memory areas
+     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
+     * must limit the coroutine pool to a safe size to avoid running out of
+     * VMAs.
+     */
+    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
+                            NULL) &&
+        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
+        /*
+         * This is a conservative upper bound that avoids exceeding
+         * max_map_count. Leave half for non-coroutine users like library
+         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
+         * halve the amount again.
+         */
+        return max_map_count / 4;
+    }
+#endif
+
+    return UINT_MAX;
+}
+
+static void __attribute__((constructor)) qemu_coroutine_init(void)
+{
+    qemu_mutex_init(&global_pool_lock);
+    global_pool_hard_max_size = get_global_pool_hard_max_size();
 }
-- 
2.44.0

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Daniel P. Berrangé 1 month, 1 week ago

On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> The coroutine pool implementation can hit the Linux vm.max_map_count
> limit, causing QEMU to abort with "failed to allocate memory for stack"
> or "failed to set up stack guard page" during coroutine creation.
> 
> This happens because per-thread pools can grow to tens of thousands of
> coroutines. Each coroutine causes 2 virtual memory areas to be created.

This sounds quite alarming. What usage scenario is justified in
creating so many coroutines ?

IIUC, coroutine stack size is 1 MB, and so tens of thousands of
coroutines implies 10's of GB of memory just on stacks alone.

> Eventually vm.max_map_count is reached and memory-related syscalls fail.

On my system max_map_count is 1048576, quite alot higher than
10's of 1000's. Hitting that would imply ~500,000 coroutines and
~500 GB of stacks !

> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 5fd2dbaf8b..2790959eaf 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c

> +static unsigned int get_global_pool_hard_max_size(void)
> +{
> +#ifdef __linux__
> +    g_autofree char *contents = NULL;
> +    int max_map_count;
> +
> +    /*
> +     * Linux processes can have up to max_map_count virtual memory areas
> +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> +     * must limit the coroutine pool to a safe size to avoid running out of
> +     * VMAs.
> +     */
> +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> +                            NULL) &&
> +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> +        /*
> +         * This is a conservative upper bound that avoids exceeding
> +         * max_map_count. Leave half for non-coroutine users like library
> +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> +         * halve the amount again.
> +         */
> +        return max_map_count / 4;

That's 256,000 coroutines, which still sounds incredibly large
to me.

> +    }
> +#endif
> +
> +    return UINT_MAX;

Why UINT_MAX as a default ?  If we can't read procfs, we should
assume some much smaller sane default IMHO, that corresponds to
what current linux default max_map_count would be.

> +}
> +
> +static void __attribute__((constructor)) qemu_coroutine_init(void)
> +{
> +    qemu_mutex_init(&global_pool_lock);
> +    global_pool_hard_max_size = get_global_pool_hard_max_size();
>  }
> -- 
> 2.44.0
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Stefan Hajnoczi 1 month, 1 week ago

On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> 
> This sounds quite alarming. What usage scenario is justified in
> creating so many coroutines ?

The coroutine pool hides creation and deletion latency. The pool
initially has a modest size of 64, but each virtio-blk device increases
the pool size by num_queues * queue_size (256) / 2.

The issue pops up with large SMP guests (i.e. large num_queues) with
multiple virtio-blk devices.

> IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> coroutines implies 10's of GB of memory just on stacks alone.
> 
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> 
> On my system max_map_count is 1048576, quite alot higher than
> 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> ~500 GB of stacks !

Fedora recently increased the limit to 1048576. Before that it was
65k-ish and still is on most other distros.

Regarding why QEMU might have 65k coroutines pooled, it's because the
existing coroutine pool algorithm is per thread. So if the max pool size
is 15k but you have 4 IOThreads then up to 4 x 15k total coroutines can
be sitting in pools. This patch addresses this by setting a small fixed
size on per thread pools (256).

> 
> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> 
> > +static unsigned int get_global_pool_hard_max_size(void)
> > +{
> > +#ifdef __linux__
> > +    g_autofree char *contents = NULL;
> > +    int max_map_count;
> > +
> > +    /*
> > +     * Linux processes can have up to max_map_count virtual memory areas
> > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > +     * must limit the coroutine pool to a safe size to avoid running out of
> > +     * VMAs.
> > +     */
> > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > +                            NULL) &&
> > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > +        /*
> > +         * This is a conservative upper bound that avoids exceeding
> > +         * max_map_count. Leave half for non-coroutine users like library
> > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > +         * halve the amount again.
> > +         */
> > +        return max_map_count / 4;
> 
> That's 256,000 coroutines, which still sounds incredibly large
> to me.

Any ideas for tweaking this heuristic?

> 
> > +    }
> > +#endif
> > +
> > +    return UINT_MAX;
> 
> Why UINT_MAX as a default ?  If we can't read procfs, we should
> assume some much smaller sane default IMHO, that corresponds to
> what current linux default max_map_count would be.

This line is not Linux-specific. I don't know if other OSes have an
equivalent to max_map_count.

I agree with defaulting to 64k-ish on Linux.

Stefan

> 
> > +}
> > +
> > +static void __attribute__((constructor)) qemu_coroutine_init(void)
> > +{
> > +    qemu_mutex_init(&global_pool_lock);
> > +    global_pool_hard_max_size = get_global_pool_hard_max_size();
> >  }
> > -- 
> > 2.44.0
> > 
> > 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Daniel P. Berrangé 1 month, 1 week ago

On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > index 5fd2dbaf8b..2790959eaf 100644
> > > --- a/util/qemu-coroutine.c
> > > +++ b/util/qemu-coroutine.c
> > 
> > > +static unsigned int get_global_pool_hard_max_size(void)
> > > +{
> > > +#ifdef __linux__
> > > +    g_autofree char *contents = NULL;
> > > +    int max_map_count;
> > > +
> > > +    /*
> > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > +     * VMAs.
> > > +     */
> > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > +                            NULL) &&
> > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > +        /*
> > > +         * This is a conservative upper bound that avoids exceeding
> > > +         * max_map_count. Leave half for non-coroutine users like library
> > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > +         * halve the amount again.

Leaving half for loaded libraries, etc is quite conservative
if max_map_count is the small-ish 64k default.

That reservation could perhaps a fixed number like 5,000 ?

> > > +         */
> > > +        return max_map_count / 4;
> > 
> > That's 256,000 coroutines, which still sounds incredibly large
> > to me.
> 
> Any ideas for tweaking this heuristic?

The awkward thing about this limit is that its hardcoded, and
since it is indeed a "heuristic", we know it is going to be
sub-optimal for some use cases / scenarios.

The worst case upper limit is

   num virtio-blk * num threads * num queues

Reducing the number of devices isn't practical if the guest
genuinely needs that many volumes.

Reducing the threads or queues artificially limits the peak
performance of a single disk handling in isolation, while
other disks are idle, so that's not desirable.

So there's no way to cap the worst case scenario, while
still maximising the single disk performance possibilities.

With large VMs with many CPUs and many disks, it could be
reasonable to not expect a real guest to need to maximise
I/O on every disk at the same time, and thus want to put
some cap there to control worst case resource usage.

It feels like it leans towards being able to control the
coroutine pool limit explicitly, as a CLI option, to override
this default hueristic.

> > > +    }
> > > +#endif
> > > +
> > > +    return UINT_MAX;
> > 
> > Why UINT_MAX as a default ?  If we can't read procfs, we should
> > assume some much smaller sane default IMHO, that corresponds to
> > what current linux default max_map_count would be.
> 
> This line is not Linux-specific. I don't know if other OSes have an
> equivalent to max_map_count.
> 
> I agree with defaulting to 64k-ish on Linux.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Stefan Hajnoczi 1 month, 1 week ago

On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > --- a/util/qemu-coroutine.c
> > > > +++ b/util/qemu-coroutine.c
> > > 
> > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > +{
> > > > +#ifdef __linux__
> > > > +    g_autofree char *contents = NULL;
> > > > +    int max_map_count;
> > > > +
> > > > +    /*
> > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > +     * VMAs.
> > > > +     */
> > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > +                            NULL) &&
> > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > +        /*
> > > > +         * This is a conservative upper bound that avoids exceeding
> > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > +         * halve the amount again.
> 
> Leaving half for loaded libraries, etc is quite conservative
> if max_map_count is the small-ish 64k default.
> 
> That reservation could perhaps a fixed number like 5,000 ?

While I don't want QEMU to abort, once this heuristic is in the code it
will be scary to make it more optimistic and we may never change it. So
now is the best time to try 5,000.

I'll send a follow-up patch that reserves 5,000 mappings. If that turns
out to be too optimistic we can increase the reservation.

> > > > +         */
> > > > +        return max_map_count / 4;
> > > 
> > > That's 256,000 coroutines, which still sounds incredibly large
> > > to me.
> > 
> > Any ideas for tweaking this heuristic?
> 
> The awkward thing about this limit is that its hardcoded, and
> since it is indeed a "heuristic", we know it is going to be
> sub-optimal for some use cases / scenarios.
> 
> The worst case upper limit is
> 
>    num virtio-blk * num threads * num queues
> 
> Reducing the number of devices isn't practical if the guest
> genuinely needs that many volumes.
> 
> Reducing the threads or queues artificially limits the peak
> performance of a single disk handling in isolation, while
> other disks are idle, so that's not desirable.
> 
> So there's no way to cap the worst case scenario, while
> still maximising the single disk performance possibilities.
> 
> With large VMs with many CPUs and many disks, it could be
> reasonable to not expect a real guest to need to maximise
> I/O on every disk at the same time, and thus want to put
> some cap there to control worst case resource usage.
> 
> It feels like it leans towards being able to control the
> coroutine pool limit explicitly, as a CLI option, to override
> this default hueristic.
> 
> > > > +    }
> > > > +#endif
> > > > +
> > > > +    return UINT_MAX;
> > > 
> > > Why UINT_MAX as a default ?  If we can't read procfs, we should
> > > assume some much smaller sane default IMHO, that corresponds to
> > > what current linux default max_map_count would be.
> > 
> > This line is not Linux-specific. I don't know if other OSes have an
> > equivalent to max_map_count.
> > 
> > I agree with defaulting to 64k-ish on Linux.
> 
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Daniel P. Berrangé 1 month, 1 week ago

On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > --- a/util/qemu-coroutine.c
> > > > > +++ b/util/qemu-coroutine.c
> > > > 
> > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > +{
> > > > > +#ifdef __linux__
> > > > > +    g_autofree char *contents = NULL;
> > > > > +    int max_map_count;
> > > > > +
> > > > > +    /*
> > > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > > +     * VMAs.
> > > > > +     */
> > > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > > +                            NULL) &&
> > > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > > +        /*
> > > > > +         * This is a conservative upper bound that avoids exceeding
> > > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > > +         * halve the amount again.
> > 
> > Leaving half for loaded libraries, etc is quite conservative
> > if max_map_count is the small-ish 64k default.
> > 
> > That reservation could perhaps a fixed number like 5,000 ?
> 
> While I don't want QEMU to abort, once this heuristic is in the code it
> will be scary to make it more optimistic and we may never change it. So
> now is the best time to try 5,000.
> 
> I'll send a follow-up patch that reserves 5,000 mappings. If that turns
> out to be too optimistic we can increase the reservation.

BTW, I suggested 5,000, because I looked at a few QEM processes I have
running on Fedora and saw just under 1,000 lines in /proc/$PID/maps,
of which only a subset is library mappings. So multiplying that x5 felt
like a fairly generous overhead for more complex build configurations.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Kevin Wolf 1 month, 1 week ago

Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben:
> On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > > --- a/util/qemu-coroutine.c
> > > > > > +++ b/util/qemu-coroutine.c
> > > > > 
> > > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > > +{
> > > > > > +#ifdef __linux__
> > > > > > +    g_autofree char *contents = NULL;
> > > > > > +    int max_map_count;
> > > > > > +
> > > > > > +    /*
> > > > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > > > +     * VMAs.
> > > > > > +     */
> > > > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > > > +                            NULL) &&
> > > > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > > > +        /*
> > > > > > +         * This is a conservative upper bound that avoids exceeding
> > > > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > > > +         * halve the amount again.
> > > 
> > > Leaving half for loaded libraries, etc is quite conservative
> > > if max_map_count is the small-ish 64k default.
> > > 
> > > That reservation could perhaps a fixed number like 5,000 ?
> > 
> > While I don't want QEMU to abort, once this heuristic is in the code it
> > will be scary to make it more optimistic and we may never change it. So
> > now is the best time to try 5,000.
> > 
> > I'll send a follow-up patch that reserves 5,000 mappings. If that turns
> > out to be too optimistic we can increase the reservation.
> 
> BTW, I suggested 5,000, because I looked at a few QEM processes I have
> running on Fedora and saw just under 1,000 lines in /proc/$PID/maps,
> of which only a subset is library mappings. So multiplying that x5 felt
> like a fairly generous overhead for more complex build configurations.

On my system, the boring desktop VM with no special hardware or other
advanced configuration takes ~1500 mappings, most of which are
libraries. I'm not concerned about the library mappings, it's unlikely
that we'll double the number of libraries soon.

But I'm not sure about dynamic mappings outside of coroutines, maybe
when enabling features my simple desktop VM doesn't even use at all. If
we're sure that nothing else uses any number worth mentioning, fine with
me. But I couldn't tell.

Staying the area we know reasonably well, how many libblkio bounce
buffers could be in use at the same time? I think each one is an
individual mmap(), right?

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Stefan Hajnoczi 1 month, 1 week ago

On Thu, 21 Mar 2024 at 08:22, Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben:
> > On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote:
> > > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote:
> > > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > > > --- a/util/qemu-coroutine.c
> > > > > > > +++ b/util/qemu-coroutine.c
> > > > > >
> > > > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > > > +{
> > > > > > > +#ifdef __linux__
> > > > > > > +    g_autofree char *contents = NULL;
> > > > > > > +    int max_map_count;
> > > > > > > +
> > > > > > > +    /*
> > > > > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > > > > +     * VMAs.
> > > > > > > +     */
> > > > > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > > > > +                            NULL) &&
> > > > > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > > > > +        /*
> > > > > > > +         * This is a conservative upper bound that avoids exceeding
> > > > > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > > > > +         * halve the amount again.
> > > >
> > > > Leaving half for loaded libraries, etc is quite conservative
> > > > if max_map_count is the small-ish 64k default.
> > > >
> > > > That reservation could perhaps a fixed number like 5,000 ?
> > >
> > > While I don't want QEMU to abort, once this heuristic is in the code it
> > > will be scary to make it more optimistic and we may never change it. So
> > > now is the best time to try 5,000.
> > >
> > > I'll send a follow-up patch that reserves 5,000 mappings. If that turns
> > > out to be too optimistic we can increase the reservation.
> >
> > BTW, I suggested 5,000, because I looked at a few QEM processes I have
> > running on Fedora and saw just under 1,000 lines in /proc/$PID/maps,
> > of which only a subset is library mappings. So multiplying that x5 felt
> > like a fairly generous overhead for more complex build configurations.
>
> On my system, the boring desktop VM with no special hardware or other
> advanced configuration takes ~1500 mappings, most of which are
> libraries. I'm not concerned about the library mappings, it's unlikely
> that we'll double the number of libraries soon.
>
> But I'm not sure about dynamic mappings outside of coroutines, maybe
> when enabling features my simple desktop VM doesn't even use at all. If
> we're sure that nothing else uses any number worth mentioning, fine with
> me. But I couldn't tell.
>
> Staying the area we know reasonably well, how many libblkio bounce
> buffers could be in use at the same time? I think each one is an
> individual mmap(), right?

libblkio's mapping requirements are similar to vhost-user. There is
one general-purpose bounce buffer mapping plus a mapping for each QEMU
RAMBlock. That means the number of in-flight I/Os does not directly
influence the number of mappings.

Stefan

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Kevin Wolf 1 month, 1 week ago

Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> 
> This sounds quite alarming. What usage scenario is justified in
> creating so many coroutines?

Basically we try to allow pooling coroutines for as many requests as
there can be in flight at the same time. That is, adding a virtio-blk
device increases the maximum pool size by num_queues * queue_size. If
you have a guest with many CPUs, the default num_queues is relatively
large (the bug referenced by Stefan had 64), and queue_size is 256 by
default. That's 16k potential requests in flight per disk.

Another part of it is just that our calculation didn't make a lot of
sense. Instead of applying this number to the pool size of the iothread
that would actually get the requests, we applied it to _every_ iothread.
This is fixed with this patch, it's a global number applied to a global
pool now.

> IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> coroutines implies 10's of GB of memory just on stacks alone.

That's only virtual memory, though. Not sure how much of it is actually
used in practice.

> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> 
> On my system max_map_count is 1048576, quite alot higher than
> 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> ~500 GB of stacks !

Did you change the configuration some time in the past, or is this just
a newer default? I get 65530, and that's the same default number I've
seen in the bug reports.

> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> 
> > +static unsigned int get_global_pool_hard_max_size(void)
> > +{
> > +#ifdef __linux__
> > +    g_autofree char *contents = NULL;
> > +    int max_map_count;
> > +
> > +    /*
> > +     * Linux processes can have up to max_map_count virtual memory areas
> > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > +     * must limit the coroutine pool to a safe size to avoid running out of
> > +     * VMAs.
> > +     */
> > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > +                            NULL) &&
> > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > +        /*
> > +         * This is a conservative upper bound that avoids exceeding
> > +         * max_map_count. Leave half for non-coroutine users like library
> > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > +         * halve the amount again.
> > +         */
> > +        return max_map_count / 4;
> 
> That's 256,000 coroutines, which still sounds incredibly large
> to me.

The whole purpose of the limitation is that you won't ever get -ENOMEM
back, which will likely crash your VM. Even if this hard limit is high,
that doesn't mean that it's fully used. Your setting of 1048576 probably
means that you would never have hit the crash anyway.

Even the benchmarks that used to hit the problem don't even get close to
this hard limit any more because the actual number of coroutines stays
much smaller after applying this patch.

> > +    }
> > +#endif
> > +
> > +    return UINT_MAX;
> 
> Why UINT_MAX as a default ?  If we can't read procfs, we should
> assume some much smaller sane default IMHO, that corresponds to
> what current linux default max_map_count would be.

I don't think we should artificially limit the pool size and with this
potentially limit the performance with it even if the host could do more
if we only allowed it to. If we can't read it from procfs, then it's
your responsibility as a user to make sure that it's large enough for
your VM configuration.

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Daniel P. Berrangé 1 month, 1 week ago

On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > or "failed to set up stack guard page" during coroutine creation.
> > > 
> > > This happens because per-thread pools can grow to tens of thousands of
> > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > 
> > This sounds quite alarming. What usage scenario is justified in
> > creating so many coroutines?
> 
> Basically we try to allow pooling coroutines for as many requests as
> there can be in flight at the same time. That is, adding a virtio-blk
> device increases the maximum pool size by num_queues * queue_size. If
> you have a guest with many CPUs, the default num_queues is relatively
> large (the bug referenced by Stefan had 64), and queue_size is 256 by
> default. That's 16k potential requests in flight per disk.

If we have more than 1 virtio-blk device, does that scale
up the max coroutines too ?

eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
requests/coroutines ?

> > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > coroutines implies 10's of GB of memory just on stacks alone.
> 
> That's only virtual memory, though. Not sure how much of it is actually
> used in practice.

True, by default Linux wouldn't care too much about virtual memory,
Only if 'vm.overcommit_memory' is changed from its default, such
that Linux applies an overcommit ratio on RAM, then total virtual
memory would be relevant.



> > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > 
> > On my system max_map_count is 1048576, quite alot higher than
> > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > ~500 GB of stacks !
> 
> Did you change the configuration some time in the past, or is this just
> a newer default? I get 65530, and that's the same default number I've
> seen in the bug reports.

It turns out it is a Fedora change, rather than a kernel change:

  https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

> > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > index 5fd2dbaf8b..2790959eaf 100644
> > > --- a/util/qemu-coroutine.c
> > > +++ b/util/qemu-coroutine.c
> > 
> > > +static unsigned int get_global_pool_hard_max_size(void)
> > > +{
> > > +#ifdef __linux__
> > > +    g_autofree char *contents = NULL;
> > > +    int max_map_count;
> > > +
> > > +    /*
> > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > +     * VMAs.
> > > +     */
> > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > +                            NULL) &&
> > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > +        /*
> > > +         * This is a conservative upper bound that avoids exceeding
> > > +         * max_map_count. Leave half for non-coroutine users like library
> > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > +         * halve the amount again.
> > > +         */
> > > +        return max_map_count / 4;
> > 
> > That's 256,000 coroutines, which still sounds incredibly large
> > to me.
> 
> The whole purpose of the limitation is that you won't ever get -ENOMEM
> back, which will likely crash your VM. Even if this hard limit is high,
> that doesn't mean that it's fully used. Your setting of 1048576 probably
> means that you would never have hit the crash anyway.
> 
> Even the benchmarks that used to hit the problem don't even get close to
> this hard limit any more because the actual number of coroutines stays
> much smaller after applying this patch.

I'm more thinking about what's the worst case behaviour that a
malicious guest can inflict on QEMU, and cause unexpectedly
high memory usage in the host.

ENOMEM is bad for a friendy VM, but there's also the risk to
the host from a unfriendly VM exploiting the high limits

> 
> > > +    }
> > > +#endif
> > > +
> > > +    return UINT_MAX;
> > 
> > Why UINT_MAX as a default ?  If we can't read procfs, we should
> > assume some much smaller sane default IMHO, that corresponds to
> > what current linux default max_map_count would be.
> 
> I don't think we should artificially limit the pool size and with this
> potentially limit the performance with it even if the host could do more
> if we only allowed it to. If we can't read it from procfs, then it's
> your responsibility as a user to make sure that it's large enough for
> your VM configuration.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Kevin Wolf 1 month, 1 week ago

Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben:
> On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > > or "failed to set up stack guard page" during coroutine creation.
> > > > 
> > > > This happens because per-thread pools can grow to tens of thousands of
> > > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > > 
> > > This sounds quite alarming. What usage scenario is justified in
> > > creating so many coroutines?
> > 
> > Basically we try to allow pooling coroutines for as many requests as
> > there can be in flight at the same time. That is, adding a virtio-blk
> > device increases the maximum pool size by num_queues * queue_size. If
> > you have a guest with many CPUs, the default num_queues is relatively
> > large (the bug referenced by Stefan had 64), and queue_size is 256 by
> > default. That's 16k potential requests in flight per disk.
> 
> If we have more than 1 virtio-blk device, does that scale up the max
> coroutines too ?
> 
> eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
> requests/coroutines ?

Yes. This is the number of request descriptors that fit in the
virtqueues, and if you add another device with additional virtqueues,
then obviously that increases the number of theoretically possible
parallel requests.

The limits of what you can actually achieve in practice might be lower
because I/O might complete faster than the time we need to process all
of the queued requests, depending on how many vcpus are trying to
"compete" with how many iothreads. Of course, the practical limits in
five years might be different from today.

> > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > > coroutines implies 10's of GB of memory just on stacks alone.
> > 
> > That's only virtual memory, though. Not sure how much of it is actually
> > used in practice.
> 
> True, by default Linux wouldn't care too much about virtual memory,
> Only if 'vm.overcommit_memory' is changed from its default, such
> that Linux applies an overcommit ratio on RAM, then total virtual
> memory would be relevant.

That's a good point and one that I don't have a good answer for, short
of just replacing the whole QEMU block layer with rsd and switching to
stackless coroutines/futures this way.

> > > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > > 
> > > On my system max_map_count is 1048576, quite alot higher than
> > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > > ~500 GB of stacks !
> > 
> > Did you change the configuration some time in the past, or is this just
> > a newer default? I get 65530, and that's the same default number I've
> > seen in the bug reports.
> 
> It turns out it is a Fedora change, rather than a kernel change:
> 
>   https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

Good to know, thanks.

> > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > --- a/util/qemu-coroutine.c
> > > > +++ b/util/qemu-coroutine.c
> > > 
> > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > +{
> > > > +#ifdef __linux__
> > > > +    g_autofree char *contents = NULL;
> > > > +    int max_map_count;
> > > > +
> > > > +    /*
> > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > +     * VMAs.
> > > > +     */
> > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > +                            NULL) &&
> > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > +        /*
> > > > +         * This is a conservative upper bound that avoids exceeding
> > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > +         * halve the amount again.
> > > > +         */
> > > > +        return max_map_count / 4;
> > > 
> > > That's 256,000 coroutines, which still sounds incredibly large
> > > to me.
> > 
> > The whole purpose of the limitation is that you won't ever get -ENOMEM
> > back, which will likely crash your VM. Even if this hard limit is high,
> > that doesn't mean that it's fully used. Your setting of 1048576 probably
> > means that you would never have hit the crash anyway.
> > 
> > Even the benchmarks that used to hit the problem don't even get close to
> > this hard limit any more because the actual number of coroutines stays
> > much smaller after applying this patch.
> 
> I'm more thinking about what's the worst case behaviour that a
> malicious guest can inflict on QEMU, and cause unexpectedly high
> memory usage in the host.
> 
> ENOMEM is bad for a friendy VM, but there's also the risk to the host
> from a unfriendly VM exploiting the high limits

But from a QEMU perspective, what is the difference between a friendly
high-performance VM that exhausts the available bandwidth to do its job
as good and fast as possible, and a malicious VM that does that same
just to waste host resources? I don't think QEMU can decide this, they
look the same.

If you want a VM not to send 16k requests in parallel, you can configure
its disk to expose less queues or a smaller queue size. The values I
mentioned above are only defaults that allow friendly VMs to perform
well out of the box, nothing prevents you from changing them to restrict
the amount of resources a VM can use.

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Daniel P. Berrangé 1 month, 1 week ago

On Tue, Mar 19, 2024 at 06:41:28PM +0100, Kevin Wolf wrote:
> Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben:
> > On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> > > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > > > or "failed to set up stack guard page" during coroutine creation.
> > > > > 
> > > > > This happens because per-thread pools can grow to tens of thousands of
> > > > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > > > 
> > > > This sounds quite alarming. What usage scenario is justified in
> > > > creating so many coroutines?
> > > 
> > > Basically we try to allow pooling coroutines for as many requests as
> > > there can be in flight at the same time. That is, adding a virtio-blk
> > > device increases the maximum pool size by num_queues * queue_size. If
> > > you have a guest with many CPUs, the default num_queues is relatively
> > > large (the bug referenced by Stefan had 64), and queue_size is 256 by
> > > default. That's 16k potential requests in flight per disk.
> > 
> > If we have more than 1 virtio-blk device, does that scale up the max
> > coroutines too ?
> > 
> > eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
> > requests/coroutines ?
> 
> Yes. This is the number of request descriptors that fit in the
> virtqueues, and if you add another device with additional virtqueues,
> then obviously that increases the number of theoretically possible
> parallel requests.
> 
> The limits of what you can actually achieve in practice might be lower
> because I/O might complete faster than the time we need to process all
> of the queued requests, depending on how many vcpus are trying to
> "compete" with how many iothreads. Of course, the practical limits in
> five years might be different from today.
> 
> > > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > > > coroutines implies 10's of GB of memory just on stacks alone.
> > > 
> > > That's only virtual memory, though. Not sure how much of it is actually
> > > used in practice.
> > 
> > True, by default Linux wouldn't care too much about virtual memory,
> > Only if 'vm.overcommit_memory' is changed from its default, such
> > that Linux applies an overcommit ratio on RAM, then total virtual
> > memory would be relevant.
> 
> That's a good point and one that I don't have a good answer for, short
> of just replacing the whole QEMU block layer with rsd and switching to
> stackless coroutines/futures this way.
> 
> > > > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > > > 
> > > > On my system max_map_count is 1048576, quite alot higher than
> > > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > > > ~500 GB of stacks !
> > > 
> > > Did you change the configuration some time in the past, or is this just
> > > a newer default? I get 65530, and that's the same default number I've
> > > seen in the bug reports.
> > 
> > It turns out it is a Fedora change, rather than a kernel change:
> > 
> >   https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount
> 
> Good to know, thanks.
> 
> > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > --- a/util/qemu-coroutine.c
> > > > > +++ b/util/qemu-coroutine.c
> > > > 
> > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > +{
> > > > > +#ifdef __linux__
> > > > > +    g_autofree char *contents = NULL;
> > > > > +    int max_map_count;
> > > > > +
> > > > > +    /*
> > > > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > > > +     * VMAs.
> > > > > +     */
> > > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > > > +                            NULL) &&
> > > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > > +        /*
> > > > > +         * This is a conservative upper bound that avoids exceeding
> > > > > +         * max_map_count. Leave half for non-coroutine users like library
> > > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > > > +         * halve the amount again.
> > > > > +         */
> > > > > +        return max_map_count / 4;
> > > > 
> > > > That's 256,000 coroutines, which still sounds incredibly large
> > > > to me.
> > > 
> > > The whole purpose of the limitation is that you won't ever get -ENOMEM
> > > back, which will likely crash your VM. Even if this hard limit is high,
> > > that doesn't mean that it's fully used. Your setting of 1048576 probably
> > > means that you would never have hit the crash anyway.
> > > 
> > > Even the benchmarks that used to hit the problem don't even get close to
> > > this hard limit any more because the actual number of coroutines stays
> > > much smaller after applying this patch.
> > 
> > I'm more thinking about what's the worst case behaviour that a
> > malicious guest can inflict on QEMU, and cause unexpectedly high
> > memory usage in the host.
> > 
> > ENOMEM is bad for a friendy VM, but there's also the risk to the host
> > from a unfriendly VM exploiting the high limits
> 
> But from a QEMU perspective, what is the difference between a friendly
> high-performance VM that exhausts the available bandwidth to do its job
> as good and fast as possible, and a malicious VM that does that same
> just to waste host resources? I don't think QEMU can decide this, they
> look the same.
> 
> If you want a VM not to send 16k requests in parallel, you can configure
> its disk to expose less queues or a smaller queue size. The values I
> mentioned above are only defaults that allow friendly VMs to perform
> well out of the box, nothing prevents you from changing them to restrict
> the amount of resources a VM can use.

Reducing queues is a no-win scenario, as it limits the performance of
a single disk when used in isolation, in order to cap the worst case
when all disks are used concurrently :-( It would be nice to allow a
single disk to burst to a high level, and only limit coroutines if
many disks are all trying to concurrently burst to a high level.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Kevin Wolf 1 month, 1 week ago

Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben:
> The coroutine pool implementation can hit the Linux vm.max_map_count
> limit, causing QEMU to abort with "failed to allocate memory for stack"
> or "failed to set up stack guard page" during coroutine creation.
> 
> This happens because per-thread pools can grow to tens of thousands of
> coroutines. Each coroutine causes 2 virtual memory areas to be created.
> Eventually vm.max_map_count is reached and memory-related syscalls fail.
> The per-thread pool sizes are non-uniform and depend on past coroutine
> usage in each thread, so it's possible for one thread to have a large
> pool while another thread's pool is empty.
> 
> Switch to a new coroutine pool implementation with a global pool that
> grows to a maximum number of coroutines and per-thread local pools that
> are capped at hardcoded small number of coroutines.
> 
> This approach does not leave large numbers of coroutines pooled in a
> thread that may not use them again. In order to perform well it
> amortizes the cost of global pool accesses by working in batches of
> coroutines instead of individual coroutines.
> 
> The global pool is a list. Threads donate batches of coroutines to when
> they have too many and take batches from when they have too few:
> 
> .-----------------------------------.
> | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> `-----------------------------------'
> 
> Each thread has up to 2 batches of coroutines:
> 
> .-------------------.
> | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> `-------------------'
> 
> The goal of this change is to reduce the excessive number of pooled
> coroutines that cause QEMU to abort when vm.max_map_count is reached
> without losing the performance of an adequately sized coroutine pool.
> 
> Here are virtio-blk disk I/O benchmark results:
> 
>       RW BLKSIZE IODEPTH    OLD    NEW CHANGE
> randread      4k       1 113725 117451 +3.3%
> randread      4k       8 192968 198510 +2.9%
> randread      4k      16 207138 209429 +1.1%
> randread      4k      32 212399 215145 +1.3%
> randread      4k      64 218319 221277 +1.4%
> randread    128k       1  17587  17535 -0.3%
> randread    128k       8  17614  17616 +0.0%
> randread    128k      16  17608  17609 +0.0%
> randread    128k      32  17552  17553 +0.0%
> randread    128k      64  17484  17484 +0.0%
> 
> See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> 
> Buglink: https://issues.redhat.com/browse/RHEL-28947
> Reported-by: Sanjay Rao <srao@redhat.com>
> Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
> Reported-by: Joe Mario <jmario@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

Though I do wonder if we can do something about the slight performance
degradation that Sanjay reported. We seem to stay well under the hard
limit, so the reduced global pool size shouldn't be the issue. Maybe
it's the locking?

Either way, even though it could be called a fix, I don't think this is
for 9.0, right?

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Sanjay Rao 1 month, 1 week ago

On Tue, Mar 19, 2024 at 9:32 AM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> >
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> >
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> >
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> >
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> >
> > .-----------------------------------.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `-----------------------------------'
> >
> > Each thread has up to 2 batches of coroutines:
> >
> > .-------------------.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `-------------------'
> >
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> >
> > Here are virtio-blk disk I/O benchmark results:
> >
> >       RW BLKSIZE IODEPTH    OLD    NEW CHANGE
> > randread      4k       1 113725 117451 +3.3%
> > randread      4k       8 192968 198510 +2.9%
> > randread      4k      16 207138 209429 +1.1%
> > randread      4k      32 212399 215145 +1.3%
> > randread      4k      64 218319 221277 +1.4%
> > randread    128k       1  17587  17535 -0.3%
> > randread    128k       8  17614  17616 +0.0%
> > randread    128k      16  17608  17609 +0.0%
> > randread    128k      32  17552  17553 +0.0%
> > randread    128k      64  17484  17484 +0.0%
> >
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> >
> https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> >
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao <srao@redhat.com>
> > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
> > Reported-by: Joe Mario <jmario@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
>
> Though I do wonder if we can do something about the slight performance
> degradation that Sanjay reported. We seem to stay well under the hard
> limit, so the reduced global pool size shouldn't be the issue. Maybe
> it's the locking?
>
> We are only seeing a slight fall off from our much improved numbers with
the addition of iothreads. I am not very concerned. With database
workloads, there's always a run to run variation. Especially when there's a
lot of idle cpus on the host. To reduce the run to run variation, we use
cpu / numa pinning and other methods like pci passthru. If I get a chance,
I will do some runs with cpu pinning to see what the numbers look like.



> Either way, even though it could be called a fix, I don't think this is
> for 9.0, right?
>
> Kevin
>
>

-- 

Sanjay Rao
Sr. Principal Performance Engineer                   Phone: 978-392-2479
Red Hat, Inc.                                    FAX: 978-392-1001
314 Littleton Road                               Email: srao@redhat.com
Westford, MA 01886

Re: [PATCH] coroutine: cap per-thread local pool size

Posted by Stefan Hajnoczi 1 month, 1 week ago

On Tue, Mar 19, 2024 at 02:32:06PM +0100, Kevin Wolf wrote:
> Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> > 
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> > 
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> > 
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> > 
> > .-----------------------------------.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `-----------------------------------'
> > 
> > Each thread has up to 2 batches of coroutines:
> > 
> > .-------------------.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `-------------------'
> > 
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> > 
> > Here are virtio-blk disk I/O benchmark results:
> > 
> >       RW BLKSIZE IODEPTH    OLD    NEW CHANGE
> > randread      4k       1 113725 117451 +3.3%
> > randread      4k       8 192968 198510 +2.9%
> > randread      4k      16 207138 209429 +1.1%
> > randread      4k      32 212399 215145 +1.3%
> > randread      4k      64 218319 221277 +1.4%
> > randread    128k       1  17587  17535 -0.3%
> > randread    128k       8  17614  17616 +0.0%
> > randread    128k      16  17608  17609 +0.0%
> > randread    128k      32  17552  17553 +0.0%
> > randread    128k      64  17484  17484 +0.0%
> > 
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao <srao@redhat.com>
> > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
> > Reported-by: Joe Mario <jmario@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> 
> Though I do wonder if we can do something about the slight performance
> degradation that Sanjay reported. We seem to stay well under the hard
> limit, so the reduced global pool size shouldn't be the issue. Maybe
> it's the locking?

I'm not sure if it's the lock. When writing the code I tried to avoid
thresholds that cause batches to bounce between the global and local
thread pools, because that is another way to lose performance. So maybe
it's something related to the algorithm.

> Either way, even though it could be called a fix, I don't think this is
> for 9.0, right?

There is a bug report for the max_map_count issue (RHEL-28947), so I
think this fix should go into QEMU 9.0.

Stefan