[v1] util/oslib-posix: increase memprealloc thread count to 32

[PATCH] util/oslib-posix: increase memprealloc thread count to 32

Posted by Jon Kohler 1 week, 5 days ago

Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
touched in 2017 [1] and, since then, physical machine sizes and VMs
therein have continue to get even bigger, both on average and on the
extremes.

For very large VMs, using 16 threads to preallocate memory can be a
non-trivial bottleneck during VM start-up and migration. Increasing
this limit to 32 threads reduces the time taken for these operations.

Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
linear gain of 50% with the 2x thread count increase.

---------------------------------------------
Idle Guest w/ 2M HugePages   | Start-up time
---------------------------------------------
240 vCPU, 7.5TB (16 threads) | 2m41.955s
---------------------------------------------
240 vCPU, 7.5TB (32 threads) | 1m19.404s
---------------------------------------------

[1] 1e356fc14bea ("mem-prealloc: reduce large guest start-up and migration time.")

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 util/oslib-posix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 3c14b72665..dc001da66d 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -61,7 +61,7 @@
 #include "qemu/memalign.h"
 #include "qemu/mmap-alloc.h"
 
-#define MAX_MEM_PREALLOC_THREAD_COUNT 16
+#define MAX_MEM_PREALLOC_THREAD_COUNT 32
 
 struct MemsetThread;
 
-- 
2.43.0

Re: [PATCH] util/oslib-posix: increase memprealloc thread count to 32

Posted by Daniel P. Berrangé 1 week, 5 days ago

On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
> touched in 2017 [1] and, since then, physical machine sizes and VMs
> therein have continue to get even bigger, both on average and on the
> extremes.
> 
> For very large VMs, using 16 threads to preallocate memory can be a
> non-trivial bottleneck during VM start-up and migration. Increasing
> this limit to 32 threads reduces the time taken for these operations.
> 
> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
> linear gain of 50% with the 2x thread count increase.
> 
> ---------------------------------------------
> Idle Guest w/ 2M HugePages   | Start-up time
> ---------------------------------------------
> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
> ---------------------------------------------
> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
> ---------------------------------------------

If we're configuring a guest with 240 vCPUs, then this implies the admin
is expecting that the guest will consume upto 240 host CPUs worth of
compute time.

What is the purpose of limiting the number of prealloc threads to a
value that is an order of magnitude less than the number of vCPUs the
guest has been given ? 

Have you measured what startup time would look like with 240 prealloc
threads ? Do we hit some scaling limit before that point making more
prealloc threads counter-productive ? 

I guess there could be different impact for hotadd vs cold add. With
cold startup the vCPU threads are not yet consuming CPU time, so we
can reasonably consume that resource for prealloc, where as for
hot-add any prealloc is on top of what vCPUs are already consuming.

> [1] 1e356fc14bea ("mem-prealloc: reduce large guest start-up and migration time.")
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  util/oslib-posix.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 3c14b72665..dc001da66d 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -61,7 +61,7 @@
>  #include "qemu/memalign.h"
>  #include "qemu/mmap-alloc.h"
>  
> -#define MAX_MEM_PREALLOC_THREAD_COUNT 16
> +#define MAX_MEM_PREALLOC_THREAD_COUNT 32
>  
>  struct MemsetThread;
>  
> -- 
> 2.43.0
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] util/oslib-posix: increase memprealloc thread count to 32

Posted by Jon Kohler 1 week, 4 days ago

> On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
>> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
>> touched in 2017 [1] and, since then, physical machine sizes and VMs
>> therein have continue to get even bigger, both on average and on the
>> extremes.
>> 
>> For very large VMs, using 16 threads to preallocate memory can be a
>> non-trivial bottleneck during VM start-up and migration. Increasing
>> this limit to 32 threads reduces the time taken for these operations.
>> 
>> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
>> linear gain of 50% with the 2x thread count increase.
>> 
>> ---------------------------------------------
>> Idle Guest w/ 2M HugePages   | Start-up time
>> ---------------------------------------------
>> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
>> ---------------------------------------------
>> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
>> ---------------------------------------------
> 
> If we're configuring a guest with 240 vCPUs, then this implies the admin
> is expecting that the guest will consume upto 240 host CPUs worth of
> compute time.
> 
> What is the purpose of limiting the number of prealloc threads to a
> value that is an order of magnitude less than the number of vCPUs the
> guest has been given ?

Daniel - thanks for the quick review and thoughts here.

I looked back through the original commits that led up to the current 16
thread max, and it wasn’t immediately clear to me why we clamped it at
16. Perhaps there was some other contention at the time.

> Have you measured what startup time would look like with 240 prealloc
> threads ? Do we hit some scaling limit before that point making more
> prealloc threads counter-productive ?

I have, and it isn’t wildly better, it comes down to about 50-ish seconds,
as you start running into practical limitations on the speed of memory, as
well as context switching if you’re doing other things on the host at the
same time.

In playing around with some other values, here’s how they shake out:
32 threads: 1m19s
48 threads: 1m4s
64 threads: 59s
…
240 threads: 50s

This also looks much less exciting when the amount of memory is
smaller. For smaller memory sizes (I’m testing with 7.5TB), anything
smaller than that gets less and less fun from a speedup perspective.

Putting that all together, 32 seemed like a sane number with a solid
speedup on fairly modern hardware.

For posterity, I am testing with kernel 6.12 LTS, but could also try
newer kernels if you were curious.

Most of the time is spent in clear_pages_erms and outside of an
experimental series on LKML [1], there really isn’t any improvements
on this state of the art.

For posterity, also adding Ankur into the mix as the author of that
series, as this is something they’ve been looking at for a while I
believe.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20251027202109.678022-1-ankur.a.arora@oracle.com/

> I guess there could be different impact for hotadd vs cold add. With
> cold startup the vCPU threads are not yet consuming CPU time, so we
> can reasonably consume that resource for prealloc, where as for
> hot-add any prealloc is on top of what vCPUs are already consuming.

Re: [PATCH] util/oslib-posix: increase memprealloc thread count to 32

Posted by Daniel P. Berrangé 1 week, 3 days ago

On Tue, Nov 04, 2025 at 08:33:05PM +0000, Jon Kohler wrote:
> 
> 
> > On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
> >> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
> >> touched in 2017 [1] and, since then, physical machine sizes and VMs
> >> therein have continue to get even bigger, both on average and on the
> >> extremes.
> >> 
> >> For very large VMs, using 16 threads to preallocate memory can be a
> >> non-trivial bottleneck during VM start-up and migration. Increasing
> >> this limit to 32 threads reduces the time taken for these operations.
> >> 
> >> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
> >> linear gain of 50% with the 2x thread count increase.
> >> 
> >> ---------------------------------------------
> >> Idle Guest w/ 2M HugePages   | Start-up time
> >> ---------------------------------------------
> >> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
> >> ---------------------------------------------
> >> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
> >> ---------------------------------------------
> > 
> > If we're configuring a guest with 240 vCPUs, then this implies the admin
> > is expecting that the guest will consume upto 240 host CPUs worth of
> > compute time.
> > 
> > What is the purpose of limiting the number of prealloc threads to a
> > value that is an order of magnitude less than the number of vCPUs the
> > guest has been given ?
> 
> Daniel - thanks for the quick review and thoughts here.
> 
> I looked back through the original commits that led up to the current 16
> thread max, and it wasn’t immediately clear to me why we clamped it at
> 16. Perhaps there was some other contention at the time.
> 
> > Have you measured what startup time would look like with 240 prealloc
> > threads ? Do we hit some scaling limit before that point making more
> > prealloc threads counter-productive ?
> 
> I have, and it isn’t wildly better, it comes down to about 50-ish seconds,
> as you start running into practical limitations on the speed of memory, as
> well as context switching if you’re doing other things on the host at the
> same time.
> 
> In playing around with some other values, here’s how they shake out:
> 32 threads: 1m19s
> 48 threads: 1m4s
> 64 threads: 59s
> …
> 240 threads: 50s
> 
> This also looks much less exciting when the amount of memory is
> smaller. For smaller memory sizes (I’m testing with 7.5TB), anything
> smaller than that gets less and less fun from a speedup perspective.
> 
> Putting that all together, 32 seemed like a sane number with a solid
> speedup on fairly modern hardware.

Yep, that's useful background, I've no objectino to picking 32.

Perhaps worth putting a bit more of this details into the
commit message as background.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] util/oslib-posix: increase memprealloc thread count to 32

Posted by Jon Kohler 1 week, 3 days ago


> On Nov 5, 2025, at 4:05 AM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue, Nov 04, 2025 at 08:33:05PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>> 
>>> On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
>>>> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
>>>> touched in 2017 [1] and, since then, physical machine sizes and VMs
>>>> therein have continue to get even bigger, both on average and on the
>>>> extremes.
>>>> 
>>>> For very large VMs, using 16 threads to preallocate memory can be a
>>>> non-trivial bottleneck during VM start-up and migration. Increasing
>>>> this limit to 32 threads reduces the time taken for these operations.
>>>> 
>>>> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
>>>> linear gain of 50% with the 2x thread count increase.
>>>> 
>>>> ---------------------------------------------
>>>> Idle Guest w/ 2M HugePages   | Start-up time
>>>> ---------------------------------------------
>>>> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
>>>> ---------------------------------------------
>>>> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
>>>> ---------------------------------------------
>>> 
>>> If we're configuring a guest with 240 vCPUs, then this implies the admin
>>> is expecting that the guest will consume upto 240 host CPUs worth of
>>> compute time.
>>> 
>>> What is the purpose of limiting the number of prealloc threads to a
>>> value that is an order of magnitude less than the number of vCPUs the
>>> guest has been given ?
>> 
>> Daniel - thanks for the quick review and thoughts here.
>> 
>> I looked back through the original commits that led up to the current 16
>> thread max, and it wasn’t immediately clear to me why we clamped it at
>> 16. Perhaps there was some other contention at the time.
>> 
>>> Have you measured what startup time would look like with 240 prealloc
>>> threads ? Do we hit some scaling limit before that point making more
>>> prealloc threads counter-productive ?
>> 
>> I have, and it isn’t wildly better, it comes down to about 50-ish seconds,
>> as you start running into practical limitations on the speed of memory, as
>> well as context switching if you’re doing other things on the host at the
>> same time.
>> 
>> In playing around with some other values, here’s how they shake out:
>> 32 threads: 1m19s
>> 48 threads: 1m4s
>> 64 threads: 59s
>> …
>> 240 threads: 50s
>> 
>> This also looks much less exciting when the amount of memory is
>> smaller. For smaller memory sizes (I’m testing with 7.5TB), anything
>> smaller than that gets less and less fun from a speedup perspective.
>> 
>> Putting that all together, 32 seemed like a sane number with a solid
>> speedup on fairly modern hardware.
> 
> Yep, that's useful background, I've no objectino to picking 32.
> 
> Perhaps worth putting a bit more of this details into the
> commit message as background.

Ok thank you for the advice, I’ll spruce up the commit msg and send a v2

> 
> 
> With regards,
> Daniel