util/oslib-posix.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
touched in 2017 [1] and, since then, physical machine sizes and VMs
therein have continue to get even bigger, both on average and on the
extremes.
For very large VMs, using 16 threads to preallocate memory can be a
non-trivial bottleneck during VM start-up and migration. Increasing
this limit to 32 threads reduces the time taken for these operations.
Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
linear gain of 50% with the 2x thread count increase.
---------------------------------------------
Idle Guest w/ 2M HugePages | Start-up time
---------------------------------------------
240 vCPU, 7.5TB (16 threads) | 2m41.955s
---------------------------------------------
240 vCPU, 7.5TB (32 threads) | 1m19.404s
---------------------------------------------
[1] 1e356fc14bea ("mem-prealloc: reduce large guest start-up and migration time.")
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
util/oslib-posix.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 3c14b72665..dc001da66d 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -61,7 +61,7 @@
#include "qemu/memalign.h"
#include "qemu/mmap-alloc.h"
-#define MAX_MEM_PREALLOC_THREAD_COUNT 16
+#define MAX_MEM_PREALLOC_THREAD_COUNT 32
struct MemsetThread;
--
2.43.0
On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
> touched in 2017 [1] and, since then, physical machine sizes and VMs
> therein have continue to get even bigger, both on average and on the
> extremes.
>
> For very large VMs, using 16 threads to preallocate memory can be a
> non-trivial bottleneck during VM start-up and migration. Increasing
> this limit to 32 threads reduces the time taken for these operations.
>
> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
> linear gain of 50% with the 2x thread count increase.
>
> ---------------------------------------------
> Idle Guest w/ 2M HugePages | Start-up time
> ---------------------------------------------
> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
> ---------------------------------------------
> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
> ---------------------------------------------
If we're configuring a guest with 240 vCPUs, then this implies the admin
is expecting that the guest will consume upto 240 host CPUs worth of
compute time.
What is the purpose of limiting the number of prealloc threads to a
value that is an order of magnitude less than the number of vCPUs the
guest has been given ?
Have you measured what startup time would look like with 240 prealloc
threads ? Do we hit some scaling limit before that point making more
prealloc threads counter-productive ?
I guess there could be different impact for hotadd vs cold add. With
cold startup the vCPU threads are not yet consuming CPU time, so we
can reasonably consume that resource for prealloc, where as for
hot-add any prealloc is on top of what vCPUs are already consuming.
> [1] 1e356fc14bea ("mem-prealloc: reduce large guest start-up and migration time.")
>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
> util/oslib-posix.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 3c14b72665..dc001da66d 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -61,7 +61,7 @@
> #include "qemu/memalign.h"
> #include "qemu/mmap-alloc.h"
>
> -#define MAX_MEM_PREALLOC_THREAD_COUNT 16
> +#define MAX_MEM_PREALLOC_THREAD_COUNT 32
>
> struct MemsetThread;
>
> --
> 2.43.0
>
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote: >> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last >> touched in 2017 [1] and, since then, physical machine sizes and VMs >> therein have continue to get even bigger, both on average and on the >> extremes. >> >> For very large VMs, using 16 threads to preallocate memory can be a >> non-trivial bottleneck during VM start-up and migration. Increasing >> this limit to 32 threads reduces the time taken for these operations. >> >> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly >> linear gain of 50% with the 2x thread count increase. >> >> --------------------------------------------- >> Idle Guest w/ 2M HugePages | Start-up time >> --------------------------------------------- >> 240 vCPU, 7.5TB (16 threads) | 2m41.955s >> --------------------------------------------- >> 240 vCPU, 7.5TB (32 threads) | 1m19.404s >> --------------------------------------------- > > If we're configuring a guest with 240 vCPUs, then this implies the admin > is expecting that the guest will consume upto 240 host CPUs worth of > compute time. > > What is the purpose of limiting the number of prealloc threads to a > value that is an order of magnitude less than the number of vCPUs the > guest has been given ? Daniel - thanks for the quick review and thoughts here. I looked back through the original commits that led up to the current 16 thread max, and it wasn’t immediately clear to me why we clamped it at 16. Perhaps there was some other contention at the time. > Have you measured what startup time would look like with 240 prealloc > threads ? Do we hit some scaling limit before that point making more > prealloc threads counter-productive ? I have, and it isn’t wildly better, it comes down to about 50-ish seconds, as you start running into practical limitations on the speed of memory, as well as context switching if you’re doing other things on the host at the same time. In playing around with some other values, here’s how they shake out: 32 threads: 1m19s 48 threads: 1m4s 64 threads: 59s … 240 threads: 50s This also looks much less exciting when the amount of memory is smaller. For smaller memory sizes (I’m testing with 7.5TB), anything smaller than that gets less and less fun from a speedup perspective. Putting that all together, 32 seemed like a sane number with a solid speedup on fairly modern hardware. For posterity, I am testing with kernel 6.12 LTS, but could also try newer kernels if you were curious. Most of the time is spent in clear_pages_erms and outside of an experimental series on LKML [1], there really isn’t any improvements on this state of the art. For posterity, also adding Ankur into the mix as the author of that series, as this is something they’ve been looking at for a while I believe. [1] https://patchwork.kernel.org/project/linux-mm/cover/20251027202109.678022-1-ankur.a.arora@oracle.com/ > I guess there could be different impact for hotadd vs cold add. With > cold startup the vCPU threads are not yet consuming CPU time, so we > can reasonably consume that resource for prealloc, where as for > hot-add any prealloc is on top of what vCPUs are already consuming.
On Tue, Nov 04, 2025 at 08:33:05PM +0000, Jon Kohler wrote: > > > > On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote: > >> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last > >> touched in 2017 [1] and, since then, physical machine sizes and VMs > >> therein have continue to get even bigger, both on average and on the > >> extremes. > >> > >> For very large VMs, using 16 threads to preallocate memory can be a > >> non-trivial bottleneck during VM start-up and migration. Increasing > >> this limit to 32 threads reduces the time taken for these operations. > >> > >> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly > >> linear gain of 50% with the 2x thread count increase. > >> > >> --------------------------------------------- > >> Idle Guest w/ 2M HugePages | Start-up time > >> --------------------------------------------- > >> 240 vCPU, 7.5TB (16 threads) | 2m41.955s > >> --------------------------------------------- > >> 240 vCPU, 7.5TB (32 threads) | 1m19.404s > >> --------------------------------------------- > > > > If we're configuring a guest with 240 vCPUs, then this implies the admin > > is expecting that the guest will consume upto 240 host CPUs worth of > > compute time. > > > > What is the purpose of limiting the number of prealloc threads to a > > value that is an order of magnitude less than the number of vCPUs the > > guest has been given ? > > Daniel - thanks for the quick review and thoughts here. > > I looked back through the original commits that led up to the current 16 > thread max, and it wasn’t immediately clear to me why we clamped it at > 16. Perhaps there was some other contention at the time. > > > Have you measured what startup time would look like with 240 prealloc > > threads ? Do we hit some scaling limit before that point making more > > prealloc threads counter-productive ? > > I have, and it isn’t wildly better, it comes down to about 50-ish seconds, > as you start running into practical limitations on the speed of memory, as > well as context switching if you’re doing other things on the host at the > same time. > > In playing around with some other values, here’s how they shake out: > 32 threads: 1m19s > 48 threads: 1m4s > 64 threads: 59s > … > 240 threads: 50s > > This also looks much less exciting when the amount of memory is > smaller. For smaller memory sizes (I’m testing with 7.5TB), anything > smaller than that gets less and less fun from a speedup perspective. > > Putting that all together, 32 seemed like a sane number with a solid > speedup on fairly modern hardware. Yep, that's useful background, I've no objectino to picking 32. Perhaps worth putting a bit more of this details into the commit message as background. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> On Nov 5, 2025, at 4:05 AM, Daniel P. Berrangé <berrange@redhat.com> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Tue, Nov 04, 2025 at 08:33:05PM +0000, Jon Kohler wrote: >> >> >>> On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <berrange@redhat.com> wrote: >>> >>> On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote: >>>> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last >>>> touched in 2017 [1] and, since then, physical machine sizes and VMs >>>> therein have continue to get even bigger, both on average and on the >>>> extremes. >>>> >>>> For very large VMs, using 16 threads to preallocate memory can be a >>>> non-trivial bottleneck during VM start-up and migration. Increasing >>>> this limit to 32 threads reduces the time taken for these operations. >>>> >>>> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly >>>> linear gain of 50% with the 2x thread count increase. >>>> >>>> --------------------------------------------- >>>> Idle Guest w/ 2M HugePages | Start-up time >>>> --------------------------------------------- >>>> 240 vCPU, 7.5TB (16 threads) | 2m41.955s >>>> --------------------------------------------- >>>> 240 vCPU, 7.5TB (32 threads) | 1m19.404s >>>> --------------------------------------------- >>> >>> If we're configuring a guest with 240 vCPUs, then this implies the admin >>> is expecting that the guest will consume upto 240 host CPUs worth of >>> compute time. >>> >>> What is the purpose of limiting the number of prealloc threads to a >>> value that is an order of magnitude less than the number of vCPUs the >>> guest has been given ? >> >> Daniel - thanks for the quick review and thoughts here. >> >> I looked back through the original commits that led up to the current 16 >> thread max, and it wasn’t immediately clear to me why we clamped it at >> 16. Perhaps there was some other contention at the time. >> >>> Have you measured what startup time would look like with 240 prealloc >>> threads ? Do we hit some scaling limit before that point making more >>> prealloc threads counter-productive ? >> >> I have, and it isn’t wildly better, it comes down to about 50-ish seconds, >> as you start running into practical limitations on the speed of memory, as >> well as context switching if you’re doing other things on the host at the >> same time. >> >> In playing around with some other values, here’s how they shake out: >> 32 threads: 1m19s >> 48 threads: 1m4s >> 64 threads: 59s >> … >> 240 threads: 50s >> >> This also looks much less exciting when the amount of memory is >> smaller. For smaller memory sizes (I’m testing with 7.5TB), anything >> smaller than that gets less and less fun from a speedup perspective. >> >> Putting that all together, 32 seemed like a sane number with a solid >> speedup on fairly modern hardware. > > Yep, that's useful background, I've no objectino to picking 32. > > Perhaps worth putting a bit more of this details into the > commit message as background. Ok thank you for the advice, I’ll spruce up the commit msg and send a v2 > > > With regards, > Daniel
© 2016 - 2025 Red Hat, Inc.