[Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.

Jitendra Kolhe posted 1 patch 7 years, 2 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1486976456-17657-1-git-send-email-jitendra.kolhe@hpe.com
Test checkpatch passed
Test docker passed
Test s390x passed
There is a newer version of this series
backends/hostmem.c   |  4 ++--
exec.c               |  2 +-
include/qemu/osdep.h |  3 ++-
util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
util/oslib-win32.c   |  3 ++-
5 files changed, 69 insertions(+), 11 deletions(-)
[Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Jitendra Kolhe 7 years, 2 months ago
Using "-mem-prealloc" option for a large guest leads to higher guest
start-up and migration time. This is because with "-mem-prealloc" option
qemu tries to map every guest page (create address translations), and
make sure the pages are available during runtime. virsh/libvirt by
default, seems to use "-mem-prealloc" option in case the guest is
configured to use huge pages. The patch tries to map all guest pages
simultaneously by spawning multiple threads. Currently limiting the
change to QEMU library functions on POSIX compliant host only, as we are
not sure if the problem exists on win32. Below are some stats with
"-mem-prealloc" option for guest configured to use huge pages.

------------------------------------------------------------------------
Idle Guest      | Start-up time | Migration time
------------------------------------------------------------------------
Guest stats with 2M HugePage usage - single threaded (existing code)
------------------------------------------------------------------------
64 Core - 4TB   | 54m11.796s    | 75m43.843s
64 Core - 1TB   | 8m56.576s     | 14m29.049s
64 Core - 256GB | 2m11.245s     | 3m26.598s
------------------------------------------------------------------------
Guest stats with 2M HugePage usage - map guest pages using 8 threads
------------------------------------------------------------------------
64 Core - 4TB   | 5m1.027s      | 34m10.565s
64 Core - 1TB   | 1m10.366s     | 8m28.188s
64 Core - 256GB | 0m19.040s     | 2m10.148s
-----------------------------------------------------------------------
Guest stats with 2M HugePage usage - map guest pages using 16 threads
-----------------------------------------------------------------------
64 Core - 4TB   | 1m58.970s     | 31m43.400s
64 Core - 1TB   | 0m39.885s     | 7m55.289s
64 Core - 256GB | 0m11.960s     | 2m0.135s
-----------------------------------------------------------------------

Changed in v2:
 - modify number of memset threads spawned to min(smp_cpus, 16).
 - removed 64GB memory restriction for spawning memset threads.

Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
---
 backends/hostmem.c   |  4 ++--
 exec.c               |  2 +-
 include/qemu/osdep.h |  3 ++-
 util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
 util/oslib-win32.c   |  3 ++-
 5 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 7f5de70..162c218 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
-        os_mem_prealloc(fd, ptr, sz, &local_err);
+        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          */
         if (backend->prealloc) {
             os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
-                            &local_err);
+                            smp_cpus, &local_err);
             if (local_err) {
                 goto out;
             }
diff --git a/exec.c b/exec.c
index 8b9ed73..53afcd2 100644
--- a/exec.c
+++ b/exec.c
@@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     if (mem_prealloc) {
-        os_mem_prealloc(fd, area, memory, errp);
+        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
         if (errp && *errp) {
             goto error;
         }
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 56c9e22..fb1d22b 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
 
 void qemu_set_tty_echo(int fd, bool echo);
 
-void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
+void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
+                     Error **errp);
 
 int qemu_read_password(char *buf, int buf_size);
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f631464..17da029 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -55,6 +55,16 @@
 #include "qemu/error-report.h"
 #endif
 
+#define MAX_MEM_PREALLOC_THREAD_COUNT 16
+struct PageRange {
+    char *addr;
+    uint64_t numpages;
+    uint64_t hpagesize;
+};
+typedef struct PageRange PageRange;
+
+static PageRange *page_range;
+
 int qemu_get_thread_id(void)
 {
 #if defined(__linux__)
@@ -323,7 +333,56 @@ static void sigbus_handler(int signal)
     siglongjmp(sigjump, 1);
 }
 
-void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
+static void *do_touch_pages(void *arg)
+{
+    PageRange *range = (PageRange *)arg;
+    char *addr = range->addr;
+    uint64_t numpages = range->numpages;
+    uint64_t hpagesize = range->hpagesize;
+    int i = 0;
+
+    for (i = 0; i < numpages; i++) {
+        memset(addr, 0, 1);
+        addr += hpagesize;
+    }
+    return NULL;
+}
+
+static void touch_all_pages(char *area, size_t hpagesize, size_t numpages,
+                            int smp_cpus)
+{
+    QemuThread page_threads[MAX_MEM_PREALLOC_THREAD_COUNT];
+    uint64_t numpages_per_thread, size_per_thread;
+    char *addr = area;
+    int i = 0;
+    int num_threads = MIN(smp_cpus, MAX_MEM_PREALLOC_THREAD_COUNT);
+
+    page_range = g_new0(PageRange, num_threads);
+    numpages_per_thread = (numpages / num_threads);
+    size_per_thread = (hpagesize * numpages_per_thread);
+    for (i = 0; i < (num_threads - 1); i++) {
+        page_range[i].addr = addr;
+        page_range[i].numpages = numpages_per_thread;
+        page_range[i].hpagesize = hpagesize;
+        qemu_thread_create(page_threads + i, "touch_pages",
+                           do_touch_pages, (page_range + i),
+                           QEMU_THREAD_JOINABLE);
+        addr += size_per_thread;
+        numpages -= numpages_per_thread;
+    }
+    for (i = 0; i < numpages; i++) {
+        memset(addr, 0, 1);
+        addr += hpagesize;
+    }
+    for (i = 0; i < (num_threads - 1); i++) {
+        qemu_thread_join(page_threads + i);
+    }
+    g_free(page_range);
+    page_range = NULL;
+}
+
+void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
+                     Error **errp)
 {
     int ret;
     struct sigaction act, oldact;
@@ -349,14 +408,11 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
         error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
             "pages available to allocate guest RAM\n");
     } else {
-        int i;
         size_t hpagesize = qemu_fd_getpagesize(fd);
         size_t numpages = DIV_ROUND_UP(memory, hpagesize);
 
-        /* MAP_POPULATE silently ignores failures */
-        for (i = 0; i < numpages; i++) {
-            memset(area + (hpagesize * i), 0, 1);
-        }
+        /* touch pages simultaneously */
+        touch_all_pages(area, hpagesize, numpages, smp_cpus);
     }
 
     ret = sigaction(SIGBUS, &oldact, NULL);
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 0b1890f..80e4668 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -541,7 +541,8 @@ int getpagesize(void)
     return system_info.dwPageSize;
 }
 
-void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
+void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
+                     Error **errp)
 {
     int i;
     size_t pagesize = getpagesize();
-- 
1.8.3.1


Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Igor Mammedov 7 years, 2 months ago
On Mon, 13 Feb 2017 14:30:56 +0530
Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:

> Using "-mem-prealloc" option for a large guest leads to higher guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Currently limiting the
> change to QEMU library functions on POSIX compliant host only, as we are
> not sure if the problem exists on win32. Below are some stats with
> "-mem-prealloc" option for guest configured to use huge pages.
> 
> ------------------------------------------------------------------------
> Idle Guest      | Start-up time | Migration time
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - single threaded (existing code)
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> -----------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> -----------------------------------------------------------------------
> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> 64 Core - 256GB | 0m11.960s     | 2m0.135s
> -----------------------------------------------------------------------
> 
> Changed in v2:
>  - modify number of memset threads spawned to min(smp_cpus, 16).
>  - removed 64GB memory restriction for spawning memset threads.
> 
> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
>  backends/hostmem.c   |  4 ++--
>  exec.c               |  2 +-
>  include/qemu/osdep.h |  3 ++-
>  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  util/oslib-win32.c   |  3 ++-
>  5 files changed, 69 insertions(+), 11 deletions(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 7f5de70..162c218 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>          uint64_t sz = memory_region_size(&backend->mr);
>  
> -        os_mem_prealloc(fd, ptr, sz, &local_err);
> +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>           */
>          if (backend->prealloc) {
>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> -                            &local_err);
> +                            smp_cpus, &local_err);
>              if (local_err) {
>                  goto out;
>              }
> diff --git a/exec.c b/exec.c
> index 8b9ed73..53afcd2 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
>      }
>  
>      if (mem_prealloc) {
> -        os_mem_prealloc(fd, area, memory, errp);
> +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
>          if (errp && *errp) {
>              goto error;
>          }
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 56c9e22..fb1d22b 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
>  
>  void qemu_set_tty_echo(int fd, bool echo);
>  
> -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
> +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> +                     Error **errp);
>  
>  int qemu_read_password(char *buf, int buf_size);
>  
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index f631464..17da029 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -55,6 +55,16 @@
>  #include "qemu/error-report.h"
>  #endif
>  
> +#define MAX_MEM_PREALLOC_THREAD_COUNT 16
running with -smp 16 or bigger on host with less than 16 cpus
it would be not quite optimal.
Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
something like sysconf(_SC_NPROCESSORS_ONLN)

> +struct PageRange {
> +    char *addr;
> +    uint64_t numpages;
> +    uint64_t hpagesize;
> +};
> +typedef struct PageRange PageRange;
> +
> +static PageRange *page_range;
> +
>  int qemu_get_thread_id(void)
>  {
>  #if defined(__linux__)
> @@ -323,7 +333,56 @@ static void sigbus_handler(int signal)
>      siglongjmp(sigjump, 1);
>  }
>  
> -void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
> +static void *do_touch_pages(void *arg)
> +{
> +    PageRange *range = (PageRange *)arg;
> +    char *addr = range->addr;
> +    uint64_t numpages = range->numpages;
> +    uint64_t hpagesize = range->hpagesize;
> +    int i = 0;
> +
> +    for (i = 0; i < numpages; i++) {
> +        memset(addr, 0, 1);
> +        addr += hpagesize;
> +    }
> +    return NULL;
> +}
> +
> +static void touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> +                            int smp_cpus)
> +{
> +    QemuThread page_threads[MAX_MEM_PREALLOC_THREAD_COUNT];
> +    uint64_t numpages_per_thread, size_per_thread;
> +    char *addr = area;
> +    int i = 0;
> +    int num_threads = MIN(smp_cpus, MAX_MEM_PREALLOC_THREAD_COUNT);
> +
> +    page_range = g_new0(PageRange, num_threads);
> +    numpages_per_thread = (numpages / num_threads);
> +    size_per_thread = (hpagesize * numpages_per_thread);
> +    for (i = 0; i < (num_threads - 1); i++) {
> +        page_range[i].addr = addr;
> +        page_range[i].numpages = numpages_per_thread;
> +        page_range[i].hpagesize = hpagesize;
> +        qemu_thread_create(page_threads + i, "touch_pages",
> +                           do_touch_pages, (page_range + i),
> +                           QEMU_THREAD_JOINABLE);
> +        addr += size_per_thread;
> +        numpages -= numpages_per_thread;
> +    }
> +    for (i = 0; i < numpages; i++) {
> +        memset(addr, 0, 1);
> +        addr += hpagesize;
> +    }
> +    for (i = 0; i < (num_threads - 1); i++) {
> +        qemu_thread_join(page_threads + i);
> +    }
> +    g_free(page_range);
> +    page_range = NULL;
> +}
> +
> +void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     Error **errp)
>  {
>      int ret;
>      struct sigaction act, oldact;
> @@ -349,14 +408,11 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>              "pages available to allocate guest RAM\n");
>      } else {
> -        int i;
>          size_t hpagesize = qemu_fd_getpagesize(fd);
>          size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>  
> -        /* MAP_POPULATE silently ignores failures */
> -        for (i = 0; i < numpages; i++) {
> -            memset(area + (hpagesize * i), 0, 1);
> -        }
> +        /* touch pages simultaneously */
> +        touch_all_pages(area, hpagesize, numpages, smp_cpus);
>      }
>  
>      ret = sigaction(SIGBUS, &oldact, NULL);
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 0b1890f..80e4668 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -541,7 +541,8 @@ int getpagesize(void)
>      return system_info.dwPageSize;
>  }
>  
> -void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
> +void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     Error **errp)
>  {
>      int i;
>      size_t pagesize = getpagesize();


Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Daniel P. Berrange 7 years, 2 months ago
On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:
> On Mon, 13 Feb 2017 14:30:56 +0530
> Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
> 
> > Using "-mem-prealloc" option for a large guest leads to higher guest
> > start-up and migration time. This is because with "-mem-prealloc" option
> > qemu tries to map every guest page (create address translations), and
> > make sure the pages are available during runtime. virsh/libvirt by
> > default, seems to use "-mem-prealloc" option in case the guest is
> > configured to use huge pages. The patch tries to map all guest pages
> > simultaneously by spawning multiple threads. Currently limiting the
> > change to QEMU library functions on POSIX compliant host only, as we are
> > not sure if the problem exists on win32. Below are some stats with
> > "-mem-prealloc" option for guest configured to use huge pages.
> > 
> > ------------------------------------------------------------------------
> > Idle Guest      | Start-up time | Migration time
> > ------------------------------------------------------------------------
> > Guest stats with 2M HugePage usage - single threaded (existing code)
> > ------------------------------------------------------------------------
> > 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> > 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> > 64 Core - 256GB | 2m11.245s     | 3m26.598s
> > ------------------------------------------------------------------------
> > Guest stats with 2M HugePage usage - map guest pages using 8 threads
> > ------------------------------------------------------------------------
> > 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> > 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> > 64 Core - 256GB | 0m19.040s     | 2m10.148s
> > -----------------------------------------------------------------------
> > Guest stats with 2M HugePage usage - map guest pages using 16 threads
> > -----------------------------------------------------------------------
> > 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> > 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> > 64 Core - 256GB | 0m11.960s     | 2m0.135s
> > -----------------------------------------------------------------------
> > 
> > Changed in v2:
> >  - modify number of memset threads spawned to min(smp_cpus, 16).
> >  - removed 64GB memory restriction for spawning memset threads.
> > 
> > Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> > ---
> >  backends/hostmem.c   |  4 ++--
> >  exec.c               |  2 +-
> >  include/qemu/osdep.h |  3 ++-
> >  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >  util/oslib-win32.c   |  3 ++-
> >  5 files changed, 69 insertions(+), 11 deletions(-)
> > 
> > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > index 7f5de70..162c218 100644
> > --- a/backends/hostmem.c
> > +++ b/backends/hostmem.c
> > @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
> >          void *ptr = memory_region_get_ram_ptr(&backend->mr);
> >          uint64_t sz = memory_region_size(&backend->mr);
> >  
> > -        os_mem_prealloc(fd, ptr, sz, &local_err);
> > +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
> >          if (local_err) {
> >              error_propagate(errp, local_err);
> >              return;
> > @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> >           */
> >          if (backend->prealloc) {
> >              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> > -                            &local_err);
> > +                            smp_cpus, &local_err);
> >              if (local_err) {
> >                  goto out;
> >              }
> > diff --git a/exec.c b/exec.c
> > index 8b9ed73..53afcd2 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
> >      }
> >  
> >      if (mem_prealloc) {
> > -        os_mem_prealloc(fd, area, memory, errp);
> > +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
> >          if (errp && *errp) {
> >              goto error;
> >          }
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 56c9e22..fb1d22b 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
> >  
> >  void qemu_set_tty_echo(int fd, bool echo);
> >  
> > -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
> > +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> > +                     Error **errp);
> >  
> >  int qemu_read_password(char *buf, int buf_size);
> >  
> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > index f631464..17da029 100644
> > --- a/util/oslib-posix.c
> > +++ b/util/oslib-posix.c
> > @@ -55,6 +55,16 @@
> >  #include "qemu/error-report.h"
> >  #endif
> >  
> > +#define MAX_MEM_PREALLOC_THREAD_COUNT 16
> running with -smp 16 or bigger on host with less than 16 cpus
> it would be not quite optimal.
> Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
> something like sysconf(_SC_NPROCESSORS_ONLN)

The point is to not consume more host resources than would otherwise
be consumed by running the guest CPUs. ie, if running a KVM guest
with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
4 pCPUs worth of resource on the host.  Using sysconf would cause
the consume to consume all host resources, likely harming other
guests workloads.

If the person launching QEMU gives a -smp value that's larger than
the host CPUs count, then they've already accepted that they're
asking QEMU todo more than the host is really capable of. IOW, I
don't think we need to special case memsetting for that, since
VCPU execution itself is already going to overcommit the host.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Igor Mammedov 7 years, 2 months ago
On Mon, 13 Feb 2017 11:23:17 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:
> > On Mon, 13 Feb 2017 14:30:56 +0530
> > Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
> >   
> > > Using "-mem-prealloc" option for a large guest leads to higher guest
> > > start-up and migration time. This is because with "-mem-prealloc" option
> > > qemu tries to map every guest page (create address translations), and
> > > make sure the pages are available during runtime. virsh/libvirt by
> > > default, seems to use "-mem-prealloc" option in case the guest is
> > > configured to use huge pages. The patch tries to map all guest pages
> > > simultaneously by spawning multiple threads. Currently limiting the
> > > change to QEMU library functions on POSIX compliant host only, as we are
> > > not sure if the problem exists on win32. Below are some stats with
> > > "-mem-prealloc" option for guest configured to use huge pages.
> > > 
> > > ------------------------------------------------------------------------
> > > Idle Guest      | Start-up time | Migration time
> > > ------------------------------------------------------------------------
> > > Guest stats with 2M HugePage usage - single threaded (existing code)
> > > ------------------------------------------------------------------------
> > > 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> > > 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> > > 64 Core - 256GB | 2m11.245s     | 3m26.598s
> > > ------------------------------------------------------------------------
> > > Guest stats with 2M HugePage usage - map guest pages using 8 threads
> > > ------------------------------------------------------------------------
> > > 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> > > 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> > > 64 Core - 256GB | 0m19.040s     | 2m10.148s
> > > -----------------------------------------------------------------------
> > > Guest stats with 2M HugePage usage - map guest pages using 16 threads
> > > -----------------------------------------------------------------------
> > > 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> > > 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> > > 64 Core - 256GB | 0m11.960s     | 2m0.135s
> > > -----------------------------------------------------------------------
> > > 
> > > Changed in v2:
> > >  - modify number of memset threads spawned to min(smp_cpus, 16).
> > >  - removed 64GB memory restriction for spawning memset threads.
> > > 
> > > Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> > > ---
> > >  backends/hostmem.c   |  4 ++--
> > >  exec.c               |  2 +-
> > >  include/qemu/osdep.h |  3 ++-
> > >  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
> > >  util/oslib-win32.c   |  3 ++-
> > >  5 files changed, 69 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > > index 7f5de70..162c218 100644
> > > --- a/backends/hostmem.c
> > > +++ b/backends/hostmem.c
> > > @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
> > >          void *ptr = memory_region_get_ram_ptr(&backend->mr);
> > >          uint64_t sz = memory_region_size(&backend->mr);
> > >  
> > > -        os_mem_prealloc(fd, ptr, sz, &local_err);
> > > +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
> > >          if (local_err) {
> > >              error_propagate(errp, local_err);
> > >              return;
> > > @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> > >           */
> > >          if (backend->prealloc) {
> > >              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> > > -                            &local_err);
> > > +                            smp_cpus, &local_err);
> > >              if (local_err) {
> > >                  goto out;
> > >              }
> > > diff --git a/exec.c b/exec.c
> > > index 8b9ed73..53afcd2 100644
> > > --- a/exec.c
> > > +++ b/exec.c
> > > @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
> > >      }
> > >  
> > >      if (mem_prealloc) {
> > > -        os_mem_prealloc(fd, area, memory, errp);
> > > +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
> > >          if (errp && *errp) {
> > >              goto error;
> > >          }
> > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > > index 56c9e22..fb1d22b 100644
> > > --- a/include/qemu/osdep.h
> > > +++ b/include/qemu/osdep.h
> > > @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
> > >  
> > >  void qemu_set_tty_echo(int fd, bool echo);
> > >  
> > > -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
> > > +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> > > +                     Error **errp);
> > >  
> > >  int qemu_read_password(char *buf, int buf_size);
> > >  
> > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > > index f631464..17da029 100644
> > > --- a/util/oslib-posix.c
> > > +++ b/util/oslib-posix.c
> > > @@ -55,6 +55,16 @@
> > >  #include "qemu/error-report.h"
> > >  #endif
> > >  
> > > +#define MAX_MEM_PREALLOC_THREAD_COUNT 16  
> > running with -smp 16 or bigger on host with less than 16 cpus
> > it would be not quite optimal.
> > Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
> > something like sysconf(_SC_NPROCESSORS_ONLN)  
> 
> The point is to not consume more host resources than would otherwise
> be consumed by running the guest CPUs. ie, if running a KVM guest
> with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
> 4 pCPUs worth of resource on the host.  Using sysconf would cause
> the consume to consume all host resources, likely harming other
> guests workloads.
> 
> If the person launching QEMU gives a -smp value that's larger than
> the host CPUs count, then they've already accepted that they're
> asking QEMU todo more than the host is really capable of. IOW, I
> don't think we need to special case memsetting for that, since
> VCPU execution itself is already going to overcommit the host.
Doing over commit at preallocate time doesn't make much sense,
if MAX_MEM_PREALLOC_THREAD_COUNT is replaced with sysconf(_SC_NPROCESSORS_ONLN)
then QEMU will end up with MIN(-smp, sysconf(_SC_NPROCESSORS_ONLN))
which will put cap on upper value and avoid useless over commit at
preallocate time.

> Regards,
> Daniel


Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Jitendra Kolhe 7 years, 2 months ago
On 2/13/2017 5:34 PM, Igor Mammedov wrote:
> On Mon, 13 Feb 2017 11:23:17 +0000
> "Daniel P. Berrange" <berrange@redhat.com> wrote:
> 
>> On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:
>>> On Mon, 13 Feb 2017 14:30:56 +0530
>>> Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
>>>   
>>>> Using "-mem-prealloc" option for a large guest leads to higher guest
>>>> start-up and migration time. This is because with "-mem-prealloc" option
>>>> qemu tries to map every guest page (create address translations), and
>>>> make sure the pages are available during runtime. virsh/libvirt by
>>>> default, seems to use "-mem-prealloc" option in case the guest is
>>>> configured to use huge pages. The patch tries to map all guest pages
>>>> simultaneously by spawning multiple threads. Currently limiting the
>>>> change to QEMU library functions on POSIX compliant host only, as we are
>>>> not sure if the problem exists on win32. Below are some stats with
>>>> "-mem-prealloc" option for guest configured to use huge pages.
>>>>
>>>> ------------------------------------------------------------------------
>>>> Idle Guest      | Start-up time | Migration time
>>>> ------------------------------------------------------------------------
>>>> Guest stats with 2M HugePage usage - single threaded (existing code)
>>>> ------------------------------------------------------------------------
>>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>>>> ------------------------------------------------------------------------
>>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>>>> ------------------------------------------------------------------------
>>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>>>> -----------------------------------------------------------------------
>>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>>>> -----------------------------------------------------------------------
>>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>>>> -----------------------------------------------------------------------
>>>>
>>>> Changed in v2:
>>>>  - modify number of memset threads spawned to min(smp_cpus, 16).
>>>>  - removed 64GB memory restriction for spawning memset threads.
>>>>
>>>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>>>> ---
>>>>  backends/hostmem.c   |  4 ++--
>>>>  exec.c               |  2 +-
>>>>  include/qemu/osdep.h |  3 ++-
>>>>  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>  util/oslib-win32.c   |  3 ++-
>>>>  5 files changed, 69 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>>> index 7f5de70..162c218 100644
>>>> --- a/backends/hostmem.c
>>>> +++ b/backends/hostmem.c
>>>> @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>>>>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>>>>          uint64_t sz = memory_region_size(&backend->mr);
>>>>  
>>>> -        os_mem_prealloc(fd, ptr, sz, &local_err);
>>>> +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
>>>>          if (local_err) {
>>>>              error_propagate(errp, local_err);
>>>>              return;
>>>> @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>>>>           */
>>>>          if (backend->prealloc) {
>>>>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
>>>> -                            &local_err);
>>>> +                            smp_cpus, &local_err);
>>>>              if (local_err) {
>>>>                  goto out;
>>>>              }
>>>> diff --git a/exec.c b/exec.c
>>>> index 8b9ed73..53afcd2 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
>>>>      }
>>>>  
>>>>      if (mem_prealloc) {
>>>> -        os_mem_prealloc(fd, area, memory, errp);
>>>> +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
>>>>          if (errp && *errp) {
>>>>              goto error;
>>>>          }
>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>> index 56c9e22..fb1d22b 100644
>>>> --- a/include/qemu/osdep.h
>>>> +++ b/include/qemu/osdep.h
>>>> @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
>>>>  
>>>>  void qemu_set_tty_echo(int fd, bool echo);
>>>>  
>>>> -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
>>>> +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
>>>> +                     Error **errp);
>>>>  
>>>>  int qemu_read_password(char *buf, int buf_size);
>>>>  
>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>> index f631464..17da029 100644
>>>> --- a/util/oslib-posix.c
>>>> +++ b/util/oslib-posix.c
>>>> @@ -55,6 +55,16 @@
>>>>  #include "qemu/error-report.h"
>>>>  #endif
>>>>  
>>>> +#define MAX_MEM_PREALLOC_THREAD_COUNT 16  
>>> running with -smp 16 or bigger on host with less than 16 cpus
>>> it would be not quite optimal.
>>> Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
>>> something like sysconf(_SC_NPROCESSORS_ONLN)  
>>
>> The point is to not consume more host resources than would otherwise
>> be consumed by running the guest CPUs. ie, if running a KVM guest
>> with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
>> 4 pCPUs worth of resource on the host.  Using sysconf would cause
>> the consume to consume all host resources, likely harming other
>> guests workloads.
>>
>> If the person launching QEMU gives a -smp value that's larger than
>> the host CPUs count, then they've already accepted that they're
>> asking QEMU todo more than the host is really capable of. IOW, I
>> don't think we need to special case memsetting for that, since
>> VCPU execution itself is already going to overcommit the host.
> Doing over commit at preallocate time doesn't make much sense,
> if MAX_MEM_PREALLOC_THREAD_COUNT is replaced with sysconf(_SC_NPROCESSORS_ONLN)
> then QEMU will end up with MIN(-smp, sysconf(_SC_NPROCESSORS_ONLN))
> which will put cap on upper value and avoid useless over commit at
> preallocate time.
> 

I agree, we should consider case where we run with -smp >= 16 which 
is overcommited on host with < 16 cpus. At the same time we should 
also be sure that we don't end up spawning to many memset threads. 
For e.g. I have been running fat guests with -smp > 64 on hosts 
with 384 cpus.

Thanks,
- Jitendra

>> Regards,
>> Daniel
> 

Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Jitendra Kolhe 7 years, 2 months ago
On 2/13/2017 9:22 PM, Jitendra Kolhe wrote:
> On 2/13/2017 5:34 PM, Igor Mammedov wrote:
>> On Mon, 13 Feb 2017 11:23:17 +0000
>> "Daniel P. Berrange" <berrange@redhat.com> wrote:
>>
>>> On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:
>>>> On Mon, 13 Feb 2017 14:30:56 +0530
>>>> Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
>>>>   
>>>>> Using "-mem-prealloc" option for a large guest leads to higher guest
>>>>> start-up and migration time. This is because with "-mem-prealloc" option
>>>>> qemu tries to map every guest page (create address translations), and
>>>>> make sure the pages are available during runtime. virsh/libvirt by
>>>>> default, seems to use "-mem-prealloc" option in case the guest is
>>>>> configured to use huge pages. The patch tries to map all guest pages
>>>>> simultaneously by spawning multiple threads. Currently limiting the
>>>>> change to QEMU library functions on POSIX compliant host only, as we are
>>>>> not sure if the problem exists on win32. Below are some stats with
>>>>> "-mem-prealloc" option for guest configured to use huge pages.
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> Idle Guest      | Start-up time | Migration time
>>>>> ------------------------------------------------------------------------
>>>>> Guest stats with 2M HugePage usage - single threaded (existing code)
>>>>> ------------------------------------------------------------------------
>>>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>>>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>>>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>>>>> ------------------------------------------------------------------------
>>>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>>>>> ------------------------------------------------------------------------
>>>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>>>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>>>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>>>>> -----------------------------------------------------------------------
>>>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>>>>> -----------------------------------------------------------------------
>>>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>>>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>>>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>> Changed in v2:
>>>>>  - modify number of memset threads spawned to min(smp_cpus, 16).
>>>>>  - removed 64GB memory restriction for spawning memset threads.
>>>>>
>>>>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>>>>> ---
>>>>>  backends/hostmem.c   |  4 ++--
>>>>>  exec.c               |  2 +-
>>>>>  include/qemu/osdep.h |  3 ++-
>>>>>  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>>  util/oslib-win32.c   |  3 ++-
>>>>>  5 files changed, 69 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>>>> index 7f5de70..162c218 100644
>>>>> --- a/backends/hostmem.c
>>>>> +++ b/backends/hostmem.c
>>>>> @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>>>>>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>>>>>          uint64_t sz = memory_region_size(&backend->mr);
>>>>>  
>>>>> -        os_mem_prealloc(fd, ptr, sz, &local_err);
>>>>> +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
>>>>>          if (local_err) {
>>>>>              error_propagate(errp, local_err);
>>>>>              return;
>>>>> @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>>>>>           */
>>>>>          if (backend->prealloc) {
>>>>>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
>>>>> -                            &local_err);
>>>>> +                            smp_cpus, &local_err);
>>>>>              if (local_err) {
>>>>>                  goto out;
>>>>>              }
>>>>> diff --git a/exec.c b/exec.c
>>>>> index 8b9ed73..53afcd2 100644
>>>>> --- a/exec.c
>>>>> +++ b/exec.c
>>>>> @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
>>>>>      }
>>>>>  
>>>>>      if (mem_prealloc) {
>>>>> -        os_mem_prealloc(fd, area, memory, errp);
>>>>> +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
>>>>>          if (errp && *errp) {
>>>>>              goto error;
>>>>>          }
>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>>> index 56c9e22..fb1d22b 100644
>>>>> --- a/include/qemu/osdep.h
>>>>> +++ b/include/qemu/osdep.h
>>>>> @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
>>>>>  
>>>>>  void qemu_set_tty_echo(int fd, bool echo);
>>>>>  
>>>>> -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
>>>>> +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
>>>>> +                     Error **errp);
>>>>>  
>>>>>  int qemu_read_password(char *buf, int buf_size);
>>>>>  
>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>>> index f631464..17da029 100644
>>>>> --- a/util/oslib-posix.c
>>>>> +++ b/util/oslib-posix.c
>>>>> @@ -55,6 +55,16 @@
>>>>>  #include "qemu/error-report.h"
>>>>>  #endif
>>>>>  
>>>>> +#define MAX_MEM_PREALLOC_THREAD_COUNT 16  
>>>> running with -smp 16 or bigger on host with less than 16 cpus
>>>> it would be not quite optimal.
>>>> Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
>>>> something like sysconf(_SC_NPROCESSORS_ONLN)  
>>>
>>> The point is to not consume more host resources than would otherwise
>>> be consumed by running the guest CPUs. ie, if running a KVM guest
>>> with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
>>> 4 pCPUs worth of resource on the host.  Using sysconf would cause
>>> the consume to consume all host resources, likely harming other
>>> guests workloads.
>>>
>>> If the person launching QEMU gives a -smp value that's larger than
>>> the host CPUs count, then they've already accepted that they're
>>> asking QEMU todo more than the host is really capable of. IOW, I
>>> don't think we need to special case memsetting for that, since
>>> VCPU execution itself is already going to overcommit the host.
>> Doing over commit at preallocate time doesn't make much sense,
>> if MAX_MEM_PREALLOC_THREAD_COUNT is replaced with sysconf(_SC_NPROCESSORS_ONLN)
>> then QEMU will end up with MIN(-smp, sysconf(_SC_NPROCESSORS_ONLN))
>> which will put cap on upper value and avoid useless over commit at
>> preallocate time.
>>
> 
> I agree, we should consider case where we run with -smp >= 16 which 
> is overcommited on host with < 16 cpus. At the same time we should 
> also be sure that we don't end up spawning to many memset threads. 
> For e.g. I have been running fat guests with -smp > 64 on hosts 
> with 384 cpus.
> 
how about putting a cap on MAX_MEM_PREALLOC_THREAD_COUNT to 
(MIN(sysconf(_SC_NPROCESSORS_ONLN), 16))?
Number of memset threads can be calculated using
MIN(smp_cpus, MAX_MEM_PREALLOC_THREAD_COUNT);

Thanks,
- Jitendra

> Thanks,
> - Jitendra
> 
>>> Regards,
>>> Daniel
>>
> 

Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.
Posted by Igor Mammedov 7 years, 2 months ago
On Tue, 14 Feb 2017 11:53:11 +0530
Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:

> On 2/13/2017 9:22 PM, Jitendra Kolhe wrote:
> > On 2/13/2017 5:34 PM, Igor Mammedov wrote:  
> >> On Mon, 13 Feb 2017 11:23:17 +0000
> >> "Daniel P. Berrange" <berrange@redhat.com> wrote:
> >>  
> >>> On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:  
> >>>> On Mon, 13 Feb 2017 14:30:56 +0530
> >>>> Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
> >>>>     
> >>>>> Using "-mem-prealloc" option for a large guest leads to higher guest
> >>>>> start-up and migration time. This is because with "-mem-prealloc" option
> >>>>> qemu tries to map every guest page (create address translations), and
> >>>>> make sure the pages are available during runtime. virsh/libvirt by
> >>>>> default, seems to use "-mem-prealloc" option in case the guest is
> >>>>> configured to use huge pages. The patch tries to map all guest pages
> >>>>> simultaneously by spawning multiple threads. Currently limiting the
> >>>>> change to QEMU library functions on POSIX compliant host only, as we are
> >>>>> not sure if the problem exists on win32. Below are some stats with
> >>>>> "-mem-prealloc" option for guest configured to use huge pages.
> >>>>>
> >>>>> ------------------------------------------------------------------------
> >>>>> Idle Guest      | Start-up time | Migration time
> >>>>> ------------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - single threaded (existing code)
> >>>>> ------------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> >>>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> >>>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> >>>>> ------------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> >>>>> ------------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> >>>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> >>>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> >>>>> -----------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> >>>>> -----------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> >>>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> >>>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
> >>>>> -----------------------------------------------------------------------
> >>>>>
> >>>>> Changed in v2:
> >>>>>  - modify number of memset threads spawned to min(smp_cpus, 16).
> >>>>>  - removed 64GB memory restriction for spawning memset threads.
> >>>>>
> >>>>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> >>>>> ---
> >>>>>  backends/hostmem.c   |  4 ++--
> >>>>>  exec.c               |  2 +-
> >>>>>  include/qemu/osdep.h |  3 ++-
> >>>>>  util/oslib-posix.c   | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>>>>  util/oslib-win32.c   |  3 ++-
> >>>>>  5 files changed, 69 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
> >>>>> index 7f5de70..162c218 100644
> >>>>> --- a/backends/hostmem.c
> >>>>> +++ b/backends/hostmem.c
> >>>>> @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
> >>>>>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
> >>>>>          uint64_t sz = memory_region_size(&backend->mr);
> >>>>>  
> >>>>> -        os_mem_prealloc(fd, ptr, sz, &local_err);
> >>>>> +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
> >>>>>          if (local_err) {
> >>>>>              error_propagate(errp, local_err);
> >>>>>              return;
> >>>>> @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> >>>>>           */
> >>>>>          if (backend->prealloc) {
> >>>>>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> >>>>> -                            &local_err);
> >>>>> +                            smp_cpus, &local_err);
> >>>>>              if (local_err) {
> >>>>>                  goto out;
> >>>>>              }
> >>>>> diff --git a/exec.c b/exec.c
> >>>>> index 8b9ed73..53afcd2 100644
> >>>>> --- a/exec.c
> >>>>> +++ b/exec.c
> >>>>> @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
> >>>>>      }
> >>>>>  
> >>>>>      if (mem_prealloc) {
> >>>>> -        os_mem_prealloc(fd, area, memory, errp);
> >>>>> +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
> >>>>>          if (errp && *errp) {
> >>>>>              goto error;
> >>>>>          }
> >>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>> index 56c9e22..fb1d22b 100644
> >>>>> --- a/include/qemu/osdep.h
> >>>>> +++ b/include/qemu/osdep.h
> >>>>> @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
> >>>>>  
> >>>>>  void qemu_set_tty_echo(int fd, bool echo);
> >>>>>  
> >>>>> -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
> >>>>> +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> >>>>> +                     Error **errp);
> >>>>>  
> >>>>>  int qemu_read_password(char *buf, int buf_size);
> >>>>>  
> >>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>> index f631464..17da029 100644
> >>>>> --- a/util/oslib-posix.c
> >>>>> +++ b/util/oslib-posix.c
> >>>>> @@ -55,6 +55,16 @@
> >>>>>  #include "qemu/error-report.h"
> >>>>>  #endif
> >>>>>  
> >>>>> +#define MAX_MEM_PREALLOC_THREAD_COUNT 16    
> >>>> running with -smp 16 or bigger on host with less than 16 cpus
> >>>> it would be not quite optimal.
> >>>> Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
> >>>> something like sysconf(_SC_NPROCESSORS_ONLN)    
> >>>
> >>> The point is to not consume more host resources than would otherwise
> >>> be consumed by running the guest CPUs. ie, if running a KVM guest
> >>> with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
> >>> 4 pCPUs worth of resource on the host.  Using sysconf would cause
> >>> the consume to consume all host resources, likely harming other
> >>> guests workloads.
> >>>
> >>> If the person launching QEMU gives a -smp value that's larger than
> >>> the host CPUs count, then they've already accepted that they're
> >>> asking QEMU todo more than the host is really capable of. IOW, I
> >>> don't think we need to special case memsetting for that, since
> >>> VCPU execution itself is already going to overcommit the host.  
> >> Doing over commit at preallocate time doesn't make much sense,
> >> if MAX_MEM_PREALLOC_THREAD_COUNT is replaced with sysconf(_SC_NPROCESSORS_ONLN)
> >> then QEMU will end up with MIN(-smp, sysconf(_SC_NPROCESSORS_ONLN))
> >> which will put cap on upper value and avoid useless over commit at
> >> preallocate time.
> >>  
> > 
> > I agree, we should consider case where we run with -smp >= 16 which 
> > is overcommited on host with < 16 cpus. At the same time we should 
> > also be sure that we don't end up spawning to many memset threads. 
> > For e.g. I have been running fat guests with -smp > 64 on hosts 
> > with 384 cpus.
> >   
> how about putting a cap on MAX_MEM_PREALLOC_THREAD_COUNT to 
> (MIN(sysconf(_SC_NPROCESSORS_ONLN), 16))?
> Number of memset threads can be calculated using
> MIN(smp_cpus, MAX_MEM_PREALLOC_THREAD_COUNT);
it looks ok to me

> 
> Thanks,
> - Jitendra
> 
> > Thanks,
> > - Jitendra
> >   
> >>> Regards,
> >>> Daniel  
> >>  
> >