[Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()

Yang Zhong posted 1 patch 6 years, 4 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1511505030-3669-1-git-send-email-yang.zhong@intel.com
Test checkpatch passed
Test docker passed
Test ppc passed
Test s390x passed
There is a newer version of this series
configure  | 29 +++++++++++++++++++++++++++++
util/rcu.c |  6 ++++++
2 files changed, 35 insertions(+)
[Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
Posted by Yang Zhong 6 years, 4 months ago
Since there are some issues in memory alloc/free machenism
in glibc for little chunk memory, if Qemu frequently
alloc/free little chunk memory, the glibc doesn't alloc
little chunk memory from free list of glibc and still
allocate from OS, which make the heap size bigger and bigger.

This patch introduce malloc_trim(), which will free heap memory.

Below are test results from smaps file.
(1)without patch
55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
Size:              21796 kB
Rss:               14260 kB
Pss:               14260 kB

(2)with patch
55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
Size:              21668 kB
Rss:                6940 kB
Pss:                6940 kB

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 configure  | 29 +++++++++++++++++++++++++++++
 util/rcu.c |  6 ++++++
 2 files changed, 35 insertions(+)

diff --git a/configure b/configure
index 0c6e757..6292ab0 100755
--- a/configure
+++ b/configure
@@ -426,6 +426,7 @@ vxhs=""
 supported_cpu="no"
 supported_os="no"
 bogus_os="no"
+malloc_trim="yes"
 
 # parse CC options first
 for opt do
@@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
     exit 1
 fi
 
+# Even if malloc_trim() is available, these non-libc memory allocators
+# do not support it.
+if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
+    if test "$malloc_trim" = "yes" ; then
+        echo "Disabling malloc_trim with non-libc memory allocator"
+    fi
+    malloc_trim="no"
+fi
+
+#######################################
+# malloc_trim
+
+if test "$malloc_trim" != "no" ; then
+    cat > $TMPC << EOF
+#include <malloc.h>
+int main(void) { malloc_trim(0); return 0; }
+EOF
+    if compile_prog "" "" ; then
+        malloc_trim="yes"
+    else
+        malloc_trim="no"
+    fi
+fi
+
 ##########################################
 # tcmalloc probe
 
@@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
   fi
 fi
 
+if test "$malloc_trim" = "yes" ; then
+  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
+fi
+
 if test "$avx2_opt" = "yes" ; then
   echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
 fi
diff --git a/util/rcu.c b/util/rcu.c
index ca5a63e..f403b77 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -32,6 +32,9 @@
 #include "qemu/atomic.h"
 #include "qemu/thread.h"
 #include "qemu/main-loop.h"
+#if defined(CONFIG_MALLOC_TRIM)
+#include <malloc.h>
+#endif
 
 /*
  * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
@@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
             node->func(node);
         }
         qemu_mutex_unlock_iothread();
+#if defined(CONFIG_MALLOC_TRIM)
+        malloc_trim(4 * 1024 * 1024);
+#endif
     }
     abort();
 }
-- 
1.9.1


Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
Posted by Stefan Hajnoczi 6 years, 4 months ago
On Fri, Nov 24, 2017 at 02:30:30PM +0800, Yang Zhong wrote:
> diff --git a/configure b/configure
> index 0c6e757..6292ab0 100755
> --- a/configure
> +++ b/configure
> @@ -426,6 +426,7 @@ vxhs=""
>  supported_cpu="no"
>  supported_os="no"
>  bogus_os="no"
> +malloc_trim="yes"

Looks pretty good, sorry I forgot to mention two things:

Please add the --enable-malloc-trim/--disable-malloc-trim options so
it's easy to build QEMU with or without this feature.  For example, if
someone is debugging a performance issue they may wish to rebuild with
--disable-malloc-trim to confirm that trimming hasn't caused a
regression.

Also please change this line to malloc_trim="" so the "Disabling
malloc_trim with non-libc memory allocator" error message is only
printed when --enable-malloc-trim was explicitly given by the user.
Otherwise the message is always printed when QEMU is built with
jemalloc/tcmalloc - that's too noisy.
Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
Posted by Shannon Zhao 6 years, 4 months ago
Hi,

On 2017/11/24 14:30, Yang Zhong wrote:
> Since there are some issues in memory alloc/free machenism
> in glibc for little chunk memory, if Qemu frequently
> alloc/free little chunk memory, the glibc doesn't alloc
> little chunk memory from free list of glibc and still
> allocate from OS, which make the heap size bigger and bigger.
> 
> This patch introduce malloc_trim(), which will free heap memory.
> 
> Below are test results from smaps file.
> (1)without patch
> 55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
> Size:              21796 kB
> Rss:               14260 kB
> Pss:               14260 kB
> 
> (2)with patch
> 55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
> Size:              21668 kB
> Rss:                6940 kB
> Pss:                6940 kB
> 
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  configure  | 29 +++++++++++++++++++++++++++++
>  util/rcu.c |  6 ++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/configure b/configure
> index 0c6e757..6292ab0 100755
> --- a/configure
> +++ b/configure
> @@ -426,6 +426,7 @@ vxhs=""
>  supported_cpu="no"
>  supported_os="no"
>  bogus_os="no"
> +malloc_trim="yes"
>  
>  # parse CC options first
>  for opt do
> @@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
>      exit 1
>  fi
>  
> +# Even if malloc_trim() is available, these non-libc memory allocators
> +# do not support it.
> +if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
> +    if test "$malloc_trim" = "yes" ; then
> +        echo "Disabling malloc_trim with non-libc memory allocator"
> +    fi
> +    malloc_trim="no"
> +fi
> +
> +#######################################
> +# malloc_trim
> +
> +if test "$malloc_trim" != "no" ; then
> +    cat > $TMPC << EOF
> +#include <malloc.h>
> +int main(void) { malloc_trim(0); return 0; }
> +EOF
> +    if compile_prog "" "" ; then
> +        malloc_trim="yes"
> +    else
> +        malloc_trim="no"
> +    fi
> +fi
> +
>  ##########################################
>  # tcmalloc probe
>  
> @@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
>    fi
>  fi
>  
> +if test "$malloc_trim" = "yes" ; then
> +  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
> +fi
> +
>  if test "$avx2_opt" = "yes" ; then
>    echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
>  fi
> diff --git a/util/rcu.c b/util/rcu.c
> index ca5a63e..f403b77 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -32,6 +32,9 @@
>  #include "qemu/atomic.h"
>  #include "qemu/thread.h"
>  #include "qemu/main-loop.h"
> +#if defined(CONFIG_MALLOC_TRIM)
> +#include <malloc.h>
> +#endif
>  
>  /*
>   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
> @@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
>              node->func(node);
>          }
>          qemu_mutex_unlock_iothread();
> +#if defined(CONFIG_MALLOC_TRIM)
> +        malloc_trim(4 * 1024 * 1024);
> +#endif
>      }
>      abort();
>  }
> 

Looks like this patch introduces a performance regression. With this
patch the time of booting a VM with 60 scsi disks on ARM64 is increased
by 200+ seconds.

Thanks,
-- 
Shannon


Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
Posted by Zhong Yang 6 years, 4 months ago
On Sun, Nov 26, 2017 at 02:17:18PM +0800, Shannon Zhao wrote:
> Hi,
> 
> On 2017/11/24 14:30, Yang Zhong wrote:
> > Since there are some issues in memory alloc/free machenism
> > in glibc for little chunk memory, if Qemu frequently
> > alloc/free little chunk memory, the glibc doesn't alloc
> > little chunk memory from free list of glibc and still
> > allocate from OS, which make the heap size bigger and bigger.
> > 
> > This patch introduce malloc_trim(), which will free heap memory.
> > 
> > Below are test results from smaps file.
> > (1)without patch
> > 55f0783e1000-55f07992a000 rw-p 00000000 00:00 0  [heap]
> > Size:              21796 kB
> > Rss:               14260 kB
> > Pss:               14260 kB
> > 
> > (2)with patch
> > 55cc5fadf000-55cc61008000 rw-p 00000000 00:00 0  [heap]
> > Size:              21668 kB
> > Rss:                6940 kB
> > Pss:                6940 kB
> > 
> > Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> > ---
> >  configure  | 29 +++++++++++++++++++++++++++++
> >  util/rcu.c |  6 ++++++
> >  2 files changed, 35 insertions(+)
> > 
> > diff --git a/configure b/configure
> > index 0c6e757..6292ab0 100755
> > --- a/configure
> > +++ b/configure
> > @@ -426,6 +426,7 @@ vxhs=""
> >  supported_cpu="no"
> >  supported_os="no"
> >  bogus_os="no"
> > +malloc_trim="yes"
> >  
> >  # parse CC options first
> >  for opt do
> > @@ -3857,6 +3858,30 @@ if test "$tcmalloc" = "yes" && test "$jemalloc" = "yes" ; then
> >      exit 1
> >  fi
> >  
> > +# Even if malloc_trim() is available, these non-libc memory allocators
> > +# do not support it.
> > +if test "$tcmalloc" = "yes" || test "$jemalloc" = "yes" ; then
> > +    if test "$malloc_trim" = "yes" ; then
> > +        echo "Disabling malloc_trim with non-libc memory allocator"
> > +    fi
> > +    malloc_trim="no"
> > +fi
> > +
> > +#######################################
> > +# malloc_trim
> > +
> > +if test "$malloc_trim" != "no" ; then
> > +    cat > $TMPC << EOF
> > +#include <malloc.h>
> > +int main(void) { malloc_trim(0); return 0; }
> > +EOF
> > +    if compile_prog "" "" ; then
> > +        malloc_trim="yes"
> > +    else
> > +        malloc_trim="no"
> > +    fi
> > +fi
> > +
> >  ##########################################
> >  # tcmalloc probe
> >  
> > @@ -6012,6 +6037,10 @@ if test "$opengl" = "yes" ; then
> >    fi
> >  fi
> >  
> > +if test "$malloc_trim" = "yes" ; then
> > +  echo "CONFIG_MALLOC_TRIM=y" >> $config_host_mak
> > +fi
> > +
> >  if test "$avx2_opt" = "yes" ; then
> >    echo "CONFIG_AVX2_OPT=y" >> $config_host_mak
> >  fi
> > diff --git a/util/rcu.c b/util/rcu.c
> > index ca5a63e..f403b77 100644
> > --- a/util/rcu.c
> > +++ b/util/rcu.c
> > @@ -32,6 +32,9 @@
> >  #include "qemu/atomic.h"
> >  #include "qemu/thread.h"
> >  #include "qemu/main-loop.h"
> > +#if defined(CONFIG_MALLOC_TRIM)
> > +#include <malloc.h>
> > +#endif
> >  
> >  /*
> >   * Global grace period counter.  Bit 0 is always one in rcu_gp_ctr.
> > @@ -272,6 +275,9 @@ static void *call_rcu_thread(void *opaque)
> >              node->func(node);
> >          }
> >          qemu_mutex_unlock_iothread();
> > +#if defined(CONFIG_MALLOC_TRIM)
> > +        malloc_trim(4 * 1024 * 1024);
> > +#endif
> >      }
> >      abort();
> >  }
> > 
> 
> Looks like this patch introduces a performance regression. With this
> patch the time of booting a VM with 60 scsi disks on ARM64 is increased
> by 200+ seconds.
> 
  Hello Shannon,

  Thanks for your reply!
  As for your concerns, i did VM bootup compared tests, and results as below:

  #test command
  ./qemu-system-x86_64 -enable-kvm -cpu host -m 2G -smp cpus=4,cores=4,\
                       threads=1,sockets=1 -drive format=raw,\
                       file=test.img,index=0,media=disk -nographic

  #without patch
  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.979s (kernel) + 1.214s (userspace) = 6.193s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.922s (kernel) + 1.175s (userspace) = 6.097s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.990s (kernel) + 1.301s (userspace) = 6.291s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.063s (kernel) + 1.336s (userspace) = 6.400s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.820s (kernel) + 1.237s (userspace) = 6.057s

  avg: kernel 4.9548, userspace 1.2526


  #with this patch
  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.099s (kernel) + 1.579s (userspace) = 6.679s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 5.003s (kernel) + 1.343s (userspace) = 6.347s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.853s (kernel) + 1.220s (userspace) = 6.074s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.836s (kernel) + 1.111s (userspace) = 5.948s

  root@intel-internal-corei7-64:~# systemd-analyze
  Startup finished in 4.917s (kernel) + 1.166s (userspace) = 6.083s

  avg: kernel 4.9416s, userspace: 1.2838

  From above test results, there are almost not any performance regression
  on x86 platform. Sorry, there is not any ARM based platform in my hand,
  i can't give related datas.  thanks!

  Regards,

  Yang


> Thanks,
> -- 
> Shannon

Re: [Qemu-devel] [PATCH v3] rcu: reduce more than 7MB heap memory by malloc_trim()
Posted by Paolo Bonzini 6 years, 4 months ago
On 27/11/2017 04:06, Zhong Yang wrote:
>   #test command
>   ./qemu-system-x86_64 -enable-kvm -cpu host -m 2G -smp cpus=4,cores=4,\
>                        threads=1,sockets=1 -drive format=raw,\
>                        file=test.img,index=0,media=disk -nographic
> 
>   #without patch
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.979s (kernel) + 1.214s (userspace) = 6.193s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.922s (kernel) + 1.175s (userspace) = 6.097s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.990s (kernel) + 1.301s (userspace) = 6.291s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.063s (kernel) + 1.336s (userspace) = 6.400s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.820s (kernel) + 1.237s (userspace) = 6.057s
> 
>   avg: kernel 4.9548, userspace 1.2526
> 
> 
>   #with this patch
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.099s (kernel) + 1.579s (userspace) = 6.679s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 5.003s (kernel) + 1.343s (userspace) = 6.347s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.853s (kernel) + 1.220s (userspace) = 6.074s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.836s (kernel) + 1.111s (userspace) = 5.948s
> 
>   root@intel-internal-corei7-64:~# systemd-analyze
>   Startup finished in 4.917s (kernel) + 1.166s (userspace) = 6.083s
> 
>   avg: kernel 4.9416s, userspace: 1.2838
> 
>   From above test results, there are almost not any performance regression
>   on x86 platform. Sorry, there is not any ARM based platform in my hand,
>   i can't give related datas.  thanks!

You are using only one disk, Shannon is using 200.  That may make a
difference, as PCI BAR setup in the guest becomes very expensive as you
add more devices.

Thanks,

Paolo