linux-user: Change mmap_lock to rwlock

[Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock

Posted by Richard Henderson 7 years, 7 months ago

The objective here is to avoid serializing the first-fault
and no-fault aarch64 sve loads.

When I first thought of this, we still had tb_lock protecting
code generation and merely needed mmap_lock to prevent mapping
changes while doing so.  Now (or very soon), tb_lock is gone
and user-only code generation dual-purposes the mmap_lock for
the code generation lock.

There are two other legacy users of mmap_lock within linux-user,
for ppc and mips.  I believe these are dead code from before these
ports were converted to use tcg atomics.

Thoughts?


r~


Based-on: <20180621143715.27176-1-richard.henderson@linaro.org> [tcg-next]
Based-on: <20180621015359.12018-1-richard.henderson@linaro.org> [v5 SVE]


Richard Henderson (2):
  exec: Split mmap_lock to mmap_rdlock/mmap_wrlock
  linux-user: Use pthread_rwlock_t for mmap_rd/wrlock

 include/exec/exec-all.h    |  6 ++--
 accel/tcg/cpu-exec.c       |  8 ++---
 accel/tcg/translate-all.c  |  4 +--
 bsd-user/mmap.c            | 18 +++++++++---
 exec.c                     |  6 ++--
 linux-user/elfload.c       |  2 +-
 linux-user/mips/cpu_loop.c |  2 +-
 linux-user/mmap.c          | 60 +++++++++++++++++++++++++-------------
 linux-user/ppc/cpu_loop.c  |  2 +-
 linux-user/syscall.c       |  4 +--
 target/arm/sve_helper.c    | 10 ++-----
 target/xtensa/op_helper.c  |  2 +-
 12 files changed, 76 insertions(+), 48 deletions(-)

-- 
2.17.1

Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock

Posted by Emilio G. Cota 7 years, 7 months ago

On Thu, Jun 21, 2018 at 07:36:33 -1000, Richard Henderson wrote:
> The objective here is to avoid serializing the first-fault
> and no-fault aarch64 sve loads.
> 
> When I first thought of this, we still had tb_lock protecting
> code generation and merely needed mmap_lock to prevent mapping
> changes while doing so.  Now (or very soon), tb_lock is gone
> and user-only code generation dual-purposes the mmap_lock for
> the code generation lock.
> 
> There are two other legacy users of mmap_lock within linux-user,
> for ppc and mips.  I believe these are dead code from before these
> ports were converted to use tcg atomics.
> 
> Thoughts?

I'm curious to see how much perf could be gained. It seems that the hold
times in SVE code for readers might not be very large, which
then wouldn't let us amortize the atomic inc of the read lock
(IOW, we might not see much of a difference compared to a regular
mutex).

Are you using any benchmark that shows any perf difference?

Thanks,

		E.

Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock

Posted by Richard Henderson 7 years, 7 months ago

On 06/22/2018 02:12 PM, Emilio G. Cota wrote:
> I'm curious to see how much perf could be gained. It seems that the hold
> times in SVE code for readers might not be very large, which
> then wouldn't let us amortize the atomic inc of the read lock
> (IOW, we might not see much of a difference compared to a regular
> mutex).

In theory, the uncontended case for rwlocks is the same as a mutex.

> Are you using any benchmark that shows any perf difference?

Not so far.  Glibc has some microbenchmarks for strings, which I will try next
week, but they are not multi-threaded.  Maybe just run 4 threads of those
benchmark?

r~

Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock

Posted by Emilio G. Cota 7 years, 7 months ago

On Sat, Jun 23, 2018 at 08:25:52 -0700, Richard Henderson wrote:
> On 06/22/2018 02:12 PM, Emilio G. Cota wrote:
> > I'm curious to see how much perf could be gained. It seems that the hold
> > times in SVE code for readers might not be very large, which
> > then wouldn't let us amortize the atomic inc of the read lock
> > (IOW, we might not see much of a difference compared to a regular
> > mutex).
> 
> In theory, the uncontended case for rwlocks is the same as a mutex.

In the fast path, wr_lock/unlock have one more atomic than
mutex_lock/unlock. The perf difference is quite large in
microbenchmarks, e.g. changing tests/atomic_add-bench to
use pthread_mutex or pthread_rwlock_wrlock instead of
an atomic operation (this is enabled with the added -m flag):

$ taskset -c 0 perf record tests/atomic_add-bench-mutex  -d 4 -m
 Throughput:         62.05 Mops/s

$ taskset -c 0 perf record tests/atomic_add-bench-rwlock  -d 4 -m
 Throughput:         37.68 Mops/s

That said, it's unlikely to have real user-space code
(i.e. not from microbenchmarks) that would be sensitive to
the additional delay and/or lower scalability. It is common to
avoid frequent calls to mmap(2) due to potential serialization
in the kernel -- think for instance of memory allocators, they
do a few large mmap calls and then manage the memory themselves.

To double-check I ran some multi-threaded benchmarks from
Hoard[1] under qemu-linux-user, with and without the rwlock change,
and couldn't measure a significant difference.

[1] https://github.com/emeryberger/Hoard/tree/master/benchmarks

> > Are you using any benchmark that shows any perf difference?
> 
> Not so far.  Glibc has some microbenchmarks for strings, which I will try next
> week, but they are not multi-threaded.  Maybe just run 4 threads of those
> benchmark?

I'd run more threads if possible. I have access to a 64-core machine,
so ping me once you identify benchmarks that are of interest.

		Emilio