[v4] Optimize buffer_is_zero

[PATCH v4 00/10] Optimize buffer_is_zero

Posted by Richard Henderson 9 months, 2 weeks ago

v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/

Changes for v4:
  - Keep separate >= 256 entry point, but only keep constant length
    check inline.  This allows the indirect function call to be hidden
    and optimized away when the pointer is constant.
  - Split out a >= 256 integer routine.
  - Simplify acceleration selection for testing.
  - Add function pointer typedef.
  - Implement new aarch64 accelerations.


r~


Alexander Monakov (5):
  util/bufferiszero: Remove SSE4.1 variant
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Richard Henderson (5):
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Add simd acceleration for aarch64
  util/bufferiszero: Add sve acceleration for aarch64

 host/include/aarch64/host/cpuinfo.h |   1 +
 include/qemu/cutils.h               |  15 +-
 util/bufferiszero.c                 | 500 ++++++++++++++++------------
 util/cpuinfo-aarch64.c              |   1 +
 meson.build                         |  13 +
 5 files changed, 323 insertions(+), 207 deletions(-)

-- 
2.34.1

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Alexander Monakov 9 months, 2 weeks ago

On Wed, 14 Feb 2024, Richard Henderson wrote:

> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> 
> Changes for v4:
>   - Keep separate >= 256 entry point, but only keep constant length
>     check inline.  This allows the indirect function call to be hidden
>     and optimized away when the pointer is constant.

Sorry, I don't understand this. Most of the improvement (at least in our
testing) comes from inlining the byte checks, which often fail and eliminate
call overhead entirely. Moving them out-of-line seems to lose most of the
speedup the patchset was bringing, doesn't it? Is there some concern I am
not seeing?

>   - Split out a >= 256 integer routine.
>   - Simplify acceleration selection for testing.
>   - Add function pointer typedef.
>   - Implement new aarch64 accelerations.
> 
> 
> r~
> 
> 
> Alexander Monakov (5):
>   util/bufferiszero: Remove SSE4.1 variant
>   util/bufferiszero: Remove AVX512 variant
>   util/bufferiszero: Reorganize for early test for acceleration
>   util/bufferiszero: Remove useless prefetches
>   util/bufferiszero: Optimize SSE2 and AVX2 variants
> 
> Richard Henderson (5):
>   util/bufferiszero: Improve scalar variant
>   util/bufferiszero: Introduce biz_accel_fn typedef
>   util/bufferiszero: Simplify test_buffer_is_zero_next_accel
>   util/bufferiszero: Add simd acceleration for aarch64
>   util/bufferiszero: Add sve acceleration for aarch64
> 
>  host/include/aarch64/host/cpuinfo.h |   1 +
>  include/qemu/cutils.h               |  15 +-
>  util/bufferiszero.c                 | 500 ++++++++++++++++------------
>  util/cpuinfo-aarch64.c              |   1 +
>  meson.build                         |  13 +
>  5 files changed, 323 insertions(+), 207 deletions(-)
> 
>

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Richard Henderson 9 months, 2 weeks ago

On 2/14/24 22:57, Alexander Monakov wrote:
> 
> On Wed, 14 Feb 2024, Richard Henderson wrote:
> 
>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>
>> Changes for v4:
>>    - Keep separate >= 256 entry point, but only keep constant length
>>      check inline.  This allows the indirect function call to be hidden
>>      and optimized away when the pointer is constant.
> 
> Sorry, I don't understand this. Most of the improvement (at least in our
> testing) comes from inlining the byte checks, which often fail and eliminate
> call overhead entirely. Moving them out-of-line seems to lose most of the
> speedup the patchset was bringing, doesn't it? Is there some concern I am
> not seeing?

What is your benchmarking method?

It was my guess that most of the improvement came from performing those early byte checks 
*at all*, and that the overhead of a function call to a small out of line wrapper would be 
negligible.

By not exposing the function pointer outside the bufferiszero translation unit, the 
compiler can see when the pointer is never modified for a given host, and then transform 
the indirect branch to a direct branch.

r~

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Alexander Monakov 9 months, 2 weeks ago

On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:57, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> >>
> >> Changes for v4:
> >>    - Keep separate >= 256 entry point, but only keep constant length
> >>      check inline.  This allows the indirect function call to be hidden
> >>      and optimized away when the pointer is constant.
> > 
> > Sorry, I don't understand this. Most of the improvement (at least in our
> > testing) comes from inlining the byte checks, which often fail and eliminate
> > call overhead entirely. Moving them out-of-line seems to lose most of the
> > speedup the patchset was bringing, doesn't it? Is there some concern I am
> > not seeing?
> 
> What is your benchmarking method?

Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
are you saying they did not reach your inbox?
https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

> It was my guess that most of the improvement came from performing those early
> byte checks *at all*, and that the overhead of a function call to a small out
> of line wrapper would be negligible.

qemu-img invokes buffer_is_zero in a fairly tight loop. Let us know if you
need numbers how much the out-of-line version loses.

> By not exposing the function pointer outside the bufferiszero translation
> unit, the compiler can see when the pointer is never modified for a given
> host, and then transform the indirect branch to a direct branch.

Okay, but that does not make it necessary to move byte checks out of line.
I was preparing a rebase that does not expose the function pointer to the
inline wrapper. I was completely unaware that you're taking over the patchset.

Alexander

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Richard Henderson 9 months, 2 weeks ago

On 2/15/24 11:36, Alexander Monakov wrote:
> 
> On Thu, 15 Feb 2024, Richard Henderson wrote:
> 
>> On 2/14/24 22:57, Alexander Monakov wrote:
>>>
>>> On Wed, 14 Feb 2024, Richard Henderson wrote:
>>>
>>>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>>>
>>>> Changes for v4:
>>>>     - Keep separate >= 256 entry point, but only keep constant length
>>>>       check inline.  This allows the indirect function call to be hidden
>>>>       and optimized away when the pointer is constant.
>>>
>>> Sorry, I don't understand this. Most of the improvement (at least in our
>>> testing) comes from inlining the byte checks, which often fail and eliminate
>>> call overhead entirely. Moving them out-of-line seems to lose most of the
>>> speedup the patchset was bringing, doesn't it? Is there some concern I am
>>> not seeing?
>>
>> What is your benchmarking method?
> 
> Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
> are you saying they did not reach your inbox?
> https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

I'm saying that this is not a reproducible description of methodology.

With master, so with neither of our changes:

I tried converting an 80G win7 image that I happened to have lying about, I see 
buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating the image to 16G 
to see if having the entire image in ram would help -- not yet, still only 3.4% perf 
overhead.  Finally, I truncated the image to 4G and saw 2.9% overhead.

So... help be out here.  I would like to be able to see results that are at least vaguely 
similar.

r~

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Alexander Monakov 9 months, 2 weeks ago

On Thu, 15 Feb 2024, Richard Henderson wrote:

> > Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and
> > v2,
> > are you saying they did not reach your inbox?
> > https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> > https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/
> 
> I'm saying that this is not a reproducible description of methodology.
> 
> With master, so with neither of our changes:
> 
> I tried converting an 80G win7 image that I happened to have lying about, I
> see buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating
> the image to 16G to see if having the entire image in ram would help -- not
> yet, still only 3.4% perf overhead.  Finally, I truncated the image to 4G and
> saw 2.9% overhead.
> 
> So... help be out here.  I would like to be able to see results that are at
> least vaguely similar.

Ah, I guess you might be running at low perf_event_paranoid setting that
allows unprivileged sampling of kernel events? In our submissions the
percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
excluding kernel time under syscalls.

Retrieve IE11.Win7.VirtualBox.zip from
https://archive.org/details/ie11.win7.virtualbox
and use

  unzip -p IE11.Win7.VirtualBox.zip | tar xv

to extract 'IE11 - Win7-disk001.vmdk'.

(Mikhail used a different image when preparing the patch)

On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running

  qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

user:kernel time is about 0.15:2.3, so 70% relative to user time does
roughly correspond to single-digits percentage relative to (user+kernel) time.

(which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
need two seconds to read a fully cached 5 Gigabyte file)

Alexander

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Richard Henderson 9 months, 2 weeks ago

On 2/15/24 13:37, Alexander Monakov wrote:
> Ah, I guess you might be running at low perf_event_paranoid setting that
> allows unprivileged sampling of kernel events? In our submissions the
> percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> excluding kernel time under syscalls.

Ok.  Eliminating kernel samples makes things easier to see.
But I still do not see a 40% reduction in runtime.

Just so we're on the same page:

> Retrieve IE11.Win7.VirtualBox.zip from
> https://archive.org/details/ie11.win7.virtualbox
> and use
> 
>   unzip -p IE11.Win7.VirtualBox.zip | tar xv
> 
> to extract 'IE11 - Win7-disk001.vmdk'.
> 
> (Mikhail used a different image when preparing the patch)
> 
> On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running
> 
>   qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

With this, I see virtually all of the runtime in libz.so.
Therefore I converted this to raw first, to focus on the issue.

For avoidance of doubt:

$ ls -lsh test.raw && sha256sum test.raw
  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw

Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

master:
   57.48%  qemu-img-m  [.] buffer_zero_avx2
    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
    2.61%  qemu-img-m  [.] buffer_is_zero
   63.69%  -- total

v3:
   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
    3.79%  qemu-img-v3  [.] buffer_zero_avx2
   52.65%  -- total
     -17%  -- reduction from master

v4:
   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
    3.30%  qemu-img-v4  [.] buffer_zero_avx2
    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
   61.07%  -- total
      -4%  -- reduction from master

v4+:
   46.65%  qemu-img  [.] is_allocated_sectors.part.0
    3.49%  qemu-img  [.] buffer_zero_avx2
    0.05%  qemu-img  [.] buffer_is_zero_ge256
   50.19%  -- total
     -21%  -- reduction from master

The v4+ puts the 3 byte test back inline, like in your v3.

Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" this to (s | 
m | e) != 0, on the assumption that the reduced number of branches would help.

Diving into perf, it becomes clear why:

  57.36 │       cmpb   $0x0,(%rbx)
   4.02 │     ↓ jne    89
  21.84 │       cmpb   $0x0,0x1ff(%rbx)
   0.64 │     ↓ jne    89
   8.45 │       cmpb   $0x0,0x100(%rbx)
   0.26 │     ↓ jne    89
   0.06 │       mov    $0x200,%esi
        │       mov    %rbx,%rdi
   0.07 │     → call   buffer_is_zero_ge256

The three bytes are on 3 different cachelines.  Judging by the relative percentages, it 
would seem that the first byte alone eliminates slightly more than half of all blocks; the 
last byte eliminates more than half again; the middle byte eliminates a fair fraction of 
the rest.  With the short-circuit, the extra cachelines are not touched.

This is so important that it should be spelled out in a comment.

With that settled, I guess we need to talk about how much the out-of-line implementation 
matters at all.  I'm thinking about writing a test/bench/bufferiszero, with all-zero 
buffers of various sizes and alignments.  With that it would be easier to talk about 
whether any given implementation is is an improvement for that final 4% not eliminated by 
the three bytes.

> (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
> need two seconds to read a fully cached 5 Gigabyte file)

Indeed!

r~

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Alexander Monakov 9 months, 1 week ago

On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/15/24 13:37, Alexander Monakov wrote:
> > Ah, I guess you might be running at low perf_event_paranoid setting that
> > allows unprivileged sampling of kernel events? In our submissions the
> > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> > excluding kernel time under syscalls.
> 
> Ok.  Eliminating kernel samples makes things easier to see.
> But I still do not see a 40% reduction in runtime.

I suspect Mikhail's image was less sparse, so the impact from inlining
was greater.

> With this, I see virtually all of the runtime in libz.so.
> Therefore I converted this to raw first, to focus on the issue.

Ah, apologies for that. I built with --disable-default-features and
did not notice my qemu-img lacked support for vmdk and treated it
as a raw image instead. I was assuming it was similar to what Mikhail
used, but obviously it's not due to the compression.

> For avoidance of doubt:
> 
> $ ls -lsh test.raw && sha256sum test.raw
>  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
> 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw
> 
> Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
> Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
> 
> master:
>   57.48%  qemu-img-m  [.] buffer_zero_avx2
>    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
>    2.61%  qemu-img-m  [.] buffer_is_zero
>   63.69%  -- total
> 
> v3:
>   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
>   3.79%  qemu-img-v3  [.] buffer_zero_avx2
>   52.65%  -- total
>     -17%  -- reduction from master
> 
> v4:
>   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
>    3.30%  qemu-img-v4  [.] buffer_zero_avx2
>    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
>   61.07%  -- total
>      -4%  -- reduction from master
> 
> v4+:
>   46.65%  qemu-img  [.] is_allocated_sectors.part.0
>    3.49%  qemu-img  [.] buffer_zero_avx2
>    0.05%  qemu-img  [.] buffer_is_zero_ge256
>   50.19%  -- total
>     -21%  -- reduction from master

Any ideas where the -21% vs v3's -17% difference comes from?

FWIW, in situations like these I always recommend to run perf with fixed
sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
cycles/period=100000/P' to make sample counts between runs of different
duration directly comparable (displayed with 'perf report -n').

> The v4+ puts the 3 byte test back inline, like in your v3.
> 
> Importantly, it must be as 3 short-circuting tests, where my v4 "simplified"
> this to (s | m | e) != 0, on the assumption that the reduced number of
> branches would help.

Yes, we also noticed that when preparing our patch. We also tried mixed
variants like (s | e) != 0 || m != 0, but they did not turn out faster.

> With that settled, I guess we need to talk about how much the out-of-line
> implementation matters at all.  I'm thinking about writing a
> test/bench/bufferiszero, with all-zero buffers of various sizes and
> alignments.  With that it would be easier to talk about whether any given
> implementation is is an improvement for that final 4% not eliminated by the
> three bytes.

Yeah, initially I suggested this task to Mikhail as a practice exercise
outside of Qemu, and we had a benchmark that measures buffer_is_zero via
perf_event_open. This allows to see exactly how close the implementation
runs to the performance ceiling given by max L1 fetch rate (two loads
per cycle on x86).

Alexander

Re: [PATCH v4 00/10] Optimize buffer_is_zero

Posted by Richard Henderson 9 months, 1 week ago

On 2/16/24 10:20, Alexander Monakov wrote:
> FWIW, in situations like these I always recommend to run perf with fixed
> sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
> cycles/period=100000/P' to make sample counts between runs of different
> duration directly comparable (displayed with 'perf report -n').

I've re-done the numbers with fixed period, as suggested, and the difference between v3 
and v4+ is in the sampling noise, differing about 0.3%.


r~