host/include/aarch64/host/cpuinfo.h | 1 + include/qemu/cutils.h | 15 +- util/bufferiszero.c | 500 ++++++++++++++++------------ util/cpuinfo-aarch64.c | 1 + meson.build | 13 + 5 files changed, 323 insertions(+), 207 deletions(-)
v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ Changes for v4: - Keep separate >= 256 entry point, but only keep constant length check inline. This allows the indirect function call to be hidden and optimized away when the pointer is constant. - Split out a >= 256 integer routine. - Simplify acceleration selection for testing. - Add function pointer typedef. - Implement new aarch64 accelerations. r~ Alexander Monakov (5): util/bufferiszero: Remove SSE4.1 variant util/bufferiszero: Remove AVX512 variant util/bufferiszero: Reorganize for early test for acceleration util/bufferiszero: Remove useless prefetches util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson (5): util/bufferiszero: Improve scalar variant util/bufferiszero: Introduce biz_accel_fn typedef util/bufferiszero: Simplify test_buffer_is_zero_next_accel util/bufferiszero: Add simd acceleration for aarch64 util/bufferiszero: Add sve acceleration for aarch64 host/include/aarch64/host/cpuinfo.h | 1 + include/qemu/cutils.h | 15 +- util/bufferiszero.c | 500 ++++++++++++++++------------ util/cpuinfo-aarch64.c | 1 + meson.build | 13 + 5 files changed, 323 insertions(+), 207 deletions(-) -- 2.34.1
On Wed, 14 Feb 2024, Richard Henderson wrote: > v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ > > Changes for v4: > - Keep separate >= 256 entry point, but only keep constant length > check inline. This allows the indirect function call to be hidden > and optimized away when the pointer is constant. Sorry, I don't understand this. Most of the improvement (at least in our testing) comes from inlining the byte checks, which often fail and eliminate call overhead entirely. Moving them out-of-line seems to lose most of the speedup the patchset was bringing, doesn't it? Is there some concern I am not seeing? > - Split out a >= 256 integer routine. > - Simplify acceleration selection for testing. > - Add function pointer typedef. > - Implement new aarch64 accelerations. > > > r~ > > > Alexander Monakov (5): > util/bufferiszero: Remove SSE4.1 variant > util/bufferiszero: Remove AVX512 variant > util/bufferiszero: Reorganize for early test for acceleration > util/bufferiszero: Remove useless prefetches > util/bufferiszero: Optimize SSE2 and AVX2 variants > > Richard Henderson (5): > util/bufferiszero: Improve scalar variant > util/bufferiszero: Introduce biz_accel_fn typedef > util/bufferiszero: Simplify test_buffer_is_zero_next_accel > util/bufferiszero: Add simd acceleration for aarch64 > util/bufferiszero: Add sve acceleration for aarch64 > > host/include/aarch64/host/cpuinfo.h | 1 + > include/qemu/cutils.h | 15 +- > util/bufferiszero.c | 500 ++++++++++++++++------------ > util/cpuinfo-aarch64.c | 1 + > meson.build | 13 + > 5 files changed, 323 insertions(+), 207 deletions(-) > >
On 2/14/24 22:57, Alexander Monakov wrote: > > On Wed, 14 Feb 2024, Richard Henderson wrote: > >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ >> >> Changes for v4: >> - Keep separate >= 256 entry point, but only keep constant length >> check inline. This allows the indirect function call to be hidden >> and optimized away when the pointer is constant. > > Sorry, I don't understand this. Most of the improvement (at least in our > testing) comes from inlining the byte checks, which often fail and eliminate > call overhead entirely. Moving them out-of-line seems to lose most of the > speedup the patchset was bringing, doesn't it? Is there some concern I am > not seeing? What is your benchmarking method? It was my guess that most of the improvement came from performing those early byte checks *at all*, and that the overhead of a function call to a small out of line wrapper would be negligible. By not exposing the function pointer outside the bufferiszero translation unit, the compiler can see when the pointer is never modified for a given host, and then transform the indirect branch to a direct branch. r~
On Thu, 15 Feb 2024, Richard Henderson wrote: > On 2/14/24 22:57, Alexander Monakov wrote: > > > > On Wed, 14 Feb 2024, Richard Henderson wrote: > > > >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ > >> > >> Changes for v4: > >> - Keep separate >= 256 entry point, but only keep constant length > >> check inline. This allows the indirect function call to be hidden > >> and optimized away when the pointer is constant. > > > > Sorry, I don't understand this. Most of the improvement (at least in our > > testing) comes from inlining the byte checks, which often fail and eliminate > > call overhead entirely. Moving them out-of-line seems to lose most of the > > speedup the patchset was bringing, doesn't it? Is there some concern I am > > not seeing? > > What is your benchmarking method? Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2, are you saying they did not reach your inbox? https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/ https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/ > It was my guess that most of the improvement came from performing those early > byte checks *at all*, and that the overhead of a function call to a small out > of line wrapper would be negligible. qemu-img invokes buffer_is_zero in a fairly tight loop. Let us know if you need numbers how much the out-of-line version loses. > By not exposing the function pointer outside the bufferiszero translation > unit, the compiler can see when the pointer is never modified for a given > host, and then transform the indirect branch to a direct branch. Okay, but that does not make it necessary to move byte checks out of line. I was preparing a rebase that does not expose the function pointer to the inline wrapper. I was completely unaware that you're taking over the patchset. Alexander
On 2/15/24 11:36, Alexander Monakov wrote: > > On Thu, 15 Feb 2024, Richard Henderson wrote: > >> On 2/14/24 22:57, Alexander Monakov wrote: >>> >>> On Wed, 14 Feb 2024, Richard Henderson wrote: >>> >>>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ >>>> >>>> Changes for v4: >>>> - Keep separate >= 256 entry point, but only keep constant length >>>> check inline. This allows the indirect function call to be hidden >>>> and optimized away when the pointer is constant. >>> >>> Sorry, I don't understand this. Most of the improvement (at least in our >>> testing) comes from inlining the byte checks, which often fail and eliminate >>> call overhead entirely. Moving them out-of-line seems to lose most of the >>> speedup the patchset was bringing, doesn't it? Is there some concern I am >>> not seeing? >> >> What is your benchmarking method? > > Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2, > are you saying they did not reach your inbox? > https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/ > https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/ I'm saying that this is not a reproducible description of methodology. With master, so with neither of our changes: I tried converting an 80G win7 image that I happened to have lying about, I see buffer_zero_avx2 with only 3.03% perf overhead. Then I tried truncating the image to 16G to see if having the entire image in ram would help -- not yet, still only 3.4% perf overhead. Finally, I truncated the image to 4G and saw 2.9% overhead. So... help be out here. I would like to be able to see results that are at least vaguely similar. r~
On Thu, 15 Feb 2024, Richard Henderson wrote: > > Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and > > v2, > > are you saying they did not reach your inbox? > > https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/ > > https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/ > > I'm saying that this is not a reproducible description of methodology. > > With master, so with neither of our changes: > > I tried converting an 80G win7 image that I happened to have lying about, I > see buffer_zero_avx2 with only 3.03% perf overhead. Then I tried truncating > the image to 16G to see if having the entire image in ram would help -- not > yet, still only 3.4% perf overhead. Finally, I truncated the image to 4G and > saw 2.9% overhead. > > So... help be out here. I would like to be able to see results that are at > least vaguely similar. Ah, I guess you might be running at low perf_event_paranoid setting that allows unprivileged sampling of kernel events? In our submissions the percentage was for perf_event_paranoid=2, i.e. relative to Qemu only, excluding kernel time under syscalls. Retrieve IE11.Win7.VirtualBox.zip from https://archive.org/details/ie11.win7.virtualbox and use unzip -p IE11.Win7.VirtualBox.zip | tar xv to extract 'IE11 - Win7-disk001.vmdk'. (Mikhail used a different image when preparing the patch) On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2 user:kernel time is about 0.15:2.3, so 70% relative to user time does roughly correspond to single-digits percentage relative to (user+kernel) time. (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't need two seconds to read a fully cached 5 Gigabyte file) Alexander
On 2/15/24 13:37, Alexander Monakov wrote: > Ah, I guess you might be running at low perf_event_paranoid setting that > allows unprivileged sampling of kernel events? In our submissions the > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only, > excluding kernel time under syscalls. Ok. Eliminating kernel samples makes things easier to see. But I still do not see a 40% reduction in runtime. Just so we're on the same page: > Retrieve IE11.Win7.VirtualBox.zip from > https://archive.org/details/ie11.win7.virtualbox > and use > > unzip -p IE11.Win7.VirtualBox.zip | tar xv > > to extract 'IE11 - Win7-disk001.vmdk'. > > (Mikhail used a different image when preparing the patch) > > On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running > > qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2 With this, I see virtually all of the runtime in libz.so. Therefore I converted this to raw first, to focus on the issue. For avoidance of doubt: $ ls -lsh test.raw && sha256sum test.raw 12G -rw-r--r-- 1 rth rth 40G Feb 15 21:14 test.raw 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80 test.raw Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) master: 57.48% qemu-img-m [.] buffer_zero_avx2 3.60% qemu-img-m [.] is_allocated_sectors.part.0 2.61% qemu-img-m [.] buffer_is_zero 63.69% -- total v3: 48.86% qemu-img-v3 [.] is_allocated_sectors.part.0 3.79% qemu-img-v3 [.] buffer_zero_avx2 52.65% -- total -17% -- reduction from master v4: 54.60% qemu-img-v4 [.] buffer_is_zero_ge256 3.30% qemu-img-v4 [.] buffer_zero_avx2 3.17% qemu-img-v4 [.] is_allocated_sectors.part.0 61.07% -- total -4% -- reduction from master v4+: 46.65% qemu-img [.] is_allocated_sectors.part.0 3.49% qemu-img [.] buffer_zero_avx2 0.05% qemu-img [.] buffer_is_zero_ge256 50.19% -- total -21% -- reduction from master The v4+ puts the 3 byte test back inline, like in your v3. Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" this to (s | m | e) != 0, on the assumption that the reduced number of branches would help. Diving into perf, it becomes clear why: 57.36 │ cmpb $0x0,(%rbx) 4.02 │ ↓ jne 89 21.84 │ cmpb $0x0,0x1ff(%rbx) 0.64 │ ↓ jne 89 8.45 │ cmpb $0x0,0x100(%rbx) 0.26 │ ↓ jne 89 0.06 │ mov $0x200,%esi │ mov %rbx,%rdi 0.07 │ → call buffer_is_zero_ge256 The three bytes are on 3 different cachelines. Judging by the relative percentages, it would seem that the first byte alone eliminates slightly more than half of all blocks; the last byte eliminates more than half again; the middle byte eliminates a fair fraction of the rest. With the short-circuit, the extra cachelines are not touched. This is so important that it should be spelled out in a comment. With that settled, I guess we need to talk about how much the out-of-line implementation matters at all. I'm thinking about writing a test/bench/bufferiszero, with all-zero buffers of various sizes and alignments. With that it would be easier to talk about whether any given implementation is is an improvement for that final 4% not eliminated by the three bytes. > (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't > need two seconds to read a fully cached 5 Gigabyte file) Indeed! r~
On Thu, 15 Feb 2024, Richard Henderson wrote: > On 2/15/24 13:37, Alexander Monakov wrote: > > Ah, I guess you might be running at low perf_event_paranoid setting that > > allows unprivileged sampling of kernel events? In our submissions the > > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only, > > excluding kernel time under syscalls. > > Ok. Eliminating kernel samples makes things easier to see. > But I still do not see a 40% reduction in runtime. I suspect Mikhail's image was less sparse, so the impact from inlining was greater. > With this, I see virtually all of the runtime in libz.so. > Therefore I converted this to raw first, to focus on the issue. Ah, apologies for that. I built with --disable-default-features and did not notice my qemu-img lacked support for vmdk and treated it as a raw image instead. I was assuming it was similar to what Mikhail used, but obviously it's not due to the compression. > For avoidance of doubt: > > $ ls -lsh test.raw && sha256sum test.raw > 12G -rw-r--r-- 1 rth rth 40G Feb 15 21:14 test.raw > 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80 test.raw > > Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz > Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) > > master: > 57.48% qemu-img-m [.] buffer_zero_avx2 > 3.60% qemu-img-m [.] is_allocated_sectors.part.0 > 2.61% qemu-img-m [.] buffer_is_zero > 63.69% -- total > > v3: > 48.86% qemu-img-v3 [.] is_allocated_sectors.part.0 > 3.79% qemu-img-v3 [.] buffer_zero_avx2 > 52.65% -- total > -17% -- reduction from master > > v4: > 54.60% qemu-img-v4 [.] buffer_is_zero_ge256 > 3.30% qemu-img-v4 [.] buffer_zero_avx2 > 3.17% qemu-img-v4 [.] is_allocated_sectors.part.0 > 61.07% -- total > -4% -- reduction from master > > v4+: > 46.65% qemu-img [.] is_allocated_sectors.part.0 > 3.49% qemu-img [.] buffer_zero_avx2 > 0.05% qemu-img [.] buffer_is_zero_ge256 > 50.19% -- total > -21% -- reduction from master Any ideas where the -21% vs v3's -17% difference comes from? FWIW, in situations like these I always recommend to run perf with fixed sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e cycles/period=100000/P' to make sample counts between runs of different duration directly comparable (displayed with 'perf report -n'). > The v4+ puts the 3 byte test back inline, like in your v3. > > Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" > this to (s | m | e) != 0, on the assumption that the reduced number of > branches would help. Yes, we also noticed that when preparing our patch. We also tried mixed variants like (s | e) != 0 || m != 0, but they did not turn out faster. > With that settled, I guess we need to talk about how much the out-of-line > implementation matters at all. I'm thinking about writing a > test/bench/bufferiszero, with all-zero buffers of various sizes and > alignments. With that it would be easier to talk about whether any given > implementation is is an improvement for that final 4% not eliminated by the > three bytes. Yeah, initially I suggested this task to Mikhail as a practice exercise outside of Qemu, and we had a benchmark that measures buffer_is_zero via perf_event_open. This allows to see exactly how close the implementation runs to the performance ceiling given by max L1 fetch rate (two loads per cycle on x86). Alexander
On 2/16/24 10:20, Alexander Monakov wrote: > FWIW, in situations like these I always recommend to run perf with fixed > sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e > cycles/period=100000/P' to make sample counts between runs of different > duration directly comparable (displayed with 'perf report -n'). I've re-done the numbers with fixed period, as suggested, and the difference between v3 and v4+ is in the sampling noise, differing about 0.3%. r~
© 2016 - 2024 Red Hat, Inc.