[RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions

Max Chou posted 6 patches 8 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240215192823.729209-1-max.chou@sifive.com
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, Riku Voipio <riku.voipio@iki.fi>, Palmer Dabbelt <palmer@dabbelt.com>, Alistair Francis <alistair.francis@wdc.com>, Bin Meng <bin.meng@windriver.com>, Weiwei Li <liwei1518@gmail.com>, Daniel Henrique Barboza <dbarboza@ventanamicro.com>, Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
accel/tcg/ldst_common.c.inc             |  40 ++++++--
accel/tcg/user-exec.c                   |  17 ++--
target/riscv/helper.h                   |   4 +
target/riscv/insn32.decode              |  11 +-
target/riscv/insn_trans/trans_rvv.c.inc |  61 +++++++++++
target/riscv/vector_helper.c            | 130 +++++++++++++++++++-----
6 files changed, 221 insertions(+), 42 deletions(-)
[RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
Posted by Max Chou 8 months, 2 weeks ago
Hi all,

When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
slower than the scalar equivalent on QEMU and it hurts developer
productivity.

From the performance analysis result, we can observe that the glibc
memcpy spends most of the time in the vector unit-stride load/store
helper functions.

Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
  Children      Self  Command       Shared Object            Symbol
+   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
+   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 0x00000000000000ff
+   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] qemu_plugin_vcpu_mem_cb
+   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
+   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] cpu_stb_mmu
+    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] cpu_ldb_mmu
+    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
+    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
+    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] adjust_addr (inlined)   


So this patchset tries to improve the performance of the RVV version of
glibc memcpy on QEMU by improving the corresponding helper function
quality.

The overall performance improvement can achieve following numbers
(depending on the size).
Average: 2.86X / Smallest: 1.15X / Largest: 4.49X

PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
the next version or next serious will complete other vector ld/st part.

Regards,
Max.

[1] https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com

Max Chou (6):
  target/riscv: Seperate vector segment ld/st instructions
  accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  target/riscv: Inline vext_ldst_us and coressponding function for
    performance
  accel/tcg: Inline cpu_mmu_lookup function
  accel/tcg: Inline do_ld1_mmu function
  accel/tcg: Inline do_st1_mmu function

 accel/tcg/ldst_common.c.inc             |  40 ++++++--
 accel/tcg/user-exec.c                   |  17 ++--
 target/riscv/helper.h                   |   4 +
 target/riscv/insn32.decode              |  11 +-
 target/riscv/insn_trans/trans_rvv.c.inc |  61 +++++++++++
 target/riscv/vector_helper.c            | 130 +++++++++++++++++++-----
 6 files changed, 221 insertions(+), 42 deletions(-)

-- 
2.34.1
Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
Posted by Richard Henderson 8 months, 2 weeks ago
On 2/15/24 09:28, Max Chou wrote:
> Hi all,
> 
> When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
> slower than the scalar equivalent on QEMU and it hurts developer
> productivity.
> 
>  From the performance analysis result, we can observe that the glibc
> memcpy spends most of the time in the vector unit-stride load/store
> helper functions.
> 
> Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
>    Children      Self  Command       Shared Object            Symbol
> +   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
> +   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 0x00000000000000ff
> +   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] qemu_plugin_vcpu_mem_cb
> +   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
> +   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] cpu_stb_mmu
> +    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] cpu_ldb_mmu
> +    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
> +    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
> +    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] adjust_addr (inlined)
> 
> 
> So this patchset tries to improve the performance of the RVV version of
> glibc memcpy on QEMU by improving the corresponding helper function
> quality.
> 
> The overall performance improvement can achieve following numbers
> (depending on the size).
> Average: 2.86X / Smallest: 1.15X / Largest: 4.49X
> 
> PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
> the next version or next serious will complete other vector ld/st part.

You are still not tackling the root problem, which is over-use of the full out-of-line 
load/store routines.  The reason that cpu_mmu_lookup is in that list is because you are 
performing the full virtual address resolution for each and every byte.

The only way to make a real improvement is to perform virtual address resolution *once* 
for the entire vector.  I refer to my previous advice:

https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369


r~
Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
Posted by Max Chou 8 months, 2 weeks ago
Hi Richard,

Thank you for the suggestion and the reference.
I'm trying to follow the reference to implement it and I'll send another 
version for this.

Thanks a lot,
Max

On 2024/2/16 4:24 AM, Richard Henderson wrote:
> On 2/15/24 09:28, Max Chou wrote:
>> Hi all,
>>
>> When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
>> slower than the scalar equivalent on QEMU and it hurts developer
>> productivity.
>>
>>  From the performance analysis result, we can observe that the glibc
>> memcpy spends most of the time in the vector unit-stride load/store
>> helper functions.
>>
>> Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
>>    Children      Self  Command       Shared Object Symbol
>> +   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] 
>> vext_ldst_us
>> +   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 
>> 0x00000000000000ff
>> +   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] 
>> qemu_plugin_vcpu_mem_cb
>> +   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
>> +   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_stb_mmu
>> +    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_ldb_mmu
>> +    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_mmu_lookup
>> +    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
>> +    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
>> adjust_addr (inlined)
>>
>>
>> So this patchset tries to improve the performance of the RVV version of
>> glibc memcpy on QEMU by improving the corresponding helper function
>> quality.
>>
>> The overall performance improvement can achieve following numbers
>> (depending on the size).
>> Average: 2.86X / Smallest: 1.15X / Largest: 4.49X
>>
>> PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
>> the next version or next serious will complete other vector ld/st part.
>
> You are still not tackling the root problem, which is over-use of the 
> full out-of-line load/store routines.  The reason that cpu_mmu_lookup 
> is in that list is because you are performing the full virtual address 
> resolution for each and every byte.
>
> The only way to make a real improvement is to perform virtual address 
> resolution *once* for the entire vector.  I refer to my previous advice:
>
> https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
>
>
> r~