[RFC 0/2] Improve the performance of unit-stride RVV ld/st on

Paolo Savini posted 2 patches 1 month, 3 weeks ago
Failed in applying to current master (apply log)
target/riscv/vector_helper.c | 63 +++++++++++++++++++++++++++++++++++-
1 file changed, 62 insertions(+), 1 deletion(-)
[RFC 0/2] Improve the performance of unit-stride RVV ld/st on
Posted by Paolo Savini 1 month, 3 weeks ago
This series of patches builds on top of Max Chou's patches:

https://lore.kernel.org/all/20240613175122.1299212-1-max.chou@sifive.com/

The aim of these patches is to improve the performance of QEMU emulation
of RVV unit-stride load and store instructions in the following cases

1. when the data being loaded/stored per iteration amounts to 8 bytes or less.
2. when the vector length is 16 bytes (VLEN=128) and there is no grouping of the
   vector registers (LMUL=1).
3. when the data being loaded/stored per iteration is more than 64 bytes.

In the first two cases the optimization consists of avoiding the
overhead of probing the RAM of the host machine and perform a simple loop
load/store on the data grouped in chunks of as many bytes as possible (8,4,2 or 1).

The third case is optimized by calling the __builtin_memcpy function on
data chuncks of 128 bytes and 256 bytes per time.

These patches on top of Max Chou's patches have been tested with SPEC
CPU 2017 and achieve an average reduction of 13% of the time needed by
QEMU for running the benchmarks compared with the master branch of QEMU.

You can find the source code being developed here: https://github.com/embecosm/rise-rvv-tcg-qemu
and regular updates and more statistics about the patch here: https://github.com/embecosm/rise-rvv-tcg-qemu-reports

Changes:
- patch 1:
  - Modify vext_ldst_us to run the simple loop load/store if we
    are in one of the two cases above.
- patch 2:
  - Modify vext_group_ldst_host to use __builtin_memcpy for data sizes
    of 128 bits and above.

Cc: Richard Handerson <richard.henderson@linaro.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bmeng.cn@gmail.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Cc: Helene Chelin <helene.chelin@embecosm.com>
Cc: Max Chou <max.chou@sifive.com>

Helene CHELIN (1):
  target/riscv: rvv: reduce the overhead for simple RISC-V vector
    unit-stride loads and stores

Paolo Savini (1):
  target/riscv: rvv: improve performance of RISC-V vector loads and
    stores on large amounts of data.

 target/riscv/vector_helper.c | 63 +++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

-- 
2.17.1
Re: [RFC 0/2] Improve the performance of unit-stride RVV ld/st on
Posted by Daniel Henrique Barboza 1 month, 1 week ago
Hi Paolo,


I suggest adding a "riscv:" at the start of the cover letter subject for the next
version. This will make it easier for everyone else to quickly identify which arch
the patches are changing.

Other than that, and checkpatch.pl style changes, looks good to me.


Thanks,


Daniel


On 7/17/24 12:30 PM, Paolo Savini wrote:
> This series of patches builds on top of Max Chou's patches:
> 
> https://lore.kernel.org/all/20240613175122.1299212-1-max.chou@sifive.com/
> 
> The aim of these patches is to improve the performance of QEMU emulation
> of RVV unit-stride load and store instructions in the following cases
> 
> 1. when the data being loaded/stored per iteration amounts to 8 bytes or less.
> 2. when the vector length is 16 bytes (VLEN=128) and there is no grouping of the
>     vector registers (LMUL=1).
> 3. when the data being loaded/stored per iteration is more than 64 bytes.
> 
> In the first two cases the optimization consists of avoiding the
> overhead of probing the RAM of the host machine and perform a simple loop
> load/store on the data grouped in chunks of as many bytes as possible (8,4,2 or 1).
> 
> The third case is optimized by calling the __builtin_memcpy function on
> data chuncks of 128 bytes and 256 bytes per time.
> 
> These patches on top of Max Chou's patches have been tested with SPEC
> CPU 2017 and achieve an average reduction of 13% of the time needed by
> QEMU for running the benchmarks compared with the master branch of QEMU.
> 
> You can find the source code being developed here: https://github.com/embecosm/rise-rvv-tcg-qemu
> and regular updates and more statistics about the patch here: https://github.com/embecosm/rise-rvv-tcg-qemu-reports
> 
> Changes:
> - patch 1:
>    - Modify vext_ldst_us to run the simple loop load/store if we
>      are in one of the two cases above.
> - patch 2:
>    - Modify vext_group_ldst_host to use __builtin_memcpy for data sizes
>      of 128 bits and above.
> 
> Cc: Richard Handerson <richard.henderson@linaro.org>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Alistair Francis <alistair.francis@wdc.com>
> Cc: Bin Meng <bmeng.cn@gmail.com>
> Cc: Weiwei Li <liwei1518@gmail.com>
> Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
> Cc: Helene Chelin <helene.chelin@embecosm.com>
> Cc: Max Chou <max.chou@sifive.com>
> 
> Helene CHELIN (1):
>    target/riscv: rvv: reduce the overhead for simple RISC-V vector
>      unit-stride loads and stores
> 
> Paolo Savini (1):
>    target/riscv: rvv: improve performance of RISC-V vector loads and
>      stores on large amounts of data.
> 
>   target/riscv/vector_helper.c | 63 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 62 insertions(+), 1 deletion(-)
>