[RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.

Paolo Savini posted 2 patches 2 months ago
[RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
Posted by Paolo Savini 2 months ago
This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 64 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 128
and 256 bytes between the memory address of the simulated vector register and
the destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
 target/riscv/vector_helper.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 4b444c6bc5..7674972784 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
     }
 
     fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+    if (byte_offset + 32 < byte_end) {
+      group_size = MO_256;
+      if (is_load)
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
+      else
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
+    } else if (byte_offset + 16 < byte_end) {
+      group_size = MO_128;
+      if (is_load)
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
+      else
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
+    } else {
+      fn(vd, byte_offset, host + byte_offset);
+    }
 
     return 1 << group_size;
 }
-- 
2.17.1
Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
Posted by Richard Henderson 1 month, 3 weeks ago
On 7/18/24 01:30, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 64 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 128
> and 256 bytes between the memory address of the simulated vector register and
> the destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 4b444c6bc5..7674972784 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    if (byte_offset + 32 < byte_end) {
> +      group_size = MO_256;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
> +    } else if (byte_offset + 16 < byte_end) {
> +      group_size = MO_128;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +    } else {
> +      fn(vd, byte_offset, host + byte_offset);
> +    }
>   

This will not work for big-endian hosts.

This may have atomicity issues, depending on the spec, the compiler options, and the host 
capabilities.


r~
Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
Posted by Paolo Savini 6 days, 8 hours ago
Thanks for the feedback Richard, I'm working on the endianness. Could 
you please give me more details about the atomicity issues you are 
referring to?

Best wishes

Paolo

On 7/27/24 08:15, Richard Henderson wrote:
> On 7/18/24 01:30, Paolo Savini wrote:
>> This patch optimizes the emulation of unit-stride load/store RVV 
>> instructions
>> when the data being loaded/stored per iteration amounts to 64 bytes 
>> or more.
>> The optimization consists of calling __builtin_memcpy on chunks of 
>> data of 128
>> and 256 bytes between the memory address of the simulated vector 
>> register and
>> the destination memory address and vice versa.
>> This is done only if we have direct access to the RAM of the host 
>> machine.
>>
>> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
>> ---
>>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index 4b444c6bc5..7674972784 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void 
>> *vd, uint32_t byte_end,
>>       }
>>         fn = fns[is_load][group_size];
>> -    fn(vd, byte_offset, host + byte_offset);
>> +
>> +    if (byte_offset + 32 < byte_end) {
>> +      group_size = MO_256;
>> +      if (is_load)
>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t 
>> *)(host + byte_offset), 32);
>> +      else
>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t 
>> *)(vd + byte_offset), 32);
>> +    } else if (byte_offset + 16 < byte_end) {
>> +      group_size = MO_128;
>> +      if (is_load)
>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t 
>> *)(host + byte_offset), 16);
>> +      else
>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t 
>> *)(vd + byte_offset), 16);
>> +    } else {
>> +      fn(vd, byte_offset, host + byte_offset);
>> +    }
>
> This will not work for big-endian hosts.
>
> This may have atomicity issues, depending on the spec, the compiler 
> options, and the host capabilities.
>
>
> r~
>

Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
Posted by Richard Henderson 6 days, 1 hour ago
On 9/10/24 04:20, Paolo Savini wrote:
> Thanks for the feedback Richard, I'm working on the endianness. Could you please give me 
> more details about the atomicity issues you are referring to?

For instance a 32-bit atomic memory operation in the guest must be implemented with a >= 
32-bit atomic memory operation in the host.

The main thing to remember is that memcpy() has no atomicity guarantee.  It could be 
implemented as a byte loop.  Thus you may only use memcpy with guest byte vectors.



r~
Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
Posted by Daniel Henrique Barboza 1 month, 3 weeks ago

On 7/17/24 12:30 PM, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 64 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 128
> and 256 bytes between the memory address of the simulated vector register and
> the destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 4b444c6bc5..7674972784 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    if (byte_offset + 32 < byte_end) {
> +      group_size = MO_256;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
> +    } else if (byte_offset + 16 < byte_end) {
> +      group_size = MO_128;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +    } else {
> +      fn(vd, byte_offset, host + byte_offset);
> +    }
>  

I see that we don't have any precedence with this particular built-in in the TCG code. We do have
some instances in other parts of QEMU though (e.g. util/guest-random.c).

If we're ok with adding these builtin calls in the execution helpers in TCG, and aside from the
style warnings that ./scripts/checkpatch.pl will give, LGTM.


Thanks,

Daniel

>       return 1 << group_size;
>   }