[PATCH 1/2] target/riscv/vector_helper.c: skip set tail when vta is zero

Daniel Henrique Barboza posted 2 patches 2 years, 9 months ago
Maintainers: Palmer Dabbelt <palmer@dabbelt.com>, Alistair Francis <alistair.francis@wdc.com>, Bin Meng <bin.meng@windriver.com>, Weiwei Li <liweiwei@iscas.ac.cn>, Daniel Henrique Barboza <dbarboza@ventanamicro.com>, Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
[PATCH 1/2] target/riscv/vector_helper.c: skip set tail when vta is zero
Posted by Daniel Henrique Barboza 2 years, 9 months ago
The function is a no-op if 'vta' is zero but we're still doing a lot of
stuff in this function regardless. vext_set_elems_1s() will ignore every
single time (since vta is zero) and we just wasted time.

Skip it altogether in this case. Aside from the code simplification
there's a noticeable emulation performance gain by doing it. For a
regular C binary that does a vectors operation like this:

=======
 #define SZ 10000000

int main ()
{
  int *a = malloc (SZ * sizeof (int));
  int *b = malloc (SZ * sizeof (int));
  int *c = malloc (SZ * sizeof (int));

  for (int i = 0; i < SZ; i++)
    c[i] = a[i] + b[i];
  return c[SZ - 1];
}
=======

Emulating it with qemu-riscv64 and RVV takes ~0.3 sec:

$ time ~/work/qemu/build/qemu-riscv64 \
    -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out

real    0m0.303s
user    0m0.281s
sys     0m0.023s

With this skip we take ~0.275 sec:

$ time ~/work/qemu/build/qemu-riscv64 \
    -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out

real    0m0.274s
user    0m0.252s
sys     0m0.019s

This performance gain adds up fast when executing heavy benchmarks like
SPEC.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 target/riscv/vector_helper.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index f4d0438988..8e6c99e573 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -268,12 +268,17 @@ static void vext_set_tail_elems_1s(CPURISCVState *env, target_ulong vl,
                                    void *vd, uint32_t desc, uint32_t nf,
                                    uint32_t esz, uint32_t max_elems)
 {
-    uint32_t total_elems = vext_get_total_elems(env, desc, esz);
-    uint32_t vlenb = riscv_cpu_cfg(env)->vlen >> 3;
+    uint32_t total_elems, vlenb, registers_used;
     uint32_t vta = vext_vta(desc);
-    uint32_t registers_used;
     int k;
 
+    if (vta == 0) {
+        return;
+    }
+
+    total_elems = vext_get_total_elems(env, desc, esz);
+    vlenb = riscv_cpu_cfg(env)->vlen >> 3;
+
     for (k = 0; k < nf; ++k) {
         vext_set_elems_1s(vd, vta, (k * max_elems + vl) * esz,
                           (k * max_elems + max_elems) * esz);
-- 
2.40.0
Re: [PATCH 1/2] target/riscv/vector_helper.c: skip set tail when vta is zero
Posted by Alistair Francis 2 years, 9 months ago
On Fri, Apr 28, 2023 at 6:58 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
> The function is a no-op if 'vta' is zero but we're still doing a lot of
> stuff in this function regardless. vext_set_elems_1s() will ignore every
> single time (since vta is zero) and we just wasted time.
>
> Skip it altogether in this case. Aside from the code simplification
> there's a noticeable emulation performance gain by doing it. For a
> regular C binary that does a vectors operation like this:
>
> =======
>  #define SZ 10000000
>
> int main ()
> {
>   int *a = malloc (SZ * sizeof (int));
>   int *b = malloc (SZ * sizeof (int));
>   int *c = malloc (SZ * sizeof (int));
>
>   for (int i = 0; i < SZ; i++)
>     c[i] = a[i] + b[i];
>   return c[SZ - 1];
> }
> =======
>
> Emulating it with qemu-riscv64 and RVV takes ~0.3 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>     -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.303s
> user    0m0.281s
> sys     0m0.023s
>
> With this skip we take ~0.275 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>     -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.274s
> user    0m0.252s
> sys     0m0.019s
>
> This performance gain adds up fast when executing heavy benchmarks like
> SPEC.
>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>

Thanks!

Applied to riscv-to-apply.next

Alistair

> ---
>  target/riscv/vector_helper.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index f4d0438988..8e6c99e573 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -268,12 +268,17 @@ static void vext_set_tail_elems_1s(CPURISCVState *env, target_ulong vl,
>                                     void *vd, uint32_t desc, uint32_t nf,
>                                     uint32_t esz, uint32_t max_elems)
>  {
> -    uint32_t total_elems = vext_get_total_elems(env, desc, esz);
> -    uint32_t vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +    uint32_t total_elems, vlenb, registers_used;
>      uint32_t vta = vext_vta(desc);
> -    uint32_t registers_used;
>      int k;
>
> +    if (vta == 0) {
> +        return;
> +    }
> +
> +    total_elems = vext_get_total_elems(env, desc, esz);
> +    vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +
>      for (k = 0; k < nf; ++k) {
>          vext_set_elems_1s(vd, vta, (k * max_elems + vl) * esz,
>                            (k * max_elems + max_elems) * esz);
> --
> 2.40.0
>
>
Re: [PATCH 1/2] target/riscv/vector_helper.c: skip set tail when vta is zero
Posted by Alistair Francis 2 years, 9 months ago
On Fri, Apr 28, 2023 at 6:58 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
> The function is a no-op if 'vta' is zero but we're still doing a lot of
> stuff in this function regardless. vext_set_elems_1s() will ignore every
> single time (since vta is zero) and we just wasted time.
>
> Skip it altogether in this case. Aside from the code simplification
> there's a noticeable emulation performance gain by doing it. For a
> regular C binary that does a vectors operation like this:
>
> =======
>  #define SZ 10000000
>
> int main ()
> {
>   int *a = malloc (SZ * sizeof (int));
>   int *b = malloc (SZ * sizeof (int));
>   int *c = malloc (SZ * sizeof (int));
>
>   for (int i = 0; i < SZ; i++)
>     c[i] = a[i] + b[i];
>   return c[SZ - 1];
> }
> =======
>
> Emulating it with qemu-riscv64 and RVV takes ~0.3 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>     -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.303s
> user    0m0.281s
> sys     0m0.023s
>
> With this skip we take ~0.275 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>     -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.274s
> user    0m0.252s
> sys     0m0.019s
>
> This performance gain adds up fast when executing heavy benchmarks like
> SPEC.
>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>

Acked-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  target/riscv/vector_helper.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index f4d0438988..8e6c99e573 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -268,12 +268,17 @@ static void vext_set_tail_elems_1s(CPURISCVState *env, target_ulong vl,
>                                     void *vd, uint32_t desc, uint32_t nf,
>                                     uint32_t esz, uint32_t max_elems)
>  {
> -    uint32_t total_elems = vext_get_total_elems(env, desc, esz);
> -    uint32_t vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +    uint32_t total_elems, vlenb, registers_used;
>      uint32_t vta = vext_vta(desc);
> -    uint32_t registers_used;
>      int k;
>
> +    if (vta == 0) {
> +        return;
> +    }
> +
> +    total_elems = vext_get_total_elems(env, desc, esz);
> +    vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +
>      for (k = 0; k < nf; ++k) {
>          vext_set_elems_1s(vd, vta, (k * max_elems + vl) * esz,
>                            (k * max_elems + max_elems) * esz);
> --
> 2.40.0
>
>
Re: [PATCH 1/2] target/riscv/vector_helper.c: skip set tail when vta is zero
Posted by Weiwei Li 2 years, 9 months ago
On 2023/4/28 04:57, Daniel Henrique Barboza wrote:
> The function is a no-op if 'vta' is zero but we're still doing a lot of
> stuff in this function regardless. vext_set_elems_1s() will ignore every
> single time (since vta is zero) and we just wasted time.
>
> Skip it altogether in this case. Aside from the code simplification
> there's a noticeable emulation performance gain by doing it. For a
> regular C binary that does a vectors operation like this:
>
> =======
>   #define SZ 10000000
>
> int main ()
> {
>    int *a = malloc (SZ * sizeof (int));
>    int *b = malloc (SZ * sizeof (int));
>    int *c = malloc (SZ * sizeof (int));
>
>    for (int i = 0; i < SZ; i++)
>      c[i] = a[i] + b[i];
>    return c[SZ - 1];
> }
> =======
>
> Emulating it with qemu-riscv64 and RVV takes ~0.3 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>      -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.303s
> user    0m0.281s
> sys     0m0.023s
>
> With this skip we take ~0.275 sec:
>
> $ time ~/work/qemu/build/qemu-riscv64 \
>      -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out
>
> real    0m0.274s
> user    0m0.252s
> sys     0m0.019s
>
> This performance gain adds up fast when executing heavy benchmarks like
> SPEC.
>
> Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> ---

Reviewed-by: Weiwei Li <liweiwei@iscas.ac.cn>

Weiwei Li

>   target/riscv/vector_helper.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index f4d0438988..8e6c99e573 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -268,12 +268,17 @@ static void vext_set_tail_elems_1s(CPURISCVState *env, target_ulong vl,
>                                      void *vd, uint32_t desc, uint32_t nf,
>                                      uint32_t esz, uint32_t max_elems)
>   {
> -    uint32_t total_elems = vext_get_total_elems(env, desc, esz);
> -    uint32_t vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +    uint32_t total_elems, vlenb, registers_used;
>       uint32_t vta = vext_vta(desc);
> -    uint32_t registers_used;
>       int k;
>   
> +    if (vta == 0) {
> +        return;
> +    }
> +
> +    total_elems = vext_get_total_elems(env, desc, esz);
> +    vlenb = riscv_cpu_cfg(env)->vlen >> 3;
> +
>       for (k = 0; k < nf; ++k) {
>           vext_set_elems_1s(vd, vta, (k * max_elems + vl) * esz,
>                             (k * max_elems + max_elems) * esz);