Series comparison

-[RFC 0/1] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores.
+[PATCH 0/1 v4] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores.
-Following the reviews on the previous version:
+Previous versions:
 - RFC v1: https://lore.kernel.org/all/20241218170840.1090473-1-paolo.savini@embecosm.com/
-- Review: https://lore.kernel.org/all/e8fb908d-4723-417a-bf88-b4050432ddad@linaro.org/
+- RFC v2: https://lore.kernel.org/all/20241220153834.16302-1-paolo.savini@embecosm.com/
 - RFC v3: https://lore.kernel.org/all/20250122164905.13615-1-paolo.savini@embecosm.com/
-we apply the following fixes:
+Version v4 of this patch brings the following changes:
-- Fall back to using the helper function if vstart != 0 at the beginning
+- removed the host specific conditions so that the behaviour of the emulation
-  of the iterations and refactor the setting of the function arguments
+  doesn't depend on the host we are running on.
-  accordignly.
+  The intruduction of this extra complexity is not worth the very marginal
-- Add mark_vs_dirty before performing the memory operations.
+  performance improvement, when the overall performance improves anyway
-- Loosen the atomicity constraints and apply only MO_ATOM_IFALIGN_PAIR
+  considerably without.
-  for element sizes MO_16, MO_32 and MO_64.
+- added reviewers contacts (thanks all for reviewing the work).
-- Change the way we update vstart in order to set vstart to 0 if it's the last
+- changed the header from RFC to PATCH.
   iteration.
 - Fix the indentation.
 We also rephrase the commit message to better reflect the new behaviour of the
 patch.
 Many thanks Richard for the thorough review and explanations.
 Cc: Richard Handerson <richard.henderson@linaro.org>
 Cc: Palmer Dabbelt <palmer@dabbelt.com>
 Cc: Alistair Francis <alistair.francis@wdc.com>
 Cc: Bin Meng <bmeng.cn@gmail.com>
 ...
 Paolo Savini (1):
   target/riscv: use tcg ops generation to emulate whole reg rvv
     loads/stores.
- target/riscv/insn_trans/trans_rvv.c.inc | 125 +++++++++++++++---------
+ target/riscv/insn_trans/trans_rvv.c.inc | 155 +++++++++++++++++-------
-file changed, 78 insertions(+), 47 deletions(-)
+file changed, 108 insertions(+), 47 deletions(-)
 --
 .34.1

-[PATCH 1/1] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores.
+[PATCH 1/1 v4] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores.
 This patch replaces the use of a helper function with direct tcg ops generation
 in order to emulate whole register loads and stores. This is done in order to
 improve the performance of QEMU.
 We still use the helper function when vstart is not 0 at the beginning of the
-emulation of the whole register load or store.
+emulation of the whole register load or store or when we would end up generating
 partial loads or stores of vector elements (e.g. emulating 64 bits element loads
 with pairs of 32 bits loads on hosts with 32 bits registers).
 The latter condition ensures that we are not surprised by a trap in mid-element
 and consecutively that we can update vstart correctly.
 We also use the helper function when it performs better than tcg for specific
 combinations of vector length, number of fields and element size.
 Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
+Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
+Reviewed-by: Richard Handerson <richard.henderson@linaro.org>
+Reviewed-by: Max Chou <max.chou@sifive.com>
+Reviewed-by: "Alex Bennée" <alex.bennee@linaro.org>
 ---
- target/riscv/insn_trans/trans_rvv.c.inc | 125 +++++++++++++++---------
+ target/riscv/insn_trans/trans_rvv.c.inc | 155 +++++++++++++++++-------
-file changed, 78 insertions(+), 47 deletions(-)
+file changed, 108 insertions(+), 47 deletions(-)
 diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/target/riscv/insn_trans/trans_rvv.c.inc
 +++ b/target/riscv/insn_trans/trans_rvv.c.inc
 ...
 -
      mark_vs_dirty(s);
 -    fn(dest, base, tcg_env, desc);
 +    /*
-+     * Load/store minimum vlenb bytes per iteration.
++     * Load/store multiple bytes per iteration.
 +     * When possible do this atomically.
 +     * Update vstart with the number of processed elements.
++     * Use the helper function if either:
++     * - vstart is not 0.
++     * - the target has 32 bit registers and we are loading/storing 64 bit long
++     *   elements. This is to ensure that we process every element with a single
++     *   memory instruction.
 +     */
-+    if (s->vstart_eq_zero) {
++
 +    bool use_helper_fn = !(s->vstart_eq_zero) ||
 +                          (TCG_TARGET_REG_BITS == 32 && log2_esz == 3);
 +
 +    if (!use_helper_fn) {
 +        TCGv addr = tcg_temp_new();
 +        uint32_t size = s->cfg_ptr->vlenb * nf;
-+        TCGv_i128 t16 = tcg_temp_new_i128();
++        TCGv_i64 t8 = tcg_temp_new_i64();
 +        TCGv_i32 t4 = tcg_temp_new_i32();
 +        MemOp atomicity = MO_ATOM_NONE;
 +        if (log2_esz == 0) {
 +            atomicity = MO_ATOM_NONE;
 +        } else {
 +            atomicity = MO_ATOM_IFALIGN_PAIR;
 +        }
-+        for (int i = 0; i < size; i += 16) {
++        if (TCG_TARGET_REG_BITS == 64) {
-+            addr = get_address(s, rs1, i);
++            for (int i = 0; i < size; i += 8) {
-+            if (is_load) {
++                addr = get_address(s, rs1, i);
-+                tcg_gen_qemu_ld_i128(t16, addr, s->mem_idx,
++                if (is_load) {
-+                        MO_LE | MO_128 | atomicity);
++                    tcg_gen_qemu_ld_i64(t8, addr, s->mem_idx,
-+                tcg_gen_st_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
++                            MO_LE | MO_64 | atomicity);
-+            } else {
++                    tcg_gen_st_i64(t8, tcg_env, vreg_ofs(s, vd) + i);
-+                tcg_gen_ld_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
++                } else {
-+                tcg_gen_qemu_st_i128(t16, addr, s->mem_idx,
++                    tcg_gen_ld_i64(t8, tcg_env, vreg_ofs(s, vd) + i);
-+                        MO_LE | MO_128 | atomicity);
++                    tcg_gen_qemu_st_i64(t8, addr, s->mem_idx,
 +                            MO_LE | MO_64 | atomicity);
 +                }
 +                if (i == size - 8) {
 +                    tcg_gen_movi_tl(cpu_vstart, 0);
 +                } else {
 +                    tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 8 >> log2_esz);
 +                }
 +            }
-+            if (i == size - 16) {
++        } else {
-+                tcg_gen_movi_tl(cpu_vstart, 0);
++            for (int i = 0; i < size; i += 4) {
-+            } else {
++                addr = get_address(s, rs1, i);
-+                tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 16 >> log2_esz);
++                if (is_load) {
 +                    tcg_gen_qemu_ld_i32(t4, addr, s->mem_idx,
 +                            MO_LE | MO_32 | atomicity);
 +                    tcg_gen_st_i32(t4, tcg_env, vreg_ofs(s, vd) + i);
 +                } else {
 +                    tcg_gen_ld_i32(t4, tcg_env, vreg_ofs(s, vd) + i);
 +                    tcg_gen_qemu_st_i32(t4, addr, s->mem_idx,
 +                            MO_LE | MO_32 | atomicity);
 +                }
 +                if (i == size - 4) {
 +                    tcg_gen_movi_tl(cpu_vstart, 0);
 +                } else {
 +                    tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 4 >> log2_esz);
 +                }
 +            }
 +        }
 +    } else {
 +        TCGv_ptr dest;
 +        TCGv base;
 ...
  /*
   *** Vector Integer Arithmetic Instructions
 --
 .34.1

Following the reviews on the previous version:

- RFC v1: https://lore.kernel.org/all/20241218170840.1090473-1-paolo.savini@embecosm.com/
- Review: https://lore.kernel.org/all/e8fb908d-4723-417a-bf88-b4050432ddad@linaro.org/

we apply the following fixes:

- Fall back to using the helper function if vstart != 0 at the beginning
  of the iterations and refactor the setting of the function arguments
  accordignly.
- Add mark_vs_dirty before performing the memory operations.
- Loosen the atomicity constraints and apply only MO_ATOM_IFALIGN_PAIR
  for element sizes MO_16, MO_32 and MO_64.
- Change the way we update vstart in order to set vstart to 0 if it's the last
  iteration.
- Fix the indentation.

We also rephrase the commit message to better reflect the new behaviour of the
patch.

Many thanks Richard for the thorough review and explanations.

Cc: Richard Handerson <richard.henderson@linaro.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bmeng.cn@gmail.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Cc: Helene Chelin <helene.chelin@embecosm.com>
Cc: Nathan Egge <negge@google.com>
Cc: Max Chou <max.chou@sifive.com>
Cc: Jeremy Bennett <jeremy.bennett@embecosm.com>
Cc: Craig Blackmore <craig.blackmore@embecosm.com>

Paolo Savini (1):
  target/riscv: use tcg ops generation to emulate whole reg rvv
    loads/stores.

target/riscv/insn_trans/trans_rvv.c.inc | 125 +++++++++++++++---------
 1 file changed, 78 insertions(+), 47 deletions(-)

-- 
2.34.1

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
 target/riscv/insn_trans/trans_rvv.c.inc | 125 +++++++++++++++---------
 1 file changed, 78 insertions(+), 47 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -XXX,XX +XXX,XX @@ GEN_VEXT_TRANS(vle64ff_v, MO_64, r2nfvm, ldff_op, ld_us_check)
 typedef void gen_helper_ldst_whole(TCGv_ptr, TCGv, TCGv_env, TCGv_i32);
 
 static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
-                             gen_helper_ldst_whole *fn,
-                             DisasContext *s)
+                             uint32_t log2_esz, gen_helper_ldst_whole *fn,
+                             DisasContext *s, bool is_load)
 {
-    TCGv_ptr dest;
-    TCGv base;
-    TCGv_i32 desc;
-
-    uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
-    data = FIELD_DP32(data, VDATA, VM, 1);
-    dest = tcg_temp_new_ptr();
-    desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
-                                      s->cfg_ptr->vlenb, data));
-
-    base = get_gpr(s, rs1, EXT_NONE);
-    tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
-
     mark_vs_dirty(s);
 
-    fn(dest, base, tcg_env, desc);
+    /*
+     * Load/store minimum vlenb bytes per iteration.
+     * When possible do this atomically.
+     * Update vstart with the number of processed elements.
+     */
+    if (s->vstart_eq_zero) {
+        TCGv addr = tcg_temp_new();
+        uint32_t size = s->cfg_ptr->vlenb * nf;
+        TCGv_i128 t16 = tcg_temp_new_i128();
+        MemOp atomicity = MO_ATOM_NONE;
+        if (log2_esz == 0) {
+            atomicity = MO_ATOM_NONE;
+        } else {
+            atomicity = MO_ATOM_IFALIGN_PAIR;
+        }
+        for (int i = 0; i < size; i += 16) {
+            addr = get_address(s, rs1, i);
+            if (is_load) {
+                tcg_gen_qemu_ld_i128(t16, addr, s->mem_idx,
+                        MO_LE | MO_128 | atomicity);
+                tcg_gen_st_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
+            } else {
+                tcg_gen_ld_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
+                tcg_gen_qemu_st_i128(t16, addr, s->mem_idx,
+                        MO_LE | MO_128 | atomicity);
+            }
+            if (i == size - 16) {
+                tcg_gen_movi_tl(cpu_vstart, 0);
+            } else {
+                tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 16 >> log2_esz);
+            }
+        }
+    } else {
+        TCGv_ptr dest;
+        TCGv base;
+        TCGv_i32 desc;
+        uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
+        data = FIELD_DP32(data, VDATA, VM, 1);
+        dest = tcg_temp_new_ptr();
+        desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
+                        s->cfg_ptr->vlenb, data));
+        base = get_gpr(s, rs1, EXT_NONE);
+        tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
+        fn(dest, base, tcg_env, desc);
+    }
 
     finalize_rvv_inst(s);
     return true;
@@ -XXX,XX +XXX,XX @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
  * load and store whole register instructions ignore vtype and vl setting.
  * Thus, we don't need to check vill bit. (Section 7.9)
  */
-#define GEN_LDST_WHOLE_TRANS(NAME, ARG_NF)                                \
-static bool trans_##NAME(DisasContext *s, arg_##NAME * a)                 \
-{                                                                         \
-    if (require_rvv(s) &&                                                 \
-        QEMU_IS_ALIGNED(a->rd, ARG_NF)) {                                 \
-        return ldst_whole_trans(a->rd, a->rs1, ARG_NF,                    \
-                                gen_helper_##NAME, s);                    \
-    }                                                                     \
-    return false;                                                         \
-}
-
-GEN_LDST_WHOLE_TRANS(vl1re8_v,  1)
-GEN_LDST_WHOLE_TRANS(vl1re16_v, 1)
-GEN_LDST_WHOLE_TRANS(vl1re32_v, 1)
-GEN_LDST_WHOLE_TRANS(vl1re64_v, 1)
-GEN_LDST_WHOLE_TRANS(vl2re8_v,  2)
-GEN_LDST_WHOLE_TRANS(vl2re16_v, 2)
-GEN_LDST_WHOLE_TRANS(vl2re32_v, 2)
-GEN_LDST_WHOLE_TRANS(vl2re64_v, 2)
-GEN_LDST_WHOLE_TRANS(vl4re8_v,  4)
-GEN_LDST_WHOLE_TRANS(vl4re16_v, 4)
-GEN_LDST_WHOLE_TRANS(vl4re32_v, 4)
-GEN_LDST_WHOLE_TRANS(vl4re64_v, 4)
-GEN_LDST_WHOLE_TRANS(vl8re8_v,  8)
-GEN_LDST_WHOLE_TRANS(vl8re16_v, 8)
-GEN_LDST_WHOLE_TRANS(vl8re32_v, 8)
-GEN_LDST_WHOLE_TRANS(vl8re64_v, 8)
+#define GEN_LDST_WHOLE_TRANS(NAME, ETYPE, ARG_NF, IS_LOAD)                  \
+static bool trans_##NAME(DisasContext *s, arg_##NAME * a)                   \
+{                                                                           \
+    if (require_rvv(s) &&                                                   \
+        QEMU_IS_ALIGNED(a->rd, ARG_NF)) {                                   \
+        return ldst_whole_trans(a->rd, a->rs1, ARG_NF, ctzl(sizeof(ETYPE)), \
+                                gen_helper_##NAME, s, IS_LOAD);             \
+    }                                                                       \
+    return false;                                                           \
+}
+
+GEN_LDST_WHOLE_TRANS(vl1re8_v,  int8_t,  1, true)
+GEN_LDST_WHOLE_TRANS(vl1re16_v, int16_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl1re32_v, int32_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl1re64_v, int64_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl2re8_v,  int8_t,  2, true)
+GEN_LDST_WHOLE_TRANS(vl2re16_v, int16_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl2re32_v, int32_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl2re64_v, int64_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl4re8_v,  int8_t,  4, true)
+GEN_LDST_WHOLE_TRANS(vl4re16_v, int16_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl4re32_v, int32_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl4re64_v, int64_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl8re8_v,  int8_t,  8, true)
+GEN_LDST_WHOLE_TRANS(vl8re16_v, int16_t, 8, true)
+GEN_LDST_WHOLE_TRANS(vl8re32_v, int32_t, 8, true)
+GEN_LDST_WHOLE_TRANS(vl8re64_v, int64_t, 8, true)
 
 /*
  * The vector whole register store instructions are encoded similar to
  * unmasked unit-stride store of elements with EEW=8.
  */
-GEN_LDST_WHOLE_TRANS(vs1r_v, 1)
-GEN_LDST_WHOLE_TRANS(vs2r_v, 2)
-GEN_LDST_WHOLE_TRANS(vs4r_v, 4)
-GEN_LDST_WHOLE_TRANS(vs8r_v, 8)
+GEN_LDST_WHOLE_TRANS(vs1r_v, int8_t, 1, false)
+GEN_LDST_WHOLE_TRANS(vs2r_v, int8_t, 2, false)
+GEN_LDST_WHOLE_TRANS(vs4r_v, int8_t, 4, false)
+GEN_LDST_WHOLE_TRANS(vs8r_v, int8_t, 8, false)
 
 /*
  *** Vector Integer Arithmetic Instructions
-- 
2.34.1

Previous versions:

- RFC v1: https://lore.kernel.org/all/20241218170840.1090473-1-paolo.savini@embecosm.com/
- RFC v2: https://lore.kernel.org/all/20241220153834.16302-1-paolo.savini@embecosm.com/
- RFC v3: https://lore.kernel.org/all/20250122164905.13615-1-paolo.savini@embecosm.com/

Version v4 of this patch brings the following changes:

- removed the host specific conditions so that the behaviour of the emulation
  doesn't depend on the host we are running on.
  The intruduction of this extra complexity is not worth the very marginal
  performance improvement, when the overall performance improves anyway
  considerably without.
- added reviewers contacts (thanks all for reviewing the work).
- changed the header from RFC to PATCH.

Paolo Savini (1):
  target/riscv: use tcg ops generation to emulate whole reg rvv
    loads/stores.

target/riscv/insn_trans/trans_rvv.c.inc | 155 +++++++++++++++++-------
 1 file changed, 108 insertions(+), 47 deletions(-)

-- 
2.34.1

This patch replaces the use of a helper function with direct tcg ops generation
in order to emulate whole register loads and stores. This is done in order to
improve the performance of QEMU.
We still use the helper function when vstart is not 0 at the beginning of the
emulation of the whole register load or store or when we would end up generating
partial loads or stores of vector elements (e.g. emulating 64 bits element loads
with pairs of 32 bits loads on hosts with 32 bits registers).
The latter condition ensures that we are not surprised by a trap in mid-element
and consecutively that we can update vstart correctly.
We also use the helper function when it performs better than tcg for specific
combinations of vector length, number of fields and element size.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Richard Handerson <richard.henderson@linaro.org>
Reviewed-by: Max Chou <max.chou@sifive.com>
Reviewed-by: "Alex Bennée" <alex.bennee@linaro.org>
---
 target/riscv/insn_trans/trans_rvv.c.inc | 155 +++++++++++++++++-------
 1 file changed, 108 insertions(+), 47 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -XXX,XX +XXX,XX @@ GEN_VEXT_TRANS(vle64ff_v, MO_64, r2nfvm, ldff_op, ld_us_check)
 typedef void gen_helper_ldst_whole(TCGv_ptr, TCGv, TCGv_env, TCGv_i32);
 
 static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
-                             gen_helper_ldst_whole *fn,
-                             DisasContext *s)
+                             uint32_t log2_esz, gen_helper_ldst_whole *fn,
+                             DisasContext *s, bool is_load)
 {
-    TCGv_ptr dest;
-    TCGv base;
-    TCGv_i32 desc;
-
-    uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
-    data = FIELD_DP32(data, VDATA, VM, 1);
-    dest = tcg_temp_new_ptr();
-    desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
-                                      s->cfg_ptr->vlenb, data));
-
-    base = get_gpr(s, rs1, EXT_NONE);
-    tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
-
     mark_vs_dirty(s);
 
-    fn(dest, base, tcg_env, desc);
+    /*
+     * Load/store multiple bytes per iteration.
+     * When possible do this atomically.
+     * Update vstart with the number of processed elements.
+     * Use the helper function if either:
+     * - vstart is not 0.
+     * - the target has 32 bit registers and we are loading/storing 64 bit long
+     *   elements. This is to ensure that we process every element with a single
+     *   memory instruction.
+     */
+
+    bool use_helper_fn = !(s->vstart_eq_zero) ||
+                          (TCG_TARGET_REG_BITS == 32 && log2_esz == 3);
+
+    if (!use_helper_fn) {
+        TCGv addr = tcg_temp_new();
+        uint32_t size = s->cfg_ptr->vlenb * nf;
+        TCGv_i64 t8 = tcg_temp_new_i64();
+        TCGv_i32 t4 = tcg_temp_new_i32();
+        MemOp atomicity = MO_ATOM_NONE;
+        if (log2_esz == 0) {
+            atomicity = MO_ATOM_NONE;
+        } else {
+            atomicity = MO_ATOM_IFALIGN_PAIR;
+        }
+        if (TCG_TARGET_REG_BITS == 64) {
+            for (int i = 0; i < size; i += 8) {
+                addr = get_address(s, rs1, i);
+                if (is_load) {
+                    tcg_gen_qemu_ld_i64(t8, addr, s->mem_idx,
+                            MO_LE | MO_64 | atomicity);
+                    tcg_gen_st_i64(t8, tcg_env, vreg_ofs(s, vd) + i);
+                } else {
+                    tcg_gen_ld_i64(t8, tcg_env, vreg_ofs(s, vd) + i);
+                    tcg_gen_qemu_st_i64(t8, addr, s->mem_idx,
+                            MO_LE | MO_64 | atomicity);
+                }
+                if (i == size - 8) {
+                    tcg_gen_movi_tl(cpu_vstart, 0);
+                } else {
+                    tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 8 >> log2_esz);
+                }
+            }
+        } else {
+            for (int i = 0; i < size; i += 4) {
+                addr = get_address(s, rs1, i);
+                if (is_load) {
+                    tcg_gen_qemu_ld_i32(t4, addr, s->mem_idx,
+                            MO_LE | MO_32 | atomicity);
+                    tcg_gen_st_i32(t4, tcg_env, vreg_ofs(s, vd) + i);
+                } else {
+                    tcg_gen_ld_i32(t4, tcg_env, vreg_ofs(s, vd) + i);
+                    tcg_gen_qemu_st_i32(t4, addr, s->mem_idx,
+                            MO_LE | MO_32 | atomicity);
+                }
+                if (i == size - 4) {
+                    tcg_gen_movi_tl(cpu_vstart, 0);
+                } else {
+                    tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 4 >> log2_esz);
+                }
+            }
+        }
+    } else {
+        TCGv_ptr dest;
+        TCGv base;
+        TCGv_i32 desc;
+        uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
+        data = FIELD_DP32(data, VDATA, VM, 1);
+        dest = tcg_temp_new_ptr();
+        desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
+                        s->cfg_ptr->vlenb, data));
+        base = get_gpr(s, rs1, EXT_NONE);
+        tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
+        fn(dest, base, tcg_env, desc);
+    }
 
     finalize_rvv_inst(s);
     return true;
@@ -XXX,XX +XXX,XX @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
  * load and store whole register instructions ignore vtype and vl setting.
  * Thus, we don't need to check vill bit. (Section 7.9)
  */
-#define GEN_LDST_WHOLE_TRANS(NAME, ARG_NF)                                \
-static bool trans_##NAME(DisasContext *s, arg_##NAME * a)                 \
-{                                                                         \
-    if (require_rvv(s) &&                                                 \
-        QEMU_IS_ALIGNED(a->rd, ARG_NF)) {                                 \
-        return ldst_whole_trans(a->rd, a->rs1, ARG_NF,                    \
-                                gen_helper_##NAME, s);                    \
-    }                                                                     \
-    return false;                                                         \
-}
-
-GEN_LDST_WHOLE_TRANS(vl1re8_v,  1)
-GEN_LDST_WHOLE_TRANS(vl1re16_v, 1)
-GEN_LDST_WHOLE_TRANS(vl1re32_v, 1)
-GEN_LDST_WHOLE_TRANS(vl1re64_v, 1)
-GEN_LDST_WHOLE_TRANS(vl2re8_v,  2)
-GEN_LDST_WHOLE_TRANS(vl2re16_v, 2)
-GEN_LDST_WHOLE_TRANS(vl2re32_v, 2)
-GEN_LDST_WHOLE_TRANS(vl2re64_v, 2)
-GEN_LDST_WHOLE_TRANS(vl4re8_v,  4)
-GEN_LDST_WHOLE_TRANS(vl4re16_v, 4)
-GEN_LDST_WHOLE_TRANS(vl4re32_v, 4)
-GEN_LDST_WHOLE_TRANS(vl4re64_v, 4)
-GEN_LDST_WHOLE_TRANS(vl8re8_v,  8)
-GEN_LDST_WHOLE_TRANS(vl8re16_v, 8)
-GEN_LDST_WHOLE_TRANS(vl8re32_v, 8)
-GEN_LDST_WHOLE_TRANS(vl8re64_v, 8)
+#define GEN_LDST_WHOLE_TRANS(NAME, ETYPE, ARG_NF, IS_LOAD)                  \
+static bool trans_##NAME(DisasContext *s, arg_##NAME * a)                   \
+{                                                                           \
+    if (require_rvv(s) &&                                                   \
+        QEMU_IS_ALIGNED(a->rd, ARG_NF)) {                                   \
+        return ldst_whole_trans(a->rd, a->rs1, ARG_NF, ctzl(sizeof(ETYPE)), \
+                                gen_helper_##NAME, s, IS_LOAD);             \
+    }                                                                       \
+    return false;                                                           \
+}
+
+GEN_LDST_WHOLE_TRANS(vl1re8_v,  int8_t,  1, true)
+GEN_LDST_WHOLE_TRANS(vl1re16_v, int16_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl1re32_v, int32_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl1re64_v, int64_t, 1, true)
+GEN_LDST_WHOLE_TRANS(vl2re8_v,  int8_t,  2, true)
+GEN_LDST_WHOLE_TRANS(vl2re16_v, int16_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl2re32_v, int32_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl2re64_v, int64_t, 2, true)
+GEN_LDST_WHOLE_TRANS(vl4re8_v,  int8_t,  4, true)
+GEN_LDST_WHOLE_TRANS(vl4re16_v, int16_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl4re32_v, int32_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl4re64_v, int64_t, 4, true)
+GEN_LDST_WHOLE_TRANS(vl8re8_v,  int8_t,  8, true)
+GEN_LDST_WHOLE_TRANS(vl8re16_v, int16_t, 8, true)
+GEN_LDST_WHOLE_TRANS(vl8re32_v, int32_t, 8, true)
+GEN_LDST_WHOLE_TRANS(vl8re64_v, int64_t, 8, true)
 
 /*
  * The vector whole register store instructions are encoded similar to
  * unmasked unit-stride store of elements with EEW=8.
  */
-GEN_LDST_WHOLE_TRANS(vs1r_v, 1)
-GEN_LDST_WHOLE_TRANS(vs2r_v, 2)
-GEN_LDST_WHOLE_TRANS(vs4r_v, 4)
-GEN_LDST_WHOLE_TRANS(vs8r_v, 8)
+GEN_LDST_WHOLE_TRANS(vs1r_v, int8_t, 1, false)
+GEN_LDST_WHOLE_TRANS(vs2r_v, int8_t, 2, false)
+GEN_LDST_WHOLE_TRANS(vs4r_v, int8_t, 4, false)
+GEN_LDST_WHOLE_TRANS(vs8r_v, int8_t, 8, false)
 
 /*
  *** Vector Integer Arithmetic Instructions
-- 
2.34.1