[v2] target/arm: Use TCG vector ops for MVE

[PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL

Posted by Peter Maydell 4 years, 4 months ago

Optimize the MVE VSHLL insns by using TCG vector ops when possible.
This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
with zero shift count".

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
The cases here that I've implemented with ANDI then shift
could also be implemented as shift-then-shift. Is one better
than another?
---
 target/arm/translate-mve.c | 67 +++++++++++++++++++++++++++++++++-----
 1 file changed, 59 insertions(+), 8 deletions(-)

diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 00fa4379a74..5d66f70657e 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -1735,16 +1735,67 @@ DO_2SHIFT_SCALAR(VQSHL_U_scalar, vqshli_u)
 DO_2SHIFT_SCALAR(VQRSHL_S_scalar, vqrshli_s)
 DO_2SHIFT_SCALAR(VQRSHL_U_scalar, vqrshli_u)
 
-#define DO_VSHLL(INSN, FN)                                      \
-    static bool trans_##INSN(DisasContext *s, arg_2shift *a)    \
-    {                                                           \
-        static MVEGenTwoOpShiftFn * const fns[] = {             \
-            gen_helper_mve_##FN##b,                             \
-            gen_helper_mve_##FN##h,                             \
-        };                                                      \
-        return do_2shift(s, a, fns[a->size], false);            \
+#define DO_VSHLL(INSN, FN)                                              \
+    static bool trans_##INSN(DisasContext *s, arg_2shift *a)            \
+    {                                                                   \
+        static MVEGenTwoOpShiftFn * const fns[] = {                     \
+            gen_helper_mve_##FN##b,                                     \
+            gen_helper_mve_##FN##h,                                     \
+        };                                                              \
+        return do_2shift_vec(s, a, fns[a->size], false, do_gvec_##FN);  \
     }
 
+/*
+ * For the VSHLL vector helpers, the vece is the size of the input
+ * (ie MO_8 or MO_16); the helpers want to work in the output size.
+ * The shift count can be 0..<input size>, inclusive. (0 is VMOVL.)
+ */
+static void do_gvec_vshllbs(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    tcg_gen_gvec_shli(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    tcg_gen_gvec_sari(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+}
+
+static void do_gvec_vshllbu(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    tcg_gen_gvec_andi(ovece, dofs, aofs,
+                      ovece == MO_16 ? 0xff : 0xffff, oprsz, maxsz);
+    tcg_gen_gvec_shli(ovece, dofs, dofs, shift, oprsz, maxsz);
+}
+
+static void do_gvec_vshllts(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    if (shift == 0) {
+        tcg_gen_gvec_sari(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    } else {
+        tcg_gen_gvec_andi(ovece, dofs, aofs,
+                          ovece == MO_16 ? 0xff00 : 0xffff0000, oprsz, maxsz);
+        tcg_gen_gvec_sari(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+    }
+}
+
+static void do_gvec_vshlltu(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    if (shift == 0) {
+        tcg_gen_gvec_shri(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    } else {
+        tcg_gen_gvec_andi(ovece, dofs, aofs,
+                          ovece == MO_16 ? 0xff00 : 0xffff0000, oprsz, maxsz);
+        tcg_gen_gvec_shri(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+    }
+}
+
 DO_VSHLL(VSHLL_BS, vshllbs)
 DO_VSHLL(VSHLL_BU, vshllbu)
 DO_VSHLL(VSHLL_TS, vshllts)
-- 
2.20.1

Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL

Posted by Richard Henderson 4 years, 4 months ago

On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE VSHLL insns by using TCG vector ops when possible.
> This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
> with zero shift count".
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> The cases here that I've implemented with ANDI then shift
> could also be implemented as shift-then-shift. Is one better
> than another?

I would expect and + shift to be preferred over shift + shift.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL

Posted by Peter Maydell 4 years, 4 months ago

On Mon, 13 Sept 2021 at 15:04, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/13/21 2:54 AM, Peter Maydell wrote:
> > Optimize the MVE VSHLL insns by using TCG vector ops when possible.
> > This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
> > with zero shift count".
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > The cases here that I've implemented with ANDI then shift
> > could also be implemented as shift-then-shift. Is one better
> > than another?
>
> I would expect and + shift to be preferred over shift + shift.

OK. (I wasn't sure, because and + shift requires another insn
to assemble the immediate constant, I think.)

-- PMM

Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL

Posted by Richard Henderson 4 years, 4 months ago

On 9/13/21 7:22 AM, Peter Maydell wrote:
> On Mon, 13 Sept 2021 at 15:04, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 9/13/21 2:54 AM, Peter Maydell wrote:
>>> Optimize the MVE VSHLL insns by using TCG vector ops when possible.
>>> This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
>>> with zero shift count".
>>>
>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>>> ---
>>> The cases here that I've implemented with ANDI then shift
>>> could also be implemented as shift-then-shift. Is one better
>>> than another?
>>
>> I would expect and + shift to be preferred over shift + shift.
> 
> OK. (I wasn't sure, because and + shift requires another insn
> to assemble the immediate constant, I think.)

Yea, though Arm itself is good about not requiring one.  But there's generally only one 
shifter across multiple pipelines.  Not that we're doing any sort of compute resource 
allocation and scheduling...


r~