[v3] target/arm: Implement FEAT_SME2p1

[PATCH v3 62/97] target/arm: Split out do_whilel from helper_sve_whilel

Posted by Richard Henderson 2 months, 1 week ago

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/tcg/sve_helper.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
index 5014fd135d..4497e9107b 100644
--- a/target/arm/tcg/sve_helper.c
+++ b/target/arm/tcg/sve_helper.c
@@ -4113,26 +4113,30 @@ static uint32_t pred_count_test(uint32_t elements, uint32_t count, bool invert)
     return flags;
 }
 
-uint32_t HELPER(sve_whilel)(void *vd, uint32_t count, uint32_t pred_desc)
+static void do_whilel(uint64_t *d, uint64_t esz_mask,
+                      uint32_t count, uint32_t oprbits)
 {
-    intptr_t oprsz = FIELD_EX32(pred_desc, PREDDESC, OPRSZ);
-    intptr_t esz = FIELD_EX32(pred_desc, PREDDESC, ESZ);
-    uint64_t esz_mask = pred_esz_masks[esz];
-    ARMPredicateReg *d = vd;
-    intptr_t i, oprbits = oprsz * 8;
+    uint32_t i;
 
     tcg_debug_assert(count <= oprbits);
 
-    /* Begin with a zero predicate register.  */
-    do_zero(d, oprsz);
-
-    /* Set all of the requested bits.  */
     for (i = 0; i < count / 64; ++i) {
-        d->p[i] = esz_mask;
+        d[i] = esz_mask;
     }
     if (count & 63) {
-        d->p[i] = MAKE_64BIT_MASK(0, count & 63) & esz_mask;
+        d[i] = MAKE_64BIT_MASK(0, count & 63) & esz_mask;
     }
+}
+
+uint32_t HELPER(sve_whilel)(void *vd, uint32_t count, uint32_t pred_desc)
+{
+    uint32_t oprsz = FIELD_EX32(pred_desc, PREDDESC, OPRSZ);
+    uint32_t esz = FIELD_EX32(pred_desc, PREDDESC, ESZ);
+    uint32_t oprbits = oprsz * 8;
+    uint64_t esz_mask = pred_esz_masks[esz];
+
+    do_zero(vd, oprsz);
+    do_whilel(vd, esz_mask, count, oprbits);
 
     return pred_count_test(oprbits, count, false);
 }
-- 
2.43.0

Re: [PATCH v3 62/97] target/arm: Split out do_whilel from helper_sve_whilel

Posted by Peter Maydell 2 months, 1 week ago

On Wed, 2 Jul 2025 at 13:38, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  target/arm/tcg/sve_helper.c | 28 ++++++++++++++++------------
>  1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
> index 5014fd135d..4497e9107b 100644
> --- a/target/arm/tcg/sve_helper.c
> +++ b/target/arm/tcg/sve_helper.c
> @@ -4113,26 +4113,30 @@ static uint32_t pred_count_test(uint32_t elements, uint32_t count, bool invert)
>      return flags;
>  }
>
> -uint32_t HELPER(sve_whilel)(void *vd, uint32_t count, uint32_t pred_desc)
> +static void do_whilel(uint64_t *d, uint64_t esz_mask,
> +                      uint32_t count, uint32_t oprbits)

Does the compiler generate worse code if we give the right
typed argument for d here (ArmPredicateReg *d) vs uint64_t * ?

>  {
> -    intptr_t oprsz = FIELD_EX32(pred_desc, PREDDESC, OPRSZ);
> -    intptr_t esz = FIELD_EX32(pred_desc, PREDDESC, ESZ);
> -    uint64_t esz_mask = pred_esz_masks[esz];
> -    ARMPredicateReg *d = vd;
> -    intptr_t i, oprbits = oprsz * 8;
> +    uint32_t i;
>
>      tcg_debug_assert(count <= oprbits);
>
> -    /* Begin with a zero predicate register.  */
> -    do_zero(d, oprsz);
> -
> -    /* Set all of the requested bits.  */
>      for (i = 0; i < count / 64; ++i) {
> -        d->p[i] = esz_mask;
> +        d[i] = esz_mask;
>      }
>      if (count & 63) {
> -        d->p[i] = MAKE_64BIT_MASK(0, count & 63) & esz_mask;
> +        d[i] = MAKE_64BIT_MASK(0, count & 63) & esz_mask;
>      }
> +}

Either way
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM

Re: [PATCH v3 62/97] target/arm: Split out do_whilel from helper_sve_whilel

Posted by Richard Henderson 2 months, 1 week ago

On 7/3/25 04:38, Peter Maydell wrote:
> On Wed, 2 Jul 2025 at 13:38, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   target/arm/tcg/sve_helper.c | 28 ++++++++++++++++------------
>>   1 file changed, 16 insertions(+), 12 deletions(-)
>>
>> diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
>> index 5014fd135d..4497e9107b 100644
>> --- a/target/arm/tcg/sve_helper.c
>> +++ b/target/arm/tcg/sve_helper.c
>> @@ -4113,26 +4113,30 @@ static uint32_t pred_count_test(uint32_t elements, uint32_t count, bool invert)
>>       return flags;
>>   }
>>
>> -uint32_t HELPER(sve_whilel)(void *vd, uint32_t count, uint32_t pred_desc)
>> +static void do_whilel(uint64_t *d, uint64_t esz_mask,
>> +                      uint32_t count, uint32_t oprbits)
> 
> Does the compiler generate worse code if we give the right
> typed argument for d here (ArmPredicateReg *d) vs uint64_t * ?

It'll almost certainly be identical.

I believe my original split for this was written with the idea that I'd be expanding 
CounterToPredicate for LD1 et al exactly like the pseudocode does.  If I had done that, I 
would have needed to pass an array of uint64_t with 4x the elements in ARMPredicateReg. 
Thus uint64_t* instead of ARMPredicateReg*.

But in the end, decode_counter does not expand all of the predicate bits explicitly.

r~