[v1] s390x/tcg: Vector Instruction Support Part 1

[Qemu-devel] [PATCH v1 03/33] s390x: Add one temporary vector register in CPU state for TCG

Posted by David Hildenbrand 6 years, 11 months ago

We sometimes want to work on a temporary vector register instead of the
actual destination, because source and destination might overlap. An
alternative would be loading the vector into two i64 variables, but than
separate handling for accessing the vector elements would be needed.
This is easier. Add one for now as that seems to be enough.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/cpu.h       | 11 +++++++++++
 target/s390x/translate.c |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/target/s390x/cpu.h b/target/s390x/cpu.h
index cb6d77053a..a8dc0b2b83 100644
--- a/target/s390x/cpu.h
+++ b/target/s390x/cpu.h
@@ -67,6 +67,17 @@ struct CPUS390XState {
      * vregs[0][0] -> vregs[15][0] are 16 floating point registers
      */
     CPU_DoubleU vregs[32][2];  /* vector registers */
+#ifdef CONFIG_TCG
+#define TMP_VREG_0   33
+    /*
+     * Temporary vector registers used while processing vector instructions
+     * in TCG. This is helpful e.g. when source and destination registers
+     * overlap for certain instructions in translate functions. Content valid
+     * only within execution of one translated block, therefore no migration is
+     * needed. Resets don't mather, but has to be properly aligned.
+     */
+    CPU_DoubleU tmp_vregs[1][2];
+#endif
     uint32_t aregs[16];    /* access registers */
     uint8_t riccb[64];     /* runtime instrumentation control */
     uint64_t gscb[4];      /* guarded storage control */
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index d52c02c572..8733d19182 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -147,6 +147,9 @@ void s390x_translate_init(void)
 
 static inline int vec_full_reg_offset(uint8_t reg)
 {
+    if (reg == TMP_VREG_0) {
+        return offsetof(CPUS390XState, tmp_vregs[0][0].d);
+    }
     g_assert(reg < 32);
     return offsetof(CPUS390XState, vregs[reg][0].d);
 }
-- 
2.17.2

Re: [Qemu-devel] [PATCH v1 03/33] s390x: Add one temporary vector register in CPU state for TCG

Posted by Richard Henderson 6 years, 11 months ago

On 2/26/19 3:38 AM, David Hildenbrand wrote:
> We sometimes want to work on a temporary vector register instead of the
> actual destination, because source and destination might overlap. An
> alternative would be loading the vector into two i64 variables, but than
> separate handling for accessing the vector elements would be needed.
> This is easier. Add one for now as that seems to be enough.

Hmm, I'll reserve judgment until I see how this is used.

For ARM SVE, I would allocate this temporary on the stack within the helper,
and move one of the operands out of the way.  E.g.

void helper(foo)(void *vd, void *vx, *void *vy
{
    VectorReg tmp;
    TYPE *d = vd, *x = vx, *y = vy;

    if (vx == vd || vy == vd) {
        tmp = *(VectorReg *)vd;
        if (vx == vd) {
            vx = &tmp;
        }
        if (vy == vd) {
            vy = &tmp;
        }
    }

    process d, x, y as normal.
}

This minimized the amount of code inline.  However, SVE vectors are quite a bit
larger, at 256 bytes, so the copy itself was out of line most of the time anyway.

Provisionally,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

Re: [Qemu-devel] [PATCH v1 03/33] s390x: Add one temporary vector register in CPU state for TCG

Posted by David Hildenbrand 6 years, 11 months ago

On 26.02.19 19:36, Richard Henderson wrote:
> On 2/26/19 3:38 AM, David Hildenbrand wrote:
>> We sometimes want to work on a temporary vector register instead of the
>> actual destination, because source and destination might overlap. An
>> alternative would be loading the vector into two i64 variables, but than
>> separate handling for accessing the vector elements would be needed.
>> This is easier. Add one for now as that seems to be enough.
> 
> Hmm, I'll reserve judgment until I see how this is used.
> 
> For ARM SVE, I would allocate this temporary on the stack within the helper,
> and move one of the operands out of the way.  E.g.

Yes, I do the same for helpers. This, however is for TCG translated code :)

E.g. see

[PATCH v1 08/33] s390x/tcg: Implement VECTOR LOAD
[PATCH v1 19/33] s390x/tcg: Implement VECTOR MERGE (HIGH|LOW)
[PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *


> 
> void helper(foo)(void *vd, void *vx, *void *vy
> {
>     VectorReg tmp;
>     TYPE *d = vd, *x = vx, *y = vy;
> 
>     if (vx == vd || vy == vd) {
>         tmp = *(VectorReg *)vd;
>         if (vx == vd) {
>             vx = &tmp;
>         }
>         if (vy == vd) {
>             vy = &tmp;
>         }
>     }
> 
>     process d, x, y as normal.
> }
> 
> This minimized the amount of code inline.  However, SVE vectors are quite a bit
> larger, at 256 bytes, so the copy itself was out of line most of the time anyway.
> 
> Provisionally,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
> 
> 
> r~
> 


-- 

Thanks,

David / dhildenb