[Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *

David Hildenbrand posted 33 patches 6 years, 8 months ago
There is a newer version of this series
[Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by David Hildenbrand 6 years, 8 months ago
Combine all variant in a single handler. As source and destination
have different element sizes, we can't use gvec expansion. Expand
manually. Also watch out for overlapping source and destination and
use a temporary register in that case.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/insn-data.def      |  8 +++++++
 target/s390x/translate_vx.inc.c | 41 +++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/target/s390x/insn-data.def b/target/s390x/insn-data.def
index 5d4d2ecc7e..2c49c63c59 100644
--- a/target/s390x/insn-data.def
+++ b/target/s390x/insn-data.def
@@ -1046,6 +1046,14 @@
     F(0xe73e, VSTM,    VRS_a, V,   la2, 0, 0, 0, vstm, 0, IF_VEC)
 /* VECTOR STORE WITH LENGTH */
     F(0xe73f, VSTL,    VRS_b, V,   la2, r3_32u, 0, 0, vstl, 0, IF_VEC)
+/* VECTOR UNPACK HIGH */
+    F(0xe7d7, VUPH,    VRR_a, V,   0, 0, 0, 0, vup, 0, IF_VEC)
+/* VECTOR UNPACK LOGICAL HIGH */
+    F(0xe7d5, VUPLH,   VRR_a, V,   0, 0, 0, 0, vup, 0, IF_VEC)
+/* VECTOR UNPACK LOW */
+    F(0xe7d6, VUPL,    VRR_a, V,   0, 0, 0, 0, vup, 0, IF_VEC)
+/* VECTOR UNPACK LOGICAL LOW */
+    F(0xe7d4, VUPLL,   VRR_a, V,   0, 0, 0, 0, vup, 0, IF_VEC)
 
 #ifndef CONFIG_USER_ONLY
 /* COMPARE AND SWAP AND PURGE */
diff --git a/target/s390x/translate_vx.inc.c b/target/s390x/translate_vx.inc.c
index d87f5bafcf..fde8b06953 100644
--- a/target/s390x/translate_vx.inc.c
+++ b/target/s390x/translate_vx.inc.c
@@ -842,3 +842,44 @@ static DisasJumpType op_vstl(DisasContext *s, DisasOps *o)
     tcg_temp_free_ptr(a0);
     return DISAS_NEXT;
 }
+
+static DisasJumpType op_vup(DisasContext *s, DisasOps *o)
+{
+    const bool high = s->fields->op2 == 0xd7 || s->fields->op2 == 0xd5;
+    const bool logical = s->fields->op2 == 0xd4 || s->fields->op2 == 0xd5;
+    const uint8_t v1 = get_field(s->fields, v1);
+    const uint8_t v2 = get_field(s->fields, v2);
+    const uint8_t src_es = get_field(s->fields, m3);
+    const uint8_t dst_es = src_es + 1;
+    uint8_t dst_v = v1;
+    int dst_idx, src_idx;
+    TCGv_i64 tmp;
+
+    if (src_es > MO_32) {
+        gen_program_exception(s, PGM_SPECIFICATION);
+        return DISAS_NORETURN;
+    }
+
+    /* Source and destination overlap -> use a temporary register */
+    if (v1 == v2) {
+        dst_v = TMP_VREG_0;
+    }
+
+    tmp = tcg_temp_new_i64();
+    for (dst_idx = 0; dst_idx < NUM_VEC_ELEMENTS(dst_es); dst_idx++) {
+        src_idx = dst_idx;
+        if (!high) {
+            src_idx += NUM_VEC_ELEMENTS(src_es) / 2;
+        }
+        read_vec_element_i64(tmp, v2, src_idx,
+                             src_es | (logical ? 0 : MO_SIGN));
+        write_vec_element_i64(tmp, dst_v, dst_idx, dst_es);
+    }
+    tcg_temp_free_i64(tmp);
+
+    /* move the temporary to the destination */
+    if (dst_v != v1) {
+        gen_gvec_mov(v1, dst_v);
+    }
+    return DISAS_NEXT;
+}
-- 
2.17.2


Re: [Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by Richard Henderson 6 years, 8 months ago
On 2/26/19 3:39 AM, David Hildenbrand wrote:
> Combine all variant in a single handler. As source and destination
> have different element sizes, we can't use gvec expansion. Expand
> manually. Also watch out for overlapping source and destination and
> use a temporary register in that case.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/s390x/insn-data.def      |  8 +++++++
>  target/s390x/translate_vx.inc.c | 41 +++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)

This works as is, so
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

But the same comment applies wrt iteration order and not needing a temporary.
High unpack can iterate backward, while low unpack can iterate forward, with no
lost data.


r~

Re: [Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by David Hildenbrand 6 years, 8 months ago
On 28.02.19 01:03, Richard Henderson wrote:
> On 2/26/19 3:39 AM, David Hildenbrand wrote:
>> Combine all variant in a single handler. As source and destination
>> have different element sizes, we can't use gvec expansion. Expand
>> manually. Also watch out for overlapping source and destination and
>> use a temporary register in that case.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/insn-data.def      |  8 +++++++
>>  target/s390x/translate_vx.inc.c | 41 +++++++++++++++++++++++++++++++++
>>  2 files changed, 49 insertions(+)
> 
> This works as is, so
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
> 
> But the same comment applies wrt iteration order and not needing a temporary.
> High unpack can iterate backward, while low unpack can iterate forward, with no
> lost data.

I'll fix that right away. I guess vector pack cannot be handled like this.

The only way to get rid of the temporary would be to load both elements
from v2 and v3 and then writing the two (half sized) elements in v1.

I'll have a look.

Thanks!

-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by David Hildenbrand 6 years, 8 months ago
On 28.02.19 10:28, David Hildenbrand wrote:
> On 28.02.19 01:03, Richard Henderson wrote:
>> On 2/26/19 3:39 AM, David Hildenbrand wrote:
>>> Combine all variant in a single handler. As source and destination
>>> have different element sizes, we can't use gvec expansion. Expand
>>> manually. Also watch out for overlapping source and destination and
>>> use a temporary register in that case.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  target/s390x/insn-data.def      |  8 +++++++
>>>  target/s390x/translate_vx.inc.c | 41 +++++++++++++++++++++++++++++++++
>>>  2 files changed, 49 insertions(+)
>>
>> This works as is, so
>> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
>>
>> But the same comment applies wrt iteration order and not needing a temporary.
>> High unpack can iterate backward, while low unpack can iterate forward, with no
>> lost data.
> 
> I'll fix that right away. I guess vector pack cannot be handled like this.
> 
> The only way to get rid of the temporary would be to load both elements
> from v2 and v3 and then writing the two (half sized) elements in v1.
> 
> I'll have a look.

Hmm, as v2 and v3 are handled concatenated it is not that easy. I am not
sure if we can handle this without a temporary vector.

I thought about packing them first interleaved

v2 = [v2e0, v2e1]
v3 = [v3e0, ve31]
v1 = [v2e0_packed, v3e0_packed, v2e1_packed, v3e1_packed]

And then restoring the right order

v1 = [v2e0_packed, v2e1_packed, v3e0_packed, v3e1_packed]

But than the second operation seems to be the problem. That shuffling
would have to be hard coded as far as I can see. (shuffling with MO_8 is
nasty -> 14 element shave to be exchanged, in my opinion needing
eventually 14 temporary variables)

Of course, we can also simply detect duplicates and if so, call into a
helper.

-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by Richard Henderson 6 years, 8 months ago
On 2/28/19 2:54 AM, David Hildenbrand wrote:
> Hmm, as v2 and v3 are handled concatenated it is not that easy. I am not
> sure if we can handle this without a temporary vector.
> 
> I thought about packing them first interleaved
> 
> v2 = [v2e0, v2e1]
> v3 = [v3e0, ve31]
> v1 = [v2e0_packed, v3e0_packed, v2e1_packed, v3e1_packed]
> 
> And then restoring the right order
> 
> v1 = [v2e0_packed, v2e1_packed, v3e0_packed, v3e1_packed]
> 
> But than the second operation seems to be the problem. That shuffling
> would have to be hard coded as far as I can see. (shuffling with MO_8 is
> nasty -> 14 element shave to be exchanged, in my opinion needing
> eventually 14 temporary variables)

I suppose you could do it in registers.

  load_element_i64(t1, v2, 0);
  for (i = 1; i < N; i++) {
    load_element_i64(t3, v2, i, es);
    tcg_gen_deposit_i64(t1, t1, t3, i << es, 1 << es);
  }
  // repeat for v3 into t2
  // store t1,t2 into v1.

Now you have only 3 temporaries, which is manageable.

The only question, when it comes to MO_8, is whether the code expansion of this
is reasonable (16 byte loads, 15 deposits, 2 stores -- minimum 33 insns,
probably 48 for x86_64 host), or whether a helper function would be better in
the end.  But then the same is true for all of the other merge & unpack
operations wrt MO_8.


r~

Re: [Qemu-devel] [PATCH v1 33/33] s390x/tcg: Implement VECTOR UNPACK *
Posted by David Hildenbrand 6 years, 8 months ago
On 28.02.19 19:22, Richard Henderson wrote:
> On 2/28/19 2:54 AM, David Hildenbrand wrote:
>> Hmm, as v2 and v3 are handled concatenated it is not that easy. I am not
>> sure if we can handle this without a temporary vector.
>>
>> I thought about packing them first interleaved
>>
>> v2 = [v2e0, v2e1]
>> v3 = [v3e0, ve31]
>> v1 = [v2e0_packed, v3e0_packed, v2e1_packed, v3e1_packed]
>>
>> And then restoring the right order
>>
>> v1 = [v2e0_packed, v2e1_packed, v3e0_packed, v3e1_packed]
>>
>> But than the second operation seems to be the problem. That shuffling
>> would have to be hard coded as far as I can see. (shuffling with MO_8 is
>> nasty -> 14 element shave to be exchanged, in my opinion needing
>> eventually 14 temporary variables)
> 
> I suppose you could do it in registers.
> 
>   load_element_i64(t1, v2, 0);
>   for (i = 1; i < N; i++) {
>     load_element_i64(t3, v2, i, es);
>     tcg_gen_deposit_i64(t1, t1, t3, i << es, 1 << es);
>   }
>   // repeat for v3 into t2
>   // store t1,t2 into v1.
> 
> Now you have only 3 temporaries, which is manageable.
> 
> The only question, when it comes to MO_8, is whether the code expansion of this
> is reasonable (16 byte loads, 15 deposits, 2 stores -- minimum 33 insns,
> probably 48 for x86_64 host), or whether a helper function would be better in
> the end.  But then the same is true for all of the other merge & unpack
> operations wrt MO_8.

And it would only apply when dst==src. Will have a try what looks "less
ugly" :) Thanks!

> 
> 
> r~
> 


-- 

Thanks,

David / dhildenb