[Qemu-devel] [RFC PATCH v2 0/9] target/ppc: convert VMX instructions to use TCG vector operations

Mark Cave-Ayland posted 9 patches 5 years, 3 months ago
Test checkpatch passed
Test asan passed
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test docker-quick@centos7 passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20181217122405.18732-1-mark.cave-ayland@ilande.co.uk
linux-user/ppc/signal.c             |  24 +-
target/ppc/arch_dump.c              |  12 +-
target/ppc/cpu.h                    |  26 +-
target/ppc/gdbstub.c                |   8 +-
target/ppc/helper.h                 |   8 -
target/ppc/int_helper.c             |  63 ++-
target/ppc/internal.h               |  29 +-
target/ppc/machine.c                |  72 +++-
target/ppc/monitor.c                |   4 +-
target/ppc/translate.c              |  74 ++--
target/ppc/translate/dfp-impl.inc.c |   2 +-
target/ppc/translate/fp-impl.inc.c  | 490 +++++++++++++++++-----
target/ppc/translate/vmx-impl.inc.c | 186 ++++++---
target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++----------
target/ppc/translate_init.inc.c     |  24 +-
15 files changed, 1262 insertions(+), 542 deletions(-)
[Qemu-devel] [RFC PATCH v2 0/9] target/ppc: convert VMX instructions to use TCG vector operations
Posted by Mark Cave-Ayland 5 years, 3 months ago
This patchset is an attempt at trying to improve the VMX (Altivec) instruction
performance by making use of the new TCG vector operations where possible.

In order to use TCG vector operations, the registers must be accessible from cpu_env
whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
registers using the supplied TCGv_i64 parameter. Meanwhile patch 4 fixes a minor issue
spotted by Richard during review to ensure that AVR registers are not modified until
after exceptions are processing during register load.

Once this is done, patch 5 enables us to remove the static TCG global arrays and updates
the access helpers to read/write to the relevant fields in cpu_env directly.

Patches 6 and 7 perform the legwork required to enable VSX instructions to be converted
to use TCG vector operations in future by rearranging the FP, VMX and VSX registers into
a single aligned VSR register array (the scope of this patchset is VMX only).

The final patches 8 and 9 convert the VMX logical instructions and addition/subtraction
instructions respectively over to the TCG vector operations.

NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector
operations, however it struck me that there may be some potential for converting
saturating add/sub and cmp instructions if there were a mechanism to return a set of
flags indicating the result of the saturation/comparison.

Finally thanks to Richard for taking the time to answer some of my (mostly beginner)
questions related to TCG.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>


v2:
- Rebase onto master
- Add comment explaining rationale for FPR helpers in description for patch 1
- Add R-B tags from Richard
- Add patch 3 to delay AVR register writeback as spotted by Richard
- Add patches 6 and 7 to merge FPR, VMX and VSX registers into the vsr array
  to facilitate conversion of VSX instructions to vector operations later
- Fix accidental bug whereby the conversion of get_vsr()/set_vsr() to access
  data from cpu_env was incorrectly squashed into patch 3
- Move set_fpr() further down in gen_fsqrts() and gen_frsqrtes() in patch 1


Mark Cave-Ayland (9):
  target/ppc: introduce get_fpr() and set_fpr() helpers for FP register
    access
  target/ppc: introduce get_avr64() and set_avr64() helpers for VMX
    register access
  target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}()
    helpers for VSR register access
  target/ppc: delay writeback of avr{l,h} during lvx instruction
  target/ppc: switch FPR, VMX and VSX helpers to access data directly
    from cpu_env
  target/ppc: merge ppc_vsr_t and ppc_avr_t union types
  target/ppc: move FP and VMX registers into aligned vsr register array
  target/ppc: convert VMX logical instructions to use vector operations
  target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use
    vector operations

 linux-user/ppc/signal.c             |  24 +-
 target/ppc/arch_dump.c              |  12 +-
 target/ppc/cpu.h                    |  26 +-
 target/ppc/gdbstub.c                |   8 +-
 target/ppc/helper.h                 |   8 -
 target/ppc/int_helper.c             |  63 ++-
 target/ppc/internal.h               |  29 +-
 target/ppc/machine.c                |  72 +++-
 target/ppc/monitor.c                |   4 +-
 target/ppc/translate.c              |  74 ++--
 target/ppc/translate/dfp-impl.inc.c |   2 +-
 target/ppc/translate/fp-impl.inc.c  | 490 +++++++++++++++++-----
 target/ppc/translate/vmx-impl.inc.c | 186 ++++++---
 target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++----------
 target/ppc/translate_init.inc.c     |  24 +-
 15 files changed, 1262 insertions(+), 542 deletions(-)

-- 
2.11.0


Re: [Qemu-devel] [RFC PATCH v2 0/9] target/ppc: convert VMX instructions to use TCG vector operations
Posted by Richard Henderson 5 years, 3 months ago
On 12/17/18 4:23 AM, Mark Cave-Ayland wrote:
> NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector
> operations, however it struck me that there may be some potential for converting
> saturating add/sub and cmp instructions if there were a mechanism to return a set of
> flags indicating the result of the saturation/comparison.

There are also a lot of instructions that can be converted, but aren't:

* vspltis[bhw] can use tcg_gen_gvec_dup{8,16,32}i.

* vsplt{b,h,w} can use tcg_gen_gvec_dup_mem.

  Note that you'll need something like vec_reg_offset from
  target/arm/translate-a64.h to compute the offset of the
  specific byte/word/long from which we are to splat.

* vmr should be handled by having tcg_gen_gvec_or notice aofs == bofs.
  For ARM, we do special case this during translation.
  But since tcg/tcg-op.c does these things for tcg_gen_or_i64,
  we should probably handle the same set of transformations.

* vnot would need to be handled by actually adding a tcg_gen_gvec_nor
  and then also noticing aofs == bofs.

For saturation, I think the easiest thing to do is represent SAT as a
ppc_avr_t.  We notice saturation by also computing normal arithmetic and
comparing to see if they differ.  E.g.

    tcg_gen_gvec_add(vece, offsetof_avr_tmp,
                     offsetof(ra), offsetof(rb), 16, 16);
    tcg_gen_gvec_ssadd(vece, offsetof(rt),
                       offsetof(ra), offsetof(rb), 16, 16);
    tcg_gen_gvec_cmp(TCG_COND_NE, vece, offsetof_avr_tmp,
                     offsetof_avr_tmp, offsetof(rt), 16, 16);
    tcg_gen_gvec_or(vece, offsetof_avr_sat, offsetof_avr_sat,
                    offsetof_avr_tmp, 16, 16);

You only need to convert the ppc_avr_t to a single bit when reading VSCR.

For comparisons... that's tricky.  I wonder if there's anything better than

    tcg_gen_gvec_cmp(TCG_COND_FOO, vece, offsetof(rt),
                     offsetof(ra), offsetof(rb), 16, 16);
    if (rc) {
        TCGv_i64 hi, lo, t, f;

        tcg_gen_ld_i64(hi, cpu_env, offsetof(rt));
        tcg_gen_ld_i64(lo, cpu_env, offsetof(rt) + 8);

        tcg_gen_and_i64(t, hi, lo);
        tcg_gen_or_i64(f, hi, lo);
        tcg_gen_setcondi_i64(TCG_COND_EQ, t, t, -1);
        tcg_gen_setcondi_i64(TCG_COND_EQ, f, f, 0);

        // truncate to i32, shift, or, and set to cr6.
    }


r~

Re: [Qemu-devel] [RFC PATCH v2 0/9] target/ppc: convert VMX instructions to use TCG vector operations
Posted by Mark Cave-Ayland 5 years, 3 months ago
On 17/12/2018 17:39, Richard Henderson wrote:

> On 12/17/18 4:23 AM, Mark Cave-Ayland wrote:
>> NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector
>> operations, however it struck me that there may be some potential for converting
>> saturating add/sub and cmp instructions if there were a mechanism to return a set of
>> flags indicating the result of the saturation/comparison.
> 
> There are also a lot of instructions that can be converted, but aren't:
> 
> * vspltis[bhw] can use tcg_gen_gvec_dup{8,16,32}i.
> 
> * vsplt{b,h,w} can use tcg_gen_gvec_dup_mem.
> 
>   Note that you'll need something like vec_reg_offset from
>   target/arm/translate-a64.h to compute the offset of the
>   specific byte/word/long from which we are to splat.

Oh okay, thanks for the hints - I remember thinking that I couldn't much with splat,
but I'll go and look again.

> * vmr should be handled by having tcg_gen_gvec_or notice aofs == bofs.
>   For ARM, we do special case this during translation.
>   But since tcg/tcg-op.c does these things for tcg_gen_or_i64,
>   we should probably handle the same set of transformations.
> 
> * vnot would need to be handled by actually adding a tcg_gen_gvec_nor
>   and then also noticing aofs == bofs.

And I'll revisit these ones too.

> For saturation, I think the easiest thing to do is represent SAT as a
> ppc_avr_t.  We notice saturation by also computing normal arithmetic and
> comparing to see if they differ.  E.g.
> 
>     tcg_gen_gvec_add(vece, offsetof_avr_tmp,
>                      offsetof(ra), offsetof(rb), 16, 16);
>     tcg_gen_gvec_ssadd(vece, offsetof(rt),
>                        offsetof(ra), offsetof(rb), 16, 16);
>     tcg_gen_gvec_cmp(TCG_COND_NE, vece, offsetof_avr_tmp,
>                      offsetof_avr_tmp, offsetof(rt), 16, 16);
>     tcg_gen_gvec_or(vece, offsetof_avr_sat, offsetof_avr_sat,
>                     offsetof_avr_tmp, 16, 16);
> 
> You only need to convert the ppc_avr_t to a single bit when reading VSCR.

I actually had a PoC that looked somewhat similar to this, except that I abandoned
the idea as I thought that the penalty of doing the add twice (plus comparison) would
slow down everything by several orders of magnitude for backends that didn't support
vector instructions. What's the best way to handle this?

> For comparisons... that's tricky.  I wonder if there's anything better than
> 
>     tcg_gen_gvec_cmp(TCG_COND_FOO, vece, offsetof(rt),
>                      offsetof(ra), offsetof(rb), 16, 16);
>     if (rc) {
>         TCGv_i64 hi, lo, t, f;
> 
>         tcg_gen_ld_i64(hi, cpu_env, offsetof(rt));
>         tcg_gen_ld_i64(lo, cpu_env, offsetof(rt) + 8);
> 
>         tcg_gen_and_i64(t, hi, lo);
>         tcg_gen_or_i64(f, hi, lo);
>         tcg_gen_setcondi_i64(TCG_COND_EQ, t, t, -1);
>         tcg_gen_setcondi_i64(TCG_COND_EQ, f, f, 0);
> 
>         // truncate to i32, shift, or, and set to cr6.
>     }

Certainly I can look at this approach, but again my concern is that we end up heavily
penalising the backends without vector instruction support :/


ATB,

Mark.