Series comparison

-[PULL 00/23] tcg patch queue
+[PULL 00/27] tcg patch queue
-The following changes since commit 9e5319ca52a5b9e84d55ad9c36e2c0b317a122bb:
+The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:
-  Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging (2019-10-04 18:32:34 +0100)
+  Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)
 are available in the Git repository at:
-  https://github.com/rth7680/qemu.git tags/pull-tcg-20191013
+  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530
-for you to fetch changes up to d2f86bba6931388e275e8eb4ccd1dbcc7cae6328:
+for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:
-  cpus: kick all vCPUs when running thread=single (2019-10-07 14:08:58 -0400)
+  tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)
 ----------------------------------------------------------------
-Host vector support for tcg/ppc.
+Improvements to 128-bit atomics:
-Fix thread=single cpu kicking.
+  - Separate __int128_t type and arithmetic detection
   - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
   - Accelerate atomics via host/include/
 Decodetree:
   - Add named field syntax
   - Move tests to meson
 ----------------------------------------------------------------
-Alex Bennée (1):
+Peter Maydell (5):
-      cpus: kick all vCPUs when running thread=single
+      docs: Document decodetree named field syntax
       scripts/decodetree: Pass lvalue-formatter function to str_extract()
       scripts/decodetree: Implement a topological sort
       scripts/decodetree: Implement named field support
       tests/decode: Add tests for various named-field cases
 Richard Henderson (22):
-      tcg/ppc: Introduce Altivec registers
+      tcg: Fix register move type in tcg_out_ld_helper_ret
-      tcg/ppc: Introduce macro VX4()
+      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
-      tcg/ppc: Introduce macros VRT(), VRA(), VRB(), VRC()
+      meson: Split test for __int128_t type from __int128_t arithmetic
-      tcg/ppc: Create TCGPowerISA and have_isa
+      qemu/atomic128: Add x86_64 atomic128-ldst.h
-      tcg/ppc: Replace HAVE_ISA_2_06
+      tcg/i386: Support 128-bit load/store
-      tcg/ppc: Replace HAVE_ISEL macro with a variable
+      tcg/aarch64: Rename temporaries
-      tcg/ppc: Enable tcg backend vector compilation
+      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
-      tcg/ppc: Add support for load/store/logic/comparison
+      tcg/aarch64: Simplify constraints on qemu_ld/st
-      tcg/ppc: Add support for vector maximum/minimum
+      tcg/aarch64: Support 128-bit load/store
-      tcg/ppc: Add support for vector add/subtract
+      tcg/ppc: Support 128-bit load/store
-      tcg/ppc: Add support for vector saturated add/subtract
+      tcg/s390x: Support 128-bit load/store
-      tcg/ppc: Support vector shift by immediate
+      accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
-      tcg/ppc: Support vector multiply
+      accel/tcg: Extract store_atom_insert_al16 to host header
-      tcg/ppc: Support vector dup2
+      accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
-      tcg/ppc: Enable Altivec detection
+      accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
-      tcg/ppc: Update vector support for VSX
+      accel/tcg: Add aarch64 store_atom_insert_al16
-      tcg/ppc: Update vector support for v2.07 Altivec
+      tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
-      tcg/ppc: Update vector support for v2.07 VSX
+      decodetree: Add --test-for-error
-      tcg/ppc: Update vector support for v2.07 FP
+      decodetree: Fix recursion in prop_format and build_tree
-      tcg/ppc: Update vector support for v3.00 Altivec
+      decodetree: Diagnose empty pattern group
-      tcg/ppc: Update vector support for v3.00 load/store
+      decodetree: Do not remove output_file from /dev
-      tcg/ppc: Update vector support for v3.00 dup/dupi
+      tests/decode: Convert tests to meson
- tcg/ppc/tcg-target.h     |   51 ++-
+ docs/devel/decodetree.rst                         |  33 ++-
- tcg/ppc/tcg-target.opc.h |   13 +
+ meson.build                                       |  15 +-
- cpus.c                   |   24 +-
+ host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
- tcg/ppc/tcg-target.inc.c | 1118 ++++++++++++++++++++++++++++++++++++++++++----
+ host/include/aarch64/host/store-insert-al16.h     |  47 ++++
-files changed, 1119 insertions(+), 87 deletions(-)
+ host/include/generic/host/load-extract-al16-al8.h |  45 ++++
- create mode 100644 tcg/ppc/tcg-target.opc.h
+ host/include/generic/host/store-insert-al16.h     |  50 ++++
+ host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
  host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
  include/qemu/int128.h                             |   4 +-
  tcg/aarch64/tcg-target-con-set.h                  |   4 +-
  tcg/aarch64/tcg-target-con-str.h                  |   1 -
  tcg/aarch64/tcg-target.h                          |  12 +-
  tcg/arm/tcg-target.h                              |   1 -
  tcg/i386/tcg-target.h                             |   5 +-
  tcg/mips/tcg-target.h                             |   1 -
  tcg/ppc/tcg-target-con-set.h                      |   2 +
  tcg/ppc/tcg-target-con-str.h                      |   1 +
  tcg/ppc/tcg-target.h                              |   4 +-
  tcg/riscv/tcg-target.h                            |   1 -
  tcg/s390x/tcg-target-con-set.h                    |   2 +
  tcg/s390x/tcg-target.h                            |   3 +-
  tcg/sparc64/tcg-target.h                          |   1 -
  tcg/tci/tcg-target.h                              |   1 -
  tests/decode/err_field10.decode                   |   7 +
  tests/decode/err_field7.decode                    |   7 +
  tests/decode/err_field8.decode                    |   8 +
  tests/decode/err_field9.decode                    |  14 ++
  tests/decode/succ_named_field.decode              |  19 ++
  tcg/tcg.c                                         |   4 +-
  accel/tcg/ldst_atomicity.c.inc                    |  80 +------
  tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
  tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
  tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
  tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
  scripts/decodetree.py                             | 265 ++++++++++++++++++++--
  tests/decode/check.sh                             |  24 --
  tests/decode/meson.build                          |  64 ++++++
  tests/meson.build                                 |   5 +-
 files changed, 1312 insertions(+), 225 deletions(-)
  create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
  create mode 100644 host/include/aarch64/host/store-insert-al16.h
  create mode 100644 host/include/generic/host/load-extract-al16-al8.h
  create mode 100644 host/include/generic/host/store-insert-al16.h
  create mode 100644 host/include/x86_64/host/atomic128-ldst.h
  create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
  create mode 100644 tests/decode/err_field10.decode
  create mode 100644 tests/decode/err_field7.decode
  create mode 100644 tests/decode/err_field8.decode
  create mode 100644 tests/decode/err_field9.decode
  create mode 100644 tests/decode/succ_named_field.decode
  delete mode 100755 tests/decode/check.sh
  create mode 100644 tests/decode/meson.build

-[PULL 14/23] tcg/ppc: Support vector dup2
+[PULL 01/27] tcg: Fix register move type in tcg_out_ld_helper_ret
-This is only used for 32-bit hosts.
+The first move was incorrectly using TCG_TYPE_I32 while the second
 move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
 from moving all 128-bits of the return value.
+Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 9 +++++++++
+ tcg/tcg.c | 4 ++--
-file changed, 9 insertions(+)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/tcg/tcg.c b/tcg/tcg.c
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tcg/tcg.c
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tcg/tcg.c
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
-         }
+     mov[0].dst = ldst->datalo_reg;
-         break;
+     mov[0].src =
+         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
-+    case INDEX_op_dup2_vec:
+-    mov[0].dst_type = TCG_TYPE_I32;
-+        assert(TCG_TARGET_REG_BITS == 32);
+-    mov[0].src_type = TCG_TYPE_I32;
-+        /* With inputs a1 = xLxx, a2 = xHxx  */
++    mov[0].dst_type = TCG_TYPE_REG;
-+        tcg_out32(s, VMRGHW | VRT(a0) | VRA(a2) | VRB(a1));  /* a0  = xxHL */
++    mov[0].src_type = TCG_TYPE_REG;
-+        tcg_out_vsldoi(s, TCG_VEC_TMP1, a0, a0, 8);          /* tmp = HLxx */
+     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
-+        tcg_out_vsldoi(s, a0, a0, TCG_VEC_TMP1, 8);          /* a0  = HLHL */
-+        return;
+     mov[1].dst = ldst->datahi_reg;
 +
      case INDEX_op_ppc_mrgh_vec:
          insn = mrgh_op[vece];
          break;
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      case INDEX_op_ppc_mulou_vec:
      case INDEX_op_ppc_pkum_vec:
      case INDEX_op_ppc_rotl_vec:
 +    case INDEX_op_dup2_vec:
          return &v_v_v;
      case INDEX_op_not_vec:
      case INDEX_op_dup_vec:
 --
-.17.1
+.34.1

-[PULL 04/23] tcg/ppc: Create TCGPowerISA and have_isa
+[PULL 02/27] accel/tcg: Fix check for page writeability in load_atomic16_or_exit
-Introduce an enum to hold base < 2.06 < 3.00.  Use macros to
+PAGE_WRITE is current writability, as modified by TB protection;
-preserve the existing have_isa_2_06 and have_isa_3_00 predicates.
+PAGE_WRITE_ORG is the original page writability.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.h     | 12 ++++++++++--
+ accel/tcg/ldst_atomicity.c.inc | 4 ++--
- tcg/ppc/tcg-target.inc.c |  8 ++++----
+file changed, 2 insertions(+), 2 deletions(-)
 files changed, 14 insertions(+), 6 deletions(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/accel/tcg/ldst_atomicity.c.inc
-+++ b/tcg/ppc/tcg-target.h
++++ b/accel/tcg/ldst_atomicity.c.inc
-@@ -XXX,XX +XXX,XX @@ typedef enum {
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
-     TCG_AREG0 = TCG_REG_R27
+      * another process, because the fallback start_exclusive solution
- } TCGReg;
+      * provides no protection across processes.
+      */
--extern bool have_isa_2_06;
+-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
--extern bool have_isa_3_00;
++    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
-+typedef enum {
+         uint64_t *p = __builtin_assume_aligned(pv, 8);
-+    tcg_isa_base,
+         return *p;
 +    tcg_isa_2_06,
 +    tcg_isa_3_00,
 +} TCGPowerISA;
 +
 +extern TCGPowerISA have_isa;
 +
 +#define have_isa_2_06  (have_isa >= tcg_isa_2_06)
 +#define have_isa_3_00  (have_isa >= tcg_isa_3_00)
  /* optional instructions automatically implemented */
  #define TCG_TARGET_HAS_ext8u_i32        0 /* andi */
 diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.inc.c
 +++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@
  static tcg_insn_unit *tb_ret_addr;
 -bool have_isa_2_06;
 -bool have_isa_3_00;
 +TCGPowerISA have_isa;
  #define HAVE_ISA_2_06  have_isa_2_06
  #define HAVE_ISEL      have_isa_2_06
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      unsigned long hwcap = qemu_getauxval(AT_HWCAP);
      unsigned long hwcap2 = qemu_getauxval(AT_HWCAP2);
 +    have_isa = tcg_isa_base;
      if (hwcap & PPC_FEATURE_ARCH_2_06) {
 -        have_isa_2_06 = true;
 +        have_isa = tcg_isa_2_06;
      }
- #ifdef PPC_FEATURE2_ARCH_3_00
+@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
-     if (hwcap2 & PPC_FEATURE2_ARCH_3_00) {
+      * another process, because the fallback start_exclusive solution
--        have_isa_3_00 = true;
+      * provides no protection across processes.
-+        have_isa = tcg_isa_3_00;
+      */
 -    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
 +    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
          return *p;
      }
  #endif
 --
-.17.1
+.34.1

-[PULL 22/23] tcg/ppc: Update vector support for v3.00 dup/dupi
+[PULL 03/27] meson: Split test for __int128_t type from __int128_t arithmetic
-These new instructions are conditional on MSR.VEC for TX=1,
+Older versions of clang have missing runtime functions for arithmetic
-so we can consider these Altivec instructions.
+with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
 __int128_t for implementing Int128.  But __int128_t is present,
 data movement works, and it can be used for atomic128.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
 qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
 and adjust the meson probe for atomics to use has_int128_type.
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 28 ++++++++++++++++++++++++++--
+ meson.build           | 15 ++++++++++-----
-file changed, 26 insertions(+), 2 deletions(-)
+ include/qemu/int128.h |  4 ++--
 files changed, 12 insertions(+), 7 deletions(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/meson.build
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/meson.build
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
+     return 0;
- #define XXPERMDI   (OPCD(60) | (10 << 3) | 7)  /* v2.06, force ax=bx=tx=1 */
+   }'''))
- #define XXSEL      (OPCD(60) | (3 << 4) | 0xf) /* v2.06, force ax=bx=cx=tx=1 */
-+#define XXSPLTIB   (OPCD(60) | (360 << 1) | 1) /* v3.00, force tx=1 */
+-has_int128 = cc.links('''
++has_int128_type = cc.compiles('''
- #define MFVSRD     (XO31(51) | 1)   /* v2.07, force sx=1 */
++  __int128_t a;
- #define MFVSRWZ    (XO31(115) | 1)  /* v2.07, force sx=1 */
++  __uint128_t b;
- #define MTVSRD     (XO31(179) | 1)  /* v2.07, force tx=1 */
++  int main(void) { b = a; }''')
- #define MTVSRWZ    (XO31(243) | 1)  /* v2.07, force tx=1 */
++config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
-+#define MTVSRDD    (XO31(435) | 1)  /* v3.00, force tx=1 */
++
-+#define MTVSRWS    (XO31(403) | 1)  /* v3.00, force tx=1 */
++has_int128 = has_int128_type and cc.links('''
+   __int128_t a;
- #define RT(r) ((r)<<21)
+   __uint128_t b;
- #define RS(r) ((r)<<21)
+   int main (void) {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
+@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
-             return;
+     a = a * a;
      return 0;
    }''')
 -
  config_host_data.set('CONFIG_INT128', has_int128)
 -if has_int128
 +if has_int128_type
    # "do we have 128-bit atomics which are handled inline and specifically not
    # via libatomic". The reason we can't use libatomic is documented in the
    # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
@@ -XXX,XX +XXX,XX @@ if has_int128
    # __alignof(unsigned __int128) for the host.
    atomic_test_128 = '''
      int main(int ac, char **av) {
 -      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
 +      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
        p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
        __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
        __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
        config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
          int main(void)
          {
 -          unsigned __int128 x = 0, y = 0;
 +          __uint128_t x = 0, y = 0;
            __sync_val_compare_and_swap_16(&x, y, x);
            return 0;
          }
-     }
+diff --git a/include/qemu/int128.h b/include/qemu/int128.h
-+    if (have_isa_3_00 && val == (tcg_target_long)dup_const(MO_8, val)) {
+index XXXXXXX..XXXXXXX 100644
-+        tcg_out32(s, XXSPLTIB | VRT(ret) | ((val & 0xff) << 11));
+--- a/include/qemu/int128.h
-+        return;
++++ b/include/qemu/int128.h
-+    }
+@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
+  * a possible structure and the native types.  Ease parameter passing
-     /*
+  * via use of the transparent union extension.
-      * Otherwise we must load the value from the constant pool.
+  */
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
+-#ifdef CONFIG_INT128
-                             TCGReg dst, TCGReg src)
++#ifdef CONFIG_INT128_TYPE
- {
+ typedef union {
-     tcg_debug_assert(dst >= TCG_REG_V0);
+     __uint128_t u;
--    tcg_debug_assert(src >= TCG_REG_V0);
+     __int128_t i;
-+
+@@ -XXX,XX +XXX,XX @@ typedef union {
-+    /* Splat from integer reg allowed via constraints for v3.00.  */
+ } Int128Alias __attribute__((transparent_union));
-+    if (src < TCG_REG_V0) {
+ #else
-+        tcg_debug_assert(have_isa_3_00);
+ typedef Int128 Int128Alias;
-+        switch (vece) {
+-#endif /* CONFIG_INT128 */
-+        case MO_64:
++#endif /* CONFIG_INT128_TYPE */
-+            tcg_out32(s, MTVSRDD | VRT(dst) | RA(src) | RB(src));
-+            return true;
+ #endif /* INT128_H */
 +        case MO_32:
 +            tcg_out32(s, MTVSRWS | VRT(dst) | RA(src));
 +            return true;
 +        default:
 +            /* Fail, so that we fall back on either dupm or mov+dup.  */
 +            return false;
 +        }
 +    }
      /*
       * Recall we use (or emulate) VSX integer loads, so the integer is
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      static const TCGTargetOpDef sub2
          = { .args_ct_str = { "r", "r", "rI", "rZM", "r", "r" } };
      static const TCGTargetOpDef v_r = { .args_ct_str = { "v", "r" } };
 +    static const TCGTargetOpDef v_vr = { .args_ct_str = { "v", "vr" } };
      static const TCGTargetOpDef v_v = { .args_ct_str = { "v", "v" } };
      static const TCGTargetOpDef v_v_v = { .args_ct_str = { "v", "v", "v" } };
      static const TCGTargetOpDef v_v_v_v
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
          return &v_v_v;
      case INDEX_op_not_vec:
      case INDEX_op_neg_vec:
 -    case INDEX_op_dup_vec:
          return &v_v;
 +    case INDEX_op_dup_vec:
 +        return have_isa_3_00 ? &v_vr : &v_v;
      case INDEX_op_ld_vec:
      case INDEX_op_st_vec:
      case INDEX_op_dupm_vec:
 --
-.17.1
+.34.1

-[PULL 21/23] tcg/ppc: Update vector support for v3.00 load/store
+[PULL 04/27] qemu/atomic128: Add x86_64 atomic128-ldst.h
-These new instructions are a mix of those like LXSD that are
+With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
-only conditional only on MSR.VEC and those like LXV that are
+load/store without cmpxchg16b.
 conditional on MSR.VEC for TX=1.  Thus, in the end, we can
 consider all of these as Altivec instructions.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 47 ++++++++++++++++++++++++++++++++--------
+ host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
-file changed, 38 insertions(+), 9 deletions(-)
+file changed, 68 insertions(+)
  create mode 100644 host/include/x86_64/host/atomic128-ldst.h
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/ppc/tcg-target.inc.c
+index XXXXXXX..XXXXXXX
-+++ b/tcg/ppc/tcg-target.inc.c
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
++++ b/host/include/x86_64/host/atomic128-ldst.h
- #define LXSDX      (XO31(588) | 1)  /* v2.06, force tx=1 */
+@@ -XXX,XX +XXX,XX @@
- #define LXVDSX     (XO31(332) | 1)  /* v2.06, force tx=1 */
++/*
- #define LXSIWZX    (XO31(12) | 1)   /* v2.07, force tx=1 */
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+#define LXV        (OPCD(61) | 8 | 1)  /* v3.00, force tx=1 */
++ * Load/store for 128-bit atomic operations, x86_64 version.
-+#define LXSD       (OPCD(57) | 2)   /* v3.00 */
++ *
-+#define LXVWSX     (XO31(364) | 1)  /* v3.00, force tx=1 */
++ * Copyright (C) 2023 Linaro, Ltd.
++ *
- #define STVX       XO31(231)
++ * See docs/devel/atomics.rst for discussion about the guarantees each
- #define STVEWX     XO31(199)
++ * atomic primitive is meant to provide.
- #define STXSDX     (XO31(716) | 1)  /* v2.06, force sx=1 */
++ */
- #define STXSIWX    (XO31(140) | 1)  /* v2.07, force sx=1 */
++
-+#define STXV       (OPCD(61) | 8 | 5) /* v3.00, force sx=1 */
++#ifndef AARCH64_ATOMIC128_LDST_H
-+#define STXSD      (OPCD(61) | 2)   /* v3.00 */
++#define AARCH64_ATOMIC128_LDST_H
++
- #define VADDSBS    VX4(768)
++#ifdef CONFIG_INT128_TYPE
- #define VADDUBS    VX4(512)
++#include "host/cpuinfo.h"
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
++#include "tcg/debug-assert.h"
-                              TCGReg base, tcg_target_long offset)
++
- {
++/*
-     tcg_target_long orig = offset, l0, l1, extra = 0, align = 0;
++ * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
--    bool is_store = false;
++ * expanded to a read-write operation: lock cmpxchg16b.
-+    bool is_int_store = false;
++ */
-     TCGReg rs = TCG_REG_TMP1;
++
++#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
-     switch (opi) {
++#define HAVE_ATOMIC128_RW  1
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
++
-             break;
++static inline Int128 atomic16_read_ro(const Int128 *ptr)
-         }
++{
-         break;
++    Int128Alias r;
-+    case LXSD:
++
-+    case STXSD:
++    tcg_debug_assert(HAVE_ATOMIC128_RO);
-+        align = 3;
++    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
-+        break;
++
-+    case LXV:
++    return r.s;
-+    case STXV:
++}
-+        align = 15;
++
-+        break;
++static inline Int128 atomic16_read_rw(Int128 *ptr)
-     case STD:
++{
-         align = 3;
++    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
-         /* FALLTHRU */
++    Int128Alias r;
-     case STB: case STH: case STW:
++
--        is_store = true;
++    if (HAVE_ATOMIC128_RO) {
-+        is_int_store = true;
++        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
-         break;
++    } else {
-     }
++        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
++    }
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
++    return r.s;
-         if (rs == base) {
++}
-             rs = TCG_REG_R0;
++
-         }
++static inline void atomic16_set(Int128 *ptr, Int128 val)
--        tcg_debug_assert(!is_store || rs != rt);
++{
-+        tcg_debug_assert(!is_int_store || rs != rt);
++    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
-         tcg_out_movi(s, TCG_TYPE_PTR, rs, orig);
++    Int128Alias new = { .s = val };
-         tcg_out32(s, opx | TAB(rt & 31, base, rs));
++
-         return;
++    if (HAVE_ATOMIC128_RO) {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
++        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
-     case TCG_TYPE_V64:
++    } else {
-         tcg_debug_assert(ret >= TCG_REG_V0);
++        __int128_t old;
-         if (have_vsx) {
++        do {
--            tcg_out_mem_long(s, 0, LXSDX, ret, base, offset);
++            old = *ptr_align;
-+            tcg_out_mem_long(s, have_isa_3_00 ? LXSD : 0, LXSDX,
++        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
-+                             ret, base, offset);
++    }
-             break;
++}
-         }
++#else
-         tcg_debug_assert((offset & 7) == 0);
++/* Provide QEMU_ERROR stubs. */
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
++#include "host/include/generic/host/atomic128-ldst.h"
-     case TCG_TYPE_V128:
++#endif
-         tcg_debug_assert(ret >= TCG_REG_V0);
++
-         tcg_debug_assert((offset & 15) == 0);
++#endif /* AARCH64_ATOMIC128_LDST_H */
 -        tcg_out_mem_long(s, 0, LVX, ret, base, offset);
 +        tcg_out_mem_long(s, have_isa_3_00 ? LXV : 0,
 +                         LVX, ret, base, offset);
          break;
      default:
          g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
      case TCG_TYPE_V64:
          tcg_debug_assert(arg >= TCG_REG_V0);
          if (have_vsx) {
 -            tcg_out_mem_long(s, 0, STXSDX, arg, base, offset);
 +            tcg_out_mem_long(s, have_isa_3_00 ? STXSD : 0,
 +                             STXSDX, arg, base, offset);
              break;
          }
          tcg_debug_assert((offset & 7) == 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
          break;
      case TCG_TYPE_V128:
          tcg_debug_assert(arg >= TCG_REG_V0);
 -        tcg_out_mem_long(s, 0, STVX, arg, base, offset);
 +        tcg_out_mem_long(s, have_isa_3_00 ? STXV : 0,
 +                         STVX, arg, base, offset);
          break;
      default:
          g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
      tcg_debug_assert(out >= TCG_REG_V0);
      switch (vece) {
      case MO_8:
 -        tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
 +        if (have_isa_3_00) {
 +            tcg_out_mem_long(s, LXV, LVX, out, base, offset & -16);
 +        } else {
 +            tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
 +        }
          elt = extract32(offset, 0, 4);
  #ifndef HOST_WORDS_BIGENDIAN
          elt ^= 15;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
          break;
      case MO_16:
          tcg_debug_assert((offset & 1) == 0);
 -        tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
 +        if (have_isa_3_00) {
 +            tcg_out_mem_long(s, LXV | 8, LVX, out, base, offset & -16);
 +        } else {
 +            tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
 +        }
          elt = extract32(offset, 1, 3);
  #ifndef HOST_WORDS_BIGENDIAN
          elt ^= 7;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
          tcg_out32(s, VSPLTH | VRT(out) | VRB(out) | (elt << 16));
          break;
      case MO_32:
 +        if (have_isa_3_00) {
 +            tcg_out_mem_long(s, 0, LXVWSX, out, base, offset);
 +            break;
 +        }
          tcg_debug_assert((offset & 3) == 0);
          tcg_out_mem_long(s, 0, LVEWX, out, base, offset);
          elt = extract32(offset, 2, 2);
 --
-.17.1
+.34.1

-[PULL 16/23] tcg/ppc: Update vector support for VSX
+[PULL 05/27] tcg/i386: Support 128-bit load/store
-The VSX instruction set instructions include double-word loads and
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
-stores, double-word load and splat, double-word permute, and bit
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-select.  All of which require multiple operations in the Altivec
+---
-instruction set.
+ tcg/i386/tcg-target.h     |   4 +-
  tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
 files changed, 190 insertions(+), 5 deletions(-)
-Because the VSX registers map %vsr32 to %vr0, and we have no current
+diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
 intention or need to use vector registers outside %vr0-%vr19, force
 on the {ax,bx,cx,tx} bits within the added VSX insns so that we don't
 have to otherwise modify the VR[TABC] macros.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
  tcg/ppc/tcg-target.h     |  5 ++--
  tcg/ppc/tcg-target.inc.c | 52 ++++++++++++++++++++++++++++++++++++----
 files changed, 51 insertions(+), 6 deletions(-)
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/tcg/i386/tcg-target.h
-+++ b/tcg/ppc/tcg-target.h
++++ b/tcg/i386/tcg-target.h
 @@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define have_avx1         (cpuinfo & CPUINFO_AVX1)
- extern TCGPowerISA have_isa;
+ #define have_avx2         (cpuinfo & CPUINFO_AVX2)
- extern bool have_altivec;
+ #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
-+extern bool have_vsx;
+-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
- #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
+ /*
- #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
+  * There are interesting instructions in AVX512, so long as we have AVX512VL,
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+@@ -XXX,XX +XXX,XX @@ typedef enum {
-  * instruction and substituting two 32-bit stores makes the generated
+ #define TCG_TARGET_HAS_qemu_st8_i32     1
-  * code quite large.
+ #endif
 -#define TCG_TARGET_HAS_qemu_ldst_i128   0
 +#define TCG_TARGET_HAS_qemu_ldst_i128 \
 +    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
  /* We do not support older SSE systems, only beginning with AVX1.  */
  #define TCG_TARGET_HAS_v64              have_avx1
 diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/i386/tcg-target.c.inc
 +++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
  #endif
  };
 +#define TCG_TMP_VEC  TCG_REG_XMM5
 +
  static const int tcg_target_call_iarg_regs[] = {
  #if TCG_TARGET_REG_BITS == 64
  #if defined(_WIN64)
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
  #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
  #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
  #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
 +#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
 +#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
  #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
  #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
  #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return have_movbe;
 +    TCGAtomAlign aa;
 +
 +    if (!have_movbe) {
 +        return false;
 +    }
 +    if ((memop & MO_SIZE) < MO_128) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
 +     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom < MO_128;
  }
  /*
@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
  static const TCGLdstHelperParam ldst_helper_param = { };
  #endif
 +static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
 +                                TCGReg l, TCGReg h, TCGReg v)
 +{
 +    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
 +
 +    /* vpmov{d,q} %v, %l */
 +    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
 +    /* vpextr{d,q} $1, %v, %h */
 +    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
 +    tcg_out8(s, 1);
 +}
 +
 +static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
 +                                TCGReg v, TCGReg l, TCGReg h)
 +{
 +    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
 +
 +    /* vmov{d,q} %l, %v */
 +    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
 +    /* vpinsr{d,q} $1, %h, %v, %v */
 +    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
 +    tcg_out8(s, 1);
 +}
 +
  /*
   * Generate code for the slow path for a load at the end of block
   */
--#define TCG_TARGET_HAS_v64              0
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+#define TCG_TARGET_HAS_v64              have_vsx
+ {
- #define TCG_TARGET_HAS_v128             have_altivec
+     TCGLabelQemuLdst *ldst = NULL;
- #define TCG_TARGET_HAS_v256             0
+     MemOp opc = get_memop(oi);
++    MemOp s_bits = opc & MO_SIZE;
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+     unsigned a_mask;
- #define TCG_TARGET_HAS_mul_vec          1
- #define TCG_TARGET_HAS_sat_vec          1
+ #ifdef CONFIG_SOFTMMU
- #define TCG_TARGET_HAS_minmax_vec       1
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
--#define TCG_TARGET_HAS_bitsel_vec       0
+     *h = x86_guest_base;
-+#define TCG_TARGET_HAS_bitsel_vec       have_vsx
+ #endif
- #define TCG_TARGET_HAS_cmpsel_vec       0
+     h->base = addrlo;
+-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
- void flush_icache_range(uintptr_t start, uintptr_t stop);
++    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+     a_mask = (1 << h->aa.align) - 1;
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+ #ifdef CONFIG_SOFTMMU
-+++ b/tcg/ppc/tcg-target.inc.c
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
+     TCGType tlbtype = TCG_TYPE_I32;
- TCGPowerISA have_isa;
+     int trexw = 0, hrexw = 0, tlbrexw = 0;
- static bool have_isel;
+     unsigned mem_index = get_mmuidx(oi);
- bool have_altivec;
+-    unsigned s_bits = opc & MO_SIZE;
-+bool have_vsx;
+     unsigned s_mask = (1 << s_bits) - 1;
+     int tlb_mask;
- #ifndef CONFIG_SOFTMMU
- #define TCG_GUEST_BASE_REG 30
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+                                      h.base, h.index, 0, h.ofs + 4);
- #define LVEBX      XO31(7)
+         }
- #define LVEHX      XO31(39)
+         break;
- #define LVEWX      XO31(71)
++
-+#define LXSDX      (XO31(588) | 1)  /* v2.06, force tx=1 */
++    case MO_128:
-+#define LXVDSX     (XO31(332) | 1)  /* v2.06, force tx=1 */
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++
- #define STVX       XO31(231)
++        /*
- #define STVEWX     XO31(199)
++         * Without 16-byte atomicity, use integer regs.
-+#define STXSDX     (XO31(716) | 1)  /* v2.06, force sx=1 */
++         * That is where we want the data, and it allows bswaps.
++         */
- #define VADDSBS    VX4(768)
++        if (h.aa.atom < MO_128) {
- #define VADDUBS    VX4(512)
++            if (use_movbe) {
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
++                TCGReg t = datalo;
++                datalo = datahi;
- #define VSLDOI     VX4(44)
++                datahi = t;
++            }
-+#define XXPERMDI   (OPCD(60) | (10 << 3) | 7)  /* v2.06, force ax=bx=tx=1 */
++            if (h.base == datalo || h.index == datalo) {
-+#define XXSEL      (OPCD(60) | (3 << 4) | 0xf) /* v2.06, force ax=bx=cx=tx=1 */
++                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
-+
++                                         h.base, h.index, 0, h.ofs);
- #define RT(r) ((r)<<21)
++                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
- #define RS(r) ((r)<<21)
++                                     datalo, datahi, 0);
- #define RA(r) ((r)<<16)
++                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
++                                     datahi, datahi, 8);
-         add = 0;
++            } else {
-     }
++                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
++                                         h.base, h.index, 0, h.ofs);
--    load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
++                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
--    if (TCG_TARGET_REG_BITS == 64) {
++                                         h.base, h.index, 0, h.ofs + 8);
--        new_pool_l2(s, rel, s->code_ptr, add, val, val);
++            }
 +    if (have_vsx) {
 +        load_insn = type == TCG_TYPE_V64 ? LXSDX : LXVDSX;
 +        load_insn |= VRT(ret) | RB(TCG_REG_TMP1);
 +        if (TCG_TARGET_REG_BITS == 64) {
 +            new_pool_label(s, val, rel, s->code_ptr, add);
 +        } else {
 +            new_pool_l2(s, rel, s->code_ptr, add, val, val);
 +        }
      } else {
 -        new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
 +        load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
 +        if (TCG_TARGET_REG_BITS == 64) {
 +            new_pool_l2(s, rel, s->code_ptr, add, val, val);
 +        } else {
 +            new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
 +        }
      }
      if (USE_REG_TB) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
          /* fallthru */
      case TCG_TYPE_V64:
          tcg_debug_assert(ret >= TCG_REG_V0);
 +        if (have_vsx) {
 +            tcg_out_mem_long(s, 0, LXSDX, ret, base, offset);
 +            break;
 +        }
-         tcg_debug_assert((offset & 7) == 0);
++
-         tcg_out_mem_long(s, 0, LVX, ret, base, offset & -16);
++        /*
-         if (offset & 8) {
++         * With 16-byte atomicity, a vector load is required.
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
++         * If we already have 16-byte alignment, then VMOVDQA always works.
-         /* fallthru */
++         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
-     case TCG_TYPE_V64:
++         * Else use we require a runtime test for alignment for VMOVDQA;
-         tcg_debug_assert(arg >= TCG_REG_V0);
++         * use VMOVDQU on the unaligned nonatomic path for simplicity.
-+        if (have_vsx) {
++         */
-+            tcg_out_mem_long(s, 0, STXSDX, arg, base, offset);
++        if (h.aa.align >= MO_128) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else {
 +            TCGLabel *l1 = gen_new_label();
 +            TCGLabel *l2 = gen_new_label();
 +
 +            tcg_out_testi(s, h.base, 15);
 +            tcg_out_jxx(s, JCC_JNE, l1, true);
 +
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_jxx(s, JCC_JMP, l2, true);
 +
 +            tcg_out_label(s, l1);
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_label(s, l2);
 +        }
 +        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
 +        break;
 +
      default:
          g_assert_not_reached();
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                       h.base, h.index, 0, h.ofs + 4);
          }
          break;
 +
 +    case MO_128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +
 +        /*
 +         * Without 16-byte atomicity, use integer regs.
 +         * That is where we have the data, and it allows bswaps.
 +         */
 +        if (h.aa.atom < MO_128) {
 +            if (use_movbe) {
 +                TCGReg t = datalo;
 +                datalo = datahi;
 +                datahi = t;
 +            }
 +            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
 +                                     h.base, h.index, 0, h.ofs);
 +            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
 +                                     h.base, h.index, 0, h.ofs + 8);
 +            break;
 +        }
-         tcg_debug_assert((offset & 7) == 0);
++
-         if (offset & 8) {
++        /*
-             tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, 8);
++         * With 16-byte atomicity, a vector store is required.
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
++         * If we already have 16-byte alignment, then VMOVDQA always works.
-     case INDEX_op_shri_vec:
++         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
-     case INDEX_op_sari_vec:
++         * Else use we require a runtime test for alignment for VMOVDQA;
-         return vece <= MO_32 ? -1 : 0;
++         * use VMOVDQU on the unaligned nonatomic path for simplicity.
-+    case INDEX_op_bitsel_vec:
++         */
-+        return have_vsx;
++        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
 +        if (h.aa.align >= MO_128) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else {
 +            TCGLabel *l1 = gen_new_label();
 +            TCGLabel *l2 = gen_new_label();
 +
 +            tcg_out_testi(s, h.base, 15);
 +            tcg_out_jxx(s, JCC_JNE, l1, true);
 +
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_jxx(s, JCC_JMP, l2, true);
 +
 +            tcg_out_label(s, l1);
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_label(s, l2);
 +        }
 +        break;
 +
      default:
-         return 0;
+         g_assert_not_reached();
      }
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
-         tcg_out32(s, VSPLTW | VRT(dst) | VRB(src) | (1 << 16));
+             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
          break;
      case MO_64:
 +        if (have_vsx) {
 +            tcg_out32(s, XXPERMDI | VRT(dst) | VRA(src) | VRB(src));
 +            break;
 +        }
          tcg_out_vsldoi(s, TCG_VEC_TMP1, src, src, 8);
          tcg_out_vsldoi(s, dst, TCG_VEC_TMP1, src, 8);
          break;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
          tcg_out32(s, VSPLTW | VRT(out) | VRB(out) | (elt << 16));
          break;
      case MO_64:
 +        if (have_vsx) {
 +            tcg_out_mem_long(s, 0, LXVDSX, out, base, offset);
 +            break;
 +        }
          tcg_debug_assert((offset & 7) == 0);
          tcg_out_mem_long(s, 0, LVX, out, base, offset & -16);
          tcg_out_vsldoi(s, TCG_VEC_TMP1, out, out, 8);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          }
          break;
++    case INDEX_op_qemu_ld_a32_i128:
-+    case INDEX_op_bitsel_vec:
++    case INDEX_op_qemu_ld_a64_i128:
-+        tcg_out32(s, XXSEL | VRT(a0) | VRC(a1) | VRB(a2) | VRA(args[3]));
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
-+        return;
++        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
-+
++        break;
-     case INDEX_op_dup2_vec:
-         assert(TCG_TARGET_REG_BITS == 32);
+     case INDEX_op_qemu_st_a64_i32:
-         /* With inputs a1 = xLxx, a2 = xHxx  */
+     case INDEX_op_qemu_st8_a64_i32:
-@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
-     case INDEX_op_st_vec:
+             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
-     case INDEX_op_dupm_vec:
+         }
-         return &v_r;
+         break;
-+    case INDEX_op_bitsel_vec:
++    case INDEX_op_qemu_st_a32_i128:
-     case INDEX_op_ppc_msum_vec:
++    case INDEX_op_qemu_st_a64_i128:
-         return &v_v_v_v;
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
 +        break;
      OP_32_64(mulu2):
          tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
      case INDEX_op_qemu_st_a64_i64:
          return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        return C_O2_I1(r, r, L);
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        return C_O0_I3(L, L, L);
 +
      case INDEX_op_brcond2_i32:
          return C_O0_I4(r, r, ri, ri);
 @@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
-     if (hwcap & PPC_FEATURE_HAS_ALTIVEC) {
+     s->reserved_regs = 0;
-         have_altivec = true;
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
-+        /* We only care about the portion of VSX that overlaps Altivec. */
++    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
-+        if (hwcap & PPC_FEATURE_HAS_VSX) {
+ #ifdef _WIN64
-+            have_vsx = true;
+     /* These are call saved, and we don't save them, so don't use them. */
-+        }
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
      }
      tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
 --
-.17.1
+.34.1

-[PULL 08/23] tcg/ppc: Add support for load/store/logic/comparison
+[PULL 06/27] tcg/aarch64: Rename temporaries
-Add various bits and peaces related mostly to load and store
+We will need to allocate a second general-purpose temporary.
-operations. In that context, logic, compare, and splat Altivec
+Rename the existing temps to add a distinguishing number.
 instructions are used, and, therefore, the support for emitting
 them is included in this patch too.
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.h     |   6 +-
+ tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
- tcg/ppc/tcg-target.inc.c | 472 ++++++++++++++++++++++++++++++++++++---
+file changed, 25 insertions(+), 25 deletions(-)
 files changed, 442 insertions(+), 36 deletions(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.h
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
- #define TCG_TARGET_HAS_v128             have_altivec
+     return TCG_REG_X0 + slot;
- #define TCG_TARGET_HAS_v256             0
+ }
--#define TCG_TARGET_HAS_andc_vec         0
+-#define TCG_REG_TMP TCG_REG_X30
-+#define TCG_TARGET_HAS_andc_vec         1
+-#define TCG_VEC_TMP TCG_REG_V31
- #define TCG_TARGET_HAS_orc_vec          0
++#define TCG_REG_TMP0 TCG_REG_X30
--#define TCG_TARGET_HAS_not_vec          0
++#define TCG_VEC_TMP0 TCG_REG_V31
-+#define TCG_TARGET_HAS_not_vec          1
- #define TCG_TARGET_HAS_neg_vec          0
+ #ifndef CONFIG_SOFTMMU
- #define TCG_TARGET_HAS_abs_vec          0
+ #define TCG_REG_GUEST_BASE TCG_REG_X28
- #define TCG_TARGET_HAS_shi_vec          0
+@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
- #define TCG_TARGET_HAS_shs_vec          0
+ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
- #define TCG_TARGET_HAS_shv_vec          0
+                              TCGReg r, TCGReg base, intptr_t offset)
 -#define TCG_TARGET_HAS_cmp_vec          0
 +#define TCG_TARGET_HAS_cmp_vec          1
  #define TCG_TARGET_HAS_mul_vec          0
  #define TCG_TARGET_HAS_sat_vec          0
  #define TCG_TARGET_HAS_minmax_vec       0
 diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.inc.c
 +++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
          ct->ct |= TCG_CT_REG;
          ct->u.regs = 0xffffffff;
          break;
 +    case 'v':
 +        ct->ct |= TCG_CT_REG;
 +        ct->u.regs = 0xffffffff00000000ull;
 +        break;
      case 'L':                   /* qemu_ld constraint */
          ct->ct |= TCG_CT_REG;
          ct->u.regs = 0xffffffff;
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
  #define NOP    ORI  /* ori 0,0,0 */
 +#define LVX        XO31(103)
 +#define LVEBX      XO31(7)
 +#define LVEHX      XO31(39)
 +#define LVEWX      XO31(71)
 +
 +#define STVX       XO31(231)
 +#define STVEWX     XO31(199)
 +
 +#define VCMPEQUB   VX4(6)
 +#define VCMPEQUH   VX4(70)
 +#define VCMPEQUW   VX4(134)
 +#define VCMPGTSB   VX4(774)
 +#define VCMPGTSH   VX4(838)
 +#define VCMPGTSW   VX4(902)
 +#define VCMPGTUB   VX4(518)
 +#define VCMPGTUH   VX4(582)
 +#define VCMPGTUW   VX4(646)
 +
 +#define VAND       VX4(1028)
 +#define VANDC      VX4(1092)
 +#define VNOR       VX4(1284)
 +#define VOR        VX4(1156)
 +#define VXOR       VX4(1220)
 +
 +#define VSPLTB     VX4(524)
 +#define VSPLTH     VX4(588)
 +#define VSPLTW     VX4(652)
 +#define VSPLTISB   VX4(780)
 +#define VSPLTISH   VX4(844)
 +#define VSPLTISW   VX4(908)
 +
 +#define VSLDOI     VX4(44)
 +
  #define RT(r) ((r)<<21)
  #define RS(r) ((r)<<21)
  #define RA(r) ((r)<<16)
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
                          intptr_t value, intptr_t addend)
  {
-     tcg_insn_unit *target;
+-    TCGReg temp = TCG_REG_TMP;
-+    int16_t lo;
++    TCGReg temp = TCG_REG_TMP0;
-+    int32_t hi;
+     if (offset < -0xffffff || offset > 0xffffff) {
-     value += addend;
+         tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
-     target = (tcg_insn_unit *)value;
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
          }
          *code_ptr = (*code_ptr & ~0xfffc) | (value & 0xfffc);
          break;
 +    case R_PPC_ADDR32:
 +        /*
 +         * We are abusing this relocation type.  Again, this points to
 +         * a pair of insns, lis + load.  This is an absolute address
 +         * relocation for PPC32 so the lis cannot be removed.
 +         */
 +        lo = value;
 +        hi = value - lo;
 +        if (hi + lo != value) {
 +            return false;
 +        }
 +        code_ptr[0] = deposit32(code_ptr[0], 0, 16, hi >> 16);
 +        code_ptr[1] = deposit32(code_ptr[1], 0, 16, lo);
 +        break;
      default:
          g_assert_not_reached();
      }
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
      /* Worst-case scenario, move offset to temp register, use reg offset.  */
 -    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
 -    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
 +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
 +    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
  }
  static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
- {
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
--    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
+     if (offset == sextract64(offset, 0, 26)) {
--    if (ret != arg) {
+         tcg_out_insn(s, 3206, BL, offset);
--        tcg_out32(s, OR | SAB(arg, ret, arg));
+     } else {
-+    if (ret == arg) {
+-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-+        return true;
+-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
-+    }
++        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
-+    switch (type) {
++        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
 +    case TCG_TYPE_I64:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        /* fallthru */
 +    case TCG_TYPE_I32:
 +        if (ret < TCG_REG_V0 && arg < TCG_REG_V0) {
 +            tcg_out32(s, OR | SAB(arg, ret, arg));
 +            break;
 +        } else if (ret < TCG_REG_V0 || arg < TCG_REG_V0) {
 +            /* Altivec does not support vector/integer moves.  */
 +            return false;
 +        }
 +        /* fallthru */
 +    case TCG_TYPE_V64:
 +    case TCG_TYPE_V128:
 +        tcg_debug_assert(ret >= TCG_REG_V0 && arg >= TCG_REG_V0);
 +        tcg_out32(s, VOR | VRT(ret) | VRA(arg) | VRB(arg));
 +        break;
 +    default:
 +        g_assert_not_reached();
      }
      return true;
  }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
  static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
                               tcg_target_long val)
  {
 -    g_assert_not_reached();
 +    uint32_t load_insn;
 +    int rel, low;
 +    intptr_t add;
 +
 +    low = (int8_t)val;
 +    if (low >= -16 && low < 16) {
 +        if (val == (tcg_target_long)dup_const(MO_8, low)) {
 +            tcg_out32(s, VSPLTISB | VRT(ret) | ((val & 31) << 16));
 +            return;
 +        }
 +        if (val == (tcg_target_long)dup_const(MO_16, low)) {
 +            tcg_out32(s, VSPLTISH | VRT(ret) | ((val & 31) << 16));
 +            return;
 +        }
 +        if (val == (tcg_target_long)dup_const(MO_32, low)) {
 +            tcg_out32(s, VSPLTISW | VRT(ret) | ((val & 31) << 16));
 +            return;
 +        }
 +    }
 +
 +    /*
 +     * Otherwise we must load the value from the constant pool.
 +     */
 +    if (USE_REG_TB) {
 +        rel = R_PPC_ADDR16;
 +        add = -(intptr_t)s->code_gen_ptr;
 +    } else {
 +        rel = R_PPC_ADDR32;
 +        add = 0;
 +    }
 +
 +    load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
 +    if (TCG_TARGET_REG_BITS == 64) {
 +        new_pool_l2(s, rel, s->code_ptr, add, val, val);
 +    } else {
 +        new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
 +    }
 +
 +    if (USE_REG_TB) {
 +        tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, 0, 0));
 +        load_insn |= RA(TCG_REG_TB);
 +    } else {
 +        tcg_out32(s, ADDIS | TAI(TCG_REG_TMP1, 0, 0));
 +        tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, TCG_REG_TMP1, 0));
 +    }
 +    tcg_out32(s, load_insn);
  }
  static void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
          align = 3;
          /* FALLTHRU */
      default:
 -        if (rt != TCG_REG_R0) {
 +        if (rt > TCG_REG_R0 && rt < TCG_REG_V0) {
              rs = rt;
              break;
          }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
      }
      /* For unaligned, or very large offsets, use the indexed form.  */
 -    if (offset & align || offset != (int32_t)offset) {
 +    if (offset & align || offset != (int32_t)offset || opi == 0) {
          if (rs == base) {
              rs = TCG_REG_R0;
          }
          tcg_debug_assert(!is_store || rs != rt);
          tcg_out_movi(s, TCG_TYPE_PTR, rs, orig);
 -        tcg_out32(s, opx | TAB(rt, base, rs));
 +        tcg_out32(s, opx | TAB(rt & 31, base, rs));
          return;
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
          base = rs;
      }
      if (opi != ADDI || base != rt || l0 != 0) {
 -        tcg_out32(s, opi | TAI(rt, base, l0));
 +        tcg_out32(s, opi | TAI(rt & 31, base, l0));
      }
  }
--static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
--                              TCGReg arg1, intptr_t arg2)
+     AArch64Insn insn;
-+static void tcg_out_vsldoi(TCGContext *s, TCGReg ret,
-+                           TCGReg va, TCGReg vb, int shb)
+     if (rl == ah || (!const_bh && rl == bh)) {
 -        rl = TCG_REG_TMP;
 +        rl = TCG_REG_TMP0;
      }
      if (const_bl) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
                 possibility of adding 0+const in the low part, and the
                 immediate add instructions encode XSP not XZR.  Don't try
                 anything more elaborate here than loading another zero.  */
 -            al = TCG_REG_TMP;
 +            al = TCG_REG_TMP0;
              tcg_out_movi(s, ext, al, 0);
          }
          tcg_out_insn_3401(s, insn, ext, rl, al, bl);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
  {
--    int opi, opx;
+     TCGReg a1 = a0;
--
+     if (is_ctz) {
--    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
+-        a1 = TCG_REG_TMP;
--    if (type == TCG_TYPE_I32) {
++        a1 = TCG_REG_TMP0;
--        opi = LWZ, opx = LWZX;
+         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
--    } else {
+     }
--        opi = LD, opx = LDX;
+     if (const_b && b == (ext ? 64 : 32)) {
--    }
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
--    tcg_out_mem_long(s, opi, opx, ret, arg1, arg2);
+         AArch64Insn sel = I3506_CSEL;
-+    tcg_out32(s, VSLDOI | VRT(ret) | VRA(va) | VRB(vb) | (shb << 6));
          tcg_out_cmp(s, ext, a0, 0, 1);
 -        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
 +        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
          if (const_b) {
              if (b == -1) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
                  b = d;
              }
          }
 -        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
 +        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
      }
  }
--static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
--                              TCGReg arg1, intptr_t arg2)
+ }
-+static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
-+                       TCGReg base, intptr_t offset)
+ static const TCGLdstHelperParam ldst_helper_param = {
- {
+-    .ntmp = 1, .tmp = { TCG_REG_TMP }
--    int opi, opx;
++    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
-+    int shift;
+ };
--    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
+ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
--    if (type == TCG_TYPE_I32) {
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
--        opi = STW, opx = STWX;
--    } else {
+     set_jmp_insn_offset(s, which);
--        opi = STD, opx = STDX;
+     tcg_out32(s, I3206_B);
-+    switch (type) {
+-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
-+    case TCG_TYPE_I32:
++    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
-+        if (ret < TCG_REG_V0) {
+     set_jmp_reset_offset(s, which);
-+            tcg_out_mem_long(s, LWZ, LWZX, ret, base, offset);
+ }
-+            break;
-+        }
+@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
-+        tcg_debug_assert((offset & 3) == 0);
+         ptrdiff_t i_offset = i_addr - jmp_rx;
-+        tcg_out_mem_long(s, 0, LVEWX, ret, base, offset);
-+        shift = (offset - 4) & 0xc;
+         /* Note that we asserted this in range in tcg_out_goto_tb. */
-+        if (shift) {
+-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
-+            tcg_out_vsldoi(s, ret, ret, ret, shift);
++        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
 +        }
 +        break;
 +    case TCG_TYPE_I64:
 +        if (ret < TCG_REG_V0) {
 +            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +            tcg_out_mem_long(s, LD, LDX, ret, base, offset);
 +            break;
 +        }
 +        /* fallthru */
 +    case TCG_TYPE_V64:
 +        tcg_debug_assert(ret >= TCG_REG_V0);
 +        tcg_debug_assert((offset & 7) == 0);
 +        tcg_out_mem_long(s, 0, LVX, ret, base, offset & -16);
 +        if (offset & 8) {
 +            tcg_out_vsldoi(s, ret, ret, ret, 8);
 +        }
 +        break;
 +    case TCG_TYPE_V128:
 +        tcg_debug_assert(ret >= TCG_REG_V0);
 +        tcg_debug_assert((offset & 15) == 0);
 +        tcg_out_mem_long(s, 0, LVX, ret, base, offset);
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +}
 +
 +static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
 +                              TCGReg base, intptr_t offset)
 +{
 +    int shift;
 +
 +    switch (type) {
 +    case TCG_TYPE_I32:
 +        if (arg < TCG_REG_V0) {
 +            tcg_out_mem_long(s, STW, STWX, arg, base, offset);
 +            break;
 +        }
 +        tcg_debug_assert((offset & 3) == 0);
 +        shift = (offset - 4) & 0xc;
 +        if (shift) {
 +            tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, shift);
 +            arg = TCG_VEC_TMP1;
 +        }
 +        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset);
 +        break;
 +    case TCG_TYPE_I64:
 +        if (arg < TCG_REG_V0) {
 +            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +            tcg_out_mem_long(s, STD, STDX, arg, base, offset);
 +            break;
 +        }
 +        /* fallthru */
 +    case TCG_TYPE_V64:
 +        tcg_debug_assert(arg >= TCG_REG_V0);
 +        tcg_debug_assert((offset & 7) == 0);
 +        if (offset & 8) {
 +            tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, 8);
 +            arg = TCG_VEC_TMP1;
 +        }
 +        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset);
 +        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset + 4);
 +        break;
 +    case TCG_TYPE_V128:
 +        tcg_debug_assert(arg >= TCG_REG_V0);
 +        tcg_out_mem_long(s, 0, STVX, arg, base, offset);
 +        break;
 +    default:
 +        g_assert_not_reached();
      }
--    tcg_out_mem_long(s, opi, opx, arg, arg1, arg2);
+     qatomic_set((uint32_t *)jmp_rw, insn);
      flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_rem_i64:
      case INDEX_op_rem_i32:
 -        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
 -        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
 +        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
 +        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
          break;
      case INDEX_op_remu_i64:
      case INDEX_op_remu_i32:
 -        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
 -        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
 +        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
 +        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
          break;
      case INDEX_op_shl_i64:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
          if (c2) {
              tcg_out_rotl(s, ext, a0, a1, a2);
          } else {
 -            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
 -            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
 +            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
 +            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
          }
          break;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                              break;
                          }
                      }
 -                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
 -                    a2 = TCG_VEC_TMP;
 +                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
 +                    a2 = TCG_VEC_TMP0;
                  }
                  if (is_scalar) {
                      insn = cmp_scalar_insn[cond];
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      s->reserved_regs = 0;
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
 -    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
 -    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
  }
- static inline bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
+ /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc, const TCGArg *args,
  int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
  {
 -    g_assert_not_reached();
 +    switch (opc) {
 +    case INDEX_op_and_vec:
 +    case INDEX_op_or_vec:
 +    case INDEX_op_xor_vec:
 +    case INDEX_op_andc_vec:
 +    case INDEX_op_not_vec:
 +        return 1;
 +    case INDEX_op_cmp_vec:
 +        return vece <= MO_32 ? -1 : 0;
 +    default:
 +        return 0;
 +    }
  }
  static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg dst, TCGReg src)
  {
 -    g_assert_not_reached();
 +    tcg_debug_assert(dst >= TCG_REG_V0);
 +    tcg_debug_assert(src >= TCG_REG_V0);
 +
 +    /*
 +     * Recall we use (or emulate) VSX integer loads, so the integer is
 +     * right justified within the left (zero-index) double-word.
 +     */
 +    switch (vece) {
 +    case MO_8:
 +        tcg_out32(s, VSPLTB | VRT(dst) | VRB(src) | (7 << 16));
 +        break;
 +    case MO_16:
 +        tcg_out32(s, VSPLTH | VRT(dst) | VRB(src) | (3 << 16));
 +        break;
 +    case MO_32:
 +        tcg_out32(s, VSPLTW | VRT(dst) | VRB(src) | (1 << 16));
 +        break;
 +    case MO_64:
 +        tcg_out_vsldoi(s, TCG_VEC_TMP1, src, src, 8);
 +        tcg_out_vsldoi(s, dst, TCG_VEC_TMP1, src, 8);
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +    return true;
  }
  static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                               TCGReg out, TCGReg base, intptr_t offset)
  {
 -    g_assert_not_reached();
 +    int elt;
 +
 +    tcg_debug_assert(out >= TCG_REG_V0);
 +    switch (vece) {
 +    case MO_8:
 +        tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
 +        elt = extract32(offset, 0, 4);
 +#ifndef HOST_WORDS_BIGENDIAN
 +        elt ^= 15;
 +#endif
 +        tcg_out32(s, VSPLTB | VRT(out) | VRB(out) | (elt << 16));
 +        break;
 +    case MO_16:
 +        tcg_debug_assert((offset & 1) == 0);
 +        tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
 +        elt = extract32(offset, 1, 3);
 +#ifndef HOST_WORDS_BIGENDIAN
 +        elt ^= 7;
 +#endif
 +        tcg_out32(s, VSPLTH | VRT(out) | VRB(out) | (elt << 16));
 +        break;
 +    case MO_32:
 +        tcg_debug_assert((offset & 3) == 0);
 +        tcg_out_mem_long(s, 0, LVEWX, out, base, offset);
 +        elt = extract32(offset, 2, 2);
 +#ifndef HOST_WORDS_BIGENDIAN
 +        elt ^= 3;
 +#endif
 +        tcg_out32(s, VSPLTW | VRT(out) | VRB(out) | (elt << 16));
 +        break;
 +    case MO_64:
 +        tcg_debug_assert((offset & 7) == 0);
 +        tcg_out_mem_long(s, 0, LVX, out, base, offset & -16);
 +        tcg_out_vsldoi(s, TCG_VEC_TMP1, out, out, 8);
 +        elt = extract32(offset, 3, 1);
 +#ifndef HOST_WORDS_BIGENDIAN
 +        elt = !elt;
 +#endif
 +        if (elt) {
 +            tcg_out_vsldoi(s, out, out, TCG_VEC_TMP1, 8);
 +        } else {
 +            tcg_out_vsldoi(s, out, TCG_VEC_TMP1, out, 8);
 +        }
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +    return true;
  }
  static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             unsigned vecl, unsigned vece,
                             const TCGArg *args, const int *const_args)
  {
 -    g_assert_not_reached();
 +    static const uint32_t
 +        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
 +        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
 +        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 };
 +
 +    TCGType type = vecl + TCG_TYPE_V64;
 +    TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
 +    uint32_t insn;
 +
 +    switch (opc) {
 +    case INDEX_op_ld_vec:
 +        tcg_out_ld(s, type, a0, a1, a2);
 +        return;
 +    case INDEX_op_st_vec:
 +        tcg_out_st(s, type, a0, a1, a2);
 +        return;
 +    case INDEX_op_dupm_vec:
 +        tcg_out_dupm_vec(s, type, vece, a0, a1, a2);
 +        return;
 +
 +    case INDEX_op_and_vec:
 +        insn = VAND;
 +        break;
 +    case INDEX_op_or_vec:
 +        insn = VOR;
 +        break;
 +    case INDEX_op_xor_vec:
 +        insn = VXOR;
 +        break;
 +    case INDEX_op_andc_vec:
 +        insn = VANDC;
 +        break;
 +    case INDEX_op_not_vec:
 +        insn = VNOR;
 +        a2 = a1;
 +        break;
 +
 +    case INDEX_op_cmp_vec:
 +        switch (args[3]) {
 +        case TCG_COND_EQ:
 +            insn = eq_op[vece];
 +            break;
 +        case TCG_COND_GT:
 +            insn = gts_op[vece];
 +            break;
 +        case TCG_COND_GTU:
 +            insn = gtu_op[vece];
 +            break;
 +        default:
 +            g_assert_not_reached();
 +        }
 +        break;
 +
 +    case INDEX_op_mov_vec:  /* Always emitted via tcg_out_mov.  */
 +    case INDEX_op_dupi_vec: /* Always emitted via tcg_out_movi.  */
 +    case INDEX_op_dup_vec:  /* Always emitted via tcg_out_dup_vec.  */
 +    default:
 +        g_assert_not_reached();
 +    }
 +
 +    tcg_debug_assert(insn != 0);
 +    tcg_out32(s, insn | VRT(a0) | VRA(a1) | VRB(a2));
 +}
 +
 +static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
 +                           TCGv_vec v1, TCGv_vec v2, TCGCond cond)
 +{
 +    bool need_swap = false, need_inv = false;
 +
 +    tcg_debug_assert(vece <= MO_32);
 +
 +    switch (cond) {
 +    case TCG_COND_EQ:
 +    case TCG_COND_GT:
 +    case TCG_COND_GTU:
 +        break;
 +    case TCG_COND_NE:
 +    case TCG_COND_LE:
 +    case TCG_COND_LEU:
 +        need_inv = true;
 +        break;
 +    case TCG_COND_LT:
 +    case TCG_COND_LTU:
 +        need_swap = true;
 +        break;
 +    case TCG_COND_GE:
 +    case TCG_COND_GEU:
 +        need_swap = need_inv = true;
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +
 +    if (need_inv) {
 +        cond = tcg_invert_cond(cond);
 +    }
 +    if (need_swap) {
 +        TCGv_vec t1;
 +        t1 = v1, v1 = v2, v2 = t1;
 +        cond = tcg_swap_cond(cond);
 +    }
 +
 +    vec_gen_4(INDEX_op_cmp_vec, type, vece, tcgv_vec_arg(v0),
 +              tcgv_vec_arg(v1), tcgv_vec_arg(v2), cond);
 +
 +    if (need_inv) {
 +        tcg_gen_not_vec(vece, v0, v0);
 +    }
  }
  void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
                         TCGArg a0, ...)
  {
 -    g_assert_not_reached();
 +    va_list va;
 +    TCGv_vec v0, v1, v2;
 +
 +    va_start(va, a0);
 +    v0 = temp_tcgv_vec(arg_temp(a0));
 +    v1 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
 +    v2 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
 +
 +    switch (opc) {
 +    case INDEX_op_cmp_vec:
 +        expand_vec_cmp(type, vece, v0, v1, v2, va_arg(va, TCGArg));
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +    va_end(va);
  }
  static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
          = { .args_ct_str = { "r", "r", "r", "r", "rI", "rZM" } };
      static const TCGTargetOpDef sub2
          = { .args_ct_str = { "r", "r", "rI", "rZM", "r", "r" } };
 +    static const TCGTargetOpDef v_r = { .args_ct_str = { "v", "r" } };
 +    static const TCGTargetOpDef v_v = { .args_ct_str = { "v", "v" } };
 +    static const TCGTargetOpDef v_v_v = { .args_ct_str = { "v", "v", "v" } };
      switch (op) {
      case INDEX_op_goto_ptr:
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
          return (TCG_TARGET_REG_BITS == 64 ? &S_S
                  : TARGET_LONG_BITS == 32 ? &S_S_S : &S_S_S_S);
 +    case INDEX_op_and_vec:
 +    case INDEX_op_or_vec:
 +    case INDEX_op_xor_vec:
 +    case INDEX_op_andc_vec:
 +    case INDEX_op_orc_vec:
 +    case INDEX_op_cmp_vec:
 +        return &v_v_v;
 +    case INDEX_op_not_vec:
 +    case INDEX_op_dup_vec:
 +        return &v_v;
 +    case INDEX_op_ld_vec:
 +    case INDEX_op_st_vec:
 +    case INDEX_op_dupm_vec:
 +        return &v_r;
 +
      default:
          return NULL;
      }
 --
-.17.1
+.34.1

-[PULL 06/23] tcg/ppc: Replace HAVE_ISEL macro with a variable
+[PULL 07/27] tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
-Previously we've been hard-coding knowledge that Power7 has ISEL, but
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 it was an optional instruction before that.  Use the AT_HWCAP2 bit,
 when present, to properly determine support.
 Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 17 ++++++++++++-----
+ tcg/aarch64/tcg-target.c.inc | 9 +++++++--
-file changed, 12 insertions(+), 5 deletions(-)
+file changed, 7 insertions(+), 2 deletions(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
- static tcg_insn_unit *tb_ret_addr;
+     TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
- TCGPowerISA have_isa;
+     TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
--
+-    TCG_REG_X16, TCG_REG_X17,
--#define HAVE_ISEL      have_isa_2_06
-+static bool have_isel;
+     TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
      TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
 +    /* X16 reserved as temporary */
 +    /* X17 reserved as temporary */
      /* X18 reserved by system */
      /* X19 reserved for AREG0 */
      /* X29 reserved as fp */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
      return TCG_REG_X0 + slot;
  }
 -#define TCG_REG_TMP0 TCG_REG_X30
 +#define TCG_REG_TMP0 TCG_REG_X16
 +#define TCG_REG_TMP1 TCG_REG_X17
 +#define TCG_REG_TMP2 TCG_REG_X30
  #define TCG_VEC_TMP0 TCG_REG_V31
  #ifndef CONFIG_SOFTMMU
- #define TCG_GUEST_BASE_REG 30
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_setcond(TCGContext *s, TCGType type, TCGCond cond,
-     /* If we have ISEL, we can implement everything with 3 or 4 insns.
-        All other cases below are also at least 3 insns, so speed up the
-        code generator by not considering them and always using ISEL.  */
--    if (HAVE_ISEL) {
-+    if (have_isel) {
-         int isel, tab;
-         tcg_out_cmp(s, cond, arg1, arg2, const_arg2, 7, type);
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_movcond(TCGContext *s, TCGType type, TCGCond cond,
-     tcg_out_cmp(s, cond, c1, c2, const_c2, 7, type);
--    if (HAVE_ISEL) {
-+    if (have_isel) {
-         int isel = tcg_to_isel[cond];
-         /* Swap the V operands if the operation indicates inversion.  */
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_cntxz(TCGContext *s, TCGType type, uint32_t opc,
-     } else {
-         tcg_out_cmp(s, TCG_COND_EQ, a1, 0, 1, 7, type);
-         /* Note that the only other valid constant for a2 is 0.  */
--        if (HAVE_ISEL) {
-+        if (have_isel) {
-             tcg_out32(s, opc | RA(TCG_REG_R0) | RS(a1));
-             tcg_out32(s, tcg_to_isel[TCG_COND_EQ] | TAB(a0, a2, TCG_REG_R0));
-         } else if (!const_a2 && a0 == a2) {
 @@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
-     }
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
- #endif
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
-+#ifdef PPC_FEATURE2_HAS_ISEL
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
-+    /* Prefer explicit instruction from the kernel. */
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
-+    have_isel = (hwcap2 & PPC_FEATURE2_HAS_ISEL) != 0;
+     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
-+#else
+ }
 +    /* Fall back to knowing Power7 (2.06) has ISEL. */
 +    have_isel = have_isa_2_06;
 +#endif
 +
      tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
      tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffff;
 --
-.17.1
+.34.1

-[PULL 20/23] tcg/ppc: Update vector support for v3.00 Altivec
+[PULL 08/27] tcg/aarch64: Simplify constraints on qemu_ld/st
-These new instructions are conditional only on MSR.VEC and
+Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
-are thus part of the Altivec instruction set, and not VSX.
+registers.  Since we handle overlap betwen inputs and helper arguments,
-This includes negation and compare not equal.
+we can allow any allocatable reg.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.h     |  2 +-
+ tcg/aarch64/tcg-target-con-set.h |  2 --
- tcg/ppc/tcg-target.inc.c | 23 +++++++++++++++++++++++
+ tcg/aarch64/tcg-target-con-str.h |  1 -
-files changed, 24 insertions(+), 1 deletion(-)
+ tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
 files changed, 19 insertions(+), 29 deletions(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/tcg/aarch64/tcg-target-con-set.h
-+++ b/tcg/ppc/tcg-target.h
++++ b/tcg/aarch64/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
+@@ -XXX,XX +XXX,XX @@
- #define TCG_TARGET_HAS_andc_vec         1
+  * tcg-target-con-str.h; the constraint combination is inclusive or.
- #define TCG_TARGET_HAS_orc_vec          have_isa_2_07
+  */
- #define TCG_TARGET_HAS_not_vec          1
+ C_O0_I1(r)
--#define TCG_TARGET_HAS_neg_vec          0
+-C_O0_I2(lZ, l)
-+#define TCG_TARGET_HAS_neg_vec          have_isa_3_00
+ C_O0_I2(r, rA)
- #define TCG_TARGET_HAS_abs_vec          0
+ C_O0_I2(rZ, r)
- #define TCG_TARGET_HAS_shi_vec          0
+ C_O0_I2(w, r)
- #define TCG_TARGET_HAS_shs_vec          0
+-C_O1_I1(r, l)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+ C_O1_I1(r, r)
  C_O1_I1(w, r)
  C_O1_I1(w, w)
 diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tcg/aarch64/tcg-target-con-str.h
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tcg/aarch64/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@
- #define VSUBUWM    VX4(1152)
+  * REGS(letter, register_mask)
- #define VSUBUDM    VX4(1216)      /* v2.07 */
+  */
+ REGS('r', ALL_GENERAL_REGS)
-+#define VNEGW      (VX4(1538) | (6 << 16))  /* v3.00 */
+-REGS('l', ALL_QLDST_REGS)
-+#define VNEGD      (VX4(1538) | (7 << 16))  /* v3.00 */
+ REGS('w', ALL_VECTOR_REGS)
-+
- #define VMAXSB     VX4(258)
+ /*
- #define VMAXSH     VX4(322)
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
- #define VMAXSW     VX4(386)
+index XXXXXXX..XXXXXXX 100644
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+--- a/tcg/aarch64/tcg-target.c.inc
- #define VCMPGTUH   VX4(582)
++++ b/tcg/aarch64/tcg-target.c.inc
- #define VCMPGTUW   VX4(646)
+@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
- #define VCMPGTUD   VX4(711)       /* v2.07 */
+ #define ALL_GENERAL_REGS  0xffffffffu
-+#define VCMPNEB    VX4(7)         /* v3.00 */
+ #define ALL_VECTOR_REGS   0xffffffff00000000ull
-+#define VCMPNEH    VX4(71)        /* v3.00 */
-+#define VCMPNEW    VX4(135)       /* v3.00 */
+-#ifdef CONFIG_SOFTMMU
+-#define ALL_QLDST_REGS \
- #define VSLB       VX4(260)
+-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
- #define VSLH       VX4(324)
+-                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+-#else
-     case INDEX_op_shri_vec:
+-#define ALL_QLDST_REGS   ALL_GENERAL_REGS
-     case INDEX_op_sari_vec:
+-#endif
-         return vece <= MO_32 || have_isa_2_07 ? -1 : 0;
+-
-+    case INDEX_op_neg_vec:
+ /* Match a constant valid for addition (12-bit, optionally shifted).  */
-+        return vece >= MO_32 && have_isa_3_00;
+ static inline bool is_aimm(uint64_t val)
-     case INDEX_op_mul_vec:
+ {
-         switch (vece) {
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         case MO_8:
+     unsigned s_bits = opc & MO_SIZE;
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+     unsigned s_mask = (1u << s_bits) - 1;
-     static const uint32_t
+     unsigned mem_index = get_mmuidx(oi);
-         add_op[4] = { VADDUBM, VADDUHM, VADDUWM, VADDUDM },
+-    TCGReg x3;
-         sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, VSUBUDM },
++    TCGReg addr_adj;
-+        neg_op[4] = { 0, 0, VNEGW, VNEGD },
+     TCGType mask_type;
-         eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, VCMPEQUD },
+     uint64_t compare_mask;
-+        ne_op[4]  = { VCMPNEB, VCMPNEH, VCMPNEW, 0 },
-         gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, VCMPGTSD },
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, VCMPGTUD },
+     mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
-         ssadd_op[4] = { VADDSBS, VADDSHS, VADDSWS, 0 },
+                  ? TCG_TYPE_I64 : TCG_TYPE_I32);
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
-     case INDEX_op_sub_vec:
+-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
-         insn = sub_op[vece];
++    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
-         break;
+     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    case INDEX_op_neg_vec:
+     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
-+        insn = neg_op[vece];
+     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
-+        a2 = a1;
+     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-+        a1 = 0;
+-    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
-+        break;
++    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
-     case INDEX_op_mul_vec:
+                  TLB_MASK_TABLE_OFS(mem_index), 1, 0);
-         tcg_debug_assert(vece == MO_32 && have_isa_2_07);
-         insn = VMULUWM;
+     /* Extract the TLB index from the address into X0.  */
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+     tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-         case TCG_COND_EQ:
+-                 TCG_REG_X0, TCG_REG_X0, addr_reg,
-             insn = eq_op[vece];
++                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
-             break;
+                  s->page_bits - CPU_TLB_ENTRY_BITS);
-+        case TCG_COND_NE:
-+            insn = ne_op[vece];
+-    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-+            break;
+-    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
-         case TCG_COND_GT:
++    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
-             insn = gts_op[vece];
++    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
-             break;
-@@ -XXX,XX +XXX,XX @@ static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
+-    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-     case TCG_COND_GTU:
+-    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
-         break;
++    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
-     case TCG_COND_NE:
++    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
-+        if (have_isa_3_00 && vece <= MO_32) {
+                is_ld ? offsetof(CPUTLBEntry, addr_read)
-+            break;
+                      : offsetof(CPUTLBEntry, addr_write));
-+        }
+-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
-+        /* fall through */
++    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
-     case TCG_COND_LE:
+                offsetof(CPUTLBEntry, addend));
-     case TCG_COND_LEU:
-         need_inv = true;
+     /*
-@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     case INDEX_op_dup2_vec:
+      * cross pages using the address of the last byte of the access.
-         return &v_v_v;
+      */
-     case INDEX_op_not_vec:
+     if (a_mask >= s_mask) {
-+    case INDEX_op_neg_vec:
+-        x3 = addr_reg;
-     case INDEX_op_dup_vec:
++        addr_adj = addr_reg;
-         return &v_v;
+     } else {
-     case INDEX_op_ld_vec:
++        addr_adj = TCG_REG_TMP2;
          tcg_out_insn(s, 3401, ADDI, addr_type,
 -                     TCG_REG_X3, addr_reg, s_mask - a_mask);
 -        x3 = TCG_REG_X3;
 +                     addr_adj, addr_reg, s_mask - a_mask);
      }
      compare_mask = (uint64_t)s->page_mask | a_mask;
 -    /* Store the page mask part of the address into X3.  */
 -    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
 +    /* Store the page mask part of the address into TMP2.  */
 +    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
 +                     addr_adj, compare_mask);
      /* Perform the address comparison. */
 -    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
 +    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
      /* If not equal, we jump to the slow path. */
      ldst->label_ptr[0] = s->code_ptr;
      tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 -    h->base = TCG_REG_X1,
 +    h->base = TCG_REG_TMP1;
      h->index = addr_reg;
      h->index_ext = addr_type;
  #else
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
      case INDEX_op_qemu_ld_a64_i32:
      case INDEX_op_qemu_ld_a32_i64:
      case INDEX_op_qemu_ld_a64_i64:
 -        return C_O1_I1(r, l);
 +        return C_O1_I1(r, r);
      case INDEX_op_qemu_st_a32_i32:
      case INDEX_op_qemu_st_a64_i32:
      case INDEX_op_qemu_st_a32_i64:
      case INDEX_op_qemu_st_a64_i64:
 -        return C_O0_I2(lZ, l);
 +        return C_O0_I2(rZ, r);
      case INDEX_op_deposit_i32:
      case INDEX_op_deposit_i64:
 --
-.17.1
+.34.1

-[PULL 13/23] tcg/ppc: Support vector multiply
+[PULL 09/27] tcg/aarch64: Support 128-bit load/store
-For Altivec, this is always an expansion.
+With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
 -byte atomicity is required and LDP/STP otherwise.
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.h     |   2 +-
+ tcg/aarch64/tcg-target-con-set.h |   2 +
- tcg/ppc/tcg-target.opc.h |   8 +++
+ tcg/aarch64/tcg-target.h         |  11 ++-
- tcg/ppc/tcg-target.inc.c | 113 ++++++++++++++++++++++++++++++++++++++-
+ tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
-files changed, 121 insertions(+), 2 deletions(-)
+files changed, 151 insertions(+), 3 deletions(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/tcg/aarch64/tcg-target-con-set.h
-+++ b/tcg/ppc/tcg-target.h
++++ b/tcg/aarch64/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
- #define TCG_TARGET_HAS_shs_vec          0
+ C_O0_I2(r, rA)
- #define TCG_TARGET_HAS_shv_vec          1
+ C_O0_I2(rZ, r)
- #define TCG_TARGET_HAS_cmp_vec          1
+ C_O0_I2(w, r)
--#define TCG_TARGET_HAS_mul_vec          0
++C_O0_I3(rZ, rZ, r)
-+#define TCG_TARGET_HAS_mul_vec          1
+ C_O1_I1(r, r)
- #define TCG_TARGET_HAS_sat_vec          1
+ C_O1_I1(w, r)
- #define TCG_TARGET_HAS_minmax_vec       1
+ C_O1_I1(w, w)
- #define TCG_TARGET_HAS_bitsel_vec       0
+@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
-diff --git a/tcg/ppc/tcg-target.opc.h b/tcg/ppc/tcg-target.opc.h
+ C_O1_I2(w, w, wZ)
  C_O1_I3(w, w, w, w)
  C_O1_I4(r, r, rA, rZ, rZ)
 +C_O2_I1(r, r, r)
  C_O2_I4(r, r, rZ, rZ, rA, rMZ)
 diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.opc.h
+--- a/tcg/aarch64/tcg-target.h
-+++ b/tcg/ppc/tcg-target.opc.h
++++ b/tcg/aarch64/tcg-target.h
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ typedef enum {
-  * emitted by tcg_expand_vec_op.  For those familiar with GCC internals,
+ #define TCG_TARGET_HAS_muluh_i64        1
-  * consider these to be UNSPEC with names.
+ #define TCG_TARGET_HAS_mulsh_i64        1
-  */
-+
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
-+DEF(ppc_mrgh_vec, 1, 2, 0, IMPLVEC)
++/*
-+DEF(ppc_mrgl_vec, 1, 2, 0, IMPLVEC)
++ * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
-+DEF(ppc_msum_vec, 1, 3, 0, IMPLVEC)
++ * which requires writable pages.  We must defer to the helper for user-only,
-+DEF(ppc_muleu_vec, 1, 2, 0, IMPLVEC)
++ * but in system mode all ram is writable for the host.
-+DEF(ppc_mulou_vec, 1, 2, 0, IMPLVEC)
++ */
-+DEF(ppc_pkum_vec, 1, 2, 0, IMPLVEC)
++#ifdef CONFIG_USER_ONLY
-+DEF(ppc_rotl_vec, 1, 2, 0, IMPLVEC)
++#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
++#else
 +#define TCG_TARGET_HAS_qemu_ldst_i128   1
 +#endif
  #define TCG_TARGET_HAS_v64              1
  #define TCG_TARGET_HAS_v128             1
 diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@ typedef enum {
- #define VSRAB      VX4(772)
+     I3305_LDR_v64   = 0x5c000000,
- #define VSRAH      VX4(836)
+     I3305_LDR_v128  = 0x9c000000,
- #define VSRAW      VX4(900)
-+#define VRLB       VX4(4)
++    /* Load/store exclusive. */
-+#define VRLH       VX4(68)
++    I3306_LDXP      = 0xc8600000,
-+#define VRLW       VX4(132)
++    I3306_STXP      = 0xc8200000,
 +
-+#define VMULEUB    VX4(520)
+     /* Load/store register.  Described here as 3.3.12, but the helper
-+#define VMULEUH    VX4(584)
+        that emits them can transform to 3.3.10 or 3.3.13.  */
-+#define VMULOUB    VX4(8)
+     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
-+#define VMULOUH    VX4(72)
+@@ -XXX,XX +XXX,XX @@ typedef enum {
-+#define VMSUMUHM   VX4(38)
+     I3406_ADR       = 0x10000000,
-+
+     I3406_ADRP      = 0x90000000,
-+#define VMRGHB     VX4(12)
-+#define VMRGHH     VX4(76)
++    /* Add/subtract extended register instructions. */
-+#define VMRGHW     VX4(140)
++    I3501_ADD       = 0x0b200000,
-+#define VMRGLB     VX4(268)
++
-+#define VMRGLH     VX4(332)
+     /* Add/subtract shifted register instructions (without a shift).  */
-+#define VMRGLW     VX4(396)
+     I3502_ADD       = 0x0b000000,
-+
+     I3502_ADDS      = 0x2b000000,
-+#define VPKUHUM    VX4(14)
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
-+#define VPKUWUM    VX4(78)
+     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
+ }
- #define VAND       VX4(1028)
- #define VANDC      VX4(1092)
++static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
++                              TCGReg rt, TCGReg rt2, TCGReg rn)
-     case INDEX_op_sarv_vec:
++{
-         return vece <= MO_32;
++    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
-     case INDEX_op_cmp_vec:
++}
-+    case INDEX_op_mul_vec:
++
-     case INDEX_op_shli_vec:
+ static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
-     case INDEX_op_shri_vec:
+                               TCGReg rt, int imm19)
-     case INDEX_op_sari_vec:
+ {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
-         smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 },
+     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
-         shlv_op[4] = { VSLB, VSLH, VSLW, 0 },
+ }
-         shrv_op[4] = { VSRB, VSRH, VSRW, 0 },
--        sarv_op[4] = { VSRAB, VSRAH, VSRAW, 0 };
++static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
-+        sarv_op[4] = { VSRAB, VSRAH, VSRAW, 0 },
++                                     TCGType sf, TCGReg rd, TCGReg rn,
-+        mrgh_op[4] = { VMRGHB, VMRGHH, VMRGHW, 0 },
++                                     TCGReg rm, int opt, int imm3)
-+        mrgl_op[4] = { VMRGLB, VMRGLH, VMRGLW, 0 },
++{
-+        muleu_op[4] = { VMULEUB, VMULEUH, 0, 0 },
++    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
-+        mulou_op[4] = { VMULOUB, VMULOUH, 0, 0 },
++              imm3 << 10 | rn << 5 | rd);
-+        pkum_op[4] = { VPKUHUM, VPKUWUM, 0, 0 },
++}
-+        rotl_op[4] = { VRLB, VRLH, VRLW, 0 };
++
+ /* This function is for both 3.5.2 (Add/Subtract shifted register), for
-     TCGType type = vecl + TCG_TYPE_V64;
+    the rare occasion when we actually want to supply a shift amount.  */
-     TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
+ static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         }
+     TCGType addr_type = s->addr_type;
-         break;
+     TCGLabelQemuLdst *ldst = NULL;
+     MemOp opc = get_memop(oi);
-+    case INDEX_op_ppc_mrgh_vec:
++    MemOp s_bits = opc & MO_SIZE;
-+        insn = mrgh_op[vece];
+     unsigned a_mask;
-+        break;
-+    case INDEX_op_ppc_mrgl_vec:
+     h->aa = atom_and_align_for_opc(s, opc,
-+        insn = mrgl_op[vece];
+                                    have_lse2 ? MO_ATOM_WITHIN16
-+        break;
+                                              : MO_ATOM_IFALIGN,
-+    case INDEX_op_ppc_muleu_vec:
+-                                   false);
-+        insn = muleu_op[vece];
++                                   s_bits == MO_128);
-+        break;
+     a_mask = (1 << h->aa.align) - 1;
-+    case INDEX_op_ppc_mulou_vec:
-+        insn = mulou_op[vece];
+ #ifdef CONFIG_SOFTMMU
-+        break;
+-    unsigned s_bits = opc & MO_SIZE;
-+    case INDEX_op_ppc_pkum_vec:
+     unsigned s_mask = (1u << s_bits) - 1;
-+        insn = pkum_op[vece];
+     unsigned mem_index = get_mmuidx(oi);
-+        break;
+     TCGReg addr_adj;
-+    case INDEX_op_ppc_rotl_vec:
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
 +        insn = rotl_op[vece];
 +        break;
 +    case INDEX_op_ppc_msum_vec:
 +        tcg_debug_assert(vece == MO_16);
 +        tcg_out32(s, VMSUMUHM | VRT(a0) | VRA(a1) | VRB(a2) | VRC(args[3]));
 +        return;
 +
      case INDEX_op_mov_vec:  /* Always emitted via tcg_out_mov.  */
      case INDEX_op_dupi_vec: /* Always emitted via tcg_out_movi.  */
      case INDEX_op_dup_vec:  /* Always emitted via tcg_out_dup_vec.  */
@@ -XXX,XX +XXX,XX @@ static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
      }
  }
-+static void expand_vec_mul(TCGType type, unsigned vece, TCGv_vec v0,
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
-+                           TCGv_vec v1, TCGv_vec v2)
++                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
-+    TCGv_vec t1 = tcg_temp_new_vec(type);
++    TCGLabelQemuLdst *ldst;
-+    TCGv_vec t2 = tcg_temp_new_vec(type);
++    HostAddress h;
-+    TCGv_vec t3, t4;
++    TCGReg base;
-+
++    bool use_pair;
-+    switch (vece) {
++
-+    case MO_8:
++    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
-+    case MO_16:
++
-+        vec_gen_3(INDEX_op_ppc_muleu_vec, type, vece, tcgv_vec_arg(t1),
++    /* Compose the final address, as LDP/STP have no indexing. */
-+                  tcgv_vec_arg(v1), tcgv_vec_arg(v2));
++    if (h.index == TCG_REG_XZR) {
-+        vec_gen_3(INDEX_op_ppc_mulou_vec, type, vece, tcgv_vec_arg(t2),
++        base = h.base;
-+                  tcgv_vec_arg(v1), tcgv_vec_arg(v2));
++    } else {
-+        vec_gen_3(INDEX_op_ppc_mrgh_vec, type, vece + 1, tcgv_vec_arg(v0),
++        base = TCG_REG_TMP2;
-+                  tcgv_vec_arg(t1), tcgv_vec_arg(t2));
++        if (h.index_ext == TCG_TYPE_I32) {
-+        vec_gen_3(INDEX_op_ppc_mrgl_vec, type, vece + 1, tcgv_vec_arg(t1),
++            /* add base, base, index, uxtw */
-+                  tcgv_vec_arg(t1), tcgv_vec_arg(t2));
++            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
-+        vec_gen_3(INDEX_op_ppc_pkum_vec, type, vece, tcgv_vec_arg(v0),
++                         h.base, h.index, MO_32, 0);
-+                  tcgv_vec_arg(v0), tcgv_vec_arg(t1));
++        } else {
-+    break;
++            /* add base, base, index */
-+
++            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
-+    case MO_32:
++        }
-+        t3 = tcg_temp_new_vec(type);
++    }
-+        t4 = tcg_temp_new_vec(type);
++
-+        tcg_gen_dupi_vec(MO_8, t4, -16);
++    use_pair = h.aa.atom < MO_128 || have_lse2;
-+        vec_gen_3(INDEX_op_ppc_rotl_vec, type, MO_32, tcgv_vec_arg(t1),
++
-+                  tcgv_vec_arg(v2), tcgv_vec_arg(t4));
++    if (!use_pair) {
-+        vec_gen_3(INDEX_op_ppc_mulou_vec, type, MO_16, tcgv_vec_arg(t2),
++        tcg_insn_unit *branch = NULL;
-+                  tcgv_vec_arg(v1), tcgv_vec_arg(v2));
++        TCGReg ll, lh, sl, sh;
-+        tcg_gen_dupi_vec(MO_8, t3, 0);
++
-+        vec_gen_4(INDEX_op_ppc_msum_vec, type, MO_16, tcgv_vec_arg(t3),
++        /*
-+                  tcgv_vec_arg(v1), tcgv_vec_arg(t1), tcgv_vec_arg(t3));
++         * If we have already checked for 16-byte alignment, that's all
-+        vec_gen_3(INDEX_op_shlv_vec, type, MO_32, tcgv_vec_arg(t3),
++         * we need. Otherwise we have determined that misaligned atomicity
-+                  tcgv_vec_arg(t3), tcgv_vec_arg(t4));
++         * may be handled with two 8-byte loads.
-+        tcg_gen_add_vec(MO_32, v0, t2, t3);
++         */
-+        tcg_temp_free_vec(t3);
++        if (h.aa.align < MO_128) {
-+        tcg_temp_free_vec(t4);
++            /*
 +             * TODO: align should be MO_64, so we only need test bit 3,
 +             * which means we could use TBNZ instead of ANDS+B_C.
 +             */
 +            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
 +            branch = s->code_ptr;
 +            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 +            use_pair = true;
 +        }
 +
 +        if (is_ld) {
 +            /*
 +             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
 +             *    ldxp lo, hi, [base]
 +             *    stxp t0, lo, hi, [base]
 +             *    cbnz t0, .-8
 +             * Require no overlap between data{lo,hi} and base.
 +             */
 +            if (datalo == base || datahi == base) {
 +                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
 +                base = TCG_REG_TMP2;
 +            }
 +            ll = sl = datalo;
 +            lh = sh = datahi;
 +        } else {
 +            /*
 +             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
 +             * 1: ldxp t0, t1, [base]
 +             *    stxp t0, lo, hi, [base]
 +             *    cbnz t0, 1b
 +             */
 +            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
 +            ll = TCG_REG_TMP0;
 +            lh = TCG_REG_TMP1;
 +            sl = datalo;
 +            sh = datahi;
 +        }
 +
 +        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
 +        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
 +        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
 +
 +        if (use_pair) {
 +            /* "b .+8", branching across the one insn of use_pair. */
 +            tcg_out_insn(s, 3206, B, 2);
 +            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
 +        }
 +    }
 +
 +    if (use_pair) {
 +        if (is_ld) {
 +            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
 +        } else {
 +            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
 +        }
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
  static const tcg_insn_unit *tb_ret_addr;
  static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_qemu_st_a64_i64:
          tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
          break;
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
 +        break;
-+
++    case INDEX_op_qemu_st_a32_i128:
-+    default:
++    case INDEX_op_qemu_st_a64_i128:
-+        g_assert_not_reached();
++        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
 +    }
 +    tcg_temp_free_vec(t1);
 +    tcg_temp_free_vec(t2);
 +}
 +
  void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
                         TCGArg a0, ...)
  {
@@ -XXX,XX +XXX,XX @@ void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
          v2 = temp_tcgv_vec(arg_temp(a2));
          expand_vec_cmp(type, vece, v0, v1, v2, va_arg(va, TCGArg));
          break;
 +    case INDEX_op_mul_vec:
 +        v2 = temp_tcgv_vec(arg_temp(a2));
 +        expand_vec_mul(type, vece, v0, v1, v2);
 +        break;
-     default:
-         g_assert_not_reached();
+     case INDEX_op_bswap64_i64:
-     }
+         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
-@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-     static const TCGTargetOpDef v_r = { .args_ct_str = { "v", "r" } };
+     case INDEX_op_qemu_ld_a32_i64:
-     static const TCGTargetOpDef v_v = { .args_ct_str = { "v", "v" } };
+     case INDEX_op_qemu_ld_a64_i64:
-     static const TCGTargetOpDef v_v_v = { .args_ct_str = { "v", "v", "v" } };
+         return C_O1_I1(r, r);
-+    static const TCGTargetOpDef v_v_v_v
++    case INDEX_op_qemu_ld_a32_i128:
-+        = { .args_ct_str = { "v", "v", "v", "v" } };
++    case INDEX_op_qemu_ld_a64_i128:
++        return C_O2_I1(r, r, r);
-     switch (op) {
+     case INDEX_op_qemu_st_a32_i32:
-     case INDEX_op_goto_ptr:
+     case INDEX_op_qemu_st_a64_i32:
-@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_st_a32_i64:
+     case INDEX_op_qemu_st_a64_i64:
-     case INDEX_op_add_vec:
+         return C_O0_I2(rZ, r);
-     case INDEX_op_sub_vec:
++    case INDEX_op_qemu_st_a32_i128:
-+    case INDEX_op_mul_vec:
++    case INDEX_op_qemu_st_a64_i128:
-     case INDEX_op_and_vec:
++        return C_O0_I3(rZ, rZ, r);
-     case INDEX_op_or_vec:
-     case INDEX_op_xor_vec:
+     case INDEX_op_deposit_i32:
-@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_deposit_i64:
      case INDEX_op_shlv_vec:
      case INDEX_op_shrv_vec:
      case INDEX_op_sarv_vec:
 +    case INDEX_op_ppc_mrgh_vec:
 +    case INDEX_op_ppc_mrgl_vec:
 +    case INDEX_op_ppc_muleu_vec:
 +    case INDEX_op_ppc_mulou_vec:
 +    case INDEX_op_ppc_pkum_vec:
 +    case INDEX_op_ppc_rotl_vec:
          return &v_v_v;
      case INDEX_op_not_vec:
      case INDEX_op_dup_vec:
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      case INDEX_op_st_vec:
      case INDEX_op_dupm_vec:
          return &v_r;
 +    case INDEX_op_ppc_msum_vec:
 +        return &v_v_v_v;
      default:
          return NULL;
 --
-.17.1
+.34.1

-[PULL 17/23] tcg/ppc: Update vector support for v2.07 Altivec
+[PULL 10/27] tcg/ppc: Support 128-bit load/store
-These new instructions are conditional only on MSR.VEC and
+Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
-are thus part of the Altivec instruction set, and not VSX.
+Note that these instructions do not require 16-byte alignment.
 This includes lots of double-word arithmetic and a few extra
 logical operations.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.h     |  4 +-
+ tcg/ppc/tcg-target-con-set.h |   2 +
- tcg/ppc/tcg-target.inc.c | 85 ++++++++++++++++++++++++++++++----------
+ tcg/ppc/tcg-target-con-str.h |   1 +
-files changed, 67 insertions(+), 22 deletions(-)
+ tcg/ppc/tcg-target.h         |   3 +-
  tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 files changed, 101 insertions(+), 13 deletions(-)
+diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/ppc/tcg-target-con-set.h
++++ b/tcg/ppc/tcg-target-con-set.h
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
+ C_O0_I2(r, ri)
+ C_O0_I2(v, r)
+ C_O0_I3(r, r, r)
++C_O0_I3(o, m, r)
+ C_O0_I4(r, r, ri, ri)
+ C_O0_I4(r, r, r, r)
+ C_O1_I1(r, r)
+@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
+ C_O1_I4(r, r, ri, rZ, rZ)
+ C_O1_I4(r, r, r, ri, ri)
+ C_O2_I1(r, r, r)
++C_O2_I1(o, m, r)
+ C_O2_I2(r, r, r, r)
+ C_O2_I4(r, r, rI, rZM, r, r)
+ C_O2_I4(r, r, r, r, rI, rZM)
+diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/ppc/tcg-target-con-str.h
++++ b/tcg/ppc/tcg-target-con-str.h
+@@ -XXX,XX +XXX,XX @@
+  * REGS(letter, register_mask)
+  */
+ REGS('r', ALL_GENERAL_REGS)
++REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
+ REGS('v', ALL_VECTOR_REGS)
+ /*
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
-@@ -XXX,XX +XXX,XX @@ typedef enum {
- typedef enum {
-     tcg_isa_base,
-     tcg_isa_2_06,
-+    tcg_isa_2_07,
-     tcg_isa_3_00,
- } TCGPowerISA;
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
- extern bool have_vsx;
- #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
-+#define have_isa_2_07  (have_isa >= tcg_isa_2_07)
- #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
- /* optional instructions automatically implemented */
 @@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
- #define TCG_TARGET_HAS_v256             0
+ #define TCG_TARGET_HAS_mulsh_i64        1
+ #endif
- #define TCG_TARGET_HAS_andc_vec         1
--#define TCG_TARGET_HAS_orc_vec          0
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
-+#define TCG_TARGET_HAS_orc_vec          have_isa_2_07
++#define TCG_TARGET_HAS_qemu_ldst_i128   \
- #define TCG_TARGET_HAS_not_vec          1
++    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
- #define TCG_TARGET_HAS_neg_vec          0
- #define TCG_TARGET_HAS_abs_vec          0
+ /*
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+  * While technically Altivec could support V64, it has no 64-bit store
-index XXXXXXX..XXXXXXX 100644
+diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
---- a/tcg/ppc/tcg-target.inc.c
+index XXXXXXX..XXXXXXX 100644
-+++ b/tcg/ppc/tcg-target.inc.c
+--- a/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
++++ b/tcg/ppc/tcg-target.c.inc
- #define VADDSWS    VX4(896)
+@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
- #define VADDUWS    VX4(640)
- #define VADDUWM    VX4(128)
+ #define B      OPCD( 18)
-+#define VADDUDM    VX4(192)       /* v2.07 */
+ #define BC     OPCD( 16)
++
- #define VSUBSBS    VX4(1792)
+ #define LBZ    OPCD( 34)
- #define VSUBUBS    VX4(1536)
+ #define LHZ    OPCD( 40)
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+ #define LHA    OPCD( 42)
- #define VSUBSWS    VX4(1920)
+ #define LWZ    OPCD( 32)
- #define VSUBUWS    VX4(1664)
+ #define LWZUX  XO31( 55)
- #define VSUBUWM    VX4(1152)
+-#define STB    OPCD( 38)
-+#define VSUBUDM    VX4(1216)      /* v2.07 */
+-#define STH    OPCD( 44)
+-#define STW    OPCD( 36)
- #define VMAXSB     VX4(258)
+-
- #define VMAXSH     VX4(322)
+-#define STD    XO62(  0)
- #define VMAXSW     VX4(386)
+-#define STDU   XO62(  1)
-+#define VMAXSD     VX4(450)       /* v2.07 */
+-#define STDX   XO31(149)
- #define VMAXUB     VX4(2)
+-
- #define VMAXUH     VX4(66)
+ #define LD     XO58(  0)
- #define VMAXUW     VX4(130)
+ #define LDX    XO31( 21)
-+#define VMAXUD     VX4(194)       /* v2.07 */
+ #define LDU    XO58(  1)
- #define VMINSB     VX4(770)
+ #define LDUX   XO31( 53)
- #define VMINSH     VX4(834)
+ #define LWA    XO58(  2)
- #define VMINSW     VX4(898)
+ #define LWAX   XO31(341)
-+#define VMINSD     VX4(962)       /* v2.07 */
++#define LQ     OPCD( 56)
- #define VMINUB     VX4(514)
++
- #define VMINUH     VX4(578)
++#define STB    OPCD( 38)
- #define VMINUW     VX4(642)
++#define STH    OPCD( 44)
-+#define VMINUD     VX4(706)       /* v2.07 */
++#define STW    OPCD( 36)
++#define STD    XO62(  0)
- #define VCMPEQUB   VX4(6)
++#define STDU   XO62(  1)
- #define VCMPEQUH   VX4(70)
++#define STDX   XO31(149)
- #define VCMPEQUW   VX4(134)
++#define STQ    XO62(  2)
-+#define VCMPEQUD   VX4(199)       /* v2.07 */
- #define VCMPGTSB   VX4(774)
+ #define ADDIC  OPCD( 12)
- #define VCMPGTSH   VX4(838)
+ #define ADDI   OPCD( 14)
- #define VCMPGTSW   VX4(902)
+@@ -XXX,XX +XXX,XX @@ typedef struct {
-+#define VCMPGTSD   VX4(967)       /* v2.07 */
- #define VCMPGTUB   VX4(518)
+ bool tcg_target_has_memory_bswap(MemOp memop)
- #define VCMPGTUH   VX4(582)
+ {
- #define VCMPGTUW   VX4(646)
+-    return true;
-+#define VCMPGTUD   VX4(711)       /* v2.07 */
++    TCGAtomAlign aa;
++
- #define VSLB       VX4(260)
++    if ((memop & MO_SIZE) <= MO_64) {
- #define VSLH       VX4(324)
++        return true;
- #define VSLW       VX4(388)
++    }
-+#define VSLD       VX4(1476)      /* v2.07 */
++
- #define VSRB       VX4(516)
++    /*
- #define VSRH       VX4(580)
++     * Reject 16-byte memop with 16-byte atomicity,
- #define VSRW       VX4(644)
++     * but do allow a pair of 64-bit operations.
-+#define VSRD       VX4(1732)      /* v2.07 */
++     */
- #define VSRAB      VX4(772)
++    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
- #define VSRAH      VX4(836)
++    return aa.atom <= MO_64;
- #define VSRAW      VX4(900)
+ }
-+#define VSRAD      VX4(964)       /* v2.07 */
- #define VRLB       VX4(4)
+ /*
- #define VRLH       VX4(68)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
- #define VRLW       VX4(132)
+ {
-+#define VRLD       VX4(196)       /* v2.07 */
+     TCGLabelQemuLdst *ldst = NULL;
+     MemOp opc = get_memop(oi);
- #define VMULEUB    VX4(520)
+-    MemOp a_bits;
- #define VMULEUH    VX4(584)
++    MemOp a_bits, s_bits;
-+#define VMULEUW    VX4(648)       /* v2.07 */
- #define VMULOUB    VX4(8)
+     /*
- #define VMULOUH    VX4(72)
+      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
-+#define VMULOUW    VX4(136)       /* v2.07 */
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+#define VMULUWM    VX4(137)       /* v2.07 */
+      * As of 3.0, "the non-atomic access is performed as described in
- #define VMSUMUHM   VX4(38)
+      * the corresponding list", which matches MO_ATOM_SUBALIGN.
+      */
- #define VMRGHB     VX4(12)
++    s_bits = opc & MO_SIZE;
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+     h->aa = atom_and_align_for_opc(s, opc,
- #define VNOR       VX4(1284)
+                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
- #define VOR        VX4(1156)
+                                                  : MO_ATOM_IFALIGN,
- #define VXOR       VX4(1220)
+-                                   false);
-+#define VEQV       VX4(1668)      /* v2.07 */
++                                   s_bits == MO_128);
-+#define VNAND      VX4(1412)      /* v2.07 */
+     a_bits = h->aa.align;
-+#define VORC       VX4(1348)      /* v2.07 */
+ #ifdef CONFIG_SOFTMMU
- #define VSPLTB     VX4(524)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
- #define VSPLTH     VX4(588)
+     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-     case INDEX_op_andc_vec:
+     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-     case INDEX_op_not_vec:
+-    unsigned s_bits = opc & MO_SIZE;
-         return 1;
-+    case INDEX_op_orc_vec:
+     ldst = new_ldst_label(s);
-+        return have_isa_2_07;
+     ldst->is_ld = is_ld;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
      }
  }
 +static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
 +                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
 +    TCGLabelQemuLdst *ldst;
 +    HostAddress h;
 +    bool need_bswap;
 +    uint32_t insn;
 +    TCGReg index;
 +
 +    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
 +
 +    /* Compose the final address, as LQ/STQ have no indexing. */
 +    index = h.index;
 +    if (h.base != 0) {
 +        index = TCG_REG_TMP1;
 +        tcg_out32(s, ADD | TAB(index, h.base, h.index));
 +    }
 +    need_bswap = get_memop(oi) & MO_BSWAP;
 +
 +    if (h.aa.atom == MO_128) {
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? LQ : STQ;
 +        tcg_out32(s, insn | TAI(datahi, index, 0));
 +    } else {
 +        TCGReg d1, d2;
 +
 +        if (HOST_BIG_ENDIAN ^ need_bswap) {
 +            d1 = datahi, d2 = datalo;
 +        } else {
 +            d1 = datalo, d2 = datahi;
 +        }
 +
 +        if (need_bswap) {
 +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
 +            insn = is_ld ? LDBRX : STDBRX;
 +            tcg_out32(s, insn | TAB(d1, 0, index));
 +            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
 +        } else {
 +            insn = is_ld ? LD : STD;
 +            tcg_out32(s, insn | TAI(d1, index, 0));
 +            tcg_out32(s, insn | TAI(d2, index, 8));
 +        }
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
  static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
  {
      int i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                              args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
 +        break;
      case INDEX_op_qemu_st_a64_i32:
          if (TCG_TARGET_REG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                              args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
 +        break;
      case INDEX_op_setcond_i32:
          tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
      case INDEX_op_qemu_st_a64_i64:
          return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        return C_O2_I1(o, m, r);
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        return C_O0_I3(o, m, r);
 +
      case INDEX_op_add_vec:
      case INDEX_op_sub_vec:
-     case INDEX_op_smax_vec:
+     case INDEX_op_mul_vec:
      case INDEX_op_smin_vec:
      case INDEX_op_umax_vec:
      case INDEX_op_umin_vec:
 +    case INDEX_op_shlv_vec:
 +    case INDEX_op_shrv_vec:
 +    case INDEX_op_sarv_vec:
 +        return vece <= MO_32 || have_isa_2_07;
      case INDEX_op_ssadd_vec:
      case INDEX_op_sssub_vec:
      case INDEX_op_usadd_vec:
      case INDEX_op_ussub_vec:
 -    case INDEX_op_shlv_vec:
 -    case INDEX_op_shrv_vec:
 -    case INDEX_op_sarv_vec:
          return vece <= MO_32;
      case INDEX_op_cmp_vec:
 -    case INDEX_op_mul_vec:
      case INDEX_op_shli_vec:
      case INDEX_op_shri_vec:
      case INDEX_op_sari_vec:
 -        return vece <= MO_32 ? -1 : 0;
 +        return vece <= MO_32 || have_isa_2_07 ? -1 : 0;
 +    case INDEX_op_mul_vec:
 +        switch (vece) {
 +        case MO_8:
 +        case MO_16:
 +            return -1;
 +        case MO_32:
 +            return have_isa_2_07 ? 1 : -1;
 +        }
 +        return 0;
      case INDEX_op_bitsel_vec:
          return have_vsx;
      default:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             const TCGArg *args, const int *const_args)
  {
      static const uint32_t
 -        add_op[4] = { VADDUBM, VADDUHM, VADDUWM, 0 },
 -        sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, 0 },
 -        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
 -        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
 -        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 },
 +        add_op[4] = { VADDUBM, VADDUHM, VADDUWM, VADDUDM },
 +        sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, VSUBUDM },
 +        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, VCMPEQUD },
 +        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, VCMPGTSD },
 +        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, VCMPGTUD },
          ssadd_op[4] = { VADDSBS, VADDSHS, VADDSWS, 0 },
          usadd_op[4] = { VADDUBS, VADDUHS, VADDUWS, 0 },
          sssub_op[4] = { VSUBSBS, VSUBSHS, VSUBSWS, 0 },
          ussub_op[4] = { VSUBUBS, VSUBUHS, VSUBUWS, 0 },
 -        umin_op[4] = { VMINUB, VMINUH, VMINUW, 0 },
 -        smin_op[4] = { VMINSB, VMINSH, VMINSW, 0 },
 -        umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, 0 },
 -        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 },
 -        shlv_op[4] = { VSLB, VSLH, VSLW, 0 },
 -        shrv_op[4] = { VSRB, VSRH, VSRW, 0 },
 -        sarv_op[4] = { VSRAB, VSRAH, VSRAW, 0 },
 +        umin_op[4] = { VMINUB, VMINUH, VMINUW, VMINUD },
 +        smin_op[4] = { VMINSB, VMINSH, VMINSW, VMINSD },
 +        umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, VMAXUD },
 +        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, VMAXSD },
 +        shlv_op[4] = { VSLB, VSLH, VSLW, VSLD },
 +        shrv_op[4] = { VSRB, VSRH, VSRW, VSRD },
 +        sarv_op[4] = { VSRAB, VSRAH, VSRAW, VSRAD },
          mrgh_op[4] = { VMRGHB, VMRGHH, VMRGHW, 0 },
          mrgl_op[4] = { VMRGLB, VMRGLH, VMRGLW, 0 },
 -        muleu_op[4] = { VMULEUB, VMULEUH, 0, 0 },
 -        mulou_op[4] = { VMULOUB, VMULOUH, 0, 0 },
 +        muleu_op[4] = { VMULEUB, VMULEUH, VMULEUW, 0 },
 +        mulou_op[4] = { VMULOUB, VMULOUH, VMULOUW, 0 },
          pkum_op[4] = { VPKUHUM, VPKUWUM, 0, 0 },
 -        rotl_op[4] = { VRLB, VRLH, VRLW, 0 };
 +        rotl_op[4] = { VRLB, VRLH, VRLW, VRLD };
      TCGType type = vecl + TCG_TYPE_V64;
      TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_sub_vec:
          insn = sub_op[vece];
          break;
 +    case INDEX_op_mul_vec:
 +        tcg_debug_assert(vece == MO_32 && have_isa_2_07);
 +        insn = VMULUWM;
 +        break;
      case INDEX_op_ssadd_vec:
          insn = ssadd_op[vece];
          break;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          insn = VNOR;
          a2 = a1;
          break;
 +    case INDEX_op_orc_vec:
 +        insn = VORC;
 +        break;
      case INDEX_op_cmp_vec:
          switch (args[3]) {
@@ -XXX,XX +XXX,XX @@ static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
  {
      bool need_swap = false, need_inv = false;
 -    tcg_debug_assert(vece <= MO_32);
 +    tcg_debug_assert(vece <= MO_32 || have_isa_2_07);
      switch (cond) {
      case TCG_COND_EQ:
@@ -XXX,XX +XXX,XX @@ static void expand_vec_mul(TCGType type, unsigned vece, TCGv_vec v0,
      break;
      case MO_32:
 +        tcg_debug_assert(!have_isa_2_07);
          t3 = tcg_temp_new_vec(type);
          t4 = tcg_temp_new_vec(type);
          tcg_gen_dupi_vec(MO_8, t4, -16);
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      if (hwcap & PPC_FEATURE_ARCH_2_06) {
          have_isa = tcg_isa_2_06;
      }
 +#ifdef PPC_FEATURE2_ARCH_2_07
 +    if (hwcap2 & PPC_FEATURE2_ARCH_2_07) {
 +        have_isa = tcg_isa_2_07;
 +    }
 +#endif
  #ifdef PPC_FEATURE2_ARCH_3_00
      if (hwcap2 & PPC_FEATURE2_ARCH_3_00) {
          have_isa = tcg_isa_3_00;
 --
-.17.1
+.34.1

-[PULL 23/23] cpus: kick all vCPUs when running thread=single
+[PULL 11/27] tcg/s390x: Support 128-bit load/store
-From: Alex Bennée <alex.bennee@linaro.org>
+Use LPQ/STPQ when 16-byte atomicity is required.
 Note that these instructions require 16-byte alignment.
-qemu_cpu_kick is used for a number of reasons including to indicate
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 there is work to be done. However when thread=single the old
 qemu_cpu_kick_rr_cpu only advanced the vCPU to the next executing one
 which can lead to a hang in the case that:
   a) the kick is from outside the vCPUs (e.g. iothread)
   b) the timers are paused (i.e. iothread calling run_on_cpu)
 To avoid this lets split qemu_cpu_kick_rr into two functions. One for
 the timer which continues to advance to the next timeslice and another
 for all other kicks.
 Message-Id: <20191001160426.26644-1-alex.bennee@linaro.org>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- cpus.c | 24 ++++++++++++++++++------
+ tcg/s390x/tcg-target-con-set.h |   2 +
-file changed, 18 insertions(+), 6 deletions(-)
+ tcg/s390x/tcg-target.h         |   2 +-
  tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
 files changed, 107 insertions(+), 4 deletions(-)
-diff --git a/cpus.c b/cpus.c
+diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/cpus.c
+--- a/tcg/s390x/tcg-target-con-set.h
-+++ b/cpus.c
++++ b/tcg/s390x/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ static inline int64_t qemu_tcg_next_kick(void)
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
-     return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + TCG_KICK_PERIOD;
+ C_O0_I2(r, ri)
  C_O0_I2(r, rA)
  C_O0_I2(v, r)
 +C_O0_I3(o, m, r)
  C_O1_I1(r, r)
  C_O1_I1(v, r)
  C_O1_I1(v, v)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
  C_O1_I3(v, v, v, v)
  C_O1_I4(r, r, ri, rI, r)
  C_O1_I4(r, r, rA, rI, r)
 +C_O2_I1(o, m, r)
  C_O2_I2(o, m, 0, r)
  C_O2_I2(o, m, r, r)
  C_O2_I3(o, m, 0, 1, r)
 diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.h
 +++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
  #define TCG_TARGET_HAS_muluh_i64      0
  #define TCG_TARGET_HAS_mulsh_i64      0
 -#define TCG_TARGET_HAS_qemu_ldst_i128 0
 +#define TCG_TARGET_HAS_qemu_ldst_i128 1
  #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
  #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
 diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.c.inc
 +++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
      RXY_LLGF    = 0xe316,
      RXY_LLGH    = 0xe391,
      RXY_LMG     = 0xeb04,
 +    RXY_LPQ     = 0xe38f,
      RXY_LRV     = 0xe31e,
      RXY_LRVG    = 0xe30f,
      RXY_LRVH    = 0xe31f,
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
      RXY_STG     = 0xe324,
      RXY_STHY    = 0xe370,
      RXY_STMG    = 0xeb24,
 +    RXY_STPQ    = 0xe38e,
      RXY_STRV    = 0xe33e,
      RXY_STRVG   = 0xe32f,
      RXY_STRVH   = 0xe33f,
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return true;
 +    TCGAtomAlign aa;
 +
 +    if ((memop & MO_SIZE) <= MO_64) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity,
 +     * but do allow a pair of 64-bit operations.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom <= MO_64;
  }
--/* Kick the currently round-robin scheduled vCPU */
+ static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
--static void qemu_cpu_kick_rr_cpu(void)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 +/* Kick the currently round-robin scheduled vCPU to next */
 +static void qemu_cpu_kick_rr_next_cpu(void)
  {
-     CPUState *cpu;
+     TCGLabelQemuLdst *ldst = NULL;
-     do {
+     MemOp opc = get_memop(oi);
-@@ -XXX,XX +XXX,XX @@ static void qemu_cpu_kick_rr_cpu(void)
++    MemOp s_bits = opc & MO_SIZE;
-     } while (cpu != atomic_mb_read(&tcg_current_rr_cpu));
+     unsigned a_mask;
 -    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
 +    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
      a_mask = (1 << h->aa.align) - 1;
  #ifdef CONFIG_SOFTMMU
 -    unsigned s_bits = opc & MO_SIZE;
      unsigned s_mask = (1 << s_bits) - 1;
      int mem_index = get_mmuidx(oi);
      int fast_off = TLB_MASK_TABLE_OFS(mem_index);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
      }
  }
-+/* Kick all RR vCPUs */
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
-+static void qemu_cpu_kick_rr_cpus(void)
++                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
-+    CPUState *cpu;
++    TCGLabel *l1 = NULL, *l2 = NULL;
-+
++    TCGLabelQemuLdst *ldst;
-+    CPU_FOREACH(cpu) {
++    HostAddress h;
-+        cpu_exit(cpu);
++    bool need_bswap;
-+    };
++    bool use_pair;
 +    S390Opcode insn;
 +
 +    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
 +
 +    use_pair = h.aa.atom < MO_128;
 +    need_bswap = get_memop(oi) & MO_BSWAP;
 +
 +    if (!use_pair) {
 +        /*
 +         * Atomicity requires we use LPQ.  If we've already checked for
 +         * 16-byte alignment, that's all we need.  If we arrive with
 +         * lesser alignment, we have determined that less than 16-byte
 +         * alignment can be satisfied with two 8-byte loads.
 +         */
 +        if (h.aa.align < MO_128) {
 +            use_pair = true;
 +            l1 = gen_new_label();
 +            l2 = gen_new_label();
 +
 +            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
 +            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
 +        }
 +
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? RXY_LPQ : RXY_STPQ;
 +        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
 +
 +        if (use_pair) {
 +            tgen_branch(s, S390_CC_ALWAYS, l2);
 +            tcg_out_label(s, l1);
 +        }
 +    }
 +    if (use_pair) {
 +        TCGReg d1, d2;
 +
 +        if (need_bswap) {
 +            d1 = datalo, d2 = datahi;
 +            insn = is_ld ? RXY_LRVG : RXY_STRVG;
 +        } else {
 +            d1 = datahi, d2 = datalo;
 +            insn = is_ld ? RXY_LG : RXY_STG;
 +        }
 +
 +        if (h.base == d1 || h.index == d1) {
 +            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
 +            h.base = TCG_TMP0;
 +            h.index = TCG_REG_NONE;
 +            h.disp = 0;
 +        }
 +        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
 +        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
 +    }
 +    if (l2) {
 +        tcg_out_label(s, l2);
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
- static void do_nothing(CPUState *cpu, run_on_cpu_data unused)
+ static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
  {
- }
+     /* Reuse the zeroing that exists for goto_ptr.  */
-@@ -XXX,XX +XXX,XX @@ void qemu_timer_notify_cb(void *opaque, QEMUClockType type)
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
- static void kick_tcg_thread(void *opaque)
+     case INDEX_op_qemu_st_a64_i64:
- {
+         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
-     timer_mod(tcg_kick_vcpu_timer, qemu_tcg_next_kick());
+         break;
--    qemu_cpu_kick_rr_cpu();
++    case INDEX_op_qemu_ld_a32_i128:
-+    qemu_cpu_kick_rr_next_cpu();
++    case INDEX_op_qemu_ld_a64_i128:
- }
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
++        break;
- static void start_tcg_kick_timer(void)
++    case INDEX_op_qemu_st_a32_i128:
-@@ -XXX,XX +XXX,XX @@ void qemu_cpu_kick(CPUState *cpu)
++    case INDEX_op_qemu_st_a64_i128:
- {
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
-     qemu_cond_broadcast(cpu->halt_cond);
++        break;
-     if (tcg_enabled()) {
--        cpu_exit(cpu);
+     case INDEX_op_ld16s_i64:
--        /* NOP unless doing single-thread RR */
+         tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
--        qemu_cpu_kick_rr_cpu();
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-+        if (qemu_tcg_mttcg_enabled()) {
+     case INDEX_op_qemu_st_a32_i32:
-+            cpu_exit(cpu);
+     case INDEX_op_qemu_st_a64_i32:
-+        } else {
+         return C_O0_I2(r, r);
-+            qemu_cpu_kick_rr_cpus();
++    case INDEX_op_qemu_ld_a32_i128:
-+        }
++    case INDEX_op_qemu_ld_a64_i128:
-     } else {
++        return C_O2_I1(o, m, r);
-         if (hax_enabled()) {
++    case INDEX_op_qemu_st_a32_i128:
-             /*
++    case INDEX_op_qemu_st_a64_i128:
 +        return C_O0_I3(o, m, r);
      case INDEX_op_deposit_i32:
      case INDEX_op_deposit_i64:
 --
-.17.1
+.34.1

-[PULL 19/23] tcg/ppc: Update vector support for v2.07 FP
+[PULL 12/27] accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
-These new instructions are conditional on MSR.FP when TX=0 and
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 MSR.VEC when TX=1.  Since we only care about the Altivec registers,
 and force TX=1, we can consider these to be Altivec instructions.
 Since Altivec is true for any use of vector types, we only need
 test have_isa_2_07.
 This includes moves to and from the integer registers.
 Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 32 ++++++++++++++++++++++++++------
+ .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
-file changed, 26 insertions(+), 6 deletions(-)
+ accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
 files changed, 47 insertions(+), 34 deletions(-)
  create mode 100644 host/include/generic/host/load-extract-al16-al8.h
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/host/include/generic/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
 +/*
 + * SPDX-License-Identifier: GPL-2.0-or-later
 + * Atomic extract 64 from 128-bit, generic version.
 + *
 + * Copyright (C) 2023 Linaro, Ltd.
 + */
 +
 +#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
 +#define HOST_LOAD_EXTRACT_AL16_AL8_H
 +
 +/**
 + * load_atom_extract_al16_or_al8:
 + * @pv: host address
 + * @s: object size in bytes, @s <= 8.
 + *
 + * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
 + * cross an 16-byte boundary then the access must be 16-byte atomic,
 + * otherwise the access must be 8-byte atomic.
 + */
 +static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
 +load_atom_extract_al16_or_al8(void *pv, int s)
 +{
 +    uintptr_t pi = (uintptr_t)pv;
 +    int o = pi & 7;
 +    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
 +    Int128 r;
 +
 +    pv = (void *)(pi & ~7);
 +    if (pi & 8) {
 +        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
 +        uint64_t a = qatomic_read__nocheck(p8);
 +        uint64_t b = qatomic_read__nocheck(p8 + 1);
 +
 +        if (HOST_BIG_ENDIAN) {
 +            r = int128_make128(b, a);
 +        } else {
 +            r = int128_make128(a, b);
 +        }
 +    } else {
 +        r = atomic16_read_ro(pv);
 +    }
 +    return int128_getlo(int128_urshift(r, shr));
 +}
 +
 +#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
 diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/accel/tcg/ldst_atomicity.c.inc
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/accel/tcg/ldst_atomicity.c.inc
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@
- #define XXPERMDI   (OPCD(60) | (10 << 3) | 7)  /* v2.06, force ax=bx=tx=1 */
+  * See the COPYING file in the top-level directory.
- #define XXSEL      (OPCD(60) | (3 << 4) | 0xf) /* v2.06, force ax=bx=cx=tx=1 */
+  */
-+#define MFVSRD     (XO31(51) | 1)   /* v2.07, force sx=1 */
++#include "host/load-extract-al16-al8.h"
 +#define MFVSRWZ    (XO31(115) | 1)  /* v2.07, force sx=1 */
 +#define MTVSRD     (XO31(179) | 1)  /* v2.07, force tx=1 */
 +#define MTVSRWZ    (XO31(243) | 1)  /* v2.07, force tx=1 */
 +
- #define RT(r) ((r)<<21)
+ #ifdef CONFIG_ATOMIC64
- #define RS(r) ((r)<<21)
+ # define HAVE_al8          true
- #define RA(r) ((r)<<16)
+ #else
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
-         tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+     return int128_getlo(r);
-         /* fallthru */
+ }
-     case TCG_TYPE_I32:
--        if (ret < TCG_REG_V0 && arg < TCG_REG_V0) {
+-/**
--            tcg_out32(s, OR | SAB(arg, ret, arg));
+- * load_atom_extract_al16_or_al8:
--            break;
+- * @p: host address
--        } else if (ret < TCG_REG_V0 || arg < TCG_REG_V0) {
+- * @s: object size in bytes, @s <= 8.
--            /* Altivec does not support vector/integer moves.  */
+- *
--            return false;
+- * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
-+        if (ret < TCG_REG_V0) {
+- * cross an 16-byte boundary then the access must be 16-byte atomic,
-+            if (arg < TCG_REG_V0) {
+- * otherwise the access must be 8-byte atomic.
-+                tcg_out32(s, OR | SAB(arg, ret, arg));
+- */
-+                break;
+-static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-+            } else if (have_isa_2_07) {
+-load_atom_extract_al16_or_al8(void *pv, int s)
-+                tcg_out32(s, (type == TCG_TYPE_I32 ? MFVSRWZ : MFVSRD)
+-{
-+                          | VRT(arg) | RA(ret));
+-    uintptr_t pi = (uintptr_t)pv;
-+                break;
+-    int o = pi & 7;
-+            } else {
+-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-+                /* Altivec does not support vector->integer moves.  */
+-    Int128 r;
-+                return false;
+-
-+            }
+-    pv = (void *)(pi & ~7);
-+        } else if (arg < TCG_REG_V0) {
+-    if (pi & 8) {
-+            if (have_isa_2_07) {
+-        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-+                tcg_out32(s, (type == TCG_TYPE_I32 ? MTVSRWZ : MTVSRD)
+-        uint64_t a = qatomic_read__nocheck(p8);
-+                          | VRT(ret) | RA(arg));
+-        uint64_t b = qatomic_read__nocheck(p8 + 1);
-+                break;
+-
-+            } else {
+-        if (HOST_BIG_ENDIAN) {
-+                /* Altivec does not support integer->vector moves.  */
+-            r = int128_make128(b, a);
-+                return false;
+-        } else {
-+            }
+-            r = int128_make128(a, b);
-         }
+-        }
-         /* fallthru */
+-    } else {
-     case TCG_TYPE_V64:
+-        r = atomic16_read_ro(pv);
 -    }
 -    return int128_getlo(int128_urshift(r, shr));
 -}
 -
  /**
   * load_atom_4_by_2:
   * @pv: host address
 --
-.17.1
+.34.1

-[PULL 07/23] tcg/ppc: Enable tcg backend vector compilation
+[PULL 13/27] accel/tcg: Extract store_atom_insert_al16 to host header
-Introduce all of the flags required to enable tcg backend vector support,
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
-and a runtime flag to indicate the host supports Altivec instructions.
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
  host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
  accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
 files changed, 51 insertions(+), 39 deletions(-)
  create mode 100644 host/include/generic/host/store-insert-al16.h
-For now, do not actually set have_isa_altivec to true, because we have not
+diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
 yet added all of the code to actually generate all of the required insns.
 However, we must define these flags in order to disable ifndefs that create
 stub versions of the functions added here.
 The change to tcg_out_movi works around a buglet in tcg.c wherein if we
 do not define tcg_out_dupi_vec we get a declared but not defined Werror,
 but if we only declare it we get a defined but not used Werror.  We need
 to this change to tcg_out_movi eventually anyway, so it's no biggie.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
  tcg/ppc/tcg-target.h     | 25 ++++++++++++++++
  tcg/ppc/tcg-target.opc.h |  5 ++++
  tcg/ppc/tcg-target.inc.c | 62 ++++++++++++++++++++++++++++++++++++++--
 files changed, 89 insertions(+), 3 deletions(-)
  create mode 100644 tcg/ppc/tcg-target.opc.h
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
  } TCGPowerISA;
  extern TCGPowerISA have_isa;
 +extern bool have_altivec;
  #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
  #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
@@ -XXX,XX +XXX,XX @@ extern TCGPowerISA have_isa;
  #define TCG_TARGET_HAS_mulsh_i64        1
  #endif
 +/*
 + * While technically Altivec could support V64, it has no 64-bit store
 + * instruction and substituting two 32-bit stores makes the generated
 + * code quite large.
 + */
 +#define TCG_TARGET_HAS_v64              0
 +#define TCG_TARGET_HAS_v128             have_altivec
 +#define TCG_TARGET_HAS_v256             0
 +
 +#define TCG_TARGET_HAS_andc_vec         0
 +#define TCG_TARGET_HAS_orc_vec          0
 +#define TCG_TARGET_HAS_not_vec          0
 +#define TCG_TARGET_HAS_neg_vec          0
 +#define TCG_TARGET_HAS_abs_vec          0
 +#define TCG_TARGET_HAS_shi_vec          0
 +#define TCG_TARGET_HAS_shs_vec          0
 +#define TCG_TARGET_HAS_shv_vec          0
 +#define TCG_TARGET_HAS_cmp_vec          0
 +#define TCG_TARGET_HAS_mul_vec          0
 +#define TCG_TARGET_HAS_sat_vec          0
 +#define TCG_TARGET_HAS_minmax_vec       0
 +#define TCG_TARGET_HAS_bitsel_vec       0
 +#define TCG_TARGET_HAS_cmpsel_vec       0
 +
  void flush_icache_range(uintptr_t start, uintptr_t stop);
  void tb_target_set_jmp_target(uintptr_t, uintptr_t, uintptr_t);
 diff --git a/tcg/ppc/tcg-target.opc.h b/tcg/ppc/tcg-target.opc.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/tcg/ppc/tcg-target.opc.h
++++ b/host/include/generic/host/store-insert-al16.h
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * Target-specific opcodes for host vector expansion.  These will be
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+ * emitted by tcg_expand_vec_op.  For those familiar with GCC internals,
++ * Atomic store insert into 128-bit, generic version.
-+ * consider these to be UNSPEC with names.
++ *
 + * Copyright (C) 2023 Linaro, Ltd.
 + */
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
++
-index XXXXXXX..XXXXXXX 100644
++#ifndef HOST_STORE_INSERT_AL16_H
---- a/tcg/ppc/tcg-target.inc.c
++#define HOST_STORE_INSERT_AL16_H
-+++ b/tcg/ppc/tcg-target.inc.c
++
-@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
++/**
++ * store_atom_insert_al16:
- TCGPowerISA have_isa;
++ * @p: host address
- static bool have_isel;
++ * @val: shifted value to store
-+bool have_altivec;
++ * @msk: mask for value to store
++ *
- #ifndef CONFIG_SOFTMMU
++ * Atomically store @val to @p masked by @msk.
- #define TCG_GUEST_BASE_REG 30
++ */
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
++static inline void ATTRIBUTE_ATOMIC128_OPT
-     }
++store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
- }
++{
++#if defined(CONFIG_ATOMIC128)
--static inline void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
++    __uint128_t *pu;
--                                tcg_target_long arg)
++    Int128Alias old, new;
-+static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
++
-+                             tcg_target_long val)
++    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
- {
++    pu = __builtin_assume_aligned(ps, 16);
--    tcg_out_movi_int(s, type, ret, arg, false);
++    old.u = *pu;
-+    g_assert_not_reached();
++    msk = int128_not(msk);
 +    do {
 +        new.s = int128_and(old.s, msk);
 +        new.s = int128_or(new.s, val);
 +    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
 +                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 +#else
 +    Int128 old, new, cmp;
 +
 +    ps = __builtin_assume_aligned(ps, 16);
 +    old = *ps;
 +    msk = int128_not(msk);
 +    do {
 +        cmp = old;
 +        new = int128_and(old, msk);
 +        new = int128_or(new, val);
 +        old = atomic16_cmpxchg(ps, cmp, new);
 +    } while (int128_ne(cmp, old));
 +#endif
 +}
 +
-+static void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
++#endif /* HOST_STORE_INSERT_AL16_H */
-+                         tcg_target_long arg)
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
-+{
+index XXXXXXX..XXXXXXX 100644
-+    switch (type) {
+--- a/accel/tcg/ldst_atomicity.c.inc
-+    case TCG_TYPE_I32:
++++ b/accel/tcg/ldst_atomicity.c.inc
-+    case TCG_TYPE_I64:
+@@ -XXX,XX +XXX,XX @@
-+        tcg_debug_assert(ret < TCG_REG_V0);
+  */
-+        tcg_out_movi_int(s, type, ret, arg, false);
-+        break;
+ #include "host/load-extract-al16-al8.h"
-+
++#include "host/store-insert-al16.h"
-+    case TCG_TYPE_V64:
-+    case TCG_TYPE_V128:
+ #ifdef CONFIG_ATOMIC64
-+        tcg_debug_assert(ret >= TCG_REG_V0);
+ # define HAVE_al8          true
-+        tcg_out_dupi_vec(s, type, ret, arg);
+@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
-+        break;
+                                           __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 +
 +    default:
 +        g_assert_not_reached();
 +    }
  }
- static bool mask_operand(uint32_t c, int *mb, int *me)
+-/**
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc, const TCGArg *args,
+- * store_atom_insert_al16:
-     }
+- * @p: host address
- }
+- * @val: shifted value to store
+- * @msk: mask for value to store
-+int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+- *
-+{
+- * Atomically store @val to @p masked by @msk.
-+    g_assert_not_reached();
+- */
-+}
+-static void ATTRIBUTE_ATOMIC128_OPT
-+
+-store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
-+static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
+-{
-+                            TCGReg dst, TCGReg src)
+-#if defined(CONFIG_ATOMIC128)
-+{
+-    __uint128_t *pu, old, new;
-+    g_assert_not_reached();
+-
-+}
+-    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-+
+-    pu = __builtin_assume_aligned(ps, 16);
-+static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
+-    old = *pu;
-+                             TCGReg out, TCGReg base, intptr_t offset)
+-    do {
-+{
+-        new = (old & ~msk.u) | val.u;
-+    g_assert_not_reached();
+-    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
-+}
+-                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
-+
+-#elif defined(CONFIG_CMPXCHG128)
-+static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+-    __uint128_t *pu, old, new;
-+                           unsigned vecl, unsigned vece,
+-
-+                           const TCGArg *args, const int *const_args)
+-    /*
-+{
+-     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
-+    g_assert_not_reached();
+-     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
-+}
+-     * and accept the sequential consistency that comes with it.
-+
+-     */
-+void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
+-    pu = __builtin_assume_aligned(ps, 16);
-+                       TCGArg a0, ...)
+-    do {
-+{
+-        old = *pu;
-+    g_assert_not_reached();
+-        new = (old & ~msk.u) | val.u;
-+}
+-    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
-+
+-#else
- static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
+-    qemu_build_not_reached();
- {
+-#endif
-     static const TCGTargetOpDef r = { .args_ct_str = { "r" } };
+-}
-@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
+-
+ /**
-     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
+  * store_bytes_leN:
-     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffff;
+  * @pv: host address
 +    if (have_altivec) {
 +        tcg_target_available_regs[TCG_TYPE_V64] = 0xffffffff00000000ull;
 +        tcg_target_available_regs[TCG_TYPE_V128] = 0xffffffff00000000ull;
 +    }
      tcg_target_call_clobber_regs = 0;
      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R0);
 --
-.17.1
+.34.1

-[PULL 18/23] tcg/ppc: Update vector support for v2.07 VSX
+[PULL 14/27] accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
-These new instructions are conditional only on MSR.VSX and
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 are thus part of the VSX instruction set, and not Altivec.
 This includes double-word loads and stores.
 Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 11 +++++++++++
+ .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
-file changed, 11 insertions(+)
+file changed, 50 insertions(+)
  create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/ppc/tcg-target.inc.c
+index XXXXXXX..XXXXXXX
-+++ b/tcg/ppc/tcg-target.inc.c
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
++++ b/host/include/x86_64/host/load-extract-al16-al8.h
- #define LVEWX      XO31(71)
+@@ -XXX,XX +XXX,XX @@
- #define LXSDX      (XO31(588) | 1)  /* v2.06, force tx=1 */
++/*
- #define LXVDSX     (XO31(332) | 1)  /* v2.06, force tx=1 */
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+#define LXSIWZX    (XO31(12) | 1)   /* v2.07, force tx=1 */
++ * Atomic extract 64 from 128-bit, x86_64 version.
++ *
- #define STVX       XO31(231)
++ * Copyright (C) 2023 Linaro, Ltd.
- #define STVEWX     XO31(199)
++ */
- #define STXSDX     (XO31(716) | 1)  /* v2.06, force sx=1 */
++
-+#define STXSIWX    (XO31(140) | 1)  /* v2.07, force sx=1 */
++#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
++#define X86_64_LOAD_EXTRACT_AL16_AL8_H
- #define VADDSBS    VX4(768)
++
- #define VADDUBS    VX4(512)
++#ifdef CONFIG_INT128_TYPE
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
++#include "host/cpuinfo.h"
-             tcg_out_mem_long(s, LWZ, LWZX, ret, base, offset);
++
-             break;
++/**
-         }
++ * load_atom_extract_al16_or_al8:
-+        if (have_isa_2_07 && have_vsx) {
++ * @pv: host address
-+            tcg_out_mem_long(s, 0, LXSIWZX, ret, base, offset);
++ * @s: object size in bytes, @s <= 8.
-+            break;
++ *
-+        }
++ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
-         tcg_debug_assert((offset & 3) == 0);
++ * cross an 16-byte boundary then the access must be 16-byte atomic,
-         tcg_out_mem_long(s, 0, LVEWX, ret, base, offset);
++ * otherwise the access must be 8-byte atomic.
-         shift = (offset - 4) & 0xc;
++ */
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
++static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-             tcg_out_mem_long(s, STW, STWX, arg, base, offset);
++load_atom_extract_al16_or_al8(void *pv, int s)
-             break;
++{
-         }
++    uintptr_t pi = (uintptr_t)pv;
-+        if (have_isa_2_07 && have_vsx) {
++    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
-+            tcg_out_mem_long(s, 0, STXSIWX, arg, base, offset);
++    int shr = (pi & 7) * 8;
-+            break;
++    Int128Alias r;
-+        }
++
-+        assert((offset & 3) == 0);
++    /*
-         tcg_debug_assert((offset & 3) == 0);
++     * ptr_align % 16 is now only 0 or 8.
-         shift = (offset - 4) & 0xc;
++     * If the host supports atomic loads with VMOVDQU, then always use that,
-         if (shift) {
++     * making the branch highly predictable.  Otherwise we must use VMOVDQA
 +     * when ptr_align % 16 == 0 for 16-byte atomicity.
 +     */
 +    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
 +        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
 +    } else {
 +        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
 +    }
 +    return int128_getlo(int128_urshift(r.s, shr));
 +}
 +#else
 +/* Fallback definition that must be optimized away, or error.  */
 +uint64_t QEMU_ERROR("unsupported atomic")
 +    load_atom_extract_al16_or_al8(void *pv, int s);
 +#endif
 +
 +#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
 --
-.17.1
+.34.1

-[PULL 15/23] tcg/ppc: Enable Altivec detection
+[PULL 15/27] accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
-Now that we have implemented the required tcg operations,
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 we can enable detection of host vector support.
 Tested-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> (PPC32)
 Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 4 ++++
+ .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
-file changed, 4 insertions(+)
+file changed, 40 insertions(+)
  create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/ppc/tcg-target.inc.c
+index XXXXXXX..XXXXXXX
-+++ b/tcg/ppc/tcg-target.inc.c
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
++++ b/host/include/aarch64/host/load-extract-al16-al8.h
-     have_isel = have_isa_2_06;
+@@ -XXX,XX +XXX,XX @@
- #endif
++/*
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+    if (hwcap & PPC_FEATURE_HAS_ALTIVEC) {
++ * Atomic extract 64 from 128-bit, AArch64 version.
-+        have_altivec = true;
++ *
-+    }
++ * Copyright (C) 2023 Linaro, Ltd.
 + */
 +
-     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
++#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
-     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffff;
++#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
-     if (have_altivec) {
++
 +#include "host/cpuinfo.h"
 +#include "tcg/debug-assert.h"
 +
 +/**
 + * load_atom_extract_al16_or_al8:
 + * @pv: host address
 + * @s: object size in bytes, @s <= 8.
 + *
 + * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
 + * cross an 16-byte boundary then the access must be 16-byte atomic,
 + * otherwise the access must be 8-byte atomic.
 + */
 +static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
 +{
 +    uintptr_t pi = (uintptr_t)pv;
 +    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
 +    int shr = (pi & 7) * 8;
 +    uint64_t l, h;
 +
 +    /*
 +     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
 +     * and single-copy atomic on the parts if 8-byte aligned.
 +     * All we need do is align the pointer mod 8.
 +     */
 +    tcg_debug_assert(HAVE_ATOMIC128_RO);
 +    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
 +    return (l >> shr) | (h << (-shr & 63));
 +}
 +
 +#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
 --
-.17.1
+.34.1

-[PULL 12/23] tcg/ppc: Support vector shift by immediate
+[PULL 16/27] accel/tcg: Add aarch64 store_atom_insert_al16
-For Altivec, this is done via vector shift by vector,
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
-and loading the immediate into a register.
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
  host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
 file changed, 47 insertions(+)
  create mode 100644 host/include/aarch64/host/store-insert-al16.h
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+new file mode 100644
----
+index XXXXXXX..XXXXXXX
- tcg/ppc/tcg-target.h     |  2 +-
+--- /dev/null
- tcg/ppc/tcg-target.inc.c | 58 ++++++++++++++++++++++++++++++++++++++--
++++ b/host/include/aarch64/host/store-insert-al16.h
-files changed, 57 insertions(+), 3 deletions(-)
+@@ -XXX,XX +XXX,XX @@
++/*
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
++ * SPDX-License-Identifier: GPL-2.0-or-later
-index XXXXXXX..XXXXXXX 100644
++ * Atomic store insert into 128-bit, AArch64 version.
---- a/tcg/ppc/tcg-target.h
++ *
-+++ b/tcg/ppc/tcg-target.h
++ * Copyright (C) 2023 Linaro, Ltd.
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
++ */
  #define TCG_TARGET_HAS_abs_vec          0
  #define TCG_TARGET_HAS_shi_vec          0
  #define TCG_TARGET_HAS_shs_vec          0
 -#define TCG_TARGET_HAS_shv_vec          0
 +#define TCG_TARGET_HAS_shv_vec          1
  #define TCG_TARGET_HAS_cmp_vec          1
  #define TCG_TARGET_HAS_mul_vec          0
  #define TCG_TARGET_HAS_sat_vec          1
 diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.inc.c
 +++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
  #define VCMPGTUH   VX4(582)
  #define VCMPGTUW   VX4(646)
 +#define VSLB       VX4(260)
 +#define VSLH       VX4(324)
 +#define VSLW       VX4(388)
 +#define VSRB       VX4(516)
 +#define VSRH       VX4(580)
 +#define VSRW       VX4(644)
 +#define VSRAB      VX4(772)
 +#define VSRAH      VX4(836)
 +#define VSRAW      VX4(900)
 +
- #define VAND       VX4(1028)
++#ifndef AARCH64_STORE_INSERT_AL16_H
- #define VANDC      VX4(1092)
++#define AARCH64_STORE_INSERT_AL16_H
- #define VNOR       VX4(1284)
++
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
++/**
-     case INDEX_op_sssub_vec:
++ * store_atom_insert_al16:
-     case INDEX_op_usadd_vec:
++ * @p: host address
-     case INDEX_op_ussub_vec:
++ * @val: shifted value to store
-+    case INDEX_op_shlv_vec:
++ * @msk: mask for value to store
-+    case INDEX_op_shrv_vec:
++ *
-+    case INDEX_op_sarv_vec:
++ * Atomically store @val to @p masked by @msk.
-         return vece <= MO_32;
++ */
-     case INDEX_op_cmp_vec:
++static inline void ATTRIBUTE_ATOMIC128_OPT
-+    case INDEX_op_shli_vec:
++store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
 +    case INDEX_op_shri_vec:
 +    case INDEX_op_sari_vec:
          return vece <= MO_32 ? -1 : 0;
      default:
          return 0;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          umin_op[4] = { VMINUB, VMINUH, VMINUW, 0 },
          smin_op[4] = { VMINSB, VMINSH, VMINSW, 0 },
          umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, 0 },
 -        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 };
 +        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 },
 +        shlv_op[4] = { VSLB, VSLH, VSLW, 0 },
 +        shrv_op[4] = { VSRB, VSRH, VSRW, 0 },
 +        sarv_op[4] = { VSRAB, VSRAH, VSRAW, 0 };
      TCGType type = vecl + TCG_TYPE_V64;
      TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_umax_vec:
          insn = umax_op[vece];
          break;
 +    case INDEX_op_shlv_vec:
 +        insn = shlv_op[vece];
 +        break;
 +    case INDEX_op_shrv_vec:
 +        insn = shrv_op[vece];
 +        break;
 +    case INDEX_op_sarv_vec:
 +        insn = sarv_op[vece];
 +        break;
      case INDEX_op_and_vec:
          insn = VAND;
          break;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
      tcg_out32(s, insn | VRT(a0) | VRA(a1) | VRB(a2));
  }
 +static void expand_vec_shi(TCGType type, unsigned vece, TCGv_vec v0,
 +                           TCGv_vec v1, TCGArg imm, TCGOpcode opci)
 +{
-+    TCGv_vec t1 = tcg_temp_new_vec(type);
++    /*
 +     * GCC only implements __sync* primitives for int128 on aarch64.
 +     * We can do better without the barriers, and integrating the
 +     * arithmetic into the load-exclusive/store-conditional pair.
 +     */
 +    uint64_t tl, th, vl, vh, ml, mh;
 +    uint32_t fail;
 +
-+    /* Splat w/bytes for xxspltib.  */
++    qemu_build_assert(!HOST_BIG_ENDIAN);
-+    tcg_gen_dupi_vec(MO_8, t1, imm & ((8 << vece) - 1));
++    vl = int128_getlo(val);
-+    vec_gen_3(opci, type, vece, tcgv_vec_arg(v0),
++    vh = int128_gethi(val);
-+              tcgv_vec_arg(v1), tcgv_vec_arg(t1));
++    ml = int128_getlo(msk);
-+    tcg_temp_free_vec(t1);
++    mh = int128_gethi(msk);
 +
 +    asm("0: ldxp %[l], %[h], %[mem]\n\t"
 +        "bic %[l], %[l], %[ml]\n\t"
 +        "bic %[h], %[h], %[mh]\n\t"
 +        "orr %[l], %[l], %[vl]\n\t"
 +        "orr %[h], %[h], %[vh]\n\t"
 +        "stxp %w[f], %[l], %[h], %[mem]\n\t"
 +        "cbnz %w[f], 0b\n"
 +        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
 +        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
 +}
 +
- static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
++#endif /* AARCH64_STORE_INSERT_AL16_H */
                             TCGv_vec v1, TCGv_vec v2, TCGCond cond)
  {
@@ -XXX,XX +XXX,XX @@ void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
  {
      va_list va;
      TCGv_vec v0, v1, v2;
 +    TCGArg a2;
      va_start(va, a0);
      v0 = temp_tcgv_vec(arg_temp(a0));
      v1 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
 -    v2 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
 +    a2 = va_arg(va, TCGArg);
      switch (opc) {
 +    case INDEX_op_shli_vec:
 +        expand_vec_shi(type, vece, v0, v1, a2, INDEX_op_shlv_vec);
 +        break;
 +    case INDEX_op_shri_vec:
 +        expand_vec_shi(type, vece, v0, v1, a2, INDEX_op_shrv_vec);
 +        break;
 +    case INDEX_op_sari_vec:
 +        expand_vec_shi(type, vece, v0, v1, a2, INDEX_op_sarv_vec);
 +        break;
      case INDEX_op_cmp_vec:
 +        v2 = temp_tcgv_vec(arg_temp(a2));
          expand_vec_cmp(type, vece, v0, v1, v2, va_arg(va, TCGArg));
          break;
      default:
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      case INDEX_op_smin_vec:
      case INDEX_op_umax_vec:
      case INDEX_op_umin_vec:
 +    case INDEX_op_shlv_vec:
 +    case INDEX_op_shrv_vec:
 +    case INDEX_op_sarv_vec:
          return &v_v_v;
      case INDEX_op_not_vec:
      case INDEX_op_dup_vec:
 --
-.17.1
+.34.1

-[PULL 01/23] tcg/ppc: Introduce Altivec registers
+[PULL 17/27] tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
-Altivec supports 32 128-bit vector registers, whose names are
+The last use was removed by e77c89fb086a.
 by convention v0 through v31.
+Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.h     | 11 ++++-
+ tcg/aarch64/tcg-target.h | 1 -
- tcg/ppc/tcg-target.inc.c | 88 +++++++++++++++++++++++++---------------
+ tcg/arm/tcg-target.h     | 1 -
-files changed, 65 insertions(+), 34 deletions(-)
+ tcg/i386/tcg-target.h    | 1 -
  tcg/mips/tcg-target.h    | 1 -
  tcg/ppc/tcg-target.h     | 1 -
  tcg/riscv/tcg-target.h   | 1 -
  tcg/s390x/tcg-target.h   | 1 -
  tcg/sparc64/tcg-target.h | 1 -
  tcg/tci/tcg-target.h     | 1 -
 files changed, 9 deletions(-)
+diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target.h
++++ b/tcg/aarch64/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
+ #include "host/cpuinfo.h"
+ #define TCG_TARGET_INSN_UNIT_SIZE  4
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
+ typedef enum {
+diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/arm/tcg-target.h
++++ b/tcg/arm/tcg-target.h
+@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
+ #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+ #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
+ typedef enum {
+diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/i386/tcg-target.h
++++ b/tcg/i386/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
+ #include "host/cpuinfo.h"
+ #define TCG_TARGET_INSN_UNIT_SIZE  1
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
+ #ifdef __x86_64__
+ # define TCG_TARGET_REG_BITS  64
+diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/mips/tcg-target.h
++++ b/tcg/mips/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
+ #endif
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+ #define TCG_TARGET_NB_REGS 32
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
 @@ -XXX,XX +XXX,XX @@
- # define TCG_TARGET_REG_BITS  32
- #endif
+ #define TCG_TARGET_NB_REGS 64
 -#define TCG_TARGET_NB_REGS 32
 +#define TCG_TARGET_NB_REGS 64
  #define TCG_TARGET_INSN_UNIT_SIZE 4
- #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
-@@ -XXX,XX +XXX,XX @@ typedef enum {
+ typedef enum {
-     TCG_REG_R24, TCG_REG_R25, TCG_REG_R26, TCG_REG_R27,
+     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
-     TCG_REG_R28, TCG_REG_R29, TCG_REG_R30, TCG_REG_R31,
+diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
 +    TCG_REG_V0,  TCG_REG_V1,  TCG_REG_V2,  TCG_REG_V3,
 +    TCG_REG_V4,  TCG_REG_V5,  TCG_REG_V6,  TCG_REG_V7,
 +    TCG_REG_V8,  TCG_REG_V9,  TCG_REG_V10, TCG_REG_V11,
 +    TCG_REG_V12, TCG_REG_V13, TCG_REG_V14, TCG_REG_V15,
 +    TCG_REG_V16, TCG_REG_V17, TCG_REG_V18, TCG_REG_V19,
 +    TCG_REG_V20, TCG_REG_V21, TCG_REG_V22, TCG_REG_V23,
 +    TCG_REG_V24, TCG_REG_V25, TCG_REG_V26, TCG_REG_V27,
 +    TCG_REG_V28, TCG_REG_V29, TCG_REG_V30, TCG_REG_V31,
 +
      TCG_REG_CALL_STACK = TCG_REG_R1,
      TCG_AREG0 = TCG_REG_R27
  } TCGReg;
 diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tcg/riscv/tcg-target.h
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tcg/riscv/tcg-target.h
 @@ -XXX,XX +XXX,XX @@
- # define TCG_REG_TMP1   TCG_REG_R12
+ #define TCG_TARGET_REG_BITS 64
- #endif
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
-+#define TCG_VEC_TMP1    TCG_REG_V0
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
-+#define TCG_VEC_TMP2    TCG_REG_V1
+ #define TCG_TARGET_NB_REGS 32
-+
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
- #define TCG_REG_TB     TCG_REG_R31
- #define USE_REG_TB     (TCG_TARGET_REG_BITS == 64)
+diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
-@@ -XXX,XX +XXX,XX @@ bool have_isa_3_00;
+--- a/tcg/s390x/tcg-target.h
- #endif
++++ b/tcg/s390x/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
- #ifdef CONFIG_DEBUG_TCG
+ #define S390_TCG_TARGET_H
--static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
--    "r0",
+ #define TCG_TARGET_INSN_UNIT_SIZE 2
--    "r1",
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
--    "r2",
--    "r3",
+ /* We have a +- 4GB range on the branches; leave some slop.  */
--    "r4",
+ #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
--    "r5",
+diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
--    "r6",
+index XXXXXXX..XXXXXXX 100644
--    "r7",
+--- a/tcg/sparc64/tcg-target.h
--    "r8",
++++ b/tcg/sparc64/tcg-target.h
--    "r9",
+@@ -XXX,XX +XXX,XX @@
--    "r10",
+ #define SPARC_TCG_TARGET_H
--    "r11",
--    "r12",
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--    "r13",
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
--    "r14",
+ #define TCG_TARGET_NB_REGS 32
--    "r15",
+ #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
--    "r16",
--    "r17",
+diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
--    "r18",
+index XXXXXXX..XXXXXXX 100644
--    "r19",
+--- a/tcg/tci/tcg-target.h
--    "r20",
++++ b/tcg/tci/tcg-target.h
--    "r21",
+@@ -XXX,XX +XXX,XX @@
--    "r22",
--    "r23",
+ #define TCG_TARGET_INTERPRETER 1
--    "r24",
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--    "r25",
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
--    "r26",
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
--    "r27",
--    "r28",
+ #if UINTPTR_MAX == UINT32_MAX
 -    "r29",
 -    "r30",
 -    "r31"
 +static const char tcg_target_reg_names[TCG_TARGET_NB_REGS][4] = {
 +    "r0",  "r1",  "r2",  "r3",  "r4",  "r5",  "r6",  "r7",
 +    "r8",  "r9",  "r10", "r11", "r12", "r13", "r14", "r15",
 +    "r16", "r17", "r18", "r19", "r20", "r21", "r22", "r23",
 +    "r24", "r25", "r26", "r27", "r28", "r29", "r30", "r31",
 +    "v0",  "v1",  "v2",  "v3",  "v4",  "v5",  "v6",  "v7",
 +    "v8",  "v9",  "v10", "v11", "v12", "v13", "v14", "v15",
 +    "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
 +    "v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31",
  };
  #endif
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
      TCG_REG_R5,
      TCG_REG_R4,
      TCG_REG_R3,
 +
 +    /* V0 and V1 reserved as temporaries; V20 - V31 are call-saved */
 +    TCG_REG_V2,   /* call clobbered, vectors */
 +    TCG_REG_V3,
 +    TCG_REG_V4,
 +    TCG_REG_V5,
 +    TCG_REG_V6,
 +    TCG_REG_V7,
 +    TCG_REG_V8,
 +    TCG_REG_V9,
 +    TCG_REG_V10,
 +    TCG_REG_V11,
 +    TCG_REG_V12,
 +    TCG_REG_V13,
 +    TCG_REG_V14,
 +    TCG_REG_V15,
 +    TCG_REG_V16,
 +    TCG_REG_V17,
 +    TCG_REG_V18,
 +    TCG_REG_V19,
  };
  static const int tcg_target_call_iarg_regs[] = {
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R11);
      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R12);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V0);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V1);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V2);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V3);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V4);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V5);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V6);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V7);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V8);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V9);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V10);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V11);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V12);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V13);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V14);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V15);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V16);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V17);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V18);
 +    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V19);
 +
      s->reserved_regs = 0;
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_R0); /* tcg temp */
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_R1); /* stack pointer */
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_R13); /* thread pointer */
  #endif
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1); /* mem temp */
 +    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP1);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP2);
      if (USE_REG_TB) {
          tcg_regset_set_reg(s->reserved_regs, TCG_REG_TB);  /* tb->tc_ptr */
      }
 --
-.17.1
+.34.1

-[PULL 11/23] tcg/ppc: Add support for vector saturated add/subtract
+[PULL 18/27] decodetree: Add --test-for-error
-Add support for vector saturated add/subtract using Altivec
+Invert the exit code, for use with the testsuite.
 instructions:
 VADDSBS, VADDSHS, VADDSWS, VADDUBS, VADDUHS, VADDUWS, and
 VSUBSBS, VSUBSHS, VSUBSWS, VSUBUBS, VSUBUHS, VSUBUWS.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.h     |  2 +-
+ scripts/decodetree.py | 9 +++++++--
- tcg/ppc/tcg-target.inc.c | 36 ++++++++++++++++++++++++++++++++++++
+file changed, 7 insertions(+), 2 deletions(-)
 files changed, 37 insertions(+), 1 deletion(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/scripts/decodetree.py
-+++ b/tcg/ppc/tcg-target.h
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+@@ -XXX,XX +XXX,XX @@
- #define TCG_TARGET_HAS_shv_vec          0
+ formats = {}
- #define TCG_TARGET_HAS_cmp_vec          1
+ allpatterns = []
- #define TCG_TARGET_HAS_mul_vec          0
+ anyextern = False
--#define TCG_TARGET_HAS_sat_vec          0
++testforerror = False
-+#define TCG_TARGET_HAS_sat_vec          1
- #define TCG_TARGET_HAS_minmax_vec       1
+ translate_prefix = 'trans'
- #define TCG_TARGET_HAS_bitsel_vec       0
+ translate_scope = 'static '
- #define TCG_TARGET_HAS_cmpsel_vec       0
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+     if output_file and output_fd:
-index XXXXXXX..XXXXXXX 100644
+         output_fd.close()
---- a/tcg/ppc/tcg-target.inc.c
+         os.remove(output_file)
-+++ b/tcg/ppc/tcg-target.inc.c
+-    exit(1)
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
++    exit(0 if testforerror else 1)
- #define STVX       XO31(231)
+ # end error_with_file
- #define STVEWX     XO31(199)
-+#define VADDSBS    VX4(768)
+@@ -XXX,XX +XXX,XX @@ def main():
-+#define VADDUBS    VX4(512)
+     global bitop_width
- #define VADDUBM    VX4(0)
+     global variablewidth
-+#define VADDSHS    VX4(832)
+     global anyextern
-+#define VADDUHS    VX4(576)
++    global testforerror
- #define VADDUHM    VX4(64)
-+#define VADDSWS    VX4(896)
+     decode_scope = 'static '
-+#define VADDUWS    VX4(640)
- #define VADDUWM    VX4(128)
+     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
+-                 'static-decode=', 'varinsnwidth=']
-+#define VSUBSBS    VX4(1792)
++                 'static-decode=', 'varinsnwidth=', 'test-for-error']
-+#define VSUBUBS    VX4(1536)
+     try:
- #define VSUBUBM    VX4(1024)
+         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
-+#define VSUBSHS    VX4(1856)
+     except getopt.GetoptError as err:
-+#define VSUBUHS    VX4(1600)
+@@ -XXX,XX +XXX,XX @@ def main():
- #define VSUBUHM    VX4(1088)
+                 bitop_width = 64
-+#define VSUBSWS    VX4(1920)
+             elif insnwidth != 32:
-+#define VSUBUWS    VX4(1664)
+                 error(0, 'cannot handle insns of width', insnwidth)
- #define VSUBUWM    VX4(1152)
++        elif o == '--test-for-error':
++            testforerror = True
- #define VMAXSB     VX4(258)
+         else:
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+             assert False, 'unhandled option'
-     case INDEX_op_smin_vec:
-     case INDEX_op_umax_vec:
+@@ -XXX,XX +XXX,XX @@ def main():
-     case INDEX_op_umin_vec:
-+    case INDEX_op_ssadd_vec:
+     if output_file:
-+    case INDEX_op_sssub_vec:
+         output_fd.close()
-+    case INDEX_op_usadd_vec:
++    exit(1 if testforerror else 0)
-+    case INDEX_op_ussub_vec:
+ # end main
-         return vece <= MO_32;
-     case INDEX_op_cmp_vec:
          return vece <= MO_32 ? -1 : 0;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
          gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
          gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 },
 +        ssadd_op[4] = { VADDSBS, VADDSHS, VADDSWS, 0 },
 +        usadd_op[4] = { VADDUBS, VADDUHS, VADDUWS, 0 },
 +        sssub_op[4] = { VSUBSBS, VSUBSHS, VSUBSWS, 0 },
 +        ussub_op[4] = { VSUBUBS, VSUBUHS, VSUBUWS, 0 },
          umin_op[4] = { VMINUB, VMINUH, VMINUW, 0 },
          smin_op[4] = { VMINSB, VMINSH, VMINSW, 0 },
          umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, 0 },
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_sub_vec:
          insn = sub_op[vece];
          break;
 +    case INDEX_op_ssadd_vec:
 +        insn = ssadd_op[vece];
 +        break;
 +    case INDEX_op_sssub_vec:
 +        insn = sssub_op[vece];
 +        break;
 +    case INDEX_op_usadd_vec:
 +        insn = usadd_op[vece];
 +        break;
 +    case INDEX_op_ussub_vec:
 +        insn = ussub_op[vece];
 +        break;
      case INDEX_op_smin_vec:
          insn = smin_op[vece];
          break;
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      case INDEX_op_andc_vec:
      case INDEX_op_orc_vec:
      case INDEX_op_cmp_vec:
 +    case INDEX_op_ssadd_vec:
 +    case INDEX_op_sssub_vec:
 +    case INDEX_op_usadd_vec:
 +    case INDEX_op_ussub_vec:
      case INDEX_op_smax_vec:
      case INDEX_op_smin_vec:
      case INDEX_op_umax_vec:
 --
-.17.1
+.34.1

-[PULL 10/23] tcg/ppc: Add support for vector add/subtract
+[PULL 19/27] decodetree: Fix recursion in prop_format and build_tree
-Add support for vector add/subtract using Altivec instructions:
+Two copy-paste errors walking the parse tree.
 VADDUBM, VADDUHM, VADDUWM, VSUBUBM, VSUBUHM, VSUBUWM.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.inc.c | 20 ++++++++++++++++++++
+ scripts/decodetree.py | 4 ++--
-file changed, 20 insertions(+)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/scripts/decodetree.py
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@ def build_tree(self):
- #define STVX       XO31(231)
- #define STVEWX     XO31(199)
+     def prop_format(self):
+         for p in self.pats:
-+#define VADDUBM    VX4(0)
+-            p.build_tree()
-+#define VADDUHM    VX4(64)
++            p.prop_format()
-+#define VADDUWM    VX4(128)
-+
+     def prop_width(self):
-+#define VSUBUBM    VX4(1024)
+         width = None
-+#define VSUBUHM    VX4(1088)
+@@ -XXX,XX +XXX,XX @@ def __build_tree(pats, outerbits, outermask):
-+#define VSUBUWM    VX4(1152)
+         return t
-+
- #define VMAXSB     VX4(258)
+     def build_tree(self):
- #define VMAXSH     VX4(322)
+-        super().prop_format()
- #define VMAXSW     VX4(386)
++        super().build_tree()
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+         self.tree = self.__build_tree(self.pats, self.fixedbits,
-     case INDEX_op_andc_vec:
+                                       self.fixedmask)
-     case INDEX_op_not_vec:
          return 1;
 +    case INDEX_op_add_vec:
 +    case INDEX_op_sub_vec:
      case INDEX_op_smax_vec:
      case INDEX_op_smin_vec:
      case INDEX_op_umax_vec:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             const TCGArg *args, const int *const_args)
  {
      static const uint32_t
 +        add_op[4] = { VADDUBM, VADDUHM, VADDUWM, 0 },
 +        sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, 0 },
          eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
          gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
          gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 },
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          tcg_out_dupm_vec(s, type, vece, a0, a1, a2);
          return;
 +    case INDEX_op_add_vec:
 +        insn = add_op[vece];
 +        break;
 +    case INDEX_op_sub_vec:
 +        insn = sub_op[vece];
 +        break;
      case INDEX_op_smin_vec:
          insn = smin_op[vece];
          break;
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
          return (TCG_TARGET_REG_BITS == 64 ? &S_S
                  : TARGET_LONG_BITS == 32 ? &S_S_S : &S_S_S_S);
 +    case INDEX_op_add_vec:
 +    case INDEX_op_sub_vec:
      case INDEX_op_and_vec:
      case INDEX_op_or_vec:
      case INDEX_op_xor_vec:
 --
-.17.1
+.34.1

-[PULL 09/23] tcg/ppc: Add support for vector maximum/minimum
+[PULL 20/27] decodetree: Diagnose empty pattern group
-Add support for vector maximum/minimum using Altivec instructions
+Test err_pattern_group_empty.decode failed with exception:
-VMAXSB, VMAXSH, VMAXSW, VMAXUB, VMAXUH, VMAXUW, and
-VMINSB, VMINSH, VMINSW, VMINUB, VMINUH, VMINUW.
+Traceback (most recent call last):
   File "./scripts/decodetree.py", line 1424, in <module> main()
   File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
   File "./scripts/decodetree.py", line 627, in build_tree
     self.tree = self.__build_tree(self.pats, self.fixedbits,
   File "./scripts/decodetree.py", line 607, in __build_tree
     fb = i.fixedbits & innermask
 TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.h     |  2 +-
+ scripts/decodetree.py | 6 ++++++
- tcg/ppc/tcg-target.inc.c | 40 +++++++++++++++++++++++++++++++++++++++-
+file changed, 6 insertions(+)
 files changed, 40 insertions(+), 2 deletions(-)
-diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.h
+--- a/scripts/decodetree.py
-+++ b/tcg/ppc/tcg-target.h
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
- #define TCG_TARGET_HAS_cmp_vec          1
+                 output(ind, '}\n')
- #define TCG_TARGET_HAS_mul_vec          0
+             else:
- #define TCG_TARGET_HAS_sat_vec          0
+                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
 -#define TCG_TARGET_HAS_minmax_vec       0
 +#define TCG_TARGET_HAS_minmax_vec       1
  #define TCG_TARGET_HAS_bitsel_vec       0
  #define TCG_TARGET_HAS_cmpsel_vec       0
 diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.inc.c
 +++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
  #define STVX       XO31(231)
  #define STVEWX     XO31(199)
 +#define VMAXSB     VX4(258)
 +#define VMAXSH     VX4(322)
 +#define VMAXSW     VX4(386)
 +#define VMAXUB     VX4(2)
 +#define VMAXUH     VX4(66)
 +#define VMAXUW     VX4(130)
 +#define VMINSB     VX4(770)
 +#define VMINSH     VX4(834)
 +#define VMINSW     VX4(898)
 +#define VMINUB     VX4(514)
 +#define VMINUH     VX4(578)
 +#define VMINUW     VX4(642)
 +
- #define VCMPEQUB   VX4(6)
++    def build_tree(self):
- #define VCMPEQUH   VX4(70)
++        if not self.pats:
- #define VCMPEQUW   VX4(134)
++            error_with_file(self.file, self.lineno, 'empty pattern group')
-@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
++        super().build_tree()
-     case INDEX_op_andc_vec:
++
-     case INDEX_op_not_vec:
+ #end IncMultiPattern
-         return 1;
-+    case INDEX_op_smax_vec:
 +    case INDEX_op_smin_vec:
 +    case INDEX_op_umax_vec:
 +    case INDEX_op_umin_vec:
 +        return vece <= MO_32;
      case INDEX_op_cmp_vec:
          return vece <= MO_32 ? -1 : 0;
      default:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
      static const uint32_t
          eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
          gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
 -        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 };
 +        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 },
 +        umin_op[4] = { VMINUB, VMINUH, VMINUW, 0 },
 +        smin_op[4] = { VMINSB, VMINSH, VMINSW, 0 },
 +        umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, 0 },
 +        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 };
      TCGType type = vecl + TCG_TYPE_V64;
      TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
          tcg_out_dupm_vec(s, type, vece, a0, a1, a2);
          return;
 +    case INDEX_op_smin_vec:
 +        insn = smin_op[vece];
 +        break;
 +    case INDEX_op_umin_vec:
 +        insn = umin_op[vece];
 +        break;
 +    case INDEX_op_smax_vec:
 +        insn = smax_op[vece];
 +        break;
 +    case INDEX_op_umax_vec:
 +        insn = umax_op[vece];
 +        break;
      case INDEX_op_and_vec:
          insn = VAND;
          break;
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
      case INDEX_op_andc_vec:
      case INDEX_op_orc_vec:
      case INDEX_op_cmp_vec:
 +    case INDEX_op_smax_vec:
 +    case INDEX_op_smin_vec:
 +    case INDEX_op_umax_vec:
 +    case INDEX_op_umin_vec:
          return &v_v_v;
      case INDEX_op_not_vec:
      case INDEX_op_dup_vec:
 --
-.17.1
+.34.1

-[PULL 03/23] tcg/ppc: Introduce macros VRT(), VRA(), VRB(), VRC()
+[PULL 21/27] decodetree: Do not remove output_file from /dev
-Introduce macros VRT(), VRA(), VRB(), VRC() used for encoding
+Nor report any PermissionError on remove.
-elements of Altivec instructions.
+The primary purpose is testing with -o /dev/null.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 ---
- tcg/ppc/tcg-target.inc.c | 5 +++++
+ scripts/decodetree.py | 7 ++++++-
-file changed, 5 insertions(+)
+file changed, 6 insertions(+), 1 deletion(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/scripts/decodetree.py
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
- #define MB64(b) ((b)<<5)
- #define FXM(b) (1 << (19 - (b)))
+     if output_file and output_fd:
+         output_fd.close()
-+#define VRT(r)  (((r) & 31) << 21)
+-        os.remove(output_file)
-+#define VRA(r)  (((r) & 31) << 16)
++        # Do not try to remove e.g. -o /dev/null
-+#define VRB(r)  (((r) & 31) << 11)
++        if not output_file.startswith("/dev"):
-+#define VRC(r)  (((r) & 31) <<  6)
++            try:
-+
++                os.remove(output_file)
- #define LK    1
++            except PermissionError:
++                pass
- #define TAB(t, a, b) (RT(t) | RA(a) | RB(b))
+     exit(0 if testforerror else 1)
  # end error_with_file
 --
-.17.1
+.34.1

-[PULL 05/23] tcg/ppc: Replace HAVE_ISA_2_06
+[PULL 22/27] tests/decode: Convert tests to meson
-This is identical to have_isa_2_06, so replace it.
-Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 5 ++---
+ tests/decode/check.sh    | 24 ----------------
-file changed, 2 insertions(+), 3 deletions(-)
+ tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
  tests/meson.build        |  5 +---
 files changed, 60 insertions(+), 28 deletions(-)
  delete mode 100755 tests/decode/check.sh
  create mode 100644 tests/decode/meson.build
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/tests/decode/check.sh b/tests/decode/check.sh
 deleted file mode 100755
 index XXXXXXX..XXXXXXX
 --- a/tests/decode/check.sh
 +++ /dev/null
@@ -XXX,XX +XXX,XX @@
 -#!/bin/sh
 -# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 -# See the COPYING.LIB file in the top-level directory.
 -
 -PYTHON=$1
 -DECODETREE=$2
 -E=0
 -
 -# All of these tests should produce errors
 -for i in err_*.decode; do
 -    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
 -        # Pass, aka failed to fail.
 -        echo FAIL: $i 1>&2
 -        E=1
 -    fi
 -done
 -
 -for i in succ_*.decode; do
 -    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
 -        echo FAIL:$i 1>&2
 -    fi
 -done
 -
 -exit $E
 diff --git a/tests/decode/meson.build b/tests/decode/meson.build
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@
 +err_tests = [
 +    'err_argset1.decode',
 +    'err_argset2.decode',
 +    'err_field1.decode',
 +    'err_field2.decode',
 +    'err_field3.decode',
 +    'err_field4.decode',
 +    'err_field5.decode',
 +    'err_field6.decode',
 +    'err_init1.decode',
 +    'err_init2.decode',
 +    'err_init3.decode',
 +    'err_init4.decode',
 +    'err_overlap1.decode',
 +    'err_overlap2.decode',
 +    'err_overlap3.decode',
 +    'err_overlap4.decode',
 +    'err_overlap5.decode',
 +    'err_overlap6.decode',
 +    'err_overlap7.decode',
 +    'err_overlap8.decode',
 +    'err_overlap9.decode',
 +    'err_pattern_group_empty.decode',
 +    'err_pattern_group_ident1.decode',
 +    'err_pattern_group_ident2.decode',
 +    'err_pattern_group_nest1.decode',
 +    'err_pattern_group_nest2.decode',
 +    'err_pattern_group_nest3.decode',
 +    'err_pattern_group_overlap1.decode',
 +    'err_width1.decode',
 +    'err_width2.decode',
 +    'err_width3.decode',
 +    'err_width4.decode',
 +]
 +
 +succ_tests = [
 +    'succ_argset_type1.decode',
 +    'succ_function.decode',
 +    'succ_ident1.decode',
 +    'succ_pattern_group_nest1.decode',
 +    'succ_pattern_group_nest2.decode',
 +    'succ_pattern_group_nest3.decode',
 +    'succ_pattern_group_nest4.decode',
 +]
 +
 +suite = 'decodetree'
 +decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
 +
 +foreach t: err_tests
 +    test(fs.replace_suffix(t, ''),
 +         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
 +         suite: suite)
 +endforeach
 +
 +foreach t: succ_tests
 +    test(fs.replace_suffix(t, ''),
 +         decodetree, args: ['-o', '/dev/null', files(t)],
 +         suite: suite)
 +endforeach
 diff --git a/tests/meson.build b/tests/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/tests/meson.build
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/tests/meson.build
-@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
+@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
+              dependencies: [qemuutil, vhost_user])
- TCGPowerISA have_isa;
+ endif
--#define HAVE_ISA_2_06  have_isa_2_06
+-test('decodetree', sh,
- #define HAVE_ISEL      have_isa_2_06
+-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
+-     workdir: meson.current_source_dir() / 'decode',
- #ifndef CONFIG_SOFTMMU
+-     suite: 'decodetree')
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is_64)
++subdir('decode')
-         }
-     } else {
+ if 'CONFIG_TCG' in config_all
-         uint32_t insn = qemu_ldx_opc[opc & (MO_BSWAP | MO_SSIZE)];
+   subdir('fp')
 -        if (!HAVE_ISA_2_06 && insn == LDBRX) {
 +        if (!have_isa_2_06 && insn == LDBRX) {
              tcg_out32(s, ADDI | TAI(TCG_REG_R0, addrlo, 4));
              tcg_out32(s, LWBRX | TAB(datalo, rbase, addrlo));
              tcg_out32(s, LWBRX | TAB(TCG_REG_R0, rbase, TCG_REG_R0));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is_64)
          }
      } else {
          uint32_t insn = qemu_stx_opc[opc & (MO_BSWAP | MO_SIZE)];
 -        if (!HAVE_ISA_2_06 && insn == STDBRX) {
 +        if (!have_isa_2_06 && insn == STDBRX) {
              tcg_out32(s, STWBRX | SAB(datalo, rbase, addrlo));
              tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, addrlo, 4));
              tcg_out_shri64(s, TCG_REG_R0, datalo, 32);
 --
-.17.1
+.34.1

-New patch
+[PULL 23/27] docs: Document decodetree named field syntax
+From: Peter Maydell <peter.maydell@linaro.org>
+Document the named field syntax that we want to implement for the
+decodetree script.  This allows a field to be defined in terms of
+some other field that the instruction pattern has already set, for
+example:
+   %sz_imm 10:3 sz:3 !function=expand_sz_imm
+to allow a function to be passed both an immediate field from the
+instruction and also a sz value which might have been specified by
+the instruction pattern directly (sz=1, etc) rather than being a
+simple field within the instruction.
+Note that the restriction on not having the format referring to the
+pattern and the pattern referring to the format simultaneously is a
+restriction of the decoder generator rather than inherently being a
+silly thing to do.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
+---
+ docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
+file changed, 28 insertions(+), 5 deletions(-)
+diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
+index XXXXXXX..XXXXXXX 100644
+--- a/docs/devel/decodetree.rst
++++ b/docs/devel/decodetree.rst
+@@ -XXX,XX +XXX,XX @@ Fields
+ Syntax::
+-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
++  field_def     := '%' identifier ( field )* ( !function=identifier )?
++  field         := unnamed_field | named_field
+   unnamed_field := number ':' ( 's' ) number
++  named_field   := identifier ':' ( 's' ) number
+ For *unnamed_field*, the first number is the least-significant bit position
+ of the field and the second number is the length of the field.  If the 's' is
+-present, the field is considered signed.  If multiple ``unnamed_fields`` are
+-present, they are concatenated.  In this way one can define disjoint fields.
++present, the field is considered signed.
++
++A *named_field* refers to some other field in the instruction pattern
++or format. Regardless of the length of the other field where it is
++defined, it will be inserted into this field with the specified
++signedness and bit width.
++
++Field definitions that involve loops (i.e. where a field is defined
++directly or indirectly in terms of itself) are errors.
++
++A format can include fields that refer to named fields that are
++defined in the instruction pattern(s) that use the format.
++Conversely, an instruction pattern can include fields that refer to
++named fields that are defined in the format it uses. However you
++cannot currently do both at once (i.e. pattern P uses format F; F has
++a field A that refers to a named field B that is defined in P, and P
++has a field C that refers to a named field D that is defined in F).
++
++If multiple ``fields`` are present, they are concatenated.
++In this way one can define disjoint fields.
+ If ``!function`` is specified, the concatenated result is passed through the
+ named function, taking and returning an integral value.
+-One may use ``!function`` with zero ``unnamed_fields``.  This case is called
++One may use ``!function`` with zero ``fields``.  This case is called
+ a *parameter*, and the named function is only passed the ``DisasContext``
+ and returns an integral value extracted from there.
+-A field with no ``unnamed_fields`` and no ``!function`` is in error.
++A field with no ``fields`` and no ``!function`` is in error.
+ Field examples:
+@@ -XXX,XX +XXX,XX @@ Field examples:
+ | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
+ |   !function=expand_shimm8 |               extract(i, 13, 1))            |
+ +---------------------------+---------------------------------------------+
++| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
++|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
+++---------------------------+---------------------------------------------+
+ Argument Sets
+ =============
+--
+.34.1

-New patch
+[PULL 24/27] scripts/decodetree: Pass lvalue-formatter function to str_extract()
+From: Peter Maydell <peter.maydell@linaro.org>
+To support referring to other named fields in field definitions, we
+need to pass the str_extract() method a function which tells it how
+to emit the code for a previously initialized named field.  (In
+Pattern::output_code() the other field will be "u.f_foo.field", and
+in Format::output_extract() it is "a->field".)
+Refactor the two callsites that currently do "output code to
+initialize each field", and have them pass a lambda that defines how
+to format the lvalue in each case.  This is then used both in
+emitting the LHS of the assignment and also passed down to
+str_extract() as a new argument (unused at the moment, but will be
+used in the following patch).
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
+---
+ scripts/decodetree.py | 26 +++++++++++++++-----------
+file changed, 15 insertions(+), 11 deletions(-)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@ def __str__(self):
+             s = ''
+         return str(self.pos) + ':' + s + str(self.len)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         global bitop_width
+         s = 's' if self.sign else ''
+         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
+@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
+     def __str__(self):
+         return str(self.subs)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         global bitop_width
+         ret = '0'
+         pos = 0
+         for f in reversed(self.subs):
+-            ext = f.str_extract()
++            ext = f.str_extract(lvalue_formatter)
+             if pos == 0:
+                 ret = ext
+             else:
+@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
+     def __str__(self):
+         return str(self.value)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         return str(self.value)
+     def __cmp__(self, other):
+@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
+     def __str__(self):
+         return self.func + '(' + str(self.base) + ')'
+-    def str_extract(self):
+-        return self.func + '(ctx, ' + self.base.str_extract() + ')'
++    def str_extract(self, lvalue_formatter):
++        return (self.func + '(ctx, '
++                + self.base.str_extract(lvalue_formatter) + ')')
+     def __eq__(self, other):
+         return self.func == other.func and self.base == other.base
+@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
+     def __str__(self):
+         return self.func
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         return self.func + '(ctx)'
+     def __eq__(self, other):
+@@ -XXX,XX +XXX,XX @@ def __str__(self):
+     def str1(self, i):
+         return str_indent(i) + self.__str__()
++
++    def output_fields(self, indent, lvalue_formatter):
++        for n, f in self.fields.items():
++            output(indent, lvalue_formatter(n), ' = ',
++                   f.str_extract(lvalue_formatter), ';\n')
+ # end General
+@@ -XXX,XX +XXX,XX @@ def extract_name(self):
+     def output_extract(self):
+         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
+                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
+-        for n, f in self.fields.items():
+-            output('    a->', n, ' = ', f.str_extract(), ';\n')
++        self.output_fields(str_indent(4), lambda n: 'a->' + n)
+         output('}\n\n')
+ # end Format
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
+         if not extracted:
+             output(ind, self.base.extract_name(),
+                    '(ctx, &u.f_', arg, ', insn);\n')
+-        for n, f in self.fields.items():
+-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
++        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+         output(ind, 'if (', translate_prefix, '_', self.name,
+                '(ctx, &u.f_', arg, ')) return true;\n')
+--
+.34.1

-New patch
+[PULL 25/27] scripts/decodetree: Implement a topological sort
+From: Peter Maydell <peter.maydell@linaro.org>
+To support named fields, we will need to be able to do a topological
+sort (so that we ensure that we output the assignment to field A
+before the assignment to field B if field B refers to field A by
+name). The good news is that there is a tsort in the python standard
+library; the bad news is that it was only added in Python 3.9.
+To bridge the gap between our current minimum supported Python
+version and 3.9, provide a local implementation that has the
+same API as the stdlib version for the parts we care about.
+In future when QEMU's minimum Python version requirement reaches
+.9 we can delete this code and replace it with an 'import' line.
+The core of this implementation is based on
+https://code.activestate.com/recipes/578272-topological-sort/
+which is MIT-licensed.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Acked-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
+---
+ scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
+file changed, 74 insertions(+)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@
+ re_fmt_ident = '@[a-zA-Z0-9_]*'
+ re_pat_ident = '[a-zA-Z0-9_]*'
++# Local implementation of a topological sort. We use the same API that
++# the Python graphlib does, so that when QEMU moves forward to a
++# baseline of Python 3.9 or newer this code can all be dropped and
++# replaced with:
++#    from graphlib import TopologicalSorter, CycleError
++#
++# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
++#
++# We only implement the parts of TopologicalSorter we care about:
++#  ts = TopologicalSorter(graph=None)
++#    create the sorter. graph is a dictionary whose keys are
++#    nodes and whose values are lists of the predecessors of that node.
++#    (That is, if graph contains "A" -> ["B", "C"] then we must output
++#    B and C before A.)
++#  ts.static_order()
++#    returns a list of all the nodes in sorted order, or raises CycleError
++#  CycleError
++#    exception raised if there are cycles in the graph. The second
++#    element in the args attribute is a list of nodes which form a
++#    cycle; the first and last element are the same, eg [a, b, c, a]
++#    (Our implementation doesn't give the order correctly.)
++#
++# For our purposes we can assume that the data set is always small
++# (typically 10 nodes or less, actual links in the graph very rare),
++# so we don't need to worry about efficiency of implementation.
++#
++# The core of this implementation is from
++# https://code.activestate.com/recipes/578272-topological-sort/
++# (but updated to Python 3), and is under the MIT license.
++
++class CycleError(ValueError):
++    """Subclass of ValueError raised if cycles exist in the graph"""
++    pass
++
++class TopologicalSorter:
++    """Topologically sort a graph"""
++    def __init__(self, graph=None):
++        self.graph = graph
++
++    def static_order(self):
++        # We do the sort right here, unlike the stdlib version
++        from functools import reduce
++        data = {}
++        r = []
++
++        if not self.graph:
++            return []
++
++        # This code wants the values in the dict to be specifically sets
++        for k, v in self.graph.items():
++            data[k] = set(v)
++
++        # Find all items that don't depend on anything.
++        extra_items_in_deps = (reduce(set.union, data.values())
++                               - set(data.keys()))
++        # Add empty dependencies where needed
++        data.update({item:{} for item in extra_items_in_deps})
++        while True:
++            ordered = set(item for item, dep in data.items() if not dep)
++            if not ordered:
++                break
++            r.extend(ordered)
++            data = {item: (dep - ordered)
++                    for item, dep in data.items()
++                        if item not in ordered}
++        if data:
++            # This doesn't give as nice results as the stdlib, which
++            # gives you the cycle by listing the nodes in order. Here
++            # we only know the nodes in the cycle but not their order.
++            raise CycleError(f'nodes are in a cycle', list(data.keys()))
++
++        return r
++# end TopologicalSorter
++
+ def error_with_file(file, lineno, *args):
+     """Print an error message from file:line and args and exit."""
+     global output_file
+--
+.34.1

-[PULL 02/23] tcg/ppc: Introduce macro VX4()
+[PULL 26/27] scripts/decodetree: Implement named field support
-Introduce macro VX4() used for encoding Altivec instructions.
+From: Peter Maydell <peter.maydell@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+Implement support for named fields, i.e.  where one field is defined
-Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
+in terms of another, rather than directly in terms of bits extracted
 from the instruction.
 The new method referenced_fields() on all the Field classes returns a
 list of fields that this field references.  This just passes through,
 except for the new NamedField class.
 We can then use referenced_fields() to:
  * construct a list of 'dangling references' for a format or
    pattern, which is the fields that the format/pattern uses but
    doesn't define itself
  * do a topological sort, so that we output "field = value"
    assignments in an order that means that we assign a field before
    we reference it in a subsequent assignment
  * check when we output the code for a pattern whether we need to
    fill in the format fields before or after the pattern fields, and
    do other error checking
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
 ---
- tcg/ppc/tcg-target.inc.c | 1 +
+ scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
-file changed, 1 insertion(+)
+file changed, 139 insertions(+), 6 deletions(-)
-diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.inc.c
+--- a/scripts/decodetree.py
-+++ b/tcg/ppc/tcg-target.inc.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
+@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
- #define XO31(opc) (OPCD(31)|((opc)<<1))
+         s = 's' if self.sign else ''
- #define XO58(opc) (OPCD(58)|(opc))
+         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
- #define XO62(opc) (OPCD(62)|(opc))
-+#define VX4(opc)  (OPCD(4)|(opc))
++    def referenced_fields(self):
++        return []
- #define B      OPCD( 18)
++
- #define BC     OPCD( 16)
+     def __eq__(self, other):
          return self.sign == other.sign and self.mask == other.mask
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
              pos += f.len
          return ret
 +    def referenced_fields(self):
 +        l = []
 +        for f in self.subs:
 +            l.extend(f.referenced_fields())
 +        return l
 +
      def __ne__(self, other):
          if len(self.subs) != len(other.subs):
              return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return str(self.value)
 +    def referenced_fields(self):
 +        return []
 +
      def __cmp__(self, other):
          return self.value - other.value
  # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
          return (self.func + '(ctx, '
                  + self.base.str_extract(lvalue_formatter) + ')')
 +    def referenced_fields(self):
 +        return self.base.referenced_fields()
 +
      def __eq__(self, other):
          return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return self.func + '(ctx)'
 +    def referenced_fields(self):
 +        return []
 +
      def __eq__(self, other):
          return self.func == other.func
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
          return not self.__eq__(other)
  # end ParameterField
 +class NamedField:
 +    """Class representing a field already named in the pattern"""
 +    def __init__(self, name, sign, len):
 +        self.mask = 0
 +        self.sign = sign
 +        self.len = len
 +        self.name = name
 +
 +    def __str__(self):
 +        return self.name
 +
 +    def str_extract(self, lvalue_formatter):
 +        global bitop_width
 +        s = 's' if self.sign else ''
 +        lvalue = lvalue_formatter(self.name)
 +        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
 +
 +    def referenced_fields(self):
 +        return [self.name]
 +
 +    def __eq__(self, other):
 +        return self.name == other.name
 +
 +    def __ne__(self, other):
 +        return not self.__eq__(other)
 +# end NamedField
  class Arguments:
      """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
              output('} ', self.struct_name(), ';\n\n')
  # end Arguments
 -
  class General:
      """Common code between instruction formats and instruction patterns"""
      def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
          self.fieldmask = fldm
          self.fields = flds
          self.width = w
 +        self.dangling = None
      def __str__(self):
          return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str1(self, i):
          return str_indent(i) + self.__str__()
 +    def dangling_references(self):
 +        # Return a list of all named references which aren't satisfied
 +        # directly by this format/pattern. This will be either:
 +        #  * a format referring to a field which is specified by the
 +        #    pattern(s) using it
 +        #  * a pattern referring to a field which is specified by the
 +        #    format it uses
 +        #  * a user error (referring to a field that doesn't exist at all)
 +        if self.dangling is None:
 +            # Compute this once and cache the answer
 +            dangling = []
 +            for n, f in self.fields.items():
 +                for r in f.referenced_fields():
 +                    if r not in self.fields:
 +                        dangling.append(r)
 +            self.dangling = dangling
 +        return self.dangling
 +
      def output_fields(self, indent, lvalue_formatter):
 +        # We use a topological sort to ensure that any use of NamedField
 +        # comes after the initialization of the field it is referencing.
 +        graph = {}
          for n, f in self.fields.items():
 -            output(indent, lvalue_formatter(n), ' = ',
 -                   f.str_extract(lvalue_formatter), ';\n')
 +            refs = f.referenced_fields()
 +            graph[n] = refs
 +
 +        try:
 +            ts = TopologicalSorter(graph)
 +            for n in ts.static_order():
 +                # We only want to emit assignments for the keys
 +                # in our fields list, not for anything that ends up
 +                # in the tsort graph only because it was referenced as
 +                # a NamedField.
 +                try:
 +                    f = self.fields[n]
 +                    output(indent, lvalue_formatter(n), ' = ',
 +                           f.str_extract(lvalue_formatter), ';\n')
 +                except KeyError:
 +                    pass
 +        except CycleError as e:
 +            # The second element of args is a list of nodes which form
 +            # a cycle (there might be others too, but only one is reported).
 +            # Pretty-print it to tell the user.
 +            cycle = ' => '.join(e.args[1])
 +            error(self.lineno, 'field definitions form a cycle: ' + cycle)
  # end General
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          arg = self.base.base.name
          output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
 +        # We might have named references in the format that refer to fields
 +        # in the pattern, or named references in the pattern that refer
 +        # to fields in the format. This affects whether we extract the fields
 +        # for the format before or after the ones for the pattern.
 +        # For simplicity we don't allow cross references in both directions.
 +        # This is also where we catch the syntax error of referring to
 +        # a nonexistent field.
 +        fmt_refs = self.base.dangling_references()
 +        for r in fmt_refs:
 +            if r not in self.fields:
 +                error(self.lineno, f'format refers to undefined field {r}')
 +        pat_refs = self.dangling_references()
 +        for r in pat_refs:
 +            if r not in self.base.fields:
 +                error(self.lineno, f'pattern refers to undefined field {r}')
 +        if pat_refs and fmt_refs:
 +            error(self.lineno, ('pattern that uses fields defined in format '
 +                                'cannot use format that uses fields defined '
 +                                'in pattern'))
 +        if fmt_refs:
 +            # pattern fields first
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +            assert not extracted, "dangling fmt refs but it was already extracted"
          if not extracted:
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', arg, ', insn);\n')
 -        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +        if not fmt_refs:
 +            # pattern fields last
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +
          output(ind, 'if (', translate_prefix, '_', self.name,
                 '(ctx, &u.f_', arg, ')) return true;\n')
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          # If we identified all nodes below have the same format,
 -        # extract the fields now.
 -        if not extracted and self.base:
 +        # extract the fields now. But don't do it if the format relies
 +        # on named fields from the insn pattern, as those won't have
 +        # been initialised at this point.
 +        if not extracted and self.base and not self.base.dangling_references():
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', self.base.base.name, ', insn);\n')
              extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
      """Parse one instruction field from TOKS at LINENO"""
      global fields
      global insnwidth
 +    global re_C_ident
      # A "simple" field will have only one entry;
      # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
              func = func[1]
              continue
 +        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
 +            # Signed named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, True, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +        if re.fullmatch(re_C_ident + ':[0-9]+', t):
 +            # Unsigned named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, False, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +
          if re.fullmatch('[0-9]+:s[0-9]+', t):
              # Signed field extract
              subtoks = t.split(':s')
 --
-.17.1
+.34.1

-New patch
+[PULL 27/27] tests/decode: Add tests for various named-field cases
+From: Peter Maydell <peter.maydell@linaro.org>
+Add some tests for various cases of named-field use, both ones that
+should work and ones that should be diagnosed as errors.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
+---
+ tests/decode/err_field10.decode      |  7 +++++++
+ tests/decode/err_field7.decode       |  7 +++++++
+ tests/decode/err_field8.decode       |  8 ++++++++
+ tests/decode/err_field9.decode       | 14 ++++++++++++++
+ tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
+ tests/decode/meson.build             |  5 +++++
+files changed, 60 insertions(+)
+ create mode 100644 tests/decode/err_field10.decode
+ create mode 100644 tests/decode/err_field7.decode
+ create mode 100644 tests/decode/err_field8.decode
+ create mode 100644 tests/decode/err_field9.decode
+ create mode 100644 tests/decode/succ_named_field.decode
+diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field10.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose formats which refer to undefined fields
++%field1        field2:3
++@fmt ........ ........ ........ ........ %field1
++insn 00000000 00000000 00000000 00000000 @fmt
+diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field7.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose fields whose definitions form a loop
++%field1        field2:3
++%field2        field1:4
++insn 00000000 00000000 00000000 00000000 %field1 %field2
+diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field8.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose patterns which refer to undefined fields
++&f1 f1 a
++%field1        field2:3
++@fmt ........ ........ ........ .... a:4 &f1
++insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
+diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field9.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose fields where the format refers to a field defined in the
++# pattern and the pattern refers to a field defined in the format.
++# This is theoretically not impossible to implement, but is not
++# supported by the script at this time.
++&abcd a b c d
++%refa        a:3
++%refc        c:4
++# Format defines 'c' and sets 'b' to an indirect ref to 'a'
++@fmt ........ ........ ........ c:8 &abcd b=%refa
++# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
++insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
+diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/succ_named_field.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# field using a named_field
++%imm_sz    8:8 sz:3
++insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
++
++# Ditto, via a format. Here a field in the format
++# references a named field defined in the insn pattern:
++&imm_a imm alpha
++%foo 0:16 alpha:4
++@foo 00000001 ........ ........ ........ &imm_a imm=%foo
++i1   ........ 00000000 ........ ........ @foo alpha=1
++i2   ........ 00000001 ........ ........ @foo alpha=2
++
++# Here the named field is defined in the format and referenced
++# from the insn pattern:
++@bar 00000010 ........ ........ ........ &imm_a alpha=4
++i3   ........ 00000000 ........ ........ @bar imm=%foo
+diff --git a/tests/decode/meson.build b/tests/decode/meson.build
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/decode/meson.build
++++ b/tests/decode/meson.build
+@@ -XXX,XX +XXX,XX @@ err_tests = [
+     'err_field4.decode',
+     'err_field5.decode',
+     'err_field6.decode',
++    'err_field7.decode',
++    'err_field8.decode',
++    'err_field9.decode',
++    'err_field10.decode',
+     'err_init1.decode',
+     'err_init2.decode',
+     'err_init3.decode',
+@@ -XXX,XX +XXX,XX @@ succ_tests = [
+     'succ_argset_type1.decode',
+     'succ_function.decode',
+     'succ_ident1.decode',
++    'succ_named_field.decode',
+     'succ_pattern_group_nest1.decode',
+     'succ_pattern_group_nest2.decode',
+     'succ_pattern_group_nest3.decode',
+--
+.34.1

The following changes since commit 9e5319ca52a5b9e84d55ad9c36e2c0b317a122bb:

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging (2019-10-04 18:32:34 +0100)

are available in the Git repository at:

https://github.com/rth7680/qemu.git tags/pull-tcg-20191013

for you to fetch changes up to d2f86bba6931388e275e8eb4ccd1dbcc7cae6328:

cpus: kick all vCPUs when running thread=single (2019-10-07 14:08:58 -0400)

----------------------------------------------------------------
Host vector support for tcg/ppc.
Fix thread=single cpu kicking.

----------------------------------------------------------------
Alex Bennée (1):
      cpus: kick all vCPUs when running thread=single

Richard Henderson (22):
      tcg/ppc: Introduce Altivec registers
      tcg/ppc: Introduce macro VX4()
      tcg/ppc: Introduce macros VRT(), VRA(), VRB(), VRC()
      tcg/ppc: Create TCGPowerISA and have_isa
      tcg/ppc: Replace HAVE_ISA_2_06
      tcg/ppc: Replace HAVE_ISEL macro with a variable
      tcg/ppc: Enable tcg backend vector compilation
      tcg/ppc: Add support for load/store/logic/comparison
      tcg/ppc: Add support for vector maximum/minimum
      tcg/ppc: Add support for vector add/subtract
      tcg/ppc: Add support for vector saturated add/subtract
      tcg/ppc: Support vector shift by immediate
      tcg/ppc: Support vector multiply
      tcg/ppc: Support vector dup2
      tcg/ppc: Enable Altivec detection
      tcg/ppc: Update vector support for VSX
      tcg/ppc: Update vector support for v2.07 Altivec
      tcg/ppc: Update vector support for v2.07 VSX
      tcg/ppc: Update vector support for v2.07 FP
      tcg/ppc: Update vector support for v3.00 Altivec
      tcg/ppc: Update vector support for v3.00 load/store
      tcg/ppc: Update vector support for v3.00 dup/dupi

tcg/ppc/tcg-target.h     |   51 ++-
 tcg/ppc/tcg-target.opc.h |   13 +
 cpus.c                   |   24 +-
 tcg/ppc/tcg-target.inc.c | 1118 ++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 1119 insertions(+), 87 deletions(-)
 create mode 100644 tcg/ppc/tcg-target.opc.h

Altivec supports 32 128-bit vector registers, whose names are
by convention v0 through v31.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     | 11 ++++-
 tcg/ppc/tcg-target.inc.c | 88 +++++++++++++++++++++++++---------------
 2 files changed, 65 insertions(+), 34 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 # define TCG_TARGET_REG_BITS  32
 #endif
 
-#define TCG_TARGET_NB_REGS 32
+#define TCG_TARGET_NB_REGS 64
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 
@@ -XXX,XX +XXX,XX @@ typedef enum {
     TCG_REG_R24, TCG_REG_R25, TCG_REG_R26, TCG_REG_R27,
     TCG_REG_R28, TCG_REG_R29, TCG_REG_R30, TCG_REG_R31,
 
+    TCG_REG_V0,  TCG_REG_V1,  TCG_REG_V2,  TCG_REG_V3,
+    TCG_REG_V4,  TCG_REG_V5,  TCG_REG_V6,  TCG_REG_V7,
+    TCG_REG_V8,  TCG_REG_V9,  TCG_REG_V10, TCG_REG_V11,
+    TCG_REG_V12, TCG_REG_V13, TCG_REG_V14, TCG_REG_V15,
+    TCG_REG_V16, TCG_REG_V17, TCG_REG_V18, TCG_REG_V19,
+    TCG_REG_V20, TCG_REG_V21, TCG_REG_V22, TCG_REG_V23,
+    TCG_REG_V24, TCG_REG_V25, TCG_REG_V26, TCG_REG_V27,
+    TCG_REG_V28, TCG_REG_V29, TCG_REG_V30, TCG_REG_V31,
+
     TCG_REG_CALL_STACK = TCG_REG_R1,
     TCG_AREG0 = TCG_REG_R27
 } TCGReg;
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@
 # define TCG_REG_TMP1   TCG_REG_R12
 #endif
 
+#define TCG_VEC_TMP1    TCG_REG_V0
+#define TCG_VEC_TMP2    TCG_REG_V1
+
 #define TCG_REG_TB     TCG_REG_R31
 #define USE_REG_TB     (TCG_TARGET_REG_BITS == 64)
 
@@ -XXX,XX +XXX,XX @@ bool have_isa_3_00;
 #endif
 
 #ifdef CONFIG_DEBUG_TCG
-static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
-    "r0",
-    "r1",
-    "r2",
-    "r3",
-    "r4",
-    "r5",
-    "r6",
-    "r7",
-    "r8",
-    "r9",
-    "r10",
-    "r11",
-    "r12",
-    "r13",
-    "r14",
-    "r15",
-    "r16",
-    "r17",
-    "r18",
-    "r19",
-    "r20",
-    "r21",
-    "r22",
-    "r23",
-    "r24",
-    "r25",
-    "r26",
-    "r27",
-    "r28",
-    "r29",
-    "r30",
-    "r31"
+static const char tcg_target_reg_names[TCG_TARGET_NB_REGS][4] = {
+    "r0",  "r1",  "r2",  "r3",  "r4",  "r5",  "r6",  "r7",
+    "r8",  "r9",  "r10", "r11", "r12", "r13", "r14", "r15",
+    "r16", "r17", "r18", "r19", "r20", "r21", "r22", "r23",
+    "r24", "r25", "r26", "r27", "r28", "r29", "r30", "r31",
+    "v0",  "v1",  "v2",  "v3",  "v4",  "v5",  "v6",  "v7",
+    "v8",  "v9",  "v10", "v11", "v12", "v13", "v14", "v15",
+    "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
+    "v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31",
 };
 #endif
 
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
     TCG_REG_R5,
     TCG_REG_R4,
     TCG_REG_R3,
+
+    /* V0 and V1 reserved as temporaries; V20 - V31 are call-saved */
+    TCG_REG_V2,   /* call clobbered, vectors */
+    TCG_REG_V3,
+    TCG_REG_V4,
+    TCG_REG_V5,
+    TCG_REG_V6,
+    TCG_REG_V7,
+    TCG_REG_V8,
+    TCG_REG_V9,
+    TCG_REG_V10,
+    TCG_REG_V11,
+    TCG_REG_V12,
+    TCG_REG_V13,
+    TCG_REG_V14,
+    TCG_REG_V15,
+    TCG_REG_V16,
+    TCG_REG_V17,
+    TCG_REG_V18,
+    TCG_REG_V19,
 };
 
 static const int tcg_target_call_iarg_regs[] = {
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R11);
     tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R12);
 
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V0);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V1);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V2);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V3);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V4);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V5);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V6);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V7);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V8);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V9);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V10);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V11);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V12);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V13);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V14);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V15);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V16);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V17);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V18);
+    tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_V19);
+
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R0); /* tcg temp */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R1); /* stack pointer */
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R13); /* thread pointer */
 #endif
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1); /* mem temp */
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP1);
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP2);
     if (USE_REG_TB) {
         tcg_regset_set_reg(s->reserved_regs, TCG_REG_TB);  /* tb->tc_ptr */
     }
-- 
2.17.1

Introduce an enum to hold base < 2.06 < 3.00.  Use macros to
preserve the existing have_isa_2_06 and have_isa_3_00 predicates.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.h     | 12 ++++++++++--
 tcg/ppc/tcg-target.inc.c |  8 ++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
     TCG_AREG0 = TCG_REG_R27
 } TCGReg;
 
-extern bool have_isa_2_06;
-extern bool have_isa_3_00;
+typedef enum {
+    tcg_isa_base,
+    tcg_isa_2_06,
+    tcg_isa_3_00,
+} TCGPowerISA;
+
+extern TCGPowerISA have_isa;
+
+#define have_isa_2_06  (have_isa >= tcg_isa_2_06)
+#define have_isa_3_00  (have_isa >= tcg_isa_3_00)
 
 /* optional instructions automatically implemented */
 #define TCG_TARGET_HAS_ext8u_i32        0 /* andi */
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@
 
 static tcg_insn_unit *tb_ret_addr;
 
-bool have_isa_2_06;
-bool have_isa_3_00;
+TCGPowerISA have_isa;
 
 #define HAVE_ISA_2_06  have_isa_2_06
 #define HAVE_ISEL      have_isa_2_06
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     unsigned long hwcap = qemu_getauxval(AT_HWCAP);
     unsigned long hwcap2 = qemu_getauxval(AT_HWCAP2);
 
+    have_isa = tcg_isa_base;
     if (hwcap & PPC_FEATURE_ARCH_2_06) {
-        have_isa_2_06 = true;
+        have_isa = tcg_isa_2_06;
     }
 #ifdef PPC_FEATURE2_ARCH_3_00
     if (hwcap2 & PPC_FEATURE2_ARCH_3_00) {
-        have_isa_3_00 = true;
+        have_isa = tcg_isa_3_00;
     }
 #endif
 
-- 
2.17.1

This is identical to have_isa_2_06, so replace it.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
 
 TCGPowerISA have_isa;
 
-#define HAVE_ISA_2_06  have_isa_2_06
 #define HAVE_ISEL      have_isa_2_06
 
 #ifndef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is_64)
         }
     } else {
         uint32_t insn = qemu_ldx_opc[opc & (MO_BSWAP | MO_SSIZE)];
-        if (!HAVE_ISA_2_06 && insn == LDBRX) {
+        if (!have_isa_2_06 && insn == LDBRX) {
             tcg_out32(s, ADDI | TAI(TCG_REG_R0, addrlo, 4));
             tcg_out32(s, LWBRX | TAB(datalo, rbase, addrlo));
             tcg_out32(s, LWBRX | TAB(TCG_REG_R0, rbase, TCG_REG_R0));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is_64)
         }
     } else {
         uint32_t insn = qemu_stx_opc[opc & (MO_BSWAP | MO_SIZE)];
-        if (!HAVE_ISA_2_06 && insn == STDBRX) {
+        if (!have_isa_2_06 && insn == STDBRX) {
             tcg_out32(s, STWBRX | SAB(datalo, rbase, addrlo));
             tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, addrlo, 4));
             tcg_out_shri64(s, TCG_REG_R0, datalo, 32);
-- 
2.17.1

Previously we've been hard-coding knowledge that Power7 has ISEL, but
it was an optional instruction before that.  Use the AT_HWCAP2 bit,
when present, to properly determine support.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@
 static tcg_insn_unit *tb_ret_addr;
 
 TCGPowerISA have_isa;
-
-#define HAVE_ISEL      have_isa_2_06
+static bool have_isel;
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_GUEST_BASE_REG 30
@@ -XXX,XX +XXX,XX @@ static void tcg_out_setcond(TCGContext *s, TCGType type, TCGCond cond,
     /* If we have ISEL, we can implement everything with 3 or 4 insns.
        All other cases below are also at least 3 insns, so speed up the
        code generator by not considering them and always using ISEL.  */
-    if (HAVE_ISEL) {
+    if (have_isel) {
         int isel, tab;
 
         tcg_out_cmp(s, cond, arg1, arg2, const_arg2, 7, type);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movcond(TCGContext *s, TCGType type, TCGCond cond,
 
     tcg_out_cmp(s, cond, c1, c2, const_c2, 7, type);
 
-    if (HAVE_ISEL) {
+    if (have_isel) {
         int isel = tcg_to_isel[cond];
 
         /* Swap the V operands if the operation indicates inversion.  */
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cntxz(TCGContext *s, TCGType type, uint32_t opc,
     } else {
         tcg_out_cmp(s, TCG_COND_EQ, a1, 0, 1, 7, type);
         /* Note that the only other valid constant for a2 is 0.  */
-        if (HAVE_ISEL) {
+        if (have_isel) {
             tcg_out32(s, opc | RA(TCG_REG_R0) | RS(a1));
             tcg_out32(s, tcg_to_isel[TCG_COND_EQ] | TAB(a0, a2, TCG_REG_R0));
         } else if (!const_a2 && a0 == a2) {
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     }
 #endif
 
+#ifdef PPC_FEATURE2_HAS_ISEL
+    /* Prefer explicit instruction from the kernel. */
+    have_isel = (hwcap2 & PPC_FEATURE2_HAS_ISEL) != 0;
+#else
+    /* Fall back to knowing Power7 (2.06) has ISEL. */
+    have_isel = have_isa_2_06;
+#endif
+
     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffff;
 
-- 
2.17.1

Introduce all of the flags required to enable tcg backend vector support,
and a runtime flag to indicate the host supports Altivec instructions.

For now, do not actually set have_isa_altivec to true, because we have not
yet added all of the code to actually generate all of the required insns.
However, we must define these flags in order to disable ifndefs that create
stub versions of the functions added here.

The change to tcg_out_movi works around a buglet in tcg.c wherein if we
do not define tcg_out_dupi_vec we get a declared but not defined Werror,
but if we only declare it we get a defined but not used Werror.  We need
to this change to tcg_out_movi eventually anyway, so it's no biggie.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     | 25 ++++++++++++++++
 tcg/ppc/tcg-target.opc.h |  5 ++++
 tcg/ppc/tcg-target.inc.c | 62 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 89 insertions(+), 3 deletions(-)
 create mode 100644 tcg/ppc/tcg-target.opc.h

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 } TCGPowerISA;
 
 extern TCGPowerISA have_isa;
+extern bool have_altivec;
 
 #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
 #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
@@ -XXX,XX +XXX,XX @@ extern TCGPowerISA have_isa;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
+/*
+ * While technically Altivec could support V64, it has no 64-bit store
+ * instruction and substituting two 32-bit stores makes the generated
+ * code quite large.
+ */
+#define TCG_TARGET_HAS_v64              0
+#define TCG_TARGET_HAS_v128             have_altivec
+#define TCG_TARGET_HAS_v256             0
+
+#define TCG_TARGET_HAS_andc_vec         0
+#define TCG_TARGET_HAS_orc_vec          0
+#define TCG_TARGET_HAS_not_vec          0
+#define TCG_TARGET_HAS_neg_vec          0
+#define TCG_TARGET_HAS_abs_vec          0
+#define TCG_TARGET_HAS_shi_vec          0
+#define TCG_TARGET_HAS_shs_vec          0
+#define TCG_TARGET_HAS_shv_vec          0
+#define TCG_TARGET_HAS_cmp_vec          0
+#define TCG_TARGET_HAS_mul_vec          0
+#define TCG_TARGET_HAS_sat_vec          0
+#define TCG_TARGET_HAS_minmax_vec       0
+#define TCG_TARGET_HAS_bitsel_vec       0
+#define TCG_TARGET_HAS_cmpsel_vec       0
+
 void flush_icache_range(uintptr_t start, uintptr_t stop);
 void tb_target_set_jmp_target(uintptr_t, uintptr_t, uintptr_t);
 
diff --git a/tcg/ppc/tcg-target.opc.h b/tcg/ppc/tcg-target.opc.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tcg/ppc/tcg-target.opc.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Target-specific opcodes for host vector expansion.  These will be
+ * emitted by tcg_expand_vec_op.  For those familiar with GCC internals,
+ * consider these to be UNSPEC with names.
+ */
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
 
 TCGPowerISA have_isa;
 static bool have_isel;
+bool have_altivec;
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_GUEST_BASE_REG 30
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     }
 }
 
-static inline void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
-                                tcg_target_long arg)
+static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
+                             tcg_target_long val)
 {
-    tcg_out_movi_int(s, type, ret, arg, false);
+    g_assert_not_reached();
+}
+
+static void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
+                         tcg_target_long arg)
+{
+    switch (type) {
+    case TCG_TYPE_I32:
+    case TCG_TYPE_I64:
+        tcg_debug_assert(ret < TCG_REG_V0);
+        tcg_out_movi_int(s, type, ret, arg, false);
+        break;
+
+    case TCG_TYPE_V64:
+    case TCG_TYPE_V128:
+        tcg_debug_assert(ret >= TCG_REG_V0);
+        tcg_out_dupi_vec(s, type, ret, arg);
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
 }
 
 static bool mask_operand(uint32_t c, int *mb, int *me)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc, const TCGArg *args,
     }
 }
 
+int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
+{
+    g_assert_not_reached();
+}
+
+static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
+                            TCGReg dst, TCGReg src)
+{
+    g_assert_not_reached();
+}
+
+static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
+                             TCGReg out, TCGReg base, intptr_t offset)
+{
+    g_assert_not_reached();
+}
+
+static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
+                           unsigned vecl, unsigned vece,
+                           const TCGArg *args, const int *const_args)
+{
+    g_assert_not_reached();
+}
+
+void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
+                       TCGArg a0, ...)
+{
+    g_assert_not_reached();
+}
+
 static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
 {
     static const TCGTargetOpDef r = { .args_ct_str = { "r" } };
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 
     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffff;
+    if (have_altivec) {
+        tcg_target_available_regs[TCG_TYPE_V64] = 0xffffffff00000000ull;
+        tcg_target_available_regs[TCG_TYPE_V128] = 0xffffffff00000000ull;
+    }
 
     tcg_target_call_clobber_regs = 0;
     tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_R0);
-- 
2.17.1

Add various bits and peaces related mostly to load and store
operations. In that context, logic, compare, and splat Altivec
instructions are used, and, therefore, the support for emitting
them is included in this patch too.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |   6 +-
 tcg/ppc/tcg-target.inc.c | 472 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 442 insertions(+), 36 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
 #define TCG_TARGET_HAS_v128             have_altivec
 #define TCG_TARGET_HAS_v256             0
 
-#define TCG_TARGET_HAS_andc_vec         0
+#define TCG_TARGET_HAS_andc_vec         1
 #define TCG_TARGET_HAS_orc_vec          0
-#define TCG_TARGET_HAS_not_vec          0
+#define TCG_TARGET_HAS_not_vec          1
 #define TCG_TARGET_HAS_neg_vec          0
 #define TCG_TARGET_HAS_abs_vec          0
 #define TCG_TARGET_HAS_shi_vec          0
 #define TCG_TARGET_HAS_shs_vec          0
 #define TCG_TARGET_HAS_shv_vec          0
-#define TCG_TARGET_HAS_cmp_vec          0
+#define TCG_TARGET_HAS_cmp_vec          1
 #define TCG_TARGET_HAS_mul_vec          0
 #define TCG_TARGET_HAS_sat_vec          0
 #define TCG_TARGET_HAS_minmax_vec       0
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
         ct->ct |= TCG_CT_REG;
         ct->u.regs = 0xffffffff;
         break;
+    case 'v':
+        ct->ct |= TCG_CT_REG;
+        ct->u.regs = 0xffffffff00000000ull;
+        break;
     case 'L':                   /* qemu_ld constraint */
         ct->ct |= TCG_CT_REG;
         ct->u.regs = 0xffffffff;
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 
 #define NOP    ORI  /* ori 0,0,0 */
 
+#define LVX        XO31(103)
+#define LVEBX      XO31(7)
+#define LVEHX      XO31(39)
+#define LVEWX      XO31(71)
+
+#define STVX       XO31(231)
+#define STVEWX     XO31(199)
+
+#define VCMPEQUB   VX4(6)
+#define VCMPEQUH   VX4(70)
+#define VCMPEQUW   VX4(134)
+#define VCMPGTSB   VX4(774)
+#define VCMPGTSH   VX4(838)
+#define VCMPGTSW   VX4(902)
+#define VCMPGTUB   VX4(518)
+#define VCMPGTUH   VX4(582)
+#define VCMPGTUW   VX4(646)
+
+#define VAND       VX4(1028)
+#define VANDC      VX4(1092)
+#define VNOR       VX4(1284)
+#define VOR        VX4(1156)
+#define VXOR       VX4(1220)
+
+#define VSPLTB     VX4(524)
+#define VSPLTH     VX4(588)
+#define VSPLTW     VX4(652)
+#define VSPLTISB   VX4(780)
+#define VSPLTISH   VX4(844)
+#define VSPLTISW   VX4(908)
+
+#define VSLDOI     VX4(44)
+
 #define RT(r) ((r)<<21)
 #define RS(r) ((r)<<21)
 #define RA(r) ((r)<<16)
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
                         intptr_t value, intptr_t addend)
 {
     tcg_insn_unit *target;
+    int16_t lo;
+    int32_t hi;
 
     value += addend;
     target = (tcg_insn_unit *)value;
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
         }
         *code_ptr = (*code_ptr & ~0xfffc) | (value & 0xfffc);
         break;
+    case R_PPC_ADDR32:
+        /*
+         * We are abusing this relocation type.  Again, this points to
+         * a pair of insns, lis + load.  This is an absolute address
+         * relocation for PPC32 so the lis cannot be removed.
+         */
+        lo = value;
+        hi = value - lo;
+        if (hi + lo != value) {
+            return false;
+        }
+        code_ptr[0] = deposit32(code_ptr[0], 0, 16, hi >> 16);
+        code_ptr[1] = deposit32(code_ptr[1], 0, 16, lo);
+        break;
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
 
 static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
 {
-    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
-    if (ret != arg) {
-        tcg_out32(s, OR | SAB(arg, ret, arg));
+    if (ret == arg) {
+        return true;
+    }
+    switch (type) {
+    case TCG_TYPE_I64:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        /* fallthru */
+    case TCG_TYPE_I32:
+        if (ret < TCG_REG_V0 && arg < TCG_REG_V0) {
+            tcg_out32(s, OR | SAB(arg, ret, arg));
+            break;
+        } else if (ret < TCG_REG_V0 || arg < TCG_REG_V0) {
+            /* Altivec does not support vector/integer moves.  */
+            return false;
+        }
+        /* fallthru */
+    case TCG_TYPE_V64:
+    case TCG_TYPE_V128:
+        tcg_debug_assert(ret >= TCG_REG_V0 && arg >= TCG_REG_V0);
+        tcg_out32(s, VOR | VRT(ret) | VRA(arg) | VRB(arg));
+        break;
+    default:
+        g_assert_not_reached();
     }
     return true;
 }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
 static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
                              tcg_target_long val)
 {
-    g_assert_not_reached();
+    uint32_t load_insn;
+    int rel, low;
+    intptr_t add;
+
+    low = (int8_t)val;
+    if (low >= -16 && low < 16) {
+        if (val == (tcg_target_long)dup_const(MO_8, low)) {
+            tcg_out32(s, VSPLTISB | VRT(ret) | ((val & 31) << 16));
+            return;
+        }
+        if (val == (tcg_target_long)dup_const(MO_16, low)) {
+            tcg_out32(s, VSPLTISH | VRT(ret) | ((val & 31) << 16));
+            return;
+        }
+        if (val == (tcg_target_long)dup_const(MO_32, low)) {
+            tcg_out32(s, VSPLTISW | VRT(ret) | ((val & 31) << 16));
+            return;
+        }
+    }
+
+    /*
+     * Otherwise we must load the value from the constant pool.
+     */
+    if (USE_REG_TB) {
+        rel = R_PPC_ADDR16;
+        add = -(intptr_t)s->code_gen_ptr;
+    } else {
+        rel = R_PPC_ADDR32;
+        add = 0;
+    }
+
+    load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
+    if (TCG_TARGET_REG_BITS == 64) {
+        new_pool_l2(s, rel, s->code_ptr, add, val, val);
+    } else {
+        new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
+    }
+
+    if (USE_REG_TB) {
+        tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, 0, 0));
+        load_insn |= RA(TCG_REG_TB);
+    } else {
+        tcg_out32(s, ADDIS | TAI(TCG_REG_TMP1, 0, 0));
+        tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, TCG_REG_TMP1, 0));
+    }
+    tcg_out32(s, load_insn);
 }
 
 static void tcg_out_movi(TCGContext *s, TCGType type, TCGReg ret,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
         align = 3;
         /* FALLTHRU */
     default:
-        if (rt != TCG_REG_R0) {
+        if (rt > TCG_REG_R0 && rt < TCG_REG_V0) {
             rs = rt;
             break;
         }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
     }
 
     /* For unaligned, or very large offsets, use the indexed form.  */
-    if (offset & align || offset != (int32_t)offset) {
+    if (offset & align || offset != (int32_t)offset || opi == 0) {
         if (rs == base) {
             rs = TCG_REG_R0;
         }
         tcg_debug_assert(!is_store || rs != rt);
         tcg_out_movi(s, TCG_TYPE_PTR, rs, orig);
-        tcg_out32(s, opx | TAB(rt, base, rs));
+        tcg_out32(s, opx | TAB(rt & 31, base, rs));
         return;
     }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
         base = rs;
     }
     if (opi != ADDI || base != rt || l0 != 0) {
-        tcg_out32(s, opi | TAI(rt, base, l0));
+        tcg_out32(s, opi | TAI(rt & 31, base, l0));
     }
 }
 
-static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
-                              TCGReg arg1, intptr_t arg2)
+static void tcg_out_vsldoi(TCGContext *s, TCGReg ret,
+                           TCGReg va, TCGReg vb, int shb)
 {
-    int opi, opx;
-
-    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
-    if (type == TCG_TYPE_I32) {
-        opi = LWZ, opx = LWZX;
-    } else {
-        opi = LD, opx = LDX;
-    }
-    tcg_out_mem_long(s, opi, opx, ret, arg1, arg2);
+    tcg_out32(s, VSLDOI | VRT(ret) | VRA(va) | VRB(vb) | (shb << 6));
 }
 
-static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
-                              TCGReg arg1, intptr_t arg2)
+static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
+                       TCGReg base, intptr_t offset)
 {
-    int opi, opx;
+    int shift;
 
-    tcg_debug_assert(TCG_TARGET_REG_BITS == 64 || type == TCG_TYPE_I32);
-    if (type == TCG_TYPE_I32) {
-        opi = STW, opx = STWX;
-    } else {
-        opi = STD, opx = STDX;
+    switch (type) {
+    case TCG_TYPE_I32:
+        if (ret < TCG_REG_V0) {
+            tcg_out_mem_long(s, LWZ, LWZX, ret, base, offset);
+            break;
+        }
+        tcg_debug_assert((offset & 3) == 0);
+        tcg_out_mem_long(s, 0, LVEWX, ret, base, offset);
+        shift = (offset - 4) & 0xc;
+        if (shift) {
+            tcg_out_vsldoi(s, ret, ret, ret, shift);
+        }
+        break;
+    case TCG_TYPE_I64:
+        if (ret < TCG_REG_V0) {
+            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+            tcg_out_mem_long(s, LD, LDX, ret, base, offset);
+            break;
+        }
+        /* fallthru */
+    case TCG_TYPE_V64:
+        tcg_debug_assert(ret >= TCG_REG_V0);
+        tcg_debug_assert((offset & 7) == 0);
+        tcg_out_mem_long(s, 0, LVX, ret, base, offset & -16);
+        if (offset & 8) {
+            tcg_out_vsldoi(s, ret, ret, ret, 8);
+        }
+        break;
+    case TCG_TYPE_V128:
+        tcg_debug_assert(ret >= TCG_REG_V0);
+        tcg_debug_assert((offset & 15) == 0);
+        tcg_out_mem_long(s, 0, LVX, ret, base, offset);
+        break;
+    default:
+        g_assert_not_reached();
+    }
+}
+
+static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+                              TCGReg base, intptr_t offset)
+{
+    int shift;
+
+    switch (type) {
+    case TCG_TYPE_I32:
+        if (arg < TCG_REG_V0) {
+            tcg_out_mem_long(s, STW, STWX, arg, base, offset);
+            break;
+        }
+        tcg_debug_assert((offset & 3) == 0);
+        shift = (offset - 4) & 0xc;
+        if (shift) {
+            tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, shift);
+            arg = TCG_VEC_TMP1;
+        }
+        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset);
+        break;
+    case TCG_TYPE_I64:
+        if (arg < TCG_REG_V0) {
+            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+            tcg_out_mem_long(s, STD, STDX, arg, base, offset);
+            break;
+        }
+        /* fallthru */
+    case TCG_TYPE_V64:
+        tcg_debug_assert(arg >= TCG_REG_V0);
+        tcg_debug_assert((offset & 7) == 0);
+        if (offset & 8) {
+            tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, 8);
+            arg = TCG_VEC_TMP1;
+        }
+        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset);
+        tcg_out_mem_long(s, 0, STVEWX, arg, base, offset + 4);
+        break;
+    case TCG_TYPE_V128:
+        tcg_debug_assert(arg >= TCG_REG_V0);
+        tcg_out_mem_long(s, 0, STVX, arg, base, offset);
+        break;
+    default:
+        g_assert_not_reached();
     }
-    tcg_out_mem_long(s, opi, opx, arg, arg1, arg2);
 }
 
 static inline bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc, const TCGArg *args,
 
 int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
 {
-    g_assert_not_reached();
+    switch (opc) {
+    case INDEX_op_and_vec:
+    case INDEX_op_or_vec:
+    case INDEX_op_xor_vec:
+    case INDEX_op_andc_vec:
+    case INDEX_op_not_vec:
+        return 1;
+    case INDEX_op_cmp_vec:
+        return vece <= MO_32 ? -1 : 0;
+    default:
+        return 0;
+    }
 }
 
 static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
                             TCGReg dst, TCGReg src)
 {
-    g_assert_not_reached();
+    tcg_debug_assert(dst >= TCG_REG_V0);
+    tcg_debug_assert(src >= TCG_REG_V0);
+
+    /*
+     * Recall we use (or emulate) VSX integer loads, so the integer is
+     * right justified within the left (zero-index) double-word.
+     */
+    switch (vece) {
+    case MO_8:
+        tcg_out32(s, VSPLTB | VRT(dst) | VRB(src) | (7 << 16));
+        break;
+    case MO_16:
+        tcg_out32(s, VSPLTH | VRT(dst) | VRB(src) | (3 << 16));
+        break;
+    case MO_32:
+        tcg_out32(s, VSPLTW | VRT(dst) | VRB(src) | (1 << 16));
+        break;
+    case MO_64:
+        tcg_out_vsldoi(s, TCG_VEC_TMP1, src, src, 8);
+        tcg_out_vsldoi(s, dst, TCG_VEC_TMP1, src, 8);
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    return true;
 }
 
 static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg out, TCGReg base, intptr_t offset)
 {
-    g_assert_not_reached();
+    int elt;
+
+    tcg_debug_assert(out >= TCG_REG_V0);
+    switch (vece) {
+    case MO_8:
+        tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
+        elt = extract32(offset, 0, 4);
+#ifndef HOST_WORDS_BIGENDIAN
+        elt ^= 15;
+#endif
+        tcg_out32(s, VSPLTB | VRT(out) | VRB(out) | (elt << 16));
+        break;
+    case MO_16:
+        tcg_debug_assert((offset & 1) == 0);
+        tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
+        elt = extract32(offset, 1, 3);
+#ifndef HOST_WORDS_BIGENDIAN
+        elt ^= 7;
+#endif
+        tcg_out32(s, VSPLTH | VRT(out) | VRB(out) | (elt << 16));
+        break;
+    case MO_32:
+        tcg_debug_assert((offset & 3) == 0);
+        tcg_out_mem_long(s, 0, LVEWX, out, base, offset);
+        elt = extract32(offset, 2, 2);
+#ifndef HOST_WORDS_BIGENDIAN
+        elt ^= 3;
+#endif
+        tcg_out32(s, VSPLTW | VRT(out) | VRB(out) | (elt << 16));
+        break;
+    case MO_64:
+        tcg_debug_assert((offset & 7) == 0);
+        tcg_out_mem_long(s, 0, LVX, out, base, offset & -16);
+        tcg_out_vsldoi(s, TCG_VEC_TMP1, out, out, 8);
+        elt = extract32(offset, 3, 1);
+#ifndef HOST_WORDS_BIGENDIAN
+        elt = !elt;
+#endif
+        if (elt) {
+            tcg_out_vsldoi(s, out, out, TCG_VEC_TMP1, 8);
+        } else {
+            tcg_out_vsldoi(s, out, TCG_VEC_TMP1, out, 8);
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    return true;
 }
 
 static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                            unsigned vecl, unsigned vece,
                            const TCGArg *args, const int *const_args)
 {
-    g_assert_not_reached();
+    static const uint32_t
+        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
+        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
+        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 };
+
+    TCGType type = vecl + TCG_TYPE_V64;
+    TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
+    uint32_t insn;
+
+    switch (opc) {
+    case INDEX_op_ld_vec:
+        tcg_out_ld(s, type, a0, a1, a2);
+        return;
+    case INDEX_op_st_vec:
+        tcg_out_st(s, type, a0, a1, a2);
+        return;
+    case INDEX_op_dupm_vec:
+        tcg_out_dupm_vec(s, type, vece, a0, a1, a2);
+        return;
+
+    case INDEX_op_and_vec:
+        insn = VAND;
+        break;
+    case INDEX_op_or_vec:
+        insn = VOR;
+        break;
+    case INDEX_op_xor_vec:
+        insn = VXOR;
+        break;
+    case INDEX_op_andc_vec:
+        insn = VANDC;
+        break;
+    case INDEX_op_not_vec:
+        insn = VNOR;
+        a2 = a1;
+        break;
+
+    case INDEX_op_cmp_vec:
+        switch (args[3]) {
+        case TCG_COND_EQ:
+            insn = eq_op[vece];
+            break;
+        case TCG_COND_GT:
+            insn = gts_op[vece];
+            break;
+        case TCG_COND_GTU:
+            insn = gtu_op[vece];
+            break;
+        default:
+            g_assert_not_reached();
+        }
+        break;
+
+    case INDEX_op_mov_vec:  /* Always emitted via tcg_out_mov.  */
+    case INDEX_op_dupi_vec: /* Always emitted via tcg_out_movi.  */
+    case INDEX_op_dup_vec:  /* Always emitted via tcg_out_dup_vec.  */
+    default:
+        g_assert_not_reached();
+    }
+
+    tcg_debug_assert(insn != 0);
+    tcg_out32(s, insn | VRT(a0) | VRA(a1) | VRB(a2));
+}
+
+static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
+                           TCGv_vec v1, TCGv_vec v2, TCGCond cond)
+{
+    bool need_swap = false, need_inv = false;
+
+    tcg_debug_assert(vece <= MO_32);
+
+    switch (cond) {
+    case TCG_COND_EQ:
+    case TCG_COND_GT:
+    case TCG_COND_GTU:
+        break;
+    case TCG_COND_NE:
+    case TCG_COND_LE:
+    case TCG_COND_LEU:
+        need_inv = true;
+        break;
+    case TCG_COND_LT:
+    case TCG_COND_LTU:
+        need_swap = true;
+        break;
+    case TCG_COND_GE:
+    case TCG_COND_GEU:
+        need_swap = need_inv = true;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    if (need_inv) {
+        cond = tcg_invert_cond(cond);
+    }
+    if (need_swap) {
+        TCGv_vec t1;
+        t1 = v1, v1 = v2, v2 = t1;
+        cond = tcg_swap_cond(cond);
+    }
+
+    vec_gen_4(INDEX_op_cmp_vec, type, vece, tcgv_vec_arg(v0),
+              tcgv_vec_arg(v1), tcgv_vec_arg(v2), cond);
+
+    if (need_inv) {
+        tcg_gen_not_vec(vece, v0, v0);
+    }
 }
 
 void tcg_expand_vec_op(TCGOpcode opc, TCGType type, unsigned vece,
                        TCGArg a0, ...)
 {
-    g_assert_not_reached();
+    va_list va;
+    TCGv_vec v0, v1, v2;
+
+    va_start(va, a0);
+    v0 = temp_tcgv_vec(arg_temp(a0));
+    v1 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
+    v2 = temp_tcgv_vec(arg_temp(va_arg(va, TCGArg)));
+
+    switch (opc) {
+    case INDEX_op_cmp_vec:
+        expand_vec_cmp(type, vece, v0, v1, v2, va_arg(va, TCGArg));
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    va_end(va);
 }
 
 static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
         = { .args_ct_str = { "r", "r", "r", "r", "rI", "rZM" } };
     static const TCGTargetOpDef sub2
         = { .args_ct_str = { "r", "r", "rI", "rZM", "r", "r" } };
+    static const TCGTargetOpDef v_r = { .args_ct_str = { "v", "r" } };
+    static const TCGTargetOpDef v_v = { .args_ct_str = { "v", "v" } };
+    static const TCGTargetOpDef v_v_v = { .args_ct_str = { "v", "v", "v" } };
 
     switch (op) {
     case INDEX_op_goto_ptr:
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
         return (TCG_TARGET_REG_BITS == 64 ? &S_S
                 : TARGET_LONG_BITS == 32 ? &S_S_S : &S_S_S_S);
 
+    case INDEX_op_and_vec:
+    case INDEX_op_or_vec:
+    case INDEX_op_xor_vec:
+    case INDEX_op_andc_vec:
+    case INDEX_op_orc_vec:
+    case INDEX_op_cmp_vec:
+        return &v_v_v;
+    case INDEX_op_not_vec:
+    case INDEX_op_dup_vec:
+        return &v_v;
+    case INDEX_op_ld_vec:
+    case INDEX_op_st_vec:
+    case INDEX_op_dupm_vec:
+        return &v_r;
+
     default:
         return NULL;
     }
-- 
2.17.1

Add support for vector maximum/minimum using Altivec instructions
VMAXSB, VMAXSH, VMAXSW, VMAXUB, VMAXUH, VMAXUW, and
VMINSB, VMINSH, VMINSW, VMINUB, VMINUH, VMINUW.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |  2 +-
 tcg/ppc/tcg-target.inc.c | 40 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

Add support for vector add/subtract using Altivec instructions:
VADDUBM, VADDUHM, VADDUWM, VSUBUBM, VSUBUHM, VSUBUWM.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.inc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Add support for vector saturated add/subtract using Altivec
instructions:
VADDSBS, VADDSHS, VADDSWS, VADDUBS, VADDUHS, VADDUWS, and
VSUBSBS, VSUBSHS, VSUBSWS, VSUBUBS, VSUBUHS, VSUBUWS.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |  2 +-
 tcg/ppc/tcg-target.inc.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletion(-)

For Altivec, this is done via vector shift by vector,
and loading the immediate into a register.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |  2 +-
 tcg/ppc/tcg-target.inc.c | 58 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 57 insertions(+), 3 deletions(-)

For Altivec, this is always an expansion.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |   2 +-
 tcg/ppc/tcg-target.opc.h |   8 +++
 tcg/ppc/tcg-target.inc.c | 113 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 121 insertions(+), 2 deletions(-)

This is only used for 32-bit hosts.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.inc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_dup2_vec:
+        assert(TCG_TARGET_REG_BITS == 32);
+        /* With inputs a1 = xLxx, a2 = xHxx  */
+        tcg_out32(s, VMRGHW | VRT(a0) | VRA(a2) | VRB(a1));  /* a0  = xxHL */
+        tcg_out_vsldoi(s, TCG_VEC_TMP1, a0, a0, 8);          /* tmp = HLxx */
+        tcg_out_vsldoi(s, a0, a0, TCG_VEC_TMP1, 8);          /* a0  = HLHL */
+        return;
+
     case INDEX_op_ppc_mrgh_vec:
         insn = mrgh_op[vece];
         break;
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     case INDEX_op_ppc_mulou_vec:
     case INDEX_op_ppc_pkum_vec:
     case INDEX_op_ppc_rotl_vec:
+    case INDEX_op_dup2_vec:
         return &v_v_v;
     case INDEX_op_not_vec:
     case INDEX_op_dup_vec:
-- 
2.17.1

The VSX instruction set instructions include double-word loads and
stores, double-word load and splat, double-word permute, and bit
select.  All of which require multiple operations in the Altivec
instruction set.

Because the VSX registers map %vsr32 to %vr0, and we have no current
intention or need to use vector registers outside %vr0-%vr19, force
on the {ax,bx,cx,tx} bits within the added VSX insns so that we don't
have to otherwise modify the VR[TABC] macros.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
---
 tcg/ppc/tcg-target.h     |  5 ++--
 tcg/ppc/tcg-target.inc.c | 52 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 
 extern TCGPowerISA have_isa;
 extern bool have_altivec;
+extern bool have_vsx;
 
 #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
 #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
  * instruction and substituting two 32-bit stores makes the generated
  * code quite large.
  */
-#define TCG_TARGET_HAS_v64              0
+#define TCG_TARGET_HAS_v64              have_vsx
 #define TCG_TARGET_HAS_v128             have_altivec
 #define TCG_TARGET_HAS_v256             0
 
@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
 #define TCG_TARGET_HAS_mul_vec          1
 #define TCG_TARGET_HAS_sat_vec          1
 #define TCG_TARGET_HAS_minmax_vec       1
-#define TCG_TARGET_HAS_bitsel_vec       0
+#define TCG_TARGET_HAS_bitsel_vec       have_vsx
 #define TCG_TARGET_HAS_cmpsel_vec       0
 
 void flush_icache_range(uintptr_t start, uintptr_t stop);
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static tcg_insn_unit *tb_ret_addr;
 TCGPowerISA have_isa;
 static bool have_isel;
 bool have_altivec;
+bool have_vsx;
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_GUEST_BASE_REG 30
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define LVEBX      XO31(7)
 #define LVEHX      XO31(39)
 #define LVEWX      XO31(71)
+#define LXSDX      (XO31(588) | 1)  /* v2.06, force tx=1 */
+#define LXVDSX     (XO31(332) | 1)  /* v2.06, force tx=1 */
 
 #define STVX       XO31(231)
 #define STVEWX     XO31(199)
+#define STXSDX     (XO31(716) | 1)  /* v2.06, force sx=1 */
 
 #define VADDSBS    VX4(768)
 #define VADDUBS    VX4(512)
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 
 #define VSLDOI     VX4(44)
 
+#define XXPERMDI   (OPCD(60) | (10 << 3) | 7)  /* v2.06, force ax=bx=tx=1 */
+#define XXSEL      (OPCD(60) | (3 << 4) | 0xf) /* v2.06, force ax=bx=cx=tx=1 */
+
 #define RT(r) ((r)<<21)
 #define RS(r) ((r)<<21)
 #define RA(r) ((r)<<16)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_dupi_vec(TCGContext *s, TCGType type, TCGReg ret,
         add = 0;
     }
 
-    load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
-    if (TCG_TARGET_REG_BITS == 64) {
-        new_pool_l2(s, rel, s->code_ptr, add, val, val);
+    if (have_vsx) {
+        load_insn = type == TCG_TYPE_V64 ? LXSDX : LXVDSX;
+        load_insn |= VRT(ret) | RB(TCG_REG_TMP1);
+        if (TCG_TARGET_REG_BITS == 64) {
+            new_pool_label(s, val, rel, s->code_ptr, add);
+        } else {
+            new_pool_l2(s, rel, s->code_ptr, add, val, val);
+        }
     } else {
-        new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
+        load_insn = LVX | VRT(ret) | RB(TCG_REG_TMP1);
+        if (TCG_TARGET_REG_BITS == 64) {
+            new_pool_l2(s, rel, s->code_ptr, add, val, val);
+        } else {
+            new_pool_l4(s, rel, s->code_ptr, add, val, val, val, val);
+        }
     }
 
     if (USE_REG_TB) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
         /* fallthru */
     case TCG_TYPE_V64:
         tcg_debug_assert(ret >= TCG_REG_V0);
+        if (have_vsx) {
+            tcg_out_mem_long(s, 0, LXSDX, ret, base, offset);
+            break;
+        }
         tcg_debug_assert((offset & 7) == 0);
         tcg_out_mem_long(s, 0, LVX, ret, base, offset & -16);
         if (offset & 8) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
         /* fallthru */
     case TCG_TYPE_V64:
         tcg_debug_assert(arg >= TCG_REG_V0);
+        if (have_vsx) {
+            tcg_out_mem_long(s, 0, STXSDX, arg, base, offset);
+            break;
+        }
         tcg_debug_assert((offset & 7) == 0);
         if (offset & 8) {
             tcg_out_vsldoi(s, TCG_VEC_TMP1, arg, arg, 8);
@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
     case INDEX_op_shri_vec:
     case INDEX_op_sari_vec:
         return vece <= MO_32 ? -1 : 0;
+    case INDEX_op_bitsel_vec:
+        return have_vsx;
     default:
         return 0;
     }
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
         tcg_out32(s, VSPLTW | VRT(dst) | VRB(src) | (1 << 16));
         break;
     case MO_64:
+        if (have_vsx) {
+            tcg_out32(s, XXPERMDI | VRT(dst) | VRA(src) | VRB(src));
+            break;
+        }
         tcg_out_vsldoi(s, TCG_VEC_TMP1, src, src, 8);
         tcg_out_vsldoi(s, dst, TCG_VEC_TMP1, src, 8);
         break;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
         tcg_out32(s, VSPLTW | VRT(out) | VRB(out) | (elt << 16));
         break;
     case MO_64:
+        if (have_vsx) {
+            tcg_out_mem_long(s, 0, LXVDSX, out, base, offset);
+            break;
+        }
         tcg_debug_assert((offset & 7) == 0);
         tcg_out_mem_long(s, 0, LVX, out, base, offset & -16);
         tcg_out_vsldoi(s, TCG_VEC_TMP1, out, out, 8);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_bitsel_vec:
+        tcg_out32(s, XXSEL | VRT(a0) | VRC(a1) | VRB(a2) | VRA(args[3]));
+        return;
+
     case INDEX_op_dup2_vec:
         assert(TCG_TARGET_REG_BITS == 32);
         /* With inputs a1 = xLxx, a2 = xHxx  */
@@ -XXX,XX +XXX,XX @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     case INDEX_op_st_vec:
     case INDEX_op_dupm_vec:
         return &v_r;
+    case INDEX_op_bitsel_vec:
     case INDEX_op_ppc_msum_vec:
         return &v_v_v_v;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 
     if (hwcap & PPC_FEATURE_HAS_ALTIVEC) {
         have_altivec = true;
+        /* We only care about the portion of VSX that overlaps Altivec. */
+        if (hwcap & PPC_FEATURE_HAS_VSX) {
+            have_vsx = true;
+        }
     }
 
     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffff;
-- 
2.17.1

These new instructions are conditional only on MSR.VEC and
are thus part of the Altivec instruction set, and not VSX.
This includes lots of double-word arithmetic and a few extra
logical operations.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.h     |  4 +-
 tcg/ppc/tcg-target.inc.c | 85 ++++++++++++++++++++++++++++++----------
 2 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 typedef enum {
     tcg_isa_base,
     tcg_isa_2_06,
+    tcg_isa_2_07,
     tcg_isa_3_00,
 } TCGPowerISA;
 
@@ -XXX,XX +XXX,XX @@ extern bool have_altivec;
 extern bool have_vsx;
 
 #define have_isa_2_06  (have_isa >= tcg_isa_2_06)
+#define have_isa_2_07  (have_isa >= tcg_isa_2_07)
 #define have_isa_3_00  (have_isa >= tcg_isa_3_00)
 
 /* optional instructions automatically implemented */
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_v256             0
 
 #define TCG_TARGET_HAS_andc_vec         1
-#define TCG_TARGET_HAS_orc_vec          0
+#define TCG_TARGET_HAS_orc_vec          have_isa_2_07
 #define TCG_TARGET_HAS_not_vec          1
 #define TCG_TARGET_HAS_neg_vec          0
 #define TCG_TARGET_HAS_abs_vec          0
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define VADDSWS    VX4(896)
 #define VADDUWS    VX4(640)
 #define VADDUWM    VX4(128)
+#define VADDUDM    VX4(192)       /* v2.07 */
 
 #define VSUBSBS    VX4(1792)
 #define VSUBUBS    VX4(1536)
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define VSUBSWS    VX4(1920)
 #define VSUBUWS    VX4(1664)
 #define VSUBUWM    VX4(1152)
+#define VSUBUDM    VX4(1216)      /* v2.07 */
 
 #define VMAXSB     VX4(258)
 #define VMAXSH     VX4(322)
 #define VMAXSW     VX4(386)
+#define VMAXSD     VX4(450)       /* v2.07 */
 #define VMAXUB     VX4(2)
 #define VMAXUH     VX4(66)
 #define VMAXUW     VX4(130)
+#define VMAXUD     VX4(194)       /* v2.07 */
 #define VMINSB     VX4(770)
 #define VMINSH     VX4(834)
 #define VMINSW     VX4(898)
+#define VMINSD     VX4(962)       /* v2.07 */
 #define VMINUB     VX4(514)
 #define VMINUH     VX4(578)
 #define VMINUW     VX4(642)
+#define VMINUD     VX4(706)       /* v2.07 */
 
 #define VCMPEQUB   VX4(6)
 #define VCMPEQUH   VX4(70)
 #define VCMPEQUW   VX4(134)
+#define VCMPEQUD   VX4(199)       /* v2.07 */
 #define VCMPGTSB   VX4(774)
 #define VCMPGTSH   VX4(838)
 #define VCMPGTSW   VX4(902)
+#define VCMPGTSD   VX4(967)       /* v2.07 */
 #define VCMPGTUB   VX4(518)
 #define VCMPGTUH   VX4(582)
 #define VCMPGTUW   VX4(646)
+#define VCMPGTUD   VX4(711)       /* v2.07 */
 
 #define VSLB       VX4(260)
 #define VSLH       VX4(324)
 #define VSLW       VX4(388)
+#define VSLD       VX4(1476)      /* v2.07 */
 #define VSRB       VX4(516)
 #define VSRH       VX4(580)
 #define VSRW       VX4(644)
+#define VSRD       VX4(1732)      /* v2.07 */
 #define VSRAB      VX4(772)
 #define VSRAH      VX4(836)
 #define VSRAW      VX4(900)
+#define VSRAD      VX4(964)       /* v2.07 */
 #define VRLB       VX4(4)
 #define VRLH       VX4(68)
 #define VRLW       VX4(132)
+#define VRLD       VX4(196)       /* v2.07 */
 
 #define VMULEUB    VX4(520)
 #define VMULEUH    VX4(584)
+#define VMULEUW    VX4(648)       /* v2.07 */
 #define VMULOUB    VX4(8)
 #define VMULOUH    VX4(72)
+#define VMULOUW    VX4(136)       /* v2.07 */
+#define VMULUWM    VX4(137)       /* v2.07 */
 #define VMSUMUHM   VX4(38)
 
 #define VMRGHB     VX4(12)
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define VNOR       VX4(1284)
 #define VOR        VX4(1156)
 #define VXOR       VX4(1220)
+#define VEQV       VX4(1668)      /* v2.07 */
+#define VNAND      VX4(1412)      /* v2.07 */
+#define VORC       VX4(1348)      /* v2.07 */
 
 #define VSPLTB     VX4(524)
 #define VSPLTH     VX4(588)
@@ -XXX,XX +XXX,XX @@ int tcg_can_emit_vec_op(TCGOpcode opc, TCGType type, unsigned vece)
     case INDEX_op_andc_vec:
     case INDEX_op_not_vec:
         return 1;
+    case INDEX_op_orc_vec:
+        return have_isa_2_07;
     case INDEX_op_add_vec:
     case INDEX_op_sub_vec:
     case INDEX_op_smax_vec:
     case INDEX_op_smin_vec:
     case INDEX_op_umax_vec:
     case INDEX_op_umin_vec:
+    case INDEX_op_shlv_vec:
+    case INDEX_op_shrv_vec:
+    case INDEX_op_sarv_vec:
+        return vece <= MO_32 || have_isa_2_07;
     case INDEX_op_ssadd_vec:
     case INDEX_op_sssub_vec:
     case INDEX_op_usadd_vec:
     case INDEX_op_ussub_vec:
-    case INDEX_op_shlv_vec:
-    case INDEX_op_shrv_vec:
-    case INDEX_op_sarv_vec:
         return vece <= MO_32;
     case INDEX_op_cmp_vec:
-    case INDEX_op_mul_vec:
     case INDEX_op_shli_vec:
     case INDEX_op_shri_vec:
     case INDEX_op_sari_vec:
-        return vece <= MO_32 ? -1 : 0;
+        return vece <= MO_32 || have_isa_2_07 ? -1 : 0;
+    case INDEX_op_mul_vec:
+        switch (vece) {
+        case MO_8:
+        case MO_16:
+            return -1;
+        case MO_32:
+            return have_isa_2_07 ? 1 : -1;
+        }
+        return 0;
     case INDEX_op_bitsel_vec:
         return have_vsx;
     default:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                            const TCGArg *args, const int *const_args)
 {
     static const uint32_t
-        add_op[4] = { VADDUBM, VADDUHM, VADDUWM, 0 },
-        sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, 0 },
-        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, 0 },
-        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, 0 },
-        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, 0 },
+        add_op[4] = { VADDUBM, VADDUHM, VADDUWM, VADDUDM },
+        sub_op[4] = { VSUBUBM, VSUBUHM, VSUBUWM, VSUBUDM },
+        eq_op[4]  = { VCMPEQUB, VCMPEQUH, VCMPEQUW, VCMPEQUD },
+        gts_op[4] = { VCMPGTSB, VCMPGTSH, VCMPGTSW, VCMPGTSD },
+        gtu_op[4] = { VCMPGTUB, VCMPGTUH, VCMPGTUW, VCMPGTUD },
         ssadd_op[4] = { VADDSBS, VADDSHS, VADDSWS, 0 },
         usadd_op[4] = { VADDUBS, VADDUHS, VADDUWS, 0 },
         sssub_op[4] = { VSUBSBS, VSUBSHS, VSUBSWS, 0 },
         ussub_op[4] = { VSUBUBS, VSUBUHS, VSUBUWS, 0 },
-        umin_op[4] = { VMINUB, VMINUH, VMINUW, 0 },
-        smin_op[4] = { VMINSB, VMINSH, VMINSW, 0 },
-        umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, 0 },
-        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, 0 },
-        shlv_op[4] = { VSLB, VSLH, VSLW, 0 },
-        shrv_op[4] = { VSRB, VSRH, VSRW, 0 },
-        sarv_op[4] = { VSRAB, VSRAH, VSRAW, 0 },
+        umin_op[4] = { VMINUB, VMINUH, VMINUW, VMINUD },
+        smin_op[4] = { VMINSB, VMINSH, VMINSW, VMINSD },
+        umax_op[4] = { VMAXUB, VMAXUH, VMAXUW, VMAXUD },
+        smax_op[4] = { VMAXSB, VMAXSH, VMAXSW, VMAXSD },
+        shlv_op[4] = { VSLB, VSLH, VSLW, VSLD },
+        shrv_op[4] = { VSRB, VSRH, VSRW, VSRD },
+        sarv_op[4] = { VSRAB, VSRAH, VSRAW, VSRAD },
         mrgh_op[4] = { VMRGHB, VMRGHH, VMRGHW, 0 },
         mrgl_op[4] = { VMRGLB, VMRGLH, VMRGLW, 0 },
-        muleu_op[4] = { VMULEUB, VMULEUH, 0, 0 },
-        mulou_op[4] = { VMULOUB, VMULOUH, 0, 0 },
+        muleu_op[4] = { VMULEUB, VMULEUH, VMULEUW, 0 },
+        mulou_op[4] = { VMULOUB, VMULOUH, VMULOUW, 0 },
         pkum_op[4] = { VPKUHUM, VPKUWUM, 0, 0 },
-        rotl_op[4] = { VRLB, VRLH, VRLW, 0 };
+        rotl_op[4] = { VRLB, VRLH, VRLW, VRLD };
 
     TCGType type = vecl + TCG_TYPE_V64;
     TCGArg a0 = args[0], a1 = args[1], a2 = args[2];
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_sub_vec:
         insn = sub_op[vece];
         break;
+    case INDEX_op_mul_vec:
+        tcg_debug_assert(vece == MO_32 && have_isa_2_07);
+        insn = VMULUWM;
+        break;
     case INDEX_op_ssadd_vec:
         insn = ssadd_op[vece];
         break;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
         insn = VNOR;
         a2 = a1;
         break;
+    case INDEX_op_orc_vec:
+        insn = VORC;
+        break;
 
     case INDEX_op_cmp_vec:
         switch (args[3]) {
@@ -XXX,XX +XXX,XX @@ static void expand_vec_cmp(TCGType type, unsigned vece, TCGv_vec v0,
 {
     bool need_swap = false, need_inv = false;
 
-    tcg_debug_assert(vece <= MO_32);
+    tcg_debug_assert(vece <= MO_32 || have_isa_2_07);
 
     switch (cond) {
     case TCG_COND_EQ:
@@ -XXX,XX +XXX,XX @@ static void expand_vec_mul(TCGType type, unsigned vece, TCGv_vec v0,
 	break;
 
     case MO_32:
+        tcg_debug_assert(!have_isa_2_07);
         t3 = tcg_temp_new_vec(type);
         t4 = tcg_temp_new_vec(type);
         tcg_gen_dupi_vec(MO_8, t4, -16);
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     if (hwcap & PPC_FEATURE_ARCH_2_06) {
         have_isa = tcg_isa_2_06;
     }
+#ifdef PPC_FEATURE2_ARCH_2_07
+    if (hwcap2 & PPC_FEATURE2_ARCH_2_07) {
+        have_isa = tcg_isa_2_07;
+    }
+#endif
 #ifdef PPC_FEATURE2_ARCH_3_00
     if (hwcap2 & PPC_FEATURE2_ARCH_3_00) {
         have_isa = tcg_isa_3_00;
-- 
2.17.1

These new instructions are conditional only on MSR.VSX and
are thus part of the VSX instruction set, and not Altivec.
This includes double-word loads and stores.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

These new instructions are conditional on MSR.FP when TX=0 and
MSR.VEC when TX=1.  Since we only care about the Altivec registers,
and force TX=1, we can consider these to be Altivec instructions.
Since Altivec is true for any use of vector types, we only need
test have_isa_2_07.

This includes moves to and from the integer registers.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

These new instructions are conditional only on MSR.VEC and
are thus part of the Altivec instruction set, and not VSX.
This includes negation and compare not equal.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.h     |  2 +-
 tcg/ppc/tcg-target.inc.c | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

These new instructions are a mix of those like LXSD that are
only conditional only on MSR.VEC and those like LXV that are
conditional on MSR.VEC for TX=1.  Thus, in the end, we can
consider all of these as Altivec instructions.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 47 ++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -XXX,XX +XXX,XX @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define LXSDX      (XO31(588) | 1)  /* v2.06, force tx=1 */
 #define LXVDSX     (XO31(332) | 1)  /* v2.06, force tx=1 */
 #define LXSIWZX    (XO31(12) | 1)   /* v2.07, force tx=1 */
+#define LXV        (OPCD(61) | 8 | 1)  /* v3.00, force tx=1 */
+#define LXSD       (OPCD(57) | 2)   /* v3.00 */
+#define LXVWSX     (XO31(364) | 1)  /* v3.00, force tx=1 */
 
 #define STVX       XO31(231)
 #define STVEWX     XO31(199)
 #define STXSDX     (XO31(716) | 1)  /* v2.06, force sx=1 */
 #define STXSIWX    (XO31(140) | 1)  /* v2.07, force sx=1 */
+#define STXV       (OPCD(61) | 8 | 5) /* v3.00, force sx=1 */
+#define STXSD      (OPCD(61) | 2)   /* v3.00 */
 
 #define VADDSBS    VX4(768)
 #define VADDUBS    VX4(512)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
                              TCGReg base, tcg_target_long offset)
 {
     tcg_target_long orig = offset, l0, l1, extra = 0, align = 0;
-    bool is_store = false;
+    bool is_int_store = false;
     TCGReg rs = TCG_REG_TMP1;
 
     switch (opi) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
             break;
         }
         break;
+    case LXSD:
+    case STXSD:
+        align = 3;
+        break;
+    case LXV:
+    case STXV:
+        align = 15;
+        break;
     case STD:
         align = 3;
         /* FALLTHRU */
     case STB: case STH: case STW:
-        is_store = true;
+        is_int_store = true;
         break;
     }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
         if (rs == base) {
             rs = TCG_REG_R0;
         }
-        tcg_debug_assert(!is_store || rs != rt);
+        tcg_debug_assert(!is_int_store || rs != rt);
         tcg_out_movi(s, TCG_TYPE_PTR, rs, orig);
         tcg_out32(s, opx | TAB(rt & 31, base, rs));
         return;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
     case TCG_TYPE_V64:
         tcg_debug_assert(ret >= TCG_REG_V0);
         if (have_vsx) {
-            tcg_out_mem_long(s, 0, LXSDX, ret, base, offset);
+            tcg_out_mem_long(s, have_isa_3_00 ? LXSD : 0, LXSDX,
+                             ret, base, offset);
             break;
         }
         tcg_debug_assert((offset & 7) == 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
     case TCG_TYPE_V128:
         tcg_debug_assert(ret >= TCG_REG_V0);
         tcg_debug_assert((offset & 15) == 0);
-        tcg_out_mem_long(s, 0, LVX, ret, base, offset);
+        tcg_out_mem_long(s, have_isa_3_00 ? LXV : 0,
+                         LVX, ret, base, offset);
         break;
     default:
         g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
     case TCG_TYPE_V64:
         tcg_debug_assert(arg >= TCG_REG_V0);
         if (have_vsx) {
-            tcg_out_mem_long(s, 0, STXSDX, arg, base, offset);
+            tcg_out_mem_long(s, have_isa_3_00 ? STXSD : 0,
+                             STXSDX, arg, base, offset);
             break;
         }
         tcg_debug_assert((offset & 7) == 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
         break;
     case TCG_TYPE_V128:
         tcg_debug_assert(arg >= TCG_REG_V0);
-        tcg_out_mem_long(s, 0, STVX, arg, base, offset);
+        tcg_out_mem_long(s, have_isa_3_00 ? STXV : 0,
+                         STVX, arg, base, offset);
         break;
     default:
         g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
     tcg_debug_assert(out >= TCG_REG_V0);
     switch (vece) {
     case MO_8:
-        tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
+        if (have_isa_3_00) {
+            tcg_out_mem_long(s, LXV, LVX, out, base, offset & -16);
+        } else {
+            tcg_out_mem_long(s, 0, LVEBX, out, base, offset);
+        }
         elt = extract32(offset, 0, 4);
 #ifndef HOST_WORDS_BIGENDIAN
         elt ^= 15;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
         break;
     case MO_16:
         tcg_debug_assert((offset & 1) == 0);
-        tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
+        if (have_isa_3_00) {
+            tcg_out_mem_long(s, LXV | 8, LVX, out, base, offset & -16);
+        } else {
+            tcg_out_mem_long(s, 0, LVEHX, out, base, offset);
+        }
         elt = extract32(offset, 1, 3);
 #ifndef HOST_WORDS_BIGENDIAN
         elt ^= 7;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
         tcg_out32(s, VSPLTH | VRT(out) | VRB(out) | (elt << 16));
         break;
     case MO_32:
+        if (have_isa_3_00) {
+            tcg_out_mem_long(s, 0, LXVWSX, out, base, offset);
+            break;
+        }
         tcg_debug_assert((offset & 3) == 0);
         tcg_out_mem_long(s, 0, LVEWX, out, base, offset);
         elt = extract32(offset, 2, 2);
-- 
2.17.1

These new instructions are conditional on MSR.VEC for TX=1,
so we can consider these Altivec instructions.

Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.inc.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

From: Alex Bennée <alex.bennee@linaro.org>

qemu_cpu_kick is used for a number of reasons including to indicate
there is work to be done. However when thread=single the old
qemu_cpu_kick_rr_cpu only advanced the vCPU to the next executing one
which can lead to a hang in the case that:

a) the kick is from outside the vCPUs (e.g. iothread)
  b) the timers are paused (i.e. iothread calling run_on_cpu)

To avoid this lets split qemu_cpu_kick_rr into two functions. One for
the timer which continues to advance to the next timeslice and another
for all other kicks.

Message-Id: <20191001160426.26644-1-alex.bennee@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 cpus.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index XXXXXXX..XXXXXXX 100644
--- a/cpus.c
+++ b/cpus.c
@@ -XXX,XX +XXX,XX @@ static inline int64_t qemu_tcg_next_kick(void)
     return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + TCG_KICK_PERIOD;
 }
 
-/* Kick the currently round-robin scheduled vCPU */
-static void qemu_cpu_kick_rr_cpu(void)
+/* Kick the currently round-robin scheduled vCPU to next */
+static void qemu_cpu_kick_rr_next_cpu(void)
 {
     CPUState *cpu;
     do {
@@ -XXX,XX +XXX,XX @@ static void qemu_cpu_kick_rr_cpu(void)
     } while (cpu != atomic_mb_read(&tcg_current_rr_cpu));
 }
 
+/* Kick all RR vCPUs */
+static void qemu_cpu_kick_rr_cpus(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        cpu_exit(cpu);
+    };
+}
+
 static void do_nothing(CPUState *cpu, run_on_cpu_data unused)
 {
 }
@@ -XXX,XX +XXX,XX @@ void qemu_timer_notify_cb(void *opaque, QEMUClockType type)
 static void kick_tcg_thread(void *opaque)
 {
     timer_mod(tcg_kick_vcpu_timer, qemu_tcg_next_kick());
-    qemu_cpu_kick_rr_cpu();
+    qemu_cpu_kick_rr_next_cpu();
 }
 
 static void start_tcg_kick_timer(void)
@@ -XXX,XX +XXX,XX @@ void qemu_cpu_kick(CPUState *cpu)
 {
     qemu_cond_broadcast(cpu->halt_cond);
     if (tcg_enabled()) {
-        cpu_exit(cpu);
-        /* NOP unless doing single-thread RR */
-        qemu_cpu_kick_rr_cpu();
+        if (qemu_tcg_mttcg_enabled()) {
+            cpu_exit(cpu);
+        } else {
+            qemu_cpu_kick_rr_cpus();
+        }
     } else {
         if (hax_enabled()) {
             /*
-- 
2.17.1

The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:

Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)

are available in the Git repository at:

https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530

for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:

tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)

----------------------------------------------------------------
Improvements to 128-bit atomics:
  - Separate __int128_t type and arithmetic detection
  - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
  - Accelerate atomics via host/include/
Decodetree:
  - Add named field syntax
  - Move tests to meson

----------------------------------------------------------------
Peter Maydell (5):
      docs: Document decodetree named field syntax
      scripts/decodetree: Pass lvalue-formatter function to str_extract()
      scripts/decodetree: Implement a topological sort
      scripts/decodetree: Implement named field support
      tests/decode: Add tests for various named-field cases

Richard Henderson (22):
      tcg: Fix register move type in tcg_out_ld_helper_ret
      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
      meson: Split test for __int128_t type from __int128_t arithmetic
      qemu/atomic128: Add x86_64 atomic128-ldst.h
      tcg/i386: Support 128-bit load/store
      tcg/aarch64: Rename temporaries
      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
      tcg/aarch64: Simplify constraints on qemu_ld/st
      tcg/aarch64: Support 128-bit load/store
      tcg/ppc: Support 128-bit load/store
      tcg/s390x: Support 128-bit load/store
      accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
      accel/tcg: Extract store_atom_insert_al16 to host header
      accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 store_atom_insert_al16
      tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
      decodetree: Add --test-for-error
      decodetree: Fix recursion in prop_format and build_tree
      decodetree: Diagnose empty pattern group
      decodetree: Do not remove output_file from /dev
      tests/decode: Convert tests to meson

docs/devel/decodetree.rst                         |  33 ++-
 meson.build                                       |  15 +-
 host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
 host/include/aarch64/host/store-insert-al16.h     |  47 ++++
 host/include/generic/host/load-extract-al16-al8.h |  45 ++++
 host/include/generic/host/store-insert-al16.h     |  50 ++++
 host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
 host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
 include/qemu/int128.h                             |   4 +-
 tcg/aarch64/tcg-target-con-set.h                  |   4 +-
 tcg/aarch64/tcg-target-con-str.h                  |   1 -
 tcg/aarch64/tcg-target.h                          |  12 +-
 tcg/arm/tcg-target.h                              |   1 -
 tcg/i386/tcg-target.h                             |   5 +-
 tcg/mips/tcg-target.h                             |   1 -
 tcg/ppc/tcg-target-con-set.h                      |   2 +
 tcg/ppc/tcg-target-con-str.h                      |   1 +
 tcg/ppc/tcg-target.h                              |   4 +-
 tcg/riscv/tcg-target.h                            |   1 -
 tcg/s390x/tcg-target-con-set.h                    |   2 +
 tcg/s390x/tcg-target.h                            |   3 +-
 tcg/sparc64/tcg-target.h                          |   1 -
 tcg/tci/tcg-target.h                              |   1 -
 tests/decode/err_field10.decode                   |   7 +
 tests/decode/err_field7.decode                    |   7 +
 tests/decode/err_field8.decode                    |   8 +
 tests/decode/err_field9.decode                    |  14 ++
 tests/decode/succ_named_field.decode              |  19 ++
 tcg/tcg.c                                         |   4 +-
 accel/tcg/ldst_atomicity.c.inc                    |  80 +------
 tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
 tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
 tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
 tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
 scripts/decodetree.py                             | 265 ++++++++++++++++++++--
 tests/decode/check.sh                             |  24 --
 tests/decode/meson.build                          |  64 ++++++
 tests/meson.build                                 |   5 +-
 38 files changed, 1312 insertions(+), 225 deletions(-)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
 create mode 100644 host/include/aarch64/host/store-insert-al16.h
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h
 create mode 100644 host/include/generic/host/store-insert-al16.h
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

The first move was incorrectly using TCG_TYPE_I32 while the second
move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
from moving all 128-bits of the return value.

Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---
 tcg/tcg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
     mov[0].dst = ldst->datalo_reg;
     mov[0].src =
         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
-    mov[0].dst_type = TCG_TYPE_I32;
-    mov[0].src_type = TCG_TYPE_I32;
+    mov[0].dst_type = TCG_TYPE_REG;
+    mov[0].src_type = TCG_TYPE_REG;
     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
 
     mov[1].dst = ldst->datahi_reg;
-- 
2.34.1

PAGE_WRITE is current writability, as modified by TB protection;
PAGE_WRITE_ORG is the original page writability.

Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
+    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
         uint64_t *p = __builtin_assume_aligned(pv, 8);
         return *p;
     }
@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
+    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
         return *p;
     }
 #endif
-- 
2.34.1

Older versions of clang have missing runtime functions for arithmetic
with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
__int128_t for implementing Int128.  But __int128_t is present,
data movement works, and it can be used for atomic128.

Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
and adjust the meson probe for atomics to use has_int128_type.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 meson.build           | 15 ++++++++++-----
 include/qemu/int128.h |  4 ++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
     return 0;
   }'''))
 
-has_int128 = cc.links('''
+has_int128_type = cc.compiles('''
+  __int128_t a;
+  __uint128_t b;
+  int main(void) { b = a; }''')
+config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
+
+has_int128 = has_int128_type and cc.links('''
   __int128_t a;
   __uint128_t b;
   int main (void) {
@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
     a = a * a;
     return 0;
   }''')
-
 config_host_data.set('CONFIG_INT128', has_int128)
 
-if has_int128
+if has_int128_type
   # "do we have 128-bit atomics which are handled inline and specifically not
   # via libatomic". The reason we can't use libatomic is documented in the
   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
@@ -XXX,XX +XXX,XX @@ if has_int128
   # __alignof(unsigned __int128) for the host.
   atomic_test_128 = '''
     int main(int ac, char **av) {
-      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
+      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
       p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
       __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
       __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
       config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
         int main(void)
         {
-          unsigned __int128 x = 0, y = 0;
+          __uint128_t x = 0, y = 0;
           __sync_val_compare_and_swap_16(&x, y, x);
           return 0;
         }
diff --git a/include/qemu/int128.h b/include/qemu/int128.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/int128.h
+++ b/include/qemu/int128.h
@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
  * a possible structure and the native types.  Ease parameter passing
  * via use of the transparent union extension.
  */
-#ifdef CONFIG_INT128
+#ifdef CONFIG_INT128_TYPE
 typedef union {
     __uint128_t u;
     __int128_t i;
@@ -XXX,XX +XXX,XX @@ typedef union {
 } Int128Alias __attribute__((transparent_union));
 #else
 typedef Int128 Int128Alias;
-#endif /* CONFIG_INT128 */
+#endif /* CONFIG_INT128_TYPE */
 
 #endif /* INT128_H */
-- 
2.34.1

With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
load/store without cmpxchg16b.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h

diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/atomic128-ldst.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Load/store for 128-bit atomic operations, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ *
+ * See docs/devel/atomics.rst for discussion about the guarantees each
+ * atomic primitive is meant to provide.
+ */
+
+#ifndef AARCH64_ATOMIC128_LDST_H
+#define AARCH64_ATOMIC128_LDST_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/*
+ * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
+ * expanded to a read-write operation: lock cmpxchg16b.
+ */
+
+#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
+#define HAVE_ATOMIC128_RW  1
+
+static inline Int128 atomic16_read_ro(const Int128 *ptr)
+{
+    Int128Alias r;
+
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
+
+    return r.s;
+}
+
+static inline Int128 atomic16_read_rw(Int128 *ptr)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias r;
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
+    }
+    return r.s;
+}
+
+static inline void atomic16_set(Int128 *ptr, Int128 val)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias new = { .s = val };
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
+    } else {
+        __int128_t old;
+        do {
+            old = *ptr_align;
+        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
+    }
+}
+#else
+/* Provide QEMU_ERROR stubs. */
+#include "host/include/generic/host/atomic128-ldst.h"
+#endif
+
+#endif /* AARCH64_ATOMIC128_LDST_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |   4 +-
 tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define have_avx1         (cpuinfo & CPUINFO_AVX1)
 #define have_avx2         (cpuinfo & CPUINFO_AVX2)
 #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
 
 /*
  * There are interesting instructions in AVX512, so long as we have AVX512VL,
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_qemu_st8_i32     1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128 \
+    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
 
 /* We do not support older SSE systems, only beginning with AVX1.  */
 #define TCG_TARGET_HAS_v64              have_avx1
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 #endif
 };
 
+#define TCG_TMP_VEC  TCG_REG_XMM5
+
 static const int tcg_target_call_iarg_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
 #if defined(_WIN64)
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
 #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
 #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
+#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
+#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
 #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
 #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
 #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return have_movbe;
+    TCGAtomAlign aa;
+
+    if (!have_movbe) {
+        return false;
+    }
+    if ((memop & MO_SIZE) < MO_128) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
+     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom < MO_128;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
 static const TCGLdstHelperParam ldst_helper_param = { };
 #endif
 
+static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
+                                TCGReg l, TCGReg h, TCGReg v)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vpmov{d,q} %v, %l */
+    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
+    /* vpextr{d,q} $1, %v, %h */
+    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
+    tcg_out8(s, 1);
+}
+
+static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
+                                TCGReg v, TCGReg l, TCGReg h)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vmov{d,q} %l, %v */
+    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
+    /* vpinsr{d,q} $1, %h, %v, %v */
+    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
+    tcg_out8(s, 1);
+}
+
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     *h = x86_guest_base;
 #endif
     h->base = addrlo;
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType tlbtype = TCG_TYPE_I32;
     int trexw = 0, hrexw = 0, tlbrexw = 0;
     unsigned mem_index = get_mmuidx(oi);
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int tlb_mask;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we want the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            if (h.base == datalo || h.index == datalo) {
+                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datalo, datahi, 0);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datahi, datahi, 8);
+            } else {
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                         h.base, h.index, 0, h.ofs + 8);
+            }
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector load is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we have the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                     h.base, h.index, 0, h.ofs);
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                     h.base, h.index, 0, h.ofs + 8);
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector store is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st8_a64_i32:
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     OP_32_64(mulu2):
         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O2_I1(r, r, L);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O0_I3(L, L, L);
+
     case INDEX_op_brcond2_i32:
         return C_O0_I4(r, r, ri, ri);
 
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
+    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
 #ifdef _WIN64
     /* These are call saved, and we don't save them, so don't use them. */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
-- 
2.34.1

We will need to allocate a second general-purpose temporary.
Rename the existing temps to add a distinguishing number.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP TCG_REG_X30
-#define TCG_VEC_TMP TCG_REG_V31
+#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_REG_GUEST_BASE TCG_REG_X28
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
 static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg r, TCGReg base, intptr_t offset)
 {
-    TCGReg temp = TCG_REG_TMP;
+    TCGReg temp = TCG_REG_TMP0;
 
     if (offset < -0xffffff || offset > 0xffffff) {
         tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
     }
 
     /* Worst-case scenario, move offset to temp register, use reg offset.  */
-    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
-    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
+    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
+    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
 }
 
 static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
     if (offset == sextract64(offset, 0, 26)) {
         tcg_out_insn(s, 3206, BL, offset);
     } else {
-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
+        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
+        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
     AArch64Insn insn;
 
     if (rl == ah || (!const_bh && rl == bh)) {
-        rl = TCG_REG_TMP;
+        rl = TCG_REG_TMP0;
     }
 
     if (const_bl) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
                possibility of adding 0+const in the low part, and the
                immediate add instructions encode XSP not XZR.  Don't try
                anything more elaborate here than loading another zero.  */
-            al = TCG_REG_TMP;
+            al = TCG_REG_TMP0;
             tcg_out_movi(s, ext, al, 0);
         }
         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 {
     TCGReg a1 = a0;
     if (is_ctz) {
-        a1 = TCG_REG_TMP;
+        a1 = TCG_REG_TMP0;
         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
     }
     if (const_b && b == (ext ? 64 : 32)) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
         AArch64Insn sel = I3506_CSEL;
 
         tcg_out_cmp(s, ext, a0, 0, 1);
-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
+        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
 
         if (const_b) {
             if (b == -1) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
                 b = d;
             }
         }
-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
+        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
 }
 
 static const TCGLdstHelperParam ldst_helper_param = {
-    .ntmp = 1, .tmp = { TCG_REG_TMP }
+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
 };
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 
     set_jmp_insn_offset(s, which);
     tcg_out32(s, I3206_B);
-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
     set_jmp_reset_offset(s, which);
 }
 
@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
         ptrdiff_t i_offset = i_addr - jmp_rx;
 
         /* Note that we asserted this in range in tcg_out_goto_tb. */
-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
+        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
     }
     qatomic_set((uint32_t *)jmp_rw, insn);
     flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_rem_i64:
     case INDEX_op_rem_i32:
-        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
     case INDEX_op_remu_i64:
     case INDEX_op_remu_i32:
-        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
 
     case INDEX_op_shl_i64:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (c2) {
             tcg_out_rotl(s, ext, a0, a1, a2);
         } else {
-            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
-            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
+            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
+            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
         }
         break;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             break;
                         }
                     }
-                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
-                    a2 = TCG_VEC_TMP;
+                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
+                    a2 = TCG_VEC_TMP0;
                 }
                 if (is_scalar) {
                     insn = cmp_scalar_insn[cond];
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
-    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
 /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 
     TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
     TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
-    TCG_REG_X16, TCG_REG_X17,
 
     TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
     TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
 
+    /* X16 reserved as temporary */
+    /* X17 reserved as temporary */
     /* X18 reserved by system */
     /* X19 reserved for AREG0 */
     /* X29 reserved as fp */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_REG_TMP0 TCG_REG_X16
+#define TCG_REG_TMP1 TCG_REG_X17
+#define TCG_REG_TMP2 TCG_REG_X30
 #define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
-- 
2.34.1

Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
registers.  Since we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |  2 --
 tcg/aarch64/tcg-target-con-str.h |  1 -
 tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  * tcg-target-con-str.h; the constraint combination is inclusive or.
  */
 C_O0_I1(r)
-C_O0_I2(lZ, l)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
-C_O1_I1(r, l)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-str.h
+++ b/tcg/aarch64/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('l', ALL_QLDST_REGS)
 REGS('w', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
 #define ALL_GENERAL_REGS  0xffffffffu
 #define ALL_VECTOR_REGS   0xffffffff00000000ull
 
-#ifdef CONFIG_SOFTMMU
-#define ALL_QLDST_REGS \
-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
-                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
-#else
-#define ALL_QLDST_REGS   ALL_GENERAL_REGS
-#endif
-
 /* Match a constant valid for addition (12-bit, optionally shifted).  */
 static inline bool is_aimm(uint64_t val)
 {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
-    TCGReg x3;
+    TCGReg addr_adj;
     TCGType mask_type;
     uint64_t compare_mask;
 
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
                  ? TCG_TYPE_I64 : TCG_TYPE_I32);
 
-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
+    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
                  TLB_MASK_TABLE_OFS(mem_index), 1, 0);
 
     /* Extract the TLB index from the address into X0.  */
     tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-                 TCG_REG_X0, TCG_REG_X0, addr_reg,
+                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
                  s->page_bits - CPU_TLB_ENTRY_BITS);
 
-    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
+    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
+    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
 
-    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
+    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
+    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
                is_ld ? offsetof(CPUTLBEntry, addr_read)
                      : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
                offsetof(CPUTLBEntry, addend));
 
     /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * cross pages using the address of the last byte of the access.
      */
     if (a_mask >= s_mask) {
-        x3 = addr_reg;
+        addr_adj = addr_reg;
     } else {
+        addr_adj = TCG_REG_TMP2;
         tcg_out_insn(s, 3401, ADDI, addr_type,
-                     TCG_REG_X3, addr_reg, s_mask - a_mask);
-        x3 = TCG_REG_X3;
+                     addr_adj, addr_reg, s_mask - a_mask);
     }
     compare_mask = (uint64_t)s->page_mask | a_mask;
 
-    /* Store the page mask part of the address into X3.  */
-    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
+    /* Store the page mask part of the address into TMP2.  */
+    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
+                     addr_adj, compare_mask);
 
     /* Perform the address comparison. */
-    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
+    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
 
     /* If not equal, we jump to the slow path. */
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 
-    h->base = TCG_REG_X1,
+    h->base = TCG_REG_TMP1;
     h->index = addr_reg;
     h->index_ext = addr_type;
 #else
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a64_i32:
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
-        return C_O1_I1(r, l);
+        return C_O1_I1(r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
-        return C_O0_I2(lZ, l);
+        return C_O0_I2(rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
16-byte atomicity is required and LDP/STP otherwise.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |   2 +
 tcg/aarch64/tcg-target.h         |  11 ++-
 tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
+C_O0_I3(rZ, rZ, r)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
 C_O1_I2(w, w, wZ)
 C_O1_I3(w, w, w, w)
 C_O1_I4(r, r, rA, rZ, rZ)
+C_O2_I1(r, r, r)
 C_O2_I4(r, r, rZ, rZ, rA, rMZ)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+/*
+ * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
+ * which requires writable pages.  We must defer to the helper for user-only,
+ * but in system mode all ram is writable for the host.
+ */
+#ifdef CONFIG_USER_ONLY
+#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
+#else
+#define TCG_TARGET_HAS_qemu_ldst_i128   1
+#endif
 
 #define TCG_TARGET_HAS_v64              1
 #define TCG_TARGET_HAS_v128             1
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3305_LDR_v64   = 0x5c000000,
     I3305_LDR_v128  = 0x9c000000,
 
+    /* Load/store exclusive. */
+    I3306_LDXP      = 0xc8600000,
+    I3306_STXP      = 0xc8200000,
+
     /* Load/store register.  Described here as 3.3.12, but the helper
        that emits them can transform to 3.3.10 or 3.3.13.  */
     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3406_ADR       = 0x10000000,
     I3406_ADRP      = 0x90000000,
 
+    /* Add/subtract extended register instructions. */
+    I3501_ADD       = 0x0b200000,
+
     /* Add/subtract shifted register instructions (without a shift).  */
     I3502_ADD       = 0x0b000000,
     I3502_ADDS      = 0x2b000000,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
 }
 
+static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
+                              TCGReg rt, TCGReg rt2, TCGReg rn)
+{
+    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
                               TCGReg rt, int imm19)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
 }
 
+static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
+                                     TCGType sf, TCGReg rd, TCGReg rn,
+                                     TCGReg rm, int opt, int imm3)
+{
+    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
+              imm3 << 10 | rn << 5 | rd);
+}
+
 /* This function is for both 3.5.2 (Add/Subtract shifted register), for
    the rare occasion when we actually want to supply a shift amount.  */
 static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType addr_type = s->addr_type;
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_lse2 ? MO_ATOM_WITHIN16
                                              : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
     TCGReg addr_adj;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    TCGReg base;
+    bool use_pair;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    /* Compose the final address, as LDP/STP have no indexing. */
+    if (h.index == TCG_REG_XZR) {
+        base = h.base;
+    } else {
+        base = TCG_REG_TMP2;
+        if (h.index_ext == TCG_TYPE_I32) {
+            /* add base, base, index, uxtw */
+            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
+                         h.base, h.index, MO_32, 0);
+        } else {
+            /* add base, base, index */
+            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
+        }
+    }
+
+    use_pair = h.aa.atom < MO_128 || have_lse2;
+
+    if (!use_pair) {
+        tcg_insn_unit *branch = NULL;
+        TCGReg ll, lh, sl, sh;
+
+        /*
+         * If we have already checked for 16-byte alignment, that's all
+         * we need. Otherwise we have determined that misaligned atomicity
+         * may be handled with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            /*
+             * TODO: align should be MO_64, so we only need test bit 3,
+             * which means we could use TBNZ instead of ANDS+B_C.
+             */
+            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
+            branch = s->code_ptr;
+            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+            use_pair = true;
+        }
+
+        if (is_ld) {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             *    ldxp lo, hi, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, .-8
+             * Require no overlap between data{lo,hi} and base.
+             */
+            if (datalo == base || datahi == base) {
+                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
+                base = TCG_REG_TMP2;
+            }
+            ll = sl = datalo;
+            lh = sh = datahi;
+        } else {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             * 1: ldxp t0, t1, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, 1b
+             */
+            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
+            ll = TCG_REG_TMP0;
+            lh = TCG_REG_TMP1;
+            sl = datalo;
+            sh = datahi;
+        }
+
+        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
+        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
+        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
+
+        if (use_pair) {
+            /* "b .+8", branching across the one insn of use_pair. */
+            tcg_out_insn(s, 3206, B, 2);
+            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
+        }
+    }
+
+    if (use_pair) {
+        if (is_ld) {
+            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
+        } else {
+            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static const tcg_insn_unit *tb_ret_addr;
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
+        break;
 
     case INDEX_op_bswap64_i64:
         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
         return C_O1_I1(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(r, r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
         return C_O0_I2(rZ, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(rZ, rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
Note that these instructions do not require 16-byte alignment.

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-set.h |   2 +
 tcg/ppc/tcg-target-con-str.h |   1 +
 tcg/ppc/tcg-target.h         |   3 +-
 tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 4 files changed, 101 insertions(+), 13 deletions(-)

diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-set.h
+++ b/tcg/ppc/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(v, r)
 C_O0_I3(r, r, r)
+C_O0_I3(o, m, r)
 C_O0_I4(r, r, ri, ri)
 C_O0_I4(r, r, r, r)
 C_O1_I1(r, r)
@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rZ, rZ)
 C_O1_I4(r, r, r, ri, ri)
 C_O2_I1(r, r, r)
+C_O2_I1(o, m, r)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, rI, rZM, r, r)
 C_O2_I4(r, r, r, r, rI, rZM)
diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-str.h
+++ b/tcg/ppc/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
+REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
 REGS('v', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128   \
+    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
 
 /*
  * While technically Altivec could support V64, it has no 64-bit store
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 
 #define B      OPCD( 18)
 #define BC     OPCD( 16)
+
 #define LBZ    OPCD( 34)
 #define LHZ    OPCD( 40)
 #define LHA    OPCD( 42)
 #define LWZ    OPCD( 32)
 #define LWZUX  XO31( 55)
-#define STB    OPCD( 38)
-#define STH    OPCD( 44)
-#define STW    OPCD( 36)
-
-#define STD    XO62(  0)
-#define STDU   XO62(  1)
-#define STDX   XO31(149)
-
 #define LD     XO58(  0)
 #define LDX    XO31( 21)
 #define LDU    XO58(  1)
 #define LDUX   XO31( 53)
 #define LWA    XO58(  2)
 #define LWAX   XO31(341)
+#define LQ     OPCD( 56)
+
+#define STB    OPCD( 38)
+#define STH    OPCD( 44)
+#define STW    OPCD( 36)
+#define STD    XO62(  0)
+#define STDU   XO62(  1)
+#define STDX   XO31(149)
+#define STQ    XO62(  2)
 
 #define ADDIC  OPCD( 12)
 #define ADDI   OPCD( 14)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp a_bits;
+    MemOp a_bits, s_bits;
 
     /*
      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * As of 3.0, "the non-atomic access is performed as described in
      * the corresponding list", which matches MO_ATOM_SUBALIGN.
      */
+    s_bits = opc & MO_SIZE;
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
                                                  : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_bits = h->aa.align;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    unsigned s_bits = opc & MO_SIZE;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    uint32_t insn;
+    TCGReg index;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
+
+    /* Compose the final address, as LQ/STQ have no indexing. */
+    index = h.index;
+    if (h.base != 0) {
+        index = TCG_REG_TMP1;
+        tcg_out32(s, ADD | TAB(index, h.base, h.index));
+    }
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (h.aa.atom == MO_128) {
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? LQ : STQ;
+        tcg_out32(s, insn | TAI(datahi, index, 0));
+    } else {
+        TCGReg d1, d2;
+
+        if (HOST_BIG_ENDIAN ^ need_bswap) {
+            d1 = datahi, d2 = datalo;
+        } else {
+            d1 = datalo, d2 = datahi;
+        }
+
+        if (need_bswap) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
+            insn = is_ld ? LDBRX : STDBRX;
+            tcg_out32(s, insn | TAB(d1, 0, index));
+            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
+        } else {
+            insn = is_ld ? LD : STD;
+            tcg_out32(s, insn | TAI(d1, index, 0));
+            tcg_out32(s, insn | TAI(d2, index, 8));
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 {
     int i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
         if (TCG_TARGET_REG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_setcond_i32:
         tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
+
     case INDEX_op_add_vec:
     case INDEX_op_sub_vec:
     case INDEX_op_mul_vec:
-- 
2.34.1

Use LPQ/STPQ when 16-byte atomicity is required.
Note that these instructions require 16-byte alignment.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |   2 +
 tcg/s390x/tcg-target.h         |   2 +-
 tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
 3 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(r, rA)
 C_O0_I2(v, r)
+C_O0_I3(o, m, r)
 C_O1_I1(r, r)
 C_O1_I1(v, r)
 C_O1_I1(v, v)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rI, r)
 C_O1_I4(r, r, rA, rI, r)
+C_O2_I1(o, m, r)
 C_O2_I2(o, m, 0, r)
 C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
-#define TCG_TARGET_HAS_qemu_ldst_i128 0
+#define TCG_TARGET_HAS_qemu_ldst_i128 1
 
 #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_LLGF    = 0xe316,
     RXY_LLGH    = 0xe391,
     RXY_LMG     = 0xeb04,
+    RXY_LPQ     = 0xe38f,
     RXY_LRV     = 0xe31e,
     RXY_LRVG    = 0xe30f,
     RXY_LRVH    = 0xe31f,
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_STG     = 0xe324,
     RXY_STHY    = 0xe370,
     RXY_STMG    = 0xeb24,
+    RXY_STPQ    = 0xe38e,
     RXY_STRV    = 0xe33e,
     RXY_STRVG   = 0xe32f,
     RXY_STRVH   = 0xe33f,
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int mem_index = get_mmuidx(oi);
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabel *l1 = NULL, *l2 = NULL;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    bool use_pair;
+    S390Opcode insn;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    use_pair = h.aa.atom < MO_128;
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (!use_pair) {
+        /*
+         * Atomicity requires we use LPQ.  If we've already checked for
+         * 16-byte alignment, that's all we need.  If we arrive with
+         * lesser alignment, we have determined that less than 16-byte
+         * alignment can be satisfied with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            use_pair = true;
+            l1 = gen_new_label();
+            l2 = gen_new_label();
+
+            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
+            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
+        }
+
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? RXY_LPQ : RXY_STPQ;
+        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
+
+        if (use_pair) {
+            tgen_branch(s, S390_CC_ALWAYS, l2);
+            tcg_out_label(s, l1);
+        }
+    }
+    if (use_pair) {
+        TCGReg d1, d2;
+
+        if (need_bswap) {
+            d1 = datalo, d2 = datahi;
+            insn = is_ld ? RXY_LRVG : RXY_STRVG;
+        } else {
+            d1 = datahi, d2 = datalo;
+            insn = is_ld ? RXY_LG : RXY_STG;
+        }
+
+        if (h.base == d1 || h.index == d1) {
+            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
+            h.base = TCG_TMP0;
+            h.index = TCG_REG_NONE;
+            h.disp = 0;
+        }
+        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
+        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
+    }
+    if (l2) {
+        tcg_out_label(s, l2);
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
 {
     /* Reuse the zeroing that exists for goto_ptr.  */
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_ld16s_i64:
         tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
         return C_O0_I2(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
 2 files changed, 47 insertions(+), 34 deletions(-)
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h

diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
+#define HOST_LOAD_EXTRACT_AL16_AL8_H
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
+    Int128 r;
+
+    pv = (void *)(pi & ~7);
+    if (pi & 8) {
+        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
+        uint64_t a = qatomic_read__nocheck(p8);
+        uint64_t b = qatomic_read__nocheck(p8 + 1);
+
+        if (HOST_BIG_ENDIAN) {
+            r = int128_make128(b, a);
+        } else {
+            r = int128_make128(a, b);
+        }
+    } else {
+        r = atomic16_read_ro(pv);
+    }
+    return int128_getlo(int128_urshift(r, shr));
+}
+
+#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  * See the COPYING file in the top-level directory.
  */
 
+#include "host/load-extract-al16-al8.h"
+
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
 #else
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
     return int128_getlo(r);
 }
 
-/**
- * load_atom_extract_al16_or_al8:
- * @p: host address
- * @s: object size in bytes, @s <= 8.
- *
- * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
- * cross an 16-byte boundary then the access must be 16-byte atomic,
- * otherwise the access must be 8-byte atomic.
- */
-static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-load_atom_extract_al16_or_al8(void *pv, int s)
-{
-    uintptr_t pi = (uintptr_t)pv;
-    int o = pi & 7;
-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-    Int128 r;
-
-    pv = (void *)(pi & ~7);
-    if (pi & 8) {
-        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-        uint64_t a = qatomic_read__nocheck(p8);
-        uint64_t b = qatomic_read__nocheck(p8 + 1);
-
-        if (HOST_BIG_ENDIAN) {
-            r = int128_make128(b, a);
-        } else {
-            r = int128_make128(a, b);
-        }
-    } else {
-        r = atomic16_read_ro(pv);
-    }
-    return int128_getlo(int128_urshift(r, shr));
-}
-
 /**
  * load_atom_4_by_2:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
 2 files changed, 51 insertions(+), 39 deletions(-)
 create mode 100644 host/include/generic/host/store-insert-al16.h

diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_STORE_INSERT_AL16_H
+#define HOST_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+#if defined(CONFIG_ATOMIC128)
+    __uint128_t *pu;
+    Int128Alias old, new;
+
+    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
+    pu = __builtin_assume_aligned(ps, 16);
+    old.u = *pu;
+    msk = int128_not(msk);
+    do {
+        new.s = int128_and(old.s, msk);
+        new.s = int128_or(new.s, val);
+    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+#else
+    Int128 old, new, cmp;
+
+    ps = __builtin_assume_aligned(ps, 16);
+    old = *ps;
+    msk = int128_not(msk);
+    do {
+        cmp = old;
+        new = int128_and(old, msk);
+        new = int128_or(new, val);
+        old = atomic16_cmpxchg(ps, cmp, new);
+    } while (int128_ne(cmp, old));
+#endif
+}
+
+#endif /* HOST_STORE_INSERT_AL16_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "host/load-extract-al16-al8.h"
+#include "host/store-insert-al16.h"
 
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
                                           __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 }
 
-/**
- * store_atom_insert_al16:
- * @p: host address
- * @val: shifted value to store
- * @msk: mask for value to store
- *
- * Atomically store @val to @p masked by @msk.
- */
-static void ATTRIBUTE_ATOMIC128_OPT
-store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
-{
-#if defined(CONFIG_ATOMIC128)
-    __uint128_t *pu, old, new;
-
-    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-    pu = __builtin_assume_aligned(ps, 16);
-    old = *pu;
-    do {
-        new = (old & ~msk.u) | val.u;
-    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
-                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
-#elif defined(CONFIG_CMPXCHG128)
-    __uint128_t *pu, old, new;
-
-    /*
-     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
-     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
-     * and accept the sequential consistency that comes with it.
-     */
-    pu = __builtin_assume_aligned(ps, 16);
-    do {
-        old = *pu;
-        new = (old & ~msk.u) | val.u;
-    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
-#else
-    qemu_build_not_reached();
-#endif
-}
-
 /**
  * store_bytes_leN:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h

diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
+#define X86_64_LOAD_EXTRACT_AL16_AL8_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    Int128Alias r;
+
+    /*
+     * ptr_align % 16 is now only 0 or 8.
+     * If the host supports atomic loads with VMOVDQU, then always use that,
+     * making the branch highly predictable.  Otherwise we must use VMOVDQA
+     * when ptr_align % 16 == 0 for 16-byte atomicity.
+     */
+    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
+        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    }
+    return int128_getlo(int128_urshift(r.s, shr));
+}
+#else
+/* Fallback definition that must be optimized away, or error.  */
+uint64_t QEMU_ERROR("unsupported atomic")
+    load_atom_extract_al16_or_al8(void *pv, int s);
+#endif
+
+#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h

diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
+#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
+
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    uint64_t l, h;
+
+    /*
+     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
+     * and single-copy atomic on the parts if 8-byte aligned.
+     * All we need do is align the pointer mod 8.
+     */
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
+    return (l >> shr) | (h << (-shr & 63));
+}
+
+#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 host/include/aarch64/host/store-insert-al16.h

diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_STORE_INSERT_AL16_H
+#define AARCH64_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+    /*
+     * GCC only implements __sync* primitives for int128 on aarch64.
+     * We can do better without the barriers, and integrating the
+     * arithmetic into the load-exclusive/store-conditional pair.
+     */
+    uint64_t tl, th, vl, vh, ml, mh;
+    uint32_t fail;
+
+    qemu_build_assert(!HOST_BIG_ENDIAN);
+    vl = int128_getlo(val);
+    vh = int128_gethi(val);
+    ml = int128_getlo(msk);
+    mh = int128_gethi(msk);
+
+    asm("0: ldxp %[l], %[h], %[mem]\n\t"
+        "bic %[l], %[l], %[ml]\n\t"
+        "bic %[h], %[h], %[mh]\n\t"
+        "orr %[l], %[l], %[vl]\n\t"
+        "orr %[h], %[h], %[vh]\n\t"
+        "stxp %w[f], %[l], %[h], %[mem]\n\t"
+        "cbnz %w[f], 0b\n"
+        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
+        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
+}
+
+#endif /* AARCH64_STORE_INSERT_AL16_H */
-- 
2.34.1

The last use was removed by e77c89fb086a.

Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.h | 1 -
 tcg/arm/tcg-target.h     | 1 -
 tcg/i386/tcg-target.h    | 1 -
 tcg/mips/tcg-target.h    | 1 -
 tcg/ppc/tcg-target.h     | 1 -
 tcg/riscv/tcg-target.h   | 1 -
 tcg/s390x/tcg-target.h   | 1 -
 tcg/sparc64/tcg-target.h | 1 -
 tcg/tci/tcg-target.h     | 1 -
 9 files changed, 9 deletions(-)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
 #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
 
 typedef enum {
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #endif
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_NB_REGS 64
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 
 typedef enum {
     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define TCG_TARGET_REG_BITS 64
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define S390_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
 
 /* We have a +- 4GB range on the branches; leave some slop.  */
 #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define SPARC_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
 
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_INTERPRETER 1
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 #if UINTPTR_MAX == UINT32_MAX
-- 
2.34.1

Invert the exit code, for use with the testsuite.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 formats = {}
 allpatterns = []
 anyextern = False
+testforerror = False
 
 translate_prefix = 'trans'
 translate_scope = 'static '
@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
     if output_file and output_fd:
         output_fd.close()
         os.remove(output_file)
-    exit(1)
+    exit(0 if testforerror else 1)
 # end error_with_file
 
 
@@ -XXX,XX +XXX,XX @@ def main():
     global bitop_width
     global variablewidth
     global anyextern
+    global testforerror
 
     decode_scope = 'static '
 
     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
-                 'static-decode=', 'varinsnwidth=']
+                 'static-decode=', 'varinsnwidth=', 'test-for-error']
     try:
         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
     except getopt.GetoptError as err:
@@ -XXX,XX +XXX,XX @@ def main():
                 bitop_width = 64
             elif insnwidth != 32:
                 error(0, 'cannot handle insns of width', insnwidth)
+        elif o == '--test-for-error':
+            testforerror = True
         else:
             assert False, 'unhandled option'
 
@@ -XXX,XX +XXX,XX @@ def main():
 
     if output_file:
         output_fd.close()
+    exit(1 if testforerror else 0)
 # end main
 
 
-- 
2.34.1

Test err_pattern_group_empty.decode failed with exception:

Traceback (most recent call last):
  File "./scripts/decodetree.py", line 1424, in <module> main()
  File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
  File "./scripts/decodetree.py", line 627, in build_tree
    self.tree = self.__build_tree(self.pats, self.fixedbits,
  File "./scripts/decodetree.py", line 607, in __build_tree
    fb = i.fixedbits & innermask
TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
                 output(ind, '}\n')
             else:
                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
+
+    def build_tree(self):
+        if not self.pats:
+            error_with_file(self.file, self.lineno, 'empty pattern group')
+        super().build_tree()
+
 #end IncMultiPattern
 
 
-- 
2.34.1

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/decode/check.sh    | 24 ----------------
 tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
 tests/meson.build        |  5 +---
 3 files changed, 60 insertions(+), 28 deletions(-)
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

diff --git a/tests/decode/check.sh b/tests/decode/check.sh
deleted file mode 100755
index XXXXXXX..XXXXXXX
--- a/tests/decode/check.sh
+++ /dev/null
@@ -XXX,XX +XXX,XX @@
-#!/bin/sh
-# This work is licensed under the terms of the GNU LGPL, version 2 or later.
-# See the COPYING.LIB file in the top-level directory.
-
-PYTHON=$1
-DECODETREE=$2
-E=0
-
-# All of these tests should produce errors
-for i in err_*.decode; do
-    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        # Pass, aka failed to fail.
-        echo FAIL: $i 1>&2
-        E=1
-    fi
-done
-
-for i in succ_*.decode; do
-    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        echo FAIL:$i 1>&2
-    fi
-done
-
-exit $E
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@
+err_tests = [
+    'err_argset1.decode',
+    'err_argset2.decode',
+    'err_field1.decode',
+    'err_field2.decode',
+    'err_field3.decode',
+    'err_field4.decode',
+    'err_field5.decode',
+    'err_field6.decode',
+    'err_init1.decode',
+    'err_init2.decode',
+    'err_init3.decode',
+    'err_init4.decode',
+    'err_overlap1.decode',
+    'err_overlap2.decode',
+    'err_overlap3.decode',
+    'err_overlap4.decode',
+    'err_overlap5.decode',
+    'err_overlap6.decode',
+    'err_overlap7.decode',
+    'err_overlap8.decode',
+    'err_overlap9.decode',
+    'err_pattern_group_empty.decode',
+    'err_pattern_group_ident1.decode',
+    'err_pattern_group_ident2.decode',
+    'err_pattern_group_nest1.decode',
+    'err_pattern_group_nest2.decode',
+    'err_pattern_group_nest3.decode',
+    'err_pattern_group_overlap1.decode',
+    'err_width1.decode',
+    'err_width2.decode',
+    'err_width3.decode',
+    'err_width4.decode',
+]
+
+succ_tests = [
+    'succ_argset_type1.decode',
+    'succ_function.decode',
+    'succ_ident1.decode',
+    'succ_pattern_group_nest1.decode',
+    'succ_pattern_group_nest2.decode',
+    'succ_pattern_group_nest3.decode',
+    'succ_pattern_group_nest4.decode',
+]
+
+suite = 'decodetree'
+decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
+
+foreach t: err_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
+         suite: suite)
+endforeach
+
+foreach t: succ_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', files(t)],
+         suite: suite)
+endforeach
diff --git a/tests/meson.build b/tests/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
              dependencies: [qemuutil, vhost_user])
 endif
 
-test('decodetree', sh,
-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
-     workdir: meson.current_source_dir() / 'decode',
-     suite: 'decodetree')
+subdir('decode')
 
 if 'CONFIG_TCG' in config_all
   subdir('fp')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Document the named field syntax that we want to implement for the
decodetree script.  This allows a field to be defined in terms of
some other field that the instruction pattern has already set, for
example:

%sz_imm 10:3 sz:3 !function=expand_sz_imm

to allow a function to be passed both an immediate field from the
instruction and also a sz value which might have been specified by
the instruction pattern directly (sz=1, etc) rather than being a
simple field within the instruction.

Note that the restriction on not having the format referring to the
pattern and the pattern referring to the format simultaneously is a
restriction of the decoder generator rather than inherently being a
silly thing to do.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
---
 docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/decodetree.rst
+++ b/docs/devel/decodetree.rst
@@ -XXX,XX +XXX,XX @@ Fields
 
 Syntax::
 
-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
+  field_def     := '%' identifier ( field )* ( !function=identifier )?
+  field         := unnamed_field | named_field
   unnamed_field := number ':' ( 's' ) number
+  named_field   := identifier ':' ( 's' ) number
 
 For *unnamed_field*, the first number is the least-significant bit position
 of the field and the second number is the length of the field.  If the 's' is
-present, the field is considered signed.  If multiple ``unnamed_fields`` are
-present, they are concatenated.  In this way one can define disjoint fields.
+present, the field is considered signed.
+
+A *named_field* refers to some other field in the instruction pattern
+or format. Regardless of the length of the other field where it is
+defined, it will be inserted into this field with the specified
+signedness and bit width.
+
+Field definitions that involve loops (i.e. where a field is defined
+directly or indirectly in terms of itself) are errors.
+
+A format can include fields that refer to named fields that are
+defined in the instruction pattern(s) that use the format.
+Conversely, an instruction pattern can include fields that refer to
+named fields that are defined in the format it uses. However you
+cannot currently do both at once (i.e. pattern P uses format F; F has
+a field A that refers to a named field B that is defined in P, and P
+has a field C that refers to a named field D that is defined in F).
+
+If multiple ``fields`` are present, they are concatenated.
+In this way one can define disjoint fields.
 
 If ``!function`` is specified, the concatenated result is passed through the
 named function, taking and returning an integral value.
 
-One may use ``!function`` with zero ``unnamed_fields``.  This case is called
+One may use ``!function`` with zero ``fields``.  This case is called
 a *parameter*, and the named function is only passed the ``DisasContext``
 and returns an integral value extracted from there.
 
-A field with no ``unnamed_fields`` and no ``!function`` is in error.
+A field with no ``fields`` and no ``!function`` is in error.
 
 Field examples:
 
@@ -XXX,XX +XXX,XX @@ Field examples:
 | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
 |   !function=expand_shimm8 |               extract(i, 13, 1))            |
 +---------------------------+---------------------------------------------+
+| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
+|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
++---------------------------+---------------------------------------------+
 
 Argument Sets
 =============
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support referring to other named fields in field definitions, we
need to pass the str_extract() method a function which tells it how
to emit the code for a previously initialized named field.  (In
Pattern::output_code() the other field will be "u.f_foo.field", and
in Format::output_extract() it is "a->field".)

Refactor the two callsites that currently do "output code to
initialize each field", and have them pass a lambda that defines how
to format the lvalue in each case.  This is then used both in
emitting the LHS of the assignment and also passed down to
str_extract() as a new argument (unused at the moment, but will be
used in the following patch).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def __str__(self):
             s = ''
         return str(self.pos) + ':' + s + str(self.len)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
     def __str__(self):
         return str(self.subs)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         ret = '0'
         pos = 0
         for f in reversed(self.subs):
-            ext = f.str_extract()
+            ext = f.str_extract(lvalue_formatter)
             if pos == 0:
                 ret = ext
             else:
@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
     def __str__(self):
         return str(self.value)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return str(self.value)
 
     def __cmp__(self, other):
@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
     def __str__(self):
         return self.func + '(' + str(self.base) + ')'
 
-    def str_extract(self):
-        return self.func + '(ctx, ' + self.base.str_extract() + ')'
+    def str_extract(self, lvalue_formatter):
+        return (self.func + '(ctx, '
+                + self.base.str_extract(lvalue_formatter) + ')')
 
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
     def __str__(self):
         return self.func
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
     def __eq__(self, other):
@@ -XXX,XX +XXX,XX @@ def __str__(self):
 
     def str1(self, i):
         return str_indent(i) + self.__str__()
+
+    def output_fields(self, indent, lvalue_formatter):
+        for n, f in self.fields.items():
+            output(indent, lvalue_formatter(n), ' = ',
+                   f.str_extract(lvalue_formatter), ';\n')
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def extract_name(self):
     def output_extract(self):
         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
-        for n, f in self.fields.items():
-            output('    a->', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(str_indent(4), lambda n: 'a->' + n)
         output('}\n\n')
 # end Format
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        for n, f in self.fields.items():
-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support named fields, we will need to be able to do a topological
sort (so that we ensure that we output the assignment to field A
before the assignment to field B if field B refers to field A by
name). The good news is that there is a tsort in the python standard
library; the bad news is that it was only added in Python 3.9.

To bridge the gap between our current minimum supported Python
version and 3.9, provide a local implementation that has the
same API as the stdlib version for the parts we care about.
In future when QEMU's minimum Python version requirement reaches
3.9 we can delete this code and replace it with an 'import' line.

The core of this implementation is based on
https://code.activestate.com/recipes/578272-topological-sort/
which is MIT-licensed.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 re_fmt_ident = '@[a-zA-Z0-9_]*'
 re_pat_ident = '[a-zA-Z0-9_]*'
 
+# Local implementation of a topological sort. We use the same API that
+# the Python graphlib does, so that when QEMU moves forward to a
+# baseline of Python 3.9 or newer this code can all be dropped and
+# replaced with:
+#    from graphlib import TopologicalSorter, CycleError
+#
+# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
+#
+# We only implement the parts of TopologicalSorter we care about:
+#  ts = TopologicalSorter(graph=None)
+#    create the sorter. graph is a dictionary whose keys are
+#    nodes and whose values are lists of the predecessors of that node.
+#    (That is, if graph contains "A" -> ["B", "C"] then we must output
+#    B and C before A.)
+#  ts.static_order()
+#    returns a list of all the nodes in sorted order, or raises CycleError
+#  CycleError
+#    exception raised if there are cycles in the graph. The second
+#    element in the args attribute is a list of nodes which form a
+#    cycle; the first and last element are the same, eg [a, b, c, a]
+#    (Our implementation doesn't give the order correctly.)
+#
+# For our purposes we can assume that the data set is always small
+# (typically 10 nodes or less, actual links in the graph very rare),
+# so we don't need to worry about efficiency of implementation.
+#
+# The core of this implementation is from
+# https://code.activestate.com/recipes/578272-topological-sort/
+# (but updated to Python 3), and is under the MIT license.
+
+class CycleError(ValueError):
+    """Subclass of ValueError raised if cycles exist in the graph"""
+    pass
+
+class TopologicalSorter:
+    """Topologically sort a graph"""
+    def __init__(self, graph=None):
+        self.graph = graph
+
+    def static_order(self):
+        # We do the sort right here, unlike the stdlib version
+        from functools import reduce
+        data = {}
+        r = []
+
+        if not self.graph:
+            return []
+
+        # This code wants the values in the dict to be specifically sets
+        for k, v in self.graph.items():
+            data[k] = set(v)
+
+        # Find all items that don't depend on anything.
+        extra_items_in_deps = (reduce(set.union, data.values())
+                               - set(data.keys()))
+        # Add empty dependencies where needed
+        data.update({item:{} for item in extra_items_in_deps})
+        while True:
+            ordered = set(item for item, dep in data.items() if not dep)
+            if not ordered:
+                break
+            r.extend(ordered)
+            data = {item: (dep - ordered)
+                    for item, dep in data.items()
+                        if item not in ordered}
+        if data:
+            # This doesn't give as nice results as the stdlib, which
+            # gives you the cycle by listing the nodes in order. Here
+            # we only know the nodes in the cycle but not their order.
+            raise CycleError(f'nodes are in a cycle', list(data.keys()))
+
+        return r
+# end TopologicalSorter
+
 def error_with_file(file, lineno, *args):
     """Print an error message from file:line and args and exit."""
     global output_file
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Implement support for named fields, i.e.  where one field is defined
in terms of another, rather than directly in terms of bits extracted
from the instruction.

The new method referenced_fields() on all the Field classes returns a
list of fields that this field references.  This just passes through,
except for the new NamedField class.

We can then use referenced_fields() to:
 * construct a list of 'dangling references' for a format or
   pattern, which is the fields that the format/pattern uses but
   doesn't define itself
 * do a topological sort, so that we output "field = value"
   assignments in an order that means that we assign a field before
   we reference it in a subsequent assignment
 * check when we output the code for a pattern whether we need to
   fill in the format fields before or after the pattern fields, and
   do other error checking

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 139 insertions(+), 6 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.sign == other.sign and self.mask == other.mask
 
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
             pos += f.len
         return ret
 
+    def referenced_fields(self):
+        l = []
+        for f in self.subs:
+            l.extend(f.referenced_fields())
+        return l
+
     def __ne__(self, other):
         if len(self.subs) != len(other.subs):
             return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return str(self.value)
 
+    def referenced_fields(self):
+        return []
+
     def __cmp__(self, other):
         return self.value - other.value
 # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         return (self.func + '(ctx, '
                 + self.base.str_extract(lvalue_formatter) + ')')
 
+    def referenced_fields(self):
+        return self.base.referenced_fields()
+
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
 
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.func == other.func
 
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
         return not self.__eq__(other)
 # end ParameterField
 
+class NamedField:
+    """Class representing a field already named in the pattern"""
+    def __init__(self, name, sign, len):
+        self.mask = 0
+        self.sign = sign
+        self.len = len
+        self.name = name
+
+    def __str__(self):
+        return self.name
+
+    def str_extract(self, lvalue_formatter):
+        global bitop_width
+        s = 's' if self.sign else ''
+        lvalue = lvalue_formatter(self.name)
+        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
+
+    def referenced_fields(self):
+        return [self.name]
+
+    def __eq__(self, other):
+        return self.name == other.name
+
+    def __ne__(self, other):
+        return not self.__eq__(other)
+# end NamedField
 
 class Arguments:
     """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
             output('} ', self.struct_name(), ';\n\n')
 # end Arguments
 
-
 class General:
     """Common code between instruction formats and instruction patterns"""
     def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
         self.fieldmask = fldm
         self.fields = flds
         self.width = w
+        self.dangling = None
 
     def __str__(self):
         return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str1(self, i):
         return str_indent(i) + self.__str__()
 
+    def dangling_references(self):
+        # Return a list of all named references which aren't satisfied
+        # directly by this format/pattern. This will be either:
+        #  * a format referring to a field which is specified by the
+        #    pattern(s) using it
+        #  * a pattern referring to a field which is specified by the
+        #    format it uses
+        #  * a user error (referring to a field that doesn't exist at all)
+        if self.dangling is None:
+            # Compute this once and cache the answer
+            dangling = []
+            for n, f in self.fields.items():
+                for r in f.referenced_fields():
+                    if r not in self.fields:
+                        dangling.append(r)
+            self.dangling = dangling
+        return self.dangling
+
     def output_fields(self, indent, lvalue_formatter):
+        # We use a topological sort to ensure that any use of NamedField
+        # comes after the initialization of the field it is referencing.
+        graph = {}
         for n, f in self.fields.items():
-            output(indent, lvalue_formatter(n), ' = ',
-                   f.str_extract(lvalue_formatter), ';\n')
+            refs = f.referenced_fields()
+            graph[n] = refs
+
+        try:
+            ts = TopologicalSorter(graph)
+            for n in ts.static_order():
+                # We only want to emit assignments for the keys
+                # in our fields list, not for anything that ends up
+                # in the tsort graph only because it was referenced as
+                # a NamedField.
+                try:
+                    f = self.fields[n]
+                    output(indent, lvalue_formatter(n), ' = ',
+                           f.str_extract(lvalue_formatter), ';\n')
+                except KeyError:
+                    pass
+        except CycleError as e:
+            # The second element of args is a list of nodes which form
+            # a cycle (there might be others too, but only one is reported).
+            # Pretty-print it to tell the user.
+            cycle = ' => '.join(e.args[1])
+            error(self.lineno, 'field definitions form a cycle: ' + cycle)
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
         arg = self.base.base.name
         output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
+        # We might have named references in the format that refer to fields
+        # in the pattern, or named references in the pattern that refer
+        # to fields in the format. This affects whether we extract the fields
+        # for the format before or after the ones for the pattern.
+        # For simplicity we don't allow cross references in both directions.
+        # This is also where we catch the syntax error of referring to
+        # a nonexistent field.
+        fmt_refs = self.base.dangling_references()
+        for r in fmt_refs:
+            if r not in self.fields:
+                error(self.lineno, f'format refers to undefined field {r}')
+        pat_refs = self.dangling_references()
+        for r in pat_refs:
+            if r not in self.base.fields:
+                error(self.lineno, f'pattern refers to undefined field {r}')
+        if pat_refs and fmt_refs:
+            error(self.lineno, ('pattern that uses fields defined in format '
+                                'cannot use format that uses fields defined '
+                                'in pattern'))
+        if fmt_refs:
+            # pattern fields first
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+            assert not extracted, "dangling fmt refs but it was already extracted"
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+        if not fmt_refs:
+            # pattern fields last
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
 
         # If we identified all nodes below have the same format,
-        # extract the fields now.
-        if not extracted and self.base:
+        # extract the fields now. But don't do it if the format relies
+        # on named fields from the insn pattern, as those won't have
+        # been initialised at this point.
+        if not extracted and self.base and not self.base.dangling_references():
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', self.base.base.name, ', insn);\n')
             extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
     """Parse one instruction field from TOKS at LINENO"""
     global fields
     global insnwidth
+    global re_C_ident
 
     # A "simple" field will have only one entry;
     # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
             func = func[1]
             continue
 
+        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
+            # Signed named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, True, le)
+            subs.append(f)
+            width += le
+            continue
+        if re.fullmatch(re_C_ident + ':[0-9]+', t):
+            # Unsigned named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, False, le)
+            subs.append(f)
+            width += le
+            continue
+
         if re.fullmatch('[0-9]+:s[0-9]+', t):
             # Signed field extract
             subtoks = t.split(':s')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Add some tests for various cases of named-field use, both ones that
should work and ones that should be diagnosed as errors.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
---
 tests/decode/err_field10.decode      |  7 +++++++
 tests/decode/err_field7.decode       |  7 +++++++
 tests/decode/err_field8.decode       |  8 ++++++++
 tests/decode/err_field9.decode       | 14 ++++++++++++++
 tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
 tests/decode/meson.build             |  5 +++++
 6 files changed, 60 insertions(+)
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode

diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field10.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose formats which refer to undefined fields
+%field1        field2:3
+@fmt ........ ........ ........ ........ %field1
+insn 00000000 00000000 00000000 00000000 @fmt
diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field7.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields whose definitions form a loop
+%field1        field2:3
+%field2        field1:4
+insn 00000000 00000000 00000000 00000000 %field1 %field2
diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field8.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose patterns which refer to undefined fields
+&f1 f1 a
+%field1        field2:3
+@fmt ........ ........ ........ .... a:4 &f1
+insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field9.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields where the format refers to a field defined in the
+# pattern and the pattern refers to a field defined in the format.
+# This is theoretically not impossible to implement, but is not
+# supported by the script at this time.
+&abcd a b c d
+%refa        a:3
+%refc        c:4
+# Format defines 'c' and sets 'b' to an indirect ref to 'a'
+@fmt ........ ........ ........ c:8 &abcd b=%refa
+# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
+insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/succ_named_field.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# field using a named_field
+%imm_sz	8:8 sz:3
+insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
+
+# Ditto, via a format. Here a field in the format
+# references a named field defined in the insn pattern:
+&imm_a imm alpha
+%foo 0:16 alpha:4
+@foo 00000001 ........ ........ ........ &imm_a imm=%foo
+i1   ........ 00000000 ........ ........ @foo alpha=1
+i2   ........ 00000001 ........ ........ @foo alpha=2
+
+# Here the named field is defined in the format and referenced
+# from the insn pattern:
+@bar 00000010 ........ ........ ........ &imm_a alpha=4
+i3   ........ 00000000 ........ ........ @bar imm=%foo
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/decode/meson.build
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@ err_tests = [
     'err_field4.decode',
     'err_field5.decode',
     'err_field6.decode',
+    'err_field7.decode',
+    'err_field8.decode',
+    'err_field9.decode',
+    'err_field10.decode',
     'err_init1.decode',
     'err_init2.decode',
     'err_init3.decode',
@@ -XXX,XX +XXX,XX @@ succ_tests = [
     'succ_argset_type1.decode',
     'succ_function.decode',
     'succ_ident1.decode',
+    'succ_named_field.decode',
     'succ_pattern_group_nest1.decode',
     'succ_pattern_group_nest2.decode',
     'succ_pattern_group_nest3.decode',
-- 
2.34.1