Series comparison

-[PULL 00/11] tcg patch queue
+[PULL 00/27] tcg patch queue
-The following changes since commit 75d30fde55485b965a1168a21d016dd07b50ed32:
+The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:
-  Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-10-30 15:07:25 -0400)
+  Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)
 are available in the Git repository at:
-  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20221031
+  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530
-for you to fetch changes up to cb375590983fc3d23600d02ba05a05d34fe44150:
+for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:
-  target/i386: Expand eflags updates inline (2022-10-31 11:39:10 +1100)
+  tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)
 ----------------------------------------------------------------
-Remove sparc32plus support from tcg/sparc.
+Improvements to 128-bit atomics:
-target/i386: Use cpu_unwind_state_data for tpr access.
+  - Separate __int128_t type and arithmetic detection
-target/i386: Expand eflags updates inline
+  - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
   - Accelerate atomics via host/include/
 Decodetree:
   - Add named field syntax
   - Move tests to meson
 ----------------------------------------------------------------
-Icenowy Zheng (1):
+Peter Maydell (5):
-      tcg/tci: fix logic error when registering helpers via FFI
+      docs: Document decodetree named field syntax
       scripts/decodetree: Pass lvalue-formatter function to str_extract()
       scripts/decodetree: Implement a topological sort
       scripts/decodetree: Implement named field support
       tests/decode: Add tests for various named-field cases
-Richard Henderson (10):
+Richard Henderson (22):
-      tcg/sparc: Remove support for sparc32plus
+      tcg: Fix register move type in tcg_out_ld_helper_ret
-      tcg/sparc64: Rename from tcg/sparc
+      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
-      tcg/sparc64: Remove sparc32plus constraints
+      meson: Split test for __int128_t type from __int128_t arithmetic
-      accel/tcg: Introduce cpu_unwind_state_data
+      qemu/atomic128: Add x86_64 atomic128-ldst.h
-      target/i386: Use cpu_unwind_state_data for tpr access
+      tcg/i386: Support 128-bit load/store
-      target/openrisc: Always exit after mtspr npc
+      tcg/aarch64: Rename temporaries
-      target/openrisc: Use cpu_unwind_state_data for mfspr
+      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
-      accel/tcg: Remove will_exit argument from cpu_restore_state
+      tcg/aarch64: Simplify constraints on qemu_ld/st
-      accel/tcg: Remove reset_icount argument from cpu_restore_state_from_tb
+      tcg/aarch64: Support 128-bit load/store
-      target/i386: Expand eflags updates inline
+      tcg/ppc: Support 128-bit load/store
       tcg/s390x: Support 128-bit load/store
       accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
       accel/tcg: Extract store_atom_insert_al16 to host header
       accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
       accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
       accel/tcg: Add aarch64 store_atom_insert_al16
       tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
       decodetree: Add --test-for-error
       decodetree: Fix recursion in prop_format and build_tree
       decodetree: Diagnose empty pattern group
       decodetree: Do not remove output_file from /dev
       tests/decode: Convert tests to meson
- meson.build                                 |   4 +-
+ docs/devel/decodetree.rst                         |  33 ++-
- accel/tcg/internal.h                        |   4 +-
+ meson.build                                       |  15 +-
- include/exec/exec-all.h                     |  24 ++-
+ host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
- target/i386/helper.h                        |   5 -
+ host/include/aarch64/host/store-insert-al16.h     |  47 ++++
- tcg/{sparc => sparc64}/tcg-target-con-set.h |  16 +-
+ host/include/generic/host/load-extract-al16-al8.h |  45 ++++
- tcg/{sparc => sparc64}/tcg-target-con-str.h |   3 -
+ host/include/generic/host/store-insert-al16.h     |  50 ++++
- tcg/{sparc => sparc64}/tcg-target.h         |  11 --
+ host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
- accel/tcg/cpu-exec-common.c                 |   2 +-
+ host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
- accel/tcg/tb-maint.c                        |   4 +-
+ include/qemu/int128.h                             |   4 +-
- accel/tcg/translate-all.c                   |  91 +++++----
+ tcg/aarch64/tcg-target-con-set.h                  |   4 +-
- target/alpha/helper.c                       |   2 +-
+ tcg/aarch64/tcg-target-con-str.h                  |   1 -
- target/alpha/mem_helper.c                   |   2 +-
+ tcg/aarch64/tcg-target.h                          |  12 +-
- target/arm/op_helper.c                      |   2 +-
+ tcg/arm/tcg-target.h                              |   1 -
- target/arm/tlb_helper.c                     |   8 +-
+ tcg/i386/tcg-target.h                             |   5 +-
- target/cris/helper.c                        |   2 +-
+ tcg/mips/tcg-target.h                             |   1 -
- target/i386/helper.c                        |  21 ++-
+ tcg/ppc/tcg-target-con-set.h                      |   2 +
- target/i386/tcg/cc_helper.c                 |  41 -----
+ tcg/ppc/tcg-target-con-str.h                      |   1 +
- target/i386/tcg/sysemu/svm_helper.c         |   2 +-
+ tcg/ppc/tcg-target.h                              |   4 +-
- target/i386/tcg/translate.c                 |  30 ++-
+ tcg/riscv/tcg-target.h                            |   1 -
- target/m68k/op_helper.c                     |   4 +-
+ tcg/s390x/tcg-target-con-set.h                    |   2 +
- target/microblaze/helper.c                  |   2 +-
+ tcg/s390x/tcg-target.h                            |   3 +-
- target/nios2/op_helper.c                    |   2 +-
+ tcg/sparc64/tcg-target.h                          |   1 -
- target/openrisc/sys_helper.c                |  17 +-
+ tcg/tci/tcg-target.h                              |   1 -
- target/ppc/excp_helper.c                    |   2 +-
+ tests/decode/err_field10.decode                   |   7 +
- target/s390x/tcg/excp_helper.c              |   2 +-
+ tests/decode/err_field7.decode                    |   7 +
- target/tricore/op_helper.c                  |   2 +-
+ tests/decode/err_field8.decode                    |   8 +
- target/xtensa/helper.c                      |   6 +-
+ tests/decode/err_field9.decode                    |  14 ++
- tcg/tcg.c                                   |  81 +-------
+ tests/decode/succ_named_field.decode              |  19 ++
- tcg/{sparc => sparc64}/tcg-target.c.inc     | 275 ++++++++--------------------
+ tcg/tcg.c                                         |   4 +-
- MAINTAINERS                                 |   2 +-
+ accel/tcg/ldst_atomicity.c.inc                    |  80 +------
-files changed, 232 insertions(+), 437 deletions(-)
+ tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
- rename tcg/{sparc => sparc64}/tcg-target-con-set.h (69%)
+ tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
- rename tcg/{sparc => sparc64}/tcg-target-con-str.h (77%)
+ tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
- rename tcg/{sparc => sparc64}/tcg-target.h (95%)
+ tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
- rename tcg/{sparc => sparc64}/tcg-target.c.inc (91%)
+ scripts/decodetree.py                             | 265 ++++++++++++++++++++--
  tests/decode/check.sh                             |  24 --
  tests/decode/meson.build                          |  64 ++++++
  tests/meson.build                                 |   5 +-
 files changed, 1312 insertions(+), 225 deletions(-)
  create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
  create mode 100644 host/include/aarch64/host/store-insert-al16.h
  create mode 100644 host/include/generic/host/load-extract-al16-al8.h
  create mode 100644 host/include/generic/host/store-insert-al16.h
  create mode 100644 host/include/x86_64/host/atomic128-ldst.h
  create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
  create mode 100644 tests/decode/err_field10.decode
  create mode 100644 tests/decode/err_field7.decode
  create mode 100644 tests/decode/err_field8.decode
  create mode 100644 tests/decode/err_field9.decode
  create mode 100644 tests/decode/succ_named_field.decode
  delete mode 100755 tests/decode/check.sh
  create mode 100644 tests/decode/meson.build

-[PULL 04/11] tcg/tci: fix logic error when registering helpers via FFI
+[PULL 01/27] tcg: Fix register move type in tcg_out_ld_helper_ret
-From: Icenowy Zheng <uwu@icenowy.me>
+The first move was incorrectly using TCG_TYPE_I32 while the second
 move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
 from moving all 128-bits of the return value.
-When registering helpers via FFI for TCI, the inner loop that iterates
+Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
 parameters of the helper reuses (and thus pollutes) the same variable
 used by the outer loop that iterates all helpers, thus made some helpers
 unregistered.
 Fix this logic error by using a dedicated temporary variable for the
 inner loop.
 Fixes: 22f15579fa ("tcg: Build ffi data structures for helpers")
 Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
 Message-Id: <20221028072145.1593205-1-uwu@icenowy.me>
 [rth: Move declaration of j to the for loop itself]
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 ---
- tcg/tcg.c | 6 +++---
+ tcg/tcg.c | 4 ++--
-file changed, 3 insertions(+), 3 deletions(-)
+file changed, 2 insertions(+), 2 deletions(-)
 diff --git a/tcg/tcg.c b/tcg/tcg.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/tcg.c
 +++ b/tcg/tcg.c
-@@ -XXX,XX +XXX,XX @@ static void tcg_context_init(unsigned max_cpus)
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
+     mov[0].dst = ldst->datalo_reg;
-         if (nargs != 0) {
+     mov[0].src =
-             ca->cif.arg_types = ca->args;
+         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
--            for (i = 0; i < nargs; ++i) {
+-    mov[0].dst_type = TCG_TYPE_I32;
--                int typecode = extract32(typemask, (i + 1) * 3, 3);
+-    mov[0].src_type = TCG_TYPE_I32;
--                ca->args[i] = typecode_to_ffi[typecode];
++    mov[0].dst_type = TCG_TYPE_REG;
-+            for (int j = 0; j < nargs; ++j) {
++    mov[0].src_type = TCG_TYPE_REG;
-+                int typecode = extract32(typemask, (j + 1) * 3, 3);
+     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
-+                ca->args[j] = typecode_to_ffi[typecode];
-             }
+     mov[1].dst = ldst->datahi_reg;
          }
 --
 .34.1

-New patch
+[PULL 02/27] accel/tcg: Fix check for page writeability in load_atomic16_or_exit
+PAGE_WRITE is current writability, as modified by TB protection;
+PAGE_WRITE_ORG is the original page writability.
+Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ accel/tcg/ldst_atomicity.c.inc | 4 ++--
+file changed, 2 insertions(+), 2 deletions(-)
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
+index XXXXXXX..XXXXXXX 100644
+--- a/accel/tcg/ldst_atomicity.c.inc
++++ b/accel/tcg/ldst_atomicity.c.inc
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
+      * another process, because the fallback start_exclusive solution
+      * provides no protection across processes.
+      */
+-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
++    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
+         uint64_t *p = __builtin_assume_aligned(pv, 8);
+         return *p;
+     }
+@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
+      * another process, because the fallback start_exclusive solution
+      * provides no protection across processes.
+      */
+-    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
++    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
+         return *p;
+     }
+ #endif
+--
+.34.1

-[PULL 02/11] tcg/sparc64: Rename from tcg/sparc
+[PULL 03/27] meson: Split test for __int128_t type from __int128_t arithmetic
-Emphasize that we only support full 64-bit code generation.
+Older versions of clang have missing runtime functions for arithmetic
 with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
 __int128_t for implementing Int128.  But __int128_t is present,
 data movement works, and it can be used for atomic128.
 Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
 qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
 and adjust the meson probe for atomics to use has_int128_type.
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- meson.build                                 | 4 +---
+ meson.build           | 15 ++++++++++-----
- tcg/{sparc => sparc64}/tcg-target-con-set.h | 0
+ include/qemu/int128.h |  4 ++--
- tcg/{sparc => sparc64}/tcg-target-con-str.h | 0
+files changed, 12 insertions(+), 7 deletions(-)
  tcg/{sparc => sparc64}/tcg-target.h         | 0
  tcg/{sparc => sparc64}/tcg-target.c.inc     | 0
  MAINTAINERS                                 | 2 +-
 files changed, 2 insertions(+), 4 deletions(-)
  rename tcg/{sparc => sparc64}/tcg-target-con-set.h (100%)
  rename tcg/{sparc => sparc64}/tcg-target-con-str.h (100%)
  rename tcg/{sparc => sparc64}/tcg-target.h (100%)
  rename tcg/{sparc => sparc64}/tcg-target.c.inc (100%)
 diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/meson.build
 +++ b/meson.build
-@@ -XXX,XX +XXX,XX @@ qapi_trace_events = []
+@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
- bsd_oses = ['gnu/kfreebsd', 'freebsd', 'netbsd', 'openbsd', 'dragonfly', 'darwin']
+     return 0;
- supported_oses = ['windows', 'freebsd', 'netbsd', 'openbsd', 'darwin', 'sunos', 'linux']
+   }'''))
- supported_cpus = ['ppc', 'ppc64', 's390x', 'riscv', 'x86', 'x86_64',
--  'arm', 'aarch64', 'loongarch64', 'mips', 'mips64', 'sparc', 'sparc64']
+-has_int128 = cc.links('''
-+  'arm', 'aarch64', 'loongarch64', 'mips', 'mips64', 'sparc64']
++has_int128_type = cc.compiles('''
++  __int128_t a;
- cpu = host_machine.cpu_family()
++  __uint128_t b;
++  int main(void) { b = a; }''')
-@@ -XXX,XX +XXX,XX @@ if get_option('tcg').allowed()
++config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
-   endif
++
-   if get_option('tcg_interpreter')
++has_int128 = has_int128_type and cc.links('''
-     tcg_arch = 'tci'
+   __int128_t a;
--  elif host_arch == 'sparc64'
+   __uint128_t b;
--    tcg_arch = 'sparc'
+   int main (void) {
-   elif host_arch == 'x86_64'
+@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
-     tcg_arch = 'i386'
+     a = a * a;
-   elif host_arch == 'ppc64'
+     return 0;
-diff --git a/tcg/sparc/tcg-target-con-set.h b/tcg/sparc64/tcg-target-con-set.h
+   }''')
-similarity index 100%
+-
-rename from tcg/sparc/tcg-target-con-set.h
+ config_host_data.set('CONFIG_INT128', has_int128)
-rename to tcg/sparc64/tcg-target-con-set.h
-diff --git a/tcg/sparc/tcg-target-con-str.h b/tcg/sparc64/tcg-target-con-str.h
+-if has_int128
-similarity index 100%
++if has_int128_type
-rename from tcg/sparc/tcg-target-con-str.h
+   # "do we have 128-bit atomics which are handled inline and specifically not
-rename to tcg/sparc64/tcg-target-con-str.h
+   # via libatomic". The reason we can't use libatomic is documented in the
-diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc64/tcg-target.h
+   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
-similarity index 100%
+@@ -XXX,XX +XXX,XX @@ if has_int128
-rename from tcg/sparc/tcg-target.h
+   # __alignof(unsigned __int128) for the host.
-rename to tcg/sparc64/tcg-target.h
+   atomic_test_128 = '''
-diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
+     int main(int ac, char **av) {
-similarity index 100%
+-      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
-rename from tcg/sparc/tcg-target.c.inc
++      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
-rename to tcg/sparc64/tcg-target.c.inc
+       p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
-diff --git a/MAINTAINERS b/MAINTAINERS
+       __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
        __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
        config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
          int main(void)
          {
 -          unsigned __int128 x = 0, y = 0;
 +          __uint128_t x = 0, y = 0;
            __sync_val_compare_and_swap_16(&x, y, x);
            return 0;
          }
 diff --git a/include/qemu/int128.h b/include/qemu/int128.h
 index XXXXXXX..XXXXXXX 100644
---- a/MAINTAINERS
+--- a/include/qemu/int128.h
-+++ b/MAINTAINERS
++++ b/include/qemu/int128.h
-@@ -XXX,XX +XXX,XX @@ L: qemu-s390x@nongnu.org
+@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
+  * a possible structure and the native types.  Ease parameter passing
- SPARC TCG target
+  * via use of the transparent union extension.
- S: Odd Fixes
+  */
--F: tcg/sparc/
+-#ifdef CONFIG_INT128
-+F: tcg/sparc64/
++#ifdef CONFIG_INT128_TYPE
- F: disas/sparc.c
+ typedef union {
+     __uint128_t u;
- TCI TCG target
+     __int128_t i;
@@ -XXX,XX +XXX,XX @@ typedef union {
  } Int128Alias __attribute__((transparent_union));
  #else
  typedef Int128 Int128Alias;
 -#endif /* CONFIG_INT128 */
 +#endif /* CONFIG_INT128_TYPE */
  #endif /* INT128_H */
 --
 .34.1

-[PULL 01/11] tcg/sparc: Remove support for sparc32plus
+[PULL 04/27] qemu/atomic128: Add x86_64 atomic128-ldst.h
-Since 9b9c37c36439, we have only supported sparc64 cpus.
+With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
-Debian and Gentoo now only support 64-bit sparc64 userland,
+load/store without cmpxchg16b.
 so it is time to drop the 32-bit sparc64 userland: sparc32plus.
-Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/sparc/tcg-target.h     |  11 ---
+ host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
- tcg/tcg.c                  |  75 +----------------
+file changed, 68 insertions(+)
- tcg/sparc/tcg-target.c.inc | 166 +++++++------------------------------
+ create mode 100644 host/include/x86_64/host/atomic128-ldst.h
 files changed, 33 insertions(+), 219 deletions(-)
-diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
+diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/sparc/tcg-target.h
+index XXXXXXX..XXXXXXX
-+++ b/tcg/sparc/tcg-target.h
+--- /dev/null
 +++ b/host/include/x86_64/host/atomic128-ldst.h
 @@ -XXX,XX +XXX,XX @@
- #ifndef SPARC_TCG_TARGET_H
++/*
- #define SPARC_TCG_TARGET_H
++ * SPDX-License-Identifier: GPL-2.0-or-later
++ * Load/store for 128-bit atomic operations, x86_64 version.
--#define TCG_TARGET_REG_BITS 64
++ *
--
++ * Copyright (C) 2023 Linaro, Ltd.
- #define TCG_TARGET_INSN_UNIT_SIZE 4
++ *
- #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
++ * See docs/devel/atomics.rst for discussion about the guarantees each
- #define TCG_TARGET_NB_REGS 32
++ * atomic primitive is meant to provide.
-@@ -XXX,XX +XXX,XX @@ typedef enum {
++ */
- /* used for function call generation */
++
- #define TCG_REG_CALL_STACK TCG_REG_O6
++#ifndef AARCH64_ATOMIC128_LDST_H
++#define AARCH64_ATOMIC128_LDST_H
--#ifdef __arch64__
++
- #define TCG_TARGET_STACK_BIAS           2047
++#ifdef CONFIG_INT128_TYPE
- #define TCG_TARGET_STACK_ALIGN          16
++#include "host/cpuinfo.h"
- #define TCG_TARGET_CALL_STACK_OFFSET    (128 + 6*8 + TCG_TARGET_STACK_BIAS)
++#include "tcg/debug-assert.h"
--#else
++
--#define TCG_TARGET_STACK_BIAS           0
++/*
--#define TCG_TARGET_STACK_ALIGN          8
++ * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
--#define TCG_TARGET_CALL_STACK_OFFSET    (64 + 4 + 6*4)
++ * expanded to a read-write operation: lock cmpxchg16b.
--#endif
++ */
--
++
--#ifdef __arch64__
++#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
- #define TCG_TARGET_EXTEND_ARGS 1
++#define HAVE_ATOMIC128_RW  1
--#endif
++
++static inline Int128 atomic16_read_ro(const Int128 *ptr)
- #if defined(__VIS__) && __VIS__ >= 0x300
++{
- #define use_vis3_instructions  1
++    Int128Alias r;
-diff --git a/tcg/tcg.c b/tcg/tcg.c
++
-index XXXXXXX..XXXXXXX 100644
++    tcg_debug_assert(HAVE_ATOMIC128_RO);
---- a/tcg/tcg.c
++    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
-+++ b/tcg/tcg.c
++
-@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
++    return r.s;
-     }
++}
- #endif
++
++static inline Int128 atomic16_read_rw(Int128 *ptr)
--#if defined(__sparc__) && !defined(__arch64__) \
++{
--    && !defined(CONFIG_TCG_INTERPRETER)
++    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
--    /* We have 64-bit values in one register, but need to pass as two
++    Int128Alias r;
--       separate parameters.  Split them.  */
++
--    int orig_typemask = typemask;
++    if (HAVE_ATOMIC128_RO) {
--    int orig_nargs = nargs;
++        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
--    TCGv_i64 retl, reth;
++    } else {
--    TCGTemp *split_args[MAX_OPC_PARAM];
++        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
--
++    }
--    retl = NULL;
++    return r.s;
--    reth = NULL;
++}
--    typemask = 0;
++
--    for (i = real_args = 0; i < nargs; ++i) {
++static inline void atomic16_set(Int128 *ptr, Int128 val)
--        int argtype = extract32(orig_typemask, (i + 1) * 3, 3);
++{
--        bool is_64bit = (argtype & ~1) == dh_typecode_i64;
++    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
--
++    Int128Alias new = { .s = val };
--        if (is_64bit) {
++
--            TCGv_i64 orig = temp_tcgv_i64(args[i]);
++    if (HAVE_ATOMIC128_RO) {
--            TCGv_i32 h = tcg_temp_new_i32();
++        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
--            TCGv_i32 l = tcg_temp_new_i32();
++    } else {
--            tcg_gen_extr_i64_i32(l, h, orig);
++        __int128_t old;
--            split_args[real_args++] = tcgv_i32_temp(h);
++        do {
--            typemask |= dh_typecode_i32 << (real_args * 3);
++            old = *ptr_align;
--            split_args[real_args++] = tcgv_i32_temp(l);
++        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
--            typemask |= dh_typecode_i32 << (real_args * 3);
++    }
--        } else {
++}
--            split_args[real_args++] = args[i];
++#else
--            typemask |= argtype << (real_args * 3);
++/* Provide QEMU_ERROR stubs. */
--        }
++#include "host/include/generic/host/atomic128-ldst.h"
 -    }
 -    nargs = real_args;
 -    args = split_args;
 -#elif defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
 +#if defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
      for (i = 0; i < nargs; ++i) {
          int argtype = extract32(typemask, (i + 1) * 3, 3);
          bool is_32bit = (argtype & ~1) == dh_typecode_i32;
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
      pi = 0;
      if (ret != NULL) {
 -#if defined(__sparc__) && !defined(__arch64__) \
 -    && !defined(CONFIG_TCG_INTERPRETER)
 -        if ((typemask & 6) == dh_typecode_i64) {
 -            /* The 32-bit ABI is going to return the 64-bit value in
 -               the %o0/%o1 register pair.  Prepare for this by using
 -               two return temporaries, and reassemble below.  */
 -            retl = tcg_temp_new_i64();
 -            reth = tcg_temp_new_i64();
 -            op->args[pi++] = tcgv_i64_arg(reth);
 -            op->args[pi++] = tcgv_i64_arg(retl);
 -            nb_rets = 2;
 -        } else {
 -            op->args[pi++] = temp_arg(ret);
 -            nb_rets = 1;
 -        }
 -#else
          if (TCG_TARGET_REG_BITS < 64 && (typemask & 6) == dh_typecode_i64) {
  #if HOST_BIG_ENDIAN
              op->args[pi++] = temp_arg(ret + 1);
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
              op->args[pi++] = temp_arg(ret);
              nb_rets = 1;
          }
 -#endif
      } else {
          nb_rets = 0;
      }
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
      tcg_debug_assert(TCGOP_CALLI(op) == real_args);
      tcg_debug_assert(pi <= ARRAY_SIZE(op->args));
 -#if defined(__sparc__) && !defined(__arch64__) \
 -    && !defined(CONFIG_TCG_INTERPRETER)
 -    /* Free all of the parts we allocated above.  */
 -    for (i = real_args = 0; i < orig_nargs; ++i) {
 -        int argtype = extract32(orig_typemask, (i + 1) * 3, 3);
 -        bool is_64bit = (argtype & ~1) == dh_typecode_i64;
 -
 -        if (is_64bit) {
 -            tcg_temp_free_internal(args[real_args++]);
 -            tcg_temp_free_internal(args[real_args++]);
 -        } else {
 -            real_args++;
 -        }
 -    }
 -    if ((orig_typemask & 6) == dh_typecode_i64) {
 -        /* The 32-bit ABI returned two 32-bit pieces.  Re-assemble them.
 -           Note that describing these as TCGv_i64 eliminates an unnecessary
 -           zero-extension that tcg_gen_concat_i32_i64 would create.  */
 -        tcg_gen_concat32_i64(temp_tcgv_i64(ret), retl, reth);
 -        tcg_temp_free_i64(retl);
 -        tcg_temp_free_i64(reth);
 -    }
 -#elif defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
 +#if defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
      for (i = 0; i < nargs; ++i) {
          int argtype = extract32(typemask, (i + 1) * 3, 3);
          bool is_32bit = (argtype & ~1) == dh_typecode_i32;
 diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/sparc/tcg-target.c.inc
 +++ b/tcg/sparc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@
   * THE SOFTWARE.
   */
 +/* We only support generating code for 64-bit mode.  */
 +#ifndef __arch64__
 +#error "unsupported code generation mode"
 +#endif
 +
- #include "../tcg-pool.c.inc"
++#endif /* AARCH64_ATOMIC128_LDST_H */
  #ifdef CONFIG_DEBUG_TCG
@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
  };
  #endif
 -#ifdef __arch64__
 -# define SPARC64 1
 -#else
 -# define SPARC64 0
 -#endif
 -
  #define TCG_CT_CONST_S11  0x100
  #define TCG_CT_CONST_S13  0x200
  #define TCG_CT_CONST_ZERO 0x400
@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
   * high bits of the %i and %l registers garbage at all times.
   */
  #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
 -#if SPARC64
  # define ALL_GENERAL_REGS64  ALL_GENERAL_REGS
 -#else
 -# define ALL_GENERAL_REGS64  MAKE_64BIT_MASK(0, 16)
 -#endif
  #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
  #define ALL_QLDST_REGS64     (ALL_GENERAL_REGS64 & ~SOFTMMU_RESERVE_REGS)
@@ -XXX,XX +XXX,XX @@ static bool check_fit_i32(int32_t val, unsigned int bits)
  }
  #define check_fit_tl    check_fit_i64
 -#if SPARC64
 -# define check_fit_ptr  check_fit_i64
 -#else
 -# define check_fit_ptr  check_fit_i32
 -#endif
 +#define check_fit_ptr   check_fit_i64
  static bool patch_reloc(tcg_insn_unit *src_rw, int type,
                          intptr_t value, intptr_t addend)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_sety(TCGContext *s, TCGReg rs)
      tcg_out32(s, WRY | INSN_RS1(TCG_REG_G0) | INSN_RS2(rs));
  }
 -static void tcg_out_rdy(TCGContext *s, TCGReg rd)
 -{
 -    tcg_out32(s, RDY | INSN_RD(rd));
 -}
 -
  static void tcg_out_div32(TCGContext *s, TCGReg rd, TCGReg rs1,
                            int32_t val2, int val2const, int uns)
  {
@@ -XXX,XX +XXX,XX @@ static void emit_extend(TCGContext *s, TCGReg r, int op)
          tcg_out_arithi(s, r, r, 16, SHIFT_SRL);
          break;
      case MO_32:
 -        if (SPARC64) {
 -            tcg_out_arith(s, r, r, 0, SHIFT_SRL);
 -        }
 +        tcg_out_arith(s, r, r, 0, SHIFT_SRL);
          break;
      case MO_64:
          break;
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
      };
      int i;
 -    TCGReg ra;
      for (i = 0; i < ARRAY_SIZE(qemu_ld_helpers); ++i) {
          if (qemu_ld_helpers[i] == NULL) {
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
          }
          qemu_ld_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
 -        if (SPARC64 || TARGET_LONG_BITS == 32) {
 -            ra = TCG_REG_O3;
 -        } else {
 -            /* Install the high part of the address.  */
 -            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
 -            ra = TCG_REG_O4;
 -        }
 -
          /* Set the retaddr operand.  */
 -        tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
 +        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O3, TCG_REG_O7);
          /* Tail call.  */
          tcg_out_jmpl_const(s, qemu_ld_helpers[i], true, true);
          /* delay slot -- set the env argument */
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
          }
          qemu_st_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
 -        if (SPARC64) {
 -            emit_extend(s, TCG_REG_O2, i);
 -            ra = TCG_REG_O4;
 -        } else {
 -            ra = TCG_REG_O1;
 -            if (TARGET_LONG_BITS == 64) {
 -                /* Install the high part of the address.  */
 -                tcg_out_arithi(s, ra, ra + 1, 32, SHIFT_SRLX);
 -                ra += 2;
 -            } else {
 -                ra += 1;
 -            }
 -            if ((i & MO_SIZE) == MO_64) {
 -                /* Install the high part of the data.  */
 -                tcg_out_arithi(s, ra, ra + 1, 32, SHIFT_SRLX);
 -                ra += 2;
 -            } else {
 -                emit_extend(s, ra, i);
 -                ra += 1;
 -            }
 -            /* Skip the oi argument.  */
 -            ra += 1;
 -        }
 -
 +        emit_extend(s, TCG_REG_O2, i);
 +
          /* Set the retaddr operand.  */
 -        if (ra >= TCG_REG_O6) {
 -            tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,
 -                       TCG_TARGET_CALL_STACK_OFFSET);
 -        } else {
 -            tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
 -        }
 +        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O4, TCG_REG_O7);
          /* Tail call.  */
          tcg_out_jmpl_const(s, qemu_st_helpers[i], true, true);
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
              qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
          }
 -        if (!SPARC64 && TARGET_LONG_BITS == 64) {
 -            /* Install the high part of the address.  */
 -            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
 -        }
 -
          /* Tail call.  */
          tcg_out_jmpl_const(s, helper, true, true);
          /* delay slot -- set the env argument */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, int mem_index,
      tcg_out_cmp(s, r0, r2, 0);
      /* If the guest address must be zero-extended, do so now.  */
 -    if (SPARC64 && TARGET_LONG_BITS == 32) {
 +    if (TARGET_LONG_BITS == 32) {
          tcg_out_arithi(s, r0, addr, 0, SHIFT_SRL);
          return r0;
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
  #ifdef CONFIG_SOFTMMU
      unsigned memi = get_mmuidx(oi);
 -    TCGReg addrz, param;
 +    TCGReg addrz;
      const tcg_insn_unit *func;
      addrz = tcg_out_tlb_load(s, addr, memi, memop,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
      /* TLB Miss.  */
 -    param = TCG_REG_O1;
 -    if (!SPARC64 && TARGET_LONG_BITS == 64) {
 -        /* Skip the high-part; we'll perform the extract in the trampoline.  */
 -        param++;
 -    }
 -    tcg_out_mov(s, TCG_TYPE_REG, param++, addrz);
 +    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
      /* We use the helpers to extend SB and SW data, leaving the case
         of SL needing explicit extending below.  */
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
      tcg_debug_assert(func != NULL);
      tcg_out_call_nodelay(s, func, false);
      /* delay slot */
 -    tcg_out_movi(s, TCG_TYPE_I32, param, oi);
 +    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O2, oi);
 -    /* Recall that all of the helpers return 64-bit results.
 -       Which complicates things for sparcv8plus.  */
 -    if (SPARC64) {
 -        /* We let the helper sign-extend SB and SW, but leave SL for here.  */
 -        if (is_64 && (memop & MO_SSIZE) == MO_SL) {
 -            tcg_out_arithi(s, data, TCG_REG_O0, 0, SHIFT_SRA);
 -        } else {
 -            tcg_out_mov(s, TCG_TYPE_REG, data, TCG_REG_O0);
 -        }
 +    /* We let the helper sign-extend SB and SW, but leave SL for here.  */
 +    if (is_64 && (memop & MO_SSIZE) == MO_SL) {
 +        tcg_out_arithi(s, data, TCG_REG_O0, 0, SHIFT_SRA);
      } else {
 -        if ((memop & MO_SIZE) == MO_64) {
 -            tcg_out_arithi(s, TCG_REG_O0, TCG_REG_O0, 32, SHIFT_SLLX);
 -            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O1, 0, SHIFT_SRL);
 -            tcg_out_arith(s, data, TCG_REG_O0, TCG_REG_O1, ARITH_OR);
 -        } else if (is_64) {
 -            /* Re-extend from 32-bit rather than reassembling when we
 -               know the high register must be an extension.  */
 -            tcg_out_arithi(s, data, TCG_REG_O1, 0,
 -                           memop & MO_SIGN ? SHIFT_SRA : SHIFT_SRL);
 -        } else {
 -            tcg_out_mov(s, TCG_TYPE_I32, data, TCG_REG_O1);
 -        }
 +        tcg_out_mov(s, TCG_TYPE_REG, data, TCG_REG_O0);
      }
      *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
      unsigned s_bits = memop & MO_SIZE;
      unsigned t_bits;
 -    if (SPARC64 && TARGET_LONG_BITS == 32) {
 +    if (TARGET_LONG_BITS == 32) {
          tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
          addr = TCG_REG_T1;
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
           * operation in the delay slot, and failure need only invoke the
           * handler for SIGBUS.
           */
 -        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
          tcg_out_call_nodelay(s, qemu_unalign_ld_trampoline, false);
          /* delay slot -- move to low part of argument reg */
 -        tcg_out_mov_delay(s, arg_low, addr);
 +        tcg_out_mov_delay(s, TCG_REG_O1, addr);
      } else {
          /* Underalignment: load by pieces of minimum alignment. */
          int ld_opc, a_size, s_size, i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
  #ifdef CONFIG_SOFTMMU
      unsigned memi = get_mmuidx(oi);
 -    TCGReg addrz, param;
 +    TCGReg addrz;
      const tcg_insn_unit *func;
      addrz = tcg_out_tlb_load(s, addr, memi, memop,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
      /* TLB Miss.  */
 -    param = TCG_REG_O1;
 -    if (!SPARC64 && TARGET_LONG_BITS == 64) {
 -        /* Skip the high-part; we'll perform the extract in the trampoline.  */
 -        param++;
 -    }
 -    tcg_out_mov(s, TCG_TYPE_REG, param++, addrz);
 -    if (!SPARC64 && (memop & MO_SIZE) == MO_64) {
 -        /* Skip the high-part; we'll perform the extract in the trampoline.  */
 -        param++;
 -    }
 -    tcg_out_mov(s, TCG_TYPE_REG, param++, data);
 +    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
 +    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O2, data);
      func = qemu_st_trampoline[memop & (MO_BSWAP | MO_SIZE)];
      tcg_debug_assert(func != NULL);
      tcg_out_call_nodelay(s, func, false);
      /* delay slot */
 -    tcg_out_movi(s, TCG_TYPE_I32, param, oi);
 +    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O3, oi);
      *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
  #else
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
      unsigned s_bits = memop & MO_SIZE;
      unsigned t_bits;
 -    if (SPARC64 && TARGET_LONG_BITS == 32) {
 +    if (TARGET_LONG_BITS == 32) {
          tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
          addr = TCG_REG_T1;
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
           * operation in the delay slot, and failure need only invoke the
           * handler for SIGBUS.
           */
 -        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
          tcg_out_call_nodelay(s, qemu_unalign_st_trampoline, false);
          /* delay slot -- move to low part of argument reg */
 -        tcg_out_mov_delay(s, arg_low, addr);
 +        tcg_out_mov_delay(s, TCG_REG_O1, addr);
      } else {
          /* Underalignment: store by pieces of minimum alignment. */
          int st_opc, a_size, s_size, i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_muls2_i32:
          c = ARITH_SMUL;
      do_mul2:
 -        /* The 32-bit multiply insns produce a full 64-bit result.  If the
 -           destination register can hold it, we can avoid the slower RDY.  */
 +        /* The 32-bit multiply insns produce a full 64-bit result. */
          tcg_out_arithc(s, a0, a2, args[3], const_args[3], c);
 -        if (SPARC64 || a0 <= TCG_REG_O7) {
 -            tcg_out_arithi(s, a1, a0, 32, SHIFT_SRLX);
 -        } else {
 -            tcg_out_rdy(s, a1);
 -        }
 +        tcg_out_arithi(s, a1, a0, 32, SHIFT_SRLX);
          break;
      case INDEX_op_qemu_ld_i32:
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_T2); /* for internal use */
  }
 -#if SPARC64
 -# define ELF_HOST_MACHINE  EM_SPARCV9
 -#else
 -# define ELF_HOST_MACHINE  EM_SPARC32PLUS
 -# define ELF_HOST_FLAGS    EF_SPARC_32PLUS
 -#endif
 +#define ELF_HOST_MACHINE  EM_SPARCV9
  typedef struct {
      DebugFrameHeader h;
 -    uint8_t fde_def_cfa[SPARC64 ? 4 : 2];
 +    uint8_t fde_def_cfa[4];
      uint8_t fde_win_save;
      uint8_t fde_ret_save[3];
  } DebugFrame;
@@ -XXX,XX +XXX,XX @@ static const DebugFrame debug_frame = {
      .h.fde.len = sizeof(DebugFrame) - offsetof(DebugFrame, h.fde.cie_offset),
      .fde_def_cfa = {
 -#if SPARC64
 , 30,                         /* DW_CFA_def_cfa i6, 2047 */
          (2047 & 0x7f) | 0x80, (2047 >> 7)
 -#else
 -        13, 30                          /* DW_CFA_def_cfa_register i6 */
 -#endif
      },
      .fde_win_save = 0x2d,               /* DW_CFA_GNU_window_save */
      .fde_ret_save = { 9, 15, 31 },      /* DW_CFA_register o7, i7 */
 --
 .34.1

-New patch
+[PULL 05/27] tcg/i386: Support 128-bit load/store
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ tcg/i386/tcg-target.h     |   4 +-
+ tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
+files changed, 190 insertions(+), 5 deletions(-)
+diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/i386/tcg-target.h
++++ b/tcg/i386/tcg-target.h
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define have_avx1         (cpuinfo & CPUINFO_AVX1)
+ #define have_avx2         (cpuinfo & CPUINFO_AVX2)
+ #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
+-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
+ /*
+  * There are interesting instructions in AVX512, so long as we have AVX512VL,
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define TCG_TARGET_HAS_qemu_st8_i32     1
+ #endif
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
++#define TCG_TARGET_HAS_qemu_ldst_i128 \
++    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
+ /* We do not support older SSE systems, only beginning with AVX1.  */
+ #define TCG_TARGET_HAS_v64              have_avx1
+diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/i386/tcg-target.c.inc
++++ b/tcg/i386/tcg-target.c.inc
+@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
+ #endif
+ };
++#define TCG_TMP_VEC  TCG_REG_XMM5
++
+ static const int tcg_target_call_iarg_regs[] = {
+ #if TCG_TARGET_REG_BITS == 64
+ #if defined(_WIN64)
+@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
+ #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
+ #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
+ #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
++#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
++#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
+ #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
+ #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
+ #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
+@@ -XXX,XX +XXX,XX @@ typedef struct {
+ bool tcg_target_has_memory_bswap(MemOp memop)
+ {
+-    return have_movbe;
++    TCGAtomAlign aa;
++
++    if (!have_movbe) {
++        return false;
++    }
++    if ((memop & MO_SIZE) < MO_128) {
++        return true;
++    }
++
++    /*
++     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
++     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
++     */
++    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
++    return aa.atom < MO_128;
+ }
+ /*
+@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
+ static const TCGLdstHelperParam ldst_helper_param = { };
+ #endif
++static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
++                                TCGReg l, TCGReg h, TCGReg v)
++{
++    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
++
++    /* vpmov{d,q} %v, %l */
++    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
++    /* vpextr{d,q} $1, %v, %h */
++    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
++    tcg_out8(s, 1);
++}
++
++static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
++                                TCGReg v, TCGReg l, TCGReg h)
++{
++    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
++
++    /* vmov{d,q} %l, %v */
++    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
++    /* vpinsr{d,q} $1, %h, %v, %v */
++    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
++    tcg_out8(s, 1);
++}
++
+ /*
+  * Generate code for the slow path for a load at the end of block
+  */
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+ {
+     TCGLabelQemuLdst *ldst = NULL;
+     MemOp opc = get_memop(oi);
++    MemOp s_bits = opc & MO_SIZE;
+     unsigned a_mask;
+ #ifdef CONFIG_SOFTMMU
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+     *h = x86_guest_base;
+ #endif
+     h->base = addrlo;
+-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
++    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
+     a_mask = (1 << h->aa.align) - 1;
+ #ifdef CONFIG_SOFTMMU
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+     TCGType tlbtype = TCG_TYPE_I32;
+     int trexw = 0, hrexw = 0, tlbrexw = 0;
+     unsigned mem_index = get_mmuidx(oi);
+-    unsigned s_bits = opc & MO_SIZE;
+     unsigned s_mask = (1 << s_bits) - 1;
+     int tlb_mask;
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                      h.base, h.index, 0, h.ofs + 4);
+         }
+         break;
++
++    case MO_128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++
++        /*
++         * Without 16-byte atomicity, use integer regs.
++         * That is where we want the data, and it allows bswaps.
++         */
++        if (h.aa.atom < MO_128) {
++            if (use_movbe) {
++                TCGReg t = datalo;
++                datalo = datahi;
++                datahi = t;
++            }
++            if (h.base == datalo || h.index == datalo) {
++                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
++                                         h.base, h.index, 0, h.ofs);
++                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
++                                     datalo, datahi, 0);
++                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
++                                     datahi, datahi, 8);
++            } else {
++                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
++                                         h.base, h.index, 0, h.ofs);
++                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
++                                         h.base, h.index, 0, h.ofs + 8);
++            }
++            break;
++        }
++
++        /*
++         * With 16-byte atomicity, a vector load is required.
++         * If we already have 16-byte alignment, then VMOVDQA always works.
++         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
++         * Else use we require a runtime test for alignment for VMOVDQA;
++         * use VMOVDQU on the unaligned nonatomic path for simplicity.
++         */
++        if (h.aa.align >= MO_128) {
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++        } else {
++            TCGLabel *l1 = gen_new_label();
++            TCGLabel *l2 = gen_new_label();
++
++            tcg_out_testi(s, h.base, 15);
++            tcg_out_jxx(s, JCC_JNE, l1, true);
++
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++            tcg_out_jxx(s, JCC_JMP, l2, true);
++
++            tcg_out_label(s, l1);
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++            tcg_out_label(s, l2);
++        }
++        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
++        break;
++
+     default:
+         g_assert_not_reached();
+     }
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                      h.base, h.index, 0, h.ofs + 4);
+         }
+         break;
++
++    case MO_128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++
++        /*
++         * Without 16-byte atomicity, use integer regs.
++         * That is where we have the data, and it allows bswaps.
++         */
++        if (h.aa.atom < MO_128) {
++            if (use_movbe) {
++                TCGReg t = datalo;
++                datalo = datahi;
++                datahi = t;
++            }
++            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
++                                     h.base, h.index, 0, h.ofs);
++            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
++                                     h.base, h.index, 0, h.ofs + 8);
++            break;
++        }
++
++        /*
++         * With 16-byte atomicity, a vector store is required.
++         * If we already have 16-byte alignment, then VMOVDQA always works.
++         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
++         * Else use we require a runtime test for alignment for VMOVDQA;
++         * use VMOVDQU on the unaligned nonatomic path for simplicity.
++         */
++        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
++        if (h.aa.align >= MO_128) {
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++        } else {
++            TCGLabel *l1 = gen_new_label();
++            TCGLabel *l2 = gen_new_label();
++
++            tcg_out_testi(s, h.base, 15);
++            tcg_out_jxx(s, JCC_JNE, l1, true);
++
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++            tcg_out_jxx(s, JCC_JMP, l2, true);
++
++            tcg_out_label(s, l1);
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
++                                         TCG_TMP_VEC, 0,
++                                         h.base, h.index, 0, h.ofs);
++            tcg_out_label(s, l2);
++        }
++        break;
++
+     default:
+         g_assert_not_reached();
+     }
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
+             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
+         }
+         break;
++    case INDEX_op_qemu_ld_a32_i128:
++    case INDEX_op_qemu_ld_a64_i128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
++        break;
+     case INDEX_op_qemu_st_a64_i32:
+     case INDEX_op_qemu_st8_a64_i32:
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
+             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
+         }
+         break;
++    case INDEX_op_qemu_st_a32_i128:
++    case INDEX_op_qemu_st_a64_i128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
++        break;
+     OP_32_64(mulu2):
+         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_st_a64_i64:
+         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
++    case INDEX_op_qemu_ld_a32_i128:
++    case INDEX_op_qemu_ld_a64_i128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++        return C_O2_I1(r, r, L);
++    case INDEX_op_qemu_st_a32_i128:
++    case INDEX_op_qemu_st_a64_i128:
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++        return C_O0_I3(L, L, L);
++
+     case INDEX_op_brcond2_i32:
+         return C_O0_I4(r, r, ri, ri);
+@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
+     s->reserved_regs = 0;
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
++    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
+ #ifdef _WIN64
+     /* These are call saved, and we don't save them, so don't use them. */
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
+--
+.34.1

-[PULL 09/11] accel/tcg: Remove will_exit argument from cpu_restore_state
+[PULL 06/27] tcg/aarch64: Rename temporaries
-The value passed is always true, and if the target's
+We will need to allocate a second general-purpose temporary.
-synchronize_from_tb hook is non-trivial, not exiting
+Rename the existing temps to add a distinguishing number.
 may be erroneous.
-Reviewed-by: Claudio Fontana <cfontana@suse.de>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- include/exec/exec-all.h             |  5 +----
+ tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
- accel/tcg/cpu-exec-common.c         |  2 +-
+file changed, 25 insertions(+), 25 deletions(-)
  accel/tcg/translate-all.c           | 12 ++----------
  target/alpha/helper.c               |  2 +-
  target/alpha/mem_helper.c           |  2 +-
  target/arm/op_helper.c              |  2 +-
  target/arm/tlb_helper.c             |  8 ++++----
  target/cris/helper.c                |  2 +-
  target/i386/tcg/sysemu/svm_helper.c |  2 +-
  target/m68k/op_helper.c             |  4 ++--
  target/microblaze/helper.c          |  2 +-
  target/nios2/op_helper.c            |  2 +-
  target/openrisc/sys_helper.c        |  4 ++--
  target/ppc/excp_helper.c            |  2 +-
  target/s390x/tcg/excp_helper.c      |  2 +-
  target/tricore/op_helper.c          |  2 +-
  target/xtensa/helper.c              |  6 +++---
 files changed, 25 insertions(+), 36 deletions(-)
-diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/include/exec/exec-all.h
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/include/exec/exec-all.h
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data);
+@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
-  * cpu_restore_state:
+     return TCG_REG_X0 + slot;
-  * @cpu: the cpu context
+ }
-  * @host_pc: the host pc within the translation
-- * @will_exit: true if the TB executed will be interrupted after some
+-#define TCG_REG_TMP TCG_REG_X30
--               cpu adjustments. Required for maintaining the correct
+-#define TCG_VEC_TMP TCG_REG_V31
--               icount valus
++#define TCG_REG_TMP0 TCG_REG_X30
-  * @return: true if state was restored, false otherwise
++#define TCG_VEC_TMP0 TCG_REG_V31
-  *
-  * Attempt to restore the state for a fault occurring in translated
+ #ifndef CONFIG_SOFTMMU
-  * code. If @host_pc is not in translated code no state is
+ #define TCG_REG_GUEST_BASE TCG_REG_X28
-  * restored and the function returns false.
+@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
-  */
+ static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
--bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit);
+                              TCGReg r, TCGReg base, intptr_t offset)
 +bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc);
  G_NORETURN void cpu_loop_exit_noexc(CPUState *cpu);
  G_NORETURN void cpu_loop_exit(CPUState *cpu);
 diff --git a/accel/tcg/cpu-exec-common.c b/accel/tcg/cpu-exec-common.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/cpu-exec-common.c
 +++ b/accel/tcg/cpu-exec-common.c
@@ -XXX,XX +XXX,XX @@ void cpu_loop_exit(CPUState *cpu)
  void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
  {
-     if (pc) {
+-    TCGReg temp = TCG_REG_TMP;
--        cpu_restore_state(cpu, pc, true);
++    TCGReg temp = TCG_REG_TMP0;
-+        cpu_restore_state(cpu, pc);
      if (offset < -0xffffff || offset > 0xffffff) {
          tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
      }
-     cpu_loop_exit(cpu);
      /* Worst-case scenario, move offset to temp register, use reg offset.  */
 -    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
 -    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
 +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
 +    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
  }
-diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
-index XXXXXXX..XXXXXXX 100644
+ static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
---- a/accel/tcg/translate-all.c
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
-+++ b/accel/tcg/translate-all.c
+     if (offset == sextract64(offset, 0, 26)) {
-@@ -XXX,XX +XXX,XX @@ void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+         tcg_out_insn(s, 3206, BL, offset);
  #endif
  }
 -bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
 +bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
  {
 -    /*
 -     * The pc update associated with restore without exit will
 -     * break the relative pc adjustments performed by TARGET_TB_PCREL.
 -     */
 -    if (TARGET_TB_PCREL) {
 -        assert(will_exit);
 -    }
 -
      /*
       * The host_pc has to be in the rx region of the code buffer.
       * If it is not we will not be able to resolve it here.
@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
      if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
          TranslationBlock *tb = tcg_tb_lookup(host_pc);
          if (tb) {
 -            cpu_restore_state_from_tb(cpu, tb, host_pc, will_exit);
 +            cpu_restore_state_from_tb(cpu, tb, host_pc, true);
              return true;
          }
      }
 diff --git a/target/alpha/helper.c b/target/alpha/helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/alpha/helper.c
 +++ b/target/alpha/helper.c
@@ -XXX,XX +XXX,XX @@ G_NORETURN void dynamic_excp(CPUAlphaState *env, uintptr_t retaddr,
      cs->exception_index = excp;
      env->error_code = error;
      if (retaddr) {
 -        cpu_restore_state(cs, retaddr, true);
 +        cpu_restore_state(cs, retaddr);
          /* Floating-point exceptions (our only users) point to the next PC.  */
          env->pc += 4;
      }
 diff --git a/target/alpha/mem_helper.c b/target/alpha/mem_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/alpha/mem_helper.c
 +++ b/target/alpha/mem_helper.c
@@ -XXX,XX +XXX,XX @@ static void do_unaligned_access(CPUAlphaState *env, vaddr addr, uintptr_t retadd
      uint64_t pc;
      uint32_t insn;
 -    cpu_restore_state(env_cpu(env), retaddr, true);
 +    cpu_restore_state(env_cpu(env), retaddr);
      pc = env->pc;
      insn = cpu_ldl_code(env, pc);
 diff --git a/target/arm/op_helper.c b/target/arm/op_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/arm/op_helper.c
 +++ b/target/arm/op_helper.c
@@ -XXX,XX +XXX,XX @@ void raise_exception_ra(CPUARMState *env, uint32_t excp, uint32_t syndrome,
       * we must restore CPU state here before setting the syndrome
       * the caller passed us, and cannot use cpu_loop_exit_restore().
       */
 -    cpu_restore_state(cs, ra, true);
 +    cpu_restore_state(cs, ra);
      raise_exception(env, excp, syndrome, target_el);
  }
 diff --git a/target/arm/tlb_helper.c b/target/arm/tlb_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/arm/tlb_helper.c
 +++ b/target/arm/tlb_helper.c
@@ -XXX,XX +XXX,XX @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
      ARMMMUFaultInfo fi = {};
      /* now we have a real cpu fault */
 -    cpu_restore_state(cs, retaddr, true);
 +    cpu_restore_state(cs, retaddr);
      fi.type = ARMFault_Alignment;
      arm_deliver_fault(cpu, vaddr, access_type, mmu_idx, &fi);
@@ -XXX,XX +XXX,XX @@ void arm_cpu_do_transaction_failed(CPUState *cs, hwaddr physaddr,
      ARMMMUFaultInfo fi = {};
      /* now we have a real cpu fault */
 -    cpu_restore_state(cs, retaddr, true);
 +    cpu_restore_state(cs, retaddr);
      fi.ea = arm_extabort_type(response);
      fi.type = ARMFault_SyncExternal;
@@ -XXX,XX +XXX,XX @@ bool arm_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
          return false;
      } else {
-         /* now we have a real cpu fault */
+-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
--        cpu_restore_state(cs, retaddr, true);
+-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
-+        cpu_restore_state(cs, retaddr);
++        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
-         arm_deliver_fault(cpu, address, access_type, mmu_idx, fi);
++        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
      }
  }
-@@ -XXX,XX +XXX,XX @@ void arm_cpu_record_sigsegv(CPUState *cs, vaddr addr,
-      * We report both ESR and FAR to signal handlers.
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
-      * For now, it's easiest to deliver the fault normally.
+     AArch64Insn insn;
-      */
--    cpu_restore_state(cs, ra, true);
+     if (rl == ah || (!const_bh && rl == bh)) {
-+    cpu_restore_state(cs, ra);
+-        rl = TCG_REG_TMP;
-     arm_deliver_fault(cpu, addr, access_type, MMU_USER_IDX, &fi);
++        rl = TCG_REG_TMP0;
- }
+     }
-diff --git a/target/cris/helper.c b/target/cris/helper.c
+     if (const_bl) {
-index XXXXXXX..XXXXXXX 100644
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
---- a/target/cris/helper.c
+                possibility of adding 0+const in the low part, and the
-+++ b/target/cris/helper.c
+                immediate add instructions encode XSP not XZR.  Don't try
-@@ -XXX,XX +XXX,XX @@ bool cris_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
+                anything more elaborate here than loading another zero.  */
-     cs->exception_index = EXCP_BUSFAULT;
+-            al = TCG_REG_TMP;
-     env->fault_vector = res.bf_vec;
++            al = TCG_REG_TMP0;
-     if (retaddr) {
+             tcg_out_movi(s, ext, al, 0);
 -        if (cpu_restore_state(cs, retaddr, true)) {
 +        if (cpu_restore_state(cs, retaddr)) {
              /* Evaluate flags after retranslation. */
              helper_top_evaluate_flags(env);
          }
-diff --git a/target/i386/tcg/sysemu/svm_helper.c b/target/i386/tcg/sysemu/svm_helper.c
+         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
-index XXXXXXX..XXXXXXX 100644
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 --- a/target/i386/tcg/sysemu/svm_helper.c
 +++ b/target/i386/tcg/sysemu/svm_helper.c
@@ -XXX,XX +XXX,XX @@ void cpu_vmexit(CPUX86State *env, uint32_t exit_code, uint64_t exit_info_1,
  {
-     CPUState *cs = env_cpu(env);
+     TCGReg a1 = a0;
+     if (is_ctz) {
--    cpu_restore_state(cs, retaddr, true);
+-        a1 = TCG_REG_TMP;
-+    cpu_restore_state(cs, retaddr);
++        a1 = TCG_REG_TMP0;
+         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
-     qemu_log_mask(CPU_LOG_TB_IN_ASM, "vmexit(%08x, %016" PRIx64 ", %016"
+     }
-                   PRIx64 ", " TARGET_FMT_lx ")!\n",
+     if (const_b && b == (ext ? 64 : 32)) {
-diff --git a/target/m68k/op_helper.c b/target/m68k/op_helper.c
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
-index XXXXXXX..XXXXXXX 100644
+         AArch64Insn sel = I3506_CSEL;
---- a/target/m68k/op_helper.c
-+++ b/target/m68k/op_helper.c
+         tcg_out_cmp(s, ext, a0, 0, 1);
-@@ -XXX,XX +XXX,XX @@ void m68k_cpu_transaction_failed(CPUState *cs, hwaddr physaddr, vaddr addr,
+-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
-     M68kCPU *cpu = M68K_CPU(cs);
++        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
-     CPUM68KState *env = &cpu->env;
+         if (const_b) {
--    cpu_restore_state(cs, retaddr, true);
+             if (b == -1) {
-+    cpu_restore_state(cs, retaddr);
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
+                 b = d;
-     if (m68k_feature(env, M68K_FEATURE_M68040)) {
+             }
-         env->mmu.mmusr = 0;
+         }
-@@ -XXX,XX +XXX,XX @@ raise_exception_format2(CPUM68KState *env, int tt, int ilen, uintptr_t raddr)
+-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
-     cs->exception_index = tt;
++        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
      /* Recover PC and CC_OP for the beginning of the insn.  */
 -    cpu_restore_state(cs, raddr, true);
 +    cpu_restore_state(cs, raddr);
      /* Flags are current in env->cc_*, or are undefined. */
      env->cc_op = CC_OP_FLAGS;
 diff --git a/target/microblaze/helper.c b/target/microblaze/helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/microblaze/helper.c
 +++ b/target/microblaze/helper.c
@@ -XXX,XX +XXX,XX @@ void mb_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
      uint32_t esr, iflags;
      /* Recover the pc and iflags from the corresponding insn_start.  */
 -    cpu_restore_state(cs, retaddr, true);
 +    cpu_restore_state(cs, retaddr);
      iflags = cpu->env.iflags;
      qemu_log_mask(CPU_LOG_INT,
 diff --git a/target/nios2/op_helper.c b/target/nios2/op_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/nios2/op_helper.c
 +++ b/target/nios2/op_helper.c
@@ -XXX,XX +XXX,XX @@ void nios2_cpu_loop_exit_advance(CPUNios2State *env, uintptr_t retaddr)
       * Do this here, rather than in restore_state_to_opc(),
       * lest we affect QEMU internal exceptions, like EXCP_DEBUG.
       */
 -    cpu_restore_state(cs, retaddr, true);
 +    cpu_restore_state(cs, retaddr);
      env->pc += 4;
      cpu_loop_exit(cs);
  }
 diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/openrisc/sys_helper.c
 +++ b/target/openrisc/sys_helper.c
@@ -XXX,XX +XXX,XX @@ void HELPER(mtspr)(CPUOpenRISCState *env, target_ulong spr, target_ulong rb)
          break;
      case TO_SPR(0, 16): /* NPC */
 -        cpu_restore_state(cs, GETPC(), true);
 +        cpu_restore_state(cs, GETPC());
          /* ??? Mirror or1ksim in not trashing delayed branch state
             when "jumping" to the current instruction.  */
          if (env->pc != rb) {
@@ -XXX,XX +XXX,XX @@ void HELPER(mtspr)(CPUOpenRISCState *env, target_ulong spr, target_ulong rb)
      case TO_SPR(8, 0):  /* PMR */
          env->pmr = rb;
          if (env->pmr & PMR_DME || env->pmr & PMR_SME) {
 -            cpu_restore_state(cs, GETPC(), true);
 +            cpu_restore_state(cs, GETPC());
              env->pc += 4;
              cs->halted = 1;
              raise_exception(cpu, EXCP_HALTED);
 diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/ppc/excp_helper.c
 +++ b/target/ppc/excp_helper.c
@@ -XXX,XX +XXX,XX @@ void ppc_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
      uint32_t insn;
      /* Restore state and reload the insn we executed, for filling in DSISR.  */
 -    cpu_restore_state(cs, retaddr, true);
 +    cpu_restore_state(cs, retaddr);
      insn = cpu_ldl_code(env, env->nip);
      switch (env->mmu_model) {
 diff --git a/target/s390x/tcg/excp_helper.c b/target/s390x/tcg/excp_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/s390x/tcg/excp_helper.c
 +++ b/target/s390x/tcg/excp_helper.c
@@ -XXX,XX +XXX,XX @@ G_NORETURN void tcg_s390_program_interrupt(CPUS390XState *env,
  {
      CPUState *cs = env_cpu(env);
 -    cpu_restore_state(cs, ra, true);
 +    cpu_restore_state(cs, ra);
      qemu_log_mask(CPU_LOG_INT, "program interrupt at %#" PRIx64 "\n",
                    env->psw.addr);
      trigger_pgm_exception(env, code);
 diff --git a/target/tricore/op_helper.c b/target/tricore/op_helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/tricore/op_helper.c
 +++ b/target/tricore/op_helper.c
@@ -XXX,XX +XXX,XX @@ void raise_exception_sync_internal(CPUTriCoreState *env, uint32_t class, int tin
  {
      CPUState *cs = env_cpu(env);
      /* in case we come from a helper-call we need to restore the PC */
 -    cpu_restore_state(cs, pc, true);
 +    cpu_restore_state(cs, pc);
      /* Tin is loaded into d[15] */
      env->gpr_d[15] = tin;
 diff --git a/target/xtensa/helper.c b/target/xtensa/helper.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/xtensa/helper.c
 +++ b/target/xtensa/helper.c
@@ -XXX,XX +XXX,XX @@ void xtensa_cpu_do_unaligned_access(CPUState *cs,
      assert(xtensa_option_enabled(env->config,
                                   XTENSA_OPTION_UNALIGNED_EXCEPTION));
 -    cpu_restore_state(CPU(cpu), retaddr, true);
 +    cpu_restore_state(CPU(cpu), retaddr);
      HELPER(exception_cause_vaddr)(env,
                                    env->pc, LOAD_STORE_ALIGNMENT_CAUSE,
                                    addr);
@@ -XXX,XX +XXX,XX @@ bool xtensa_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
      } else if (probe) {
          return false;
      } else {
 -        cpu_restore_state(cs, retaddr, true);
 +        cpu_restore_state(cs, retaddr);
          HELPER(exception_cause_vaddr)(env, env->pc, ret, address);
      }
  }
-@@ -XXX,XX +XXX,XX @@ void xtensa_cpu_do_transaction_failed(CPUState *cs, hwaddr physaddr, vaddr addr,
-     XtensaCPU *cpu = XTENSA_CPU(cs);
+@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
-     CPUXtensaState *env = &cpu->env;
+ }
--    cpu_restore_state(cs, retaddr, true);
+ static const TCGLdstHelperParam ldst_helper_param = {
-+    cpu_restore_state(cs, retaddr);
+-    .ntmp = 1, .tmp = { TCG_REG_TMP }
-     HELPER(exception_cause_vaddr)(env, env->pc,
++    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
-                                   access_type == MMU_INST_FETCH ?
+ };
-                                   INSTR_PIF_ADDR_ERROR_CAUSE :
  static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
      set_jmp_insn_offset(s, which);
      tcg_out32(s, I3206_B);
 -    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
 +    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
      set_jmp_reset_offset(s, which);
  }
@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
          ptrdiff_t i_offset = i_addr - jmp_rx;
          /* Note that we asserted this in range in tcg_out_goto_tb. */
 -        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
 +        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
      }
      qatomic_set((uint32_t *)jmp_rw, insn);
      flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_rem_i64:
      case INDEX_op_rem_i32:
 -        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
 -        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
 +        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
 +        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
          break;
      case INDEX_op_remu_i64:
      case INDEX_op_remu_i32:
 -        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
 -        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
 +        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
 +        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
          break;
      case INDEX_op_shl_i64:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
          if (c2) {
              tcg_out_rotl(s, ext, a0, a1, a2);
          } else {
 -            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
 -            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
 +            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
 +            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
          }
          break;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                              break;
                          }
                      }
 -                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
 -                    a2 = TCG_VEC_TMP;
 +                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
 +                    a2 = TCG_VEC_TMP0;
                  }
                  if (is_scalar) {
                      insn = cmp_scalar_insn[cond];
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      s->reserved_regs = 0;
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
 -    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
 -    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
  }
  /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
 --
 .34.1

-New patch
+[PULL 07/27] tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ tcg/aarch64/tcg-target.c.inc | 9 +++++++--
+file changed, 7 insertions(+), 2 deletions(-)
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target.c.inc
++++ b/tcg/aarch64/tcg-target.c.inc
+@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
+     TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
+     TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
+-    TCG_REG_X16, TCG_REG_X17,
+     TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
+     TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
++    /* X16 reserved as temporary */
++    /* X17 reserved as temporary */
+     /* X18 reserved by system */
+     /* X19 reserved for AREG0 */
+     /* X29 reserved as fp */
+@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
+     return TCG_REG_X0 + slot;
+ }
+-#define TCG_REG_TMP0 TCG_REG_X30
++#define TCG_REG_TMP0 TCG_REG_X16
++#define TCG_REG_TMP1 TCG_REG_X17
++#define TCG_REG_TMP2 TCG_REG_X30
+ #define TCG_VEC_TMP0 TCG_REG_V31
+ #ifndef CONFIG_SOFTMMU
+@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
+     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
+ }
+--
+.34.1

-[PULL 03/11] tcg/sparc64: Remove sparc32plus constraints
+[PULL 08/27] tcg/aarch64: Simplify constraints on qemu_ld/st
-With sparc64 we need not distinguish between registers that
+Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
-can hold 32-bit values and those that can hold 64-bit values.
+registers.  Since we handle overlap betwen inputs and helper arguments,
 we can allow any allocatable reg.
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/sparc64/tcg-target-con-set.h |  16 +----
+ tcg/aarch64/tcg-target-con-set.h |  2 --
- tcg/sparc64/tcg-target-con-str.h |   3 -
+ tcg/aarch64/tcg-target-con-str.h |  1 -
- tcg/sparc64/tcg-target.c.inc     | 109 ++++++++++++-------------------
+ tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
-files changed, 44 insertions(+), 84 deletions(-)
+files changed, 19 insertions(+), 29 deletions(-)
-diff --git a/tcg/sparc64/tcg-target-con-set.h b/tcg/sparc64/tcg-target-con-set.h
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/sparc64/tcg-target-con-set.h
+--- a/tcg/aarch64/tcg-target-con-set.h
-+++ b/tcg/sparc64/tcg-target-con-set.h
++++ b/tcg/aarch64/tcg-target-con-set.h
 @@ -XXX,XX +XXX,XX @@
+  * tcg-target-con-str.h; the constraint combination is inclusive or.
   */
  C_O0_I1(r)
+-C_O0_I2(lZ, l)
+ C_O0_I2(r, rA)
  C_O0_I2(rZ, r)
--C_O0_I2(RZ, r)
+ C_O0_I2(w, r)
- C_O0_I2(rZ, rJ)
+-C_O1_I1(r, l)
 -C_O0_I2(RZ, RJ)
 -C_O0_I2(sZ, A)
 -C_O0_I2(SZ, A)
 -C_O1_I1(r, A)
 -C_O1_I1(R, A)
 +C_O0_I2(sZ, s)
 +C_O1_I1(r, s)
  C_O1_I1(r, r)
--C_O1_I1(r, R)
+ C_O1_I1(w, r)
--C_O1_I1(R, r)
+ C_O1_I1(w, w)
--C_O1_I1(R, R)
+diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
 -C_O1_I2(R, R, R)
 +C_O1_I2(r, r, r)
  C_O1_I2(r, rZ, rJ)
 -C_O1_I2(R, RZ, RJ)
  C_O1_I4(r, rZ, rJ, rI, 0)
 -C_O1_I4(R, RZ, RJ, RI, 0)
  C_O2_I2(r, r, rZ, rJ)
 -C_O2_I4(R, R, RZ, RZ, RJ, RI)
  C_O2_I4(r, r, rZ, rZ, rJ, rJ)
 diff --git a/tcg/sparc64/tcg-target-con-str.h b/tcg/sparc64/tcg-target-con-str.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/sparc64/tcg-target-con-str.h
+--- a/tcg/aarch64/tcg-target-con-str.h
-+++ b/tcg/sparc64/tcg-target-con-str.h
++++ b/tcg/aarch64/tcg-target-con-str.h
 @@ -XXX,XX +XXX,XX @@
   * REGS(letter, register_mask)
   */
  REGS('r', ALL_GENERAL_REGS)
--REGS('R', ALL_GENERAL_REGS64)
+-REGS('l', ALL_QLDST_REGS)
- REGS('s', ALL_QLDST_REGS)
+ REGS('w', ALL_VECTOR_REGS)
 -REGS('S', ALL_QLDST_REGS64)
 -REGS('A', TARGET_LONG_BITS == 64 ? ALL_QLDST_REGS64 : ALL_QLDST_REGS)
  /*
-  * Define constraint letters for constants:
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/sparc64/tcg-target.c.inc
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/sparc64/tcg-target.c.inc
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
+@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
  #define ALL_GENERAL_REGS  0xffffffffu
  #define ALL_VECTOR_REGS   0xffffffff00000000ull
 -#ifdef CONFIG_SOFTMMU
 -#define ALL_QLDST_REGS \
 -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
 -                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
 -#else
 -#define ALL_QLDST_REGS   ALL_GENERAL_REGS
 -#endif
 -
  /* Match a constant valid for addition (12-bit, optionally shifted).  */
  static inline bool is_aimm(uint64_t val)
  {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      unsigned s_bits = opc & MO_SIZE;
      unsigned s_mask = (1u << s_bits) - 1;
      unsigned mem_index = get_mmuidx(oi);
 -    TCGReg x3;
 +    TCGReg addr_adj;
      TCGType mask_type;
      uint64_t compare_mask;
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
                   ? TCG_TYPE_I64 : TCG_TYPE_I32);
 -    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
 +    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
      QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
      QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
      QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
      QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
 -    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
 +    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
                   TLB_MASK_TABLE_OFS(mem_index), 1, 0);
      /* Extract the TLB index from the address into X0.  */
      tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
 -                 TCG_REG_X0, TCG_REG_X0, addr_reg,
 +                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
                   s->page_bits - CPU_TLB_ENTRY_BITS);
 -    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
 -    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
 +    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
 +    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
 -    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
 -    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
 +    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
 +    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
                 is_ld ? offsetof(CPUTLBEntry, addr_read)
                       : offsetof(CPUTLBEntry, addr_write));
 -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
 +    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
                 offsetof(CPUTLBEntry, addend));
      /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
       * cross pages using the address of the last byte of the access.
       */
      if (a_mask >= s_mask) {
 -        x3 = addr_reg;
 +        addr_adj = addr_reg;
      } else {
 +        addr_adj = TCG_REG_TMP2;
          tcg_out_insn(s, 3401, ADDI, addr_type,
 -                     TCG_REG_X3, addr_reg, s_mask - a_mask);
 -        x3 = TCG_REG_X3;
 +                     addr_adj, addr_reg, s_mask - a_mask);
      }
      compare_mask = (uint64_t)s->page_mask | a_mask;
 -    /* Store the page mask part of the address into X3.  */
 -    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
 +    /* Store the page mask part of the address into TMP2.  */
 +    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
 +                     addr_adj, compare_mask);
      /* Perform the address comparison. */
 -    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
 +    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
      /* If not equal, we jump to the slow path. */
      ldst->label_ptr[0] = s->code_ptr;
      tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 -    h->base = TCG_REG_X1,
 +    h->base = TCG_REG_TMP1;
      h->index = addr_reg;
      h->index_ext = addr_type;
  #else
- #define SOFTMMU_RESERVE_REGS 0
- #endif
--
--/*
-- * Note that sparcv8plus can only hold 64 bit quantities in %g and %o
-- * registers.  These are saved manually by the kernel in full 64-bit
-- * slots.  The %i and %l registers are saved by the register window
-- * mechanism, which only allocates space for 32 bits.  Given that this
-- * window spill/fill can happen on any signal, we must consider the
-- * high bits of the %i and %l registers garbage at all times.
-- */
- #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
--# define ALL_GENERAL_REGS64  ALL_GENERAL_REGS
- #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
--#define ALL_QLDST_REGS64     (ALL_GENERAL_REGS64 & ~SOFTMMU_RESERVE_REGS)
- /* Define some temporary registers.  T2 is used for constant generation.  */
- #define TCG_REG_T1  TCG_REG_G1
 @@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-         return C_O0_I1(r);
+     case INDEX_op_qemu_ld_a64_i32:
+     case INDEX_op_qemu_ld_a32_i64:
-     case INDEX_op_ld8u_i32:
+     case INDEX_op_qemu_ld_a64_i64:
-+    case INDEX_op_ld8u_i64:
+-        return C_O1_I1(r, l);
-     case INDEX_op_ld8s_i32:
++        return C_O1_I1(r, r);
-+    case INDEX_op_ld8s_i64:
+     case INDEX_op_qemu_st_a32_i32:
-     case INDEX_op_ld16u_i32:
+     case INDEX_op_qemu_st_a64_i32:
-+    case INDEX_op_ld16u_i64:
+     case INDEX_op_qemu_st_a32_i64:
-     case INDEX_op_ld16s_i32:
+     case INDEX_op_qemu_st_a64_i64:
-+    case INDEX_op_ld16s_i64:
+-        return C_O0_I2(lZ, l);
-     case INDEX_op_ld_i32:
++        return C_O0_I2(rZ, r);
-+    case INDEX_op_ld32u_i64:
-+    case INDEX_op_ld32s_i64:
+     case INDEX_op_deposit_i32:
-+    case INDEX_op_ld_i64:
+     case INDEX_op_deposit_i64:
      case INDEX_op_neg_i32:
 +    case INDEX_op_neg_i64:
      case INDEX_op_not_i32:
 +    case INDEX_op_not_i64:
 +    case INDEX_op_ext32s_i64:
 +    case INDEX_op_ext32u_i64:
 +    case INDEX_op_ext_i32_i64:
 +    case INDEX_op_extu_i32_i64:
 +    case INDEX_op_extrl_i64_i32:
 +    case INDEX_op_extrh_i64_i32:
          return C_O1_I1(r, r);
      case INDEX_op_st8_i32:
 +    case INDEX_op_st8_i64:
      case INDEX_op_st16_i32:
 +    case INDEX_op_st16_i64:
      case INDEX_op_st_i32:
 +    case INDEX_op_st32_i64:
 +    case INDEX_op_st_i64:
          return C_O0_I2(rZ, r);
      case INDEX_op_add_i32:
 +    case INDEX_op_add_i64:
      case INDEX_op_mul_i32:
 +    case INDEX_op_mul_i64:
      case INDEX_op_div_i32:
 +    case INDEX_op_div_i64:
      case INDEX_op_divu_i32:
 +    case INDEX_op_divu_i64:
      case INDEX_op_sub_i32:
 +    case INDEX_op_sub_i64:
      case INDEX_op_and_i32:
 +    case INDEX_op_and_i64:
      case INDEX_op_andc_i32:
 +    case INDEX_op_andc_i64:
      case INDEX_op_or_i32:
 +    case INDEX_op_or_i64:
      case INDEX_op_orc_i32:
 +    case INDEX_op_orc_i64:
      case INDEX_op_xor_i32:
 +    case INDEX_op_xor_i64:
      case INDEX_op_shl_i32:
 +    case INDEX_op_shl_i64:
      case INDEX_op_shr_i32:
 +    case INDEX_op_shr_i64:
      case INDEX_op_sar_i32:
 +    case INDEX_op_sar_i64:
      case INDEX_op_setcond_i32:
 +    case INDEX_op_setcond_i64:
          return C_O1_I2(r, rZ, rJ);
      case INDEX_op_brcond_i32:
 +    case INDEX_op_brcond_i64:
          return C_O0_I2(rZ, rJ);
      case INDEX_op_movcond_i32:
 +    case INDEX_op_movcond_i64:
          return C_O1_I4(r, rZ, rJ, rI, 0);
      case INDEX_op_add2_i32:
 +    case INDEX_op_add2_i64:
      case INDEX_op_sub2_i32:
 +    case INDEX_op_sub2_i64:
          return C_O2_I4(r, r, rZ, rZ, rJ, rJ);
      case INDEX_op_mulu2_i32:
      case INDEX_op_muls2_i32:
          return C_O2_I2(r, r, rZ, rJ);
 -
 -    case INDEX_op_ld8u_i64:
 -    case INDEX_op_ld8s_i64:
 -    case INDEX_op_ld16u_i64:
 -    case INDEX_op_ld16s_i64:
 -    case INDEX_op_ld32u_i64:
 -    case INDEX_op_ld32s_i64:
 -    case INDEX_op_ld_i64:
 -    case INDEX_op_ext_i32_i64:
 -    case INDEX_op_extu_i32_i64:
 -        return C_O1_I1(R, r);
 -
 -    case INDEX_op_st8_i64:
 -    case INDEX_op_st16_i64:
 -    case INDEX_op_st32_i64:
 -    case INDEX_op_st_i64:
 -        return C_O0_I2(RZ, r);
 -
 -    case INDEX_op_add_i64:
 -    case INDEX_op_mul_i64:
 -    case INDEX_op_div_i64:
 -    case INDEX_op_divu_i64:
 -    case INDEX_op_sub_i64:
 -    case INDEX_op_and_i64:
 -    case INDEX_op_andc_i64:
 -    case INDEX_op_or_i64:
 -    case INDEX_op_orc_i64:
 -    case INDEX_op_xor_i64:
 -    case INDEX_op_shl_i64:
 -    case INDEX_op_shr_i64:
 -    case INDEX_op_sar_i64:
 -    case INDEX_op_setcond_i64:
 -        return C_O1_I2(R, RZ, RJ);
 -
 -    case INDEX_op_neg_i64:
 -    case INDEX_op_not_i64:
 -    case INDEX_op_ext32s_i64:
 -    case INDEX_op_ext32u_i64:
 -        return C_O1_I1(R, R);
 -
 -    case INDEX_op_extrl_i64_i32:
 -    case INDEX_op_extrh_i64_i32:
 -        return C_O1_I1(r, R);
 -
 -    case INDEX_op_brcond_i64:
 -        return C_O0_I2(RZ, RJ);
 -    case INDEX_op_movcond_i64:
 -        return C_O1_I4(R, RZ, RJ, RI, 0);
 -    case INDEX_op_add2_i64:
 -    case INDEX_op_sub2_i64:
 -        return C_O2_I4(R, R, RZ, RZ, RJ, RI);
      case INDEX_op_muluh_i64:
 -        return C_O1_I2(R, R, R);
 +        return C_O1_I2(r, r, r);
      case INDEX_op_qemu_ld_i32:
 -        return C_O1_I1(r, A);
      case INDEX_op_qemu_ld_i64:
 -        return C_O1_I1(R, A);
 +        return C_O1_I1(r, s);
      case INDEX_op_qemu_st_i32:
 -        return C_O0_I2(sZ, A);
      case INDEX_op_qemu_st_i64:
 -        return C_O0_I2(SZ, A);
 +        return C_O0_I2(sZ, s);
      default:
          g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
  #endif
      tcg_target_available_regs[TCG_TYPE_I32] = ALL_GENERAL_REGS;
 -    tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS64;
 +    tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS;
      tcg_target_call_clobber_regs = 0;
      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_G1);
 --
 .34.1

-New patch
+[PULL 09/27] tcg/aarch64: Support 128-bit load/store
+With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
+-byte atomicity is required and LDP/STP otherwise.
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ tcg/aarch64/tcg-target-con-set.h |   2 +
+ tcg/aarch64/tcg-target.h         |  11 ++-
+ tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
+files changed, 151 insertions(+), 3 deletions(-)
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target-con-set.h
++++ b/tcg/aarch64/tcg-target-con-set.h
+@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
+ C_O0_I2(r, rA)
+ C_O0_I2(rZ, r)
+ C_O0_I2(w, r)
++C_O0_I3(rZ, rZ, r)
+ C_O1_I1(r, r)
+ C_O1_I1(w, r)
+ C_O1_I1(w, w)
+@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
+ C_O1_I2(w, w, wZ)
+ C_O1_I3(w, w, w, w)
+ C_O1_I4(r, r, rA, rZ, rZ)
++C_O2_I1(r, r, r)
+ C_O2_I4(r, r, rZ, rZ, rA, rMZ)
+diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target.h
++++ b/tcg/aarch64/tcg-target.h
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define TCG_TARGET_HAS_muluh_i64        1
+ #define TCG_TARGET_HAS_mulsh_i64        1
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
++/*
++ * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
++ * which requires writable pages.  We must defer to the helper for user-only,
++ * but in system mode all ram is writable for the host.
++ */
++#ifdef CONFIG_USER_ONLY
++#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
++#else
++#define TCG_TARGET_HAS_qemu_ldst_i128   1
++#endif
+ #define TCG_TARGET_HAS_v64              1
+ #define TCG_TARGET_HAS_v128             1
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target.c.inc
++++ b/tcg/aarch64/tcg-target.c.inc
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+     I3305_LDR_v64   = 0x5c000000,
+     I3305_LDR_v128  = 0x9c000000,
++    /* Load/store exclusive. */
++    I3306_LDXP      = 0xc8600000,
++    I3306_STXP      = 0xc8200000,
++
+     /* Load/store register.  Described here as 3.3.12, but the helper
+        that emits them can transform to 3.3.10 or 3.3.13.  */
+     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+     I3406_ADR       = 0x10000000,
+     I3406_ADRP      = 0x90000000,
++    /* Add/subtract extended register instructions. */
++    I3501_ADD       = 0x0b200000,
++
+     /* Add/subtract shifted register instructions (without a shift).  */
+     I3502_ADD       = 0x0b000000,
+     I3502_ADDS      = 0x2b000000,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
+     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
+ }
++static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
++                              TCGReg rt, TCGReg rt2, TCGReg rn)
++{
++    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
++}
++
+ static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
+                               TCGReg rt, int imm19)
+ {
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
+     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
+ }
++static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
++                                     TCGType sf, TCGReg rd, TCGReg rn,
++                                     TCGReg rm, int opt, int imm3)
++{
++    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
++              imm3 << 10 | rn << 5 | rd);
++}
++
+ /* This function is for both 3.5.2 (Add/Subtract shifted register), for
+    the rare occasion when we actually want to supply a shift amount.  */
+ static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+     TCGType addr_type = s->addr_type;
+     TCGLabelQemuLdst *ldst = NULL;
+     MemOp opc = get_memop(oi);
++    MemOp s_bits = opc & MO_SIZE;
+     unsigned a_mask;
+     h->aa = atom_and_align_for_opc(s, opc,
+                                    have_lse2 ? MO_ATOM_WITHIN16
+                                              : MO_ATOM_IFALIGN,
+-                                   false);
++                                   s_bits == MO_128);
+     a_mask = (1 << h->aa.align) - 1;
+ #ifdef CONFIG_SOFTMMU
+-    unsigned s_bits = opc & MO_SIZE;
+     unsigned s_mask = (1u << s_bits) - 1;
+     unsigned mem_index = get_mmuidx(oi);
+     TCGReg addr_adj;
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
+     }
+ }
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
++                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
++{
++    TCGLabelQemuLdst *ldst;
++    HostAddress h;
++    TCGReg base;
++    bool use_pair;
++
++    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
++
++    /* Compose the final address, as LDP/STP have no indexing. */
++    if (h.index == TCG_REG_XZR) {
++        base = h.base;
++    } else {
++        base = TCG_REG_TMP2;
++        if (h.index_ext == TCG_TYPE_I32) {
++            /* add base, base, index, uxtw */
++            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
++                         h.base, h.index, MO_32, 0);
++        } else {
++            /* add base, base, index */
++            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
++        }
++    }
++
++    use_pair = h.aa.atom < MO_128 || have_lse2;
++
++    if (!use_pair) {
++        tcg_insn_unit *branch = NULL;
++        TCGReg ll, lh, sl, sh;
++
++        /*
++         * If we have already checked for 16-byte alignment, that's all
++         * we need. Otherwise we have determined that misaligned atomicity
++         * may be handled with two 8-byte loads.
++         */
++        if (h.aa.align < MO_128) {
++            /*
++             * TODO: align should be MO_64, so we only need test bit 3,
++             * which means we could use TBNZ instead of ANDS+B_C.
++             */
++            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
++            branch = s->code_ptr;
++            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
++            use_pair = true;
++        }
++
++        if (is_ld) {
++            /*
++             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
++             *    ldxp lo, hi, [base]
++             *    stxp t0, lo, hi, [base]
++             *    cbnz t0, .-8
++             * Require no overlap between data{lo,hi} and base.
++             */
++            if (datalo == base || datahi == base) {
++                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
++                base = TCG_REG_TMP2;
++            }
++            ll = sl = datalo;
++            lh = sh = datahi;
++        } else {
++            /*
++             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
++             * 1: ldxp t0, t1, [base]
++             *    stxp t0, lo, hi, [base]
++             *    cbnz t0, 1b
++             */
++            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
++            ll = TCG_REG_TMP0;
++            lh = TCG_REG_TMP1;
++            sl = datalo;
++            sh = datahi;
++        }
++
++        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
++        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
++        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
++
++        if (use_pair) {
++            /* "b .+8", branching across the one insn of use_pair. */
++            tcg_out_insn(s, 3206, B, 2);
++            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
++        }
++    }
++
++    if (use_pair) {
++        if (is_ld) {
++            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
++        } else {
++            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
++        }
++    }
++
++    if (ldst) {
++        ldst->type = TCG_TYPE_I128;
++        ldst->datalo_reg = datalo;
++        ldst->datahi_reg = datahi;
++        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
++    }
++}
++
+ static const tcg_insn_unit *tb_ret_addr;
+ static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
+     case INDEX_op_qemu_st_a64_i64:
+         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
+         break;
++    case INDEX_op_qemu_ld_a32_i128:
++    case INDEX_op_qemu_ld_a64_i128:
++        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
++        break;
++    case INDEX_op_qemu_st_a32_i128:
++    case INDEX_op_qemu_st_a64_i128:
++        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
++        break;
+     case INDEX_op_bswap64_i64:
+         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_ld_a32_i64:
+     case INDEX_op_qemu_ld_a64_i64:
+         return C_O1_I1(r, r);
++    case INDEX_op_qemu_ld_a32_i128:
++    case INDEX_op_qemu_ld_a64_i128:
++        return C_O2_I1(r, r, r);
+     case INDEX_op_qemu_st_a32_i32:
+     case INDEX_op_qemu_st_a64_i32:
+     case INDEX_op_qemu_st_a32_i64:
+     case INDEX_op_qemu_st_a64_i64:
+         return C_O0_I2(rZ, r);
++    case INDEX_op_qemu_st_a32_i128:
++    case INDEX_op_qemu_st_a64_i128:
++        return C_O0_I3(rZ, rZ, r);
+     case INDEX_op_deposit_i32:
+     case INDEX_op_deposit_i64:
+--
+.34.1

-[PULL 11/11] target/i386: Expand eflags updates inline
+[PULL 10/27] tcg/ppc: Support 128-bit load/store
-The helpers for reset_rf, cli, sti, clac, stac are
+Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
-completely trivial; implement them inline.
+Note that these instructions do not require 16-byte alignment.
-Drop some nearby #if 0 code.
+Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
 Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/i386/helper.h        |  5 -----
+ tcg/ppc/tcg-target-con-set.h |   2 +
- target/i386/tcg/cc_helper.c | 41 -------------------------------------
+ tcg/ppc/tcg-target-con-str.h |   1 +
- target/i386/tcg/translate.c | 30 ++++++++++++++++++++++-----
+ tcg/ppc/tcg-target.h         |   3 +-
-files changed, 25 insertions(+), 51 deletions(-)
+ tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 files changed, 101 insertions(+), 13 deletions(-)
-diff --git a/target/i386/helper.h b/target/i386/helper.h
+diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/target/i386/helper.h
+--- a/tcg/ppc/tcg-target-con-set.h
-+++ b/target/i386/helper.h
++++ b/tcg/ppc/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ DEF_HELPER_2(syscall, void, env, int)
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
- DEF_HELPER_2(sysret, void, env, int)
+ C_O0_I2(r, ri)
  C_O0_I2(v, r)
  C_O0_I3(r, r, r)
 +C_O0_I3(o, m, r)
  C_O0_I4(r, r, ri, ri)
  C_O0_I4(r, r, r, r)
  C_O1_I1(r, r)
@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
  C_O1_I4(r, r, ri, rZ, rZ)
  C_O1_I4(r, r, r, ri, ri)
  C_O2_I1(r, r, r)
 +C_O2_I1(o, m, r)
  C_O2_I2(r, r, r, r)
  C_O2_I4(r, r, rI, rZM, r, r)
  C_O2_I4(r, r, r, r, rI, rZM)
 diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target-con-str.h
 +++ b/tcg/ppc/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
   * REGS(letter, register_mask)
   */
  REGS('r', ALL_GENERAL_REGS)
 +REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
  REGS('v', ALL_VECTOR_REGS)
  /*
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
  #define TCG_TARGET_HAS_mulsh_i64        1
  #endif
- DEF_HELPER_FLAGS_2(pause, TCG_CALL_NO_WG, noreturn, env, int)
--DEF_HELPER_1(reset_rf, void, env)
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
- DEF_HELPER_FLAGS_3(raise_interrupt, TCG_CALL_NO_WG, noreturn, env, int, int)
++#define TCG_TARGET_HAS_qemu_ldst_i128   \
- DEF_HELPER_FLAGS_2(raise_exception, TCG_CALL_NO_WG, noreturn, env, int)
++    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
--DEF_HELPER_1(cli, void, env)
--DEF_HELPER_1(sti, void, env)
+ /*
--DEF_HELPER_1(clac, void, env)
+  * While technically Altivec could support V64, it has no 64-bit store
--DEF_HELPER_1(stac, void, env)
+diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
- DEF_HELPER_3(boundw, void, env, tl, int)
+index XXXXXXX..XXXXXXX 100644
- DEF_HELPER_3(boundl, void, env, tl, int)
+--- a/tcg/ppc/tcg-target.c.inc
++++ b/tcg/ppc/tcg-target.c.inc
-diff --git a/target/i386/tcg/cc_helper.c b/target/i386/tcg/cc_helper.c
+@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
-index XXXXXXX..XXXXXXX 100644
---- a/target/i386/tcg/cc_helper.c
+ #define B      OPCD( 18)
-+++ b/target/i386/tcg/cc_helper.c
+ #define BC     OPCD( 16)
-@@ -XXX,XX +XXX,XX @@ void helper_clts(CPUX86State *env)
++
-     env->cr[0] &= ~CR0_TS_MASK;
+ #define LBZ    OPCD( 34)
-     env->hflags &= ~HF_TS_MASK;
+ #define LHZ    OPCD( 40)
  #define LHA    OPCD( 42)
  #define LWZ    OPCD( 32)
  #define LWZUX  XO31( 55)
 -#define STB    OPCD( 38)
 -#define STH    OPCD( 44)
 -#define STW    OPCD( 36)
 -
 -#define STD    XO62(  0)
 -#define STDU   XO62(  1)
 -#define STDX   XO31(149)
 -
  #define LD     XO58(  0)
  #define LDX    XO31( 21)
  #define LDU    XO58(  1)
  #define LDUX   XO31( 53)
  #define LWA    XO58(  2)
  #define LWAX   XO31(341)
 +#define LQ     OPCD( 56)
 +
 +#define STB    OPCD( 38)
 +#define STH    OPCD( 44)
 +#define STW    OPCD( 36)
 +#define STD    XO62(  0)
 +#define STDU   XO62(  1)
 +#define STDX   XO31(149)
 +#define STQ    XO62(  2)
  #define ADDIC  OPCD( 12)
  #define ADDI   OPCD( 14)
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return true;
 +    TCGAtomAlign aa;
 +
 +    if ((memop & MO_SIZE) <= MO_64) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity,
 +     * but do allow a pair of 64-bit operations.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom <= MO_64;
  }
--
--void helper_reset_rf(CPUX86State *env)
+ /*
--{
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
--    env->eflags &= ~RF_MASK;
+ {
--}
+     TCGLabelQemuLdst *ldst = NULL;
--
+     MemOp opc = get_memop(oi);
--void helper_cli(CPUX86State *env)
+-    MemOp a_bits;
--{
++    MemOp a_bits, s_bits;
--    env->eflags &= ~IF_MASK;
--}
+     /*
--
+      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
--void helper_sti(CPUX86State *env)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
--{
+      * As of 3.0, "the non-atomic access is performed as described in
--    env->eflags |= IF_MASK;
+      * the corresponding list", which matches MO_ATOM_SUBALIGN.
--}
+      */
--
++    s_bits = opc & MO_SIZE;
--void helper_clac(CPUX86State *env)
+     h->aa = atom_and_align_for_opc(s, opc,
--{
+                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
--    env->eflags &= ~AC_MASK;
+                                                  : MO_ATOM_IFALIGN,
--}
+-                                   false);
--
++                                   s_bits == MO_128);
--void helper_stac(CPUX86State *env)
+     a_bits = h->aa.align;
--{
--    env->eflags |= AC_MASK;
+ #ifdef CONFIG_SOFTMMU
--}
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
--
+     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
--#if 0
+     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
--/* vm86plus instructions */
+     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
--void helper_cli_vm(CPUX86State *env)
+-    unsigned s_bits = opc & MO_SIZE;
--{
--    env->eflags &= ~VIF_MASK;
+     ldst = new_ldst_label(s);
--}
+     ldst->is_ld = is_ld;
--
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
 -void helper_sti_vm(CPUX86State *env)
 -{
 -    env->eflags |= VIF_MASK;
 -    if (env->eflags & VIP_MASK) {
 -        raise_exception_ra(env, EXCP0D_GPF, GETPC());
 -    }
 -}
 -#endif
 diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/i386/tcg/translate.c
 +++ b/target/i386/tcg/translate.c
@@ -XXX,XX +XXX,XX @@ static void gen_reset_hflag(DisasContext *s, uint32_t mask)
      }
  }
-+static void gen_set_eflags(DisasContext *s, target_ulong mask)
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
 +                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
-+    TCGv t = tcg_temp_new();
++    TCGLabelQemuLdst *ldst;
-+
++    HostAddress h;
-+    tcg_gen_ld_tl(t, cpu_env, offsetof(CPUX86State, eflags));
++    bool need_bswap;
-+    tcg_gen_ori_tl(t, t, mask);
++    uint32_t insn;
-+    tcg_gen_st_tl(t, cpu_env, offsetof(CPUX86State, eflags));
++    TCGReg index;
-+    tcg_temp_free(t);
++
 +    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
 +
 +    /* Compose the final address, as LQ/STQ have no indexing. */
 +    index = h.index;
 +    if (h.base != 0) {
 +        index = TCG_REG_TMP1;
 +        tcg_out32(s, ADD | TAB(index, h.base, h.index));
 +    }
 +    need_bswap = get_memop(oi) & MO_BSWAP;
 +
 +    if (h.aa.atom == MO_128) {
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? LQ : STQ;
 +        tcg_out32(s, insn | TAI(datahi, index, 0));
 +    } else {
 +        TCGReg d1, d2;
 +
 +        if (HOST_BIG_ENDIAN ^ need_bswap) {
 +            d1 = datahi, d2 = datalo;
 +        } else {
 +            d1 = datalo, d2 = datahi;
 +        }
 +
 +        if (need_bswap) {
 +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
 +            insn = is_ld ? LDBRX : STDBRX;
 +            tcg_out32(s, insn | TAB(d1, 0, index));
 +            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
 +        } else {
 +            insn = is_ld ? LD : STD;
 +            tcg_out32(s, insn | TAI(d1, index, 0));
 +            tcg_out32(s, insn | TAI(d2, index, 8));
 +        }
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
-+static void gen_reset_eflags(DisasContext *s, target_ulong mask)
+ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 +{
 +    TCGv t = tcg_temp_new();
 +
 +    tcg_gen_ld_tl(t, cpu_env, offsetof(CPUX86State, eflags));
 +    tcg_gen_andi_tl(t, t, ~mask);
 +    tcg_gen_st_tl(t, cpu_env, offsetof(CPUX86State, eflags));
 +    tcg_temp_free(t);
 +}
 +
  /* Clear BND registers during legacy branches.  */
  static void gen_bnd_jmp(DisasContext *s)
  {
-@@ -XXX,XX +XXX,XX @@ do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, bool jr)
+     int i;
-     }
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
+                             args[4], TCG_TYPE_I64);
      if (s->base.tb->flags & HF_RF_MASK) {
 -        gen_helper_reset_rf(cpu_env);
 +        gen_reset_eflags(s, RF_MASK);
      }
      if (recheck_tf) {
          gen_helper_rechecking_single_step(cpu_env);
@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
  #endif
      case 0xfa: /* cli */
          if (check_iopl(s)) {
 -            gen_helper_cli(cpu_env);
 +            gen_reset_eflags(s, IF_MASK);
          }
          break;
-     case 0xfb: /* sti */
++    case INDEX_op_qemu_ld_a32_i128:
-         if (check_iopl(s)) {
++    case INDEX_op_qemu_ld_a64_i128:
--            gen_helper_sti(cpu_env);
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
-+            gen_set_eflags(s, IF_MASK);
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
-             /* interruptions are enabled only the first insn after sti */
++        break;
-             gen_update_eip_next(s);
-             gen_eob_inhibit_irq(s, true);
+     case INDEX_op_qemu_st_a64_i32:
-@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
+         if (TCG_TARGET_REG_BITS == 32) {
-                 || CPL(s) != 0) {
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
-                 goto illegal_op;
+                             args[4], TCG_TYPE_I64);
-             }
+         }
--            gen_helper_clac(cpu_env);
+         break;
-+            gen_reset_eflags(s, AC_MASK);
++    case INDEX_op_qemu_st_a32_i128:
-             s->base.is_jmp = DISAS_EOB_NEXT;
++    case INDEX_op_qemu_st_a64_i128:
-             break;
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
-@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
++        break;
-                 || CPL(s) != 0) {
-                 goto illegal_op;
+     case INDEX_op_setcond_i32:
-             }
+         tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
--            gen_helper_stac(cpu_env);
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-+            gen_set_eflags(s, AC_MASK);
+     case INDEX_op_qemu_st_a64_i64:
-             s->base.is_jmp = DISAS_EOB_NEXT;
+         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
-             break;
++    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        return C_O2_I1(o, m, r);
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        return C_O0_I3(o, m, r);
 +
      case INDEX_op_add_vec:
      case INDEX_op_sub_vec:
      case INDEX_op_mul_vec:
 --
 .34.1

-[PULL 06/11] target/i386: Use cpu_unwind_state_data for tpr access
+[PULL 11/27] tcg/s390x: Support 128-bit load/store
-Avoid cpu_restore_state, and modifying env->eip out from
+Use LPQ/STPQ when 16-byte atomicity is required.
-underneath the translator with TARGET_TB_PCREL.  There is
+Note that these instructions require 16-byte alignment.
 some slight duplication from x86_restore_state_to_opc,
 but it's just a few lines.
-Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1269
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Claudio Fontana <cfontana@suse.de>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/i386/helper.c | 21 +++++++++++++++++++--
+ tcg/s390x/tcg-target-con-set.h |   2 +
-file changed, 19 insertions(+), 2 deletions(-)
+ tcg/s390x/tcg-target.h         |   2 +-
  tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
 files changed, 107 insertions(+), 4 deletions(-)
-diff --git a/target/i386/helper.c b/target/i386/helper.c
+diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/target/i386/helper.c
+--- a/tcg/s390x/tcg-target-con-set.h
-+++ b/target/i386/helper.c
++++ b/tcg/s390x/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ void cpu_x86_inject_mce(Monitor *mon, X86CPU *cpu, int bank,
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
  C_O0_I2(r, ri)
  C_O0_I2(r, rA)
  C_O0_I2(v, r)
 +C_O0_I3(o, m, r)
  C_O1_I1(r, r)
  C_O1_I1(v, r)
  C_O1_I1(v, v)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
  C_O1_I3(v, v, v, v)
  C_O1_I4(r, r, ri, rI, r)
  C_O1_I4(r, r, rA, rI, r)
 +C_O2_I1(o, m, r)
  C_O2_I2(o, m, 0, r)
  C_O2_I2(o, m, r, r)
  C_O2_I3(o, m, 0, 1, r)
 diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.h
 +++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
  #define TCG_TARGET_HAS_muluh_i64      0
  #define TCG_TARGET_HAS_mulsh_i64      0
 -#define TCG_TARGET_HAS_qemu_ldst_i128 0
 +#define TCG_TARGET_HAS_qemu_ldst_i128 1
  #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
  #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
 diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.c.inc
 +++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
      RXY_LLGF    = 0xe316,
      RXY_LLGH    = 0xe391,
      RXY_LMG     = 0xeb04,
 +    RXY_LPQ     = 0xe38f,
      RXY_LRV     = 0xe31e,
      RXY_LRVG    = 0xe30f,
      RXY_LRVH    = 0xe31f,
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
      RXY_STG     = 0xe324,
      RXY_STHY    = 0xe370,
      RXY_STMG    = 0xeb24,
 +    RXY_STPQ    = 0xe38e,
      RXY_STRV    = 0xe33e,
      RXY_STRVG   = 0xe32f,
      RXY_STRVH   = 0xe33f,
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return true;
 +    TCGAtomAlign aa;
 +
 +    if ((memop & MO_SIZE) <= MO_64) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity,
 +     * but do allow a pair of 64-bit operations.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom <= MO_64;
  }
  static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
  {
      TCGLabelQemuLdst *ldst = NULL;
      MemOp opc = get_memop(oi);
 +    MemOp s_bits = opc & MO_SIZE;
      unsigned a_mask;
 -    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
 +    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
      a_mask = (1 << h->aa.align) - 1;
  #ifdef CONFIG_SOFTMMU
 -    unsigned s_bits = opc & MO_SIZE;
      unsigned s_mask = (1 << s_bits) - 1;
      int mem_index = get_mmuidx(oi);
      int fast_off = TLB_MASK_TABLE_OFS(mem_index);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
      }
  }
-+static target_ulong get_memio_eip(CPUX86State *env)
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
 +                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
-+    uint64_t data[TARGET_INSN_START_WORDS];
++    TCGLabel *l1 = NULL, *l2 = NULL;
-+    CPUState *cs = env_cpu(env);
++    TCGLabelQemuLdst *ldst;
-+
++    HostAddress h;
-+    if (!cpu_unwind_state_data(cs, cs->mem_io_pc, data)) {
++    bool need_bswap;
-+        return env->eip;
++    bool use_pair;
-+    }
++    S390Opcode insn;
 +
-+    /* Per x86_restore_state_to_opc. */
++    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
-+    if (TARGET_TB_PCREL) {
++
-+        return (env->eip & TARGET_PAGE_MASK) | data[0];
++    use_pair = h.aa.atom < MO_128;
-+    } else {
++    need_bswap = get_memop(oi) & MO_BSWAP;
-+        return data[0] - env->segs[R_CS].base;
++
 +    if (!use_pair) {
 +        /*
 +         * Atomicity requires we use LPQ.  If we've already checked for
 +         * 16-byte alignment, that's all we need.  If we arrive with
 +         * lesser alignment, we have determined that less than 16-byte
 +         * alignment can be satisfied with two 8-byte loads.
 +         */
 +        if (h.aa.align < MO_128) {
 +            use_pair = true;
 +            l1 = gen_new_label();
 +            l2 = gen_new_label();
 +
 +            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
 +            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
 +        }
 +
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? RXY_LPQ : RXY_STPQ;
 +        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
 +
 +        if (use_pair) {
 +            tgen_branch(s, S390_CC_ALWAYS, l2);
 +            tcg_out_label(s, l1);
 +        }
 +    }
 +    if (use_pair) {
 +        TCGReg d1, d2;
 +
 +        if (need_bswap) {
 +            d1 = datalo, d2 = datahi;
 +            insn = is_ld ? RXY_LRVG : RXY_STRVG;
 +        } else {
 +            d1 = datahi, d2 = datalo;
 +            insn = is_ld ? RXY_LG : RXY_STG;
 +        }
 +
 +        if (h.base == d1 || h.index == d1) {
 +            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
 +            h.base = TCG_TMP0;
 +            h.index = TCG_REG_NONE;
 +            h.disp = 0;
 +        }
 +        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
 +        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
 +    }
 +    if (l2) {
 +        tcg_out_label(s, l2);
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
- void cpu_report_tpr_access(CPUX86State *env, TPRAccess access)
+ static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
  {
-     X86CPU *cpu = env_archcpu(env);
+     /* Reuse the zeroing that exists for goto_ptr.  */
-@@ -XXX,XX +XXX,XX @@ void cpu_report_tpr_access(CPUX86State *env, TPRAccess access)
+@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
+     case INDEX_op_qemu_st_a64_i64:
-         cpu_interrupt(cs, CPU_INTERRUPT_TPR);
+         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
-     } else if (tcg_enabled()) {
+         break;
--        cpu_restore_state(cs, cs->mem_io_pc, false);
++    case INDEX_op_qemu_ld_a32_i128:
-+        target_ulong eip = get_memio_eip(env);
++    case INDEX_op_qemu_ld_a64_i128:
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
--        apic_handle_tpr_access_report(cpu->apic_state, env->eip, access);
++        break;
-+        apic_handle_tpr_access_report(cpu->apic_state, eip, access);
++    case INDEX_op_qemu_st_a32_i128:
-     }
++    case INDEX_op_qemu_st_a64_i128:
- }
++        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
- #endif /* !CONFIG_USER_ONLY */
++        break;
      case INDEX_op_ld16s_i64:
          tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
      case INDEX_op_qemu_st_a32_i32:
      case INDEX_op_qemu_st_a64_i32:
          return C_O0_I2(r, r);
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        return C_O2_I1(o, m, r);
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        return C_O0_I3(o, m, r);
      case INDEX_op_deposit_i32:
      case INDEX_op_deposit_i64:
 --
 .34.1

-[PULL 10/11] accel/tcg: Remove reset_icount argument from cpu_restore_state_from_tb
+[PULL 12/27] accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
-The value passed is always true.
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Claudio Fontana <cfontana@suse.de>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- accel/tcg/internal.h      |  2 +-
+ .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
- accel/tcg/tb-maint.c      |  4 ++--
+ accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
- accel/tcg/translate-all.c | 15 +++++++--------
+files changed, 47 insertions(+), 34 deletions(-)
-files changed, 10 insertions(+), 11 deletions(-)
+ create mode 100644 host/include/generic/host/load-extract-al16-al8.h
-diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
+diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/host/include/generic/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
 +/*
 + * SPDX-License-Identifier: GPL-2.0-or-later
 + * Atomic extract 64 from 128-bit, generic version.
 + *
 + * Copyright (C) 2023 Linaro, Ltd.
 + */
 +
 +#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
 +#define HOST_LOAD_EXTRACT_AL16_AL8_H
 +
 +/**
 + * load_atom_extract_al16_or_al8:
 + * @pv: host address
 + * @s: object size in bytes, @s <= 8.
 + *
 + * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
 + * cross an 16-byte boundary then the access must be 16-byte atomic,
 + * otherwise the access must be 8-byte atomic.
 + */
 +static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
 +load_atom_extract_al16_or_al8(void *pv, int s)
 +{
 +    uintptr_t pi = (uintptr_t)pv;
 +    int o = pi & 7;
 +    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
 +    Int128 r;
 +
 +    pv = (void *)(pi & ~7);
 +    if (pi & 8) {
 +        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
 +        uint64_t a = qatomic_read__nocheck(p8);
 +        uint64_t b = qatomic_read__nocheck(p8 + 1);
 +
 +        if (HOST_BIG_ENDIAN) {
 +            r = int128_make128(b, a);
 +        } else {
 +            r = int128_make128(a, b);
 +        }
 +    } else {
 +        r = atomic16_read_ro(pv);
 +    }
 +    return int128_getlo(int128_urshift(r, shr));
 +}
 +
 +#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
 diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/accel/tcg/internal.h
+--- a/accel/tcg/ldst_atomicity.c.inc
-+++ b/accel/tcg/internal.h
++++ b/accel/tcg/ldst_atomicity.c.inc
-@@ -XXX,XX +XXX,XX @@ TranslationBlock *tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
+@@ -XXX,XX +XXX,XX @@
-                                tb_page_addr_t phys_page2);
+  * See the COPYING file in the top-level directory.
- bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc);
+  */
- void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
--                               uintptr_t host_pc, bool reset_icount);
++#include "host/load-extract-al16-al8.h"
-+                               uintptr_t host_pc);
++
+ #ifdef CONFIG_ATOMIC64
- /* Return the current PC from CPU, which may be cached in TB. */
+ # define HAVE_al8          true
- static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
+ #else
-diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
-index XXXXXXX..XXXXXXX 100644
+     return int128_getlo(r);
 --- a/accel/tcg/tb-maint.c
 +++ b/accel/tcg/tb-maint.c
@@ -XXX,XX +XXX,XX @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
                   * restore the CPU state.
                   */
                  current_tb_modified = true;
 -                cpu_restore_state_from_tb(cpu, current_tb, retaddr, true);
 +                cpu_restore_state_from_tb(cpu, current_tb, retaddr);
              }
  #endif /* TARGET_HAS_PRECISE_SMC */
              tb_phys_invalidate__locked(tb);
@@ -XXX,XX +XXX,XX @@ bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc)
               * function to partially restore the CPU state.
               */
              current_tb_modified = true;
 -            cpu_restore_state_from_tb(cpu, current_tb, pc, true);
 +            cpu_restore_state_from_tb(cpu, current_tb, pc);
          }
  #endif /* TARGET_HAS_PRECISE_SMC */
          tb_phys_invalidate(tb, addr);
 diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/translate-all.c
 +++ b/accel/tcg/translate-all.c
@@ -XXX,XX +XXX,XX @@ static int cpu_unwind_data_from_tb(TranslationBlock *tb, uintptr_t host_pc,
  }
- /*
+-/**
-- * The cpu state corresponding to 'host_pc' is restored.
+- * load_atom_extract_al16_or_al8:
-- * When reset_icount is true, current TB will be interrupted and
+- * @p: host address
-- * icount should be recalculated.
+- * @s: object size in bytes, @s <= 8.
-+ * The cpu state corresponding to 'host_pc' is restored in
+- *
-+ * preparation for exiting the TB.
+- * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
-  */
+- * cross an 16-byte boundary then the access must be 16-byte atomic,
- void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+- * otherwise the access must be 8-byte atomic.
--                               uintptr_t host_pc, bool reset_icount)
+- */
-+                               uintptr_t host_pc)
+-static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
- {
+-load_atom_extract_al16_or_al8(void *pv, int s)
-     uint64_t data[TARGET_INSN_START_WORDS];
+-{
- #ifdef CONFIG_PROFILER
+-    uintptr_t pi = (uintptr_t)pv;
-@@ -XXX,XX +XXX,XX @@ void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+-    int o = pi & 7;
-         return;
+-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-     }
+-    Int128 r;
+-
--    if (reset_icount && (tb_cflags(tb) & CF_USE_ICOUNT)) {
+-    pv = (void *)(pi & ~7);
-+    if (tb_cflags(tb) & CF_USE_ICOUNT) {
+-    if (pi & 8) {
-         assert(icount_enabled());
+-        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-         /*
+-        uint64_t a = qatomic_read__nocheck(p8);
-          * Reset the cycle counter to the start of the block and
+-        uint64_t b = qatomic_read__nocheck(p8 + 1);
-@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
+-
-     if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
+-        if (HOST_BIG_ENDIAN) {
-         TranslationBlock *tb = tcg_tb_lookup(host_pc);
+-            r = int128_make128(b, a);
-         if (tb) {
+-        } else {
--            cpu_restore_state_from_tb(cpu, tb, host_pc, true);
+-            r = int128_make128(a, b);
-+            cpu_restore_state_from_tb(cpu, tb, host_pc);
+-        }
-             return true;
+-    } else {
-         }
+-        r = atomic16_read_ro(pv);
-     }
+-    }
-@@ -XXX,XX +XXX,XX @@ void tb_check_watchpoint(CPUState *cpu, uintptr_t retaddr)
+-    return int128_getlo(int128_urshift(r, shr));
-     tb = tcg_tb_lookup(retaddr);
+-}
-     if (tb) {
+-
-         /* We can use retranslation to find the PC.  */
+ /**
--        cpu_restore_state_from_tb(cpu, tb, retaddr, true);
+  * load_atom_4_by_2:
-+        cpu_restore_state_from_tb(cpu, tb, retaddr);
+  * @pv: host address
          tb_phys_invalidate(tb, -1);
      } else {
          /* The exception probably happened in a helper.  The CPU state should
@@ -XXX,XX +XXX,XX @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
          cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
                    (void *)retaddr);
      }
 -    cpu_restore_state_from_tb(cpu, tb, retaddr, true);
 +    cpu_restore_state_from_tb(cpu, tb, retaddr);
      /*
       * Some guests must re-execute the branch when re-executing a delay
 --
 .34.1

-[PULL 05/11] accel/tcg: Introduce cpu_unwind_state_data
+[PULL 13/27] accel/tcg: Extract store_atom_insert_al16 to host header
-Add a way to examine the unwind data without actually
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 restoring the data back into env.
 Reviewed-by: Claudio Fontana <cfontana@suse.de>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- accel/tcg/internal.h      |  4 +--
+ host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
- include/exec/exec-all.h   | 21 ++++++++---
+ accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
- accel/tcg/translate-all.c | 74 ++++++++++++++++++++++++++-------------
+files changed, 51 insertions(+), 39 deletions(-)
-files changed, 68 insertions(+), 31 deletions(-)
+ create mode 100644 host/include/generic/host/store-insert-al16.h
-diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
+diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/accel/tcg/internal.h
+index XXXXXXX..XXXXXXX
-+++ b/accel/tcg/internal.h
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ void tb_reset_jump(TranslationBlock *tb, int n);
++++ b/host/include/generic/host/store-insert-al16.h
- TranslationBlock *tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
+@@ -XXX,XX +XXX,XX @@
-                                tb_page_addr_t phys_page2);
++/*
- bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc);
++ * SPDX-License-Identifier: GPL-2.0-or-later
--int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
++ * Atomic store insert into 128-bit, generic version.
--                              uintptr_t searched_pc, bool reset_icount);
++ *
-+void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
++ * Copyright (C) 2023 Linaro, Ltd.
-+                               uintptr_t host_pc, bool reset_icount);
++ */
++
- /* Return the current PC from CPU, which may be cached in TB. */
++#ifndef HOST_STORE_INSERT_AL16_H
- static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
++#define HOST_STORE_INSERT_AL16_H
-diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
++
 index XXXXXXX..XXXXXXX 100644
 --- a/include/exec/exec-all.h
 +++ b/include/exec/exec-all.h
@@ -XXX,XX +XXX,XX @@ typedef ram_addr_t tb_page_addr_t;
  #define TB_PAGE_ADDR_FMT RAM_ADDR_FMT
  #endif
 +/**
-+ * cpu_unwind_state_data:
++ * store_atom_insert_al16:
-+ * @cpu: the cpu context
++ * @p: host address
-+ * @host_pc: the host pc within the translation
++ * @val: shifted value to store
-+ * @data: output data
++ * @msk: mask for value to store
 + *
-+ * Attempt to load the the unwind state for a host pc occurring in
++ * Atomically store @val to @p masked by @msk.
 + * translated code.  If @host_pc is not in translated code, the
 + * function returns false; otherwise @data is loaded.
 + * This is the same unwind info as given to restore_state_to_opc.
 + */
-+bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data);
++static inline void ATTRIBUTE_ATOMIC128_OPT
 +store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
 +{
 +#if defined(CONFIG_ATOMIC128)
 +    __uint128_t *pu;
 +    Int128Alias old, new;
 +
- /**
++    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-  * cpu_restore_state:
++    pu = __builtin_assume_aligned(ps, 16);
-- * @cpu: the vCPU state is to be restore to
++    old.u = *pu;
-- * @searched_pc: the host PC the fault occurred at
++    msk = int128_not(msk);
-+ * @cpu: the cpu context
++    do {
-+ * @host_pc: the host pc within the translation
++        new.s = int128_and(old.s, msk);
-  * @will_exit: true if the TB executed will be interrupted after some
++        new.s = int128_or(new.s, val);
-                cpu adjustments. Required for maintaining the correct
++    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
-                icount valus
++                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
-  * @return: true if state was restored, false otherwise
++#else
-  *
++    Int128 old, new, cmp;
-  * Attempt to restore the state for a fault occurring in translated
++
-- * code. If the searched_pc is not in translated code no state is
++    ps = __builtin_assume_aligned(ps, 16);
-+ * code. If @host_pc is not in translated code no state is
++    old = *ps;
-  * restored and the function returns false.
++    msk = int128_not(msk);
-  */
++    do {
--bool cpu_restore_state(CPUState *cpu, uintptr_t searched_pc, bool will_exit);
++        cmp = old;
-+bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit);
++        new = int128_and(old, msk);
++        new = int128_or(new, val);
- G_NORETURN void cpu_loop_exit_noexc(CPUState *cpu);
++        old = atomic16_cmpxchg(ps, cmp, new);
- G_NORETURN void cpu_loop_exit(CPUState *cpu);
++    } while (int128_ne(cmp, old));
-diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
++#endif
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/translate-all.c
 +++ b/accel/tcg/translate-all.c
@@ -XXX,XX +XXX,XX @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
      return p - block;
  }
 -/* The cpu state corresponding to 'searched_pc' is restored.
 - * When reset_icount is true, current TB will be interrupted and
 - * icount should be recalculated.
 - */
 -int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
 -                              uintptr_t searched_pc, bool reset_icount)
 +static int cpu_unwind_data_from_tb(TranslationBlock *tb, uintptr_t host_pc,
 +                                   uint64_t *data)
  {
 -    uint64_t data[TARGET_INSN_START_WORDS];
 -    uintptr_t host_pc = (uintptr_t)tb->tc.ptr;
 +    uintptr_t iter_pc = (uintptr_t)tb->tc.ptr;
      const uint8_t *p = tb->tc.ptr + tb->tc.size;
      int i, j, num_insns = tb->icount;
 -#ifdef CONFIG_PROFILER
 -    TCGProfile *prof = &tcg_ctx->prof;
 -    int64_t ti = profile_getclock();
 -#endif
 -    searched_pc -= GETPC_ADJ;
 +    host_pc -= GETPC_ADJ;
 -    if (searched_pc < host_pc) {
 +    if (host_pc < iter_pc) {
          return -1;
      }
 -    memset(data, 0, sizeof(data));
 +    memset(data, 0, sizeof(uint64_t) * TARGET_INSN_START_WORDS);
      if (!TARGET_TB_PCREL) {
          data[0] = tb_pc(tb);
      }
 -    /* Reconstruct the stored insn data while looking for the point at
 -       which the end of the insn exceeds the searched_pc.  */
 +    /*
 +     * Reconstruct the stored insn data while looking for the point
 +     * at which the end of the insn exceeds host_pc.
 +     */
      for (i = 0; i < num_insns; ++i) {
          for (j = 0; j < TARGET_INSN_START_WORDS; ++j) {
              data[j] += decode_sleb128(&p);
          }
 -        host_pc += decode_sleb128(&p);
 -        if (host_pc > searched_pc) {
 -            goto found;
 +        iter_pc += decode_sleb128(&p);
 +        if (iter_pc > host_pc) {
 +            return num_insns - i;
          }
      }
      return -1;
 +}
 +
-+/*
++#endif /* HOST_STORE_INSERT_AL16_H */
-+ * The cpu state corresponding to 'host_pc' is restored.
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
-+ * When reset_icount is true, current TB will be interrupted and
+index XXXXXXX..XXXXXXX 100644
-+ * icount should be recalculated.
+--- a/accel/tcg/ldst_atomicity.c.inc
-+ */
++++ b/accel/tcg/ldst_atomicity.c.inc
-+void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+@@ -XXX,XX +XXX,XX @@
-+                               uintptr_t host_pc, bool reset_icount)
+  */
-+{
-+    uint64_t data[TARGET_INSN_START_WORDS];
+ #include "host/load-extract-al16-al8.h"
-+#ifdef CONFIG_PROFILER
++#include "host/store-insert-al16.h"
-+    TCGProfile *prof = &tcg_ctx->prof;
-+    int64_t ti = profile_getclock();
+ #ifdef CONFIG_ATOMIC64
-+#endif
+ # define HAVE_al8          true
-+    int insns_left = cpu_unwind_data_from_tb(tb, host_pc, data);
+@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
-+
+                                           __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 +    if (insns_left < 0) {
 +        return;
 +    }
 - found:
      if (reset_icount && (tb_cflags(tb) & CF_USE_ICOUNT)) {
          assert(icount_enabled());
 -        /* Reset the cycle counter to the start of the block
 -           and shift if to the number of actually executed instructions */
 -        cpu_neg(cpu)->icount_decr.u16.low += num_insns - i;
 +        /*
 +         * Reset the cycle counter to the start of the block and
 +         * shift if to the number of actually executed instructions.
 +         */
 +        cpu_neg(cpu)->icount_decr.u16.low += insns_left;
      }
      cpu->cc->tcg_ops->restore_state_to_opc(cpu, tb, data);
@@ -XXX,XX +XXX,XX @@ int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
                  prof->restore_time + profile_getclock() - ti);
      qatomic_set(&prof->restore_count, prof->restore_count + 1);
  #endif
 -    return 0;
  }
- bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
+-/**
-@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
+- * store_atom_insert_al16:
-     return false;
+- * @p: host address
- }
+- * @val: shifted value to store
+- * @msk: mask for value to store
-+bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data)
+- *
-+{
+- * Atomically store @val to @p masked by @msk.
-+    if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
+- */
-+        TranslationBlock *tb = tcg_tb_lookup(host_pc);
+-static void ATTRIBUTE_ATOMIC128_OPT
-+        if (tb) {
+-store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
-+            return cpu_unwind_data_from_tb(tb, host_pc, data) >= 0;
+-{
-+        }
+-#if defined(CONFIG_ATOMIC128)
-+    }
+-    __uint128_t *pu, old, new;
-+    return false;
+-
-+}
+-    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-+
+-    pu = __builtin_assume_aligned(ps, 16);
- void page_init(void)
+-    old = *pu;
- {
+-    do {
-     page_size_init();
+-        new = (old & ~msk.u) | val.u;
 -    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
 -                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 -#elif defined(CONFIG_CMPXCHG128)
 -    __uint128_t *pu, old, new;
 -
 -    /*
 -     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
 -     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
 -     * and accept the sequential consistency that comes with it.
 -     */
 -    pu = __builtin_assume_aligned(ps, 16);
 -    do {
 -        old = *pu;
 -        new = (old & ~msk.u) | val.u;
 -    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
 -#else
 -    qemu_build_not_reached();
 -#endif
 -}
 -
  /**
   * store_bytes_leN:
   * @pv: host address
 --
 .34.1

-New patch
+[PULL 14/27] accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
+file changed, 50 insertions(+)
+ create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
+diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/host/include/x86_64/host/load-extract-al16-al8.h
+@@ -XXX,XX +XXX,XX @@
++/*
++ * SPDX-License-Identifier: GPL-2.0-or-later
++ * Atomic extract 64 from 128-bit, x86_64 version.
++ *
++ * Copyright (C) 2023 Linaro, Ltd.
++ */
++
++#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
++#define X86_64_LOAD_EXTRACT_AL16_AL8_H
++
++#ifdef CONFIG_INT128_TYPE
++#include "host/cpuinfo.h"
++
++/**
++ * load_atom_extract_al16_or_al8:
++ * @pv: host address
++ * @s: object size in bytes, @s <= 8.
++ *
++ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
++ * cross an 16-byte boundary then the access must be 16-byte atomic,
++ * otherwise the access must be 8-byte atomic.
++ */
++static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
++load_atom_extract_al16_or_al8(void *pv, int s)
++{
++    uintptr_t pi = (uintptr_t)pv;
++    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
++    int shr = (pi & 7) * 8;
++    Int128Alias r;
++
++    /*
++     * ptr_align % 16 is now only 0 or 8.
++     * If the host supports atomic loads with VMOVDQU, then always use that,
++     * making the branch highly predictable.  Otherwise we must use VMOVDQA
++     * when ptr_align % 16 == 0 for 16-byte atomicity.
++     */
++    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
++        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
++    } else {
++        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
++    }
++    return int128_getlo(int128_urshift(r.s, shr));
++}
++#else
++/* Fallback definition that must be optimized away, or error.  */
++uint64_t QEMU_ERROR("unsupported atomic")
++    load_atom_extract_al16_or_al8(void *pv, int s);
++#endif
++
++#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
+--
+.34.1

-New patch
+[PULL 15/27] accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
+file changed, 40 insertions(+)
+ create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
+diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/host/include/aarch64/host/load-extract-al16-al8.h
+@@ -XXX,XX +XXX,XX @@
++/*
++ * SPDX-License-Identifier: GPL-2.0-or-later
++ * Atomic extract 64 from 128-bit, AArch64 version.
++ *
++ * Copyright (C) 2023 Linaro, Ltd.
++ */
++
++#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
++#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
++
++#include "host/cpuinfo.h"
++#include "tcg/debug-assert.h"
++
++/**
++ * load_atom_extract_al16_or_al8:
++ * @pv: host address
++ * @s: object size in bytes, @s <= 8.
++ *
++ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
++ * cross an 16-byte boundary then the access must be 16-byte atomic,
++ * otherwise the access must be 8-byte atomic.
++ */
++static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
++{
++    uintptr_t pi = (uintptr_t)pv;
++    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
++    int shr = (pi & 7) * 8;
++    uint64_t l, h;
++
++    /*
++     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
++     * and single-copy atomic on the parts if 8-byte aligned.
++     * All we need do is align the pointer mod 8.
++     */
++    tcg_debug_assert(HAVE_ATOMIC128_RO);
++    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
++    return (l >> shr) | (h << (-shr & 63));
++}
++
++#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
+--
+.34.1

-New patch
+[PULL 16/27] accel/tcg: Add aarch64 store_atom_insert_al16
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
+file changed, 47 insertions(+)
+ create mode 100644 host/include/aarch64/host/store-insert-al16.h
+diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/host/include/aarch64/host/store-insert-al16.h
+@@ -XXX,XX +XXX,XX @@
++/*
++ * SPDX-License-Identifier: GPL-2.0-or-later
++ * Atomic store insert into 128-bit, AArch64 version.
++ *
++ * Copyright (C) 2023 Linaro, Ltd.
++ */
++
++#ifndef AARCH64_STORE_INSERT_AL16_H
++#define AARCH64_STORE_INSERT_AL16_H
++
++/**
++ * store_atom_insert_al16:
++ * @p: host address
++ * @val: shifted value to store
++ * @msk: mask for value to store
++ *
++ * Atomically store @val to @p masked by @msk.
++ */
++static inline void ATTRIBUTE_ATOMIC128_OPT
++store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
++{
++    /*
++     * GCC only implements __sync* primitives for int128 on aarch64.
++     * We can do better without the barriers, and integrating the
++     * arithmetic into the load-exclusive/store-conditional pair.
++     */
++    uint64_t tl, th, vl, vh, ml, mh;
++    uint32_t fail;
++
++    qemu_build_assert(!HOST_BIG_ENDIAN);
++    vl = int128_getlo(val);
++    vh = int128_gethi(val);
++    ml = int128_getlo(msk);
++    mh = int128_gethi(msk);
++
++    asm("0: ldxp %[l], %[h], %[mem]\n\t"
++        "bic %[l], %[l], %[ml]\n\t"
++        "bic %[h], %[h], %[mh]\n\t"
++        "orr %[l], %[l], %[vl]\n\t"
++        "orr %[h], %[h], %[vh]\n\t"
++        "stxp %w[f], %[l], %[h], %[mem]\n\t"
++        "cbnz %w[f], 0b\n"
++        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
++        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
++}
++
++#endif /* AARCH64_STORE_INSERT_AL16_H */
+--
+.34.1

-[PULL 07/11] target/openrisc: Always exit after mtspr npc
+[PULL 17/27] tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
-We have called cpu_restore_state asserting will_exit.
+The last use was removed by e77c89fb086a.
 Do not go back on that promise.  This affects icount.
+Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/openrisc/sys_helper.c | 2 +-
+ tcg/aarch64/tcg-target.h | 1 -
-file changed, 1 insertion(+), 1 deletion(-)
+ tcg/arm/tcg-target.h     | 1 -
  tcg/i386/tcg-target.h    | 1 -
  tcg/mips/tcg-target.h    | 1 -
  tcg/ppc/tcg-target.h     | 1 -
  tcg/riscv/tcg-target.h   | 1 -
  tcg/s390x/tcg-target.h   | 1 -
  tcg/sparc64/tcg-target.h | 1 -
  tcg/tci/tcg-target.h     | 1 -
 files changed, 9 deletions(-)
-diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
+diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
---- a/target/openrisc/sys_helper.c
+--- a/tcg/aarch64/tcg-target.h
-+++ b/target/openrisc/sys_helper.c
++++ b/tcg/aarch64/tcg-target.h
-@@ -XXX,XX +XXX,XX @@ void HELPER(mtspr)(CPUOpenRISCState *env, target_ulong spr, target_ulong rb)
+@@ -XXX,XX +XXX,XX @@
-         if (env->pc != rb) {
+ #include "host/cpuinfo.h"
-             env->pc = rb;
-             env->dflag = 0;
+ #define TCG_TARGET_INSN_UNIT_SIZE  4
--            cpu_loop_exit(cs);
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
-         }
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
-+        cpu_loop_exit(cs);
-         break;
+ typedef enum {
+diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
-     case TO_SPR(0, 17): /* SR */
+index XXXXXXX..XXXXXXX 100644
 --- a/tcg/arm/tcg-target.h
 +++ b/tcg/arm/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
  #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
  #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
  typedef enum {
 diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/i386/tcg-target.h
 +++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #include "host/cpuinfo.h"
  #define TCG_TARGET_INSN_UNIT_SIZE  1
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
  #ifdef __x86_64__
  # define TCG_TARGET_REG_BITS  64
 diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/mips/tcg-target.h
 +++ b/tcg/mips/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #endif
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
  #define TCG_TARGET_NB_REGS 32
  #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #define TCG_TARGET_NB_REGS 64
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
  typedef enum {
      TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
 diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/riscv/tcg-target.h
 +++ b/tcg/riscv/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #define TCG_TARGET_REG_BITS 64
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
  #define TCG_TARGET_NB_REGS 32
  #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.h
 +++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #define S390_TCG_TARGET_H
  #define TCG_TARGET_INSN_UNIT_SIZE 2
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
  /* We have a +- 4GB range on the branches; leave some slop.  */
  #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
 diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/sparc64/tcg-target.h
 +++ b/tcg/sparc64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #define SPARC_TCG_TARGET_H
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
  #define TCG_TARGET_NB_REGS 32
  #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
 diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/tci/tcg-target.h
 +++ b/tcg/tci/tcg-target.h
@@ -XXX,XX +XXX,XX @@
  #define TCG_TARGET_INTERPRETER 1
  #define TCG_TARGET_INSN_UNIT_SIZE 4
 -#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
  #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
  #if UINTPTR_MAX == UINT32_MAX
 --
 .34.1

-New patch
+[PULL 18/27] decodetree: Add --test-for-error
+Invert the exit code, for use with the testsuite.
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ scripts/decodetree.py | 9 +++++++--
+file changed, 7 insertions(+), 2 deletions(-)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@
+ formats = {}
+ allpatterns = []
+ anyextern = False
++testforerror = False
+ translate_prefix = 'trans'
+ translate_scope = 'static '
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
+     if output_file and output_fd:
+         output_fd.close()
+         os.remove(output_file)
+-    exit(1)
++    exit(0 if testforerror else 1)
+ # end error_with_file
+@@ -XXX,XX +XXX,XX @@ def main():
+     global bitop_width
+     global variablewidth
+     global anyextern
++    global testforerror
+     decode_scope = 'static '
+     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
+-                 'static-decode=', 'varinsnwidth=']
++                 'static-decode=', 'varinsnwidth=', 'test-for-error']
+     try:
+         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
+     except getopt.GetoptError as err:
+@@ -XXX,XX +XXX,XX @@ def main():
+                 bitop_width = 64
+             elif insnwidth != 32:
+                 error(0, 'cannot handle insns of width', insnwidth)
++        elif o == '--test-for-error':
++            testforerror = True
+         else:
+             assert False, 'unhandled option'
+@@ -XXX,XX +XXX,XX @@ def main():
+     if output_file:
+         output_fd.close()
++    exit(1 if testforerror else 0)
+ # end main
+--
+.34.1

-New patch
+[PULL 19/27] decodetree: Fix recursion in prop_format and build_tree
+Two copy-paste errors walking the parse tree.
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ scripts/decodetree.py | 4 ++--
+file changed, 2 insertions(+), 2 deletions(-)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@ def build_tree(self):
+     def prop_format(self):
+         for p in self.pats:
+-            p.build_tree()
++            p.prop_format()
+     def prop_width(self):
+         width = None
+@@ -XXX,XX +XXX,XX @@ def __build_tree(pats, outerbits, outermask):
+         return t
+     def build_tree(self):
+-        super().prop_format()
++        super().build_tree()
+         self.tree = self.__build_tree(self.pats, self.fixedbits,
+                                       self.fixedmask)
+--
+.34.1

-New patch
+[PULL 20/27] decodetree: Diagnose empty pattern group
+Test err_pattern_group_empty.decode failed with exception:
+Traceback (most recent call last):
+  File "./scripts/decodetree.py", line 1424, in <module> main()
+  File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
+  File "./scripts/decodetree.py", line 627, in build_tree
+    self.tree = self.__build_tree(self.pats, self.fixedbits,
+  File "./scripts/decodetree.py", line 607, in __build_tree
+    fb = i.fixedbits & innermask
+TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ scripts/decodetree.py | 6 ++++++
+file changed, 6 insertions(+)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
+                 output(ind, '}\n')
+             else:
+                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
++
++    def build_tree(self):
++        if not self.pats:
++            error_with_file(self.file, self.lineno, 'empty pattern group')
++        super().build_tree()
++
+ #end IncMultiPattern
+--
+.34.1

-[PULL 08/11] target/openrisc: Use cpu_unwind_state_data for mfspr
+[PULL 21/27] decodetree: Do not remove output_file from /dev
-Since we do not plan to exit, use cpu_unwind_state_data
+Nor report any PermissionError on remove.
-and extract exactly the data requested.
+The primary purpose is testing with -o /dev/null.
 This is a bug fix, in that we no longer clobber dflag.
 Consider:
         l.j       L2         // branch
         l.mfspr   r1, ppc    // delay
 L1:     boom
 L2:     l.lwa     r3, (r4)
 Here, dflag would be set by cpu_restore_state (because that is the current
 state of the cpu), but but not cleared by tb_stop on exiting the TB
 (because DisasContext has recorded the current value as zero).
 The next TB begins at L2 with dflag incorrectly set.  If the load has a
 tlb miss, then the exception will be delivered as per a delay slot:
 with DSX set in the status register and PC decremented (delay slots
 restart by re-executing the branch). This will cause the return from
 interrupt to go to L1, and boom!
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/openrisc/sys_helper.c | 11 +++++++++--
+ scripts/decodetree.py | 7 ++++++-
-file changed, 9 insertions(+), 2 deletions(-)
+file changed, 6 insertions(+), 1 deletion(-)
-diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/target/openrisc/sys_helper.c
+--- a/scripts/decodetree.py
-+++ b/target/openrisc/sys_helper.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ target_ulong HELPER(mfspr)(CPUOpenRISCState *env, target_ulong rd,
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
-                            target_ulong spr)
- {
+     if output_file and output_fd:
- #ifndef CONFIG_USER_ONLY
+         output_fd.close()
-+    uint64_t data[TARGET_INSN_START_WORDS];
+-        os.remove(output_file)
-     MachineState *ms = MACHINE(qdev_get_machine());
++        # Do not try to remove e.g. -o /dev/null
-     OpenRISCCPU *cpu = env_archcpu(env);
++        if not output_file.startswith("/dev"):
-     CPUState *cs = env_cpu(env);
++            try:
-@@ -XXX,XX +XXX,XX @@ target_ulong HELPER(mfspr)(CPUOpenRISCState *env, target_ulong rd,
++                os.remove(output_file)
-         return env->evbar;
++            except PermissionError:
++                pass
-     case TO_SPR(0, 16): /* NPC (equals PC) */
+     exit(0 if testforerror else 1)
--        cpu_restore_state(cs, GETPC(), false);
+ # end error_with_file
-+        if (cpu_unwind_state_data(cs, GETPC(), data)) {
 +            return data[0];
 +        }
          return env->pc;
      case TO_SPR(0, 17): /* SR */
          return cpu_get_sr(env);
      case TO_SPR(0, 18): /* PPC */
 -        cpu_restore_state(cs, GETPC(), false);
 +        if (cpu_unwind_state_data(cs, GETPC(), data)) {
 +            if (data[1] & 2) {
 +                return data[0] - 4;
 +            }
 +        }
          return env->ppc;
      case TO_SPR(0, 32): /* EPCR */
 --
 .34.1

-New patch
+[PULL 22/27] tests/decode: Convert tests to meson
+Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+---
+ tests/decode/check.sh    | 24 ----------------
+ tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
+ tests/meson.build        |  5 +---
+files changed, 60 insertions(+), 28 deletions(-)
+ delete mode 100755 tests/decode/check.sh
+ create mode 100644 tests/decode/meson.build
+diff --git a/tests/decode/check.sh b/tests/decode/check.sh
+deleted file mode 100755
+index XXXXXXX..XXXXXXX
+--- a/tests/decode/check.sh
++++ /dev/null
+@@ -XXX,XX +XXX,XX @@
+-#!/bin/sh
+-# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+-# See the COPYING.LIB file in the top-level directory.
+-
+-PYTHON=$1
+-DECODETREE=$2
+-E=0
+-
+-# All of these tests should produce errors
+-for i in err_*.decode; do
+-    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
+-        # Pass, aka failed to fail.
+-        echo FAIL: $i 1>&2
+-        E=1
+-    fi
+-done
+-
+-for i in succ_*.decode; do
+-    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
+-        echo FAIL:$i 1>&2
+-    fi
+-done
+-
+-exit $E
+diff --git a/tests/decode/meson.build b/tests/decode/meson.build
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/meson.build
+@@ -XXX,XX +XXX,XX @@
++err_tests = [
++    'err_argset1.decode',
++    'err_argset2.decode',
++    'err_field1.decode',
++    'err_field2.decode',
++    'err_field3.decode',
++    'err_field4.decode',
++    'err_field5.decode',
++    'err_field6.decode',
++    'err_init1.decode',
++    'err_init2.decode',
++    'err_init3.decode',
++    'err_init4.decode',
++    'err_overlap1.decode',
++    'err_overlap2.decode',
++    'err_overlap3.decode',
++    'err_overlap4.decode',
++    'err_overlap5.decode',
++    'err_overlap6.decode',
++    'err_overlap7.decode',
++    'err_overlap8.decode',
++    'err_overlap9.decode',
++    'err_pattern_group_empty.decode',
++    'err_pattern_group_ident1.decode',
++    'err_pattern_group_ident2.decode',
++    'err_pattern_group_nest1.decode',
++    'err_pattern_group_nest2.decode',
++    'err_pattern_group_nest3.decode',
++    'err_pattern_group_overlap1.decode',
++    'err_width1.decode',
++    'err_width2.decode',
++    'err_width3.decode',
++    'err_width4.decode',
++]
++
++succ_tests = [
++    'succ_argset_type1.decode',
++    'succ_function.decode',
++    'succ_ident1.decode',
++    'succ_pattern_group_nest1.decode',
++    'succ_pattern_group_nest2.decode',
++    'succ_pattern_group_nest3.decode',
++    'succ_pattern_group_nest4.decode',
++]
++
++suite = 'decodetree'
++decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
++
++foreach t: err_tests
++    test(fs.replace_suffix(t, ''),
++         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
++         suite: suite)
++endforeach
++
++foreach t: succ_tests
++    test(fs.replace_suffix(t, ''),
++         decodetree, args: ['-o', '/dev/null', files(t)],
++         suite: suite)
++endforeach
+diff --git a/tests/meson.build b/tests/meson.build
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/meson.build
++++ b/tests/meson.build
+@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
+              dependencies: [qemuutil, vhost_user])
+ endif
+-test('decodetree', sh,
+-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
+-     workdir: meson.current_source_dir() / 'decode',
+-     suite: 'decodetree')
++subdir('decode')
+ if 'CONFIG_TCG' in config_all
+   subdir('fp')
+--
+.34.1

-New patch
+[PULL 23/27] docs: Document decodetree named field syntax
+From: Peter Maydell <peter.maydell@linaro.org>
+Document the named field syntax that we want to implement for the
+decodetree script.  This allows a field to be defined in terms of
+some other field that the instruction pattern has already set, for
+example:
+   %sz_imm 10:3 sz:3 !function=expand_sz_imm
+to allow a function to be passed both an immediate field from the
+instruction and also a sz value which might have been specified by
+the instruction pattern directly (sz=1, etc) rather than being a
+simple field within the instruction.
+Note that the restriction on not having the format referring to the
+pattern and the pattern referring to the format simultaneously is a
+restriction of the decoder generator rather than inherently being a
+silly thing to do.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
+---
+ docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
+file changed, 28 insertions(+), 5 deletions(-)
+diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
+index XXXXXXX..XXXXXXX 100644
+--- a/docs/devel/decodetree.rst
++++ b/docs/devel/decodetree.rst
+@@ -XXX,XX +XXX,XX @@ Fields
+ Syntax::
+-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
++  field_def     := '%' identifier ( field )* ( !function=identifier )?
++  field         := unnamed_field | named_field
+   unnamed_field := number ':' ( 's' ) number
++  named_field   := identifier ':' ( 's' ) number
+ For *unnamed_field*, the first number is the least-significant bit position
+ of the field and the second number is the length of the field.  If the 's' is
+-present, the field is considered signed.  If multiple ``unnamed_fields`` are
+-present, they are concatenated.  In this way one can define disjoint fields.
++present, the field is considered signed.
++
++A *named_field* refers to some other field in the instruction pattern
++or format. Regardless of the length of the other field where it is
++defined, it will be inserted into this field with the specified
++signedness and bit width.
++
++Field definitions that involve loops (i.e. where a field is defined
++directly or indirectly in terms of itself) are errors.
++
++A format can include fields that refer to named fields that are
++defined in the instruction pattern(s) that use the format.
++Conversely, an instruction pattern can include fields that refer to
++named fields that are defined in the format it uses. However you
++cannot currently do both at once (i.e. pattern P uses format F; F has
++a field A that refers to a named field B that is defined in P, and P
++has a field C that refers to a named field D that is defined in F).
++
++If multiple ``fields`` are present, they are concatenated.
++In this way one can define disjoint fields.
+ If ``!function`` is specified, the concatenated result is passed through the
+ named function, taking and returning an integral value.
+-One may use ``!function`` with zero ``unnamed_fields``.  This case is called
++One may use ``!function`` with zero ``fields``.  This case is called
+ a *parameter*, and the named function is only passed the ``DisasContext``
+ and returns an integral value extracted from there.
+-A field with no ``unnamed_fields`` and no ``!function`` is in error.
++A field with no ``fields`` and no ``!function`` is in error.
+ Field examples:
+@@ -XXX,XX +XXX,XX @@ Field examples:
+ | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
+ |   !function=expand_shimm8 |               extract(i, 13, 1))            |
+ +---------------------------+---------------------------------------------+
++| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
++|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
+++---------------------------+---------------------------------------------+
+ Argument Sets
+ =============
+--
+.34.1

-New patch
+[PULL 24/27] scripts/decodetree: Pass lvalue-formatter function to str_extract()
+From: Peter Maydell <peter.maydell@linaro.org>
+To support referring to other named fields in field definitions, we
+need to pass the str_extract() method a function which tells it how
+to emit the code for a previously initialized named field.  (In
+Pattern::output_code() the other field will be "u.f_foo.field", and
+in Format::output_extract() it is "a->field".)
+Refactor the two callsites that currently do "output code to
+initialize each field", and have them pass a lambda that defines how
+to format the lvalue in each case.  This is then used both in
+emitting the LHS of the assignment and also passed down to
+str_extract() as a new argument (unused at the moment, but will be
+used in the following patch).
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
+---
+ scripts/decodetree.py | 26 +++++++++++++++-----------
+file changed, 15 insertions(+), 11 deletions(-)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@ def __str__(self):
+             s = ''
+         return str(self.pos) + ':' + s + str(self.len)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         global bitop_width
+         s = 's' if self.sign else ''
+         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
+@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
+     def __str__(self):
+         return str(self.subs)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         global bitop_width
+         ret = '0'
+         pos = 0
+         for f in reversed(self.subs):
+-            ext = f.str_extract()
++            ext = f.str_extract(lvalue_formatter)
+             if pos == 0:
+                 ret = ext
+             else:
+@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
+     def __str__(self):
+         return str(self.value)
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         return str(self.value)
+     def __cmp__(self, other):
+@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
+     def __str__(self):
+         return self.func + '(' + str(self.base) + ')'
+-    def str_extract(self):
+-        return self.func + '(ctx, ' + self.base.str_extract() + ')'
++    def str_extract(self, lvalue_formatter):
++        return (self.func + '(ctx, '
++                + self.base.str_extract(lvalue_formatter) + ')')
+     def __eq__(self, other):
+         return self.func == other.func and self.base == other.base
+@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
+     def __str__(self):
+         return self.func
+-    def str_extract(self):
++    def str_extract(self, lvalue_formatter):
+         return self.func + '(ctx)'
+     def __eq__(self, other):
+@@ -XXX,XX +XXX,XX @@ def __str__(self):
+     def str1(self, i):
+         return str_indent(i) + self.__str__()
++
++    def output_fields(self, indent, lvalue_formatter):
++        for n, f in self.fields.items():
++            output(indent, lvalue_formatter(n), ' = ',
++                   f.str_extract(lvalue_formatter), ';\n')
+ # end General
+@@ -XXX,XX +XXX,XX @@ def extract_name(self):
+     def output_extract(self):
+         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
+                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
+-        for n, f in self.fields.items():
+-            output('    a->', n, ' = ', f.str_extract(), ';\n')
++        self.output_fields(str_indent(4), lambda n: 'a->' + n)
+         output('}\n\n')
+ # end Format
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
+         if not extracted:
+             output(ind, self.base.extract_name(),
+                    '(ctx, &u.f_', arg, ', insn);\n')
+-        for n, f in self.fields.items():
+-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
++        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+         output(ind, 'if (', translate_prefix, '_', self.name,
+                '(ctx, &u.f_', arg, ')) return true;\n')
+--
+.34.1

-New patch
+[PULL 25/27] scripts/decodetree: Implement a topological sort
+From: Peter Maydell <peter.maydell@linaro.org>
+To support named fields, we will need to be able to do a topological
+sort (so that we ensure that we output the assignment to field A
+before the assignment to field B if field B refers to field A by
+name). The good news is that there is a tsort in the python standard
+library; the bad news is that it was only added in Python 3.9.
+To bridge the gap between our current minimum supported Python
+version and 3.9, provide a local implementation that has the
+same API as the stdlib version for the parts we care about.
+In future when QEMU's minimum Python version requirement reaches
+.9 we can delete this code and replace it with an 'import' line.
+The core of this implementation is based on
+https://code.activestate.com/recipes/578272-topological-sort/
+which is MIT-licensed.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Acked-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
+---
+ scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
+file changed, 74 insertions(+)
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
+index XXXXXXX..XXXXXXX 100644
+--- a/scripts/decodetree.py
++++ b/scripts/decodetree.py
+@@ -XXX,XX +XXX,XX @@
+ re_fmt_ident = '@[a-zA-Z0-9_]*'
+ re_pat_ident = '[a-zA-Z0-9_]*'
++# Local implementation of a topological sort. We use the same API that
++# the Python graphlib does, so that when QEMU moves forward to a
++# baseline of Python 3.9 or newer this code can all be dropped and
++# replaced with:
++#    from graphlib import TopologicalSorter, CycleError
++#
++# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
++#
++# We only implement the parts of TopologicalSorter we care about:
++#  ts = TopologicalSorter(graph=None)
++#    create the sorter. graph is a dictionary whose keys are
++#    nodes and whose values are lists of the predecessors of that node.
++#    (That is, if graph contains "A" -> ["B", "C"] then we must output
++#    B and C before A.)
++#  ts.static_order()
++#    returns a list of all the nodes in sorted order, or raises CycleError
++#  CycleError
++#    exception raised if there are cycles in the graph. The second
++#    element in the args attribute is a list of nodes which form a
++#    cycle; the first and last element are the same, eg [a, b, c, a]
++#    (Our implementation doesn't give the order correctly.)
++#
++# For our purposes we can assume that the data set is always small
++# (typically 10 nodes or less, actual links in the graph very rare),
++# so we don't need to worry about efficiency of implementation.
++#
++# The core of this implementation is from
++# https://code.activestate.com/recipes/578272-topological-sort/
++# (but updated to Python 3), and is under the MIT license.
++
++class CycleError(ValueError):
++    """Subclass of ValueError raised if cycles exist in the graph"""
++    pass
++
++class TopologicalSorter:
++    """Topologically sort a graph"""
++    def __init__(self, graph=None):
++        self.graph = graph
++
++    def static_order(self):
++        # We do the sort right here, unlike the stdlib version
++        from functools import reduce
++        data = {}
++        r = []
++
++        if not self.graph:
++            return []
++
++        # This code wants the values in the dict to be specifically sets
++        for k, v in self.graph.items():
++            data[k] = set(v)
++
++        # Find all items that don't depend on anything.
++        extra_items_in_deps = (reduce(set.union, data.values())
++                               - set(data.keys()))
++        # Add empty dependencies where needed
++        data.update({item:{} for item in extra_items_in_deps})
++        while True:
++            ordered = set(item for item, dep in data.items() if not dep)
++            if not ordered:
++                break
++            r.extend(ordered)
++            data = {item: (dep - ordered)
++                    for item, dep in data.items()
++                        if item not in ordered}
++        if data:
++            # This doesn't give as nice results as the stdlib, which
++            # gives you the cycle by listing the nodes in order. Here
++            # we only know the nodes in the cycle but not their order.
++            raise CycleError(f'nodes are in a cycle', list(data.keys()))
++
++        return r
++# end TopologicalSorter
++
+ def error_with_file(file, lineno, *args):
+     """Print an error message from file:line and args and exit."""
+     global output_file
+--
+.34.1

-New patch
+[PULL 26/27] scripts/decodetree: Implement named field support
+From: Peter Maydell <peter.maydell@linaro.org>
 Implement support for named fields, i.e.  where one field is defined
 in terms of another, rather than directly in terms of bits extracted
 from the instruction.
 The new method referenced_fields() on all the Field classes returns a
 list of fields that this field references.  This just passes through,
 except for the new NamedField class.
 We can then use referenced_fields() to:
  * construct a list of 'dangling references' for a format or
    pattern, which is the fields that the format/pattern uses but
    doesn't define itself
  * do a topological sort, so that we output "field = value"
    assignments in an order that means that we assign a field before
    we reference it in a subsequent assignment
  * check when we output the code for a pattern whether we need to
    fill in the format fields before or after the pattern fields, and
    do other error checking
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
 ---
  scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
 file changed, 139 insertions(+), 6 deletions(-)
 diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
 --- a/scripts/decodetree.py
 +++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
          s = 's' if self.sign else ''
          return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
 +    def referenced_fields(self):
 +        return []
 +
      def __eq__(self, other):
          return self.sign == other.sign and self.mask == other.mask
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
              pos += f.len
          return ret
 +    def referenced_fields(self):
 +        l = []
 +        for f in self.subs:
 +            l.extend(f.referenced_fields())
 +        return l
 +
      def __ne__(self, other):
          if len(self.subs) != len(other.subs):
              return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return str(self.value)
 +    def referenced_fields(self):
 +        return []
 +
      def __cmp__(self, other):
          return self.value - other.value
  # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
          return (self.func + '(ctx, '
                  + self.base.str_extract(lvalue_formatter) + ')')
 +    def referenced_fields(self):
 +        return self.base.referenced_fields()
 +
      def __eq__(self, other):
          return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return self.func + '(ctx)'
 +    def referenced_fields(self):
 +        return []
 +
      def __eq__(self, other):
          return self.func == other.func
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
          return not self.__eq__(other)
  # end ParameterField
 +class NamedField:
 +    """Class representing a field already named in the pattern"""
 +    def __init__(self, name, sign, len):
 +        self.mask = 0
 +        self.sign = sign
 +        self.len = len
 +        self.name = name
 +
 +    def __str__(self):
 +        return self.name
 +
 +    def str_extract(self, lvalue_formatter):
 +        global bitop_width
 +        s = 's' if self.sign else ''
 +        lvalue = lvalue_formatter(self.name)
 +        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
 +
 +    def referenced_fields(self):
 +        return [self.name]
 +
 +    def __eq__(self, other):
 +        return self.name == other.name
 +
 +    def __ne__(self, other):
 +        return not self.__eq__(other)
 +# end NamedField
  class Arguments:
      """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
              output('} ', self.struct_name(), ';\n\n')
  # end Arguments
 -
  class General:
      """Common code between instruction formats and instruction patterns"""
      def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
          self.fieldmask = fldm
          self.fields = flds
          self.width = w
 +        self.dangling = None
      def __str__(self):
          return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str1(self, i):
          return str_indent(i) + self.__str__()
 +    def dangling_references(self):
 +        # Return a list of all named references which aren't satisfied
 +        # directly by this format/pattern. This will be either:
 +        #  * a format referring to a field which is specified by the
 +        #    pattern(s) using it
 +        #  * a pattern referring to a field which is specified by the
 +        #    format it uses
 +        #  * a user error (referring to a field that doesn't exist at all)
 +        if self.dangling is None:
 +            # Compute this once and cache the answer
 +            dangling = []
 +            for n, f in self.fields.items():
 +                for r in f.referenced_fields():
 +                    if r not in self.fields:
 +                        dangling.append(r)
 +            self.dangling = dangling
 +        return self.dangling
 +
      def output_fields(self, indent, lvalue_formatter):
 +        # We use a topological sort to ensure that any use of NamedField
 +        # comes after the initialization of the field it is referencing.
 +        graph = {}
          for n, f in self.fields.items():
 -            output(indent, lvalue_formatter(n), ' = ',
 -                   f.str_extract(lvalue_formatter), ';\n')
 +            refs = f.referenced_fields()
 +            graph[n] = refs
 +
 +        try:
 +            ts = TopologicalSorter(graph)
 +            for n in ts.static_order():
 +                # We only want to emit assignments for the keys
 +                # in our fields list, not for anything that ends up
 +                # in the tsort graph only because it was referenced as
 +                # a NamedField.
 +                try:
 +                    f = self.fields[n]
 +                    output(indent, lvalue_formatter(n), ' = ',
 +                           f.str_extract(lvalue_formatter), ';\n')
 +                except KeyError:
 +                    pass
 +        except CycleError as e:
 +            # The second element of args is a list of nodes which form
 +            # a cycle (there might be others too, but only one is reported).
 +            # Pretty-print it to tell the user.
 +            cycle = ' => '.join(e.args[1])
 +            error(self.lineno, 'field definitions form a cycle: ' + cycle)
  # end General
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          arg = self.base.base.name
          output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
 +        # We might have named references in the format that refer to fields
 +        # in the pattern, or named references in the pattern that refer
 +        # to fields in the format. This affects whether we extract the fields
 +        # for the format before or after the ones for the pattern.
 +        # For simplicity we don't allow cross references in both directions.
 +        # This is also where we catch the syntax error of referring to
 +        # a nonexistent field.
 +        fmt_refs = self.base.dangling_references()
 +        for r in fmt_refs:
 +            if r not in self.fields:
 +                error(self.lineno, f'format refers to undefined field {r}')
 +        pat_refs = self.dangling_references()
 +        for r in pat_refs:
 +            if r not in self.base.fields:
 +                error(self.lineno, f'pattern refers to undefined field {r}')
 +        if pat_refs and fmt_refs:
 +            error(self.lineno, ('pattern that uses fields defined in format '
 +                                'cannot use format that uses fields defined '
 +                                'in pattern'))
 +        if fmt_refs:
 +            # pattern fields first
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +            assert not extracted, "dangling fmt refs but it was already extracted"
          if not extracted:
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', arg, ', insn);\n')
 -        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +        if not fmt_refs:
 +            # pattern fields last
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +
          output(ind, 'if (', translate_prefix, '_', self.name,
                 '(ctx, &u.f_', arg, ')) return true;\n')
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          # If we identified all nodes below have the same format,
 -        # extract the fields now.
 -        if not extracted and self.base:
 +        # extract the fields now. But don't do it if the format relies
 +        # on named fields from the insn pattern, as those won't have
 +        # been initialised at this point.
 +        if not extracted and self.base and not self.base.dangling_references():
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', self.base.base.name, ', insn);\n')
              extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
      """Parse one instruction field from TOKS at LINENO"""
      global fields
      global insnwidth
 +    global re_C_ident
      # A "simple" field will have only one entry;
      # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
              func = func[1]
              continue
 +        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
 +            # Signed named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, True, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +        if re.fullmatch(re_C_ident + ':[0-9]+', t):
 +            # Unsigned named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, False, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +
          if re.fullmatch('[0-9]+:s[0-9]+', t):
              # Signed field extract
              subtoks = t.split(':s')
 --
 .34.1

-New patch
+[PULL 27/27] tests/decode: Add tests for various named-field cases
+From: Peter Maydell <peter.maydell@linaro.org>
+Add some tests for various cases of named-field use, both ones that
+should work and ones that should be diagnosed as errors.
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
+Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
+---
+ tests/decode/err_field10.decode      |  7 +++++++
+ tests/decode/err_field7.decode       |  7 +++++++
+ tests/decode/err_field8.decode       |  8 ++++++++
+ tests/decode/err_field9.decode       | 14 ++++++++++++++
+ tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
+ tests/decode/meson.build             |  5 +++++
+files changed, 60 insertions(+)
+ create mode 100644 tests/decode/err_field10.decode
+ create mode 100644 tests/decode/err_field7.decode
+ create mode 100644 tests/decode/err_field8.decode
+ create mode 100644 tests/decode/err_field9.decode
+ create mode 100644 tests/decode/succ_named_field.decode
+diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field10.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose formats which refer to undefined fields
++%field1        field2:3
++@fmt ........ ........ ........ ........ %field1
++insn 00000000 00000000 00000000 00000000 @fmt
+diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field7.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose fields whose definitions form a loop
++%field1        field2:3
++%field2        field1:4
++insn 00000000 00000000 00000000 00000000 %field1 %field2
+diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field8.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose patterns which refer to undefined fields
++&f1 f1 a
++%field1        field2:3
++@fmt ........ ........ ........ .... a:4 &f1
++insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
+diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/err_field9.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# Diagnose fields where the format refers to a field defined in the
++# pattern and the pattern refers to a field defined in the format.
++# This is theoretically not impossible to implement, but is not
++# supported by the script at this time.
++&abcd a b c d
++%refa        a:3
++%refc        c:4
++# Format defines 'c' and sets 'b' to an indirect ref to 'a'
++@fmt ........ ........ ........ c:8 &abcd b=%refa
++# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
++insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
+diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
+new file mode 100644
+index XXXXXXX..XXXXXXX
+--- /dev/null
++++ b/tests/decode/succ_named_field.decode
+@@ -XXX,XX +XXX,XX @@
++# This work is licensed under the terms of the GNU LGPL, version 2 or later.
++# See the COPYING.LIB file in the top-level directory.
++
++# field using a named_field
++%imm_sz    8:8 sz:3
++insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
++
++# Ditto, via a format. Here a field in the format
++# references a named field defined in the insn pattern:
++&imm_a imm alpha
++%foo 0:16 alpha:4
++@foo 00000001 ........ ........ ........ &imm_a imm=%foo
++i1   ........ 00000000 ........ ........ @foo alpha=1
++i2   ........ 00000001 ........ ........ @foo alpha=2
++
++# Here the named field is defined in the format and referenced
++# from the insn pattern:
++@bar 00000010 ........ ........ ........ &imm_a alpha=4
++i3   ........ 00000000 ........ ........ @bar imm=%foo
+diff --git a/tests/decode/meson.build b/tests/decode/meson.build
+index XXXXXXX..XXXXXXX 100644
+--- a/tests/decode/meson.build
++++ b/tests/decode/meson.build
+@@ -XXX,XX +XXX,XX @@ err_tests = [
+     'err_field4.decode',
+     'err_field5.decode',
+     'err_field6.decode',
++    'err_field7.decode',
++    'err_field8.decode',
++    'err_field9.decode',
++    'err_field10.decode',
+     'err_init1.decode',
+     'err_init2.decode',
+     'err_init3.decode',
+@@ -XXX,XX +XXX,XX @@ succ_tests = [
+     'succ_argset_type1.decode',
+     'succ_function.decode',
+     'succ_ident1.decode',
++    'succ_named_field.decode',
+     'succ_pattern_group_nest1.decode',
+     'succ_pattern_group_nest2.decode',
+     'succ_pattern_group_nest3.decode',
+--
+.34.1

The following changes since commit 75d30fde55485b965a1168a21d016dd07b50ed32:

Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-10-30 15:07:25 -0400)

are available in the Git repository at:

https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20221031

for you to fetch changes up to cb375590983fc3d23600d02ba05a05d34fe44150:

target/i386: Expand eflags updates inline (2022-10-31 11:39:10 +1100)

----------------------------------------------------------------
Remove sparc32plus support from tcg/sparc.
target/i386: Use cpu_unwind_state_data for tpr access.
target/i386: Expand eflags updates inline

----------------------------------------------------------------
Icenowy Zheng (1):
      tcg/tci: fix logic error when registering helpers via FFI

Richard Henderson (10):
      tcg/sparc: Remove support for sparc32plus
      tcg/sparc64: Rename from tcg/sparc
      tcg/sparc64: Remove sparc32plus constraints
      accel/tcg: Introduce cpu_unwind_state_data
      target/i386: Use cpu_unwind_state_data for tpr access
      target/openrisc: Always exit after mtspr npc
      target/openrisc: Use cpu_unwind_state_data for mfspr
      accel/tcg: Remove will_exit argument from cpu_restore_state
      accel/tcg: Remove reset_icount argument from cpu_restore_state_from_tb
      target/i386: Expand eflags updates inline

meson.build                                 |   4 +-
 accel/tcg/internal.h                        |   4 +-
 include/exec/exec-all.h                     |  24 ++-
 target/i386/helper.h                        |   5 -
 tcg/{sparc => sparc64}/tcg-target-con-set.h |  16 +-
 tcg/{sparc => sparc64}/tcg-target-con-str.h |   3 -
 tcg/{sparc => sparc64}/tcg-target.h         |  11 --
 accel/tcg/cpu-exec-common.c                 |   2 +-
 accel/tcg/tb-maint.c                        |   4 +-
 accel/tcg/translate-all.c                   |  91 +++++----
 target/alpha/helper.c                       |   2 +-
 target/alpha/mem_helper.c                   |   2 +-
 target/arm/op_helper.c                      |   2 +-
 target/arm/tlb_helper.c                     |   8 +-
 target/cris/helper.c                        |   2 +-
 target/i386/helper.c                        |  21 ++-
 target/i386/tcg/cc_helper.c                 |  41 -----
 target/i386/tcg/sysemu/svm_helper.c         |   2 +-
 target/i386/tcg/translate.c                 |  30 ++-
 target/m68k/op_helper.c                     |   4 +-
 target/microblaze/helper.c                  |   2 +-
 target/nios2/op_helper.c                    |   2 +-
 target/openrisc/sys_helper.c                |  17 +-
 target/ppc/excp_helper.c                    |   2 +-
 target/s390x/tcg/excp_helper.c              |   2 +-
 target/tricore/op_helper.c                  |   2 +-
 target/xtensa/helper.c                      |   6 +-
 tcg/tcg.c                                   |  81 +-------
 tcg/{sparc => sparc64}/tcg-target.c.inc     | 275 ++++++++--------------------
 MAINTAINERS                                 |   2 +-
 30 files changed, 232 insertions(+), 437 deletions(-)
 rename tcg/{sparc => sparc64}/tcg-target-con-set.h (69%)
 rename tcg/{sparc => sparc64}/tcg-target-con-str.h (77%)
 rename tcg/{sparc => sparc64}/tcg-target.h (95%)
 rename tcg/{sparc => sparc64}/tcg-target.c.inc (91%)

Since 9b9c37c36439, we have only supported sparc64 cpus.
Debian and Gentoo now only support 64-bit sparc64 userland,
so it is time to drop the 32-bit sparc64 userland: sparc32plus.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc/tcg-target.h     |  11 ---
 tcg/tcg.c                  |  75 +----------------
 tcg/sparc/tcg-target.c.inc | 166 +++++++------------------------------
 3 files changed, 33 insertions(+), 219 deletions(-)

diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #ifndef SPARC_TCG_TARGET_H
 #define SPARC_TCG_TARGET_H
 
-#define TCG_TARGET_REG_BITS 64
-
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define TCG_TARGET_NB_REGS 32
@@ -XXX,XX +XXX,XX @@ typedef enum {
 /* used for function call generation */
 #define TCG_REG_CALL_STACK TCG_REG_O6
 
-#ifdef __arch64__
 #define TCG_TARGET_STACK_BIAS           2047
 #define TCG_TARGET_STACK_ALIGN          16
 #define TCG_TARGET_CALL_STACK_OFFSET    (128 + 6*8 + TCG_TARGET_STACK_BIAS)
-#else
-#define TCG_TARGET_STACK_BIAS           0
-#define TCG_TARGET_STACK_ALIGN          8
-#define TCG_TARGET_CALL_STACK_OFFSET    (64 + 4 + 6*4)
-#endif
-
-#ifdef __arch64__
 #define TCG_TARGET_EXTEND_ARGS 1
-#endif
 
 #if defined(__VIS__) && __VIS__ >= 0x300
 #define use_vis3_instructions  1
diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
     }
 #endif
 
-#if defined(__sparc__) && !defined(__arch64__) \
-    && !defined(CONFIG_TCG_INTERPRETER)
-    /* We have 64-bit values in one register, but need to pass as two
-       separate parameters.  Split them.  */
-    int orig_typemask = typemask;
-    int orig_nargs = nargs;
-    TCGv_i64 retl, reth;
-    TCGTemp *split_args[MAX_OPC_PARAM];
-
-    retl = NULL;
-    reth = NULL;
-    typemask = 0;
-    for (i = real_args = 0; i < nargs; ++i) {
-        int argtype = extract32(orig_typemask, (i + 1) * 3, 3);
-        bool is_64bit = (argtype & ~1) == dh_typecode_i64;
-
-        if (is_64bit) {
-            TCGv_i64 orig = temp_tcgv_i64(args[i]);
-            TCGv_i32 h = tcg_temp_new_i32();
-            TCGv_i32 l = tcg_temp_new_i32();
-            tcg_gen_extr_i64_i32(l, h, orig);
-            split_args[real_args++] = tcgv_i32_temp(h);
-            typemask |= dh_typecode_i32 << (real_args * 3);
-            split_args[real_args++] = tcgv_i32_temp(l);
-            typemask |= dh_typecode_i32 << (real_args * 3);
-        } else {
-            split_args[real_args++] = args[i];
-            typemask |= argtype << (real_args * 3);
-        }
-    }
-    nargs = real_args;
-    args = split_args;
-#elif defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
+#if defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
     for (i = 0; i < nargs; ++i) {
         int argtype = extract32(typemask, (i + 1) * 3, 3);
         bool is_32bit = (argtype & ~1) == dh_typecode_i32;
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
 
     pi = 0;
     if (ret != NULL) {
-#if defined(__sparc__) && !defined(__arch64__) \
-    && !defined(CONFIG_TCG_INTERPRETER)
-        if ((typemask & 6) == dh_typecode_i64) {
-            /* The 32-bit ABI is going to return the 64-bit value in
-               the %o0/%o1 register pair.  Prepare for this by using
-               two return temporaries, and reassemble below.  */
-            retl = tcg_temp_new_i64();
-            reth = tcg_temp_new_i64();
-            op->args[pi++] = tcgv_i64_arg(reth);
-            op->args[pi++] = tcgv_i64_arg(retl);
-            nb_rets = 2;
-        } else {
-            op->args[pi++] = temp_arg(ret);
-            nb_rets = 1;
-        }
-#else
         if (TCG_TARGET_REG_BITS < 64 && (typemask & 6) == dh_typecode_i64) {
 #if HOST_BIG_ENDIAN
             op->args[pi++] = temp_arg(ret + 1);
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
             op->args[pi++] = temp_arg(ret);
             nb_rets = 1;
         }
-#endif
     } else {
         nb_rets = 0;
     }
@@ -XXX,XX +XXX,XX @@ void tcg_gen_callN(void *func, TCGTemp *ret, int nargs, TCGTemp **args)
     tcg_debug_assert(TCGOP_CALLI(op) == real_args);
     tcg_debug_assert(pi <= ARRAY_SIZE(op->args));
 
-#if defined(__sparc__) && !defined(__arch64__) \
-    && !defined(CONFIG_TCG_INTERPRETER)
-    /* Free all of the parts we allocated above.  */
-    for (i = real_args = 0; i < orig_nargs; ++i) {
-        int argtype = extract32(orig_typemask, (i + 1) * 3, 3);
-        bool is_64bit = (argtype & ~1) == dh_typecode_i64;
-
-        if (is_64bit) {
-            tcg_temp_free_internal(args[real_args++]);
-            tcg_temp_free_internal(args[real_args++]);
-        } else {
-            real_args++;
-        }
-    }
-    if ((orig_typemask & 6) == dh_typecode_i64) {
-        /* The 32-bit ABI returned two 32-bit pieces.  Re-assemble them.
-           Note that describing these as TCGv_i64 eliminates an unnecessary
-           zero-extension that tcg_gen_concat_i32_i64 would create.  */
-        tcg_gen_concat32_i64(temp_tcgv_i64(ret), retl, reth);
-        tcg_temp_free_i64(retl);
-        tcg_temp_free_i64(reth);
-    }
-#elif defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
+#if defined(TCG_TARGET_EXTEND_ARGS) && TCG_TARGET_REG_BITS == 64
     for (i = 0; i < nargs; ++i) {
         int argtype = extract32(typemask, (i + 1) * 3, 3);
         bool is_32bit = (argtype & ~1) == dh_typecode_i32;
diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@
  * THE SOFTWARE.
  */
 
+/* We only support generating code for 64-bit mode.  */
+#ifndef __arch64__
+#error "unsupported code generation mode"
+#endif
+
 #include "../tcg-pool.c.inc"
 
 #ifdef CONFIG_DEBUG_TCG
@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 };
 #endif
 
-#ifdef __arch64__
-# define SPARC64 1
-#else
-# define SPARC64 0
-#endif
-
 #define TCG_CT_CONST_S11  0x100
 #define TCG_CT_CONST_S13  0x200
 #define TCG_CT_CONST_ZERO 0x400
@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
  * high bits of the %i and %l registers garbage at all times.
  */
 #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
-#if SPARC64
 # define ALL_GENERAL_REGS64  ALL_GENERAL_REGS
-#else
-# define ALL_GENERAL_REGS64  MAKE_64BIT_MASK(0, 16)
-#endif
 #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 #define ALL_QLDST_REGS64     (ALL_GENERAL_REGS64 & ~SOFTMMU_RESERVE_REGS)
 
@@ -XXX,XX +XXX,XX @@ static bool check_fit_i32(int32_t val, unsigned int bits)
 }
 
 #define check_fit_tl    check_fit_i64
-#if SPARC64
-# define check_fit_ptr  check_fit_i64
-#else
-# define check_fit_ptr  check_fit_i32
-#endif
+#define check_fit_ptr   check_fit_i64
 
 static bool patch_reloc(tcg_insn_unit *src_rw, int type,
                         intptr_t value, intptr_t addend)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_sety(TCGContext *s, TCGReg rs)
     tcg_out32(s, WRY | INSN_RS1(TCG_REG_G0) | INSN_RS2(rs));
 }
 
-static void tcg_out_rdy(TCGContext *s, TCGReg rd)
-{
-    tcg_out32(s, RDY | INSN_RD(rd));
-}
-
 static void tcg_out_div32(TCGContext *s, TCGReg rd, TCGReg rs1,
                           int32_t val2, int val2const, int uns)
 {
@@ -XXX,XX +XXX,XX @@ static void emit_extend(TCGContext *s, TCGReg r, int op)
         tcg_out_arithi(s, r, r, 16, SHIFT_SRL);
         break;
     case MO_32:
-        if (SPARC64) {
-            tcg_out_arith(s, r, r, 0, SHIFT_SRL);
-        }
+        tcg_out_arith(s, r, r, 0, SHIFT_SRL);
         break;
     case MO_64:
         break;
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
     };
 
     int i;
-    TCGReg ra;
 
     for (i = 0; i < ARRAY_SIZE(qemu_ld_helpers); ++i) {
         if (qemu_ld_helpers[i] == NULL) {
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
         }
         qemu_ld_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
 
-        if (SPARC64 || TARGET_LONG_BITS == 32) {
-            ra = TCG_REG_O3;
-        } else {
-            /* Install the high part of the address.  */
-            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
-            ra = TCG_REG_O4;
-        }
-
         /* Set the retaddr operand.  */
-        tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
+        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O3, TCG_REG_O7);
         /* Tail call.  */
         tcg_out_jmpl_const(s, qemu_ld_helpers[i], true, true);
         /* delay slot -- set the env argument */
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
         }
         qemu_st_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
 
-        if (SPARC64) {
-            emit_extend(s, TCG_REG_O2, i);
-            ra = TCG_REG_O4;
-        } else {
-            ra = TCG_REG_O1;
-            if (TARGET_LONG_BITS == 64) {
-                /* Install the high part of the address.  */
-                tcg_out_arithi(s, ra, ra + 1, 32, SHIFT_SRLX);
-                ra += 2;
-            } else {
-                ra += 1;
-            }
-            if ((i & MO_SIZE) == MO_64) {
-                /* Install the high part of the data.  */
-                tcg_out_arithi(s, ra, ra + 1, 32, SHIFT_SRLX);
-                ra += 2;
-            } else {
-                emit_extend(s, ra, i);
-                ra += 1;
-            }
-            /* Skip the oi argument.  */
-            ra += 1;
-        }
-                
+        emit_extend(s, TCG_REG_O2, i);
+
         /* Set the retaddr operand.  */
-        if (ra >= TCG_REG_O6) {
-            tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,
-                       TCG_TARGET_CALL_STACK_OFFSET);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
-        }
+        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O4, TCG_REG_O7);
 
         /* Tail call.  */
         tcg_out_jmpl_const(s, qemu_st_helpers[i], true, true);
@@ -XXX,XX +XXX,XX @@ static void build_trampolines(TCGContext *s)
             qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
         }
 
-        if (!SPARC64 && TARGET_LONG_BITS == 64) {
-            /* Install the high part of the address.  */
-            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
-        }
-
         /* Tail call.  */
         tcg_out_jmpl_const(s, helper, true, true);
         /* delay slot -- set the env argument */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, int mem_index,
     tcg_out_cmp(s, r0, r2, 0);
 
     /* If the guest address must be zero-extended, do so now.  */
-    if (SPARC64 && TARGET_LONG_BITS == 32) {
+    if (TARGET_LONG_BITS == 32) {
         tcg_out_arithi(s, r0, addr, 0, SHIFT_SRL);
         return r0;
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
 
 #ifdef CONFIG_SOFTMMU
     unsigned memi = get_mmuidx(oi);
-    TCGReg addrz, param;
+    TCGReg addrz;
     const tcg_insn_unit *func;
 
     addrz = tcg_out_tlb_load(s, addr, memi, memop,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
 
     /* TLB Miss.  */
 
-    param = TCG_REG_O1;
-    if (!SPARC64 && TARGET_LONG_BITS == 64) {
-        /* Skip the high-part; we'll perform the extract in the trampoline.  */
-        param++;
-    }
-    tcg_out_mov(s, TCG_TYPE_REG, param++, addrz);
+    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
 
     /* We use the helpers to extend SB and SW data, leaving the case
        of SL needing explicit extending below.  */
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
     tcg_debug_assert(func != NULL);
     tcg_out_call_nodelay(s, func, false);
     /* delay slot */
-    tcg_out_movi(s, TCG_TYPE_I32, param, oi);
+    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O2, oi);
 
-    /* Recall that all of the helpers return 64-bit results.
-       Which complicates things for sparcv8plus.  */
-    if (SPARC64) {
-        /* We let the helper sign-extend SB and SW, but leave SL for here.  */
-        if (is_64 && (memop & MO_SSIZE) == MO_SL) {
-            tcg_out_arithi(s, data, TCG_REG_O0, 0, SHIFT_SRA);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_REG, data, TCG_REG_O0);
-        }
+    /* We let the helper sign-extend SB and SW, but leave SL for here.  */
+    if (is_64 && (memop & MO_SSIZE) == MO_SL) {
+        tcg_out_arithi(s, data, TCG_REG_O0, 0, SHIFT_SRA);
     } else {
-        if ((memop & MO_SIZE) == MO_64) {
-            tcg_out_arithi(s, TCG_REG_O0, TCG_REG_O0, 32, SHIFT_SLLX);
-            tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O1, 0, SHIFT_SRL);
-            tcg_out_arith(s, data, TCG_REG_O0, TCG_REG_O1, ARITH_OR);
-        } else if (is_64) {
-            /* Re-extend from 32-bit rather than reassembling when we
-               know the high register must be an extension.  */
-            tcg_out_arithi(s, data, TCG_REG_O1, 0,
-                           memop & MO_SIGN ? SHIFT_SRA : SHIFT_SRL);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, data, TCG_REG_O1);
-        }
+        tcg_out_mov(s, TCG_TYPE_REG, data, TCG_REG_O0);
     }
 
     *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
     unsigned s_bits = memop & MO_SIZE;
     unsigned t_bits;
 
-    if (SPARC64 && TARGET_LONG_BITS == 32) {
+    if (TARGET_LONG_BITS == 32) {
         tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
         addr = TCG_REG_T1;
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
          * operation in the delay slot, and failure need only invoke the
          * handler for SIGBUS.
          */
-        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
         tcg_out_call_nodelay(s, qemu_unalign_ld_trampoline, false);
         /* delay slot -- move to low part of argument reg */
-        tcg_out_mov_delay(s, arg_low, addr);
+        tcg_out_mov_delay(s, TCG_REG_O1, addr);
     } else {
         /* Underalignment: load by pieces of minimum alignment. */
         int ld_opc, a_size, s_size, i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
 
 #ifdef CONFIG_SOFTMMU
     unsigned memi = get_mmuidx(oi);
-    TCGReg addrz, param;
+    TCGReg addrz;
     const tcg_insn_unit *func;
 
     addrz = tcg_out_tlb_load(s, addr, memi, memop,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
 
     /* TLB Miss.  */
 
-    param = TCG_REG_O1;
-    if (!SPARC64 && TARGET_LONG_BITS == 64) {
-        /* Skip the high-part; we'll perform the extract in the trampoline.  */
-        param++;
-    }
-    tcg_out_mov(s, TCG_TYPE_REG, param++, addrz);
-    if (!SPARC64 && (memop & MO_SIZE) == MO_64) {
-        /* Skip the high-part; we'll perform the extract in the trampoline.  */
-        param++;
-    }
-    tcg_out_mov(s, TCG_TYPE_REG, param++, data);
+    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
+    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O2, data);
 
     func = qemu_st_trampoline[memop & (MO_BSWAP | MO_SIZE)];
     tcg_debug_assert(func != NULL);
     tcg_out_call_nodelay(s, func, false);
     /* delay slot */
-    tcg_out_movi(s, TCG_TYPE_I32, param, oi);
+    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O3, oi);
 
     *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #else
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
     unsigned s_bits = memop & MO_SIZE;
     unsigned t_bits;
 
-    if (SPARC64 && TARGET_LONG_BITS == 32) {
+    if (TARGET_LONG_BITS == 32) {
         tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
         addr = TCG_REG_T1;
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
          * operation in the delay slot, and failure need only invoke the
          * handler for SIGBUS.
          */
-        TCGReg arg_low = TCG_REG_O1 + (!SPARC64 && TARGET_LONG_BITS == 64);
         tcg_out_call_nodelay(s, qemu_unalign_st_trampoline, false);
         /* delay slot -- move to low part of argument reg */
-        tcg_out_mov_delay(s, arg_low, addr);
+        tcg_out_mov_delay(s, TCG_REG_O1, addr);
     } else {
         /* Underalignment: store by pieces of minimum alignment. */
         int st_opc, a_size, s_size, i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_muls2_i32:
         c = ARITH_SMUL;
     do_mul2:
-        /* The 32-bit multiply insns produce a full 64-bit result.  If the
-           destination register can hold it, we can avoid the slower RDY.  */
+        /* The 32-bit multiply insns produce a full 64-bit result. */
         tcg_out_arithc(s, a0, a2, args[3], const_args[3], c);
-        if (SPARC64 || a0 <= TCG_REG_O7) {
-            tcg_out_arithi(s, a1, a0, 32, SHIFT_SRLX);
-        } else {
-            tcg_out_rdy(s, a1);
-        }
+        tcg_out_arithi(s, a1, a0, 32, SHIFT_SRLX);
         break;
 
     case INDEX_op_qemu_ld_i32:
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_T2); /* for internal use */
 }
 
-#if SPARC64
-# define ELF_HOST_MACHINE  EM_SPARCV9
-#else
-# define ELF_HOST_MACHINE  EM_SPARC32PLUS
-# define ELF_HOST_FLAGS    EF_SPARC_32PLUS
-#endif
+#define ELF_HOST_MACHINE  EM_SPARCV9
 
 typedef struct {
     DebugFrameHeader h;
-    uint8_t fde_def_cfa[SPARC64 ? 4 : 2];
+    uint8_t fde_def_cfa[4];
     uint8_t fde_win_save;
     uint8_t fde_ret_save[3];
 } DebugFrame;
@@ -XXX,XX +XXX,XX @@ static const DebugFrame debug_frame = {
     .h.fde.len = sizeof(DebugFrame) - offsetof(DebugFrame, h.fde.cie_offset),
 
     .fde_def_cfa = {
-#if SPARC64
         12, 30,                         /* DW_CFA_def_cfa i6, 2047 */
         (2047 & 0x7f) | 0x80, (2047 >> 7)
-#else
-        13, 30                          /* DW_CFA_def_cfa_register i6 */
-#endif
     },
     .fde_win_save = 0x2d,               /* DW_CFA_GNU_window_save */
     .fde_ret_save = { 9, 15, 31 },      /* DW_CFA_register o7, i7 */
-- 
2.34.1

Emphasize that we only support full 64-bit code generation.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 meson.build                                 | 4 +---
 tcg/{sparc => sparc64}/tcg-target-con-set.h | 0
 tcg/{sparc => sparc64}/tcg-target-con-str.h | 0
 tcg/{sparc => sparc64}/tcg-target.h         | 0
 tcg/{sparc => sparc64}/tcg-target.c.inc     | 0
 MAINTAINERS                                 | 2 +-
 6 files changed, 2 insertions(+), 4 deletions(-)
 rename tcg/{sparc => sparc64}/tcg-target-con-set.h (100%)
 rename tcg/{sparc => sparc64}/tcg-target-con-str.h (100%)
 rename tcg/{sparc => sparc64}/tcg-target.h (100%)
 rename tcg/{sparc => sparc64}/tcg-target.c.inc (100%)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ qapi_trace_events = []
 bsd_oses = ['gnu/kfreebsd', 'freebsd', 'netbsd', 'openbsd', 'dragonfly', 'darwin']
 supported_oses = ['windows', 'freebsd', 'netbsd', 'openbsd', 'darwin', 'sunos', 'linux']
 supported_cpus = ['ppc', 'ppc64', 's390x', 'riscv', 'x86', 'x86_64',
-  'arm', 'aarch64', 'loongarch64', 'mips', 'mips64', 'sparc', 'sparc64']
+  'arm', 'aarch64', 'loongarch64', 'mips', 'mips64', 'sparc64']
 
 cpu = host_machine.cpu_family()
 
@@ -XXX,XX +XXX,XX @@ if get_option('tcg').allowed()
   endif
   if get_option('tcg_interpreter')
     tcg_arch = 'tci'
-  elif host_arch == 'sparc64'
-    tcg_arch = 'sparc'
   elif host_arch == 'x86_64'
     tcg_arch = 'i386'
   elif host_arch == 'ppc64'
diff --git a/tcg/sparc/tcg-target-con-set.h b/tcg/sparc64/tcg-target-con-set.h
similarity index 100%
rename from tcg/sparc/tcg-target-con-set.h
rename to tcg/sparc64/tcg-target-con-set.h
diff --git a/tcg/sparc/tcg-target-con-str.h b/tcg/sparc64/tcg-target-con-str.h
similarity index 100%
rename from tcg/sparc/tcg-target-con-str.h
rename to tcg/sparc64/tcg-target-con-str.h
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc64/tcg-target.h
similarity index 100%
rename from tcg/sparc/tcg-target.h
rename to tcg/sparc64/tcg-target.h
diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
similarity index 100%
rename from tcg/sparc/tcg-target.c.inc
rename to tcg/sparc64/tcg-target.c.inc
diff --git a/MAINTAINERS b/MAINTAINERS
index XXXXXXX..XXXXXXX 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -XXX,XX +XXX,XX @@ L: qemu-s390x@nongnu.org
 
 SPARC TCG target
 S: Odd Fixes
-F: tcg/sparc/
+F: tcg/sparc64/
 F: disas/sparc.c
 
 TCI TCG target
-- 
2.34.1

With sparc64 we need not distinguish between registers that
can hold 32-bit values and those that can hold 64-bit values.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target-con-set.h |  16 +----
 tcg/sparc64/tcg-target-con-str.h |   3 -
 tcg/sparc64/tcg-target.c.inc     | 109 ++++++++++++-------------------
 3 files changed, 44 insertions(+), 84 deletions(-)

diff --git a/tcg/sparc64/tcg-target-con-set.h b/tcg/sparc64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target-con-set.h
+++ b/tcg/sparc64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  */
 C_O0_I1(r)
 C_O0_I2(rZ, r)
-C_O0_I2(RZ, r)
 C_O0_I2(rZ, rJ)
-C_O0_I2(RZ, RJ)
-C_O0_I2(sZ, A)
-C_O0_I2(SZ, A)
-C_O1_I1(r, A)
-C_O1_I1(R, A)
+C_O0_I2(sZ, s)
+C_O1_I1(r, s)
 C_O1_I1(r, r)
-C_O1_I1(r, R)
-C_O1_I1(R, r)
-C_O1_I1(R, R)
-C_O1_I2(R, R, R)
+C_O1_I2(r, r, r)
 C_O1_I2(r, rZ, rJ)
-C_O1_I2(R, RZ, RJ)
 C_O1_I4(r, rZ, rJ, rI, 0)
-C_O1_I4(R, RZ, RJ, RI, 0)
 C_O2_I2(r, r, rZ, rJ)
-C_O2_I4(R, R, RZ, RZ, RJ, RI)
 C_O2_I4(r, r, rZ, rZ, rJ, rJ)
diff --git a/tcg/sparc64/tcg-target-con-str.h b/tcg/sparc64/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target-con-str.h
+++ b/tcg/sparc64/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('R', ALL_GENERAL_REGS64)
 REGS('s', ALL_QLDST_REGS)
-REGS('S', ALL_QLDST_REGS64)
-REGS('A', TARGET_LONG_BITS == 64 ? ALL_QLDST_REGS64 : ALL_QLDST_REGS)
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 #else
 #define SOFTMMU_RESERVE_REGS 0
 #endif
-
-/*
- * Note that sparcv8plus can only hold 64 bit quantities in %g and %o
- * registers.  These are saved manually by the kernel in full 64-bit
- * slots.  The %i and %l registers are saved by the register window
- * mechanism, which only allocates space for 32 bits.  Given that this
- * window spill/fill can happen on any signal, we must consider the
- * high bits of the %i and %l registers garbage at all times.
- */
 #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
-# define ALL_GENERAL_REGS64  ALL_GENERAL_REGS
 #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
-#define ALL_QLDST_REGS64     (ALL_GENERAL_REGS64 & ~SOFTMMU_RESERVE_REGS)
 
 /* Define some temporary registers.  T2 is used for constant generation.  */
 #define TCG_REG_T1  TCG_REG_G1
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
         return C_O0_I1(r);
 
     case INDEX_op_ld8u_i32:
+    case INDEX_op_ld8u_i64:
     case INDEX_op_ld8s_i32:
+    case INDEX_op_ld8s_i64:
     case INDEX_op_ld16u_i32:
+    case INDEX_op_ld16u_i64:
     case INDEX_op_ld16s_i32:
+    case INDEX_op_ld16s_i64:
     case INDEX_op_ld_i32:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld32s_i64:
+    case INDEX_op_ld_i64:
     case INDEX_op_neg_i32:
+    case INDEX_op_neg_i64:
     case INDEX_op_not_i32:
+    case INDEX_op_not_i64:
+    case INDEX_op_ext32s_i64:
+    case INDEX_op_ext32u_i64:
+    case INDEX_op_ext_i32_i64:
+    case INDEX_op_extu_i32_i64:
+    case INDEX_op_extrl_i64_i32:
+    case INDEX_op_extrh_i64_i32:
         return C_O1_I1(r, r);
 
     case INDEX_op_st8_i32:
+    case INDEX_op_st8_i64:
     case INDEX_op_st16_i32:
+    case INDEX_op_st16_i64:
     case INDEX_op_st_i32:
+    case INDEX_op_st32_i64:
+    case INDEX_op_st_i64:
         return C_O0_I2(rZ, r);
 
     case INDEX_op_add_i32:
+    case INDEX_op_add_i64:
     case INDEX_op_mul_i32:
+    case INDEX_op_mul_i64:
     case INDEX_op_div_i32:
+    case INDEX_op_div_i64:
     case INDEX_op_divu_i32:
+    case INDEX_op_divu_i64:
     case INDEX_op_sub_i32:
+    case INDEX_op_sub_i64:
     case INDEX_op_and_i32:
+    case INDEX_op_and_i64:
     case INDEX_op_andc_i32:
+    case INDEX_op_andc_i64:
     case INDEX_op_or_i32:
+    case INDEX_op_or_i64:
     case INDEX_op_orc_i32:
+    case INDEX_op_orc_i64:
     case INDEX_op_xor_i32:
+    case INDEX_op_xor_i64:
     case INDEX_op_shl_i32:
+    case INDEX_op_shl_i64:
     case INDEX_op_shr_i32:
+    case INDEX_op_shr_i64:
     case INDEX_op_sar_i32:
+    case INDEX_op_sar_i64:
     case INDEX_op_setcond_i32:
+    case INDEX_op_setcond_i64:
         return C_O1_I2(r, rZ, rJ);
 
     case INDEX_op_brcond_i32:
+    case INDEX_op_brcond_i64:
         return C_O0_I2(rZ, rJ);
     case INDEX_op_movcond_i32:
+    case INDEX_op_movcond_i64:
         return C_O1_I4(r, rZ, rJ, rI, 0);
     case INDEX_op_add2_i32:
+    case INDEX_op_add2_i64:
     case INDEX_op_sub2_i32:
+    case INDEX_op_sub2_i64:
         return C_O2_I4(r, r, rZ, rZ, rJ, rJ);
     case INDEX_op_mulu2_i32:
     case INDEX_op_muls2_i32:
         return C_O2_I2(r, r, rZ, rJ);
-
-    case INDEX_op_ld8u_i64:
-    case INDEX_op_ld8s_i64:
-    case INDEX_op_ld16u_i64:
-    case INDEX_op_ld16s_i64:
-    case INDEX_op_ld32u_i64:
-    case INDEX_op_ld32s_i64:
-    case INDEX_op_ld_i64:
-    case INDEX_op_ext_i32_i64:
-    case INDEX_op_extu_i32_i64:
-        return C_O1_I1(R, r);
-
-    case INDEX_op_st8_i64:
-    case INDEX_op_st16_i64:
-    case INDEX_op_st32_i64:
-    case INDEX_op_st_i64:
-        return C_O0_I2(RZ, r);
-
-    case INDEX_op_add_i64:
-    case INDEX_op_mul_i64:
-    case INDEX_op_div_i64:
-    case INDEX_op_divu_i64:
-    case INDEX_op_sub_i64:
-    case INDEX_op_and_i64:
-    case INDEX_op_andc_i64:
-    case INDEX_op_or_i64:
-    case INDEX_op_orc_i64:
-    case INDEX_op_xor_i64:
-    case INDEX_op_shl_i64:
-    case INDEX_op_shr_i64:
-    case INDEX_op_sar_i64:
-    case INDEX_op_setcond_i64:
-        return C_O1_I2(R, RZ, RJ);
-
-    case INDEX_op_neg_i64:
-    case INDEX_op_not_i64:
-    case INDEX_op_ext32s_i64:
-    case INDEX_op_ext32u_i64:
-        return C_O1_I1(R, R);
-
-    case INDEX_op_extrl_i64_i32:
-    case INDEX_op_extrh_i64_i32:
-        return C_O1_I1(r, R);
-
-    case INDEX_op_brcond_i64:
-        return C_O0_I2(RZ, RJ);
-    case INDEX_op_movcond_i64:
-        return C_O1_I4(R, RZ, RJ, RI, 0);
-    case INDEX_op_add2_i64:
-    case INDEX_op_sub2_i64:
-        return C_O2_I4(R, R, RZ, RZ, RJ, RI);
     case INDEX_op_muluh_i64:
-        return C_O1_I2(R, R, R);
+        return C_O1_I2(r, r, r);
 
     case INDEX_op_qemu_ld_i32:
-        return C_O1_I1(r, A);
     case INDEX_op_qemu_ld_i64:
-        return C_O1_I1(R, A);
+        return C_O1_I1(r, s);
     case INDEX_op_qemu_st_i32:
-        return C_O0_I2(sZ, A);
     case INDEX_op_qemu_st_i64:
-        return C_O0_I2(SZ, A);
+        return C_O0_I2(sZ, s);
 
     default:
         g_assert_not_reached();
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 #endif
 
     tcg_target_available_regs[TCG_TYPE_I32] = ALL_GENERAL_REGS;
-    tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS64;
+    tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS;
 
     tcg_target_call_clobber_regs = 0;
     tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_G1);
-- 
2.34.1

From: Icenowy Zheng <uwu@icenowy.me>

When registering helpers via FFI for TCI, the inner loop that iterates
parameters of the helper reuses (and thus pollutes) the same variable
used by the outer loop that iterates all helpers, thus made some helpers
unregistered.

Fix this logic error by using a dedicated temporary variable for the
inner loop.

Fixes: 22f15579fa ("tcg: Build ffi data structures for helpers")
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Message-Id: <20221028072145.1593205-1-uwu@icenowy.me>
[rth: Move declaration of j to the for loop itself]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static void tcg_context_init(unsigned max_cpus)
 
         if (nargs != 0) {
             ca->cif.arg_types = ca->args;
-            for (i = 0; i < nargs; ++i) {
-                int typecode = extract32(typemask, (i + 1) * 3, 3);
-                ca->args[i] = typecode_to_ffi[typecode];
+            for (int j = 0; j < nargs; ++j) {
+                int typecode = extract32(typemask, (j + 1) * 3, 3);
+                ca->args[j] = typecode_to_ffi[typecode];
             }
         }
 
-- 
2.34.1

Add a way to examine the unwind data without actually
restoring the data back into env.

Reviewed-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/internal.h      |  4 +--
 include/exec/exec-all.h   | 21 ++++++++---
 accel/tcg/translate-all.c | 74 ++++++++++++++++++++++++++-------------
 3 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -XXX,XX +XXX,XX @@ void tb_reset_jump(TranslationBlock *tb, int n);
 TranslationBlock *tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                                tb_page_addr_t phys_page2);
 bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc);
-int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
-                              uintptr_t searched_pc, bool reset_icount);
+void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+                               uintptr_t host_pc, bool reset_icount);
 
 /* Return the current PC from CPU, which may be cached in TB. */
 static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -XXX,XX +XXX,XX @@ typedef ram_addr_t tb_page_addr_t;
 #define TB_PAGE_ADDR_FMT RAM_ADDR_FMT
 #endif
 
+/**
+ * cpu_unwind_state_data:
+ * @cpu: the cpu context
+ * @host_pc: the host pc within the translation
+ * @data: output data
+ *
+ * Attempt to load the the unwind state for a host pc occurring in
+ * translated code.  If @host_pc is not in translated code, the
+ * function returns false; otherwise @data is loaded.
+ * This is the same unwind info as given to restore_state_to_opc.
+ */
+bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data);
+
 /**
  * cpu_restore_state:
- * @cpu: the vCPU state is to be restore to
- * @searched_pc: the host PC the fault occurred at
+ * @cpu: the cpu context
+ * @host_pc: the host pc within the translation
  * @will_exit: true if the TB executed will be interrupted after some
                cpu adjustments. Required for maintaining the correct
                icount valus
  * @return: true if state was restored, false otherwise
  *
  * Attempt to restore the state for a fault occurring in translated
- * code. If the searched_pc is not in translated code no state is
+ * code. If @host_pc is not in translated code no state is
  * restored and the function returns false.
  */
-bool cpu_restore_state(CPUState *cpu, uintptr_t searched_pc, bool will_exit);
+bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit);
 
 G_NORETURN void cpu_loop_exit_noexc(CPUState *cpu);
 G_NORETURN void cpu_loop_exit(CPUState *cpu);
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -XXX,XX +XXX,XX @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
     return p - block;
 }
 
-/* The cpu state corresponding to 'searched_pc' is restored.
- * When reset_icount is true, current TB will be interrupted and
- * icount should be recalculated.
- */
-int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
-                              uintptr_t searched_pc, bool reset_icount)
+static int cpu_unwind_data_from_tb(TranslationBlock *tb, uintptr_t host_pc,
+                                   uint64_t *data)
 {
-    uint64_t data[TARGET_INSN_START_WORDS];
-    uintptr_t host_pc = (uintptr_t)tb->tc.ptr;
+    uintptr_t iter_pc = (uintptr_t)tb->tc.ptr;
     const uint8_t *p = tb->tc.ptr + tb->tc.size;
     int i, j, num_insns = tb->icount;
-#ifdef CONFIG_PROFILER
-    TCGProfile *prof = &tcg_ctx->prof;
-    int64_t ti = profile_getclock();
-#endif
 
-    searched_pc -= GETPC_ADJ;
+    host_pc -= GETPC_ADJ;
 
-    if (searched_pc < host_pc) {
+    if (host_pc < iter_pc) {
         return -1;
     }
 
-    memset(data, 0, sizeof(data));
+    memset(data, 0, sizeof(uint64_t) * TARGET_INSN_START_WORDS);
     if (!TARGET_TB_PCREL) {
         data[0] = tb_pc(tb);
     }
 
-    /* Reconstruct the stored insn data while looking for the point at
-       which the end of the insn exceeds the searched_pc.  */
+    /*
+     * Reconstruct the stored insn data while looking for the point
+     * at which the end of the insn exceeds host_pc.
+     */
     for (i = 0; i < num_insns; ++i) {
         for (j = 0; j < TARGET_INSN_START_WORDS; ++j) {
             data[j] += decode_sleb128(&p);
         }
-        host_pc += decode_sleb128(&p);
-        if (host_pc > searched_pc) {
-            goto found;
+        iter_pc += decode_sleb128(&p);
+        if (iter_pc > host_pc) {
+            return num_insns - i;
         }
     }
     return -1;
+}
+
+/*
+ * The cpu state corresponding to 'host_pc' is restored.
+ * When reset_icount is true, current TB will be interrupted and
+ * icount should be recalculated.
+ */
+void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
+                               uintptr_t host_pc, bool reset_icount)
+{
+    uint64_t data[TARGET_INSN_START_WORDS];
+#ifdef CONFIG_PROFILER
+    TCGProfile *prof = &tcg_ctx->prof;
+    int64_t ti = profile_getclock();
+#endif
+    int insns_left = cpu_unwind_data_from_tb(tb, host_pc, data);
+
+    if (insns_left < 0) {
+        return;
+    }
 
- found:
     if (reset_icount && (tb_cflags(tb) & CF_USE_ICOUNT)) {
         assert(icount_enabled());
-        /* Reset the cycle counter to the start of the block
-           and shift if to the number of actually executed instructions */
-        cpu_neg(cpu)->icount_decr.u16.low += num_insns - i;
+        /*
+         * Reset the cycle counter to the start of the block and
+         * shift if to the number of actually executed instructions.
+         */
+        cpu_neg(cpu)->icount_decr.u16.low += insns_left;
     }
 
     cpu->cc->tcg_ops->restore_state_to_opc(cpu, tb, data);
@@ -XXX,XX +XXX,XX @@ int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
                 prof->restore_time + profile_getclock() - ti);
     qatomic_set(&prof->restore_count, prof->restore_count + 1);
 #endif
-    return 0;
 }
 
 bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
     return false;
 }
 
+bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data)
+{
+    if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
+        TranslationBlock *tb = tcg_tb_lookup(host_pc);
+        if (tb) {
+            return cpu_unwind_data_from_tb(tb, host_pc, data) >= 0;
+        }
+    }
+    return false;
+}
+
 void page_init(void)
 {
     page_size_init();
-- 
2.34.1

Avoid cpu_restore_state, and modifying env->eip out from
underneath the translator with TARGET_TB_PCREL.  There is
some slight duplication from x86_restore_state_to_opc,
but it's just a few lines.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1269
Reviewed-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/i386/helper.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/target/i386/helper.c b/target/i386/helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -XXX,XX +XXX,XX @@ void cpu_x86_inject_mce(Monitor *mon, X86CPU *cpu, int bank,
     }
 }
 
+static target_ulong get_memio_eip(CPUX86State *env)
+{
+    uint64_t data[TARGET_INSN_START_WORDS];
+    CPUState *cs = env_cpu(env);
+
+    if (!cpu_unwind_state_data(cs, cs->mem_io_pc, data)) {
+        return env->eip;
+    }
+
+    /* Per x86_restore_state_to_opc. */
+    if (TARGET_TB_PCREL) {
+        return (env->eip & TARGET_PAGE_MASK) | data[0];
+    } else {
+        return data[0] - env->segs[R_CS].base;
+    }
+}
+
 void cpu_report_tpr_access(CPUX86State *env, TPRAccess access)
 {
     X86CPU *cpu = env_archcpu(env);
@@ -XXX,XX +XXX,XX @@ void cpu_report_tpr_access(CPUX86State *env, TPRAccess access)
 
         cpu_interrupt(cs, CPU_INTERRUPT_TPR);
     } else if (tcg_enabled()) {
-        cpu_restore_state(cs, cs->mem_io_pc, false);
+        target_ulong eip = get_memio_eip(env);
 
-        apic_handle_tpr_access_report(cpu->apic_state, env->eip, access);
+        apic_handle_tpr_access_report(cpu->apic_state, eip, access);
     }
 }
 #endif /* !CONFIG_USER_ONLY */
-- 
2.34.1

Since we do not plan to exit, use cpu_unwind_state_data
and extract exactly the data requested.

This is a bug fix, in that we no longer clobber dflag.

Consider:

l.j       L2         // branch
        l.mfspr   r1, ppc    // delay

L1:     boom
L2:     l.lwa     r3, (r4)

Here, dflag would be set by cpu_restore_state (because that is the current
state of the cpu), but but not cleared by tb_stop on exiting the TB
(because DisasContext has recorded the current value as zero).

The next TB begins at L2 with dflag incorrectly set.  If the load has a
tlb miss, then the exception will be delivered as per a delay slot:
with DSX set in the status register and PC decremented (delay slots
restart by re-executing the branch). This will cause the return from
interrupt to go to L1, and boom!

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/openrisc/sys_helper.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/openrisc/sys_helper.c
+++ b/target/openrisc/sys_helper.c
@@ -XXX,XX +XXX,XX @@ target_ulong HELPER(mfspr)(CPUOpenRISCState *env, target_ulong rd,
                            target_ulong spr)
 {
 #ifndef CONFIG_USER_ONLY
+    uint64_t data[TARGET_INSN_START_WORDS];
     MachineState *ms = MACHINE(qdev_get_machine());
     OpenRISCCPU *cpu = env_archcpu(env);
     CPUState *cs = env_cpu(env);
@@ -XXX,XX +XXX,XX @@ target_ulong HELPER(mfspr)(CPUOpenRISCState *env, target_ulong rd,
         return env->evbar;
 
     case TO_SPR(0, 16): /* NPC (equals PC) */
-        cpu_restore_state(cs, GETPC(), false);
+        if (cpu_unwind_state_data(cs, GETPC(), data)) {
+            return data[0];
+        }
         return env->pc;
 
     case TO_SPR(0, 17): /* SR */
         return cpu_get_sr(env);
 
     case TO_SPR(0, 18): /* PPC */
-        cpu_restore_state(cs, GETPC(), false);
+        if (cpu_unwind_state_data(cs, GETPC(), data)) {
+            if (data[1] & 2) {
+                return data[0] - 4;
+            }
+        }
         return env->ppc;
 
     case TO_SPR(0, 32): /* EPCR */
-- 
2.34.1

The value passed is always true, and if the target's
synchronize_from_tb hook is non-trivial, not exiting
may be erroneous.

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -XXX,XX +XXX,XX @@ bool cpu_unwind_state_data(CPUState *cpu, uintptr_t host_pc, uint64_t *data);
  * cpu_restore_state:
  * @cpu: the cpu context
  * @host_pc: the host pc within the translation
- * @will_exit: true if the TB executed will be interrupted after some
-               cpu adjustments. Required for maintaining the correct
-               icount valus
  * @return: true if state was restored, false otherwise
  *
  * Attempt to restore the state for a fault occurring in translated
  * code. If @host_pc is not in translated code no state is
  * restored and the function returns false.
  */
-bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit);
+bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc);
 
 G_NORETURN void cpu_loop_exit_noexc(CPUState *cpu);
 G_NORETURN void cpu_loop_exit(CPUState *cpu);
diff --git a/accel/tcg/cpu-exec-common.c b/accel/tcg/cpu-exec-common.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cpu-exec-common.c
+++ b/accel/tcg/cpu-exec-common.c
@@ -XXX,XX +XXX,XX @@ void cpu_loop_exit(CPUState *cpu)
 void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
 {
     if (pc) {
-        cpu_restore_state(cpu, pc, true);
+        cpu_restore_state(cpu, pc);
     }
     cpu_loop_exit(cpu);
 }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -XXX,XX +XXX,XX @@ void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
 #endif
 }
 
-bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
+bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
 {
-    /*
-     * The pc update associated with restore without exit will
-     * break the relative pc adjustments performed by TARGET_TB_PCREL.
-     */
-    if (TARGET_TB_PCREL) {
-        assert(will_exit);
-    }
-
     /*
      * The host_pc has to be in the rx region of the code buffer.
      * If it is not we will not be able to resolve it here.
@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc, bool will_exit)
     if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
         TranslationBlock *tb = tcg_tb_lookup(host_pc);
         if (tb) {
-            cpu_restore_state_from_tb(cpu, tb, host_pc, will_exit);
+            cpu_restore_state_from_tb(cpu, tb, host_pc, true);
             return true;
         }
     }
diff --git a/target/alpha/helper.c b/target/alpha/helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/alpha/helper.c
+++ b/target/alpha/helper.c
@@ -XXX,XX +XXX,XX @@ G_NORETURN void dynamic_excp(CPUAlphaState *env, uintptr_t retaddr,
     cs->exception_index = excp;
     env->error_code = error;
     if (retaddr) {
-        cpu_restore_state(cs, retaddr, true);
+        cpu_restore_state(cs, retaddr);
         /* Floating-point exceptions (our only users) point to the next PC.  */
         env->pc += 4;
     }
diff --git a/target/alpha/mem_helper.c b/target/alpha/mem_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/alpha/mem_helper.c
+++ b/target/alpha/mem_helper.c
@@ -XXX,XX +XXX,XX @@ static void do_unaligned_access(CPUAlphaState *env, vaddr addr, uintptr_t retadd
     uint64_t pc;
     uint32_t insn;
 
-    cpu_restore_state(env_cpu(env), retaddr, true);
+    cpu_restore_state(env_cpu(env), retaddr);
 
     pc = env->pc;
     insn = cpu_ldl_code(env, pc);
diff --git a/target/arm/op_helper.c b/target/arm/op_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/arm/op_helper.c
+++ b/target/arm/op_helper.c
@@ -XXX,XX +XXX,XX @@ void raise_exception_ra(CPUARMState *env, uint32_t excp, uint32_t syndrome,
      * we must restore CPU state here before setting the syndrome
      * the caller passed us, and cannot use cpu_loop_exit_restore().
      */
-    cpu_restore_state(cs, ra, true);
+    cpu_restore_state(cs, ra);
     raise_exception(env, excp, syndrome, target_el);
 }
 
diff --git a/target/arm/tlb_helper.c b/target/arm/tlb_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/arm/tlb_helper.c
+++ b/target/arm/tlb_helper.c
@@ -XXX,XX +XXX,XX @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
     ARMMMUFaultInfo fi = {};
 
     /* now we have a real cpu fault */
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
 
     fi.type = ARMFault_Alignment;
     arm_deliver_fault(cpu, vaddr, access_type, mmu_idx, &fi);
@@ -XXX,XX +XXX,XX @@ void arm_cpu_do_transaction_failed(CPUState *cs, hwaddr physaddr,
     ARMMMUFaultInfo fi = {};
 
     /* now we have a real cpu fault */
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
 
     fi.ea = arm_extabort_type(response);
     fi.type = ARMFault_SyncExternal;
@@ -XXX,XX +XXX,XX @@ bool arm_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
         return false;
     } else {
         /* now we have a real cpu fault */
-        cpu_restore_state(cs, retaddr, true);
+        cpu_restore_state(cs, retaddr);
         arm_deliver_fault(cpu, address, access_type, mmu_idx, fi);
     }
 }
@@ -XXX,XX +XXX,XX @@ void arm_cpu_record_sigsegv(CPUState *cs, vaddr addr,
      * We report both ESR and FAR to signal handlers.
      * For now, it's easiest to deliver the fault normally.
      */
-    cpu_restore_state(cs, ra, true);
+    cpu_restore_state(cs, ra);
     arm_deliver_fault(cpu, addr, access_type, MMU_USER_IDX, &fi);
 }
 
diff --git a/target/cris/helper.c b/target/cris/helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/cris/helper.c
+++ b/target/cris/helper.c
@@ -XXX,XX +XXX,XX @@ bool cris_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
     cs->exception_index = EXCP_BUSFAULT;
     env->fault_vector = res.bf_vec;
     if (retaddr) {
-        if (cpu_restore_state(cs, retaddr, true)) {
+        if (cpu_restore_state(cs, retaddr)) {
             /* Evaluate flags after retranslation. */
             helper_top_evaluate_flags(env);
         }
diff --git a/target/i386/tcg/sysemu/svm_helper.c b/target/i386/tcg/sysemu/svm_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/tcg/sysemu/svm_helper.c
+++ b/target/i386/tcg/sysemu/svm_helper.c
@@ -XXX,XX +XXX,XX @@ void cpu_vmexit(CPUX86State *env, uint32_t exit_code, uint64_t exit_info_1,
 {
     CPUState *cs = env_cpu(env);
 
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
 
     qemu_log_mask(CPU_LOG_TB_IN_ASM, "vmexit(%08x, %016" PRIx64 ", %016"
                   PRIx64 ", " TARGET_FMT_lx ")!\n",
diff --git a/target/m68k/op_helper.c b/target/m68k/op_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/m68k/op_helper.c
+++ b/target/m68k/op_helper.c
@@ -XXX,XX +XXX,XX @@ void m68k_cpu_transaction_failed(CPUState *cs, hwaddr physaddr, vaddr addr,
     M68kCPU *cpu = M68K_CPU(cs);
     CPUM68KState *env = &cpu->env;
 
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
 
     if (m68k_feature(env, M68K_FEATURE_M68040)) {
         env->mmu.mmusr = 0;
@@ -XXX,XX +XXX,XX @@ raise_exception_format2(CPUM68KState *env, int tt, int ilen, uintptr_t raddr)
     cs->exception_index = tt;
 
     /* Recover PC and CC_OP for the beginning of the insn.  */
-    cpu_restore_state(cs, raddr, true);
+    cpu_restore_state(cs, raddr);
 
     /* Flags are current in env->cc_*, or are undefined. */
     env->cc_op = CC_OP_FLAGS;
diff --git a/target/microblaze/helper.c b/target/microblaze/helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/microblaze/helper.c
+++ b/target/microblaze/helper.c
@@ -XXX,XX +XXX,XX @@ void mb_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
     uint32_t esr, iflags;
 
     /* Recover the pc and iflags from the corresponding insn_start.  */
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
     iflags = cpu->env.iflags;
 
     qemu_log_mask(CPU_LOG_INT,
diff --git a/target/nios2/op_helper.c b/target/nios2/op_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/nios2/op_helper.c
+++ b/target/nios2/op_helper.c
@@ -XXX,XX +XXX,XX @@ void nios2_cpu_loop_exit_advance(CPUNios2State *env, uintptr_t retaddr)
      * Do this here, rather than in restore_state_to_opc(),
      * lest we affect QEMU internal exceptions, like EXCP_DEBUG.
      */
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
     env->pc += 4;
     cpu_loop_exit(cs);
 }
diff --git a/target/openrisc/sys_helper.c b/target/openrisc/sys_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/openrisc/sys_helper.c
+++ b/target/openrisc/sys_helper.c
@@ -XXX,XX +XXX,XX @@ void HELPER(mtspr)(CPUOpenRISCState *env, target_ulong spr, target_ulong rb)
         break;
 
     case TO_SPR(0, 16): /* NPC */
-        cpu_restore_state(cs, GETPC(), true);
+        cpu_restore_state(cs, GETPC());
         /* ??? Mirror or1ksim in not trashing delayed branch state
            when "jumping" to the current instruction.  */
         if (env->pc != rb) {
@@ -XXX,XX +XXX,XX @@ void HELPER(mtspr)(CPUOpenRISCState *env, target_ulong spr, target_ulong rb)
     case TO_SPR(8, 0):  /* PMR */
         env->pmr = rb;
         if (env->pmr & PMR_DME || env->pmr & PMR_SME) {
-            cpu_restore_state(cs, GETPC(), true);
+            cpu_restore_state(cs, GETPC());
             env->pc += 4;
             cs->halted = 1;
             raise_exception(cpu, EXCP_HALTED);
diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -XXX,XX +XXX,XX @@ void ppc_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
     uint32_t insn;
 
     /* Restore state and reload the insn we executed, for filling in DSISR.  */
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
     insn = cpu_ldl_code(env, env->nip);
 
     switch (env->mmu_model) {
diff --git a/target/s390x/tcg/excp_helper.c b/target/s390x/tcg/excp_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/s390x/tcg/excp_helper.c
+++ b/target/s390x/tcg/excp_helper.c
@@ -XXX,XX +XXX,XX @@ G_NORETURN void tcg_s390_program_interrupt(CPUS390XState *env,
 {
     CPUState *cs = env_cpu(env);
 
-    cpu_restore_state(cs, ra, true);
+    cpu_restore_state(cs, ra);
     qemu_log_mask(CPU_LOG_INT, "program interrupt at %#" PRIx64 "\n",
                   env->psw.addr);
     trigger_pgm_exception(env, code);
diff --git a/target/tricore/op_helper.c b/target/tricore/op_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/tricore/op_helper.c
+++ b/target/tricore/op_helper.c
@@ -XXX,XX +XXX,XX @@ void raise_exception_sync_internal(CPUTriCoreState *env, uint32_t class, int tin
 {
     CPUState *cs = env_cpu(env);
     /* in case we come from a helper-call we need to restore the PC */
-    cpu_restore_state(cs, pc, true);
+    cpu_restore_state(cs, pc);
 
     /* Tin is loaded into d[15] */
     env->gpr_d[15] = tin;
diff --git a/target/xtensa/helper.c b/target/xtensa/helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/xtensa/helper.c
+++ b/target/xtensa/helper.c
@@ -XXX,XX +XXX,XX @@ void xtensa_cpu_do_unaligned_access(CPUState *cs,
 
     assert(xtensa_option_enabled(env->config,
                                  XTENSA_OPTION_UNALIGNED_EXCEPTION));
-    cpu_restore_state(CPU(cpu), retaddr, true);
+    cpu_restore_state(CPU(cpu), retaddr);
     HELPER(exception_cause_vaddr)(env,
                                   env->pc, LOAD_STORE_ALIGNMENT_CAUSE,
                                   addr);
@@ -XXX,XX +XXX,XX @@ bool xtensa_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
     } else if (probe) {
         return false;
     } else {
-        cpu_restore_state(cs, retaddr, true);
+        cpu_restore_state(cs, retaddr);
         HELPER(exception_cause_vaddr)(env, env->pc, ret, address);
     }
 }
@@ -XXX,XX +XXX,XX @@ void xtensa_cpu_do_transaction_failed(CPUState *cs, hwaddr physaddr, vaddr addr,
     XtensaCPU *cpu = XTENSA_CPU(cs);
     CPUXtensaState *env = &cpu->env;
 
-    cpu_restore_state(cs, retaddr, true);
+    cpu_restore_state(cs, retaddr);
     HELPER(exception_cause_vaddr)(env, env->pc,
                                   access_type == MMU_INST_FETCH ?
                                   INSTR_PIF_ADDR_ERROR_CAUSE :
-- 
2.34.1

The value passed is always true.

Reviewed-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/internal.h      |  2 +-
 accel/tcg/tb-maint.c      |  4 ++--
 accel/tcg/translate-all.c | 15 +++++++--------
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -XXX,XX +XXX,XX @@ TranslationBlock *tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                                tb_page_addr_t phys_page2);
 bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc);
 void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
-                               uintptr_t host_pc, bool reset_icount);
+                               uintptr_t host_pc);
 
 /* Return the current PC from CPU, which may be cached in TB. */
 static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/tb-maint.c
+++ b/accel/tcg/tb-maint.c
@@ -XXX,XX +XXX,XX @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
                  * restore the CPU state.
                  */
                 current_tb_modified = true;
-                cpu_restore_state_from_tb(cpu, current_tb, retaddr, true);
+                cpu_restore_state_from_tb(cpu, current_tb, retaddr);
             }
 #endif /* TARGET_HAS_PRECISE_SMC */
             tb_phys_invalidate__locked(tb);
@@ -XXX,XX +XXX,XX @@ bool tb_invalidate_phys_page_unwind(tb_page_addr_t addr, uintptr_t pc)
              * function to partially restore the CPU state.
              */
             current_tb_modified = true;
-            cpu_restore_state_from_tb(cpu, current_tb, pc, true);
+            cpu_restore_state_from_tb(cpu, current_tb, pc);
         }
 #endif /* TARGET_HAS_PRECISE_SMC */
         tb_phys_invalidate(tb, addr);
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -XXX,XX +XXX,XX @@ static int cpu_unwind_data_from_tb(TranslationBlock *tb, uintptr_t host_pc,
 }
 
 /*
- * The cpu state corresponding to 'host_pc' is restored.
- * When reset_icount is true, current TB will be interrupted and
- * icount should be recalculated.
+ * The cpu state corresponding to 'host_pc' is restored in
+ * preparation for exiting the TB.
  */
 void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
-                               uintptr_t host_pc, bool reset_icount)
+                               uintptr_t host_pc)
 {
     uint64_t data[TARGET_INSN_START_WORDS];
 #ifdef CONFIG_PROFILER
@@ -XXX,XX +XXX,XX @@ void cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
         return;
     }
 
-    if (reset_icount && (tb_cflags(tb) & CF_USE_ICOUNT)) {
+    if (tb_cflags(tb) & CF_USE_ICOUNT) {
         assert(icount_enabled());
         /*
          * Reset the cycle counter to the start of the block and
@@ -XXX,XX +XXX,XX @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
     if (in_code_gen_buffer((const void *)(host_pc - tcg_splitwx_diff))) {
         TranslationBlock *tb = tcg_tb_lookup(host_pc);
         if (tb) {
-            cpu_restore_state_from_tb(cpu, tb, host_pc, true);
+            cpu_restore_state_from_tb(cpu, tb, host_pc);
             return true;
         }
     }
@@ -XXX,XX +XXX,XX @@ void tb_check_watchpoint(CPUState *cpu, uintptr_t retaddr)
     tb = tcg_tb_lookup(retaddr);
     if (tb) {
         /* We can use retranslation to find the PC.  */
-        cpu_restore_state_from_tb(cpu, tb, retaddr, true);
+        cpu_restore_state_from_tb(cpu, tb, retaddr);
         tb_phys_invalidate(tb, -1);
     } else {
         /* The exception probably happened in a helper.  The CPU state should
@@ -XXX,XX +XXX,XX @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
         cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
                   (void *)retaddr);
     }
-    cpu_restore_state_from_tb(cpu, tb, retaddr, true);
+    cpu_restore_state_from_tb(cpu, tb, retaddr);
 
     /*
      * Some guests must re-execute the branch when re-executing a delay
-- 
2.34.1

The helpers for reset_rf, cli, sti, clac, stac are
completely trivial; implement them inline.

Drop some nearby #if 0 code.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/i386/helper.h        |  5 -----
 target/i386/tcg/cc_helper.c | 41 -------------------------------------
 target/i386/tcg/translate.c | 30 ++++++++++++++++++++++-----
 3 files changed, 25 insertions(+), 51 deletions(-)

diff --git a/target/i386/helper.h b/target/i386/helper.h
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/helper.h
+++ b/target/i386/helper.h
@@ -XXX,XX +XXX,XX @@ DEF_HELPER_2(syscall, void, env, int)
 DEF_HELPER_2(sysret, void, env, int)
 #endif
 DEF_HELPER_FLAGS_2(pause, TCG_CALL_NO_WG, noreturn, env, int)
-DEF_HELPER_1(reset_rf, void, env)
 DEF_HELPER_FLAGS_3(raise_interrupt, TCG_CALL_NO_WG, noreturn, env, int, int)
 DEF_HELPER_FLAGS_2(raise_exception, TCG_CALL_NO_WG, noreturn, env, int)
-DEF_HELPER_1(cli, void, env)
-DEF_HELPER_1(sti, void, env)
-DEF_HELPER_1(clac, void, env)
-DEF_HELPER_1(stac, void, env)
 DEF_HELPER_3(boundw, void, env, tl, int)
 DEF_HELPER_3(boundl, void, env, tl, int)
 
diff --git a/target/i386/tcg/cc_helper.c b/target/i386/tcg/cc_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/tcg/cc_helper.c
+++ b/target/i386/tcg/cc_helper.c
@@ -XXX,XX +XXX,XX @@ void helper_clts(CPUX86State *env)
     env->cr[0] &= ~CR0_TS_MASK;
     env->hflags &= ~HF_TS_MASK;
 }
-
-void helper_reset_rf(CPUX86State *env)
-{
-    env->eflags &= ~RF_MASK;
-}
-
-void helper_cli(CPUX86State *env)
-{
-    env->eflags &= ~IF_MASK;
-}
-
-void helper_sti(CPUX86State *env)
-{
-    env->eflags |= IF_MASK;
-}
-
-void helper_clac(CPUX86State *env)
-{
-    env->eflags &= ~AC_MASK;
-}
-
-void helper_stac(CPUX86State *env)
-{
-    env->eflags |= AC_MASK;
-}
-
-#if 0
-/* vm86plus instructions */
-void helper_cli_vm(CPUX86State *env)
-{
-    env->eflags &= ~VIF_MASK;
-}
-
-void helper_sti_vm(CPUX86State *env)
-{
-    env->eflags |= VIF_MASK;
-    if (env->eflags & VIP_MASK) {
-        raise_exception_ra(env, EXCP0D_GPF, GETPC());
-    }
-}
-#endif
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -XXX,XX +XXX,XX @@ static void gen_reset_hflag(DisasContext *s, uint32_t mask)
     }
 }
 
+static void gen_set_eflags(DisasContext *s, target_ulong mask)
+{
+    TCGv t = tcg_temp_new();
+
+    tcg_gen_ld_tl(t, cpu_env, offsetof(CPUX86State, eflags));
+    tcg_gen_ori_tl(t, t, mask);
+    tcg_gen_st_tl(t, cpu_env, offsetof(CPUX86State, eflags));
+    tcg_temp_free(t);
+}
+
+static void gen_reset_eflags(DisasContext *s, target_ulong mask)
+{
+    TCGv t = tcg_temp_new();
+
+    tcg_gen_ld_tl(t, cpu_env, offsetof(CPUX86State, eflags));
+    tcg_gen_andi_tl(t, t, ~mask);
+    tcg_gen_st_tl(t, cpu_env, offsetof(CPUX86State, eflags));
+    tcg_temp_free(t);
+}
+
 /* Clear BND registers during legacy branches.  */
 static void gen_bnd_jmp(DisasContext *s)
 {
@@ -XXX,XX +XXX,XX @@ do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, bool jr)
     }
 
     if (s->base.tb->flags & HF_RF_MASK) {
-        gen_helper_reset_rf(cpu_env);
+        gen_reset_eflags(s, RF_MASK);
     }
     if (recheck_tf) {
         gen_helper_rechecking_single_step(cpu_env);
@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
 #endif
     case 0xfa: /* cli */
         if (check_iopl(s)) {
-            gen_helper_cli(cpu_env);
+            gen_reset_eflags(s, IF_MASK);
         }
         break;
     case 0xfb: /* sti */
         if (check_iopl(s)) {
-            gen_helper_sti(cpu_env);
+            gen_set_eflags(s, IF_MASK);
             /* interruptions are enabled only the first insn after sti */
             gen_update_eip_next(s);
             gen_eob_inhibit_irq(s, true);
@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
                 || CPL(s) != 0) {
                 goto illegal_op;
             }
-            gen_helper_clac(cpu_env);
+            gen_reset_eflags(s, AC_MASK);
             s->base.is_jmp = DISAS_EOB_NEXT;
             break;
 
@@ -XXX,XX +XXX,XX @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
                 || CPL(s) != 0) {
                 goto illegal_op;
             }
-            gen_helper_stac(cpu_env);
+            gen_set_eflags(s, AC_MASK);
             s->base.is_jmp = DISAS_EOB_NEXT;
             break;
 
-- 
2.34.1

The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:

Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)

are available in the Git repository at:

https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530

for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:

tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)

----------------------------------------------------------------
Improvements to 128-bit atomics:
  - Separate __int128_t type and arithmetic detection
  - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
  - Accelerate atomics via host/include/
Decodetree:
  - Add named field syntax
  - Move tests to meson

----------------------------------------------------------------
Peter Maydell (5):
      docs: Document decodetree named field syntax
      scripts/decodetree: Pass lvalue-formatter function to str_extract()
      scripts/decodetree: Implement a topological sort
      scripts/decodetree: Implement named field support
      tests/decode: Add tests for various named-field cases

Richard Henderson (22):
      tcg: Fix register move type in tcg_out_ld_helper_ret
      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
      meson: Split test for __int128_t type from __int128_t arithmetic
      qemu/atomic128: Add x86_64 atomic128-ldst.h
      tcg/i386: Support 128-bit load/store
      tcg/aarch64: Rename temporaries
      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
      tcg/aarch64: Simplify constraints on qemu_ld/st
      tcg/aarch64: Support 128-bit load/store
      tcg/ppc: Support 128-bit load/store
      tcg/s390x: Support 128-bit load/store
      accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
      accel/tcg: Extract store_atom_insert_al16 to host header
      accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 store_atom_insert_al16
      tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
      decodetree: Add --test-for-error
      decodetree: Fix recursion in prop_format and build_tree
      decodetree: Diagnose empty pattern group
      decodetree: Do not remove output_file from /dev
      tests/decode: Convert tests to meson

docs/devel/decodetree.rst                         |  33 ++-
 meson.build                                       |  15 +-
 host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
 host/include/aarch64/host/store-insert-al16.h     |  47 ++++
 host/include/generic/host/load-extract-al16-al8.h |  45 ++++
 host/include/generic/host/store-insert-al16.h     |  50 ++++
 host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
 host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
 include/qemu/int128.h                             |   4 +-
 tcg/aarch64/tcg-target-con-set.h                  |   4 +-
 tcg/aarch64/tcg-target-con-str.h                  |   1 -
 tcg/aarch64/tcg-target.h                          |  12 +-
 tcg/arm/tcg-target.h                              |   1 -
 tcg/i386/tcg-target.h                             |   5 +-
 tcg/mips/tcg-target.h                             |   1 -
 tcg/ppc/tcg-target-con-set.h                      |   2 +
 tcg/ppc/tcg-target-con-str.h                      |   1 +
 tcg/ppc/tcg-target.h                              |   4 +-
 tcg/riscv/tcg-target.h                            |   1 -
 tcg/s390x/tcg-target-con-set.h                    |   2 +
 tcg/s390x/tcg-target.h                            |   3 +-
 tcg/sparc64/tcg-target.h                          |   1 -
 tcg/tci/tcg-target.h                              |   1 -
 tests/decode/err_field10.decode                   |   7 +
 tests/decode/err_field7.decode                    |   7 +
 tests/decode/err_field8.decode                    |   8 +
 tests/decode/err_field9.decode                    |  14 ++
 tests/decode/succ_named_field.decode              |  19 ++
 tcg/tcg.c                                         |   4 +-
 accel/tcg/ldst_atomicity.c.inc                    |  80 +------
 tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
 tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
 tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
 tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
 scripts/decodetree.py                             | 265 ++++++++++++++++++++--
 tests/decode/check.sh                             |  24 --
 tests/decode/meson.build                          |  64 ++++++
 tests/meson.build                                 |   5 +-
 38 files changed, 1312 insertions(+), 225 deletions(-)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
 create mode 100644 host/include/aarch64/host/store-insert-al16.h
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h
 create mode 100644 host/include/generic/host/store-insert-al16.h
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

The first move was incorrectly using TCG_TYPE_I32 while the second
move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
from moving all 128-bits of the return value.

Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---
 tcg/tcg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
     mov[0].dst = ldst->datalo_reg;
     mov[0].src =
         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
-    mov[0].dst_type = TCG_TYPE_I32;
-    mov[0].src_type = TCG_TYPE_I32;
+    mov[0].dst_type = TCG_TYPE_REG;
+    mov[0].src_type = TCG_TYPE_REG;
     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
 
     mov[1].dst = ldst->datahi_reg;
-- 
2.34.1

PAGE_WRITE is current writability, as modified by TB protection;
PAGE_WRITE_ORG is the original page writability.

Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
+    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
         uint64_t *p = __builtin_assume_aligned(pv, 8);
         return *p;
     }
@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
+    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
         return *p;
     }
 #endif
-- 
2.34.1

Older versions of clang have missing runtime functions for arithmetic
with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
__int128_t for implementing Int128.  But __int128_t is present,
data movement works, and it can be used for atomic128.

Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
and adjust the meson probe for atomics to use has_int128_type.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 meson.build           | 15 ++++++++++-----
 include/qemu/int128.h |  4 ++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
     return 0;
   }'''))
 
-has_int128 = cc.links('''
+has_int128_type = cc.compiles('''
+  __int128_t a;
+  __uint128_t b;
+  int main(void) { b = a; }''')
+config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
+
+has_int128 = has_int128_type and cc.links('''
   __int128_t a;
   __uint128_t b;
   int main (void) {
@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
     a = a * a;
     return 0;
   }''')
-
 config_host_data.set('CONFIG_INT128', has_int128)
 
-if has_int128
+if has_int128_type
   # "do we have 128-bit atomics which are handled inline and specifically not
   # via libatomic". The reason we can't use libatomic is documented in the
   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
@@ -XXX,XX +XXX,XX @@ if has_int128
   # __alignof(unsigned __int128) for the host.
   atomic_test_128 = '''
     int main(int ac, char **av) {
-      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
+      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
       p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
       __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
       __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
       config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
         int main(void)
         {
-          unsigned __int128 x = 0, y = 0;
+          __uint128_t x = 0, y = 0;
           __sync_val_compare_and_swap_16(&x, y, x);
           return 0;
         }
diff --git a/include/qemu/int128.h b/include/qemu/int128.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/int128.h
+++ b/include/qemu/int128.h
@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
  * a possible structure and the native types.  Ease parameter passing
  * via use of the transparent union extension.
  */
-#ifdef CONFIG_INT128
+#ifdef CONFIG_INT128_TYPE
 typedef union {
     __uint128_t u;
     __int128_t i;
@@ -XXX,XX +XXX,XX @@ typedef union {
 } Int128Alias __attribute__((transparent_union));
 #else
 typedef Int128 Int128Alias;
-#endif /* CONFIG_INT128 */
+#endif /* CONFIG_INT128_TYPE */
 
 #endif /* INT128_H */
-- 
2.34.1

With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
load/store without cmpxchg16b.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h

diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/atomic128-ldst.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Load/store for 128-bit atomic operations, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ *
+ * See docs/devel/atomics.rst for discussion about the guarantees each
+ * atomic primitive is meant to provide.
+ */
+
+#ifndef AARCH64_ATOMIC128_LDST_H
+#define AARCH64_ATOMIC128_LDST_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/*
+ * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
+ * expanded to a read-write operation: lock cmpxchg16b.
+ */
+
+#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
+#define HAVE_ATOMIC128_RW  1
+
+static inline Int128 atomic16_read_ro(const Int128 *ptr)
+{
+    Int128Alias r;
+
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
+
+    return r.s;
+}
+
+static inline Int128 atomic16_read_rw(Int128 *ptr)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias r;
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
+    }
+    return r.s;
+}
+
+static inline void atomic16_set(Int128 *ptr, Int128 val)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias new = { .s = val };
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
+    } else {
+        __int128_t old;
+        do {
+            old = *ptr_align;
+        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
+    }
+}
+#else
+/* Provide QEMU_ERROR stubs. */
+#include "host/include/generic/host/atomic128-ldst.h"
+#endif
+
+#endif /* AARCH64_ATOMIC128_LDST_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |   4 +-
 tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define have_avx1         (cpuinfo & CPUINFO_AVX1)
 #define have_avx2         (cpuinfo & CPUINFO_AVX2)
 #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
 
 /*
  * There are interesting instructions in AVX512, so long as we have AVX512VL,
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_qemu_st8_i32     1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128 \
+    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
 
 /* We do not support older SSE systems, only beginning with AVX1.  */
 #define TCG_TARGET_HAS_v64              have_avx1
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 #endif
 };
 
+#define TCG_TMP_VEC  TCG_REG_XMM5
+
 static const int tcg_target_call_iarg_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
 #if defined(_WIN64)
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
 #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
 #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
+#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
+#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
 #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
 #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
 #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return have_movbe;
+    TCGAtomAlign aa;
+
+    if (!have_movbe) {
+        return false;
+    }
+    if ((memop & MO_SIZE) < MO_128) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
+     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom < MO_128;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
 static const TCGLdstHelperParam ldst_helper_param = { };
 #endif
 
+static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
+                                TCGReg l, TCGReg h, TCGReg v)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vpmov{d,q} %v, %l */
+    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
+    /* vpextr{d,q} $1, %v, %h */
+    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
+    tcg_out8(s, 1);
+}
+
+static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
+                                TCGReg v, TCGReg l, TCGReg h)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vmov{d,q} %l, %v */
+    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
+    /* vpinsr{d,q} $1, %h, %v, %v */
+    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
+    tcg_out8(s, 1);
+}
+
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     *h = x86_guest_base;
 #endif
     h->base = addrlo;
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType tlbtype = TCG_TYPE_I32;
     int trexw = 0, hrexw = 0, tlbrexw = 0;
     unsigned mem_index = get_mmuidx(oi);
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int tlb_mask;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we want the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            if (h.base == datalo || h.index == datalo) {
+                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datalo, datahi, 0);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datahi, datahi, 8);
+            } else {
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                         h.base, h.index, 0, h.ofs + 8);
+            }
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector load is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we have the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                     h.base, h.index, 0, h.ofs);
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                     h.base, h.index, 0, h.ofs + 8);
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector store is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st8_a64_i32:
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     OP_32_64(mulu2):
         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O2_I1(r, r, L);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O0_I3(L, L, L);
+
     case INDEX_op_brcond2_i32:
         return C_O0_I4(r, r, ri, ri);
 
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
+    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
 #ifdef _WIN64
     /* These are call saved, and we don't save them, so don't use them. */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
-- 
2.34.1

We will need to allocate a second general-purpose temporary.
Rename the existing temps to add a distinguishing number.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP TCG_REG_X30
-#define TCG_VEC_TMP TCG_REG_V31
+#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_REG_GUEST_BASE TCG_REG_X28
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
 static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg r, TCGReg base, intptr_t offset)
 {
-    TCGReg temp = TCG_REG_TMP;
+    TCGReg temp = TCG_REG_TMP0;
 
     if (offset < -0xffffff || offset > 0xffffff) {
         tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
     }
 
     /* Worst-case scenario, move offset to temp register, use reg offset.  */
-    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
-    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
+    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
+    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
 }
 
 static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
     if (offset == sextract64(offset, 0, 26)) {
         tcg_out_insn(s, 3206, BL, offset);
     } else {
-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
+        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
+        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
     AArch64Insn insn;
 
     if (rl == ah || (!const_bh && rl == bh)) {
-        rl = TCG_REG_TMP;
+        rl = TCG_REG_TMP0;
     }
 
     if (const_bl) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
                possibility of adding 0+const in the low part, and the
                immediate add instructions encode XSP not XZR.  Don't try
                anything more elaborate here than loading another zero.  */
-            al = TCG_REG_TMP;
+            al = TCG_REG_TMP0;
             tcg_out_movi(s, ext, al, 0);
         }
         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 {
     TCGReg a1 = a0;
     if (is_ctz) {
-        a1 = TCG_REG_TMP;
+        a1 = TCG_REG_TMP0;
         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
     }
     if (const_b && b == (ext ? 64 : 32)) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
         AArch64Insn sel = I3506_CSEL;
 
         tcg_out_cmp(s, ext, a0, 0, 1);
-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
+        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
 
         if (const_b) {
             if (b == -1) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
                 b = d;
             }
         }
-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
+        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
 }
 
 static const TCGLdstHelperParam ldst_helper_param = {
-    .ntmp = 1, .tmp = { TCG_REG_TMP }
+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
 };
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 
     set_jmp_insn_offset(s, which);
     tcg_out32(s, I3206_B);
-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
     set_jmp_reset_offset(s, which);
 }
 
@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
         ptrdiff_t i_offset = i_addr - jmp_rx;
 
         /* Note that we asserted this in range in tcg_out_goto_tb. */
-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
+        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
     }
     qatomic_set((uint32_t *)jmp_rw, insn);
     flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_rem_i64:
     case INDEX_op_rem_i32:
-        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
     case INDEX_op_remu_i64:
     case INDEX_op_remu_i32:
-        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
 
     case INDEX_op_shl_i64:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (c2) {
             tcg_out_rotl(s, ext, a0, a1, a2);
         } else {
-            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
-            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
+            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
+            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
         }
         break;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             break;
                         }
                     }
-                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
-                    a2 = TCG_VEC_TMP;
+                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
+                    a2 = TCG_VEC_TMP0;
                 }
                 if (is_scalar) {
                     insn = cmp_scalar_insn[cond];
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
-    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
 /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 
     TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
     TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
-    TCG_REG_X16, TCG_REG_X17,
 
     TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
     TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
 
+    /* X16 reserved as temporary */
+    /* X17 reserved as temporary */
     /* X18 reserved by system */
     /* X19 reserved for AREG0 */
     /* X29 reserved as fp */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_REG_TMP0 TCG_REG_X16
+#define TCG_REG_TMP1 TCG_REG_X17
+#define TCG_REG_TMP2 TCG_REG_X30
 #define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
-- 
2.34.1

Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
registers.  Since we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |  2 --
 tcg/aarch64/tcg-target-con-str.h |  1 -
 tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  * tcg-target-con-str.h; the constraint combination is inclusive or.
  */
 C_O0_I1(r)
-C_O0_I2(lZ, l)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
-C_O1_I1(r, l)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-str.h
+++ b/tcg/aarch64/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('l', ALL_QLDST_REGS)
 REGS('w', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
 #define ALL_GENERAL_REGS  0xffffffffu
 #define ALL_VECTOR_REGS   0xffffffff00000000ull
 
-#ifdef CONFIG_SOFTMMU
-#define ALL_QLDST_REGS \
-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
-                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
-#else
-#define ALL_QLDST_REGS   ALL_GENERAL_REGS
-#endif
-
 /* Match a constant valid for addition (12-bit, optionally shifted).  */
 static inline bool is_aimm(uint64_t val)
 {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
-    TCGReg x3;
+    TCGReg addr_adj;
     TCGType mask_type;
     uint64_t compare_mask;
 
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
                  ? TCG_TYPE_I64 : TCG_TYPE_I32);
 
-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
+    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
                  TLB_MASK_TABLE_OFS(mem_index), 1, 0);
 
     /* Extract the TLB index from the address into X0.  */
     tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-                 TCG_REG_X0, TCG_REG_X0, addr_reg,
+                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
                  s->page_bits - CPU_TLB_ENTRY_BITS);
 
-    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
+    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
+    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
 
-    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
+    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
+    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
                is_ld ? offsetof(CPUTLBEntry, addr_read)
                      : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
                offsetof(CPUTLBEntry, addend));
 
     /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * cross pages using the address of the last byte of the access.
      */
     if (a_mask >= s_mask) {
-        x3 = addr_reg;
+        addr_adj = addr_reg;
     } else {
+        addr_adj = TCG_REG_TMP2;
         tcg_out_insn(s, 3401, ADDI, addr_type,
-                     TCG_REG_X3, addr_reg, s_mask - a_mask);
-        x3 = TCG_REG_X3;
+                     addr_adj, addr_reg, s_mask - a_mask);
     }
     compare_mask = (uint64_t)s->page_mask | a_mask;
 
-    /* Store the page mask part of the address into X3.  */
-    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
+    /* Store the page mask part of the address into TMP2.  */
+    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
+                     addr_adj, compare_mask);
 
     /* Perform the address comparison. */
-    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
+    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
 
     /* If not equal, we jump to the slow path. */
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 
-    h->base = TCG_REG_X1,
+    h->base = TCG_REG_TMP1;
     h->index = addr_reg;
     h->index_ext = addr_type;
 #else
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a64_i32:
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
-        return C_O1_I1(r, l);
+        return C_O1_I1(r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
-        return C_O0_I2(lZ, l);
+        return C_O0_I2(rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
16-byte atomicity is required and LDP/STP otherwise.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |   2 +
 tcg/aarch64/tcg-target.h         |  11 ++-
 tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
+C_O0_I3(rZ, rZ, r)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
 C_O1_I2(w, w, wZ)
 C_O1_I3(w, w, w, w)
 C_O1_I4(r, r, rA, rZ, rZ)
+C_O2_I1(r, r, r)
 C_O2_I4(r, r, rZ, rZ, rA, rMZ)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+/*
+ * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
+ * which requires writable pages.  We must defer to the helper for user-only,
+ * but in system mode all ram is writable for the host.
+ */
+#ifdef CONFIG_USER_ONLY
+#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
+#else
+#define TCG_TARGET_HAS_qemu_ldst_i128   1
+#endif
 
 #define TCG_TARGET_HAS_v64              1
 #define TCG_TARGET_HAS_v128             1
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3305_LDR_v64   = 0x5c000000,
     I3305_LDR_v128  = 0x9c000000,
 
+    /* Load/store exclusive. */
+    I3306_LDXP      = 0xc8600000,
+    I3306_STXP      = 0xc8200000,
+
     /* Load/store register.  Described here as 3.3.12, but the helper
        that emits them can transform to 3.3.10 or 3.3.13.  */
     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3406_ADR       = 0x10000000,
     I3406_ADRP      = 0x90000000,
 
+    /* Add/subtract extended register instructions. */
+    I3501_ADD       = 0x0b200000,
+
     /* Add/subtract shifted register instructions (without a shift).  */
     I3502_ADD       = 0x0b000000,
     I3502_ADDS      = 0x2b000000,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
 }
 
+static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
+                              TCGReg rt, TCGReg rt2, TCGReg rn)
+{
+    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
                               TCGReg rt, int imm19)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
 }
 
+static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
+                                     TCGType sf, TCGReg rd, TCGReg rn,
+                                     TCGReg rm, int opt, int imm3)
+{
+    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
+              imm3 << 10 | rn << 5 | rd);
+}
+
 /* This function is for both 3.5.2 (Add/Subtract shifted register), for
    the rare occasion when we actually want to supply a shift amount.  */
 static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType addr_type = s->addr_type;
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_lse2 ? MO_ATOM_WITHIN16
                                              : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
     TCGReg addr_adj;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    TCGReg base;
+    bool use_pair;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    /* Compose the final address, as LDP/STP have no indexing. */
+    if (h.index == TCG_REG_XZR) {
+        base = h.base;
+    } else {
+        base = TCG_REG_TMP2;
+        if (h.index_ext == TCG_TYPE_I32) {
+            /* add base, base, index, uxtw */
+            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
+                         h.base, h.index, MO_32, 0);
+        } else {
+            /* add base, base, index */
+            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
+        }
+    }
+
+    use_pair = h.aa.atom < MO_128 || have_lse2;
+
+    if (!use_pair) {
+        tcg_insn_unit *branch = NULL;
+        TCGReg ll, lh, sl, sh;
+
+        /*
+         * If we have already checked for 16-byte alignment, that's all
+         * we need. Otherwise we have determined that misaligned atomicity
+         * may be handled with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            /*
+             * TODO: align should be MO_64, so we only need test bit 3,
+             * which means we could use TBNZ instead of ANDS+B_C.
+             */
+            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
+            branch = s->code_ptr;
+            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+            use_pair = true;
+        }
+
+        if (is_ld) {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             *    ldxp lo, hi, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, .-8
+             * Require no overlap between data{lo,hi} and base.
+             */
+            if (datalo == base || datahi == base) {
+                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
+                base = TCG_REG_TMP2;
+            }
+            ll = sl = datalo;
+            lh = sh = datahi;
+        } else {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             * 1: ldxp t0, t1, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, 1b
+             */
+            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
+            ll = TCG_REG_TMP0;
+            lh = TCG_REG_TMP1;
+            sl = datalo;
+            sh = datahi;
+        }
+
+        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
+        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
+        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
+
+        if (use_pair) {
+            /* "b .+8", branching across the one insn of use_pair. */
+            tcg_out_insn(s, 3206, B, 2);
+            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
+        }
+    }
+
+    if (use_pair) {
+        if (is_ld) {
+            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
+        } else {
+            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static const tcg_insn_unit *tb_ret_addr;
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
+        break;
 
     case INDEX_op_bswap64_i64:
         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
         return C_O1_I1(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(r, r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
         return C_O0_I2(rZ, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(rZ, rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
Note that these instructions do not require 16-byte alignment.

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-set.h |   2 +
 tcg/ppc/tcg-target-con-str.h |   1 +
 tcg/ppc/tcg-target.h         |   3 +-
 tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 4 files changed, 101 insertions(+), 13 deletions(-)

diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-set.h
+++ b/tcg/ppc/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(v, r)
 C_O0_I3(r, r, r)
+C_O0_I3(o, m, r)
 C_O0_I4(r, r, ri, ri)
 C_O0_I4(r, r, r, r)
 C_O1_I1(r, r)
@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rZ, rZ)
 C_O1_I4(r, r, r, ri, ri)
 C_O2_I1(r, r, r)
+C_O2_I1(o, m, r)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, rI, rZM, r, r)
 C_O2_I4(r, r, r, r, rI, rZM)
diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-str.h
+++ b/tcg/ppc/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
+REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
 REGS('v', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128   \
+    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
 
 /*
  * While technically Altivec could support V64, it has no 64-bit store
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 
 #define B      OPCD( 18)
 #define BC     OPCD( 16)
+
 #define LBZ    OPCD( 34)
 #define LHZ    OPCD( 40)
 #define LHA    OPCD( 42)
 #define LWZ    OPCD( 32)
 #define LWZUX  XO31( 55)
-#define STB    OPCD( 38)
-#define STH    OPCD( 44)
-#define STW    OPCD( 36)
-
-#define STD    XO62(  0)
-#define STDU   XO62(  1)
-#define STDX   XO31(149)
-
 #define LD     XO58(  0)
 #define LDX    XO31( 21)
 #define LDU    XO58(  1)
 #define LDUX   XO31( 53)
 #define LWA    XO58(  2)
 #define LWAX   XO31(341)
+#define LQ     OPCD( 56)
+
+#define STB    OPCD( 38)
+#define STH    OPCD( 44)
+#define STW    OPCD( 36)
+#define STD    XO62(  0)
+#define STDU   XO62(  1)
+#define STDX   XO31(149)
+#define STQ    XO62(  2)
 
 #define ADDIC  OPCD( 12)
 #define ADDI   OPCD( 14)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp a_bits;
+    MemOp a_bits, s_bits;
 
     /*
      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * As of 3.0, "the non-atomic access is performed as described in
      * the corresponding list", which matches MO_ATOM_SUBALIGN.
      */
+    s_bits = opc & MO_SIZE;
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
                                                  : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_bits = h->aa.align;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    unsigned s_bits = opc & MO_SIZE;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    uint32_t insn;
+    TCGReg index;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
+
+    /* Compose the final address, as LQ/STQ have no indexing. */
+    index = h.index;
+    if (h.base != 0) {
+        index = TCG_REG_TMP1;
+        tcg_out32(s, ADD | TAB(index, h.base, h.index));
+    }
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (h.aa.atom == MO_128) {
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? LQ : STQ;
+        tcg_out32(s, insn | TAI(datahi, index, 0));
+    } else {
+        TCGReg d1, d2;
+
+        if (HOST_BIG_ENDIAN ^ need_bswap) {
+            d1 = datahi, d2 = datalo;
+        } else {
+            d1 = datalo, d2 = datahi;
+        }
+
+        if (need_bswap) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
+            insn = is_ld ? LDBRX : STDBRX;
+            tcg_out32(s, insn | TAB(d1, 0, index));
+            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
+        } else {
+            insn = is_ld ? LD : STD;
+            tcg_out32(s, insn | TAI(d1, index, 0));
+            tcg_out32(s, insn | TAI(d2, index, 8));
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 {
     int i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
         if (TCG_TARGET_REG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_setcond_i32:
         tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
+
     case INDEX_op_add_vec:
     case INDEX_op_sub_vec:
     case INDEX_op_mul_vec:
-- 
2.34.1

Use LPQ/STPQ when 16-byte atomicity is required.
Note that these instructions require 16-byte alignment.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |   2 +
 tcg/s390x/tcg-target.h         |   2 +-
 tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
 3 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(r, rA)
 C_O0_I2(v, r)
+C_O0_I3(o, m, r)
 C_O1_I1(r, r)
 C_O1_I1(v, r)
 C_O1_I1(v, v)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rI, r)
 C_O1_I4(r, r, rA, rI, r)
+C_O2_I1(o, m, r)
 C_O2_I2(o, m, 0, r)
 C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
-#define TCG_TARGET_HAS_qemu_ldst_i128 0
+#define TCG_TARGET_HAS_qemu_ldst_i128 1
 
 #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_LLGF    = 0xe316,
     RXY_LLGH    = 0xe391,
     RXY_LMG     = 0xeb04,
+    RXY_LPQ     = 0xe38f,
     RXY_LRV     = 0xe31e,
     RXY_LRVG    = 0xe30f,
     RXY_LRVH    = 0xe31f,
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_STG     = 0xe324,
     RXY_STHY    = 0xe370,
     RXY_STMG    = 0xeb24,
+    RXY_STPQ    = 0xe38e,
     RXY_STRV    = 0xe33e,
     RXY_STRVG   = 0xe32f,
     RXY_STRVH   = 0xe33f,
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int mem_index = get_mmuidx(oi);
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabel *l1 = NULL, *l2 = NULL;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    bool use_pair;
+    S390Opcode insn;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    use_pair = h.aa.atom < MO_128;
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (!use_pair) {
+        /*
+         * Atomicity requires we use LPQ.  If we've already checked for
+         * 16-byte alignment, that's all we need.  If we arrive with
+         * lesser alignment, we have determined that less than 16-byte
+         * alignment can be satisfied with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            use_pair = true;
+            l1 = gen_new_label();
+            l2 = gen_new_label();
+
+            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
+            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
+        }
+
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? RXY_LPQ : RXY_STPQ;
+        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
+
+        if (use_pair) {
+            tgen_branch(s, S390_CC_ALWAYS, l2);
+            tcg_out_label(s, l1);
+        }
+    }
+    if (use_pair) {
+        TCGReg d1, d2;
+
+        if (need_bswap) {
+            d1 = datalo, d2 = datahi;
+            insn = is_ld ? RXY_LRVG : RXY_STRVG;
+        } else {
+            d1 = datahi, d2 = datalo;
+            insn = is_ld ? RXY_LG : RXY_STG;
+        }
+
+        if (h.base == d1 || h.index == d1) {
+            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
+            h.base = TCG_TMP0;
+            h.index = TCG_REG_NONE;
+            h.disp = 0;
+        }
+        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
+        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
+    }
+    if (l2) {
+        tcg_out_label(s, l2);
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
 {
     /* Reuse the zeroing that exists for goto_ptr.  */
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_ld16s_i64:
         tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
         return C_O0_I2(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
 2 files changed, 47 insertions(+), 34 deletions(-)
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h

diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
+#define HOST_LOAD_EXTRACT_AL16_AL8_H
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
+    Int128 r;
+
+    pv = (void *)(pi & ~7);
+    if (pi & 8) {
+        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
+        uint64_t a = qatomic_read__nocheck(p8);
+        uint64_t b = qatomic_read__nocheck(p8 + 1);
+
+        if (HOST_BIG_ENDIAN) {
+            r = int128_make128(b, a);
+        } else {
+            r = int128_make128(a, b);
+        }
+    } else {
+        r = atomic16_read_ro(pv);
+    }
+    return int128_getlo(int128_urshift(r, shr));
+}
+
+#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  * See the COPYING file in the top-level directory.
  */
 
+#include "host/load-extract-al16-al8.h"
+
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
 #else
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
     return int128_getlo(r);
 }
 
-/**
- * load_atom_extract_al16_or_al8:
- * @p: host address
- * @s: object size in bytes, @s <= 8.
- *
- * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
- * cross an 16-byte boundary then the access must be 16-byte atomic,
- * otherwise the access must be 8-byte atomic.
- */
-static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-load_atom_extract_al16_or_al8(void *pv, int s)
-{
-    uintptr_t pi = (uintptr_t)pv;
-    int o = pi & 7;
-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-    Int128 r;
-
-    pv = (void *)(pi & ~7);
-    if (pi & 8) {
-        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-        uint64_t a = qatomic_read__nocheck(p8);
-        uint64_t b = qatomic_read__nocheck(p8 + 1);
-
-        if (HOST_BIG_ENDIAN) {
-            r = int128_make128(b, a);
-        } else {
-            r = int128_make128(a, b);
-        }
-    } else {
-        r = atomic16_read_ro(pv);
-    }
-    return int128_getlo(int128_urshift(r, shr));
-}
-
 /**
  * load_atom_4_by_2:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
 2 files changed, 51 insertions(+), 39 deletions(-)
 create mode 100644 host/include/generic/host/store-insert-al16.h

diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_STORE_INSERT_AL16_H
+#define HOST_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+#if defined(CONFIG_ATOMIC128)
+    __uint128_t *pu;
+    Int128Alias old, new;
+
+    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
+    pu = __builtin_assume_aligned(ps, 16);
+    old.u = *pu;
+    msk = int128_not(msk);
+    do {
+        new.s = int128_and(old.s, msk);
+        new.s = int128_or(new.s, val);
+    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+#else
+    Int128 old, new, cmp;
+
+    ps = __builtin_assume_aligned(ps, 16);
+    old = *ps;
+    msk = int128_not(msk);
+    do {
+        cmp = old;
+        new = int128_and(old, msk);
+        new = int128_or(new, val);
+        old = atomic16_cmpxchg(ps, cmp, new);
+    } while (int128_ne(cmp, old));
+#endif
+}
+
+#endif /* HOST_STORE_INSERT_AL16_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "host/load-extract-al16-al8.h"
+#include "host/store-insert-al16.h"
 
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
                                           __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 }
 
-/**
- * store_atom_insert_al16:
- * @p: host address
- * @val: shifted value to store
- * @msk: mask for value to store
- *
- * Atomically store @val to @p masked by @msk.
- */
-static void ATTRIBUTE_ATOMIC128_OPT
-store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
-{
-#if defined(CONFIG_ATOMIC128)
-    __uint128_t *pu, old, new;
-
-    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-    pu = __builtin_assume_aligned(ps, 16);
-    old = *pu;
-    do {
-        new = (old & ~msk.u) | val.u;
-    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
-                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
-#elif defined(CONFIG_CMPXCHG128)
-    __uint128_t *pu, old, new;
-
-    /*
-     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
-     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
-     * and accept the sequential consistency that comes with it.
-     */
-    pu = __builtin_assume_aligned(ps, 16);
-    do {
-        old = *pu;
-        new = (old & ~msk.u) | val.u;
-    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
-#else
-    qemu_build_not_reached();
-#endif
-}
-
 /**
  * store_bytes_leN:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h

diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
+#define X86_64_LOAD_EXTRACT_AL16_AL8_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    Int128Alias r;
+
+    /*
+     * ptr_align % 16 is now only 0 or 8.
+     * If the host supports atomic loads with VMOVDQU, then always use that,
+     * making the branch highly predictable.  Otherwise we must use VMOVDQA
+     * when ptr_align % 16 == 0 for 16-byte atomicity.
+     */
+    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
+        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    }
+    return int128_getlo(int128_urshift(r.s, shr));
+}
+#else
+/* Fallback definition that must be optimized away, or error.  */
+uint64_t QEMU_ERROR("unsupported atomic")
+    load_atom_extract_al16_or_al8(void *pv, int s);
+#endif
+
+#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h

diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
+#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
+
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    uint64_t l, h;
+
+    /*
+     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
+     * and single-copy atomic on the parts if 8-byte aligned.
+     * All we need do is align the pointer mod 8.
+     */
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
+    return (l >> shr) | (h << (-shr & 63));
+}
+
+#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 host/include/aarch64/host/store-insert-al16.h

diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_STORE_INSERT_AL16_H
+#define AARCH64_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+    /*
+     * GCC only implements __sync* primitives for int128 on aarch64.
+     * We can do better without the barriers, and integrating the
+     * arithmetic into the load-exclusive/store-conditional pair.
+     */
+    uint64_t tl, th, vl, vh, ml, mh;
+    uint32_t fail;
+
+    qemu_build_assert(!HOST_BIG_ENDIAN);
+    vl = int128_getlo(val);
+    vh = int128_gethi(val);
+    ml = int128_getlo(msk);
+    mh = int128_gethi(msk);
+
+    asm("0: ldxp %[l], %[h], %[mem]\n\t"
+        "bic %[l], %[l], %[ml]\n\t"
+        "bic %[h], %[h], %[mh]\n\t"
+        "orr %[l], %[l], %[vl]\n\t"
+        "orr %[h], %[h], %[vh]\n\t"
+        "stxp %w[f], %[l], %[h], %[mem]\n\t"
+        "cbnz %w[f], 0b\n"
+        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
+        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
+}
+
+#endif /* AARCH64_STORE_INSERT_AL16_H */
-- 
2.34.1

The last use was removed by e77c89fb086a.

Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.h | 1 -
 tcg/arm/tcg-target.h     | 1 -
 tcg/i386/tcg-target.h    | 1 -
 tcg/mips/tcg-target.h    | 1 -
 tcg/ppc/tcg-target.h     | 1 -
 tcg/riscv/tcg-target.h   | 1 -
 tcg/s390x/tcg-target.h   | 1 -
 tcg/sparc64/tcg-target.h | 1 -
 tcg/tci/tcg-target.h     | 1 -
 9 files changed, 9 deletions(-)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
 #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
 
 typedef enum {
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #endif
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_NB_REGS 64
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 
 typedef enum {
     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define TCG_TARGET_REG_BITS 64
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define S390_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
 
 /* We have a +- 4GB range on the branches; leave some slop.  */
 #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define SPARC_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
 
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_INTERPRETER 1
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 #if UINTPTR_MAX == UINT32_MAX
-- 
2.34.1

Invert the exit code, for use with the testsuite.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 formats = {}
 allpatterns = []
 anyextern = False
+testforerror = False
 
 translate_prefix = 'trans'
 translate_scope = 'static '
@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
     if output_file and output_fd:
         output_fd.close()
         os.remove(output_file)
-    exit(1)
+    exit(0 if testforerror else 1)
 # end error_with_file
 
 
@@ -XXX,XX +XXX,XX @@ def main():
     global bitop_width
     global variablewidth
     global anyextern
+    global testforerror
 
     decode_scope = 'static '
 
     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
-                 'static-decode=', 'varinsnwidth=']
+                 'static-decode=', 'varinsnwidth=', 'test-for-error']
     try:
         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
     except getopt.GetoptError as err:
@@ -XXX,XX +XXX,XX @@ def main():
                 bitop_width = 64
             elif insnwidth != 32:
                 error(0, 'cannot handle insns of width', insnwidth)
+        elif o == '--test-for-error':
+            testforerror = True
         else:
             assert False, 'unhandled option'
 
@@ -XXX,XX +XXX,XX @@ def main():
 
     if output_file:
         output_fd.close()
+    exit(1 if testforerror else 0)
 # end main
 
 
-- 
2.34.1

Test err_pattern_group_empty.decode failed with exception:

Traceback (most recent call last):
  File "./scripts/decodetree.py", line 1424, in <module> main()
  File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
  File "./scripts/decodetree.py", line 627, in build_tree
    self.tree = self.__build_tree(self.pats, self.fixedbits,
  File "./scripts/decodetree.py", line 607, in __build_tree
    fb = i.fixedbits & innermask
TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
                 output(ind, '}\n')
             else:
                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
+
+    def build_tree(self):
+        if not self.pats:
+            error_with_file(self.file, self.lineno, 'empty pattern group')
+        super().build_tree()
+
 #end IncMultiPattern
 
 
-- 
2.34.1

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/decode/check.sh    | 24 ----------------
 tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
 tests/meson.build        |  5 +---
 3 files changed, 60 insertions(+), 28 deletions(-)
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

diff --git a/tests/decode/check.sh b/tests/decode/check.sh
deleted file mode 100755
index XXXXXXX..XXXXXXX
--- a/tests/decode/check.sh
+++ /dev/null
@@ -XXX,XX +XXX,XX @@
-#!/bin/sh
-# This work is licensed under the terms of the GNU LGPL, version 2 or later.
-# See the COPYING.LIB file in the top-level directory.
-
-PYTHON=$1
-DECODETREE=$2
-E=0
-
-# All of these tests should produce errors
-for i in err_*.decode; do
-    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        # Pass, aka failed to fail.
-        echo FAIL: $i 1>&2
-        E=1
-    fi
-done
-
-for i in succ_*.decode; do
-    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        echo FAIL:$i 1>&2
-    fi
-done
-
-exit $E
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@
+err_tests = [
+    'err_argset1.decode',
+    'err_argset2.decode',
+    'err_field1.decode',
+    'err_field2.decode',
+    'err_field3.decode',
+    'err_field4.decode',
+    'err_field5.decode',
+    'err_field6.decode',
+    'err_init1.decode',
+    'err_init2.decode',
+    'err_init3.decode',
+    'err_init4.decode',
+    'err_overlap1.decode',
+    'err_overlap2.decode',
+    'err_overlap3.decode',
+    'err_overlap4.decode',
+    'err_overlap5.decode',
+    'err_overlap6.decode',
+    'err_overlap7.decode',
+    'err_overlap8.decode',
+    'err_overlap9.decode',
+    'err_pattern_group_empty.decode',
+    'err_pattern_group_ident1.decode',
+    'err_pattern_group_ident2.decode',
+    'err_pattern_group_nest1.decode',
+    'err_pattern_group_nest2.decode',
+    'err_pattern_group_nest3.decode',
+    'err_pattern_group_overlap1.decode',
+    'err_width1.decode',
+    'err_width2.decode',
+    'err_width3.decode',
+    'err_width4.decode',
+]
+
+succ_tests = [
+    'succ_argset_type1.decode',
+    'succ_function.decode',
+    'succ_ident1.decode',
+    'succ_pattern_group_nest1.decode',
+    'succ_pattern_group_nest2.decode',
+    'succ_pattern_group_nest3.decode',
+    'succ_pattern_group_nest4.decode',
+]
+
+suite = 'decodetree'
+decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
+
+foreach t: err_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
+         suite: suite)
+endforeach
+
+foreach t: succ_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', files(t)],
+         suite: suite)
+endforeach
diff --git a/tests/meson.build b/tests/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
              dependencies: [qemuutil, vhost_user])
 endif
 
-test('decodetree', sh,
-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
-     workdir: meson.current_source_dir() / 'decode',
-     suite: 'decodetree')
+subdir('decode')
 
 if 'CONFIG_TCG' in config_all
   subdir('fp')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Document the named field syntax that we want to implement for the
decodetree script.  This allows a field to be defined in terms of
some other field that the instruction pattern has already set, for
example:

%sz_imm 10:3 sz:3 !function=expand_sz_imm

to allow a function to be passed both an immediate field from the
instruction and also a sz value which might have been specified by
the instruction pattern directly (sz=1, etc) rather than being a
simple field within the instruction.

Note that the restriction on not having the format referring to the
pattern and the pattern referring to the format simultaneously is a
restriction of the decoder generator rather than inherently being a
silly thing to do.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
---
 docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/decodetree.rst
+++ b/docs/devel/decodetree.rst
@@ -XXX,XX +XXX,XX @@ Fields
 
 Syntax::
 
-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
+  field_def     := '%' identifier ( field )* ( !function=identifier )?
+  field         := unnamed_field | named_field
   unnamed_field := number ':' ( 's' ) number
+  named_field   := identifier ':' ( 's' ) number
 
 For *unnamed_field*, the first number is the least-significant bit position
 of the field and the second number is the length of the field.  If the 's' is
-present, the field is considered signed.  If multiple ``unnamed_fields`` are
-present, they are concatenated.  In this way one can define disjoint fields.
+present, the field is considered signed.
+
+A *named_field* refers to some other field in the instruction pattern
+or format. Regardless of the length of the other field where it is
+defined, it will be inserted into this field with the specified
+signedness and bit width.
+
+Field definitions that involve loops (i.e. where a field is defined
+directly or indirectly in terms of itself) are errors.
+
+A format can include fields that refer to named fields that are
+defined in the instruction pattern(s) that use the format.
+Conversely, an instruction pattern can include fields that refer to
+named fields that are defined in the format it uses. However you
+cannot currently do both at once (i.e. pattern P uses format F; F has
+a field A that refers to a named field B that is defined in P, and P
+has a field C that refers to a named field D that is defined in F).
+
+If multiple ``fields`` are present, they are concatenated.
+In this way one can define disjoint fields.
 
 If ``!function`` is specified, the concatenated result is passed through the
 named function, taking and returning an integral value.
 
-One may use ``!function`` with zero ``unnamed_fields``.  This case is called
+One may use ``!function`` with zero ``fields``.  This case is called
 a *parameter*, and the named function is only passed the ``DisasContext``
 and returns an integral value extracted from there.
 
-A field with no ``unnamed_fields`` and no ``!function`` is in error.
+A field with no ``fields`` and no ``!function`` is in error.
 
 Field examples:
 
@@ -XXX,XX +XXX,XX @@ Field examples:
 | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
 |   !function=expand_shimm8 |               extract(i, 13, 1))            |
 +---------------------------+---------------------------------------------+
+| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
+|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
++---------------------------+---------------------------------------------+
 
 Argument Sets
 =============
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support referring to other named fields in field definitions, we
need to pass the str_extract() method a function which tells it how
to emit the code for a previously initialized named field.  (In
Pattern::output_code() the other field will be "u.f_foo.field", and
in Format::output_extract() it is "a->field".)

Refactor the two callsites that currently do "output code to
initialize each field", and have them pass a lambda that defines how
to format the lvalue in each case.  This is then used both in
emitting the LHS of the assignment and also passed down to
str_extract() as a new argument (unused at the moment, but will be
used in the following patch).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def __str__(self):
             s = ''
         return str(self.pos) + ':' + s + str(self.len)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
     def __str__(self):
         return str(self.subs)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         ret = '0'
         pos = 0
         for f in reversed(self.subs):
-            ext = f.str_extract()
+            ext = f.str_extract(lvalue_formatter)
             if pos == 0:
                 ret = ext
             else:
@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
     def __str__(self):
         return str(self.value)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return str(self.value)
 
     def __cmp__(self, other):
@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
     def __str__(self):
         return self.func + '(' + str(self.base) + ')'
 
-    def str_extract(self):
-        return self.func + '(ctx, ' + self.base.str_extract() + ')'
+    def str_extract(self, lvalue_formatter):
+        return (self.func + '(ctx, '
+                + self.base.str_extract(lvalue_formatter) + ')')
 
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
     def __str__(self):
         return self.func
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
     def __eq__(self, other):
@@ -XXX,XX +XXX,XX @@ def __str__(self):
 
     def str1(self, i):
         return str_indent(i) + self.__str__()
+
+    def output_fields(self, indent, lvalue_formatter):
+        for n, f in self.fields.items():
+            output(indent, lvalue_formatter(n), ' = ',
+                   f.str_extract(lvalue_formatter), ';\n')
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def extract_name(self):
     def output_extract(self):
         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
-        for n, f in self.fields.items():
-            output('    a->', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(str_indent(4), lambda n: 'a->' + n)
         output('}\n\n')
 # end Format
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        for n, f in self.fields.items():
-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support named fields, we will need to be able to do a topological
sort (so that we ensure that we output the assignment to field A
before the assignment to field B if field B refers to field A by
name). The good news is that there is a tsort in the python standard
library; the bad news is that it was only added in Python 3.9.

To bridge the gap between our current minimum supported Python
version and 3.9, provide a local implementation that has the
same API as the stdlib version for the parts we care about.
In future when QEMU's minimum Python version requirement reaches
3.9 we can delete this code and replace it with an 'import' line.

The core of this implementation is based on
https://code.activestate.com/recipes/578272-topological-sort/
which is MIT-licensed.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 re_fmt_ident = '@[a-zA-Z0-9_]*'
 re_pat_ident = '[a-zA-Z0-9_]*'
 
+# Local implementation of a topological sort. We use the same API that
+# the Python graphlib does, so that when QEMU moves forward to a
+# baseline of Python 3.9 or newer this code can all be dropped and
+# replaced with:
+#    from graphlib import TopologicalSorter, CycleError
+#
+# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
+#
+# We only implement the parts of TopologicalSorter we care about:
+#  ts = TopologicalSorter(graph=None)
+#    create the sorter. graph is a dictionary whose keys are
+#    nodes and whose values are lists of the predecessors of that node.
+#    (That is, if graph contains "A" -> ["B", "C"] then we must output
+#    B and C before A.)
+#  ts.static_order()
+#    returns a list of all the nodes in sorted order, or raises CycleError
+#  CycleError
+#    exception raised if there are cycles in the graph. The second
+#    element in the args attribute is a list of nodes which form a
+#    cycle; the first and last element are the same, eg [a, b, c, a]
+#    (Our implementation doesn't give the order correctly.)
+#
+# For our purposes we can assume that the data set is always small
+# (typically 10 nodes or less, actual links in the graph very rare),
+# so we don't need to worry about efficiency of implementation.
+#
+# The core of this implementation is from
+# https://code.activestate.com/recipes/578272-topological-sort/
+# (but updated to Python 3), and is under the MIT license.
+
+class CycleError(ValueError):
+    """Subclass of ValueError raised if cycles exist in the graph"""
+    pass
+
+class TopologicalSorter:
+    """Topologically sort a graph"""
+    def __init__(self, graph=None):
+        self.graph = graph
+
+    def static_order(self):
+        # We do the sort right here, unlike the stdlib version
+        from functools import reduce
+        data = {}
+        r = []
+
+        if not self.graph:
+            return []
+
+        # This code wants the values in the dict to be specifically sets
+        for k, v in self.graph.items():
+            data[k] = set(v)
+
+        # Find all items that don't depend on anything.
+        extra_items_in_deps = (reduce(set.union, data.values())
+                               - set(data.keys()))
+        # Add empty dependencies where needed
+        data.update({item:{} for item in extra_items_in_deps})
+        while True:
+            ordered = set(item for item, dep in data.items() if not dep)
+            if not ordered:
+                break
+            r.extend(ordered)
+            data = {item: (dep - ordered)
+                    for item, dep in data.items()
+                        if item not in ordered}
+        if data:
+            # This doesn't give as nice results as the stdlib, which
+            # gives you the cycle by listing the nodes in order. Here
+            # we only know the nodes in the cycle but not their order.
+            raise CycleError(f'nodes are in a cycle', list(data.keys()))
+
+        return r
+# end TopologicalSorter
+
 def error_with_file(file, lineno, *args):
     """Print an error message from file:line and args and exit."""
     global output_file
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Implement support for named fields, i.e.  where one field is defined
in terms of another, rather than directly in terms of bits extracted
from the instruction.

The new method referenced_fields() on all the Field classes returns a
list of fields that this field references.  This just passes through,
except for the new NamedField class.

We can then use referenced_fields() to:
 * construct a list of 'dangling references' for a format or
   pattern, which is the fields that the format/pattern uses but
   doesn't define itself
 * do a topological sort, so that we output "field = value"
   assignments in an order that means that we assign a field before
   we reference it in a subsequent assignment
 * check when we output the code for a pattern whether we need to
   fill in the format fields before or after the pattern fields, and
   do other error checking

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 139 insertions(+), 6 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.sign == other.sign and self.mask == other.mask
 
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
             pos += f.len
         return ret
 
+    def referenced_fields(self):
+        l = []
+        for f in self.subs:
+            l.extend(f.referenced_fields())
+        return l
+
     def __ne__(self, other):
         if len(self.subs) != len(other.subs):
             return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return str(self.value)
 
+    def referenced_fields(self):
+        return []
+
     def __cmp__(self, other):
         return self.value - other.value
 # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         return (self.func + '(ctx, '
                 + self.base.str_extract(lvalue_formatter) + ')')
 
+    def referenced_fields(self):
+        return self.base.referenced_fields()
+
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
 
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.func == other.func
 
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
         return not self.__eq__(other)
 # end ParameterField
 
+class NamedField:
+    """Class representing a field already named in the pattern"""
+    def __init__(self, name, sign, len):
+        self.mask = 0
+        self.sign = sign
+        self.len = len
+        self.name = name
+
+    def __str__(self):
+        return self.name
+
+    def str_extract(self, lvalue_formatter):
+        global bitop_width
+        s = 's' if self.sign else ''
+        lvalue = lvalue_formatter(self.name)
+        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
+
+    def referenced_fields(self):
+        return [self.name]
+
+    def __eq__(self, other):
+        return self.name == other.name
+
+    def __ne__(self, other):
+        return not self.__eq__(other)
+# end NamedField
 
 class Arguments:
     """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
             output('} ', self.struct_name(), ';\n\n')
 # end Arguments
 
-
 class General:
     """Common code between instruction formats and instruction patterns"""
     def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
         self.fieldmask = fldm
         self.fields = flds
         self.width = w
+        self.dangling = None
 
     def __str__(self):
         return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str1(self, i):
         return str_indent(i) + self.__str__()
 
+    def dangling_references(self):
+        # Return a list of all named references which aren't satisfied
+        # directly by this format/pattern. This will be either:
+        #  * a format referring to a field which is specified by the
+        #    pattern(s) using it
+        #  * a pattern referring to a field which is specified by the
+        #    format it uses
+        #  * a user error (referring to a field that doesn't exist at all)
+        if self.dangling is None:
+            # Compute this once and cache the answer
+            dangling = []
+            for n, f in self.fields.items():
+                for r in f.referenced_fields():
+                    if r not in self.fields:
+                        dangling.append(r)
+            self.dangling = dangling
+        return self.dangling
+
     def output_fields(self, indent, lvalue_formatter):
+        # We use a topological sort to ensure that any use of NamedField
+        # comes after the initialization of the field it is referencing.
+        graph = {}
         for n, f in self.fields.items():
-            output(indent, lvalue_formatter(n), ' = ',
-                   f.str_extract(lvalue_formatter), ';\n')
+            refs = f.referenced_fields()
+            graph[n] = refs
+
+        try:
+            ts = TopologicalSorter(graph)
+            for n in ts.static_order():
+                # We only want to emit assignments for the keys
+                # in our fields list, not for anything that ends up
+                # in the tsort graph only because it was referenced as
+                # a NamedField.
+                try:
+                    f = self.fields[n]
+                    output(indent, lvalue_formatter(n), ' = ',
+                           f.str_extract(lvalue_formatter), ';\n')
+                except KeyError:
+                    pass
+        except CycleError as e:
+            # The second element of args is a list of nodes which form
+            # a cycle (there might be others too, but only one is reported).
+            # Pretty-print it to tell the user.
+            cycle = ' => '.join(e.args[1])
+            error(self.lineno, 'field definitions form a cycle: ' + cycle)
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
         arg = self.base.base.name
         output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
+        # We might have named references in the format that refer to fields
+        # in the pattern, or named references in the pattern that refer
+        # to fields in the format. This affects whether we extract the fields
+        # for the format before or after the ones for the pattern.
+        # For simplicity we don't allow cross references in both directions.
+        # This is also where we catch the syntax error of referring to
+        # a nonexistent field.
+        fmt_refs = self.base.dangling_references()
+        for r in fmt_refs:
+            if r not in self.fields:
+                error(self.lineno, f'format refers to undefined field {r}')
+        pat_refs = self.dangling_references()
+        for r in pat_refs:
+            if r not in self.base.fields:
+                error(self.lineno, f'pattern refers to undefined field {r}')
+        if pat_refs and fmt_refs:
+            error(self.lineno, ('pattern that uses fields defined in format '
+                                'cannot use format that uses fields defined '
+                                'in pattern'))
+        if fmt_refs:
+            # pattern fields first
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+            assert not extracted, "dangling fmt refs but it was already extracted"
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+        if not fmt_refs:
+            # pattern fields last
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
 
         # If we identified all nodes below have the same format,
-        # extract the fields now.
-        if not extracted and self.base:
+        # extract the fields now. But don't do it if the format relies
+        # on named fields from the insn pattern, as those won't have
+        # been initialised at this point.
+        if not extracted and self.base and not self.base.dangling_references():
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', self.base.base.name, ', insn);\n')
             extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
     """Parse one instruction field from TOKS at LINENO"""
     global fields
     global insnwidth
+    global re_C_ident
 
     # A "simple" field will have only one entry;
     # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
             func = func[1]
             continue
 
+        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
+            # Signed named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, True, le)
+            subs.append(f)
+            width += le
+            continue
+        if re.fullmatch(re_C_ident + ':[0-9]+', t):
+            # Unsigned named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, False, le)
+            subs.append(f)
+            width += le
+            continue
+
         if re.fullmatch('[0-9]+:s[0-9]+', t):
             # Signed field extract
             subtoks = t.split(':s')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Add some tests for various cases of named-field use, both ones that
should work and ones that should be diagnosed as errors.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
---
 tests/decode/err_field10.decode      |  7 +++++++
 tests/decode/err_field7.decode       |  7 +++++++
 tests/decode/err_field8.decode       |  8 ++++++++
 tests/decode/err_field9.decode       | 14 ++++++++++++++
 tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
 tests/decode/meson.build             |  5 +++++
 6 files changed, 60 insertions(+)
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode

diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field10.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose formats which refer to undefined fields
+%field1        field2:3
+@fmt ........ ........ ........ ........ %field1
+insn 00000000 00000000 00000000 00000000 @fmt
diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field7.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields whose definitions form a loop
+%field1        field2:3
+%field2        field1:4
+insn 00000000 00000000 00000000 00000000 %field1 %field2
diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field8.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose patterns which refer to undefined fields
+&f1 f1 a
+%field1        field2:3
+@fmt ........ ........ ........ .... a:4 &f1
+insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field9.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields where the format refers to a field defined in the
+# pattern and the pattern refers to a field defined in the format.
+# This is theoretically not impossible to implement, but is not
+# supported by the script at this time.
+&abcd a b c d
+%refa        a:3
+%refc        c:4
+# Format defines 'c' and sets 'b' to an indirect ref to 'a'
+@fmt ........ ........ ........ c:8 &abcd b=%refa
+# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
+insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/succ_named_field.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# field using a named_field
+%imm_sz	8:8 sz:3
+insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
+
+# Ditto, via a format. Here a field in the format
+# references a named field defined in the insn pattern:
+&imm_a imm alpha
+%foo 0:16 alpha:4
+@foo 00000001 ........ ........ ........ &imm_a imm=%foo
+i1   ........ 00000000 ........ ........ @foo alpha=1
+i2   ........ 00000001 ........ ........ @foo alpha=2
+
+# Here the named field is defined in the format and referenced
+# from the insn pattern:
+@bar 00000010 ........ ........ ........ &imm_a alpha=4
+i3   ........ 00000000 ........ ........ @bar imm=%foo
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/decode/meson.build
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@ err_tests = [
     'err_field4.decode',
     'err_field5.decode',
     'err_field6.decode',
+    'err_field7.decode',
+    'err_field8.decode',
+    'err_field9.decode',
+    'err_field10.decode',
     'err_init1.decode',
     'err_init2.decode',
     'err_init3.decode',
@@ -XXX,XX +XXX,XX @@ succ_tests = [
     'succ_argset_type1.decode',
     'succ_function.decode',
     'succ_ident1.decode',
+    'succ_named_field.decode',
     'succ_pattern_group_nest1.decode',
     'succ_pattern_group_nest2.decode',
     'succ_pattern_group_nest3.decode',
-- 
2.34.1